OPNsense Forum

Archive => 24.1, 24.4 Legacy Series => Topic started by: Matzke on February 26, 2024, 06:15:26 PM

Title: strange behavior after update to Version 24.1.2 (and until 24.7.2)
Post by: Matzke on February 26, 2024, 06:15:26 PM
Dear all,

like I already wrote in my small post regarding wireguard problems - I just want to open a new topic to don't hijack the old one.

I'm very unsure whether there is a general problem which causes different symptoms (therefor I wrote my observations in the other topic).

Here a copy of my old post:

QuoteHello,
I also have some strange problems after the update. I don't want to hijack this thread, but I think it might be the same origin that manifests differently for everyone.

OPNSense A:
Update and direct reboot
Everything seemed to work fine, but later today (day after update) I received error messages that some servers were not reachable - cause a DNS problem. According to the GUI, Unbound was not running - BUT the Internet via browser on the clients was working, so part of the DNS server must have been running. A reboot of OPNSense seemed to have fixed the problem - but I'll have to wait and see tomorrow.

OPNSense B:
Update and direct reboot
- A device can no longer connect to its cloud server.
I can address the device within my internal network (several VLANs routed via OPNSense), so the routing must basically work
- Internet access on my test client worked, websites could be loaded
- a "ping google.de" on the same test client shows no connection
- a "tracert google.de" stops at the OPNSense
- DNS worked, as both of the above commands were able to resolve an IP. I tried it with 3 different hosts, always the same behavior
- a restart of Unbound brought no change
- I checked to see if there was another update available on the OPNSense - the update routine could not connect to the update server either
After rebooting the OPNSense, everything seemed to work again (device had cloud connection, ping worked again, tracert worked again) - I did no other changes!

P.S. My Wireguard worked at least after the second reboot, before that I don't know.

Both OPNSense machines have been running for several years, nothing was changed in the configurations before the update. So it seems that something is sporadically unstable.
Title: Re: strange behavior after update to Version 24.1.2
Post by: Matzke on February 26, 2024, 06:25:39 PM
And now my new observation from today:

I didn't do changes in opnsense - it just rebooted by a cron job. After that, I can report the following behavior (which is very strange):

Just a small notice on my configuration:

2 WAN interfaces and 2 gateways -> OPNSense -> multiple internal VLAN Interfaces

--> gateway two is switched off and marked as down, so it is configured but not present.

Now the strange behavior (it's a little bit like reported above):

- normal internet usage (using browser) works without problems, therefor I didn't realize today in the morning that there is a problem
- routing between internal vlans works without problems
- one device in my technic-vlan can't connect to its cloud servers
- ping to different targets in the internet results in timeouts
- traceroute to same targets stops at OPNSense
- all connections from WAN-side stops working (no services were reachable - neither HAProxy on the OPNSense itself nor other NATed services behind OPNSense)
- no OPNSense update possible (can't reach servers)
- unbound worked as expected (I pinged a target which I never connected before so it couldn't be in the cache (I did a random google search and used the first hit to connect to, it was a local flower shop in a foreign country))
- all services were started (green arrows) except crowdsec-plugin
- starting crowdsec-plugin manually (after that also green arrow) doesn't change the behavior
- no errors in logfiles as I can see

--> rebooting OPNSense without any other interaction -> everything works fine direct after reboot

I don't know what could cause this strange behavior but I can imagine that this behavior causes a wide variety of error patterns for other users.
Title: Re: strange behavior after update to Version 24.1.2
Post by: BoneStorm on March 03, 2024, 01:26:21 AM
I came here by search for my strange upgrade problem, this post seems the only reference but signature fits. Please read below for workaround.

I'm running an physical HA setup of opnsense and upgraded 23.7.12_5 to 24.1.2. I just fixed my HA setup prior the upgrade and tested that well. So I'm confident things broke on the upgrade itself.

I'm running wan with fixed private VIP with CARP enabled. WAN default GW is ping monitored. Right after the upgrade things were fine so I moved forward upgrading the other node too. Then after some minutes misbehavior became visible.

* DNS broke - no name resolution
* GW pings failed - declaring GW down
* tcpdump on wan indicate icmp packets leaving opnsense and were answered by remote successfully
* opensense shell ping however reported timeouts
* same signature on DNS - DNS leaving but unbound states server failure
* existing connections (flows in the connection table) were successfully held and also cached DNS records were served, so it was not entirely obvious things were going wrong
* tcpdump attached to pflogd0 did not indicate any drop
* for troubleshooting I added to WAN ingress permit ip any any statements - no fun
* pfctl -d - disabling pf made the opnsense shell ping to directly connected WAN default GW instantly work
* the issue persisted through multiple reboots including other HA node held artificially down do reduce noise

I tried to make sense out of pfctl rules webgui summary to see where things went wrong, but could not pinpoint an issue here.

Workaround:
* I pulled the backup from history prior the upgrade from both nodes
* fresh install of old 23.7 release (from an old stick I had around)
* load config and restore the cluster

Hope it helps to either confirming this is a real issue, or to spread the word of an workaround which worked for me(tm)
Title: Re: strange behavior after update to Version 24.1.2
Post by: Matzke on March 03, 2024, 09:13:35 AM
Dear all,

I don't know if this is a stable solution (fresh install and import of config) because it needs time to see if it worked or not.

My OPNSense has a restart-job every 1,4,7 day of the week. So I can say not every restart triggers the problem.

But today I have the same problems like stated above (nothing changed - only reboot in the night).



Title: Re: strange behavior after update to Version 24.1.2
Post by: godyang on March 24, 2024, 02:11:56 AM
I am experiencing the same problem.

In my case,
1. It worked fine after the initial upgrade to 24.1. However after patching several packages and rebooting the box, it started to have this behavior.
2. A few websites are still working fine, like google and YouTube. However, most others are not connectable.
3. DNS seems resolved fine. Can't access the IP addresses / ports though.
4. In the OPNsense box itself, the Internet is still fully accessible. Problems seem to only occur at underlying hosts.
Title: Re: strange behavior after update to Version 24.1.2
Post by: godyang on March 24, 2024, 07:27:56 AM
I found another interesting pattern. Google and YouTube are working fine because they have IPv6 endpoints. Once I turned off IPv6 in my WAN/LAN interfaces, they are no longer accessible.

I also attempted to install and configure 24.1 from scratch without importing a backup config. Again, it was working fine until upgrading to 24.1.4. I suspect NAT implementation is somewhat broken, caused by the package upgrade.
Title: Re: strange behavior after update to Version 24.1.2
Post by: tb_one on March 26, 2024, 11:06:51 AM
We do have some of the issues, too.

weird and strange behaviour. I opened another thread with that issues little time ago:
https://forum.opnsense.org/index.php?topic=39654.0

Quote from: BoneStorm on March 03, 2024, 01:26:21 AM
...
* DNS broke - no name resolution
* GW pings failed - declaring GW down
* tcpdump on wan indicate icmp packets leaving opnsense and were answered by remote successfully
* opensense shell ping however reported timeouts
* same signature on DNS - DNS leaving but unbound states server failure
* existing connections (flows in the connection table) were successfully held and also cached DNS records were served, so it was not entirely obvious things were going wrong
* tcpdump attached to pflogd0 did not indicate any drop
* for troubleshooting I added to WAN ingress permit ip any any statements - no fun
* pfctl -d - disabling pf made the opnsense shell ping to directly connected WAN default GW instantly work
* the issue persisted through multiple reboots including other HA node held artificially down do reduce noise
...
Title: Re: strange behavior after update to Version 24.1.2
Post by: bassopt on March 26, 2024, 11:25:21 AM
Try using bind.

I've never used unbound on opnsense because it's always seems broken somehow.  Still testing it but it looks way more promising
Title: Re: strange behavior after update to Version 24.1.2
Post by: Matzke on April 01, 2024, 03:15:28 PM
I don't think that it is an unbound problem because I also can't reach IP-addresses (also internal when routed via OPNSense) and I also can't reach my NATted devices behind OPNSense from outside.

The problem is still persistend (but sporadically, I think every 2-5 restarts the problem occurs).

First what is eye-catching - on dashboard I can see, that crowdsec service isn't started.

This has nothing to do with the error, it seems that OPNSense has problems with starting all services correctly.

When I restart crowdsec it runs suddenly but problem is still present. After restarting service pf and routing it seems to work.

I hope somebody can solve this problem because so OPNSense is unstable at my side!!!
Title: Re: strange behavior after update to Version 24.1.2
Post by: Matzke on April 06, 2024, 02:49:23 PM
It's still present in OPNsense 24.1.5_3-amd64

It seems that it's enough to manually restart service "routing".

I don't know, why I have such big problems after upgrade (and also some other people) and I don't know why nobody tries to solve this problem. It started after release of version 24 - before I used OPNSense for years without this problem. Installation was an upgrade from GUI.
Title: Re: strange behavior after update to Version 24.1.2
Post by: BoneStorm on April 17, 2024, 10:36:14 PM
Hi Matzke,

Quote from: Matzke on April 06, 2024, 02:49:23 PM
It's still present in OPNsense 24.1.5_3-amd64

It seems that it's enough to manually restart service "routing".
...

That makes me a bit more confident to give the upgrade a new try next few days, once spare time permits. Thanks for that information.
Title: Re: strange behavior after update to Version 24.1.2
Post by: Matzke on April 18, 2024, 09:43:05 PM
... thanks for that info - I just updated to 24.1.6 and will report what's going on.

It will need some time because I only restart on weekend until this failure is solved.

BTW - I didn't read anything in changelog which could fix this behavior or do I overlook something?
Title: Re: strange behavior after update to Version 24.1.2
Post by: Matzke on April 20, 2024, 09:15:07 AM
OPNsense 24.1.6-amd64  --> still the same problem.

And yes - restart routing seems to work, just checked...

Why nobody (especially from OPNSense) tries to help???
Title: Re: strange behavior after update to Version 24.1.2 (and until 24.1.6)
Post by: tb_one on May 03, 2024, 01:19:07 PM
We found a solution after many hours, days of searching, it was so simple.

go to your WAN interface, and make sure (if it is your only wan interface and you are having no multi-wan-system) IPv4 Upstream Gateway is set to "Auto-Detect". Another admin in our company set it manually to the default gateway given through our ISP. That never causes problems, since now we're on 24.1.

kind regards.
Title: Re: strange behavior after update to Version 24.1.2 (and until 24.1.6)
Post by: Matzke on May 25, 2024, 08:42:00 AM
Dear All,

Problem is still present but not after every reboot. Since my last post, today it happened again and I do a reboot every Saturday.

I have a Multi-WAN Setup (but mostly one WAN interface is down because it is for test-reasons and blackout for primary WAN interface)

Perhaps these hints give you a possibility to check, where the problem could be. As I already wrote - this configuration worked for years with prior versions of OPNSense.

Restarting "Routing" Service seems to be enough - but only works when I'm local on site because all connections from WAN (including VPN) doesn't work.
Title: Re: strange behavior after update to Version 24.1.2 (and until 24.1.8)
Post by: Matzke on June 02, 2024, 10:17:14 PM
... still present on 24.1.8
Title: Re: strange behavior after update to Version 24.1.2 (and until 24.1.8)
Post by: franco on June 03, 2024, 12:25:28 PM
... GitHub ticket?


Cheers,
Franco
Title: Re: strange behavior after update to Version 24.1.2 (and until 24.1.8)
Post by: schnipp on June 11, 2024, 05:24:32 PM
I can confirm, the issue still persists in Opnsense version 24.1.8

@Matzke: Can you please share the github ticket id
Title: Re: strange behavior after update to Version 24.1.2 (and until 24.1.8)
Post by: franco on June 11, 2024, 06:37:56 PM
Might be related to https://github.com/opnsense/core/issues/7452 -- the patch is https://github.com/opnsense/core/commit/9a3caab85 but will only land in 24.1.9.


Cheers,
Franco
Title: Re: strange behavior after update to Version 24.1.2 (and until 24.1.8)
Post by: schnipp on June 12, 2024, 05:49:40 PM
Thank you.

I am not sure whether this issue is covered by the mentioned ticket. Since the error is not critical (restarting the service after booting temporarily fixes the problem), I will test after OPnsense 24.1.9 is released  :)
Title: Re: strange behavior after update to Version 24.1.2 (and until 24.1.8)
Post by: schnipp on June 27, 2024, 04:14:39 PM
Looks good, IPv6 is back again.  :)

Many thanks
Title: Re: strange behavior after update to Version 24.1.2 (and until 24.1.8)
Post by: Matzke on July 01, 2024, 12:23:09 AM
Dear all,

I just updated to OPNsense 24.1.9_4-amd64 and after a reboot - problem is still present and not solved.

I didn't opened a ticket because have very little time (at the moment health-related and lot's of visits for myself at hospitals). I hope it will be solved.

The problem is a bit of critical because it can only easily be solved when I'm on the site where opnsense is installed (I can't solve it remotely because all outside WAN ports (NAT as well as OpenVPN) are unreachable).
Title: Re: strange behavior after update to Version 24.1.2 (and until 24.1.9_4)
Post by: Yszty on August 08, 2024, 11:08:40 PM
Same problem here with: OPNsense 24.7_9-amd64

I setup mirror configuration, same devices, same ISP, mask /28. Only one difference is opnsense version. With    OPNsense 24.1.6-amd64 everything working.


On OPNsense 24.7_9-amd64 after switch from backup to master.

* DNS broke - no name resolution
* GW pings failed - declaring GW down

after some time

* GW went up BUT

I was able ping 8.8.8.8,
but unable to ping google.com

after permanently switch off backup node and restart master, everything started working normaly.




Title: Re: strange behavior after update to Version 24.1.2 (and until 24.1.9_4)
Post by: Matzke on August 26, 2024, 12:11:48 AM
Dear Yszty,

I think this is not the same problem because I don't have a mirror configuration - I have a single OpnSense on every site.

But unfortunatelly I can report, that the problem still exists in 24.7.2-amd64

I think a restart of service "Gateway Monitor"  can solve the problem after restart of OPNSense - but therefore I always have to be on the site and can't do anything from remote because in 90% OPNSense isn't reachable from outside after restart/update.

Please help.