strange behavior after update to Version 24.1.2 (and until 24.7.2)

Started by Matzke, February 26, 2024, 06:15:26 PM

Previous topic - Next topic
Dear all,

like I already wrote in my small post regarding wireguard problems - I just want to open a new topic to don't hijack the old one.

I'm very unsure whether there is a general problem which causes different symptoms (therefor I wrote my observations in the other topic).

Here a copy of my old post:

QuoteHello,
I also have some strange problems after the update. I don't want to hijack this thread, but I think it might be the same origin that manifests differently for everyone.

OPNSense A:
Update and direct reboot
Everything seemed to work fine, but later today (day after update) I received error messages that some servers were not reachable - cause a DNS problem. According to the GUI, Unbound was not running - BUT the Internet via browser on the clients was working, so part of the DNS server must have been running. A reboot of OPNSense seemed to have fixed the problem - but I'll have to wait and see tomorrow.

OPNSense B:
Update and direct reboot
- A device can no longer connect to its cloud server.
I can address the device within my internal network (several VLANs routed via OPNSense), so the routing must basically work
- Internet access on my test client worked, websites could be loaded
- a "ping google.de" on the same test client shows no connection
- a "tracert google.de" stops at the OPNSense
- DNS worked, as both of the above commands were able to resolve an IP. I tried it with 3 different hosts, always the same behavior
- a restart of Unbound brought no change
- I checked to see if there was another update available on the OPNSense - the update routine could not connect to the update server either
After rebooting the OPNSense, everything seemed to work again (device had cloud connection, ping worked again, tracert worked again) - I did no other changes!

P.S. My Wireguard worked at least after the second reboot, before that I don't know.

Both OPNSense machines have been running for several years, nothing was changed in the configurations before the update. So it seems that something is sporadically unstable.

And now my new observation from today:

I didn't do changes in opnsense - it just rebooted by a cron job. After that, I can report the following behavior (which is very strange):

Just a small notice on my configuration:

2 WAN interfaces and 2 gateways -> OPNSense -> multiple internal VLAN Interfaces

--> gateway two is switched off and marked as down, so it is configured but not present.

Now the strange behavior (it's a little bit like reported above):

- normal internet usage (using browser) works without problems, therefor I didn't realize today in the morning that there is a problem
- routing between internal vlans works without problems
- one device in my technic-vlan can't connect to its cloud servers
- ping to different targets in the internet results in timeouts
- traceroute to same targets stops at OPNSense
- all connections from WAN-side stops working (no services were reachable - neither HAProxy on the OPNSense itself nor other NATed services behind OPNSense)
- no OPNSense update possible (can't reach servers)
- unbound worked as expected (I pinged a target which I never connected before so it couldn't be in the cache (I did a random google search and used the first hit to connect to, it was a local flower shop in a foreign country))
- all services were started (green arrows) except crowdsec-plugin
- starting crowdsec-plugin manually (after that also green arrow) doesn't change the behavior
- no errors in logfiles as I can see

--> rebooting OPNSense without any other interaction -> everything works fine direct after reboot

I don't know what could cause this strange behavior but I can imagine that this behavior causes a wide variety of error patterns for other users.

March 03, 2024, 01:26:21 AM #2 Last Edit: March 03, 2024, 10:19:28 PM by BoneStorm
I came here by search for my strange upgrade problem, this post seems the only reference but signature fits. Please read below for workaround.

I'm running an physical HA setup of opnsense and upgraded 23.7.12_5 to 24.1.2. I just fixed my HA setup prior the upgrade and tested that well. So I'm confident things broke on the upgrade itself.

I'm running wan with fixed private VIP with CARP enabled. WAN default GW is ping monitored. Right after the upgrade things were fine so I moved forward upgrading the other node too. Then after some minutes misbehavior became visible.

* DNS broke - no name resolution
* GW pings failed - declaring GW down
* tcpdump on wan indicate icmp packets leaving opnsense and were answered by remote successfully
* opensense shell ping however reported timeouts
* same signature on DNS - DNS leaving but unbound states server failure
* existing connections (flows in the connection table) were successfully held and also cached DNS records were served, so it was not entirely obvious things were going wrong
* tcpdump attached to pflogd0 did not indicate any drop
* for troubleshooting I added to WAN ingress permit ip any any statements - no fun
* pfctl -d - disabling pf made the opnsense shell ping to directly connected WAN default GW instantly work
* the issue persisted through multiple reboots including other HA node held artificially down do reduce noise

I tried to make sense out of pfctl rules webgui summary to see where things went wrong, but could not pinpoint an issue here.

Workaround:
* I pulled the backup from history prior the upgrade from both nodes
* fresh install of old 23.7 release (from an old stick I had around)
* load config and restore the cluster

Hope it helps to either confirming this is a real issue, or to spread the word of an workaround which worked for me(tm)

Dear all,

I don't know if this is a stable solution (fresh install and import of config) because it needs time to see if it worked or not.

My OPNSense has a restart-job every 1,4,7 day of the week. So I can say not every restart triggers the problem.

But today I have the same problems like stated above (nothing changed - only reboot in the night).




I am experiencing the same problem.

In my case,
1. It worked fine after the initial upgrade to 24.1. However after patching several packages and rebooting the box, it started to have this behavior.
2. A few websites are still working fine, like google and YouTube. However, most others are not connectable.
3. DNS seems resolved fine. Can't access the IP addresses / ports though.
4. In the OPNsense box itself, the Internet is still fully accessible. Problems seem to only occur at underlying hosts.

I found another interesting pattern. Google and YouTube are working fine because they have IPv6 endpoints. Once I turned off IPv6 in my WAN/LAN interfaces, they are no longer accessible.

I also attempted to install and configure 24.1 from scratch without importing a backup config. Again, it was working fine until upgrading to 24.1.4. I suspect NAT implementation is somewhat broken, caused by the package upgrade.

We do have some of the issues, too.

weird and strange behaviour. I opened another thread with that issues little time ago:
https://forum.opnsense.org/index.php?topic=39654.0

Quote from: BoneStorm on March 03, 2024, 01:26:21 AM
...
* DNS broke - no name resolution
* GW pings failed - declaring GW down
* tcpdump on wan indicate icmp packets leaving opnsense and were answered by remote successfully
* opensense shell ping however reported timeouts
* same signature on DNS - DNS leaving but unbound states server failure
* existing connections (flows in the connection table) were successfully held and also cached DNS records were served, so it was not entirely obvious things were going wrong
* tcpdump attached to pflogd0 did not indicate any drop
* for troubleshooting I added to WAN ingress permit ip any any statements - no fun
* pfctl -d - disabling pf made the opnsense shell ping to directly connected WAN default GW instantly work
* the issue persisted through multiple reboots including other HA node held artificially down do reduce noise
...

March 26, 2024, 11:25:21 AM #7 Last Edit: March 26, 2024, 04:08:15 PM by bassopt
Try using bind.

I've never used unbound on opnsense because it's always seems broken somehow.  Still testing it but it looks way more promising

I don't think that it is an unbound problem because I also can't reach IP-addresses (also internal when routed via OPNSense) and I also can't reach my NATted devices behind OPNSense from outside.

The problem is still persistend (but sporadically, I think every 2-5 restarts the problem occurs).

First what is eye-catching - on dashboard I can see, that crowdsec service isn't started.

This has nothing to do with the error, it seems that OPNSense has problems with starting all services correctly.

When I restart crowdsec it runs suddenly but problem is still present. After restarting service pf and routing it seems to work.

I hope somebody can solve this problem because so OPNSense is unstable at my side!!!

It's still present in OPNsense 24.1.5_3-amd64

It seems that it's enough to manually restart service "routing".

I don't know, why I have such big problems after upgrade (and also some other people) and I don't know why nobody tries to solve this problem. It started after release of version 24 - before I used OPNSense for years without this problem. Installation was an upgrade from GUI.

Hi Matzke,

Quote from: Matzke on April 06, 2024, 02:49:23 PM
It's still present in OPNsense 24.1.5_3-amd64

It seems that it's enough to manually restart service "routing".
...

That makes me a bit more confident to give the upgrade a new try next few days, once spare time permits. Thanks for that information.

... thanks for that info - I just updated to 24.1.6 and will report what's going on.

It will need some time because I only restart on weekend until this failure is solved.

BTW - I didn't read anything in changelog which could fix this behavior or do I overlook something?

OPNsense 24.1.6-amd64  --> still the same problem.

And yes - restart routing seems to work, just checked...

Why nobody (especially from OPNSense) tries to help???

We found a solution after many hours, days of searching, it was so simple.

go to your WAN interface, and make sure (if it is your only wan interface and you are having no multi-wan-system) IPv4 Upstream Gateway is set to "Auto-Detect". Another admin in our company set it manually to the default gateway given through our ISP. That never causes problems, since now we're on 24.1.

kind regards.

Dear All,

Problem is still present but not after every reboot. Since my last post, today it happened again and I do a reboot every Saturday.

I have a Multi-WAN Setup (but mostly one WAN interface is down because it is for test-reasons and blackout for primary WAN interface)

Perhaps these hints give you a possibility to check, where the problem could be. As I already wrote - this configuration worked for years with prior versions of OPNSense.

Restarting "Routing" Service seems to be enough - but only works when I'm local on site because all connections from WAN (including VPN) doesn't work.