I have 3 OpnSense firewalls running - all on 23.7.5, all as virtual machine in Hyper-V 2019 DC.
This morning one of them suddenly stopped working correctly; there is no internet access anymore from any LAN segment. The live view is not displaying any activity anymore, but strange, all VPN's work flawlessly.
As this physical server (the one where OpnSense is not working anymore) is the only one that got the Windows Updates yesterday, I am in the process of uninstalling these updates now.
I hope this solves the problem, I will share the results.
One more point: The firewall itself has internet access, only the LAN segments have not ..
Uninstall of the updates did NOT do the trick - still no internet and no activity in the live log.
Anyone any thoughts?
Check your VM config and verify that it's actually passing traffic to the VM. If nothing changed in OPNSense and the only machine having issues had Windows Updates, go through and verify everything. Start with no assumptions of anything working.
No problems found - rolled back the updates, behaviour stays the same.
In the bootlog, I see:
filter_configure_sync[285] failed.
several times.
On the working firewalls, there is no failed entry in the bootlog.
I do not know what this means, can anyone explain?
Just because windows says it rolled back the updates doesn't necessarily mean that it did.
That error appears to be an issue with a HA setup sync and makes sense if your VM connectivity is broken. It wouldn't show up on the others because they're able to talk.
You need to verify all of the base details of your VM config and hosting setup. Without doing that you're just going to be chasing your tail.
VM connectivity is not broken ....
Working flawlessly:
- OpenVPN; clients can connect and run an RDP over this connection
- IPSec (Routing): the other LANS on the other OpnSense firewalls can be accessed normally (intercontinental connection)
- IPSec client connections: working perfectly.
- Ping and lookup from the firewall itself: works normally.
So the routing between the LANs and the VPN adapters work.
The VPN adapters can connect (going over the WAN).
Only routing from LAN to internet fails, routing to anything else (that is also connection over the same WAN) works.
Well, restored the complete machine with a backup from Saturday, and it is working again.
Restoring the config from Saturday did not work.
Changing the backup from weekly to daily now.
It's about 8 hours since the previous firewall failed.
Now the next one fails (other hardware, other continent ....
I know is hard to take time to diagnose when the system is broken and you must restore functionality but are you able to do diagnostics, try to pinpoint -not the root of the problem- but where the problem is?
"...suddenly stopped working correctly; there is no internet access anymore from any LAN segment" is descriptive of the symptom but gives no clue as to what the reason might be. From a distance, could you mean that DNS is not resolving and hence it looks like there is no access? Or, there is actually a problem with routing? For that, the usual network engineer diagnostics are required. Print your routes, your interfaces definitions (ifconfig), your traceroutes, etc.
Also virtualisation is another layer of complexity to account and diagnose for.
When 30+ people can not work, there is no time to investigate.
But the symptoms are:
- Incoming traffic works
- Outgoing traffic to VPN's work
- Outgoing traffic to the internet does not.
- Outgoing traffic to the internet from the firewall itself works
-Edit: Traffic between LANS (on the same firewall and on the other side of VPN's) work
When I do a tracert to any IP address, the only adapter that answers is the gateway.
So, indeed DNS does not work as this is on the internet.
But also raw traffic to an IP does not work as the routing stops at the gateway.
Edit2: As I said in another thread:
If some of the developers of Deciso B.V. are reading this, if you contact me I can deliver 2 disks with an installation (HyperV) that have the problem for investigation.
Indeed no time to investigate but this information useful as it is, fails to give enough information to tell what the problem might be. And transplanting the disks does not replicate the environment, so of very limited use.
By the way statements are fine when accompanied by captures of the diagnostic otherwise is open to interpretation.
I would get out of Hyper-v for starters. Not a great hypervisor for freebsd.
Next thing if you can, try bare metal.
After that, consider a commercial support for a couple hours to diagnose if there are no in-house expertise in networking.
I was thinking of what was changed.
I have 3 firewall, and until now, 2 failed.
Also, I am running OpnSense since the second half of august and did not encounter anything like this before.
Changes made in the last days:
1. Upgraded to 23.7.5 on all three firewalls
2. Working on an incoming SNAT work around
It can't be 1, as only 2 of 3 failed (configurations are comparable).
For 2: Some Port Forwards were changed on the Description field.
The description was: "PortForward: my pf description"
This was changed in: "SNAT x.x.x.x #PortForward: my pf description"
This last change was only done on the 2 firewalls that failed, not on the one that did not fail.
As this description is used to generate a rule on the WAN, this might be the cause (maybe the # character?)
Quote from: cookiemonster on October 11, 2023, 10:32:39 PM
Indeed no time to investigate but this information useful as it is, fails to give enough information to tell what the problem might be. And transplanting the disks does not replicate the environment, so of very limited use.
By the way statements are fine when accompanied by captures of the diagnostic otherwise is open to interpretation.
I would get out of Hyper-v for starters. Not a great hypervisor for freebsd.
Next thing if you can, try bare metal.
After that, consider a commercial support for a couple hours to diagnose if there are no in-house expertise in networking.
Well, I can be simple about that. If the choice comes between OpnSense and HyperV, OpnSense will go.
Have you tried different emulated network interfaces? I don't know what Hyper-V offers, but while paravirtualised does have the least overhead, Intel E1000 is considered the most robust with FreeBSD guests.
After I restored the harddisk (not VM hardware) everything works again, so it looks like a problem with corruption or configuration of OpnSense on the harddisk.
The fact that restoring the Harddisk only solves the problem, rules the VM environment out (for me that is).
As I look at what is happening, it looks like the default route was somehow not working anymore, as any other route still worked.
Quote from: tverweij on October 12, 2023, 05:06:53 PM
As I look at what is happening, it looks like the default route was somehow not working anymore, as any other route still worked.
That's what I would have thought at a point in the thread, hence my suggestions at the time. Still unproven though.
Quote from: tverweij on October 12, 2023, 05:06:53 PM
After I restored the harddisk (not VM hardware) everything works again, so it looks like a problem with corruption or configuration of OpnSense on the harddisk.
The fact that restoring the Harddisk only solves the problem, rules the VM environment out (for me that is).
Do you mean restore the virtual disk that the virtual machine is using? and are you using zfs or ufs as the filesystem?
Also as I find this interesting, what's the flow here. Did you create a backup prior to a change and then restored this backup. Sorry but you haven't ruled anything yet in terms of environment.
I looked at it and searched google.
Everywhere, the standard HyperV network adapters are adviced, also for FreeBsd (and yes, FreeBsd is 100% supported)
ok then. Fingers crossed said corruption is a one off.
Quote from: cookiemonster on October 12, 2023, 05:21:45 PM
Do you mean restore the virtual disk that the virtual machine is using? and are you using zfs or ufs as the filesystem?
kup. Sorry but you haven't ruled anything yet in terms of environment.
I mean that I restored the plain vhdx file that is attached to the SCSII controller and functions as HD. Would be the same in VMWare when the VMDK file is restored.
I use ZFS.