After upgrade to 23.7.7_3 - link down/up - and after that NO connection outside

Started by lar.hed, November 04, 2023, 09:56:57 AM

Previous topic - Next topic
Nah, all good.

So we talked about this just now and would like to know what changes in /tmp/rules.debug when this happens... It sounds like something is going on in the file between having the filter reload lines (as it was on > 23.7.7) and how it is now (since 23.7.7).

Could you make copy in both cases and diff against it?


Thanks,
Franco

Maybe I should describe the process:

Comment out filter_configure(false, false); lines. Provoke error case.

Copy /tmp/rules.debug to e.g. /root/rules.bad

run

# /usr/local/etc/rc.filter_configure

(problem should be fixed)

Copy /tmp/rules.debug to e.g. /root/rules.good

And then let us know what this returns:

# diff -u /root/rules.bad /root/rules.good

And here is the result:

The result is removed since I seemed to have a WAN down in the middle of all. Jikes.

NO don't trust that - this is my WAN-LTE failover working - in the middle of all, my WAN connection was dropped. I like this..... NOT!

Okay now I have kind of a inverted problem: I can not recreate the problem.

And to be very clear: The filter lines are commented, so they are not executed, and YES I have rebooted my OPNsense Bare metal firewall hardware. And now it works, and yes, WAN is back up. ???

Nope, I have no way of triggering the problem anymore. :'(

I partly like to have this problem gone, but I also like to know why/what and so on. So even if I am partly okay with everything is back to normal, I would very much like to know what and why.

What I have tried is reboot, cold restart, all cables out, and some more. There is nothing I can do to trigger this.
Except maybe reinstall everything from 23.7 and then upgrade, restore config - maybe that might re-trigger this. I might have to look into that, just need some more time....

Okay, so this morning OPNsense was back in order - I had the same problem as before. I also am running the UNmodified file, the lines was completely removed. Just as it was last night when I rebooted, so why this extra time between link down - bunch of 12 hours or so - link up -> no connection to outside world on this particular direct connected PC (1.1.1.1 works, so raw IP traffic works perfect).

So I directly brought up my MobaXterm, and logged into OPNsense and cp the file. The I run the command suggested "/usr/local/etc/rc.filter_configure" - and Internet connection restored. I then cp the file again, and here is the result - it looks a bit like the one before (no there is no WAN or LTE down - all traffic goes over WAN):

diff -u /root/rules.bad /root/rules.good
--- /root/rules.bad     2023-11-08 09:14:25.069074000 +0100
+++ /root/rules.good    2023-11-08 09:15:13.266804000 +0100
@@ -68,6 +68,7 @@
no nat proto carp all
no rdr proto carp all
# [prio: 200]
+nat on igb7 inet from (igb2:network) to any port 500 -> (igb7:0) static-port # Automatic outbound rule
nat on igb7 inet from (vlan01:network) to any port 500 -> (igb7:0) static-port # Automatic outbound rule
nat on igb7 inet from (igb0:network) to any port 500 -> (igb7:0) static-port # Automatic outbound rule
nat on igb7 inet from (igb5:network) to any port 500 -> (igb7:0) static-port # Automatic outbound rule
@@ -76,6 +77,7 @@
nat on igb7 inet from (igb4:network) to any port 500 -> (igb7:0) static-port # Automatic outbound rule
nat on igb7 inet from (igb6:network) to any port 500 -> (igb7:0) static-port # Automatic outbound rule
nat on igb7 inet from 127.0.0.0/8 to any port 500 -> (igb7:0) static-port # Automatic outbound rule
+nat on igb7 inet from (igb2:network) to any -> (igb7:0) port 1024:65535 # Automatic outbound rule
nat on igb7 inet from (vlan01:network) to any -> (igb7:0) port 1024:65535 # Automatic outbound rule
nat on igb7 inet from (igb0:network) to any -> (igb7:0) port 1024:65535 # Automatic outbound rule
nat on igb7 inet from (igb5:network) to any -> (igb7:0) port 1024:65535 # Automatic outbound rule
@@ -84,6 +86,7 @@
nat on igb7 inet from (igb4:network) to any -> (igb7:0) port 1024:65535 # Automatic outbound rule
nat on igb7 inet from (igb6:network) to any -> (igb7:0) port 1024:65535 # Automatic outbound rule
nat on igb7 inet from 127.0.0.0/8 to any -> (igb7:0) port 1024:65535 # Automatic outbound rule
+nat on igb1 inet from (igb2:network) to any port 500 -> (igb1:0) static-port # Automatic outbound rule
nat on igb1 inet from (vlan01:network) to any port 500 -> (igb1:0) static-port # Automatic outbound rule
nat on igb1 inet from (igb0:network) to any port 500 -> (igb1:0) static-port # Automatic outbound rule
nat on igb1 inet from (igb5:network) to any port 500 -> (igb1:0) static-port # Automatic outbound rule
@@ -92,6 +95,7 @@
nat on igb1 inet from (igb4:network) to any port 500 -> (igb1:0) static-port # Automatic outbound rule
nat on igb1 inet from (igb6:network) to any port 500 -> (igb1:0) static-port # Automatic outbound rule
nat on igb1 inet from 127.0.0.0/8 to any port 500 -> (igb1:0) static-port # Automatic outbound rule
+nat on igb1 inet from (igb2:network) to any -> (igb1:0) port 1024:65535 # Automatic outbound rule
nat on igb1 inet from (vlan01:network) to any -> (igb1:0) port 1024:65535 # Automatic outbound rule
nat on igb1 inet from (igb0:network) to any -> (igb1:0) port 1024:65535 # Automatic outbound rule
nat on igb1 inet from (igb5:network) to any -> (igb1:0) port 1024:65535 # Automatic outbound rule


Some interface info:
igb1 = WAN (Primary)
igb7 = LTE (failover for WAN that is)

igb2 = PC that has this link-down/link-up problem

igb0 = Home Assistant server
igb5 = Laser printer with built in scanner
igb4 = Extra server interface, currently not connected at all
igb6 / vland1 = Unifi AP, where vlan1 is IoT
igb3 = Media with things like Kef speakers, Chromecast and projector

I find some strange things in the above. Like well any of the "Automatic outbound rule". Why do they appear when the WAN link is stable? Do note that the box has been rebooted after WAN problem, and well the WAN has been up since then...

Anyways, the thing to accept is that the command:
/usr/local/etc/rc.filter_configure

Solves my problem with link-down/<a large amount of time it seems>/link-up and no internet connection (which looks a lot like DNS problem, but since all other interfaces has DNS resolution it is more likely to be something not DNS related - like filter....)

Oh and now I have behaved so I have also reapplied the patch (not edited the file) in a correct manner....

Thanks for the debugging. Highly appreciated. igb2 is static IPv4, right?

> Oh and now I have behaved so I have also reapplied the patch (not edited the file) in a correct manner....

Hehe, that made me happy <3


Cheers,
Franco

Quote from: franco on November 08, 2023, 01:45:23 PM
Thanks for the debugging. Highly appreciated. igb2 is static IPv4, right?

igb2 is DHCP.

igb2 is actually my work PC (Microsoft Surface Book 2, connected over USB-C<->Thunderbolt to my Dell 4021Q screen, which has a Ethernet port connected to the igb2 interface). igb2 interface has DHCP since well from time to another I actually do use a Dlink switch when I need more connections at my work desk. So it needs to be DHCP for those very very limited and few occasions.

Quote from: franco on November 08, 2023, 01:45:23 PM> Oh and now I have behaved so I have also reapplied the patch (not edited the file) in a correct manner....

Hehe, that made me happy <3

Just for the record: I just returned back home, and the link has been down for at least 5 hours. No problem after reapplied the patch - works like it always has.

For what it is worth: Still working after that patch.

I have also done a few more diff on rules.debug - the one last night returned zero, this morning returned a lot more rows but some of those lines are not interesting (state, block country and stuff). Let me know if anyone needs them, but I say they do not bring any news to the table.

> igb2 is DHCP

Are you sure? In the interface settings IPv4 mode is set to "DHCP"? We were pondering over it but would make an educated guess that you mean it runs a DHCP server (which also requires a static IPv4 address) since you plug in clients...


Cheers,
Franco

Okay, if you phrase it like that I need to change my answer:

Yes the interface is static (10.168.2.1/24 - Upstream GW = Auto detect) - however the client that is connected to that port aka "Surface Booke 2 PC with Windows 10" is DHCP (10.168.2.20). So yes the interface is static - I just assumed (assumption is the mother of all f*ckups and all that) you referernced my PC and not the interface port on my h/w running OPNsense. This is clearly my mistake, sorry for the confusion.

Ok, thanks for clarifying. So we know what the problem is but the fix is really really tricky to pull off.

The idea is simple: leave the static addresses on the interface when rc.linkup pulls it down.

The reality is overly complex: this pertains to virtual IPs as well, CARPs are already an exception and multiple code points calling the offending interface_bring_down() either do too much or too little in the scope of what is happening. interface_bring_down() is a convoluted piece of code that does historic things for historic reasons but without a real plan of action. I'll try to unwind this in the coming days.

The good news is that this is an edge case that has nothing to do with why the filter reload was removed from the file and that decision stands. It actually would bring a lot more stability to the system if we manage to unwind interface_bring_down() behaviour and fix all callers.

But that also means when 23.7.8 hits today and you eventually install it you need to reapply the patch for now to keep it from breaking on your end.

So far I'm only aware of your report further indicating that we are going in the right direction. Thanks for all your help so far!


Cheers,
Franco