Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - TheDJ

#1
German - Deutsch / Re: DS-Lite OPNsense
October 27, 2024, 09:38:24 PM
Wenn ich das richtig sehe, wird dieses Problem jetzt in https://github.com/opnsense/core/issues/7713 getrackt.
#2
After weeks of hunting down this behavior and literally exchanging every hardware component, I found the problem: it (presumably) was a firmware upgrade in a Wifi Access Point that worked as a wireless backhaul.

I deployed the new v7.XX branch for Zyxel NWA220AX-6E roughly at the same time. I performed multiple firmware upgrades on that device and even got it swapped via an RMA afterward, so I didn't think this was related.
Today I DOWNgraded it to a 6.XX firmware that I still had - 'poof' - all issues seem to be gone. I will continue to monitor the situation, but I believe that firmware for that device is borked. This leads to packet loss and in turn a closing of TCP states.
#3
This is true for very fresh traffic after a reboot/reconnect. But should stabilize after a few minutes. For me, the behavior is ongoing even after a few days.
#4
Just for the record: 24.7.4 did not change/improve the behavior.

I didn't expect it to, because I did not see anything in the changelog that would indicate better behavior for v4, but I just wanted to note it here.

Is there anything else that could be done? I am very open to suggestions.
#5
Try widening the widget a bit more. I think, I noticed the same behavior when I resized it and it got too small.
#6
@meyergru and @rkube: what are your ISPs (I assume both of you are in Germany)?

As mentioned, I am on a Telekom Dual Stack with 250/40. Maybe it is a routing/peering issue that coincidentally appeared at the same time. Then, the TCP packets might be just a little too late (running out of the TTL) and the state is closed? This would also explain why it is not perfectly consistent and now even hits 24.1?
#7
Quote from: rkube on September 09, 2024, 04:16:09 PM
I just went back to OPNsense 24.1 (imported config from 24.7)  and, with debug logging turned on,... taddahhh... I also see same 'pf: loose state match' notices.

Thanks, good to know. Maybe it's a different (but somehow related) issue that did not surface in the same way until now.

Do you also see the performance degredation/FW hits?
#8
Do you notice any FW hits on the default deny for this traffic?

For me, these TCP state losses correspond quite well with the state losses, as far as I can tell. Right now, I wouldn't know any other reason, why incoming 443 traffic would be blocked (especially with the A and PA flags).
#9
Quote from: meyergru on September 08, 2024, 10:47:18 PM
So there are alo VLANs and LAGGs in the mix? Maybe netmap and suricata as well? ???

For me, only VLANs. No LAGGs, netmap and suricata (had it in IDS mode before, but turned it of without any difference). Also, these VLANs have been stable before (for months).
I also had a traffic shaper running beforehand, but it does not make a difference if it is running with or without it (although the iperf results show way more Retr packets if it is running with the shaper).

With the shaper:
# iperf3 -c iperf.online.net 198.18.50.136 --bidir
Connecting to host iperf.online.net, port 5201
[  5] local 10.200.10.2 port 44254 connected to 51.158.1.21 port 5201
[  7] local 10.200.10.2 port 44266 connected to 51.158.1.21 port 5201
[ ID][Role] Interval           Transfer     Bitrate         Retr  Cwnd
[  5][TX-C]   0.00-1.00   sec  6.12 MBytes  51.4 Mbits/sec   15    222 KBytes       
[  7][RX-C]   0.00-1.00   sec  23.2 MBytes   195 Mbits/sec                 
[  5][TX-C]   1.00-2.00   sec  4.88 MBytes  40.9 Mbits/sec   59    222 KBytes       
[  7][RX-C]   1.00-2.00   sec  26.7 MBytes   224 Mbits/sec                 
[  5][TX-C]   2.00-3.00   sec  4.75 MBytes  39.8 Mbits/sec   92    228 KBytes       
[  7][RX-C]   2.00-3.00   sec  26.7 MBytes   224 Mbits/sec                 
[  5][TX-C]   3.00-4.00   sec  3.57 MBytes  29.9 Mbits/sec   77    222 KBytes       
[  7][RX-C]   3.00-4.00   sec  27.0 MBytes   227 Mbits/sec                 
[  5][TX-C]   4.00-5.00   sec  4.76 MBytes  39.9 Mbits/sec  136    166 KBytes       
[  7][RX-C]   4.00-5.00   sec  27.1 MBytes   227 Mbits/sec                 
[  5][TX-C]   5.00-6.00   sec  3.52 MBytes  29.5 Mbits/sec  145    225 KBytes       
[  7][RX-C]   5.00-6.00   sec  26.9 MBytes   225 Mbits/sec                 
[  5][TX-C]   6.00-7.00   sec  4.76 MBytes  39.9 Mbits/sec   90    219 KBytes       
[  7][RX-C]   6.00-7.00   sec  27.0 MBytes   227 Mbits/sec                 
[  5][TX-C]   7.00-8.00   sec  4.70 MBytes  39.4 Mbits/sec   84    148 KBytes       
[  7][RX-C]   7.00-8.00   sec  26.3 MBytes   221 Mbits/sec                 
[  5][TX-C]   8.00-9.00   sec  3.52 MBytes  29.6 Mbits/sec   85    222 KBytes       
[  7][RX-C]   8.00-9.00   sec  27.7 MBytes   232 Mbits/sec                 
[  5][TX-C]   9.00-10.00  sec  4.80 MBytes  40.3 Mbits/sec  123    152 KBytes       
[  7][RX-C]   9.00-10.00  sec  26.9 MBytes   226 Mbits/sec                 
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval           Transfer     Bitrate         Retr
[  5][TX-C]   0.00-10.00  sec  45.4 MBytes  38.1 Mbits/sec  906             sender
[  5][TX-C]   0.00-10.02  sec  42.3 MBytes  35.4 Mbits/sec                  receiver
[  7][RX-C]   0.00-10.00  sec   277 MBytes   233 Mbits/sec  2261             sender
[  7][RX-C]   0.00-10.02  sec   266 MBytes   222 Mbits/sec                  receiver

iperf Done.


Without the shaper:

# iperf3 -c iperf.online.net 198.18.50.136 --bidir
Connecting to host iperf.online.net, port 5201
[  5] local 10.200.10.2 port 52252 connected to 51.158.1.21 port 5201
[  7] local 10.200.10.2 port 52266 connected to 51.158.1.21 port 5201
[ ID][Role] Interval           Transfer     Bitrate         Retr  Cwnd
[  5][TX-C]   0.00-1.00   sec  7.46 MBytes  62.6 Mbits/sec   91    382 KBytes       
[  7][RX-C]   0.00-1.00   sec  23.7 MBytes   199 Mbits/sec                 
[  5][TX-C]   1.00-2.00   sec  4.66 MBytes  39.1 Mbits/sec   33    294 KBytes       
[  7][RX-C]   1.00-2.00   sec  29.1 MBytes   244 Mbits/sec                 
[  5][TX-C]   2.00-3.00   sec  4.73 MBytes  39.7 Mbits/sec   12    259 KBytes       
[  7][RX-C]   2.00-3.00   sec  30.2 MBytes   253 Mbits/sec                 
[  5][TX-C]   3.00-4.00   sec  4.70 MBytes  39.4 Mbits/sec    0    276 KBytes       
[  7][RX-C]   3.00-4.00   sec  31.9 MBytes   267 Mbits/sec                 
[  5][TX-C]   4.00-5.00   sec  4.70 MBytes  39.4 Mbits/sec    0    253 KBytes       
[  7][RX-C]   4.00-5.00   sec  30.7 MBytes   257 Mbits/sec                 
[  5][TX-C]   5.00-6.00   sec  4.63 MBytes  38.8 Mbits/sec    0    264 KBytes       
[  7][RX-C]   5.00-6.00   sec  29.4 MBytes   247 Mbits/sec                 
[  5][TX-C]   6.00-7.00   sec  4.70 MBytes  39.5 Mbits/sec    0    273 KBytes       
[  7][RX-C]   6.00-7.00   sec  33.6 MBytes   282 Mbits/sec                 
[  5][TX-C]   7.00-8.00   sec  4.67 MBytes  39.2 Mbits/sec    0    270 KBytes       
[  7][RX-C]   7.00-8.00   sec  31.9 MBytes   267 Mbits/sec                 
[  5][TX-C]   8.00-9.00   sec  4.66 MBytes  39.1 Mbits/sec    0    262 KBytes       
[  7][RX-C]   8.00-9.00   sec  31.5 MBytes   265 Mbits/sec                 
[  5][TX-C]   9.00-10.00  sec  4.70 MBytes  39.4 Mbits/sec    0   5.62 KBytes       
[  7][RX-C]   9.00-10.00  sec  31.4 MBytes   264 Mbits/sec                 
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval           Transfer     Bitrate         Retr
[  5][TX-C]   0.00-10.00  sec  49.6 MBytes  41.6 Mbits/sec  136             sender
[  5][TX-C]   0.00-10.02  sec  46.7 MBytes  39.1 Mbits/sec                  receiver
[  7][RX-C]   0.00-10.00  sec   320 MBytes   268 Mbits/sec  1054             sender
[  7][RX-C]   0.00-10.02  sec   303 MBytes   254 Mbits/sec                  receiver

iperf Done.


Both iperf runs are in line with what I expect for my line. But the TCP state losses and FW hits still happen.

What still seems strange to me: all of the TCP state losses and FW hits happen on v4 addresses, although the devices have SLAAC GUAs available. Of course, the public servers, for which those connection dropouts happen, might only have v4 addresses, so I'm not sure if that is any specific symptom.
The TCP dropouts also happen for some apps (e.g. German ZDF Mediathek) more often than for others.
For my internal networks, I have never experienced a state loss - only to the Internet.

What else could be done to diagnose this? I am close to downgrading to 24.1. The timeouts are really annoying during regular use.
#10
Another shot in the dark from my side (I would not be able to explain it, but maybe someone else): could it be a VLAN routing problem? The (LAN) interfaces that have this problem on my device are all VLANs. Not LAGGs, but VLANs just like your setup.
At the same time, I have a road warrior WireGuard VPN setup on the same box, leaving via the same WAN, which (at least from very shallow use) did not encounter any problem in this regard.
#11
Yeah, I switched the mirror and then the download worked.

However, even with multiple more hours on the no_sa kernel, the TCP state losses (and FW rule hits) are still there and completely the same.
The "ICMP error message too short (ip6)" (which were the initial starting point for this thread) are gone (like the others described), but the TCP behavior did not change.
#12
Quote from: doktornotor on September 06, 2024, 11:01:24 PM
Fetching the kernel works fine for me.

It's very weird, but fetching the kernel does not work for me anymore: after the status "Fetching kernel-24.7.3-no_sa-amd64.txz" the loading dots just keep on running for a long time. No error message or anything.
The same happens for the other test kernel for 24.7.3 in the snapshot directory (https://pkg.opnsense.org/FreeBSD:14:amd64/snapshots/sets/)

Trying to fetch a non-existing kernel times out immediately with an error message.
Fetching kernel-24.7.3-test-amd64.txz: ..[fetch: https://pkg.opnsense.org/FreeBSD:14:amd64/snapshots/sets/kernel-24.7.3-test-amd64.txz.sig: Not Found] failed, no signature found
I already reverted to the zfs snapshot that I set up before my kernel testing earlier this week.
Maybe, my opnsense installation is more broken than I thought.

This means, I can't currently verify anything regarding the TCP timeouts with the -no_sa kernel.

EDIT: Scratch that - changing the mirror works. I will now also test the -no_sa deeper than before.
#13
I don't run the IGMP Proxy service (actually not even using IGMP at all in my network).

So, I would assume that this is not related.

As currently the other thread is more active I posted my assumption based on the current testing there just a few moments ago.
But I am very open to switch to this one, if we feel like the "ICMP error message too short (ip6)" logs can be targeted with the full-revert kernel (and are therefore manageble), while the state losses don't seem to be affected by the -no_sa kernel.

P.S. I am also native German speaker, but I think we should keep it international-friendly in this thread :)
#14
Is it possible that the 24.7.3-no_sa kernel is not online anymore? Fetching it times out for me (tried it earlier today and now again.

Anyways: as I said before, I already tried the kernel a few days ago and the TCP state losses (on v4) were still very much a thing even with the -no_sa kernel. I don't have any logs from that run, but the behavior described by Reza sounds like it is still having those problems even with the -no_sa kernel:

Quote from: rkube on September 06, 2024, 06:38:06 PM
Other, but perhaps unproblematic debug messages such as 'pf: loose state match...' and 'pf: dropping packet with ip options' are still often logged if debug logging is enabled.

So I am really not sure what is going on and if it is really the upstream bug.

It's one thing that there are debug messages but a completely different thing to actually experience legitimate packets hitting the FW rule due to the state losses.
If the state loss problems should be treated differently from the ICMP ip6 log messages, I am very open to discuss them in this thread: https://forum.opnsense.org/index.php?topic=42657.0
#15
I tried the experimental kernel beforehand and it did not seem to have an effect (however, I did not check the logs, I only noticed that the TCP was not getting better).

If I have the time, I will do some more detailed testing.