24.7.2 IPv6 woes

Started by CruxtheNinth, August 26, 2024, 08:28:06 AM

Previous topic - Next topic
Oh, so 24.7.4 seems close. I guess, there is always more work to be done... no objections (although I already regret my former advice: looking back I would now rather go with the full SA revert).
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

We can still do that. My plan is to test the amended SA as it comes out. If it's still whacky we go full revert. Ok?


Cheers,
Franco

September 09, 2024, 06:11:42 PM #77 Last Edit: September 09, 2024, 07:06:09 PM by Server07
Quote
to install the  new "fixed" version for testing:

pkg add -f https://pkg.opnsense.org/FreeBSD:14:amd64/snapshots/misc/dhcp6c-20240907.pkg

OR

revert to previously known working from 24.7.1

opnsense-revert -r 24.7.1 dhcp6c

reboot afterwards

Many thanks - did the revert and WAN has now proper 2a00: .... address.
AdguardHome starting after upgrade - see other thread and crash due to service -yy ...




Can we move the off-topic questions somewhere else please?

As far as the SA goes FreeBSD core responded making this our problem so 24.7.4 will scrub the whole SA effort and we can continue to build on good IPv6 connectivity...


Cheers,
Franco

Franco, I do not know if you followed the separate FreeBSD report.

You do not need to read all of that, just look at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=281395#c26 and https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=281395#c27.

I wonder if some or all reports of this were on Proxmox VMs and were not related to the SA at all. After the Proxmox setting was changed, I could not reproduce the ND problems on FreeBSD 14.1-RELEASE or FreeBSD 14.1-STABLE any more.

But maybe it is just my inability to reproduce the correct environment to bring the problem to life under plain FreeBSD.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

I confirmed weird ND behaviour on the first patch of the SA set, namely pf not allowing the ND to be answered which ended up in two excess retransmissions. This was on plain hardware between OPNsense and a FritzBox.

As stated in the other patch disabling the state tracking brought the behaviour back to normal even with the patches applied. The impractical thing is that disabling state tracking for ICMP even by rule will resurface the SA.

Commit https://github.com/openbsd/src/commit/2633ae8c4c8a64 makes this very clear "Fixes a bug uncovered by one of the previous commits that virtually breaks IPv6 connectivity after few minutes of use.". It was eventually pulled in after claiming conspiracy but the SA was never amended so far. I feel this has never been our battle to begin with especially within FreeBSD release engineering practice that has never had such a negative impact on a supported release.

I understand the sentiment that this is our problem now. It made the decision to avoid the bad code completely much easier. If it was the right call on the FreeBSD end I'm in doubt given the technical evidence, but if they want to go by verifiable test case that is certainly fair. But not now, not at the will of a third party that apparently does not care if FreeBSD releases are reliable or not.


Cheers,
Franco

I am willing to buy you and kp@ any number of beers or other drinks provided that both of you sit down at a table with me at the same time  :)
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

No need, we certainly all made our own choices.


Cheers,
Franco

I also know that I witnessed the ND problem on OpnSense bare metal, so I have no real doubt of what is going on.

I underestimated the difficulty to simulate the needed pf rules on a pure FreeBSD basis just to prove a point by "playing by their rules". I now can fully understand that you lack the enthusiasm to do that.

Do you expect that FreeBSD will at some point fix their code? If I had not filed those two bugs, they could simply have dropped all that.

BTW: When I saw this thread, I already thought that BSI had included CVE-2024-6640 in the list of 11 (!) CVEs contained in that advisory (as luck would have it, it was not).

It is scandalous how these guys bundle up a few CVEs and use the highest CVSS score for one combined advisory, emitting a false sense of urgency. Going by that rule, one would have to fix each and every small problem and creating huge usability concerns. Considerations like the one to revert the SA to restore operation were impossible like that.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 440 up, Bufferbloat A+

Quote from: meyergru on September 14, 2024, 01:55:45 AMI underestimated the difficulty to simulate the needed pf rules on a pure FreeBSD basis just to prove a point by "playing by their rules". I now can fully understand that you lack the enthusiasm to do that.

Do you expect that FreeBSD will at some point fix their code? If I had not filed those two bugs, they could simply have dropped all that.

My point is that I asked the researcher how much he was involved in the review and test of the SA. He told me he was not involved at all. He didn't even know it caused regressions until I told him.

The testing on this has been low effort, the patch size and age is high risk. It's basically a feature that was added into all supported FreeBSD release on SA/SO fast trck which to my knowledge never happened before. Please correct me if I'm wrong. I'm not overly fond of the strict errata policy (not fixing kernel panics) in supported FreeBSD releases, but it's a workable given that also nothing ever gets worse.

For now we wait and see what happens when this hits pfSense. When we have to deal with 14.2 inclusion we will add more patches from OpenBSD and reassess; and I'll try to chase relevant OpenBSD devs on the subject at EuroBSDCon.


Cheers,
Franco

@CruxtheNinth back to the topic at hand :)

I've added

https://github.com/opnsense/dhcp6c/commit/9c42063ba5
https://github.com/opnsense/dhcp6c/commit/0a714091cb

Which should be safe changes:

# pkg add -f https://pkg.opnsense.org/FreeBSD:14:amd64/snapshots/misc/dhcp6c-20240907_1.pkg

Now what I think is a problem is the following...

https://github.com/opnsense/dhcp6c/blob/master/dhcp6c.c#L157-L159

You can see someone tried to be clever and discard random() use when arc4random() is available, but...

# git grep '[^s4]random('
common.c:               ev->retrans = (random() % (SOL_MAX_DELAY));
common.c:                       r = (double)((random() % 1000) + 1) / 10000;
common.c:                       r = (double)((random() % 2000) - 1000) / 10000;
missing/arc4random.c: * a stub function to make random() to return good random numbers.

There's all of the timer calculations in common.c that still use random() without srandom() ever seeding the RNG.

Why can this be a problem?

Because the values produced are likely not random and probably not uniformly distributed either. Changing it all to a consistent arc4random() use seems to cause the weird renewal issues seen before.

We will take a closer look, but for now I'd just like to know if the current build presented above doesn't make this worse.


Thanks,
Franco

Okay I found the issue I think:

random() returns signed type long, arc4random*() returns unsigned type int.

This underflows during (random - 1000) as the calculation tries to return values between -0.1 and 0.1 causing values such as this instead:

RAND -0.1 <= 429926.6292 <= 0.1

Which will get the timer stuck or overflow itself as well.


Cheers,
Franco


Someone mind testing this second revision above? Ideally it's 24.7.5 material for next week but I would like to be sure.


Cheers,
Franco

September 20, 2024, 12:09:54 PM #89 Last Edit: September 20, 2024, 12:25:24 PM by marjohn56
I'm going to load it now. Seeing some weird IPv6 issues.


I'm running 25.1*. Yesterday I rebooted, everything came up OK, ten minutes later IPv6 stopped working, interfaces all have the correct addresses it just appears there's no route. dpinger is still saying all good. BTW, my ISP runs a five minute lease renewal, even though the address ranges are reserved!


I then killed and restarted dhcp6c and hey presto IPv6 started working and it's been working fine for 24 hours. I'll load the test dhcp6c and reboot, and see what happens.

++++


Same thing, five minutes after a reboot no route, restart dhcp6c all good again; may not be dhcp6c related but it's a pita. The only good thing is it appears I can cause the problem just by restarting opnsense.

OPNsense 24.7 - Qotom Q355G4 - ISP - Squirrel 1Gbps.

Team Rebellion Member - If we've helped you remember to applaud