radvd stops announcing IPv6 prefix after a while (radvd freeze?)

Started by direx, September 08, 2020, 07:53:05 AM

Previous topic - Next topic
Hi,

I have a problem which was introduced after updating to 20.7:

After round about two days of uptime of my OPNsense box, IPv6 in my networks stops working. This has nothing to do with chaning prefix (mine chages every 180 days) but I figured out that radvd does not announnce the IPv6 prefix any more. This means all clients will lose IPv6 connectivity eventually.

Clicking the restart button for "radvd" in the web UI fixes this and clients re-gain their internet connectivity after this. The strange part is that radvd is always running (output before restart):


# ps aux|grep rad
root    42763   0.0  0.1 1061048  3196  -  Ss   Sun21       0:30.35 /usr/local/sbin/radvd -p /var/run/radvd.pid -C /var/etc/radvd.conf -m syslog


Between radvd restarts the radvd.conf and the output of "netstat -6an" does not change.

This really looks like a bug to me (radvd freezing) but I don't know how I can debug this. Any hints here on how to get to the root cause of the radvd issue? It looks like the "strace" command is not available so I am a little helpless here.


Regards,
direx

Yesterday after a cold boot, I didn't notice I had no IPv6 until 90 minutes later and it required me to save/apply an unchanged WAN interface followed by a save/apply an unchanged LAN interface.  Then routing started.  It looked like I had IPv6 addresses on hosts, but no connectivity (ipv6 monitored by Nagios ping).  There are a few more ipv6 threads that may be related (one solved by moving to 21.1 development version).  In my experience testing, unrelated to the above problem (maybe), it looks like radvd is not responding to host solicitations directly.  It advertises and I increased the frequency of that using manual settings. 

https://forum.opnsense.org/index.php?topic=18868.0
https://forum.opnsense.org/index.php?topic=18549.0
https://forum.opnsense.org/index.php?topic=18591.0

Just an FYI.  Oh, and I have not seen the problem you describe where radvd stops altogether.  You might want to try manual router adv settings.  Something definitely seems wrong as compared to 20.1.x series.

Cheers.
HP T730/AMD  RX-427BB/8GB/500GB SSD
HP NC365T 4-PORT


I can confirm, too. Two OPNsense systems are affected by this issue.

I registered on this forum just to say that I've been hit with this problem too.

A restart of the radvd service fixes the problem immediately, but radvd then stops working after 24-48 hours (ipv6 solicitation stops working) until you manually restart the service.

Somehow the kernel hits a limit for multicast join/leave in FreeBSD 12. We haven't had the chance to debug this further and there seem to be no relevant patches in FreeBSD.


Cheers,
Franco

I'm new to OPNsense, but noticed a couple days ago IPv6 addresses were not being given to devices on the LAN side. A restart of radvd got it working again.

Would a potential work-around for this issue be to set up a cron job to pkill and then restart radvd once a day?


Hey folks,

I just found that our office in Frankfurt is suffering the same problem. They have an uplink ISP that does not provide v6 at all, so we deployed an OPNsense (20.7) to route through a WireGuard tunnel to our main office in Karlsruhe.

Works great, if it wasn't for the router advertisements.

Since the OPNsense box in Frankfurt is the only router for v6 in the LAN, we are in control of everything, and Macs (our preferred developer platform) will probably honour almost anything - would it be a viable workaround to switch to DHCPv6 from SLAAC?

From https://github.com/opnsense/core/issues/4338 I get the issue is not quite fixed in the update planned for tomorrow?

Thanks!
Patrick

EDIT: I just found that DHCPv6 does not work without RA ... ok.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

One question:

Why are we using radvd at all? I assume this is this product?
http://www.litech.org/radvd/

The FreeBSD base system contains rtadvd which is running in production on our site without a single problem ...

Kind regards,
Patrick
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)


Yes - but that's a port of an external piece of software:
https://www.freshports.org/net/radvd/

rtadvd is in base and basically what we run everywhere if it's plain FreeBSD and not OPNsense:
https://www.freebsd.org/cgi/man.cgi?query=rtadvd&apropos=0&sektion=0&manpath=FreeBSD+12.1-RELEASE&arch=default&format=html

Why the extra package?
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Can the element in question be rolled back to what was used in 20.1? Add a watchdog to restart it or make restarting radvd an option in the cron task menu?

Quote from: CloudHoppingFlowerChild on October 23, 2020, 04:19:43 AM
Can the element in question be rolled back to what was used in 20.1? Add a watchdog to restart it or make restarting radvd an option in the cron task menu?

Base HardenedBSD is also upgraded to 12 since 20.7 release. This issue might be more than just radvd as discussed over opnsense GitHub. So simply reverting back to old radvd package might not guarantee a solution. But not sure why radvd is used over FreeBSD native rtadvd..

For what it worths, I am restarting radvd via cron every 30 minutes as a stopgap for now

I think it's in the kernel since radvd is the same. Rollback is not easily possible in this regard.

As for why radvd and not rtadvd... radvd came before or was more reliable (think over 10 years ago) and nobody did the work to evaluate rtadvd migration since then. Maybe now that 12.1 is out and provides challenges to radvd it's time to do this evaluation, but even then the kernel issue might be affecting rtadvd too.

At this point it is still too early to tell.


Cheers,
Franco