OPNsense Forum

Archive => 23.1 Legacy Series => Topic started by: rssor on May 29, 2023, 04:50:22 am

Title: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on May 29, 2023, 04:50:22 am
Edit: Repro/summary below (https://forum.opnsense.org/index.php?topic=34241.msg165693#msg165693)

I'm seeing what appear to be new instances of dpinger/updaterrd.sh appear gradually over time, lining up with DHCPv6 renewal times. I have two wan interfaces, and the dpinger instances are specific to the primary WAN DHCPv6 gateway.

dpinger going wild on only one of 4 monitored gateways (there were well over 150 total):
Code: [Select]
root    12033   0.0  0.1  17732   2180  -  Is   11:16      0:11.11 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root    12140   0.0  0.1  17732   2180  -  Is   Fri02      0:13.83 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root    12674   0.0  0.1  13504   2108  -  Is   01:17      0:03.04 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root    12776   0.0  0.1  17732   2180  -  Is   20:47      0:06.05 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root    13736   0.0  0.1  17732   2180  -  Is   18:16      0:07.63 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root    14579   0.0  0.1  17732   2180  -  Is   10:46      0:10.94 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root    15498   0.0  0.1  17732   2180  -  Is   19:16      0:07.24 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root    15832   0.0  0.1  17732   2180  -  Is   14:16      0:09.50 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s

All of the dpinger processes operate in parallel resulting in large amounts of ICMP traffic.

updaterrd.sh also showing up at lease renewals (wc -l reports 250+ here)
Code: [Select]
root    94437   0.0  0.0  13508   776  -  Is   Sat07       0:13.04 /bin/sh /var/db/rrd/updaterrd.sh
root    94634   0.0  0.0  13508   720  -  Is   Fri03       0:24.33 /bin/sh /var/db/rrd/updaterrd.sh
root    94713   0.0  0.0  13508   720  -  Is   Fri21       0:17.22 /bin/sh /var/db/rrd/updaterrd.sh
root    94840   0.0  0.0  13508   720  -  Is   Thu23       0:26.19 /bin/sh /var/db/rrd/updaterrd.sh
root    95958   0.0  0.0  13508   780  -  Is   Sat17       0:09.44 /bin/sh /var/db/rrd/updaterrd.sh
root    96010   0.0  0.0  13508   772  -  Ss   12:17       0:02.78 /bin/sh /var/db/rrd/updaterrd.sh
root    96619   0.0  0.0  13508   776  -  Is   Fri15       0:19.62 /bin/sh /var/db/rrd/updaterrd.sh
root    96973   0.0  0.0  13508   776  -  Ss   07:46       0:04.10 /bin/sh /var/db/rrd/updaterrd.sh
root    97095   0.0  0.0  13508   720  -  Is   Fri19       0:17.42 /bin/sh /var/db/rrd/updaterrd.sh
root    97793   0.0  0.0  13508   776  -  Ss   Sat07       0:13.07 /bin/sh /var/db/rrd/updaterrd.sh
root    98511   0.0  0.0  13508   776  -  Ss   22:46       0:07.43 /bin/sh /var/db/rrd/updaterrd.sh
root    99268   0.0  0.0  13508   720  -  Is   Thu22       0:26.39 /bin/sh /var/db/rrd/updaterrd.sh
root    99893   0.0  0.0  13508   720  -  Is   Fri20       0:17.44 /bin/sh /var/db/rrd/updaterrd.sh

I have temporarily terminated monitoring of the WAN IPv6 gateway, but the updaterrd.sh processes continue to creep up over time.

The best clue I've found so far is in the logs as this newwanipv6 script running constantly, but only for this particular interface.

Code: [Select]
2023-05-28T22:19:26-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: No IP change detected (current: 2600:WAN:IP::, interface: WAN[wan])
2023-05-28T22:17:22-04:00 Error opnsense /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '70636'' returned exit code '1', the output was 'kill: 70636: No such process'
2023-05-28T22:17:22-04:00 Error opnsense /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '70636'' returned exit code '1', the output was 'kill: 70636: No such process'

I suspect that this is to me having to provide a DHCPv6 configuration override in order to pull multiple prefixes simultaneously (AT&T CPE quirk, must pull multiple /64s instead of one larger /60). Clarification: Each /64 prefix must have its own IA_PD, otherwise the server will not respond.

My tentative next steps are to try to specify a configuration override identical to the non-overridden version that pulls a single /64, just to figure out if this is an error in how overridden DHCPv6 configs function or if it's an error in my actual configuration or an unavoidable side effect of having multiple delegated prefixes confusing the system.

The machine is still in its weird state, is there any additional information I should collect before rebooting and continuing to try to track this down?

Even if this is an error in my dhcpv6 config override, are the dpinger instances not realizing that the pid file already corresponds to an already-running dpinger for that interface its own separate bug?
Title: Re: 23.1.8: Hundreds of dpinger and updaterrd.sh processes
Post by: rssor on May 29, 2023, 07:02:05 am
I still haven't rebooted, but I poked at it a little more and I think the root cause is the presence of multiple delegated prefixes. I still suspect there's some other unrelated bug in whatever kicks dpinger off failing to kill off the old one, but fixing this bug should make the others not matter.

Just after a manual dhcp reload
Code: [Select]
root@gw:~ # /usr/local/sbin/ifctl -i hn1 -6p
2600:WAN:0:f::/64
2600:WAN:0:e::/64
2600:WAN:0:c::/64
2600:WAN:0:8::/64
2600:WAN:0:d::/64

After the first renewal
Code: [Select]
root@gw:~ # /usr/local/sbin/ifctl -i hn1 -6p
2600:WAN:0:f::/64


I instrumented the wan script slightly to log more verbosely, and it looks like it might be related to https://github.com/opnsense/dhcp6c/commit/c3343396236b1b6bf03d7ead5c3bf638044228d7 (NOTE: on second thought, probably not, since each of my prefixes renew separately so there's only ever one per invocation of the script).

I'm just printing $PDNEW, $PDOLD, $ARGS and $PDINFO
Code: [Select]
2023-05-29T00:04:31-04:00 Notice dhcp6c dhcp6c_script: RENEW on hn1 renewal (PDINFO) 2600:WAN:0:e::/64 is not 2600:WAN:0:f::/64 - -a 2600:WAN:0:f::/64 - 2600:WAN:0:f::/64
2023-05-29T00:04:31-04:00 Notice dhcp6c dhcp6c_script: RENEW on hn1 renewal (PDINFO) 2600:WAN:0:c::/64 is not 2600:WAN:0:e::/64 - -a 2600:WAN:0:e::/64 - 2600:WAN:0:e::/64
2023-05-29T00:04:31-04:00 Notice dhcp6c dhcp6c_script: RENEW on hn1 renewal (PDINFO) 2600:WAN:0:8::/64 is not 2600:WAN:0:c::/64 - -a 2600:WAN:0:c::/64 - 2600:WAN:0:c::/64
2023-05-29T00:04:31-04:00 Notice dhcp6c dhcp6c_script: RENEW on hn1 renewal (PDINFO) 2600:WAN:0:d::/64 is not 2600:WAN:0:8::/64 - -a 2600:WAN:0:8::/64 - 2600:WAN:0:8::/64
2023-05-29T00:04:31-04:00 Notice dhcp6c dhcp6c_script: RENEW on hn1 renewal (PDINFO) is not 2600:WAN:0:d::/64 - -a 2600:WAN:0:d::/64 - 2600:WAN:0:d::/64
2023-05-29T00:04:31-04:00 Notice dhcp6c dhcp6c_script: RENEW on hn1 renewal (PDINFO) 2600:WAN:0:f::/64

This is causing FORCE=force in /var/etc/dhcp6c_wan_script.sh, which I think explains most of the rest of the behavior.

Is the intended solution that I need to provide a customized dhcp6c_wan_script.sh in addition to the dhcp6 or is this meant to work gracefully on its own?

Since each prefix appears to get its own renewal/script invocation, PDINFO only ever has one prefix in it, so the most recent prefix winds up being the only winner as the script can't know which subset is actually being updated.

I'll keep poking at it, but I don't quite follow why it initially seems to get every prefix but then renews them piecemeal.

EDIT: So, this seems to be an unavoidable consequence of being required to use separate IA_PDs for each /64 prefix, which violates assumptions made elsewhere since dhcp6c will split up the renewals, depriving the script of necessary information.

I think it should be possible to workaround this if dhcp6c, when populating PDINFO, also provides the IAID associated with the prefixes provided, in the vicinity of https://github.com/opnsense/dhcp6c/blob/555bdcdeff3e0f450d50a713e46aa39394544fa5/dhcp6c_script.c#L415 . At present, the wan script seems to assume that all prefixes are from the same IA_PD and that every renew response includes prefix information from every IA_PD. I might be misunderstanding something because even if there was only one IA_PD wouldn't the IA_NA message renewal clear out the prefix information and trigger this same issue? I'll chalk this up to the limits of my understanding of this system for now.
Title: Re: 23.1.8: Hundreds of dpinger and updaterrd.sh processes
Post by: rssor on May 29, 2023, 10:19:08 am
Okay, I think I actually have a minimal repro without a dhcp config override. I think the multiple IA_PD prefixes via such an override just exacerbates the issue due to the extra invocations of the automatically generated script.

I believe the minimal relevant configuration is:

This is 30 minutes after boot, just after the first renewal, having made no configuration changes in the UI since booting:
Code: [Select]
root    30467   0.0  0.1  13504   2552  -  Is   04:01     0:00.02 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr
root    93748   0.0  0.1  13504   2552  -  Is   03:31     0:00.13 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr

Causing 2 pings per second presently, I assume it'll go up to 3 in another 30 due to the short lifetime handed out by the equipment. I did modify a single line of the dhcp6c script for this interface to print some env vars.

I'm still not sure if this is the root cause, but because the IA_NA and IA_PD are renewed in separate messages, even in a totally standard configuration it looks like the dhcp6 script always runs FORCE=force because only one of the renewals has the PD information.

Code: [Select]
2023-05-29T04:01:24-04:00 Notice dhcp6c dhcp6c_script: RENEW on hn1 renewal (PDINFO) is not 2600:WAN:0:f::/64 - -a 2600:WAN:0:f::/64 2600:WAN:0:f::/64
2023-05-29T04:01:24-04:00 Notice dhcp6c dhcp6c_script: RENEW on hn1 renewal (PDINFO) 2600:WAN:0:f::/64 is not -

The modified line 32 from the automatically generated script:
Code: [Select]
        /usr/bin/logger -t dhcp6c "dhcp6c_script: $REASON on hn1 renewal (PDINFO) ${PDOLD} is not ${PDNEW} - ${ARGS} ${PDINFO}"

EDIT:

Wait, is this just a race condition? It looks like I have multiple rc.newwanipv6's running at the same time, overlapping.

Code: [Select]
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : webgui_configure_do(,wan))
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : vxlan_configure_do())
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : webgui_configure_do(,wan))
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : vxlan_configure_do())
2023-05-29T04:31:25-04:00 Error opnsense /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '36660'' returned exit code '1', the output was 'kill: 36660: No such process'
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : unbound_configure_do(,wan))
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : openssh_configure_do(,wan))
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : opendns_configure_do())
2023-05-29T04:31:25-04:00 Error opnsense /usr/local/etc/rc.newwanipv6: The command '/usr/local/sbin/ntpd -g -c '/var/etc/ntpd.conf'' returned exit code '1', the output was ''
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : unbound_configure_do(,wan))
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : openssh_configure_do(,wan))
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : opendns_configure_do())
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : ntpd_configure_do())
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : dnsmasq_configure_do())
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (,wan)
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure vpn (execute task : openvpn_configure_do(,wan))
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure vpn (execute task : ipsec_configure_do(,wan))
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure vpn (,wan)
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : ntpd_configure_do())
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (execute task : dnsmasq_configure_do())
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure newwanip (,wan)
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure vpn (execute task : openvpn_configure_do(,wan))
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure vpn (execute task : ipsec_configure_do(,wan))
2023-05-29T04:31:25-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure vpn (,wan)
2023-05-29T04:31:24-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
2023-05-29T04:31:24-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure monitor (,WAN_DHCP6)
2023-05-29T04:31:24-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: keeping current inet6 default gateway 'fe80::12c4:caff:fe39:8051%hn1'
2023-05-29T04:31:24-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: configuring inet6 default gateway on wan
2023-05-29T04:31:24-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: entering configure using 'wan'
2023-05-29T04:31:24-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2600:WAN:0:0::, old: 2600:WAN:0:0::, interface: WAN[wan], device: hn1, force: yes)
2023-05-29T04:31:24-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
2023-05-29T04:31:24-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: plugins_configure monitor (,WAN_DHCP6)
2023-05-29T04:31:24-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: keeping current inet6 default gateway 'fe80::12c4:caff:fe39:8051%hn1'
2023-05-29T04:31:24-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: configuring inet6 default gateway on wan
2023-05-29T04:31:24-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: entering configure using 'wan'
2023-05-29T04:31:24-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2600:WAN:0:0::, old: 2600:WAN:0:0::, interface: WAN[wan], device: hn1, force: yes)
2023-05-29T04:31:24-04:00 Notice dhcp6c dhcp6c_script: RENEW on hn1 renewal (PDINFO) is not 2600:WAN:0:f::/64 - -a 2600:WAN:0:f::/64 2600:WAN:0:f::/64
2023-05-29T04:31:24-04:00 Notice dhcp6c dhcp6c_script: RENEW on hn1 executing
2023-05-29T04:31:24-04:00 Notice dhcp6c dhcp6c_script: RENEW on hn1 renewal (PDINFO) 2600:WAN:0:f::/64 is not -
2023-05-29T04:31:24-04:00 Notice dhcp6c dhcp6c_script: RENEW on hn1 executing

It seems like it doesn't always get an extra dpinger, it must just depend on how the execution overlaps. Multiple IA_PD renews must just result in it clashing enough that it accumulates duplicates much more rapidly. But even a single delegated prefix in a stock configuration is at risk of triggering this issue.

Should the configctl line in the script have that -d arg that causes it to return immediately? dhcp6c looks like it would block and wait, which might be the desired behavior here so we wouldn't be processing more than one of these renewals at a time. That said, maybe this actually needs to be enforced in rc.newwanipv6 itself.

Or a combination of both, since those ifctl calls might cause surprising things if the dhcp6c script and another rc.newwanipv6 are in flight at the same time...
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: bringha on May 29, 2023, 11:35:26 am
Hi all,

not sure whether my thing is related but this morning I found my opnsense in a kind of blocked state, last night at 3.40am it began to start countless processes of

Code: [Select]
/usr/local/bin/php /usr/local/etc/rc.newwanipv6 pppoe0 force

until 10am I had about 400 processes running with 100% CPU and 92% Memory load. Reboot failed, only a hard power cycle brought the system back.

It seems also be related to dhcp6c, in the system log I found every minute several of these logs
Code: [Select]
<13>1 2023-05-29T03:39:53+02:00 OPNsense.zuhause.xx dhcp6c 25660 - [meta sequenceId="1"] dhcp6c_script: REQUEST on pppoe0 executing
<13>1 2023-05-29T03:39:53+02:00 OPNsense.zuhause.xx dhcp6c 28830 - [meta sequenceId="2"] dhcp6c_script: REQUEST on pppoe0 renewal (REASON)
<13>1 2023-05-29T03:39:54+02:00 OPNsense.zuhause.xx opnsense 29301 - [meta sequenceId="3"] /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2003:XX:XXXX:XXXX
:3eec:efff:fe57:21ce, old: 2003:XX:XXXX:XXXX:3eec:efff:fe57:21ce, interface: WAN[wan], device: pppoe0, force: yes)
<13>1 2023-05-29T03:39:54+02:00 OPNsense.zuhause.xx opnsense 29301 - [meta sequenceId="4"] /usr/local/etc/rc.newwanipv6: plugins_configure dhcp (,inet6)
<13>1 2023-05-29T03:39:54+02:00 OPNsense.zuhause.xx opnsense 29301 - [meta sequenceId="5"] /usr/local/etc/rc.newwanipv6: plugins_configure dhcp (execute task : dhc
pd_dhcp_configure(,inet6))
The dhcpd logs show that it is tried to constantly start dhcpd although it is running already.

Race condition might be a reason, however this seems to be new since 23.1.8 (never seen this before)

Br br
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on May 29, 2023, 01:29:59 pm
I'm hoping I've root caused this correctly, and filed an issue at https://github.com/opnsense/dhcp6c/issues/34
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: franco on May 29, 2023, 02:16:20 pm
Thanks for the analysis... I'm not entirely sure if the previous logic was right so I would try to fix this in core via:

https://github.com/opnsense/core/commit/03519568aa

If we talk about RENEW/REBIND we don't have a difference in resulting PDINFO in its entirety anyway. In general it sucks a bit that REQUEST behaves so different to RENEW/REBIND, but complaining about inconsistency in IPv6 is just trying to explain IPv6 in general. More of the same. :)


Cheers,
Franco
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on May 30, 2023, 05:28:14 am
Thanks for the rapid fix. Back to the original config (5 IA_PDs, 1 IA_NA) where there are 6 instances of wannewipv6 every 30 minutes running simultaneously it's been holding steady with seemingly none of the old negative behavior.

It looks like I can still collect additional dpinger processes over time but the duplicates I have now match when I made configuration changes or restarted services and are on the v4 gateways. I'm assuming it's just the inherent raciness of multiple new address scripts running simultaneously. Not really a problem for steady operation in practice.

It's a shame that the dhcp6c process is so stingy about sharing its lease state. The logic in core did make sense to me, it was just let down by dhcp6c refusing to give the scripts the rest of the info it has buried in there somewhere. Although since address changes during renewals are rare I suppose it doesn't really matter in practice.

Thanks again!

Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: franco on May 30, 2023, 08:51:38 am
It's a shame that the dhcp6c process is so stingy about sharing its lease state. The logic in core did make sense to me, it was just let down by dhcp6c refusing to give the scripts the rest of the info it has buried in there somewhere. Although since address changes during renewals are rare I suppose it doesn't really matter in practice.

It's sort of surprisingly inconsistent, but I think that is actually either from the RFC or how it was always carried out in practice in the IPv6 field. In theory you can batch renewals/rebinds, but for one the timers inside dhcp6c need to be consolidated and DHCPv6 servers all need to properly support this. Seeing how uncommonly supported multiple PD handling is out (nevermind ISPs handing them out as they are normally just ignored) I wouldn't be surprised some servers cannot handle such a batch renew/rebind.

Care to debug the dpinger issue you are seeing further?


Cheers,
Franco
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on May 30, 2023, 10:29:46 am
In theory you can batch renewals/rebinds, but for one the timers inside dhcp6c need to be consolidated and DHCPv6 servers all need to properly support this. Seeing how uncommonly supported multiple PD handling is out (nevermind ISPs handing them out as they are normally just ignored) I wouldn't be surprised some servers cannot handle such a batch renew/rebind.
Yeah, that's why my first thought was to rather than change the behavior of the client, just forcibly extract all the PD information from the dhcp6_if* iaconf_list available in the caller of client6_script by plumbing it through as an argument, since it does seem that all the current data is available there (and that line of control flow would have just been iterating over the same IA list before getting to client6_script anyways).

That way PDINFO is just always the entire current state, since really all the script needs to do is figure out if anything changed, not really any reason to limit it to just the PDs in the particular response since the rest of the system doesn't really care about which IA_PD any given prefix came from anyways.

So less work than coalescing the timers and risking any kind of behavior change, but much more work than simply only worrying about the initial assignment in the first place.  :)

And almost certainly not worth the bother, since it's not like renews will ever change any of it in the real world.

Care to debug the dpinger issue you are seeing further?
Sure! And I come with some corrections: it actually doesn't look like it's from configuration updates/service restarts, it's actually just that some (possibly just the ipv4) gateway dpingers double up on boot sometimes. So my most recent boot has 2 on each ipv4 interface, it looks like the boot before that I had it double up on just WAN2, and I know I've seen it boot with only the 4 expected as well.

Code: [Select]
root    15021   0.0  0.1  13504   2128  -  Is   13:01      0:02.90 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP -B 104.WAN_IP -p /var/run/dpinger_WAN_DHCP.pi
root    45935   0.0  0.1  13504   2128  -  Is   13:01      0:03.04 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP -B 104.WAN_IP -p /var/run/dpinger_WAN_DHCP.pi
root    43971   0.0  0.1  13504   2132  -  Is   13:01      0:02.85 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:WAN:IP:: -p /v
root    49476   0.0  0.1  13504   2132  -  Is   13:01      0:03.02 /usr/local/bin/dpinger -f -S -r 0 -i WAN2_DHCP6 -B 2603:WAN2:IP:: -p /va
root    18056   0.0  0.1  13504   2128  -  Is   13:01      0:03.01 /usr/local/bin/dpinger -f -S -r 0 -i WAN2_DHCP -B 67.WAN2_IP -p /var/run/dpinger_WAN2_DHCP.p
root    55265   0.0  0.1  13504   2128  -  Is   13:01      0:03.22 /usr/local/bin/dpinger -f -S -r 0 -i WAN2_DHCP -B 67.WAN2_IP -p /va

I've restarted radvd and updated firewall rules a bunch which is what I thought had done it before, but everything seems to actually stay fixed at where it was after boot. So this particular boot seems to be the worst case and also... not that bad since it apparently won't get worse.

If the occasional extra dpinger on boot is worth tracking down, is there any info in particular that would help debug it?
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: franco on May 31, 2023, 10:06:41 am
Let's try a simple one for now: https://github.com/opnsense/core/commit/74c3125e0d2f

# opnsense-patch 74c3125e0d2f

Might not be it but it's good to rule it out.


Cheers,
Franco
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on June 01, 2023, 11:36:58 am
So after 5 reboots, that patch hasn't had any extra dpingers. I do still have anywhere between 1-3 updaterrd.sh scripts running after boot, so that might also need the same treatment?

I did go ahead and collect some before-patch data before applying, but since this patch seems to have addressed the issue it's kind of pointless now. :) Maybe it'll at least help confirm what was going on.

23.1.9, no patches (well, I forgot to revert the other 2 before updating to 23.1.9, and I don't know if reverting them would partially downgrade me, so I'm assuming/hoping they're harmless)

Reboot 1: 0 extra
Reboot 2: 1 extra (WAN2_DHCP)
Reboot 3: 0 extra
Reboot 4: 1 extra (WAN2_DHCP)
Reboot 5: 0 extra
Reboot 6: 2 extra (WAN_DHCP, WAN2_DHCP)
Reboot 7: 1 extra (WAN_DHCP6) !! this is not ipv4 only
Reboot 8: 1 extra (WAN2_WHCP6)
Reboot 9: 1 extra (WAN2_DHCP)

Example when 2 extra dpingers exist (WAN_DHCP, WAN2_DHCP):
Code: [Select]
root@gw:~ # opnsense-log | grep dping
<13>1 2023-06-01T04:28:21-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="219"] /usr/local/etc/rc.bootup: plugins_configure monitor (execute task : dpinger_configure_do(1,null,1))
<13>1 2023-06-01T04:28:25-04:00 gw.domain.tld opnsense 56603 - [meta sequenceId="274"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP))
<13>1 2023-06-01T04:28:25-04:00 gw.domain.tld opnsense 58156 - [meta sequenceId="281"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP))
<13>1 2023-06-01T04:28:25-04:00 gw.domain.tld opnsense 59296 - [meta sequenceId="286"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP6))
<13>1 2023-06-01T04:28:25-04:00 gw.domain.tld opnsense 60996 - [meta sequenceId="293"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
<13>1 2023-06-01T04:28:25-04:00 gw.domain.tld opnsense 62698 - [meta sequenceId="301"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
<13>1 2023-06-01T04:28:26-04:00 gw.domain.tld opnsense 36428 - [meta sequenceId="368"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))

Example when 1 extra dpinger exists (WAN_DHCP6):
Code: [Select]
<13>1 2023-06-01T04:34:40-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="220"] /usr/local/etc/rc.bootup: plugins_configure monitor (execute task : dpinger_configure_do(1,null,1))
<13>1 2023-06-01T04:34:44-04:00 gw.domain.tld opnsense 97726 - [meta sequenceId="270"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP))
<13>1 2023-06-01T04:34:44-04:00 gw.domain.tld opnsense 7700 - [meta sequenceId="277"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP))
<13>1 2023-06-01T04:34:44-04:00 gw.domain.tld opnsense 17484 - [meta sequenceId="282"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP6))
<13>1 2023-06-01T04:34:44-04:00 gw.domain.tld opnsense 35563 - [meta sequenceId="287"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
<13>1 2023-06-01T04:34:44-04:00 gw.domain.tld opnsense 46648 - [meta sequenceId="299"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
<13>1 2023-06-01T04:34:44-04:00 gw.domain.tld opnsense 64173 - [meta sequenceId="304"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
<13>1 2023-06-01T04:34:45-04:00 gw.domain.tld opnsense 67519 - [meta sequenceId="363"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))


Another example when 1 extra dpinger exists (WAN2_DHCP6), with more info:
Code: [Select]
root@gw:~ # opnsense-log | grep "renewal starting"
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 17978 - [meta sequenceId="260"] /usr/local/etc/rc.newwanip: IP renewal starting (new: 67.WAN2, old: , interface: WAN2[opt4], device: hn4, force: no)
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 25179 - [meta sequenceId="267"] /usr/local/etc/rc.newwanip: IP renewal starting (new: 104.WAN, old: , interface: WAN[wan], device: hn1, force: no)
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 33607 - [meta sequenceId="276"] /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2603:SPECTRUM, old: , interface: WAN2[opt4], device: hn4, force: no)
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 51419 - [meta sequenceId="286"] /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2600:ATT, old: , interface: WAN[wan], device: hn1, force: no)
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 57618 - [meta sequenceId="288"] /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2603:SPECTRUM, old: , interface: WAN2[opt4], device: hn4, force: yes)
<13>1 2023-06-01T04:38:31-04:00 gw.domain.tld opnsense 35985 - [meta sequenceId="315"] /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2600:ATT, old: , interface: WAN[wan], device: hn1, force: yes)
root@gw:~ # opnsense-log | grep dping
<13>1 2023-06-01T04:38:26-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="217"] /usr/local/etc/rc.bootup: plugins_configure monitor (execute task : dpinger_configure_do(1,null,1))
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 17978 - [meta sequenceId="266"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP))
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 25179 - [meta sequenceId="274"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP))
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 33607 - [meta sequenceId="280"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP6))
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 51419 - [meta sequenceId="292"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 57618 - [meta sequenceId="295"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP6))
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 57115 - [meta sequenceId="303"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
<13>1 2023-06-01T04:38:31-04:00 gw.domain.tld opnsense 35985 - [meta sequenceId="319"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
<13>1 2023-06-01T04:38:32-04:00 gw.domain.tld opnsense 73205 - [meta sequenceId="386"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
root@gw:~ # opnsense-log | grep TERM
<11>1 2023-06-01T04:38:25-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="201"] /usr/local/etc/rc.bootup: The command '/bin/kill -'TERM' '59132'' returned exit code '1', the output was 'kill: 59132: No such process'
<11>1 2023-06-01T04:38:26-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="206"] /usr/local/etc/rc.bootup: The command '/bin/kill -'TERM' '78770'' returned exit code '1', the output was 'kill: 78770: No such process'
<11>1 2023-06-01T04:38:31-04:00 gw.domain.tld opnsense 25179 - [meta sequenceId="357"] /usr/local/etc/rc.newwanip: The command '/bin/kill -'TERM' '89507'' returned exit code '1', the output was 'kill: 89507: No such process'
<11>1 2023-06-01T04:38:32-04:00 gw.domain.tld opnsense 33607 - [meta sequenceId="22"] /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '63279'' returned exit code '1', the output was 'kill: 63279: No such process'
<11>1 2023-06-01T04:38:33-04:00 gw.domain.tld opnsense 57618 - [meta sequenceId="39"] /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '77855'' returned exit code '1', the output was 'kill: 77855: No such process'
<11>1 2023-06-01T04:38:33-04:00 gw.domain.tld opnsense 35985 - [meta sequenceId="40"] /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '28005'' returned exit code '1', the output was 'kill: 28005: No such process'

Last example, IPv4 when 1 extra dpinger exists (WAN2_DHCP):
Code: [Select]
root@gw:~ # opnsense-log | grep "renewal starting"
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 49018 - [meta sequenceId="262"] /usr/local/etc/rc.newwanip: IP renewal starting (new: 67.WAN2, old: , interface: WAN2[opt4], device: hn4, force: no)
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 67077 - [meta sequenceId="269"] /usr/local/etc/rc.newwanip: IP renewal starting (new: 104.WAN, old: , interface: WAN[wan], device: hn1, force: no)
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 83739 - [meta sequenceId="276"] /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2603:SPECTRUM, old: , interface: WAN2[opt4], device: hn4, force: no)
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 596 - [meta sequenceId="287"] /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2600:ATT, old: , interface: WAN[wan], device: hn1, force: no)
root@gw:~ # opnsense-log | grep dping
<13>1 2023-06-01T05:01:38-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="218"] /usr/local/etc/rc.bootup: plugins_configure monitor (execute task : dpinger_configure_do(1,null,1))
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 49018 - [meta sequenceId="268"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP))
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 67077 - [meta sequenceId="275"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP))
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 83739 - [meta sequenceId="280"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP6))
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 596 - [meta sequenceId="293"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 11080 - [meta sequenceId="302"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
<13>1 2023-06-01T05:01:42-04:00 gw.domain.tld opnsense 98545 - [meta sequenceId="373"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
root@gw:~ # opnsense-log | grep TERM
<11>1 2023-06-01T05:01:37-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="203"] /usr/local/etc/rc.bootup: The command '/bin/kill -'TERM' '67954'' returned exit code '1', the output was 'kill: 67954: No such process'
<11>1 2023-06-01T05:01:37-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="204"] /usr/local/etc/rc.bootup: The command '/bin/kill -'TERM' '78770'' returned exit code '1', the output was 'kill: 78770: No such process'
<11>1 2023-06-01T05:01:42-04:00 gw.domain.tld opnsense 83739 - [meta sequenceId="356"] /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '97307'' returned exit code '1', the output was 'kill: 97307: No such process'
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: franco on June 03, 2023, 01:43:16 pm
Sorry for the late follow-up. Looks like more service-related snafu here:

https://github.com/opnsense/core/commit/fdf46f317c3
https://github.com/opnsense/core/commit/0200f79a19

# opnsense-patch fdf46f317c3 0200f79a19


Cheers,
Franco
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on June 04, 2023, 03:30:22 pm
Have only rebooted three times since applying, but have only seen 2 rrd processes since, one marked daemon with the pid of the other, so that seems to have cleared it up.

On occasion I'm seeing a dpinger process die or never fail to start (usually for the WAN2_DHCP6 gateway), and it looks like it might be related to the lease taking a long time to acquire initially. I'll see if I can find a reliable repro for this, but basically while it will eventually get a lease and be able to pass traffic, it shows up as down forever on the dashboard, both the gateway and the dpinger instance for that gateway in the services list.

I've got one last problem which is sometimes after boot it's just straight up using the wrong gateway. The UI says that the active gateways are WAN for v4/v6 (reflecting the configuration), but looking at System -> Routes -> Status WAN2 is the default. Sometimes it's only the IPv4 gateway that's wrong, sometimes it's the IPv6 gateway that's wrong. System -> Gateways -> Single always shows the intended gateways as active, but the actual active routes disagree.

This is certainly some totally unrelated issue from where I started though, so I can start another thread if that would be appropriate.
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: franco on June 05, 2023, 09:00:37 am
The two issues are likely related. Are you using a gateway group and/or default gateway switching?

The GUI shows the state according to what it thinks is correct at the moment. It doesn't care if that's not the case. If you e.g. edit a gateway and hit apply this should fix itself... Some of this might be caused by dpinger not correctly alerting in all relevant trigger cases or not being able to run (see below).

For different reasons the dpinger process might not be able to start/acquire an address to listen on (in case of IPv6 that might be a DAD timer -- duplicate address detection grace period in which the new address is not allowed to be used). This or a similar issue with address reading/use.

For both things the development version has relevant patches that can help (but wouldn't rule out there is another issue still).


Cheers,
Franco
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on June 05, 2023, 12:38:18 pm
Yes, two gateway groups doing failover, one for IPv4, one for IPv6, default switching enabled and once up it mostly works fine. (Well, except that long-running connections that got made on WAN2 aren't killed when the primary gateway comes back up, but it looks like intended behavior that the source tracking table isn't getting killed on gateway recovery since that feature got removed on purpose. So I think I just have some scripting to do here.  :) )

Looking at the logs, it seems like it might just be racy, I'm not sure enough time elapsed for it to be related to dpinger. Gateway health was green and there's not even a blip of packet loss in the health graphs. 9:19:03 is when it finally got things right. Seems strange that they seem to disagree on gateway status during the same second in the timestamps.

Log messages from boot with "rout"
Code: [Select]
2023-06-04T09:19:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: setting inet default route to 104.56.90.1
2023-06-04T09:19:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: configuring inet default gateway on wan
2023-06-04T09:19:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'wan'
2023-06-04T09:19:03-04:00 Notice dhclient New Routers (hn1): 104.56.90.1
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: plugins_configure monitor (1)
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: setting inet6 default route to fe80::12c4:caff:fe39:8051%hn1
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet6 default gateway on wan
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: keeping current inet default gateway '67.8.168.1'
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt4
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: ignoring down gateways: WAN_DHCP
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: entering configure using defaults
2023-06-04T09:09:04-04:00 Notice kernel <118>[+] Backgrounding route monitor
2023-06-04T09:09:04-04:00 Notice kernel <118>[#] route -q -n add -inet 10.6.0.0/16 -interface wg1
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: plugins_configure monitor (1)
2023-06-04T09:09:03-04:00 Error opnsense /usr/local/etc/rc.routing_configure: The command '/sbin/route add -inet '10.6.0.0/16' -interface 'wg1'' returned exit code '65', the output was 'route: interface 'wg1' does not exist'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: keeping current inet6 default gateway 'fe80::2bc:60ff:fe92:4419%hn4'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet6 default gateway on opt4
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: setting inet default route to 67.8.168.1
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt4
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: keeping current inet6 default gateway 'fe80::2bc:60ff:fe92:4419%hn4'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: configuring inet6 default gateway on opt4
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: entering configure using defaults
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: entering configure using 'wan'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6, WAN2_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: entering configure using 'opt4'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: keeping current inet default gateway '104.56.90.1'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: configuring inet default gateway on wan
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: ignoring down gateways: WAN_DHCP6, WAN2_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'wan'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: setting inet default route to 104.56.90.1
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: configuring inet default gateway on wan
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: ignoring down gateways: WAN_DHCP6, WAN2_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'opt4'
2023-06-04T09:08:59-04:00 Notice kernel <118>Starting router advertisement service...done.
2023-06-04T09:08:59-04:00 Notice kernel <118>Setting up routes...done.
2023-06-04T09:08:58-04:00 Error opnsense /usr/local/etc/rc.bootup: The command '/sbin/route add -inet '10.6.0.0/16' -interface 'wg1'' returned exit code '65', the output was 'route: interface 'wg1' does not exist'
2023-06-04T09:08:58-04:00 Error opnsense /usr/local/etc/rc.bootup: ROUTING: refusing to set inet gateway on addressless opt5(wg1)
2023-06-04T09:08:58-04:00 Notice opnsense /usr/local/etc/rc.bootup: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6, WAN2_DHCP, WAN2_DHCP6
2023-06-04T09:08:58-04:00 Notice opnsense /usr/local/etc/rc.bootup: ROUTING: entering configure using defaults
2023-06-04T09:08:58-04:00 Notice dhclient New Routers (hn4): 67.8.168.1
2023-06-04T09:08:58-04:00 Notice dhclient New Routers (hn1): 104.56.90.1

Log messages from boot with "gateway"
Code: [Select]
2023-06-04T09:19:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: configuring inet default gateway on wan
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet6 default gateway on wan
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: keeping current inet default gateway '67.8.168.1'
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt4
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: ignoring down gateways: WAN_DHCP
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: keeping current inet6 default gateway 'fe80::2bc:60ff:fe92:4419%hn4'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet6 default gateway on opt4
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt4
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: keeping current inet6 default gateway 'fe80::2bc:60ff:fe92:4419%hn4'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: configuring inet6 default gateway on opt4
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6, WAN2_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: keeping current inet default gateway '104.56.90.1'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: configuring inet default gateway on wan
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: ignoring down gateways: WAN_DHCP6, WAN2_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: configuring inet default gateway on wan
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: ignoring down gateways: WAN_DHCP6, WAN2_DHCP6
2023-06-04T09:09:00-04:00 Notice kernel <118>Setting up gateway monitors...done.
2023-06-04T09:08:59-04:00 Notice opnsense /usr/local/etc/rc.bootup: Gateway currently empty for 2001:4860:4860::8888 on wan
2023-06-04T09:08:58-04:00 Error opnsense /usr/local/etc/rc.bootup: ROUTING: refusing to set inet gateway on addressless opt5(wg1)
2023-06-04T09:08:58-04:00 Notice opnsense /usr/local/etc/rc.bootup: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6, WAN2_DHCP, WAN2_DHCP6

Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: franco on June 05, 2023, 12:54:42 pm
As an extra data point would you mind switching to the development version? Because I've rewritten the gateway watcher monitoring/alert and merged it into a single service that would be more deterministic at least from the side we can control. IPv4 and IPv6 could still race depending on the timing with the ISP but it would be good to have that data point as well.


Cheers,
Franco
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on June 05, 2023, 06:14:31 pm
Sure thing. Should I revert all the various patches first, or can I just flip it over just fine?

I'm unsure if I was supposed to revert any of the patches before the last upgrade to 23.1.9 too.
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: franco on June 06, 2023, 09:08:51 am
Every time you install a package (and this includes the core package for stable or development) all known files are reset to their known state undoing all patches (except the ones that have been added into the package in between).

Mileage may vary here depending on where and when patches were committed and when the last release build was. One handy but sometimes counter-intuitive feature of opnsense-patch (or "patch" in general) is that it will reverse a patch already in place so sometimes people get confused by installing a version that already has the change and then undoing it and testing again to see it fail.

But since we talk mostly about minor glitches here I think it's not that imperative. All patches discussed here will be in the community version 23.1.10 anyway. No need to retest these :)


Cheers,
Franco
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on June 06, 2023, 10:51:49 pm
No such luck, got a broken dpinger on the first reboot.

Code: [Select]
2023-06-06T16:47:45-04:00 Notice opnsense-devel /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
2023-06-06T16:47:44-04:00 Notice opnsense-devel /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
2023-06-06T16:47:44-04:00 Notice opnsense-devel /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
2023-06-06T16:47:43-04:00 Notice opnsense-devel /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP))
2023-06-06T16:47:43-04:00 Notice opnsense-devel /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
2023-06-06T16:47:43-04:00 Notice opnsense-devel /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
2023-06-06T16:47:43-04:00 Notice opnsense-devel /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP6))
2023-06-06T16:47:43-04:00 Notice opnsense-devel /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP))
2023-06-06T16:47:42-04:00 Notice opnsense-devel /usr/local/etc/rc.bootup: plugins_configure monitor (execute task : dpinger_configure_do(1,null,1))

WAN2_DHCP6's dpinger service shows as not running in the dashboard and the gateway shows as failed with no stats.
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on June 07, 2023, 04:16:31 am
In fact, I'm not sure if this is new or if this has been happening, but for some reason opnsense is now sending DHCP (IPv4) renewal requests for WAN2 out of my WAN interface with my WAN2 IP as the source address.

It looks like this is caused by the route to the DHCP server being covered by my primary WAN interface:

Code: [Select]
root@gw:/var/etc # route get 72.31.136.237
   route to: 072-031-136-237.res.spectrum.com
destination: default
       mask: default
    gateway: 104-56-90-1.lightspeed.dybhfl.sbcglobal.net
        fib: 0
  interface: hn1
      flags: <UP,GATEWAY,DONE,STATIC>
 recvpipe  sendpipe  ssthresh  rtt,msec    mtu        weight    expire

EDIT: Actually, I think this is an upstream bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266508

Except that it's not quite root caused correctly, it's not because they're both on the same subnet, it's because one of the interfaces already has a route to the server in question. Since my Spectrum DHCP server isn't on the same subnet as my Spectrum WAN IP, it winds up trying to renew out the wrong interface because the primary gateway is the route. It's smart enough to use the right source IP, but not smart enough to only send it out the interface that's allowed to originate traffic with that IP. Brilliant.

That seems particularly egregious as upstream bugs go, but upstream is gonna upstream I guess.

EDIT 2: And it seems I can’t try to work around this with policy based routing, because primary WAN also uses DHCP and thus has the automatic firewall rule that immediately allows all outgoing DHCP before my rule can redirect it out WAN2 where it belongs. :(
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: franco on June 07, 2023, 10:11:02 am
Thanks for testing... it's strange that this pops up now if it's a fundamental issue. Switching back to community will work around this?

For dpinger at least it looks like it's trying to set it up but failing eventually. Might have been started but stopped later on. A lingering PID file would tell us perhaps.


Cheers,
Franco
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on June 07, 2023, 02:35:52 pm
After reverting back to community I seem to have seen a duplicate dpinger once again, on WAN_DHCP, on a boot where WAN2_DHCP6 also failed to start as it did on devel:

This is from community, apologies for not having a set of this output pidfiles from devel. I can try switching back again tomorrow.

Code: [Select]
root@gw:~ # ps aux | grep dpin
root     4197   0.0  0.1  13504  2580  -  Is   08:15    0:00.05 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B fe80::215:5dff:fe01:2405%hn1 -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s -A 1s -D 500 -L 20 -d 0 fe80::12c4:caff:fe39:8051%hn1
root     4433   0.0  0.1  13504  2548  -  Is   08:15    0:00.05 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP -B 104.WAN -p /var/run/dpinger_WAN_DHCP.pid -u /var/run/dpinger_WAN_DHCP.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s -A 1s -D 500 -L 20 -d 0 104.56.90.1
root     5337   0.0  0.1  13504  2548  -  Is   08:15    0:00.04 /usr/local/bin/dpinger -f -S -r 0 -i WAN2_DHCP -B 67.WA2 -p /var/run/dpinger_WAN2_DHCP.pid -u /var/run/dpinger_WAN2_DHCP.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s -A 1s -D 500 -L 20 -d 0 67.8.168.1
root    98809   0.0  0.1  13504  2548  -  Is   08:15    0:00.04 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP -B 104.WAN -p /var/run/dpinger_WAN_DHCP.pid -u /var/run/dpinger_WAN_DHCP.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s -A 1s -D 500 -L 20 -d 0 104.56.90.1
root    59089   0.0  0.1  12752  2228  0  S+   08:26    0:00.00 grep dpin
root@gw:~ # ls -l /var/run/dping*
-rw-r-----  1 root  wheel  5 Jun  7 08:15 /var/run/dpinger_WAN2_DHCP.pid
srw-rw-rw-  1 root  wheel  0 Jun  7 08:15 /var/run/dpinger_WAN2_DHCP.sock
-rw-r-----  1 root  wheel  5 Jun  7 08:15 /var/run/dpinger_WAN_DHCP.pid
srw-rw-rw-  1 root  wheel  0 Jun  7 08:15 /var/run/dpinger_WAN_DHCP.sock
-rw-r-----  1 root  wheel  5 Jun  7 08:15 /var/run/dpinger_WAN_DHCP6.pid
srw-rw-rw-  1 root  wheel  0 Jun  7 08:15 /var/run/dpinger_WAN_DHCP6.sock

For that DHCP wrong interface thing, there probably aren't a lot people doing multi-wan with:

I do have some pcaps of that DHCP behavior over both interfaces, and essentially while all the unicast request attempts go out the wrong interface it does eventually go back to a broadcast (with the same transaction ID) which is going out the right interface at which point the request goes through.

I can think of a few ways to work around the issue, but it would involve an option that makes the existing allow outgoing DHCP automatic rules more specific or an option to add even earlier automatic floating rules to implement policy based routing for the unicast requests.
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on June 07, 2023, 02:44:00 pm
Wait, some of this is an error on my part, I went back to community but I did not apply the three most recent patches again, false alarm on the duplicate dpinger!

Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on June 07, 2023, 05:53:21 pm
I think the DHCP issues are even more fundamental on multi-WAN setups, and it's even present on community. Here's en example of me attempting a 'Reload' in the Interface Overview for WAN2:

WAN2 PCAP:
Code: [Select]
38 11:38:43.310005 0.0.0.0 255.255.255.255 DHCP 342 DHCP Request  - Transaction ID 0x81518a2f
39 11:38:43.390628 72.31.136.237 67.WAN2_IP DHCP 359 DHCP ACK      - Transaction ID 0x81518a2f

Out of WAN1:
Code: [Select]
91 11:38:43.390695 72.31.136.237 67.WAN2_IP DHCP 359 DHCP ACK      - Transaction ID 0x81518a2f

In summary:
Sends the request from 0.0.0.0 out WAN2, gets the response to WAN address (I'm assuming it cleared the address from that interface first), and immediately routes it back out the other WAN interface.

Code: [Select]
WAN 2023-06-07T11:48:34-04:00 <- 72.31.136.237:67 67.WAN2_IP:68 udp let out anything from firewall host itself
WAN2 2023-06-07T11:48:34-04:00 -> 72.31.136.237:67 67.WAN2_IP:68 udp allow DHCP client on WAN2


The rules making this happen are so early I'm not sure I can even prevent this from happening.

Also hitting 'Reload' on the interface overview is one of the things that can make WAN2_DHCP6's dpinger instance fail and not recover.

EDIT: and, to follow up on the /var/run dpinger pidfiles for WAN2_DHCP6 when it fails: they are missing, so I think something deleted them.
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: franco on June 08, 2023, 10:22:54 am
It might make sense to wait for 23.1.10 and test more on the community version to reproduce the renew/rebind issue you are seeing. I've added the updaterrd/dpinger patches to that upcoming version to avoid more manual patching.


Cheers,
Franco
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on June 09, 2023, 06:27:29 pm
Sure. I suspect reproducing the DHCPv4 stuff will be no problem at all, and with longer-running pcaps I think I've got it figured out. I don't think it's new, I just hadn't used OPNSense before the end of May.

Given:
Once the lease T1 timer is hit, dhclient attempts a unicast renewal to the DHCP server. FreeBSD kernel elects to deliver this out WAN1 based on the default route, despite using WAN2's address as the source IP. Primary WAN ISP clearly just drops this.

Once the lease T2 timer is hit, dhclient reverts to broadcasts and this does go out the right interface, crisis is averted.

dhclient's behavior of unicast renewals at T1 and broadcasts after T2 is correct, the only error is the interface used. With freebsd not having SO_BINDTODEVICE or an equivalent, I'm not sure dhclient can even be fixed directly.

This also explains why nobody has noticed: upstream ISP likely just drops the bad traffic and dhclient is able to renew at T2, so people wouldn't notice T1 renewals failing in this kind of a multi-wan setup.

Additionally, the automatic rules are a bit too loose and inadvertently allow received DHCP messages to be reflected from one WAN interface to the other. as a result, it's not possible to manually add a firewall rule to implement policy based routing.

I'd need a rule like "src: WAN2 address dest: * gateway: WAN2" as (I think) a floating rule (or at least WAN_GROUP) to stop this from happening.

Assuming I'm not wildly off this probably warrants its own separate issue. A slightly stricter outgoing dhcp allow rule would allow me to work around the issue with a manual policy based route, without having to fight upstream.

Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: billimek on June 10, 2023, 04:52:01 pm
I found this thread after experiencing the same symptoms.  Running 23.1.9 with no patches.  I had a system load of over 100 and rebooted to try to fix it before finding this thread.

I have gateway configuration set-up but don't really need it, and wondering if I disable that feature, if this will solve the problem?

(https://i.imgur.com/Sd5H06S.png)

(https://i.imgur.com/C9RNRiY.png)

(https://i.imgur.com/gFaYiKT.png)
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on June 10, 2023, 07:12:40 pm
If your issue is the same as mine, then you need up to the three patches in
Code: [Select]
opnsense-patch 74c3125e0d2f
opnsense-patch fdf46f317c3 0200f79a19
The first one at minimum.

To determine if your issue is the same as mine, look at the start time of the dpinger processes, then look through your logs for RENEW. If the new dpingers are starting when DHCPv6 renewals occur then it might be the same issue.

That said 23.1.9 did prevent the underlying cause from being triggered from the DHCPv6 renewals, at least for me.

My issue at least was not caused by multiple gateways, but rather the monitoring on its own in combination with the newwanipv6 logic running multiple times simultaneously.
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: franco on June 12, 2023, 11:34:44 am
About the unicast routing issue I think this is a little difficult to address it judging by the past unsolved problems not just in FreeBSD:

2013: https://lists.isc.org/pipermail/dhcp-users/2013-February/016398.html
2016: https://bugzilla.redhat.com/show_bug.cgi?id=1297445
2022: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266508 (mentioned by rssor before)


Cheers,
Franco
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: franco on June 12, 2023, 12:22:27 pm
Only chance that I see here... not sure if this works... hard to test...

https://github.com/opnsense/src/commit/5dc1ca155bf1e5

I'm going to let this run locally for the time being to see what happens on renewal WRT the warning() log message.


Cheers,
Franco
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on June 12, 2023, 07:50:51 pm
In the event it's not possible to fix the unicast issue, am I right that a new floating automatic egress rule buried in advanced settings might address the issue? Then there's the secondary issue with reflected responses I think might be fixable with a second rule.

Rule 1: It seems like I can define an egress rule that matches on src ip of a WAN interface and forces it out the correct gateway, and if that's correct the only issue is that I can't add it before the automatic DHCP out rule I can't control.

Is that viable, or would having a rule like that the needs to be updated on IP changes as a floating egress problematic?

Rule 2: The only other rule I don't think I can make at the moment is preventing entirely a packet with incoming interface WAN1 going out WAN2 and the other way around (automatic DHCP rules permit it before I can block it).

This would solve the issue during a manual reload, where the broadcast goes out and I get a unicast response, but since the interface doesn't have the address yet it reflects it out the other interface. But since these are both gateways these are actually the only two interfaces on the entire machine I never want traffic to pass between.
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: franco on June 27, 2023, 03:09:08 pm
And thoughts on this? I'm not very optimistic about the patches doing anything to help here, but they also do not appear to have a downside that I could find.


Cheers,
Franco
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: rssor on June 27, 2023, 04:18:30 pm
After seeing that even the on Linux the same unicast routing issue seems to exist (policy based routing is the only option there as well) I don't think any patch will work, so just some way to narrow the allowed out rules (so that I can use policy based rules to prevent the traffic from moving from one WAN interface to the other) is probably the only action worth taking here.

The reason I want the slightly tweaked rules is that for the traffic I posted on June 7th:
Code: [Select]
WAN 2023-06-07T11:48:34-04:00 <- 72.31.136.237:67 67.WAN2_IP:68 udp let out anything from firewall host itself
WAN2 2023-06-07T11:48:34-04:00 -> 72.31.136.237:67 67.WAN2_IP:68 udp allow DHCP client on WAN2
Has a DHCP response from WAN2's DHCP server come in on WAN2 and get routed out WAN1 because the "let out anything from firewall host itself" rule doesn't actually verify that it's from the firewall host itself.

I'm trying to avoid touching the machine until 23.7 comes out, as it has a kernel fix I need to avoid constant filesystem corruption on the hardware. ( https://github.com/opnsense/src/commit/567cc4e6bfd92d7351e385569f2bb4b7c89b6db0 )
Title: Re: 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
Post by: franco on June 28, 2023, 01:31:02 pm
You are right. We should pin the gateway to the automatic firewall rules, but I'm not sure if we can do that without specifying the IP address... perhaps "(em1:0)" as source works with kernel address expansion, but it would have to be a floating outgoing rule since we don't know where it originates from?

(Sorry for stating something obvious perhaps, not enough time to go through the thread again at the moment.)


Cheers,
Franco