1
23.1 Legacy Series / 23.1.8: dhcp6c script race condition causes hundreds of dpinger processes?
« on: May 29, 2023, 04:50:22 am »
Edit: Repro/summary below
I'm seeing what appear to be new instances of dpinger/updaterrd.sh appear gradually over time, lining up with DHCPv6 renewal times. I have two wan interfaces, and the dpinger instances are specific to the primary WAN DHCPv6 gateway.
dpinger going wild on only one of 4 monitored gateways (there were well over 150 total):
All of the dpinger processes operate in parallel resulting in large amounts of ICMP traffic.
updaterrd.sh also showing up at lease renewals (wc -l reports 250+ here)
I have temporarily terminated monitoring of the WAN IPv6 gateway, but the updaterrd.sh processes continue to creep up over time.
The best clue I've found so far is in the logs as this newwanipv6 script running constantly, but only for this particular interface.
I suspect that this is to me having to provide a DHCPv6 configuration override in order to pull multiple prefixes simultaneously (AT&T CPE quirk, must pull multiple /64s instead of one larger /60). Clarification: Each /64 prefix must have its own IA_PD, otherwise the server will not respond.
My tentative next steps are to try to specify a configuration override identical to the non-overridden version that pulls a single /64, just to figure out if this is an error in how overridden DHCPv6 configs function or if it's an error in my actual configuration or an unavoidable side effect of having multiple delegated prefixes confusing the system.
The machine is still in its weird state, is there any additional information I should collect before rebooting and continuing to try to track this down?
Even if this is an error in my dhcpv6 config override, are the dpinger instances not realizing that the pid file already corresponds to an already-running dpinger for that interface its own separate bug?
I'm seeing what appear to be new instances of dpinger/updaterrd.sh appear gradually over time, lining up with DHCPv6 renewal times. I have two wan interfaces, and the dpinger instances are specific to the primary WAN DHCPv6 gateway.
dpinger going wild on only one of 4 monitored gateways (there were well over 150 total):
Code: [Select]
root 12033 0.0 0.1 17732 2180 - Is 11:16 0:11.11 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root 12140 0.0 0.1 17732 2180 - Is Fri02 0:13.83 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root 12674 0.0 0.1 13504 2108 - Is 01:17 0:03.04 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root 12776 0.0 0.1 17732 2180 - Is 20:47 0:06.05 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root 13736 0.0 0.1 17732 2180 - Is 18:16 0:07.63 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root 14579 0.0 0.1 17732 2180 - Is 10:46 0:10.94 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root 15498 0.0 0.1 17732 2180 - Is 19:16 0:07.24 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
root 15832 0.0 0.1 17732 2180 - Is 14:16 0:09.50 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:2600:WAN:IP:: -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s
All of the dpinger processes operate in parallel resulting in large amounts of ICMP traffic.
updaterrd.sh also showing up at lease renewals (wc -l reports 250+ here)
Code: [Select]
root 94437 0.0 0.0 13508 776 - Is Sat07 0:13.04 /bin/sh /var/db/rrd/updaterrd.sh
root 94634 0.0 0.0 13508 720 - Is Fri03 0:24.33 /bin/sh /var/db/rrd/updaterrd.sh
root 94713 0.0 0.0 13508 720 - Is Fri21 0:17.22 /bin/sh /var/db/rrd/updaterrd.sh
root 94840 0.0 0.0 13508 720 - Is Thu23 0:26.19 /bin/sh /var/db/rrd/updaterrd.sh
root 95958 0.0 0.0 13508 780 - Is Sat17 0:09.44 /bin/sh /var/db/rrd/updaterrd.sh
root 96010 0.0 0.0 13508 772 - Ss 12:17 0:02.78 /bin/sh /var/db/rrd/updaterrd.sh
root 96619 0.0 0.0 13508 776 - Is Fri15 0:19.62 /bin/sh /var/db/rrd/updaterrd.sh
root 96973 0.0 0.0 13508 776 - Ss 07:46 0:04.10 /bin/sh /var/db/rrd/updaterrd.sh
root 97095 0.0 0.0 13508 720 - Is Fri19 0:17.42 /bin/sh /var/db/rrd/updaterrd.sh
root 97793 0.0 0.0 13508 776 - Ss Sat07 0:13.07 /bin/sh /var/db/rrd/updaterrd.sh
root 98511 0.0 0.0 13508 776 - Ss 22:46 0:07.43 /bin/sh /var/db/rrd/updaterrd.sh
root 99268 0.0 0.0 13508 720 - Is Thu22 0:26.39 /bin/sh /var/db/rrd/updaterrd.sh
root 99893 0.0 0.0 13508 720 - Is Fri20 0:17.44 /bin/sh /var/db/rrd/updaterrd.sh
I have temporarily terminated monitoring of the WAN IPv6 gateway, but the updaterrd.sh processes continue to creep up over time.
The best clue I've found so far is in the logs as this newwanipv6 script running constantly, but only for this particular interface.
Code: [Select]
2023-05-28T22:19:26-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: No IP change detected (current: 2600:WAN:IP::, interface: WAN[wan])
2023-05-28T22:17:22-04:00 Error opnsense /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '70636'' returned exit code '1', the output was 'kill: 70636: No such process'
2023-05-28T22:17:22-04:00 Error opnsense /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '70636'' returned exit code '1', the output was 'kill: 70636: No such process'
I suspect that this is to me having to provide a DHCPv6 configuration override in order to pull multiple prefixes simultaneously (AT&T CPE quirk, must pull multiple /64s instead of one larger /60). Clarification: Each /64 prefix must have its own IA_PD, otherwise the server will not respond.
My tentative next steps are to try to specify a configuration override identical to the non-overridden version that pulls a single /64, just to figure out if this is an error in how overridden DHCPv6 configs function or if it's an error in my actual configuration or an unavoidable side effect of having multiple delegated prefixes confusing the system.
The machine is still in its weird state, is there any additional information I should collect before rebooting and continuing to try to track this down?
Even if this is an error in my dhcpv6 config override, are the dpinger instances not realizing that the pid file already corresponds to an already-running dpinger for that interface its own separate bug?