Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - rssor

#1
After seeing that even the on Linux the same unicast routing issue seems to exist (policy based routing is the only option there as well) I don't think any patch will work, so just some way to narrow the allowed out rules (so that I can use policy based rules to prevent the traffic from moving from one WAN interface to the other) is probably the only action worth taking here.

The reason I want the slightly tweaked rules is that for the traffic I posted on June 7th:

WAN 2023-06-07T11:48:34-04:00 <- 72.31.136.237:67 67.WAN2_IP:68 udp let out anything from firewall host itself
WAN2 2023-06-07T11:48:34-04:00 -> 72.31.136.237:67 67.WAN2_IP:68 udp allow DHCP client on WAN2

Has a DHCP response from WAN2's DHCP server come in on WAN2 and get routed out WAN1 because the "let out anything from firewall host itself" rule doesn't actually verify that it's from the firewall host itself.

I'm trying to avoid touching the machine until 23.7 comes out, as it has a kernel fix I need to avoid constant filesystem corruption on the hardware. ( https://github.com/opnsense/src/commit/567cc4e6bfd92d7351e385569f2bb4b7c89b6db0 )
#2
In the event it's not possible to fix the unicast issue, am I right that a new floating automatic egress rule buried in advanced settings might address the issue? Then there's the secondary issue with reflected responses I think might be fixable with a second rule.

Rule 1: It seems like I can define an egress rule that matches on src ip of a WAN interface and forces it out the correct gateway, and if that's correct the only issue is that I can't add it before the automatic DHCP out rule I can't control.

Is that viable, or would having a rule like that the needs to be updated on IP changes as a floating egress problematic?

Rule 2: The only other rule I don't think I can make at the moment is preventing entirely a packet with incoming interface WAN1 going out WAN2 and the other way around (automatic DHCP rules permit it before I can block it).

This would solve the issue during a manual reload, where the broadcast goes out and I get a unicast response, but since the interface doesn't have the address yet it reflects it out the other interface. But since these are both gateways these are actually the only two interfaces on the entire machine I never want traffic to pass between.
#3
If your issue is the same as mine, then you need up to the three patches in

opnsense-patch 74c3125e0d2f
opnsense-patch fdf46f317c3 0200f79a19

The first one at minimum.

To determine if your issue is the same as mine, look at the start time of the dpinger processes, then look through your logs for RENEW. If the new dpingers are starting when DHCPv6 renewals occur then it might be the same issue.

That said 23.1.9 did prevent the underlying cause from being triggered from the DHCPv6 renewals, at least for me.

My issue at least was not caused by multiple gateways, but rather the monitoring on its own in combination with the newwanipv6 logic running multiple times simultaneously.
#4
Sure. I suspect reproducing the DHCPv4 stuff will be no problem at all, and with longer-running pcaps I think I've got it figured out. I don't think it's new, I just hadn't used OPNSense before the end of May.

Given:

  • Multi-WAN configuration
  • Failover WAN (called WAN2) ISP uses DHCP relay for DHCP server not on WAN2's net

Once the lease T1 timer is hit, dhclient attempts a unicast renewal to the DHCP server. FreeBSD kernel elects to deliver this out WAN1 based on the default route, despite using WAN2's address as the source IP. Primary WAN ISP clearly just drops this.

Once the lease T2 timer is hit, dhclient reverts to broadcasts and this does go out the right interface, crisis is averted.

dhclient's behavior of unicast renewals at T1 and broadcasts after T2 is correct, the only error is the interface used. With freebsd not having SO_BINDTODEVICE or an equivalent, I'm not sure dhclient can even be fixed directly.

This also explains why nobody has noticed: upstream ISP likely just drops the bad traffic and dhclient is able to renew at T2, so people wouldn't notice T1 renewals failing in this kind of a multi-wan setup.

Additionally, the automatic rules are a bit too loose and inadvertently allow received DHCP messages to be reflected from one WAN interface to the other. as a result, it's not possible to manually add a firewall rule to implement policy based routing.

I'd need a rule like "src: WAN2 address dest: * gateway: WAN2" as (I think) a floating rule (or at least WAN_GROUP) to stop this from happening.

Assuming I'm not wildly off this probably warrants its own separate issue. A slightly stricter outgoing dhcp allow rule would allow me to work around the issue with a manual policy based route, without having to fight upstream.

#5
I think the DHCP issues are even more fundamental on multi-WAN setups, and it's even present on community. Here's en example of me attempting a 'Reload' in the Interface Overview for WAN2:

WAN2 PCAP:

38 11:38:43.310005 0.0.0.0 255.255.255.255 DHCP 342 DHCP Request  - Transaction ID 0x81518a2f
39 11:38:43.390628 72.31.136.237 67.WAN2_IP DHCP 359 DHCP ACK      - Transaction ID 0x81518a2f


Out of WAN1:

91 11:38:43.390695 72.31.136.237 67.WAN2_IP DHCP 359 DHCP ACK      - Transaction ID 0x81518a2f


In summary:
Sends the request from 0.0.0.0 out WAN2, gets the response to WAN address (I'm assuming it cleared the address from that interface first), and immediately routes it back out the other WAN interface.


WAN 2023-06-07T11:48:34-04:00 <- 72.31.136.237:67 67.WAN2_IP:68 udp let out anything from firewall host itself
WAN2 2023-06-07T11:48:34-04:00 -> 72.31.136.237:67 67.WAN2_IP:68 udp allow DHCP client on WAN2



The rules making this happen are so early I'm not sure I can even prevent this from happening.

Also hitting 'Reload' on the interface overview is one of the things that can make WAN2_DHCP6's dpinger instance fail and not recover.

EDIT: and, to follow up on the /var/run dpinger pidfiles for WAN2_DHCP6 when it fails: they are missing, so I think something deleted them.
#6
Wait, some of this is an error on my part, I went back to community but I did not apply the three most recent patches again, false alarm on the duplicate dpinger!

#7
After reverting back to community I seem to have seen a duplicate dpinger once again, on WAN_DHCP, on a boot where WAN2_DHCP6 also failed to start as it did on devel:

This is from community, apologies for not having a set of this output pidfiles from devel. I can try switching back again tomorrow.


root@gw:~ # ps aux | grep dpin
root     4197   0.0  0.1  13504  2580  -  Is   08:15    0:00.05 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B fe80::215:5dff:fe01:2405%hn1 -p /var/run/dpinger_WAN_DHCP6.pid -u /var/run/dpinger_WAN_DHCP6.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s -A 1s -D 500 -L 20 -d 0 fe80::12c4:caff:fe39:8051%hn1
root     4433   0.0  0.1  13504  2548  -  Is   08:15    0:00.05 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP -B 104.WAN -p /var/run/dpinger_WAN_DHCP.pid -u /var/run/dpinger_WAN_DHCP.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s -A 1s -D 500 -L 20 -d 0 104.56.90.1
root     5337   0.0  0.1  13504  2548  -  Is   08:15    0:00.04 /usr/local/bin/dpinger -f -S -r 0 -i WAN2_DHCP -B 67.WA2 -p /var/run/dpinger_WAN2_DHCP.pid -u /var/run/dpinger_WAN2_DHCP.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s -A 1s -D 500 -L 20 -d 0 67.8.168.1
root    98809   0.0  0.1  13504  2548  -  Is   08:15    0:00.04 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP -B 104.WAN -p /var/run/dpinger_WAN_DHCP.pid -u /var/run/dpinger_WAN_DHCP.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s -A 1s -D 500 -L 20 -d 0 104.56.90.1
root    59089   0.0  0.1  12752  2228  0  S+   08:26    0:00.00 grep dpin
root@gw:~ # ls -l /var/run/dping*
-rw-r-----  1 root  wheel  5 Jun  7 08:15 /var/run/dpinger_WAN2_DHCP.pid
srw-rw-rw-  1 root  wheel  0 Jun  7 08:15 /var/run/dpinger_WAN2_DHCP.sock
-rw-r-----  1 root  wheel  5 Jun  7 08:15 /var/run/dpinger_WAN_DHCP.pid
srw-rw-rw-  1 root  wheel  0 Jun  7 08:15 /var/run/dpinger_WAN_DHCP.sock
-rw-r-----  1 root  wheel  5 Jun  7 08:15 /var/run/dpinger_WAN_DHCP6.pid
srw-rw-rw-  1 root  wheel  0 Jun  7 08:15 /var/run/dpinger_WAN_DHCP6.sock


For that DHCP wrong interface thing, there probably aren't a lot people doing multi-wan with:

  • Primary and failover interfaces as DHCP clients
  • Failover interface ISP DHCP server not on the same subnet as the failover interface address

I do have some pcaps of that DHCP behavior over both interfaces, and essentially while all the unicast request attempts go out the wrong interface it does eventually go back to a broadcast (with the same transaction ID) which is going out the right interface at which point the request goes through.

I can think of a few ways to work around the issue, but it would involve an option that makes the existing allow outgoing DHCP automatic rules more specific or an option to add even earlier automatic floating rules to implement policy based routing for the unicast requests.
#8
In fact, I'm not sure if this is new or if this has been happening, but for some reason opnsense is now sending DHCP (IPv4) renewal requests for WAN2 out of my WAN interface with my WAN2 IP as the source address.

It looks like this is caused by the route to the DHCP server being covered by my primary WAN interface:


root@gw:/var/etc # route get 72.31.136.237
   route to: 072-031-136-237.res.spectrum.com
destination: default
       mask: default
    gateway: 104-56-90-1.lightspeed.dybhfl.sbcglobal.net
        fib: 0
  interface: hn1
      flags: <UP,GATEWAY,DONE,STATIC>
recvpipe  sendpipe  ssthresh  rtt,msec    mtu        weight    expire


EDIT: Actually, I think this is an upstream bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266508

Except that it's not quite root caused correctly, it's not because they're both on the same subnet, it's because one of the interfaces already has a route to the server in question. Since my Spectrum DHCP server isn't on the same subnet as my Spectrum WAN IP, it winds up trying to renew out the wrong interface because the primary gateway is the route. It's smart enough to use the right source IP, but not smart enough to only send it out the interface that's allowed to originate traffic with that IP. Brilliant.

That seems particularly egregious as upstream bugs go, but upstream is gonna upstream I guess.

EDIT 2: And it seems I can't try to work around this with policy based routing, because primary WAN also uses DHCP and thus has the automatic firewall rule that immediately allows all outgoing DHCP before my rule can redirect it out WAN2 where it belongs. :(
#9
No such luck, got a broken dpinger on the first reboot.


2023-06-06T16:47:45-04:00 Notice opnsense-devel /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
2023-06-06T16:47:44-04:00 Notice opnsense-devel /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
2023-06-06T16:47:44-04:00 Notice opnsense-devel /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
2023-06-06T16:47:43-04:00 Notice opnsense-devel /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP))
2023-06-06T16:47:43-04:00 Notice opnsense-devel /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
2023-06-06T16:47:43-04:00 Notice opnsense-devel /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
2023-06-06T16:47:43-04:00 Notice opnsense-devel /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP6))
2023-06-06T16:47:43-04:00 Notice opnsense-devel /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP))
2023-06-06T16:47:42-04:00 Notice opnsense-devel /usr/local/etc/rc.bootup: plugins_configure monitor (execute task : dpinger_configure_do(1,null,1))


WAN2_DHCP6's dpinger service shows as not running in the dashboard and the gateway shows as failed with no stats.
#10
Sure thing. Should I revert all the various patches first, or can I just flip it over just fine?

I'm unsure if I was supposed to revert any of the patches before the last upgrade to 23.1.9 too.
#11
Yes, two gateway groups doing failover, one for IPv4, one for IPv6, default switching enabled and once up it mostly works fine. (Well, except that long-running connections that got made on WAN2 aren't killed when the primary gateway comes back up, but it looks like intended behavior that the source tracking table isn't getting killed on gateway recovery since that feature got removed on purpose. So I think I just have some scripting to do here.  :) )

Looking at the logs, it seems like it might just be racy, I'm not sure enough time elapsed for it to be related to dpinger. Gateway health was green and there's not even a blip of packet loss in the health graphs. 9:19:03 is when it finally got things right. Seems strange that they seem to disagree on gateway status during the same second in the timestamps.

Log messages from boot with "rout"

2023-06-04T09:19:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: setting inet default route to 104.56.90.1
2023-06-04T09:19:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: configuring inet default gateway on wan
2023-06-04T09:19:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'wan'
2023-06-04T09:19:03-04:00 Notice dhclient New Routers (hn1): 104.56.90.1
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: plugins_configure monitor (1)
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: setting inet6 default route to fe80::12c4:caff:fe39:8051%hn1
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet6 default gateway on wan
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: keeping current inet default gateway '67.8.168.1'
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt4
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: ignoring down gateways: WAN_DHCP
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: entering configure using defaults
2023-06-04T09:09:04-04:00 Notice kernel <118>[+] Backgrounding route monitor
2023-06-04T09:09:04-04:00 Notice kernel <118>[#] route -q -n add -inet 10.6.0.0/16 -interface wg1
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: plugins_configure monitor (1)
2023-06-04T09:09:03-04:00 Error opnsense /usr/local/etc/rc.routing_configure: The command '/sbin/route add -inet '10.6.0.0/16' -interface 'wg1'' returned exit code '65', the output was 'route: interface 'wg1' does not exist'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: keeping current inet6 default gateway 'fe80::2bc:60ff:fe92:4419%hn4'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet6 default gateway on opt4
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: setting inet default route to 67.8.168.1
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt4
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: keeping current inet6 default gateway 'fe80::2bc:60ff:fe92:4419%hn4'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: configuring inet6 default gateway on opt4
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: entering configure using defaults
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: entering configure using 'wan'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6, WAN2_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: entering configure using 'opt4'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: keeping current inet default gateway '104.56.90.1'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: configuring inet default gateway on wan
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: ignoring down gateways: WAN_DHCP6, WAN2_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'wan'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: setting inet default route to 104.56.90.1
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: configuring inet default gateway on wan
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: ignoring down gateways: WAN_DHCP6, WAN2_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'opt4'
2023-06-04T09:08:59-04:00 Notice kernel <118>Starting router advertisement service...done.
2023-06-04T09:08:59-04:00 Notice kernel <118>Setting up routes...done.
2023-06-04T09:08:58-04:00 Error opnsense /usr/local/etc/rc.bootup: The command '/sbin/route add -inet '10.6.0.0/16' -interface 'wg1'' returned exit code '65', the output was 'route: interface 'wg1' does not exist'
2023-06-04T09:08:58-04:00 Error opnsense /usr/local/etc/rc.bootup: ROUTING: refusing to set inet gateway on addressless opt5(wg1)
2023-06-04T09:08:58-04:00 Notice opnsense /usr/local/etc/rc.bootup: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6, WAN2_DHCP, WAN2_DHCP6
2023-06-04T09:08:58-04:00 Notice opnsense /usr/local/etc/rc.bootup: ROUTING: entering configure using defaults
2023-06-04T09:08:58-04:00 Notice dhclient New Routers (hn4): 67.8.168.1
2023-06-04T09:08:58-04:00 Notice dhclient New Routers (hn1): 104.56.90.1


Log messages from boot with "gateway"

2023-06-04T09:19:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: configuring inet default gateway on wan
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet6 default gateway on wan
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: keeping current inet default gateway '67.8.168.1'
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt4
2023-06-04T09:09:04-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: ignoring down gateways: WAN_DHCP
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: keeping current inet6 default gateway 'fe80::2bc:60ff:fe92:4419%hn4'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet6 default gateway on opt4
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: configuring inet default gateway on opt4
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.routing_configure: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: keeping current inet6 default gateway 'fe80::2bc:60ff:fe92:4419%hn4'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: configuring inet6 default gateway on opt4
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanipv6: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6, WAN2_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: keeping current inet default gateway '104.56.90.1'
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: configuring inet default gateway on wan
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: ignoring down gateways: WAN_DHCP6, WAN2_DHCP6
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: configuring inet default gateway on wan
2023-06-04T09:09:03-04:00 Notice opnsense /usr/local/etc/rc.newwanip: ROUTING: ignoring down gateways: WAN_DHCP6, WAN2_DHCP6
2023-06-04T09:09:00-04:00 Notice kernel <118>Setting up gateway monitors...done.
2023-06-04T09:08:59-04:00 Notice opnsense /usr/local/etc/rc.bootup: Gateway currently empty for 2001:4860:4860::8888 on wan
2023-06-04T09:08:58-04:00 Error opnsense /usr/local/etc/rc.bootup: ROUTING: refusing to set inet gateway on addressless opt5(wg1)
2023-06-04T09:08:58-04:00 Notice opnsense /usr/local/etc/rc.bootup: ROUTING: ignoring down gateways: WAN_DHCP, WAN_DHCP6, WAN2_DHCP, WAN2_DHCP6


#12
Have only rebooted three times since applying, but have only seen 2 rrd processes since, one marked daemon with the pid of the other, so that seems to have cleared it up.

On occasion I'm seeing a dpinger process die or never fail to start (usually for the WAN2_DHCP6 gateway), and it looks like it might be related to the lease taking a long time to acquire initially. I'll see if I can find a reliable repro for this, but basically while it will eventually get a lease and be able to pass traffic, it shows up as down forever on the dashboard, both the gateway and the dpinger instance for that gateway in the services list.

I've got one last problem which is sometimes after boot it's just straight up using the wrong gateway. The UI says that the active gateways are WAN for v4/v6 (reflecting the configuration), but looking at System -> Routes -> Status WAN2 is the default. Sometimes it's only the IPv4 gateway that's wrong, sometimes it's the IPv6 gateway that's wrong. System -> Gateways -> Single always shows the intended gateways as active, but the actual active routes disagree.

This is certainly some totally unrelated issue from where I started though, so I can start another thread if that would be appropriate.
#13
So after 5 reboots, that patch hasn't had any extra dpingers. I do still have anywhere between 1-3 updaterrd.sh scripts running after boot, so that might also need the same treatment?

I did go ahead and collect some before-patch data before applying, but since this patch seems to have addressed the issue it's kind of pointless now. :) Maybe it'll at least help confirm what was going on.

23.1.9, no patches (well, I forgot to revert the other 2 before updating to 23.1.9, and I don't know if reverting them would partially downgrade me, so I'm assuming/hoping they're harmless)

Reboot 1: 0 extra
Reboot 2: 1 extra (WAN2_DHCP)
Reboot 3: 0 extra
Reboot 4: 1 extra (WAN2_DHCP)
Reboot 5: 0 extra
Reboot 6: 2 extra (WAN_DHCP, WAN2_DHCP)
Reboot 7: 1 extra (WAN_DHCP6) !! this is not ipv4 only
Reboot 8: 1 extra (WAN2_WHCP6)
Reboot 9: 1 extra (WAN2_DHCP)

Example when 2 extra dpingers exist (WAN_DHCP, WAN2_DHCP):

root@gw:~ # opnsense-log | grep dping
<13>1 2023-06-01T04:28:21-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="219"] /usr/local/etc/rc.bootup: plugins_configure monitor (execute task : dpinger_configure_do(1,null,1))
<13>1 2023-06-01T04:28:25-04:00 gw.domain.tld opnsense 56603 - [meta sequenceId="274"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP))
<13>1 2023-06-01T04:28:25-04:00 gw.domain.tld opnsense 58156 - [meta sequenceId="281"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP))
<13>1 2023-06-01T04:28:25-04:00 gw.domain.tld opnsense 59296 - [meta sequenceId="286"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP6))
<13>1 2023-06-01T04:28:25-04:00 gw.domain.tld opnsense 60996 - [meta sequenceId="293"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
<13>1 2023-06-01T04:28:25-04:00 gw.domain.tld opnsense 62698 - [meta sequenceId="301"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
<13>1 2023-06-01T04:28:26-04:00 gw.domain.tld opnsense 36428 - [meta sequenceId="368"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))


Example when 1 extra dpinger exists (WAN_DHCP6):

<13>1 2023-06-01T04:34:40-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="220"] /usr/local/etc/rc.bootup: plugins_configure monitor (execute task : dpinger_configure_do(1,null,1))
<13>1 2023-06-01T04:34:44-04:00 gw.domain.tld opnsense 97726 - [meta sequenceId="270"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP))
<13>1 2023-06-01T04:34:44-04:00 gw.domain.tld opnsense 7700 - [meta sequenceId="277"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP))
<13>1 2023-06-01T04:34:44-04:00 gw.domain.tld opnsense 17484 - [meta sequenceId="282"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP6))
<13>1 2023-06-01T04:34:44-04:00 gw.domain.tld opnsense 35563 - [meta sequenceId="287"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
<13>1 2023-06-01T04:34:44-04:00 gw.domain.tld opnsense 46648 - [meta sequenceId="299"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
<13>1 2023-06-01T04:34:44-04:00 gw.domain.tld opnsense 64173 - [meta sequenceId="304"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
<13>1 2023-06-01T04:34:45-04:00 gw.domain.tld opnsense 67519 - [meta sequenceId="363"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))



Another example when 1 extra dpinger exists (WAN2_DHCP6), with more info:

root@gw:~ # opnsense-log | grep "renewal starting"
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 17978 - [meta sequenceId="260"] /usr/local/etc/rc.newwanip: IP renewal starting (new: 67.WAN2, old: , interface: WAN2[opt4], device: hn4, force: no)
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 25179 - [meta sequenceId="267"] /usr/local/etc/rc.newwanip: IP renewal starting (new: 104.WAN, old: , interface: WAN[wan], device: hn1, force: no)
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 33607 - [meta sequenceId="276"] /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2603:SPECTRUM, old: , interface: WAN2[opt4], device: hn4, force: no)
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 51419 - [meta sequenceId="286"] /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2600:ATT, old: , interface: WAN[wan], device: hn1, force: no)
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 57618 - [meta sequenceId="288"] /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2603:SPECTRUM, old: , interface: WAN2[opt4], device: hn4, force: yes)
<13>1 2023-06-01T04:38:31-04:00 gw.domain.tld opnsense 35985 - [meta sequenceId="315"] /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2600:ATT, old: , interface: WAN[wan], device: hn1, force: yes)
root@gw:~ # opnsense-log | grep dping
<13>1 2023-06-01T04:38:26-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="217"] /usr/local/etc/rc.bootup: plugins_configure monitor (execute task : dpinger_configure_do(1,null,1))
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 17978 - [meta sequenceId="266"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP))
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 25179 - [meta sequenceId="274"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP))
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 33607 - [meta sequenceId="280"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP6))
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 51419 - [meta sequenceId="292"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 57618 - [meta sequenceId="295"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP6))
<13>1 2023-06-01T04:38:30-04:00 gw.domain.tld opnsense 57115 - [meta sequenceId="303"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
<13>1 2023-06-01T04:38:31-04:00 gw.domain.tld opnsense 35985 - [meta sequenceId="319"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
<13>1 2023-06-01T04:38:32-04:00 gw.domain.tld opnsense 73205 - [meta sequenceId="386"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
root@gw:~ # opnsense-log | grep TERM
<11>1 2023-06-01T04:38:25-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="201"] /usr/local/etc/rc.bootup: The command '/bin/kill -'TERM' '59132'' returned exit code '1', the output was 'kill: 59132: No such process'
<11>1 2023-06-01T04:38:26-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="206"] /usr/local/etc/rc.bootup: The command '/bin/kill -'TERM' '78770'' returned exit code '1', the output was 'kill: 78770: No such process'
<11>1 2023-06-01T04:38:31-04:00 gw.domain.tld opnsense 25179 - [meta sequenceId="357"] /usr/local/etc/rc.newwanip: The command '/bin/kill -'TERM' '89507'' returned exit code '1', the output was 'kill: 89507: No such process'
<11>1 2023-06-01T04:38:32-04:00 gw.domain.tld opnsense 33607 - [meta sequenceId="22"] /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '63279'' returned exit code '1', the output was 'kill: 63279: No such process'
<11>1 2023-06-01T04:38:33-04:00 gw.domain.tld opnsense 57618 - [meta sequenceId="39"] /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '77855'' returned exit code '1', the output was 'kill: 77855: No such process'
<11>1 2023-06-01T04:38:33-04:00 gw.domain.tld opnsense 35985 - [meta sequenceId="40"] /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '28005'' returned exit code '1', the output was 'kill: 28005: No such process'


Last example, IPv4 when 1 extra dpinger exists (WAN2_DHCP):

root@gw:~ # opnsense-log | grep "renewal starting"
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 49018 - [meta sequenceId="262"] /usr/local/etc/rc.newwanip: IP renewal starting (new: 67.WAN2, old: , interface: WAN2[opt4], device: hn4, force: no)
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 67077 - [meta sequenceId="269"] /usr/local/etc/rc.newwanip: IP renewal starting (new: 104.WAN, old: , interface: WAN[wan], device: hn1, force: no)
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 83739 - [meta sequenceId="276"] /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2603:SPECTRUM, old: , interface: WAN2[opt4], device: hn4, force: no)
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 596 - [meta sequenceId="287"] /usr/local/etc/rc.newwanipv6: IP renewal starting (new: 2600:ATT, old: , interface: WAN[wan], device: hn1, force: no)
root@gw:~ # opnsense-log | grep dping
<13>1 2023-06-01T05:01:38-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="218"] /usr/local/etc/rc.bootup: plugins_configure monitor (execute task : dpinger_configure_do(1,null,1))
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 49018 - [meta sequenceId="268"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP))
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 67077 - [meta sequenceId="275"] /usr/local/etc/rc.newwanip: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP))
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 83739 - [meta sequenceId="280"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN2_DHCP6))
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 596 - [meta sequenceId="293"] /usr/local/etc/rc.newwanipv6: plugins_configure monitor (execute task : dpinger_configure_do(,WAN_DHCP6))
<13>1 2023-06-01T05:01:41-04:00 gw.domain.tld opnsense 11080 - [meta sequenceId="302"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
<13>1 2023-06-01T05:01:42-04:00 gw.domain.tld opnsense 98545 - [meta sequenceId="373"] /usr/local/etc/rc.routing_configure: plugins_configure monitor (execute task : dpinger_configure_do(1))
root@gw:~ # opnsense-log | grep TERM
<11>1 2023-06-01T05:01:37-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="203"] /usr/local/etc/rc.bootup: The command '/bin/kill -'TERM' '67954'' returned exit code '1', the output was 'kill: 67954: No such process'
<11>1 2023-06-01T05:01:37-04:00 gw.domain.tld opnsense 273 - [meta sequenceId="204"] /usr/local/etc/rc.bootup: The command '/bin/kill -'TERM' '78770'' returned exit code '1', the output was 'kill: 78770: No such process'
<11>1 2023-06-01T05:01:42-04:00 gw.domain.tld opnsense 83739 - [meta sequenceId="356"] /usr/local/etc/rc.newwanipv6: The command '/bin/kill -'TERM' '97307'' returned exit code '1', the output was 'kill: 97307: No such process'
#14
Quote from: franco on May 30, 2023, 08:51:38 AM
In theory you can batch renewals/rebinds, but for one the timers inside dhcp6c need to be consolidated and DHCPv6 servers all need to properly support this. Seeing how uncommonly supported multiple PD handling is out (nevermind ISPs handing them out as they are normally just ignored) I wouldn't be surprised some servers cannot handle such a batch renew/rebind.
Yeah, that's why my first thought was to rather than change the behavior of the client, just forcibly extract all the PD information from the dhcp6_if* iaconf_list available in the caller of client6_script by plumbing it through as an argument, since it does seem that all the current data is available there (and that line of control flow would have just been iterating over the same IA list before getting to client6_script anyways).

That way PDINFO is just always the entire current state, since really all the script needs to do is figure out if anything changed, not really any reason to limit it to just the PDs in the particular response since the rest of the system doesn't really care about which IA_PD any given prefix came from anyways.

So less work than coalescing the timers and risking any kind of behavior change, but much more work than simply only worrying about the initial assignment in the first place.  :)

And almost certainly not worth the bother, since it's not like renews will ever change any of it in the real world.

Quote from: franco on May 30, 2023, 08:51:38 AM
Care to debug the dpinger issue you are seeing further?
Sure! And I come with some corrections: it actually doesn't look like it's from configuration updates/service restarts, it's actually just that some (possibly just the ipv4) gateway dpingers double up on boot sometimes. So my most recent boot has 2 on each ipv4 interface, it looks like the boot before that I had it double up on just WAN2, and I know I've seen it boot with only the 4 expected as well.


root    15021   0.0  0.1  13504   2128  -  Is   13:01      0:02.90 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP -B 104.WAN_IP -p /var/run/dpinger_WAN_DHCP.pi
root    45935   0.0  0.1  13504   2128  -  Is   13:01      0:03.04 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP -B 104.WAN_IP -p /var/run/dpinger_WAN_DHCP.pi
root    43971   0.0  0.1  13504   2132  -  Is   13:01      0:02.85 /usr/local/bin/dpinger -f -S -r 0 -i WAN_DHCP6 -B 2600:WAN:IP:: -p /v
root    49476   0.0  0.1  13504   2132  -  Is   13:01      0:03.02 /usr/local/bin/dpinger -f -S -r 0 -i WAN2_DHCP6 -B 2603:WAN2:IP:: -p /va
root    18056   0.0  0.1  13504   2128  -  Is   13:01      0:03.01 /usr/local/bin/dpinger -f -S -r 0 -i WAN2_DHCP -B 67.WAN2_IP -p /var/run/dpinger_WAN2_DHCP.p
root    55265   0.0  0.1  13504   2128  -  Is   13:01      0:03.22 /usr/local/bin/dpinger -f -S -r 0 -i WAN2_DHCP -B 67.WAN2_IP -p /va


I've restarted radvd and updated firewall rules a bunch which is what I thought had done it before, but everything seems to actually stay fixed at where it was after boot. So this particular boot seems to be the worst case and also... not that bad since it apparently won't get worse.

If the occasional extra dpinger on boot is worth tracking down, is there any info in particular that would help debug it?
#15
Thanks for the rapid fix. Back to the original config (5 IA_PDs, 1 IA_NA) where there are 6 instances of wannewipv6 every 30 minutes running simultaneously it's been holding steady with seemingly none of the old negative behavior.

It looks like I can still collect additional dpinger processes over time but the duplicates I have now match when I made configuration changes or restarted services and are on the v4 gateways. I'm assuming it's just the inherent raciness of multiple new address scripts running simultaneously. Not really a problem for steady operation in practice.

It's a shame that the dhcp6c process is so stingy about sharing its lease state. The logic in core did make sense to me, it was just let down by dhcp6c refusing to give the scripts the rest of the info it has buried in there somewhere. Although since address changes during renewals are rare I suppose it doesn't really matter in practice.

Thanks again!