I have an issue that seems to be ongoing and I cannot see a fix in the forums. If this has been resolved apologies.
Using starlink where the WAN frequently drops out DPinger needs to be restarted in order for the gateway monitoring to work again and the routes services restarted to get the default route back.
Has anyone found a fix for this yet? I have disabled sticky connections in the firewall settings.
I've been having the same issue for quite some time now. I also have Starlink and a second WAN on PPPoE. I have IPV4 and V6 enabled on SL and only V4 on the PPPoE link.
I can usually trigger this issue easily. If I reboot OPNSense, usually after about 2 hours the issue will start happening, something seems to flap on the SL interface after that 2 hours. When that trigger happens, you can see the 2 dpinger for v4 and v6 quitting and then are restarted by the scripts to start checking the gateways again. Usually the one for v6 recovers and continues to work but almost all the time the one for v4 starts failing and flags the gateway as down.
The gateway is really up and I can manually run dpinger in the command line to validate this.
So today, I decided to troubleshoot it and I discovered 2 things.
The dpinger that got restarted has all the right parameters but the traffic for this dpinger is going through the PPPoE interface (I had to do a tcpdump to see this in OPNSense on the PPPoE interface) instead of the SL interface, even though the -B "SL PUBLIC IP" is there with the SL IP in it the traffic is going out with the wrong interface.
While this dpinger is not working, I run another on the command line with the exact same parameters and this one is working well. Checking tcpdump again on the one I started, with the exact same parameters, I see it uses the SL interface, unlike the bad one that uses the PPPoE interface. So I thought, what's going on here ? What could cause one dpinger process to use one interface and another dpinger process to use another interface ? The routes are the sames, the IP are the same, the command line is the same...
Then I thought about checking the firewall states, maybe (like UDP for instance) something is being reused or hasn't timeout or got cleared properly when the interface flapped and the packets are not being handled properly because of this.
And there it was, I could see one state with the bad dpinger and another state with the good dpinger.
Checking the good dpinger, I saw that it showed the rule handling it as "let out anything from firewall host itself (force gw)" as you usually see on any dpinger state while the bad dpinger was showing under rule "allow access to DHCP server" and that doesn't make sense. So it seemed the old dpinger was kind of stuck on a weird rule or something in the states that wasn't right.
Without restarting dpinger (like I usually do to fix this), I only deleted the bad state in the table and as soon as I did this, the packets started flowing to the SL interface as they should have and stopped going to the PPPoE interface and the gateway got flagged UP within a few checks and the state rule now started showing "let out anything from firewall host itself (force gw)" also as it should have.
All that being said, this looks like something bad happening (probably some timing in the script or a state not cleared) during the interface flap or dhcp renewal or something else and my guess is that maybe dpinger starts monitoring and a bad state is either created or kept and that makes the dpinger traffic go to the wrong interface and the states doesn't get a chance to expire or get reset so dpinger continues to flag it as down since traffic continues to flow to the wrong interface because the state is being reused. And the gateway gets wrongly flagged as down since the ICMP packets are being routed to the wrong interface (PPPoE in my case) with the source IP of the SL interface.
Restarting dpinger fixes this since the state is linked to the process (you see the process ID in the state) or deleting the state (firewall/diag/states search for your dpinger process id or the IP it monitors) will create a new state that will route the packets to the proper interface and also fix the issue.
Sorry I posted a by rapidly and made a few mistakes and my description wasn't super clear, I edited my reply a bit, hopefully it is better :-) If not, please do not hesitate to ask for clarifications
Also, here are two screenshots showing the
- good (let out anything from firewall host itself (force gw))
and the
- bad dpinger processes (allow access to DHCP server)
under the rule in firewall/diags/states.
My monitoring IP for SL is 1.1.1.1 which makes it easy for me to check for the state as 1.1.1.1 is only used for monitoring the gateway on SL only, nothing else.
More troubleshooting. I tried to manually flap the interface to try to see what happened and I see that sometimes it uses the SL gateway (100.64.0.1) and sometimes 0.0.0.0.
I even had 2 states are some point pointing to 2 different gateways. Also, since SL is also pushing 1.1.1.1 as a DNS server using DHCP, I also end up having a route added but it last only for some time then it goes away.
SL interface is igb0 in the logs below
SL gateway flagged UP and dpinger working well. 1.1.1.1 is not in the route table.
pfctl -ss -vv | grep "1\.1\.1\.1" -A 3
all icmp 100.79.101.92:23255 -> 1.1.1.1:23255 0:0
age 00:22:12, expires in 00:00:10, 1309:1305 pkts, 36652:36540 bytes, rule 100
id: 3253546400000000 creatorid: 837fd2f8 gateway: 100.64.0.1
origif: igb0
Now I disconnect and reconnect SL and wait for DHCP to get an IP and now I see this, it seems to be using the default gateway, weird... Still dpinger works maybe because a temporary route has been added to 1.1.1.1 on initial dhcp ?
pfctl -ss -vv | grep "1\.1\.1\.1" -A 3
all icmp 100.79.101.92:54053 -> 1.1.1.1:54053 0:0
age 00:01:41, expires in 00:00:10, 100:100 pkts, 2800:2800 bytes, rule 93
id: 6e58546400000002 creatorid: 837fd2f8 gateway: 0.0.0.0
origif: igb0
and this in the routing table (only the top few routes to keep this simple...)
netstat -rn | head
Routing tables
Internet:
Destination Gateway Flags Netif Expire
default 10.50.45.70 UGS pppoe0
1.1.1.1 100.64.0.1 UGHS igb0
8.8.4.4 10.50.45.70 UGHS pppoe0
After a minute or two, SL issues a DHCP renewal and the GW goes down temporarily for dpinger and I see this, two different states, one on the default gateway and another one with the SL gateway.
pfctl -ss -vv | grep "1\.1\.1\.1" -A 3
all icmp 100.79.101.92:61626 -> 1.1.1.1:61626 0:0
age 00:00:14, expires in 00:00:09, 14:14 pkts, 392:392 bytes, rule 100
id: 7451546400000003 creatorid: 837fd2f8 gateway: 100.64.0.1
origif: igb0
--
all icmp 100.79.101.92:54053 -> 1.1.1.1:54053 0:0
age 00:03:33, expires in 00:00:00, 195:148 pkts, 5460:4144 bytes, rule 93
id: 6e58546400000002 creatorid: 837fd2f8 gateway: 0.0.0.0
origif: igb0
After some time the state using 0.0.0.0 seems to disappear and the route to 1.1.1.1 also disappear.
SL is still marked as UP now, so for some reason the problem did not happen this time but you can see that if something gets stuck on the 0.0.0.0 (which is my main WAN, PPPoE, by default) this would result in SL dpinger not working and sending its packets to PPPoE instead of SL.
I'll try to reproduce the isseu again later on and post the results and I'll also try to catch a pfctl output and netstat -rn when the issue happens, if you could do the same, maybe we'll see something clearer than in the UI.
Also starlink and seeing this in the logs:
2023-05-04T20:40:33-04:00 Notice dhclient Creating resolv.conf
2023-05-04T20:40:33-04:00 Error dhclient unknown dhcp option value 0x52
Quote from: tracerrx on May 05, 2023, 02:43:01 AM
Also starlink and seeing this in the logs:
2023-05-04T20:40:33-04:00 Notice dhclient Creating resolv.conf
2023-05-04T20:40:33-04:00 Error dhclient unknown dhcp option value 0x52
Yeah, that part has been there forever, likely just because dhclient doesn't support (or needs to) this option in the DHCP reply we get from SL (option 82 (0x52 hex)). Likely something SL original router needs and/or supports but not used by standard DHCP (I've replace some values by xxxx)
Vendor-rfc1048 Extensions
Magic Cookie 0x63825363
DHCP-Message Option 53, length 1: ACK
Subnet-Mask Option 1, length 4: 255.192.0.0
Server-ID Option 54, length 4: 100.64.0.1
Default-Gateway Option 3, length 4: 100.64.0.1
Lease-Time Option 51, length 4: 300
Domain-Name-Server Option 6, length 8: 1.1.1.1,8.8.8.8
Classless-Static-Route Option 121, length 23: (192.168.100.1/32:0.0.0.0),(34.120.255.244/32:0.0.0.0),(default:100.64.0.1)
MTU Option 26, length 2: 1500
Agent-Information Option 82, length 24:
Circuit-ID SubOption 1, length 4: xxxx
Unknown SubOption 5, length 4:
0x0000: xxxx xxxx
Unknown SubOption 151, length 8:
0x0000: xxxx xxxx xxxx xxxx
Unknown SubOption 152, length 0:
END Option 255, length 0
PAD Option 0, length 0, occurs 28
Also reported here: https://forum.opnsense.org/index.php?topic=28391.0
Option 82 : https://en.wikipedia.org/wiki/Dynamic_Host_Configuration_Protocol#Relay_agent_information_sub-options
I was able to trigger the dpinger issue this way.
The gateway was up and the state showed this
all icmp 100.79.101.92:34217 -> 1.1.1.1:34217 0:0
age 00:03:13, expires in 00:00:10, 190:190 pkts, 5320:5320 bytes, rule 100
id: ed96546400000000 creatorid: 837fd2f8 gateway: 100.64.0.1
origif: igb0
which is normal and 1.1.1.1 was not in the routing table.
I then unplugged the RJ45 on OPNSense (igb0) and reconnected it almost right away. This triggered a DHCP request. When the request came in, I now had the 1.1.1.1 route in the route table and the state table showed this (notice the rule has changed from 100 to 90 and the gateway is now 0.0.0.0 (which is the default which would use PPPoE which is not good.
all icmp 100.79.101.92:16758 -> 1.1.1.1:16758 0:0
age 00:02:50, expires in 00:00:10, 168:146 pkts, 4704:4088 bytes, rule 90
id: 6f87546400000002 creatorid: 837fd2f8 gateway: 0.0.0.0
origif: igb0
After about 2 mins (SL DHCP renewal time) the 1.1.1.1 route disappeared, on DHCP renewal, from the routing table and the state remained to gateway: 0.0.0.0. At that point the gateway monitoring started to fail (since the packets started routing to the wrong interface).
After another 2 mins, another DHCP renewal I guess, the state changed to this: (notice rule 100 now and SL gateway, not 0.0.0.0 anymore)
all icmp 100.79.101.92:34217 -> 1.1.1.1:34217 0:0
age 00:03:13, expires in 00:00:10, 190:190 pkts, 5320:5320 bytes, rule 100
id: ed96546400000000 creatorid: 837fd2f8 gateway: 100.64.0.1
origif: igb0
And the gateway monitoring went back to UP. So far it seems to be able to recover after some time and many renewal even though the network never went down, it all seems to be a mix of routing/firewall state thing. So it may remain down depending on some timing I suppose...
I have
- "Allow DNS server list to be overridden by DHCP/PPP on WAN" unchecked in general
- "Allow default gateway switching" checked in general
- "Disable Host Route" unchecked on all gateways in gateways (Description: Do not create a dedicated host route for this monitor when it is checked).
So since I have the last setting unchecked, a route should in theory be added for the monitor. Which is the case when the interface comes up on initial DHCP. But it seems to be removed on the next DHCP renewal, I'm not sure why. Maybe this conflicts with the DNS (1.1.1.1) since I use the same (1.1.1.1) for monitoring or something...
So I tried checking "Disable Host Route" and saved the gateway and I now have this, monitoring works but I'm not sure why I see origif: pppoe0 (not SL). Checking with tcpdump, I see the icmp queries going out on igb0 (SL) and not pppoe0. I suppose the gateway forces traffic on the right interface. I also do not see the 1.1.1.1 route anymore in the routing table...
State with "Disable host route" checked in the SL gateway.
all icmp 100.79.101.92:59191 -> 1.1.1.1:59191 0:0
age 00:00:26, expires in 00:00:10, 26:25 pkts, 728:700 bytes, rule 100
id: 9394546400000002 creatorid: 837fd2f8 gateway: 100.64.0.1
origif: pppoe0
disconnecting/reconnecting SL rj45 ends up creating those two states
all icmp 100.79.101.92:22967 -> 1.1.1.1:22967 0:0
age 00:00:16, expires in 00:00:04, 1:0 pkts, 28:0 bytes, rule 90
id: 79a6546400000000 creatorid: 837fd2f8 gateway: 0.0.0.0
origif: pppoe0
--
all icmp 100.79.101.92:51249 -> 1.1.1.1:51249 0:0
age 00:00:13, expires in 00:00:10, 14:14 pkts, 392:392 bytes, rule 100
id: 9bcf546400000001 creatorid: 837fd2f8 gateway: 100.64.0.1
origif: pppoe0
After a few seconds only this one remains
all icmp 100.79.101.92:51249 -> 1.1.1.1:51249 0:0
age 00:00:57, expires in 00:00:09, 56:56 pkts, 1568:1568 bytes, rule 100
id: 9bcf546400000001 creatorid: 837fd2f8 gateway: 100.64.0.1
origif: pppoe0
I'm still unable to make the problem (dpinger flagging SL as down and keeping it down until I restart dpinger) happen though. I'll continue to try to reproduce the issue.
At least I see something weird with the routes/states that could explain it may be flagged down at some point: if the route to 1.1.1.1 disappears and the gateway remains on 0.0.0.0 on the state which seemed to happened for about 2 mins (SL dhcp renewal) and fixed itself.
Maybe the state should expire in this situation (seems there is a 10 secs timeout) but never does as dpinger keeps it alive every 1 sec ? I don't know... lol
@xaxero what IP do you monitor for your SL gateway ? Something that may also conflict with what we receive in the DHCP from SL ?
Good Morning
I am trying this 2 ways (I use 2 Starlink Maritime interfaces)
StarlinkBackup_GWv4 (active) StarlinkBackup IPv4 199 192.168.192.1 100.64.0.1 40.0 ms 9.0 ms 0.0 % Online
Starmain_VLAN_GWv4 4G_VLAN IPv4 201 192.168.191.1 1.1.1.1 0.0 ms 0.0 ms 100.0 % Offline
StrMain
I use the remote gateway IP for one 100.64.0.1 and this works better than 1.1.1.1 that is down every morning.
Quote from: xaxero on May 05, 2023, 07:25:18 AM
I use the remote gateway IP for one 100.64.0.1 and this works better than 1.1.1.1 that is down every morning.
Ha, interesting so you may end up with the same situation as me since 1.1.1.1 is pushed by SL as DNS in their DHCP reply. So if you hit the same bug I'm trying to figure out, you may end up having the gateway flagged as down wrongly.
SL is pushing 2 DNS using DHCP and this could create issues I think if you monitor those for the gateways. They push 1.1.1.1 and 8.8.8.8. So try using something else like 8.8.4.4 for instance which may not be impacted by some dhcp client script that remove or add routes automatically as they renew (in theory).
You could also try what I'm testing as well that is enabling "disable host route" in the gateway settings so that the monitoring will not try to add a route and depend on it while the dhcp client script may want to remove it since, I suppose, we do not use the SL DNS servers that are pushed to us (general settings). I suppose you don't let WAN dhcp-learned DNS servers override your own DNS servers you have likely defined manually.
OK Done as you suggested on the one interface (main) Will see how it goes.
Name Interface Protocol Priority Gateway Monitor IP RTT RTTd Loss Status Description
StarlinkBackup_GWv4 (active) StarlinkBackup IPv4 199 192.168.192.1 100.64.0.1 46.8 ms 11.7 ms 0.0 % Online
StarBK
StarMain_VLAN_GWv4 4G_VLAN IPv4 201 192.168.191.1 208.67.222.222 36.3 ms 7.8 ms 0.0 % Online
StrMain
i'm, having similar problems for some weeks. I have 2 WAN gateways. The problem comes mostly with the backup GW but sometimes with the primary too. The gateway goes into Error status with 100% packet loss. I attached a screenshot that shows the States in the Firewall Diagnostic after the GW was changed to Error status. This GW has 8.8.4.4 as monitor IP. After i delete the entry with the state 0:0 everything goes back to normal for some hours.
Currently i have OPNsense 23.1.6-amd64, this issue came with version 23.1.5.
Also this: https://github.com/opnsense/core/issues/6544 got released in 23.1.7_3 (I am running _1) not long ago.
Also, _3 contains a few other patches that could also, possibly, impact our current issue, worth upgrading/testing at the very least.
The upgrade didn't solve the issue for me.
To test, I unchecked again the "disable host route" on the SL gateway so that a route is added (to replicate the issue we had). Btw, with "disable host route" checked, I did not had the problem again, so far.
So, back to testing. After unchecking the "disable host route", I unplugged and replugged the SL ethernet cable in my igb0.
Once the link came back up I had the 1.1.1.1 that got added to the routing table (since this is the IP I monitor) which is expected. And the state was now like this. Notice that the gateway is 0.0.0.0, it should normally be SL gateway (100.64.0.1) to make sure dpinger uses this interface to monitor (icmp ping) 1.1.1.1. So right there I knew the issue would probably trigger later on (on DHCP renewal):
all icmp 100.79.101.92:47126 -> 1.1.1.1:47126 0:0
age 00:00:58, expires in 00:00:10, 58:57 pkts, 1624:1596 bytes, rule 90
id: 58da556400000000 creatorid: 7ac5a56d gateway: 0.0.0.0
origif: igb0
And the gateway was marked as UP. About 2-3 minutes later (SL DHCP renewal), the 1.1.1.1 route disappeared from the routing table and the gateway is now marked as DOWN
The state is still this
all icmp 100.79.101.92:47126 -> 1.1.1.1:47126 0:0
age 00:12:03, expires in 00:00:10, 715:148 pkts, 20020:4144 bytes, rule 90
id: 58da556400000000 creatorid: 7ac5a56d gateway: 0.0.0.0
origif: igb0
And it is not recovering and you see that it is linked to this dpinger
root 47126 0.0 0.0 17728 2624 - Is 14:51 0:00.09 /usr/local/bin/dpinger -f -S -r 0 -i STARLINK_DHCP -B 100.79.101.92 -p /var/run/dpinger_STARLINK_DHCP.pid -u /var/run/dpinger_STARLINK_DHCP.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s -A 1s -D 500 -L 75 -d 0 1.1.1.1
Which is not working and flagging the gateway as down, even though it is UP because the packets are now going to the wrong interface. They should be going to igb0 (SL) but they are going to my other (default) WAN which is pppoe0 so they will fail (100.79.101.92 is my current SL IP)
tcpdump -i pppoe0 icmp and host 1.1.1.1 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on pppoe0, link-type NULL (BSD loopback), capture size 262144 bytes
15:05:28.529901 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 47126, seq 843, length 8
15:05:29.545644 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 47126, seq 844, length 8
15:05:30.553962 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 47126, seq 845, length 8
This happens because in the state, the gateway is set to 0.0.0.0 which is wrong, it should be 100.64.0.1.
If I manually test it I see it works and the latency is definitely SL as it would be 2-3 ms over my pppoe0 link.
ping -S 100.79.101.92 1.1.1.1
PING 1.1.1.1 (1.1.1.1) from 100.79.101.92: 56 data bytes
64 bytes from 1.1.1.1: icmp_seq=0 ttl=58 time=56.306 ms
64 bytes from 1.1.1.1: icmp_seq=1 ttl=58 time=64.790 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=58 time=47.758 ms
The state is also unable to expire and release/relearn itself since dpinger pings every 1 sec and this keeps the state alive.
If I kill or restart dpinger this will release the state and fix the issue
I'll kill the state to test it
root@xxxxx:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
all icmp 100.79.101.92:47126 -> 1.1.1.1:47126 0:0
age 00:21:06, expires in 00:00:09, 1249:148 pkts, 34972:4144 bytes, rule 90
id: 58da556400000000 creatorid: 7ac5a56d gateway: 0.0.0.0
origif: igb0
root@xxxxx:~ # pfctl -k id -k 58da556400000000
killed 1 states
root@xxxxx:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
all icmp 100.79.101.92:47126 -> 1.1.1.1:47126 0:0
age 00:00:03, expires in 00:00:10, 4:4 pkts, 112:112 bytes, rule 100
id: f1c1556400000002 creatorid: 7ac5a56d gateway: 100.64.0.1
origif: pppoe0
And now the gateway is back UP.
I'll re-check "disable host route" in the SL gateway since this seems to help as it does seem to prevent that the gateway in the state be 0.0.0.0 since there is never a 1.1.1.1 route if I do this. It's a workaround but it seems to work for now. Probably using something else than 1.1.1.1 would also work since DHCP renewal would not play with the route as it seems to be doing after 3 mins (SL DHCP renew time).
Actually, in 23.1.7_3 the problem seems worse... Unplugging and replugging the SL ether cable in igb0, triggers the problem every time it seems.
I did it 3 times and the state looks like this each time now... :-\
all icmp 100.79.101.92:7232 -> 1.1.1.1:7232 0:0
age 00:00:36, expires in 00:00:10, 36:0 pkts, 1008:0 bytes, rule 90
id: c4e7556400000000 creatorid: 7ac5a56d gateway: 0.0.0.0
origif: pppoe0
again, restarting dpinger or killing the state brings the gateway status to UP and the state to what it should be
all icmp 100.79.101.92:9493 -> 1.1.1.1:9493 0:0
age 00:00:16, expires in 00:00:09, 16:16 pkts, 448:448 bytes, rule 100
id: 15eb556400000000 creatorid: 7ac5a56d gateway: 100.64.0.1
origif: pppoe0
Also notice the rule goes from 90 to 100. 100 is usually what I see when it works, I believe it's the default rule that allows traffic from the OPNSense to anywhere and 90 is the rule associated to DHCP.
With 23.1.7_3, SL gateway always ends up being flagged as down if I use 1.1.1.1 whether I use the "Disable Host Route" or not. I tried multiple things to keep it up but after some time it ends up failing because of the state gateway that ends up sending the packets to the pppoe0 wan instead of SL.
So I'm dropping the idea of using the 1.1.1.1 altogether for now as this seems really problematic likely because of the dhcp renewal on SL that sends 1.1.1.1 as a dns maybe ? Anyways, I'll be testing with 9.9.9.9 instead and see how it goes.
Did using another IP than 1.1.1.1 fixed it for you @xaxero ? Also, have you upgraded to 23.1.7_3 yet ?
Good Morning
Changing to openDNS has resulted in a big improvement. 48 hours with no issues. However SL has been very stable. The second unit I simply use the SL Gateway address.
Note: As I am using the Dual antenna setup I have put in a second router at the front end simply to NAT the traffic and so I have a unique gateway for each antenna and tagging the packets onto separate VLANS to our main router several decks down. 2 WANS with the same gateway was problematic if we had to do a full system power cycle.
With the front end router I am disabling gateway monitoring and I am doing all the DPinger stuff on the main router. Also Disable host route may have helped as well.
Another slimy hack is to force all passenger traffic through the 4G-Starlink-Primary interface via the firewall so this bypasses dpinger completely. The more critical ship traffic goes through the Gateway failover and the worst case scenario is that we are stuck on the VSAT until I can restart Dpinger.
I have attached the gateway configuration of the front and and the core routers. So far it has been working well.
You can use the following to inspect host route behaviour now:
# pluginctl -r host_routes
An overlap between facilities IS possible and the last match wins which may break DNS or monitoring facility... That's why disable host route was added to monitor settings in which case the DNS is still active and dpinger monitoring latches on to interface IP anyway so routing should be ok (if no PBR is used breaking that as well).
Cheers,
Franco
Quote from: franco on May 08, 2023, 12:01:59 PM
You can use the following to inspect host route behaviour now:
# pluginctl -r host_routes
An overlap between facilities IS possible and the last match wins which may break DNS or monitoring facility... That's why disable host route was added to monitor settings in which case the DNS is still active and dpinger monitoring latches on to interface IP anyway so routing should be ok (if no PBR is used breaking that as well).
Hello franco :)
Ok, so everything remained stable (but I did not test for very long, maybe 12h) while I was using 9.9.9.9. I've configured 1.1.1.1 again on SL, saved gateway and then saved the interface as well to restart it.
For now I see this (everything normal and gateway is marked UP)
root@xxxxx:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:47540 -> 1.1.1.1:47540 0:0
age 00:03:49, expires in 00:00:10, 225:225 pkts, 6300:6300 bytes, rule 100
id: a7325d6400000000 creatorid: 7ac5a56d gateway: 100.64.0.1
origif: igb0
root@xxxxx:~ # pluginctl -r host_routes
{
"core": {
"8.8.8.8": null,
"8.8.4.4": null
},
"dpinger": {
"8.8.4.4": "10.50.45.70",
"1.1.1.1": "100.64.0.1",
"2001:4860:4860::8844": "fe80::200:xxxx:xxxx:xxx%igb0",
"149.112.112.112": "192.168.2.1",
"2620:fe::9": "2001:470:xx:4x:x"
}
}
10.50.45.70 is my default gateway that uses pppoe0 interface
100.64.0.1 is SL and is used as backup gateway on igb0
root@xxxxx:~ # netstat -rn | head
Routing tables
Routing tables
Internet:
Destination Gateway Flags Netif Expire
default 10.50.45.70 UGS pppoe0
1.1.1.1 100.64.0.1 UGHS igb0
8.8.4.4 10.50.45.70 UGHS pppoe0
10.2.0.0/16 192.168.2.1 UGS em0
10.50.45.70 link#16 UHS pppoe0
34.120.255.244 link#4 UHS igb0
After 2-3 mins, I see the routing tables loses 1.1.1.1 (SL dhcp renewal I guess) but so far everything remains functional
root@xxxxx:~ # netstat -rn | head
Routing tables
Internet:
Destination Gateway Flags Netif Expire
default 10.50.45.70 UGS pppoe0
8.8.4.4 10.50.45.70 UGHS pppoe0
10.2.0.0/16 192.168.2.1 UGS em0
10.50.45.70 link#16 UHS pppoe0
34.120.255.244 link#4 UHS igb0
100.64.0.0/10 link#4 U igb0
Everything else remains the same and gateway is, for now, marked UP. When I get back home, I'll test the ethernet cable pull/plug, that usually seems to trigger the issue and I'll let you know what I get then.
Hello RedVortex :)
Hmm, how about this one?
# grep -nr "1\.1\.1\.1" /var/db/dhclient.leases.*
If SL is pushing routes it will scrub them on a renew perhaps.
Cheers,
Franco
Hi,
FWIW, see this too for Multi-WAN Gateway monitor.
Monitor IP / dpinger not reliable in simulated fail & failback scenarios
Can only "fix" it be restart of Gateway service :(
Keep in mind that some DNS servers have been known to rate-limit or block ping requests so it looks bad but it's not. From the OPNsense perspective the alarm has to be raised even though it's not necessary and disruptive.
Cheers,
Franco
So I've been testing Multi-Wan gateway failover for quite a few hours now.
Does not work with Trigger Level = "Packet Loss" option for 23.latest or even back to 22.7.latest
Scenario: Primary gateway with Trigger Level = "Packet Loss" option set then block downstream ping does NOT cause gateway to be marked as down nor default route to be flipped to Secondary. Have to manually restart Gateway service (then it notices).
Failback works ok.
Works ok if Trigger Level = "Member Down" however, this is a less likely real-world scenario where ISP is up but internet service is interrupted.
See https://github.com/opnsense/core/issues/6231 -- packetloss and delay triggers have been broken inherently with the switch from apinger to dpinger. The latter never supported the lower thresholds. I'm trying to avoid dealing with dpinger for alarm decisions in 23.7 to bring back the desired behaviour and dpinger then is left to only monitor.
Cheers,
Franco
Quote from: franco on May 11, 2023, 09:26:23 AM
See https://github.com/opnsense/core/issues/6231 -- packetloss and delay triggers have been broken inherently with the switch from apinger to dpinger. The latter never supported the lower thresholds. I'm trying to avoid dealing with dpinger for alarm decisions in 23.7 to bring back the desired behaviour and dpinger then is left to only monitor.
Cheers,
Franco
thanks Franco. Read through the issues thread. Appreciate the detail there.
What timeframe are you thinking for the fix ?
It might take 1 more month for the final code to hit development, but as I said the plan is to have it in production for 23.7 in July (not sooner due to considerable changes).
Cheers,
Franco
I am collating the data from this post and others Applies to Starlink only but may be useful elsewhere. Applied the following fixes from everyone's suggestions and the gateways are stable - We are having frequent outages as we are in laser link territory however the link is stable overall.
1/. Wan Definition Reject leases from 192.168.100.1 (note gateways are on separate router in my case)
2/. Gateway - Disable host route.
3/. Monitor IP that is not 1.1.1.1 (In my case open DNS) and bind each interface to DNS via Settings General.
Interfaces have been going up and down last 24 hours and the gateways (so far) are behaving and the routes are changing dynamically
Last thought - perhaps we could include httping as an option in the future as well as dpinger. http has much higher priority.
That leaves only the question of who will write and integrate a new solution for the problem someone though solved a decade ago. ;)
Cheers,
Franco
Problem occurred again today after a Ethernet flap on the SL side (likely a firmware update on their side)
root@xxxxx:~ # pluginctl -r host_routes
{
"core": {
"8.8.8.8": null,
"8.8.4.4": null
},
"dpinger": {
"8.8.4.4": "10.50.45.70",
"1.1.1.1": "100.64.0.1",
"2001:4860:4860::8844": "fe80::200:xxxx:xxxx:xxx%igb0",
"149.112.112.112": "192.168.2.1",
"2620:fe::9": "2001:470:xx:4x:x"
}
}
root@xxxxx:~ # netstat -rn | head
Routing tables
Internet:
Destination Gateway Flags Netif Expire
default 10.50.45.70 UGS pppoe0
8.8.4.4 10.50.45.70 UGHS pppoe0
10.2.0.0/16 192.168.2.1 UGS em0
10.50.45.70 link#16 UHS pppoe0
34.120.255.244 link#4 UHS igb0
100.64.0.0/10 link#4 U igb0
root@xxxxx:~ # grep -nr "1\.1\.1\.1" /var/db/dhclient.leases.*
/var/db/dhclient.leases.igb0:7: option domain-name-servers 1.1.1.1,8.8.8.8;
/var/db/dhclient.leases.igb0:24: option domain-name-servers 1.1.1.1,8.8.8.8;
/var/db/dhclient.leases.igb0:41: option domain-name-servers 1.1.1.1,8.8.8.8;
root@xxxxx:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:23279 -> 1.1.1.1:23279 0:0
age 00:05:24, expires in 00:00:10, 319:148 pkts, 8932:4144 bytes, rule 90
id: 36e4776400000003 creatorid: 7ac5a56d gateway: 0.0.0.0
origif: igb0
root@xxxxx:~ # tcpdump -i pppoe0 icmp and host 1.1.1.1 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on pppoe0, link-type NULL (BSD loopback), capture size 262144 bytes
22:21:08.078198 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 23279, seq 385, length 8
22:21:09.141706 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 23279, seq 386, length 8
22:21:10.205204 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 23279, seq 387, length 8
^C
3 packets captured
679 packets received by filter
0 packets dropped by kernel
After the Ethernet flap, the 1.1.1.1 route was present (likely added by the DNS received from the SL dhcp) but this route got removed after 2-3 minutes (On DHCP SL renewal I think). At that point, since dpinger state continue using 0.0.0.0 (but not dpinger command line itself), the gateway went down since packets are now being routed to pppoe (my main provider) instead of igb0 (SL) which cannot work since dpinger uses SL source IP on my other provider and it is likely being dropped.
What I expect to happen: the state should use SL gateway, not 0.0.0.0 whatever routes are.
Ok, let's do this then: https://github.com/opnsense/core/commit/c12e77519f164
However, in multi-WAN you really need to set a gateway for each global DNS server being used:
"core": {
"8.8.8.8": null,
"8.8.4.4": null
},
8.8.8.8 would naturally need the SL gateway and 8.8.4.4 the other WAN gateway as per:
https://docs.opnsense.org/manual/how-tos/multiwan.html#step-3-configure-dns-for-each-gateway
Perhaps even adding 1.1.1.1 as global DNS to SL would fix the current situation as well (DNS server and route are always enforced unlike gateway monitoring). And from the docs you can see coupling these facilities through the same server on the same link makes sense.
Cheers,
Franco
I applied the patch, so far so good, I'll do more testing this week and let you know how it goes.
Is it just me or I have a feeling of "deja vu" ? I think we troubleshooted something along those lines a few months ago before letting it go after deciding that there was a lot of cleanup necessary around those scripts? :-) hopefully this time we'll nail it, lol
We did, but some progress was made on the code so it's good to revisit (and debugging was kinda easy this time).
For the route drop of the nameserver it's probably better to aim for symmetry or at least not undo routes that haven't even been added (by DNS itself). I've added the proposed change to upcoming 23.1.8 and will circle back at some point. Already have an idea on how to pull this off.
Cheers,
Franco
I also think the problematic is somewhat complicated by the fact that we use 1.1.1.1 for 2 things. We'll need to decide which one wins.
On SL, they push us the 1.1.1.1 DNS. Even though I do not use their DNS (I do not allow WAN-pushed DNS to override mine) it seems to play with the routing table. On top of that, I also use 1.1.1.1 for gateway monitoring, where you can select whether you was dpinger to add routes or not for the gateway monitoring. On top of that, you can also have someone which may add 1.1.1.1 to its DNS configs and (may or may not, I know don't) select a gateway for it, which I think may also add routes...
So I think we may need to decide at some point what takes priority (likely based on what functionality absolutely needs their route or something like that) or a order of priority of what does what.
I mean, any provider could decide to start pushing a route, a DNS or something that we are already using as a monitoring IP and we may have selected (or not) to add a route for the monitoring, who would win ? :-)
Actually, as per the doc it suggests to make each IP exclusive to the attached uplink and that's it. You would be starting to validate in a circle and some of this like DNS server via DHCP(v6) is runtime information further complicating the issue.
The individual areas can validate against double-use already, but throwing the host route into the routing table is sort of a blackbox. We only know if a route was there but not why. Is it ours? Is it someone else's? Who knows.
Cheers,
Franco
Testing the patch went well and I upgraded to 23.1.8 last night and so far so good.
As long as the route remains there, it should work, we'll see the next few days. SL is on igb0 and GW is 100.64.0.10 and monitors IP 1.1.1.1 and main provider is on pppoe0 GW is 10.50.45.70 and monitors IP 8.8.4.4
DNS servers (not bound to a gateway) are 8.8.8.8 and 8.8.4.4
The state shows that the default gateway 0.0.0.0 (routing table more likely) is being used to reach 1.1.1.1, not 100.64.0.1 but I see the packets flowing through igb0 (SL), not pppoe0 which is what we want so we're good as long as the 1.1.1.1 route remains there.
root@xxxxx:~ # pluginctl -r host_routes
{
"core": {
"8.8.8.8": null,
"8.8.4.4": null
},
"dpinger": {
"8.8.4.4": "10.50.45.70",
"1.1.1.1": "100.64.0.1",
"2001:4860:4860::8844": "fe80::200:xxxx:xxxx:xxx%igb0",
"149.112.112.112": "192.168.2.1",
"2620:fe::9": "2001:470:xxx:x::x"
}
}
root@xxxxx:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:29707 -> 1.1.1.1:29707 0:0
age 12:19:33, expires in 00:00:09, 43592:43538 pkts, 1220576:1219064 bytes, rule 90
id: 3b7e706400000001 creatorid: c307077d gateway: 0.0.0.0
origif: igb0
root@xxxxx:~ # netstat -rn | head
Routing tables
Internet:
Destination Gateway Flags Netif Expire
default 10.50.45.70 UGS pppoe0
1.1.1.1 100.64.0.1 UGHS igb0
8.8.4.4 10.50.45.70 UGHS pppoe0
10.2.0.0/16 192.168.2.1 UGS em0
10.50.45.70 link#16 UHS pppoe0
34.120.255.244 link#4 UHS igb0
Same problem on 23.7.8.
I have 4 Gateways (FTTH, FTTC, FWA, SAT) and Starlink is the only one presenting this problem.
Quote from: Jetro on November 14, 2023, 11:01:43 PM
Same problem on 23.7.8.
I have 4 Gateways (FTTH, FTTC, FWA, SAT) and Starlink is the only one presenting this problem.
I'm curious... I also run Starlink and did not get this issue with this version, yet. But I know it sometimes takes time to show up and also some instability for the issue to show up on my SL gateway. But the bug did not happen anymore since 23.1.8 fixes.
Can you check the output of
pluginctl -r host_routes
Quote from: Jetro on November 14, 2023, 11:01:43 PM
Same problem on 23.7.8.
I have 4 Gateways (FTTH, FTTC, FWA, SAT) and Starlink is the only one presenting this problem.
For me the problem was not happening since the last 23.1.x patches on Starlink but started to appear again in 24.1-rc1 and still ongoing on 24.1 final
Here's the link to the issue in the 24.1 forum if you feel like troubleshooting it with us: https://forum.opnsense.org/index.php?topic=38603.0