Multi-WAN (PPPoE + Starlink) - SL Gateway falsely being marked down after outage

Started by RedVortex, February 04, 2024, 10:22:51 PM

Previous topic - Next topic
@franco FYI since we worked on this together last time including @xaxero which I suspect might be affected as well too since he was also using multi-wan and Starlink like me.

Seems a regression or similar of this old issue that was fixed until 24.1

https://forum.opnsense.org/index.php?topic=33831.msg163808#msg163808

After Starlink updates itself during the night, the gateway sometimes gets flagged as down and never comes back up by itself even though it is up and working

When Starlink goes down, it temporarily also assigns itself an ip of 192.168.100.1/24 range (and gives opnsense .100) and then when it comes back up it goes back on his normal IPs or 100.64.x.x

For some reasons, dpinger has a hard adjusting itself when that happens. Not sure if it is begin restarted properly on the gateway change or the temporarily network flap. It seems to remain stucked on the 192.168.100.1 gateway

You can also wee that the state of the dpinger process is kinda stuck to the 192.168.100.1 IP and that if I manually clear the state it will then change to a right gateway but that's not enough to bring the IP UP I need to reload dpinger.

root@opnsense:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:924 -> 1.1.1.1:924       0:0
   age 14:00:37, expires in 00:00:09, 49623:0 pkts, 1439067:0 bytes, rule 102
   id: adb0c36500000002 creatorid: 5f0e2da3 gateway: 192.168.100.1
   origif: igb0


root@opnsense:~ # pfctl -k id -k adb0c36500000002
killed 1 states



root@opnsense:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:924 -> 1.1.1.1:924       0:0
   age 00:00:06, expires in 00:00:09, 6:6 pkts, 174:174 bytes, rule 104
   id: 7518c56500000002 creatorid: 5f0e2da3 gateway: 100.64.0.1
   origif: igb0


igb0 is my Starlink interface, you can see before the state clear that packet are going out but not coming back

root@opnsense:~ # tcpdump -i igb0 icmp and host 1.1.1.1 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on igb0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:45:36.973765 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 924, seq 49523, length 9
15:45:37.974749 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 924, seq 49524, length 9
15:45:38.994204 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 924, seq 49525, length 9
15:45:40.033544 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 924, seq 49526, length 9


After I clear the state, they are coming back to normal now.

root@opnsense:~ # tcpdump -i igb0 icmp and host 1.1.1.1 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on igb0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:55:44.088583 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 924, seq 50121, length 9
15:55:44.125365 IP 1.1.1.1 > 100.79.101.92: ICMP echo reply, id 924, seq 50121, length 9
15:55:45.092572 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 924, seq 50122, length 9
15:55:45.146659 IP 1.1.1.1 > 100.79.101.92: ICMP echo reply, id 924, seq 50122, length 9
15:55:46.139495 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 924, seq 50123, length 9
15:55:46.191996 IP 1.1.1.1 > 100.79.101.92: ICMP echo reply, id 924, seq 50123, length 9
15:55:47.157407 IP 100.79.101.92 > 1.1.1.1: ICMP echo request, id 924, seq 50124, length 9
15:55:47.210673 IP 1.1.1.1 > 100.79.101.92: ICMP echo reply, id 924, seq 50124, length 9


Logs of gateway/dpinger

root@opnsense:~ # tail -100 /var/log/gateways/latest.log | grep -v DHCP6
<12>1 2024-02-04T01:45:05-05:00 opnsense dpinger 57453 - [meta sequenceId="1"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:06-05:00 opnsense dpinger 57453 - [meta sequenceId="3"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:07-05:00 opnsense dpinger 57453 - [meta sequenceId="5"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:08-05:00 opnsense dpinger 57453 - [meta sequenceId="7"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:09-05:00 opnsense dpinger 57453 - [meta sequenceId="9"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:10-05:00 opnsense dpinger 57453 - [meta sequenceId="11"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:11-05:00 opnsense dpinger 57453 - [meta sequenceId="13"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:12-05:00 opnsense dpinger 57453 - [meta sequenceId="15"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:13-05:00 opnsense dpinger 57453 - [meta sequenceId="17"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:14-05:00 opnsense dpinger 57453 - [meta sequenceId="19"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:15-05:00 opnsense dpinger 57453 - [meta sequenceId="21"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:16-05:00 opnsense dpinger 57453 - [meta sequenceId="24"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:17-05:00 opnsense dpinger 57453 - [meta sequenceId="26"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:18-05:00 opnsense dpinger 57453 - [meta sequenceId="28"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:19-05:00 opnsense dpinger 57453 - [meta sequenceId="30"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:20-05:00 opnsense dpinger 57453 - [meta sequenceId="32"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:21-05:00 opnsense dpinger 57453 - [meta sequenceId="35"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:22-05:00 opnsense dpinger 57453 - [meta sequenceId="37"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:23-05:00 opnsense dpinger 57453 - [meta sequenceId="39"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:24-05:00 opnsense dpinger 57453 - [meta sequenceId="41"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:25-05:00 opnsense dpinger 57453 - [meta sequenceId="43"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:26-05:00 opnsense dpinger 57453 - [meta sequenceId="45"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:27-05:00 opnsense dpinger 57453 - [meta sequenceId="47"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:28-05:00 opnsense dpinger 57453 - [meta sequenceId="48"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:29-05:00 opnsense dpinger 57453 - [meta sequenceId="49"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:30-05:00 opnsense dpinger 57453 - [meta sequenceId="50"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:31-05:00 opnsense dpinger 57453 - [meta sequenceId="51"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<165>1 2024-02-04T01:45:31-05:00 opnsense dpinger 53072 - [meta sequenceId="52"] ALERT: STARLINK_DHCP (Addr: 1.1.1.1 Alarm: none -> loss RTT: 43.7 ms RTTd: 9.1 ms Loss: 42.0 %)
<12>1 2024-02-04T01:45:32-05:00 opnsense dpinger 57453 - [meta sequenceId="53"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:33-05:00 opnsense dpinger 57453 - [meta sequenceId="54"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:34-05:00 opnsense dpinger 57453 - [meta sequenceId="55"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:35-05:00 opnsense dpinger 57453 - [meta sequenceId="56"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:36-05:00 opnsense dpinger 57453 - [meta sequenceId="57"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:37-05:00 opnsense dpinger 57453 - [meta sequenceId="58"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:38-05:00 opnsense dpinger 57453 - [meta sequenceId="59"] STARLINK_DHCP 1.1.1.1: sendto error: 22
<12>1 2024-02-04T01:45:38-05:00 opnsense dpinger 57453 - [meta sequenceId="60"] exiting on signal 15
<12>1 2024-02-04T01:45:38-05:00 opnsense dpinger 1844 - [meta sequenceId="61"] send_interval 1000ms  loss_interval 4000ms  time_period 60000ms  report_interval 0ms  data_len 1  alert_interval 1000ms  latency_alarm 0ms  loss_alarm 0%  alarm_hold 10000ms  dest_addr 1.1.1.1  bind_addr 192.168.100.100  identifier "STARLINK_DHCP "
<12>1 2024-02-04T01:45:38-05:00 opnsense dpinger 59623 - [meta sequenceId="62"] exiting on signal 15
<165>1 2024-02-04T01:45:38-05:00 opnsense dpinger 53072 - [meta sequenceId="63"] Reloaded gateway watcher configuration on SIGHUP
<12>1 2024-02-04T01:45:38-05:00 opnsense dpinger 1844 - [meta sequenceId="64"] exiting on signal 15
<165>1 2024-02-04T01:45:38-05:00 opnsense dpinger 53072 - [meta sequenceId="65"] Reloaded gateway watcher configuration on SIGHUP
<12>1 2024-02-04T01:45:38-05:00 opnsense dpinger 11941 - [meta sequenceId="66"] send_interval 1000ms  loss_interval 4000ms  time_period 60000ms  report_interval 0ms  data_len 1  alert_interval 1000ms  latency_alarm 0ms  loss_alarm 0%  alarm_hold 10000ms  dest_addr 1.1.1.1  bind_addr 192.168.100.100  identifier "STARLINK_DHCP "
<165>1 2024-02-04T01:45:39-05:00 opnsense dpinger 53072 - [meta sequenceId="67"] Reloaded gateway watcher configuration on SIGHUP
<165>1 2024-02-04T01:45:43-05:00 opnsense dpinger 53072 - [meta sequenceId="68"] ALERT: STARLINK_DHCP (Addr: 1.1.1.1 Alarm: loss -> down RTT: 0.0 ms RTTd: 0.0 ms Loss: 100.0 %)
<12>1 2024-02-04T01:46:41-05:00 opnsense dpinger 11941 - [meta sequenceId="69"] exiting on signal 15
<12>1 2024-02-04T01:46:41-05:00 opnsense dpinger 66460 - [meta sequenceId="70"] send_interval 1000ms  loss_interval 4000ms  time_period 60000ms  report_interval 0ms  data_len 1  alert_interval 1000ms  latency_alarm 0ms  loss_alarm 0%  alarm_hold 10000ms  dest_addr 1.1.1.1  bind_addr 100.79.101.92  identifier "STARLINK_DHCP "
<165>1 2024-02-04T01:46:41-05:00 opnsense dpinger 53072 - [meta sequenceId="71"] Reloaded gateway watcher configuration on SIGHUP
<165>1 2024-02-04T01:46:44-05:00 opnsense dpinger 53072 - [meta sequenceId="73"] Reloaded gateway watcher configuration on SIGHUP
<12>1 2024-02-04T15:38:41-05:00 opnsense dpinger 36972 - [meta sequenceId="1"] send_interval 1000ms  loss_interval 4000ms  time_period 60000ms  report_interval 0ms  data_len 1  alert_interval 1000ms  latency_alarm 0ms  loss_alarm 0%  alarm_hold 10000ms  dest_addr 1.1.1.1  bind_addr 100.79.101.92  identifier "STARLINK_DHCP "
<165>1 2024-02-04T15:38:43-05:00 opnsense dpinger 53072 - [meta sequenceId="2"] ALERT: STARLINK_DHCP (Addr: 1.1.1.1 Alarm: down -> none RTT: 52.6 ms RTTd: 8.4 ms Loss: 0.0 %)
<12>1 2024-02-04T15:38:44-05:00 opnsense dpinger 36972 - [meta sequenceId="3"] exiting on signal 2


The process before I reload it

root    66460   0.0  0.0  13340  2508  -  Is   01:46       0:02.54 /usr/local/bin/dpinger -f -S -r 0 -i STARLINK_DHCP -B 100.79.101.92 -p /var/run/dpinger_STARLINK_DHCP.pid -u /var/run/dpinger_STARLINK_DHCP.sock -s 1s -l 4s -t 60s -d 1 1.1.1.1

And after I reload it and it marks the interface as up now

root@opnsense:~ # ps aux | grep LINK_DHCP\
root    91462   0.0  0.0  13340  2512  -  Is   16:07       0:00.03 /usr/local/bin/dpinger -f -S -r 0 -i STARLINK_DHCP -B 100.79.101.92 -p /var/run/dpinger_STARLINK_DHCP.pid -u /var/run/dpinger_STARLINK_DHCP.sock -s 1s -l 4s -t 60s -d 1 1.1.1.1


They are the same...

And state after dpinger reload is still normal like the one after I manually forced state kill to reset it

root@opnsense:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:25926 -> 1.1.1.1:25926       0:0
   age 00:01:53, expires in 00:00:10, 112:112 pkts, 3248:3248 bytes, rule 104
   id: ba21c56500000002 creatorid: 5f0e2da3 gateway: 100.64.0.1
   origif: igb0


Seems like the state isn't cleared properly and/or dpginger isn't resetting properly after interface flag or gateway change.

I forgot to take the output during the outage of

pluginctl -r host_routes

But here it is after everything is good. Next outage I'll take it before fixing it.

root@opnsense:~ # pluginctl -r host_routes
{
    "core": {
        "8.8.8.8": null,
        "8.8.4.4": null
    },
    "dpinger": {
        "8.8.4.4": "10.50.45.70",
        "1.1.1.1": "100.64.0.1",
        "2001:4860:4860::8844": "fe80::200:xxxx:xxxx:xxx%igb0",
        "149.112.112.112": "192.168.2.1",
        "192.168.170.2": "192.168.170.2",
        "192.168.171.2": "192.168.171.2",
        "2620:fe::9": "2001:470:xx:x::x
    }
}

Same situation this morning

root@opnsense:~ # pluginctl -r host_routes
{
    "core": {
        "8.8.8.8": null,
        "8.8.4.4": null
    },
    "dpinger": {
        "8.8.4.4": "10.50.45.70",
        "1.1.1.1": "100.64.0.1",
        "2001:4860:4860::8844": "fe80::200:xxxx:xxxx:xxx%igb0",
        "149.112.112.112": "192.168.2.1",
        "192.168.170.2": "192.168.170.2",
        "192.168.171.2": "192.168.171.2",
        "2620:fe::9": "2001:470:xx:x::x"
    }
}


root@opnsense:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:47956 -> 1.1.1.1:47956       0:0
   age 08:02:44, expires in 00:00:09, 28494:0 pkts, 826326:0 bytes, rule 102
   id: ba64cd6500000000 creatorid: 5f0e2da3 gateway: 192.168.100.1
   origif: igb0


After killing the state, dpinger now sees the state as up (I did not restart/reload dpinger, I only cleared the state above)

root@opnsense:~ # pfctl -k id -k ba64cd6500000000
killed 1 states

root@opnsense:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:47956 -> 1.1.1.1:47956       0:0
   age 00:00:17, expires in 00:00:09, 17:17 pkts, 493:493 bytes, rule 104
   id: 7168ce6500000000 creatorid: 5f0e2da3 gateway: 100.64.0.1
   origif: igb0


<165>1 2024-02-06T01:36:14-05:00 opnsense dpinger 53072 - [meta sequenceId="75"] ALERT: STARLINK_DHCP (Addr: 1.1.1.1 Alarm: loss -> down RTT: 0.0 ms RTTd: 0.0 ms Loss: 100.0 %)
<12>1 2024-02-06T01:37:03-05:00 opnsense dpinger 4447 - [meta sequenceId="76"] exiting on signal 15
<12>1 2024-02-06T01:37:03-05:00 opnsense dpinger 47956 - [meta sequenceId="77"] send_interval 1000ms  loss_interval 4000ms  time_period 60000ms  report_interval 0ms  data_len 1  alert_interval 1000ms  latency_alarm 0ms  loss_alarm 0%  alarm_hold 10000ms  dest_addr 1.1.1.1  bind_addr 100.79.101.92  identifier "STARLINK_DHCP "
<165>1 2024-02-06T01:37:03-05:00 opnsense dpinger 53072 - [meta sequenceId="78"] Reloaded gateway watcher configuration on SIGHUP
<165>1 2024-02-06T01:37:21-05:00 opnsense dpinger 53072 - [meta sequenceId="79"] Reloaded gateway watcher configuration on SIGHUP
<12>1 2024-02-06T01:38:19-05:00 opnsense dpinger 35161 - [meta sequenceId="80"] send_interval 1000ms  loss_interval 4000ms  time_period 60000ms  report_interval 0ms  data_len 1  alert_interval 1000ms  latency_alarm 0ms  loss_alarm 0%  alarm_hold 10000ms  dest_addr 2001:4860:4860::8844  bind_addr 2605:59c8:2300:98f9:xxxx:xxxx:xxxx:xxxx  identifier "STARLINK_DHCP6 "
<165>1 2024-02-06T01:38:19-05:00 opnsense dpinger 53072 - [meta sequenceId="81"] Reloaded gateway watcher configuration on SIGHUP
<165>1 2024-02-06T01:38:20-05:00 opnsense dpinger 53072 - [meta sequenceId="82"] ALERT: STARLINK_DHCP6 (Addr: 2001:4860:4860::8844 Alarm: down -> none RTT: 51.3 ms RTTd: 3.9 ms Loss: 0.0 %)
<165>1 2024-02-06T09:41:27-05:00 opnsense dpinger 53072 - [meta sequenceId="1"] ALERT: STARLINK_DHCP (Addr: 1.1.1.1 Alarm: down -> loss RTT: 30.7 ms RTTd: 3.7 ms Loss: 75.0 %)
<165>1 2024-02-06T09:41:57-05:00 opnsense dpinger 53072 - [meta sequenceId="2"] ALERT: STARLINK_DHCP (Addr: 1.1.1.1 Alarm: loss -> none RTT: 30.6 ms RTTd: 5.4 ms Loss: 25.0 %)


I used to have this issue in the past as well but hasn't been a problem in a bit.  Currently still on 23.7.12_5 as I was waiting for a few patches before upgrading.  However, if this issue is now back in 24.x I'll be waiting a bit longer :)
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Quote from: axsdenied on February 06, 2024, 08:36:11 PM
I used to have this issue in the past as well but hasn't been a problem in a bit.  Currently still on 23.7.12_5 as I was waiting for a few patches before upgrading.  However, if this issue is now back in 24.x I'll be waiting a bit longer :)

Yeah... This is definitely a regression. Almost every day I need to reset the state or the gateway, like this so the state goes back to the right gateway instead of being stucked on the Starlink temporary IP/gateway when it reboots or updates itself. The temporary gateway on which it gets stuck is: 192.168.100.1 but the gateway once it is really up is: 100.64.0.1.

Killing the state, resets it properly.

It is likely something that happens (or doesn't happen in this case) during the interface flap and/or the DHCP address issuance by Starlink to opnsense so the states never reset to the new gateway...

Bad state (My gateway monitoring is configured to ping 1.1.1.1 on Starlink)

root@opnsense:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:28961 -> 1.1.1.1:28961       0:0
   age 08:33:39, expires in 00:00:10, 30306:0 pkts, 878874:0 bytes, rule 102
   id: ec7de16500000001 creatorid: 5f0e2da3 gateway: 192.168.100.1
   origif: igb0


Killing the bad state

root@opnsense:~ # pfctl -k id -k ec7de16500000001
killed 1 states


The right state after killing the bad one. Gateway is now marked as up.


root@opnsense:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:28961 -> 1.1.1.1:28961       0:0
   age 00:00:02, expires in 00:00:10, 3:3 pkts, 87:87 bytes, rule 104
   id: 3564d96500000002 creatorid: 5f0e2da3 gateway: 100.64.0.1
   origif: igb0

After your reply I re-read your post.  I actually block DCHP leases from 192.168.100.1 on the WAN interface so that the modem can't temporarily assign an address from that block to opnsense.

If you don't you could also have issues like what you're describing as it's technically a valid network config, it just can't route anywhere and sometimes when the real network is available it doesn't swap.
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Quote from: axsdenied on February 13, 2024, 04:37:16 PM
After your reply I re-read your post.  I actually block DCHP leases from 192.168.100.1 on the WAN interface so that the modem can't temporarily assign an address from that block to opnsense.

If you don't you could also have issues like what you're describing as it's technically a valid network config, it just can't route anywhere and sometimes when the real network is available it doesn't swap.

I do block it on other interfaces than Starlink. The reason I keep it enable on Starlink is to be able to access the dish in case there is an issue like snow on the dish, firmware going bad, ability to access the antenna when it is stowed. In all those cases, the dish falls back on its 192.168.100.1 IP and that's the only way to access it. As soon as it comes back up, it re-issues an IP in the Starlink network. When that happens, I expect the state to be cleared and/or the dpinger to be reloaded/restarted which should also clear the state.

But yes, as as workaround I could block those or even do a cronjob that flushes the state every now and then when it finds it is stucked on 192.168.100.1 or something... But in theory the dhcp, interfaces, gateways scripts should all automatically handles this. It was working fine in 23.x when it was fixed (it was buggy at some point in 22.x or early 23.x, I can't remember exactly when it started to happen but it was around the time the devs were working on the scripts that handle gateways, interfaces, etc...).

Thanks for the idea, I may give it a try if not bugfix is made soon. I did not open a new one since this is regression but maybe I should...  :-\

Most if this was discussed, patched and all in this other thread: https://forum.opnsense.org/index.php?topic=33831.0

I don't have Starlink so I don't have firsthand experience, but out of curiosity, when the Starlink network is up and everything is working with a Starlink network IP, can you still access the dish via the 192.168.100.x network?
OPNsense 24.7.7 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Quote from: axsdenied on February 14, 2024, 06:44:42 PM
I don't have Starlink so I don't have firsthand experience, but out of curiosity, when the Starlink network is up and everything is working with a Starlink network IP, can you still access the dish via the 192.168.100.x network?

Yes, the dish still keeps this IP but the DHCP IP that it will hand you will not be in this range anymore when it gets a SL IP properly.

opnsense usually handles this properly because SL still sends this IP range in the DHCP options (Classless-Static-Route Option 121) that says other networks that can be reached through it and it includes this range (and some other public IPs too I guess for they services in AWS through them).

Here's a DHCP reply when the SL dish is connected to the SL network.

You can see in the dhcp reply default gateway being 100.64.0.1 (which is when SL is UP). The SL dish still uses 192.168.100.1 and in fact, when you use the SL app to manage the antenna, it connects to this IP.

13:08:23.480240 xx:xx:xx:xx:xx:xx > xx:xx:xx:xx:xx:xx, ethertype IPv4 (0x0800), length 350: (tos 0x0, ttl 64, id 49857, offset 0, flags [DF], proto UDP (17), length 336)
    100.64.0.1.67 > 100.79.101.92.68: [no cksum] BOOTP/DHCP, Reply, length 308, xid 0x12b7a4ac, Flags [none] (0x0000)
  Your-IP 100.79.101.92
  Server-IP 10.10.10.10
  Gateway-IP 192.168.100.100
  Client-Ethernet-Address xx:xx:xx:xx:xx:xx
  Vendor-rfc1048 Extensions
    Magic Cookie 0x63825363
    DHCP-Message Option 53, length 1: ACK
    Subnet-Mask Option 1, length 4: 255.192.0.0
    Server-ID Option 54, length 4: 100.64.0.1
    Default-Gateway Option 3, length 4: 100.64.0.1
    Lease-Time Option 51, length 4: 300
    Domain-Name-Server Option 6, length 8: 1.1.1.1,8.8.8.8
    Classless-Static-Route Option 121, length 23: (192.168.100.1/32:0.0.0.0),(34.120.255.244/32:0.0.0.0),(default:100.64.0.1)
    MTU Option 26, length 2: 1500
    END Option 255, length 0
    PAD Option 0, length 0


If I put the SL dish in stow mode (flipped down to not talk to satellites, or when SL is down, maintenance, whatever) the DHCP reply becomes this. The GW is .1 and it gives me .100 in the 192.168.100.0/24 range

13:11:42.957591 xx:xx:xx:xx:xx:xx > xx:xx:xx:xx:xx:xx, ethertype IPv4 (0x0800), length 320: (tos 0x0, ttl 255, id 0, offset 0, flags [none], proto UDP (17), length 306)
    192.168.100.1.67 > 192.168.100.100.68: [no cksum] BOOTP/DHCP, Reply, length 278, xid 0xae69f181, Flags [none] (0x0000)
  Your-IP 192.168.100.100
  Client-Ethernet-Address xx:xx:xx:xx:xx:xx
  Vendor-rfc1048 Extensions
    Magic Cookie 0x63825363
    DHCP-Message Option 53, length 1: ACK
    Subnet-Mask Option 1, length 4: 255.255.255.0
    Server-ID Option 54, length 4: 192.168.100.1
    Default-Gateway Option 3, length 4: 192.168.100.1
    Lease-Time Option 51, length 4: 5
    Domain-Name-Server Option 6, length 4: 192.168.100.1
    MTU Option 26, length 2: 1500
    END Option 255, length 0


And now I have the same problem, the gateway is now marked as down even though SL is back up.

It's weird because for a few seconds when SL comes back up. I see 2 states, one of which would be the right one but it ends up disappearing and the bad state remains

root@opnsense:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:59388 -> 1.1.1.1:59388       0:0
   age 00:02:54, expires in 00:00:09, 171:0 pkts, 4959:0 bytes, rule 104
   id: 9512dc6500000002 creatorid: 5f0e2da3 gateway: 192.168.100.1
   origif: igb0
--
all icmp 100.79.101.92:63965 (192.168.22.14:14148) -> 1.1.1.1:63965       0:0
   age 00:00:11, expires in 00:00:00, 2:2 pkts, 168:168 bytes, rule 104
   id: e113dc6500000002 creatorid: 5f0e2da3 gateway: 100.64.0.1
   origif: igb0


And after a few seconds... The bad one remains and the gateway remains marked as down

root@opnsense:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:59388 -> 1.1.1.1:59388       0:0
   age 00:05:58, expires in 00:00:09, 353:0 pkts, 10237:0 bytes, rule 104
   id: 9512dc6500000002 creatorid: 5f0e2da3 gateway: 192.168.100.1
   origif: igb0


and dpinger is configured to use the right interface (100.64.0.1) but doesn't work likely because of the bad state

root@opnsense:~ # pluginctl -r host_routes
{
    "core": {
        "8.8.8.8": null,
        "8.8.4.4": null
    },
    "dpinger": {
        "8.8.4.4": "10.50.45.70",
        "1.1.1.1": "100.64.0.1",
        "2001:4860:4860::8844": "fe80::200:xxxx:xxxx:xxx%igb0",
        "149.112.112.112": "192.168.2.1",
        "192.168.170.2": "192.168.170.2",
        "192.168.171.2": "192.168.171.2",
        "2620:fe::9": "2001:470:xx:x::x"
    }
}


While SL was down, dpinger updated itself to use the DISH IP properly, so it seems dpinger is doing his job but something else with the states is not working well

Here's how it looks when SL is down

root@opnsense:~ # pluginctl -r host_routes
{
    "core": {
        "8.8.8.8": null,
        "8.8.4.4": null
    },
    "dpinger": {
        "8.8.4.4": "10.50.45.70",
        "1.1.1.1": "192.168.100.1",
        "2001:4860:4860::8844": "fe80::200:xxxx:xxxx:xxx%igb0",
        "149.112.112.112": "192.168.2.1",
        "192.168.170.2": "192.168.170.2",
        "192.168.171.2": "192.168.171.2",
        "2620:fe::9": "2001:470:xx:x::x"
    }
}

Problem is still present in 24.1.2

Bad state

No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:33064 -> 1.1.1.1:33064       0:0
   age 08:41:39, expires in 00:00:10, 30734:0 pkts, 891286:0 bytes, rule 104
   id: d928da6500000003 creatorid: d7e1a47d gateway: 192.168.100.1
   origif: igb0


Killing it

root@opnsense:~ # pfctl -k id -k d928da6500000003
killed 1 states


State is now back to what it should and gateway is now recovering

root@opnsense:~ # pfctl -ss -vvv | grep "1\.1\.1\.1" -A 3
No ALTQ support in kernel
ALTQ related functions disabled
all icmp 100.79.101.92:33064 -> 1.1.1.1:33064       0:0
   age 00:00:05, expires in 00:00:09, 5:5 pkts, 145:145 bytes, rule 104
   id: 7698db6500000002 creatorid: d7e1a47d gateway: 100.64.0.1
   origif: igb0

Can confirm tihs.

I was on 23.7.12-5 with a dual wan configuration,  since i use OPNSense as the core of my enterprise network i take some test too serious. One of them is testing very deep the failover beheaviour, i have two different isps both via cablemodem.
I know that disconnecting the coax cable for the cablemodem makes the Sense boxes crazy when configured to failover, but none of this happen on 23.7.n series.

Since upgraded to 24.1.5_3 some of that beahaviours came back, this comprends:

Interface with public ip address but marked as down, no response when tried to ping monitor ip.
Lots of sendto error: 65 on the gateway marked as down in the gateway log.
Suddently high ping and then sendto error: 65
Sometimes when unpluging and pluging the coax cable from a cablemodem it takes very long time to OPNSense to mark the gateaay as up again and then it starts to flap.

Some logs:

2024-05-03T12:12:20-03:00 Warning dpinger FIBERTEL_DHCP 1.1.1.1: sendto error: 65
2024-05-03T12:05:44-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: loss -> down RTT: 10.9 ms RTTd: 2.8 ms Loss: 30.0 %)
2024-05-03T12:05:34-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: none -> loss RTT: 10.8 ms RTTd: 2.6 ms Loss: 12.0 %)
2024-05-03T07:28:59-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: loss -> none RTT: 10.6 ms RTTd: 3.0 ms Loss: 3.0 %)
2024-05-03T07:28:49-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: down -> loss RTT: 10.7 ms RTTd: 3.3 ms Loss: 20.0 %)
2024-05-03T07:21:42-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: loss -> down RTT: 11.1 ms RTTd: 1.9 ms Loss: 32.0 %)
2024-05-03T07:21:31-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: none -> loss RTT: 11.3 ms RTTd: 2.2 ms Loss: 12.0 %)
2024-05-03T06:41:05-03:00 Notice dpinger ALERT: TELECENTRO_DHCP (Addr: 1.0.0.1 Alarm: loss -> none RTT: 16.3 ms RTTd: 4.9 ms Loss: 10.0 %)
2024-05-03T06:40:20-03:00 Notice dpinger ALERT: TELECENTRO_DHCP (Addr: 1.0.0.1 Alarm: none -> loss RTT: 15.4 ms RTTd: 6.6 ms Loss: 12.0 %)
2024-05-03T03:32:54-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: loss -> none RTT: 10.6 ms RTTd: 3.0 ms Loss: 3.0 %)
2024-05-03T03:32:44-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: down -> loss RTT: 10.8 ms RTTd: 3.3 ms Loss: 20.0 %)
2024-05-03T03:28:40-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: loss -> down RTT: 13.8 ms RTTd: 22.9 ms Loss: 30.0 %)
2024-05-03T03:28:30-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: none -> loss RTT: 13.1 ms RTTd: 20.5 ms Loss: 12.0 %)
2024-05-03T02:18:13-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: loss -> none RTT: 10.7 ms RTTd: 3.0 ms Loss: 3.0 %)
2024-05-03T02:18:03-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: down -> loss RTT: 10.7 ms RTTd: 3.3 ms Loss: 20.0 %)
2024-05-03T02:06:51-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: loss -> down RTT: 11.4 ms RTTd: 2.9 ms Loss: 32.0 %)
2024-05-03T02:06:41-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: none -> loss RTT: 11.2 ms RTTd: 2.6 ms Loss: 12.0 %)
2024-05-03T00:17:12-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: loss -> none RTT: 11.9 ms RTTd: 4.0 ms Loss: 3.0 %)
2024-05-03T00:17:02-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: down -> loss RTT: 11.9 ms RTTd: 4.3 ms Loss: 20.0 %)
2024-05-03T00:12:29-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: loss -> down RTT: 10.6 ms RTTd: 1.4 ms Loss: 30.0 %)
2024-05-03T00:12:18-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: none -> loss RTT: 10.6 ms RTTd: 1.2 ms Loss: 12.0 %)
2024-05-02T23:16:49-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: loss -> none RTT: 10.4 ms RTTd: 0.8 ms Loss: 3.0 %)
2024-05-02T23:16:38-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: down -> loss RTT: 10.4 ms RTTd: 0.8 ms Loss: 20.0 %)
2024-05-02T23:11:37-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: loss -> down RTT: 10.5 ms RTTd: 0.8 ms Loss: 32.0 %)
2024-05-02T23:11:27-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: none -> loss RTT: 10.7 ms RTTd: 1.0 ms Loss: 12.0 %)
2024-05-02T23:01:44-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: delay -> none RTT: 10.4 ms RTTd: 1.4 ms Loss: 0.0 %)
2024-05-02T23:01:33-03:00 Notice dpinger ALERT: FIBERTEL_DHCP (Addr: 1.1.1.1 Alarm: down -> delay RTT: 498.6 ms RTTd: 1574.7 ms Loss: 3.0 %)
2024-05-02T23:00:40-03:00 Warning dpinger FIBERTEL_DHCP 1.1.1.1: sendto error: 65

I was just hit by this after upgrading to 24.1.7


2024-05-21T19:29:59-04:00 Warning dpinger send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% alarm_hold 10000ms dest_addr 8.8.4.4 bind_addr 100.99.yy.xx identifier "WAN_SL_DHCP "
2024-05-21T19:29:59-04:00 Warning dpinger send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% alarm_hold 10000ms dest_addr 8.8.8.8 bind_addr 192.168.1.64 identifier "WAN_MX_DHCP "
2024-05-21T19:29:59-04:00 Warning dpinger exiting on signal 15
2024-05-21T19:29:59-04:00 Warning dpinger exiting on signal 15
2024-05-21T19:29:59-04:00 Warning dpinger exiting on signal 15
2024-05-21T19:13:59-04:00 Warning dpinger send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% alarm_hold 10000ms dest_addr 1.1.1.1 bind_addr 100.99.yy.xx identifier "WAN_SL_DHCP "
2024-05-21T19:13:59-04:00 Warning dpinger exiting on signal 15
2024-05-21T19:13:57-04:00 Warning dpinger WAN_SL_DHCP 1.1.1.1: sendto error: 22
2024-05-21T19:13:56-04:00 Warning dpinger WAN_SL_DHCP 1.1.1.1: sendto error: 22


I've set the Starlink GW as a far GW for now...

Also there's another similar post

EDIT: Setting it as a far GW doesn't help at all! :-(

Quote from: mircsicz on May 22, 2024, 01:33:08 AM
I was just hit by this after upgrading to 24.1.7


2024-05-21T19:29:59-04:00 Warning dpinger send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% alarm_hold 10000ms dest_addr 8.8.4.4 bind_addr 100.99.yy.xx identifier "WAN_SL_DHCP "
2024-05-21T19:29:59-04:00 Warning dpinger send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% alarm_hold 10000ms dest_addr 8.8.8.8 bind_addr 192.168.1.64 identifier "WAN_MX_DHCP "
2024-05-21T19:29:59-04:00 Warning dpinger exiting on signal 15
2024-05-21T19:29:59-04:00 Warning dpinger exiting on signal 15
2024-05-21T19:29:59-04:00 Warning dpinger exiting on signal 15
2024-05-21T19:13:59-04:00 Warning dpinger send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% alarm_hold 10000ms dest_addr 1.1.1.1 bind_addr 100.99.yy.xx identifier "WAN_SL_DHCP "
2024-05-21T19:13:59-04:00 Warning dpinger exiting on signal 15
2024-05-21T19:13:57-04:00 Warning dpinger WAN_SL_DHCP 1.1.1.1: sendto error: 22
2024-05-21T19:13:56-04:00 Warning dpinger WAN_SL_DHCP 1.1.1.1: sendto error: 22


I've set the Starlink GW as a far GW for now...

Also there's another similar post

EDIT: Setting it as a far GW doesn't help at all! :-(

Bump on this, updated to 24.1.8 and the problems still happens.

Little update on this, on wednesday i went to: System: Gateways: Group, just clicked on edit to the group for failover, didn't change anything and saved it.

Also on the montior ip used 1.0.0.1 for isp 1 and 8.8.8.8 far isp 2, saved (before i was using 1.0.0.1 and 1.1.1.1).

After this changes did some tests to trigger the failover, all errors are gone and online availability behaved like expected.

I will update this if something happens again.