Randomly loss Internet connectivity, must reboot or Reload all Services CLI

Started by FLguy, January 29, 2025, 03:57:30 PM

Previous topic - Next topic
I just did a search on the error message and there were a few entries in the forum.
https://forum.opnsense.org/index.php?topic=34340.0

Interestingly, in that thread the OP restarted from scratch...
If you end up doing that, it could be interesting to do a config comparison (current and new-functionally-equivalent) to possibly narrow down the root cause.
I've done some simple search/replace on configs (e.g. change from passthrough to bridges under proxmox).
If you have a lot of rules, it might be possible to copy/paste large sections of the old config.
That would not only accelerate the transition, but it could also make the comparison easier.
There's a risk of moving the problem along though...

Quote from: EricPerl on April 01, 2025, 08:32:14 PMI just did a search on the error message and there were a few entries in the forum.
https://forum.opnsense.org/index.php?topic=34340.0

Damn, I did see this post after all. It doesn't provide any helpful information. He mentioned trying to reach a gateway outside the subnet. My default gateway is the IP that arpresolve complains about.  Of course that's in the same subnet as my interface. 

After you suggested moving my WAN-F interface, that got me thinking, there is no special configuration between my WAN-F = WAN and WAN-C = OPT2 interfaces. They are both DHCP. I can just swap the two cables between those two interfaces and see if the problem starts happening with my cable provider.  So now the interface where the cable provider is now WAN-F, and WAN-C is now <wan>, and that interface is still disabled.  SO FAR, it's been over 24 hours, and no outages.

After 3 days, I will reenable the other interface.  To see what happens.  These two interfaces are configured in the same way:

    <wan>
      <if>igb0</if>
      <descr/>
      <if>igb0</if>
      <descr/>
      <spoofmac/>
      <blockpriv>1</blockpriv>
      <blockbogons>1</blockbogons>
      <ipaddr>dhcp</ipaddr>
      <dhcphostname/>
      <alias-address/>
      <alias-subnet>32</alias-subnet>
      <dhcprejectfrom/>
      <adv_dhcp_pt_timeout/>
      <adv_dhcp_pt_retry/>
      <adv_dhcp_pt_select_timeout/>
      <adv_dhcp_pt_reboot/>
      <adv_dhcp_pt_backoff_cutoff/>
      <adv_dhcp_pt_initial_interval/>
      <adv_dhcp_pt_values>SavedCfg</adv_dhcp_pt_values>
      <adv_dhcp_send_options/>
      <adv_dhcp_request_options/>
      <adv_dhcp_required_options/>
      <adv_dhcp_option_modifiers/>
      <adv_dhcp_config_advanced/>
      <adv_dhcp_config_file_override/>
      <adv_dhcp_config_file_override_path/>
    </wan>
    <opt2>
      <if>igb2</if>
      <descr>WANC</descr>
      <enable>1</enable>
      <spoofmac/>
      <blockpriv>1</blockpriv>
      <blockbogons>1</blockbogons>
      <ipaddr>dhcp</ipaddr>
      <dhcphostname/>
      <alias-address/>
      <alias-subnet>32</alias-subnet>
      <dhcprejectfrom/>
      <adv_dhcp_pt_timeout/>
      <adv_dhcp_pt_retry/>
      <adv_dhcp_pt_select_timeout/>
      <adv_dhcp_pt_reboot/>
      <adv_dhcp_pt_backoff_cutoff/>
      <adv_dhcp_pt_initial_interval/>
      <adv_dhcp_pt_values>SavedCfg</adv_dhcp_pt_values>
      <adv_dhcp_send_options/>
      <adv_dhcp_request_options/>
      <adv_dhcp_required_options/>
      <adv_dhcp_option_modifiers/>
      <adv_dhcp_config_advanced/>
      <adv_dhcp_config_file_override/>
      <adv_dhcp_config_file_override_path/>
    </opt2>

If you're not using any load balancing or auto failover, you might be better off only having 1 WAN interface, which also means 1 GW.
With 2 interfaces, you probably have 2 GWs, one for each interface.
You might have different configurations for these (monitoring, declared as upstream, priority).
Even if you disable an interface, I'm not sure what it does to its GW.

Switching cables is one way to investigate.
You might still have gremlins left over from some multi-WAN behavior.
That makes 4 combinations to test. Enabled interface x Physical cable option.

The arp message might just be a consequence of the existence of these 2 GWs.
The fiber connection goes down, the system tries to use the cable GW, it ends up out of the fiber subnet.

Quote from: EricPerl on April 02, 2025, 09:08:00 PMIf you're not using any load balancing or auto failover, you might be better off only having 1 WAN interface, which also means 1 GW.
You might still have gremlins left over from some multi-WAN behavior.

The arp message might just be a consequence of the existence of these 2 GWs.

No, no, no. I want to get back to my multi-WAN load balancing configuration that I had in November. The only reason I am disabling Interfaces or Gateways is to troubleshoot this issue. This issue could definitely be related to multi-WAN, even though most, if not all, of the posts related to the Arpresolve message are single internet configurations. 

OK. You might as well compare your GW configs then: System > Gateways > Configuration.

I think I have just found the reason for this problem and other similar ones rising on the forum from time to time.
I can reproduce the alleged bug and its solution, at least on my systems.

If in "Firewall: Aliases" you have exhausted the available table entries, than your firewall will refuse to let network clients browse the Internet. Clients will ping the firewall LAN address, but will not go through WAN connection(s). This is true with single and multiple WAN, load balance or failover mode does not make any differece.

Solution is tricky because simply increasing the value of "Firewall Maximum Table Entries" in "Firewall, Settings, Advanced", will leave the firewall non operational.
You have to restore a working firewall backup, even if it is showing that available entries are exhausted, then delete rules using aliases starting from big ones (hint: geoip aliases often contain many records), than delete aliases to reduce table entries under the predefined limit. At this point, the limit can be increased, so aliases and rules can be recreated.

While the alleged bug is being investigated I would suggest to increase the default "Firewall Maximum Table Entries" value of an order of magnitude (10X) BEFORE this capability is exhausted. In my experience this does not slow down even the least powerful systems.