1
22.7 Legacy Series / ARP issue in 22.7.3_2
« on: September 08, 2022, 01:35:14 pm »
I'm in the process of rolling out several identical boxes of opnsense - all on the same version (22.7.3_2), running on top of Proxmox 7.2 on small firewall appliance boxes off Amazon. Two of the three have zero issues thus far and are running fine... unfortunately, one of them, my one for home that has dual WAN, is having an ARP issue likely caused by its direct exposure to the Internet.
On this box, I have AT&T fiber on WAN1, and Spectrum cable on WAN2. Both providers come into their own modems first and then go to the appliance. Spectrum is operating in bridged mode so the appliance gets DHCP right from Spectrum. AT&T does this odd thing where the modem passes through the public IP address it gets from the provider to the appliance using DHCP (kind of like a DHCP relay). This works fine, I get a public IP on WAN1.
Twice now, however, I have had the firewall completely lock up. In the first instance, I didn't do any troubleshooting, I just cold cycled the appliance and everything magically came back up (this was a couple days after putting the appliance into service). Now, a couple days after that incident, I lose all internet again (despite having two WAN in failover). This time, I went to the Proxmox GUI and saw the VM for the firewall had an exclamation point on it.
I went to the console and it is filled with messages about how its WAN1 interface was having its IP address used by some other MAC address ("arp: XX:XX:XX:XX:XX:XX is using my IP address (YYY.YYY.YYY.YYY) on vtnet2!"). I attempted to recover the console but the VM was completely hung; no ping, SSH, or console access at all. On a hard reset of just that VM, everything's back fine now (with the same public IP I've always gotten from AT&T on WAN1).
I think it's possible I am a target of some kind of ARP poisoning attack; if someone was accidentally assigned my outside IP by AT&T, it shouldn't overwhelm the opnsense VM. I think it's likely someone is playing with black hat tools and spraying bad ARP packets at the firewall, and the overload of this attack causes the firewall to seize up. At least, I hope that's the case - a firewall shouldn't completely hang up if one of its IPs was duplicated.
I cannot set the address on WAN1 to be a hard, static IP, because if I do so, the modem will think the lease is expired and give the IP address up to someone else. So I can only really think of three alternatives:
1) Find some way to block ARP packets on my WAN1 interface; this is undesirable, because I'd still have to set an upstream gateway MAC and that might change on the provider's side and the MAC of the "usurper" will likely be different every time.
2) Change the AT&T modem to no longer pass through public IP to the firewall and let the modem absorb the ARP attack. This is also not desirable because then I lose the ability to directly reach my outside interface from the Internet for things like SSL VPN.
3) Have opnsense configured in some way to rate-limit this activity, or be patched by some upgrade to not completely hang up/seize when its WAN interface is "being duplicated." I would expect the behavior in this case to be "WAN1 is crap, so the gateway monitoring pings fail, switch to WAN2 until WAN1 sorts itself out" instead of "WAN1's IP is being usurped! WHAAAA... CAN... NOT... COMPUTE..."
Any assistance/pointers would be appreciated.
On this box, I have AT&T fiber on WAN1, and Spectrum cable on WAN2. Both providers come into their own modems first and then go to the appliance. Spectrum is operating in bridged mode so the appliance gets DHCP right from Spectrum. AT&T does this odd thing where the modem passes through the public IP address it gets from the provider to the appliance using DHCP (kind of like a DHCP relay). This works fine, I get a public IP on WAN1.
Twice now, however, I have had the firewall completely lock up. In the first instance, I didn't do any troubleshooting, I just cold cycled the appliance and everything magically came back up (this was a couple days after putting the appliance into service). Now, a couple days after that incident, I lose all internet again (despite having two WAN in failover). This time, I went to the Proxmox GUI and saw the VM for the firewall had an exclamation point on it.
I went to the console and it is filled with messages about how its WAN1 interface was having its IP address used by some other MAC address ("arp: XX:XX:XX:XX:XX:XX is using my IP address (YYY.YYY.YYY.YYY) on vtnet2!"). I attempted to recover the console but the VM was completely hung; no ping, SSH, or console access at all. On a hard reset of just that VM, everything's back fine now (with the same public IP I've always gotten from AT&T on WAN1).
I think it's possible I am a target of some kind of ARP poisoning attack; if someone was accidentally assigned my outside IP by AT&T, it shouldn't overwhelm the opnsense VM. I think it's likely someone is playing with black hat tools and spraying bad ARP packets at the firewall, and the overload of this attack causes the firewall to seize up. At least, I hope that's the case - a firewall shouldn't completely hang up if one of its IPs was duplicated.
I cannot set the address on WAN1 to be a hard, static IP, because if I do so, the modem will think the lease is expired and give the IP address up to someone else. So I can only really think of three alternatives:
1) Find some way to block ARP packets on my WAN1 interface; this is undesirable, because I'd still have to set an upstream gateway MAC and that might change on the provider's side and the MAC of the "usurper" will likely be different every time.
2) Change the AT&T modem to no longer pass through public IP to the firewall and let the modem absorb the ARP attack. This is also not desirable because then I lose the ability to directly reach my outside interface from the Internet for things like SSL VPN.
3) Have opnsense configured in some way to rate-limit this activity, or be patched by some upgrade to not completely hang up/seize when its WAN interface is "being duplicated." I would expect the behavior in this case to be "WAN1 is crap, so the gateway monitoring pings fail, switch to WAN2 until WAN1 sorts itself out" instead of "WAN1's IP is being usurped! WHAAAA... CAN... NOT... COMPUTE..."
Any assistance/pointers would be appreciated.