Problem with Load Balancing via Gateway Group and shared forwarding

Started by _richii, February 20, 2023, 04:05:01 PM

Previous topic - Next topic
There seems to be a problem with Load Balancing (2 Gateways on Tier 1) via Gateway Groups and the shared forwarding feature.
As soon as the "Policy Based Routing" Firewall rules with the Load Balancing Gateway Group as a gateway are in place, two things happen:

1. The hardware console is spammed with arpresolve errors:
arpresolve: can't allocate llinfo for <IP> on igb0
arpresolve: can't allocate llinfo for <IP> on igb1

2. There are random(?) timeouts for outgoing traffic:
- First try loading a web page fails
- Second try loading a web page works

When the Gateway Group is is set to Failover, first gateway Tier1 and second gateway Tier 2, there are no problems.
When the shared forwarding feature under Firewall -> Settings -> Advanced ist disabled, there are no problems, too.
But when the feature is disabled, there is no traffic shaping possible for PBR Firewall rules, or at least this is stated in the shared forwarding help text.

This problem also existed in 22.7.11_1, see -> https://forum.opnsense.org/index.php?topic=32374.0 for further information.

Does anybody use the Load Balancing feature of Gateway Groups and can reproduce this?
Or is using shared forwarding and Load Balancing via Gateway Groups mutually exclusive / not supported?

OS / Hardware
OPNsense 23.1.1_2-amd64
FreeBSD 13.1-RELEASE-p6
OpenSSL 1.1.1t 7 Feb 2023
CPU: Intel(R) Pentium(R) Gold G6605 CPU @ 4.30GHz
Mainboard: Supermicro X12STL-IF
Network:
- Onboard: 2x Intel i210 RJ45 1GbE network ports (WAN)
- PCI-E: Mellanox ConnectX-4 Lx with 2x SFP28 25/10/1GbE network ports (LAN)

Hi all,

@_richii: Thank you for your time narrowing down this issue.
I have a much more modest setup, but I'm also experiencing this issue.

I'll run some tests based on your inputs and report back.

Thank you for your help!

EDIT: Forgot to mention the issue only manifests itself when there are many many users online. (300+ users)

The hardware and software used is listed below:
OS: OPNsense 23.1.3-amd64
Hardware: Protecli FW4B
Intel CeleronĀ® J3160
nics: 4x Intel(R) I211 (copper gigabit)

Hi,

After following your suggestions it is possible to use the system.

While reading the log files I've stubled upon a message regarding dpinger.

2023-03-17T14:33:23   Warning   dpinger   WAN_DHCP 8.8.8.8: sendto error: 65

There is some information about this error on netgate page. (https://docs.netgate.com/pfsense/en/latest/troubleshooting/gateway-errors.html)

Following that advice, I've created a pipe and rules to accommodate icmp traffic.
Since I'm monitoring the upstream gateway, I've also ticked Disabled Host Route on gateway settings.

In case system is stable for a few days, I'll resume testing reenabling gateway groups.

BR,
Pedro

Unfortunately the system wasn't stable, the operation was rolled back. (after a while using this infrastructure there are huge latency spikes)

So far I've been using fq-codel as the shaping algorythm, I'll try tomorrow with WFQ to check whether it has a different outcome.

@_richii: What queuing algorythm are you testing with?

BR,
Pedro

Hi all,

A quick update on this topic.

This issue was caused by the ISP gateways, both circuits are set to bridge but cause issues. (!)
I've migrated this internet access to a corporate link with proper routers and there is no issue to the gateway.

Now I've this issue to deal with the ISP as due to internal policies wifi cannot use our IP space...

Best,
Pedro

I am experiencing this exact same issue with my system. I cannot set both gateways to the same tier and even set on different tiers it is causing issues.

I am on version 23.1.7_3