1
High availability / pfsync interface overload problem
« on: June 13, 2024, 02:31:23 pm »
Good afternoon
I have the following configuration
two OPNsense servers based on HPE dl360 G10 servers, the main one and the backup one, have two network dual-port cards - 2x25 gigabit and 2x10 gigabit. Two 25 Gbit ports - the first port as an uplink (wan), the second port as a parent for internal vlans. Two 10 gigabit ports in lagg are used as a pfsync channel between servers, that is, there is a total of 20 gbps. What is the problem - when the network load increases during the day and the size of the connection table approaches three and a half million - 3,500,000 (State table size 13% of limit), the load on the pfsync channel increases to 400-500 mbps (which is very far from the width of this channel) and problems with synchronization begin, the ping of neighbor on a channel pfsync looks like this:
with losses and messages "no buffer space available".
Further, apparently due to the loss of pfsync packets, not all connections on the backup are closed and the size of the state table on the backup server “swells” and approaches 99 percent (24 million)
Then the carp addresses start flapping - they go from the master to the backup and back, and the carp status is incorrect and goes to -240 on both servers. For pfsync, we changed the network cards, it was both 1 gigabit and 4 gigabit, the latter option was 20 gigabit. But the situation repeats itself every time. This seems to be some kind of software problem. I couldn't find how to fix it.
I have the following configuration
two OPNsense servers based on HPE dl360 G10 servers, the main one and the backup one, have two network dual-port cards - 2x25 gigabit and 2x10 gigabit. Two 25 Gbit ports - the first port as an uplink (wan), the second port as a parent for internal vlans. Two 10 gigabit ports in lagg are used as a pfsync channel between servers, that is, there is a total of 20 gbps. What is the problem - when the network load increases during the day and the size of the connection table approaches three and a half million - 3,500,000 (State table size 13% of limit), the load on the pfsync channel increases to 400-500 mbps (which is very far from the width of this channel) and problems with synchronization begin, the ping of neighbor on a channel pfsync looks like this:
with losses and messages "no buffer space available".
Further, apparently due to the loss of pfsync packets, not all connections on the backup are closed and the size of the state table on the backup server “swells” and approaches 99 percent (24 million)
Then the carp addresses start flapping - they go from the master to the backup and back, and the carp status is incorrect and goes to -240 on both servers. For pfsync, we changed the network cards, it was both 1 gigabit and 4 gigabit, the latter option was 20 gigabit. But the situation repeats itself every time. This seems to be some kind of software problem. I couldn't find how to fix it.