OPNsense Forum

Archive => 18.7 Legacy Series => Topic started by: drivera on February 06, 2019, 03:07:52 pm

Title: Firewall failover not working
Post by: drivera on February 06, 2019, 03:07:52 pm
Hi, all!

I'm experiencing some issues on the failover.  I had a power outage last night and one of my ISPs (the Cable provider) seems to be down due to some line damage, and they'll be offline for a few hours. This happened in the wee hours of the morning, while I was asleep.

When I woke up, I found that I had no internet service. This means that failover had once again been unsuccessful.

After some poking around, I've discovered that even though I have a failover gateway group set up for both my ISPs (Cable and ADSL), and the group is (apparently) configured properly (Cable is Tier 1, ADSL is Tier 2, Trigger Level is "Member Down"), the failover algorithm will not work as expected.

This is the behavior that I would expect:

* When the Cable link is up, the Cable link is promoted to default gateway
* When the Cable link is down, the ADSL link is promoted to default geteway
* When the Cable link comes back up, the Cable link is promoted to default gateway irrespective of the ADLS link's status

This is the behavior I'm seeing:

* When the Cable link is up, the Cable link is promoted to default gateway
* When the Cable link is down, the ADSL link is promoted to default geteway
* When the Cable link comes back up, the ADSL link remains as default gateway unless I explicitly mark the Cable gateway as the default gateway
* However, when I mark the Cable gateway as the default gateway, and a prolonged outage occurs, it seems that fail-back simply won't happen at all and it will remain as the default gateway regardless of up/down status

I've configured each gateway (Cable + ADSL) to have a Monitor IP setting for an address on the far side of the link, so it can be used to determine if the link really is up vs. appearing to be up. I've noticed that even though there's an outage, the Cable link's status shows as "pending" vs. "down". Perhaps this is the issue? Perhaps the algorithm is assuming that the link is up because it's not explicitly marked as "down"?

Thoughts?

If that's the case, then definitely the algorithm should compare the link's current status (UP, UP+Latency, UP+Packet Loss, UP+Latency+Packet Loss, Pending, Down) vs. the "Trigger Level" condition set up in the gateway group, taking into account that until a link is in UP or UP+* state, it should be considered to be down? (i.e. Down + Pending should be equivalent, I think)...

Thanks!