OPNsense Forum

Archive => 17.7 Legacy Series => Topic started by: sebastian on August 07, 2017, 04:13:07 am

Title: [DANGEROUS] Losing a NIC should NEVER trigger any form of reset!
Post by: sebastian on August 07, 2017, 04:13:07 am
Noticed this pretty dangerous bug:

I accidentially losed the WAN NIC on the firewall (It had just come loose a bit).

This caused OPNSense to do a automatic unattended interface assignment reset, resetting everything that has with interfaces to default.

Of course, this caused all other settings related to the interfaces to be invalid, and thus caused a complete lockout of firewall - since I had firewall rules in place that prevents access to the GUI from "unauthorized" interfaces (And I had no backup).
Notice that it was the WAN interface that was lost - the system would just work fine without WAN, just that it would not have any internet.

This is COMPLETELY STUPID, as this prevents rescue of a system whose NIC breaks - a NIC can be easily replaced with a new.

If a interface is lost, for example if re0 is configured but it can't find the interface at boot - it should continue as-is, and just ignore the interface (and hope the interface comes back next boot - for example if its replaced or reseated).
By ignoring the interface, a system where that particular interface never returns again - for example a permanently broken NIC port - then that interface could be reassigned to a another NIC port, or be put on a VLAN as a temporary solution until hardware is replaced.

It should NEVER automatically touch interface settings. Let the system administrator do that instead.
Title: Re: [DANGEROUS] Losing a NIC should NEVER trigger any form of reset!
Post by: franco on August 07, 2017, 05:08:44 am
Please back up your configs and restore as needed; in fact there are a number of backups in your system at any time. The configuration reassignment is for recovery from bad situations.

It's impossible to "guess" what cards are exchanged, removed or added and how their drivers are going to be named, etc.

The problem is that even if we could limit this process to "WAN" and "LAN" it's impossible to tell if anyone keeps using them or deletes them and finally has an "OPTX", "OPTY", "OPTZ" setting that defies all attempts to make sense of it in code. Which one is vital, which one is not?

That being said, how about an option to "lock" a particular interface from being removed? We can't enable this by default, but it would take away the "guessing" and let the user define what he expects in a state of missing interfaces?


Cheers,
Franco
Title: Re: [DANGEROUS] Losing a NIC should NEVER trigger any form of reset!
Post by: franco on August 07, 2017, 06:22:32 am
This should help in the future... Interfaces can be locked, will not be able to delete from the GUI and the reassignment does not run on a mismatch.

https://github.com/opnsense/core/commit/81aed987
https://github.com/opnsense/core/commit/f22ade58

It's not perfect, but it's progress.


Thanks,
Franco
Title: Re: [DANGEROUS] Losing a NIC should NEVER trigger any form of reset!
Post by: sebastian on August 07, 2017, 06:45:37 am
The idea is that automatically resetting interfaces are not going to help anyways.

If an administrator did switch out a network interface, the administrator will also be present to do the interface reassignment manually.

Best is to not touch anything, but in some cases a interface might change name after a reboot - but a automatic reset is not going to fix that anyways.

I think the following logic would be best to recover from a "bad situation": Store the physical MAC of each interface, and also its name (eg em0, re0 etc)

Upon mismatch, do the following:

1: First, match every interface found by hardware MAC to its configured interface that is stored by MAC. This will catch any cases where a interface just changes name, from lets say re0 to re1.
If this interface has VLAN assignments, automatically rename them too.

2: For any interfaces left, match by exact driver name, so any interface configured as re2 will be matched to re2.

3: For any interfaces left here, match driver name against any interface with same driver - in the same order.
So if there is re4, re5 and em1 configured, and phy re6, re7, em0 - then do the following replacements:
re4 --> re6
re5 --> re7
em1 --> em0

4: Assign any leftover hardware interfaces to any leftover firewall interface, same order as enumerated and configured.

5: Any interface left as this point should be left UNTOUCHED, even if theres no hardware interface left.
Don't delete interfaces because they don't exist in real machine, just leave them there in the GUI but with a "None" assignment, so the user can itself reassign them. (But still keep all firewall rules and configs - just that these firewall rules and configs would then be dormant due to the non-existing interface - so when that interface is replaced or fixed, the interface can be reassigned without any loss of data)


Those 5 steps is going to get the best bet to recover from a bad situation.


I mean, this is completely non-understandable:
The WAN interface dropped (re0).
Then it reassigned LAN from em0_vlan1 to em0 (parent interface), and then deleted all assignments for em0_vlan3 (Wireless), em0_vlan4 (Server), which also deleted all firewall rules, services settings and such for those interfaces.

Why did it touch em0 at all? It was the WAN interface that become lost, not one of the LAN interfaces.
Its like it reset the interface assignments to factory.

Thats NOT going to recover anything, it will only make things worse.

I could understand, if it did assign WAN to em0 (since em0 was unassigned as im using VLAN on em0), as an attempt to recover.