21.7 Boot hang at “Configuring VLAN interfaces...” with imported 21.1 config

Started by MacLemon, July 09, 2021, 05:17:04 PM

Previous topic - Next topic
UPDATE
This thread was started with 21.7-RC1 but applies to the 21.7-RELEAS (2021-07-29) version as well.
Not only new installs, but also upgrades with VLANs on LAGG interfaces seem to be affected.

Summary:
When importing an existing OPNsense 21.1.7_1 config into a freshly installed 21.7-RC1 config (on different Hardware) the system hangs during reboot at "Configuring VLAN interfaces..." forever.

Steps to Reproduce:

  • Production firewall: OPNsense 21.1.7_1 on intel Atom C2758 with 8 x i350 (igb0-3 on PCIe NIC, igb4-7 onboard) Only igb0-3 are in use!
  • Export Config file as usual with all sections.
  • Install 21.7-RC1 on new Hardware (Xeon-D 2143, 4 x i350 (igb0-3), 4 x X772(ixl0-3))
  • Import config from the old firewall
  • reboot
  • See boot hang at configuring VLANs.

Expected Results:
I'd expect the import to automatically reassign the interfaces according to their names.
Match old igb0 to new igb0. (Same name, different MAC) and so on.

This is in fact the behaviour I actually see happen flawlessly when trying the same migration from 21.1.7_1 on the old hardware to 21.1 on the new hardware. It just works, and works as I had hoped for.

Actual Results:
When importing the same file exported on 21.1.7_1 on the old hardware into 21.7-RC1 on the new hardware the system hangs at the first reboot (and all subsequent reboots).

No errors are shown.


Regression:
The hardware change works *perfectly* fine (it's almost boring) when importing that same file from 21.1.7_1 into 21.1 on the new hardware.
System boots up as expected and automagically assigns all the NICs correctly.

Notes:
I've also tried to import the config into 21.1 on the new hardware, export it again into a fresh file. Which would basically resemble having the same hardware reinstalled and reimporting an existing config dump from that exact hardware.
This results in the same problem.

Version Information:
Old hardware: 21.1.7_1
New hardware: 21.1 and 21.7-RC1 tested



Is there anything else I could have missed during these tests? Any obvious mistakes I've overlooked while covering my ears from the fan noise on my desk?

You input is much appreciated.
MacLemon

Do you use third-party repositories? E.g. SunnyValley.

If yes, and there is no binary available, Opnsense setup chokes to death.
OPNsense HW:

Minisforum Venus series UN100C, 16 GB RAM, 512 GB SSD
T-bao N9N Pro, 16 GB RAM, 512 GB SSD

Thanks for the input.
Nope, Sunny Valley is not in use.

After importing the config the system hangs before it has fully booted up. So it doesn't even get to a point where it could fail to download/install any plugins.

On a config reimport it's very unlikely third party repositories play a role, especially when their code is not there the settings in the config.xml won't be executed.

The question is what is hanging there. Looking at the code legacy_interface_listget() is executed first which also checks for WLAN capable cards. Is there such a thing plugged into the new HW?


Cheers,
Franco

No WiFi hardware here.
NICs:

  • 4 x intel i350 Gigabit/s (igb)
  • 2 x intel X772 10GE Copper (ixl)
  • 2 x intel X772 10GE SFP+ (ixl)

The new hardware used is a Thomas-Krenn RI1102D-F (v2.1) which is basically a SUPERMICRO X11SDV-4C-TP8F motherboard preassembled in a chassis with 2x8GB RAM and an NVMe SSD as boot drive.


The console output up to the hang reads like this:

Configuring Kernel Modules...done.
Setting up extended sysctls...done.
Setting timezone...done.
Writing firmware setting...done
Writing trust files...done.
Settings hostname: <opnsense.example.org>
Generating /etc/hosts...done.
Configuring system logging...done.
Configuring loopback interface...done.
Creating wireless clone interfaces...done.
Configuring LAGG interfaces...done.
Configuring VLAN interfaces...


This is where it never continues any further.

Importing the very same file into 21.1 works flawlessly.
Importing it on 27.1-RC1 shows this symptom of a non-booting firewall.

To me that points to a difference in how the config is parsed for the VLAN section, or in the way the VLAN interfaces are getting configured.

We do use a LAGG (LACP) and all the VLANs are on that LAGG if this is of any help.

> Importing the very same file into 21.1 works flawlessly. (1)
> Importing it on 27.1-RC1 shows this symptom of a non-booting firewall. (2)

The question is if (1) was confirmed on the new hardware as well and if the upgrade actually makes it stuck. I'm not convinced it's the code that wasn't considerably changed.


Cheers,
Franco

I did test:

  • export config on 21.1.7_1 on the existing hardware
  • import into 21.1 on the new hardware
Imports all the settings correctly, maps the igb0-3 interfaces correctly, reboots completely, just works.
I'd say, the common OPNsense experience with updates. :-)

I also did test

  • export config on 21.1.7_1 on the existing hardware
  • import into 21.7 on the new hardware
which results on the mentioned hang while configuring VLAN interfaces.

I've also tested

  • export config on 21.1.7_1 on the existing hardware
  • import into 21.1 on the new hardware
  • export config again from 21.1 on the new hardware
  • import the new export into 21.7 on the new hardware
which results in the same hang.

Is there anything else I could test? I do have the new hardware at my disposal for tests. :-)

When you put it like that maybe something in the RC is causing this. The first suspect would be

https://github.com/opnsense/core/commit/a98d776fa4ff0

Can you try to patch it and see if the hang is still there?

# opnsense-patch a98d776fa4ff0

(patching actually un-patches it, but it works splendid for testing)


Cheers,
Franco


  • Installed OPNsense 21.7-RC1
  • Applied Patch as instructed (output below)
  • rebooted for good measure (came back just fine)
  • Imported unencrypted config file that had been exported on the old hardware with 21.1.7_1
  • auto-reboot after Config import.

Same result so far. The reboot after the import hangs at
Configuring VLAN interfaces...


Full output emitted when applying the patch.

# opnsense-patch a98d776fa4ff0
Fetched a98d776fa4ff0 via https://github.com/opnsense/core
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|From a98d776fa4ff04d616e46b45a3bc60f8c1407269 Mon Sep 17 00:00:00 2001
|From: Ad Schellevis <ad@opnsense.org>
|Date: Wed, 16 Jun 2021 16:18:50 +0200
|Subject: [PATCH] Interfaces / Hardware settings - Overwite global settings,
| closes https://github.com/opnsense/core/issues/5050
|
|---
| src/etc/inc/interfaces.lib.inc |  32 +++++++----
| src/www/interfaces.php         | 102 +++++++++++++++++++++++++++++++++
| 2 files changed, 123 insertions(+), 11 deletions(-)
|
|diff --git a/src/etc/inc/interfaces.lib.inc b/src/etc/inc/interfaces.lib.inc
|index 9ca22ab996..cc4c59470b 100644
|--- a/src/etc/inc/interfaces.lib.inc
|+++ b/src/etc/inc/interfaces.lib.inc
--------------------------
Patching file etc/inc/interfaces.lib.inc using Plan A...
Reversed (or previously applied) patch detected!  Assuming -R.Hunk #1 succeeded at 386 (offset 1 line).
Hunk #2 succeeded at 399 (offset 1 line).
Hunk #3 succeeded at 439 (offset 1 line).
Hmm...  The next patch looks like a unified diff to me...
The text leading up to this was:
--------------------------
|diff --git a/src/www/interfaces.php b/src/www/interfaces.php
|index 30a95ff65e..c078c1a4de 100644
|--- a/src/www/interfaces.php
|+++ b/src/www/interfaces.php
--------------------------
Patching file www/interfaces.php using Plan A...
Reversed (or previously applied) patch detected!  Assuming -R.Hunk #1 succeeded at 388.
Hunk #2 succeeded at 1307.
Hunk #3 succeeded at 1699.
Hunk #4 succeeded at 1913.
done
All patches have been applied successfully.  Have a nice day.



Here's the VLAN config section extracted from the config file. I've replaced the customer's name with "customer" and some other vendor's we use with "vendor". The general description structure stays identical. (The only "special" characters in the description fields are spaces and a "-".)
All VLANs are assigned to the same single lagg0 interface.


  0   <vlans>
  1     <vlan>
  2       <if>lagg0</if>
  3       <tag>1104</tag>
  4       <pcp>1</pcp>
  5       <descr>Customer Studio</descr>
  6       <vlanif>lagg0_vlan1104</vlanif>
  7     </vlan>
  8     <vlan>
  9       <if>lagg0</if>
10       <tag>1254</tag>
11       <pcp>7</pcp>
12       <descr>AdminVLAN</descr>
13       <vlanif>lagg0_vlan1254</vlanif>
14     </vlan>
15     <vlan>
16       <if>lagg0</if>
17       <tag>1251</tag>
18       <pcp>0</pcp>
19       <descr>XXX DMZ</descr>
20       <vlanif>lagg0_vlan1251</vlanif>
21     </vlan>
22     <vlan>
23       <if>lagg0</if>
24       <tag>1250</tag>
25       <pcp>3</pcp>
26       <descr>YYY DMZ</descr>
27       <vlanif>lagg0_vlan1250</vlanif>
28     </vlan>
29     <vlan>
30       <if>lagg0</if>
31       <tag>1105</tag>
32       <pcp>1</pcp>
33       <descr>CustomerPublic</descr>
34       <vlanif>lagg0_vlan1105</vlanif>
35     </vlan>
36     <vlan>
37       <if>lagg0</if>
38       <tag>1109</tag>
39       <pcp>2</pcp>
40       <descr>Customer labels</descr>
41       <vlanif>lagg0_vlan1109</vlanif>
42     </vlan>
43     <vlan>
44       <if>lagg0</if>
45       <tag>1106</tag>
46       <pcp>2</pcp>
47       <descr>VendorService</descr>
48       <vlanif>lagg0_vlan1106</vlanif>
49     </vlan>
50     <vlan>
51       <if>lagg0</if>
52       <tag>1107</tag>
53       <pcp>1</pcp>
54       <descr>VendorDemo</descr>
55       <vlanif>lagg0_vlan1107</vlanif>
56     </vlan>
57     <vlan>
58       <if>lagg0</if>
59       <tag>1108</tag>
60       <pcp>1</pcp>
61       <descr>CustomerDemo</descr>
62       <vlanif>lagg0_vlan1108</vlanif>
63     </vlan>
64   </vlans>

I tried to reproduce with a lagg0 and the vlans on top and boot went fine. I half-suspect the config isn't the underlying issue, but it would be good to rule that out. Would you mind sending it over to franco@opnsense.org with the necessary redactions?

If that were true it might still be tied to specific hardware behaviour in which case we can only try to find the command that hangs the system.


Cheers,
Franco

I just had this exact issue happen to me when trying to upgrade from 21.1 to 21.7. After updating the we web UI and rebooting it hangs at "Configuring VLAN interfaces". Never proceeds further. No hardware change for me. Running on a supermicro X10SDV-TP8F

I also have a lagg interface that has vlans associated with it.

Getting the exact same issue.  Upgraded this morning to 21.1.9, then an in-pace upgrade to 21.7.  Hardware is a QOTOM Q555G6-S05.  I used the in-place upgrade method rather than a config import.  This is the 2nd node of a 2-node HA cluster.

If I can help troubleshoot in any way please let me know what is needed.
Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)


In an attempt to bring my network back up I tried using my config with the 21.7 install config importer. That failed and hung configuring the VLANs. Went back a release to 21.1 and the installer successfully imported the config and booted fully.

@franco, I just realized this is a post for RC1.  Should we start a new thread since this is now the GA code?
Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)