21.7 Boot hang at “Configuring VLAN interfaces...” with imported 21.1 config

Started by MacLemon, July 09, 2021, 05:17:04 PM

Previous topic - Next topic
same issue here with the final 21.7.

After upgrade it stucks at the "vlan changing name to..."
I use miniPC with intel CPU and NIC's (nrg-systems.de)

what i tested:

Clean install 21.7(works) and restore with 21.1.9 backup
-> freeze on vlan config

Clean install 21.1.9, restore config 21.1.9 and upgrade to 21.7
-> freeze on vlan config

Clean install 21.1.9 and restore config 21.1.9
-> everything works fine

i also use LAGG interfaces with VLAN's

Thanks for Help!

Best Regards

I'm  not sure if it's related to this but there's definitely something related to VLANs in 21.7.  The system is fine until I began adding VLANs.  Using the same hardware that failed to upgrade, here's what I did and what I ran into.  Apologies if it's a ramble or is missing relevant data, it's 2 AM, I've been testing this for the last few hours because OCD.  The issue is 100% repeatable.

Steps to reproduce:

  • Clean install of 21.7.  Mirrored ZFS (Old 21.1 setup was GEOM Mirror)
  • Minor changes.  Change theme, point logging to syslog server, etc.
  • Create lagg0 from igb0 and igb1.
  • Create lagg1 from idb2 and igb3.
  • Assign lagg0 to WAN interface
  • Assign lagg1 to LAN interface
  • Create VLAN 20, parent interface lagg0.  No problems at all up to this point.
  • Here's where the trouble begins.  Create VLAN 30, parent interface lagg0

As soon as I created VLAN 30 on lagg0 the system started having trouble.

The first time I tried to add VLAN 30, the system froze up and I started getting scrolling drive alerts (ATA_IDENTIFY, CAM status, etc) on the console.  Had to hard power cycle the system to get it back.

After the power cycle, the second time I tried to add VLAN 30, the system froze up and the GUI is just sitting there with a dot moving on the browser tab like the browser is waiting,  but there are no drive alerts on the console.  The system shutdown gracefully via power button this time.

The third time I tried to add VLAN 40 and had pretty much the same results.

I was sending my logs to a syslog server during this and captured this is the configd.py.log file during the 1st attempt.

2021-07-29T00:39:27-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [8d7456bd-a164-4aab-b572-0893ce42a42c] Linkup stopping igb0
2021-07-29T00:39:27-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [e0604275-642d-4c72-a2e4-0c8c8a70bb27] Linkup stopping igb1
2021-07-29T00:39:28-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [08bc2d5f-8eb8-4dbd-925b-4e2ef22476be] Linkup stopping lagg0
2021-07-29T00:39:28-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [fa84fb2a-16f0-4b3e-b20c-ee3f3a789bd6] Linkup stopping lagg0_vlan20
2021-07-29T00:39:28-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [1ce28a8d-8096-48bc-afc2-567dda5db8c2] trigger config changed event
2021-07-29T00:39:32-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [742fcbb1-5fc2-422f-b61a-c90904ea3e33] Linkup starting igb0
2021-07-29T00:39:32-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [d4daa58f-dbbb-45b6-b39e-f4dfecaa9e02] Linkup starting lagg0
2021-07-29T00:39:32-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [b3a9ba7a-5fed-4cc8-8343-a2ac6ff3d9e8] New IPv4 on lagg0
2021-07-29T00:39:33-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [04145dfb-5625-461b-b94d-ea82d39ea3fc] generate template OPNsense/Filter
2021-07-29T00:39:33-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: generate template container OPNsense/Filter
2021-07-29T00:39:33-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [3b49d59a-fa71-4f63-b6e5-9ceb2e936307] refresh url table aliases
2021-07-29T00:39:33-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [898420e5-6f67-441e-8b55-0710f9fad03a] Linkup starting lagg0_vlan20
2021-07-29T00:39:33-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [a73be435-5bb7-4e6c-aea6-f009c7a96986] Linkup starting igb1
2021-07-29T00:39:33-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: message 3b49d59a-fa71-4f63-b6e5-9ceb2e936307 [filter.refresh_aliases] returned {"status": "ok"}
2021-07-29T00:39:35-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [4dab0966-c2a9-4e31-b4df-971ef18750ad] Reloading filter
2021-07-29T00:39:35-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [81acf0e3-927e-4ad0-a090-e19f66f43802] generate template OPNsense/Filter
2021-07-29T00:39:35-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: generate template container OPNsense/Filter
2021-07-29T00:39:35-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [29824610-759d-45d3-ad3e-14289162d97b] refresh url table aliases
2021-07-29T00:39:35-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: message 29824610-759d-45d3-ad3e-14289162d97b [filter.refresh_aliases] returned {"status": "ok"}
2021-07-29T00:39:44-04:00 inner-fw2.lan.thejeffcoats.net configd.py[84073]: [d9cf2ad2-0fa5-41b8-aaab-09ad4c2df34f] Linkup stopping igb0


The link stopping igb0 is the last entry before the system powered back up.

Not sure if there's a real drive problem, smart status shows errors on the drive being called out (ada1) as having no problems.  I ran a short test, it returned no errors.  I do have spares that I can swap it out to eliminate this as the potential source of any problems.  The fact that it loads 21.1 after without issue leads me to think it might not be the root cause or related.

smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.1-RELEASE-p19-HBSD amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Silicon Motion based SSDs
Device Model:     TS256GMSA370
Serial Number:    F915720124
LU WWN Device Id: 5 7c3548 19c3583bc
Firmware Version: P1225CH1
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jul 29 02:00:13 2021 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (    0) seconds.
Offline data collection
capabilities: (0x71) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: (   1) minutes.
Extended self-test routine
recommended polling time: (   1) minutes.
Conveyance self-test routine
recommended polling time: (   1) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0000   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0000   100   100   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       343
12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       56
160 Uncorrectable_Error_Cnt 0x0000   100   100   000    Old_age   Offline      -       0
161 Valid_Spare_Block_Cnt   0x0000   100   100   000    Old_age   Offline      -       155
163 Initial_Bad_Block_Count 0x0000   100   100   000    Old_age   Offline      -       10
164 Total_Erase_Count       0x0000   100   100   000    Old_age   Offline      -       411497
165 Max_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       251
166 Min_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       148
167 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      -       197
168 Max_Erase_Count_of_Spec 0x0000   100   100   000    Old_age   Offline      -       3000
169 Remaining_Lifetime_Perc 0x0000   100   100   000    Old_age   Offline      -       94
175 Program_Fail_Count_Chip 0x0000   100   100   000    Old_age   Offline      -       0
176 Erase_Fail_Count_Chip   0x0000   100   100   000    Old_age   Offline      -       0
177 Wear_Leveling_Count     0x0000   100   100   050    Old_age   Offline      -       2157
178 Runtime_Invalid_Blk_Cnt 0x0000   100   100   000    Old_age   Offline      -       0
181 Program_Fail_Cnt_Total  0x0000   100   100   000    Old_age   Offline      -       0
182 Erase_Fail_Count_Total  0x0000   100   100   000    Old_age   Offline      -       0
192 Power-Off_Retract_Count 0x0000   100   100   000    Old_age   Offline      -       11
194 Temperature_Celsius     0x0000   100   100   000    Old_age   Offline      -       60
195 Hardware_ECC_Recovered  0x0000   100   100   000    Old_age   Offline      -       0
196 Reallocated_Event_Count 0x0000   100   100   016    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0000   100   100   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0000   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0000   100   100   050    Old_age   Offline      -       0
232 Available_Reservd_Space 0x0000   100   100   000    Old_age   Offline      -       100
241 Host_Writes_32MiB       0x0000   100   100   000    Old_age   Offline      -       303133
242 Host_Reads_32MiB        0x0000   100   100   000    Old_age   Offline      -       24735
245 TLC_Writes_32MiB        0x0000   100   100   000    Old_age   Offline      -       1645988

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        87         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
    7        0    65535  Read_scanning was completed without error
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


ZFS shows now errors.
root@inner-fw2:~ # zpool status
  pool: zroot
state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0

errors: No known data errors


At the end of all this I reinstalled 21.1, restored my config from backup, and was back online with no errors or issues. 

Not sure this will help but may give some clues.  I can test the  upgrade again on this firewall at will if necessary to try to help capture this, it's the standby firewall in my HA setup.

Thank you!

Al

Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)

Please keep responses in this thread. I am starting to suspect that this is a (igb?) driver issue with the latest kernel code? We did not add any code related to LAGG or VLAN in any case... the only code updates related with VLAN are these of iflib, which is also used by igb driver.

If you replace the kernel with the old one does it still hang?

# opnsense-update -zkr 21.1.8
# opnsense-shell reboot


Cheers,
Franco

Tried that, here's the steps and results.


  • Clean Install
  • Assign interfaces igb4 to WAN and igb5 to LAN temporarily from CLI
  • Assign LAN IP Address from CLI
  • Connect to GUI, run wizard for minimal config.
  • Create lagg0 from igb0 and igb1.
  • Assign lagg0 to WAN interface.
  • Create lagg1 from idb2 and igb3.
  • Assign lagg1 to LAN interface.
  • Set ipv6 to none on WAN interface.
  • Create upstream gateway and assign to WAN interface.
  • Reboot, just because.
  • Run command to replace kernel with old one:  opnsense-update -zkr 21.1.8
  • Run command to reboot:  opnsense-shell reboot
  • Assign VLAN ID 20 to lagg0:  Success
  • Assign VLAN ID 30 to lagg0:  Success
  • Assign VLAN ID 40 to lagg0:  Success

YAY!  So that seemed to do the trick.  No hangs and no disk errors with that downgrade of the kernel.  Now I really am going to bed, it's 3:15 ;)  Thanks and have a good day!

Al
Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)

Thanks for confirming. Now comes the hard part figuring out what change in the kernel actually causes this... *sigh*


Cheers,
Franco

I do have the same issue:

Update to 21.7, then the opnsense hangs on Configuring Vlan interfaces

Hardware:
Intel Celeron G3900 2-Core 2,80GHz 2MB
8 GB (1x 8GB) ECC DDR4 2666 RAM
Supermicro X11SSH-LN4F with a Onboard Quad LAN with Intel® Ethernet Controller I210-AT

If i select the Kernel.old image in the boot screen, the opnsense starts fine. But I have to do this on every startup/reboot.

Quote from: franco on July 29, 2021, 09:22:45 AM
Thanks for confirming. Now comes the hard part figuring out what change in the kernel actually causes this... *sigh*


Cheers,
Franco

If I can help in any way, I'm at your disposal.  Unfortunately I'm  not a developer so cannot help in the way I think you need.   :(
Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)

Ok one more thing to try:

Add two tunables "hint.ahci.0.msi" and "hint.ahci.1.msi" to "0" and try booting the new kernel.

These can also be set from the loader prompt (3. escape to loader prompt)

set hint.ahci.0.msi=0
set hint.ahci.1.msi=0
boot


Cheers,
Franco

Quote from: franco on July 29, 2021, 09:39:21 AM
Ok one more thing to try:

Add two tunables "hint.ahci.0.msi" and "hint.ahci.1.msi" to "0" and try booting the new kernel.

These can also be set from the loader prompt (3. escape to loader prompt)

set hint.ahci.0.msi=0
set hint.ahci.1.msi=0
boot


Cheers,
Franco

I've applied them, how to I upgrade the kernel on the last install / downgrade I just ran through?
Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)

I rebooted and selected old kernel, and it's hung at "Configuring VLAN interfaces...".  It looks like it successfully changes the name of vlan0_vlan20 to lagg0_vlan20, then hangs at changing vlan1 to lagg0_vlan30, and I start getting achich1 timeout errors and CAM status timeouts.  Took a screenshot, will attach.
Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)

I already tried that, and it didnt work :(

Same error message and stuck on Vlan configuration

Not sure if it is related, but with Sensei it seems to be a problem using igb with netmap:
https://forum.opnsense.org/index.php?topic=24133.0
As soon as I use igb with netmap (emulated or native), my OPNsense becomes unresponsive.

Quote from: r4nc0r on July 29, 2021, 09:24:50 AM
I do have the same issue:

Update to 21.7, then the opnsense hangs on Configuring Vlan interfaces

Hardware:
Intel Celeron G3900 2-Core 2,80GHz 2MB
8 GB (1x 8GB) ECC DDR4 2666 RAM
Supermicro X11SSH-LN4F with a Onboard Quad LAN with Intel® Ethernet Controller I210-AT

If i select the Kernel.old image in the boot screen, the opnsense starts fine. But I have to do this on every startup/reboot.

Same problem here (and also a friend of mine) after upgrading to 21.7. Booting kernel.old works but not the new kernel.

CPU: Intel(R) Pentium(R) CPU G4560 @ 3.50GHz (3504.14-MHz K8-class CPU)
Quad Intel(R) PRO/1000 PCI-Express

VLANs on LAGG configured

My friend has a Qotom box. Not sure which model but with intel nics and VLANs on LAGG configured.

Do custom added tunables survive updates between point releases? 

I've got my system back to 21.1.9_1, I want to try setting those tunables again.  I think earlier this morning I added them to my primary firewall instead of the secondary I had been testing with, and want to try it again.
Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)

Tried the upgrade again. 


  • Restored back to 21.1.9_1
  • Added the tunables, rebooted
  • Upgraded to 21.7 via GUI
  • Failed to boot, stuck at "Configuring VLAN interfaces".  Had to power cycle.
  • Booted, at loader selected option 3 and added the tunables, boot.  System is stuck at "Configuring VLAN interfaces.".

Figured it was worth a shot when I didn't have keyboard rash on my face ;)
Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)