OPNsense suddenly unreachable - How to debug

Started by chemlud, August 21, 2024, 05:16:52 PM

Previous topic - Next topic
Hy!

Have here a Dell Optiplex box (doing fine for years) that failed some weeks ago after an update (new SSD installed some weeks before). As a result I reinstalled another fresh SSD and ZFS some weeks ago. Afterwards box was stable.

About 1 week ago I upgraded to 24.7.1 and today all of a sudden one LAN client was unreachable, a reboot resulted in no IP via DHCP. Moreover I could not reach the OPNsense via GUI or serial console. Rebooting (hard reset) brought the OPNsense back, but I want to learn what is failing here.

Where to start looking? Which logs might help?
kind regards
chemlud
____
"The price of reliability is the pursuit of the utmost simplicity."
C.A.R. Hoare

felix eichhorns premium katzenfutter mit der extraportion energie

A router is not a switch - A router is not a switch - A router is not a switch - A rou....


...looks normal, no errors reported, just in the end a little chatter (Suricata at work, I guess)

# dmesg
...
845.542461 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
845.549880 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
845.557405 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
845.678136 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
igb4: link state changed to DOWN
846.207238 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
846.214659 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
846.222222 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
846.792433 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
846.800203 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
846.807833 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
846.820656 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
igb3: link state changed to DOWN
847.368629 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
847.376714 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
847.384138 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
847.917440 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
847.924851 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
847.932375 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
847.944330 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
igb2: link state changed to DOWN
848.469548 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
848.476990 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
848.484500 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
849.031964 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
849.039428 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
849.046955 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
849.060860 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
igb0: link state changed to DOWN
849.974814 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
849.982272 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
849.989856 [ 850] iflib_netmap_config       txr 2 rxr 2 txd 1024 rxd 1024 rbufsz 2048
igb4: link state changed to UP
850.891927 [ 850] iflib_netmap_config       txr 1 rxr 1 txd 1024 rxd 1024 rbufsz 2048
850.899350 [ 850] iflib_netmap_config       txr 1 rxr 1 txd 1024 rxd 1024 rbufsz 2048
850.906881 [ 850] iflib_netmap_config       txr 1 rxr 1 txd 1024 rxd 1024 rbufsz 2048
850.918983 [ 850] iflib_netmap_config       txr 1 rxr 1 txd 1024 rxd 1024 rbufsz 2048
igb3: link state changed to UP
em0: link state changed to DOWN
igb2: link state changed to UP
igb0: link state changed to UP
853.993664 [ 850] iflib_netmap_config       txr 1 rxr 1 txd 1024 rxd 1024 rbufsz 48
854.001073 [ 850] iflib_netmap_config       txr 1 rxr 1 txd 1024 rxd 1024 rbufsz 2048
854.008590 [ 850] iflib_netmap_config       txr 1 rxr 1 txd 1024 rxd 1024 rbufsz 2048
em0: link state changed to UP


What would be the equivalent to

journalctl -r -b -1

in BSD, maybe there is something in the systemlogs before the box apparently crashed?
kind regards
chemlud
____
"The price of reliability is the pursuit of the utmost simplicity."
C.A.R. Hoare

felix eichhorns premium katzenfutter mit der extraportion energie

A router is not a switch - A router is not a switch - A router is not a switch - A rou....

Quote from: chemlud on August 22, 2024, 08:44:54 AM
What would be the equivalent to

journalctl -r -b -1

in BSD, maybe there is something in the systemlogs before the box apparently crashed?

Fortunately, none...  :-X ::) I'd suggest to use the old-fashioned grep, tail or whatever, looking at /var/log/*/latest.log

Latest 20 lines from each log:

# tail -n 20 /var/log/*/latest.log


August 22, 2024, 09:02:15 AM #4 Last Edit: August 22, 2024, 09:12:06 AM by chemlud
....I didn't invent systemd :-P only have to use it...

In the last 20 lines of each log nothing special.

Is there a way to view specifically the logs before the (second) last boot?

It crashed yesterday afternoon, in the evening I updated to 24.7.2 (including reboot), so the interesting stuff might be much more than 20 lines away...
kind regards
chemlud
____
"The price of reliability is the pursuit of the utmost simplicity."
C.A.R. Hoare

felix eichhorns premium katzenfutter mit der extraportion energie

A router is not a switch - A router is not a switch - A router is not a switch - A rou....

Quote from: chemlud on August 22, 2024, 09:02:15 AM
....I didn't invent systemd :-P only have to use it...

;D ;D ;D

Quote from: chemlud on August 21, 2024, 05:16:52 PM
I could not reach the OPNsense via serial console.

Well, that is rather weird. Maybe the hardware has had it overall... not just the SSD. Perhaps run some memtests, CPU stress tests or whatever to how it goes.

August 22, 2024, 10:14:17 AM #6 Last Edit: August 22, 2024, 10:17:13 AM by chemlud
Had a look in the system logs for yesterday, especially afternoon, nothing special in there, some unbound error

# tail -n 300 /var/log/system/system_20240821.log

....
<11>1 2024-08-21T18:14:51+02:00 OPN.arpa opnsense 3186 - [meta sequenceId="159"] /usr/local/etc/rc.linkup: The command '/sbin/mount -r -t nullfs '/lib' '/var/unbound/lib'' returned exit code '1', the output was 'mount_nullfs: /var/unbound/lib: Resource d'
<12>1 2024-08-21T18:14:51+02:00 OPN.arpa opnsense 19375 - [meta sequenceId="160"] /usr/local/etc/rc.newwanip: warning: ignoring missing default tunable request: debug.pfftpproxy
<12>1 2024-08-21T18:14:51+02:00 OPN.arpa opnsense 3186 - [meta sequenceId="161"] /usr/local/etc/rc.linkup: warning: ignoring missing default tunable request: debug.pfftpproxy
<13>1 2024-08-21T18:14:53+02:00 OPN.arpa opnsense 19375 - [meta sequenceId="162"] /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : vxlan_configure_do())
<13>1 2024-08-21T18:14:53+02:00 OPN.arpa opnsense 19375 - [meta sequenceId="163"] /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : webgui_configure_do(,wan))
<13>1 2024-08-21T18:14:53+02:00 OPN.arpa opnsense 19375 - [meta sequenceId="164"] /usr/local/etc/rc.newwanip: plugins_configure newwanip (execute task : wireguard_sync(,wan))
....


...but that should not block the serial console, huh?

Boss says: No problem...

https://forum.opnsense.org/index.php?topic=39469.msg196003#msg196003

Think of changing hardware (except NICs, RAM and SSD) and see what happenz next.
kind regards
chemlud
____
"The price of reliability is the pursuit of the utmost simplicity."
C.A.R. Hoare

felix eichhorns premium katzenfutter mit der extraportion energie

A router is not a switch - A router is not a switch - A router is not a switch - A rou....

August 22, 2024, 10:50:18 AM #7 Last Edit: August 22, 2024, 10:52:01 AM by doktornotor
Quote from: chemlud on August 22, 2024, 09:02:15 AM
It crashed yesterday afternoon, in the evening I updated to 24.7.2 (including reboot), so the interesting stuff might be much more than 20 lines away...

Well, there are timestamps, use them with grep to get something around the time it got frozen. The unbound bind-mount remount is certainly not causing the freeze.

grep 2024-08-21 /var/log/*/latest.log

or more specific, incl. hour, minutes...

August 22, 2024, 11:05:58 AM #8 Last Edit: August 22, 2024, 12:28:29 PM by chemlud
Quote from: doktornotor on August 22, 2024, 10:50:18 AM
...

grep 2024-08-21 /var/log/*/latest.log

or more specific, incl. hour, minutes...

Thanks! What would be the grep for anything after 10 am on 21-AUG-2024?

If there is nothing in system.log, what should freeze the whole box and write something meaningful to a log before saying bye-bye?

I changed the hardware now (keeping SSD, RAM and NICs), hope that the nightmare is over for now.

Need a fresh spare box though...

PS: You should really add your avatar here. I liked it... :-p

Perfect! :-D
kind regards
chemlud
____
"The price of reliability is the pursuit of the utmost simplicity."
C.A.R. Hoare

felix eichhorns premium katzenfutter mit der extraportion energie

A router is not a switch - A router is not a switch - A router is not a switch - A rou....