Diagnose Frequent Firewall Lockups

Started by zzyzx, June 12, 2023, 08:33:50 PM

Previous topic - Next topic
Since updating to the 23 series, maybe just coincidental, my firewall frequently locks up (three times in the past month) and becomes unresponsive. When it does lock up, the hardware gets much hotter, so the CPU seems to be chewing on something.

When I (hard) reset, it sometimes recovers normally, but I've had to reinstall/restore twice now due to a kernel panic, I assume from the reset. What is the best way to diagnose the cause?

Thanks.

your hardware details and logs will be needed before suggesting anything.

Thanks

Hardware is a fitlet2 with Celeron J3455 quad-core, 8GB RAM, 105GB SSD

no SMART error issues listed. The only strangeness in dmesg/system logs I could see was this error multiple times:
pid 29620 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)

Which logs can I provide to help diagnose?

Thanks for the help.

More info from the most recent lockup. Same symptoms, firewall becomes unresponsive and hardware is very hot. Hard reset often results in kernel panic on reboot:
Solaris(panic): zfs: removing nonexistent segment from range tree (offset (4a7172000 size=1000)

although I think this is a result of the hard reset and not the root cause of the initial lockup.


I'm not an expert but the dump seems to have the kernel panic on zfs operations. Your filesystem is likely corrupted now and need to reinstall it. But if it keeps happening, then there is likely in my experience, hardware problem(s). It could be ram that is getting bits flipped and sending those to the filesystem or a storage subsystem problem. It is not necessarily the hard disk.
These are horrible to diagnose.
I would start by replacing what you possibly can at lowest cost first. Start with hard disk, monitor. Replace memory or run with one stick only,  monitor. That sort of thing.
Unless someone can seen more clearly than I from the crash report.

I don't have any comments on the crash report, but before swapping out hardware I'd recommend doing some burn in testing.  Set up memtest and leave it running for a few days, etc.

Thanks for the responses.

I agree, the zfs filesystem issues are likely a symptom of another underlying issue. Swapping out hardware this weekend and I'll run some ram tests to see if there are any culprits that are highlighted.

One thing I'm considering is these lockups happen most frequently when wireguard is in heavier use. Hard to test, but I'll report back of something more conclusive surfaces.

Today the gui was very slow, practically unresponsive. When the dashboard finally loaded all the tables were empty. Login via ssh had no problems.

Before reboot this was in the system log. Do these entries indicate a problem?

<13>1 2023-06-26T16:07:54-07:00 thechekt.lunas.lan dhclient 43547 - [meta sequenceId="1"] Creating resolv.conf
<11>1 2023-06-26T21:03:00-07:00 thechekt.lunas.lan configctl 79000 - [meta sequenceId="1"] error in configd communication  Traceback (most recent call last):   File "/usr/l
ocal/sbin/configctl", line 66, in exec_config_cmd     line = sock.recv(65536).decode() socket.timeout: timed out
<11>1 2023-06-26T22:02:00-07:00 thechekt.lunas.lan configctl 51892 - [meta sequenceId="1"] error in configd communication  Traceback (most recent call last):   File "/usr/l
ocal/sbin/configctl", line 66, in exec_config_cmd     line = sock.recv(65536).decode() socket.timeout: timed out
<11>1 2023-06-26T22:03:00-07:00 thechekt.lunas.lan configctl 36621 - [meta sequenceId="1"] error in configd communication  Traceback (most recent call last):   File "/usr/l
ocal/sbin/configctl", line 66, in exec_config_cmd     line = sock.recv(65536).decode() socket.timeout: timed out
1 line changed; 5 lines deleted


Earlier entries were the usual repeats:

pid 29620 (python3.9), jid 0, uid 0: exited on signal 11 (core dumped)

load average from top seemed ok. Temps are often on the high side of 50-55C but not crazy.


last pid: 82800;  load averages:  0.38,  0.38,  0.31                                                                                                up 0+19:21:47  22:22:23
49 processes:  1 running, 48 sleeping
CPU:  0.8% user,  0.0% nice,  2.4% system,  0.0% interrupt, 96.8% idle
Mem: 87M Active, 367M Inact, 622M Wired, 40K Buf, 6624M Free
ARC: 263M Total, 65M MFU, 162M MRU, 280K Anon, 2329K Header, 33M Other
     183M Compressed, 527M Uncompressed, 2.88:1 Ratio
Swap: 8192M Total, 8192M Free