Firewall Frequently Locking Up, Requiring Hard Reboot

Started by milkywaygoodfellas, August 14, 2022, 01:45:45 AM

Previous topic - Next topic
Every so often, up to multiple times per day, my firewall appliance locks up and requires a hard reboot to restore services and internet connectivity.

So far, I have been unable to find any logs or crash dumps that would help me isolate the issue outside of one time, which I did submit via the web interface.

I have no idea where to start. Can someone point me in the right direction to troubleshoot this issue? At this point I'm not sure if it's hardware or software.

I'm running it on a KingNovy fanless PC with 6x Intel I225-V, a Celeron N5105, 16 GB of RAM, and a 256 GB NVMe drive.

The start would be connecting to the console when it's locked up and seeing what it says.

I'd love to, but I can't even SSH into it when it happens.

I think he means locally on the device.  Not remoting into it ;)
OPNsense 25.1.9 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD


I managed to retrieve these crash dumps. Briefly going through them, I'm starting to suspect overheating or other hardware issues?

Looks like the panic was caused by "pfctl".  You doing packet inspection of any kind? Perhaps chocking session states?
OPNsense 25.1.9 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Quote from: axsdenied on August 15, 2022, 05:29:10 PM
Looks like the panic was caused by "pfctl".  You doing packet inspection of any kind? Perhaps chocking session states?
Just the defaults... IDS was enabled in IPS mode but with no rules downloaded. I did not modify any of those settings from the base install.

For readability:

db:0:kdb.enter.default>  show pcpu
cpuid        = 0
dynamic pcpu = 0xfc0f40
curthread    = 0xfffffe0138c28720: pid 3489 tid 102014 critnest 1 "pfctl"
curpcb       = 0xfffffe0138c28c30
fpcurthread  = 0xfffffe0138c28720: pid 3489 "pfctl"
idlethread   = 0xfffffe00207933a0: tid 100003 "idle: cpu0"
self         = 0xffffffff82c10000
curpmap      = 0xfffffe011668f518
tssp         = 0xffffffff82c10384
rsp0         = 0xfffffe0118fea000
kcr3         = 0x351ae2000
ucr3         = 0x16fe6d000
scr3         = 0x16fe6d000
gs32p        = 0xffffffff82c10404
ldt          = 0xffffffff82c10444
tss          = 0xffffffff82c10434
curvnet      = 0xfffff80001202dc0
db:0:kdb.enter.default>  bt
Tracing pid 3489 tid 102014 td 0xfffffe0138c28720
kdb_enter() at kdb_enter+0x37/frame 0xfffffe0118fe93c0
vpanic() at vpanic+0x1b0/frame 0xfffffe0118fe9410
panic() at panic+0x43/frame 0xfffffe0118fe9470
trap_fatal() at trap_fatal+0x385/frame 0xfffffe0118fe94d0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0118fe9530
calltrap() at calltrap+0x8/frame 0xfffffe0118fe9530
--- trap 0xc, rip = 0xffffffff80debe14, rsp = 0xfffffe0118fe9600, rbp = 0xfffffe0118fe9620 ---
rn_walktree() at rn_walktree+0x64/frame 0xfffffe0118fe9620
pfr_get_addrs() at pfr_get_addrs+0x219/frame 0xfffffe0118fe9680
pfioctl() at pfioctl+0x23be/frame 0xfffffe0118fe9b50
devfs_ioctl() at devfs_ioctl+0xc6/frame 0xfffffe0118fe9ba0
vn_ioctl() at vn_ioctl+0x1a4/frame 0xfffffe0118fe9cb0
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe0118fe9cd0
kern_ioctl() at kern_ioctl+0x25b/frame 0xfffffe0118fe9d40
sys_ioctl() at sys_ioctl+0xf1/frame 0xfffffe0118fe9e00
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe0118fe9f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0118fe9f30
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x8012446da, rsp = 0x7fffffffdc38, rbp = 0x7fffffffe0d0 ---


I haven't seen this before but if it doesn't happen on 22.1 it should be easy to find the bad commit.

This is new for 22.7, right?


Cheers,
Franco

Quote from: franco on August 15, 2022, 08:15:36 PM
For readability:

db:0:kdb.enter.default>  show pcpu
cpuid        = 0
dynamic pcpu = 0xfc0f40
curthread    = 0xfffffe0138c28720: pid 3489 tid 102014 critnest 1 "pfctl"
curpcb       = 0xfffffe0138c28c30
fpcurthread  = 0xfffffe0138c28720: pid 3489 "pfctl"
idlethread   = 0xfffffe00207933a0: tid 100003 "idle: cpu0"
self         = 0xffffffff82c10000
curpmap      = 0xfffffe011668f518
tssp         = 0xffffffff82c10384
rsp0         = 0xfffffe0118fea000
kcr3         = 0x351ae2000
ucr3         = 0x16fe6d000
scr3         = 0x16fe6d000
gs32p        = 0xffffffff82c10404
ldt          = 0xffffffff82c10444
tss          = 0xffffffff82c10434
curvnet      = 0xfffff80001202dc0
db:0:kdb.enter.default>  bt
Tracing pid 3489 tid 102014 td 0xfffffe0138c28720
kdb_enter() at kdb_enter+0x37/frame 0xfffffe0118fe93c0
vpanic() at vpanic+0x1b0/frame 0xfffffe0118fe9410
panic() at panic+0x43/frame 0xfffffe0118fe9470
trap_fatal() at trap_fatal+0x385/frame 0xfffffe0118fe94d0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0118fe9530
calltrap() at calltrap+0x8/frame 0xfffffe0118fe9530
--- trap 0xc, rip = 0xffffffff80debe14, rsp = 0xfffffe0118fe9600, rbp = 0xfffffe0118fe9620 ---
rn_walktree() at rn_walktree+0x64/frame 0xfffffe0118fe9620
pfr_get_addrs() at pfr_get_addrs+0x219/frame 0xfffffe0118fe9680
pfioctl() at pfioctl+0x23be/frame 0xfffffe0118fe9b50
devfs_ioctl() at devfs_ioctl+0xc6/frame 0xfffffe0118fe9ba0
vn_ioctl() at vn_ioctl+0x1a4/frame 0xfffffe0118fe9cb0
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe0118fe9cd0
kern_ioctl() at kern_ioctl+0x25b/frame 0xfffffe0118fe9d40
sys_ioctl() at sys_ioctl+0xf1/frame 0xfffffe0118fe9e00
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe0118fe9f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0118fe9f30
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x8012446da, rsp = 0x7fffffffdc38, rbp = 0x7fffffffe0d0 ---


I haven't seen this before but if it doesn't happen on 22.1 it should be easy to find the bad commit.

This is new for 22.7, right?


Cheers,
Franco
Yeah, never had this problem on 22.1 before. I disabled IPS/IDS entirely and it seems to have greatly helped the stability - it was crashing multiple times a day today and yesterday and since turning off Intrustion Detection in services, it hasn't crashed again (yet).

Just a quick update - since disabling IDS/IPS in my last post, the firewall has not crashed again as of this reply.

Did you have any hardware offloading enabled?  i.e. CRC, TSO, LRO or VLAN?
OPNsense 25.1.9 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

August 17, 2022, 06:17:30 PM #12 Last Edit: August 17, 2022, 06:32:04 PM by milkywaygoodfellas
Quote from: axsdenied on August 17, 2022, 06:04:55 PM
Did you have any hardware offloading enabled?  i.e. CRC, TSO, LRO or VLAN?
Nope, all disabled.

And I spoke too soon... another crash dump some time yesterday apparently. This time, however, the firewall rebooted itself instead of staying locked up until I power cycled it.

Caused by PHP this time, apparently?

Given the change in behavior, this is feeling more like potentially a hardware issue, but it's still not remotely clear.

To rule that out, are you able to go back to 22.1 and test?

Otherwise potentially check CPU temps, or setup alerts.
You could also, just for good measure, run a memtest on the box?

Historically, for me, it's rarely been memory issues however it WAS 1 out of the 99 times.  And that 1 time, drove me nuts in troubleshooting before I discovered the issue ;)
OPNsense 25.1.9 running on:
Dell Optiplex 3050
Intel I5-7600 @ 3.5Ghz (4 Cores)
Intel I350-T4 Nic
8G DDR4
256G SSD

Quote from: axsdenied on August 17, 2022, 07:55:19 PM
Given the change in behavior, this is feeling more like potentially a hardware issue, but it's still not remotely clear.

To rule that out, are you able to go back to 22.1 and test?

Otherwise potentially check CPU temps, or setup alerts.
You could also, just for good measure, run a memtest on the box?

Historically, for me, it's rarely been memory issues however it WAS 1 out of the 99 times.  And that 1 time, drove me nuts in troubleshooting before I discovered the issue ;)
I can try a live disk of 22.1 to see, but I made some tweaks and it was running stable again so I turned IDS/IPS back on and it almost immediately locked up with no crash dump, same as before. Turned it back off and so far so good, but it's only been a couple of hours.