Kernel Panics

Started by tuaris, June 27, 2017, 11:20:07 PM

Previous topic - Next topic
June 27, 2017, 11:20:07 PM Last Edit: July 19, 2017, 03:09:19 AM by tuaris
I'm on a Sokeris net6501 and after updating to OPNsense 17.1.8-i386 the firewall is kernel panicking at random intervals (sometimes 9 hours sometimes 2, sometimes a few minutes.).


Fatal double fault:
eip = 0xc0a30252
esp = 0xeba10fc0
ebp = 0xeba11518
cpuid = 0; apic id = 00
panic: double fault
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper(c188419d,ff,c1b0c1e0,c1b0c1f0,c796b680,...) at db_trace_self_wrapper+0x2a/frame 0xc1d7e328
kdb_backtrace(c1a5ffcd,0,c1a56ed3,c1d7e3e4,0,...) at kdb_backtrace+0x2d/frame 0xc1d7e390
vpanic(c1a56ed3,c1d7e3e4,c1d7e3e4,c1d7e3ec,c14922b6,...) at vpanic+0x114/frame 0xc1d7e3c4
panic(c1a56ed3,0,0,0,0,...) at panic+0x1b/frame 0xc1d7e3d8
dblfault_handler() at dblfault_handler+0xa6/frame 0xc1d7e3d8
--- trap 0x17, eip = 0xc0a30252, esp = 0xeba10fc0, ebp = 0xeba11518 ---
random_fortuna_pre_read(73a4bcb5,eba11530,46,8c3680,c796b680,...) at random_fortuna_pre_read+0x22/frame 0xeba11518
read_random(eba11680,100,c0defb8f,c1db6600,41474b,...) at read_random+0x26/frame 0xeba11640
arc4rand(eba117d0,2,0,1,c78a4e00,...) at arc4rand+0x74/frame 0xeba11798
ip_fillid(c8b0a810,c8b0a810,14,2,1,...) at ip_fillid+0x103/frame 0xeba117f0
pfsync_sendout(c78a4e84,0,c2172130,683,0,...) at pfsync_sendout+0xbb/frame 0xeba11844
pfsync_insert_state(c92b0340,0,8000,0,10000000,...) at pfsync_insert_state+0x118/frame 0xeba11880
pf_state_insert(c796e000,c922cbc0,c922cbc0,c92b0340,8603,...) at pf_state_insert+0x87d/frame 0xeba118d8
pf_test_rule(1,c796d800,c82ae900,14,eba11c00,...) at pf_test_rule+0x397c/frame 0xeba11bb0
pf_test(1,c79c5400,eba11d04,0,c1dc5954,...) at pf_test+0x855/frame 0xeba11cb8
pf_check_in(0,eba11d04,c79c5400,1,0,...) at pf_check_in+0x29/frame 0xeba11cd8
pfil_run_hooks(c1dc5954,eba11e24,c79c5400,1,0,...) at pfil_run_hooks+0x88/frame 0xeba11d38
enc_hhook(3,2,c79c1b30,eba11e10,0,...) at enc_hhook+0x217/frame 0xeba11d80
hhook_run_hooks(c793ec80,eba11e10,0,c8d0ca40,eba11e78,...) at hhook_run_hooks+0xa1/frame 0xeba11dd8
ipsec_run_hhooks(eba11e10,3,10,1,c2427cac,...) at ipsec_run_hhooks+0x58/frame 0xeba11df0
ipsec4_common_input_cb(c82ae900,c8ff8500,14,9,40,...) at ipsec4_common_input_cb+0x512/frame 0xeba11e78
esp_input_cb(c90babf4,eba12658,c832908a,eba11fb8,c11550f9,...) at esp_input_cb+0x88f/frame 0xeba11f80
crypto_done(c90babf4,c832908a,8,eba12060,eba12070,...) at crypto_done+0x1b9/frame 0xeba11fb8
swcr_process(c75abc80,c90babf4,0,c27be8c0,80,...) at swcr_process+0xd97/frame 0xeba126b8
crypto_invoke(0,c8329092,c8feb038,c,c,...) at crypto_invoke+0x73/frame 0xeba126f0
crypto_dispatch(c90babf4,c18ac6b8,1ad,c8feb038,c20ab55a,...) at crypto_dispatch+0x65/frame 0xeba12718
esp_input(c82ae900,c8ff8500,14,9,d4,...) at esp_input+0x556/frame 0xeba127f8
ipsec_common_input(c82ae900,14,9,2,32,...) at ipsec_common_input+0x6e7/frame 0xeba1288c
esp4_input(eba128f4,eba128f0,32,1,0,...) at esp4_input+0x34/frame 0xeba128a8
ip_input(c82ae900,c0e69bf8,b7debbe1,80015188,5f5e9218,...) at ip_input+0x32b/frame 0xeba12918
netisr_dispatch_src(1,0,c82ae900) at netisr_dispatch_src+0xd0/frame 0xeba12960
netisr_dispatch(1,c82ae900,0,c82ae900,2,...) at netisr_dispatch+0x20/frame 0xeba12974
ether_demux(c7f69400,c82ae900,6,0,7470c88c,...) at ether_demux+0x131/frame 0xeba129a0
ether_nh_input(c82ae900,c0e69bf8,dc675435,80015188,5f5e9218,...) at ether_nh_input+0x383/frame 0xeba129f0
netisr_dispatch_src(5,0,c82ae900) at netisr_dispatch_src+0xd0/frame 0xeba12a38
netisr_dispatch(5,c82ae900,c78a7400,eba12ab4,c0f88053,...) at netisr_dispatch+0x20/frame 0xeba12a4c
ether_input(c7f69400,c82ae900,1,0,10000200,...) at ether_input+0x2a/frame 0xeba12a60
vlan_input(c78a7400,c82ae900,0,c82ae900,2,...) at vlan_input+0x223/frame 0xeba12ab4
ether_demux(c78a7400,c82ae900,6,0,c8684800,...) at ether_demux+0x9a/frame 0xeba12ae0
ether_nh_input(c82ae900,801,eba12b90,eba12b8c,c8632d00,...) at ether_nh_input+0x383/frame 0xeba12b2c
netisr_dispatch_src(5,0,c82ae900) at netisr_dispatch_src+0xd0/frame 0xeba12b74
netisr_dispatch(5,c82ae900,c796b680,eba12bac,c0f79729,...) at netisr_dispatch+0x20/frame 0xeba12b88
ether_input(c7981400,c82ae900,eba12c0c,c0790343,c7981400,...) at ether_input+0x2a/frame 0xeba12b9c
if_input(c7981400,c82ae900,1,0,c827d9c0,...) at if_input+0x19/frame 0xeba12bac
em_rxeof(c7981400,c1d4ef00,c793b5c8,0,c7935680,...) at em_rxeof+0x343/frame 0xeba12c0c
em_msix_rx(c7970900,c0e6523f,c796b680,0,109,...) at em_msix_rx+0x2f/frame 0xeba12c28
intr_event_execute_handlers(109,c793b580,c187b89f,555,aa55aa55,...) at intr_event_execute_handlers+0x299/frame 0xeba12c64
ithread_loop(c7971da0,eba12ce8,aa55aa55,aa55aa55,aa55aa55,...) at ithread_loop+0xc0/frame 0xeba12ca4
fork_exit(c0e084b0,c7971da0,eba12ce8) at fork_exit+0x71/frame 0xeba12cd4
fork_trampoline() at fork_trampoline+0x8/frame 0xeba12cd4
--- trap 0, eip = 0, esp = 0xeba12d20, ebp = 0 ---
KDB: enter: panic

Which version did you come from prior to the update?


I gave up (having the firewall reboot in the middle of phone calls is a deal breaker!), got a new SSD, and installed 17.1.4.  Restored my configuration, reinstalled my plugins, and rebooted.



I discovered that OPNSense doesn't like multi-boot.  I was hoping to have both SSD's installed and have the ability to boot into different firmware, but even if I boot off the second SSD, the firmware on the first still gets loaded, very odd.

Interestingly I noticed something on the dashboard that wasn't working in 17.1.8...



The gateway status panel has content were as it previously did not.  I do remember this working before doing the update.  So perhaps there is indeed something broken in the latest firmware.


I'm starting to notice a pattern with the kernel panics.  They seem to happen regularly at ~7:30 UTC and ~13:00 UTC

I'm assuming the lack of additional response is either this is a known bug, no one knows what is wrong, or no one wants to help?

Really hoping I can make this work.

Hi tuaris,

No responses from me means not enough time for helping out here.

I don't expect the update is the issue. You could easily go back to an older kernel (it crashes there after all):

# opnsense-update -kr 17.1.4
# /usr/local/etc/rc.reboot

If the crashes continue this is due to heavy traffic and / or heat.

Your stack trace is also interesting in that it includes Firewall State Sync, IPsec and VLANs at the same time.

Also, how many services are you running? IPS? Web proxy? How is your RAM usage?

random_fortuna_pre_read() at the top is not a networking subsystem, the box crashes trying get random bytes for the kernel for an IP packet it tries to send out.

You could also also try to shape your traffic a bit to take the edge off... The Soekris net6501 isn't the fastest hardware around anymore.


Cheers,
Franco

PS: How is 17.1.9 performing?

Thanks I didn't mean to sound too negative.  That last post was made after the box crashed at the worst possible moment :). 

I have begun to notice a  pattern.  Whenever I put stress on it (by means of heavy VPN, VLAN, and sometimes traffic usage) it does seem to trigger the problem.  I use several VLANS, a few IPSec tunnels, the PPTP, uPNP plugins, and interface bonding with LAGG.  There are a several services running behind it using port forwards, VoIP, multiple HTTP services, mail, etc..

I totally understand it's a pretty taxing setup.  Interestingly enough with the exception of interface bonding, the previous device (a net4801) handled the load using m0n0wall (it's currently got an uptime of 780 days!).  I also get the difference between OPNSense vs m0n0wall is significant.

I purchased the higher end net6501-70 expecting that it would be more than capable of handling my needs (50mbits up/down and 200+ nodes).  I will try the packet shaping, I had started it but I found it a little harder to use than what I was used to with m0n0wall.

17.1.9 is performing well but still panics, but not as often.  I even shut off some logging and stats collection and it has improved slightly.

Out of curiousity, how old is your net6501-70?
I bought mine fairly soon after they came out, and it died a few years later.
Was a known issue. Something with heat, iirc.

Mine was bought by franco, and afaik still work? ;-)
Hobbyist at home, sysadmin at work. Sometimes the first is mixed with the second.

It sounds like a heat problem indeed, it's summer-time after all. A fan might already help...

The Soekris from you is still up and running in a remote branch, dutifully pushing IPsec, but not doing any heavy lifting. :)


Cheers,
Franco

I should have mentioned mine was within warrenty, so the board got replaced.
Had the newer/bigger heatsink on it.
But I had a -30, and the -70 has a fan on the heatsink, iirc?

Good to hear it's useful :-)
Hobbyist at home, sysadmin at work. Sometimes the first is mixed with the second.

Quote from: weust on July 12, 2017, 01:43:31 PM
Out of curiousity, how old is your net6501-70?
I bought mine fairly soon after they came out, and it died a few years later.
Was a known issue. Something with heat, iirc.

Mine was bought by franco, and afaik still work? ;-)

Mine is no more than a month old.  Purchased brand new directly from Sokeris EU.   I've already contacted them about a possible hardware issue, but they are saying it's software related.  I guess the only way to really know for sure is to do some tests.

Quote from: franco on July 12, 2017, 02:45:20 PM
It sounds like a heat problem indeed, it's summer-time after all. A fan might already help...

The Soekris from you is still up and running in a remote branch, dutifully pushing IPsec, but not doing any heavy lifting. :)


Cheers,
Franco

Currently at 67 C.   

Ok. Then you have the newer revision.
Temp is fine too. That CPU runs a bit hot, which is normal.
Hobbyist at home, sysadmin at work. Sometimes the first is mixed with the second.