Kernel panic when unplugging WAN network interface

Started by draga, October 01, 2018, 02:15:38 PM

Previous topic - Next topic
Hello everybody, I've been using opnsense for 9 months now. Before last week, all it was doing was just a multi-wan gateway and firewall, but now I'm implementing some more advanced configurations (like ipv6, zerotier, etc.).
Performing some tests, I've noticed that unplugging of my two wan interfaces (both of them show the same problem) I hit a kernel panic and everything hangs. Sometimes it reboots, sometimes it just sits and stops working.
Both of my wans are connected via PPPoE (one is a wlan connection, the other an ADSL) and this is the error I get:

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address   = 0x0
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80c2ab63
stack pointer           = 0x28:0xfffffe011abf2f60
frame pointer           = 0x28:0xfffffe011abf2f70
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (irq259: igb0:que 3)


I've also tried to follow this, but no luck:
https://www.netgate.com/docs/pfsense/hardware/tuning-and-troubleshooting-network-cards.html#Intel_igb.284.29_and_em.284.29_Cards

What could it be? Searching on the forum I've found some similar posts, related to PPPoE Wan devices, but I read things should have been fixed long ago. My opnsense version is OPNsense 18.7.4-amd64 - FreeBSD 11.1 RELEASE-p14 - LibreSSL 2.7.4
Thank you
Stefano


I haven't tried, yet. I'll be trying as soon as possibile.
Do you mean also re-importing  configurations from the backup?

Thank you

Yes, you can import the backup xml .. just to be sure if it's also related to 18.7 or already before ...

Hello,
sorry for the long delay, strong flu here.

I just tested a new installation  from ISO and a restore from the backup. No updates. Same result.

Thank you.

Hello everybody,
still having the same problem. This morning one of my two Wans is flappy and, from time to time, the pppoe connection disappears (it's a wireless provider). If this down lasts for more than one or two minutes, the entire opnsense hangs and I have to manually restart the APU, otherwise it continues to be stuck there.

Is there anything I can do/try to avoid this? It's not a big issue when I'm here, but quite strong when away and the APU stays blocked for days.

Thank you!

I had that kind of Problem too.
Do you have the "Kill states" option enabled? (Firewall->Settings->Advanced-> Gateway Monitoring)

Yes, I had that checked. Now I tried to uncheck it and see what happens. Thank you, I will report here ASAP

It didn't work. As soon as I unplug one of the wans, the system hangs :(

(Firewall->Settings->Advanced

Try with

unchecked:
"Kill states"
"Bind states to Interface"


checked should be:
"Use sticky Connection" (with MultiWAN)
"Shared forwarding"
"Gateway switching"

Thank you. Actually everything was ok except sticky connections.
Unfortunately, it didn't help. Here's what appears on console, then the APU hangs:

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address   = 0x0
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80c2f323
stack pointer           = 0x28:0xfffffe0120f9f5e0
frame pointer           = 0x28:0xfffffe0120f9f620
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 31763 (dpinger)

Ok. That where the options that caused trouble at our company. I don't know where the problem comes from. I see a "page fault" in this message. Maybe you run a memtest on that device, just to confirm that the RAM is ok.

Maybe you could run a tcpdump over SSH while disconnection an interface, so you see that behavior.

stack pointer, frame pointer... sorry. That's where I'm out. :-\

The error looks like a software bug in the NIC driver or kernel. Unplugging the interface (network cable) triggers a hardware interrupt (irq259 in the error message). The code behind that accesses a virtual memory address which is not mapped to a physical memory page.

To prevent unknown system behaviour with possibly trashing data the kernel panics into fail stop mode. In this case, the nic driver or kernel needs an update. Is the nic driver seperately installed or shipped with opnsense? In the latter case the BSD kernel team needs a bug report.
OPNsense 24.7.11_2-amd64

Thank you everybody.
Yes, it is stock opnsense kernel, so I guess stock FreeBSD driver. The device is a APU2C4 with Intel NICs

As I have another APU around, with Realtek nics, I've switched to it and tried.
I don't see any error on serial console, but the APU hangs and becomes unreachable. So different behaviour, but same result.
Thank you.