OPNsense Forum

Archive => 18.1 Legacy Series => Topic started by: ky41083 on November 09, 2017, 06:05:00 am

Title: Kernel Panic When Using Certain Scheduler Types
Post by: ky41083 on November 09, 2017, 06:05:00 am
Assigning certain scheduler types to pipes in the traffic shaper, causes the system to kernel panic and immediately reboot. I didn't even get a crash dump, any of the multiple times this happened.

The message(s) you will find in the system log before the panic, will be very similar to this:
Code: [Select]
heap_extract: empty heap [random memory address hex value]
The higher the volume of traffic being processed, the sooner the panic. So, if you have a very high traffic load on an OPNsense instance already, you won't have time to read the system log from the WebGUI. The most you will see is the above message being output to the console. The most consistent way to trigger (reproduce) the panic, in my case, was to hit the bandwidth limit on any of the pipes in question. This would trigger a kernel panic in a few seconds, if not less. If you are testing, set an arbitrarily low limit on a pipe, and push some traffic to hit the set limit.

I have tracked it down to the way byte based scheduling algorithms work in ipfw. My fix for this issue, was to:

- Unplug or disable interfaces to stop traffic flowing through the shaper.
- Temporarily disable all pipes + queues, apply, reboot (this stops the kernel panics).
- Plug in or enable interfaces from first step.
- Change the scheduler type for all pipes, to a packet based algorithm (rather than byte based), I chose Deficit Round Robin here, for example.
- Enable all pipes + queues, apply, reboot.

Everything should be just fine now. If you still get heap_extract messages, double check that you didn't miss the scheduler type change on any pipes.

The biggest issue with all of this, from my standpoint, is that the default "you never see it unless you toggle the advanced slider" scheduler type, is Weighted Fair Queuing. Weighted Fair Queuing is indeed a byte based algorithm, and is responsible for causing this kernel panic on the hardware in question.

My best guess, is that some network interface drivers misreport byte rates, feed invalid values into an unchecked Weighted Fair Queuing function, and since ipfw shaping algorithms are all kernel modules, we get a panic. As changing to a packet based algorithm stopped the kernel panics, I would guess the drivers in question are at least reporting packet rates correctly, or, the packet rate functions are checked. I base this guess on the fact that I have multiple production OPNsense VM's running, that pair vmxnet3 interfaces with Weighted Fair Queuing pipes, and have yet to see a single kernel panic on any of them.

Given the above fix worked, without changing anything else at all, I would guess a change to ipfw's WFQ code is responsible? Even if it happens to be a coding issue elsewhere in the base system, relying on values from NIC drivers anything outside of ipfw's control, unchecked, in the kernel, seems like a bad idea. Case in point.

To me, this screams "fix upstream". I would bet vanilla FreeBSD suffers from the same issue. Personally, I lack the understanding of the FreeBSD source code to properly get a fix pushed upstream, or even attention drawn to the proper source code points. Hopefully someone who does, does.

Simply reply to this thread if I can offer any more information than I already have. I will do what I can.

Worst case scenario, this thread will exist for anyone else seeing the above issue on OPNsense, or FreeBSD, in the future.

| Hardware Used |
- Dell Dimension 5150 (Intel Pentium D CPU 2.80GHz, 945G chipset)
- 1x Intel 82801GB 10/100 Ethernet (using fxp driver)
- 1x 3Com 3c905C-TX Fast Etherlink XL (using xl driver)
- SanDisk SDSSDA120G SSD (boot device in AHCI mode)

| Software Used |
- OPNsense 17.7.7_1-amd64 (panic experienced on multiple versions of 17.7)
Title: Re: Kernel Panic When Using Certain Scheduler Types
Post by: ky41083 on November 09, 2017, 06:32:11 am
This issue has also been cited at least once in this forum, see:
https://forum.opnsense.org/index.php?topic=4907.0

So we know it has been happening since at least 17.1, and is not isolated to my individual case.
Title: Re: Kernel Panic When Using Certain Scheduler Types
Post by: Cherubim on May 30, 2018, 06:18:32 pm
I can only say:
Thank you! Thank you! Thank you!

This problem has been driving me nuts!
I have a J1900 platform and after switching the Pipes to "Deficit Round Robin" there are no more sudden reboots.

Thank god I found this thread, I had messed with every tunable I could think of and nothing helped.

By the way: I am running 18.1.8, so the issue is still there.
Title: Re: Kernel Panic When Using Certain Scheduler Types
Post by: danin on August 02, 2018, 05:08:22 am
Hi there. First off - dear lord thank you so much. I run XenCenter on a Dell R710, and switched on Traffic Shaping a month ago. Literally no issues, I'm not even kidding.

Today, I realized - I wasn't shaping IPv6. I realized this when I switched a downloader to IPv6 and suddenly my network was --SLAMMED-- and the downloader wasn't throttled, and I went "Oh, haha, the IPv4 rule bindings don't cover IPv6. Oh, crap, that means it's not throttling this. Oh, super crap, that means it's not throttling --anything--." This explains users' reports that it was still inconsistent during --some-- downloaders.

So I slapped together some IPv6 rules, and started messing with other things. Suddenly, during messing with Unbound, it started crashing. HARD. XenCenter couldn't even reboot the thing, then on manual reboot, XenCenter couldn't take down an unrelated VM. Thought it was Unbound, full refresh, reinstalled everything, restore config after manually deleting Unbound's section from the file. Stable - okay fine. Crash. Found this thread through the one linked above, disabled all throttling, and it's all good. BUT. I got to thinking.

Is this IPv6 related? Can anyone replicate the crash on IPv4-only rulesets? IPv6-only rulesets? Mixed? I don't want to test right now, I manage the network for a household of streamers and they've been glaring daggers all day while I sorted this. I'll see if I can figure out a way to test it in more depth on my own, in the meantime.

Oh, and FYI - I was on 18.1.6 or .7 or so at the time. Updated to 18.7 today.



EDIT: Uh - also, just noticed. 17.7 Legacy Series? Let's get this moved, this is --current--.