17.1.b & Suricata fails on ESXi

Started by phoenix, December 29, 2016, 01:47:36 PM

Previous topic - Next topic
Hi Franco

Thanks for that prod, I'd forgotten about testing the E1000 NIC - the obvious sometimes escapes me. I did try the VMXNET2 NIC as well and that also failed to allow IDS enabling but I guess that's to be expected.

I should point out to anyone else that tries this, you can't leave the VMXNETx in the system, it has to be a removal and change to the E1000 NIC then a clean install of 17.1.b and then it works a treat with IDS up and running smoothly.

Thanks for your help and I wish you and the OPNsense team (and the other forum members) a happy and prosperous New Year, have a great week-end. :)
Regards


Bill

Great to hear you've gotten it working with the emulated Intel driver. That confirms that it's the same issue that I saw and should be fixed with the patch Franco linked to.

There's an unfortunate side effect of this, the CPU usage goes up to 100% and the Load is 1.3%. Using the VMXNET3 driver on 16.7 the Load was about the same with CPU usage around the 12% mark. This is a VM on a lightly loaded server so I'll leave it as it is for now and keep an eye on it.

Would it be worth mentioning this problem in the Release Notes for 17.1 (and the RCs?) just in case anyone else hits this problem.
Regards


Bill


I ran into this with the intel-em-kmod driver we maintain, it surprisingly (but not unjustly) uses the netmap(4) emulation mode as opposed to its native support, which made it possible to easily run into the same panic. First test with the new netmap(4) changes in 12-CURRENT had no conclusive results. We're definitely not going to solve this for the initial 17.1 release, but I will work with the authors to see if we can resolve this ASAP to port it over.


Cheers,
Franco
--

775.468651 [ 268] generic_find_num_desc     called, in tx 1024 rx 1024
775.476185 [ 276] generic_find_num_queues   called, in txq 0 rxq 0
775.483286 [ 801] generic_netmap_dtor       Restored native NA 0
775.496255 [ 268] generic_find_num_desc     called, in tx 1024 rx 1024
775.503779 [ 276] generic_find_num_queues   called, in txq 0 rxq 0
775.511347 [ 801] generic_netmap_dtor       Restored native NA 0
775.527056 [ 276] generic_find_num_queues   called, in txq 0 rxq 0
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x1
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80732c2a
stack pointer           = 0x28:0xfffffe00a17cb300
frame pointer           = 0x28:0xfffffe00a17cb350
code segment            = base 0x0, limit 0xfffff, type 0x1b
                       = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 80820 (W#01-em1+)
[ thread pid 80820 tid 100213 ]
Stopped at      generic_xmit_frame+0x2a:        movl    (%rax),%eax

Bill,

I looked into this all the way up to involving FreeBSD/netmap people.

The good news is: the panic is gone in code in 12-CURRENT and we have a working backport.

The bad news for now: neither 12-CURRENT nor the backport for 17.1 work in our inline IPS setup with Suricata.

I'll drop by again when we have more info.


Cheers,
Franco

Hi Franco

Thanks for both of those updates, I seem to have missed the post on Jan 3rd.  It's not an urgent problem for me so I reverted to using the VMXNET3 NICs so I could drop the cpu usage and stay on the 17.1 beta. I'm quite happy to leave Suricata disabled for now and I'll wait for any updates you get on this, I'll also be willing to be a guinea pig if you need it tested. :)

Thanks for all you hard work on this and a Happy new Year to you and all the team.
Regards


Bill

Hi Bill,

A happy new year to you too! :)

The issue is a bit problematic as it is largely present FreeBSD 11.0 but was working in 10.3 just fine. It unfortunately points to "us" being a major provider/user of the functionality, actually only a small subset or niche feature of what others are *not* directly using, not even the developers themselves. This comes with mixed implications of having to make sure the features we use are not being deleted as unused or silently broken months before they are released.

I don't know how we can pull this off, but hopefully with the current discussions we will find a way in the next weeks.


Cheers,
Franco

How about this kernel then? Make sure to snapshot. :)

# opnsense-update -kr 17.1.b-netmap-fix


Cheers,
Franco

Gosh, that was quick. :)

I (almost) always take a snapshot and I did today. Just done the update and after enabling IPS/IDS and updating the rules all seems to be quite calm with a normal relatively low CPU usage - I also have this on a VM with the VMXNET3 NICs installed. If there's anything that breaks or looks out of place I'll post here.
Regards


Bill

Quick? Took me a couple of days to dig through 2 years of netmap commit history to find it. :D

That's a good sign. If the guys at Deciso and the netmap peeps are ok with it I shall add the fix just in time for 17.1-RC1.


Cheers,
Franco

Sounds good to me, I'll keep a close eye on it for the moment and see what happens. Without IDS enabled it's been running at about 2-3% cpu usage and with it it seems to be hovering around 7-8% and obviously there was a larger spike to 10-12% as the rules were downloaded but that dropped after a few minutes.

Thanks for all your hard work on this and enjoy the rest of the evening. :)
Regards


Bill

Thank you Bill, you too!


Cheers,
Franco

January 16, 2017, 08:03:09 AM #28 Last Edit: January 16, 2017, 03:37:49 PM by phoenix
Good morning Franco

Bad news 'm afraid. A short while after updating the install yesterday the CPU usage went up to 100%. I didn't notice this yesterday evening as internet access was still OK but this morning I saw the cpu usage was up and internet access was almost impossible.

A reboot also had problems with various timeouts and I had to reset the VM to get it to boot correctly, that worked but CPU usage was straight up to 100%. - disabling IDS/IPS and resetting the VM doesn't resolve the 100% CPU problem and it runs like that all the time.

I've taken a snapshot of this current system so if you need me to do anything on that to get you some logs then let me know.
Regards


Bill

I've just been doing some testing with this and the high CPU use may not be a problem with IPS/IDS. I've enabled IPS/IDS again with the updated kernel/drivers and I'll leave it for tonight and do some  more test in the morning, I'll post the results later tomorrow.
Regards


Bill