20.7.3 + esxi webgui = kernel panic

Started by GreenMatter, October 13, 2020, 05:18:15 PM

Previous topic - Next topic

After updating to 20.7.3 I experience quite strange behavior: when I try to open esxi webgui (host is located in one of vlans), opnsense immediately (kernel panic) restarts itself. And when I'm connected to LAN over OpenVPN, I can open esxi webgui without problem.
I'd tried Netmap kernel and it's exactly the same:
Here it is what I've found in crash reporter:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address = 0x0
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff80e3b142
stack pointer         = 0x28:0xfffffe00403f28d0
frame pointer         = 0x28:0xfffffe00403f29a0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 0 (if_io_tqg_1)
trap number = 12
panic: page fault
cpuid = 1
time = 1602601568
__HardenedBSD_version = 1200059 __FreeBSD_version = 1201000
version = FreeBSD 12.1-RELEASE-p10-HBSD #1  ebb8c1489c7(master)-dirty: Mon Sep 21 13:50:27 CEST 2020
    root@sensey64:/usr/obj/usr/src/amd64.amd64/sys/SMP
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00403f2580
vpanic() at vpanic+0x1a2/frame 0xfffffe00403f25d0
panic() at panic+0x43/frame 0xfffffe00403f2630
trap_fatal() at trap_fatal+0x39c/frame 0xfffffe00403f2690
trap_pfault() at trap_pfault+0x49/frame 0xfffffe00403f26f0
trap() at trap+0x29f/frame 0xfffffe00403f2800
calltrap() at calltrap+0x8/frame 0xfffffe00403f2800
--- trap 0xc, rip = 0xffffffff80e3b142, rsp = 0xfffffe00403f28d0, rbp = 0xfffffe00403f29a0 ---
iflib_rxeof() at iflib_rxeof+0x542/frame 0xfffffe00403f29a0
_task_fn_rx() at _task_fn_rx+0xc0/frame 0xfffffe00403f29e0
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x144/frame 0xfffffe00403f2a40
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x98/frame 0xfffffe00403f2a70
fork_exit() at fork_exit+0x83/frame 0xfffffe00403f2ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00403f2ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
OPNsense on:
Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz (4 cores)
8 GB RAM
50 GB HDD
and plenty of vlans ;-)

October 14, 2020, 09:06:41 PM #1 Last Edit: October 14, 2020, 11:47:16 PM by GreenMatter


Has nobody experienced such kernel panic?
More likely it is not related to Esxi but only to ARP Table and IP address of esxi host - I've found this bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=234296 and previously seen a lot of messages: arp moved from.(mac address)... to.(other mac address)..
I think there's no other way than downgrading...


EDIT:
After restoring OPNsense to 20.1.9 and even applying 20.7.3 config backup all is fine - no kernel panic. So, obviously something is wrong with 20.7...
OPNsense on:
Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz (4 cores)
8 GB RAM
50 GB HDD
and plenty of vlans ;-)

October 23, 2020, 02:19:50 PM #2 Last Edit: October 23, 2020, 02:28:17 PM by GreenMatter
I've updated test instance of OPNsense to 20.7.4 and unfortunately, still the same:


Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x0
fault code      = supervisor write data, page not present
instruction pointer   = 0x20:0xffffffff80e3b142
stack pointer           = 0x28:0xfffffe00403f28d0
frame pointer           = 0x28:0xfffffe00403f29a0
code segment      = base 0x0, limit 0xfffff, type 0x1b
         = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags   = interrupt enabled, resume, IOPL = 0
current process      = 0 (if_io_tqg_1)
trap number      = 12
panic: page fault
cpuid = 1
time = 1603450742
__HardenedBSD_version = 1200059 __FreeBSD_version = 1201000
version = FreeBSD 12.1-RELEASE-p10-HBSD #0  6e16e28f1bf(stable/20.7)-dirty: Tue Oct 20 13:30:19 CEST 2020
    root@sensey64:/usr/obj/usr/src/amd64.amd64/sys/SMP
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00403f2580
vpanic() at vpanic+0x1a2/frame 0xfffffe00403f25d0
panic() at panic+0x43/frame 0xfffffe00403f2630
trap_fatal() at trap_fatal+0x39c/frame 0xfffffe00403f2690
trap_pfault() at trap_pfault+0x49/frame 0xfffffe00403f26f0
trap() at trap+0x29f/frame 0xfffffe00403f2800
calltrap() at calltrap+0x8/frame 0xfffffe00403f2800
--- trap 0xc, rip = 0xffffffff80e3b142, rsp = 0xfffffe00403f28d0, rbp = 0xfffffe00403f29a0 ---
iflib_rxeof() at iflib_rxeof+0x542/frame 0xfffffe00403f29a0
_task_fn_rx() at _task_fn_rx+0xc0/frame 0xfffffe00403f29e0
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x144/frame 0xfffffe00403f2a40
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x98/frame 0xfffffe00403f2a70
fork_exit() at fork_exit+0x83/frame 0xfffffe00403f2ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00403f2ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic

I think caused by:

2020-10-23T12:48:08   kernel   KDB: enter: panic
2020-10-23T12:48:08   kernel   panic() at panic+0x43/frame 0xfffffe00403f2630
2020-10-23T12:48:08   kernel   vpanic() at vpanic+0x1a2/frame 0xfffffe00403f25d0
2020-10-23T12:48:08   kernel   panic: page fault
2020-10-23T12:46:58   kernel   KDB: enter: panic
2020-10-23T12:46:58   kernel   panic() at panic+0x43/frame 0xfffffe00403f2630
2020-10-23T12:46:58   kernel   vpanic() at vpanic+0x1a2/frame 0xfffffe00403f25d0
2020-10-23T12:46:58   kernel   panic: page fault
2020-10-23T12:43:55   kernel   KDB: enter: panic
2020-10-23T12:43:55   kernel   panic() at panic+0x43/frame 0xfffffe00403f2630
2020-10-23T12:43:55   kernel   vpanic() at vpanic+0x1a2/frame 0xfffffe00403f25d0
2020-10-23T12:43:55   kernel   panic: page fault

And I noticed that ESXi - 172.16.0.8 (on screenshot) host has quite high bandwidth output: hundred of GB for just browsing webgui...? It is visible on 20.1.9 traffic report. And when I refresh ESXi webgui Bandwidth Out hits 3,4 G...

I've just noticed that Monit sends notification about full drive while it has a lot of free space:


Resource limit matched Service RootFs
Date:        Fri, 23 Oct 2020 12:32:00
Action:      alert
Description: space usage 100.0% matches resource limit [space usage > 75.0%]

Any ideas? Really I don't know where to start from.

OPNsense on:
Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz (4 cores)
8 GB RAM
50 GB HDD
and plenty of vlans ;-)

Which version of ESXi are you running?
,,The S in IoT stands for Security!" :)

Quote from: Gauss23 on October 23, 2020, 02:50:10 PM
Which version of ESXi are you running?
Used to have 6.7 but recently I upgraded ESXi to 7.01.
For both ESXi versions result is the same... Other services/servers on that particular subnet (where esxi host is), they do not affect opnsense.
One more, when I connect to LAN over OpenVPN, all is fine. And there's nothing special on both interfaces (OpenVPN and Trusted one - where I usually connect from) - just allowing to reach LAN and WAN. The only difference between them is MTU - I use in OpenVPN huge MTU: 24000 - I know it's not what's being advised but it's the most reliable and fastest connection. Inside LAN there's regular 1500...

OPNsense on:
Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz (4 cores)
8 GB RAM
50 GB HDD
and plenty of vlans ;-)

If MTU of 24000 ist most reliable there is something wrong in general.
For me it sounds you have a loop somewhere and there are too many packets also reaching the VM.

Quote from: mimugmail on October 23, 2020, 03:26:40 PM
If MTU of 24000 ist most reliable there is something wrong in general.
For me it sounds you have a loop somewhere and there are too many packets also reaching the VM.

Sounds strange, especially on an OpenVPN interface. You know that your OpenVPN packets need to leave your box presumably through your WAN interface with MTU 1500? This will lead to a lot of fragmented traffic, I guess.
,,The S in IoT stands for Security!" :)

Quote from: mimugmail on October 23, 2020, 03:26:40 PM
If MTU of 24000 ist most reliable there is something wrong in general.
For me it sounds you have a loop somewhere and there are too many packets also reaching the VM.
In general, I don't have more fancy settings than outgoing VPN interface and vlan subnets. I have tried various configurations of mssfix/link-mtu/tun-mtu and fragment as well; and I ended up with:
Quote
mssfix 0;
fragment 0;
tun-mtu 24000;
for UDP connections. All LAN interfaces have MTU 1500
Anyway, what's exactly different between 20.1 and 20.7 that 20.7 gives kernel panic and it seems like instant overfill of drive on esxi host access? Like I said, I don't know where to start from.
OPNsense on:
Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz (4 cores)
8 GB RAM
50 GB HDD
and plenty of vlans ;-)

Quote from: Gauss23 on October 23, 2020, 04:17:56 PM
Sounds strange, especially on an OpenVPN interface. You know that your OpenVPN packets need to leave your box presumably through your WAN interface with MTU 1500? This will lead to a lot of fragmented traffic, I guess.
Yes, it's strange but I can't find better one. OPNsense VM uses vmxnet3 and I have enabled hardware offload - shall I disable it?
Do you know how to troubleshot this kernel panic in relation to esxi webgui?

OPNsense on:
Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz (4 cores)
8 GB RAM
50 GB HDD
and plenty of vlans ;-)


Quote from: mimugmail on October 23, 2020, 05:21:07 PM
Yes disable all offloading ...
All hardware offloads have been disabled. But there must be some loop over there. As I use vlans in firewall, OPNsense is in trunking port group 4095 (VGT) and all, not vlan aware VM's, are tagged by vswitch (VST)  and I communicate with them via tagged ports in physical switch (EST).
So, where's an error here, how can I connect ESXi differently?
OPNsense on:
Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz (4 cores)
8 GB RAM
50 GB HDD
and plenty of vlans ;-)

October 25, 2020, 05:48:44 PM #11 Last Edit: October 25, 2020, 05:52:57 PM by GreenMatter
I think I've nailed it...
OPNsense VM is connected to LAN through trunked port group (lan id 4095) and vswitch with only one uplink, there are configured two interfaces:
vmx0 - WAN - vswitch0
vmx1 - LAN - vswitch2
My LAN consists of vlans and I use Unifi switches and APs which require management, untagged/native (not anymore, recently it has changed) subnet. That subnet uses vlan id1 and I place there all management interfaces of various services (including esxi).

        |vmx1_vlan1      |
vmx1----|vmx1_vlan11     |---port group id4095---vswitch2----Unifi switch port with tagged vlans
        |vmx1_vlanx+1... |                      |
                                                |
ESXi-------vmk0--------------port group id1-----|

And as Sensei required parent interface to be monitored, I had created it with network port vmx1 - and this was the reason of all these problems. So, I removed that parent interface and edited vlan1 to be just lan with network port vmx1 (it effectively has become untagged interface). Because of that I need to change settings of esxi port group and unifi switch port

        |                |
vmx1----|vmx1_vlan11     |---port group id4095---vswitch2----Unifi switch port with native id1 and all other vlans tagged
        |vmx1_vlanx+1... |                      |
                                                |
ESXi-------vmk0--------------port group id0-----|

And once this was done, esxi interface did stop outputting huge amount of data and I could upgrade OPNsense to 20.7.
BTW, isn't strange that such issue/loop could have triggered kernel panic, shouldn't be OPNsense more "bulletproof"?
OPNsense on:
Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz (4 cores)
8 GB RAM
50 GB HDD
and plenty of vlans ;-)