Kernel crash when starting a VNC connection

Started by computeralex92, February 13, 2022, 02:15:11 PM

Previous topic - Next topic
Hello,

I use for helping my parents etc. the VNC Viewer von Realvnc which connects via Realvnc to a VNC server on their computers.
After updating to 22.1, I can crash the Opnsense when trying to connect to one of their computers.

Here part of the crash-log:

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address = 0x10
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80eb0dfd
stack pointer         = 0x28:0xfffffe000e1c06c0
frame pointer         = 0x28:0xfffffe000e1c07e0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 0 (if_io_tqg_2)
trap number = 12
panic: page fault
cpuid = 2
time = 1644757931
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000e1c0480
vpanic() at vpanic+0x17f/frame 0xfffffe000e1c04d0
panic() at panic+0x43/frame 0xfffffe000e1c0530
trap_fatal() at trap_fatal+0x385/frame 0xfffffe000e1c0590
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe000e1c05f0
calltrap() at calltrap+0x8/frame 0xfffffe000e1c05f0
--- trap 0xc, rip = 0xffffffff80eb0dfd, rsp = 0xfffffe000e1c06c0, rbp = 0xfffffe000e1c07e0 ---
ip6_forward() at ip6_forward+0x62d/frame 0xfffffe000e1c07e0
pf_refragment6() at pf_refragment6+0x164/frame 0xfffffe000e1c0830
pf_test6() at pf_test6+0xfdb/frame 0xfffffe000e1c09a0
pf_check6_out() at pf_check6_out+0x40/frame 0xfffffe000e1c09d0
pfil_run_hooks() at pfil_run_hooks+0x97/frame 0xfffffe000e1c0a10
ip6_tryforward() at ip6_tryforward+0x2ce/frame 0xfffffe000e1c0a90
ip6_input() at ip6_input+0x60f/frame 0xfffffe000e1c0b70
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe000e1c0bc0
ether_demux() at ether_demux+0x138/frame 0xfffffe000e1c0bf0
ether_nh_input() at ether_nh_input+0x355/frame 0xfffffe000e1c0c50
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe000e1c0ca0
ether_input() at ether_input+0x69/frame 0xfffffe000e1c0d00
iflib_rxeof() at iflib_rxeof+0xc27/frame 0xfffffe000e1c0e00
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe000e1c0e40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe000e1c0ec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe000e1c0ef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe000e1c0f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000e1c0f30
--- trap 0x80388000, rip = 0xffffffff80c2bcff, rsp = 0, rbp = 0x6 ---
mi_startup() at mi_startup+0xdf/frame 0x6
KDB: enter: panic
panic.txt0600001214202201653  7124 ustarrootwheelpage faultversion.txt0600007014202201653  7523 ustarrootwheelFreeBSD 13.0-STABLE stable/22.1-n248053-232cb14f501 SMP


I reported all three crashes via the reporting tool.
What can be the source of this crashes?

Thanks,

Alex

Hi Alex,

Might be my fault... again. :)

Can you try to add a tunable "net.pf.share_forward6" with value "0", apply and try again?


Cheers,
Franco

Hello Franco,

thanks for your reply.
Ah, failures happen, when checking the changelog for 22.1 (and all versions before) I'm very glad to see how smoothly everything is working.

I set the tunable and will test it asap; but this can take until tomorrow afternoon.
I keep you updated.

Alex


Hello Franco,

I was able to test it yesterday; after the change of the tunable, everything is working without any issue.

Thanks for the quick fix,
Alex

Not a fix unfortunately. Will take a closer look. Might need your help reproducing.


Cheers,
Franco

Quote from: franco on February 15, 2022, 10:40:17 AM
Not a fix unfortunately. Will take a closer look. Might need your help reproducing.

Sure, just reach out to me if I can help somewhere.

Reporting back here... saw this one in December it seems and took a guess:

https://github.com/opnsense/src/commit/1fca8e5780b58bdf99

Looks like this wasn't the real issue. Although our code is involved here a little the actual error happens in a file we don't really change and it seems that this isn't even a route-to case to begin with or something goes wrong while trying to treat it as such. I'll keep looking.


Cheers,
Franco

Hi Alex,

Small update: this only appears to be happening when fragmentation kicks in and the kernel reassembles the packet to scan it before it goes to fragment it again... while trying to fragment it back on the wire this happens.

Maybe that can actually help with reproduction :)


Cheers,
Franco

Tried to reproduce this locally but doesn't want to crash probably because it only affects transit traffic through the firewall. To be continued...

I believe I can produce this as well, with 22.1.8. I haven't noticed a particular pattern, but it happens at least once per hour after updating to 22.1.x.

In case it matters, my deployment has two WAN interfaces and uses IPv6 Prefix Translation to map the LAN prefix to whichever WAN is active.


Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x10
fault code      = supervisor read data, page not present
instruction pointer   = 0x20:0xffffffff80eb0b9d
stack pointer           = 0x28:0xfffffe00085a94b0
frame pointer           = 0x28:0xfffffe00085a95d0
code segment      = base 0x0, limit 0xfffff, type 0x1b
         = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags   = interrupt enabled, resume, IOPL = 0
current process      = 0 (if_io_tqg_1)
trap number      = 12
panic: page fault
cpuid = 1
time = 1653915865
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00085a9270
vpanic() at vpanic+0x17f/frame 0xfffffe00085a92c0
panic() at panic+0x43/frame 0xfffffe00085a9320
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00085a9380
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00085a93e0
calltrap() at calltrap+0x8/frame 0xfffffe00085a93e0
--- trap 0xc, rip = 0xffffffff80eb0b9d, rsp = 0xfffffe00085a94b0, rbp = 0xfffffe00085a95d0 ---
ip6_forward() at ip6_forward+0x62d/frame 0xfffffe00085a95d0
pf_refragment6() at pf_refragment6+0x164/frame 0xfffffe00085a9620
pf_test6() at pf_test6+0xfdb/frame 0xfffffe00085a9790
pf_check6_out() at pf_check6_out+0x40/frame 0xfffffe00085a97c0
pfil_run_hooks() at pfil_run_hooks+0x97/frame 0xfffffe00085a9800
ip6_tryforward() at ip6_tryforward+0x2ce/frame 0xfffffe00085a9880
ip6_input() at ip6_input+0x60f/frame 0xfffffe00085a9960
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00085a99b0
ether_demux() at ether_demux+0x138/frame 0xfffffe00085a99e0
ng_ether_rcv_upper() at ng_ether_rcv_upper+0x88/frame 0xfffffe00085a9a00
ng_apply_item() at ng_apply_item+0x2bd/frame 0xfffffe00085a9aa0
ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe00085a9ae0
ng_apply_item() at ng_apply_item+0x2bd/frame 0xfffffe00085a9b80
ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe00085a9bc0
ng_ether_input() at ng_ether_input+0x4c/frame 0xfffffe00085a9bf0
ether_nh_input() at ether_nh_input+0x1f1/frame 0xfffffe00085a9c50
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00085a9ca0
ether_input() at ether_input+0x69/frame 0xfffffe00085a9d00
iflib_rxeof() at iflib_rxeof+0xc27/frame 0xfffffe00085a9e00
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe00085a9e40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe00085a9ec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe00085a9ef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00085a9f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00085a9f30
--- trap 0, rip = 0xffffffff80c2b91f, rsp = 0, rbp = 0x3 ---
mi_startup() at mi_startup+0xdf/frame 0x3
KDB: enter: panic

Setting the tunable net.pf.share_forward6 to 0 does seem effective, the system has been stable for 24+ hours where previously it would have paniced repeatedly.

I note console messages appearing now which I have not seen before, and which occur at roughly the frequency where the system previously paniced:

cannot forward src fe80:1::201:5cff:fea2:8846, dst 2602:248:7b4a:ff60:54bb:8c4c:a0f7:dd1a, nxt 58, rcvif igb0, outif igb2
cannot forward src fe80:1::201:5cff:fea2:8846, dst 2602:248:7b4a:ff60:54bb:8c4c:a0f7:dd1a, nxt 58, rcvif igb0, outif igb2
cannot forward from fe80:4::28a1:40ff:fe31:e546 to fe80:4::5c91:f6ff:fedc:25b6 nxt 58 received on igb2
cannot forward from fe80:4::28a1:40ff:fe31:e546 to fe80:4::5c91:f6ff:fedc:25b6 nxt 58 received on igb2
cannot forward from fe80:4::bc21:c3ff:fea4:9bc8 to fe80:4::7c83:26ff:fe48:f5ba nxt 58 received on igb2
cannot forward from fe80:4::bc21:c3ff:fea4:9bc8 to fe80:4::7c83:26ff:fe48:f5ba nxt 58 received on igb2
cannot forward from fe80:4::bc21:c3ff:fea4:9bc8 to fe80:4::7c83:26ff:fe48:f5ba nxt 58 received on igb2
cannot forward from fe80:4::682b:b5ff:fedb:5a10 to fe80:4::409f:1fff:fe95:c6d1 nxt 58 received on igb2
cannot forward from fe80:4::682b:b5ff:fedb:5a10 to fe80:4::409f:1fff:fe95:c6d1 nxt 58 received on igb2
cannot forward src fe80:1::201:5cff:fea2:8846, dst 2602:248:7b4a:ff60:54bb:8c4c:a0f7:dd1a, nxt 58, rcvif igb0, outif igb2
cannot forward src fe80:1::201:5cff:fea2:8846, dst 2602:248:7b4a:ff60:54bb:8c4c:a0f7:dd1a, nxt 58, rcvif igb0, outif igb2
cannot forward from fe80:4::682b:b5ff:fedb:5a10 to fe80:4::409f:1fff:fe95:c6d1 nxt 58 received on igb2
cannot forward from fe80:4::682b:b5ff:fedb:5a10 to fe80:4::409f:1fff:fe95:c6d1 nxt 58 received on igb2
cannot forward from fe80:4::682b:b5ff:fedb:5a10 to fe80:4::409f:1fff:fe95:c6d1 nxt 58 received on igb2

igb0 is one of the WAN interfaces, igb2 is the LAN.

2602:248:7b4a:ff60:: is the prefix I use within the home, which comes from the *other* WAN interface igb1 (sonic.net). I use PAT on igb0 (Comcast) to rewrite 2602:248:7b4a:ff60: on outgoing packets to the Comcast-supplied IPv6 prefix. At least, I think I do: that packets are arriving from Comcast destined to 2602:248:7b4a:ff60:: seems unexpected, especially with a link-local source address.


igb0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
   description: WAN
   options=4800028<VLAN_MTU,JUMBO_MTU,NOMAP>
   ether __:__:__:__:__:__
   inet 24.4.201.__ netmask 0xfffffe00 broadcast 255.255.255.255
   inet6 fe80::a236:9fff:fe59:19b0%igb0 prefixlen 64 scopeid 0x1
   inet6 2001:558:6045:5c:54bb:8c4c:a0f7:dd1a prefixlen 128
   groups: AllWAN
   media: Ethernet autoselect (1000baseT <full-duplex>)
   status: active
   nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
igb1: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
   description: WAN2
   options=4800028<VLAN_MTU,JUMBO_MTU,NOMAP>
   ether __:__:__:__:__:__
   inet 135.180.175.__ netmask 0xfffffc00 broadcast 135.180.175.255
   groups: AllWAN
   media: Ethernet autoselect (1000baseT <full-duplex>)
   status: active
   nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
igb2: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
   description: LAN
   options=4800028<VLAN_MTU,JUMBO_MTU,NOMAP>
   ether __:__:__:__:__:__
   inet 10.1.10.1 netmask 0xffffff00 broadcast 10.1.10.255
   inet6 fe80::a236:9fff:fe59:19b2%igb2 prefixlen 64 scopeid 0x4
   inet6 2602:248:7b4a:ff60::1 prefixlen 64
   media: Ethernet autoselect (1000baseT <full-duplex>)
   status: active
   nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
opt1_stf: flags=4041<UP,RUNNING,LINK2> metric 0 mtu 1280
   inet6 2602:248:7b4a:ff60:: prefixlen 28
   groups: stf
   v4net 135.180.175.__/0 -> tv4br 184.23.144.1
   nd6 options=103<PERFORMNUD,ACCEPT_RTADV,NO_DAD>

I added this issue to the list of things to do for 22.7 release since the impact on 22.1 was relatively small (on the number of reports). I still believe while shared forwarding surfaces the error where the fragmentation going on is broken instead, but you don't see it without shared forwarding because route-to makes the packet disappear never reaching the refragmentation code.

If you could confirm that issue still exists on 22.7.b kernel that would be helpful, see https://forum.opnsense.org/index.php?topic=28505.0


Cheers,
Franco