PSA: PF regression in 24.7.10 kernel and fix

Started by newsense, December 04, 2024, 12:51:12 AM

Previous topic - Next topic
My backtrace is slightly different if this helps.

db:0:kdb.enter.default>  bt
Tracing pid 0 tid 100041 td 0xfffff80103950740
kdb_enter() at kdb_enter+0x33/frame 0xfffffe00d3bf22b0
panic() at panic+0x43/frame 0xfffffe00d3bf2310
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe00d3bf2370
trap_pfault() at trap_pfault+0x46/frame 0xfffffe00d3bf23c0
calltrap() at calltrap+0x8/frame 0xfffffe00d3bf23c0
--- trap 0xc, rip = 0xffffffff8109fa60, rsp = 0xfffffe00d3bf2490, rbp = 0xfffffe00d3bf2490 ---
memcmp() at memcmp+0x110/frame 0xfffffe00d3bf2490
pf_find_state() at pf_find_state+0xc0/frame 0xfffffe00d3bf24e0
pf_test_state_tcp() at pf_test_state_tcp+0x1c4/frame 0xfffffe00d3bf2650
pf_test() at pf_test+0x131e/frame 0xfffffe00d3bf2800
pf_check_in() at pf_check_in+0x27/frame 0xfffffe00d3bf2820
pfil_mbuf_in() at pfil_mbuf_in+0x38/frame 0xfffffe00d3bf2850
ip_tryforward() at ip_tryforward+0x17f/frame 0xfffffe00d3bf2910
ip_input() at ip_input+0x56c/frame 0xfffffe00d3bf2970
netisr_dispatch_src() at netisr_dispatch_src+0x9e/frame 0xfffffe00d3bf29c0
ether_demux() at ether_demux+0x149/frame 0xfffffe00d3bf29f0
ng_ether_rcv_upper() at ng_ether_rcv_upper+0x8c/frame 0xfffffe00d3bf2a10
ng_apply_item() at ng_apply_item+0x13e/frame 0xfffffe00d3bf2ab0
ng_snd_item() at ng_snd_item+0x274/frame 0xfffffe00d3bf2af0
ng_apply_item() at ng_apply_item+0x13e/frame 0xfffffe00d3bf2b90
ng_snd_item() at ng_snd_item+0x274/frame 0xfffffe00d3bf2bd0
ng_ether_input() at ng_ether_input+0x4c/frame 0xfffffe00d3bf2c00
ether_nh_input() at ether_nh_input+0x1dc/frame 0xfffffe00d3bf2c60
netisr_dispatch_src() at netisr_dispatch_src+0x9e/frame 0xfffffe00d3bf2cb0
ether_input() at ether_input+0x56/frame 0xfffffe00d3bf2d00
iflib_rxeof() at iflib_rxeof+0xc0e/frame 0xfffffe00d3bf2e00
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe00d3bf2e40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x14e/frame 0xfffffe00d3bf2ec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe00d3bf2ef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00d3bf2f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00d3bf2f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

> My backtrace is slightly different if this helps.

Hmm, juicy... a memcpy()? Are you sure yours is fixed with the amended kernel? It might just keep happening if we assume the revert does nothing for your case.

If it keeps happening try the 24.7.8 kernel to see if that fixes it.


Cheers,
Franco

I reverted to the 24.7.8 kernel yesterday.  I have 3 panics from the original 24.7.10 kernel, 2 from pf_find_state() and 1 from pf_detach_state().

Could be related, got a GitHub issue with it now too: https://github.com/opnsense/src/issues/230

It would further point to the commit in question corrupting the states to set up a panic in another code path later on. That is if the new 24.7.10 kernel fixes this issue, too.


Cheers,
Franco

Updated to the latest kernel now, will report issues here.

After "opnsense-update -fk" uname -v shows me "stable/24.7-n267981-8375762712f"
With this kernel the system crashes too. 

After "opnsense-update -zkr 24.7.10-state" uname -v shows me "route_del_fix-n267981-8375762712f".
This kernel runs stable. I'm using the default mirror.

Quote from: martin87 on December 04, 2024, 06:57:36 PM
After "opnsense-update -fk" uname -v shows me "stable/24.7-n267981-8375762712f"
With this kernel the system crashes too. 

After "opnsense-update -zkr 24.7.10-state" uname -v shows me "route_del_fix-n267981-8375762712f".
This kernel runs stable. I'm using the default mirror.

I highlighted the commit hashes to emphasise that the builds are in fact the same.


Cheers,
Franco

Quote from: franco on December 04, 2024, 07:07:48 PM
Quote from: martin87 on December 04, 2024, 06:57:36 PM
After "opnsense-update -fk" uname -v shows me "stable/24.7-n267981-8375762712f"
With this kernel the system crashes too. 

After "opnsense-update -zkr 24.7.10-state" uname -v shows me "route_del_fix-n267981-8375762712f".
This kernel runs stable. I'm using the default mirror.

I highlighted the commit hashes to emphasise that the builds are in fact the same.


Cheers,
Franco

Ok thank you, but with the "stable/24.7-n267981-8375762712f" it crashes with

--- trap 0xc, rip = 0xffffffff80f6be42, rsp = 0xfffffe00b2899c40, rbp = 0xfffffe00b2899c50 ---
vm_radix_lookup_unlocked() at vm_radix_lookup_unlocked+0x62/frame 0xfffffe00b2899c50
vm_fault() at vm_fault+0x85d/frame 0xfffffe00b2899d70
vm_fault_trap() at vm_fault_trap+0x4d/frame 0xfffffe00b2899dc0
trap_pfault() at trap_pfault+0x1be/frame 0xfffffe00b2899e10
trap() at trap+0x4ab/frame 0xfffffe00b2899f30
calltrap() at calltrap+0x8/frame 0xfffffe00b2899f30
--- trap 0xc, rip = 0x82213a86e, rsp = 0x820f06990, rbp = 0x820f069a0 ---

Hmm, that's not the panic we're looking for (Star Wars giggles)

In all seriousness it may be coincidental. Is this recurring or a one off?

The panic occurs after ~1 hour after the update to  "stable/24.7-n267981-8375762712f"
After that i changed to "route_del_fix-n267981-8375762712f". This is for me stable. I tried it one more time, but after ~ 1 hour it crashed again.

Yesterday with "stable/24.7-n267979-0d692990122" i get this:

--- trap 0x9, rip = 0xffffffff80d053f7, rsp = 0xfffffe00b2e4c690, rbp = 0xfffffe00b2e4c6b0 ---
rn_walktree() at rn_walktree+0x77/frame 0xfffffe00b2e4c6b0
pfr_get_addrs() at pfr_get_addrs+0x122/frame 0xfffffe00b2e4c710
pfioctl() at pfioctl+0x221e/frame 0xfffffe00b2e4cbf0
devfs_ioctl() at devfs_ioctl+0xcb/frame 0xfffffe00b2e4cc40
vn_ioctl() at vn_ioctl+0xce/frame 0xfffffe00b2e4ccb0
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe00b2e4ccd0
kern_ioctl() at kern_ioctl+0x255/frame 0xfffffe00b2e4cd40
sys_ioctl() at sys_ioctl+0xff/frame 0xfffffe00b2e4ce00
amd64_syscall() at amd64_syscall+0x100/frame 0xfffffe00b2e4cf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00b2e4cf30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0xc46b237d5fa, rsp = 0xc46af1d9bf8, rbp = 0xc46af1da090 ---

> rn_walktree() at rn_walktree+0x77/frame 0xfffffe00b2e4c6b0
> pfr_get_addrs() at pfr_get_addrs+0x122/frame 0xfffffe00b2e4c710

I'm aware of this panic, apparently some pfctl invoke is causing this, but we have no further debug info.

It's certainly interesting that a significant number panics point to pf(4) code.


Cheers,
Franco

I sent the crash report via GUI yesterday. Or should I post it here?

Now it would be interesting to know why the system crashes with "stable/24.7-n267981-8375762712f" and not with "route_del_fix-n267981-8375762712f". Otherwise I'll just do a clean install tomorrow.

No need to reinstall, nothing to gain doing that.

If none of the .10 kernels work for you simply go back to .8 until everything is sorted out.


# opnsense-update -kr 24.7.8

# opnsense-shell reboot

Don't know, it seems circumstantial. We have a report on the business edition which means this issue is inherent to 24.7.6 or 24.7.8 kernels anyway.


Cheers,
Franco

Quote from: newsense on December 04, 2024, 08:30:35 PM
No need to reinstall, nothing to gain doing that.

If none of the .10 kernels work for you simply go back to .8 until everything is sorted out.


# opnsense-update -kr 24.7.8

# opnsense-shell reboot


Ok, then I'll wait until everything is sorted out. The "route_del_fix-n267981-8375762712f" kernel works stable.