OPNsense Forum

Archive => 23.1 Legacy Series => Topic started by: mr_penguin on June 05, 2023, 04:23:01 PM

Title: Random crashing with pf_test_state_icmp()
Post by: mr_penguin on June 05, 2023, 04:23:01 PM
Hi,
I have been using OPNsense for several years now, and at some point in the last year or so I started to get random crashes.

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 06
fault virtual address   = 0x0
fault code      = supervisor read data, page not present

The stack trace always ends at pf_test_state_icmp().

I suspected hardware issues, so I bought a completely new system, installed OPNsense, and restored my config. Same issue. Seems to point to a software issue, but I can't figure out where to start looking.

I have a HA pair setup, with the backup instance on VMware. Notably, that one doesn't seem to have the crash problem.
The primary was a Qotom Q355-G4, and has been replaced with https://www.aliexpress.us/item/3256804355685285.html configured with 8GB RAM, Intel N6005. No hardware has been shared between them.

My plugins are:
os-acme-client
os-chrony
os-etpro-telemetry
os-mdns-repeater
os-smart
os-theme-vicuna
os-vnstat
os-wireguard

I have a pair of IPsec tunnels, and a handful of Wireguard clients. I am using CARP on the WAN interface, and all of the internal interfaces. The interfaces are configured as LAGGs, with only 1 interface each (to provide failover compatibility with the VMware instance)

I have Hybrid Outbound NAT configured to set the CARP WAN address as the source for my internal networks

No unusual rules, no policy based routing. I used to have Daul WAN setup, but no longer have Dual WANs. That interface is disabled. I also used to have IPv6 configured, but no longer have IPv6 on my WAN. I have a he.net gif tunnel setup, but is disabled.

The crashes happen randomly, no pattern whatsoever. Sometimes it's 12 hours, sometimes it's 2. I'm at a loss where to look. The pf_test_state_icmp() is the only clue I have so far. I have no rules referencing ICMP, all tunables are default. I cleared them just to be sure.
Title: Re: Random crashing with pf_test_state_icmp()
Post by: franco on June 06, 2023, 12:54:47 PM
Do you have a full backtrace? Is this on latest 23.1.8/9?

And this is since 23.1 or earlier? 22.1 added FreeBSD 13, perhaps since then this was the case...


Cheers,
Franco
Title: Re: Random crashing with pf_test_state_icmp()
Post by: mr_penguin on June 06, 2023, 03:56:54 PM
I can grab the full backtrace the next time it happens. I have been submitting bug reports as it happens. This is on the latest 23.1.9, and has been happening since at least the 22.1 series, possibly even longer.
Title: Re: Random crashing with pf_test_state_icmp()
Post by: mr_penguin on June 06, 2023, 06:49:30 PM
Attached are 2 consecutive crash dumps, only minutes apart. At first glance, the stack traces are identical.
Title: Re: Random crashing with pf_test_state_icmp()
Post by: franco on June 07, 2023, 09:56:12 AM
Thanks! Didn't know about previous submissions.

So this is IPv4 traffic indeed and I couldn't find a relevant issue within FreeBSD. There are two choices here: this is a problem in response to TCP/UDP packet or a clean ICMP ping, but I'm leaning towards the former. Not sure how to proceed.

A debug kernel and a core dump might be the best option here.


Cheers,
Franco
Title: Re: Random crashing with pf_test_state_icmp()
Post by: mr_penguin on June 07, 2023, 07:37:08 PM
Sounds good to me. How do I get a debug kernel?
Title: Re: Random crashing with pf_test_state_icmp()
Post by: franco on June 08, 2023, 12:31:55 PM
I have built one now but need to test real quick how to get to the core dump. BRB :)


Cheers,
Franco
Title: Re: Random crashing with pf_test_state_icmp()
Post by: franco on June 08, 2023, 01:12:25 PM
So here is what to do:

1. Install debug kernel:

# opnsense-update -zkr dbg-23.1.8_5

2. Reboot to activate kernel.

3. Adjust action on panic after bootup:

# ddb script kdb.enter.default="bt; dump; reset"

You can test with the following to see that it picked it up:

# ddb scripts

4. Wait for panic. After a panic there will be a core file here:

# ls /var/crash/vmcore.[0-9]*

It's a mini dump of just over 200 MB. Perhaps you can send me a PM from where I can grab it.

Thanks in advance!


Cheers,
Franco
Title: Re: Random crashing with pf_test_state_icmp()
Post by: franco on June 12, 2023, 09:28:21 AM
Hi,

I've been looking at the odd core dump... here is the null pointer dereference (this time UDP, not ICMP):

#17 0xffffffff8237ed0f in pf_test_state_udp (state=<optimized out>, state@entry=0xfffffe001099b828,
    direction=<optimized out>, kif=<optimized out>, kif@entry=0xfffff800245b3a00, m=m@entry=0xfffff801e9409800,
    off=20, h=<optimized out>, pd=pd@entry=0xfffffe001099b758) at /usr/src/sys/netpfil/pf/pf.c:5086
5086         if (PF_ANEQ(pd->src, &nk->addr[pd->sidx], pd->af) ||
(kgdb) list
5081   
5082      /* translate source/destination address, if necessary */
5083      if ((*state)->key[PF_SK_WIRE] != (*state)->key[PF_SK_STACK]) {
5084         struct pf_state_key *nk = (*state)->key[pd->didx];
5085   
5086         if (PF_ANEQ(pd->src, &nk->addr[pd->sidx], pd->af) ||
5087             nk->port[pd->sidx] != uh->uh_sport)
5088            pf_change_ap(m, pd->src, &uh->uh_sport, pd->ip_sum,
5089                &uh->uh_sum, &nk->addr[pd->sidx],
5090                nk->port[pd->sidx], 1, pd->af);
(kgdb) p nk
$10 = (struct pf_state_key *) 0x0

> I have a HA pair setup

This caught my attention skimming through the upper frame is that the state sync via pfsync seems to be incomplete having brought in a lot of dead pointers which causes these code paths to fail that should always have valid data attached.

I know it's much to ask but if you try to disable state sync does the crashing stop?


Cheers,
Franco
Title: Re: Random crashing with pf_test_state_icmp()
Post by: mr_penguin on June 12, 2023, 02:46:43 PM
Thanks for digging into this. I have disabled state sync on both nodes and will let you know the results.
Title: Re: Random crashing with pf_test_state_icmp()
Post by: mr_penguin on June 13, 2023, 11:53:56 PM
Well it's hard to prove that a random crash has stopped but we went from multiple crashes a day to 36 hours and counting of uptime with the state sync disabled. It looks like you are onto something.
Title: Re: Random crashing with pf_test_state_icmp()
Post by: franco on June 14, 2023, 08:39:51 AM
Ok, would you mind creating an issue over at https://bugs.freebsd.org for FreeBSD 13.1 and let me know which one you created? The pf/pfsync maintainer should take a look at this because I don't know what should be fixed.


Cheers,
Franco
Title: Re: Random crashing with pf_test_state_icmp()
Post by: mr_penguin on June 23, 2023, 12:11:37 AM
Bug opened with FreeBSD:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=272153
Title: Re: Random crashing with pf_test_state_icmp()
Post by: franco on June 23, 2023, 08:27:31 AM
Thanks, I dropped another comment there. Let's see what happens.


Cheers,
Franco
Title: Re: Random crashing with pf_test_state_icmp()
Post by: franco on June 23, 2023, 12:19:08 PM
Ok, this is going to take long... If you want in for the ride:

First make sure to update to 23.1.10 and then install the 13.2 debug kernel:

# opnsense-update -zkr dbg-13.2
# opnsense-shell reboot

Restart pfsync and wait for panic. I've modified the crash reporter code so that vmcore files are automatically being emitted when booted from a debug kernel.

If you don't have time for this I understand. I think the upstream policy here is more of a deterrent than anything else.


Cheers,
Franco
Title: Re: Random crashing with pf_test_state_icmp()
Post by: mr_penguin on June 23, 2023, 03:13:06 PM
Updated and new debug kernel installed. I'll PM you when I have a core dump to share.
Title: Re: Random crashing with pf_test_state_icmp()
Post by: franco on June 23, 2023, 03:19:45 PM
Thanks a lot!