OPNsense Forum

English Forums => 24.7, 24.10 Production Series => Topic started by: blblblb on August 14, 2024, 01:37:45 AM

Title: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: blblblb on August 14, 2024, 01:37:45 AM
I'm glad this is not on a production system yet. The host is a SYS-E300-9A-4C (A2SDi-4C-HLN4F motherboard).

Additional hardware: Chelsio T320, rest is standard for the model.

I cannot/could not get a capture of the kernel panic, but it happens immediately after importing the previous 24.1.10 good configuration. After "initializing.... done".

Tested from Live DVD, and also on a boot environment upgraded from 24.1.10 online. Using a LAGG of the SFP+ ports. Everything else is pretty much standard for any decent enterprise setup. WAN groups, IPSec, some client OVPN, and quite a few VLANs.

Considering the fact that Deciso's commercial offerings actually use the A2SDi platform, this is not great news. Chelsio T320s are also the most common/popular SFP+ NIC for FreeBSD hosts.

Sometimes I wish Deciso did not use us as guinea pigs for QA that should have been done in-house. No harm done on this one, but anyone else with a similar setup beware. Make sure you create and activate a boot env for the upgrade so you can revert if this hits you.

I hope the above comment is not taken personally (hi Franco). I'm just surprised this is the third time an upgrade causes issues. Prior to boot environments being properly supported it was a bigger deal.

TL;DR 24.1.10 to 24.7 = kernel panic on a A2SDi-4C-HLN4F system with QAT and a Chelsio NIC.
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: netnut on August 14, 2024, 02:05:54 AM
Quote from: blblblb on August 14, 2024, 01:37:45 AM
...
I cannot/could not get a capture of the kernel panic, but it happens immediately after importing the previous 24.1.10 good configuration. After "initializing.... done".
...
Considering the fact that Deciso's commercial offerings actually use the A2SDi platform, this is not great news.

So there's no actual root cause, but you have "an issue", how is this related to the A2SDi platform ?

Running a A2SDi for years here, running old-skool UFS, over 10 major OPNsense upgrades, rock solid. It sounds like it's not only a kernel oanicking here...
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: blblblb on August 14, 2024, 03:39:34 AM
Would you mind leaving your personal/subjective assumptions and trolling attempts out of the thread? A root cause is less likely to elude *you* if you are actually trying to diagnose it, instead of derailing a thread out of personal reasons (like picking arguments with strangers on the internet...).

Also, please enlighten us with that AS2Di you have "run for years". Sounds like BS. The AS2Di is not "years old" quite, although it is far from new (has not been completely superseded, just like the X10SDV line). What "10 major revisions"? There were breaking changes that make that impossible as a smooth upgrade path without reinstalls.
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: blblblb on August 14, 2024, 03:48:02 AM
For the developers and anyone who actually has interest in diagnosing the issue:

A quick look at the panic log (there seems to be a double fault so kdb won't help) shows some stack frames that are related to the cxgbc0 task queuing (so, Chelsio driver).

I also tested on a production system with the same hardware (redundancy spare kept in storage), also with a T320, and the trap also kicks in. Again, double fault, then a loop, then a hard CPU reset.
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: netnut on August 14, 2024, 04:43:03 AM
Quote from: blblblb on August 14, 2024, 01:37:45 AM
...
Considering the fact that Deciso's commercial offerings actually use the A2SDi platform, this is not great news. Chelsio T320s are also the most common/popular SFP+ NIC for FreeBSD hosts.
Sometimes I wish Deciso did not use us as guinea pigs for QA that should have been done in-house. No harm done on this one, but anyone else with a similar setup beware. Make sure you create and activate a boot env for the upgrade so you can revert if this hits you.

I replied as an OPNsense user and A2SDi owner in a public forum, in the assumption this post refers to OPNsense users who own a A2SDi.

Quote
Also, please enlighten us with that AS2Di you have "run for years". Sounds like BS. The AS2Di is not "years old" quite, although it is far from new (has not been completely superseded, just like the X10SDV line). What "10 major revisions"? There were breaking changes that make that impossible as a smooth upgrade path without reinstalls.

March 2019, but never mind...
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: blblblb on August 14, 2024, 05:16:54 AM
You are dodging the "issue", if you take a honest look at your entire post history in this forum, you might find a pattern. I would not call it a case study in social ineptitude, but it comes close to it.

You don't need to explain yourself or bring up your personal circumstances in the thread. That's the cliff notes for you.

So, moving on and forward, if you have a Chelsio T320 and actually are curious to debug the problem, I can tell you how to configure it and replicate the BIOS settings.

I'm out of time for free QA today, but I did find some posts from other users that might hint at some kernel issues that need to be ironed out and they weren't. The Chelsio driver is one of the most stable NIC drivers in FreeBSD, written by a core developer. An out of bounds read (or a lock contention issue perhaps) in the driver indicates this is very likely an OPNsense mistake (without reviewing all the cherry picked patches Deciso has taken from upstream).

It needs proper debugging.

cxgbc0@pci0:2:0:0: class=0x020000 rev=0x00 hdr=0x00 vendor=0x1425 device=0x0031 subvendor=0x1425 subdevice=0x0001
    vendor     = 'Chelsio Communications Inc'
    device     = 'T320 10GbE Dual Port Adapter'
    class      = network
    subclass   = ethernet


Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: newsense on August 14, 2024, 06:20:57 AM
Best to open a GH issue for this. Word on the street is they're taking insults really well up there and should you provide kernel debug logs from both OPNsense and a fresh install of FreeBSD 14.1 a written apology is (almost) guaranteed.


opnsense-update -zkr kernel-dbg-24.7.1


Should the fine gentleman have some more professional venting to impart to the lesser beings on these forums, please do not hesitate to open a new thread.
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: franco on August 14, 2024, 06:50:37 AM
I will need that panic. Can always look for clues and fixes in FreeBSD code.

As far as any OS upgrades go I'm not sure what else to expect.

I think the reaction from involved parties after the fact is much more important: who provides you with time to look into it? Who will ship a fix and who will not? Who tells you they are the "bestest evar" and how will it match up with the other two question before? Always take that into account when choosing a platform.


Cheers,
Franco
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: blblblb on August 14, 2024, 07:48:05 AM
Franco, OPNsense is great, but you do have a habit of both releasing unstable major versions (as far as more complex environments are involved, I don't expect a basic kvm or esxi setup falling apart in some odd "homelab") and failing to commit resources to providing LTS-like (long term stable, a la Debian) updates. This creates a burden on your users to produce actual QA as they become guinea pigs until all the issues are ironed out. More often than not, that time buffer creates problems of its own. This affected pfSense in the past too, although they have even less of an excuse than you would.

If you provided a buffer of time with updates for the previous major versions as some sort of LTS channel, this would be literally a non-issue. Making the stable prior major revision EOL before 24.7 has all the kinks ironed out is how you get a flood of posts from folks encountering problems.

This is not a personal attack, and merits a response that does not trail along ad hominems or attempts to shrug it off. It's also not grandstanding. You can do better with your devops approach as a business, let alone as a FOSS project.

I will see if I have time to get a serial console log from the person I'm helping out. Feel free to link or send a established diagnostics procedure she can follow, meanwhile. I'll do what I can.
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: blblblb on August 14, 2024, 07:52:39 AM
Quote from: newsense on August 14, 2024, 06:20:57 AM
opnsense-update -zkr kernel-dbg-24.7.1

This presumes a bootable environment, or are you suggesting running this from 24.1.10? I can setup a tunnel and use the BMC to get a working console, but like I said, there is a double fault at some point and a loop that makes kdb unusable. The debug symbols might help if present, but kdb won't be workable. It will be a few hours until I can do this, though.
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: franco on August 14, 2024, 08:04:15 AM
We can talk about everything, no problem. :)

So are we talking about FreeBSD or OPNsense core now? The panic would suggest FreeBSD? It's a question between FreeBSD 13.2 and 14.1 in your case I believe.

I think I know what you are asking for WRT LTS, but we are deeply reliant on FreeBSD and other third party software's update schedules and EoL policies which have always caused mayhem when you least want it to have.  ;)

I don't mind if you don't think this should not count as a valid argument. And I also don't want to go into the details here. We try to make the best of it, but it isn't always easy.

As far as community edition goes it's a free option. It has a number of competitors with their strengths and weaknesses. Use what works best or consider paying for something even better.

However, I think in today's world you will always run into these issues eventually regardless which vendor, paid or unpaid. You can mitigate with official hardware for a bunch of vendors for example.

And as far as working on such "QA" issues goes it's impossible to cover all hardware and software scenarios. I'ts impossible to ask everyone to do the right thing up front. I work on such issues daily, rarely it is OPNsense core code. IPv6, FreeBSD kernel, OpenVPN, FreeBSD ports, pkg... just to name a few.


Cheers,
Franco
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: doktornotor on August 14, 2024, 08:58:56 AM
Hmmm, "any decent enterprise setup". Unless I've missed something, the business edition is still FreeBSD 13 based, so it does not suffer from any of these early regressions? But then again, some users apparently cannot wait till October (https://forum.opnsense.org/index.php?topic=41756.0).  ;D  :P
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: boom42 on August 27, 2024, 01:59:56 PM
I'm also getting crashes and reboot loops after just upgrading to the current version of opnsense (24.7) and then subsequently installing a Chelsio T320 dual nic. I thought my old opnsense installation was corrupted or something (its almost 6 years old) so I figured I wipe the SSD and do a fresh install of opnsense, but the opnsense USB installer gave the same crash and boot loop.

I thought the problem was FreeBSD, so I tried a TrueNAS Core installation and then the latest pfsense installer (on USB) and they both booted and ran fine. Ubuntu 22.04 also ran fine.

I'd hate to have to go back to using pfSense just because of this issue. Are there any quick settings or fixes I could try to get this NIC working with the current version of opnsense?


P.S. - Computer is XEON E3-1220 v2 3.1 GHz 4c/4t CPU, 8GB RAM, 180 GB SATA SSD on Supermicro X9SCM motherboard

Sent from my SM-S908U1 using Tapatalk

Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: franco on August 27, 2024, 02:23:42 PM
It's not that likely an OPNsense issue. Testing TrueNas and pfSense is fine, but these are fuzzy data points as the breakage is somewhere within the driver or network framework which is FreeBSD code on some branch or version after all.

If there is a relevant commit to fix it we just need to know which and apply it (or revert it). :)


Cheers,
Franco
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: blblblb on September 09, 2024, 10:28:46 AM
I'm happy to revisit this and test if you have updated minor revision images for the installer (USB/ISO).

Have you checked how much your fork differs from upstream's sys/kernel? I don't think expecting users to cherry pick commits (or go through your cherry picking history) is a realistic approach.

How is it not an OPNsense issue if other FreeBSD based systems (on the same major version) function properly?

@boom42 How are you using the Chelsio NICs? Did you configure it or the panic/doublefault happens regardless of whether they are in use? (test with no ports connected, link down, and no configuration using them -ex no interface assigned-).

If you have a serial port or SOL/IPMI console text log that would also be quite helpful to see if we have the same stack trace (the calls to functions up to the point where the first fault occurs before the panic). It's very likely the same issue.
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: franco on September 09, 2024, 10:40:34 AM
> Have you checked how much your fork differs from upstream's sys/kernel? I don't think expecting users to cherry pick commits (or go through your cherry picking history) is a realistic approach.

I think you have that logic backwards.


Cheers,
Franco
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: blblblb on September 09, 2024, 03:19:41 PM
I wasn't sure if I was reading gonzopancho or you Franco  ::)

No, really. Users can do free QA, are you also expecting them to be developers (kernel developers at that! -which you are not, also-)  for the foundations of your commercial product? That is really all that usually happens with OPNsense: the free users are guinea pigs for the commercial offering. In this aspect, pfSense is doing the same.

No need to be conceited or abrasive about it.
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: doktornotor on September 09, 2024, 04:03:09 PM
Gonzo, is that you?

Let's clarify this - those Netgate devs with FreeBSD commit bit seem to use every FreeBSD user as a guinea pig for their experiments lately. Such as the recent "stateful ICMP" SNAFU.

BTW, "blb" means a dumba*s in Czech. Nomen omen, it seems.
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: blblblb on September 09, 2024, 07:32:50 PM
Quote from: doktornotor on September 09, 2024, 04:03:09 PM
Gonzo, is that you?

Let's clarify this - those Netgate devs with FreeBSD commit bit seem to use every FreeBSD user as a guinea pig for their experiments lately. Such as the recent "stateful ICMP" SNAFU.

BTW, "blb" means a dumba*s in Czech. Nomen omen, it seems.

Anything productive to add to the discussion? Those "Netgate devs with commit privileges" were FreeBSD developers before working for Netgate. Judging by your post history and attitude, are you sure you are not talking about yourself? You seem to have a problem grasping a whole lot of things, and engage strangers online aggressively. Copium much?

Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: doktornotor on September 09, 2024, 07:35:43 PM
Quote from: blblblb on September 09, 2024, 07:32:50 PM
Anything productive to add to the discussion?

Nah, not really... GIGO.
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: chemlud on September 09, 2024, 08:29:30 PM
pfsense fanboy trash currently on high level here :-D
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: franco on September 09, 2024, 08:47:40 PM
Boy, that escalated quickly. Let's give blblblb an opportunity to breathe.


Cheers,
Franco
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: doktornotor on September 09, 2024, 08:52:38 PM
Well, I already pledged (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=280701#c84) to do my best to avoid future interactions with the Netgate-affiliated folks Will gladly extend that to this forum as well.  :-X
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: franco on September 09, 2024, 08:54:07 PM
And to be fair I'll explain:

>> Have you checked how much your fork differs from upstream's sys/kernel? I don't think expecting users to cherry pick commits (or go through your cherry picking history) is a realistic approach.
>
> I think you have that logic backwards.

The reason why I said this is because I asked for a panic capture but received nothing so far.

If this is a problem introduced in FreeBSD 14.1 I will take my chances that the issues lies with FreeBSD 14.1 (vs. 13.2), not OPNsense 24.7 (vs 24.1). We have several points of reference for this stance already in the short time we are on FreeBSD 14.1 and our kernel is already more stable than FreeBSD 14.1 ever will be although we did help to bring some of these things to FreeBSD for later releases like 14.2 and 15.0

And given a panic that could be caused by us we would assume a fix could be written... is that such a novel idea?


Cheers,
Franco
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: ScottD on October 30, 2024, 04:59:26 PM
I ran into the same panic and investigated further, with some help from Franco. I've raised FreeBSD bug (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=282374) and proposed a patch in the cxgb driver to fix the issue. That fix has been accepted by the upstream maintainer, and committed to FreeBSD 15.0, with a planned backport to 14.x.
As Franco suggested in OPNSense issue 224 (https://github.com/opnsense/src/issues/224), I've submitted a pull request to OPNSense (https://github.com/opnsense/src/pull/226) with the same fix, so we may see it sooner in OPNSense 24.7 than in FreeBSD 14.
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: franco on November 05, 2024, 01:59:12 PM
Yes, Scott's fix will be part of 24.7.8 tomorrow.


Cheers,
Franco
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: nd2420 on November 21, 2024, 05:12:40 PM
this is my first time posting so my apologies if im not doing this correctly.

is there a way to jump directly from 24.1.10 to 24.7.8 during to the update/upgrade process ( like specify it to skip over 24.7.1 and install 24.7.8 )?

as i cannot do the update normally since the 24.7.1 that gets installed first kernel panics during initialization and then boot loops due to my system having a chelsio card installed.

sadly i cannot pull out the chelsio card as the system is remote.
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: franco on November 21, 2024, 06:18:40 PM
Good question. Needs a little trickery but it's possible :)

# opnsense-update -ukr 24.7.8 -a FreeBSD:14:amd64 -A 24.7
# opnsense-update -ubp -A 24.1

(reboot)

(apply last batch of stable updates)


Cheers,
Franco
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: nd2420 on November 22, 2024, 02:18:10 PM
opnsense-update -ukr -a FreeBSD:14:amd64 -A 24.7

when i run that command, the shell reports "usage: man opnsense-update"

this command did run:

opnsense-update -ubp -A 24.1

but when i reboot it stays on 24.1.10_8

--------------

is there something i am missing?
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: franco on November 22, 2024, 03:00:37 PM
Sorry I fixed the command. I tested it but transcribing from the console went wrong (couldn't copy and paste).

> but when i reboot it stays on 24.1.10_8

Expected, missing the kernel so it aborts without destroying the system :)


Cheers,
Franco
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: nd2420 on November 22, 2024, 03:56:27 PM
# opnsense-update -ukr 24.7.8 -a FreeBSD:14:amd64 -A 24.7
# opnsense-update -ubp -A 24.1

so the first command worked much better, i saw it download the 24.7.8 packages this time.

i then ran the 2nd command without rebooting in between

after the 2nd command completed i then rebooted properly from the console menu.

------------

after the reboot it was showing booted into 24.7 not 24.7.8 on the main console screen.

is it normal that it is running 24.7.8 kernel, but still showing 24.7 until i do another round of updates ?
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: franco on November 22, 2024, 05:44:30 PM
> is it normal that it is running 24.7.8 kernel, but still showing 24.7 until i do another round of updates ?

Yes:

> (apply last batch of stable updates)


Cheers,
Franco
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: newsense on November 23, 2024, 08:29:08 AM
@nd2420

>>> # opnsense-update -ubp -A 24.1

I suspect this was a typo: 24.1 is nonsensical, should be 24.7 instead since you need to upgrade base and packages built for 24.7.x


DIsregard, 24.1 is correct after all.
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: franco on November 23, 2024, 09:01:16 AM
Since -A is a persistent operation after grabbing the kernel from -A 24.7 we need to switch back to -A 24.1 to get the actual upgrade set link from 24.1's perspective, which may be newer than from -A 24.7 due to upgrade pinning (e.g. the packages go to 24.7.1 as far as I remember). It's not a day to day task. :)


Cheers,
Franco
Title: Re: Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)
Post by: nd247 on November 23, 2024, 03:45:13 PM
oh thank you so much, it worked!

no panics and im now running 24.7.9_1