Upgrading 24.1.10 to 24.7: kernel panic and reset (SYS-E300-9A-4C test setup)

Started by blblblb, August 14, 2024, 01:37:45 AM

Previous topic - Next topic
I'm glad this is not on a production system yet. The host is a SYS-E300-9A-4C (A2SDi-4C-HLN4F motherboard).

Additional hardware: Chelsio T320, rest is standard for the model.

I cannot/could not get a capture of the kernel panic, but it happens immediately after importing the previous 24.1.10 good configuration. After "initializing.... done".

Tested from Live DVD, and also on a boot environment upgraded from 24.1.10 online. Using a LAGG of the SFP+ ports. Everything else is pretty much standard for any decent enterprise setup. WAN groups, IPSec, some client OVPN, and quite a few VLANs.

Considering the fact that Deciso's commercial offerings actually use the A2SDi platform, this is not great news. Chelsio T320s are also the most common/popular SFP+ NIC for FreeBSD hosts.

Sometimes I wish Deciso did not use us as guinea pigs for QA that should have been done in-house. No harm done on this one, but anyone else with a similar setup beware. Make sure you create and activate a boot env for the upgrade so you can revert if this hits you.

I hope the above comment is not taken personally (hi Franco). I'm just surprised this is the third time an upgrade causes issues. Prior to boot environments being properly supported it was a bigger deal.

TL;DR 24.1.10 to 24.7 = kernel panic on a A2SDi-4C-HLN4F system with QAT and a Chelsio NIC.

Quote from: blblblb on August 14, 2024, 01:37:45 AM
...
I cannot/could not get a capture of the kernel panic, but it happens immediately after importing the previous 24.1.10 good configuration. After "initializing.... done".
...
Considering the fact that Deciso's commercial offerings actually use the A2SDi platform, this is not great news.

So there's no actual root cause, but you have "an issue", how is this related to the A2SDi platform ?

Running a A2SDi for years here, running old-skool UFS, over 10 major OPNsense upgrades, rock solid. It sounds like it's not only a kernel oanicking here...

Would you mind leaving your personal/subjective assumptions and trolling attempts out of the thread? A root cause is less likely to elude *you* if you are actually trying to diagnose it, instead of derailing a thread out of personal reasons (like picking arguments with strangers on the internet...).

Also, please enlighten us with that AS2Di you have "run for years". Sounds like BS. The AS2Di is not "years old" quite, although it is far from new (has not been completely superseded, just like the X10SDV line). What "10 major revisions"? There were breaking changes that make that impossible as a smooth upgrade path without reinstalls.

For the developers and anyone who actually has interest in diagnosing the issue:

A quick look at the panic log (there seems to be a double fault so kdb won't help) shows some stack frames that are related to the cxgbc0 task queuing (so, Chelsio driver).

I also tested on a production system with the same hardware (redundancy spare kept in storage), also with a T320, and the trap also kicks in. Again, double fault, then a loop, then a hard CPU reset.

Quote from: blblblb on August 14, 2024, 01:37:45 AM
...
Considering the fact that Deciso's commercial offerings actually use the A2SDi platform, this is not great news. Chelsio T320s are also the most common/popular SFP+ NIC for FreeBSD hosts.
Sometimes I wish Deciso did not use us as guinea pigs for QA that should have been done in-house. No harm done on this one, but anyone else with a similar setup beware. Make sure you create and activate a boot env for the upgrade so you can revert if this hits you.

I replied as an OPNsense user and A2SDi owner in a public forum, in the assumption this post refers to OPNsense users who own a A2SDi.

Quote
Also, please enlighten us with that AS2Di you have "run for years". Sounds like BS. The AS2Di is not "years old" quite, although it is far from new (has not been completely superseded, just like the X10SDV line). What "10 major revisions"? There were breaking changes that make that impossible as a smooth upgrade path without reinstalls.

March 2019, but never mind...

You are dodging the "issue", if you take a honest look at your entire post history in this forum, you might find a pattern. I would not call it a case study in social ineptitude, but it comes close to it.

You don't need to explain yourself or bring up your personal circumstances in the thread. That's the cliff notes for you.

So, moving on and forward, if you have a Chelsio T320 and actually are curious to debug the problem, I can tell you how to configure it and replicate the BIOS settings.

I'm out of time for free QA today, but I did find some posts from other users that might hint at some kernel issues that need to be ironed out and they weren't. The Chelsio driver is one of the most stable NIC drivers in FreeBSD, written by a core developer. An out of bounds read (or a lock contention issue perhaps) in the driver indicates this is very likely an OPNsense mistake (without reviewing all the cherry picked patches Deciso has taken from upstream).

It needs proper debugging.

cxgbc0@pci0:2:0:0: class=0x020000 rev=0x00 hdr=0x00 vendor=0x1425 device=0x0031 subvendor=0x1425 subdevice=0x0001
    vendor     = 'Chelsio Communications Inc'
    device     = 'T320 10GbE Dual Port Adapter'
    class      = network
    subclass   = ethernet



Best to open a GH issue for this. Word on the street is they're taking insults really well up there and should you provide kernel debug logs from both OPNsense and a fresh install of FreeBSD 14.1 a written apology is (almost) guaranteed.


opnsense-update -zkr kernel-dbg-24.7.1


Should the fine gentleman have some more professional venting to impart to the lesser beings on these forums, please do not hesitate to open a new thread.

I will need that panic. Can always look for clues and fixes in FreeBSD code.

As far as any OS upgrades go I'm not sure what else to expect.

I think the reaction from involved parties after the fact is much more important: who provides you with time to look into it? Who will ship a fix and who will not? Who tells you they are the "bestest evar" and how will it match up with the other two question before? Always take that into account when choosing a platform.


Cheers,
Franco

Franco, OPNsense is great, but you do have a habit of both releasing unstable major versions (as far as more complex environments are involved, I don't expect a basic kvm or esxi setup falling apart in some odd "homelab") and failing to commit resources to providing LTS-like (long term stable, a la Debian) updates. This creates a burden on your users to produce actual QA as they become guinea pigs until all the issues are ironed out. More often than not, that time buffer creates problems of its own. This affected pfSense in the past too, although they have even less of an excuse than you would.

If you provided a buffer of time with updates for the previous major versions as some sort of LTS channel, this would be literally a non-issue. Making the stable prior major revision EOL before 24.7 has all the kinks ironed out is how you get a flood of posts from folks encountering problems.

This is not a personal attack, and merits a response that does not trail along ad hominems or attempts to shrug it off. It's also not grandstanding. You can do better with your devops approach as a business, let alone as a FOSS project.

I will see if I have time to get a serial console log from the person I'm helping out. Feel free to link or send a established diagnostics procedure she can follow, meanwhile. I'll do what I can.

Quote from: newsense on August 14, 2024, 06:20:57 AM
opnsense-update -zkr kernel-dbg-24.7.1

This presumes a bootable environment, or are you suggesting running this from 24.1.10? I can setup a tunnel and use the BMC to get a working console, but like I said, there is a double fault at some point and a loop that makes kdb unusable. The debug symbols might help if present, but kdb won't be workable. It will be a few hours until I can do this, though.

We can talk about everything, no problem. :)

So are we talking about FreeBSD or OPNsense core now? The panic would suggest FreeBSD? It's a question between FreeBSD 13.2 and 14.1 in your case I believe.

I think I know what you are asking for WRT LTS, but we are deeply reliant on FreeBSD and other third party software's update schedules and EoL policies which have always caused mayhem when you least want it to have.  ;)

I don't mind if you don't think this should not count as a valid argument. And I also don't want to go into the details here. We try to make the best of it, but it isn't always easy.

As far as community edition goes it's a free option. It has a number of competitors with their strengths and weaknesses. Use what works best or consider paying for something even better.

However, I think in today's world you will always run into these issues eventually regardless which vendor, paid or unpaid. You can mitigate with official hardware for a bunch of vendors for example.

And as far as working on such "QA" issues goes it's impossible to cover all hardware and software scenarios. I'ts impossible to ask everyone to do the right thing up front. I work on such issues daily, rarely it is OPNsense core code. IPv6, FreeBSD kernel, OpenVPN, FreeBSD ports, pkg... just to name a few.


Cheers,
Franco

Hmmm, "any decent enterprise setup". Unless I've missed something, the business edition is still FreeBSD 13 based, so it does not suffer from any of these early regressions? But then again, some users apparently cannot wait till October.  ;D  :P

I'm also getting crashes and reboot loops after just upgrading to the current version of opnsense (24.7) and then subsequently installing a Chelsio T320 dual nic. I thought my old opnsense installation was corrupted or something (its almost 6 years old) so I figured I wipe the SSD and do a fresh install of opnsense, but the opnsense USB installer gave the same crash and boot loop.

I thought the problem was FreeBSD, so I tried a TrueNAS Core installation and then the latest pfsense installer (on USB) and they both booted and ran fine. Ubuntu 22.04 also ran fine.

I'd hate to have to go back to using pfSense just because of this issue. Are there any quick settings or fixes I could try to get this NIC working with the current version of opnsense?


P.S. - Computer is XEON E3-1220 v2 3.1 GHz 4c/4t CPU, 8GB RAM, 180 GB SATA SSD on Supermicro X9SCM motherboard

Sent from my SM-S908U1 using Tapatalk


It's not that likely an OPNsense issue. Testing TrueNas and pfSense is fine, but these are fuzzy data points as the breakage is somewhere within the driver or network framework which is FreeBSD code on some branch or version after all.

If there is a relevant commit to fix it we just need to know which and apply it (or revert it). :)


Cheers,
Franco

I'm happy to revisit this and test if you have updated minor revision images for the installer (USB/ISO).

Have you checked how much your fork differs from upstream's sys/kernel? I don't think expecting users to cherry pick commits (or go through your cherry picking history) is a realistic approach.

How is it not an OPNsense issue if other FreeBSD based systems (on the same major version) function properly?

@boom42 How are you using the Chelsio NICs? Did you configure it or the panic/doublefault happens regardless of whether they are in use? (test with no ports connected, link down, and no configuration using them -ex no interface assigned-).

If you have a serial port or SOL/IPMI console text log that would also be quite helpful to see if we have the same stack trace (the calls to functions up to the point where the first fault occurs before the panic). It's very likely the same issue.