Kernel panic after upgrade

Started by tamer, February 01, 2019, 09:51:22 PM

Previous topic - Next topic
It's a start. Thank you for listening to what I said.


Cheers,
Franco

Quote from: RGijsen on March 06, 2019, 11:23:59 AM
For shits and giggles I created a Hyper-V gen1 VM, installed 19.1 and updated to the latest-as-of-yet 19.1.2, ran fine under gen1. Mounted the disk under a Gen2, and *poof*, still crash. So no, 19.1.2 didn't fix it, although we would have already known that.

While I totally understand the limited resources of the team (all respect for them!), it's getting hard for us to rely on this given that 18.x is now EOL (ie not secure in my book) but 19.x doesn't run at all.

Microsoft's documentation shows that FreeBSD isn't supported in gen2: https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/should-i-create-a-generation-1-or-2-virtual-machine-in-hyper-v#BKMK_FreeBSD

However, the documentation linked to above shows the 10.x line, not the 11.2 version that OPNsense is on.

There's also this document: https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/should-i-create-a-generation-1-or-2-virtual-machine-in-hyper-v#use-uefi-firmware

I've started work on debugging Hyper-V regressions. My employer lent me some hardware to test on. I've got it for a period of one to three weeks. I hope to report back soon with results.

Please check this updated document, yours is from 2016: https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/supported-freebsd-virtual-machines-on-hyper-v
States gen2 is supported, as long you disable secure-boot. That's what I find with other FreeBSD installs on Hyper-V as well.

Franco,

as someone who has done a lot of support of the stuff he designed and implemented, I can understand your frustration when a lot of people start screaming "it's broken, fix it!" without constructive content; on the other hand while the community can help solving this issue (by providing test resources, reports, patches, or even just the time to test various builds on the hardware we own), this is such a fundamental issue with the end product functionality that I think you should take the lead and help us help you, even if it turns out that the fix will not be found in OPNSense but the underlyig OS..

There are multiple threads about kernel crashes in the forum, and they are a mix of general discussion about the issue, pure complaints and reports from people trying various things. Maybe you should close these and start a new pinned thread, summarising what you know and think of the issue so far in the opening post, and asking people who are unable to boot 19.1 to list hardware/configuration combinations so we can work towards reproduction.

Hi bitwolf,

You are certainly right.

What I can tell you is I cannot take the lead on this without jeopardising my day job, my private life and areas of the OPNsense project that do work without worrying about them from a user perspective. This is where my frustration comes from.

It's too much to ask for me personally.


Cheers,
Franco

Never fear, for lattera is here!

I'm at least looking into the Hyper-V regression(s). I, too, am doing this in my spare time, but it's worth it. :)

Quote from: RGijsen on March 06, 2019, 04:14:54 PM
Honestely, I don't care how you feel about me. IIf you've made your image on me based on the 14 (count 'em!) posts I've made so far, that's tells me more about you than about me.

As stated, I tried to be constructive, as do other people. But hey, we don't click. So whatever I offer is probably not any good. Too bad, I can live with that. I can't contribute code-wise if its <> .NET or gwbasic. But if my car's engine blows out, I can't fix it either. That doesn't mean I shouldn't drive one.

Anyway, it seems I can't help at all fixing this issue. Sorry community, my bad, I tried.

All good points, and I do see that you still try to stay matter-of-fact.

My advice to franco is not to blame the users for being upset, the serious people do not blame you, but the fact that there route to the WWW has been cut and - as you know - that´s a real show stopper today  ;)

Which actually amazes me is your statement about if this all can be fixed in OPNsense. That sounds really frustrated and makes me doubt if you still want to continue the project.

If that's how it is, please tell us, because, come on, not only you but we as the users also do not want to waste our time if it is useless.

Ok, let's all just settle down. Grab a beer or a Crown Royal or both and sit back...relax.

First, I started with m0n0wall, then was going to develop something with pf (because I liked it better) then found pfSense. It was great but the development of it seem to went downhill since the main developer left and pfSense was purchased. Then I found OPNsense (via m0n0wall website) and I will not go back to pfSense...just a lot more activity here from a development standpoint and filled in a lot of holes.

Being a Software Developer, SysOps, Network Engineer, DB Admin and/or 'what other IT people don't want to do', I can understand BOTH sides of the fence, but lets work together. Communications on both sides is key to resolving any issue; venting helps only one party and usually fuels the other party (Look at the thread)

I came upon this issue due to reading about how BIOS was going to be dead in the 2k20's and EFI was the way to go...so here I am because of a change I made in December to start using EFI.

I am using Proxmox VE 5.3-11 or KVM/QEMU for the most part. I have not used any bare metal OPNsense yet.

We know that 18.7.x used FreeBSD 11.1 and 19.1.x is using FreeBSD 11.2 both with HardenedBSD. We know that using it with a BIOS works and EFI does not. So, is there something WE all can do to provide back traces or more information on the boot up process that might signal a "Well there it is!!!".

We know the EFI files got changed in 19.1.x (size change) so does something in the EFI partition need to change for the two to play nice? The ISO image also does the exact same thing when just trying to install a VM with EFI, just not an upgrade.

I wish I knew move about the EFI system to be more help but I am willing to provide data to help us all get through this.

Kev a.k.a. The Grand Wazoo
I don't usually make changes, but when I do it is in production...stay on call my friends.


I have just done some tests on our lab DELL Poweredge R340; it's not currently available for me to use, but as long as I don't touch the HDDs, and do it out of hours, I can reboot it as many times as I want, at least for now.

Given the constraint above I focussed on trying to narrow down the conditions for the kernel trap using just a bootable iso.

Franco says the problem is upstream, so instead of booting the OPNsense iso (which for some reason takes half an hour to get to the point of the crash when mounted as virtual ISO via the iDRAC) I used unmodified OS ISOs.

Here are the results so far:

UEFI ENABLED
HardenedBSD-11-STABLE-v1100056.13-amd64-bootonly.iso
doesn't even manage to boot from the iso

UEFI DISABLED
HardenedBSD-11-STABLE-v1100056.13-amd64-bootonly.iso <-
kernel trap 12

FreeBSD-11.1-RELEASE-amd64-bootonly.iso
boots all the way to the installer

HardenedBSD-12-STABLE-v1200058.3-amd64-bootonly.iso
boots all the way to the installer

I then tried to disable virtualization support in the CPU based on some comments about Meltdown/Spectre fixes in the other thread, but I still get kernel trap 12, so that's not a viable workaround.

So at least in the case of Dell bare metal it seems that UEFI is not the culprit, as disabling it doesn't stop the kernel traps. It also seems not to be a FreeBSD 11 problem, as the vanilla FBSD iso works. This leaves changes between FreeBSD 11 and HardenedBSD 11 as the most likely cause for the kernel trap, but looking at the repo it seems the classical needle in a haystack. The interesting result from this testing is that HardenedBSD 12 works, so maybe an easier investigation path could be to look at the changes between HBSD 11 and 12 that are not merged from FBSD? Shawn what do you think?

Another option to collect more data could be to have a 19.1 debug iso (ie one with DDB enabled in the kernel) so we can actually collect core dumps for these crashes. I am sure that given enough time many of us, me included, could set up a HBSD dev environment and build the image myself, but if this can be a useful investigation avenue it seems better if one of the lead devs could just run the existing build workflow with the kernel option set.

I see further up the thread that a number of people complained about the same crashes on ESXi; our own production firewalls run on ESXi 6 but upgraded to 19.1 successfully, I can do some tests in that sense tomorrow, as this seems to imply there might be a simple workaround in the VM settings for the people running ESXi. This could also be a way forward for the people who have kernel traps on overspecced bare metal, at least up to the point the upstream issue is fixed, or OPNSense has moved to HBSD 12 (but that's at least a year away).

Quote from: bitwolf on March 07, 2019, 12:21:37 AM
I have just done some tests on our lab DELL Poweredge R340; it's not currently available for me to use, but as long as I don't touch the HDDs, and do it out of hours, I can reboot it as many times as I want, at least for now.

Given the constraint above I focussed on trying to narrow down the conditions for the kernel trap using just a bootable iso.

Franco says the problem is upstream, so instead of booting the OPNsense iso (which for some reason takes half an hour to get to the point of the crash when mounted as virtual ISO via the iDRAC) I used unmodified OS ISOs.

Here are the results so far:

UEFI ENABLED
HardenedBSD-11-STABLE-v1100056.13-amd64-bootonly.iso
doesn't even manage to boot from the iso

UEFI DISABLED
HardenedBSD-11-STABLE-v1100056.13-amd64-bootonly.iso <-
kernel trap 12

FreeBSD-11.1-RELEASE-amd64-bootonly.iso
boots all the way to the installer

HardenedBSD-12-STABLE-v1200058.3-amd64-bootonly.iso
boots all the way to the installer

I'm seeing the same type of results in Hyper-V as well. However, it's with UEFI enabled due to being Generation 2. Generation 1 works fine for me.

It's possible that the issue with the Dell systems is related to the issue with the Hyper-V systems.

Quote from: bitwolf on March 07, 2019, 12:21:37 AM
ISo at least in the case of Dell bare metal it seems that UEFI is not the culprit, as disabling it doesn't stop the kernel traps. It also seems not to be a FreeBSD 11 problem, as the vanilla FBSD iso works. This leaves changes between FreeBSD 11 and HardenedBSD 11 as the most likely cause for the kernel trap, but looking at the repo it seems the classical needle in a haystack. The interesting result from this testing is that HardenedBSD 12 works, so maybe an easier investigation path could be to look at the changes between HBSD 11 and 12 that are not merged from FBSD? Shawn what do you think?

Another option to collect more data could be to have a 19.1 debug iso (ie one with DDB enabled in the kernel) so we can actually collect core dumps for these crashes. I am sure that given enough time many of us, me included, could set up a HBSD dev environment and build the image myself, but if this can be a useful investigation avenue it seems better if one of the lead devs could just run the existing build workflow with the kernel option set.

I'm building a custom version of HardenedBSD 11-STABLE/amd64 with DDB/KDB and remote KGDB along with CFLAGS="-g -O0" for "ALL THE THINGS!" I can upload the installation media once they're built.

As far as attempting to see what needs to be backported from 12-STABLE to 11-STABLE, that would entail _A LOT_ of work. More work than I have time for. However, if someone in the community wants to take that on, I'm definitely not going to stop him/her and would love to review patches. ;)

Quote from: bitwolf on March 07, 2019, 12:21:37 AM
I see further up the thread that a number of people complained about the same crashes on ESXi; our own production firewalls run on ESXi 6 but upgraded to 19.1 successfully, I can do some tests in that sense tomorrow, as this seems to imply there might be a simple workaround in the VM settings for the people running ESXi. This could also be a way forward for the people who have kernel traps on overspecced bare metal, at least up to the point the upstream issue is fixed, or OPNSense has moved to HBSD 12 (but that's at least a year away).

OPNsense's move to HardenedBSD 12 is eight months away, assuming Franco does the initial import of the source code soon. :)

Quote from: TheGrandWazoo on March 06, 2019, 11:03:51 PM
Ok, let's all just settle down. Grab a beer or a Crown Royal or both and sit back...relax.

Sorry you got something wrong (same as franco did), but to criticize something factually does not mean to be vicious.

We all want to have running systems, that´s why (most of us) just report errors hoping they might be fixed soon.

franco complained about not being payed enough for his work, the admin wants an Intel NUC from the community for 550 € just to test Hyper-V.

franco also stated that the freeze-problems in 19.x with virtualized and some other bare metal systems might never be solved.

To me, no offence, this does not look very respectable.

I am afraid I have to look for alternatives again concerning our firewalls.

Quote from: peter008 on March 07, 2019, 09:19:40 AM
franco complained about not being payed enough for his work, the admin wants an Intel NUC from the community for 550 € just to test Hyper-V.

FYI: it takes resources to debug issues. No resources means no debugging. My employer is awesome and lent me a laptop on which I can do the necessary debugging. That's what happens when one looks for potential solutions rather than griping with feelings of entitlement. ;P

If you have a better suggestion, rather than a gripe, I'm all ears.

Hi,

I've spend more than a day trying to replicate the issue and tracking it's origin, since it doesn't occur on all EUFI boot systems.
Virtualbox for example boots without issues in UEFI mode, on Parallels (osx) I was able to find the crash as well .


fpuinit_bsp1 () at /usr/src/sys/amd64/amd64/fpu.c:241
fpuinit () at /usr/src/sys/amd64/amd64/fpu.c:277
0xffffffff810adb3b in hammer_time (modulep=<optimized out>, physfree=<optimized out>) at /usr/src/sys/amd64/amd64/machdep.c:1801
0xffffffff80316024 in btext () at /usr/src/sys/amd64/amd64/locore.S:79


Let me make one thing very clear, none of our systems suffer from this issue, a lot of people where actively involved during the beta stages up to 19.1 using all kinds of hardware.

I've seen a couple of people complaining, nagging, not being to **any** help to anyone.
I understand you have an issue, we all do, but... there are always alternatives, using other types of setups, being involved earlier and actively helping improving the system.
Don't forget, if your setup fails and you have done nothing to prevent that from happening, it's still your issue.... nobody got paid to solve it for you.

The patch [1] available might not be the final fix, nor will it fix all issues in the world, but it looks promising.

I would like to thank Franco, Shawn and anybody involved in actually pinning this issue down.

A kernel with debug options enabled is available on our website [2], but if Franco has some time available he can probably move it to a better spot, maybe build some iso with kernel.


Best regards,

Ad



Quote from: AdSchellevis on March 07, 2019, 06:34:02 PM

fpuinit_bsp1 () at /usr/src/sys/amd64/amd64/fpu.c:241
fpuinit () at /usr/src/sys/amd64/amd64/fpu.c:277
0xffffffff810adb3b in hammer_time (modulep=<optimized out>, physfree=<optimized out>) at /usr/src/sys/amd64/amd64/machdep.c:1801
0xffffffff80316024 in btext () at /usr/src/sys/amd64/amd64/locore.S:79


I would like to thank Franco, Shawn and anybody involved in actually pinning this issue down.

A kernel with debug options enabled is available on our website [2], but if Franco has some time available he can probably move it to a better spot, maybe build some iso with kernel.


Best regards,

Ad


Hey Ad,

I've been working on this for the past few days. Put in around 20 hours so far tracking down the issue. :)

We effectively have two forum topics for the same problem. I've documented the issue here: https://forum.opnsense.org/index.php?topic=11403.msg54432#msg54432

So, I've figured out the root cause. I need to do more research in order to write a patch. I'm hoping to have a patch ready within the next week or two.

I know you guys are going to think I am "BAT SHIT INSANE" but I was able to get the system to boot with EFI...but before everyone goes nuts let me tell you want I did and maybe it might make sense or just confuse the crap out of you.

I downloaded the 11.2_bootonly of FreeBSD and copied the EFI files to the /boot dir and the loader.efi to the /efi/boot/BOOTx64.efi partition (You have to mount the partition 'mount -t msdosfs /dev/<your efi partition> /mnt').
This did NOT work. I received the same error.

I then proceeded to download 12.0_bootonly of FreeBSD and copied the EFI file like mentioned above. This did NOT work.

Here's the "Bad Shit Insane" part...I copied the /boot/kernel/kernel from the 12.0_bootonly to the /boot/kernel/kernel of the OPNsense and 'Holy Shit' I am up and running using EFI as a firmware to boot OPNsense.

Now I know you are saying "This does not do Squat for me" but it might give someone in FreeBSD and/or HardenedBSD land a "Light Bulb" over their head to maybe think of a change that is different between the two trains. Or could help OPNsense talk to the developers of the FreeBSD/HardenedBSD kernels and give them some insight.

Trying my best guys. I hope this helps.

Guides I used to help me with this...
https://wiki.freebsd.org/UEFI
https://www.happyassassin.net/2014/01/25/uefi-boot-how-does-that-actually-work-then/
https://www.freebsd.org/doc/en_US.ISO8859-1/books/arch-handbook/boot-kernel.html - because the panic is actually in the mi_startup().
https://forums.freebsd.org/threads/linuxkpi-kernel-panic-in-freebsd-11-2-prerelease-4-r333170-intel-skylake-hd-graphics.65848/

Kev a.k.a. The Grand Wazoo.