[Fixed] AMD hw random reboot

Started by ks, June 21, 2024, 06:44:41 PM

Previous topic - Next topic
June 21, 2024, 06:44:41 PM Last Edit: July 02, 2024, 10:51:45 AM by ks
Hello,

I'm experiencing a strange issue that is driving me nuts, and I'll appreciate any help you might throw here before I nuke'all with napalm...

I had an OPNsense baremetal running like a charm with an old i3 5th series, no issues on that.

First rule: don't repair what isn't broken... yep I know

Then the upgrade time come, so I had to move the OPN sense installation (or better I'm trying to...) to better hw to handle 2x 10G fiber parallel connections, Wireguard VPN and side sw like suricata etc.
My choice was an x470 mobo with a Ryzen 5600G and 32GB RAM, NVME for installation.

The first - and main - issue is that OPNsense reboot at random, so I first think about:
- a fault PSU (changed three),
- RAM (tested with MemTest),
- NVME support (changed),
- motherboard (changed with different model) and
- CPU too (changed with a Ryzen 2400G)

but it seems that, despite the tests I've done, OPNsense 24.1 and 23.7 with AMD hardware reboot at random.

Anyone else has experienced these kind of issues?

Thanks in advance

Something something disable power states in the BIOS. Don't remember from the top of my head, please search the forum or google.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

No problems with my HP T740 and AMD v1756b processor other than the BSD [GIANT-LOCKED] error on install (workaround posted). And since the official OPNsense hardware is all AMD, I'm guessing it is not an AMD issue.

What NICs are you using? Could it be a Realtek NIC problem?

After that, can't help.

I prefer and use AMD everytime I can. No problems here.
Any clues in the logs? If not because it crashes hard, one option could be if you have another machine running is to send logs to it. Windows machines need not apply :)

June 22, 2024, 05:06:22 PM #4 Last Edit: June 22, 2024, 05:08:08 PM by ks
Quote from: Patrick M. Hausen on June 21, 2024, 07:01:22 PM
Something something disable power states in the BIOS. Don't remember from the top of my head, please search the forum or google.

Thanks fore the hint! I found something on net and here in forums related to PBO, that I disabled without any change.


Quote from: Greg_E on June 21, 2024, 07:31:27 PM
What NICs are you using? Could it be a Realtek NIC problem?

After that, can't help.

The only Realtek NIC present is the motherboard one, that I never used.
I have 3 PCI 10Gtek SFP+ cards with some RJ45 and 10G fiber modules, all the traffic pass throught them. The controller card's is Intel 82599

Quote from: cookiemonster on June 21, 2024, 11:30:06 PM
I prefer and use AMD everytime I can. No problems here.
Any clues in the logs? If not because it crashes hard, one option could be if you have another machine running is to send logs to it. Windows machines need not apply :)

Never reached the logs unfortunately. I could set up a spare machine for receiving the logs eventually, but at this point not sure it worth continue this way.


Thanks all!

I can't remember if OPN can be set to do a kernel dump for saving logs leading up to the crash; this is why I suggested to catch logs separately to a different machine. It can be useful to diagnose when the logs of the faulting maching is not saving these dumps.

Early Ryzen CPUs (ZEN/ZEN+) had some issues running Linux OS, though it typically manifests as system lock up vs rebooting.  Later CPUs seem to work much better, as there are many data centers running the same processor die (EPYC).

Some immediately say "Disable C-States" but that is a very drastic solution, one of last resort.  It basically disables one of the best features of modern AMD CPUs - its power management.  Unless you have hundreds of active clients being routed on your bare metal system, you will appreciate the power savings over time.

The two things I've found to be effective on my Ryzen based servers (my OPNSense is on a Intel N5105) are:

-- In the BIOS, set Power Supply Idle Control to Typical Current Idle (or some equivalent wording in your particular BIOS)
--Don't use XMP or any overclocking timing for your DRAM.  Your 2400G is a 1st gen ZEN processor, so your memory speeds should be set to a much lower timing than the marketing "DDR4 3200" would make you think.  (See below)

I initially had issues running a 1st gen Ryzen 1500X in an Unraid server.  After changing these two parameters in my BIOS, that system has run flawlessly for a couple of years.




That's correct, as I discovered by myself: A combination of MSI X470 Gaming Plus Max and Ryzen 5 5600G simply doesn't work at all.
During boot the system lists a lot of MCA errors.

Quote from: connervt on June 23, 2024, 11:24:06 PM
Early Ryzen CPUs (ZEN/ZEN+) had some issues running Linux OS, though it typically manifests as system lock up vs rebooting.  Later CPUs seem to work much better, as there are many data centers running the same processor die (EPYC).

I questioned AMD about the MCA errors and the answer was: you're on yourself since you're using an unsopported OS.  :-X

Then MSI, that I questioned too, sent me unofficial BIOS to try.
But at that point the rig destiny was decided: I took it apart ans sold in pieces since I had no more time to spent for MSI as beta tester for their BIOS firmware.

Thanks all for supporting, and try to avoid MSI mobo + Ryzen APUs combo, unless you've time and resources for testing.

Interesting.  Nearly the same combination I've been running my server for the past 18 months without a single issue.

Without seeing the actual MCA errors it is impossible to start issuing blame.  Hardware not working as expected could have many root causes - CPU, motherboard, RAM, BIOS settings or BIOS code itself, the list could go on.

But sometimes it is best to move on to different hardware, especially if you don't have a compelling reason to stick with the one giving issues.

Quote from: ks on June 21, 2024, 06:44:41 PM

Anyone else has experienced these kind of issues?

Thanks in advance

I've this issue too from time to time (sometimes once a month, sometimes multiple times a week) with an 5700G, and ASUS PRIME B550M-A board. RAM is from Kingston, default BIOS settings loaded, no overclocking.
I din't find the issue, so I ordered some intel hardware to migrate to...
Firewall Specs: AMD Ryzen 5700G, 16GB DDR4 3200MHz RAM, Intel E810 Quad Port SFP28 NIC
Internet Specs: Init7 25GBit FTTH