Proxmox crashing, bad RAM?

Started by NevadaTech, July 01, 2026, 06:29:11 PM

Previous topic - Next topic
Please review this excerpt from my Proxmox log. Prox is 8.4.19, the mobo is Asrock X470D4U, the RAM is 128GB (4x32GB) Nemix.

I have 5 other servers similar to this one working fine. The exception is one other server with the same HARDWARE ERROR that pops up on the console. I believe that one also has Nemix RAM. I believe the other four servers have Kingston KSM26ED8/16ME RAM but no errors.

12:37 Hardware Error
13:21 Hardware Error
14:53 Hardware Error
15:45 server reboot

I also see a SMART thermal message but that seems like more of a 'notice'.

Under the Prox log is an output from 'dmidecode -t 17'. While it lists specs it doesn't list actual manufacturer part number. I believe the RAM is actually 3200 speed but running at a lower 2666 speed. I tried an 'lshw -C memory' but lshw is not installed.


------------------------------------------ start some Proxmox log dump
Jun 30 12:37:21 virt09b kernel: mce: [Hardware Error]: Machine check events logged
Jun 30 12:37:21 virt09b kernel: [Hardware Error]: Corrected error, no action required.
Jun 30 12:37:21 virt09b kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0x9c2041000000011b
Jun 30 12:37:21 virt09b kernel: [Hardware Error]: Error Addr: 0x0000000bbf588300
Jun 30 12:37:21 virt09b kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000000040a801101
Jun 30 12:37:21 virt09b kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
Jun 30 12:37:21 virt09b kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x4)
Jun 30 12:37:21 virt09b kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jun 30 12:43:12 virt09b smartd[1543]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 66
Jun 30 13:04:37 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 13:08:53 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 13:17:01 virt09b CRON[151843]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 30 13:17:01 virt09b CRON[151844]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 30 13:17:01 virt09b CRON[151843]: pam_unix(cron:session): session closed for user root
Jun 30 13:19:41 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 13:21:03 virt09b kernel: mce: [Hardware Error]: Machine check events logged
Jun 30 13:21:03 virt09b kernel: [Hardware Error]: Corrected error, no action required.
Jun 30 13:21:03 virt09b kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0x9c2041000000011b
Jun 30 13:21:03 virt09b kernel: [Hardware Error]: Error Addr: 0x0000000bbf588300
Jun 30 13:21:03 virt09b kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000000040a801101
Jun 30 13:21:03 virt09b kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
Jun 30 13:21:03 virt09b kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x4)
Jun 30 13:21:03 virt09b kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jun 30 13:43:12 virt09b smartd[1543]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 66
Jun 30 13:45:16 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 13:47:15 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 13:50:08 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 13:54:31 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 13:57:41 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 14:05:23 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 14:06:44 virt09b systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
Jun 30 14:06:44 virt09b systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Jun 30 14:06:44 virt09b systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
Jun 30 14:06:44 virt09b systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
Jun 30 14:17:01 virt09b CRON[172546]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 30 14:17:01 virt09b CRON[172547]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 30 14:17:01 virt09b CRON[172546]: pam_unix(cron:session): session closed for user root
Jun 30 14:27:29 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 14:28:37 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 14:35:25 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 14:43:12 virt09b smartd[1543]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 67
Jun 30 14:44:19 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 14:45:10 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 14:53:09 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 14:53:53 virt09b kernel: mce: [Hardware Error]: Machine check events logged
Jun 30 14:53:53 virt09b kernel: [Hardware Error]: Corrected error, no action required.
Jun 30 14:53:53 virt09b kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0x9c2041000000011b
Jun 30 14:53:53 virt09b kernel: [Hardware Error]: Error Addr: 0x0000000bbf520300
Jun 30 14:53:53 virt09b kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000000040a801101
Jun 30 14:53:53 virt09b kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
Jun 30 14:53:53 virt09b kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x4)
Jun 30 14:53:53 virt09b kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jun 30 14:56:46 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 15:06:35 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 15:07:25 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 15:08:03 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 15:13:12 virt09b smartd[1543]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 66
Jun 30 15:15:21 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 15:17:01 virt09b CRON[193198]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 30 15:17:01 virt09b CRON[193199]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jun 30 15:17:01 virt09b CRON[193198]: pam_unix(cron:session): session closed for user root
Jun 30 15:38:08 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
Jun 30 15:45:39 virt09b kernel: AMD-Vi: Completion-Wait loop timed out
-- Reboot --
Jun 30 15:47:33 virt09b kernel: Linux version 6.8.12-30-pve (build@proxmox) (gcc (Debian 12.2.0-14+deb12u1) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-30 (2026-06-11T10:10Z) ()
Jun 30 15:47:33 virt09b kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-30-pve root=/dev/mapper/pve-root ro quiet
Jun 30 15:47:33 virt09b kernel: KERNEL supported cpus:
Jun 30 15:47:33 virt09b kernel:   Intel GenuineIntel



------------------------------------------ end some Proxmox log dump








------------------------------------------
dmidecode -t 17 show

Memory Device
        Array Handle: 0x0014
        Error Information Handle: 0x0021
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: DIMM 0
        Bank Locator: P0 CHANNEL B
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 2666 MT/s
        Manufacturer: Unknown
        Serial Number: 5D270016
        Asset Tag: Not Specified
        Part Number: Unknown
        Rank: 2
        Configured Memory Speed: 2666 MT/s
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V
        Memory Technology: DRAM
        Memory Operating Mode Capability: Volatile memory
        Firmware Version: Unknown
        Module Manufacturer ID: Unknown
        Module Product ID: Unknown
        Memory Subsystem Controller Manufacturer ID: Unknown
        Memory Subsystem Controller Product ID: Unknown
        Non-Volatile Size: None
        Volatile Size: 32 GB
        Cache Size: None
        Logical Size: None

------------------------------------------

Yep, looks like a bad stick. You could test it with memtest86 just to verify, but I doubt it would give a different result.

The only Nemix DIMMs I've seen used re-marked chips - not for me. YMMV. Heck, it might even have a warranty... but the manufacturer has an incentive to reject claims at the moment.

Quote from: NevadaTech on July 01, 2026, 06:29:11 PM[...]I also see a SMART thermal message but that seems like more of a 'notice'.[...]

What temperature? The controller shouldn't exceed its limits, but higher temperatures are generally detrimental. Might be worth addressing while you're in it.

Does the bios/uefi have hardware tests in it? If so I would 1st use that mem test tool, just to verify things.
Mini-pc N150 i226v x520, FREEDOM

X470 chips in general and the consumer platform based Asrock Mainboards are notorious for failing early, too.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, Leox LXT-010H-D

1100 down / 450 up, Bufferbloat A+

Quote from: meyergru on July 01, 2026, 09:29:16 PMX470 chips in general and the consumer platform based Asrock Mainboards are notorious for failing early, too.
It's from their Rack Series : https://www.asrockrack.com/general/productdetail.asp?Model=X470D4U#Specifications
So basically Workstation/Server hardware even though it's just a X470 Chipset which is indeed Mainstream Consumer stuff :)

But the first thing that came to my mind after reading the title : Why ask this on the OPNsense Forum ?!
And why did you not simply run memtest86+ to check the RAM ?!

Seems straightforward, but could be just me...
Weird guy who likes everything Linux and *BSD on PC/Laptop/Tablet/Mobile and funny little ARM based boards :)

QuoteBut the first thing that came to my mind after reading the title : Why ask this on the OPNsense Forum ?!

This + Upgrade your BIOS.

ASsrock is know to have random problems due to MOBO firmware.

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
N355 - i226-V | AQC113C | 16G | 500G - PROD

PRXMX
N5105 - i226-V | 2x8G | 512G - NODE #1
N100 - i226-V | 16G | 1T - NODE #2

Quote from: nero355 on Today at 04:45:48 PMIt's from their Rack Series : https://www.asrockrack.com/general/productdetail.asp?Model=X470D4U#Specifications

I know. I meant these specific Asrock Rack Mainboards are a chimera between consumer chipsets and server features and that never went that good, forums are full of failure reports. The stability of X470 / X570 was not stellar in the first place.
Intel N100, 4* I226-V, 2* 82559, 16 GByte, 500 GByte NVME, Leox LXT-010H-D

1100 down / 450 up, Bufferbloat A+

Quote from: meyergru on Today at 05:29:07 PM
Quote from: nero355 on Today at 04:45:48 PMIt's from their Rack Series : https://www.asrockrack.com/general/productdetail.asp?Model=X470D4U#Specifications
I know. I meant these specific Asrock Rack Mainboards are a chimera between consumer chipsets and server features and that never went that good, forums are full of failure reports. The stability of X470 / X570 was not stellar in the first place.
Ahh, OK. Much clearer, thnx! :)
Weird guy who likes everything Linux and *BSD on PC/Laptop/Tablet/Mobile and funny little ARM based boards :)