DEC2752 - Stop/Crash at 00:01 every day

Started by stuckoff, Today at 01:13:53 PM

Previous topic - Next topic
Hi everyone,

I'm experiencing a recurring system instability issue with one of my appliances that started around December 21st, 2025. After 6 months of perfect stability, the system now becomes unresponsive almost every night at exactly 00:01.

Symptoms:
    Connectivity: Internet access stops for the network.
    Management: No access to WebGUI or SSH.
    Persistence: The management IP still responds to Pings.
    HA/CARP: Interestingly, services do not fail over to the secondary node because the primary node keeps its CARP VIPs (the kernel is still "alive" enough to prevent failover, but the userland is dead).

Logs: The system logs point clearly to an Out of Memory (OOM) event and swap exhaustion:

2026-01-06T00:11:26 Notice lockout_handler lockout 138.197.98.69 [using table sshlockout] after 6 attempts
2026-01-06T00:06:30 Notice kernel swp_pager_getswapspace(15): failed
2026-01-06T00:06:26 Notice kernel <3>pid 67504 (i2RsVQl2), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-06T00:06:21 Notice kernel swp_pager_getswapspace(14): failed
2026-01-06T00:06:21 Notice kernel swap_pager: out of swap space
2026-01-06T00:05:41 Notice kernel <3>pid 70232 (i2RsVQl2), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:05:26 Notice kernel <3>pid 66425 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:05:06 Notice kernel <3>pid 64687 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:04:47 Notice kernel <3>pid 62108 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:04:22 Notice kernel <3>pid 61411 (i2RsVQl2), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:04:07 Notice kernel <3>pid 60957 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:03:53 Notice kernel <3>pid 59937 (i2RsVQl2), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:03:33 Notice kernel <3>pid 59157 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:03:14 Notice kernel <3>pid 59042 (i2RsVQl2), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:02:54 Notice kernel <3>pid 57475 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:02:37 Notice kernel <3>pid 57642 (i2RsVQl2), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:02:24 Notice kernel <3>pid 54836 (8bcK6gTx), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-06T00:02:21 Notice kernel <3>pid 54500 (i2RsVQl2), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-06T00:02:20 Notice kernel <3>pid 50993 (i2RsVQl2), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-06T00:02:18 Notice kernel <3>pid 51388 (8bcK6gTx), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-06T00:02:13 Notice kernel swp_pager_getswapspace(24): failed
2026-01-06T00:02:13 Notice kernel swap_pager: out of swap space
2026-01-06T00:02:01 Notice kernel swap_pager: out of swap space
2026-01-06T00:02:00 Notice kernel <3>pid 40094 (i2RsVQl2), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:01:47 Notice kernel <3>pid 40583 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:01:10 Notice kernel <3>pid 49376 (PfNxCZtE), jid 0, uid 0, was killed: a thread waited too long to allocate a page

Scheduled Tasks: The crash timing (00:01) coincides with a cron job that I have set to run hourly:

1      *     *       *       *       (/usr/local/sbin/configctl -d syslog archive) > /dev/null
My Questions:

    Identification: How can I identify which process is actually causing the leak? The PIDs mentioned in the logs (i2RsVQl2, 8bcK6gTx, PfNxCZtE) have randomized/obfuscated names—is this normal for certain plugins, or a sign of something else?
    Timing: If the cron job runs every hour, why does the crash only occur at the midnight (00:01) run and not at 23:01 or 01:01?
    Root Cause: Since this was stable for 6 months, could this be related to log rotation/archiving of a specifically large "daily" log file that builds up?

Any advice on how to debug this via console or remote logging before the crash occurs would be greatly appreciated.

Well, how much swap and RAM do you have?

Also check /var/logs for disk usage.


Cheers,
Franco

8GB RAM
10GB Swap
/var/log is 582MB

Quote from: stuckoff on Today at 01:13:53 PMAny advice on how to debug this via console or remote logging before the crash occurs would be greatly appreciated.

Connect a serial console and leave it overnight. Serial terminals don't lose their content when the system reboots, it all just scrolls up. So if there is a panic message or similar, you will see it in the morning.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)