DEC2752 - Stop/Crash at 00:01 every day

Started by stuckoff, January 16, 2026, 01:13:53 PM

Previous topic - Next topic
Hi everyone,

I'm experiencing a recurring system instability issue with one of my appliances that started around December 21st, 2025. After 6 months of perfect stability, the system now becomes unresponsive almost every night at exactly 00:01.

Symptoms:
    Connectivity: Internet access stops for the network.
    Management: No access to WebGUI or SSH.
    Persistence: The management IP still responds to Pings.
    HA/CARP: Interestingly, services do not fail over to the secondary node because the primary node keeps its CARP VIPs (the kernel is still "alive" enough to prevent failover, but the userland is dead).

Logs: The system logs point clearly to an Out of Memory (OOM) event and swap exhaustion:

2026-01-06T00:11:26 Notice lockout_handler lockout 138.197.98.69 [using table sshlockout] after 6 attempts
2026-01-06T00:06:30 Notice kernel swp_pager_getswapspace(15): failed
2026-01-06T00:06:26 Notice kernel <3>pid 67504 (i2RsVQl2), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-06T00:06:21 Notice kernel swp_pager_getswapspace(14): failed
2026-01-06T00:06:21 Notice kernel swap_pager: out of swap space
2026-01-06T00:05:41 Notice kernel <3>pid 70232 (i2RsVQl2), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:05:26 Notice kernel <3>pid 66425 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:05:06 Notice kernel <3>pid 64687 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:04:47 Notice kernel <3>pid 62108 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:04:22 Notice kernel <3>pid 61411 (i2RsVQl2), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:04:07 Notice kernel <3>pid 60957 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:03:53 Notice kernel <3>pid 59937 (i2RsVQl2), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:03:33 Notice kernel <3>pid 59157 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:03:14 Notice kernel <3>pid 59042 (i2RsVQl2), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:02:54 Notice kernel <3>pid 57475 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:02:37 Notice kernel <3>pid 57642 (i2RsVQl2), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:02:24 Notice kernel <3>pid 54836 (8bcK6gTx), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-06T00:02:21 Notice kernel <3>pid 54500 (i2RsVQl2), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-06T00:02:20 Notice kernel <3>pid 50993 (i2RsVQl2), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-06T00:02:18 Notice kernel <3>pid 51388 (8bcK6gTx), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-06T00:02:13 Notice kernel swp_pager_getswapspace(24): failed
2026-01-06T00:02:13 Notice kernel swap_pager: out of swap space
2026-01-06T00:02:01 Notice kernel swap_pager: out of swap space
2026-01-06T00:02:00 Notice kernel <3>pid 40094 (i2RsVQl2), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:01:47 Notice kernel <3>pid 40583 (8bcK6gTx), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-06T00:01:10 Notice kernel <3>pid 49376 (PfNxCZtE), jid 0, uid 0, was killed: a thread waited too long to allocate a page

Scheduled Tasks: The crash timing (00:01) coincides with a cron job that I have set to run hourly:

1      *     *       *       *       (/usr/local/sbin/configctl -d syslog archive) > /dev/null
My Questions:

    Identification: How can I identify which process is actually causing the leak? The PIDs mentioned in the logs (i2RsVQl2, 8bcK6gTx, PfNxCZtE) have randomized/obfuscated names—is this normal for certain plugins, or a sign of something else?
    Timing: If the cron job runs every hour, why does the crash only occur at the midnight (00:01) run and not at 23:01 or 01:01?
    Root Cause: Since this was stable for 6 months, could this be related to log rotation/archiving of a specifically large "daily" log file that builds up?

Any advice on how to debug this via console or remote logging before the crash occurs would be greatly appreciated.

Well, how much swap and RAM do you have?

Also check /var/logs for disk usage.


Cheers,
Franco


Quote from: stuckoff on January 16, 2026, 01:13:53 PMAny advice on how to debug this via console or remote logging before the crash occurs would be greatly appreciated.

Connect a serial console and leave it overnight. Serial terminals don't lose their content when the system reboots, it all just scrolls up. So if there is a panic message or similar, you will see it in the morning.
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)


Yes, I did this and here is the log from 00:02 last night:

# >>> Invoking stop script 'beep'
>>> Invoking stop script 'freebsd'
snmpd not running? (check /var/run/net_snmpd.pid).
>>> Invoking stop script 'backup'
>>> Invoking backup script 'captiveportal'
>>> Invoking backup script 'dhcpleases'
>>> Invoking backup script 'duid'
>>> Invoking backup script 'netflow'
>>> Invoking backup script 'rrd'
>>> Invoking stop script 'config'
Enter full pathname of shell or RETURN for /bin/sh:

The system log looks the same:
root@fw:/var/log/system # cat system_20260117.log
2026-01-17T00:02:20+02:00 fw.lan kernel - - [meta sequenceId="1"] <3>[30505] pid 78673 (At3plE9U), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:02:20+02:00 fw.lan kernel - - [meta sequenceId="2"] <3>[30527] pid 99067 (FtcZhyzw), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:02:20+02:00 fw.lan kernel - - [meta sequenceId="3"] <3>[30541] pid 16919 (At3plE9U), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:02:20+02:00 fw.lan kernel - - [meta sequenceId="4"] <3>[30560] pid 89389 (FtcZhyzw), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:02:20+02:00 fw.lan kernel - - [meta sequenceId="5"] <3>[30571] pid 6829 (FtcZhyzw), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:02:20+02:00 fw.lan kernel - - [meta sequenceId="6"] <3>[30598] pid 5238 (At3plE9U), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:04:57+02:00 fw.lan kernel - - [meta sequenceId="1"] <3>[30628] pid 86199 (At3plE9U), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:04:57+02:00 fw.lan kernel - - [meta sequenceId="2"] <3>[30667] pid 27392 (FtcZhyzw), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:04:57+02:00 fw.lan kernel - - [meta sequenceId="3"] <3>[30679] pid 68394 (At3plE9U), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:04:57+02:00 fw.lan kernel - - [meta sequenceId="4"] <3>[30721] pid 65517 (FtcZhyzw), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:04:57+02:00 fw.lan kernel - - [meta sequenceId="5"] <3>[30734] pid 76047 (FtcZhyzw), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:06:06+02:00 fw.lan kernel - - [meta sequenceId="1"] <3>[30795] pid 1872 (At3plE9U), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:06:43+02:00 fw.lan kernel - - [meta sequenceId="2"] <3>[30838] pid 39818 (At3plE9U), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:07:28+02:00 fw.lan kernel - - [meta sequenceId="3"] <3>[30877] pid 742 (At3plE9U), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:07:28+02:00 fw.lan kernel - - [meta sequenceId="4"] <3>[30908] pid 30132 (FtcZhyzw), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:09:01+02:00 fw.lan kernel - - [meta sequenceId="1"] <3>[30955] pid 27447 (At3plE9U), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:09:01+02:00 fw.lan kernel - - [meta sequenceId="2"] <3>[30978] pid 18283 (At3plE9U), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:10:57+02:00 fw.lan kernel - - [meta sequenceId="1"] <3>[31085] pid 96108 (At3plE9U), jid 0, uid 0, was killed: a thread waited too long to allocate a page
2026-01-17T00:11:45+02:00 fw.lan kernel - - [meta sequenceId="2"] <3>[31135] pid 4483 (FtcZhyzw), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-17T00:11:45+02:00 fw.lan kernel - - [meta sequenceId="3"] <3>[31139] pid 95992 (FtcZhyzw), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-17T00:11:45+02:00 fw.lan kernel - - [meta sequenceId="4"] <3>[31140] pid 12278 (FtcZhyzw), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-17T00:11:45+02:00 fw.lan kernel - - [meta sequenceId="5"] <3>[31141] pid 28622 (FtcZhyzw), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-17T00:11:45+02:00 fw.lan kernel - - [meta sequenceId="6"] <3>[31142] pid 32151 (php), jid 0, uid 0, was killed: failed to reclaim memory
2026-01-17T00:12:12+02:00 fw.lan syslog-ng 26461 - [meta sequenceId="7"] syslog-ng shutting down; version='4.10.2'

here is the Process List:
root@fw:/var/log/system # ps ax
  PID TT  STAT       TIME COMMAND
    0  -  DLs     4:59.88 [kernel]
    1  -  ILs     0:00.10 /sbin/init
    2  -  WL      0:05.07 [clock]
    3  -  DL      0:00.00 [crypto]
    4  -  DL      0:00.09 [cam]
    5  -  DL      0:00.00 [busdma]
    6  -  DL      0:08.42 [zfskern]
    7  -  DL      4:33.70 [pf purge]
    8  -  DL      0:58.76 [rand_harvestq]
    9  -  DL      0:39.94 [pagedaemon]
   10  -  DL      0:00.00 [audit]
   11  -  RNL  5091:39.61 [idle]
   12  -  WL      1:29.44 [intr]
   13  -  DL      0:00.00 [geom]
   14  -  DL      0:00.00 [sequencer 00]
   15  -  DL      0:01.43 [usb]
   16  -  DL      0:15.36 [vmdaemon]
   17  -  DL      0:00.72 [bufdaemon]
   18  -  DL      0:00.15 [vnlru]
   19  -  DL      0:00.28 [syncer]
   31  -  DL      0:00.01 [aiod1]
   32  -  DL      0:00.01 [aiod2]
   33  -  DL      0:00.01 [aiod3]
   34  -  DL      0:00.01 [aiod4]
  165  -  DL      0:08.99 [md98]
81979 u2  Ss      0:00.10 -sh (sh)
88824 u2  R+      0:00.00 ps ax

It looks the system enters in single user mode.
There is no traffic on this firewall as we disconnected WAN and LAN interface and switched to our backup device few days ago.
I tested the memory and the nvme - there are no errors reported.

I tested the filesystem:
# zpool status zroot
  pool: zroot
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:07 with 0 errors on Sat Jan 17 13:25:05 2026
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          nda0p3    ONLINE       0     0     0

errors: No known data errors
#

Now I changed the time when this script is executed:
(/usr/local/sbin/configctl -d syslog archive) > /dev/null
let's see when it will stop next time :-)