OPNsense reboots multiple times per day

Started by jones, August 20, 2024, 12:59:49 AM

Previous topic - Next topic
For some reason, my OPNsense router reboots multiple times per day, and I can't understand why. Looking into the "General" logs, I don't see anything special, except that it does reboot indeed (see `---<<BOOT>>---`) below:

```
2024-08-20T00:48:08     Notice  kernel   FreeBSD clang version 18.1.5 (https://github.com/llvm/llvm-project.git llvmorg-18.1.5-0-g617a15a9eac9)
2024-08-20T00:48:08     Notice  kernel   FreeBSD 14.1-RELEASE-p3 stable/24.7-n267778-bb2c86773c1b SMP amd64
2024-08-20T00:48:08     Notice  kernel   FreeBSD is a registered trademark of The FreeBSD Foundation.
2024-08-20T00:48:08     Notice  kernel          The Regents of the University of California. All rights reserved.
2024-08-20T00:48:08     Notice  kernel   Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
2024-08-20T00:48:08     Notice  kernel   Copyright (c) 1992-2023 The FreeBSD Project.
2024-08-20T00:48:08     Notice  kernel   ---<<BOOT>>---
2024-08-20T00:48:08     Notice  syslog-ng        syslog-ng starting up; version='4.8.0'
2024-08-20T00:38:49     Warning radvd    prefix length should be 64 for igb1
2024-08-20T00:38:33     Warning radvd    prefix length should be 64 for igb1
2024-08-20T00:38:22     Notice  opnsense         /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for opt1(igb2)
2024-08-20T00:38:17     Notice  opnsense         /usr/local/etc/rc.linkup: plugins_configure dns (execute task : unbound_configure_do())
2024-08-20T00:38:17     Notice  opnsense         /usr/local/etc/rc.linkup: plugins_configure dns (execute task : dnsmasq_configure_do())
2024-08-20T00:38:17     Warning radvd    prefix length should be 64 for igb1
2024-08-20T00:38:17     Notice  opnsense         /usr/local/etc/rc.linkup: plugins_configure dns ()
2024-08-20T00:38:17     Warning radvd    prefix length should be 64 for igb1
2024-08-20T00:38:16     Error   opnsense         /usr/local/etc/rc.linkup: The command '/usr/sbin/daemon -f -p '/var/run/dhcpleases6.pid' '/usr/local/opnsense/scripts/dhcp/prefixes.sh'' returned exit code '3', the output was 'daemon: process already running, pid: 9786'
2024-08-20T00:38:15     Notice  opnsense         /usr/local/etc/rc.linkup: plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
2024-08-20T00:38:15     Notice  opnsense         /usr/local/etc/rc.linkup: plugins_configure dhcp ()
2024-08-20T00:38:15     Notice  opnsense         /usr/local/etc/rc.linkup: plugins_configure ipsec (execute task : ipsec_configure_do(,opt1))
2024-08-20T00:38:15     Notice  opnsense         /usr/local/etc/rc.linkup: plugins_configure ipsec (,opt1)
2024-08-20T00:38:15     Notice  opnsense         /usr/local/etc/rc.linkup: ROUTING: entering configure using 'opt1'
2024-08-20T00:38:14     Notice  opnsense         /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for opt1(igb2)
2024-08-20T00:38:13     Notice  kernel   <6>igb2: link state changed to DOWN
2024-08-20T00:38:13     Notice  kernel   <6>igb2: link state changed to UP
2024-08-20T00:36:11     Notice  dhcp6c   dhcp6c_script: RENEW on igb0_vlan10 executing
2024-08-20T00:35:49     Warning radvd    prefix length should be 64 for igb1
2024-08-20T00:30:30     Notice  dhclient         dhclient-script: Creating resolv.conf
2024-08-20T00:30:30     Notice  dhclient         dhclient-script: Reason RENEW on igb0_vlan10 executing
2024-08-20T00:28:17     Warning radvd    prefix length should be 64 for igb1
2024-08-20T00:19:54     Warning radvd    prefix length should be 64 for igb1
2024-08-20T00:12:11     Notice  dhcp6c   dhcp6c_script: RENEW on igb0_vlan10 executing
```

In the attached system(1).log, it can be seen that it happens regularly.

Why is that?

No idea if this is even remotely related but I was having the same issue and it turns out I had bad ram in the appliance. Replaced it and am living happily ever after.

Good luck diagnosing and fixing your problem.

As can be gleaned from the log, your device seems to be a PCENGINES APU.2E2 with only 2 GByte of soldered DDR3 RAM, so I would think this is pre-tested RAM.

According to OpnSense hardware requirements, 2 GByte is at the bare minimum for systems that "run all OPNsense standard features that do not need disk writes".

When I tried an OpnSense VM and gave it only 2 GByte of memory, I experienced hangups, which in your case, may or may not trigger a watchdog. I think, depending on what settings you use, this size of RAM might be on the edge or even insufficient and maybe the situation has gotten worse with the advent of FreeBSD 14.1 kernel in OpnSense.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

Thanks for the answers!

I actually have an APU.6B4 [1] with 4 GB DDR3-1333 DRAM.
Shouldn't that be enough?

Also is there a way to run diagnostics with OPNsense, in case my RAM is faulty?

[1]: https://www.pcengines.ch/apu6b4.htm

Without a swap partition to capture kernel crash reports it's impossible to say what the issue could be.


Cheers,
Franco

August 20, 2024, 12:13:00 PM #5 Last Edit: August 20, 2024, 12:50:25 PM by jones
> Without a swap partition to capture kernel crash reports it's impossible to say what the issue could be.

Oh, let me enable the swap partition!

Will the crash reports then appear in the same log file?

EDIT: I did enable the swap partition in System > Settings > Miscellaneous and rebooted the router, and found a widget on the dashboard that shows it. Somehow it shows 10GB, which seems off (in "Miscellaneous" it was mentioned 2GB).

EDIT 2: Actually, `$ top` says that there is now 10GB of swap as well:

```
CPU:  5.6% user,  0.0% nice,  2.4% system,  0.0% interrupt, 92.0% idle
Mem: 148M Active, 141M Inact, 547M Wired, 56K Buf, 3071M Free
ARC: 219M Total, 70M MFU, 109M MRU, 3288K Anon, 1915K Header, 35M Other
     134M Compressed, 360M Uncompressed, 2.68:1 Ratio
Swap: 10G Total, 10G Free
```

I guess now I just have to wait for the next crash and find a way to extract the kernel crash reports!

Ok swap partition is there then, but swap file don't work for dumps.

Nothing in the crash reporter? If not it's a power issue and the box just turns off/reboots.

Cheers,
Franco

> Ok swap partition is there then

Maybe I was not clear. I just enabled it, and now it seems to be there! So I need to wait for a new reboot (which should happen today anyway) :-).

> but swap file don't work for dumps.

And the system.log I attached above is a "dump", so it wouldn't be attached there? Is that what you mean?

> Nothing in the crash reporter?

I'll go read about the "crash reporter" so that I find it when it reboots again. I'll keep you posted!

Sorry to be clear:

swap partition -> can do crash dumps
swap file option -> cannot do crash dumps

But I think the swap file is 2 gb so the partition should be there at 8 gb too. Fingers crossed. :)


Cheers,
Franco

Oh I see! So I probably already had a swap partition, and I did not need to enable the swap file. Still learning, sorry about that :-).

So there was something in the Crash Reporter, but I could not easily download it as a file so I submitted it. And now it disappeared! I thought I would get a copy on my email address, but I did not.

So I guess next time there is a crash I will go and manually copy-paste all the boxes of the crash dump, or something like that  ::)

If you submitted it please PM the email you used to submit. I can't find it at first glance.


Cheers,
Franco

> If you submitted it please PM the email you used to submit. I can't find it at first glance.

I did, thanks!

In the meantime, it just rebooted again, but this time there is nothing in the crash reporter. The CPU temperature has been stable the whole morning at around 60 degrees Celcius, the RAM is stable around 15% of usage, the disk at 7%.

A power issue has been mentioned multiple times... should I try to buy a new power supply and see if that's better? Or is there a better way to check that?

Observing the crash in a console first might help, but it sure sounds like it's just restarting.


Cheers,
Franco

> Observing the crash in a console first might help

So connecting over serial to the router, staying logged in without doing anything in particular, and hoping that a message appears before it reboots?

Or did you mean something more elaborate? I have an easy serial access so I can do that :D

Ok, found them:

Tracing pid 4 tid 100060 td 0xfffffe0008d8f3a0
kdb_enter() at kdb_enter+0x37/frame 0xfffffe0062c9eb90
vpanic() at vpanic+0x182/frame 0xfffffe0062c9ebe0
panic() at panic+0x43/frame 0xfffffe0062c9ec40
trap_fatal() at trap_fatal+0x387/frame 0xfffffe0062c9eca0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0062c9ed00
calltrap() at calltrap+0x8/frame 0xfffffe0062c9ed00
--- trap 0xc, rip = 0xffffffff80cf2170, rsp = 0xfffffe0062c9edd0, rbp = 0xfffffe0062c9ede0 ---
turnstile_broadcast() at turnstile_broadcast+0x40/frame 0xfffffe0062c9ede0
__mtx_unlock_sleep() at __mtx_unlock_sleep+0x7a/frame 0xfffffe0062c9ee10
unlock_mtx() at unlock_mtx+0x30/frame 0xfffffe0062c9ee20
_sleep() at _sleep+0x141/frame 0xfffffe0062c9eea0
xpt_done_td() at xpt_done_td+0x8b/frame 0xfffffe0062c9eef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe0062c9ef30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0062c9ef30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---


Tracing pid 51170 tid 103454 td 0xfffffe00b16cd740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe00afda7680
vpanic() at vpanic+0x182/frame 0xfffffe00afda76d0
panic() at panic+0x43/frame 0xfffffe00afda7730
pmap_remove_pages() at pmap_remove_pages+0x944/frame 0xfffffe00afda7880
exec_new_vmspace() at exec_new_vmspace+0x207/frame 0xfffffe00afda78e0
exec_elf64_imgact() at exec_elf64_imgact+0xb1c/frame 0xfffffe00afda79e0
kern_execve() at kern_execve+0x7dd/frame 0xfffffe00afda7d70
sys_execve() at sys_execve+0x5a/frame 0xfffffe00afda7e00
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe00afda7f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00afda7f30
--- syscall (59, FreeBSD ELF64, execve), rip = 0x335ff689626a, rsp = 0x335ff3ef3508, rbp = 0x335ff3ef3650 ---


Tracing pid 4 tid 100060 td 0xfffffe0008d8f3a0
kdb_enter() at kdb_enter+0x37/frame 0xfffffe0062c9ebb0
vpanic() at vpanic+0x182/frame 0xfffffe0062c9ec00
panic() at panic+0x43/frame 0xfffffe0062c9ec60
trap_fatal() at trap_fatal+0x387/frame 0xfffffe0062c9ecc0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0062c9ed20
calltrap() at calltrap+0x8/frame 0xfffffe0062c9ed20
--- trap 0xc, rip = 0xffffffff80ce834e, rsp = 0xfffffe0062c9edf0, rbp = 0xfffffe0062c9ee20 ---
sleepq_add() at sleepq_add+0x13e/frame 0xfffffe0062c9ee20
_sleep() at _sleep+0x15d/frame 0xfffffe0062c9eea0
xpt_done_td() at xpt_done_td+0x8b/frame 0xfffffe0062c9eef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe0062c9ef30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0062c9ef30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---


Looks like memory corruptions to me (all are different and very generic).


Cheers,
Franco