I recently installed OPNsense 23.1.7 for the first time on a new system (intel i3-7100U, Realtek 8111 NICs) last week. OPNsense keeps freezing approximately every 15-20 mins (roughly 3 times an hour) every single hour. I have scoured the logs and checked dmesg, but there are no logs to indicate anything happening at or near that time. I can confirm this affects both wireless and wired clients. I am able to see this lockup as SSH sessions freeze and netdata shows a flat line (0) in the bandwidth during that time (which never happens on a busy network).
I have even tried switching the hardware to all brand new box (i5-8500, Intel X520 NICs) with completely new MB, RAM, SSD, etc. to rule out all of the hardware, with no change in results. Can anyone point me in the right direction here on how to troubleshoot this or remedy this issue? I would really appreciate it as I'm at the end of the rope here and this is impacting my meetings while WFH.
So, how is your core temperature? You could have a look at the health section if there is something going on.
Do you have enabled any enhancements manually like C3 state? Is PowerD enabled?
Edit: And can you please give a short overview of the other hardware components like SSD, RAM? Is your system running with ZFS or UFS?
Sure, I haven't seen any alerts or anything concerning in the health section. CPU usage is back in single digits at full bandwidth (300/300) thanks to the Intel NIC. I have another similar model SFF (HP ProDesk 400 G4) regularly pushed to limits on another machine and it has no issues with heat thanks to the power-efficiency of the i5-8500T (typo in OP).
I'm not sure about other additional enhancements like C3, as I haven't gone off the rails adding things, but I did try changing PowerD from default Adaptive to Hi-Adaptive to Maximum to rule out any power issues (on both boxes).
System Components:
CPU: Intel i5-8400T
Memory: 8 GB SK Hynix DDR4 SODIMM (HMA81GS6CJR8N-VK)
SSD: 128 GB Samsung PM991 (MZALQ128HBHQ-000L1)
NIC: Intel X520-T2
Running on UFS
Is there anything else I missed that would be helpful to troubleshoot?
Being a bit predictable, maybe top or htop should give a hint where to look.
Your hardware should have enough power for running with ZFS. As far as I know ZFS is a way more stable than UFS in case of powerloss.
If you haven't changed anything to C3 by yourself it'll use C1 as default which is pretty fine. How stable is your system if PowerD is disabled?
Also you could run a memtest on your system to ensure your RAM isn't faulty.
How is your BIOS configured? I'm using UEFI boot (no legacy) and have enabled C- and P-states. It's using Max Performance as default. Any HPET (High precision event timer) configured?
I'm using a VENOEN P09B2G hardware and upgraded memory from noname to 16 GB Crucial. SSD is a 256 GB Kingston one. CPU J4125.
Edit: Any additional plugins installed?
Quote from: cookiemonster on May 11, 2023, 05:15:07 PM
Being a bit predictable, maybe top or htop should give a hint where to look.
From direct console of course, as ssh sessions freeze too.
I will give this one a shot soon. Have to pull out an unmounted monitor and run it in the media closet that houses the ONT and OPNsense box. But I will absolutely try this as the SSH session top always froze and I never was able to see anything unusual in all of the attempts at trying it that way.
Quote from: Cyberturtle on May 11, 2023, 05:15:41 PM
Your hardware should have enough power for running with ZFS. As far as I know ZFS is a way more stable than UFS in case of powerloss.
If you haven't changed anything to C3 by yourself it'll use C1 as default which is pretty fine. How stable is your system if PowerD is disabled?
Also you could run a memtest on your system to ensure your RAM isn't faulty.
How is your BIOS configured? I'm using UEFI boot (no legacy) and have enabled C- and P-states. It's using Max Performance as default. Any HPET (High precision event timer) configured?
I'm using a VENOEN P09B2G hardware and upgraded memory from noname to 16 GB Crucial. SSD is a 256 GB Kingston one. CPU J4125.
Edit: Any additional plugins installed?
Good to know about ZFS. I haven't used it yet as I thought it was primarily for arrays, but I will absolutely use it next time I configure a system. I don't think this issue is related to the Memory or the SSD as I had two completely independent systems (all different parts, none shared) experience the exact same issue (one Intel CPU-Realtek box, one Intel CPU-Intel NIC). I even swapped the RAM on the old box, ran memtest 4-pass with it on a different system, and it came up clean. Both systems had different NVMe drives as well.
BIOS is Legacy Boot. I'll have to check the P-states, C-states, and other options when I am able to pull out a monitor and bring down the internet for a little soon, but they were all unmodified from the Lenovo system defaults, aside from Legacy Boot, which was a quick fix for switching from Windows 10 UEFI to quickly install OPNsense on this second machine.
Checking for hardware is the first logical but you're experiencing it from two different systems, so that kind of helps. If the problem affects the network ie. the port and or services accessed on it i.e. ssh, then ssh is of little help as you know.
I'd be thinking to narrow down first before starting changing system settings that were working before.
A few questions to figure out the scenario:
The only change is an upgrade to 23.1.7 on a working system prior, can you downgrade to previous?
Is the WAN going to a router on bridge mode, something else?
Are you virtualizing any of this? You mention the hardware but not if you're installing OPN on a VM on it.
Are you on PPoE, what is it if not but what's the setup? Topology would be ideal.
Any services running, the optional types. Suricata, Zenarmor, etc. Lookout for the netflow process, there was a time when it was a high consumer of cpu cycles. No reports of it for a while but if you have it enabled, see if disabling it helps.
Quote from: cookiemonster on May 11, 2023, 05:55:36 PM
Checking for hardware is the first logical but you're experiencing it from two different systems, so that kind of helps. If the problem affects the network ie. the port and or services accessed on it i.e. ssh, then ssh is of little help as you know.
I'd be thinking to narrow down first before starting changing system settings that were working before.
A few questions to figure out the scenario:
The only change is an upgrade to 23.1.7 on a working system prior, can you downgrade to previous?
Is the WAN going to a router on bridge mode, something else?
Are you virtualizing any of this? You mention the hardware but not if you're installing OPN on a VM on it.
Are you on PPoE, what is it if not but what's the setup? Topology would be ideal.
Any services running, the optional types. Suricata, Zenarmor, etc. Lookout for the netflow process, there was a time when it was a high consumer of cpu cycles. No reports of it for a while but if you have it enabled, see if disabling it helps.
This is a brand new install of 23.1.7. I was previously using consumer-level routers to run my network until last week, with decent traffic shaping, but finally decided to take the plunge into OPNsense.
Network Topology High-level:
Fiber --> Verizon FiOS ONT --> (Ethernet --> Intel x520-T2) OPNsense --> Switch --> Wireless APs & addt'l Switch to office
Everything is wireless (~40 clients) aside from the devices hardwired to the office switch. OPNsense is running bare-metal on the Lenovo m720q. I don't think Verizon FiOS uses PPoE, but I could be wrong here. I don't think I'm running Suricata or Zenarmor yet as I haven't installed or enabled them, but I think eventually I did want to look into them. I've attached a screenshot of the optional plugins installed and services running, if that's helpful.
Attached optional plug-ins installed (too large for last post)
Here's a sample screenshot showing the last hour of CPU and IPV4 packets per second. Notice the two flat line 0 pps for about one minute. This regularly happens 2-3 per hour.
Few more data point to help isolate this issue:
I am able to access top, htop, etc. and notice no CPU spikes or performance issues when the network outage is happening on the LAN, so it doesn't appear to be a CPU lockup or anything like that. top and htop don't reveal any programs doing anything out of the ordinary during that time either.
When I run pings to 8.8.8.8 from a downstream client device, the pings timeout on the client device during the outage period.
When I run pings to 8.8.8.8 from OPNsense shell, the pings are successful 100% of the time, even during the outage period for LAN devices.
It doesn't seem cpu overtaxed. I would check dmesg at the console when it happens.
We're looking for clues in that log buffer even if top doesn't report a spike, maybe some errors.
Your diagnosing seems to suggest the problem could be downstream from the firewall. What I would do after restarting the switch just in case is diagnose at both ends in parallel. Wired client and firewall. We want to eliminate wireless from the equation for now.
Start with dmesg and top at the firewall. Network diagnostics from the client: ping, nslookup, etc.
And I would reconfigure it without AdGuard too, to eliminate name resolution blocks. That wouldn't explain a network freeze at the client as you know.
That said, when you say OPN freezes, can you describe where (a particular settings page), or something else? I'm thinking that from the diagnostic so far, if say the network stutters (let's say the switch drops packets) from the client then it would look like OPN is frozen but is just the link to it that is. Thinking aloud here.
Quote from: cookiemonster on May 11, 2023, 10:34:20 PM
It doesn't seem cpu overtaxed. I would check dmesg at the console when it happens.
We're looking for clues in that log buffer even if top doesn't report a spike, maybe some errors.
Your diagnosing seems to suggest the problem could be downstream from the firewall. What I would do after restarting the switch just in case is diagnose at both ends in parallel. Wired client and firewall. We want to eliminate wireless from the equation for now.
Start with dmesg and top at the firewall. Network diagnostics from the client: ping, nslookup, etc.
And I would reconfigure it without AdGuard too, to eliminate name resolution blocks. That wouldn't explain a network freeze at the client as you know.
That said, when you say OPN freezes, can you describe where (a particular settings page), or something else? I'm thinking that from the diagnostic so far, if say the network stutters (let's say the switch drops packets) from the client then it would look like OPN is frozen but is just the link to it that is. Thinking aloud here.
I think you're spot on in narrowing down the issue to downstream on the LAN. All my devices, including hardwired to the switch just below OPNsense were losing all internet access, including ping to 8.8.8.8 (to rule out DNS). I tried replacing the main switch downstream from OPNsense from a 1G switch to 2.5G switch (planned upgrade anyway as part of this project), but the results didn't change. One thing I did notice is that an old AiMesh node I was using as a temporary switch while waiting for one for 5 port switch, not configured as part of the new AiMesh AP configuration, started turning lights on and off and blinking rapidly when my clients would lose internet connectivity during the minute of downtime. I replaced that old AiMesh node (not part of my newer AiMesh wireless AP system) and the connectivity issues have not happened yet in the hour since I did that. I think that may have been the issue after everything.
I've attached a better diagram of the network topology at play here BEFORE I replaced that old AiMesh "switch". The current topology replaces that OLD AiMesh "switch" with an actual unmanaged switch. Everything seems to be working now. Will of course update if the issue re-emerges, but I think we're in the clear now. Thank you all your time and help looking into this.
I'm glad you've solved it.