Total freeze of router after 22.1.2_1 update

Started by Edwin70, March 03, 2022, 02:51:14 PM

Previous topic - Next topic
Hello,

After the 22.1.2_1 update I did this morning I have an issue that the router (a Protectli Vault – 4 Port): every noew and then (the interval is between 30 minutes and 90 minutes) the router becomes totally unresponsive. No network traffic, no response from the webGUI, nothing. After 1 or 2 minutes the router comes back as if nothing happened. There is nothing in de logs (as far as I can see) and I don't see a spike in the CPU or something.

I see a dip to zero in the "System Information - Utilization and Processes | Processor" (See attachment).

Any ideas what this can be or how I can get more information?

You could try the usual methods to narrow down the problem. Check processes with ps and top, look for clues in dmesg, etc.

What kind of console do these protectli devices have? Connect a serial or a VGA monitor and watch ...
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)


March 10, 2022, 09:23:27 PM #4 Last Edit: March 10, 2022, 09:38:14 PM by teosoft
For the record, I have the same hardware and the same problem.
I see CPU usage at 100%

The only way to make it work I've found so far is to switch to a backup hardware and not upgrade to OPNsense 22.1.2_1-amd64.

Backup hardware is running OPNsense 22.1.1_3-amd64




March 10, 2022, 09:53:08 PM #5 Last Edit: March 10, 2022, 10:50:14 PM by Edwin70
I wasn't able to look into it further because I'm on a short holiday. But this does explain it. Thanks for the update.
My synology backup to an external cloud provider also fails daily. Probably also caused by this issue.

Unfortunately, I don't have backup hardware available. So I hope a solution will be available soon. I wonder if the people from Protectli have looked into this issue.
Edit: I submitted a support ticket at Protectli.

I have a FW4A 4 port Protectli box and I upgraded it last night.  Both of us have been working all day with no performance issues but I'm going to track its CPU usage to see if I can help confirm.  My only break in CPU utilization was as it was upgrading.

The router (and thus the internet connection) is totally unworkable. I saw a different post about unbound being a CPU issue (https://forum.opnsense.org/index.php?topic=27372.0), so I turned off unbound but without a result.

So, I dug a little deeper. First I ran the top command in a shell. The top command shows the following anomaly: 3 or 4 python3 processes are running at around 100%, at that point the OPNSense web client is unresponsive. Is this normal?
I also checked the Systems:Diagnostics:Activity and I often see the next process at around 100%.
/usr/local/bin/python3 /usr/local/opnsense/scripts/net-flow/flowd_aggregate.py (python3.8)

Next thing, I went into the logging. First the general log. The dreaded detached event. Repeated over en over again (about every 15 seconds):

2022-03-13T16:00:14 Error opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for dynamic wan(igb0)
2022-03-13T16:00:12 Error opnsense /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for dynamic wan(igb0)


Then, in the backend logging I see this group of messages starting with "Linkup stopping ibg0" and it also repeats every 15 seconds.:
2022-03-13T16:00:14 Notice configd.py [2fe1a6bf-9de6-4587-94bd-1167389692e1] Linkup starting igb0
2022-03-13T16:00:14 Notice configd.py [fabf8c49-b168-4ed7-b42d-f1b2647aaa35] refresh url table aliases
2022-03-13T16:00:14 Debug configd.py OPNsense/Filter generated //usr/local/etc/filter_geoip.conf
2022-03-13T16:00:14 Debug configd.py OPNsense/Filter generated //usr/local/etc/filter_tables.conf
2022-03-13T16:00:14 Notice configd.py generate template container OPNsense/Filter
2022-03-13T16:00:14 Notice configd.py [57ceaac2-7276-4ebb-a33c-7c8a9e60f4f6] generate template OPNsense/Filter
2022-03-13T16:00:12 Info configd.py message bca4da3d-f319-410f-b859-6bdb7e39b8af [filter.refresh_aliases] returned {"status": "ok"}
2022-03-13T16:00:12 Notice configd.py [c818074e-e5e8-409c-af59-fbdc38e355e2] request pf current overall table record count and table-entries limit
2022-03-13T16:00:12 Notice configd.py [216f9585-e577-4763-8111-8bf13be4fc78] Reloading filter
2022-03-13T16:00:11 Notice configd.py [ad5a48da-7168-4194-a278-3793fe655f20] Linkup stopping igb0


What is cause and what is effect? Is it the detached event that starts the steps in the backend log? Or is there a process running that stops the WAN interface which leads to the detached event? The words "Linkup stopping igb0" suggest it is done on purpose and the timestamps also indicate this: stopping at 16:00:11 and detached event at 16:00:12.

Anybody any ideas?

For now I want to go back to a previous version of OPNSense, but where can I find the previous version 22.1.1_3? The mirrors only show the latest. I went back to a 21.7.1 release I still had on a USB stick and then to 21.7.8. And now it runs without problems.
I was lucky to get a second identical machine back from a friend, so I still have the problem machine if anyone has any solutions I can try.

In your setup I would do this after a reinstall or upgrade:
- Disable netflow completely.
- Disable aliases, especially the url table aliases.
- Then I would delete and recreate them. The disablement is to get any details from them.
- Then re-enable them one by one and monitor performance.

I will disable netflow and see what happens.

As for Aliases: I haven't added my own, only the defaults (bogons, bogonsv6, sshlockout and virusprot) are there. Or am I looking in the wrong place?

That's the right place. I'm going by these entries:
2022-03-13T16:00:14 Debug configd.py OPNsense/Filter generated //usr/local/etc/filter_geoip.conf
2022-03-13T16:00:14 Debug configd.py OPNsense/Filter generated //usr/local/etc/filter_tables.conf

But it sounds like these are maybe autogenerated as part of something else, or even the log might be checking if they need to be created, so it could be nothing.
But then netflow is IMO unlikey to be what causes the link to go down, it'll reduce the spikes in processing for sure.
It would be useful to know what is the WAN setup though. Is it single WAN, has it got ipv6 enabled, does it have VLAN tags, etc. As much info as possible. See the release notes, there are changes to the interfaces setup required.

March 14, 2022, 02:40:25 PM #11 Last Edit: March 14, 2022, 02:58:23 PM by Edwin70
It is a single WAN (to a bridged ZIGGO cable modem), IPv6 config type is none and no VLAN. The filters are indeed standard; in generic configuration for the WAN interface the block private and bogon networks is enabled. So that  generated the filters.
MTU is blank.
The only odd thing I have in the WAN config is a DHCP option modifier: "supersede dhcp-server-identifier 255.255.255.255" which I added due to a different problem I had in the past.

Edit: I also have entered MAC-address in the config. And I see two other subjects talking about WAN flapping with MAC-spoofing in combination with intrusion detection. Although I don't have intrusion detection enabled, this MAC spoofing seems to have a bad effect on the WAN connection. So I will experiment with that.