Hey everyone,
got a new instance up and installed OPNsense - on the final stage where i have to change root password and/or reboot, i choosed reboot but the system cant gracefull restart - it freezes on the message "some processes would not die; ps axl advised".
Reproduce:
- Fresh instance with WAN DHCPv4, no LAGGs,VLANs,LAN - right after installation on the completing installation step:
- - choosed Reboot - dont changed rootPW
- - The freezing console output and malfunction starts right after stopping the lighttpd
- - on sending a STRG+ALT+ENTF it starts with "Invoking stop script 'beep'" till "Invoking stop script 'config'
- - after some seconds "some processes would not die; ps axl advised"
- - console tty1 is frozen
- - webui is already down
- manual restart of instance
- server boots up
- Login UI -> Wizard
- - WAN DHCPv4, no LAN, Europe/Berlin timezone, changed root PW
- Restart from WebUI
- same behavior as when the install was done and it was time to reboot
- - ui shows "The system is rebooting now, please wait... "
So i manually restarted again - the version is:
OPNsense 24.1-amd64
FreeBSD 13.2-RELEASE-p9
OpenSSL 3.0.12
So i had run an "update from console". After the update, the machine wanted to restart - and yeah - the reboot worked fine.
We are now on version:
OPNsense 24.1.9_4-amd64
FreeBSD 13.2-RELEASE-p11
OpenSSL 3.0.14
So i checked a last time rebooting from ui after the successfull update and restart, but issue still persists with a reboot from UI.
So i rebooted again manually and tested a reboot from console (8 - Shell -> "reboot" command):
Same behavior: freezes, sending STRG+ALT+ENTF - Then the invoke scripts till config, some seconds after config again "some processes would not die; ps axl advised".
The instance is located in the Hetzner Cloud, where I already have several OPNsense systems running fine - same version, same instance type, same networking.
Would be nice if someone can help me find the issue.
Best regards
I've tested it now on another new system with the same ISO, just to be sure:
- same behavior
Took a new download of the iso, got it uploaded by support, tested from the new ISO instead of Hetzners ISO:
- same behavior
I have another production system which works fine on the same version - but its already month old.
//Edit
I have compared the two config.xml files - from the productive system and the new system. Apart from a few deliberate changes in the production system, such as other NTP servers, an OpenVPN instance and so on, there are no differences. In terms of configuration, the systems are basically the same.
The production system is also located on the same server_type from the cloud, so the processor etc. is also identical.
I used both the ISO from Hetzner and one I pulled myself from the mirror. Both have the same problem. (The hash is identical anyway)
In the attachments the result of "ps axl" on a running system. As the system is freezed on the issue, i cant get the result of the command in the malfunctional state.
What I hadn't really noticed is that the system is permanently running at over 50% CPU utilization, while an identical older system is somewhere between 1-10%.
root@gw:~ # ps aux | sort -nrk 3,3 | head -n 10
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 6 100.0 0.0 0 16 - RL 14:40 1187:40.25 [rand_harvestq]
root 11 96.0 0.0 0 32 - RNL 14:40 1155:22.87 [idle]
unbound 81072 0.0 1.0 71624 41988 - Is 10:04 0:00.20 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
root 96080 0.0 0.1 13572 2924 - S 10:04 0:00.18 /bin/sh /var/db/rrd/updaterrd.sh
root 95968 0.0 0.1 12728 2128 - Is 10:04 0:00.00 daemon: /var/db/rrd/updaterrd.sh[96080] (daemon)
root 95531 0.0 0.1 12748 2180 v7 Is+ 14:40 0:00.00 /usr/libexec/getty Pc ttyv7
root 94636 0.0 0.1 12748 2180 v6 Is+ 14:40 0:00.00 /usr/libexec/getty Pc ttyv6
root 94055 0.0 0.1 12748 2172 v5 Is+ 14:40 0:00.00 /usr/libexec/getty Pc ttyv5
root 93937 0.0 0.1 12748 2180 v4 Is+ 14:40 0:00.00 /usr/libexec/getty Pc ttyv4
root 93515 0.0 0.1 12748 2176 v3 Is+ 14:40 0:00.00 /usr/libexec/getty Pc ttyv3
root@gw:~ # vmstat 2 5
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr da0 cd0 in sy cs us sy id
1 0 0 838M 1.9G 1.2K 0 0 0 1.1K 11 0 0 4 583 115 1 50 49
0 0 0 853M 1.9G 16K 0 0 10 17K 21 1 0 3 6.5K 335 4 54 42
0 0 0 838M 1.9G 7.2K 0 0 7 7.9K 20 27 0 18 1.4K 205 1 51 48
0 0 0 838M 1.9G 82 0 0 0 105 20 0 0 2 93 101 0 50 50
0 0 0 841M 1.9G 13K 0 0 0 14K 20 0 0 2 4.4K 257 2 53 45
root@gw:~ # top -b -o cpu -n 10 | head -n 20
last pid: 55468; load averages: 1.17, 1.10, 1.05 up 0+19:51:44 10:32:12
45 processes: 1 running, 44 sleeping
CPU: 1.0% user, 0.0% nice, 50.4% system, 0.0% interrupt, 48.6% idle
Mem: 59M Active, 1228M Inact, 634M Wired, 386M Buf, 1901M Free
Swap: 8192M Total, 8192M Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
18456 root 1 20 0 26M 14M select 1 0:13 0.00% python3.11
9919 root 3 20 0 40M 13M kqread 1 0:12 0.00% syslog-ng
224 root 1 20 0 86M 39M accept 1 0:10 0.00% python3.11
53977 root 1 20 0 22M 11M kqread 1 0:07 0.00% lighttpd
55066 root 1 52 0 63M 32M accept 1 0:06 0.00% php-cgi
85232 root 1 20 0 27M 15M select 1 0:03 0.00% python3.11
19307 root 1 52 0 61M 30M accept 1 0:02 0.00% php-cgi
40421 root 1 52 0 59M 30M accept 1 0:01 0.00% php-cgi
56995 root 1 52 0 13M 2512K nanslp 1 0:01 0.00% cron
62065 root 1 52 0 57M 30M accept 1 0:01 0.00% php-cgi
I also noticed something today when setting up the OpenVPN, although both systems - the defective one and the productive working one - have the same patch level:
OPNsense 24.1.9_4-amd64
FreeBSD 13.2-RELEASE-p11
OpenSSL 3.0.14
A different client configuration section in VPN -> OpenVPN -> Servers - have a look at the screenshots. productive.png is the working system - malfunctional.png is the defective system.
Actual state:
- Its isolated to Q35 chipset, i440x works fine
- 24.1 has multiple issues (restarts not working, high cpu usage)
- 23.7 seems to only have high cpu usage but reboots working fine
Actually its unknown if its q35 in general or if its just the machines thats getting provided by our hosting provider.