after fresh install: some processes would not die; ps axl advised

Started by trezyckz, July 08, 2024, 11:09:36 AM

Previous topic - Next topic
Hey everyone,

got a new instance up and installed OPNsense - on the final stage where i have to change root password and/or reboot, i choosed reboot but the system cant gracefull restart - it freezes on the message "some processes would not die; ps axl advised".

Reproduce:
- Fresh instance with WAN DHCPv4, no LAGGs,VLANs,LAN - right after installation on the completing installation step:
- - choosed Reboot - dont changed rootPW
- - The freezing console output and malfunction starts right after stopping the lighttpd
- - on sending a STRG+ALT+ENTF it starts with "Invoking stop script 'beep'" till "Invoking stop script 'config'
- - after some seconds "some processes would not die; ps axl advised"
- - console tty1 is frozen
- - webui is already down

- manual restart of instance
- server boots up
- Login UI -> Wizard
- - WAN DHCPv4, no LAN, Europe/Berlin timezone, changed root PW
- Restart from WebUI
- same behavior as when the install was done and it was time to reboot
- - ui shows "The system is rebooting now, please wait... "

So i manually restarted again - the version is:
OPNsense 24.1-amd64
FreeBSD 13.2-RELEASE-p9
OpenSSL 3.0.12

So i had run an "update from console". After the update, the machine wanted to restart - and yeah - the reboot worked fine.

We are now on version:
OPNsense 24.1.9_4-amd64
FreeBSD 13.2-RELEASE-p11
OpenSSL 3.0.14

So i checked a last time rebooting from ui after the successfull update and restart, but issue still persists with a reboot from UI.

So i rebooted again manually and tested a reboot from console (8 - Shell -> "reboot" command):
Same behavior: freezes, sending STRG+ALT+ENTF - Then the invoke scripts till config, some seconds after config again "some processes would not die; ps axl advised".

The instance is located in the Hetzner Cloud, where I already have several OPNsense systems running fine - same version, same instance type, same networking.

Would be nice if someone can help me find the issue.

Best regards

I've tested it now on another new system with the same ISO, just to be sure:
- same behavior

Took a new download of the iso, got it uploaded by support, tested from the new ISO instead of Hetzners ISO:
- same behavior

I have another production system which works fine on the same version - but its already month old.

//Edit
I have compared the two config.xml files - from the productive system and the new system. Apart from a few deliberate changes in the production system, such as other NTP servers, an OpenVPN instance and so on, there are no differences. In terms of configuration, the systems are basically the same.

The production system is also located on the same server_type from the cloud, so the processor etc. is also identical.

I used both the ISO from Hetzner and one I pulled myself from the mirror. Both have the same problem. (The hash is identical anyway)

In the attachments the result of "ps axl" on a running system. As the system is freezed on the issue, i cant get the result of the command in the malfunctional state.

What I hadn't really noticed is that the system is permanently running at over 50% CPU utilization, while an identical older system is somewhere between 1-10%.


root@gw:~ # ps aux | sort -nrk 3,3 | head -n 10

USER      PID  %CPU %MEM   VSZ   RSS TT  STAT STARTED       TIME COMMAND
root        6 100.0  0.0     0    16  -  RL   14:40   1187:40.25 [rand_harvestq]
root       11  96.0  0.0     0    32  -  RNL  14:40   1155:22.87 [idle]
unbound 81072   0.0  1.0 71624 41988  -  Is   10:04      0:00.20 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
root    96080   0.0  0.1 13572  2924  -  S    10:04      0:00.18 /bin/sh /var/db/rrd/updaterrd.sh
root    95968   0.0  0.1 12728  2128  -  Is   10:04      0:00.00 daemon: /var/db/rrd/updaterrd.sh[96080] (daemon)
root    95531   0.0  0.1 12748  2180 v7  Is+  14:40      0:00.00 /usr/libexec/getty Pc ttyv7
root    94636   0.0  0.1 12748  2180 v6  Is+  14:40      0:00.00 /usr/libexec/getty Pc ttyv6
root    94055   0.0  0.1 12748  2172 v5  Is+  14:40      0:00.00 /usr/libexec/getty Pc ttyv5
root    93937   0.0  0.1 12748  2180 v4  Is+  14:40      0:00.00 /usr/libexec/getty Pc ttyv4
root    93515   0.0  0.1 12748  2176 v3  Is+  14:40      0:00.00 /usr/libexec/getty Pc ttyv3



root@gw:~ # vmstat 2 5


procs    memory    page                      disks     faults       cpu
r  b  w  avm  fre  flt  re  pi  po   fr   sr da0 cd0   in   sy   cs us sy id
1  0  0 838M 1.9G 1.2K   0   0   0 1.1K   11   0   0    4  583  115  1 50 49
0  0  0 853M 1.9G  16K   0   0  10  17K   21   1   0    3 6.5K  335  4 54 42
0  0  0 838M 1.9G 7.2K   0   0   7 7.9K   20  27   0   18 1.4K  205  1 51 48
0  0  0 838M 1.9G   82   0   0   0  105   20   0   0    2   93  101  0 50 50
0  0  0 841M 1.9G  13K   0   0   0  14K   20   0   0    2 4.4K  257  2 53 45



root@gw:~ # top -b -o cpu -n 10 | head -n 20
last pid: 55468;  load averages:  1.17,  1.10,  1.05  up 0+19:51:44    10:32:12
45 processes:  1 running, 44 sleeping
CPU:  1.0% user,  0.0% nice, 50.4% system,  0.0% interrupt, 48.6% idle
Mem: 59M Active, 1228M Inact, 634M Wired, 386M Buf, 1901M Free
Swap: 8192M Total, 8192M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
18456 root          1  20    0    26M    14M select   1   0:13   0.00% python3.11
9919 root          3  20    0    40M    13M kqread   1   0:12   0.00% syslog-ng
  224 root          1  20    0    86M    39M accept   1   0:10   0.00% python3.11
53977 root          1  20    0    22M    11M kqread   1   0:07   0.00% lighttpd
55066 root          1  52    0    63M    32M accept   1   0:06   0.00% php-cgi
85232 root          1  20    0    27M    15M select   1   0:03   0.00% python3.11
19307 root          1  52    0    61M    30M accept   1   0:02   0.00% php-cgi
40421 root          1  52    0    59M    30M accept   1   0:01   0.00% php-cgi
56995 root          1  52    0    13M  2512K nanslp   1   0:01   0.00% cron
62065 root          1  52    0    57M    30M accept   1   0:01   0.00% php-cgi




I also noticed something today when setting up the OpenVPN, although both systems - the defective one and the productive working one - have the same patch level:

OPNsense 24.1.9_4-amd64
FreeBSD 13.2-RELEASE-p11
OpenSSL 3.0.14

A different client configuration section in VPN -> OpenVPN -> Servers - have a look at the screenshots. productive.png is the working system - malfunctional.png is the defective system.

Actual state:
- Its isolated to Q35 chipset, i440x works fine
- 24.1 has multiple issues (restarts not working, high cpu usage)
- 23.7 seems to only have high cpu usage but reboots working fine

Actually its unknown if its q35 in general or if its just the machines thats getting provided by our hosting provider.