OPNSense GUI / CLI not accessable (obsolete)

Started by tokade, December 24, 2019, 02:40:12 PM

Previous topic - Next topic
December 24, 2019, 02:40:12 PM Last Edit: December 30, 2019, 11:33:05 PM by tokade
Hi all,

since the last update to

OPNsense 19.7.8-amd64
FreeBSD 11.2-RELEASE-p16-HBSD
OpenSSL 1.0.2t 10 Sep 2019

something strange happens every night to my OPNsense system running as VM under XEN. The system isn't accessable neither trough GUI nor via CLI (ssh) in the morning. All other functions seem ok, at least I can access the internet.

I can't find anything in the logs and even a new installation with backup of the configuration shows the same behavior.

I can force this behavior when doing a health check via GUI or with  pkg check -sav in a ssh session. Both will let the GUI and CLI die.

What can I do? Where can I look for more information what is going wrong?

Any help is appreciated, since I'm new to OPNSense...

Kind regards
Torsten

Has the system enough memory? Sounds like OOMKill.

Hi Fabian,

thanks for your reply. The system run as domU under XEN with following parameters:
- Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz (2 cores)
- 2 GB RAM
- 40 GB disk (root &  4GB swap)

I can increase the memory, if 2 GB aren't enough.

This morning same thing again. I only could ping the system, but everything else was "frozen". No acceess via GUI / CLI, no Internet connection from the dom0 or LAN clients. Here are the logs from tonight and this morning after the hard "reset" of the domU:

configd.log
Dec 25 02:51:50 stargate-fw configd.py: [bc5268e0-4f45-4f6f-ad02-c6074ef0803c] update IPv6 prefixes
Dec 25 02:54:49 stargate-fw configd.py: [55f0f84d-93a7-4a9a-91ab-daa956e522ec] update IPv6 prefixes
Dec 25 03:00:55 stargate-fw configd.py: [7fdfe1ee-2093-4192-9d2d-215fb11d8f11] update IPv6 prefixes
Dec 25 08:49:37 stargate-fw configd.py: [2cdab0be-9148-401b-85e6-07f2fa488d15] Linkup starting xn5
Dec 25 08:49:37 stargate-fw configd.py: [326cb23a-9e40-4563-848b-8904025946f4] Linkup starting xn3

dhcpd.log
Dec 25 03:00:54 stargate-fw dhcpd: Solicit message from fe80::215:99ff:fe96:3ae5 port 546, transaction ID 0xBBC37200
Dec 25 03:00:54 stargate-fw dhcpd: Advertise NA: address XXXX:192:168:6:c2f9 to client with duid XXXX iaid = 24183525 valid for 7200 seconds
Dec 25 03:00:54 stargate-fw dhcpd: Sending Advertise to fe80::215:99ff:fe96:3ae5 port 546
Dec 25 03:00:55 stargate-fw dhcpd: Request message from fe80::215:99ff:fe96:3ae5 port 546, transaction ID 0xA058E100
Dec 25 03:00:55 stargate-fw dhcpd: Reply NA: address XXXX:192:168:6:c2f9 to client with duid XXXX iaid = 24183525 valid for 7200 seconds
Dec 25 03:00:55 stargate-fw dhcpd: Sending Reply to fe80::215:99ff:fe96:3ae5 port 546
Dec 25 08:49:40 stargate-fw dhcpd: Internet Systems Consortium DHCP Server 4.4.1
Dec 25 08:49:40 stargate-fw dhcpd: Copyright 2004-2018 Internet Systems Consortium.
Dec 25 08:49:40 stargate-fw dhcpd: All rights reserved.
Dec 25 08:49:40 stargate-fw dhcpd: For info, please visit https://www.isc.org/software/dhcp/
Dec 25 08:49:40 stargate-fw dhcpd: Config file: /etc/dhcpd.conf

ntpd.log
Dec 24 15:30:44 stargate-fw ntpd[42685]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Dec 24 15:30:44 stargate-fw ntpd[42685]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Dec 24 15:30:55 stargate-fw ntpd[42685]: receive: Unexpected origin timestamp 0xe1ac9f9e.c1188da3 does not match aorg 0000000000.00000000 from server@198.255.68.106 xmt 0xe1ac9f9f.0091d92b
Dec 24 15:30:55 stargate-fw ntpd[42685]: receive: Unexpected origin timestamp 0xe1ac9f9e.c10e69a9 does not match aorg 0000000000.00000000 from server@103.38.121.36 xmt 0xe1ac9f9f.12adc2f5
Dec 24 15:36:31 stargate-fw ntpd[42685]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Dec 25 08:49:46 stargate-fw ntpdate[14816]: Can't find host 0.opnsense.pool.ntp.org: hostname nor servname provided, or not known (8)
Dec 25 08:49:46 stargate-fw ntpdate[14816]: Can't find host 1.opnsense.pool.ntp.org: hostname nor servname provided, or not known (8)

ppps.log
Dec 24 15:27:55 stargate-fw ppp: [opt2] IPV6CP: rec'd Configure Ack #1 (Ack-Sent)
Dec 24 15:27:55 stargate-fw ppp: [opt2] IPV6CP: state change Ack-Sent --> Opened
Dec 24 15:27:55 stargate-fw ppp: [opt2] IPV6CP: LayerUp
Dec 24 15:27:55 stargate-fw ppp: [opt2]   0216:3eff:fea0:b0c1 -> 0200:00ff:fe00:0000
Dec 25 08:49:39 stargate-fw ppp: Multi-link PPP daemon for FreeBSD
Dec 25 08:49:39 stargate-fw ppp:   
Dec 25 08:49:39 stargate-fw ppp: process 13219 started, version 5.8 (root@opn-build-amd64-2 00:42 14-Jul-2019)
Dec 25 08:49:39 stargate-fw ppp: web: web is not running
Dec 25 08:49:39 stargate-fw ppp: [opt2] Bundle: Interface ng0 created
Dec 25 08:49:39 stargate-fw ppp: [opt2_link0] Link: OPEN event
Dec 25 08:49:39 stargate-fw ppp: [opt2_link0] LCP: Open event
Dec 25 08:49:39 stargate-fw ppp: [opt2_link0] LCP: state change Initial --> Starting
Dec 25 08:49:39 stargate-fw ppp: [opt2_link0] LCP: LayerStart
Dec 25 08:49:39 stargate-fw ppp: [opt2_link0] PPPoE: Connecting to ''

routing.log
Dec 24 15:27:51 stargate-fw radvd[41112]: version 1.15 started
Dec 24 15:27:55 stargate-fw radvd[33472]: Exiting, sigterm or sigint received.
Dec 24 15:27:55 stargate-fw radvd[33472]: sending stop adverts
Dec 24 15:27:55 stargate-fw radvd[33472]: removing /var/run/radvd.pid
Dec 24 15:27:55 stargate-fw rtsold[22940]: <make_packet> link-layer address option has null length on pppoe0. Treat as not included.
Dec 24 15:27:55 stargate-fw radvd[67189]: version 1.15 started
Dec 24 15:27:58 stargate-fw radvd[19577]: Exiting, sigterm or sigint received.
Dec 24 15:27:58 stargate-fw radvd[19577]: sending stop adverts
Dec 24 15:27:58 stargate-fw radvd[19577]: removing /var/run/radvd.pid
Dec 24 15:27:58 stargate-fw radvd[31781]: version 1.15 started
Dec 24 15:30:50 stargate-fw radvd[40350]: Exiting, sigterm or sigint received.
Dec 24 15:30:50 stargate-fw radvd[40350]: sending stop adverts
Dec 24 15:30:50 stargate-fw radvd[40350]: removing /var/run/radvd.pid
Dec 24 15:30:50 stargate-fw radvd[49177]: version 1.15 started
Dec 25 08:49:41 stargate-fw radvd[45195]: version 1.15 started
Dec 25 08:49:45 stargate-fw radvd[94719]: Exiting, sigterm or sigint received.
Dec 25 08:49:45 stargate-fw radvd[94719]: sending stop adverts
Dec 25 08:49:45 stargate-fw radvd[94719]: removing /var/run/radvd.pid
Dec 25 08:49:45 stargate-fw rtsold[4502]: <make_packet> link-layer address option has null length on pppoe0. Treat as not included.
Dec 25 08:49:45 stargate-fw radvd[90559]: version 1.15 started
Dec 25 08:49:47 stargate-fw radvd[48749]: Exiting, sigterm or sigint received.
Dec 25 08:49:47 stargate-fw radvd[48749]: sending stop adverts
Dec 25 08:49:47 stargate-fw radvd[48749]: removing /var/run/radvd.pid
Dec 25 08:49:47 stargate-fw radvd[71479]: version 1.15 started

system.log
Dec 25 02:28:20 stargate-fw dhcp6c[61476]: Received REPLY for RENEW
Dec 25 02:43:20 stargate-fw dhcp6c[61476]: Sending Renew
Dec 25 02:43:20 stargate-fw dhcp6c[61476]: Received REPLY for RENEW
Dec 25 02:58:21 stargate-fw dhcp6c[61476]: Sending Renew
Dec 25 02:58:21 stargate-fw dhcp6c[61476]: Received REPLY for RENEW
Dec 25 08:48:49 stargate-fw syslogd: kernel boot file is /boot/kernel/kernel
Dec 25 08:48:49 stargate-fw kernel: Copyright (c) 2013-2018 The HardenedBSD Project.
Dec 25 08:48:49 stargate-fw kernel: Copyright (c) 1992-2018 The FreeBSD Project.
Dec 25 08:48:49 stargate-fw kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Dec 25 08:48:49 stargate-fw kernel: The Regents of the University of California. All rights reserved.
Dec 25 08:48:49 stargate-fw kernel: FreeBSD is a registered trademark of The FreeBSD Foundation.
Dec 25 08:48:49 stargate-fw kernel: FreeBSD 11.2-RELEASE-p16-HBSD  87a7fc985c3(stable/19.7) amd64


I can't find any hints to the problem in the logs myself and there is no crash report after the restart. Anything else I can provide or debug while doing a health check to reproduce the problem?

Kind regards and merry XMas
Torsten

Please try it again with 4GB. That is what I use and never crashed.

Hi Fabian,

same result with 4 GB RAM. After "pkg check -sav" the GUI was frozen and ssh connection frozen too. I had a VNC Viewer open to OPNSense with top and i could see about 2.5 GB free RAM during the check.

In the VNC Viewer top still working after the freeze and after stopping top I can execute commands in the shell. Immediately after the exit command, that view is frozen too and the OPNSense menu isn't shown.

Anything I can try, debug,....

Kind regards
Torsten

pkg-check can freeze the system, I don't know why but it has always been this way on certain test hardware, but never VMs I think.

Worth an upstream ticket here maybe: https://github.com/freebsd/pkg


Cheers,
Franco

Hi Franco,

after reading the thread https://forum.opnsense.org/index.php?topic=12828.0 about the pppoe, I'm not sure, if really "pkg check" is the reason for my nightly freezes. I can force the described crash with edit / adding point-to-point device https://forum.opnsense.org/index.php?topic=15357.0

Is there any job or service, which runs the check every night?

I will try the workaround mentioned in the other thread, hopefully the nightly freezes stop. Otherwise OPNsense isn't usable...

Kind regards
Torsten

December 28, 2019, 10:57:37 PM #7 Last Edit: December 29, 2019, 10:50:28 AM by tokade
Hi,

have to come back here. The workaround mentioned in the 19.1. thread doesn't prevent my system from freezing. My ISP doesn't drop connection and i haven't got the 'periodic interface reset' in use. So there must be another reason for the nightly freeze.

By further testing I found another command that crashes my system: sysctl -a (report sent via the system after reboot)

Bugs with crahses by sysctl are reported for freebsd.  Maybe another process or job uses sysctl during the night... Maybe sysctl is invoked for edit / adding point-to-point devices and so both problems come to the same reason.

Unfortunately I haven't found any solution searching for the bug.

Kind regards
Torsten


The system is still up this morning, everything seems to work in the background (internet access, firewall, unbound, dhcp, ...) and when trying to login, I can type the password. After that I got a little bit more output on my vnc console / ssh as the days before

FreeBSD/amd64 (stargate-fw.cgnf.net) (ttyu0)

login: root
Password:
Last login: Sun Dec 29 10:42:42 on ttyv0
----------------------------------------------
|      Hello, this is OPNsense 19.7          |         @@@@@@@@@@@@@@@
|                                            |        @@@@         @@@@
| Website:      https://opnsense.org/        |         @@@\\\   ///@@@
| Handbook:     https://docs.opnsense.org/   |       ))))))))   ((((((((
| Forums:       https://forum.opnsense.org/  |         @@@///   \\\@@@
| Lists:        https://lists.opnsense.org/  |        @@@@         @@@@
| Code:         https://github.com/opnsense  |         @@@@@@@@@@@@@@@
----------------------------------------------


After that nothing happens, but I can interrupt with ctrl-C and got back to the login. WebGUI isnt't working at all.

How can I restart the GUI or the CLI process via cron job?

Kind regards
Torsten

sorry for late reply, as i am in vacancy for the moment.

I face the same issue, but not on a vm, it's a pc-engines apu4d on 4 GB RAM. Fortunatetly, my son "reboots" the machine, if he can't game (on my advise :) )

I thought, it might be a cron-job, as e.g. acme-update or perhaps also the update of ips-signatures. As this behaves today the first time whilst the day, and i remember, that on this time, no job ist running, it would not be the reason.

Even if WUI and CLI is not accessible, the LAN-interface is pingable.

i got some time to have a view on my firewall.

1st of all, itried, if it happens too on devel-version 20.1. in Short, yes it does.

i took a look in the logs, where i found, that suricata crashed according of to less swap. i have none and spent one now to the system.

Might it be, that network stack gets in troubles, if suricata crashes?

Today I did a lot of tests, starting with LiveCD and fresh install. I think I could narrow down the reason for my crashes, which are not related to suricata (not in use at all) or the update to 19.7.8

The crashes start, when I use VLANs in the new OPNsense domU again - without them, I couldn't force a crash. I'm sure now, that he VLANs are the reason for the nightly crashes too, since I introduced them some days before the update. Here is what i tested and the results for the commands / functions that crashes my domU:

domU from LiveCD mode:
OPNsense 19.7-amd64
FreeBSD 11.2-RELEASE-p11-HBSD
OpenSSL 1.0.2s 28 May 2019

without IPv6 on WAN
no configuration
sysctl -a via CLI: OK
Audit health via GUI: OK
edit pppoe via GUI: OK

---

domU from fresh install:
OPNsense 19.7-amd64
FreeBSD 11.2-RELEASE-p11-HBSD
OpenSSL 1.0.2s 28 May 2019

without IPv6 on WAN
minmal configuration (WAN & LAN)
sysctl -a via CLI: OK
Audit health via GUI: OK
edit pppoe via GUI: OK

There is a mandatory update for the package manager available.
Package Name Current Version New Version Required Action
pkg 1.10.5_5 1.12.0 upgrade
sysctl -a via CLI: OK
Audit health via GUI: OK
edit pppoe via GUI: OK

install os-xen 1.2
sysctl -a via CLI: OK
Audit health via GUI: OK
edit pppoe via GUI: OK

reboot
sysctl -a via CLI: OK
Audit health via GUI: OK
edit pppoe via GUI: OK

update --> 19.7.8
OPNsense 19.7.8-amd64
FreeBSD 11.2-RELEASE-p16-HBSD
OpenSSL 1.0.2t 10 Sep 2019

sysctl -a via CLI: OK
Audit health via GUI: OK
edit pppoe via GUI: OK

----

activate DHCPv6 on WAN
sysctl -a via CLI: OK
Audit health via GUI: OK
edit pppoe via GUI: OK

----

activate IPv6 on LAN (track WAN)
sysctl -a via CLI: OK
Audit health via GUI: OK
edit pppoe via GUI: OK

----

restore DHCP conf from backup and reboot
sysctl -a via CLI: OK
Audit health via GUI: OK
edit pppoe via GUI: OK

----

restore DHCPv6, system tunables, unbound conf from backup and reboot
sysctl -a via CLI: OK
Audit health via GUI: OK
edit pppoe via GUI: OK

---

restore system, interfaces from backup and reboot
sysctl -a via CLI: OK
Audit health via GUI: OK
edit pppoe via GUI: OK

---

restore interfaces from backup and reboot
sysctl -a via CLI: crashes
Audit health via GUI: OK
edit pppoe via GUI: crashes


The only think I haven't tested are VLANs before the update to 19.7.8. I set this thread to obsolete and will open a new one.

Kind regards
Torsten