OPNsense Forum

Archive => 17.7 Legacy Series => Topic started by: sedace on December 31, 2017, 12:11:27 am

Title: [Resolved]Crash 1/day since upgrade from 17.7.7 to 17.7.11
Post by: sedace on December 31, 2017, 12:11:27 am
Hi,

I'm trying to figure out what's causing a strange crash behavior since I upgraded firmware and packages/etc a couple days ago.  I also was setting some NAT Port Forwards around the same time so it could be that, disabling those for now... but the symptom is that I lose access to the GUI, SSH, and local console are all frozen.  Some internet services will keep running, for example I could still connect to one PC via team viewer and it could still get on the internet but other devices were offline.

Both crashes happened around 11:30 EST based on a gap in services in the System Health screen.

I had 23days of straight uptime prior to the upgrade 2 days ago and all I adjusted were firewall rules in NAT to enable port forwarding. 

Anyway, I don't think anyone will be able to offer any advice on what to do to fix the problem, but what I was looking for was advice on what log files I can review that will go back far enough that I can see what happened prior to the crash OR if there's something I can enable in settings such that it'll archive log files periodically so I don't lose them on a reboot?

Thanks in advance. 

[Resolved] System crashed during a reboot, reinstalled software from scratch and seems to be working fine.  Not sure if upgrade from web interface may have  caused an issue or if internal SSD had errors but there was data corruption issues after reported.  I have had close to 4 days uptime since the last reboot after the reinstall.   After I did the base install I did all the updates from the shell/ssh login and everything is on the same version as it was when I encountered the issue above.  [/End Update]
Title: Re: Crash 1/day since upgrade from 17.7.7 to 17.7.11
Post by: bartjsmit on December 31, 2017, 11:51:33 am
It may be that you have exhausted a limited resource somewhere in your system (e.g. RAM) which is making it unstable. You could even be hitting a faulty memory location.

General troubleshooting best practice:

- Back up your current configuration and then go back to the oldest listed under option 13 of the console menu
- Reduce the configuration to the barest minimum (i.e. NAT/rules for basic operation) and add features one by one
- Run a system test, especially on RAM: http://memtest.org/
- Confirm you have sufficient CPU, RAM and disk storage

Bart...
Title: Re: Crash 1/day since upgrade from 17.7.7 to 17.7.11
Post by: sedace on December 31, 2017, 07:54:20 pm
Thanks, unfortunately while trying to get NAT port forwarding working I created, modified, and deleted variations of one rule multiple times (while trying to sort out an issue with connecting inbound) which means I don't have a backup from before the 17.7.11 update was completed to restore to.   [Basically I was trying to port forward 25565 (minecraft server for my son and a friend to use) from outside to a host inside the network but wasn't able to get it to work on a single inbound port, the client seems to randomize the requesting port # so the only way I could get the rule to work was to accept any inbound port to the Server IP and direct it to the internal host on 25565.  But that's another issue not worried about right now.  (FWIW I only have 3 rules in NAT right now, one for the default anti-lockout, one for a VOIP, and the disabled Minecraft rule.) ] 


 If I still have issues my ultimate plan will be to fresh install using the older code, I was just hoping to get some more clues as to the state before the crash but wasn't sure which logs to look at and if the circular logs are written to disk regularly by default or if I have to specify that somewhere?



As I noted the system was rock solid for weeks before the update so I'm pretty confident the hadware is good and it's likely software related.  If I have issues after reverting to an older version I'll give memtest and other HWDiags a shot. 

Today while monitoring the VGA console I saw errors, " swap_pager_getswapspace(2): failed" but swap space (<25% used) and memory usage  (<15%) are very minimal and disk space is plentiful.  I ran TOP and sorted by Size and didn't see anything strange but there could be something there:
PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
64518 root          1  20    0  1093M  7912K select  1   0:01   0.02% sshd
53790 root          1  20    0  1067M  2572K select  1   0:00   0.00% sshd
 3308 root          1  20    0  1060M  2240K select  1   0:02   0.00% miniupnpd
10051 dhcpd         1  20    0  1057M  1660K select  1   0:12   0.02% dhcpd
60776 root          1  20    0  1054M  3752K pause   0   0:00   0.00% csh
65272 root          1  20    0  1054M  2804K wait    1   0:00   0.00% sh
 7922 root          1  52    0  1054M   764K wait    0   0:12   0.00% sh
46530 root          1  20    0  1053M  1364K bpf     1   2:07   0.11% filterlog
29043 _dhcp         1  20    0  1051M  1280K select  1   0:00   0.00% dhclient
24559 root          1  52    0  1051M  1116K select  1   0:00   0.00% dhclient
20379 root          1  20    0  1051M  1348K select  0   0:41   0.04% syslogd
62521 root          2  20    0  1051M  1252K piperd  0   0:00   0.00% sshlockou
79599 root          1  20    0  1051M  2304K select  1   0:05   0.01% ntpd
53240 root          1  20    0  1049M  1336K select  0   0:19   0.02% apinger



So far it hasn't locked up yet today but it hasn't quite been 24hrs since the last time, about 3hrs to go.  I also disabled the NAT rule I created and set NAT outbound back to Automatic from Hybrid.  That is pretty much the same config I had before the update to 17.7.11 short of "deleting" the rule, rather than disabling it.   I'll report back either way and advise.
Title: Re: Crash 1/day since upgrade from 17.7.7 to 17.7.11
Post by: sedace on January 01, 2018, 12:31:28 am
Well, I was having a problem with my VOIP so I initiated a reboot and the system was stuck but eventually restarted... upon which time the VGA console seemed to get stuck in a loop of TAR messages, corrupt file, etc.  So I ended up reinstalling from 10.7.5 scratch, restoring my backup configuration, and tweaking the settings to get it back to where it was since the last backup.  I am now back on 10.7.11 and we'll see if problems reoccur. 

On a separate note, Happy New Year all.