opnsense unreachable, disk full with Filter and Suricata logs

Started by 9axqe, August 20, 2023, 10:48:03 AM

Previous topic - Next topic
Hello,

I came home from 2 weeks away to find the web GUI of the router showing "CSRF check failed". I tried incognito tabs, different devices and different browsers (Firefox on macOS, Chrome on macOS, Safari on iOS) all with the same error.

I also could not ping the router somehow (ping 192.168.1.1), but Internet access was working.

Then, I decided to power cycle the router (I know, it was late, I was tired, went for the dumb option...). Now it seems it's completely gone, DHCP is not even coming up anymore, SSH is not reachable. Manually configuring my computer's IP to something within the home subnet (192.168.1.0/24) range is also not working. All of this attempted directly wired to the DEC695 to exclude any other networking issues.

A couple of days before I cam back home, I had remotely upgraded (via API, using the Home Assistant integration) to 23.7.1_3 and it was working fine after the upgrade as far as I could tell: GUI was remotely reachable, home assistant on my home network was reachable.

I'm trying to understand what I can attempt on the Serial Console to troubleshoot this.

I see prompt "root@:/ #" when connecting to the mini-USB port of my DEC695. When running ifconfig on the serial console it shows no IPs configured on WAN interface (which is/was set to igb0 in my case) for example, which is also not normal, it should get its IPv4 and IPv6 via DHCP from my ISP. Serial Console is connecting using "screen" but strangely I'm not asked for a password. ifconfig displays interfaces igb0 to 3 for example.

Any help appreciated.

I had the idea to reboot with serial connected to see errors.

I see this:

Launching the init system...flock: cannot open lock file /var/run/booting: No space left on device

(and multiple other "no space left on device" errors)

So I guess I managed to fill up some partition somehow. Weird thing is, I checked that disk was 3% full when leaving. Log partition was filling up quickly though and I had made a note to check why on my return.

If you can login at all via console, you can check under /var/log and delete some big log files, then reboot and see if it fixes the condition.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+


root@:/ # df -h
Filesystem            Size    Used   Avail Capacity  Mounted on
zroot/ROOT/default    2.6G    2.6G      0B   100%    /
devfs                 1.0K    1.0K      0B   100%    /dev
/dev/gpt/efifs        256M    872K    255M     0%    /boot/efi
zroot                  96K     96K      0B   100%    /zroot
zroot/var/audit        96K     96K      0B   100%    /var/audit
zroot/usr/home         96K     96K      0B   100%    /usr/home
zroot/var/crash        96K     96K      0B   100%    /var/crash
zroot/tmp             376K    376K      0B   100%    /tmp
zroot/var/log         220G    220G      0B   100%    /var/log
zroot/var/mail        144K    144K      0B   100%    /var/mail
zroot/usr/ports        96K     96K      0B   100%    /usr/ports
zroot/usr/src          96K     96K      0B   100%    /usr/src
zroot/var/tmp         108K    108K      0B   100%    /var/tmp
root@:/ #




hmmm, does not seem full though... Unless I'm not reading this correctly ("capacity" and "used" columns seem to contradict one another).

I also do not understand how to delete anything in /var/log: I can cd to /zroot/var/log/ but it's empty it seems (ls -la shows nothing).

I see the following lines when booting:

mkdir: /tmp/.cdrom: No space left on device

chmod: /tmp: No space left on device

chmod: /var/lib/php/sessions: No space left on device

chmod: /root: No space left on device

etc. etc.

It seems highly unlikely to me that all these partitions are full.

If the SSD was broken, would I see the same type of errors?

I doubt that the SSD is broken.

ZFS has a zpool (zroot) that has a bunch of datasets which all share the free space of the zpool.
Your is 100% full, such that all datasets appear full as well, thus  casuing those errors.

It does not matter where you make some space, but /var/log is using up the most space (i.e. 220G).

You should cd to /var/log (that is the mountpoint), then 'du -sc *'. You will see how much space is in all files and/or directories. Delete some big files or cd to the subdirectory containing most space, 'du -sc *' again there and see if there are older large files.

If you are using flowd, you could:

cd /var/log
rm flowd.log.??????

Another candidate would be /var/log/system, where you could do:

cd /var/log/system
rm system_????????.log

After that, reboot.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

Thanks for the tips. Seems flowd and filter are the worst offenders...

how do I:

1. Limit logs size? (at the expense of log retention duration of course)
2. prevent logs from filling up partitions required for the device to properly start?




root@:/var/log # du -sch *
51K    acmeclient
232K    audit
512B    boot.log
35M    configd
33M    ddclient
11M    dhcpd
97K    dmesg.today
85K    dmesg.yesterday
171G    filter
1.0M    firewall
48G    flowd.log
10M    flowd.log.000001
10M    flowd.log.000002
10M    flowd.log.000003
10M    flowd.log.000004
10M    flowd.log.000005
10M    flowd.log.000006
10M    flowd.log.000007
10M    flowd.log.000008
10M    flowd.log.000009
10M    flowd.log.000010
324K    gateways
164K    lighttpd
252K    monit
4.5K    mount.today
4.5K    mount.yesterday
512B    ntp
2.4M    ntpd
4.5K    pf.today
4.5K    pf.yesterday
170K    pkg
1.3M    portalauth
1.1M    resolver
388K    routing
4.5K    setuid.today
4.5K    setuid.yesterday
512B    squid
34M    suricata
6.2M    system
13K    userlog
4.5K    utx.lastlogin
8.5K    utx.log
220G    total




Under System: Settings: Logging, you can set the retention time in days. There, you can disable logging to the local disk altogether, e.g. if you send the log data to an external log server under System: Settings: Logging / targets.

You can also send /var/log to a RAM disk for which it is non-critical if it fills up under System: Settings: Miscellaneous. The only disadvantage is that logs are not kept over reboots if you do that.
Intel N100, 4 x I226-V, 16 GByte, 256 GByte NVME, ZTE F6005

1100 down / 800 up, Bufferbloat A+

thanks @meyergru, I have set it to 10 days to start with, I'll monitor disk usage.

I think something is not normal, I can't even set it beyond 2 days at the moment.

How can it be that flowd.log becomes **48GB** large?... Could it be that something is broken with the log rotation for the IDS/IPS process?

a top also shows logging processes are consuming a LOT of CPU: syslog-ng and filterlog are consistently and by far the two processes consuming the most CPU:


root@sense:/var/log # top

last pid: 86782;  load averages:  2.32,  2.31,  2.25                                              up 1+07:34:11  04:21:42
68 processes:  3 running, 65 sleeping
CPU: 41.0% user,  0.0% nice, 29.9% system,  1.9% interrupt, 27.3% idle
Mem: 207M Active, 660M Inact, 4724K Laundry, 6032M Wired, 2056K Buf, 994M Free
ARC: 5165M Total, 50M MFU, 5038M MRU, 5095K Anon, 11M Header, 60M Other
     4900M Compressed, 5036M Uncompressed, 1.03:1 Ratio
Swap: 8418M Total, 8418M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
4396 root          6  21    0    54M    14M kqread   3  18.0H  66.14% syslog-ng
33799 root          1  52    0    13M  3220K bpf      0 303:55  51.77% filterlog
11798 root          1  37    0    58M    31M select   2   0:01  15.86% php-cgi
88877 root          1  36    0    62M    33M piperd   0   0:04   5.99% php-cgi


There seems to be a lot of traffic going through that FW.

You need to decide whether some of the logging can be reduced or at the very least sent to a different storage share, else you're gonna keep it busy processing graphs instead of passing traffic

No, that's the thing, there's nothing going through it. It's a DEC695 and it's at my home. I have 30Mbps going through (Mbps, less than 4 MB/s) it at the moment and both syslog-ng and filterlog consume each 50% (of a CPU core I guess).

More questions:
1. why would syslog-ng do anything since I am not sending syslogs anywhere...
2. looking at the logs, flowd.log is huge (was at 48GB in less than a month), hence maybe something to do with flowd. Any settings in particular I could check?


Filter logs are huge too, not sure how to reduce logging on this. I already disabled logging on all my rules, I cannot disable it on the automatically generated ones it seems. "Filter" logs is everything generated by firewall rules right?

Suricata can trigger on a lot of things if you enabled _everything_

For example, if you only have Windows laptops and iPhones  then Microsoft Exchange or Oracle Weblogic rules don't need to be enabled.

Out of 63864 entries chances are you're gonna hit enough times one or more generic rules that will quickly fill up storage