opnsense unreachable, disk full with Filter and Suricata logs

9axqe · August 20, 2023, 10:48:03 AM

Hello,

I came home from 2 weeks away to find the web GUI of the router showing "CSRF check failed". I tried incognito tabs, different devices and different browsers (Firefox on macOS, Chrome on macOS, Safari on iOS) all with the same error.

I also could not ping the router somehow (ping 192.168.1.1), but Internet access was working.

Then, I decided to power cycle the router (I know, it was late, I was tired, went for the dumb option...). Now it seems it's completely gone, DHCP is not even coming up anymore, SSH is not reachable. Manually configuring my computer's IP to something within the home subnet (192.168.1.0/24) range is also not working. All of this attempted directly wired to the DEC695 to exclude any other networking issues.

A couple of days before I cam back home, I had remotely upgraded (via API, using the Home Assistant integration) to 23.7.1_3 and it was working fine after the upgrade as far as I could tell: GUI was remotely reachable, home assistant on my home network was reachable.

I'm trying to understand what I can attempt on the Serial Console to troubleshoot this.

I see prompt "root@:/ #" when connecting to the mini-USB port of my DEC695. When running ifconfig on the serial console it shows no IPs configured on WAN interface (which is/was set to igb0 in my case) for example, which is also not normal, it should get its IPv4 and IPv6 via DHCP from my ISP. Serial Console is connecting using "screen" but strangely I'm not asked for a password. ifconfig displays interfaces igb0 to 3 for example.

Any help appreciated.

9axqe · August 20, 2023, 10:54:06 AM

I had the idea to reboot with serial connected to see errors.

I see this:

Launching the init system...flock: cannot open lock file /var/run/booting: No space left on device

(and multiple other "no space left on device" errors)

So I guess I managed to fill up some partition somehow. Weird thing is, I checked that disk was 3% full when leaving. Log partition was filling up quickly though and I had made a note to check why on my return.

meyergru · August 20, 2023, 10:57:26 AM

If you can login at all via console, you can check under /var/log and delete some big log files, then reboot and see if it fixes the condition.

9axqe · August 20, 2023, 11:01:17 AM

Code Select


root@:/ # df -h
Filesystem            Size    Used   Avail Capacity  Mounted on
zroot/ROOT/default    2.6G    2.6G      0B   100%    /
devfs                 1.0K    1.0K      0B   100%    /dev
/dev/gpt/efifs        256M    872K    255M     0%    /boot/efi
zroot                  96K     96K      0B   100%    /zroot
zroot/var/audit        96K     96K      0B   100%    /var/audit
zroot/usr/home         96K     96K      0B   100%    /usr/home
zroot/var/crash        96K     96K      0B   100%    /var/crash
zroot/tmp             376K    376K      0B   100%    /tmp
zroot/var/log         220G    220G      0B   100%    /var/log
zroot/var/mail        144K    144K      0B   100%    /var/mail
zroot/usr/ports        96K     96K      0B   100%    /usr/ports
zroot/usr/src          96K     96K      0B   100%    /usr/src
zroot/var/tmp         108K    108K      0B   100%    /var/tmp
root@:/ #

hmmm, does not seem full though... Unless I'm not reading this correctly ("capacity" and "used" columns seem to contradict one another).

I also do not understand how to delete anything in /var/log: I can cd to /zroot/var/log/ but it's empty it seems (ls -la shows nothing).

9axqe · August 20, 2023, 11:41:31 AM

I see the following lines when booting:

mkdir: /tmp/.cdrom: No space left on device

chmod: /tmp: No space left on device

chmod: /var/lib/php/sessions: No space left on device

chmod: /root: No space left on device

etc. etc.

It seems highly unlikely to me that all these partitions are full.

If the SSD was broken, would I see the same type of errors?

meyergru · August 20, 2023, 12:37:04 PM

I doubt that the SSD is broken.

ZFS has a zpool (zroot) that has a bunch of datasets which all share the free space of the zpool.
Your is 100% full, such that all datasets appear full as well, thus casuing those errors.

It does not matter where you make some space, but /var/log is using up the most space (i.e. 220G).

You should cd to /var/log (that is the mountpoint), then 'du -sc *'. You will see how much space is in all files and/or directories. Delete some big files or cd to the subdirectory containing most space, 'du -sc *' again there and see if there are older large files.

If you are using flowd, you could:

cd /var/log
rm flowd.log.??????

Another candidate would be /var/log/system, where you could do:

cd /var/log/system
rm system_????????.log

After that, reboot.

9axqe · August 20, 2023, 12:55:45 PM

Thanks for the tips. Seems flowd and filter are the worst offenders...

how do I:

1. Limit logs size? (at the expense of log retention duration of course)
2. prevent logs from filling up partitions required for the device to properly start?

Code Select


root@:/var/log # du -sch *
 51K    acmeclient
232K    audit
512B    boot.log
 35M    configd
 33M    ddclient
 11M    dhcpd
 97K    dmesg.today
 85K    dmesg.yesterday
171G    filter
1.0M    firewall
 48G    flowd.log
 10M    flowd.log.000001
 10M    flowd.log.000002
 10M    flowd.log.000003
 10M    flowd.log.000004
 10M    flowd.log.000005
 10M    flowd.log.000006
 10M    flowd.log.000007
 10M    flowd.log.000008
 10M    flowd.log.000009
 10M    flowd.log.000010
324K    gateways
164K    lighttpd
252K    monit
4.5K    mount.today
4.5K    mount.yesterday
512B    ntp
2.4M    ntpd
4.5K    pf.today
4.5K    pf.yesterday
170K    pkg
1.3M    portalauth
1.1M    resolver
388K    routing
4.5K    setuid.today
4.5K    setuid.yesterday
512B    squid
 34M    suricata
6.2M    system
 13K    userlog
4.5K    utx.lastlogin
8.5K    utx.log
220G    total

meyergru · August 20, 2023, 02:11:48 PM

Under System: Settings: Logging, you can set the retention time in days. There, you can disable logging to the local disk altogether, e.g. if you send the log data to an external log server under System: Settings: Logging / targets.

You can also send /var/log to a RAM disk for which it is non-critical if it fills up under System: Settings: Miscellaneous. The only disadvantage is that logs are not kept over reboots if you do that.

9axqe · August 20, 2023, 03:25:10 PM

thanks @meyergru, I have set it to 10 days to start with, I'll monitor disk usage.

9axqe · August 23, 2023, 04:04:31 AM

I think something is not normal, I can't even set it beyond 2 days at the moment.

How can it be that flowd.log becomes **48GB** large?... Could it be that something is broken with the log rotation for the IDS/IPS process?

9axqe · August 23, 2023, 04:30:13 AM

a top also shows logging processes are consuming a LOT of CPU: syslog-ng and filterlog are consistently and by far the two processes consuming the most CPU:

Code Select


root@sense:/var/log # top

last pid: 86782;  load averages:  2.32,  2.31,  2.25                                              up 1+07:34:11  04:21:42
68 processes:  3 running, 65 sleeping
CPU: 41.0% user,  0.0% nice, 29.9% system,  1.9% interrupt, 27.3% idle
Mem: 207M Active, 660M Inact, 4724K Laundry, 6032M Wired, 2056K Buf, 994M Free
ARC: 5165M Total, 50M MFU, 5038M MRU, 5095K Anon, 11M Header, 60M Other
     4900M Compressed, 5036M Uncompressed, 1.03:1 Ratio
Swap: 8418M Total, 8418M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 4396 root          6  21    0    54M    14M kqread   3  18.0H  66.14% syslog-ng
33799 root          1  52    0    13M  3220K bpf      0 303:55  51.77% filterlog
11798 root          1  37    0    58M    31M select   2   0:01  15.86% php-cgi
88877 root          1  36    0    62M    33M piperd   0   0:04   5.99% php-cgi

newsense · August 23, 2023, 05:37:42 AM

There seems to be a lot of traffic going through that FW.

You need to decide whether some of the logging can be reduced or at the very least sent to a different storage share, else you're gonna keep it busy processing graphs instead of passing traffic

9axqe · August 23, 2023, 08:45:47 AM

No, that's the thing, there's nothing going through it. It's a DEC695 and it's at my home. I have 30Mbps going through (Mbps, less than 4 MB/s) it at the moment and both syslog-ng and filterlog consume each 50% (of a CPU core I guess).

More questions:
1. why would syslog-ng do anything since I am not sending syslogs anywhere...
2. looking at the logs, flowd.log is huge (was at 48GB in less than a month), hence maybe something to do with flowd. Any settings in particular I could check?

9axqe · August 23, 2023, 08:55:31 AM

Filter logs are huge too, not sure how to reduce logging on this. I already disabled logging on all my rules, I cannot disable it on the automatically generated ones it seems. "Filter" logs is everything generated by firewall rules right?

newsense · August 23, 2023, 10:13:04 AM

Suricata can trigger on a lot of things if you enabled _everything_

For example, if you only have Windows laptops and iPhones then Microsoft Exchange or Oracle Weblogic rules don't need to be enabled.

Out of 63864 entries chances are you're gonna hit enough times one or more generic rules that will quickly fill up storage

opnsense unreachable, disk full with Filter and Suricata logs

9axqe

August 20, 2023, 10:48:03 AM Last Edit: August 23, 2023, 08:53:10 AM by 9axqe

9axqe

August 20, 2023, 10:54:06 AM #1

meyergru

August 20, 2023, 10:57:26 AM #2

9axqe

August 20, 2023, 11:01:17 AM #3

9axqe

August 20, 2023, 11:41:31 AM #4

meyergru

August 20, 2023, 12:37:04 PM #5

9axqe

August 20, 2023, 12:55:45 PM #6

meyergru

August 20, 2023, 02:11:48 PM #7

9axqe

August 20, 2023, 03:25:10 PM #8

9axqe

August 23, 2023, 04:04:31 AM #9

9axqe

August 23, 2023, 04:30:13 AM #10

newsense

August 23, 2023, 05:37:42 AM #11

9axqe

August 23, 2023, 08:45:47 AM #12

9axqe

August 23, 2023, 08:55:31 AM #13

newsense

August 23, 2023, 10:13:04 AM #14