OPNsense Forum

English Forums => 23.7 Legacy Series => Topic started by: 9axqe on August 20, 2023, 10:48:03 am

Title: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 20, 2023, 10:48:03 am
Hello,

I came home from 2 weeks away to find the web GUI of the router showing "CSRF check failed". I tried incognito tabs, different devices and different browsers (Firefox on macOS, Chrome on macOS, Safari on iOS) all with the same error.

I also could not ping the router somehow (ping 192.168.1.1), but Internet access was working.

Then, I decided to power cycle the router (I know, it was late, I was tired, went for the dumb option...). Now it seems it's completely gone, DHCP is not even coming up anymore, SSH is not reachable. Manually configuring my computer's IP to something within the home subnet (192.168.1.0/24) range is also not working. All of this attempted directly wired to the DEC695 to exclude any other networking issues.

A couple of days before I cam back home, I had remotely upgraded (via API, using the Home Assistant integration) to 23.7.1_3 and it was working fine after the upgrade as far as I could tell: GUI was remotely reachable, home assistant on my home network was reachable.

I'm trying to understand what I can attempt on the Serial Console to troubleshoot this.

I see prompt "root@:/ #" when connecting to the mini-USB port of my DEC695. When running ifconfig on the serial console it shows no IPs configured on WAN interface (which is/was set to igb0 in my case) for example, which is also not normal, it should get its IPv4 and IPv6 via DHCP from my ISP. Serial Console is connecting using "screen" but strangely I'm not asked for a password. ifconfig displays interfaces igb0 to 3 for example.

Any help appreciated.
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 20, 2023, 10:54:06 am
I had the idea to reboot with serial connected to see errors.

I see this:

Launching the init system...flock: cannot open lock file /var/run/booting: No space left on device

(and multiple other "no space left on device" errors)

So I guess I managed to fill up some partition somehow. Weird thing is, I checked that disk was 3% full when leaving. Log partition was filling up quickly though and I had made a note to check why on my return.
Title: Re: DEC695 unreachable, DHCP not running
Post by: meyergru on August 20, 2023, 10:57:26 am
If you can login at all via console, you can check under /var/log and delete some big log files, then reboot and see if it fixes the condition.
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 20, 2023, 11:01:17 am
Code: [Select]
root@:/ # df -h
Filesystem            Size    Used   Avail Capacity  Mounted on
zroot/ROOT/default    2.6G    2.6G      0B   100%    /
devfs                 1.0K    1.0K      0B   100%    /dev
/dev/gpt/efifs        256M    872K    255M     0%    /boot/efi
zroot                  96K     96K      0B   100%    /zroot
zroot/var/audit        96K     96K      0B   100%    /var/audit
zroot/usr/home         96K     96K      0B   100%    /usr/home
zroot/var/crash        96K     96K      0B   100%    /var/crash
zroot/tmp             376K    376K      0B   100%    /tmp
zroot/var/log         220G    220G      0B   100%    /var/log
zroot/var/mail        144K    144K      0B   100%    /var/mail
zroot/usr/ports        96K     96K      0B   100%    /usr/ports
zroot/usr/src          96K     96K      0B   100%    /usr/src
zroot/var/tmp         108K    108K      0B   100%    /var/tmp
root@:/ #



hmmm, does not seem full though... Unless I'm not reading this correctly ("capacity" and "used" columns seem to contradict one another).

I also do not understand how to delete anything in /var/log: I can cd to /zroot/var/log/ but it's empty it seems (ls -la shows nothing).
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 20, 2023, 11:41:31 am
I see the following lines when booting:

mkdir: /tmp/.cdrom: No space left on device

chmod: /tmp: No space left on device

chmod: /var/lib/php/sessions: No space left on device

chmod: /root: No space left on device

etc. etc.

It seems highly unlikely to me that all these partitions are full.

If the SSD was broken, would I see the same type of errors?
Title: Re: DEC695 unreachable, DHCP not running
Post by: meyergru on August 20, 2023, 12:37:04 pm
I doubt that the SSD is broken.

ZFS has a zpool (zroot) that has a bunch of datasets which all share the free space of the zpool.
Your is 100% full, such that all datasets appear full as well, thus  casuing those errors.

It does not matter where you make some space, but /var/log is using up the most space (i.e. 220G).

You should cd to /var/log (that is the mountpoint), then 'du -sc *'. You will see how much space is in all files and/or directories. Delete some big files or cd to the subdirectory containing most space, 'du -sc *' again there and see if there are older large files.

If you are using flowd, you could:

cd /var/log
rm flowd.log.??????

Another candidate would be /var/log/system, where you could do:

cd /var/log/system
rm system_????????.log

After that, reboot.
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 20, 2023, 12:55:45 pm
Thanks for the tips. Seems flowd and filter are the worst offenders...

how do I:

1. Limit logs size? (at the expense of log retention duration of course)
2. prevent logs from filling up partitions required for the device to properly start?



Code: [Select]
root@:/var/log # du -sch *
 51K    acmeclient
232K    audit
512B    boot.log
 35M    configd
 33M    ddclient
 11M    dhcpd
 97K    dmesg.today
 85K    dmesg.yesterday
171G    filter
1.0M    firewall
 48G    flowd.log
 10M    flowd.log.000001
 10M    flowd.log.000002
 10M    flowd.log.000003
 10M    flowd.log.000004
 10M    flowd.log.000005
 10M    flowd.log.000006
 10M    flowd.log.000007
 10M    flowd.log.000008
 10M    flowd.log.000009
 10M    flowd.log.000010
324K    gateways
164K    lighttpd
252K    monit
4.5K    mount.today
4.5K    mount.yesterday
512B    ntp
2.4M    ntpd
4.5K    pf.today
4.5K    pf.yesterday
170K    pkg
1.3M    portalauth
1.1M    resolver
388K    routing
4.5K    setuid.today
4.5K    setuid.yesterday
512B    squid
 34M    suricata
6.2M    system
 13K    userlog
4.5K    utx.lastlogin
8.5K    utx.log
220G    total


Title: Re: DEC695 unreachable, DHCP not running
Post by: meyergru on August 20, 2023, 02:11:48 pm
Under System: Settings: Logging, you can set the retention time in days. There, you can disable logging to the local disk altogether, e.g. if you send the log data to an external log server under System: Settings: Logging / targets.

You can also send /var/log to a RAM disk for which it is non-critical if it fills up under System: Settings: Miscellaneous. The only disadvantage is that logs are not kept over reboots if you do that.
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 20, 2023, 03:25:10 pm
thanks @meyergru, I have set it to 10 days to start with, I'll monitor disk usage.
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 23, 2023, 04:04:31 am
I think something is not normal, I can't even set it beyond 2 days at the moment.

How can it be that flowd.log becomes **48GB** large?... Could it be that something is broken with the log rotation for the IDS/IPS process?
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 23, 2023, 04:30:13 am
a top also shows logging processes are consuming a LOT of CPU: syslog-ng and filterlog are consistently and by far the two processes consuming the most CPU:

Code: [Select]
root@sense:/var/log # top

last pid: 86782;  load averages:  2.32,  2.31,  2.25                                              up 1+07:34:11  04:21:42
68 processes:  3 running, 65 sleeping
CPU: 41.0% user,  0.0% nice, 29.9% system,  1.9% interrupt, 27.3% idle
Mem: 207M Active, 660M Inact, 4724K Laundry, 6032M Wired, 2056K Buf, 994M Free
ARC: 5165M Total, 50M MFU, 5038M MRU, 5095K Anon, 11M Header, 60M Other
     4900M Compressed, 5036M Uncompressed, 1.03:1 Ratio
Swap: 8418M Total, 8418M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 4396 root          6  21    0    54M    14M kqread   3  18.0H  66.14% syslog-ng
33799 root          1  52    0    13M  3220K bpf      0 303:55  51.77% filterlog
11798 root          1  37    0    58M    31M select   2   0:01  15.86% php-cgi
88877 root          1  36    0    62M    33M piperd   0   0:04   5.99% php-cgi

Title: Re: DEC695 unreachable, DHCP not running
Post by: newsense on August 23, 2023, 05:37:42 am
There seems to be a lot of traffic going through that FW.

You need to decide whether some of the logging can be reduced or at the very least sent to a different storage share, else you're gonna keep it busy processing graphs instead of passing traffic
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 23, 2023, 08:45:47 am
No, that's the thing, there's nothing going through it. It's a DEC695 and it's at my home. I have 30Mbps going through (Mbps, less than 4 MB/s) it at the moment and both syslog-ng and filterlog consume each 50% (of a CPU core I guess).

More questions:
1. why would syslog-ng do anything since I am not sending syslogs anywhere...
2. looking at the logs, flowd.log is huge (was at 48GB in less than a month), hence maybe something to do with flowd. Any settings in particular I could check?

Title: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 08:55:31 am
Filter logs are huge too, not sure how to reduce logging on this. I already disabled logging on all my rules, I cannot disable it on the automatically generated ones it seems. "Filter" logs is everything generated by firewall rules right?
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: newsense on August 23, 2023, 10:13:04 am
Suricata can trigger on a lot of things if you enabled _everything_

For example, if you only have Windows laptops and iPhones  then Microsoft Exchange or Oracle Weblogic rules don't need to be enabled.

Out of 63864 entries chances are you're gonna hit enough times one or more generic rules that will quickly fill up storage
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: dinguz on August 23, 2023, 10:47:31 am
You could also enable compression in ZFS to mitigate the problem somewhat.
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 11:42:49 am
Suricata can trigger on a lot of things if you enabled _everything_

For example, if you only have Windows laptops and iPhones  then Microsoft Exchange or Oracle Weblogic rules don't need to be enabled.

Out of 63864 entries chances are you're gonna hit enough times one or more generic rules that will quickly fill up storage

I just checked, I have not downloaded yet a single ruleset actually, I just enabled the service. I will have to fix this memory issue first before enabling anything here.
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 11:51:58 am
I noticed that my FW tables are very large: 537079/1000000

The reason is, I have implemented some geoblocking.

Could such large table cause firewall logs to become more verbose? The only block rule leveraging the alias for geograpic IP lists is NOT logging though (small "i" is gray).

Secondly, how can flowd.log be 48GB large in less than a month if IPS is enabled but not a single ruleset has been downloaded? Is it expected? (as a reminder: small home network, single server on the network that is only used to backup computers at night) I'm starting to worry what is going to happen if I enable a couple of rulesets...
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: dinguz on August 23, 2023, 12:40:17 pm
In System:Settings:Logging, you can disable logging of the default pass and default block rules, have you checked this?
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 04:19:28 pm
I found the culprit I think, CPU went down and RAM usage also stabilized. I had one rule to block DoH and DoT to public DNS servers using a public list with 70k such IPs, that was causing the crazy amount of logs, at least for filter. CPU usage is also down now, I don't see the syslog-ng and filterlog processes eathing up 50% of a CPU core all the time...

It seems I needs to be careful with these large fw aliases, I didn't expect this.
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: newsense on August 23, 2023, 05:10:52 pm
70K DoH/DoT servers seems a bit excessive, doubt there are that many to begin with...


Block outbound Dport 853 to any, udp/443 to major providers (Google, Cloudflare, Q9, and the default one in Firefox for your country) and with a redirect of regular dns to your internal one (Unbound, AGH, Pi-hole) you've accomplished the same thing

Check in Firewall - Live view for any suspicious udp/443 traffic towards obscure DNS servers - shouldn't be that many hits - if at all
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 05:13:49 pm
Nope, I made the wrong assumption. I enabled and disabled logging for all 4 rules individually to double check and the one I assumed was the problem (using an alias with 70k IPs in it) is not causing the problem.

It's the one blocking outbound to TCP/853 (DoT).

The almost identical one, blocking outbound traffic to UDP/8853 (DoQ, DNS over quic) is NOT causing this issue.

I tested three times in a row. If I enable logging for the TCP rule, the moment I click apply I see filterlog and syslog-ng shooting up in the top I have running. If I turn it off, it goes away.

Not reproducible for the UDP rule.

If anyone is bored and want to try to reproduce this... ;) The interface I applied it is a LAN brige, and it's applied to IPv4 and IPv6.

So far, I do not understand what's so special about the TCP rule that if would cause this...
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 05:21:02 pm
70K DoH/DoT servers seems a bit excessive, doubt there are that many to begin with...

There is probably more than that actually, 70k are the known ones...

I use this list: https://public-dns.info/nameservers.txt
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: newsense on August 23, 2023, 05:32:55 pm
Sure, most of them are regular servers, not  DoH/DoT enabled and you wouldn't be hitting any of it with the DNS/53 intercept rules (need one for IPv6 btw)
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 06:13:03 pm
Sure, most of them are regular servers, not  DoH/DoT enabled and you wouldn't be hitting any of it with the DNS/53 intercept rules (need one for IPv6 btw)

ah, right, but they could add it any time. Ideally, I'd have the same list filtered down by DoH capability...
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: CJ on August 24, 2023, 01:45:59 pm
I'm using the same list to block DoH but just standard port blocking for DNS and DoT.  I have all of them set as Floating rules with logging turned on and I don't have this issue.  However, I'm not running a bridge or Suricata currently.
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 24, 2023, 04:30:25 pm
It's on the bridge for me, and after upgrading to 23.7.2 I retested, same behaviour, the moment I enable logging on this particular rule, syslog-ng and filterlog processes go nuts.
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: CJ on August 25, 2023, 03:33:08 pm
It's on the bridge for me, and after upgrading to 23.7.2 I retested, same behaviour, the moment I enable logging on this particular rule, syslog-ng and filterlog processes go nuts.

Try changing it to a floating rule.  I'm curious if that will make a difference or if it's just due to the bridge.
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 25, 2023, 08:44:22 pm
I found what the issue was. Maybe I should simply have opened the logs.

Home Assistant was sending 1-2Mbps (!) of DoT attempts to Cloudflare somehow. I changed the rule from "reject" to "block" and it stopped. That's why this specific rule was causing such a crazy cpu consumption...