OPNsense Forum

English Forums => 23.7 Legacy Series => Topic started by: 9axqe on August 20, 2023, 10:48:03 am

Title: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 20, 2023, 10:48:03 am: Hello,

I came home from 2 weeks away to find the web GUI of the router showing "CSRF check failed". I tried incognito tabs, different devices and different browsers (Firefox on macOS, Chrome on macOS, Safari on iOS) all with the same error.

I also could not ping the router somehow (ping 192.168.1.1), but Internet access was working.

Then, I decided to power cycle the router (I know, it was late, I was tired, went for the dumb option...). Now it seems it's completely gone, DHCP is not even coming up anymore, SSH is not reachable. Manually configuring my computer's IP to something within the home subnet (192.168.1.0/24) range is also not working. All of this attempted directly wired to the DEC695 to exclude any other networking issues.

A couple of days before I cam back home, I had remotely upgraded (via API, using the Home Assistant integration) to 23.7.1_3 and it was working fine after the upgrade as far as I could tell: GUI was remotely reachable, home assistant on my home network was reachable.

I'm trying to understand what I can attempt on the Serial Console to troubleshoot this.

I see prompt "root@:/ #" when connecting to the mini-USB port of my DEC695. When running ifconfig on the serial console it shows no IPs configured on WAN interface (which is/was set to igb0 in my case) for example, which is also not normal, it should get its IPv4 and IPv6 via DHCP from my ISP. Serial Console is connecting using "screen" but strangely I'm not asked for a password. ifconfig displays interfaces igb0 to 3 for example.

Any help appreciated.
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 20, 2023, 10:54:06 am: I had the idea to reboot with serial connected to see errors.

I see this:

Launching the init system...flock: cannot open lock file /var/run/booting: No space left on device

(and multiple other "no space left on device" errors)

So I guess I managed to fill up some partition somehow. Weird thing is, I checked that disk was 3% full when leaving. Log partition was filling up quickly though and I had made a note to check why on my return.
Title: Re: DEC695 unreachable, DHCP not running
Post by: meyergru on August 20, 2023, 10:57:26 am: If you can login at all via console, you can check under /var/log and delete some big log files, then reboot and see if it fixes the condition.
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 20, 2023, 11:01:17 am: Code: [Select]
root@:/ # df -h Filesystem Size Used Avail Capacity Mounted on zroot/ROOT/default 2.6G 2.6G 0B 100% / devfs 1.0K 1.0K 0B 100% /dev /dev/gpt/efifs 256M 872K 255M 0% /boot/efi zroot 96K 96K 0B 100% /zroot zroot/var/audit 96K 96K 0B 100% /var/audit zroot/usr/home 96K 96K 0B 100% /usr/home zroot/var/crash 96K 96K 0B 100% /var/crash zroot/tmp 376K 376K 0B 100% /tmp zroot/var/log 220G 220G 0B 100% /var/log zroot/var/mail 144K 144K 0B 100% /var/mail zroot/usr/ports 96K 96K 0B 100% /usr/ports zroot/usr/src 96K 96K 0B 100% /usr/src zroot/var/tmp 108K 108K 0B 100% /var/tmp root@:/ #

hmmm, does not seem full though... Unless I'm not reading this correctly ("capacity" and "used" columns seem to contradict one another).

I also do not understand how to delete anything in /var/log: I can cd to /zroot/var/log/ but it's empty it seems (ls -la shows nothing).
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 20, 2023, 11:41:31 am: I see the following lines when booting:

mkdir: /tmp/.cdrom: No space left on device

chmod: /tmp: No space left on device

chmod: /var/lib/php/sessions: No space left on device

chmod: /root: No space left on device

etc. etc.

It seems highly unlikely to me that all these partitions are full.

If the SSD was broken, would I see the same type of errors?
Title: Re: DEC695 unreachable, DHCP not running
Post by: meyergru on August 20, 2023, 12:37:04 pm: I doubt that the SSD is broken.

ZFS has a zpool (zroot) that has a bunch of datasets which all share the free space of the zpool.
Your is 100% full, such that all datasets appear full as well, thus casuing those errors.

It does not matter where you make some space, but /var/log is using up the most space (i.e. 220G).

You should cd to /var/log (that is the mountpoint), then 'du -sc *'. You will see how much space is in all files and/or directories. Delete some big files or cd to the subdirectory containing most space, 'du -sc *' again there and see if there are older large files.

If you are using flowd, you could:

cd /var/log
rm flowd.log.??????

Another candidate would be /var/log/system, where you could do:

cd /var/log/system
rm system_????????.log

After that, reboot.
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 20, 2023, 12:55:45 pm: Thanks for the tips. Seems flowd and filter are the worst offenders...

how do I:

1. Limit logs size? (at the expense of log retention duration of course)
2. prevent logs from filling up partitions required for the device to properly start?

Code: [Select]
root@:/var/log # du -sch * 51K acmeclient 232K audit 512B boot.log 35M configd 33M ddclient 11M dhcpd 97K dmesg.today 85K dmesg.yesterday 171G filter 1.0M firewall 48G flowd.log 10M flowd.log.000001 10M flowd.log.000002 10M flowd.log.000003 10M flowd.log.000004 10M flowd.log.000005 10M flowd.log.000006 10M flowd.log.000007 10M flowd.log.000008 10M flowd.log.000009 10M flowd.log.000010 324K gateways 164K lighttpd 252K monit 4.5K mount.today 4.5K mount.yesterday 512B ntp 2.4M ntpd 4.5K pf.today 4.5K pf.yesterday 170K pkg 1.3M portalauth 1.1M resolver 388K routing 4.5K setuid.today 4.5K setuid.yesterday 512B squid 34M suricata 6.2M system 13K userlog 4.5K utx.lastlogin 8.5K utx.log 220G total
Title: Re: DEC695 unreachable, DHCP not running
Post by: meyergru on August 20, 2023, 02:11:48 pm: Under System: Settings: Logging, you can set the retention time in days. There, you can disable logging to the local disk altogether, e.g. if you send the log data to an external log server under System: Settings: Logging / targets.

You can also send /var/log to a RAM disk for which it is non-critical if it fills up under System: Settings: Miscellaneous. The only disadvantage is that logs are not kept over reboots if you do that.
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 20, 2023, 03:25:10 pm: thanks @meyergru, I have set it to 10 days to start with, I'll monitor disk usage.
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 23, 2023, 04:04:31 am: I think something is not normal, I can't even set it beyond 2 days at the moment.

How can it be that flowd.log becomes **48GB** large?... Could it be that something is broken with the log rotation for the IDS/IPS process?
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 23, 2023, 04:30:13 am: a top also shows logging processes are consuming a LOT of CPU: syslog-ng and filterlog are consistently and by far the two processes consuming the most CPU:

Code: [Select]
root@sense:/var/log # top last pid: 86782; load averages: 2.32, 2.31, 2.25 up 1+07:34:11 04:21:42 68 processes: 3 running, 65 sleeping CPU: 41.0% user, 0.0% nice, 29.9% system, 1.9% interrupt, 27.3% idle Mem: 207M Active, 660M Inact, 4724K Laundry, 6032M Wired, 2056K Buf, 994M Free ARC: 5165M Total, 50M MFU, 5038M MRU, 5095K Anon, 11M Header, 60M Other 4900M Compressed, 5036M Uncompressed, 1.03:1 Ratio Swap: 8418M Total, 8418M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 4396 root 6 21 0 54M 14M kqread 3 18.0H 66.14% syslog-ng 33799 root 1 52 0 13M 3220K bpf 0 303:55 51.77% filterlog 11798 root 1 37 0 58M 31M select 2 0:01 15.86% php-cgi 88877 root 1 36 0 62M 33M piperd 0 0:04 5.99% php-cgi
Title: Re: DEC695 unreachable, DHCP not running
Post by: newsense on August 23, 2023, 05:37:42 am: There seems to be a lot of traffic going through that FW.

You need to decide whether some of the logging can be reduced or at the very least sent to a different storage share, else you're gonna keep it busy processing graphs instead of passing traffic
Title: Re: DEC695 unreachable, DHCP not running
Post by: 9axqe on August 23, 2023, 08:45:47 am: No, that's the thing, there's nothing going through it. It's a DEC695 and it's at my home. I have 30Mbps going through (Mbps, less than 4 MB/s) it at the moment and both syslog-ng and filterlog consume each 50% (of a CPU core I guess).

More questions:
1. why would syslog-ng do anything since I am not sending syslogs anywhere...
2. looking at the logs, flowd.log is huge (was at 48GB in less than a month), hence maybe something to do with flowd. Any settings in particular I could check?
Title: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 08:55:31 am: Filter logs are huge too, not sure how to reduce logging on this. I already disabled logging on all my rules, I cannot disable it on the automatically generated ones it seems. "Filter" logs is everything generated by firewall rules right?
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: newsense on August 23, 2023, 10:13:04 am: Suricata can trigger on a lot of things if you enabled _everything_

For example, if you only have Windows laptops and iPhones then Microsoft Exchange or Oracle Weblogic rules don't need to be enabled.

Out of 63864 entries chances are you're gonna hit enough times one or more generic rules that will quickly fill up storage
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: dinguz on August 23, 2023, 10:47:31 am: You could also enable compression in ZFS to mitigate the problem somewhat.
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 11:42:49 am: Quote from: newsense on August 23, 2023, 10:13:04 am
Suricata can trigger on a lot of things if you enabled _everything_

For example, if you only have Windows laptops and iPhones then Microsoft Exchange or Oracle Weblogic rules don't need to be enabled.

Out of 63864 entries chances are you're gonna hit enough times one or more generic rules that will quickly fill up storage

I just checked, I have not downloaded yet a single ruleset actually, I just enabled the service. I will have to fix this memory issue first before enabling anything here.
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 11:51:58 am: I noticed that my FW tables are very large: 537079/1000000

The reason is, I have implemented some geoblocking.

Could such large table cause firewall logs to become more verbose? The only block rule leveraging the alias for geograpic IP lists is NOT logging though (small "i" is gray).

Secondly, how can flowd.log be 48GB large in less than a month if IPS is enabled but not a single ruleset has been downloaded? Is it expected? (as a reminder: small home network, single server on the network that is only used to backup computers at night) I'm starting to worry what is going to happen if I enable a couple of rulesets...
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: dinguz on August 23, 2023, 12:40:17 pm: In System:Settings:Logging, you can disable logging of the default pass and default block rules, have you checked this?
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 04:19:28 pm: I found the culprit I think, CPU went down and RAM usage also stabilized. I had one rule to block DoH and DoT to public DNS servers using a public list with 70k such IPs, that was causing the crazy amount of logs, at least for filter. CPU usage is also down now, I don't see the syslog-ng and filterlog processes eathing up 50% of a CPU core all the time...

It seems I needs to be careful with these large fw aliases, I didn't expect this.
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: newsense on August 23, 2023, 05:10:52 pm: 70K DoH/DoT servers seems a bit excessive, doubt there are that many to begin with...

Block outbound Dport 853 to any, udp/443 to major providers (Google, Cloudflare, Q9, and the default one in Firefox for your country) and with a redirect of regular dns to your internal one (Unbound, AGH, Pi-hole) you've accomplished the same thing

Check in Firewall - Live view for any suspicious udp/443 traffic towards obscure DNS servers - shouldn't be that many hits - if at all
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 05:13:49 pm: Nope, I made the wrong assumption. I enabled and disabled logging for all 4 rules individually to double check and the one I assumed was the problem (using an alias with 70k IPs in it) is not causing the problem.

It's the one blocking outbound to TCP/853 (DoT).

The almost identical one, blocking outbound traffic to UDP/8853 (DoQ, DNS over quic) is NOT causing this issue.

I tested three times in a row. If I enable logging for the TCP rule, the moment I click apply I see filterlog and syslog-ng shooting up in the top I have running. If I turn it off, it goes away.

Not reproducible for the UDP rule.

If anyone is bored and want to try to reproduce this... ;) The interface I applied it is a LAN brige, and it's applied to IPv4 and IPv6.

So far, I do not understand what's so special about the TCP rule that if would cause this...
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 05:21:02 pm: Quote from: newsense on August 23, 2023, 05:10:52 pm
70K DoH/DoT servers seems a bit excessive, doubt there are that many to begin with...

There is probably more than that actually, 70k are the known ones...

I use this list: https://public-dns.info/nameservers.txt
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: newsense on August 23, 2023, 05:32:55 pm: Sure, most of them are regular servers, not DoH/DoT enabled and you wouldn't be hitting any of it with the DNS/53 intercept rules (need one for IPv6 btw)
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 23, 2023, 06:13:03 pm: Quote from: newsense on August 23, 2023, 05:32:55 pm
Sure, most of them are regular servers, not DoH/DoT enabled and you wouldn't be hitting any of it with the DNS/53 intercept rules (need one for IPv6 btw)

ah, right, but they could add it any time. Ideally, I'd have the same list filtered down by DoH capability...
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: CJ on August 24, 2023, 01:45:59 pm: I'm using the same list to block DoH but just standard port blocking for DNS and DoT. I have all of them set as Floating rules with logging turned on and I don't have this issue. However, I'm not running a bridge or Suricata currently.
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 24, 2023, 04:30:25 pm: It's on the bridge for me, and after upgrading to 23.7.2 I retested, same behaviour, the moment I enable logging on this particular rule, syslog-ng and filterlog processes go nuts.
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: CJ on August 25, 2023, 03:33:08 pm: Quote from: 9axqe on August 24, 2023, 04:30:25 pm
It's on the bridge for me, and after upgrading to 23.7.2 I retested, same behaviour, the moment I enable logging on this particular rule, syslog-ng and filterlog processes go nuts.

Try changing it to a floating rule. I'm curious if that will make a difference or if it's just due to the bridge.
Title: Re: opnsense unreachable, disk full with Filter and Suricata logs
Post by: 9axqe on August 25, 2023, 08:44:22 pm: I found what the issue was. Maybe I should simply have opened the logs.

Home Assistant was sending 1-2Mbps (!) of DoT attempts to Cloudflare somehow. I changed the rule from "reject" to "block" and it stopped. That's why this specific rule was causing such a crazy cpu consumption...