Hello,
I came home from 2 weeks away to find the web GUI of the router showing "CSRF check failed". I tried incognito tabs, different devices and different browsers (Firefox on macOS, Chrome on macOS, Safari on iOS) all with the same error.
I also could not ping the router somehow (ping 192.168.1.1), but Internet access was working.
Then, I decided to power cycle the router (I know, it was late, I was tired, went for the dumb option...). Now it seems it's completely gone, DHCP is not even coming up anymore, SSH is not reachable. Manually configuring my computer's IP to something within the home subnet (192.168.1.0/24) range is also not working. All of this attempted directly wired to the DEC695 to exclude any other networking issues.
A couple of days before I cam back home, I had remotely upgraded (via API, using the Home Assistant integration) to 23.7.1_3 and it was working fine after the upgrade as far as I could tell: GUI was remotely reachable, home assistant on my home network was reachable.
I'm trying to understand what I can attempt on the Serial Console to troubleshoot this.
I see prompt "root@:/ #" when connecting to the mini-USB port of my DEC695. When running ifconfig on the serial console it shows no IPs configured on WAN interface (which is/was set to igb0 in my case) for example, which is also not normal, it should get its IPv4 and IPv6 via DHCP from my ISP. Serial Console is connecting using "screen" but strangely I'm not asked for a password. ifconfig displays interfaces igb0 to 3 for example.
Any help appreciated.
I had the idea to reboot with serial connected to see errors.
I see this:
Launching the init system...flock: cannot open lock file /var/run/booting: No space left on device
(and multiple other "no space left on device" errors)
So I guess I managed to fill up some partition somehow. Weird thing is, I checked that disk was 3% full when leaving. Log partition was filling up quickly though and I had made a note to check why on my return.
If you can login at all via console, you can check under /var/log and delete some big log files, then reboot and see if it fixes the condition.
root@:/ # df -h
Filesystem Size Used Avail Capacity Mounted on
zroot/ROOT/default 2.6G 2.6G 0B 100% /
devfs 1.0K 1.0K 0B 100% /dev
/dev/gpt/efifs 256M 872K 255M 0% /boot/efi
zroot 96K 96K 0B 100% /zroot
zroot/var/audit 96K 96K 0B 100% /var/audit
zroot/usr/home 96K 96K 0B 100% /usr/home
zroot/var/crash 96K 96K 0B 100% /var/crash
zroot/tmp 376K 376K 0B 100% /tmp
zroot/var/log 220G 220G 0B 100% /var/log
zroot/var/mail 144K 144K 0B 100% /var/mail
zroot/usr/ports 96K 96K 0B 100% /usr/ports
zroot/usr/src 96K 96K 0B 100% /usr/src
zroot/var/tmp 108K 108K 0B 100% /var/tmp
root@:/ #
hmmm, does not seem full though... Unless I'm not reading this correctly ("capacity" and "used" columns seem to contradict one another).
I also do not understand how to delete anything in /var/log: I can cd to /zroot/var/log/ but it's empty it seems (ls -la shows nothing).
I see the following lines when booting:
mkdir: /tmp/.cdrom: No space left on device
chmod: /tmp: No space left on device
chmod: /var/lib/php/sessions: No space left on device
chmod: /root: No space left on device
etc. etc.
It seems highly unlikely to me that all these partitions are full.
If the SSD was broken, would I see the same type of errors?
I doubt that the SSD is broken.
ZFS has a zpool (zroot) that has a bunch of datasets which all share the free space of the zpool.
Your is 100% full, such that all datasets appear full as well, thus casuing those errors.
It does not matter where you make some space, but /var/log is using up the most space (i.e. 220G).
You should cd to /var/log (that is the mountpoint), then 'du -sc *'. You will see how much space is in all files and/or directories. Delete some big files or cd to the subdirectory containing most space, 'du -sc *' again there and see if there are older large files.
If you are using flowd, you could:
cd /var/log
rm flowd.log.??????
Another candidate would be /var/log/system, where you could do:
cd /var/log/system
rm system_????????.log
After that, reboot.
Thanks for the tips. Seems flowd and filter are the worst offenders...
how do I:
1. Limit logs size? (at the expense of log retention duration of course)
2. prevent logs from filling up partitions required for the device to properly start?
root@:/var/log # du -sch *
51K acmeclient
232K audit
512B boot.log
35M configd
33M ddclient
11M dhcpd
97K dmesg.today
85K dmesg.yesterday
171G filter
1.0M firewall
48G flowd.log
10M flowd.log.000001
10M flowd.log.000002
10M flowd.log.000003
10M flowd.log.000004
10M flowd.log.000005
10M flowd.log.000006
10M flowd.log.000007
10M flowd.log.000008
10M flowd.log.000009
10M flowd.log.000010
324K gateways
164K lighttpd
252K monit
4.5K mount.today
4.5K mount.yesterday
512B ntp
2.4M ntpd
4.5K pf.today
4.5K pf.yesterday
170K pkg
1.3M portalauth
1.1M resolver
388K routing
4.5K setuid.today
4.5K setuid.yesterday
512B squid
34M suricata
6.2M system
13K userlog
4.5K utx.lastlogin
8.5K utx.log
220G total
Under System: Settings: Logging, you can set the retention time in days. There, you can disable logging to the local disk altogether, e.g. if you send the log data to an external log server under System: Settings: Logging / targets.
You can also send /var/log to a RAM disk for which it is non-critical if it fills up under System: Settings: Miscellaneous. The only disadvantage is that logs are not kept over reboots if you do that.
thanks @meyergru, I have set it to 10 days to start with, I'll monitor disk usage.
I think something is not normal, I can't even set it beyond 2 days at the moment.
How can it be that flowd.log becomes **48GB** large?... Could it be that something is broken with the log rotation for the IDS/IPS process?
a top also shows logging processes are consuming a LOT of CPU: syslog-ng and filterlog are consistently and by far the two processes consuming the most CPU:
root@sense:/var/log # top
last pid: 86782; load averages: 2.32, 2.31, 2.25 up 1+07:34:11 04:21:42
68 processes: 3 running, 65 sleeping
CPU: 41.0% user, 0.0% nice, 29.9% system, 1.9% interrupt, 27.3% idle
Mem: 207M Active, 660M Inact, 4724K Laundry, 6032M Wired, 2056K Buf, 994M Free
ARC: 5165M Total, 50M MFU, 5038M MRU, 5095K Anon, 11M Header, 60M Other
4900M Compressed, 5036M Uncompressed, 1.03:1 Ratio
Swap: 8418M Total, 8418M Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
4396 root 6 21 0 54M 14M kqread 3 18.0H 66.14% syslog-ng
33799 root 1 52 0 13M 3220K bpf 0 303:55 51.77% filterlog
11798 root 1 37 0 58M 31M select 2 0:01 15.86% php-cgi
88877 root 1 36 0 62M 33M piperd 0 0:04 5.99% php-cgi
There seems to be a lot of traffic going through that FW.
You need to decide whether some of the logging can be reduced or at the very least sent to a different storage share, else you're gonna keep it busy processing graphs instead of passing traffic
No, that's the thing, there's nothing going through it. It's a DEC695 and it's at my home. I have 30Mbps going through (Mbps, less than 4 MB/s) it at the moment and both syslog-ng and filterlog consume each 50% (of a CPU core I guess).
More questions:
1. why would syslog-ng do anything since I am not sending syslogs anywhere...
2. looking at the logs, flowd.log is huge (was at 48GB in less than a month), hence maybe something to do with flowd. Any settings in particular I could check?
Filter logs are huge too, not sure how to reduce logging on this. I already disabled logging on all my rules, I cannot disable it on the automatically generated ones it seems. "Filter" logs is everything generated by firewall rules right?
Suricata can trigger on a lot of things if you enabled _everything_
For example, if you only have Windows laptops and iPhones then Microsoft Exchange or Oracle Weblogic rules don't need to be enabled.
Out of 63864 entries chances are you're gonna hit enough times one or more generic rules that will quickly fill up storage
You could also enable compression in ZFS to mitigate the problem somewhat.
Quote from: newsense on August 23, 2023, 10:13:04 AM
Suricata can trigger on a lot of things if you enabled _everything_
For example, if you only have Windows laptops and iPhones then Microsoft Exchange or Oracle Weblogic rules don't need to be enabled.
Out of 63864 entries chances are you're gonna hit enough times one or more generic rules that will quickly fill up storage
I just checked, I have not downloaded yet a single ruleset actually, I just enabled the service. I will have to fix this memory issue first before enabling anything here.
I noticed that my FW tables are very large: 537079/1000000
The reason is, I have implemented some geoblocking.
Could such large table cause firewall logs to become more verbose? The only block rule leveraging the alias for geograpic IP lists is NOT logging though (small "i" is gray).
Secondly, how can flowd.log be 48GB large in less than a month if IPS is enabled but not a single ruleset has been downloaded? Is it expected? (as a reminder: small home network, single server on the network that is only used to backup computers at night) I'm starting to worry what is going to happen if I enable a couple of rulesets...
In System:Settings:Logging, you can disable logging of the default pass and default block rules, have you checked this?
I found the culprit I think, CPU went down and RAM usage also stabilized. I had one rule to block DoH and DoT to public DNS servers using a public list with 70k such IPs, that was causing the crazy amount of logs, at least for filter. CPU usage is also down now, I don't see the syslog-ng and filterlog processes eathing up 50% of a CPU core all the time...
It seems I needs to be careful with these large fw aliases, I didn't expect this.
70K DoH/DoT servers seems a bit excessive, doubt there are that many to begin with...
Block outbound Dport 853 to any, udp/443 to major providers (Google, Cloudflare, Q9, and the default one in Firefox for your country) and with a redirect of regular dns to your internal one (Unbound, AGH, Pi-hole) you've accomplished the same thing
Check in Firewall - Live view for any suspicious udp/443 traffic towards obscure DNS servers - shouldn't be that many hits - if at all
Nope, I made the wrong assumption. I enabled and disabled logging for all 4 rules individually to double check and the one I assumed was the problem (using an alias with 70k IPs in it) is not causing the problem.
It's the one blocking outbound to TCP/853 (DoT).
The almost identical one, blocking outbound traffic to UDP/8853 (DoQ, DNS over quic) is NOT causing this issue.
I tested three times in a row. If I enable logging for the TCP rule, the moment I click apply I see filterlog and syslog-ng shooting up in the top I have running. If I turn it off, it goes away.
Not reproducible for the UDP rule.
If anyone is bored and want to try to reproduce this... ;) The interface I applied it is a LAN brige, and it's applied to IPv4 and IPv6.
So far, I do not understand what's so special about the TCP rule that if would cause this...
Quote from: newsense on August 23, 2023, 05:10:52 PM
70K DoH/DoT servers seems a bit excessive, doubt there are that many to begin with...
There is probably more than that actually, 70k are the known ones...
I use this list: https://public-dns.info/nameservers.txt
Sure, most of them are regular servers, not DoH/DoT enabled and you wouldn't be hitting any of it with the DNS/53 intercept rules (need one for IPv6 btw)
Quote from: newsense on August 23, 2023, 05:32:55 PM
Sure, most of them are regular servers, not DoH/DoT enabled and you wouldn't be hitting any of it with the DNS/53 intercept rules (need one for IPv6 btw)
ah, right, but they could add it any time. Ideally, I'd have the same list filtered down by DoH capability...
I'm using the same list to block DoH but just standard port blocking for DNS and DoT. I have all of them set as Floating rules with logging turned on and I don't have this issue. However, I'm not running a bridge or Suricata currently.
It's on the bridge for me, and after upgrading to 23.7.2 I retested, same behaviour, the moment I enable logging on this particular rule, syslog-ng and filterlog processes go nuts.
Quote from: 9axqe on August 24, 2023, 04:30:25 PM
It's on the bridge for me, and after upgrading to 23.7.2 I retested, same behaviour, the moment I enable logging on this particular rule, syslog-ng and filterlog processes go nuts.
Try changing it to a floating rule. I'm curious if that will make a difference or if it's just due to the bridge.
I found what the issue was. Maybe I should simply have opened the logs.
Home Assistant was sending 1-2Mbps (!) of DoT attempts to Cloudflare somehow. I changed the rule from "reject" to "block" and it stopped. That's why this specific rule was causing such a crazy cpu consumption...