I posted here (https://forum.opnsense.org/index.php?topic=50771.0) earlier while being on 25.7 , but I figured I can upgrade to 26.1, maybe things will change.
And they did... it got worse.
I upgraded from 25.7 to 26.1 and now sqlite3 is constantly writing on the SSD and I cannot figure out why it is doing this. It does not look like more space is consumed (I did not notice a change in the last 3 hours), but the constant writing is worrying.
A top -S -m io -o total shows:
PID USERNAME VCSW IVCSW READ WRITE FAULT TOTAL PERCENT COMMAND
83231 root 481 0 2 2396 0 2398 97.44% sqlite3
35108 hostd 22 0 0 63 0 63 2.56% hostwatch
73985 root 0 0 0 0 0 0 0.00% php-cgi
70657 root 0 0 0 0 0 0 0.00% openvpn
59009 root 0 0 0 0 0 0 0.00% cron
So it is clearly sqlite3 writing all the time. But I have no idea which OPNsense function / service this could be.
I have read, that Neighbor Discovery can cause load, so I disabled that. No change.
Then I disabled the Wazuh Agent I installed recently. No Change
I also disabled Netflow. No change.
Captiva Portal seems to use sqlite3 too, but this is turned off for me.
I also disabled Surricata to see if it changes anything... no change.
Since I upgraded to 26.1 this is happening and the upgrade was like 5 or 6 hours ago. I suspect, even if this was some internal Database migration, it should be finished a long time ago.
How can I track down what sqlite3 is doing here?
Any help is appreciated very much by me and my SSD ;-)
Disable hostwatch. Interfaces: Neighbours: Auto discovery.
Thank you for trying to help. But as I wrote above, I disabled that one already (I have seen the thread about it) and even rebootet after disabling, just to be sure. It changes nothing, unfortunately. SQLITE3 is still hammering on the Disk after all those hours. My disk usage does not even change (at least not from what I can see) and I wonder if some process is writing on the same table entry 24/7 here.
I just cannot figure out how to look into SQLITE3 service to see from which PID which SQL queries come :-(
New day, fresh Ideas... I found procstat and played around with it a bit.
root@OPNsense:/dev/fd # procstat -f 88990
PID COMM FD T V FLAGS REF OFFSET PRO NAME
88990 sqlite3 text v r r------- - - - /usr/local/bin/sqlite3
88990 sqlite3 cwd v d r------- - - - /
88990 sqlite3 root v d r------- - - - /
88990 sqlite3 0 v r r------- 1 8519680 - /var/netflow/src_addr_086400.sqlite.clean.sql
88990 sqlite3 1 v c rw------ 6 0 - /dev/null
88990 sqlite3 2 v c rw------ 6 0 - /dev/null
88990 sqlite3 3 v r rw------ 1 0 - /var/netflow/src_addr_086400.sqlite.fix
88990 sqlite3 4 v r rw------ 2 0 - /var/netflow/src_addr_086400.sqlite.fix-journal
So it seems to have something to do with Netflow, maybe? I am not sure about the filenames yet... .clean and .fix sound like something that would happen after an upgrade or crash. I will keep investigating.
It did not stop when I removed all interfaces from Netstat (apart from WAN which seems to be the minimum you need to select). But I will try if I can stop netstat entirely to see if the writing goes away. Will not be a fix, but at least I would know what process is causing the I/O.
So after configuring Netflow like this:
- Listening interfaces: Nothing selected
- WAN interfaces: Nothing selected
A new sqlite3 process was opened, but the writing continues... still on Netflow database files?
top -S -m io -o total
PID USERNAME VCSW IVCSW READ WRITE FAULT TOTAL PERCENT COMMAND
4640 root 484 1 0 2525 0 2525 100.00% sqlite3
47616 root 2 0 0 0 0 0 0.00% lighttpd
1 root 0 0 0 0 0 0 0.00% init
and
root@OPNsense:/dev/fd # procstat -f 4640
PID COMM FD T V FLAGS REF OFFSET PRO NAME
4640 sqlite3 text v r r------- - - - /usr/local/bin/sqlite3
4640 sqlite3 cwd v d r------- - - - /
4640 sqlite3 root v d r------- - - - /
4640 sqlite3 0 v r r------- 1 1572864 - /var/netflow/src_addr_086400.sqlite.clean.sql
4640 sqlite3 1 v c rw------ 6 0 - /dev/null
4640 sqlite3 2 v c rw------ 6 0 - /dev/null
4640 sqlite3 3 v r rw------ 2 0 - /var/netflow/src_addr_086400.sqlite.fix
4640 sqlite3 4 v r rw------ 1 0 - /var/netflow/src_addr_086400.sqlite.fix-journal
So I would assume that not Netflow itself is causing the I/O, but something that works with the data within these database files / tables?
More testing... I tried to Reporting --> Settings --> Repair Netflow Data.
After that my /var/netflow folder grew from around 500MB to 650MB and now I have 2 SQLITE3 processes writing to the disk. Space usage stays about the same for over 30 minutes now.
Next try: Reporting --> Settings --> Reset Netflow Data.
This made the folder shrink from 650MB to around 300MB, but both SQLITE3 processes are still running and writing to the Database files:
root@OPNsense:/var/netflow # procstat -f 7155
PID COMM FD T V FLAGS REF OFFSET PRO NAME
7155 sqlite3 text v r r------- - - - /usr/local/bin/sqlite3
7155 sqlite3 cwd v d r------- - - - /usr/local/opnsense/service
7155 sqlite3 root v d r------- - - - /
7155 sqlite3 0 v r r------- 1 13631488 - /var/netflow/src_addr_details_086400.sqlite.clean.sql
7155 sqlite3 1 v c rw------ 15 0 - /dev/null
7155 sqlite3 2 v c rw------ 15 0 - /dev/null
7155 sqlite3 3 v r rw------ 2 0 - /var/netflow/src_addr_details_086400.sqlite.fix
7155 sqlite3 4 v r rw------ 1 0 - /var/netflow/src_addr_details_086400.sqlite.fix-journal
root@OPNsense:/var/netflow # procstat -f 79089
PID COMM FD T V FLAGS REF OFFSET PRO NAME
79089 sqlite3 text v r r------- - - - /usr/local/bin/sqlite3
79089 sqlite3 cwd v d r------- - - - /
79089 sqlite3 root v d r------- - - - /
79089 sqlite3 0 v r r------- 1 12451840 - /var/netflow/src_addr_086400.sqlite.clean.sql
79089 sqlite3 1 v c rw------ 2 0 - /dev/null
79089 sqlite3 2 v c rw------ 2 0 - /dev/null
79089 sqlite3 3 v r rw------ 2 0 - /var/netflow/src_addr_086400.sqlite.fix
79089 sqlite3 4 v r rw------ 1 0 - /var/netflow/src_addr_086400.sqlite.fix-journal
The folder before I reset the Netflow data:
root@OPNsense:/var/netflow # ls -lsa
total 634961
9 drwxr-x--- 2 root wheel 25 Feb 8 09:08 .
9 drwxr-xr-x 31 root wheel 31 Feb 3 14:04 ..
33 -rw-r----- 1 root wheel 102400 Feb 6 23:59 dst_port_000300.sqlite
13 -rw-r----- 1 root wheel 29240 Feb 7 14:33 dst_port_000300.sqlite-journal
29 -rw-r----- 1 root wheel 376832 Feb 6 23:59 dst_port_003600.sqlite
9 -rw-r----- 1 root wheel 21032 Feb 7 14:33 dst_port_003600.sqlite-journal
24705 -rw-r----- 1 root wheel 73572352 Feb 6 23:59 dst_port_086400.sqlite
13 -rw-r----- 1 root wheel 21032 Feb 7 14:33 dst_port_086400.sqlite-journal
105 -rw-r----- 1 root wheel 2527232 Feb 6 23:59 interface_000030.sqlite
905 -rw-r----- 1 root wheel 2355200 Feb 6 23:59 interface_000300.sqlite
453 -rw-r----- 1 root wheel 983040 Feb 6 23:59 interface_003600.sqlite
269 -rw-r----- 1 root wheel 585728 Feb 8 09:05 interface_086400.sqlite
5 -rw-r----- 1 root wheel 12288 Feb 6 23:59 metadata.sqlite
169 -rw-r----- 1 root wheel 495616 Feb 6 23:59 src_addr_000300.sqlite
109 -rw-r----- 1 root wheel 286720 Feb 8 09:03 src_addr_003600.sqlite
174105 -rw-r----- 1 root wheel 486756352 Feb 7 16:00 src_addr_086400.sqlite
92393 -rw-r----- 1 root wheel 431340383 Feb 8 09:01 src_addr_086400.sqlite.clean.sql
817 -rw-r----- 1 root wheel 2871296 Feb 8 09:08 src_addr_086400.sqlite.fix
1 -rw-r----- 1 root wheel 12824 Feb 8 09:08 src_addr_086400.sqlite.fix-journal
92401 -rw-r----- 1 root wheel 431340429 Feb 8 09:01 src_addr_086400.sqlite.sql
124177 -rw-r----- 1 root wheel 433369088 Feb 8 09:01 src_addr_details_086400.sqlite
62001 -rw-r----- 1 root wheel 353618428 Feb 8 09:05 src_addr_details_086400.sqlite.clean.sql
221 -rw-r----- 1 root wheel 1363968 Feb 8 09:08 src_addr_details_086400.sqlite.fix
1 -rw-r----- 1 root wheel 12824 Feb 8 09:08 src_addr_details_086400.sqlite.fix-journal
62021 -rw-r----- 1 root wheel 353618447 Feb 8 09:05 src_addr_details_086400.sqlite.sql
And the folder after I reset the Netflow Data:
root@OPNsense:/var/netflow # ls -lsa
total 319059
9 drwxr-x--- 2 root wheel 13 Feb 8 09:37 .
9 drwxr-xr-x 31 root wheel 31 Feb 3 14:04 ..
13 -rw-r----- 1 root wheel 29240 Feb 7 14:33 dst_port_000300.sqlite-journal
9 -rw-r----- 1 root wheel 21032 Feb 7 14:33 dst_port_003600.sqlite-journal
13 -rw-r----- 1 root wheel 21032 Feb 7 14:33 dst_port_086400.sqlite-journal
92393 -rw-r----- 1 root wheel 431340383 Feb 8 09:01 src_addr_086400.sqlite.clean.sql
5029 -rw-r----- 1 root wheel 15032320 Feb 8 09:37 src_addr_086400.sqlite.fix
ls: ./src_addr_086400.sqlite.fix-journal: No such file or directory
1 -rw-r----- 1 root wheel 12824 Feb 8 09:37 src_addr_086400.sqlite.fix-journal
92401 -rw-r----- 1 root wheel 431340429 Feb 8 09:01 src_addr_086400.sqlite.sql
62001 -rw-r----- 1 root wheel 353618428 Feb 8 09:05 src_addr_details_086400.sqlite.clean.sql
5165 -rw-r----- 1 root wheel 17735680 Feb 8 09:37 src_addr_details_086400.sqlite.fix
ls: ./src_addr_details_086400.sqlite.fix-journal: No such file or directory
1 -rw-r----- 1 root wheel 12824 Feb 8 09:37 src_addr_details_086400.sqlite.fix-journal
62021 -rw-r----- 1 root wheel 353618447 Feb 8 09:05 src_addr_details_086400.sqlite.sql
The "No such file or directory" only comes up sometimes when doin an ls -lsa, and I suppose that those exist only for a fraction of a second. Maybe these are the files that sqlite3 writes to constantly? Like create the file, write, delete, repeat...?
So after rebooting, no more writing to to the disk!
I think the reset of the Netflow database helped. Although there´s still around 300MB of Data left in the /var/netflow directory, the reboot killed both sqlite3 processes.
I see close to 0 write access on the disk now for several minutes, so the issue seems to be fixed.
Questions that are open:
1. I do not see any sqlite3 processes anymore at all. Is this right or is something broken after I reset the Netflow database?
2. I still wonder what triggered the constant writing to the database, Losing my Netflow history is not a big deal for me, but knowing the real root cause and being able to fix it without losing the Netflow history would be a nice to have.
Edit: A nice side-effect now is, that my CPU temps are back to sub 50°C instead of being around 57-60°C.
Since I think it is fixed now, I want to sum it up for everyone else that might encounter the same problem and seeks a fix.
Issue
Like in the Subject of this thread, after upgrading to 26.1 I realized, that there was constant Disk activity. My OPNsense Box has an LED indicator for disc access, otherwise I might not have noticed, because it did not impact performance of the OPNsense at all and there was close to no CPU utilization. If you want to check for yourself, you can with
top -S -m io -o total
In my case sqlite3 was constantly sitting on top of the list with 100% write transactions.
Investigation
Knewing it was sqlite3 causing the disk I/O, even worse write transactions on an SSD, I narrowed down from where the SQL transactions might come by using
procstat -f <pid_of_top_process_of_previous_command>
This showed me that sqlite was writing on the files in /var/netflow.
Not a solution
Trying to disable Netflow did not help. I removed all interfaces from Reporting --> Netflow --> Listening and WAN interfaces. Even after a reboot, sqlite3 was still writing to the files.
Using the Reporting --> Settings --> Repair Netflow data made it worse, spawning a second sqlite3 process, just sharing the max I/O the SSD was capable of. So no solution either.
Solution
Using the Reporting --> Settings --> Reset Netflow data function did reduce the size of the /var/netflow folder from over 600MB to around 300MB, but the writing still occured due to the 2 sqlite3 processes still running.
A reboot of OPNsense finally killed the 2 sqlite3 processes and they did not come back.
Disk I/O now happens so rarely, I can barely notice it on the Disk activity LED. System temps went down about 7-10°C.
After the reset of the Netflow database, it takes a while for Reporting --> Insight to show data again, let it sit for 10 minutes and you should see new data again.