Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - buedi

#1
Just linking the other thread, because it has the solution to my problem: https://forum.opnsense.org/index.php?topic=50783.0
#2
Since I think it is fixed now, I want to sum it up for everyone else that might encounter the same problem and seeks a fix.

Issue
Like in the Subject of this thread, after upgrading to 26.1 I realized, that there was constant Disk activity. My OPNsense Box has an LED indicator for disc access, otherwise I might not have noticed, because it did not impact performance of the OPNsense at all and there was close to no CPU utilization. If you want to check for yourself, you can with
top -S -m io -o total

In my case sqlite3 was constantly sitting on top of the list with 100% write transactions.


Investigation
Knewing it was sqlite3 causing the disk I/O, even worse write transactions on an SSD, I narrowed down from where the SQL transactions might come by using
procstat -f <pid_of_top_process_of_previous_command>
This showed me that sqlite was writing on the files in /var/netflow.

Not a solution
Trying to disable Netflow did not help. I removed all interfaces from Reporting --> Netflow --> Listening and WAN interfaces. Even after a reboot, sqlite3 was still writing to the files.

Using the Reporting --> Settings --> Repair Netflow data made it worse, spawning a second sqlite3 process, just sharing the max I/O the SSD was capable of. So no solution either.

Solution
Using the Reporting --> Settings --> Reset Netflow data function did reduce the size of the /var/netflow folder from over 600MB to around 300MB, but the writing still occured due to the 2 sqlite3 processes still running.
A reboot of OPNsense finally killed the 2 sqlite3 processes and they did not come back.

Disk I/O now happens so rarely, I can barely notice it on the Disk activity LED. System temps went down about 7-10°C.
After the reset of the Netflow database, it takes a while for Reporting --> Insight to show data again, let it sit for 10 minutes and you should see new data again.
#3
So after rebooting, no more writing to to the disk!
I think the reset of the Netflow database helped. Although there´s still around 300MB of Data left in the /var/netflow directory, the reboot killed both sqlite3 processes.

I see close to 0 write access on the disk now for several minutes, so the issue seems to be fixed.

Questions that are open:
1. I do not see any sqlite3 processes anymore at all. Is this right or is something broken after I reset the Netflow database?
2. I still wonder what triggered the constant writing to the database, Losing my Netflow history is not a big deal for me, but knowing the real root cause and being able to fix it without losing the Netflow history would be a nice to have.

Edit:  A nice side-effect now is, that my CPU temps are back to sub 50°C  instead of being around 57-60°C.
#4
More testing... I tried to Reporting --> Settings --> Repair Netflow Data.
After that my /var/netflow folder grew from around 500MB to 650MB and now I have 2 SQLITE3 processes writing to the disk. Space usage stays about the same for over 30 minutes now.

Next try: Reporting --> Settings --> Reset Netflow Data.
This made the folder shrink from 650MB to around 300MB, but both SQLITE3 processes are still running and writing to the Database files:

root@OPNsense:/var/netflow # procstat -f 7155
  PID COMM                FD T V FLAGS    REF  OFFSET PRO NAME
 7155 sqlite3           text v r r-------   -       - -   /usr/local/bin/sqlite3
 7155 sqlite3            cwd v d r-------   -       - -   /usr/local/opnsense/service
 7155 sqlite3           root v d r-------   -       - -   /
 7155 sqlite3              0 v r r-------   1 13631488 -   /var/netflow/src_addr_details_086400.sqlite.clean.sql
 7155 sqlite3              1 v c rw------  15       0 -   /dev/null
 7155 sqlite3              2 v c rw------  15       0 -   /dev/null
 7155 sqlite3              3 v r rw------   2       0 -   /var/netflow/src_addr_details_086400.sqlite.fix
 7155 sqlite3              4 v r rw------   1       0 -   /var/netflow/src_addr_details_086400.sqlite.fix-journal
root@OPNsense:/var/netflow # procstat -f 79089
  PID COMM                FD T V FLAGS    REF  OFFSET PRO NAME
79089 sqlite3           text v r r-------   -       - -   /usr/local/bin/sqlite3
79089 sqlite3            cwd v d r-------   -       - -   /
79089 sqlite3           root v d r-------   -       - -   /
79089 sqlite3              0 v r r-------   1 12451840 -   /var/netflow/src_addr_086400.sqlite.clean.sql
79089 sqlite3              1 v c rw------   2       0 -   /dev/null
79089 sqlite3              2 v c rw------   2       0 -   /dev/null
79089 sqlite3              3 v r rw------   2       0 -   /var/netflow/src_addr_086400.sqlite.fix
79089 sqlite3              4 v r rw------   1       0 -   /var/netflow/src_addr_086400.sqlite.fix-journal

The folder before I reset the Netflow data:
root@OPNsense:/var/netflow # ls -lsa
total 634961
     9 drwxr-x---   2 root wheel        25 Feb  8 09:08 .
     9 drwxr-xr-x  31 root wheel        31 Feb  3 14:04 ..
    33 -rw-r-----   1 root wheel    102400 Feb  6 23:59 dst_port_000300.sqlite
    13 -rw-r-----   1 root wheel     29240 Feb  7 14:33 dst_port_000300.sqlite-journal
    29 -rw-r-----   1 root wheel    376832 Feb  6 23:59 dst_port_003600.sqlite
     9 -rw-r-----   1 root wheel     21032 Feb  7 14:33 dst_port_003600.sqlite-journal
 24705 -rw-r-----   1 root wheel  73572352 Feb  6 23:59 dst_port_086400.sqlite
    13 -rw-r-----   1 root wheel     21032 Feb  7 14:33 dst_port_086400.sqlite-journal
   105 -rw-r-----   1 root wheel   2527232 Feb  6 23:59 interface_000030.sqlite
   905 -rw-r-----   1 root wheel   2355200 Feb  6 23:59 interface_000300.sqlite
   453 -rw-r-----   1 root wheel    983040 Feb  6 23:59 interface_003600.sqlite
   269 -rw-r-----   1 root wheel    585728 Feb  8 09:05 interface_086400.sqlite
     5 -rw-r-----   1 root wheel     12288 Feb  6 23:59 metadata.sqlite
   169 -rw-r-----   1 root wheel    495616 Feb  6 23:59 src_addr_000300.sqlite
   109 -rw-r-----   1 root wheel    286720 Feb  8 09:03 src_addr_003600.sqlite
174105 -rw-r-----   1 root wheel 486756352 Feb  7 16:00 src_addr_086400.sqlite
 92393 -rw-r-----   1 root wheel 431340383 Feb  8 09:01 src_addr_086400.sqlite.clean.sql
   817 -rw-r-----   1 root wheel   2871296 Feb  8 09:08 src_addr_086400.sqlite.fix
     1 -rw-r-----   1 root wheel     12824 Feb  8 09:08 src_addr_086400.sqlite.fix-journal
 92401 -rw-r-----   1 root wheel 431340429 Feb  8 09:01 src_addr_086400.sqlite.sql
124177 -rw-r-----   1 root wheel 433369088 Feb  8 09:01 src_addr_details_086400.sqlite
 62001 -rw-r-----   1 root wheel 353618428 Feb  8 09:05 src_addr_details_086400.sqlite.clean.sql
   221 -rw-r-----   1 root wheel   1363968 Feb  8 09:08 src_addr_details_086400.sqlite.fix
     1 -rw-r-----   1 root wheel     12824 Feb  8 09:08 src_addr_details_086400.sqlite.fix-journal
 62021 -rw-r-----   1 root wheel 353618447 Feb  8 09:05 src_addr_details_086400.sqlite.sql

And the folder after I reset the Netflow Data:
root@OPNsense:/var/netflow # ls -lsa
total 319059
    9 drwxr-x---   2 root wheel        13 Feb  8 09:37 .
    9 drwxr-xr-x  31 root wheel        31 Feb  3 14:04 ..
   13 -rw-r-----   1 root wheel     29240 Feb  7 14:33 dst_port_000300.sqlite-journal
    9 -rw-r-----   1 root wheel     21032 Feb  7 14:33 dst_port_003600.sqlite-journal
   13 -rw-r-----   1 root wheel     21032 Feb  7 14:33 dst_port_086400.sqlite-journal
92393 -rw-r-----   1 root wheel 431340383 Feb  8 09:01 src_addr_086400.sqlite.clean.sql
 5029 -rw-r-----   1 root wheel  15032320 Feb  8 09:37 src_addr_086400.sqlite.fix
ls: ./src_addr_086400.sqlite.fix-journal: No such file or directory
    1 -rw-r-----   1 root wheel     12824 Feb  8 09:37 src_addr_086400.sqlite.fix-journal
92401 -rw-r-----   1 root wheel 431340429 Feb  8 09:01 src_addr_086400.sqlite.sql
62001 -rw-r-----   1 root wheel 353618428 Feb  8 09:05 src_addr_details_086400.sqlite.clean.sql
 5165 -rw-r-----   1 root wheel  17735680 Feb  8 09:37 src_addr_details_086400.sqlite.fix
ls: ./src_addr_details_086400.sqlite.fix-journal: No such file or directory
    1 -rw-r-----   1 root wheel     12824 Feb  8 09:37 src_addr_details_086400.sqlite.fix-journal
62021 -rw-r-----   1 root wheel 353618447 Feb  8 09:05 src_addr_details_086400.sqlite.sql
The "No such file or directory" only comes up sometimes when doin an ls -lsa, and I suppose that those exist only for a fraction of a second. Maybe these are the files that sqlite3 writes to constantly? Like create the file, write, delete, repeat...?
#5
So after configuring Netflow like this:
- Listening interfaces: Nothing selected
- WAN interfaces: Nothing selected

A new sqlite3 process was opened, but the writing continues... still on Netflow database files?

top -S -m io -o total
  PID USERNAME     VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
 4640 root         484      1      0   2525      0   2525 100.00% sqlite3
47616 root           2      0      0      0      0      0   0.00% lighttpd
    1 root           0      0      0      0      0      0   0.00% init

and
root@OPNsense:/dev/fd # procstat -f 4640
  PID COMM                FD T V FLAGS    REF  OFFSET PRO NAME
 4640 sqlite3           text v r r-------   -       - -   /usr/local/bin/sqlite3
 4640 sqlite3            cwd v d r-------   -       - -   /
 4640 sqlite3           root v d r-------   -       - -   /
 4640 sqlite3              0 v r r-------   1 1572864 -   /var/netflow/src_addr_086400.sqlite.clean.sql
 4640 sqlite3              1 v c rw------   6       0 -   /dev/null
 4640 sqlite3              2 v c rw------   6       0 -   /dev/null
 4640 sqlite3              3 v r rw------   2       0 -   /var/netflow/src_addr_086400.sqlite.fix
 4640 sqlite3              4 v r rw------   1       0 -   /var/netflow/src_addr_086400.sqlite.fix-journal

So I would assume that not Netflow itself is causing the I/O, but something that works with the data within these database files / tables?
#6
New day, fresh Ideas... I found procstat and played around with it a bit.

root@OPNsense:/dev/fd # procstat -f 88990
  PID COMM                FD T V FLAGS    REF  OFFSET PRO NAME
88990 sqlite3           text v r r-------   -       - -   /usr/local/bin/sqlite3
88990 sqlite3            cwd v d r-------   -       - -   /
88990 sqlite3           root v d r-------   -       - -   /
88990 sqlite3              0 v r r-------   1 8519680 -   /var/netflow/src_addr_086400.sqlite.clean.sql
88990 sqlite3              1 v c rw------   6       0 -   /dev/null
88990 sqlite3              2 v c rw------   6       0 -   /dev/null
88990 sqlite3              3 v r rw------   1       0 -   /var/netflow/src_addr_086400.sqlite.fix
88990 sqlite3              4 v r rw------   2       0 -   /var/netflow/src_addr_086400.sqlite.fix-journal

So it seems to have something to do with Netflow, maybe? I am not sure about the filenames yet... .clean and .fix sound like something that would happen after an upgrade or crash. I will keep investigating.
It did not stop when I removed all interfaces from Netstat (apart from WAN which seems to be the minimum you need to select). But I will try if I can stop netstat entirely to see if the writing goes away. Will not be a fix, but at least I would know what process is causing the I/O.
#7
Thank you for trying to help. But as I wrote above, I disabled that one already (I have seen the thread about it) and even rebootet after disabling, just to be sure. It changes nothing, unfortunately. SQLITE3 is still hammering on the Disk after all those hours. My disk usage does not even change (at least not from what I can see) and I wonder if some process is writing on the same table entry 24/7 here.

I just cannot figure out how to look into SQLITE3 service to see from which PID which SQL queries come :-(
#8
I posted here (https://forum.opnsense.org/index.php?topic=50771.0) earlier while being on 25.7 , but I figured I can upgrade to 26.1, maybe things will change.
And they did... it got worse.

I upgraded from 25.7 to 26.1 and now sqlite3 is constantly writing on the SSD and I cannot figure out why it is doing this. It does not look like more space is consumed (I did not notice a change in the last 3 hours), but the constant writing is worrying.

A top -S -m io -o total shows:
  PID USERNAME     VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
83231 root         481      0      2   2396      0   2398  97.44% sqlite3
35108 hostd         22      0      0     63      0     63   2.56% hostwatch
73985 root           0      0      0      0      0      0   0.00% php-cgi
70657 root           0      0      0      0      0      0   0.00% openvpn
59009 root           0      0      0      0      0      0   0.00% cron

So it is clearly sqlite3 writing all the time. But I have no idea which OPNsense function / service this could be.

I have read, that Neighbor Discovery can cause load, so I disabled that. No change.
Then I disabled the Wazuh Agent I installed recently. No Change
I also disabled Netflow. No change.
Captiva Portal seems to use sqlite3 too, but this is turned off for me.
I also disabled Surricata to see if it changes anything... no change.

Since I upgraded to 26.1 this is happening and the upgrade was like 5 or 6 hours ago. I suspect, even if this was some internal Database migration, it should be finished a long time ago.

How can I track down what sqlite3 is doing here?

Any help is appreciated very much by me and my SSD ;-)
#9
Thank you very much. I figured before I track this down, I upgrade to 26.1 instead... and now it got worse. Since this is another version, I will open another thread in the 26.1 section.

To have the answer to my question here, just in case someone else stumbles upon it:

```
top -S -m io -o total
```

Shows the processes causing I/O.
#10
Hi everyone,

I am still at 25.7 and since a few weeks I encounter system instabilities and increased Disk I/O. Both might not be related, but Disk I/O is something I want to look into first, because it is also visible from the outside.

My OPNsense box has a Disk activity LED. Usually this one flashes up once every 5-10 seconds I would say. Recently it is more like every 0,5 seconds, sometimes even constantly on for a few seconds. When I set up my OPNsense around a year ago, I paid attention to minimize logging as much as possible, but either things might have changed due to new functionality / updates, or my system is having issues which is causing more logging.

I lack the knowledge in BSD how to find out which processes or parts of OPNsense are causing the Disk I/O and I would appreciate if someone could point me in the right direction.

I am not saying that OPNsense is causing the crashes I encounter, but maybe it is logging some faults that lead to the crash after a while. Also, if possible, I want to find the root cause of the Disk I/O to bring it down again for less heat and wear of the SSD.

Any help is appreciated very much.

Thank you very much in advance :-)
#11
Thanks for your reply. I did some packet captures, and I see my packets coming in from my client and seeing that they are meant for example to reach the ipv6.google.com IPv6 address. I am not as far to see in the Capture itself if that packet was routed properly. Maybe I should try to analyze the pcap files with Wireshark so see more information?

I begin to doubt that the tunnel is doing anything at all, even when I can reach ipv6.google.com from my OPNsense. When pinging ipv6.google.com directly from the OPNsense, I get a roundtrip time of about 0.1 to 0.2 ms. When www.google.com with IPv6, I get a more realistic roundtrip time of around 15-20 ms. I refuse to believe that IPv6 manages to improve roundtrip time that much ;-)

Also, to set up the GIF interface, I get the following information from Hurricane Electric:
- Server IPv4 address (where I need to talk on the IPv4 Network to reach out to the HE Tunnelbroker)
- Server IPv6 address (a /64 which ends with ::1)
- Client IPv6 address (a /64 which ends with ::2 and is in the same network as the Server IPv6 address)

Now my common sense and the tutorial at https://docs.opnsense.org/manual/how-tos/ipv6_tunnelbroker.html tells me, that when configuring the GRE Interface, I need to
- Put the Server IPv6 address into the GIF tunnel remote address and
- the Client IPv6 address into the GIF tunnel local address

If I configure it that way, I can not ping a Remote IPv6 address.
When I enter them the other way round, I can (seeminly) ping Remote IPv6 addresses. Now when I reverse the configuration again (as it should be according to the Tutorial), I still seems to work.

All that makes no sense to me. The low ping times and that I have to configure the GIF interface "wrong" to get IPv6 up and running and then can revert it and it still seems to work (on the OPNsense only, though), throws me off the track. I cannot comprehend this.

When I do a Live trace via Firewall --> Logfiles --> Live View and filter for example the ipv6.google.com destination address, it shows me that the firewall rules are allow the traffic to the IPv6 Tunnel interface I created. Because of the low latency, I suppose my packets never leave the OPNsense and no matter what IPv6 address is resolved by DNS, it stays on the OPNsense. Otherwise, pings which are the same as for 127.0.0.1 should not be possible.

Maybe I should sleep another night over this...
#12
There is no CGNAT in my case and yes, both IPs show the same. I am self-hosting all kind of stuff from home and I get a "pretty static" (usually for months) IPv4 address from my ISP.

I did make some progress today, though. I picked another Provider and was able to establish the tunnel and ping from my OPNsense to the other end of the tunnel and to the ipv6.google.com address. I am not able to get this working from my LAN yet, although I have a Firewall Policy on my LAN interface which allows IPv6 and IPv4 (both rules are identical) to *.

Advertisement seems to work, as my systems in the LAN get an IPv6 address assigned from the block that the Tunnelbroker assigned me. Also I can ping the IPv6 addresses of the OPNsense, but traffic does either not seem to get routed through the tunnel or it does not find its way back.

How would I debug that? Is tcpdump the way to go or is this not sufficient to check routing issues, but rather for packet inspection? I am not very proficient in the BSD area. Pointing me in the right direction / the right tools to debug this should be enough, as I am willing to learn and get better in managing OPNsense and using BSD. So any tips are very welcome how you would start debugging the current situation that OPNsense can now utilize the Tunnel, but Systems in LAN not, despite getting their IPv6 address and default route to the OPNsense assigned.

Thank you very much in advance :-)
#13
Hello everyone,
I searched the forum and some other bits of the internet and it seems like this setup usually is a no-brainer. But for some odd reason, I cannot get it up and running and I am a bit lost on how to debug this.
I got myself a /64 prefix from tunnelbroker.net and try to configure it on my OPNsense. Although on my end all lights show up green / up, I cannot even ping the remote end of the tunnel.
What I did is what is in the documentation here: https://docs.opnsense.org/manual/how-tos/ipv6_tunnelbroker.html.
I ended up having a gif Interface in the interface overview which shows up and the correct IPv6 addresses.
Also in the gateways, I made sure the tunnel is the default IPv6 gateway.

ifconfig shows me that the interface is there with the correct prefix length:
```
gif0: flags=1008051<UP,POINTOPOINT,RUNNING,MULTICAST,LOWER_UP> metric 0 mtu 1280
        description: IPv6Tunnel (opt7)
        options=80000<LINKSTATE>
        tunnel inet 1xx.x.x.9 --> 216.66.80.30
        inet6 fe80::aab8:e0ff:fe03:fec5%gif0 prefixlen 64 scopeid 0xf
        inet6 2001:470:xxxx:xxx::2 prefixlen 64
        groups: gif
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
```
netstat -rn6 shows me that the IPv6 tunnel is indeed the default gateway:
```
Routing tables

Internet6:
Destination                       Gateway                       Flags         Netif Expire
default                           2001:470:xxxx:xxx::1          UGS            gif0
```

But I cannot ping the other end of the tunnel. All "local" IPv6 addresses work. Even when configuring SLAAC, my clients get valid IPv6 addresses and up until the LAN interface on the OPNsense I can ping all hosts. It just seems like nothing wants to go through the tunnel.
But if I look at the live view and filter the destination IP I am trying to ping, it shows no blocked traffic... quite contrary, it shows that the packet was sent through the tunnel interface.

And this is where I am lost now... I have the impression that all interfaces are configured correctly and that the route for IPv6 traffic into the tunnel is honored. Tunnelbrocker.net is a free service and I want to make sure I have checked everything on my side before trying to open a ticket and ask them for help. Is there anything else I can do to debug if I have a problem on my end?

#14
Ooooh, now it dawns on me! So static mappings are not used to reserve specific IPs within the range for systems, but they are put on top (and outside) of the range. That's it. As I hoped... total User error on my side. Thank you very much for helping me! :-)
#15
Hi everyone,

I am pretty sure there is something I am doing wrong and you can point me in the right direction.
I run OPNsense 24.7.5_3-amd64 and utilizing ISC DHCPv4 to handle my LAN IP Pool.
It is set to hand out IPs in the range of 10.0.0.100 to 10.0.0.150. Within that range, I configured a static mapping to one of my devices MAC address, so it always will get the 10.0.0.149 address.

For whatever reason, every new system or VM I join to the network gets the 10.0.0.149 address. It feels like instead of picking one of the other 49 free addresses, it gives out the Static one on purpose and not by accident. But I can not wrap my head around why this is.

Attached is a Screenshot of the current situation. The Host "BOEXLE" is the one with the correct MAC and the static reservation. I spun up a Container on another system and it gets the .149. Yesterday I spun up a KVM VM on one of my other hosts and it got the .149 too. I do not understand why this is. I thought reserving a IP within the pool for a specific MAC should prevent from handing out this IP to another system.