After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped

Started by thowe, August 05, 2019, 07:15:34 PM

Previous topic - Next topic
Does not look like a broken or full disk:


root@opnsense01:~ # df -h
Filesystem            Size    Used   Avail Capacity  Mounted on
zroot/ROOT/default    212G    980M    211G     0%    /
devfs                 1.0K    1.0K      0B   100%    /dev
zroot/tmp             211G    1.5M    211G     0%    /tmp
zroot/usr/home        211G    304K    211G     0%    /usr/home
zroot/usr/ports       212G    681M    211G     0%    /usr/ports
zroot/usr/src         211G     88K    211G     0%    /usr/src
zroot/var/audit       211G     88K    211G     0%    /var/audit
zroot/var/crash       211G     88K    211G     0%    /var/crash
zroot/var/log         211G    549M    211G     0%    /var/log
zroot/var/mail        211G    116K    211G     0%    /var/mail
zroot/var/tmp         211G    104K    211G     0%    /var/tmp
zroot                 211G     88K    211G     0%    /zroot
devfs                 1.0K    1.0K      0B   100%    /var/dhcpd/dev
root@opnsense01:~ # zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zroot   220G  2.24G   218G        -         -    46%     1%  1.00x  ONLINE  -
root@opnsense01:~ # zpool status
  pool: zroot
state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da0p4   ONLINE       0     0     0
            da1p4   ONLINE       0     0     0

errors: No known data errors
root@opnsense01:~ #

When I run the command, nothing happens.  I let it sit for 10 15 minutes and if I press Ctrl C, this is the output

Traceback (most recent call last):
  File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module>
    Main()
  File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__
    self.run()
  File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 179, in run
    time.sleep(0.5)
KeyboardInterrupt

@unipacket: Running this command is like starting the flowd_aggregate service in the foreground. You are supposed to wait till it crashes.

Since keeping a SSH session open for days is error prone in itself I installed tmux, which allows to reconnect to running sessions.

I wouldn't be surprised ZFS is a factor here. There are numerous mentions on the Internet about sqlite I/O errors ans their "sudden" nature. Nothing concrete but enough to suggest a read/write lock is failing here in the sqlite database mode. Some suggest turning of journaling, others turning off WAL. Their downsides seem to be not being able to recover, though the real question is if the database is corrupted or still working fine and the I/O error only throws off the writer.

Do you happen to read Insight data from the GUI when the process crashes?


Cheers,
Franco

@franco: Just tried to confirm your assumption. Started the process on the CLI again and played around in ui/diagnostics/networkinsight . This does not trigger a crash.

Also read up on SQLite and ZFS. It looks like there was no final solution to this discussion here:
https://sqlite-users.sqlite.narkive.com/QCh23tfL/i-o-errors-with-wal-on-zfs

Is opnsense using WAL for the SQLite DBs?

Thanks for the quick test. I wouldn't expect this to happen all the time, but rather sooner or later when a readers and writer collide. If the writer is all alone and the disk is fine the expectation is the I/O error does not occur ever. If it still does it is even harder to get to the bottom.

We don't set any mode so sqlite picks its default. I haven't found a quick reference to what the default is though.


Cheers,
Franco

So the current setup on these files is to use the DELETE mode:

root@opnsense01:/var/netflow # sqlite3 ./metadata.sqlite
SQLite version 3.29.0 2019-07-10 17:32:03
Enter ".help" for usage hints.
sqlite> PRAGMA database_list;
0|main|/var/netflow/./metadata.sqlite
sqlite> PRAGMA main.journal_mode;
delete
sqlite>


https://www.sqlite.org/pragma.html

According to above post everything should be working fine with the DELETE journal mode...

One could use
PRAGMA query_only;

for read only processes like the web interface.

Not sure if that helps, but would make sure the WebIF does not cause the problem...

And if you are sure that the script is the only one writing, maybe you could use
PRAGMA main.locking_mode=EXCLUSIVE;


Running into a similar issue. Netflow crashes after a minute or two. Not sure, but the problem occurs after adding an additional interface via "Interfaces: Assignments" for ovpns1. Maybe there is a 'corpse' in a config file now?!

Output:

root@OPNsense:~ # /usr/local/opnsense/scripts/netflow/flowd_aggregate.py --console
Traceback (most recent call last):
  File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module>
    Main()
  File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__
    self.run()
  File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 160, in run
    aggregate_flowd(self.config, do_vacuum)
  File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 68, in aggregate_flowd
    for flow_record in parse_flow(prev_recv, config.flowd_source):
  File "/usr/local/opnsense/scripts/netflow/lib/parse.py", line 74, in parse_flow
    for flow_record in FlowParser(filename, recv_stamp):
  File "/usr/local/opnsense/scripts/netflow/lib/flowparser.py", line 139, in __iter__
    data_fields=ntohl(header[3])
  File "/usr/local/opnsense/scripts/netflow/lib/flowparser.py", line 118, in _parse_binary
    raw_data[raw_data_idx:raw_data_idx + fsize]
struct.error: unpack requires a buffer of 8 bytes
root@OPNsense:~ #

I tried to reset and repair via GUI and also a rm of the sqlite files. But nothing helped.

@Conti  can you open a new issue on GitHub with the exact same crashdump? There are likely broken flow records in your data, but the aggregator should skip those. I'll try to take a look somewhere next week.


Looks like I fixed the problem with a workaround for me:
I changed the backup back to "Power off" in system_advanced_misc.php from "24 hours". Since then flowd_aggregate keeps running...

Mine stopped working and looking for an answer I came across this thread. I tried all the above advise and nothing. I decided to reboot, and that seems to at least fixed it for now. Unfortunately, I had tried resetting everything in that settings folder so don't really know what fixed it, but it appears you have to reboot, or kill and restart a process to really get to the root of it. Happy hunting for the next one to come across this. :(