After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped

rainerle · September 09, 2019, 06:36:22 PM

Does not look like a broken or full disk:


root@opnsense01:~ # df -h
Filesystem            Size    Used   Avail Capacity  Mounted on
zroot/ROOT/default    212G    980M    211G     0%    /
devfs                 1.0K    1.0K      0B   100%    /dev
zroot/tmp             211G    1.5M    211G     0%    /tmp
zroot/usr/home        211G    304K    211G     0%    /usr/home
zroot/usr/ports       212G    681M    211G     0%    /usr/ports
zroot/usr/src         211G     88K    211G     0%    /usr/src
zroot/var/audit       211G     88K    211G     0%    /var/audit
zroot/var/crash       211G     88K    211G     0%    /var/crash
zroot/var/log         211G    549M    211G     0%    /var/log
zroot/var/mail        211G    116K    211G     0%    /var/mail
zroot/var/tmp         211G    104K    211G     0%    /var/tmp
zroot                 211G     88K    211G     0%    /zroot
devfs                 1.0K    1.0K      0B   100%    /var/dhcpd/dev
root@opnsense01:~ # zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zroot   220G  2.24G   218G        -         -    46%     1%  1.00x  ONLINE  -
root@opnsense01:~ # zpool status
  pool: zroot
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da0p4   ONLINE       0     0     0
            da1p4   ONLINE       0     0     0

errors: No known data errors
root@opnsense01:~ #

unipacket · September 10, 2019, 04:22:01 AM

When I run the command, nothing happens. I let it sit for 10 15 minutes and if I press Ctrl C, this is the output

Code Select

Traceback (most recent call last):
  File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module>
    Main()
  File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__
    self.run()
  File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 179, in run
    time.sleep(0.5)
KeyboardInterrupt

rainerle · September 10, 2019, 07:46:48 AM

@unipacket: Running this command is like starting the flowd_aggregate service in the foreground. You are supposed to wait till it crashes.

Since keeping a SSH session open for days is error prone in itself I installed tmux, which allows to reconnect to running sessions.

franco · September 10, 2019, 04:04:57 PM

I wouldn't be surprised ZFS is a factor here. There are numerous mentions on the Internet about sqlite I/O errors ans their "sudden" nature. Nothing concrete but enough to suggest a read/write lock is failing here in the sqlite database mode. Some suggest turning of journaling, others turning off WAL. Their downsides seem to be not being able to recover, though the real question is if the database is corrupted or still working fine and the I/O error only throws off the writer.

Do you happen to read Insight data from the GUI when the process crashes?

Cheers,
Franco

rainerle · September 10, 2019, 05:09:53 PM

@franco: Just tried to confirm your assumption. Started the process on the CLI again and played around in ui/diagnostics/networkinsight . This does not trigger a crash.

Also read up on SQLite and ZFS. It looks like there was no final solution to this discussion here:
https://sqlite-users.sqlite.narkive.com/QCh23tfL/i-o-errors-with-wal-on-zfs

Is opnsense using WAL for the SQLite DBs?

franco · September 10, 2019, 05:14:00 PM

Thanks for the quick test. I wouldn't expect this to happen all the time, but rather sooner or later when a readers and writer collide. If the writer is all alone and the disk is fine the expectation is the I/O error does not occur ever. If it still does it is even harder to get to the bottom.

We don't set any mode so sqlite picks its default. I haven't found a quick reference to what the default is though.

Cheers,
Franco

rainerle · September 10, 2019, 10:40:35 PM

So the current setup on these files is to use the DELETE mode:

Code Select


root@opnsense01:/var/netflow # sqlite3 ./metadata.sqlite
SQLite version 3.29.0 2019-07-10 17:32:03
Enter ".help" for usage hints.
sqlite> PRAGMA database_list;
0|main|/var/netflow/./metadata.sqlite
sqlite> PRAGMA main.journal_mode;
delete
sqlite>

https://www.sqlite.org/pragma.html

According to above post everything should be working fine with the DELETE journal mode...

rainerle · September 10, 2019, 10:44:26 PM

One could use

Code Select

PRAGMA query_only;

for read only processes like the web interface.

Not sure if that helps, but would make sure the WebIF does not cause the problem...

And if you are sure that the script is the only one writing, maybe you could use

Code Select

PRAGMA main.locking_mode=EXCLUSIVE;

unipacket · September 13, 2019, 07:43:58 PM

@rainerle thanks for the tip! i'll have to try it next time.

Conti · September 20, 2019, 08:44:29 AM

Running into a similar issue. Netflow crashes after a minute or two. Not sure, but the problem occurs after adding an additional interface via "Interfaces: Assignments" for ovpns1. Maybe there is a 'corpse' in a config file now?!

Output:

root@OPNsense:~ # /usr/local/opnsense/scripts/netflow/flowd_aggregate.py --console
Traceback (most recent call last):
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module>
Main()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__
self.run()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 160, in run
aggregate_flowd(self.config, do_vacuum)
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 68, in aggregate_flowd
for flow_record in parse_flow(prev_recv, config.flowd_source):
File "/usr/local/opnsense/scripts/netflow/lib/parse.py", line 74, in parse_flow
for flow_record in FlowParser(filename, recv_stamp):
File "/usr/local/opnsense/scripts/netflow/lib/flowparser.py", line 139, in __iter__
data_fields=ntohl(header[3])
File "/usr/local/opnsense/scripts/netflow/lib/flowparser.py", line 118, in _parse_binary
raw_data[raw_data_idx:raw_data_idx + fsize]
struct.error: unpack requires a buffer of 8 bytes
root@OPNsense:~ #

I tried to reset and repair via GUI and also a rm of the sqlite files. But nothing helped.

AdSchellevis · September 20, 2019, 10:26:18 AM

@Conti can you open a new issue on GitHub with the exact same crashdump? There are likely broken flow records in your data, but the aggregator should skip those. I'll try to take a look somewhere next week.

Conti · September 20, 2019, 10:52:28 AM

Done!

https://github.com/opnsense/core/issues/3715

Edit: Fixed

rainerle · September 24, 2019, 07:24:10 AM

Looks like I fixed the problem with a workaround for me:
I changed the backup back to "Power off" in system_advanced_misc.php from "24 hours". Since then flowd_aggregate keeps running...

lshantz · November 08, 2019, 09:36:59 PM

Mine stopped working and looking for an answer I came across this thread. I tried all the above advise and nothing. I decided to reboot, and that seems to at least fixed it for now. Unfortunately, I had tried resetting everything in that settings folder so don't really know what fixed it, but it appears you have to reboot, or kill and restart a process to really get to the root of it. Happy hunting for the next one to come across this. :(

After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped

rainerle

September 09, 2019, 06:36:22 PM #30

unipacket

September 10, 2019, 04:22:01 AM #31

rainerle

September 10, 2019, 07:46:48 AM #32

franco

September 10, 2019, 04:04:57 PM #33

rainerle

September 10, 2019, 05:09:53 PM #34

franco

September 10, 2019, 05:14:00 PM #35 Last Edit: September 10, 2019, 05:20:32 PM by franco

rainerle

September 10, 2019, 10:40:35 PM #36

rainerle

September 10, 2019, 10:44:26 PM #37 Last Edit: September 10, 2019, 10:48:21 PM by rainerle

unipacket

September 13, 2019, 07:43:58 PM #38

Conti

September 20, 2019, 08:44:29 AM #39

AdSchellevis

September 20, 2019, 10:26:18 AM #40

Conti

September 20, 2019, 10:52:28 AM #41 Last Edit: September 24, 2019, 06:25:50 AM by Conti

rainerle

September 24, 2019, 07:24:10 AM #42

lshantz

November 08, 2019, 09:36:59 PM #43