OPNsense Forum

Archive => 19.7 Legacy Series => Topic started by: thowe on August 05, 2019, 07:15:34 pm

Title: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: thowe on August 05, 2019, 07:15:34 pm: Hello

Today i updated from 19.7.1 to 19.7.2 on my APU2c4. Update itself run without problems.

But after the update the Service "flowd_aggregate" was stopped and could not be started again.

In the menu "Reporting: NetFlow" I could not reapply the current settings there, as an error stated that the WAN interface was missing in Listening Interfaces (it really was missing).

After manually re-adding the WAN interface there, I could apply the settings. In the dashboard the Service "flowd_aggregate" was shown green/running. But after a refresh of the dashboard, the service is showed as stopped again.

In the general log I found:
Quote
/flowd_aggregate.py: flowd aggregate died with message Traceback (most recent call last): File /usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 160, in run aggregate_flowd(self.config, do_vacuum) File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 80, in aggregate_flowd stream_agg_object.add(copy.copy(flow_record)) File "/usr/local/opnsense/scripts/netflow/lib/aggregates/source.py", line 117, in add super(FlowSourceAddrDetails, self).add(flow) File "/usr/local/opnsense/scripts/netflow/lib/aggregates/__init__.py", line 185, in add self._update_cur.execute(self._update_stmt, flow) sqlite3.DatabaseError: database disk image is malformed

What caused this issue?
What is the best thing to resolve this issue?

Thanks!

Tom
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: ruffy91 on August 05, 2019, 09:06:57 pm: Try the buttons "Repair Netflow Data" or it this doesn't work "Reset Netflow Data" in Reporting: Settings
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: thowe on August 05, 2019, 10:35:01 pm: In "Reporting: Settings" the point "Repair Netflow Data" did the trick. Now ok.

I was not aware of the possibility. Thanks! :-)
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: spetrillo on August 07, 2019, 01:48:44 am: This is still an issue for me. I have reset and also repaired but the Aggregator service does not stay active. Anyone else seeing this in 19.7.2?
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: unipacket on August 07, 2019, 03:00:33 am: @spetrillo

Possibly. For the past couple days, I've been having issues with Insight not working with VLANs (https://forum.opnsense.org/index.php?topic=13707.0). I saw your post today and when I logged into OPNsense, also found the flowd_aggregate service stopped. I had upgraded to 19.7.2. yesterday and tried the Repair Netflow Data option yesterday. After reading thowe's comment, I tried it again today since the service was stopped and, at least for the past hour, the service has been running and, to my surprise, reports seem to be working agin. Going to keep an eye on it though to be sure.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: spetrillo on August 07, 2019, 03:41:41 am: I def think I have got something going on here. I repaired the database but it is returning a return code of 1 when complete, and the daemon never starts again.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: unipacket on August 07, 2019, 05:21:55 pm: I agree. Checked today and found the flowd_aggregate service stopped. Also seeing similar errors. Not sure what could be going on.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: spetrillo on August 09, 2019, 03:56:21 am: Is anyone else having an issue with keeping the Aggregator running?
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: spetrillo on August 11, 2019, 03:07:31 am: So am I the only person hitting this issue?
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: unipacket on August 11, 2019, 04:32:13 pm: Seeing the same thing here. As a test, I did a fresh install of OPNsense in a VM and the flowd_aggregate has been running for days. I'm wondering if it is something with my config.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: spetrillo on August 12, 2019, 10:05:52 am: Hmmm....thats an interesting thought. I am going to do a clean install of 19.7.2 and see if that changes things.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: AF1E on August 13, 2019, 12:23:33 pm: I have the problem as well.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: spetrillo on August 13, 2019, 05:03:37 pm: Did you do a clean install of 19.7 or was it on top of 19.1? Originally I did an install on top of 19.1, which I think caused my issue. I did a clean install of 19.7 last night and the aggregator is staying up now.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: spetrillo on August 15, 2019, 03:27:13 am: Well I hoped that it was a clean build that would cure it but the service is stopped again and I cannot get it started again, no matter a repair or reset. This is definitely a problem.

Can any of the devs chime in here? Is this a bug?
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: AdSchellevis on August 15, 2019, 09:17:19 am: Can you execute the following on a console?

Code: [Select]
/usr/local/opnsense/scripts/netflow/flowd_aggregate.py --console
If it fails with something like:

Code: [Select]
Traceback (most recent call last): File "/usr/local/opnsense/site-python/sqlite3_helper.py", line 60, in check_and_repair cur.execute('analyze') sqlite3.DatabaseError: database disk image is malformed During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module> Main() File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__ self.run() File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 144, in run check_and_repair('%s/*.sqlite' % self.config.database_dir) File "/usr/local/opnsense/site-python/sqlite3_helper.py", line 62, in check_and_repair if e.find('malformed') > -1 or force_repair: AttributeError: 'DatabaseError' object has no attribute 'find'
It's a corrupted database combined with a bug trying to repair it.

Should be fixed with (on OPNsense 19.7.2):
Code: [Select]
opnsense-patch e5574648 service flowd_aggregate restart
Best regards,

Ad
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: spetrillo on August 15, 2019, 05:04:23 pm: Thanks...will validate this later tonight.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: spetrillo on August 17, 2019, 04:29:02 pm: Did what you asked and here is the output back to the console:

root@OPNsense:~ # /usr/local/opnsense/scripts/netflow/flowd_aggregate.py --console
Traceback (most recent call last):
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module>
Main()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__
self.run()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 160, in run
aggregate_flowd(self.config, do_vacuum)
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 68, in aggregate_flowd
for flow_record in parse_flow(prev_recv, config.flowd_source):
File "/usr/local/opnsense/scripts/netflow/lib/parse.py", line 74, in parse_flow
for flow_record in FlowParser(filename, recv_stamp):
File "/usr/local/opnsense/scripts/netflow/lib/flowparser.py", line 141, in __iter__
record['recv_sec'] = record['recv_time'][0]
KeyError: 'recv_time'

Since it looked somewhat similar I continued with your instructions and installed the patch and then restarted the aggregator service. It did not stay up long.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: AdSchellevis on August 17, 2019, 07:17:38 pm: ok, that doesn't look good. maybe the flowd.log file is corrupted and not handled properly on our end.

can you try https://github.com/opnsense/core/commit/d8ef93932b1696edd795ec38be57a2ec3e0187ea?

Code: [Select]
opnsense-patch d8ef9393
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: unipacket on August 20, 2019, 02:11:40 pm: Thanks AdSchellevis. I ran the same command and got a similar result as spetrillo. I've applied the patch and did a repair Netflow from the web GUI. Will keep an eye on it and see if the service continues to stop or not.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: unipacket on August 22, 2019, 05:24:12 pm: @AdSchellevis, Unfortunately no luck. The service stopped again. Is there anything else I can try or provide in terms of logs? I'm not against performing a fresh install to fix the problem but if this is a bug, I would not mind help trying to fix it.

@spetrillo, did you have any luck with the patch?
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: AdSchellevis on August 22, 2019, 05:26:41 pm: @unipacket : the simplest thing is to keep it running in a console until it crashes and dump the traceback here, usually the log would also contain some info, but a full trace is easier to debug.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: spetrillo on August 23, 2019, 05:24:48 pm: Quote from: unipacket on August 22, 2019, 05:24:12 pm
@AdSchellevis, Unfortunately no luck. The service stopped again. Is there anything else I can try or provide in terms of logs? I'm not against performing a fresh install to fix the problem but if this is a bug, I would not mind help trying to fix it.

@spetrillo, did you have any luck with the patch?

No luck here. It's down again. I will work on a trace log also.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: unipacket on August 27, 2019, 08:17:01 pm: How do I start the service from the console window? When I run the command

Code: [Select]
/usr/local/opnsense/scripts/netflow/flowd_aggregate.py --console

the output is similar to what spetrillo posted previously. I'm wondering if I should be running a different command? thanks
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: AdSchellevis on August 27, 2019, 08:34:11 pm: similar might be something different, if it's the exact same, maybe the patch didn't apply properly, in which case you better upgrade tomorrow and try again (19.7.3 is scheduled for tomorrow).

You can always dump the output here, so we can take a look.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: unipacket on September 05, 2019, 03:12:50 am: Upgraded to 19.7.3 and did a "Repair Netflow Data" but still no go. Output from flowd_aggregate.py --console:

Code: [Select]
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module> Main() File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__ self.run() File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 160, in run aggregate_flowd(self.config, do_vacuum) File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 80, in aggregate_flowd stream_agg_object.add(copy.copy(flow_record)) File "/usr/local/opnsense/scripts/netflow/lib/aggregates/ports.py", line 71, in add super(FlowDstPortTotals, self).add(flow) File "/usr/local/opnsense/scripts/netflow/lib/aggregates/__init__.py", line 185, in add self._update_cur.execute(self._update_stmt, flow)
I'm thinking mine might just be plain broke and a reinstall will fix it. Is there anything else I can try before attempting a reinstall?
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: AdSchellevis on September 05, 2019, 08:37:00 am: Your output seems incomplete, the relevant parts seem to be missing.
You can always flush all stats in Reporting -> Settings -> Netflow data if you want to remove the current stats and start from scratch.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: agh1701 on September 05, 2019, 10:34:00 pm: just upgraded from 19.1.x to 19.7.3 and my service is stopped. I've cleared the data and flowd runs for a couple of days and stops.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: unipacket on September 09, 2019, 04:44:17 am: Thanks :) I'll try flushing stats and see what happens. Will keep you posted.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: rainerle on September 09, 2019, 12:14:35 pm: Hi,

same problem here with 19.7.3.

I installed tmux and ran below commands in the tmux session...

Code: [Select]
root@opnsense01:~ # rm /var/netflow/* root@opnsense01:~ # /usr/local/opnsense/scripts/netflow/flowd_aggregate.py --console ? Traceback (most recent call last): File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module> Main() File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__ self.run() File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 160, in run aggregate_flowd(self.config, do_vacuum) File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 72, in aggregate_flowd stream_agg_object.commit() File "/usr/local/opnsense/scripts/netflow/lib/aggregates/__init__.py", line 160, in commit self._db_connection.commit() sqlite3.OperationalError: disk I/O error root@opnsense01:~ #
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: AdSchellevis on September 09, 2019, 01:11:27 pm: disk full or damaged?
Code: [Select]
df -h Might help finding the first one. The error itself is likely not related to flowd_aggregate, usually this means it's a victim of hardware related issues.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: rainerle on September 09, 2019, 06:36:22 pm: Does not look like a broken or full disk:

Code: [Select]
root@opnsense01:~ # df -h Filesystem Size Used Avail Capacity Mounted on zroot/ROOT/default 212G 980M 211G 0% / devfs 1.0K 1.0K 0B 100% /dev zroot/tmp 211G 1.5M 211G 0% /tmp zroot/usr/home 211G 304K 211G 0% /usr/home zroot/usr/ports 212G 681M 211G 0% /usr/ports zroot/usr/src 211G 88K 211G 0% /usr/src zroot/var/audit 211G 88K 211G 0% /var/audit zroot/var/crash 211G 88K 211G 0% /var/crash zroot/var/log 211G 549M 211G 0% /var/log zroot/var/mail 211G 116K 211G 0% /var/mail zroot/var/tmp 211G 104K 211G 0% /var/tmp zroot 211G 88K 211G 0% /zroot devfs 1.0K 1.0K 0B 100% /var/dhcpd/dev root@opnsense01:~ # zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT zroot 220G 2.24G 218G - - 46% 1% 1.00x ONLINE - root@opnsense01:~ # zpool status pool: zroot state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 da0p4 ONLINE 0 0 0 da1p4 ONLINE 0 0 0 errors: No known data errors root@opnsense01:~ #
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: unipacket on September 10, 2019, 04:22:01 am: When I run the command, nothing happens. I let it sit for 10 15 minutes and if I press Ctrl C, this is the output

Code: [Select]
Traceback (most recent call last): File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module> Main() File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__ self.run() File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 179, in run time.sleep(0.5) KeyboardInterrupt
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: rainerle on September 10, 2019, 07:46:48 am: @unipacket: Running this command is like starting the flowd_aggregate service in the foreground. You are supposed to wait till it crashes.

Since keeping a SSH session open for days is error prone in itself I installed tmux, which allows to reconnect to running sessions.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: franco on September 10, 2019, 04:04:57 pm: I wouldn't be surprised ZFS is a factor here. There are numerous mentions on the Internet about sqlite I/O errors ans their "sudden" nature. Nothing concrete but enough to suggest a read/write lock is failing here in the sqlite database mode. Some suggest turning of journaling, others turning off WAL. Their downsides seem to be not being able to recover, though the real question is if the database is corrupted or still working fine and the I/O error only throws off the writer.

Do you happen to read Insight data from the GUI when the process crashes?

Cheers,
Franco
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: rainerle on September 10, 2019, 05:09:53 pm: @franco: Just tried to confirm your assumption. Started the process on the CLI again and played around in ui/diagnostics/networkinsight . This does not trigger a crash.

Also read up on SQLite and ZFS. It looks like there was no final solution to this discussion here:
https://sqlite-users.sqlite.narkive.com/QCh23tfL/i-o-errors-with-wal-on-zfs

Is opnsense using WAL for the SQLite DBs?
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: franco on September 10, 2019, 05:14:00 pm: Thanks for the quick test. I wouldn't expect this to happen all the time, but rather sooner or later when a readers and writer collide. If the writer is all alone and the disk is fine the expectation is the I/O error does not occur ever. If it still does it is even harder to get to the bottom.

We don't set any mode so sqlite picks its default. I haven't found a quick reference to what the default is though.

Cheers,
Franco
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: rainerle on September 10, 2019, 10:40:35 pm: So the current setup on these files is to use the DELETE mode:
Code: [Select]
root@opnsense01:/var/netflow # sqlite3 ./metadata.sqlite SQLite version 3.29.0 2019-07-10 17:32:03 Enter ".help" for usage hints. sqlite> PRAGMA database_list; 0|main|/var/netflow/./metadata.sqlite sqlite> PRAGMA main.journal_mode; delete sqlite>
https://www.sqlite.org/pragma.html

According to above post everything should be working fine with the DELETE journal mode...
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: rainerle on September 10, 2019, 10:44:26 pm: One could use
Code: [Select]
PRAGMA query_only;

for read only processes like the web interface.

Not sure if that helps, but would make sure the WebIF does not cause the problem...

And if you are sure that the script is the only one writing, maybe you could use
Code: [Select]
PRAGMA main.locking_mode=EXCLUSIVE;
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: unipacket on September 13, 2019, 07:43:58 pm: @rainerle thanks for the tip! i'll have to try it next time.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: Conti on September 20, 2019, 08:44:29 am: Running into a similar issue. Netflow crashes after a minute or two. Not sure, but the problem occurs after adding an additional interface via "Interfaces: Assignments" for ovpns1. Maybe there is a 'corpse' in a config file now?!

Output:

root@OPNsense:~ # /usr/local/opnsense/scripts/netflow/flowd_aggregate.py --console
Traceback (most recent call last):
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module>
Main()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__
self.run()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 160, in run
aggregate_flowd(self.config, do_vacuum)
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 68, in aggregate_flowd
for flow_record in parse_flow(prev_recv, config.flowd_source):
File "/usr/local/opnsense/scripts/netflow/lib/parse.py", line 74, in parse_flow
for flow_record in FlowParser(filename, recv_stamp):
File "/usr/local/opnsense/scripts/netflow/lib/flowparser.py", line 139, in __iter__
data_fields=ntohl(header[3])
File "/usr/local/opnsense/scripts/netflow/lib/flowparser.py", line 118, in _parse_binary
raw_data[raw_data_idx:raw_data_idx + fsize]
struct.error: unpack requires a buffer of 8 bytes
root@OPNsense:~ #

I tried to reset and repair via GUI and also a rm of the sqlite files. But nothing helped.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: AdSchellevis on September 20, 2019, 10:26:18 am: @Conti can you open a new issue on GitHub with the exact same crashdump? There are likely broken flow records in your data, but the aggregator should skip those. I’ll try to take a look somewhere next week.
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: Conti on September 20, 2019, 10:52:28 am: Done!

https://github.com/opnsense/core/issues/3715

Edit: Fixed
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: rainerle on September 24, 2019, 07:24:10 am: Looks like I fixed the problem with a workaround for me:
I changed the backup back to "Power off" in system_advanced_misc.php from "24 hours". Since then flowd_aggregate keeps running...
Title: Re: After Update to 19.7.2 Service "flowd_aggregate Insight Aggregator" is stopped
Post by: lshantz on November 08, 2019, 09:36:59 pm: Mine stopped working and looking for an answer I came across this thread. I tried all the above advise and nothing. I decided to reboot, and that seems to at least fixed it for now. Unfortunately, I had tried resetting everything in that settings folder so don't really know what fixed it, but it appears you have to reboot, or kill and restart a process to really get to the root of it. Happy hunting for the next one to come across this. :(