OPNsense Forum
Archive => 19.7 Legacy Series => Topic started by: thowe on August 05, 2019, 07:15:34 pm
-
Hello
Today i updated from 19.7.1 to 19.7.2 on my APU2c4. Update itself run without problems.
But after the update the Service "flowd_aggregate" was stopped and could not be started again.
In the menu "Reporting: NetFlow" I could not reapply the current settings there, as an error stated that the WAN interface was missing in Listening Interfaces (it really was missing).
After manually re-adding the WAN interface there, I could apply the settings. In the dashboard the Service "flowd_aggregate" was shown green/running. But after a refresh of the dashboard, the service is showed as stopped again.
In the general log I found:
/flowd_aggregate.py: flowd aggregate died with message Traceback (most recent call last): File /usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 160, in run aggregate_flowd(self.config, do_vacuum) File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 80, in aggregate_flowd stream_agg_object.add(copy.copy(flow_record)) File "/usr/local/opnsense/scripts/netflow/lib/aggregates/source.py", line 117, in add super(FlowSourceAddrDetails, self).add(flow) File "/usr/local/opnsense/scripts/netflow/lib/aggregates/__init__.py", line 185, in add self._update_cur.execute(self._update_stmt, flow) sqlite3.DatabaseError: database disk image is malformed
What caused this issue?
What is the best thing to resolve this issue?
Thanks!
Tom
-
Try the buttons "Repair Netflow Data" or it this doesn't work "Reset Netflow Data" in Reporting: Settings
-
In "Reporting: Settings" the point "Repair Netflow Data" did the trick. Now ok.
I was not aware of the possibility. Thanks! :-)
-
This is still an issue for me. I have reset and also repaired but the Aggregator service does not stay active. Anyone else seeing this in 19.7.2?
-
@spetrillo
Possibly. For the past couple days, I've been having issues with Insight not working with VLANs (https://forum.opnsense.org/index.php?topic=13707.0). I saw your post today and when I logged into OPNsense, also found the flowd_aggregate service stopped. I had upgraded to 19.7.2. yesterday and tried the Repair Netflow Data option yesterday. After reading thowe's comment, I tried it again today since the service was stopped and, at least for the past hour, the service has been running and, to my surprise, reports seem to be working agin. Going to keep an eye on it though to be sure.
-
I def think I have got something going on here. I repaired the database but it is returning a return code of 1 when complete, and the daemon never starts again.
-
I agree. Checked today and found the flowd_aggregate service stopped. Also seeing similar errors. Not sure what could be going on.
-
Is anyone else having an issue with keeping the Aggregator running?
-
So am I the only person hitting this issue?
-
Seeing the same thing here. As a test, I did a fresh install of OPNsense in a VM and the flowd_aggregate has been running for days. I'm wondering if it is something with my config.
-
Hmmm....thats an interesting thought. I am going to do a clean install of 19.7.2 and see if that changes things.
-
I have the problem as well.
-
Did you do a clean install of 19.7 or was it on top of 19.1? Originally I did an install on top of 19.1, which I think caused my issue. I did a clean install of 19.7 last night and the aggregator is staying up now.
-
Well I hoped that it was a clean build that would cure it but the service is stopped again and I cannot get it started again, no matter a repair or reset. This is definitely a problem.
Can any of the devs chime in here? Is this a bug?
-
Can you execute the following on a console?
/usr/local/opnsense/scripts/netflow/flowd_aggregate.py --console
If it fails with something like:
Traceback (most recent call last):
File "/usr/local/opnsense/site-python/sqlite3_helper.py", line 60, in check_and_repair
cur.execute('analyze')
sqlite3.DatabaseError: database disk image is malformed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module>
Main()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__
self.run()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 144, in run
check_and_repair('%s/*.sqlite' % self.config.database_dir)
File "/usr/local/opnsense/site-python/sqlite3_helper.py", line 62, in check_and_repair
if e.find('malformed') > -1 or force_repair:
AttributeError: 'DatabaseError' object has no attribute 'find'
It's a corrupted database combined with a bug trying to repair it.
Should be fixed with (on OPNsense 19.7.2):
opnsense-patch e5574648
service flowd_aggregate restart
Best regards,
Ad
-
Thanks...will validate this later tonight.
-
Did what you asked and here is the output back to the console:
root@OPNsense:~ # /usr/local/opnsense/scripts/netflow/flowd_aggregate.py --console
Traceback (most recent call last):
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module>
Main()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__
self.run()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 160, in run
aggregate_flowd(self.config, do_vacuum)
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 68, in aggregate_flowd
for flow_record in parse_flow(prev_recv, config.flowd_source):
File "/usr/local/opnsense/scripts/netflow/lib/parse.py", line 74, in parse_flow
for flow_record in FlowParser(filename, recv_stamp):
File "/usr/local/opnsense/scripts/netflow/lib/flowparser.py", line 141, in __iter__
record['recv_sec'] = record['recv_time'][0]
KeyError: 'recv_time'
Since it looked somewhat similar I continued with your instructions and installed the patch and then restarted the aggregator service. It did not stay up long.
-
ok, that doesn't look good. maybe the flowd.log file is corrupted and not handled properly on our end.
can you try https://github.com/opnsense/core/commit/d8ef93932b1696edd795ec38be57a2ec3e0187ea?
opnsense-patch d8ef9393
-
Thanks AdSchellevis. I ran the same command and got a similar result as spetrillo. I've applied the patch and did a repair Netflow from the web GUI. Will keep an eye on it and see if the service continues to stop or not.
-
@AdSchellevis, Unfortunately no luck. The service stopped again. Is there anything else I can try or provide in terms of logs? I'm not against performing a fresh install to fix the problem but if this is a bug, I would not mind help trying to fix it.
@spetrillo, did you have any luck with the patch?
-
@unipacket : the simplest thing is to keep it running in a console until it crashes and dump the traceback here, usually the log would also contain some info, but a full trace is easier to debug.
-
@AdSchellevis, Unfortunately no luck. The service stopped again. Is there anything else I can try or provide in terms of logs? I'm not against performing a fresh install to fix the problem but if this is a bug, I would not mind help trying to fix it.
@spetrillo, did you have any luck with the patch?
No luck here. It's down again. I will work on a trace log also.
-
How do I start the service from the console window? When I run the command
/usr/local/opnsense/scripts/netflow/flowd_aggregate.py --console
the output is similar to what spetrillo posted previously. I'm wondering if I should be running a different command? thanks
-
similar might be something different, if it's the exact same, maybe the patch didn't apply properly, in which case you better upgrade tomorrow and try again (19.7.3 is scheduled for tomorrow).
You can always dump the output here, so we can take a look.
-
Upgraded to 19.7.3 and did a "Repair Netflow Data" but still no go. Output from flowd_aggregate.py --console:
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module>
Main()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__
self.run()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 160, in run
aggregate_flowd(self.config, do_vacuum)
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 80, in aggregate_flowd
stream_agg_object.add(copy.copy(flow_record))
File "/usr/local/opnsense/scripts/netflow/lib/aggregates/ports.py", line 71, in add
super(FlowDstPortTotals, self).add(flow)
File "/usr/local/opnsense/scripts/netflow/lib/aggregates/__init__.py", line 185, in add
self._update_cur.execute(self._update_stmt, flow)
I'm thinking mine might just be plain broke and a reinstall will fix it. Is there anything else I can try before attempting a reinstall?
-
Your output seems incomplete, the relevant parts seem to be missing.
You can always flush all stats in Reporting -> Settings -> Netflow data if you want to remove the current stats and start from scratch.
-
just upgraded from 19.1.x to 19.7.3 and my service is stopped. I've cleared the data and flowd runs for a couple of days and stops.
-
Thanks :) I'll try flushing stats and see what happens. Will keep you posted.
-
Hi,
same problem here with 19.7.3.
I installed tmux and ran below commands in the tmux session...
root@opnsense01:~ # rm /var/netflow/*
root@opnsense01:~ # /usr/local/opnsense/scripts/netflow/flowd_aggregate.py --console
?
Traceback (most recent call last):
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module>
Main()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__
self.run()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 160, in run
aggregate_flowd(self.config, do_vacuum)
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 72, in aggregate_flowd
stream_agg_object.commit()
File "/usr/local/opnsense/scripts/netflow/lib/aggregates/__init__.py", line 160, in commit
self._db_connection.commit()
sqlite3.OperationalError: disk I/O error
root@opnsense01:~ #
-
disk full or damaged?
df -h
Might help finding the first one. The error itself is likely not related to flowd_aggregate, usually this means it's a victim of hardware related issues.
-
Does not look like a broken or full disk:
root@opnsense01:~ # df -h
Filesystem Size Used Avail Capacity Mounted on
zroot/ROOT/default 212G 980M 211G 0% /
devfs 1.0K 1.0K 0B 100% /dev
zroot/tmp 211G 1.5M 211G 0% /tmp
zroot/usr/home 211G 304K 211G 0% /usr/home
zroot/usr/ports 212G 681M 211G 0% /usr/ports
zroot/usr/src 211G 88K 211G 0% /usr/src
zroot/var/audit 211G 88K 211G 0% /var/audit
zroot/var/crash 211G 88K 211G 0% /var/crash
zroot/var/log 211G 549M 211G 0% /var/log
zroot/var/mail 211G 116K 211G 0% /var/mail
zroot/var/tmp 211G 104K 211G 0% /var/tmp
zroot 211G 88K 211G 0% /zroot
devfs 1.0K 1.0K 0B 100% /var/dhcpd/dev
root@opnsense01:~ # zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zroot 220G 2.24G 218G - - 46% 1% 1.00x ONLINE -
root@opnsense01:~ # zpool status
pool: zroot
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
da0p4 ONLINE 0 0 0
da1p4 ONLINE 0 0 0
errors: No known data errors
root@opnsense01:~ #
-
When I run the command, nothing happens. I let it sit for 10 15 minutes and if I press Ctrl C, this is the output
Traceback (most recent call last):
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module>
Main()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__
self.run()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 179, in run
time.sleep(0.5)
KeyboardInterrupt
-
@unipacket: Running this command is like starting the flowd_aggregate service in the foreground. You are supposed to wait till it crashes.
Since keeping a SSH session open for days is error prone in itself I installed tmux, which allows to reconnect to running sessions.
-
I wouldn't be surprised ZFS is a factor here. There are numerous mentions on the Internet about sqlite I/O errors ans their "sudden" nature. Nothing concrete but enough to suggest a read/write lock is failing here in the sqlite database mode. Some suggest turning of journaling, others turning off WAL. Their downsides seem to be not being able to recover, though the real question is if the database is corrupted or still working fine and the I/O error only throws off the writer.
Do you happen to read Insight data from the GUI when the process crashes?
Cheers,
Franco
-
@franco: Just tried to confirm your assumption. Started the process on the CLI again and played around in ui/diagnostics/networkinsight . This does not trigger a crash.
Also read up on SQLite and ZFS. It looks like there was no final solution to this discussion here:
https://sqlite-users.sqlite.narkive.com/QCh23tfL/i-o-errors-with-wal-on-zfs
Is opnsense using WAL for the SQLite DBs?
-
Thanks for the quick test. I wouldn't expect this to happen all the time, but rather sooner or later when a readers and writer collide. If the writer is all alone and the disk is fine the expectation is the I/O error does not occur ever. If it still does it is even harder to get to the bottom.
We don't set any mode so sqlite picks its default. I haven't found a quick reference to what the default is though.
Cheers,
Franco
-
So the current setup on these files is to use the DELETE mode:
root@opnsense01:/var/netflow # sqlite3 ./metadata.sqlite
SQLite version 3.29.0 2019-07-10 17:32:03
Enter ".help" for usage hints.
sqlite> PRAGMA database_list;
0|main|/var/netflow/./metadata.sqlite
sqlite> PRAGMA main.journal_mode;
delete
sqlite>
https://www.sqlite.org/pragma.html
According to above post everything should be working fine with the DELETE journal mode...
-
One could use
PRAGMA query_only;
for read only processes like the web interface.
Not sure if that helps, but would make sure the WebIF does not cause the problem...
And if you are sure that the script is the only one writing, maybe you could use
PRAGMA main.locking_mode=EXCLUSIVE;
-
@rainerle thanks for the tip! i'll have to try it next time.
-
Running into a similar issue. Netflow crashes after a minute or two. Not sure, but the problem occurs after adding an additional interface via "Interfaces: Assignments" for ovpns1. Maybe there is a 'corpse' in a config file now?!
Output:
root@OPNsense:~ # /usr/local/opnsense/scripts/netflow/flowd_aggregate.py --console
Traceback (most recent call last):
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 224, in <module>
Main()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 136, in __init__
self.run()
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 160, in run
aggregate_flowd(self.config, do_vacuum)
File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 68, in aggregate_flowd
for flow_record in parse_flow(prev_recv, config.flowd_source):
File "/usr/local/opnsense/scripts/netflow/lib/parse.py", line 74, in parse_flow
for flow_record in FlowParser(filename, recv_stamp):
File "/usr/local/opnsense/scripts/netflow/lib/flowparser.py", line 139, in __iter__
data_fields=ntohl(header[3])
File "/usr/local/opnsense/scripts/netflow/lib/flowparser.py", line 118, in _parse_binary
raw_data[raw_data_idx:raw_data_idx + fsize]
struct.error: unpack requires a buffer of 8 bytes
root@OPNsense:~ #
I tried to reset and repair via GUI and also a rm of the sqlite files. But nothing helped.
-
@Conti can you open a new issue on GitHub with the exact same crashdump? There are likely broken flow records in your data, but the aggregator should skip those. I’ll try to take a look somewhere next week.
-
Done!
https://github.com/opnsense/core/issues/3715
Edit: Fixed
-
Looks like I fixed the problem with a workaround for me:
I changed the backup back to "Power off" in system_advanced_misc.php from "24 hours". Since then flowd_aggregate keeps running...
-
Mine stopped working and looking for an answer I came across this thread. I tried all the above advise and nothing. I decided to reboot, and that seems to at least fixed it for now. Unfortunately, I had tried resetting everything in that settings folder so don't really know what fixed it, but it appears you have to reboot, or kill and restart a process to really get to the root of it. Happy hunting for the next one to come across this. :(