[SOLVED] Netflow Insight No Data Available

Started by adcwb, August 25, 2016, 06:09:54 PM

Previous topic - Next topic
August 25, 2016, 06:09:54 PM Last Edit: September 02, 2016, 05:01:05 PM by franco
OPNsense was running ok until yesterday, as of today the Insight page just show "No Data Available".
The log shows the following:

flowd_aggregate.py: flowd aggregate died with message Traceback (most recent call last): File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 145, in run aggregate_flowd(do_vacuum) File "/usr/local/opnsense/scripts/netflow/flowd_aggregate.py", line 79, in aggregate_flowd stream_agg_object.add(flow_record_cpy) File "/usr/local/opnsense/scripts/netflow/lib/aggregates/source.py", line 105, in add super(FlowSourceAddrDetails, self).add(flow) File "/usr/local/opnsense/scripts/netflow/lib/aggregate.py", line 258, in add self._update_cur.execute(self._update_stmt, flow) DatabaseError: database disk image is malformed

In the Netflow -> Cache page i can see that it is tracking.
The flowd-reader command seems fine too, outputting the traffic.
Tried restarting the flowd_aggregator service, without success.

Any thoughts?


Hi there,

Which version are you using? What image flavour did you use?

We do have an auto-repair, but it's not in the release it. You can apply it on top of 16.7.2 using:

# opnsense-patch 5ec2101

The respective commit is:

https://github.com/opnsense/core/commit/5ec2101

If this works we can queue it up for 16.7.3 next week.


Cheers,
Franco

Bumping this.

As per my upgrade to 16.7.3 yesterday (31/8) I now have the No Data Available in my Insight display. 31/8 was ok and have data up until time of upgrade then nothing. 1/9 have NULL entries!

In NetFlow section Cache tab pressing refresh makes the numbers change. In my world that means something is happening but does not get reflected in Insight.

NetFlow Capture is on and Capture local checked, 127.0.0.1:2056 for Destination and v9 for Version.

I'd be happy to supply more information but right now I don't know what.   :-\

   /Jonas...

Hi Jonas,

You might want to try this patch we've given a longer test interval:

https://github.com/opnsense/core/commit/5ec2101ac2

You can install and activate it from the command like this

# opnsense-patch 5ec2101ac2
# service flowd_aggregate restart

The graphs should come back to life immediately.

I haven't seen any glitches there in the last month (I'm using the patched opnsense-devel version).


Cheers,
Franco

Quote from: franco on September 01, 2016, 02:05:37 PM
...
You can install and activate it from the command like this

# opnsense-patch 5ec2101ac2
# service flowd_aggregate restart

The graphs should come back to life immediately.
...

Sorry, no joy!


Edit /etc/motd to change this login announcement.

0) Logout                             7) Ping host
1) Assign Interfaces                  8) Shell
2) Set interface(s) IP address        9) pfTop
3) Reset the root password           10) Filter Logs
4) Reset to factory defaults         11) Restart web interface
5) Power off system                  12) Upgrade from console
6) Reboot system                     13) Restore a configuration

Enter an option: 8

root@OPNsense:~ # opnsense-patch 5ec2101ac2
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|From 5ec2101ac2c67824b94306fe01d1ed97211ea730 Mon Sep 17 00:00:00 2001
|From: Ad Schellevis <ad@opnsense.org>
|Date: Sun, 7 Aug 2016 16:10:19 +0200
|Subject: [PATCH] (insight) check database integrity before start, repair if
| broken
|
|---
| src/opnsense/scripts/netflow/flowd_aggregate.py |  5 +-
| src/opnsense/site-python/sqlite3_helper.py      | 69 +++++++++++++++++++++++++
| 2 files changed, 73 insertions(+), 1 deletion(-)
| create mode 100644 src/opnsense/site-python/sqlite3_helper.py
|
|diff --git a/src/opnsense/scripts/netflow/flowd_aggregate.py b/src/opnsense/scripts/netflow/flowd_aggregate.py
|index 8915576..1368046 100755
|--- a/src/opnsense/scripts/netflow/flowd_aggregate.py
|+++ b/src/opnsense/scripts/netflow/flowd_aggregate.py
--------------------------
Patching file opnsense/scripts/netflow/flowd_aggregate.py using Plan A...
Hunk #1 succeeded at 28.
Hunk #2 succeeded at 36.
Hunk #3 succeeded at 130.
Hmm...  The next patch looks like a unified diff to me...
The text leading up to this was:
--------------------------
|diff --git a/src/opnsense/site-python/sqlite3_helper.py b/src/opnsense/site-python/sqlite3_helper.py
|new file mode 100644
|index 0000000..6c32f7e
|--- /dev/null
|+++ b/src/opnsense/site-python/sqlite3_helper.py
--------------------------
(Creating file opnsense/site-python/sqlite3_helper.py...)
Patching file opnsense/site-python/sqlite3_helper.py using Plan A...
Empty context always matches.
Hunk #1 succeeded at 1.
done
All patches have been applied successfully.  Have a nice day.
root@OPNsense:~ # service flowd_aggregate restart
flowd_aggregate not running? (check /var/run/flowd_aggregate.pid).
Starting flowd_aggregate.
root@OPNsense:~ # service flowd_aggregate status
flowd_aggregate is running as pid 3230.
root@OPNsense:~ #


Still No Data Available i graphs and still nothing in Details either.

   /Jonas...

It's interesting that flowd_aggregate wasn't running. Can you please check the system log for configd log messages?

Let me rephrase that. Now I don't have any data in Details tab of Insight! :|

A screenshot would make this as unambiguous as possible.

I remember another thing that could happen is the netflow dump grew to GB proportions so that flowd_aggregate dies while loading the dump:

# ls -lah /var/log/flowd*

Like this?

root@OPNsense:/var/log # grep configd system.log


Sep  1 14:27:02 OPNsense configd.py: [8c342056-8a26-44c0-ae6b-d981b679f975] retrieve flow cache statistics
Sep  1 14:27:03 OPNsense configd.py: [d7ebf617-1294-4f55-ad1b-fdcc295a9b3a] retrieve flow cache statistics
Sep  1 14:27:04 OPNsense configd.py: [1b911f1b-91c1-4f33-9d3a-b277539a3513] retrieve flow cache statistics
Sep  1 14:27:05 OPNsense configd.py: [411f13af-6197-4cb8-b80c-64ac0171ba81] retrieve flow cache statistics
Sep  1 14:27:06 OPNsense configd.py: [1acdfd9f-0ec3-402c-b8f8-7fe958ef956a] retrieve flow cache statistics
Sep  1 14:27:12 OPNsense configd.py: [fb34970c-ed49-4e68-af22-c887f843d8f5] request netflow data aggregator timeseries for FlowInterfaceTotals
Sep  1 14:27:12 OPNsense configd.py: [1361e984-a95d-4cab-a5ab-e6535430d168] request netflow data aggregator top usage for FlowDstPortTotals
Sep  1 14:27:12 OPNsense configd.py: [b372ae49-6fd6-4d75-b865-663b3fb6094c] request netflow data aggregator top usage for FlowSourceAddrTotals
Sep  1 14:27:18 OPNsense configd.py: [b372ae49-6fd6-4d75-b865-663b3fb6094c] Script action failed with Command '/usr/local/opnsense/scripts/netflow/get_top_usage.py /provider "FlowSourceAddrTotals" /start_time "1472724000" /end_time "1472732832" /key_fields "src_addr" /value_field "octets" /filter "if=igb2" /max_hits "25"' returned non-zero exit status 1 at Traceback (most recent call last):   File "/usr/local/opnsense/service/modules/processhandler.py", line 477, in execute     stdout=output_stream, stderr=error_stream)   File "/usr/local/lib/python2.7/subprocess.py", line 541, in check_call     raise CalledProcessError(retcode, cmd) CalledProcessError: Command '/usr/local/opnsense/scripts/netflow/get_top_usage.py /provider "FlowSourceAddrTotals" /start_time "1472724000" /end_time "1472732832" /key_fields "src_addr" /value_field "octets" /filter "if=igb2" /max_hits "25"' returned non-zero exit status 1
Sep  1 14:27:18 OPNsense configd.py: [36e69f2b-e378-4c3c-bc63-17281d3949b9] request netflow data aggregator top usage for FlowInterfaceTotals
Sep  1 14:27:18 OPNsense configd.py: [3ffe4cb6-33b2-4a5c-aaaa-ecfa8f74b607] request netflow data aggregator top usage for FlowInterfaceTotals
Sep  1 14:27:18 OPNsense configd.py: [20965b29-5041-4182-9f05-a24412707e2c] request netflow data aggregator metadata
Sep  1 14:27:21 OPNsense configd.py: [d3ae606f-35d0-412d-9bf5-5bfa532baf06] request netflow data aggregator top usage for FlowSourceAddrDetails
Sep  1 14:27:26 OPNsense configd.py: [d3ae606f-35d0-412d-9bf5-5bfa532baf06] Script action failed with Command '/usr/local/opnsense/scripts/netflow/get_top_usage.py /provider "FlowSourceAddrDetails" /start_time "1472601600" /end_time "1472774399" /key_fields "service_port,protocol,if,src_addr,dst_addr" /value_field "octets" /filter "if=igb2" /max_hits "100"' returned non-zero exit status 1 at Traceback (most recent call last):   File "/usr/local/opnsense/service/modules/processhandler.py", line 477, in execute     stdout=output_stream, stderr=error_stream)   File "/usr/local/lib/python2.7/subprocess.py", line 541, in check_call     raise CalledProcessError(retcode, cmd) CalledProcessError: Command '/usr/local/opnsense/scripts/netflow/get_top_usage.py /provider "FlowSourceAddrDetails" /start_time "1472601600" /end_time "1472774399" /key_fields "service_port,protocol,if,src_addr,dst_addr" /value_field "octets" /filter "if=igb2" /max_hits "100"' returned non-zero exit status 1
Sep  1 14:29:06 OPNsense configd.py: [853a42a1-b5f4-41f7-b7bf-73a88d5214b9] request netflow data aggregator timeseries for FlowInterfaceTotals
Sep  1 14:29:06 OPNsense configd.py: [353e5e7d-98ad-445b-b57f-ff30c57485e1] request netflow data aggregator top usage for FlowDstPortTotals
Sep  1 14:29:06 OPNsense configd.py: [f20bf34b-c7b5-462f-9f92-b8444fb94d3a] request netflow data aggregator top usage for FlowInterfaceTotals
Sep  1 14:29:06 OPNsense configd.py: [189c0c02-246f-4383-a536-e132671533bd] request netflow data aggregator metadata
Sep  1 14:29:06 OPNsense configd.py: [20bc5ecb-3d1f-4980-b139-2c51acc3feab] request netflow data aggregator top usage for FlowSourceAddrTotals
Sep  1 14:29:06 OPNsense configd.py: [9cdb69f0-b4df-4607-b772-568a1a3c91af] request netflow data aggregator top usage for FlowInterfaceTotals
Sep  1 14:29:13 OPNsense configd.py: [6d482a0f-a957-4f8a-9b5e-131dde3b9aea] request netflow data aggregator top usage for FlowSourceAddrDetails

Quote from: franco on September 01, 2016, 02:36:32 PM
A screenshot would make this as unambiguous as possible.

I remember another thing that could happen is the netflow dump grew to GB proportions so that flowd_aggregate dies while loading the dump:

# ls -lah /var/log/flowd*

How do I attach/include a picture???


root@OPNsense:/var/log # ls -lah /var/log/flowd*
-rw-------  1 root  wheel    86M Sep  1 14:36 /var/log/flowd.log
-rw-------  1 root  wheel    11M Aug 31 20:39 /var/log/flowd.log.000001
-rw-------  1 root  wheel    11M Aug 31 18:55 /var/log/flowd.log.000002
-rw-------  1 root  wheel    11M Aug 31 16:42 /var/log/flowd.log.000003
-rw-------  1 root  wheel    11M Aug 31 15:15 /var/log/flowd.log.000004
-rw-------  1 root  wheel    11M Aug 31 13:11 /var/log/flowd.log.000005
-rw-------  1 root  wheel    12M Aug 31 11:06 /var/log/flowd.log.000006
-rw-------  1 root  wheel    11M Aug 31 09:06 /var/log/flowd.log.000007
-rw-------  1 root  wheel    11M Aug 31 06:57 /var/log/flowd.log.000008
-rw-------  1 root  wheel    11M Aug 31 04:39 /var/log/flowd.log.000009
-rw-------  1 root  wheel    11M Aug 31 02:22 /var/log/flowd.log.000010

Data is back in Details tab, just took a VERY long time to show up. Still no data for today and after 21:50 (isch) yesterday.

Thanks for the log, Ad will take a look at this when we is back on Monday.

Click the "Attachments and other options" link directly beneath the text area when writing a reply. :)

Flowd dump size looks ok all the data from yesterday is there (although it seems like quite a lot for a single day)

The last thing we ca try is to remove the sqlite database and keeping the flowd data, it should recalculate your graphs unless there is a problem with the dumps themselves.

# service flowd_aggregate stop
# rm /var/netflow/*
# service flowd_aggregate start


Cheers,
Franco

Oh, ok, then there's an issue with the current /var/log/flowd file, it's grown beyond the 11 MB rotation limit.

Let's try bumping flowd as a last resort...

# service flowd restart

Quote from: franco on September 01, 2016, 02:51:49 PM
Oh, ok, then there's an issue with the current /var/log/flowd file, it's grown beyond the 11 MB rotation limit.

Let's try bumping flowd as a last resort...

# service flowd restart

Ok, will do that. Should I wait for the DB rebuild to finnish first?

Quote from: franco on September 01, 2016, 02:51:49 PM
Oh, ok, then there's an issue with the current /var/log/flowd file, it's grown beyond the 11 MB rotation limit.
I've just been looking at this as I also don't have any graphs (I use other tools to monitor the firewall) and my log file seems to have grown rather large:

ls -l /var/log/flowd*
-rw-------  1 root  wheel  3545620816 Sep  1 15:10 /var/log/flowd.log
I guess I can just stop the service and remove this file and it will rebuild it?
Regards


Bill