Gateway Monitoring

Started by kbrennan1, September 11, 2015, 11:35:37 AM

Previous topic - Next topic
September 11, 2015, 11:35:37 AM Last Edit: September 15, 2015, 11:55:10 AM by kbrennan1
Hi All,

I am a recently landed m0n0wall migrant trying to get gateway group failover working!

I'm having an issue with gateway groups and monitoring upstream IP addresses.

My setup is running on a Deciso A10 SSD appliance with version 15.7.11. There are two WAN interfaces to different ISP's. One is Tier 1 and the other is Tier 2 in the gateway group.  I have disabled the default "disable gateway monitoring" and I have no other non-default firewall or nat rules set at the moment as this is a new installation.

I can not monitor the upstream gateway ip as it will always be available due to the fact that it is on the ISP CPE and there is ethernet presentation, so even if the ISP fails, the monitor will always work. For the same reason, monitoring the link up/down events will not work either. I have set the gateway monitor to use packet loss as the only metric.

My issue is that when I monitor 8.8.8.8 from the tier 1 interface and 8.8.4.4 from the tier 2 interface, they never fail - even when I disconnect the ISP side of the CPE. The only things that will cause the failure condition to trigger are either the physical WAN port on OPNSense being disconnected, or restating the apinger service. The gateway system logs do not show the failure (untit I unplug the cable of restart the service)

Once the failover condition has been triggerd, outbound routing is as expected. The failback process works with no issues.

I initially tested this configuration in VMware and I put it down to a virtualisation oddity, but now that I can recreate the same issue on a physical device I'm not so sure.

I found a few other monitoring problems on these boards, but there were realted to the service not actually starting.

I'd be grateful for any suggestions.

Cheers

Kevin


**EDIT**
I've recreated this setup with a packet sniffer and I can see that the only time apinger attempts to send an icmp packet is either on service startup or when there is an active failure condition. It *never* sends an ICMP packet when it thinks the gateway is up.

Another oddity I noticed was that the gateway section in the XML config file only closed the tags if the default configuration was present. I configured the gateway explicitly using the default values and the config xml file was correct.

Has anyone had any issues like this in the past?
I was wondering if a cron job to restart the apinger service every X seconds would work, I think it would, but I lack the knowledge to script that in suc a way that it would persist after a reload.

Ok, so I have a rough workaround to this problem.

I've added two cron jobs to run every 60 seconds via the config xml file.
1:  killall -9 apinger
and
2: /usr/local/sbin/apinger -c /var/etc/apinger.conf

It is pretty ugly and the failure event can take up to 60 seconds before it is noticed. The failback is instant.

I still think this is a bug, but my skills do not allow me to go any further.
Any tips/suggestions etc would be great!

Thanks

Kevin

Hi Kevin,

apinger has been and still is APITA.
In the pfSense forum there are numerous threads about it.
I can confirm that I have similar issues like you on all of my (currently still) pfSense boxes, which have more than one WAN. The odd thing is, that (in my case) the circumstances of apinger not reporting the real values are not predictable. This behavior makes running Multi-WAN setups very toublesome and support intensive.

I don't know if there has been a fix in OPNsense yet, which would make switching from pfSense a no-brainer for me.
The pfSense guys have announced a completely rewritten replacement for apinger for their version 2.3.

Perhaps Franco could share some knowledge about the status of the OPNsense apinger?!?

Thanks,
Harry

apinger is an interesting case of half-backed maintenance, high complexity, high annoyance, but very little impact. As we don't get paid for doing our open source work (obviously) we are going to tiptoe around the issue until a better solution can be funded. I've tried to play with it and cleaned up the port in the process, but I'm not going to debug apinger, because its behaviour is highly unpredictable and the code too complex to say it's just going to be one fix. It's going to be a larger rewrite.

Yes, pfSense said they would replace it, but other than a fork of the new dpinger they did nothing on GitHub since May, see attached screenshot. I even had to help Denny to make his code compile on FreeBSD and to be easily included in the ports tree, but that inclusion in FreeBSD ports never happened.

We've made it so far with OPNsense, I see danger from fixing apinger for fame, making others less likely to migrate away from their current solutions. I'm all about open source; and I'm also allowed to say no. Hope that helps.

Quote from: franco on September 18, 2015, 06:32:27 PM

We've made it so far with OPNsense, I see danger from fixing apinger for fame, making others less likely to migrate away from their current solutions. I'm all about open source; and I'm also allowed to say no. Hope that helps.

I dont under stand how its a danger??
but i have a big need for a better monitor/failover system with 2 poor residential isp's

Because nobody is making a commitment towards funding an apinger rework or integrating the new dpinger. I can only assume that this is not in high demand or simply not worth the funding. For us as a project that situation is too risky to just go ahead and fix it, potentially working on it for a week or two. But I can be wrong. :)

wish i had the $$ to move this along on either opnsense or pfsense but as i am just a home user all i can do with my limted $$ is help trouble shoot with my 2 low quality connections but the dual wan is the feature that brought me to pfsense and opnsense

We were talking about this amongst us and were wondering which exact feature is unreliable and whether you can say that OPNsense is affected by this as well? Thanks for your help!

yes as there has been almost no cleanup or change to Apinger I dont think it fixed itsself (that would be some cool code) not sure how to describe it the most accurate way
1 will fall to unrealistically low ping times at time <1ms and stay till it gets killed and restarted
2 it will report 100% and sometimes > 100% packetless when starting ping from firewall or another computer shows this is not correct
3 unreliable at marking gateway down and bring it back

These issue are most noticed when isp is poor or having know problems

one of my isps was so bad they had blocked icmp form 2002-2009ish   so had no means of fallover durning that time
had an old Xincom 502 router that had 3 different means of detecting failed connections
one was traffic flow another was an http check and icmp

not sure what any of linux distros currently use but am starting to study that now in my free time now that got to playing with virtualbox some   


October 06, 2015, 08:59:17 AM #9 Last Edit: October 06, 2015, 09:07:10 AM by windozer
apinger looks for (1) bad quality connection, and (2) link down. The No.1 caused such issues^ for me. This setting gives me stable connection, instead of several or more reconnections a day.

Latency threshold: 700-999
Packetl loss threshold: 80-95
Probe interval: 10
Down: 50
Avg Delay Qty: 20
Avg packet loss qty: use calculated value
Loss probe value: use calculated value

I've tried several settings, monitoring IPs and different hardware platforms, but never got a realy reliable apinger service at any time. The problem is that at some time (not reproduceable) apinger monitors a gateway as down by 100% paketloss. Only after a service restart apinger works again. It caused me several hours of trouble and angry users and is quite the opposite I want to achive by implementing multiple WAN.
So far I have to keep an eye to the WAN connections from outside, which adds burden on my part, as I simply dare not rely on the current monitoring and / or gateway switching. In general the current situation disqualifies XXXsense for any new multi WAN implemetations on my side and I'm right now investigating a stable replacement for my current installations as my hopes for a soon to come relief are pretty much shattered by this threat and:
https://forum.pfsense.org/index.php?topic=100255.0

The majority of my users strongly depends on a stable internet connection and therefore, in my opinion, it is mandatory to have at least 2 WANs from different providers. Without proprer and stable monitoring / failover / loadbalancing it is pretty senseless...

I would like to help funding a stable monitoring solution, but as I'm only a "one-man-show", I'm probably not able to fund it allone.
Perhaps we could set up a funding pool for this together?

Franco, how about defining a proper solution together, estimating the amount of money needed to be raised and try to get it funded by the community?

Cheers,
Harry

I still have trouble understanding how this has been/is such a low priority issue for ***sense but maybe you dont need such stuff for business connections ?
can't imagine think wont or hasn't caused some turnaway from troubleshooting and could be worse as more economical hardware comes to the market

but like the saying goes depends on what side of the bathroom door your on and seem must are inside unlike me :)

Thanks for everyone stepping in here. A clear consensus is needed on what is broken and how it could be fixed. Bringing a few people together is a good start, it allows me to analyse the problem and try a few things. An easier solution would be to rotate the apinger service e.g. every hour to see if that already helps. What do you guys think?

I too have fought the apinger gremlins with ***sense, and ended up disabling the gateway monitoring as I found it too unreliable.

I would chip in some funding for a stable, reliable, and predictable gateway monitoring solution.
AMD Ryzen 3 1200
GA-A320M-S2H
8GB DDR4
Intel X550-T2 10GB
32GB Industrial SSD

Shuttle SZ270R8
Intel i5-6500
8gb ram
120gb ssd
Intel x540-t2 10gb nic

rotation could help but there has been time that i had to stop it and wait (making whole in graph) then start it as a simple restart did not bring the pings back from .9 ms to normal 30ish ms  and not sure why.
if you set it up to warn at 200 ms and kill at 700 ms i have had it mark gate way down with pings well under 700ms