OPNsense Forum

English Forums => General Discussion => Topic started by: kbrennan1 on September 11, 2015, 11:35:37 am

Title: Gateway Monitoring
Post by: kbrennan1 on September 11, 2015, 11:35:37 am
Hi All,

I am a recently landed m0n0wall migrant trying to get gateway group failover working!

I'm having an issue with gateway groups and monitoring upstream IP addresses.

My setup is running on a Deciso A10 SSD appliance with version 15.7.11. There are two WAN interfaces to different ISP's. One is Tier 1 and the other is Tier 2 in the gateway group.  I have disabled the default "disable gateway monitoring" and I have no other non-default firewall or nat rules set at the moment as this is a new installation.

I can not monitor the upstream gateway ip as it will always be available due to the fact that it is on the ISP CPE and there is ethernet presentation, so even if the ISP fails, the monitor will always work. For the same reason, monitoring the link up/down events will not work either. I have set the gateway monitor to use packet loss as the only metric.

My issue is that when I monitor 8.8.8.8 from the tier 1 interface and 8.8.4.4 from the tier 2 interface, they never fail - even when I disconnect the ISP side of the CPE. The only things that will cause the failure condition to trigger are either the physical WAN port on OPNSense being disconnected, or restating the apinger service. The gateway system logs do not show the failure (untit I unplug the cable of restart the service)

Once the failover condition has been triggerd, outbound routing is as expected. The failback process works with no issues.

I initially tested this configuration in VMware and I put it down to a virtualisation oddity, but now that I can recreate the same issue on a physical device I'm not so sure.

I found a few other monitoring problems on these boards, but there were realted to the service not actually starting.

I'd be grateful for any suggestions.

Cheers

Kevin


**EDIT**
I've recreated this setup with a packet sniffer and I can see that the only time apinger attempts to send an icmp packet is either on service startup or when there is an active failure condition. It *never* sends an ICMP packet when it thinks the gateway is up.

Another oddity I noticed was that the gateway section in the XML config file only closed the tags if the default configuration was present. I configured the gateway explicitly using the default values and the config xml file was correct.

Has anyone had any issues like this in the past?
I was wondering if a cron job to restart the apinger service every X seconds would work, I think it would, but I lack the knowledge to script that in suc a way that it would persist after a reload.
Title: Re: Gateway Monitoring
Post by: kbrennan1 on September 15, 2015, 01:25:57 pm
Ok, so I have a rough workaround to this problem.

I've added two cron jobs to run every 60 seconds via the config xml file.
1:  killall -9 apinger
and
2: /usr/local/sbin/apinger -c /var/etc/apinger.conf

It is pretty ugly and the failure event can take up to 60 seconds before it is noticed. The failback is instant.

I still think this is a bug, but my skills do not allow me to go any further.
Any tips/suggestions etc would be great!

Thanks

Kevin
Title: Re: Gateway Monitoring
Post by: guest976 on September 17, 2015, 09:10:18 am
Hi Kevin,

apinger has been and still is APITA.
In the pfSense forum there are numerous threads about it.
I can confirm that I have similar issues like you on all of my (currently still) pfSense boxes, which have more than one WAN. The odd thing is, that (in my case) the circumstances of apinger not reporting the real values are not predictable. This behavior makes running Multi-WAN setups very toublesome and support intensive.
 
I don't know if there has been a fix in OPNsense yet, which would make switching from pfSense a no-brainer for me.
The pfSense guys have announced a completely rewritten replacement for apinger for their version 2.3.

Perhaps Franco could share some knowledge about the status of the OPNsense apinger?!?

Thanks,
Harry
Title: Re: Gateway Monitoring
Post by: franco on September 18, 2015, 06:32:27 pm
apinger is an interesting case of half-backed maintenance, high complexity, high annoyance, but very little impact. As we don't get paid for doing our open source work (obviously) we are going to tiptoe around the issue until a better solution can be funded. I've tried to play with it and cleaned up the port in the process, but I'm not going to debug apinger, because its behaviour is highly unpredictable and the code too complex to say it's just going to be one fix. It's going to be a larger rewrite.

Yes, pfSense said they would replace it, but other than a fork of the new dpinger they did nothing on GitHub since May, see attached screenshot. I even had to help Denny to make his code compile on FreeBSD and to be easily included in the ports tree, but that inclusion in FreeBSD ports never happened.

We've made it so far with OPNsense, I see danger from fixing apinger for fame, making others less likely to migrate away from their current solutions. I'm all about open source; and I'm also allowed to say no. Hope that helps.
Title: Re: Gateway Monitoring
Post by: grandrivers on October 05, 2015, 08:15:26 am

We've made it so far with OPNsense, I see danger from fixing apinger for fame, making others less likely to migrate away from their current solutions. I'm all about open source; and I'm also allowed to say no. Hope that helps.

I dont under stand how its a danger??
but i have a big need for a better monitor/failover system with 2 poor residential isp's
Title: Re: Gateway Monitoring
Post by: franco on October 05, 2015, 09:26:44 am
Because nobody is making a commitment towards funding an apinger rework or integrating the new dpinger. I can only assume that this is not in high demand or simply not worth the funding. For us as a project that situation is too risky to just go ahead and fix it, potentially working on it for a week or two. But I can be wrong. :)
Title: Re: Gateway Monitoring
Post by: grandrivers on October 05, 2015, 06:55:46 pm
wish i had the $$ to move this along on either opnsense or pfsense but as i am just a home user all i can do with my limted $$ is help trouble shoot with my 2 low quality connections but the dual wan is the feature that brought me to pfsense and opnsense
Title: Re: Gateway Monitoring
Post by: franco on October 05, 2015, 11:01:46 pm
We were talking about this amongst us and were wondering which exact feature is unreliable and whether you can say that OPNsense is affected by this as well? Thanks for your help!
Title: Re: Gateway Monitoring
Post by: grandrivers on October 06, 2015, 01:20:57 am
yes as there has been almost no cleanup or change to Apinger I dont think it fixed itsself (that would be some cool code) not sure how to describe it the most accurate way
1 will fall to unrealistically low ping times at time <1ms and stay till it gets killed and restarted
2 it will report 100% and sometimes > 100% packetless when starting ping from firewall or another computer shows this is not correct
3 unreliable at marking gateway down and bring it back

These issue are most noticed when isp is poor or having know problems

one of my isps was so bad they had blocked icmp form 2002-2009ish   so had no means of fallover durning that time
had an old Xincom 502 router that had 3 different means of detecting failed connections
one was traffic flow another was an http check and icmp

not sure what any of linux distros currently use but am starting to study that now in my free time now that got to playing with virtualbox some   

 
Title: Re: Gateway Monitoring
Post by: windozer on October 06, 2015, 08:59:17 am
apinger looks for (1) bad quality connection, and (2) link down. The No.1 caused such issues^ for me. This setting gives me stable connection, instead of several or more reconnections a day.

Latency threshold: 700-999
Packetl loss threshold: 80-95
Probe interval: 10
Down: 50
Avg Delay Qty: 20
Avg packet loss qty: use calculated value
Loss probe value: use calculated value
Title: Re: Gateway Monitoring
Post by: guest976 on October 06, 2015, 07:50:03 pm
I've tried several settings, monitoring IPs and different hardware platforms, but never got a realy reliable apinger service at any time. The problem is that at some time (not reproduceable) apinger monitors a gateway as down by 100% paketloss. Only after a service restart apinger works again. It caused me several hours of trouble and angry users and is quite the opposite I want to achive by implementing multiple WAN.
So far I have to keep an eye to the WAN connections from outside, which adds burden on my part, as I simply dare not rely on the current monitoring and / or gateway switching. In general the current situation disqualifies XXXsense for any new multi WAN implemetations on my side and I'm right now investigating a stable replacement for my current installations as my hopes for a soon to come relief are pretty much shattered by this threat and:
https://forum.pfsense.org/index.php?topic=100255.0 (https://forum.pfsense.org/index.php?topic=100255.0)

The majority of my users strongly depends on a stable internet connection and therefore, in my opinion, it is mandatory to have at least 2 WANs from different providers. Without proprer and stable monitoring / failover / loadbalancing it is pretty senseless...

I would like to help funding a stable monitoring solution, but as I'm only a "one-man-show", I'm probably not able to fund it allone.
Perhaps we could set up a funding pool for this together?

Franco, how about defining a proper solution together, estimating the amount of money needed to be raised and try to get it funded by the community?

Cheers,
Harry
Title: Re: Gateway Monitoring
Post by: grandrivers on October 07, 2015, 02:08:13 am
I still have trouble understanding how this has been/is such a low priority issue for ***sense but maybe you dont need such stuff for business connections ?
can't imagine think wont or hasn't caused some turnaway from troubleshooting and could be worse as more economical hardware comes to the market

but like the saying goes depends on what side of the bathroom door your on and seem must are inside unlike me :)
Title: Re: Gateway Monitoring
Post by: franco on October 11, 2015, 01:45:16 pm
Thanks for everyone stepping in here. A clear consensus is needed on what is broken and how it could be fixed. Bringing a few people together is a good start, it allows me to analyse the problem and try a few things. An easier solution would be to rotate the apinger service e.g. every hour to see if that already helps. What do you guys think?
Title: Re: Gateway Monitoring
Post by: va176thunderbolt on October 11, 2015, 04:37:12 pm
I too have fought the apinger gremlins with ***sense, and ended up disabling the gateway monitoring as I found it too unreliable.

I would chip in some funding for a stable, reliable, and predictable gateway monitoring solution.
Title: Re: Gateway Monitoring
Post by: grandrivers on October 13, 2015, 09:22:56 pm
rotation could help but there has been time that i had to stop it and wait (making whole in graph) then start it as a simple restart did not bring the pings back from .9 ms to normal 30ish ms  and not sure why.
if you set it up to warn at 200 ms and kill at 700 ms i have had it mark gate way down with pings well under 700ms 
Title: Re: Gateway Monitoring
Post by: va176thunderbolt on October 29, 2015, 07:52:52 pm
franco - a gateway monitoring solution that is dependable and reliable is what is needed. I don't believe that ICMP pings are adequate in solely determining that a given gateway is "up" (usable) or "down" (unusable). I've experienced too many instances where apinger reports high packet loss and/or high response time to a gateway while a computer behind the firewall running Pingplotter logs vastly different data.

Maybe we could:
1) wget a (or multiple) user specified url(s) and measure the response time in retrieving the url(s) (say for example, pull the status page my modem and https://www.google.com)
2) open a tcp connection or complete a ssl handshake to a remote host and measure the response time
3) leverage an application like DNS for connectivity & response time (push a query out a connection and ensure it resolves within a user defined time).
4) integrate with vnstat to say push traffic away from a gateway that's capped and reaching the monthly cap.
5) allow me to define a set to tests per WAN connection in determining usability (for example, WAN1 must be able to retrieve https://www.google.com in under 800ms and resolve www.netflix.com via my ISP's DNS server in under 70ms. For WAN2, retrieve the modem's status page in under 10ms and monthly volume under 250GB).
Title: Re: Gateway Monitoring
Post by: franco on April 22, 2016, 01:10:06 pm
I just found this old thread an will revive it for the purpose of advancing the state of apinger:

https://github.com/opnsense/apinger

I'm currently cleaning up the code base, fixing minor and potential problems. The next big hunk is to make it behave when NTP is running, apinger currently does not like that, which is probably most of the "weird" issues that have been spotted.


Cheers,
Franco
Title: Re: Gateway Monitoring
Post by: Perun on April 11, 2018, 02:54:16 pm
Hi

I have still issues with apinger on 18.1.6. I ping my monitoring ip's (DNS server of my isp's) from console and get 15-30ms rtt. With apinger I see sometimes >10000ms at the same time.

How does a cronjob to 'fix' it look like?

Greetz
Title: Re: Gateway Monitoring
Post by: prayuth01 on April 11, 2018, 04:59:12 pm
I still do not have these problems. ufabet (https://www.ufa88.com/ufabet)

thank you
Title: Re: Gateway Monitoring
Post by: Perun on April 12, 2018, 10:15:03 am
can someone say me how to restart apinger with a command?
Title: Re: Gateway Monitoring
Post by: muchacha_grande on April 12, 2018, 01:03:04 pm
HI,
   I have experienced some issues with apinger in the past, and what I could get as conclusion is that it needs some guaranty on that the PINGs traffic will leave the FW with priority, so even in situations of high upload traffic, the ICMP packets would reach the destination.
   A customer of mine called his ISP telling that the service was faulty and they response that he was using all the upload bandwith. The solution at that moment was to limit the bandwidth consumed by a dropbox sync task on a PC.
   I think that to make sure apinger or dpinger get the best meassurement possible they need to have some QoS in the FW, so that the meassurements results will be more reliable.

Cheers