OPNsense Forum

Archive => 17.7 Legacy Series => Topic started by: whitwye on August 21, 2017, 08:23:47 pm

Title: Gateway monitoring reality check
Post by: whitwye on August 21, 2017, 08:23:47 pm
Franco suggests perhaps my gateway monitoring isn't set up correctly. So let's run through that here.

At System > Gateways > All

Quote
Name    Interface    Gateway    Monitor IP    Description    
GW_WAN    GLOBAL    207.239.<offuscated>    8.8.4.4    GlobalGW    
GW_WAN_2 (default)    COGENT    38.105.<obfuscated>    8.8.8.8    CogentGW

So far so good. But indeed double-checking on the "DIsable Gateway Monitoring" boxes shows them checked. [Note on interface conventions: 99 times out of 100 checkboxes are used to enable things, not disable them.] Is this it? Uncheck both boxes and apply.

Then I disable the COGENT interface. And ...

Quote
root@OPNsense:/tmp # route get 207.136.236.70
route: route has not been found

But it does know the special route to the check IP:

Quote
root@OPNsense:/tmp # route get 8.8.4.4
   route to: google-public-dns-b.google.com
destination: google-public-dns-b.google.com
    gateway: 207.239.<obfuscated>
        fib: 0
  interface: igb2
      flags: <UP,GATEWAY,HOST,DONE,STATIC>
 recvpipe  sendpipe  ssthresh  rtt,msec    mtu        weight    expire
       0         0         0         0      1500         1         0

And can ping that. But it can't ping 8.8.8.8, or 207.136.236.70. And when I try to connect to one of the public IPs on the GLOBAL interface from outside, I can't.

Now I do nothing but enable the COGENT interface.

Quote
root@OPNsense:/tmp # route get 207.136.236.70
   route to: vt.electrainfo.com
destination: default
       mask: default
    gateway: g<obfuscated>1.atlas.cogentco.com
        fib: 0
  interface: igb1
      flags: <UP,GATEWAY,DONE,STATIC>
 recvpipe  sendpipe  ssthresh  rtt,msec    mtu        weight    expire
       0         0         0         0      1500         1         0

And I can both ping IPs on the GLOBAL interface, and connect to services NATed behind them -- which were unavailable with the COGENT interface down.

So my bad on misreading the checkbox function. Yet, getting that right's not enough to make MultiWAN work. Is there documentation on what's supposed to be going on under the covers here, so I can check on where that might be going wrong?

Thanks again,
Whit


Title: Re: Gateway monitoring reality check
Post by: whitwye on August 21, 2017, 08:45:11 pm
A more specific set of questions:

I see that 8.8.8.8 and 8.8.4.4 have static routes set, so that's why they work no matter what.

What should I see set, where, when the COGENT interface is taken down and the GLOBAL interface should be used? The default route to COGENT is removed from the route table without being replaced. I take it that's okay. What should I be looking for to take its place operationally? Is it some set of pf reply-to and route-to rules? What should these look like. If they aren't appearing as planned, what's the right way to enter them manually to test and make sure the concept is at least right? What component of the automated system is in charge maintaining those entries correctly?

Thanks again,
Whit
Title: Re: Gateway monitoring reality check
Post by: whitwye on August 21, 2017, 09:14:02 pm
Perhaps I'm not doing the right test by disabling the COGENT (aka WAN1) interface. So instead I'll use that Mark Gateway Down checkbox.

Interesting. Now OPNsense can ping me out here at 207.136.236.70. But all is not good.

OPNsense can't ping 8.8.4.4 any more -- the test IP for GLOBAL (aka WAN2) -- although it can pint 8.8.8.8 now. Odd. It see the route to 8.8.4.4 as through the right gateway. It can ping the gateway. But it can't ping 8.8.4.4 any more.

More importantly, OPNsense has dropped two CARP IPs from WAN2 (both that I had assigned). The Lobby CARP widget, which used to show "> MASTER" next to both of those IPs now shows the IPs, but without "> MASTER", and ifconfig confirms they're not assigned. I also that while COGENT shows status as "Offline", GLOBAL shows status as "Pending" on the dashboard -- probably because pinging it isn't working.

Now I go to System > Gateways > All  and uncheck the Mark Gateway as Down box. Results:

The gateway for COGENT now shows as "Online." The gateway for GLOBAL shows as "Offline," and the CARP IPs for GLOBAL are not back to MASTER status yet, with ifconfig confirming they're not back up. Meanwhile GLOBAL's gateway page does not have it marked as either disabled or set to pretend to be down. I can reach GLOBAL's fixed IP from outside, but not of course the CARP IPs it has dropped for some reason.

System > Gateways > Group Status also shows the GLOBAL gateway as "Offline," which is not accurate since I can reach the fixed IP on that gateway from outside. Interfaces > [Global] does show it as enabled, too.
Title: Re: Gateway monitoring reality check
Post by: whitwye on August 21, 2017, 09:41:13 pm
I go to Interfaces > [Global], check Disable Interface, apply it; then I check Enable Interface, apply it.

Now the Dashboard widget shows it as "Online", but I still can't ping 8.8.4.4 from the CLI, and the CARP IPs are still not working.

So I go to Firewall > VIrtual IPs > Settings, and open the edit screen for one of the CARP IPs, make no changes, but save and apply it.

That CARP IP comes back to life, but not the second one on that interface. So I do the same sequence with it. Now it's back. (Obviously in a real-life operation, once the full set of IPs on each public interface is under CARP control, such manual steps would be out of the question, what with the propensity of failover to occur in the wee hours during ISP maintenance, when we're largely asleep, yet vital automated data transmissions are ongoing using our systems.)

Now I can also ping 8.8.4.4!

By all appearances there's a deep set of bugs here which are not entirely between the chair and the keyboard. Are other people using MultiWAN and CARP with the current OPNsense in production, with reliable configurations? If so, I'd sure welcome any hints about what you've done to achieve that.

Best,
Whit
Title: Re: Gateway monitoring reality check
Post by: whitwye on August 21, 2017, 10:31:28 pm
Replication:

System > Gateways > All > CogentGW > Mark Gateway as Down, Apply

Now on Dashboard that Gateway shows as "Pending" and the other (untouched) gateway shows as "Offline" again. (Note, above I believe I reported this in reverse -- the first gateway listed there is in fact the second gateway, as an accident of the initial configuration order. Hmm. Could that be part of what has OPNsense confused, the historical order, rather than any present settings?)

Pinging 8.8.4.4 (the test IP assigned the Global gateway) fails as before. But the CARP IPs for the second gateway are still there, at least for now, and are shown as "> MASTER" still in the CARP widget.

System > Gateways > All > CogentGW > Uncheck Gateway as Down, Apply.

Still can't ping 8.8.4.4. Lobby > Dashboard showing GlobalGW as "Online" but other gateway, which I've done nothing to change, as "Offline." CARP IPs on other gateway remain up.

System > Gateways > All > GlobalGW, Edit, No changes, Save, Apply.

Lobby > Dashboard still shows that as offline (CARP IPs still up for it).

System > Gatesway > All > GlobalGW, Edit, Mark Gateway as Down, Apply, Edit Uncheck that, Apply.

Still can't ping 8.8.4.4. Lobby > Dashboard still shows this gateway as down.

Interfaces > Global, Uncheck "Enable Interface", Apply, then reverse to check "Enable Interface", Apply.

Pinging 8.8.4.4 works. Lobby > Dashboard still shows status of this gateway as "Offline" in Gateways widget. Now I notice that the CARP IPs are up, but the gateway's fixed IP is no longer assigned to it! It has been in earlier rounds of experimentation.

Double check Interfaces > [Global]. Hmm, the Apply Changes button is showing. DId that not take? Press it again. Now the fixed IP is back up on that interface. It's lost the two CARP IPs. Still can't ping 8.8.4.4. The CARP IPs no longer have "> Master" beside them in the CARP Widget.

Firewall: Virtual IPs: Settings > Edit a CARP IP, no changes, save it. It's back on the interface and "> Master" is back in the CARP widget. Still can't ping 8.8.4.4. Can ping the IP of that gateway though. And "route get" shows that gateway as the route to 8.8.4.4. But of course the Gateway widget still shows it as "Offline", since it can't be pinged. "traceroute 8.8.4.4" goes nowhere.

I'm really wondering if the accident of having assigned the secondary WAN interface first, in the order of setting the system up, is breaking some hidden assumption in the logic of the underlying code. It's not obvious how to test that short of a total reinstallation.

As is it, stuff that's sort-of-related but shouldn't in some of these cases be at all dependent on each other is getting tripped up in repeatable, although varying ways.

Whit







Title: Re: Gateway monitoring reality check
Post by: whitwye on August 21, 2017, 10:57:19 pm
The traceroute/ping blockage is quite specific to 8.8.4.4:

Quote
root@OPNsense:~ # traceroute -i igb2 8.8.4.4
traceroute to 8.8.4.4 (8.8.4.4), 64 hops max, 40 byte packets
 1  *

- Stuck (igb2 = Global interface)

Quote
root@OPNsense:~ # traceroute -i igb2 8.8.4.3
traceroute to 8.8.4.3 (8.8.4.3), 64 hops max, 40 byte packets
 1  207.239.xxx.yy (207.239.xxx.yy)  2.610 ms  2.898 ms  2.404 ms
 2  207.239.84.185 (207.239.84.185)  4.164 ms  4.279 ms  3.935 ms
 3  216.156.16.212.ptr.us.xo.net (216.156.16.212)  5.160 ms  5.160 ms  4.959 ms
 4  216.156.16.133.ptr.us.xo.net (216.156.16.133)  5.189 ms  5.188 ms  4.478 ms
...

- So 8.8.4.3 works fine through that same interface, and of course on the other interface it just works to 8.8.4.4.:

Quote
root@OPNsense:~ # traceroute -i igb1 8.8.4.4
traceroute to 8.8.4.4 (8.8.4.4), 64 hops max, 40 byte packets
 1  g<obfuscated>1.atlas.cogentco.com (38.105.xxx.yy)  0.954 ms  0.687 ms  0.575 ms
 2  154.24.38.82 (154.24.38.82)  0.967 ms  0.766 ms  0.758 ms
 3  154.24.38.85 (154.24.38.85)  0.800 ms  0.547 ms  0.671 ms
 4  be2897.ccr42.jfk02.atlas.cogentco.com (154.54.84.213)  0.908 ms  1.170 ms  0.962 ms
 5  be3295.ccr31.jfk05.atlas.cogentco.com (154.54.80.2)  0.804 ms  0.944 ms  0.970 ms
...

WTF?

Title: Re: Gateway monitoring reality check
Post by: whitwye on August 21, 2017, 11:41:51 pm
I'm really wondering if the accident of having assigned the secondary WAN interface first, in the order of setting the system up, is breaking some hidden assumption in the logic of the underlying code. It's not obvious how to test that short of a total reinstallation.

Specifically the gateways are in this order

Quote
    <gateway_item>
      <interface>opt1</interface>
      <gateway>207.239.xxx.yy</gateway>
      <name>GW_WAN</name>
      <weight>1</weight>
      <ipprotocol>inet</ipprotocol>
      <interval>1</interval>
      <descr>GlobalGW</descr>
      <avg_delay_samples/>
      <avg_loss_samples/>
      <avg_loss_delay_samples/>
      <monitor>8.8.4.4</monitor>
    </gateway_item>
    <gateway_item>
      <interface>wan</interface>
      <gateway>38.105.xxx.yy</gateway>
      <name>GW_WAN_2</name>
      <weight>5</weight>
      <ipprotocol>inet</ipprotocol>
      <interval>1</interval>
      <descr>CogentGW</descr>
      <avg_delay_samples/>
      <avg_loss_samples/>
      <avg_loss_delay_samples/>
      <monitor>8.8.8.8</monitor>
      <defaultgw>1</defaultgw>
    </gateway_item>

While the interfaces are in this order

Quote
    <wan>
      <if>igb1</if>
      <descr>Cogent</descr>
      <enable>1</enable>
      <spoofmac/>
      <blockpriv>1</blockpriv>
      <blockbogons>1</blockbogons>
      <ipaddr>38.105.xxx.yy</ipaddr>
      <subnet>27</subnet>
      <gateway>GW_WAN_2</gateway>
    </wan>
    <opt1>
      <if>igb2</if>
      <descr>Global</descr>
      <enable>1</enable>
      <spoofmac/>
      <blockpriv>1</blockpriv>
      <blockbogons>1</blockbogons>
      <ipaddr>207.239.xxx.yy</ipaddr>
      <subnet>27</subnet>
      <gateway>GW_WAN</gateway>
    </opt1>

Now, I noticed that the Web interface won't let "GW_WAN" and "GW_WAN_2" be renamed, which seems strange, unless there's something internal which very much depends on their being named just as the configuration defaulted them to. Most often GW_WAN will be on the primary gateway, and GW_WAN_2 on the secondary. So my wild guess here is that somewhere backend code is depending on that being the case, thus the extreme confusion the system is showing in my case in handling interface and gateway failover. (This isn't entirely a wild guess. I've seen pfSense failover get confused by a minor order change in the XML file like this.)

Whit
Title: Re: Gateway monitoring reality check
Post by: whitwye on August 22, 2017, 05:16:12 am
Later. Checked 8.8.4.4 pings -- not working still. In Dashboard Global inteface listed as disabled, although it's still live on its 3 IPs. Disabled Global interface through checkbox on Interfaces: [Globall] page. Re-enabled Global interface. Now pings to 8.8.4.4 work again. Still don't know what was causing those to fail.

But its two CARP IPs are gone again too, listed on the CARP Dashboard widget without "> MASTER" beside them again, and missing from the ifconfig listing.

So there are definite patterns to what it gets wrong. Things don't fail exactly the same way each time around, but there is a large degree of repetition. I'm sure it's a small minority of users who have complex MultiWAN environments. I'd like to hear from others who've found it stable and dependable for this use, about any tricks they may have found necessary to make it so.

Thanks,
Whit
Title: Re: Gateway monitoring reality check
Post by: whitwye on August 22, 2017, 03:36:39 pm
I spend hours thoroughly documenting that there are deep bugs in the current OPNsense and get no response at all? This kind of thorough testing is invaluable to any responsible software house. At least there should be thanks here, and hopefully some recognition of where the bugs might be and suggestions on how to possibly work around them.

Having a second interface get into trouble when a first interface goes down is not good. Having VIPs get lost when an interface they're on is disabled and then re-enabled again is not good. These are reproducible bugs. If others with similar configuration requirement aren't seeing them, the only thing that should be at all unusual in my setup is the reversed order of the gateway stanzas in the XML file, due to the poor programming standards in the original pfSense project if so, since someone there thought it was good practice to depend on order of stanzas in at least one other context. I can straighten out the stanzas at test this hypothesis; but only if those at the core of this project care.

OPNsense has a brilliant interface. The design is far improved over pfSense. But if there's no concern about it being seriously broken underneath, it won't deserve much of a future. Please tell me I'm wrong to suspect this, and that those at the core of this project care about quality.

Whit
Title: Re: Gateway monitoring reality check
Post by: phoenix on August 22, 2017, 05:52:12 pm
OPNsense has a brilliant interface. The design is far improved over pfSense. But if there's no concern about it being seriously broken underneath, it won't deserve much of a future. Please tell me I'm wrong to suspect this, and that those at the core of this project care about quality.
I think you're wrong about this and they wouldn't be involved in this project if they didn't care about quality software but I'm sure the developers can speak for themselves. You might also consider that they, like most people on these forums, have to earn a living and have a private life and that they do this project when they have the time - just like every other open source product. Rewriting pfsense is quite an horrendous task especially with all the improvements they've made, don't forget that there's no such thing as bug free software (I'm just pulling your leg here). :)

I can't comment on your problem as I'm not a developer nor very knowledgable about freebsd but your detailed investigation is quite impressive. I would also say that you might consider putting a bug report on github and referencing it to this thread, that's usually where bug reports go. :)
Title: Re: Gateway monitoring reality check
Post by: mimugmail on August 22, 2017, 06:23:49 pm
And dont forget its vacation time. I'm back in September and then can setup a test environment with your config.
Title: Re: Gateway monitoring reality check
Post by: franco on August 23, 2017, 05:44:16 pm
There may be a misconception between "I spend hours thoroughly documenting that there are deep bugs in the current OPNsense and get no response at all?" and actionable intel to go forward to verify and subsequently fix bugs.

I personally triage and help point to available docs or solutions. But I have to distribute that time evenly among all users and still be able to plan for version 18.1, update ports and infrastructure for 17.7.x, help contributors to review their code, fix bugs that are confirmed and pinned down to exactly lines of code.

Also, I have a day job that does not pay me to do the previous. I really don't. And I don't want to to avoid such sticky situations.

The bottom line is: non-actionable intel will make anyone need to go back and spend the same amount of hours to verify your findings, if at all software-only.

Gateway monitoring and multi-wan have its edges. If a solution cannot be found in a reasonable time frame, looking at other, mostly more costly projects / products is a sensible way out. Please don't be sad that a solution does not come from OPNsense in all cases. We just can't cover everything all the time.

If you still want to help, find a commitment from somebody who will check out your findings like mimugmail offered. It's all about engaging people and getting them excited for your findings, as funny as it may seem.


Cheers,
Franco