Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - whitwye

#1
I spend hours thoroughly documenting that there are deep bugs in the current OPNsense and get no response at all? This kind of thorough testing is invaluable to any responsible software house. At least there should be thanks here, and hopefully some recognition of where the bugs might be and suggestions on how to possibly work around them.

Having a second interface get into trouble when a first interface goes down is not good. Having VIPs get lost when an interface they're on is disabled and then re-enabled again is not good. These are reproducible bugs. If others with similar configuration requirement aren't seeing them, the only thing that should be at all unusual in my setup is the reversed order of the gateway stanzas in the XML file, due to the poor programming standards in the original pfSense project if so, since someone there thought it was good practice to depend on order of stanzas in at least one other context. I can straighten out the stanzas at test this hypothesis; but only if those at the core of this project care.

OPNsense has a brilliant interface. The design is far improved over pfSense. But if there's no concern about it being seriously broken underneath, it won't deserve much of a future. Please tell me I'm wrong to suspect this, and that those at the core of this project care about quality.

Whit
#2
Later. Checked 8.8.4.4 pings -- not working still. In Dashboard Global inteface listed as disabled, although it's still live on its 3 IPs. Disabled Global interface through checkbox on Interfaces: [Globall] page. Re-enabled Global interface. Now pings to 8.8.4.4 work again. Still don't know what was causing those to fail.

But its two CARP IPs are gone again too, listed on the CARP Dashboard widget without "> MASTER" beside them again, and missing from the ifconfig listing.

So there are definite patterns to what it gets wrong. Things don't fail exactly the same way each time around, but there is a large degree of repetition. I'm sure it's a small minority of users who have complex MultiWAN environments. I'd like to hear from others who've found it stable and dependable for this use, about any tricks they may have found necessary to make it so.

Thanks,
Whit
#3
Quote from: whitwye on August 21, 2017, 10:31:28 PM
I'm really wondering if the accident of having assigned the secondary WAN interface first, in the order of setting the system up, is breaking some hidden assumption in the logic of the underlying code. It's not obvious how to test that short of a total reinstallation.

Specifically the gateways are in this order

Quote
    <gateway_item>
      <interface>opt1</interface>
      <gateway>207.239.xxx.yy</gateway>
      <name>GW_WAN</name>
      <weight>1</weight>
      <ipprotocol>inet</ipprotocol>
      <interval>1</interval>
      <descr>GlobalGW</descr>
      <avg_delay_samples/>
      <avg_loss_samples/>
      <avg_loss_delay_samples/>
      <monitor>8.8.4.4</monitor>
    </gateway_item>
    <gateway_item>
      <interface>wan</interface>
      <gateway>38.105.xxx.yy</gateway>
      <name>GW_WAN_2</name>
      <weight>5</weight>
      <ipprotocol>inet</ipprotocol>
      <interval>1</interval>
      <descr>CogentGW</descr>
      <avg_delay_samples/>
      <avg_loss_samples/>
      <avg_loss_delay_samples/>
      <monitor>8.8.8.8</monitor>
      <defaultgw>1</defaultgw>
    </gateway_item>

While the interfaces are in this order

Quote
    <wan>
      <if>igb1</if>
      <descr>Cogent</descr>
      <enable>1</enable>
      <spoofmac/>
      <blockpriv>1</blockpriv>
      <blockbogons>1</blockbogons>
      <ipaddr>38.105.xxx.yy</ipaddr>
      <subnet>27</subnet>
      <gateway>GW_WAN_2</gateway>
    </wan>
    <opt1>
      <if>igb2</if>
      <descr>Global</descr>
      <enable>1</enable>
      <spoofmac/>
      <blockpriv>1</blockpriv>
      <blockbogons>1</blockbogons>
      <ipaddr>207.239.xxx.yy</ipaddr>
      <subnet>27</subnet>
      <gateway>GW_WAN</gateway>
    </opt1>

Now, I noticed that the Web interface won't let "GW_WAN" and "GW_WAN_2" be renamed, which seems strange, unless there's something internal which very much depends on their being named just as the configuration defaulted them to. Most often GW_WAN will be on the primary gateway, and GW_WAN_2 on the secondary. So my wild guess here is that somewhere backend code is depending on that being the case, thus the extreme confusion the system is showing in my case in handling interface and gateway failover. (This isn't entirely a wild guess. I've seen pfSense failover get confused by a minor order change in the XML file like this.)

Whit
#4
The traceroute/ping blockage is quite specific to 8.8.4.4:

Quoteroot@OPNsense:~ # traceroute -i igb2 8.8.4.4
traceroute to 8.8.4.4 (8.8.4.4), 64 hops max, 40 byte packets
1  *

- Stuck (igb2 = Global interface)

Quoteroot@OPNsense:~ # traceroute -i igb2 8.8.4.3
traceroute to 8.8.4.3 (8.8.4.3), 64 hops max, 40 byte packets
1  207.239.xxx.yy (207.239.xxx.yy)  2.610 ms  2.898 ms  2.404 ms
2  207.239.84.185 (207.239.84.185)  4.164 ms  4.279 ms  3.935 ms
216.156.16.212.ptr.us.xo.net (216.156.16.212)  5.160 ms  5.160 ms  4.959 ms
216.156.16.133.ptr.us.xo.net (216.156.16.133)  5.189 ms  5.188 ms  4.478 ms
...

- So 8.8.4.3 works fine through that same interface, and of course on the other interface it just works to 8.8.4.4.:

Quoteroot@OPNsense:~ # traceroute -i igb1 8.8.4.4
traceroute to 8.8.4.4 (8.8.4.4), 64 hops max, 40 byte packets
1  g<obfuscated>1.atlas.cogentco.com (38.105.xxx.yy)  0.954 ms  0.687 ms  0.575 ms
2  154.24.38.82 (154.24.38.82)  0.967 ms  0.766 ms  0.758 ms
3  154.24.38.85 (154.24.38.85)  0.800 ms  0.547 ms  0.671 ms
be2897.ccr42.jfk02.atlas.cogentco.com (154.54.84.213)  0.908 ms  1.170 ms  0.962 ms
be3295.ccr31.jfk05.atlas.cogentco.com (154.54.80.2)  0.804 ms  0.944 ms  0.970 ms
...

WTF?

#5
Replication:

System > Gateways > All > CogentGW > Mark Gateway as Down, Apply

Now on Dashboard that Gateway shows as "Pending" and the other (untouched) gateway shows as "Offline" again. (Note, above I believe I reported this in reverse -- the first gateway listed there is in fact the second gateway, as an accident of the initial configuration order. Hmm. Could that be part of what has OPNsense confused, the historical order, rather than any present settings?)

Pinging 8.8.4.4 (the test IP assigned the Global gateway) fails as before. But the CARP IPs for the second gateway are still there, at least for now, and are shown as "> MASTER" still in the CARP widget.

System > Gateways > All > CogentGW > Uncheck Gateway as Down, Apply.

Still can't ping 8.8.4.4. Lobby > Dashboard showing GlobalGW as "Online" but other gateway, which I've done nothing to change, as "Offline." CARP IPs on other gateway remain up.

System > Gateways > All > GlobalGW, Edit, No changes, Save, Apply.

Lobby > Dashboard still shows that as offline (CARP IPs still up for it).

System > Gatesway > All > GlobalGW, Edit, Mark Gateway as Down, Apply, Edit Uncheck that, Apply.

Still can't ping 8.8.4.4. Lobby > Dashboard still shows this gateway as down.

Interfaces > Global, Uncheck "Enable Interface", Apply, then reverse to check "Enable Interface", Apply.

Pinging 8.8.4.4 works. Lobby > Dashboard still shows status of this gateway as "Offline" in Gateways widget. Now I notice that the CARP IPs are up, but the gateway's fixed IP is no longer assigned to it! It has been in earlier rounds of experimentation.

Double check Interfaces > [Global]. Hmm, the Apply Changes button is showing. DId that not take? Press it again. Now the fixed IP is back up on that interface. It's lost the two CARP IPs. Still can't ping 8.8.4.4. The CARP IPs no longer have "> Master" beside them in the CARP Widget.

Firewall: Virtual IPs: Settings > Edit a CARP IP, no changes, save it. It's back on the interface and "> Master" is back in the CARP widget. Still can't ping 8.8.4.4. Can ping the IP of that gateway though. And "route get" shows that gateway as the route to 8.8.4.4. But of course the Gateway widget still shows it as "Offline", since it can't be pinged. "traceroute 8.8.4.4" goes nowhere.

I'm really wondering if the accident of having assigned the secondary WAN interface first, in the order of setting the system up, is breaking some hidden assumption in the logic of the underlying code. It's not obvious how to test that short of a total reinstallation.

As is it, stuff that's sort-of-related but shouldn't in some of these cases be at all dependent on each other is getting tripped up in repeatable, although varying ways.

Whit







#6
I go to Interfaces > [Global], check Disable Interface, apply it; then I check Enable Interface, apply it.

Now the Dashboard widget shows it as "Online", but I still can't ping 8.8.4.4 from the CLI, and the CARP IPs are still not working.

So I go to Firewall > VIrtual IPs > Settings, and open the edit screen for one of the CARP IPs, make no changes, but save and apply it.

That CARP IP comes back to life, but not the second one on that interface. So I do the same sequence with it. Now it's back. (Obviously in a real-life operation, once the full set of IPs on each public interface is under CARP control, such manual steps would be out of the question, what with the propensity of failover to occur in the wee hours during ISP maintenance, when we're largely asleep, yet vital automated data transmissions are ongoing using our systems.)

Now I can also ping 8.8.4.4!

By all appearances there's a deep set of bugs here which are not entirely between the chair and the keyboard. Are other people using MultiWAN and CARP with the current OPNsense in production, with reliable configurations? If so, I'd sure welcome any hints about what you've done to achieve that.

Best,
Whit
#7
Perhaps I'm not doing the right test by disabling the COGENT (aka WAN1) interface. So instead I'll use that Mark Gateway Down checkbox.

Interesting. Now OPNsense can ping me out here at 207.136.236.70. But all is not good.

OPNsense can't ping 8.8.4.4 any more -- the test IP for GLOBAL (aka WAN2) -- although it can pint 8.8.8.8 now. Odd. It see the route to 8.8.4.4 as through the right gateway. It can ping the gateway. But it can't ping 8.8.4.4 any more.

More importantly, OPNsense has dropped two CARP IPs from WAN2 (both that I had assigned). The Lobby CARP widget, which used to show "> MASTER" next to both of those IPs now shows the IPs, but without "> MASTER", and ifconfig confirms they're not assigned. I also that while COGENT shows status as "Offline", GLOBAL shows status as "Pending" on the dashboard -- probably because pinging it isn't working.

Now I go to System > Gateways > All  and uncheck the Mark Gateway as Down box. Results:

The gateway for COGENT now shows as "Online." The gateway for GLOBAL shows as "Offline," and the CARP IPs for GLOBAL are not back to MASTER status yet, with ifconfig confirming they're not back up. Meanwhile GLOBAL's gateway page does not have it marked as either disabled or set to pretend to be down. I can reach GLOBAL's fixed IP from outside, but not of course the CARP IPs it has dropped for some reason.

System > Gateways > Group Status also shows the GLOBAL gateway as "Offline," which is not accurate since I can reach the fixed IP on that gateway from outside. Interfaces > [Global] does show it as enabled, too.
#8
A more specific set of questions:

I see that 8.8.8.8 and 8.8.4.4 have static routes set, so that's why they work no matter what.

What should I see set, where, when the COGENT interface is taken down and the GLOBAL interface should be used? The default route to COGENT is removed from the route table without being replaced. I take it that's okay. What should I be looking for to take its place operationally? Is it some set of pf reply-to and route-to rules? What should these look like. If they aren't appearing as planned, what's the right way to enter them manually to test and make sure the concept is at least right? What component of the automated system is in charge maintaining those entries correctly?

Thanks again,
Whit
#9
17.7 Legacy Series / Gateway monitoring reality check
August 21, 2017, 08:23:47 PM
Franco suggests perhaps my gateway monitoring isn't set up correctly. So let's run through that here.

At System > Gateways > All

QuoteName    Interface    Gateway    Monitor IP    Description    
GW_WAN    GLOBAL    207.239.<offuscated>    8.8.4.4    GlobalGW    
GW_WAN_2 (default)    COGENT    38.105.<obfuscated>    8.8.8.8    CogentGW

So far so good. But indeed double-checking on the "DIsable Gateway Monitoring" boxes shows them checked. [Note on interface conventions: 99 times out of 100 checkboxes are used to enable things, not disable them.] Is this it? Uncheck both boxes and apply.

Then I disable the COGENT interface. And ...

Quoteroot@OPNsense:/tmp # route get 207.136.236.70
route: route has not been found

But it does know the special route to the check IP:

Quoteroot@OPNsense:/tmp # route get 8.8.4.4
   route to: google-public-dns-b.google.com
destination: google-public-dns-b.google.com
    gateway: 207.239.<obfuscated>
        fib: 0
  interface: igb2
      flags: <UP,GATEWAY,HOST,DONE,STATIC>
recvpipe  sendpipe  ssthresh  rtt,msec    mtu        weight    expire
       0         0         0         0      1500         1         0

And can ping that. But it can't ping 8.8.8.8, or 207.136.236.70. And when I try to connect to one of the public IPs on the GLOBAL interface from outside, I can't.

Now I do nothing but enable the COGENT interface.

Quoteroot@OPNsense:/tmp # route get 207.136.236.70
   route to: vt.electrainfo.com
destination: default
       mask: default
    gateway: g<obfuscated>1.atlas.cogentco.com
        fib: 0
  interface: igb1
      flags: <UP,GATEWAY,DONE,STATIC>
recvpipe  sendpipe  ssthresh  rtt,msec    mtu        weight    expire
       0         0         0         0      1500         1         0

And I can both ping IPs on the GLOBAL interface, and connect to services NATed behind them -- which were unavailable with the COGENT interface down.

So my bad on misreading the checkbox function. Yet, getting that right's not enough to make MultiWAN work. Is there documentation on what's supposed to be going on under the covers here, so I can check on where that might be going wrong?

Thanks again,
Whit


#10
Franco,

I've read that page dozens of times. Is there no other documentation on this?

Step 3 I followed, and am certain it's right. It's simple enough.

As for step 5, the example there is for the LAN IP of the firewall and DNS service. We're not running DNS on the firewall. We're not concerned (yet) with traffic behind the firewall being sent out correctly either. Is there an (undocumented) requirement that the firewall be used as a DNS server for MulitiWAN to work?

RIght now, if we take WAN1's interface down, incoming traffic from outside on WAN2 is no longer returned by WAN2, even though it is returned by WAN2 just fine if WAN1 is also up. And traffic generated from the firewall does not find its way out WAN2.

Is there documentation pertinent to those problems, or the theory by which they are supposed to be handled, or steps to diagnose them?

Thanks,
Whit
#11
Given that OPNsense isn't using multiple routing tables (which is how Linux is typically configured for policy routing), but instead is using PF's route-to and reply-to options, where can I learn about what in theory should be happening with those as interface availability changes?

I'm intrigued by Franco's statement that there's an optimal pf rule set that will make the result robust, but puzzled on what that rule set should look like. As I've mentioned in other threads, so far I can't get OPNsense to handle WAN2 correctly when WAN1 is taken down. I'll be thankful for any suggestions of recipes that should work, or pointers to documentation that gives enough background to deduce what such recipes should look like.

Specifically, what rules applied to either the floating or WAN2 interface rule set would enable WAN2 to successfully return or originate traffic, regardless of WAN1's state? I see no evidence that OPNsense will originate traffic on WAN2 ever. But it at least returns traffic on WAN2 while WAN1 is up, yet fails to return traffic on WAN2 once WAN1 is down -- an odd and unexpected dependency. Taking WAN1 down removes the default route from the system; but apparently the power of pf route-to and reply-to rules should make success independent of that. Besides, the WAN1 default being their or not shouldn't on the face of it affect the success of WAN2 in replying on its IPs, since this is working with WAN1 up, where the reply doesn't take that default route anyway.
#12
Another data point: Disabling WAN2 has no effect on WAN1.

So:

Disable WAN1 and WAN2 no longer can respond to outside traffic coming in, nor originate traffic. (There's nothing yet using this system for LAN devices going outwards, so haven't tested that.)

Disable WAN2 and WAN1 continues working for both outside traffic coming in, and originating traffic.

Checking with "netstat -nr" disabling WAN1 removes the default route via WAN1, and does not replace it with a default route via WAN2. WAN2 does have its IPv4 Upstream Gateway set in the configuration, but that is not substituted in this case.
#13
Note: Tried the patch here: https://forum.opnsense.org/index.php?topic=5785.0 (and above in this thread). As I noted there, it does not fix what I'm seeing.
#14
17.7 Legacy Series / Re: [SOLVED] Multi WAN Problem
August 21, 2017, 04:02:03 PM
Hi Franco,

Is the "suboptimal settings" thing documented somewhere, perhaps in the form of suggested optimal settings for a multi-wan setup?

Also, just tried the patch. It does not fix the problem I'm seeing: that WAN2 works fine, just until WAN1 has "Enable Interface" unchecked and applied. Of course, this isn't the same thing as WAN1 failing. But logically a working config for WAN2 shouldn't depend on WAN1 being enabled, shouldn't it? I'm open for any advice. I'd really like to get this working. I'm much impressed with the parts of OPNsense that work.

Thanks,
Whit
#15
Thanks. Tried one more thing: I'd had the WAN2 set to take over outward routing as failover. Reconfigured for it to instead work in load balancing mode. Didn't make a difference. Turning off the first WAN interface results in traffic not being responded to when sent to WAN2 IPs (that had been working with WAN1 on in either case), nor in the Firewall being able to initiate any outgoing traffic.

I can see interesting changes with

Quotediff rules.debug rules.debug.old

in /tmp, with route-to and reply-to rules changing with the configuration changes and turning WAN1 off and on. So the system's not failing to recognize the changes. It's just not responding with full adequacy.

If anyone has additional configuration steps to suggest which might work around this, I'm up for more experimentation.