Strange routing dropouts for restricted vLan

Started by part_time_nerd, September 03, 2017, 08:00:08 PM

Previous topic - Next topic
September 03, 2017, 08:00:08 PM Last Edit: September 03, 2017, 08:13:15 PM by part_time_nerd
Hi all,

a few months ago I started changing my home network for the (hopefully) better, beginning with the installation of opnsense as core router. My vanilla "one router, one subnet, one SSID" home network was replaced by a set of 5 subnets: one management (vlan 1), one private (2), one for kids (3), one for guests (4) and one for all the IoT crap I like but dont trust (vLan5). Since this is a private side project it had to fit my sparse spare time and since I encountered certain problems creating a proper external WLan AP solution for vLans 2-5, the router ran for months while we continued to use only vLan 1.

Now the AP is there and I have moved the 5-vLan setup to production. That went quite well so far but now I get very strange routing problems on vLan 5: after some time running as expected, the subnet becomes inaccessible (from management subnet: "no route to host"). When I reboot opnsense, the routing turns back to normal and the subnet becomes available again. There are no scheduled firewall rules or anything of that sort in my configuration that I would be aware of. Morevoer, I could not find any pfsense log entries that look suspicious around the time when the subnet becomes unavailable. Unfortunately my BSD knowledge ist rather limited so I didnt look very far under the hood.

Facts that might be worth mentioning:
* vLan5 is trunked on the same LAN port as all other vLans, none of which exhibits this problem.
* It is, however, the only one of the vLans that is not WAN-routed by default but single IPs get their tailored set of access rules to whereever they need to.
* Allowed outgoing connections also die when the routing dies.
* Listing the routes in the OPNsense GUI "after the fact" still lists correct routes for the affected subnet.

Here is my version information:

OPNsense 17.7-amd64
FreeBSD 11.0-RELEASE-p11
OpenSSL 1.0.2l 25 May 2017

I'd appreciate any suggestions on how to tackle problem and find the root cause of it.

Quote from: part_time_nerd on September 03, 2017, 08:00:08 PM
but now I get very strange routing problems on vLan 5: after some time running as expected, the subnet becomes inaccessible (from management subnet: "no route to host"). When I reboot opnsense, the routing turns back to normal and the subnet becomes available again.

This sounds quite odd. Just a few suggestions:


  • use tcpdump on the OPNsense CLI to debug traffic on the VLAN 5 interface when it becomes inaccessible
  • test option "Disable reply-to on WAN rules" in Firewall -> Settings -> Advanced (be aware that this may break network connectivity
  • make sure "Shared forwarding" is disabled in Firewall -> Settings -> Advanced (assuming you don't use CaptivePortal and Traffic Shaper)
  • disable hardware offload features in Interfaces -> Settings


Regards
- Frank

Quote from: fraenki on September 04, 2017, 04:08:39 PM
Quote from: part_time_nerd on September 03, 2017, 08:00:08 PM
but now I get very strange routing problems on vLan 5: after some time running as expected, the subnet becomes inaccessible (from management subnet: "no route to host"). When I reboot opnsense, the routing turns back to normal and the subnet becomes available again.

This sounds quite odd. Just a few suggestions:


  • use tcpdump on the OPNsense CLI to debug traffic on the VLAN 5 interface when it becomes inaccessible
  • test option "Disable reply-to on WAN rules" in Firewall -> Settings -> Advanced (be aware that this may break network connectivity
  • make sure "Shared forwarding" is disabled in Firewall -> Settings -> Advanced (assuming you don't use CaptivePortal and Traffic Shaper)
  • disable hardware offload features in Interfaces -> Settings

Hello fraenki,

many thanks for your kind reply and suggestions!

I begann following your tips and for a starter, I did the following:

* firmware upgrade to 17.7.1
* I checked the option "Disable reply-to on WAN rules" as suggested.
* Shared forwarding is not enabled

Now I am waiting to see if routing failures will happen again. It was stable for 24 hours now but I had more than 24 hours of stable routing before, so the last word is not spoken here.

I'd prefer not to disable hardware offloading, on the other hand I'll be willing to give it a try if everything else fails.

Unfortunately the Problem appeared again. I have now disabled Hardware VLan filtering in the Interface Settings Dialogue. Waiting again.

Hi again,

I am now at the point where I have disabled all HW offloading options, with no success. I have, however, implemented a logging mechanism that allowed me to get the time in which the problem occurs more precisely.

Using that I found a common pattern in the System log, which looks like this:




Sep 4 21:53:46    opnsense: /usr/local/etc/rc.newwanip: Interface '' is disabled or empty, nothing to do.
Sep 4 21:53:46    opnsense: /usr/local/etc/rc.newwanip: IP renewal is starting on 'ovpns1'
Sep 4 21:53:46    configd.py: [66637140-dccd-4a86-a069-63ba711697f4] rc.newwanip starting ovpns1
Sep 4 21:53:45    kernel: ovpns1: link state changed to UP
Sep 4 21:53:45    configd.py: [2c8dd3bf-c9d2-4cf3-abe9-dac306ef241a] Reloading filter
Sep 4 21:53:44    configd.py: [5962f6e9-ca80-4081-8a90-d05894fb95fc] Reloading filter
Sep 4 21:53:44    kernel: ovpns1: link state changed to DOWN
Sep 4 21:53:44    opnsense: /usr/local/etc/rc.newwanip: Resyncing OpenVPN instances for interface WAN.
Sep 4 21:53:44    opnsense: /usr/local/etc/rc.newwanip: ROUTING: setting IPv4 default route to 109.XX.XY.1
Sep 4 21:53:44    opnsense: /usr/local/etc/rc.newwanip: On (IP address: 109.XX.YY.47) (interface: WAN[wan]) (real interface: vtnet3).
Sep 4 21:53:43    opnsense: /usr/local/etc/rc.newwanip: IP renewal is starting on 'vtnet3'




Sep 9 18:00:40    opnsense: /usr/local/etc/rc.newwanip: Interface '' is disabled or empty, nothing to do.
Sep 9 18:00:40    opnsense: /usr/local/etc/rc.newwanip: IP renewal is starting on 'ovpns1'
Sep 9 18:00:39    configd.py: [deda5b84-e388-495b-884d-c433c864d1aa] rc.newwanip starting ovpns1
Sep 9 18:00:39    kernel: ovpns1: link state changed to UP
Sep 9 18:00:39    configd.py: [bbe682dd-c1c8-4854-9137-8e35d0ea96bf] Reloading filter
Sep 9 18:00:38    configd.py: [1a22f2b2-de2e-4350-8b7b-0dcb9670981a] Reloading filter
Sep 9 18:00:38    kernel: ovpns1: link state changed to DOWN
Sep 9 18:00:38    opnsense: /usr/local/etc/rc.newwanip: Resyncing OpenVPN instances for interface WAN.
Sep 9 18:00:38    opnsense: /usr/local/etc/rc.newwanip: ROUTING: setting IPv4 default route to 109.XX.XY.1
Sep 9 18:00:37    opnsense: /usr/local/etc/rc.newwanip: On (IP address: 109.XX.XX.47) (interface: WAN[wan]) (real interface: vtnet3).
Sep 9 18:00:37    opnsense: /usr/local/etc/rc.newwanip: IP renewal is starting on 'vtnet3'




So it appears like when this happens (which I presume is a DHCP renewal on the WAN port) my restricted vLan number 5 loses all routing to anywhere.

Here are my routing tables:
(I took the freedom to modify parts of the WAN addresses using a few X Y and Z characters)

root@router:~ #   netstat -rW
Routing tables

Internet:
Destination        Gateway            Flags       Use    Mtu      Netif Expire
default            ip-1X9-91-XX-1.hsiZZ.provider.com UGS  1769411   1500     vtnet3
10.10.42.0/24      10.10.42.2         UGS           0   1500     ovpns1
10.10.42.1         link#9             UHS           0  16384        lo0
10.10.42.2         link#9             UH            0   1500     ovpns1
ip1X.9X.6X.8X.in-addr.arpa 52:54:00:03:6d:fd UHS      549   1500     vtnet3
8X.2XX.1X9.4       52:54:00:03:6d:fd  UHS           0   1500     vtnet3
109.91.52.0/22     link#4             U             0   1500     vtnet3
ip-1X9-91-XX-1.hsiZZ.provider.com 52:54:00:03:6d:fd UHS           0   1500     vtnet3
ip-1X9-91-XY-YY.hsiZZ.provider.com link#4 UHS           0  16384        lo0
localhost          link#6             UH        12164  16384        lo0
192.168.10.0/24    link#13            U             0   1500 vtnet1_vlan4
192.168.10.1       link#13            UHS           0  16384        lo0
192.168.23.0/24    link#11            U        483426   1500 vtnet1_vlan2
192.168.23.1       link#11            UHS           0  16384        lo0
192.168.42.0/24    link#10            U       4827227   1500 vtnet1_vlan1
router             link#10            UHS           0  16384        lo0
192.168.42.88      link#3             UHS           0  16384        lo0
192.168.42.88/32   link#3             U             0   1500     vtnet2
192.168.100.0/24   link#14            U         40610   1500 vtnet1_vlan5
192.168.100.1      link#14            UHS           0  16384        lo0

192.168.123.0/24   link#12            U          1944   1500 vtnet1_vlan3
192.168.123.1      link#12            UHS           0  16384        lo0
192.168.200.0/24   link#1             U         95885   1500     vtnet0
192.168.200.10     link#1             UHS           0  16384        lo0

Internet6:
Destination        Gateway            Flags       Use    Mtu    Netif Expire
::1                link#6             UH            0  16384      lo0
fe80::%vtnet0/64   link#1             U             0   1500   vtnet0
fe80::5054:ff:feXX:4765%vtnet0 link#1 UHS           0  16384      lo0
fe80::%vtnet1/64   link#2             U             0   1500   vtnet1
fe80::5054:ff:feXX:d898%vtnet1 link#2 UHS           0  16384      lo0
fe80::%vtnet2/64   link#3             U             0   1500   vtnet2
fe80::5054:ff:feXX:448b%vtnet2 link#3 UHS           0  16384      lo0
fe80::%vtnet3/64   link#4             U             0   1500   vtnet3
fe80::5054:ff:feXX:6dfd%vtnet3 link#4 UHS           0  16384      lo0
fe80::%lo0/64      link#6             U             0  16384      lo0
fe80::1%lo0        link#6             UHS           0  16384      lo0
fe80::a85e:b6XX:93YY:f37b%ovpns1 link#9 UHS         0  16384      lo0
fe80::%vtnet1_vlan1/64 link#10        U           258   1500 vtnet1_vlan1
fe80::5054:ff:feXX:d898%vtnet1_vlan1 link#10 UHS        0  16384      lo0
fe80::%vtnet1_vlan2/64 link#11        U             0   1500 vtnet1_vlan2
fe80::5054:ff:feXX:d898%vtnet1_vlan2 link#11 UHS        0  16384      lo0
fe80::%vtnet1_vlan3/64 link#12        U             0   1500 vtnet1_vlan3
fe80::5054:ff:feXX:d898%vtnet1_vlan3 link#12 UHS        0  16384      lo0
fe80::%vtnet1_vlan4/64 link#13        U             0   1500 vtnet1_vlan4
fe80::5054:ff:feXX:d898%vtnet1_vlan4 link#13 UHS        0  16384      lo0
fe80::%vtnet1_vlan5/64 link#14        U             0   1500 vtnet1_vlan5
fe80::5054:ff:feXX:d898%vtnet1_vlan5 link#14 UHS        0  16384      lo0


It is clearly visible, that routing appears to be identical for all interfaces. But when I try to connect to something in the .100 subnet, I get this:

root@router:~ #   ping 192.168.100.15
PING 192.168.100.15 (192.168.100.15): 56 data bytes
ping: sendto: Invalid argument
ping: sendto: Invalid argument
^C
--- 192.168.100.15 ping statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss
root@router:~ #   ssh 192.168.100.15
ssh: connect to host 192.168.100.15 port 22: Invalid argument


Not sure what "invalid argument" does mean in this context.

I am not entirely sure in which way I would sensibly use tcpdump like suggested in fraenkis post to find out more about the issue. So I simply tried to tcp-dump the above ssh connection attempt using the command:

tcpdump -i vtnet1_vlan5 -vv

But this command does not capture anything related to the commands above but only some unreplied ARP requests, IP6 router advertisements and some DNS UDP packets sent by clients in the .100 subnet to the router, also without replies.

Any more ideas?

All right... I might have found the culprit.

I had enabled the "Enable Static ARP entries" Option in the DHCP section for this interface. Once I unticked this option, everything went back to normal.
This post here pointed me to the right direction.
https://moh10ly.wordpress.com/2015/02/14/ping-on-pfsense-gives-invalid-argument/

Of course, all devices on that subnet DO have DHCP Static Mappings registered in that section below. But somehow it looks like they seem to get lost in the reload procedure shown in the logs above.
Could that be a bug or am I getting something completely wrong here?


... and the bug got closed without any response. Duh.