[Resolved] VLAN connectivity Issue

Started by ajeffco, December 06, 2021, 10:58:23 PM

Previous topic - Next topic
December 06, 2021, 10:58:23 PM Last Edit: December 15, 2021, 09:57:24 PM by ajeffco
Hello All,

I recently posted about having a VLAN connectivity issue with one out of four VLANs I have setup on OPNSense.  I attached an overly complicated diagram which probably didn't help to show the issue.  And a lengthy write-up probably didn't help  :).  Unfortunately that didn't' get an answers, so, I've attached a hopefully simpler diagram, and hopefully a clearer question.

Out of 5 VLANs I have configured on my OPNSense server, 4 work 100% flawlessly.  One works for up to 6 minutes after OPNSense is rebooted, then suddenly loses WAN access.  VLAN <-> VLAN connectivity works after the 6 minutes.

Can anyone please give me some advice where to start troubleshooting this?  I've looked at firewall rules that match another working VLAN.  I don't know where else to even start looking?  I'd attach logs but I'm not even sure where the problem is.

ping:
ajeffco@relay:~$ ping 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=54 time=17.2 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=54 time=17.5 ms
.
.
.
64 bytes from 1.1.1.1: icmp_seq=27 ttl=54 time=17.3 ms
64 bytes from 1.1.1.1: icmp_seq=28 ttl=54 time=17.9 ms
64 bytes from 1.1.1.1: icmp_seq=29 ttl=54 time=18.7 ms  <--- Hangs here, have to Control-C
^C
--- 1.1.1.1 ping statistics ---
43 packets transmitted, 29 received, 32.5581% packet loss, time 42361ms
rtt min/avg/max/mdev = 15.276/17.575/20.984/1.583 ms


tcpdump:
17:07:26.516623 IP one.one.one.one > relay: ICMP echo reply, id 4, seq 26, length 64
17:07:27.503010 IP relay > one.one.one.one: ICMP echo request, id 4, seq 27, length 64
17:07:27.520290 IP one.one.one.one > relay: ICMP echo reply, id 4, seq 27, length 64
17:07:28.504571 IP relay > one.one.one.one: ICMP echo request, id 4, seq 28, length 64
17:07:28.522405 IP one.one.one.one > relay: ICMP echo reply, id 4, seq 28, length 64
17:07:29.505772 IP relay > one.one.one.one: ICMP echo request, id 4, seq 29, length 64
17:07:29.524413 IP one.one.one.one > relay: ICMP echo reply, id 4, seq 29, length 64
17:07:30.507778 IP relay > one.one.one.one: ICMP echo request, id 4, seq 30, length 64
17:07:31.533815 IP relay > one.one.one.one: ICMP echo request, id 4, seq 31, length 64
17:07:32.557693 IP relay > one.one.one.one: ICMP echo request, id 4, seq 32, length 64
17:07:33.581693 IP relay > one.one.one.one: ICMP echo request, id 4, seq 33, length 64
17:07:34.605654 IP relay > one.one.one.one: ICMP echo request, id 4, seq 34, length 64
17:07:35.629762 IP relay > one.one.one.one: ICMP echo request, id 4, seq 35, length 64
17:07:36.653730 IP relay > one.one.one.one: ICMP echo request, id 4, seq 36, length 64
17:07:37.677737 IP relay > one.one.one.one: ICMP echo request, id 4, seq 37, length 64
17:07:38.701782 IP relay > one.one.one.one: ICMP echo request, id 4, seq 38, length 64
17:07:39.725736 IP relay > one.one.one.one: ICMP echo request, id 4, seq 39, length 64


Thanks for any help.

Al

EDIT: Added ping and tcpdump info.
Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)

It's not DNS (for once) since you're pinging an IP
Firewall rules are extremely unlikely since they would have blocked all traffic
Same for routing - pretty binary

Are you running Suricata?

In general, try and pare your OPNsense down to the bare minimum, step-by-step until you find a feature that points to the root cause. Bummer that each test takes six minutes though.

Bart...

Hello Bart,

Thanks for taking the time to respond.  I am not running Suricata.  These are pretty bare.  The only thing happening on this setup is HA (Carp), routing, firewall, VLAN and SNMP.  No DNS, no DHCP, etc.  Some of those such as SNMP wouldn't be related in any way I can think of.

And that's the odd thing, ping an IP so no DNS, and I don't know that firewall would not work then work suddenly 0-300 seconds later.  It has me stumped and scratching my head which is why I reached out for help.

I've looked at every log I could find, turned on logging on any firewall rule related to that VLAN, etc and I just don't see anything out of the ordinary in the logs, and no symptoms other than it works then suddenly stops and only for that VLAN.  The other 3 VLANs continue working with no problem.  Are there buffers of any kind that could be looked at?  I even thought remove any config related to that VLAN ID and trying a different VLAN ID #.

I'm going to beg a favor from a network engineer @ work tomorrow who owes me favors for helping him with Linux, and see if he can help find it.  After I get past him telling I should be using Palo Alto ;), I'm really hoping he can help me track it down.

Thanks again and have a great day!

Al
Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)

When in doubt use tcpdump.  ;)
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

Quote from: pmhausen on December 07, 2021, 09:20:36 AM
When in doubt use tcpdump.  ;)

100% agree.  On the OPNSense server, I do not see any ICMP traffic for the client that works then fails, even though it was still pinging 1.1.1.1 while I was running the tcpdump on the OPNSense server.  I do however see IoT VLAN and Trusted VLAN ICMP traffic displayed on the opnsense tcpdump.

Here from the client with the issue (it's working atm):
root@relay:~# tcpdump host 1.1.1.1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens18, link-type EN10MB (Ethernet), capture size 262144 bytes
03:37:53.245740 IP relay > one.one.one.one: ICMP echo request, id 19, seq 288, length 64
03:37:53.261023 IP one.one.one.one > relay: ICMP echo reply, id 19, seq 288, length 64
03:37:54.247317 IP relay > one.one.one.one: ICMP echo request, id 19, seq 289, length 64
03:37:54.263127 IP one.one.one.one > relay: ICMP echo reply, id 19, seq 289, length 64
03:37:55.248441 IP relay > one.one.one.one: ICMP echo request, id 19, seq 290, length 64
03:37:55.266224 IP one.one.one.one > relay: ICMP echo reply, id 19, seq 290, length 64


Here's from opnsense:
root@inner-fw1:~ # tcpdump icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vtnet0, link-type EN10MB (Ethernet), capture size 262144 bytes
03:34:58.643527 IP ringpro-0e > ec2-54-90-209-52.compute-1.amazonaws.com: ICMP echo request, id 15475, seq 7754, length 64
03:34:58.688944 IP ec2-54-90-209-52.compute-1.amazonaws.com > ringpro-0e: ICMP echo reply, id 15475, seq 7754, length 64
03:35:02.875423 IP helm > gateway: ICMP echo request, id 32019, seq 0, length 64
03:35:02.875455 IP gateway > helm: ICMP echo reply, id 32019, seq 0, length 64
03:35:03.875450 IP helm > gateway: ICMP echo request, id 32019, seq 1, length 64
03:35:03.875473 IP gateway > helm: ICMP echo reply, id 32019, seq 1, length 64
03:35:04.875723 IP helm > gateway: ICMP echo request, id 32019, seq 2, length 64
03:35:04.875752 IP gateway > helm : ICMP echo reply, id 32019, seq 2, length 64
03:35:06.154937 IP sysmon > inner-fw1: ICMP echo request, id 6228, seq 0, length 64
03:35:06.154969 IP inner-fw1 > sysmon: ICMP echo reply, id 6228, seq 0, length 64
03:35:07.155075 IP sysmon > inner-fw1: ICMP echo request, id 6228, seq 1, length 64
03:35:07.155098 IP inner-fw1 > sysmon: ICMP echo reply, id 6228, seq 1, length 64
03:35:08.156146 IP sysmon > inner-fw1: ICMP echo request, id 6228, seq 2, length 64


The sysmon (zabbix) server is on the trusted VLAN(50), and the helm device is on the IoT VLAN(40).

At no point did the opnsense tcpdump show any packets for VLAN 20.

I then ran a tcpdump on the proxmox host these are running on also, and it looks pretty much the same as the opnsense tcpdump in that it sees everything but the VLAN 20 device, which had stopped pinging by then. 

I rebooted the opnsense firewall and the results are the same.  The VLAN 20 device is pinging away to 1.1.1.1, opnsense and proxmox tcpdumps do not show any packets in the output of that "tcpdump icmp" but the VLAN 20 device tcpdump shows the same as as the first post output.

Weird to me that it's working but not showing up on the firewall or the hypervisor when others are.

The IP "path" (not sure the right term) looks like this.

VLAN 20 Device (172.16.20.15) -> fw1 VIP for VLAN 20 (172.16.20.1) -> fw1 outer VIP for VLAN 20 (192.168.0.20) -> spectrum firewall (192.168.0.1).

The VLAN setup is duplicated for VLANs 30, 40 and 50, all of which are working flawlessly.



Currently still working after the reboot, here's a traceroute to see the path out, not sure if it sheds any light.  It does show it's currently working and NAT is working also:
root@relay:~# traceroute 1.1.1.1
traceroute to 1.1.1.1 (1.1.1.1), 30 hops max, 60 byte packets
1  172.16.20.1 (172.16.20.1)  0.513 ms  0.498 ms  0.555 ms
2  192.168.0.1 (192.168.0.1)  1.281 ms  1.281 ms  1.340 ms
3  072-031-136-201.res.spectrum.com (72.31.136.201)  8.805 ms  14.860 ms  14.851 ms
4  ten-0-5-0-5.orld57-ser2.bhn.net (72.31.223.126)  14.830 ms  14.821 ms  14.816 ms
5  bundle-ether36.orld71-car2.bhn.net (72.31.194.110)  18.114 ms  18.050 ms  18.100 ms
6  072-031-067-218.res.spectrum.com (72.31.67.218)  17.243 ms  15.304 ms 072-031-067-216.res.spectrum.com (72.31.67.216)  15.287 ms
7  10.bu-ether15.orldfljo00w-bcr00.tbone.rr.com (66.109.6.98)  19.511 ms  19.488 ms 0.xe-2-2-1.pr0.atl20.tbone.rr.com (66.109.9.138)  21.056 ms
8  66.109.5.131 (66.109.5.131)  19.449 ms  18.623 ms  17.288 ms
9  66.109.1.243 (66.109.1.243)  20.341 ms 108.162.211.48 (108.162.211.48)  20.323 ms  24.839 ms
10  172.70.80.2 (172.70.80.2)  23.478 ms  21.717 ms 172.70.80.4 (172.70.80.4)  21.698 ms
11  one.one.one.one (1.1.1.1)  19.996 ms  19.159 ms  18.262 ms

Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)

Hi,

just one quesition:
is the OPT1 interface disabled? If it is enabled and has an IP, this might lead to issues, because FreeBSD seems not to like tagged and untagged traffic on the same interface.

KH

Quote from: KHE on December 07, 2021, 10:01:47 AM
Hi,

just one quesition:
is the OPT1 interface disabled? If it is enabled and has an IP, this might lead to issues, because FreeBSD seems not to like tagged and untagged traffic on the same interface.

KH

It's not and now I realize the diagram I drew for this post doesn't match what I have.  I've drawn and redrawn that diagram a bunch of times to make it simple to show this issue and forgot where I was when I made it :).

OPT1 = pfsync interface to the other firewall for CARP.
WAN = Enabled/Up, and is where all the VLANs are attached. 

I'll make a new diagram later, gotta crash soon, it's 4:30 AM :)

Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)

Hello,

I have updated the diagram showing the correct OPNSense connectivity regarding what interface the VLANs attached to.  I added the actual internet connectivity.  Since VLAN 40 can't get past the inner firewall when the issue starts, I don't know how relevant the outer firewall is.  I tried to keep it as simple as possible to show the logical config, so some things like both firewalls being on the same Proxmox host are not shown.

I also updated the native/access terms to tagged/untagged based on the feedback of one of the network engineers where I work.

Thanks again for the efforts to help me resolve this issue.  Have a great day!

Al
Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)

December 13, 2021, 08:00:56 AM #8 Last Edit: December 13, 2021, 08:03:38 AM by ajeffco
I spent some time this weekend trying to to isolate this issue.  To test, I loaded VyOS.  It's loaded on the same proxmox host as OPNSense, using the same proxmox interfaces, same configuration (without CARP), same VLAN config, etc.

The test machine that loses connectivity in short order on the OPNSense VLAN setup works without issue in the VyOS setup.  The only thing that changed on the test machine is the gateway IP.  I left it up over night last night, and have been randomly poking at it throughout the day, and it has yet to fail.  This morning I got called to work to help with this log4js issue.  Before I connected I switched my work desktop to the same VLAN as the test machine and it has worked all day without issue.

To me this isolates the issue to OPNSense.  Still looking for any advice to help try to find and squash whatever is causing this.

Thanks,

Al
Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)

An update...

I removed all references of the VLAN having problems from the network switches, proxmox hosts and firewalls that needed to carry that traffic.  I then added and configured a new VLAN with a different VLAN ID.  Issue is now gone.  So I do not know why that VLAN ID was flaking out like that but the issue is now resolved.

Thanks to those who did reply!

Have a great day.

Al
Dual Virtual OPNsense on PVE with HA via CARP
Node 1: OPNsense 24.7.3_1 - Protectli Vault FW6E (i7)
Node 2: OPNsense 24.7.3_1 - Qotom-Q555G6-S05 (i5)