Vlan mismatch on reply packet

Started by Chura, June 26, 2023, 10:32:17 AM

Previous topic - Next topic
I'm having weird issues for my VLANS, that surprisingly fixed by reboot to OPNsense, but after few hours/days it comes back.

I have router on a stick, one interface that serves both tagged and untagged packets
mlxen1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
description: LAN (opt2)
options=9c00a8<VLAN_MTU,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,LINKSTATE,NETMAP>
ether 7c:55:30:90:ce:e0
inet 192.168.192.99 netmask 0xffffff00 broadcast 192.168.192.255
status: active
vlan098: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
description: WifiGuests (opt3)
options=180000<LINKSTATE,NETMAP>
ether 7c:55:30:90:ce:e0
inet 192.168.195.99 netmask 0xffffff00 broadcast 192.168.195.255
groups: vlan
vlan: 98 vlanproto: 802.1q vlanpcp: 0 parent interface: mlxen1
status: active
vlan099: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
description: IoT (opt4)
options=180000<LINKSTATE,NETMAP>
ether 7c:55:30:90:ce:e0
inet 192.168.199.99 netmask 0xffffff00 broadcast 192.168.199.255
groups: vlan
vlan: 99 vlanproto: 802.1q vlanpcp: 0 parent interface: mlxen1
status: active


when I try to ping 192.168.199.x, OPNsense sends ARP request on the right VLAN, which is recivied by the client and answered.

Request seen by OPNSense in correct vlan:
11:25:29.540675 7c:55:30:90:ce:e0 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 99, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.199.101 tell 192.168.199.99, length 28

Client (which is access and don't know about anything about VLANs):
11:25:29.556193 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.199.101 tell 192.168.199.99, length 42
11:25:29.556199 ARP, Ethernet (len 6), IPv4 (len 4), Reply 192.168.199.101 is-at c2:f0:c0:81:73:22, length 28


but OPN sense see the reply untagged!
11:29:39.182949 c2:f0:c0:81:73:22 > 7c:55:30:90:ce:e0, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Reply 192.168.199.101 is-at c2:f0:c0:81:73:22, length 42

Is there a fix to this ? config error ?
I have OPNsense hosted on proxmox, however the mlxenX is a passthroughed pci-e NIC
My other options would be :
1. Disable passthrough, create vmbr for each vlan - might effect high speed internet link in future ? (2.5gb planned)
2. not sure if the cause is the combination of tagged and untagged, but maybe start tagging the untagged as well (switch configured with native vlan 1), another port for untagged traffic is not possible for me.

June 26, 2023, 10:41:26 AM #1 Last Edit: June 26, 2023, 10:49:07 AM by Seimus
So the setup is OPN (VLAN GW) > Switch > HOST,

OPN is able to handle both TAGGED and UNTAGGED frames, I did test this during migration I did few months ago.

Do you have your Parent interface on OPN assigned?
The question here is, how is your Switch configured?
Is your Switch managed?
Is your Switch capable of VLANs?
Do you have your UPLINK from Switch towards OPN, on switch configured as TRUNK + native VLAN?
Do you have ports towards the specific HOST in the specific "access" VLANs?


Also have a look > https://github.com/opnsense/core/pull/4918#issuecomment-819265246

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

Split your thoughts to keep things in their 'rightful' places  8)

VLAN is layer-2 while ping (and other IP traffic) happens on layer-3

You'll need to have an IP subnet per VLAN and router(s) that can make sure packets go from source to destination and (often overlooked) back again.

Quote from: Seimus on June 26, 2023, 10:41:26 AM
So the setup is OPN (VLAN GW) > Switch > HOST,

OPN is able to handle both TAGGED and UNTAGGED frames, I did test this during migration I did few months ago.

Do you have your Parent interface on OPN assigned?
The question here is, how is your Switch configured?
Is your Switch managed?
Is your Switch capable of VLANs?
Do you have your UPLINK from Switch towards OPN, on switch configured as TRUNK + native VLAN?
Do you have ports towards the specific HOST in the specific "access" VLANs?


Also have a look > https://github.com/opnsense/core/pull/4918#issuecomment-819265246

Regards,
S.

Do you have your Parent interface on OPN assigned?
[Chura:] My parent is the untagged, mlxen1 interface.

The question here is, how is your Switch configured?
[Chura:] 1 Untagged, 98 and 99 Tagged

Is your Switch managed?
[Chura:] Yes

Is your Switch capable of VLANs?
[Chura:] Yes, and its working for few hours/days until suddenly it won't.

Do you have your UPLINK from Switch towards OPN, on switch configured as TRUNK + native VLAN?
[Chura:] Yes, exactly. Trunk (GE6   Trunk   1UP, 98T, 99T)

Do you have ports towards the specific HOST in the specific "access" VLANs?
[Chura:] Everything else on the switch is access, either 1, 98 or 99

Quote from: bartjsmit on June 26, 2023, 10:49:29 AM
Split your thoughts to keep things in their 'rightful' places  8)

VLAN is layer-2 while ping (and other IP traffic) happens on layer-3

You'll need to have an IP subnet per VLAN and router(s) that can make sure packets go from source to destination and (often overlooked) back again.

Not sure what you mean friend, this is how my setup is.
mlxen1 which is the parent has IP (192.168.192.99/24)
vlan098 which is tagged 98 on mlxen1 has IP (192.168.195.99/24)
vlan099 which is tagged 99 on mlxen1 has IP (192.168.199.99/24)

ICMP was just example, ARP packets are not steered to the right vlan, therefore nothing else will work on higher layers

Quote from: Seimus on June 26, 2023, 10:41:26 AM

Also have a look > https://github.com/opnsense/core/pull/4918#issuecomment-819265246


I've seen that, this is why I taught of option 2 above.
I even tried that now, create vlan1 on OPNsense, assign it to LAN instead of the mlxen1
didn't assign mlxen1 anywhere
Configured my Switch to tag everything toward OPNSense
GE6 General 1T, 98T, 99T, 4095P
Now its ever weirder, I see packet still coming on mlxen1, therefore being blocked by policy (btw, after reboot I have split second that packets works ok, I guess its after networking setup before firewall applied, hope it's not allowed the same from WAN to LAN)

After few days of debug, I've determined that SR-IOV is not working properly, not sure if its because of the combination of cpu/mb/card but it won't work as expected.
Part of the debug I've created another instance of Proxmox, when I tried to assign it a different Virtual interface the system crushed, same if I tried to do so for Windows or Debian guests.

I tried to remove the PCI passthrough from the machine and use linux bridges, behaviour was super odd, nothing coming back from the physical interface.

I decided to create linux bridges and assign them to VirtIO interfaces on the guests, didn't work out as well.
So I've disabled SR-IOV on the kernel completely, and rebooted, everything worked great !
Yes, I have no passthrough now, but at least I have a working configuration. Hope it won't limit me soon when 2.5gb comes to my doorstep.