Help with routing from the opnsense firewall itself?

Started by surfrock66, March 18, 2024, 06:35:17 PM

Previous topic - Next topic
I've been going through a network transition as part of a learning journey and am having an issue I can't seem to solve.  High level, I have a 10.*.*.* network with a bunch of /16 VLANS and I just put in a new Layer3 switch that acts as the gateway for each VLAN.  The /16 is a legacy thing from a previous configuration, and 10.*.1.254 is the gateway on each VLAN.  The L3 switch has a default 0.0.0.0/0 route pointing to the opnsense box, which is 10.99.1.40.  99 is my networking device vlan.  Opnsense is 24.1.2 running on a standalone box with 4 NICS, one going to my comcast gateway and 2 others are a LACP LAGG to the L3 switch (a trunk carrying VLANS 99 and 6, 6 being my wireguard network which is not currently set up).  I have a DHCP and DNS server on the LAN, on the 2 VLAN, and there is an IP helper on each vlan for it.  Everything there is working fine.

Each of my other vlans has been defined as an alias in opnsense, and I have a NAT rule permitting traffic.  At this time, all clients on the LAN have internet access, and from the WAN my port forward rules are working.  Almost everything appears to be working.

...With the exception of the firewall itself.  It can ping the WAN, but cannot ping anything on the LAN on any VLAN (including the 99 which is the VLAN it's on, or other VLANS).  Actually, I can ssh into the box from a client on the 4 vlan, get in fine, then can't ping back to the client I'm connected from.  One additional thing, when I assign IP addresses I have to set a default gateway for the LAN network and tag it as an upstream gateway...this didn't make sense to me, but if I didn't do that all LAN clients lose internet access.  That LAN_GW gateway is 255 priority but is tagged as upstream, where the WAN_GW is priority 254.  I was thinking it was a static route thing, so I defined static routes for all my VLANS to go through the LAN_GW gateway but that didn't change anything.

I've changed so many things and done so many experiments that I'm a bit lost, and am looking for some guidance of what the gateways, static routes, and rules SHOULD be configured like in a configuration like mine.  If opnsense were doing the L3 routing, I think I'd have to add all vlans to the trunk and make a vlan interface on each, but I don't think that's the case here?

I am very much learning right now, but I have this sense that the firewall is not seeing my LAN networks as LAN, and is routing connections to the WAN interface.  I've tried traceroute to the LAN and it times out.  I've tried "ping -S 10.99.1.40 10.2.2.213" and it times out.  The firewall rules are mostly default, save for some things I had to do to get my chromecasts to point to pihole.

Quote from: surfrock66 on March 18, 2024, 06:35:17 PM
...
I've changed so many things and done so many experiments that I'm a bit lost, and am looking for some guidance of what the gateways, static routes, and rules SHOULD be configured like in a configuration like mine.
...

From your topology description shouldn't be more that a simple static route. Because you changed so many things start over with a clean install, otherwise this relative simple issue will be a ping-pong of "it's not working".

So, clean install, create your topology and after that you don't need more than:

A gateway:

System -> Gateways -> Configuration -> Plus Sign to add -> Name [VLAN_ROUTER], Interface [LACP_TRUNK], Address Family [IPv4], Disable Gateway Monitoring [CHECK]

Leave ALL others DEFAULT (definitely not checking Far and/or Upstream gateway)

Save


A static route:

System -> Routes -> Configuration -> Plus sign to add -> Network Address [10.0.0.0/8] ->Gateway [VLAN_ROUTER] (or whatever you choose to name it in previous step)

Save


That should really be all...



Just a couple of questions, for my understanding.

1) You said the interface should be the LACP Trunk; I had made a vlan interface off of that.  Should the LAN be the LACP LAGG (lagg0) or the vlan interface (lagg0_vlan99).  I had put the latter, just confirming.

2) When the CLI asks if it needs a gateway when defining the LAN IP, it says something like "probably yes for WAN, probably no for LAN" but in my case since the LAN requires a gateway, I put yes and put in the 10.99.1.254 address.  If I don't do that, I can't get to the web interface after setting it up.  That seems to check the "upstream gateway" box for that defined gateway, hence my confusion over that setting.

Quote from: surfrock66 on March 18, 2024, 08:52:24 PM
Just a couple of questions, for my understanding.

1) You said the interface should be the LACP Trunk; I had made a vlan interface off of that.  Should the LAN be the LACP LAGG (lagg0) or the vlan interface (lagg0_vlan99).  I had put the latter, just confirming.

Yes, you're right, sorry for the confusion.

You don't have the VLANS that are living behind your L3 switch on this Trunk I hope ? These should be routed as you explained your topology.

Quote
2) When the CLI asks if it needs a gateway when defining the LAN IP, it says something like "probably yes for WAN, probably no for LAN" but in my case since the LAN requires a gateway, I put yes and put in the 10.99.1.254 address.  If I don't do that, I can't get to the web interface after setting it up.  That seems to check the "upstream gateway" box for that defined gateway, hence my confusion over that setting.

Don't do that, you need a static route like explained.

See also this reply from @Maurice in another topic (https://forum.opnsense.org/index.php?topic=39481.msg193615#msg193615)

Ok great, I'll try this when I get home tonight.

On the trunk going to the opnsense box from the L3 switch, I just have 99 (network vlan) and 6 (doing experiments with wireguard).  All the other real vlans (2, 3, 4, 5, 7, 10, etc) are NOT on that trunk, and must go through the L3 device.

I wasn't able to do a full cleanroom test due to family needing internet and me not being able to take a downtime, however I had some time for a quick round of tests and think I have some interesting information.

I have the static route in, so I untagged the "LAN_GW" as an upstream gateway, and tagged "WAN_GW" as an upstream gateway.  No change in the ability for opnsense to ping anything (it can ping WAN, not LAN), however all my LAN clients lost internet.  In this state, from opnsense, I ran a "ping -S 10.99.1.40 10.2.2.213" (that's my DNS server).  This failed, but interestingly enough I was looking at the live logs, and even though the interface is LAN, the source IP was the WAN IP.  I'm very confused; I've confirmed the LAN and WAN interfaces are correct and they have correctly assigned default gateways.  See the attached picture.

This would make sense; is opnsense doing something to switch the LAN and WAN somehow?  I'm blown away how this is the case; that being said, it makes sense that tagging the LAN interface as upstream allows traffic out.

I would say, you a bit over-complicated you setup and config which resulted in bad design.

Can you drawn us a diagram, with how is your network specifically OPN <> Switch connected, highlight there all vlans & Vlan GWs?

https://www.drawio.com/

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

March 19, 2024, 04:16:55 PM #7 Last Edit: March 19, 2024, 04:34:53 PM by surfrock66
I have a diagram here, and can provide more detail as needed.  At the bottom you will see why I have /16; in truth, it's from back when I only had a single subnet, and I made it /16 so I could use the third octet to form DHCP scopes.  That's how the network worked in my head and I knew the IP scheme, so when it came time to add VLANS much later, I just made those the 2nd octet, and that's how we are here today.  Maybe one day I'll re-do that, but it's not in scope right now:

https://nextcloud.surfrock66.com/s/txnZdzxHaiA5t65

I'm trying to get a time the family will tolerate an extended outage; I have backups but these things go however they go.  The one big thing worrying me is, I did have a working wireguard setup before, and I'd love to preserve that (all my key pairs) and my port forward rules (I have a lot of weird rules set up).  I don't see a path to wiping this and starting over that doesn't involve doing all that from scratch, huh.

Quote from: surfrock66 on March 19, 2024, 06:26:53 AM
...
I've confirmed the LAN and WAN interfaces are correct and they have correctly assigned default gateways.  See the attached picture.
...

Did you read the comment (last alinea) I linked from Maurice ? LAN doesn't need a default gateway, it needs a static route...

Yes but when I disabled that, I lost all WAN access from LAN clients, so I re-enabled it; sorry the order of things is a bit sloppy as I have to rush and minimize the downtime for now.

You need to set up static routes instead of the "LAN default gateway" ...
Deciso DEC750
People who think they know everything are a great annoyance to those of us who do. (Isaac Asimov)

I might be a bit missing something here, but don't I need to have a gateway defined in order to assign a static route to it?  I have the LAN_GW in there, and then in static routes, when I define it, it points to the subnet 10.0.0.0/8 and then I select LAN_GW from the drop down list, right?

So, if I take "upstream gateway" off the LAN_GW, that's fine, my LAN loses connection to the internet.  What then confuses me is if I go to the LAN interface and scroll to the bottom, I have a static IPv4, then below that a drop down to choose an ipv4 default gateway.  It has "LAN_GW" selected, and the only other option is "Auto-Detect."  If I try to choose "Auto-Detect" I get an error saying something like it "conflicts with a static route."  I think I have the order I did that correct, it was from memory last night so I don't have the exact orange/red error popup and I'm going to try again tonight.

It seems I can't disable the LAN_GW without losing the static route, and I can't detach it from the interface?  I'm leaning to starting from scratch but I want to iterate until I can get a downtime in case it's still solve-able.  If I'm misunderstanding something though I am totally open to that.

Quote from: netnut on March 18, 2024, 08:35:31 PM
From your topology description shouldn't be more that a simple static route. Because you changed so many things start over with a clean install, otherwise this relative simple issue will be a ping-pong of "it's not working".

I don't know the SLA you negotiated with your family, but remember: 9.9999% uptime is also "five nines".

Do yourself a favor, start over, create the static route as explained and live happily ever after.

I have the installer to re-install tonight after the family goes to bed, but I had a minute to try a configuration.  I have a static route for 10.0.0.0/8, it has a gateway (because there's no way to create a static route without one unless I'm missing something).  The LAN interface has no gateway attached, it's set to auto-detect (the only 2 options are auto-detect and the LAN_GW).  The only active gateways are LAN_GW, WAN_GW (set as upstream) and WAN_GWv6.  I set that, as in the screenshots, and I get the following.  Can ping WAN from opnsense, can't ping LAN from opnsense.  Exit, can ping LAN from LAN, can't ping WAN from LAN.  I go back and tag the LAN_GW as upsteam, internet comes back on.

surfrock66@sr66-opnsense-1:~ $ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=115 time=28.971 ms
^C
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 28.971/28.971/28.971/0.000 ms
surfrock66@sr66-opnsense-1:~ $ ping -S 10.99.1.40 10.2.2.213
PING 10.2.2.213 (10.2.2.213) from 10.99.1.40: 56 data bytes

^C
--- 10.2.2.213 ping statistics ---
5 packets transmitted, 0 packets received, 100.0% packet loss
surfrock66@sr66-opnsense-1:~ $ exit
Connection to 10.99.1.40 closed.
surfrock66@sr66-thelio:~/.scripts$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.


^C
--- 8.8.8.8 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2055ms

surfrock66@sr66-thelio:~/.scripts$ ping 10.2.2.213
PING 10.2.2.213 (10.2.2.213) 56(84) bytes of data.
64 bytes from 10.2.2.213: icmp_seq=1 ttl=63 time=0.462 ms
64 bytes from 10.2.2.213: icmp_seq=2 ttl=63 time=0.339 ms
^C
--- 10.2.2.213 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1020ms
rtt min/avg/max/mdev = 0.339/0.400/0.462/0.061 ms


I worry that if I start from scratch, I am gonna end up in the exact same place since I got here and don't understand how.

Ok, good news, I re-imaged and after about an hour of tinkering it's working.  (My wife is a doctor who does tele-medicine from home so it was tricky to get a downtime, even riskier if I couldn't get back to working; usually she works when kids are in bed and that usually my window for these kind of projects).  I still have my old config backup; I have a lot of firewall rules and services to put back in (I had redirects for google trying to reach their dns from chromecasts to my pihole, I had a zabbix client pointing to my zabbix server, I had wireguard working and want to see if I can restore existing key exchanges, it was tied to my LDAP server, etc).  I really want to compare my old backup with a new one when this is done and see if I can't figure out what was broken.  I want to document that because I found a bunch of people with similar questions that only had incomplete answers: 

1) From the CLI, the WAN interface was DHCP, I set up the lagg between my 2 ports (lagg0), created a vlan 99 interface off of it (lagg0_vlan99) and made that the LAN interface with a static IP and no gateway.
2) I made a gateway for my 10.99.1.254 LAN gateway, had to assign it to the LAN interface when I made it.  It is not tagged as upstream.  One thing I noticed, WAN_GW is priority 255; it was 254 before.  Just a difference I noticed.
3) I made an alias for each of my VLANS that might need internet access
4) In Outbound NAT, I switched it to Hybrid and made rules to allow traffic through to each VLAN.
5) Under Firewall->Rules->LAN I created a pass rule for each VLAN (This will get tuned later)

With this, LAN clients access the WAN, after putting in a port forward WAN clients can access things on the LAN, the firewall can ping both LAN and WAN.