Installation keeps bricking itself while in middle of configuring

Started by Red Squirrel, November 28, 2024, 06:35:05 AM

Previous topic - Next topic
Been fighting with this for a while now, I noticed there is a newer version so downloaded it hoping this would stop happening but it's still happening.  In order to not have to sit at my work bench and listen to the very loud switch, I temporary enabled the admin interface through the WAN port and have it plugged into my network, so I can sit at my regular PC to do all the configuring of vlans etc, then go to the work bench where I have a switch and laptop to test vlans.

After adding the 6th or so vlan, I will lose connectivity, and even rebooting, never get it back. I also can't connect to anything via the laptop/switch. The console shows all the interfaces with the right IPs, but can't ping anything. The only way to fix this is to completely reinstall, and start over from scratch.

Is there a way to stop this from happening?

I'm using it on a Sophos XG 115

Are you sure your switch doesn't accidentally block a port, e.g., Layer 2 Loop Protection, STP or something else?
Hardware:
DEC740

The switch has a very basic config, I just setup a trunk port for pfsense LAN port and then a couple vlans to test. It always works fine initially, until it just decides not to.

What seems to happen is any time a change is made to interfaces, there is a chance I lose access to web interface. Sometimes it's completely, in which case if rebooting or reassigning interface via CLI (just repeating the settings already in place) does not do anything I need to reinstall. Sometimes I just lose access via WAN interface and can still access it within one of the vlans.  But it's very hit and miss. At one point I was able to ping the WAN but not access the web UI, so I'm poking around in the live logs to try to see if I can see anything, then all of a sudden, I was no longer pinging. Reboot, then ping works again, and web UI works again.  It's very sporadic. 

I also made sure to check "Prevent interface removal" on all interfaces.  When I lose access to WAN it's really weird, since in the CLI I can see that it shows the IP, but on my DHCP server I don't see a lease.  I ended up plugging the WAN interface into another vlan (on my existing network) and now it works again, but it worked before on the other vlan, so it's really strange how it's hit and miss like this.  I have experience with pfsense and that's what I'm hoping to upgrade from, so I'm not new at setting something like this up.

Is any of this a known issue where stuff just spontaneously stops working while in middle of configuring?  I left it alone overnight and nothing changed, so the failures really seem to be caused by configuring, even if what I'm configuring is unrelated to what stops working.  Ex: configuring a new vlan, and then an existing one will stop working.

Ive configured a whole bunch of OPNsense over the years and I cannot remember having issues like that. But I also used either VMs or server hardware, or official appliances.

Maybe you are configuring something wrong, I recently wrote this docs article maybe it can help you:

https://docs.opnsense.org/manual/how-tos/vlan_and_lagg.html

Maybe its a hardware issue and you have to get something different to run it.
Hardware:
DEC740

Hoping it's not hardware as I paid over $500 into this already with taxes, shipping, extra power adapter etc. (it has redundant psu)

I have configured vlans the right way, and they work, except, every now and then, when I hit apply config, it will just brick everything.  Sometimes I can fix it by going on the console and assign interfaces and basically remove and add one or even just type in the same info that's already there, then it fixes it, sometimes.  Once this is in production I won't have access to the console though so it's a bit of an issue if this happens once I'm done setting it up. 

I was recommended this Sophos box as a great thing to install it on but starting to second guess it as I am wondering if maybe it really is a hardware issue... For about the same money I could have bought a SFF machine and throw a quad port NIC in it.

Make sure you have a dedicated link to the FW, either directly into a port or into a switch that has a dedicated lan port going up to the FW.

I wouldn't be worried about HW issues, it is very clear you're doing something that kind of works until it doesn't, and you're not saying what action you're doing right before losing it all.

Hard to tell what causes it as it's so random. I'm just doing initial config things like adding the vlans etc and then suddenly the web UI will hang, and then that's that. I lose access to the firewall on all interfaces or sometimes just the WAN.

I think I may have potentially figured out the cause though. I think you're really suppose to apply for each individual configuration item such as adding a new interface or vlan. I was doing a bunch at a time and then hitting apply but guess that messes things up.

I don't want to jinx it yet but so far I have not lost connectivity again since hitting apply each time.

So it seems as soon as I change the IP of my main vlan to the proper one that it's going to be in prod, that's when all things break loose. All the other vlans are set and are fine, but minute I set the main one, everything breaks. None of the interfaces will give out DHCP or be accessible. Need to go in physical console to set the main vlan back to 192.168.x.x range.  Main vlan is nothing special it's just a designation, so not sure why that one causes issues and not the others.  Also I created an "emergency" interface that is just a standard interface no vlan, but the DHCP server is not giving out the proper IP range that I assigned but rather giving out IPs from the main vlan range. So that's a problem.

Is there some weird bug with setting an interface to the IP 10.1.1.1/24?  As soon as I set the vlan to that is when things break. Even with WAN unplugged, ruling out some weird conflict with my main network, which shouldn't be an issue, otherwise the other vlan configs would have broke it too.

Please share all networks and vlans and the lagg configuration. If you write everything down you might find out whats wrong.

If not then do the same configuration in a VM and see if it crashes.
Hardware:
DEC740

Here's some of the config, let me know if there's something specific you need to see. 

https://imgur.com/a/M71enki

Right now things work, except for the emrg interface which is suppose to be able to give me a regular non vlan interface to plug a laptop into and access the config, except it's handing out the wrong IP range.  I get 192.168.1.x when it should be 192.168.33.x as per the config. I cannot access the web interface at either IP range. I tried to set a static IP to the 33 range and still nothing. Once I can get this interface working at least I can move this to my rack and troubleshoot from there.

I also previously figured out what each port is and wrote down the last digit of the MAC address so I know as a fact that when I'm plugging stuff in and out, it's in the right port. I keep wondering if it's something as simple as that but it's not.

Ironically the reason I'm trying to get this done is because I want to start on a Proxmox cluster but I need to get this off my workbench to make room.  I was running into similar issues several months back and kind of gave up not sure what my next steps would be but trying to revisit it again.

Just also realized, I can't access admin interface from ANY inside port now. Only WAN now. Everything keeps changing every time it screws up. I get completely different issues, it's all over the places, makes no sense.

November 29, 2024, 10:18:35 AM #10 Last Edit: November 29, 2024, 10:20:16 AM by Monviech (Cedrik)
I don't see anything obvious in what you sent as screenshots.

I would try configuring the same thing on a different hardware or a VM if you have something around.

Maybe install a hypervisor on the sophos box and see if that makes a difference, since it abstracts the hardware.

If things make no sense, changing the approach is what I would do. Use something different.
Hardware:
DEC740

I posted in another thread about this.

I found it was browser particular on how the issue happened.   Safari was the worst.

Firefox worked better.  and did not give the page no longer displayed message when applying as simple of a change as a MTU number change on a Vlan

Hmmm interesting, I am using Firefox though.

Never thought of trying to install proxmox on the Sophos, I suppose that could actually work if it has VT-D. It would at least ensure that I can gain console access to the firewall itself should something go wrong, and also enable me to do full OS level backups by backing up the VM so less risky when doing upgrades etc.  I'm very close to having this working so I think I will just keep working at it and once I feel it's ready for prod I'll swap it in, but keep the old one in place.

It seems it only messes up while configuring it and not when it's idle, and doing major configs like changing/adding interfaces, and sometimes it even fixes itself. Like that weird DHCP issue, it just solved itself.  Still does not bode well though...

For what it's worth I will do a memtest on the Sophos, and see if there is also a way to do a SMART test, maybe the storage is failing.

EDIT:

I just saw this thread: https://forum.opnsense.org/index.php?topic=43995.15

It looks on par with what I'm experiencing, maybe why stuff started working on it's own when I left it alone.  In most cases I was not giving it minutes, I would just assume everything broke right away.  The next time It happens I will give it time to see if it starts working again.

I was hoping this would be one of those things that just solves itself since it felt like things were getting better, but I left it for a few days and it just died.  Force rebooting it got it working again, but this is unacceptable.  I will have to find a different solution I think.