Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - Andreas_

#31
Ich denke ich weiß jetzt woran es liegt, siehe https://github.com/opnsense/core/issues/3468
#32
19.1 Legacy Series / Re: CARP over LAGG problems
May 03, 2019, 05:15:52 PM
Quote from: mimugmail on May 03, 2019, 11:02:23 AM
Isn't the backup unit also on the same switch? Then it should not fail over ...

That's exactly the problem. Via the second switch both FWs have degraded but fully functional LAGG connectivity. CARP shouldn't react to LAGG degradation, but it does.
#33
19.1 Legacy Series / CARP over LAGG problems
May 03, 2019, 09:25:38 AM
I usually do my opnsense upgrades by first updating the usually-backup machine, disabling carp on the master and updating it as well.
Now when upgrading from 19.1.2 to 19.1.6 (which needs reboot), I found that some VHIDs would go to master and some to backup (net.inet.carp.preempt=0, should be 1 but helpful for debugging here) afterwards. The VHIDs that became master all are on a LAGG interface (directly or VLAN), the others remaining on backup are on physical interfaces. When disabling and enabling carp on the master machine, the situation was resolved. Apparently, the LAGG interface didn't receive carp packets from the master in-time when booting up, so the rebooted machine suspected it needed to become master itself.

After my HA setup was settled and working normally, I started to upgrade the switches one by one. With one switch down, the LAGG interface is still workable, since only one of both physical interfaces looses connection, but CARP seems to increase demotion based on the physical interface, not the resulting LAGG interface. In order to not have CARP failing over unnecessarily (which would affect eg. OpenVPN connections), CARP on the backup needs to be disabled temporarily.

So there seem to be two issues here: CARP expecting traffic before LAGG is ready, and CARP demotion reacting to LAGG slave interfaces instead of the LAGG interface itself.


#34
removed bind912 for me without installing bind914, posted a github issue.
#35
I had the problem a long time. It turned out to be a problem of the switches to which WAN was connected; IGMP snooping had to be disabled.
#36
Vor kurzem hatte ich einen neuen Anlauf genommen das Problem zu diagnostizieren, war aber noch nicht dazu gekommen diesen Thread zu ergänzen.
Da ich nicht im Produktivsystem testen konnte, hab ich schlußendlich eine weitere identische Firewall gebaut und die Originalkonfiguration eingespielt. Sofort konnte ich die abgehenden CARP-Pakete sehen.  :o
Daraufhin habe ich im Produktivsystem einen zusätzlichen Switch zwischen Firewall und Upstream (Provider Switch) geschaltet, und konnte auch dort meine Pakete sehen.
Eine längere Konsultation mit dem RZ-Personal führte dann dazu daß dort IGMP Snooping im Coreswitch für uns disabled wurde, und siehe da schon gingen auch die CARP Multicastpakete wieder durch. Ich vermute, daß seinerzeit ein SW-Update im Coreswitch eingespielt wurde was IGMP Snooping falsch umsetzt, nämlich auch ALL-NODES Multicast behandelt.

Warum pfctl -d funktionierte hab ich dann nicht mehr ausprobiert, ich war froh endlich wieder funktionstüchtige Standbyredundanz zu haben.
#37
From time to time, we're suffering from some strange issue:
Triggered by a workstation on LAN1 sending a ws-discovery multicast on port 3702 (or some other service, just as example), some thousand duplicated packets can be seen on LAN2 (with LAN1-address as sender and mcast as destination), with the source MAC address of the backup firewall of a CARP pair.

Or in other words:
The carp backup firewall, which should be listening passively, creates IP Multicast packets with its own LAN2 MAC source address, LAN1 IP source Address of a client, with a rate of about 5000/s and will not stop until the firewall is kicked with pfctl -d;pfctl -e

Hotfix is to drop UDP traffic to specific ports (such as 3702) on the LAN1 network, but a firewall shouldn't create such packets on its own, right? It's 19.1 (had this already with 18.1/18.7), no specific Multicast/IGMP settings or modules.
#38
I now have 3 phase2 entries: 2 old and one new.

When clicking on 'clone' on an old entry, I get
<fw-addr>/vpn_ipsec_phase2.php?dup=
the new entry has
<fwaddr>/vpn_ipsec_phase2.php?dup=5c9e3fd2320ae

Checking further:
It seems that p2 entries I entered last year are ok, while the problem only exists with entries that are some years old.
Looking at a config backup, it seems that older entries miss the uniqid property. Seems I have to add the uniqid manually in a config and restore from there?
#39
The last time I edited/cloned phase2 Entries was with 18.7.x
#40
I'd like to add a third phase2 entry to an ipsec definition, so I pressed "clone" on an existing phase2 entry. The fields of the page coming up are not prefilled, and if filled and saved the entry won't show up. Looking at a config backup, the entry misses the ikeid, but has a uniqueid instead.

Investigating further, apparently editing is affected as well: parameters shown are not the one reflecting the phase2 entry to edit.
Only adding a fresh entry seems to work.

Version 19.1.2 and 19.1.4 tested, after refreshing with F5.
#41
Die Config hat 380k mit Tonnen von FW-Rules, ist eine CA etc., das wird kaum fehlerfrei wieder einzugeben sein.
Und der Router muß auch noch 24/7 verfügbar sein...
Ich hoffe demnächst einen physischen Router aufsetzen zu können.
#42
Ich habe inzwischen weitere Versuche gemacht, zT unfreiwillig:
- Hardwaretausch, Neuinstallation 18.7.9, restore der 18.1-Config mit angepaßten Interfacenamen. Das problematische WAN-Interface (kein lagg, kein vlan) hat jetzt einen ix-Treiber statt igb: weiterhin keine CARP-Pakete, daran lag es also nicht.
- Installation in einer Xen-DomU, wieder angepaßte Interfaces. Hier funktioniert der CARP Broadcast wie erwartet...

Ich hab mir die Config rauf und runter angesehen. thermal_hardware coretemp und Dashboard Widget hab ich entfernt (sonst Kernel crash), sonst sehe ich nichts auffälliges (sysctl weitgehend default, bis auf carp demotion). Die Hardware ist trotz Virtualisierung nahe am Original (C2558/2 Kerne statt C2758/C3758 8 Kerne), incl aesni hardware_crypto.

Ich bin reichlich ratlos. Eigentlich sieht es ja so aus als ob es am Switch liegt der die beiden Router verbindet, Filterung der Broadcasts. Ich kann nur per tcpdump auf dem Slave-Router testen weil das Routerpaar in einem RZ steht. Dann dürfte das Problem aber nicht über pfctl -d bzw -e ab-/anstellbar sein.
#43
Downgrading to unbound 1.7.3 from 18.1 helps! So apparently there's a regression from 1.7.3 to 1.8.1 and up.

Used https://pkg.opnsense.org/FreeBSD:11:amd64/18.1/latest/All/unbound-1.7.3.txz and pkg add -f
#44
Ok, this a lot badder: some 5 minutes after restart, the service will stop using the domain override, and try to resolve upstream.

A router pair with similar config still on 18.7.6 (unbound 1.8.1) doesn't show this problem so I tried that binary instead of the 1.8.2_2 version; no improvement.
#45
I've been upgrading from 18.1.x to 18.7.9, and had some trouble with unbound not resolving some domain overrides after that. It would resolve some addresses from cache, and tried some from root servers. I had to edit and save a domain override unmodified to get unbound back to normal work.
This happened on two machines (master and slave).