Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Topics - klosz007

#1
Hi,

Preface:
I have persistently connected OVPN client instance configured, to my customer's network.
I want it to be highly available so client config is replicated between HA nodes.
Remote OVPN server allows only one connection per user account and I have only one account there so only one client can be connected at a time (on primary or on secondary node).

I achieved that by configuring client to follow WAN CARP VIP.
By default WAN CARP VIP is present on primary node so OVPN client runs on primary node and is shut down on secondary.

Issue:
That worked perfectly up to 24.7.
In 25.1 it initially seemed to work well too - if CARP VIP moves to standby node then client on primary node shuts down and and spins up on secondary node and so on. So far so good.

Unfortunately something's broken in 25.1:

If VIP is on primary node and I go to HA->Status on primary node and click on "Synchronize and restart services" then HA tries to (re)start OVPN client on secondary node (even though VIP has not moved there).
Weirdly, this does not happen instantly but after a while.

With two OVPN clients (on primary and secondary node) spun up and trying to use the same account and only one connection per account is allowed on remote OVPN server, weird things start to happen then. Clients randomly and periodically connect and disconnect etc. etc.

What fixes that is manually stopping OVPN client on standby node or moving VIP to secondary node and back (by entering and leaving perisstent CARP maintenance mode).

Any idea what is wrong ? Can this be a bug in 25.1 ?

Actually I have to client instances (to two customers) and both behave exactly same way.
#2
25.1, 25.4 Legacy Series / Loader needs to be updated
February 24, 2025, 11:45:23 PM
Hi,

If you noticed this warning on bootloader screen then it's a known and unimportant bug in FreeBSD 14. It does not upgrade EFI bootloader in EFI partition.
I just noticed it now on OPNsense and I have been searching for quite some time how to fix it on my FreeBSD 14.x installs that were upgraded from 13.x.

Fix is easy from shell:
cp /boot/loader.efi /boot/efi/efi/freebsd/loader.efi
cp /boot/loader.efi /boot/efi/efi/boot/bootx64.efi

+ reboot and then it's gone.
#3
Hi,

I spent whole afternoon and evening trying to update my HA cluster from 24.7.12_4 to 25.1.

Everything done by the books:
- upgrade standby from 24.7.12_4 to 25.1
- wait until goes online and verify state
- put primary into persistent CARP maintenance mode
- upgrade primary 24.7.12_4 to 25.1
- wait until goes online and verify state
- disable CARP maintenance mode

When standby is already at 25.1 and primary still at 24.7.12_4 then HA status page on primary node is still fine - displays status of services on secondary node (says standby runs on 25.1).
When primary gets upgraded to 25.1 too - HA status page welcomes me with a yellow message "The backup firewall is not accessible (check user credentials)".

Tried upgrading from 25.1 to 25.1.1 - same problem.

Then I restored VMs from snapshots to 24.7 and tried to upgrade to 25.1 via 'full reinstall from ISO' + config restore (same order - standby node first, then primary one) - same problem just after upgrading primary node to 25.1.

Eventually I surrendered and restored both VMs from snapshots to return to 24.7.12_4.

What has changed in 25.1 ? What's going on ?
Does sync user require some new privileges in 25.1 ?

Thanks,
Zbyszek


#4
Hi,

I have been using Monit on OPNsense for some time to monitor HTTP connectivity to some hosts or to montor HTTPS connectivity + certificate validity period reporting. This worked well so far.

Starting from 24.7.6, I'm flooded with false-negative alerts that given monitoring host experienced "Connection failed" and then in a minute or so "Connection succeeded". Then there is a period of silence (a couple of minutes) and again failed/succeeded alerts repeat.

Alerts look like that:
        Action:      alert
        Host:        opnsense2.localdomain
        Description: failed protocol test [HTTP] at [192.168.7.1]:443 [TCP/IP TLS] -- Poll failed: Interrupted system call


The common message for all failed alerts is "Interrupted system call".

Only some monitored hosts are affected and both my HTTP and HTTPS monitors are affected. But no changes have been made to these hosts, it was OPNsense upgraded to 24.7.6 and it all started right after it.

Has anyone expierenced similar behavior ?

Thanks
#5
Hi,

Has anyone ever found a definitive solution to these periodic crashes of OPNsense ? They are still present in 24.1.

I have seen these for the first time when I migrated my OPNsense VM instance (running on ESXi) from regular generic PC to chinese MiniPC with N5105 + I226V's.
Then read about chinese PC's with N5105's being possible cluprit.

So replaced this MiniPC with another one with Pentium 7505 - same story.

You can blame chinese MiniPC's. But... Recently I have seen the same crashes on OPNsense running on KVM on Synology device with AMD V1500 CPU.

So it's not caused by specific CPU or hypervisor.

Moreover all my other VMs are perfectly stable, including those running non-OPNsense FreeBSD 13.2. So it's something wrong with OPNsense but it's a corner case specific to some config since not all OPNsense VMs under my supervision are experiencing this.


Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address = 0x0
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff8238d57f
stack pointer         = 0x0:0xfffffe000b9af6c0
frame pointer         = 0x0:0xfffffe000b9af720
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 0 (if_io_tqg_3)
trap number = 12
panic: page fault
cpuid = 3
time = 1706361743
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000b9af480
vpanic() at vpanic+0x151/frame 0xfffffe000b9af4d0
panic() at panic+0x43/frame 0xfffffe000b9af530
trap_fatal() at trap_fatal+0x387/frame 0xfffffe000b9af590
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe000b9af5f0
calltrap() at calltrap+0x8/frame 0xfffffe000b9af5f0
--- trap 0xc, rip = 0xffffffff8238d57f, rsp = 0xfffffe000b9af6c0, rbp = 0xfffffe000b9af720 ---
pf_test_state_udp() at pf_test_state_udp+0x28f/frame 0xfffffe000b9af720
pf_test() at pf_test+0xc57/frame 0xfffffe000b9af890
pf_check_in() at pf_check_in+0x25/frame 0xfffffe000b9af8b0
pfil_run_hooks() at pfil_run_hooks+0x97/frame 0xfffffe000b9af8f0
ip_input() at ip_input+0x799/frame 0xfffffe000b9af980
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe000b9af9d0
ether_demux() at ether_demux+0x159/frame 0xfffffe000b9afa00
ng_ether_rcv_upper() at ng_ether_rcv_upper+0x8c/frame 0xfffffe000b9afa20
ng_apply_item() at ng_apply_item+0x2bf/frame 0xfffffe000b9afab0
ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe000b9afaf0
ng_apply_item() at ng_apply_item+0x2bf/frame 0xfffffe000b9afb80
ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe000b9afbc0
ng_ether_input() at ng_ether_input+0x4c/frame 0xfffffe000b9afbf0
ether_nh_input() at ether_nh_input+0x1f2/frame 0xfffffe000b9afc50
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe000b9afca0
ether_input() at ether_input+0x69/frame 0xfffffe000b9afd00
iflib_rxeof() at iflib_rxeof+0xbcb/frame 0xfffffe000b9afe00
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe000b9afe40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe000b9afec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc3/frame 0xfffffe000b9afef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe000b9aff30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000b9aff30
--- trap 0x6c617470, rip = 0x5ac8b830975c0, rsp = 0x8b8d4820000005a8, rbp = 0x30646870 ---
KDB: enter: panic

Thanks.
#6
Hi,

Consider pls adding this in System/Trust/Certificates:

- add "by CA" filter (display certificates belonging to specific CA - I have multiple CAs)
- make Valid From/To separate columns (not part of DN)
- add column sorting (especially on "Valid To" column - to quickly identify which certificates expire first)

Current layout is good as long as you have one CA ond just a handful of certs to manage.

Thanks !
#7
Hi,

I have just spent entire evening troubleshooting very weird issue which eventually turned out to be caused by improper CARP interface creation.
It could be easily avoided by improving GUI for CARP IP creation.

GUI asks for IP address but it must be provided together with netmask (e.g. 192.168.100/24) which should be same as (correction) underlying's interface netmask.

Problem is that if one provides just IP, CARP VIP gets created with single host /32 netmask. And that leads to weird issues - e.g. DHCP no longer works properly. And it is hard to notice that no netmask was provided.


I think there should be separate dropbox to pick up netmask length so that one cannot enter IP without a netmask.
Ideally if it self-suggested correct netmask for CARP or warned if selected netmask is different from the underlying interface.

Thanks !
#8
Hi

Question as in subject :-)
#9
High availability / HA setup with no WAN CARP IP
November 09, 2023, 08:39:46 PM
Hi,

I would like to set up HA config at home but I have only one static public IP which is assigned by bradband modem by DHCP to specific MAC address (currently used by my one and only OPNsense instance - it owns this public IP).
All other devices connected to broadband modem (currently none) receive CG-NAT IPs.

I need to have some redundancy. OPNsense runs as ESX VM and whenever I shut this ESX host I have no Internet at home - wife screams, children cry (or vice versa) etc. :-)

I have second ESX host but I have no shared storage to easily and quickly move OPNsense VM to this second ESX host if I need a 'maintenance window'. But I could utilize an OPNsense HA cluster with nodes running as VMs on both ESX hosts.

I know it is not a recommended config not to have WAN CARP IP and to use just two different WAN IPs on both nodes (moreover - public and CGNAT).


Besides obvious limitations such as services running on public IP (VPN, HAProxy, etc.) will not be accessible if primary instance is down, will there be any impact or malfunction/limitation because of such config ?



Another option to condsider is to have another physical router doing just NAT, nothing else, then WAN interfaces of OPNsense HA cluster + CARP IP would be private NAT IPs. WAN CARP IP would be configured as DMZ host then. It will effectively become non-elegant dual-NAT config. But that costs another device to maintain and another SPOF. Such router would have to be fast (I have 1000MBit broadband downlink) hence expensive and power hungry.

Thanks for any advice.
#10
Hi,

Any idea what these messages mean and why they appear ? Console is flooded with these messages (after 2 days of uprtime I have 150 such messages logged). It must have started in some recent 23.1 updates.

sonewconn: pcb 0xfffff80015e1be00 (local:/var/etc/openvpn/server5.sock): Listen queue overflow: 2 already in queue awaiting acceptance (1 occurrences)

It starts after each reboot and seems like it goes away if I restart OpenVPN server service. But then returns after subsequent reboot.

Thanks,
Zbyszek
#11
Hi,

I'm talking about System/Access/Groups.
I have manually created a user group but I cannot find a wayn to delete it. Even though it is empty and has no permissions assigned, "delete" button is not showing up next to "edit" (pen) button ? Is this a bug or am I doing something incorrectly ?

Thanks,
Zbyszek
#12
Hi,

Documentation (https://docs.opnsense.org/manual/aliases.html) mentions "OpenVPN group" alias type but I cannot find such type in dropdown list when creating a new alias.
Has this feature been removed or is this a bug ?

Thanks,
Zbyszek
#13
Hi,

I have been dealing with this issue for some time with no luck. It first appeared for me in 22.7.10 (I believe). After each reboot it took significant time (say 2 minutes) for Unbound to accept UDP connections. During that time opnsense was not accepting Web interface or SSH connections. If I just waited for these 2 minutes, then Unbound eventually started and it all worked fine then.

It got worse in 22.7.11 though.
I was hoping that 23.1 will resolve it but I upgraded to 23.1 today and the same issue is still there.

Now it not only takes 2 minutes for Unbound Dto accept UDP connections (and for Web GUI and SSH to start working), but after that time all DNS queries fail with 'serverfail' message. Theres nothing really informative in Unbound log (/var/log/resolver/latest.log).

When I increased logging to level 2 then each query in the log looks like that:

<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="701"] [25609:3] info: resolving mtalk.google.com. A IN
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="702"] [25609:3] info: priming . IN NS
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="703"] [25609:0] info: resolving mtalk.google.com. A IN
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="704"] [25609:0] info: priming . IN NS
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="705"] [25609:3] info: resolving mtalk.google.com.localdomain. A IN
<30>1 2023-01-28T18:56:45+01:00 opnsense.localdomain unbound 25609 - [meta sequenceId="706"] [25609:3] info: priming . IN NS


So basically every attempt to resolve a name is followed by "priming . IN NS" message.


Fix is very easy - manually restart Unbound from GUI. Then, instantly, everything returns to normal.
Then few first messages in the log looks like that:

<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4577"] [94863:3] info: resolving prod.amcs-tachyon.com. A IN
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4578"] [94863:3] info: priming . IN NS
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4579"] [94863:3] info: response for . NS IN
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4580"] [94863:3] info: reply from <.> 199.7.91.13#53
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4581"] [94863:3] info: query response was ANSWER
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4582"] [94863:3] info: priming successful for . NS IN
<30>1 2023-01-28T18:58:18+01:00 opnsense.localdomain unbound 94863 - [meta sequenceId="4583"] [94863:0] info: control cmd:  list_local_data


(priming successful for . NS IN message appears and following queries are resolved).

If I reboot opnsense, same story repeats, I have to wait ~2 minutes, then restart Unbound to make it work.

Have you encountered such issue ? What may be causing this ?

Thanks,
Zbigniew