Network hang from time to time: not sure why

Started by XabiX, January 07, 2025, 03:43:35 PM

Previous topic - Next topic
January 07, 2025, 03:43:35 PM Last Edit: January 07, 2025, 05:55:14 PM by XabiX
Hello All,

I have been using OPNsense 24.7.11_2 over Proxmox 8.3 6.11.0-2-pve and before pfsense from a while. I am facing an unstable issue which I can't find any log to help really to troubleshoot the issue. On calls, from time to time it hangs for like 2 to 3s and then keeps going.

Some logs, but what else can I be checking?

If I restart the services, I get:
Enter an option: 11

Writing firmware settings: FreeBSD OPNsense
Writing trust files...done.
Scanning /usr/share/certs/untrusted for certificates...
Scanning /usr/share/certs/trusted for certificates...
Scanning /usr/local/share/certs for certificates...
certctl: No changes to trust store were made.
Writing trust bundles...done.
Configuring login behaviour...done.
Configuring CRON...done.
Setting timezone: Europe/Paris
Setting hostname: OPNsense.localdomain
Generating /etc/resolv.conf...done.
Generating /etc/hosts...done.
Configuring loopback interface...done.
Configuring LAGG interfaces...done.
Configuring VLAN interfaces...done.
Configuring CAM interface...done.
Configuring Download interface...done.
Configuring LAN interface...done.
Configuring POP interface...done.
Configuring WAN interface...done.
Configuring WIFI interface...done.
Setting up routes...done.
Setting up gateway monitor...done.
Configuring firewall.......done.
Starting DHCPv4 service...done.
Starting DHCPv6 service...done.
Starting router advertisement service...done.
Starting NTP service...done.
Configuring OpenSSH...done.
Starting web GUI...done.
Syncing OpenVPN settings...done.
Stopping ntopng.
Waiting for PIDS: 54790.
Stopping redis.
Waiting for PIDS: 45839.
Stopping node_exporter.
Stopping acme_http_challenge.
Waiting for PIDS: 31589.
Stopping flowd.
Stopping mdns_repeater.
Waiting for PIDS: 19673.
Stopping qemu_guest_agent.
Waiting for PIDS: 15465.
Stopping monit.
Waiting for PIDS: 89582.
Stopping flowd_aggregate...done
setup vtnet1
setup vtnet0 [egress only]
setup vtnet2
Starting flowd_aggregate.
Starting monit.
Starting Monit 5.34.3 daemon with http interface at /var/run/monit.sock
kldload: can't load virtio_console: module already loaded or in kernel
Starting qemu_guest_agent.
Starting mdns_repeater.
Starting flowd.
rmdir: /var/etc/acme-client/home/deploy: Not a directory
rmdir: /var/etc/acme-client/home/dnsapi: Not a directory
rmdir: /var/etc/acme-client/home/notify: Not a directory
Starting acme_http_challenge.
Starting node_exporter.
Starting redis.
Certificates generated /usr/local/share/ntopng/httpdocs/ssl/ntopng-cert.pem
Starting ntopng.
md5sum: invalid option -- q
usage: md5sum [-bctwz] [files ...]
usage: grep [-abcDEFGHhIiLlmnOopqRSsUVvwxz] [-A num] [-B num] [-C num]
        [-e pattern] [-f file] [--binary-files=value] [--color=when]
        [--context=num] [--directories=action] [--label] [--line-buffered]
        [--null] [pattern] [file ...]
06/Jan/2025 15:02:22 [Ntop.cpp:4052] WARNING: Unable to find timezone: using UTC
06/Jan/2025 15:02:22 [Redis.cpp:171] Successfully connected to redis 127.0.0.1@0
06/Jan/2025 15:02:22 [Redis.cpp:171] Successfully connected to redis 127.0.0.1@0
06/Jan/2025 15:02:22 [Ntop.cpp:2642] Parent process is exiting (this is normal)

The client has disconnected from the server.  Reason:
Invalid packet header.  This probably indicates a problem with key exchange or encryption.

What I noticed, is that my client gets disconnected from the host when the issue appears is:
root@Proxmox ~# ping 1.1.1.1
64 bytes from 1.1.1.1: icmp_seq=858 ttl=57 time=9.94 ms
64 bytes from 1.1.1.1: icmp_seq=859 ttl=57 time=10.1 ms

The client has disconnected from the server.  Reason:
Invalid packet header.  This probably indicates a problem with key exchange or encryption.

Could this be an issue on Proxmox versus on OPNsense? is there any other log that could make sense to check on OPNSense before checking on Proxmox side?

Is it a key change happening on OPNsense all the time? something to do with the certificate?

Merci
XabiX

January 08, 2025, 05:14:00 PM #1 Last Edit: January 08, 2025, 06:13:28 PM by XabiX
Any idea of what could be the issue? Maybe a driver issue on proxmox of r8126 on 6.11 kernel?

FYI a capture from my laptop to Proxmox host through wire. I wonder if this has something to do with OPNSense but my girlfriend does have the same issue on Wifi.

dmesg | grep -i r8169
[    0.890543] r8169 0000:0a:00.0: enabling device (0000 -> 0003)
[    0.901128] r8169 0000:0a:00.0 eth0: RTL8126A, 34:5a:60:03:c4:ad, XID 64a, IRQ 58
[    0.901132] r8169 0000:0a:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
[    3.160413] r8169 0000:0a:00.0 enp10s0: renamed from eth0
[   21.473988] r8169 0000:0a:00.0 enp10s0: entered allmulticast mode
[   21.474024] r8169 0000:0a:00.0 enp10s0: entered promiscuous mode
[   21.500368] RTL8251B 5Gbps PHY r8169-0-a00:00: attached PHY driver (mii_bus:phy_addr=r8169-0-a00:00, irq=MAC)
[   21.940489] r8169 0000:0a:00.0 enp10s0: Link is Down
[   24.815676] r8169 0000:0a:00.0 enp10s0: Link is Up - 2.5Gbps/Full - flow control off
root@Proxmox ~#

January 10, 2025, 02:37:10 PM #2 Last Edit: January 10, 2025, 02:46:56 PM by XabiX
Hello Team,

I have deactivate Unbond, Netflow, Ntopng to reduce the load. But I still have the issue but without any idea of what to look for.

Attached is my VM conf. CPU is AMD Ryzen 7 9700X.

What log could I be looking into OPNSense host to see interrupts or local freezes on the guest?

Maybe an ARP/ IP conflict:
2025-01-08T17:19:45    Error    dhcpd    uid lease 192.168.30.197 for client 6c:7e:67:c5:5f:c1 is duplicate on 192.168.30.0/24
My laptop has Zscaler not sure if this could bring some strange behaviours but normal this mac has a static IP and should not be duplicated ...


Merci

It looks like a Realtek NIC somewhere and that rarely bodes well for freeBSD-based Operating Systems. That could play a part but you seem to be using virtio instead so that might isolate the device from OPN's freeBSD. But you have Proxmox's virtualisation in the mix so hard to tell where to look except all those ingredients of the mix.
On top of it all, you seem to have a number of services that could be putting some pressure on the device, and as I said, Realtek are not great under pressure. I'm referring to LAG interface, VLAN interface, ntopng, redis?!, flowd, monit. Are they services you run on OPN?
And, jumbo frames on a Realtek?! Is that right?