Hey all, I'm trying to figure out what's going wrong. Nearly everyday i'm losing connection on my WAN interface, and I can't find anything in the logs (though i'm not really sure which logs I should be looking at). I'm running Opnsense on bare-metal on an MSI Cubi NUC 1M (Intel Core 5, 16GB RAM, 500GB SSD). It has 2x Intel I226-V, I have the WAN interface set to auto-negotiate the speed.
When it loses connection I either need to reboot, or just go into the interface settings and click save which seems to be enough to get it reconnected. As for Opnsense, I keep the version up to date (25.7.2), I'm running IDS/IPS just on my LAN interface with the Hyperscan Pattern Matcher. Crowdsec, and Wireguard are also running. I've disabled all Hardware settings in the interface settings (CRC,TSO, LRO, VLAN Filtering).
What logs should I be looking at to help me figure out what the issue is?
Any help would be appreciated
Just to add to this - when it loses WAN it's reporting as 100% Packet Loss. This is the hourly from the last couple months
iso_time | loss | delay | stddev |
2025-06-17T14:00:00+10:00 | 92.330097922 | 0.00052491607689 | 0.00018458892351 |
2025-06-23T05:00:00+10:00 | 100 | 0 | 0 |
2025-06-29T06:00:00+10:00 | 100 | 0 | 0 |
2025-06-29T07:00:00+10:00 | 100 | 0 | 0 |
2025-06-29T08:00:00+10:00 | 100 | 0 | 0 |
2025-06-29T09:00:00+10:00 | 100 | 0 | 0 |
2025-06-29T10:00:00+10:00 | 100 | 0 | 0 |
2025-06-29T11:00:00+10:00 | 100 | 0 | 0 |
2025-06-29T12:00:00+10:00 | 100 | 0 | 0 |
2025-06-29T13:00:00+10:00 | 100 | 0 | 0 |
2025-06-29T14:00:00+10:00 | 100 | 0 | 0 |
2025-06-29T15:00:00+10:00 | 100 | 0 | 0 |
2025-06-29T16:00:00+10:00 | 67.538149056 | 0.037166575359 | 0.13227880102 |
2025-07-07T08:00:00+10:00 | 100 | 0 | 0 |
2025-07-07T09:00:00+10:00 | 100 | 0 | 0 |
2025-07-27T01:00:00+10:00 | 100 | 0 | 0 |
2025-07-27T02:00:00+10:00 | 100 | 0 | 0 |
2025-07-27T03:00:00+10:00 | 100 | 0 | 0 |
2025-07-29T12:00:00+10:00 | 100 | 0 | 0 |
2025-07-29T13:00:00+10:00 | 100 | 0 | 0 |
2025-07-29T14:00:00+10:00 | 75.036629833 | 0.0015248240753 | 0.00042127605431 |
2025-08-14T01:00:00+10:00 | 100 | 0 | 0 |
2025-08-14T02:00:00+10:00 | 100 | 0 | 0 |
2025-08-14T03:00:00+10:00 | 100 | 0 | 0 |
2025-08-14T04:00:00+10:00 | 100 | 0 | 0 |
2025-08-14T05:00:00+10:00 | 100 | 0 | 0 |
2025-08-19T03:00:00+10:00 | 100 | 0 | 0 |
2025-08-21T01:00:00+10:00 | 71.897430038 | 0.002538875804 | 0.0013372099454 |
2025-08-21T02:00:00+10:00 | 100 | 0 | 0 |
2025-08-21T03:00:00+10:00 | 100 | 0 | 0 |
Quote from: jstarta on August 21, 2025, 09:38:14 PM[...]
What logs should I be looking at to help me figure out what the issue is? [...]
I'd look at ARP. One of the logs (General, I believe) may log ARP changes, but that's usually only when ARP moves between bridge member interfaces. You'll probably have to look when you lose connectivity. It could also be the (apparent) i226 ASPM issue (https://forum.opnsense.org/index.php?topic=48296.0).
The most important bit in this mystery is the type of your WAN connection.
Quote from: pfry on August 23, 2025, 01:26:19 AMQuote from: jstarta on August 21, 2025, 09:38:14 PM[...]
What logs should I be looking at to help me figure out what the issue is? [...]
I'd look at ARP. One of the logs (General, I believe) may log ARP changes, but that's usually only when ARP moves between bridge member interfaces. You'll probably have to look when you lose connectivity. It could also be the (apparent) i226 ASPM issue (https://forum.opnsense.org/index.php?topic=48296.0).
I only see two entries for the WAN interface - i'll take a look at my bios for the ASPM settings (Thanks for the hint).
Quote from: Jyling on August 23, 2025, 04:33:48 AMThe most important bit in this mystery is the type of your WAN connection.
It's set as a IPv4 DHCP connection, though I guess technically it's static IPv4 because my ISP gives me a static ip
Just a quick add - I checked the BIOS for any ASPM stuff but couldn't see anything. I did see an ErP Ready setting which i've just disabled now (Seemed to have something to do with limiting power).
Added the tunable "hw.pci.enable_aspm" and set it to 0. I'll give it a reboot at some point and then see how it all goes. This BIOS is definitely lacking a lot of advanced features :(
Quote from: jstarta on August 23, 2025, 05:18:46 AMIt's set as a IPv4 DHCP connection, though I guess technically it's static IPv4 because my ISP gives me a static ip
Cable, Ethernet or fiberoptics?
Curious if you ever resolved the issue as I have the same MSI and issue. I tested Ipfire for a week and never had an issue so I know the hardware is solid. If you did resolve the issue could you please share the fix. Thank you.
Quote from: Jyling on August 23, 2025, 05:42:39 PMQuote from: jstarta on August 23, 2025, 05:18:46 AMIt's set as a IPv4 DHCP connection, though I guess technically it's static IPv4 because my ISP gives me a static ip
Cable, Ethernet or fiberoptics?
Ethernet. Setting that tunable didn't seem to fix things unfortunately.
I had a look at the pciconf for bother interfaces, and it looks like the tunable didn't take effect 'hw.pci.enable_aspm=0', because it states ASPM is still enabled in the output:
root@OPNsense:~ # pciconf -lbcevV igc1
igc1@pci0:89:0:0: class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x1462 subdevice=0xb0b1
vendor = 'Intel Corporation'
device = 'Ethernet Controller I226-V'
class = network
subclass = ethernet
bar [10] = type Memory, range 32, base 0x6a300000, size 1048576, enabled
bar [1c] = type Memory, range 32, base 0x6a400000, size 16384, enabled
cap 01[40] = powerspec 3 supports D0 D3 current D0
cap 05[50] = MSI supports 1 message, 64 bit, vector masks
cap 11[70] = MSI-X supports 5 messages, enabled
Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR RO NS
max read 512
link x1(x1) speed 5.0(5.0) ASPM L1(L1)
ecap 0001[100] = AER 2 0 fatal 0 non-fatal 0 corrected
ecap 0003[140] = Serial 1 d843aeffffbc6cac
ecap 0018[1c0] = LTR 1
ecap 001f[1f0] = Precision Time Measurement 1
ecap 001e[1e0] = L1 PM Substates 1
root@OPNsense:~ # pciconf -lbcevV igc0
igc0@pci0:88:0:0: class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x1462 subdevice=0xb0b1
vendor = 'Intel Corporation'
device = 'Ethernet Controller I226-V'
class = network
subclass = ethernet
bar [10] = type Memory, range 32, base 0x6a600000, size 1048576, enabled
bar [1c] = type Memory, range 32, base 0x6a700000, size 16384, enabled
cap 01[40] = powerspec 3 supports D0 D3 current D0
cap 05[50] = MSI supports 1 message, 64 bit, vector masks
cap 11[70] = MSI-X supports 5 messages, enabled
Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR RO NS
max read 512
link x1(x1) speed 5.0(5.0) ASPM L1(L1)
ecap 0001[100] = AER 2 0 fatal 0 non-fatal 0 corrected
ecap 0003[140] = Serial 1 d843aeffffbc6cab
ecap 0018[1c0] = LTR 1
ecap 001f[1f0] = Precision Time Measurement 1
ecap 001e[1e0] = L1 PM Substates 1
Hmmm, well, my i226v N150 has the aspm disabled on igc, but I don't see where the setting that disables it, seems like my settings are set to "1".
Are you running powerd?
sysctl -a |grep hw.pci.enable
hw.pci.enable_pcie_ei: 0
hw.pci.enable_pcie_hp: 1
hw.pci.enable_mps_tune: 1
hw.pci.enable_aspm: 1
hw.pci.enable_ari: 1
hw.pci.enable_msix: 1
hw.pci.enable_msi: 1
hw.pci.enable_io_modes: 1
pciconf -lbcevV igc1
cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR RO NS
max read 512
link x1(x1) speed 5.0(5.0) ASPM disabled(L1)
You have WAN dhcp? What does the lease time look like?
in "/var/db/dhclient.leases.igcX" , X being your WAN iface number
option dhcp-lease-time
@jstarta: Please show the output of "sysctl hw.pci" - I do not believe that the ASPM setting was applied correctly.
Quote from: BrandyWine on August 27, 2025, 05:54:33 AMHmmm, well, my i226v N150 has the aspm disabled on igc, but I don't see where the setting that disables it, seems like my settings are set to "1".
Are you running powerd?
sysctl -a |grep hw.pci.enable
hw.pci.enable_pcie_ei: 0
hw.pci.enable_pcie_hp: 1
hw.pci.enable_mps_tune: 1
hw.pci.enable_aspm: 1
hw.pci.enable_ari: 1
hw.pci.enable_msix: 1
hw.pci.enable_msi: 1
hw.pci.enable_io_modes: 1
pciconf -lbcevV igc1
cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR RO NS
max read 512
link x1(x1) speed 5.0(5.0) ASPM disabled(L1)
You have WAN dhcp? What does the lease time look like?
in "/var/db/dhclient.leases.igcX" , X being your WAN iface number
option dhcp-lease-time
Not sure if this is normal, but there are a lot of leases:
root@OPNsense:~ # cat /var/db/dhclient.leases.igc1
lease {
interface "igc1";
fixed-address AAA.BBB.CC1.132;
option subnet-mask 255.255.252.0;
option routers AAA.BBB.CC0.1;
option domain-name-servers XXX.YYY.ZZZ.142,XXX.YYY.ZZZ.242;
option host-name "opnsense";
option dhcp-lease-time 1800;
option dhcp-message-type 5;
option dhcp-server-identifier AAA.BBB.CC0.1;
renew 3 2025/8/27 07:24:18;
rebind 3 2025/8/27 07:35:33;
expire 3 2025/8/27 07:39:18;
}
lease {
interface "igc1";
fixed-address AAA.BBB.CC1.132;
option subnet-mask 255.255.252.0;
option routers AAA.BBB.CC0.1;
option domain-name-servers XXX.YYY.ZZZ.142,XXX.YYY.ZZZ.242;
option host-name "opnsense";
option dhcp-lease-time 1800;
option dhcp-message-type 5;
option dhcp-server-identifier AAA.BBB.CC0.1;
option dhcp-renewal-time 900;
option dhcp-rebinding-time 1575;
renew 3 2025/8/27 07:32:05;
rebind 3 2025/8/27 07:43:20;
expire 3 2025/8/27 07:47:05;
}
lease {
interface "igc1";
fixed-address AAA.BBB.CC1.132;
option subnet-mask 255.255.252.0;
option routers AAA.BBB.CC0.1;
option domain-name-servers XXX.YYY.ZZZ.142,XXX.YYY.ZZZ.242;
option host-name "opnsense";
option dhcp-lease-time 1800;
option dhcp-message-type 5;
option dhcp-server-identifier AAA.BBB.CC0.1;
renew 3 2025/8/27 07:47:05;
rebind 3 2025/8/27 07:58:20;
expire 3 2025/8/27 08:02:05;
}
lease {
interface "igc1";
fixed-address AAA.BBB.CC1.132;
option subnet-mask 255.255.252.0;
option routers AAA.BBB.CC0.1;
option domain-name-servers XXX.YYY.ZZZ.142,XXX.YYY.ZZZ.242;
option host-name "opnsense";
option dhcp-lease-time 1800;
option dhcp-message-type 5;
option dhcp-server-identifier AAA.BBB.CC0.1;
option dhcp-renewal-time 900;
option dhcp-rebinding-time 1575;
renew 3 2025/8/27 08:02:05;
rebind 3 2025/8/27 08:13:20;
expire 3 2025/8/27 08:17:05;
}
lease {
interface "igc1";
fixed-address AAA.BBB.CC1.132;
option subnet-mask 255.255.252.0;
option routers AAA.BBB.CC0.1;
option domain-name-servers XXX.YYY.ZZZ.142,XXX.YYY.ZZZ.242;
option host-name "opnsense";
option dhcp-lease-time 1800;
option dhcp-message-type 5;
option dhcp-server-identifier AAA.BBB.CC0.1;
renew 3 2025/8/27 08:17:05;
rebind 3 2025/8/27 08:28:20;
expire 3 2025/8/27 08:32:05;
}
lease {
interface "igc1";
fixed-address AAA.BBB.CC1.132;
option subnet-mask 255.255.252.0;
option routers AAA.BBB.CC0.1;
option domain-name-servers XXX.YYY.ZZZ.142,XXX.YYY.ZZZ.242;
option host-name "opnsense";
option dhcp-lease-time 1800;
option dhcp-message-type 5;
option dhcp-server-identifier AAA.BBB.CC0.1;
option dhcp-renewal-time 900;
option dhcp-rebinding-time 1575;
renew 3 2025/8/27 08:32:05;
rebind 3 2025/8/27 08:43:20;
expire 3 2025/8/27 08:47:05;
}
lease {
interface "igc1";
fixed-address AAA.BBB.CC1.132;
option subnet-mask 255.255.252.0;
option routers AAA.BBB.CC0.1;
option domain-name-servers XXX.YYY.ZZZ.142,XXX.YYY.ZZZ.242;
option host-name "opnsense";
option dhcp-lease-time 1800;
option dhcp-message-type 5;
option dhcp-server-identifier AAA.BBB.CC0.1;
option dhcp-renewal-time 900;
option dhcp-rebinding-time 1575;
renew 3 2025/8/27 08:47:06;
rebind 3 2025/8/27 08:58:21;
expire 3 2025/8/27 09:02:06;
}
lease {
interface "igc1";
fixed-address AAA.BBB.CC1.132;
option subnet-mask 255.255.252.0;
option routers AAA.BBB.CC0.1;
option domain-name-servers XXX.YYY.ZZZ.142,XXX.YYY.ZZZ.242;
option host-name "opnsense";
option dhcp-lease-time 1800;
option dhcp-message-type 5;
option dhcp-server-identifier AAA.BBB.CC0.1;
renew 3 2025/8/27 09:02:06;
rebind 3 2025/8/27 09:13:21;
expire 3 2025/8/27 09:17:06;
}
lease {
interface "igc1";
fixed-address AAA.BBB.CC1.132;
option subnet-mask 255.255.252.0;
option routers AAA.BBB.CC0.1;
option domain-name-servers XXX.YYY.ZZZ.142,XXX.YYY.ZZZ.242;
option host-name "opnsense";
option dhcp-lease-time 1800;
option dhcp-message-type 5;
option dhcp-server-identifier AAA.BBB.CC0.1;
option dhcp-renewal-time 900;
option dhcp-rebinding-time 1575;
renew 3 2025/8/27 09:17:06;
rebind 3 2025/8/27 09:28:21;
expire 3 2025/8/27 09:32:06;
}
lease {
interface "igc1";
fixed-address AAA.BBB.CC1.132;
option subnet-mask 255.255.252.0;
option routers AAA.BBB.CC0.1;
option domain-name-servers XXX.YYY.ZZZ.142,XXX.YYY.ZZZ.242;
option host-name "opnsense";
option dhcp-lease-time 1800;
option dhcp-message-type 5;
option dhcp-server-identifier AAA.BBB.CC0.1;
renew 3 2025/8/27 09:32:06;
rebind 3 2025/8/27 09:43:21;
expire 3 2025/8/27 09:47:06;
}
Quote from: meyergru on August 27, 2025, 09:55:23 AM@jstarta: Please show the output of "sysctl hw.pci" - I do not believe that the ASPM setting was applied correctly.
root@OPNsense:~ # sysctl hw.pci
hw.pci.mcfg: 1
hw.pci.host_mem_start: 2147483648
hw.pci.default_vgapci_unit: 0
hw.pci.enable_pcie_ei: 0
hw.pci.pcie_hp_detach_timeout: 5000
hw.pci.enable_pcie_hp: 1
hw.pci.clear_pcib: 0
hw.pci.iov_max_config: 1048576
hw.pci.intx_reroute: 1
hw.pci.enable_mps_tune: 1
hw.pci.clear_aer_on_attach: 0
hw.pci.enable_aspm: 0
hw.pci.enable_ari: 1
hw.pci.clear_buses: 0
hw.pci.clear_bars: 0
hw.pci.usb_early_takeover: 1
hw.pci.honor_msi_blacklist: 1
hw.pci.msix_rewrite_table: 0
hw.pci.enable_msix: 1
hw.pci.enable_msi: 1
hw.pci.do_power_suspend: 0
hw.pci.do_power_resume: 1
hw.pci.do_power_nodriver: 0
hw.pci.realloc_bars: 1
hw.pci.enable_io_modes: 1
root@OPNsense:~ # pciconf -lbcevV igc1
igc1@pci0:89:0:0: class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x1462 subdevice=0xb0b1
vendor = 'Intel Corporation'
device = 'Ethernet Controller I226-V'
class = network
subclass = ethernet
bar [10] = type Memory, range 32, base 0x6a300000, size 1048576, enabled
bar [1c] = type Memory, range 32, base 0x6a400000, size 16384, enabled
cap 01[40] = powerspec 3 supports D0 D3 current D0
cap 05[50] = MSI supports 1 message, 64 bit, vector masks
cap 11[70] = MSI-X supports 5 messages, enabled
Table in map 0x1c[0x0], PBA in map 0x1c[0x2000]
cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR RO NS
max read 512
link x1(x1) speed 5.0(5.0) ASPM L1(L1)
ecap 0001[100] = AER 2 0 fatal 0 non-fatal 0 corrected
ecap 0003[140] = Serial 1 d843aeffffbc6cac
ecap 0018[1c0] = LTR 1
ecap 001f[1f0] = Precision Time Measurement 1
ecap 001e[1e0] = L1 PM Substates 1
That is really strange. The sysctl seems to be active, yet your ASPM is enabled? Never saw that. Mine is disabled, but I can disable it in the BIOS, too.
Maybe you could ask MSI for a BIOS where you can disable that. Also, if it is a standard BIOS, there are tools out there with which you can modify your BIOS to show more settings. Of course, you need the BIOS image first and some companies do not even offer any.
How do I confirm that the igc driver is loaded correctly? I think I read somewhere there should be a kernel module present and loaded. I can't find it now though. I'd have thought that because it's identified the device that the driver wouldn't be the issue.
I'm new to BSD so I don't know how to really troubleshoot this stuff unfortunately.
The command is kldstat, but I think most of the common NIC drivers are statically linked.
I had this exact issue recently, and I spent over a month trying to figure it out.
In the end, I did many things in a desperate attempt to get reliable connectivity back, so I can't pinpoint one thing, but the sum of things was this:
1. Deleted all the VLANs I had recently created, going back to a single default network. Probably didn't help, but it was a variable.
2. Switched from CloudFlare to Quad-9. PingTool showed CloudFlare frequently becoming unresponsive, and when pinging the OPNsense WAN interface, the default gateway, and CloudFlare / Quad-9 on both IPv4 and IPv6, the point of disconnect most commonly showed up between my WAN interface and the default gateway. But CloudFlare was still unresponsive more often than Quad-9.
3. Had a technician install a splitter between the street and the cable modem, as the signal was "coming in too hot".
I believe #3 had the most to do with signal quality and reliability, and I'm going to go back and test #1 as soon as I get time.
FYI, I searched all over the place for a good Ping program that would run simultaneous pings to multiple addresses and log the results. PingTool (ping-tool.com) was the best I could find for free. It doesn't do fancy graphs, but it does keep a running table of results, and it will email you if it sees one of your targets go down or come back up for a specified time period.
Quote from: jstarta on August 27, 2025, 11:28:25 AMNot sure if this is normal, but there are a lot of leases:
root@OPNsense:~ # cat /var/db/dhclient.leases.igc1
lease {
interface "igc1";
fixed-address AAA.BBB.CC1.132;
option subnet-mask 255.255.252.0;
option routers AAA.BBB.CC0.1;
option domain-name-servers XXX.YYY.ZZZ.142,XXX.YYY.ZZZ.242;
option host-name "opnsense";
option dhcp-lease-time 1800;
option dhcp-message-type 5;
option dhcp-server-identifier AAA.BBB.CC0.1;
renew 3 2025/8/27 07:24:18;
rebind 3 2025/8/27 07:35:33;
expire 3 2025/8/27 07:39:18;
The list of leases is just one lease that keeps getting renewed, the system keeps a historical record when the dhcp iface comes up.
I think your dhcp client is set to expire a lease after 30min, but it does the renew request every 15min, kinda 1/2 way through the lease time.
Seems a bit too fast, and not sure if that's causing the issue.
However, dhcp renewals for the most part are just noise and should not cause an iface to bounce in any way.
What type of dhcp device is it connected to your WAN iface?
Quote from: meyergru on August 27, 2025, 12:19:40 PMThe command is kldstat, but I think most of the common NIC drivers are statically linked.
We need to distinguish diff between linking and loading.
kldstat only lists dynamically loaded KLM's, stuff outside the compiled kernel.
static loading usually means it's in-tree compiled into kernel.
To save on kernel size, I would prefer the in-tree stuff to mostly live as dynamic KLM's, this way only what's needed can be loaded in during init.
It's nice to have big almost-monolithic kernel, but then not so nice due to size. Pros & Cons.
But no security device would ever run full monolithic.
modprobe perhaps a better utility.
Quote from: jstarta on August 27, 2025, 12:13:46 PMHow do I confirm that the igc driver is loaded correctly?
Well, loaded "correctly" will be hard to know.
It's either loaded into kernel or it's not.
Your iface list post #9 shows igc, yes? That's the driver being used by kernel for i226-V intel controller.
With other lower numbered intel controllers we find igb being used.
The correct driver is loaded. I suspect that's not a place to be looking, igc from kernel build works good for i226-V.
Also, when you saying "losing WAN", how so? Is it a connection that relies on DNS, or do you get all IP dead scenario?
Maybe turn off "allow dhcp to override system set DNS settings", maybe your ISP DNS is flaky? Set your fw DNS to maybe 9.9.9.11.
The logs will clearly show if the WAN iface bounced.
Try a continuous ping from the fw to something outside, see if it ever drops off.
Quote from: allenlook on August 27, 2025, 04:26:55 PM3. Had a technician install a splitter between the street and the cable modem, as the signal was "coming in too hot".
Your scenario is different from the OPs. You are on cable, they are on Ethernet.
The OP should ping from the open sense server, not from the LAN, to eliminate the FreeBSD+FW quirks.
If there is any variance in ping results between the LAN and the open sense, then it is within the router. Otherwise it is between the GW interface and the provider's infrastructure.
A good test would have been to use an alternative router and to ping from it, then compare. Temporarily use any non-FreeBSD router distro.
$ kldstat
Id Refs Address Size Name
1 71 0xffffffff80200000 216dad8 kernel
2 1 0xffffffff8236e000 16650 if_lagg.ko
3 2 0xffffffff82385000 3558 if_infiniband.ko
4 1 0xffffffff82389000 ed60 if_bridge.ko
5 2 0xffffffff82398000 8990 bridgestp.ko
6 1 0xffffffff823a2000 1e280 opensolaris.ko
7 1 0xffffffff823c1000 11a78 pfsync.ko
8 3 0xffffffff823d3000 908a0 pf.ko
9 1 0xffffffff82464000 3c10 pflog.ko
10 1 0xffffffff832ce000 aa30 if_gre.ko
11 1 0xffffffff832d9000 4be0 if_enc.ko
12 1 0xffffffff832de000 fb90 carp.ko
13 1 0xffffffff832ee000 5e9300 zfs.ko
14 1 0xffffffff84510000 b4270 if_iwlwifi.ko
15 1 0xffffffff845c5000 3378 lindebugfs.ko
16 1 0xffffffff845c9000 d200 rtsx.ko
17 1 0xffffffff845d7000 4250 ichsmb.ko
18 1 0xffffffff845dc000 2178 smbus.ko
19 1 0xffffffff845df000 3390 acpi_wmi.ko
20 1 0xffffffff845e3000 5640 ng_ubt.ko
21 4 0xffffffff845e9000 abb8 netgraph.ko
22 3 0xffffffff845f4000 a250 ng_hci.ko
23 2 0xffffffff845ff000 2670 ng_bluetooth.ko
24 1 0xffffffff84602000 2f5c0 if_wg.ko
25 1 0xffffffff84632000 4850 nullfs.ko
Yep, I don't think it's a driver issue specifically. I have already disabled "Allow DNS server list to be overridden by DHCP/PPP on WAN" as well.
When it drops out, it's just 100% packet loss. Next time it happens, i'll try and capture as many different types of logs as I can.
What sort of logs should I be capturing to try and help us identify the root cause?
Quite edit: I've set up a ping on Opnsense to my remote VPS, and I have it pinging back as well so I can monitor traffic in both directions
Also, Just wanted to quickly thanks everybody for your help so far - it's been fantastic, i'm learning a lot. Hopefully we can get to the bottom of it as there are a few others that also have issues.
For brevities sake, here are the tunables i've added so far:
hw.pci.enable_aspm = 0
hw.em.smart_pwr_down = 0
hw.pci.do_power_nodriver = 0
hw.pci.do_power_suspend = 0
net.link.ether.inet.max_age = 120
dev.igc.0.fc = 0
dev.igc.1.fc = 0
hw.igc.eee_setting = 0
Do you have a /var/log/messages file? If so you can cat or grep that file looking for entries related to igc or interfaces. State changes should be logged.
I also suspect not related to any power or sleep settings, the WAN iface is always active just from fw itself doing stuff, and, the fw never actually drops off into a power state of sleep.
Interface hardware seems ok, need to look elsewhere. DHCP issues is a DHCP issue, not a hardware issue, etc. I don't suspect DHCP either.
I did mean to ask earlier, in your DHCP clinet file, is the provided IP the same or did it change?
When you say "100% packet loss", what tool is used to derive that? Ping using IP? Other?
Another thing to look at is "arp -a" , make note of the igc value, keep running the command, watch the timer go down, make note of the MAC address, when the timer gets to zero just keep watching for the arp renew, right after zero timer keep watching that you get a IP and MAC address quickly, any delay here would cause 100% packet loss. Your Intel WAN iface should be the MAC that starts with 00:e0:b4, so you want to look at the other one with the timer (usually at the op of the list), this is your DFG, aka ISP IP and MAC on WAN side.
Quote from: BrandyWine on August 28, 2025, 03:49:36 AMDo you have a /var/log/messages file? If so you can cat or grep that file looking for entries related to igc or interfaces. State changes should be logged.
I also suspect not related to any power or sleep settings, the WAN iface is always active just from fw itself doing stuff, and, the fw never actually drops off into a power state of sleep.
Interface hardware seems ok, need to look elsewhere. DHCP issues is a DHCP issue, not a hardware issue, etc. I don't suspect DHCP either.
I did mean to ask earlier, in your DHCP clinet file, is the provided IP the same or did it change?
When you say "100% packet loss", what tool is used to derive that? Ping using IP? Other?
Another thing to look at is "arp -a" , make note of the igc value, keep running the command, watch the timer go down, make note of the MAC address, when the timer gets to zero just keep watching for the arp renew, right after zero timer keep watching that you get a IP and MAC address quickly, any delay here would cause 100% packet loss. Your Intel WAN iface should be the MAC that starts with 00:e0:b4, so you want to look at the other one with the timer (usually at the op of the list), this is your DFG, aka ISP IP and MAC on WAN side.
There was no /var/log/messages file unfortunately. Under Gateways configuration it would have Loss: 100%.
The provided IP Address is always the same. I'll keep looking at that 'arp -a' command, I had a look a a few times and it seemed to refresh always in the last 5 seconds or so
So it is from the router. Swap it for anything else that is not based on FreeBSD and compare. If the packet loss persists, then kick your provider in the ribs.
Quote from: Jyling on August 28, 2025, 03:39:10 PMSo it is from the router. Swap it for anything else that is not based on FreeBSD and compare. If the packet loss persists, then kick your provider in the ribs.
I would place a small switch on the WAN side (so no need to take out the fw), then plug in a laptop or something just for short period to see, the ISP should hand out more than 1 IP. Run a continuous ping to something outside, see what happens.
Quote from: BrandyWine on August 28, 2025, 06:51:38 PMhe ISP should hand out more than 1 IP
Good luck with that, in most scenarios.
I've been unable to get to the bottom of the issues unfortunately so i've but it in the VM under Proxmox. Took a bit of doing because Unbound and dnsmasq are the defaults now - I didn't want to just restore from backup so I didn't bring across any weird nonsense I had done on my previous install when trying to get stuff working.
I'll let everybody know how things go - I really wish I could have figured it out but it was getting on my nerves constantly having to restart stuff.
The fw WAN is likely not in a /30.
So let's ask.... OP, what subnet is your FW WAN getting from dhcp, or now whatever OS is connecting to the ISP?
Not sure what version of OPNsense you are running, but duly noted freeBSD 14.3-RELEASE has a noted fix for igc driver.
https://www.freebsd.org/releases/14.3R/relnotes/
Have you seen the other thread related to igc?
Maybe look that over.
https://forum.opnsense.org/index.php?topic=48564.msg246022#msg246022
Quote from: BrandyWine on August 30, 2025, 06:21:37 AMThe fw WAN is likely not in a /30.
So let's ask.... OP, what subnet is your FW WAN getting from dhcp, or now whatever OS is connecting to the ISP?
Its /22, but I have a static IP so I'll always get the same IP from the ISP
Quote from: BrandyWine on August 30, 2025, 08:26:50 AMNot sure what version of OPNsense you are running, but duly noted freeBSD 14.3-RELEASE has a noted fix for igc driver.
https://www.freebsd.org/releases/14.3R/relnotes/
Its on the latest.
I'll have a look at those links you've sent. So far the switch to using proxmox with Opnsense as a VM has been flawless.
Quote from: jstarta on August 31, 2025, 11:01:55 AMIts /22, but I have a static IP so I'll always get the same IP from the ISP
Is that /22 an rfc1918 block, or ISP public? Do you also get DHCP?
Quote from: BrandyWine on September 02, 2025, 06:32:38 PMQuote from: jstarta on August 31, 2025, 11:01:55 AMIts /22, but I have a static IP so I'll always get the same IP from the ISP
Is that /22 an rfc1918 block, or ISP public? Do you also get DHCP?
No, it's not. When I initially went with this ISP I had them disable CG-NAT (In case that's what you were thinking it might be).
I've had 5 days uptime with zero problems since installing it as a VM (Under Proxmox), so it seems it's an issue with BSD drivers.
Quote from: jstarta on September 02, 2025, 09:34:25 PMI've had 5 days uptime with zero problems since installing it as a VM (Under Proxmox), so it seems it's an issue with BSD drivers.
Using the same hardware?
When did you get to the newer OPNsense 25.7.x (bsd 14.3) ?
Maybe there's a conflict between the bsd kernel driver and the controller firmware. Must be millions of devices running the Intel 226 running bsd 14.3 (guessing how many).
I don't recall in the thread, did you try an older version of OPNsense (https://docs.opnsense.org/releases.html)? This would have given you a definitive on if the bsd 14.3 with updated igc driver was the cause.
I want to chime in and say I have had the same problem of WAN experiencing 100% Packet Loss periodically since upgrading to 25.7. I have two CWWK mini PC routers, one with an N100 CPU and the other with an N350 CPU. Both come with four i226 ports. I have a Verizon Fios connection (residential).
The 100% packet loss happens about once a day. Interestingly, when the N350 mini PC router experienced 100% packet loss, the machine continued to function on the LAN side, but with no WAN connection. I reviewed all the logs but couldn't find anything wrong. When the N100 mini PC router experienced 100% packet loss, it would reboot itself, and the WAN connection would come back.
I have tried all the suggestions in this thread, and nothing worked. Here are the things I have tried:
- Setting hw.pci.enable_aspm to 0 didn't make a difference.
- Setting hw.igc.eee_setting to 0 didn't make a difference.
- I have checked the WAN DHCP leases, and it seemed to work as expected.
- I have turned off VTd in the BIOS, but it didn't make a difference.
For my next troubleshooting step, I will install Proxmox on one of the mini PC and virtualize Opnsense to see if the problem still persists.
Quote from: pdhsker on September 06, 2025, 04:10:06 PM[...]
I have checked the WAN DHCP leases, and it seemed to work as expected.[...]
How about ARP? Don't just check presence - check that the MACs are correct. Incorrect ARP is unlikely outside of a bridged link with multiple devices, but you never can tell.
Quote from: pdhsker on September 06, 2025, 04:10:06 PMI have tried all the suggestions in this thread, and nothing worked.
An arp issue would be very strange considering DHCP is working ok.
Do you have IPv6 enabled on WAN side? If so try disabling IPv6.
Quote from: BrandyWine on September 07, 2025, 08:53:57 AMQuote from: pdhsker on September 06, 2025, 04:10:06 PMI have tried all the suggestions in this thread, and nothing worked.
An arp issue would be very strange considering DHCP is working ok.
Do you have IPv6 enabled on WAN side? If so try disabling IPv6.
I did have IPv6 enabled. I will try to disable it and see if that makes any difference.
On the other hand, I have installed Proxmox on the N350 mini pc router, then Opnsense as a VM. It has been more than 24 hours, and the router has been stable so far.
Quote from: pdhsker on September 08, 2025, 04:36:18 AMQuote from: BrandyWine on September 07, 2025, 08:53:57 AMQuote from: pdhsker on September 06, 2025, 04:10:06 PMI have tried all the suggestions in this thread, and nothing worked.
An arp issue would be very strange considering DHCP is working ok.
Do you have IPv6 enabled on WAN side? If so try disabling IPv6.
I did have IPv6 enabled. I will try to disable it and see if that makes any difference.
On the other hand, I have installed Proxmox on the N350 mini pc router, then Opnsense as a VM. It has been more than 24 hours, and the router has been stable so far.
Making it a VM isnt really fixing the issue. It's a workaround with impact on performance, plus now you have to manage Promox.
Quote from: BrandyWine on September 08, 2025, 07:15:48 AMMaking it a VM isnt really fixing the issue. It's a workaround with impact on performance, plus now you have to manage Promox.
I know, but my purpose was to find out if the problem was caused by me making mistakes in my settings or the driver. The VM Opnsense has been stable for five days, so I am pretty sure the problem was with the driver. Surprisingly, I didn't see many people reporting this problem, so I wonder if a combination of driver issues and my settings causes it.
Quote from: jstarta on August 27, 2025, 01:54:46 AMroot@OPNsense:~ # pciconf -lbcevV igc1
igc1@pci0:89:0:0: class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x1462 subdevice=0xb0b1
Ok, your 226V looks weird. subvendor 1462? 1462 is vendor MSI, and subdevice b0b1?
Wow, that's weird. MSI does not make the 226V, so their ID should not be used at all. Drivers do depend on hardware ID's, so perhaps that's your issue.
All the 226V's that I have seen thus far are full Intel device "class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x8086 subdevice=0x0000"
I would go to the Intel i226 Firmware thread and flash the 226's. The 226's are still physical devices no matter what stuff runs above that layer, etc. After that I would think about just going back to native OPNsense, no VM.
QuoteField Description
Subvendor ID Identifies the manufacturer of the PCI device.
Subdevice ID Identifies the specific model or version of the device.
These identifiers are crucial for device drivers and operating systems to correctly recognize and manage PCI devices.
Quote from: pdhsker on September 11, 2025, 05:03:55 AMQuote from: BrandyWine on September 08, 2025, 07:15:48 AMMaking it a VM isnt really fixing the issue. It's a workaround with impact on performance, plus now you have to manage Promox.
I know, but my purpose was to find out if the problem was caused by me making mistakes in my settings or the driver. The VM Opnsense has been stable for five days, so I am pretty sure the problem was with the driver. Surprisingly, I didn't see many people reporting this problem, so I wonder if a combination of driver issues and my settings causes it.
I just wanted to leave a quick note here that after upgrading to 25.7.5, the system has been running smoothly for more than a week without dropping connection. Hopefully, the underlying program has been fixed.