netmap_transmit bce0 drop mbuf that needs checksum offload

Started by andreaslink, September 14, 2020, 10:58:00 PM

Previous topic - Next topic
September 14, 2020, 10:58:00 PM Last Edit: September 24, 2020, 07:36:44 PM by andreaslink
I'm running OPNsense (20.7.2-amd64) with one Broadcom NetXtreme II BCM5709 for WAN (bce0) and one for LAN (bce1), further on I have 4x Intel 82580, which I use for other LANs like IoT (igb1) and Guests (igb0) etc.

I have "some" traffic on WAN with quite constantly 60 to 100MBit (mainly due to IP cam streams), which I consider as handeable with my setup. I also have IDS/IPS up and running as well as Sensei.

After "a while" (usually only minutes after reboot) of traffic I get the following error in the log, multiple times per second:

2020-09-10T00:28:10   kernel   490.690419 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:28:05   kernel   485.572543 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:28:00   kernel   480.194945 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:28:00   kernel   479.940436 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:54   kernel   474.761838 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:49   kernel   469.475112 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:44   kernel   464.324372 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:39   kernel   459.205033 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:33   kernel   453.830080 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:28   kernel   448.126626 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:23   kernel   443.431391 [ 320] generic_netmap_register Emulated adapter for bce0 activated
2020-09-10T00:27:23   kernel   443.431259 [1130] generic_netmap_attach Emulated adapter for bce0 created (prev was NULL)
2020-09-10T00:27:23   kernel   bce0: permanently promiscuous mode enabled
2020-09-10T00:27:23   kernel   443.407436 [1035] generic_netmap_dtor Emulated netmap adapter for bce0 destroyed
2020-09-10T00:27:23   kernel   443.407409 [1130] generic_netmap_attach Emulated adapter for bce0 created (prev was NULL)

As you can see on the attached screenshot, the MBUF usage is at 0% and with ~9720 way below the limit of 1.271.626, so there should be plenty of MBUF available.

So what triggers this error?

I can get rid of it, when deactivating IDS/IPS, and since I'm testing it, the error did not show up again. So is it somehow IPS throughput related? Nonetheless, I would like to turn IDS/IPS on again :).

How can I tune my system, so the "netmap_transmit" can handle the load? (BTW: What process/step ist it, what does it do here?)
And whay does the mbuf "need checksum offload"? What does that exactly mean?

Some more config details:

I have all three hooks set, so all of these three are disabled:
- Hardware CRC
- Hardware TSO
- Hardware LRO


root@OPNsense:~ # sysctl -a | grep nmbclusters
kern.ipc.nmbclusters: 1271626

root@OPNsense:~ # sysctl -a | grep msi
hw.sdhci.enable_msi: 1
hw.puc.msi_disable: 0
hw.pci.honor_msi_blacklist: 1
hw.pci.msix_rewrite_table: 0
hw.pci.enable_msix: 1
hw.pci.enable_msi: 1
hw.mfi.msi: 1
hw.malo.pci.msi_disable: 0
hw.ix.enable_msix: 1
hw.bce.msi_enable: 1
hw.aac.enable_msi: 1
machdep.disable_msix_migration: 0
machdep.num_msi_irqs: 512
dev.igb.3.iflib.disable_msix: 0
dev.igb.2.iflib.disable_msix: 0
dev.igb.1.iflib.disable_msix: 0
dev.igb.0.iflib.disable_msix: 0


BTW: I also experimented with following values, which did not bring any change:

kern.ipc.nmbclusters="2543660"
hw.bce.tso_enable="0"
hw.pci.enable_msix="0"
Running OPNsense on 4 core Intel Xeon E5506, 20GB RAM, 2x Broadcom NetXtreme II BCM5709, 4x Intel 82580
Ubench Single CPU: 307897 (0.39s)

Hi @andreaslink, do you have offloadings and vlan hardware filtering set to disabled? See Interfaces -> Settings

If so, please try the official netmap test kernel which will be announced today

opnsense-update -kr 20.7.2-netmap


Awesome @mb, thank you! I have done that and rebooted:

root@OPNsense:~ # opnsense-update -kr 20.7.2-netmap
Fetching kernel-20.7.2-netmap-amd64.txz: ....... done
!!!!!!!!!!!! ATTENTION !!!!!!!!!!!!!!!
! A critical upgrade is in progress. !
! Please do not turn off the system. !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Installing kernel-20.7.2-netmap-amd64.txz... done
Please reboot.

I've also activated IDS/IPS again to monitor it now. 5 mins later no problems yet, so still monitoring.
I keep you posted!

PS: And as requested, all offloadings and vlan hardware filtering were already set to disabled.
Running OPNsense on 4 core Intel Xeon E5506, 20GB RAM, 2x Broadcom NetXtreme II BCM5709, 4x Intel 82580
Ubench Single CPU: 307897 (0.39s)

Just to return some feedback here, I'm testing now for 24h under "full load" incl. IDS/IPS and Sensei and the messages did not appear anymore. So I consider this issue as solved with the new kernel "kernel-20.7.2-netmap-amd64.txz"!

Thank you very much :)!

PS: I assume, my preloading to test before next official update is not an issue for the upcoming release aka official update or will I get in troube with this kernel now?
Running OPNsense on 4 core Intel Xeon E5506, 20GB RAM, 2x Broadcom NetXtreme II BCM5709, 4x Intel 82580
Ubench Single CPU: 307897 (0.39s)

Hi Andreas,

That's great to hear. All welcome and thanks for the update.

No, you're fine. 20.7.3 will just install it's own kernel.

We will have a new kernel with 20.7.3 from the looks of it, but we will give netmap another test round so it's a later 20.7.x for sure.


Cheers,
Franco

Hmm, Ok understandable and good to know, so hoping, I will not expect these errors again then, but then I know at least why :). Thanks for making me aware.
Running OPNsense on 4 core Intel Xeon E5506, 20GB RAM, 2x Broadcom NetXtreme II BCM5709, 4x Intel 82580
Ubench Single CPU: 307897 (0.39s)

Some bad news :(, router was running 5 days, nearly 6 days now, without this issue, but today - out of a sudden - the error messages returned:
[4006] netmap_transmit bce0 drop mbuf that needs checksum offload

MBUF usage is slightly higher than normal, but far (!) away from the ciritcal maximum:

MBUF Usage  0% (10432/1271498)

So I'm not sure, if I can consider this still as solved, but at least as remarkably better.

And I said "out of a sudden" but I'm afraid the trigger might be somehow in relation to my wireguard side2side VPN, it started briefly after the counter part send a test ping after a quite long period of data silence between the two locations. I'm not sure, if this might be related, it could be coincidence, but I think, I should mention it.
Running OPNsense on 4 core Intel Xeon E5506, 20GB RAM, 2x Broadcom NetXtreme II BCM5709, 4x Intel 82580
Ubench Single CPU: 307897 (0.39s)

Hi @andreas,

Thanks for the mention. That might be related. It looks like somehow HW checksum offload was enabled on the interface. Netmap requires all HW offloading be disabled.

What does ifconfig bce0 tell?

Sorry for the delay, I had to provoke it first.
I can now clearly destroy it with this approach:

  • Reboot OPNsense --> Everything is fine
  • Let a ping go constanly from my network over into the other net of the wireguard tunnel --> Everything is fine
  • Stop my ping  --> Everything is fine
  • Let other wireguard side ping one device of my local network --> MBUF error appears nearly immediatly with the first ping

So when everything is working fine it looks like this:

root@OPNsense:~ # ifconfig bce0
bce0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
   options=80028<VLAN_MTU,JUMBO_MTU,LINKSTATE>
   ether 00:21:5e:c8:be:88
   inet6 fe80::221:5eff:fec8:be88%bce0 prefixlen 64 scopeid 0x1
   inet6 2a02:2f4:xxxx:xxx0:221:5eff:fec8:be88 prefixlen 64 autoconf
   inet6 fd00:0:cafe:affe:221:5eff:fec8:be88 prefixlen 64 autoconf
   inet 192.168.0.100 netmask 0xffffff00 broadcast 192.168.0.255
   media: Ethernet autoselect (1000baseT <full-duplex>)
   status: active
   nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>


After wireguard ping trigger from other side partner:

root@OPNsense:~ # ifconfig bce0
bce0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
   options=80028<VLAN_MTU,JUMBO_MTU,LINKSTATE>
   ether 00:21:5e:c8:be:88
   inet6 fe80::221:5eff:fec8:be88%bce0 prefixlen 64 scopeid 0x1
   inet6 2a02:2f4:xxxx:xxx0:221:5eff:fec8:be88 prefixlen 64 autoconf
   inet6 fd00:0:cafe:affe:221:5eff:fec8:be88 prefixlen 64 autoconf
   inet 192.168.0.100 netmask 0xffffff00 broadcast 192.168.0.255
   media: Ethernet autoselect (1000baseT <full-duplex>)
   status: active
   nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>


I cannot see an immediate difference here ???.
Running OPNsense on 4 core Intel Xeon E5506, 20GB RAM, 2x Broadcom NetXtreme II BCM5709, 4x Intel 82580
Ubench Single CPU: 307897 (0.39s)

Hi Andreas,

I thought this might be related to netmap offloads being enabled. But it looks like this is different.

Can you share the exact mbuf error message? May be a screenshot?

Ok, I see.
So I hope, this helps, for a screenshot I need to provoke it later again, these are messages from v20.7.2 (see dates):


2020-09-10T00:28:10   kernel   490.690419 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:28:05   kernel   485.572543 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:28:00   kernel   480.194945 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:28:00   kernel   479.940436 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:54   kernel   474.761838 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:49   kernel   469.475112 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:44   kernel   464.324372 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:39   kernel   459.205033 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:33   kernel   453.830080 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:28   kernel   448.126626 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-10T00:27:23   kernel   443.431391 [ 320] generic_netmap_register Emulated adapter for bce0 activated
2020-09-10T00:27:23   kernel   443.431259 [1130] generic_netmap_attach Emulated adapter for bce0 created (prev was NULL)
2020-09-10T00:27:23   kernel   bce0: permanently promiscuous mode enabled
2020-09-10T00:27:23   kernel   443.407436 [1035] generic_netmap_dtor Emulated netmap adapter for bce0 destroyed
2020-09-10T00:27:23   kernel   443.407409 [1130] generic_netmap_attach Emulated adapter for bce0 created (prev was NULL)


Another case:

2020-09-09T23:42:03   kernel   723.581121 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-09T23:41:58   kernel   718.205255 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-09T23:41:53   kernel   713.085191 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-09T23:41:48   kernel   707.965228 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-09T23:41:42   kernel   702.589255 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-09T23:41:37   kernel   697.337566 [4006] netmap_transmit bce0 drop mbuf that needs checksum offload
2020-09-09T23:02:00   configctl[8592]   error in configd communication Traceback (most recent call last): File "/usr/local/opnsense/service/configd_ctl.py", line 68, in exec_config_cmd line = sock.recv(65536).decode() socket.timeout: timed out
2020-09-09T23:02:00   configctl[53856]   error in configd communication Traceback (most recent call last): File "/usr/local/opnsense/service/configd_ctl.py", line 68, in exec_config_cmd line = sock.recv(65536).decode() socket.timeout: timed out
2020-09-09T22:43:00   /update_tables.py[68067]   fetch alias url https://www.spamhaus.org/drop/drop.txt (lines: 944)
2020-09-09T22:02:00   /update_tables.py[62598]   fetch alias url https://www.spamhaus.org/drop/dropv6.txt (lines: 39)
2020-09-09T22:02:00   configctl[9337]   error in configd communication Traceback (most recent call last): File "/usr/local/opnsense/service/configd_ctl.py", line 68, in exec_config_cmd line = sock.recv(65536).decode() socket.timeout: timed out
2020-09-09T22:02:00   configctl[46137]   error in configd communication Traceback (most recent call last): File "/usr/local/opnsense/service/configd_ctl.py", line 68, in exec_config_cmd line = sock.recv(65536).decode() socket.timeout: timed out
2020-09-09T21:59:09   /flowd_aggregate.py[5222]   vacuum done
Running OPNsense on 4 core Intel Xeon E5506, 20GB RAM, 2x Broadcom NetXtreme II BCM5709, 4x Intel 82580
Ubench Single CPU: 307897 (0.39s)

Hi @andreas,

Do these checksum offloading errors start just after you start the ping from the other side of the vpn tunnel?

Yes, exactly in that moment (or +~2secs).
Running OPNsense on 4 core Intel Xeon E5506, 20GB RAM, 2x Broadcom NetXtreme II BCM5709, 4x Intel 82580
Ubench Single CPU: 307897 (0.39s)

Understood, thanks for the update.

Weird. Then it seems wireguard somehow manages to enable offloadings on the bce adapter....

Anyone who has a similar problem with wireguard + Suricata / Sensei? I wonder if this is a common problem?