netmap_transmit error

Started by awptechnologies, February 23, 2025, 03:39:16 AM

Previous topic - Next topic
February 23, 2025, 03:39:16 AM Last Edit: February 23, 2025, 03:42:37 AM by awptechnologies
I use Intrusion detection both ids/ips on my lan interface bge0.

Under heavy load i get error netmap_transmit bge0 full hwcur 358 hwtail 24 qlen 333.

The three numbers change and usually occur in a sequence of 2.


Is this a bad thing or normal? Also is there certain tunables i can adjust to fix these errors.
I already have tried the dev.netmap.admode and haved tried all options 0 1 2 none seem to have effect other then 1 not allowing intrusion detection to start.
I also did dev.netmap.buf_size and upped it to 8192 instead of 2048 still get error.

This is an 8 core system that is running in a vm on proxmox. I use CPU affinity to dedicate 8 cores to opnsense and i also have vm.numa.disabled set to 0 so it can see the numa nodes since the cores 0-7 span across 2 numa nodes on the host. The network card is passed through and it is a broadcom netextreme.

Just want to know what tunables people are running to fix the issue and allow maximum throughput for opnsense.

I also used net.isr.maxthreads and set it to 8
net.isr.bindthreads and set it to 1
net.inet.rss.enabled and set it to 1
dev.bge.1.msi set to 1
dev.bge.0.msi set to 1
kern.ipc.soacceptqueue and set to 256 over the 128

Having same issue.

I tried these, seems less frequent but not resolved.

Original values
dev.netmap.buf_num: 163840
dev.netmap.ring_num: 200
dev.netmap.buf_size=2048

New Values
sysctl dev.netmap.buf_num=200000
sysctl dev.netmap.ring_num=256
sysctl dev.netmap.buf_size=4096

Are you using hyperscan in intrusion detection?

Also are these packets bypassing intrusion detection when buffer is full? what is the actual reason they are happening? Slow hardware? Bad Settings?

Quote from: awptechnologies on February 24, 2025, 01:29:19 AMAre you using hyperscan in intrusion detection?

Also are these packets bypassing intrusion detection when buffer is full? what is the actual reason they are happening? Slow hardware? Bad Settings?

I started to experience this one the latest update or at least it's noticeably worse causing my LAN interface to hang.

My hardware:
CPU: (4 cores, 1.50GHz)
RAM: 16GB (16947675136 bytes)
Cores: 4 (no Hyper-Threading)
NICs: Realtek Gigabit (re0 for WAN, re1 for LAN)
Current CPU Frequency: 1500MHz
Available Free Memory Pages: 2,356,511

I've tried these tweaks incrementally increasing them and rebooting to test. Any high load with IPS/IDS enabled with hyperscan/aho and aho ken steele, results in the LAN interface hanging.


THEN!!! I realised because I'm a dumb***.... when I re-imaged my FW, I forgot to reinstall the Realtek driver plugin :D

Not sure if OP might be having same/similar issue with missing NIC plugin?

I use a broadcom nic because it is built into my dell r630. as far as i can tell there is no plugin related to the driver i have which is bge. I think it must be included in freebsd by default.

I am also seeing this,  but it only happens for me when i limit cpu core boost speed for power savings.  when i set to full power this doesnt happen. So it seems to to be related to not enough cpu freq. i have 16 cores dedicated they are all running @ 1ghz.
ix1 full hwcur

March 03, 2025, 08:14:35 AM #6 Last Edit: March 03, 2025, 04:07:46 PM by franco
Yeah it basically means the ring buffer will be full quickly because too many packets are coming in vs. going out.


Cheers,
Franco

This is happening as well with ZA (no surprise).

Its indeed as Franco mentioned.
If there is too much packets at a given time interval, the queue that is used (by default the queue of the NIC 1024 usually) and the CPU is not able to empty the queue fast enough you will see this error which is more like a notification telling you queue is getting full. If a queue is full Tail Drop will happen.

What is interesting; this started to happen after the upgrade to 25.1 prior this upgrade this was not happening.
I am not sure if netmap had some changes.
The only thing that changed was the FreeBSD version. But not sure if its related.

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

The answer is relatively simple. We no longer carry this patch https://github.com/opnsense/src/commit/36fb07bfef7d38906403a28fb2c613712eb6baa4 because it's not in FreeBSD. Functionally it's the same as before with the message or without it.

QuoteAlso mutes a spammy message.  Bravely going where no man has gone before.  :)

hahaha this made my day


Personally I like to see that message, because now I have an exact timestamp when I see performance hit on the network. I was always aware of the potential limitation when using ZA + netmap. But now when I see a message with time stamp during an issue I am 100% sure what caused it.

For me this is an QoL improvement ;)

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

In the early days I think this wasn't even rate limited, but I could be wrong. It was pretty annoying in the beginning.


Cheers,
Franco

I have same problem, after a few days packets flow is broken.
I'm running OpnSense on VMWare, vmxnet3 nic. Is there any way to run it, or is simply not compatible?

Quote from: bazbaz on April 15, 2025, 09:45:45 AMI have same problem, after a few days packets flow is broken.
What do you mean by this?

Quote from: bazbaz on April 15, 2025, 09:45:45 AMI'm running OpnSense on VMWare, vmxnet3 nic. Is there any way to run it, or is simply not compatible?
I am not sure what do you mean by this either?


That errors is shown due to what is discussed above. It was always there when you at certain point reached more packets than the CPU could handle while having netmap ON. Devs did just disable the suppression of this message thus you can see it now.

Regards,
S.
Networking is love. You may hate it, but in the end, you always come back to it.

OPNSense HW
APU2D2 - deceased
N5105 - i226-V | Patriot 2x8G 3200 DDR4 | L 790 512G - VM HA(SOON)
N100   - i226-V | Crucial 16G  4800 DDR5 | S 980 500G - PROD

Quote from: Seimus on April 15, 2025, 10:05:41 AMThat errors is shown due to what is discussed above. It was always there when you at certain point reached more packets than the CPU could handle while having netmap ON. Devs did just disable the suppression of this message thus you can see it now.


I disabled inspection mode. It simply does not work with VmWare nics. After some times, packets entering the firewall do not exit anymore from the target interface until a full reboot. It is not a performance matter, it is something that stops working.


Today at 02:09:59 PM #14 Last Edit: Today at 02:23:04 PM by Melroy vd Berg
I would like to respond on this thread. I think its an important topic till this day.

We also have Suricata running in IPS mode. Which is using netmap under the hood.

I found and read the following reply from Giuseppe, which is one of the collaborators of netmap here.


Stating:
QuoteThe one you are interested in are ring_num and buf_num

Meaning, you can of course increase the buffer size itself, but you most likely want to increase the number of buffers available to netmap.

What I tried thus far is:

  • Doubling the buffer size, by setting; dev.netmap.buf_size to: 4096
  • More importantly increase the buffers, using; dev.netmap.buf_num to 327680
  • As well as setting; dev.netmap.ring_num to 400

You might want to add these values to the tunables and then reboot the system.

WARNING:

Increasing this values do requires sufficient RAM memory to be present (at least 4GB or more). You have been warned in case you do not have enough RAM left.

During reboot Suricata might use some CPU cycles and sysctl dev.netmap | grep curr will initially show "0" until everything is allocated. I believe this is expected.

Eventually dev.netmap.buf_curr_num should match the buf_num set earlier.

That being said... Running a speedtest over a 3+ Gbit/s fiber connection still causes buffer issues in netmap however, despite these settings above:

2025-10-17T02:00:29 Notice kernel [99224] 229.066251 [4335] netmap_transmit           ax1 full hwcur 430 hwtail 179 qlen 250
2025-10-17T02:00:29 Notice kernel [99224] 229.059118 [4335] netmap_transmit           ax1 full hwcur 430 hwtail 179 qlen 250
2025-10-17T02:00:28 Notice kernel [99223] 228.063878 [4335] netmap_transmit           ax1 full hwcur 448 hwtail 194 qlen 253
2025-10-17T02:00:28 Notice kernel [99223] 228.055056 [4335] netmap_transmit           ax1 full hwcur 449 hwtail 224 qlen 224
2025-10-17T02:00:27 Notice kernel [99222] 227.047952 [4335] netmap_transmit           ax1 full hwcur 288 hwtail 505 qlen 294
2025-10-17T02:00:27 Notice kernel [99222] 227.039051 [4335] netmap_transmit           ax1 full hwcur 289 hwtail 68 qlen 220
2025-10-17T02:00:26 Notice kernel [99221] 226.092928 [4335] netmap_transmit           ax1 full hwcur 467 hwtail 238 qlen 228
2025-10-17T02:00:26 Notice kernel [99221] 226.084023 [4335] netmap_transmit           ax1 full hwcur 468 hwtail 240 qlen 227
2025-10-17T02:00:25 Notice kernel [99220] 225.196415 [4335] netmap_transmit           ax1 full hwcur 233 hwtail 482 qlen 262
2025-10-17T02:00:25 Notice kernel [99220] 225.188117 [4335] netmap_transmit           ax1 full hwcur 483 hwtail 233 qlen 249
2025-10-17T02:00:24 Notice kernel [99219] 224.038394 [4335] netmap_transmit           ax1 full hwcur 54 hwtail 338 qlen 227
2025-10-17T02:00:24 Notice kernel [99219] 224.030190 [4335] netmap_transmit           ax1 full hwcur 339 hwtail 54 qlen 284
2025-10-17T02:00:23 Notice kernel [99218] 223.335506 [4335] netmap_transmit           ax1 full hwcur 301 hwtail 29 qlen 271
2025-10-17T02:00:23 Notice kernel [99218] 223.325235 [4335] netmap_transmit           ax1 full hwcur 30 hwtail 301 qlen 240
2025-10-16T22:57:20 Notice kernel [88235] 240.462029 [4335] netmap_transmit           ax1 full hwcur 466 hwtail 188 qlen 277
2025-10-16T22:57:20 Notice kernel [88235] 240.452645 [4335] netmap_transmit           ax1 full hwcur 189 hwtail 466 qlen 234
2025-10-16T17:41:57 Notice kernel [69312] 317.711273 [4335] netmap_transmit           ax1 full hwcur 169 hwtail 391 qlen 289
2025-10-16T17:41:57 Notice kernel [69312] 317.702335 [4335] netmap_transmit           ax1 full hwcur 170 hwtail 483 qlen 198
2025-10-16T13:31:43 Notice kernel [54299] 303.926446 [4335] netmap_transmit           ax1 full hwcur 463 hwtail 188 qlen 274
2025-10-16T06:41:43 Notice kernel [29698] 703.601969 [4335] netmap_transmit           ax1 full hwcur 12 hwtail 270 qlen 253
2025-10-16T06:41:43 Notice kernel [29698] 703.593897 [4335] netmap_transmit           ax1 full hwcur 271 hwtail 12 qlen 258
2025-10-16T06:41:43 Notice kernel [135] ax1: VLAN Stripping Disabled
2025-10-16T06:41:43 Notice kernel [135] ax1: VLAN filtering Disabled
2025-10-16T06:41:43 Notice kernel [135] ax1: Receive checksum offload Disabled
2025-10-16T06:41:43 Notice kernel [135] ax1: RSS Enabled
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 7
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 6
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 5
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 4
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 3
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 2
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 1
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 0
2025-10-16T06:41:43 Notice kernel [135] ax1: VLAN Stripping Disabled
2025-10-16T06:41:43 Notice kernel [135] ax1: VLAN filtering Disabled
2025-10-16T06:41:43 Notice kernel [135] ax1: Receive checksum offload Disabled
2025-10-16T06:41:43 Notice kernel [135] ax1: RSS Enabled
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 7
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 6
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 5
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 4
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 3
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 2
2025-10-16T06:41:43 Notice kernel [135] ax1: xgbe_config_sph_mode: SPH disabled in channel 1

So at this moment I also monitored the processes using ps -axfu during a speedtest. As expected Suricata is using the most CPU cycles, but not maxing out, meaning there is more CPU power left that Suricata is not using.

My conclusion: Increase the buffers might help but doesn't solve the issue. Suricata is just too slow, at this moment, in processing the traffic. Or finally, other fine tuning or configuration might be required to not fill the buffer too much. I have no idea what other tunables options might increase the throughput of Suricata in IPS mode. Maybe enabling RSS?? No idea at this moment how to continue further.

Ps. I also found this note: https://docs.opnsense.org/troubleshooting/performance.html#note-regarding-ips saying that limited by 1 thread. But not sure if this note is still valid or not.
Hardware: DEC3852
Version: OPNsense v25.7.5