[performance] Enable ice_ddp_load=YES on Intel E823

Started by erik78se, Today at 12:11:58 PM

Previous topic - Next topic
# OPNsense + Intel `ice` (E823) latency spikes fixed by RX tuning + DDP load

This post describes an important performance tuning guide for others running into packet drops on heavily used ice devices. This might apply for other devices too, which is not known at this point.

Our trigger at Dwellir to fix this, was that we were seeing periodic latency spikes on a number of services we provide (up to ~3500 ms) on with Intel E823 (`ice0`/`ice1`) behind a `lagg0`. This affected our services negatively broadly. We saw it in a grafana dashboard for the prometheus node exporter and investigations had to happen.

Main signal was heavy RX drops on `ice0`, `ice1`, and `lagg0`, plus very high `rx_no_desc`/`rx_discards`.

## Symptoms of the problem
- Application/TCP latency was mostly fine, but with periodic spikes up to ~3500 ms.
- Spikes appeared every few minutes (roughly 5-7 minute recurrence, with 2-3 minute high-drop windows).
- Prometheus node exporter exposed the metric `node_network_receive_drop_total` increased heavily on devices:
  - `ice0`
  - `ice1`
  - `lagg0`
- RX drops were the main issue; TX was comparatively clean.
- Drops were visible on parent interfaces, while some VLAN interfaces appeared clean.

## What fixed it
The solution was a combination, not one setting:

1. Increase descriptor rings (`override_nrxds` / `override_ntxds`, and enable extra queues and vectors (Can be done in the Tunables web UI):
```
dev.ice.0.iflib.override_nrxds="4096"
dev.ice.0.iflib.override_nrxqs="4"
dev.ice.0.iflib.override_ntxds="4096"
dev.ice.0.iflib.override_ntxqs="4"
dev.ice.0.iflib.override_qs_enable="1"
dev.ice.0.iflib.use_extra_msix_vectors="4"
dev.ice.1.iflib.override_nrxds="4096"
dev.ice.1.iflib.override_nrxqs="4"
dev.ice.1.iflib.override_ntxds="4096"
dev.ice.1.iflib.override_ntxqs="4"
dev.ice.1.iflib.override_qs_enable="1"
dev.ice.1.iflib.use_extra_msix_vectors="4"
```
2. Ensure DDP is loaded at boot (`ice_ddp_load="YES"`)
3. Enable queue override + increase queues (`override_qs_enable="1"`, `override_nrxqs`/`override_ntxqs`, ended at `4`)
4. Reboot and verify active queues in `dmesg`/`sysctl`

Adding `override_nrxds` / `override_ntxds` (dev.ice.X.iflib.override_nrxds etc) can be done from the UI, but for some other parameters, the file: /boot/loader.conf.local needs to be edited on the device since the GUI does not support adding boot-time sysctl parameters in the current version OPNsense 25.7.6.

cat /boot/loader.conf.local
```
ice_ddp_load="YES"
dev.ice.0.iflib.override_qs_enable="1"
dev.ice.1.iflib.override_qs_enable="1"
dev.ice.0.iflib.override_nrxqs="4"
dev.ice.0.iflib.override_ntxqs="4"
dev.ice.1.iflib.override_nrxqs="4"
dev.ice.1.iflib.override_ntxqs="4"
```
Note: Some of these values are overlapping (tunables vs loader.conf.local, which is probably something we could remove/improve.

## Why it worked
- RX resources were too small for burst traffic (ring/queue starvation).
- Driver was entering Safe Mode when DDP wasn't loaded, so queue scaling looked configured but wasn't effective. ice_ddp_load="YES" fixes this.

## Verification
After DDP + queue activation:
- Multiple RX/TX queues became active (`rxq0..rxq3`)
- `dmesg` reported `Using 4 Tx and Rx queues`
- RX drops flattened to near zero
- Latency spikes disappeared
- Wired RAM memory increases on FreeBSD, but doesn't OOM.

## Quick check for others
If you tune queues on `ice` and it doesn't behave as expected, check `dmesg` first for DDP/Safe Mode messages.

## Attributions
Thanx to the team at Dwellir (https://dwellir.com) for the assisting in resolving this issue and the OPNsense community as well.