[CALL FOR TESTING] Netmap generic mode queue stall fixes

Started by franco, January 27, 2023, 11:38:45 AM

Previous topic - Next topic
Hi!

Zenarmor and OPNsense have been working with Klara to bring netmap improvements to FreeBSD, some of which have already landed in the development branch for upcoming FreeBSD 14.

One of the goals in the project was to find and remove bugs from netmap. One of those bugs has been network traffic becoming unresponsive on generic mode, which means the driver itself doesn't support netmap, but can be made to interact with netmap wrapping around it...

It's easy to spot these on your system, e.g.:

# dmesg | grep generic_netmap_register
442.167865 [ 320] generic_netmap_register   Emulated adapter for gif1 activated

If you see log messages here then you might be affected and perhaps saw the behaviour before: suricata/zenarmor needs to be restarted in order to continue packet flow.

The change in question is: https://github.com/opnsense/src/commit/0c47d02eefec

And the kernel can be installed on 23.1 easily:

# opnsense-update -zkr 23.1.2-netmap
# opnsense-shell reboot

We would hope some of you could try this one out and see if problems disappear (or perhaps cause another dropout as we've solved internally already with an earlier version of the patch).

The patch does have implications on reliability in generic mode (which was always and will always be less reliable than native netmap mode), but we will explain these at a later time.


Cheers,
Franco

awesome!

Works for me, now zenarmor reports can be seen again (using zenarmor 1.12.4 and App/RulesDB 1.12.22122618).

Great work, thank you!
Oliver

Hi Oliver,

huh, I'm missing some context here. It's not supposed to fix a previously unbroken Zenarmor. Perhaps the reboot did it for you? ;)


Cheers,
Franco

I installed the new kernel and it works!

The problems with registering the telephone by the fritzbox and the surfing by the rest of the family are gone.

Unfortunately both interface uses vlan and therefore the generic netmap driver...

Not quite sure if this applies to my situation - looking for clarification.

I have been troubleshooting an issue with Sensei/ZA which I have documented here:

https://forum.opnsense.org/index.php?topic=31544.0

Sunny Valley support has indicated the problem is netmap and asked me to give this a try, which I did yesterday. The result is that it "works", but I still have the interface flapping so it didn't resolve my particular issue. I have a feeling this doesn't apply to me due to the fact that I have OPNs configured as a transparent filtering bridge and using the ZA bridge deployment mode. It doesn't stall, it just doesn't work at all which seems different.

IF I understand correctly (big assumption), their "bridge mode" currently uses netmap and bypasses the OS, but the problem is that ZA won't pass traffic at all unless the bridge is also configured in OPNs (resulting in the flapping). Therefore, the solution is to either fix netmap or add support to if_bridge(4). It should be noted that this config did previously work (with the OPNs bridge or without), so not sure where a change was implemented to break it.

Am I on the right path here? Apologies if I am off target, I am a bit out of my comfort zone on this one.   

> Not quite sure if this applies to my situation - looking for clarification.

Quote from: franco on January 27, 2023, 11:38:45 AM
It's easy to spot these on your system, e.g.:

# dmesg | grep generic_netmap_register
442.167865 [ 320] generic_netmap_register   Emulated adapter for gif1 activated

If you see log messages here then you might be affected and perhaps saw the behaviour before: suricata/zenarmor needs to be restarted in order to continue packet flow.

I actually did that but I'm not super well versed in SSH shell. I was pretty sure blank means not applicable but wanted to check. What threw me off was Sunny Valley wanted me to try this. I even asked if this applied b/c it seemed like it didn't. Thanks Franco.

Ok, so you are not using the generic netmap mode in that case. The patch is not for you, but we do have an if_bridge patch coming up shortly (iterating through QA at the moment).

However, moving all to bridge will just try to work around the issue of a hardware interface going down/up. The actual issue might persist. The down/up is actually a failsafe for removing hardware filter option settings from the device which needs a hard reset, but in theory the reset is not needed if the hardware bits are all set correctly already. A patch is not planned at this point in the project, but was discussed.


Cheers,
Franco

Appreciate the clarification, figured that was the case. It seems they don't have a working solution for a transparent bridge config at this time. Like I mentioned, it was previously working without the bridge configured in OPNs, but something changed along the way. Also, good to know the longer term bridge fix for this scenario isn't forthcoming. I have a call with them today and hope to get their reporting only mode functioning which if I understand correctly uses pcap.

Take care, love OPNs and the work you guys are doing here.

Could be that it was working either due to older FreeBSD state or old code paths that have subsequently been rewritten. For both things there is a problem:

1. FreeBSD state does sometimes deteriorate due to surrounding networking changes. Netmap has its limits both in technical and organisational sense. It's being worked on but the main consumers seem to be OPNsense/pfSense and research projects (where this originally came from). That's also why we involved Klara to look at a few shortcomings and problems encountered over the years.

2. The rework of code paths is always done to simplify and to take side effects out of the configuration paths as they are reported. There is no ill intention on breaking a certain setup (and none was implied here  but I feel I should state it explicitly). And past that we do seem to trigger other side effects from these reworks that are more in the area of the kernel than our code, which could have the averse effect stated as well.

I think at least starting to see kernel issues for what they are is a good step all things considered. Some work is being done although really slow in the grand scheme of things but still gradual so as to take one step at a time. :)


Cheers,
Franco

Quote from: franco on January 27, 2023, 11:38:45 AM
...

And the kernel can be installed on 23.1 easily:

# opnsense-update -zkr 23.1-netmap
# opnsense-shell reboot

...
I coincidentally installed the original kernel by installation of 23.1_6 back and my problems reappeared.
So I am looking forward to make this fix permanent.

@Franco: Will this fix be included in 23.2 or earlier?

The "results" (or rather a bit of lack thereof) seem promising. We've heard of no crashes, no regressions and no problem on the reliability front with more dropped packets vs. before.

The review is https://reviews.freebsd.org/D38065 but it's currently on hold because netmap developer has a different view on the subject. I'm unsure how quickly this will be resolved.


Cheers,
Franco

I'm affected by the netmap/Zenarmor issue and will install the patch today to test. Thanks for bringing this forward! :)
Will report back in 2-3 days as it usually took a while for Zenarmor to get stuck on the old kernel.

I just posted a comment in the Zenarmor noting that I think Netmap is causing a bit more of an issue than issues with generic mode (which I do not use). I am getting regular kernel panics that seem to point to Netmap:

--- trap 0xc, rip = 0xffffffff81226810, rsp = 0xfffffe00dafcb758, rbp = 0xfffffe00dafcb830 ---
lapic_handle_timer() at lapic_handle_timer/frame 0xfffffe00dafcb830
virtqueue_notify() at virtqueue_notify+0x87/frame 0xfffffe00dafcb860
vtnet_txq_mq_start_locked() at vtnet_txq_mq_start_locked+0xa2/frame 0xfffffe00dafcb8b0
vtnet_txq_mq_start() at vtnet_txq_mq_start+0x61/frame 0xfffffe00dafcb8e0
vlan_transmit() at vlan_transmit+0xf3/frame 0xfffffe00dafcb930
nm_os_generic_xmit_frame() at nm_os_generic_xmit_frame+0x6d/frame 0xfffffe00dafcb950
generic_netmap_txsync() at generic_netmap_txsync+0x2eb/frame 0xfffffe00dafcba40
netmap_ioctl() at netmap_ioctl+0x1a4/frame 0xfffffe00dafcbb10
freebsd_netmap_ioctl() at freebsd_netmap_ioctl+0x74/frame 0xfffffe00dafcbb50

I'll implement the potential solution noted here to see if it makes a difference.

@jbhorner, thanks. Quick question: are you using vlans on a vtnet interface?