[CALL FOR TESTING] Netmap generic mode queue stall fixes

Started by franco, January 27, 2023, 11:38:45 AM

Previous topic - Next topic
Yes, I do.  I don't use pass-throughs on my VM. They cause problems with snapshots. (Or at least they have for me in the past.)

After my last reply here, it had another kernel panic (post patch). Not sure what's going on here so will just have to stick with a prior release snapshot, or pfSense. I'm sure it will be sorted soon...I just won't have time to deal with the crashes.

I might bring it up later so that I can forward the crash log. But for now, it's peacefully sleeping...

Cheers!

February 04, 2023, 02:41:11 AM #16 Last Edit: February 04, 2023, 02:43:25 AM by almodovaris
The test kernel works.

# dmesg | grep generic_netmap_register gives me nothing.

Can I test Zenarmor multicore? I.e. eastpect multicore.
OPNsense HW:

Minisforum Venus series UN100C, 16 GB RAM, 512 GB SSD
T-bao N9N Pro, 16 GB RAM, 512 GB SSD

Quote from: jbhorner on February 04, 2023, 02:30:49 AM
Yes, I do.  I don't use pass-throughs on my VM. They cause problems with snapshots. (Or at least they have for me in the past.)

After my last reply here, it had another kernel panic (post patch).

Thanks for more information @jbhorner. Since you're using vlan(4), you're actually using the netmap emulated driver. We'll take a look.

For the time being, for the sake of clarity, please confirm these crashes happen when you're using the netmap beta kernel?

Just reporting my findings.

Performance is much better overall, but Large file transfers (like Win11.iso) between vlans still result in a complete lockup of the router when Zenarmor is active. All vlans are on the same parent interface (ix0) which is monitored by Zenarmor.

I am also seeing 100% A+ ratings on waveform bufferbloat test with Shaper configured. I used to see an A rating with moderate increase in buffebloat. It's absolutely zero now with a consistent A+. Even without Zenarmor, I'll be running this kernel for awhile. Thank you all for your efforts to improve the platform.

@djr92, thanks, very helpful. Glad to hear that you've seen improvements in bufferbloat tests.

WRT the stalls, can you try the same test with Zenarmor in bypass mode? I want to make sure it's not ZA-related. In bypass mode, ZA acts as a dummy bridge switching packets back and forth.

Hello.

For clarity, I had this same issue on two different machines before testing the new Kernel. This issue only appears when Netmap ZA is in Routed Mode. In bypass or passive mode I have no issue.

The issue is specifically a router crash. Router becomes unresponsive and all connectivity fails for a period of time. Often requires a power reset to recover. 

The issue only appears when ZA is in routed mode and it only happens when transferring a single large file between vlans. Something like a 5GB .iso file. I can transfer a 5GB folder full of smaller files with no issue.

I have tried with the vlans on the same parent interface and I've also tried with the vlans on different physical interfaces. Same issue.

I'm currently using a Netgate 6100 with the 10G (ix) uplinks. I also had the same issue on my previous Dell SFF PC using X550-T2 NIC (also ix).

It happened on the stock kernel and it also happens with this new test kernel.

I still get errors, but this patch is enabling ZA to actually work in L3 routed mode. The following combination seems to work best for me using Intel I225-V 2.5G interfaces (Protectli VP2420):
- Disable flow control in tunables (dev.igc.0.fc, dev.igc.1.fc, dev.igc.2.fc, dev.igc.3.fc all set to 0)
- Install this 23.1-netmap kernel
- Set ZA to run in L3 Reporting and Blocking with emulated driver.

Any one of the above settings changed, and I have flapping interfaces and issues. Especially with wireless. Wired and wireless connect to different interfaces on the firewall with difference subnets and firewall rules.

Most of the errors occur on the wireless interface (igc2)
424.125647 [ 320] generic_netmap_register   Emulated adapter for igc2 activated
438.207994 [ 320] generic_netmap_register   Emulated adapter for igc2 activated
452.313472 [ 320] generic_netmap_register   Emulated adapter for igc2 activated
484.519552 [ 320] generic_netmap_register   Emulated adapter for igc2 activated
498.622187 [ 320] generic_netmap_register   Emulated adapter for igc2 activated
514.752345 [ 320] generic_netmap_register   Emulated adapter for igc2 activated
544.042637 [ 320] generic_netmap_register   Emulated adapter for igc2 activated
558.191049 [ 320] generic_netmap_register   Emulated adapter for igc2 activated
572.323451 [ 320] generic_netmap_register   Emulated adapter for igc2 activated
599.501288 [ 320] generic_netmap_register   Emulated adapter for igc2 activated
614.628354 [ 320] generic_netmap_register   Emulated adapter for igc2 activated
632.829857 [ 320] generic_netmap_register   Emulated adapter for igc2 activated
763.054897 [ 320] generic_netmap_register   Emulated adapter for igc2 activated

With an occasional error on the wired interface:
325.237102 [ 320] generic_netmap_register   Emulated adapter for igc0 activated

Quote from: jbhorner on February 03, 2023, 08:15:41 PM
I just posted a comment in the Zenarmor noting that I think Netmap is causing a bit more of an issue than issues with generic mode (which I do not use). I am getting regular kernel panics that seem to point to Netmap:

Could you please show the full stack trace and panic message? The snippet you pasted just shows a thread taking a timer interrupt while pushing packets out of netmap, so it's hard to draw any conclusions.

February 07, 2023, 08:14:46 AM #23 Last Edit: February 07, 2023, 08:30:59 AM by Phiolin
Unfortunately I just had Zenarmor pass out again on the netmap test kernel. I use VLANs on a vtnet interface, so I'm the classic case for this issue I guess.
Had to restart Zenarmor and then everything came back.
Let me know if you need any more specific information!

% uname -a
FreeBSD redacted.local 13.1-RELEASE-p5 FreeBSD 13.1-RELEASE-p5 netmap-n250377-0c47d02eefe SMP amd64


Actually, why did this start happening anyway? I never had these issues before like... idk, November 2022 or so?
I guess there have been kernel changes in this area that are now causing the issue, so it's good that it is being looked into, but I wonder if it wouldn't be easier to just roll back whatever change introduced the problem in the first place?

Adding some context here...

@markj is from the FreeBSD Project/Klara Systems. We're currently collaborating with Klara and Mark to sort out outstanding netmap issues.

In this regard, any help you can provide here would be much appreciated by the community, since it'll help ship a reliable netmap kernel not just for OPNsense, but for the whole BSD ecosystem which is relying on FreeBSD, since these improvements will be upstreamed.

Thanks in advance for all your attention and help.!

After giving it a few more days of testing, like many others I'm still having issues with netmap, especially when significant bandwidth intensive traffic is taking place. I've resorted to placing Zenarmor into passive mode so that it's using just pcap and not having issues with that. I try to monitor the reports regularly and look for threats to possibly block in the firewall rules besides the other measures already in place (DNSBL's, Geo-IP, URL table subscriptions, CrowdSec, etc).

Here's hoping netmap fixes come soon...

@SpinningRust, thanks for the feedback.

Which ethernet were you using for the ZA protected interface? Were there any VLANs involved?

I have it set on the interface to my LAN/wired network (igc0) and my access point (igc2). I had originally setup vlans for each of these with the intention to eventually logically separate in a downlink switch for IoT, etc. with additional vlans. Or for multiple SSIDs to the wireless, but I haven't done that yet since I don't have managed switches yet or an AP that supports vlan trunks. So, the vlans are pointless right now and were only associated with the parent interfaces but have never been assigned as an interface for firewall policies, etc.

I've deleted the vlans, but they do still show up in Zenarmor as an assignable interface, though I've never used them. Not sure how to clear them out of Zenarmor since they no longer exist.

> I'm still having issues with netmap, especially when significant bandwidth intensive traffic is taking place

Let's be a bit more clear about this: we are fixing queue stalls. If you had queue stalls and still see queue stalls we would like to know before moving the goalpost to performance and further reliability.

Does anyone see queue stalls with the kernel published here? No ping going through at all? Single connections being stuck must be excluded and I'm not even sure if this is something that Zenarmor could cause as well given the nature of flow tracking in the user application.


Cheers,
Franco

February 09, 2023, 01:20:49 PM #29 Last Edit: February 09, 2023, 01:25:49 PM by SpinningRust
Quote from: franco on February 09, 2023, 09:26:16 AM
Does anyone see queue stalls with the kernel published here? No ping going through at all?

Yes, I believe so, but I'm very new to this and may be experiencing a different issue. However, dmesg fills up with errors. My comment previously was that the best way to cause the errors is a large upload/download or something bandwidth intensive. It was not in regards to performance.

Also, I can replicate issues, to less degree in some testing with the IPS feature, which I believe also uses netmap...but I'm not sure if it's using the emulated netmap driver. I have extensively tested the emulated driver for Zenarmor. While it works longer than the native netmap driver, it will fail causing wifi connectivity or other lan activity to experience complete drops for periods of time before eventually recovering.

So, for now, I'm using both IDS and Zenarmor in passive mode with no issues at all since netmap isn't used.