PC Engines APU2 1Gbit traffic not achievable

Started by Ricardo, July 27, 2018, 12:24:54 PM

Previous topic - Next topic
Hi, I have been following this thread and other related forums re: achieving 1GBit via PPPoE with PCEngines' APU2. net.isr.dispatch = "deferred" yielded only a small speed improvement - from 400Mbps to 450Mbps. Using the ISP-provided DIR-842, I can hit up to 800+Mbps. I am on the latest OPNsense with the stock kernel. PFSense on the same APU2 and net.isr.dispatch = "deferred" yielded 520-550Mbps.

I have an APU2 board with OPNsense as well. My board only achieves about 120 MBit/s per NIC in iPerf  >:(
I posted the problem here: https://forum.opnsense.org/index.php?topic=11228.0

Hi,
I've just found this blog entry: https://teklager.se/en/knowledge-base/apu2-1-gigabit-throughput-pfsense/
So the APU2 series should be able to achieve 1 gbit with pfsense.  ::)

best regards
Dirk

February 25, 2019, 09:55:15 AM #78 Last Edit: February 25, 2019, 09:57:57 AM by Ricardo
Quote from: monstermania on February 25, 2019, 09:52:57 AM
Hi,
I've just found this blog entry: https://teklager.se/en/knowledge-base/apu2-1-gigabit-throughput-pfsense/
So the APU2 series should be able to achieve 1 gbit with pfsense.  ::)

best regards
Dirk

IF(!!!) the wan type is NOT pppoe! That fact is not revealed in that blog. Can cause giant speed decrease, thanks to Freebsd pppoe handling defect.
I can only achieve 160-200 mbit, and that is fluctuating heavily between test runs. A cheap Asus RT-AC66U B1 can easily reach 800+ mbit on the very same modem/subscription.

December 18, 2019, 03:12:52 PM #79 Last Edit: December 19, 2019, 02:22:26 PM by pjdouillard
This topic didn't received much love since the last few months, but I can attest the issue is still present: OPNsense 19.7 on APU2 cannot reach 1Gbps from WAN to LAN with default setup on a single traffic flow.

So I dug around and found a few threads here and there about this, and finally found this topic to which I am replying.  I saw many did some tests, saw the proposed solution at TekLager, etc, but they don't really adress the single flow issue.

I've read about the mono-thread vs multi-thread behavior of the *BSD vs Linux, but single flow traffic will only use 1 thread anyway so I had to discard that too as a probable cause.

I then decided to make my own tests and see if this was related to a single APU2 or all of them.  I've tested 3 x APU2 with different firewall and this is the speed I get with https://speedtest.net (with NATing enable of course):

OPNsense   down: ~500 Mbps    up: ~500 Mbps
pfSense      down: ~700 Mbps    up: ~700 Mbps
OpenWRT   down: ~910 Mbps    up: ~910 Mbps
IPFire         down: ~910 Mbps    up: ~910 Mbps

pfSense on Netgate 3100   down: ~910 Mbps    up:~910 Mbps

My gaming PC (8700k) connected directly into the ISP's modem     down: ~915 Mbps  up:~915 Mbps

I also did some tests by virtualizing all these firewalls (except OpenWRT) on my workstation (AMD 3950X) with VirtualBox (Type 2 Hypervisor - not the best I know didn't had the time to setup something on the ESXi cluster) and you can substract ~200Mbs from all the speeds above.  So that means, even virtualized, IPfire is faster than both OPNsense and pfSense running of the APU2.  I also saw that all of them are using only ONE thread and using almost the same amount of CPU% when the transfer is going on.

My conclusions so far are these:
-The PC Engine APU2 is not the issue - probably a driver issue for OPNsense/pfSense
-Single threaded use for single traffic flow is not the issue either since some firewalls are able to max the speed on 1 thread
-pfSense is still based on FreeBSD which has one the best network stack in the world but it might not use the proper drivers for the NICs on the APU - that's my feeling but can't check this.
-OPNsense is now based on HardenBSD (which is a fork of FreeBSD) and add lots of exploit mitigations directly into the code. Those security enhancements might be the issue with the APU2 slow transfer speed.  OPNsense installed on premise with a ten year old Xeon X5650 (2.66Ghz) can run at 1 Gbps without breaking a sweat.  So maybe a few MHz more are required for OPNsense to max that 1 Gbps pipe.
-OpenWRT and IPFire are Linux based and they benefit from a much broader 'workforce' optimizing everything around them.  NICs are probably detected properly and the proper drivers are being used + the nature of how Linux works could also help in speeding everything a little bit more. And the Linux kernel is a dragster vs FreeBSD kernel (sorry FreeBSD but I still love you since I am wearing your t-shirt today!!).

My next steps would be if I have time, to do direct speed tests internally with iperf3 in order to have another speed chart I can refer too.

Edit: FreeBSD vs HardenedBSD Features Comparison https://hardenedbsd.org/content/easy-feature-comparison

Edit 2: Another thing that came to my mind is the ability of the running OS (in our case OPNsense) to be able to 'turbo' the cores up to 1.4Ghz on the AMD GX-412TC cpu that the APU2 uses.  The base frequency is 1Ghz but with turbo it can reach 1.4Ghz.  I am running the 4.10 latest firmware, but I can't (don't know how) to validate what frequency is being used when doing a transfer.  That would really justify the difference in transfer speed as to why OPNsense can't max a 1 Gbps link while others can.  Link on how to upgrade the bios in the APU2 : https://teklager.se/en/knowledge-base/apu-bios-upgrade/

Greatly appreciated your effort. I gave up this topic since a long time, but if you have the energy to go and find the resolution, you have all my support :) !
1 thing I would like to ask you: could you check your results if you emulate PPPoE on the INTERNET interface, instead of plain simple LAN IP protocoll on the WAN interface? As your results will be much much worse under opnsense then what you achieved in this test.

My APU2 is connected via a CAT6a Ethernet cable to the ISP's modem, which in turn is connected via another CAT6a Ethernet cable to the Fiber Optic transceiver. Then the connection between the ISP's modem is done via PPPoE (which I don't managed - it's done automatically and setup by the ISP).

So the APU2 isn't doing the PPPoE connectivity (as it would have been in this typical scenario 15 years ago via DSL for example) and it is a good thing.  Now if your setup requires the APU2 to perform the PPPoE connectivity, that doesn't really impact the transmission speed.

"Now if your setup requires the APU2 to perform the PPPoE connectivity, that doesn't really impact the transmission speed."

There is a very high chance, that the pppoe session handling and single threaded MPD daemon is the biggest bottleneck on the apu2 to reach the 1 gigabit speed.

December 19, 2019, 06:58:58 AM #83 Last Edit: December 19, 2019, 02:15:59 PM by pjdouillard
I've setup another test lab (under VirtualBox) to test the iperf3 speed between 2 Ubuntu server each behind an OPNsense 19.7.8 (fresh update from tonight!).  All VMs are using 4 vcpu and 4 GB of RAM.

-First iperf3 test (60 seconds, 1 traffic flow):
The virtual switch performance between SVR1 and SVR2 connected together yields ~2.4Gbps of bandwidth

-Second iperf3 test (60 seconds, 1 traffic flow):
This time, SVR1 is behind FW1 and SVR2 is behind FW2.  Both FW1 and FW2 are connected directly on the same virtual switch. Minimum rules are set to allow connectivity between SVR1 and SVR2 for iperf3.  Both FW1 and FW2 are NATing outbound connectivity. The performance result yields ~380Mbps.

-Third iperf3 test with PPPoE (60 seconds, 1 traffic flow):
FW1 has the PPPoE Server plugin installed and configured.  FW2 is the PPPoE client that will initiate the connection. The performance result yields ~380Mbps.

-Fourth iperf3 test with PPPoE (60 seconds, 2 traffic flow): ~380Mbps

-Fifth iperf3 test with PPPoE (60 seconds, 4 traffic flow): ~390Mbps

So unless I missed something, PPPoE connectivity doesn't affect network speed as I mentionned earlier.

I will try to replicate the same setup but with 2 x APU2 and post back the performance I get.

December 19, 2019, 03:30:29 PM #84 Last Edit: December 19, 2019, 03:36:28 PM by Ricardo
Thanks for your effort, this one was a really interesting test series.
The reason why I am suspecting the pppoe encapsulation is a serious limiting factor, that the internet is full of articles that all says the same thing: the pppoe is unsuitable for receve-queue distribution. The result is that only 1 cpu core can effectively process the entire pppoe flow, which means the other cores are sitting idle while 1 core is at 100% load. Because the APU2 has very weak single-core CPU processing power, if the above multi-queue receive is deactivated for pppoe, that is a big warning against using this product for 1gbit networks.
But anyway, I am really curious to see the next test results.

As far as I can recall, I could do 600 Mbit/s only from LAN --> WAN direction (e.g. UPLOAD from lan client to internet server), the WAN--> LAN direction (e.g. download from internet server to lan client) was much slower. And all these results were using pure IP between 2 directly connected test PC. When I installed the firewall into my production system, I reconfigured the WAN interface to pppoe, and real world results were lower than the testbench results.

December 19, 2019, 04:45:47 PM #85 Last Edit: December 19, 2019, 05:13:08 PM by pjdouillard
Be cautious about what you read on single threaded process being a limiting factor. 

When a single traffic flow enters a device, the ASIC performs the heaving lifting most of the time.  The power required afterward to analyze, route, NAT, etc, that traffic is done most of the time by a cpu core (or 1 thread) somewhere up the process stack.
But that process cannot be well distributed (or parallelized) on many threads (cores) for a single traffic flow - it would be  inefficient in the end since the destination is the same for all the threads and they would have to 'wait' after each other and thus slowing other traffic flow that requires processing.

When multiple traffic flows are entering the same device, of course the other cpu cores will be used to handle the load appropriately.

The only ways to optimize or accelerate single traffic flow on a cpu core are:
-good and optimized network code
-the appropriate network drivers that 'talk' to the NIC
-speedier cpu core (aka higher frequency (GHz/MHz)

A comparison of this behavior is the same kind of (wrong) thinking that people think about link aggregation: if we bundled 4 x 1 Gbps links together, people will think that their new speed for single flow traffic is now 4 Gbps and they are surprised to see that their max speed is still only 1 Gbps because of 'slow' 1-link wire is 1 Gbps.  On multiple traffic flows, then the compound traffic will reach the 4 Gbps speed because now each one of the 1 Gbps links are being used.

I hope that clears up some confusion.

But in the end, there is definitely something not running properly on both OPNsense and pfSense on those APU boards.
The APU's hardware is ok - many and I have showed that.
So what remains are:
a) bad drivers for the Intel 210/211 NICs
b) bad code optimization (the code itself or the inability to make the cpu core reach its 1.4Ghz turbo speed),
c) both a & b

The Netgate SG-3100 that I have has an ARM Cortex-A9 which is a dual-core cpu running at 1.6Ghz and its able to keep that 1 Gbps speed.  And we saw above that pfSense if somewhat faster on the APU compared to OPNsense.  IMO, I really think we are facing a NIC driver issue from FreeBSD for the Intel 210/211 chipset.

December 20, 2019, 11:40:56 PM #86 Last Edit: December 20, 2019, 11:43:20 PM by pjdouillard
Haven't had the time to setup the APU, but I re-did the same test under ESXi because I was curious of the performance I could reach.

The ESXi 6.7 host is a Ryzen 2700X processor with 32GB and it's storage hooked on a networked FreeNAS.  All four vms were running on it with 2 vcpu and 4 GB RAM each.

The virtual switch bandwidth from svr1 to svr2 direct iperf3 bandwidth was ~24 Gbps.
Then the same flow but svr1 having to pass through fw1 (NAT+Rules), then fw2 (NAT+Rules) then reaching svr2 gave an iperf3 bandwidth of ~4Gbps.

That's a far cry from what I've achieved on faster hardware under VirtualBox lol.

On another subject: I had an issue with this setup under ESXi as the Automatic NAT rules weren't generated for some reason on both firewalls (they were under VirtualBox though). I find that odd, but I recall a few weeks ago while I was giving a class at the college and was using OPNsense for setting up an OpenVPN vpn with my students I was seeing internal network address reaching my firewall WAN port.  The day before, I wasn't seeing this and I didn't change the setup, so I blamed VirtualBox for the problem... but now, I see the same behavior under ESXi and I am wondering if there is an issue with the automatic Outbound NAT rules generation somehow.  What is causing this behavior?

Nat always reduces throughput as it travels the CPU. Auto NAT can cause problems when you have multiple devices in this network to reach. Then you have to remove upstream gateway in Interface config and add manual nat rules. :)

December 21, 2019, 05:06:25 PM #88 Last Edit: December 22, 2019, 11:54:17 AM by pjdouillard
I wouldn't say that NAT always reduces throughput as it depends on what devices are used.
APUs and lot of other cheap and low powered devices do have issues with NAT yes - it was the main reason why I ditched many consumer grade routers when I got fiber 1 Gbps at home 4 years ago.  Back then, only the Linksys 3200ACM was able to keep up the speed with NAT active... until mysteriously - like hundreds of other people that posted on the Linksys forums - connections started to drop randomly and Internet connectivity became a nightmare.

That's when I started looking for something better and I ended up with pfSense on a SG-3100 two years ago.  All my problem were solved and still are up to this day.


Can we please quit the others-are-so-great talk now? I don't think mentioning it in every other of your posts really helps this community in any substantial way.


Cheers,
Franco