PC Engines APU2 1Gbit traffic not achievable

Started by Ricardo, July 27, 2018, 12:24:54 PM

Previous topic - Next topic
Will try to see if any of these make a difference. But in general I am very skeptic that it wont, and as nobody from the forum owners replied anything meaningful since this thread started :(
(apart from basically saying its not practical to compare BSD and Linux)

Maybe the forum owners dont use APU?

Have you followed the interrupt stuff from:
https://wiki.freebsd.org/NetworkPerformanceTuning


How many queues does you NIC have? Perhaps you can lower the number of queues on the NIC if single stream is so important for you, but then I'd guess all other traffic will be starving ..

Quote from: ricsip on August 09, 2018, 02:32:27 PM
Will try to see if any of these make a difference. But in general I am very skeptic that it wont, and as nobody from the forum owners replied anything meaningful since this thread started :(
(apart from basically saying its not practical to compare BSD and Linux)

Hi ricsip,

be aware about there a lot of circumstances (especially with hardware, or your test-setup) things will not work in an optimal way... There is no "silver bullet" - so complaining will not help. Also possible other users of OPNSense & APU2 do not have a requirement of one near full 1Gbit flow. From my perspective you have three choices: try to use a stronger hardware, use another software or do some more testing and let participate the community...

regards pylox



https://calomel.org/freebsd_network_tuning.html

# Disable Hyper Threading (HT), also known as Intel's proprietary simultaneous
# multithreading (SMT) because implementations typically share TLBs and L1
# caches between threads which is a security concern. SMT is likely to slow
# down workloads not specifically optimized for SMT if you have a CPU with more
# than two(2) real CPU cores. Secondly, multi-queue network cards are as much
# as 20% slower when network queues are bound to real CPU cores and well as SMT
# virtual cores due to interrupt processing inefficiencies.
machdep.hyperthreading_allowed="0"  # (default 1, allow Hyper Threading (HT))

# Intel igb(4): The Intel i350-T2 dual port NIC supports up to eight(8)
# input/output queues per network port, the card has two(2) network ports.
#
# Multiple transmit and receive queues in network hardware allow network
# traffic streams to be distributed into queues. Queues can be mapped by the
# FreeBSD network card driver to specific processor cores leading to reduced
# CPU cache misses. Queues also distribute the workload over multiple CPU
# cores, process network traffic in parallel and prevent network traffic or
# interrupt processing from overwhelming a single CPU core.
#
# http://www.intel.com/content/dam/doc/white-paper/improving-network-performance-in-multi-core-systems-paper.pdf
#
# For a firewall under heavy CPU load we recommend setting the number of
# network queues equal to the total number of real CPU cores in the machine
# divided by the number of active network ports. For example, a firewall with
# four(4) real CPU cores and an i350-T2 dual port NIC should use two(2) queues
# per network port (hw.igb.num_queues=2). This equals a total of four(4)
# network queues over two(2) network ports which map to to four(4) real CPU
# cores. A FreeBSD server with four(4) real CPU cores and a single network port
# should use four(4) network queues (hw.igb.num_queues=4). Or, set
# hw.igb.num_queues to zero(0) to allow the FreeBSD driver to automatically set
# the number of network queues to the number of CPU cores. It is not recommend
# to allow more network queues than real CPU cores per network port.
#
# Query total interrupts per queue with "vmstat -i" and use "top -CHIPS" to
# watch CPU usage per igb0:que. Multiple network queues will trigger more total
# interrupts compared to a single network queue, but the processing of each of
# those queues will be spread over multiple CPU cores allowing the system to
# handle increased network traffic loads.
hw.igb.num_queues="2"  # (default 0 , queues equal the number of CPU real cores)

# Intel igb(4): FreeBSD puts an upper limit on the the number of received
# packets a network card can process to 100 packets per interrupt cycle. This
# limit is in place because of inefficiencies in IRQ sharing when the network
# card is using the same IRQ as another device. When the Intel network card is
# assigned a unique IRQ (dmesg) and MSI-X is enabled through the driver
# (hw.igb.enable_msix=1) then interrupt scheduling is significantly more
# efficient and the NIC can be allowed to process packets as fast as they are
# received. A value of "-1" means unlimited packet processing and sets the same
# value to dev.igb.0.rx_processing_limit and dev.igb.1.rx_processing_limit . A
# process limit of "-1" is around one(1%) percent faster than "100" on a
# saturated network connection.
hw.igb.rx_process_limit="-1"  # (default 100 packets to process concurrently)

If these suggestions improve performance, I'd love to hear about it.

Quote from: mimugmail on August 09, 2018, 04:16:00 PM
https://calomel.org/freebsd_network_tuning.html

# Disable Hyper Threading (HT), also known as Intel's proprietary simultaneous
# multithreading (SMT) because implementations typically share TLBs and L1
# caches between threads which is a security concern. SMT is likely to slow
# down workloads not specifically optimized for SMT if you have a CPU with more
# than two(2) real CPU cores. Secondly, multi-queue network cards are as much
# as 20% slower when network queues are bound to real CPU cores and well as SMT
# virtual cores due to interrupt processing inefficiencies.
machdep.hyperthreading_allowed="0"  # (default 1, allow Hyper Threading (HT))

# Intel igb(4): The Intel i350-T2 dual port NIC supports up to eight(8)
# input/output queues per network port, the card has two(2) network ports.
#
# Multiple transmit and receive queues in network hardware allow network
# traffic streams to be distributed into queues. Queues can be mapped by the
# FreeBSD network card driver to specific processor cores leading to reduced
# CPU cache misses. Queues also distribute the workload over multiple CPU
# cores, process network traffic in parallel and prevent network traffic or
# interrupt processing from overwhelming a single CPU core.
#
# http://www.intel.com/content/dam/doc/white-paper/improving-network-performance-in-multi-core-systems-paper.pdf
#
# For a firewall under heavy CPU load we recommend setting the number of
# network queues equal to the total number of real CPU cores in the machine
# divided by the number of active network ports. For example, a firewall with
# four(4) real CPU cores and an i350-T2 dual port NIC should use two(2) queues
# per network port (hw.igb.num_queues=2). This equals a total of four(4)
# network queues over two(2) network ports which map to to four(4) real CPU
# cores. A FreeBSD server with four(4) real CPU cores and a single network port
# should use four(4) network queues (hw.igb.num_queues=4). Or, set
# hw.igb.num_queues to zero(0) to allow the FreeBSD driver to automatically set
# the number of network queues to the number of CPU cores. It is not recommend
# to allow more network queues than real CPU cores per network port.
#
# Query total interrupts per queue with "vmstat -i" and use "top -CHIPS" to
# watch CPU usage per igb0:que. Multiple network queues will trigger more total
# interrupts compared to a single network queue, but the processing of each of
# those queues will be spread over multiple CPU cores allowing the system to
# handle increased network traffic loads.
hw.igb.num_queues="2"  # (default 0 , queues equal the number of CPU real cores)

# Intel igb(4): FreeBSD puts an upper limit on the the number of received
# packets a network card can process to 100 packets per interrupt cycle. This
# limit is in place because of inefficiencies in IRQ sharing when the network
# card is using the same IRQ as another device. When the Intel network card is
# assigned a unique IRQ (dmesg) and MSI-X is enabled through the driver
# (hw.igb.enable_msix=1) then interrupt scheduling is significantly more
# efficient and the NIC can be allowed to process packets as fast as they are
# received. A value of "-1" means unlimited packet processing and sets the same
# value to dev.igb.0.rx_processing_limit and dev.igb.1.rx_processing_limit . A
# process limit of "-1" is around one(1%) percent faster than "100" on a
# saturated network connection.
hw.igb.rx_process_limit="-1"  # (default 100 packets to process concurrently)

Testing is in progress, but at the moment I am overloaded with my other tasks. Just wanted to let you know I didnt abandon the thread. As my goal is to get this fixed, I will post the results in the next couple of days here anyway.

August 15, 2018, 11:01:04 AM #21 Last Edit: August 15, 2018, 11:22:33 AM by ricsip
Quote from: pylox on August 07, 2018, 07:55:27 PM
Quote from: ricsip on August 06, 2018, 02:18:30 PM
Hello pylox, all

just to be clear: I am testing through plain IP+NAT connection (PPPoE was mentioned as a possible bottleneck, but not tested YET), and that simple test setup has approx. only 40-50% of the max. possible throughput. If I add PPPoE, it will be even slower. That's the point of this thread, trying to find at least 1 credible person who is currently using APU2 with Opnsense, and he/she confirms their speed can reach 85-90% of gigabit (at least). Even if using over PPPoE!
Then the next round will be to see, what needs to be fine-tuned to have the same perf at my ISP.
......

Hi ricsip,

this ist very hard to find. Unfornatunatly i did not have a test setup with a APU2 (and not much time).
But you can try different things:

1. Change this tunables and measure...
vm.pmap.pti="0"  #(disable meltdown patch - this is an AMD processor)
hw.ibrs_disable="1" #(disable spectre patch temporarily)

2. Try to disable igb flow control for each interface and measure
hw.igb.<x>.fc=0  #(x = number of interface)

3. Change the network interface interrupt rate and measure
hw.igb.max_interrupt_rate="16000" #(start with 16000, can increased up to 64000)

4. Disable Energy Efficiency for each interface an measure
dev.igb.<x>.eee_disabled="1" #(x = number of interface)

Should be enough for the first time...;-)

regards pylox

Ok, I did all the steps above. No improvement, still wildly sporadic measurements/results after each test-execution.

Only difference, that the CPU load characteristics went from 99% SYS + 60-70% IRQ --> 100+ 60-70% IRQ (SYS dropped to 1-2%).

Note1: only tried hw.igb.max_interrupt_rate= "8000" --> "16000" not any higher.
Note2: 2. Try to disable igb flow control for each interface and measure
hw.igb.<x>.fc=0  #(x = number of interface)  --> TYPO, its actually dev.igb.<x>.fc=0

August 15, 2018, 11:55:19 AM #22 Last Edit: August 15, 2018, 12:26:55 PM by ricsip
Quote from: mimugmail on August 09, 2018, 04:16:00 PM
https://calomel.org/freebsd_network_tuning.html

# Disable Hyper Threading (HT), also known as Intel's proprietary simultaneous
# multithreading (SMT) because implementations typically share TLBs and L1
# caches between threads which is a security concern. SMT is likely to slow
# down workloads not specifically optimized for SMT if you have a CPU with more
# than two(2) real CPU cores. Secondly, multi-queue network cards are as much
# as 20% slower when network queues are bound to real CPU cores and well as SMT
# virtual cores due to interrupt processing inefficiencies.
machdep.hyperthreading_allowed="0"  # (default 1, allow Hyper Threading (HT))

# Intel igb(4): The Intel i350-T2 dual port NIC supports up to eight(8)
# input/output queues per network port, the card has two(2) network ports.
#
# Multiple transmit and receive queues in network hardware allow network
# traffic streams to be distributed into queues. Queues can be mapped by the
# FreeBSD network card driver to specific processor cores leading to reduced
# CPU cache misses. Queues also distribute the workload over multiple CPU
# cores, process network traffic in parallel and prevent network traffic or
# interrupt processing from overwhelming a single CPU core.
#
# http://www.intel.com/content/dam/doc/white-paper/improving-network-performance-in-multi-core-systems-paper.pdf
#
# For a firewall under heavy CPU load we recommend setting the number of
# network queues equal to the total number of real CPU cores in the machine
# divided by the number of active network ports. For example, a firewall with
# four(4) real CPU cores and an i350-T2 dual port NIC should use two(2) queues
# per network port (hw.igb.num_queues=2). This equals a total of four(4)
# network queues over two(2) network ports which map to to four(4) real CPU
# cores. A FreeBSD server with four(4) real CPU cores and a single network port
# should use four(4) network queues (hw.igb.num_queues=4). Or, set
# hw.igb.num_queues to zero(0) to allow the FreeBSD driver to automatically set
# the number of network queues to the number of CPU cores. It is not recommend
# to allow more network queues than real CPU cores per network port.
#
# Query total interrupts per queue with "vmstat -i" and use "top -CHIPS" to
# watch CPU usage per igb0:que. Multiple network queues will trigger more total
# interrupts compared to a single network queue, but the processing of each of
# those queues will be spread over multiple CPU cores allowing the system to
# handle increased network traffic loads.
hw.igb.num_queues="2"  # (default 0 , queues equal the number of CPU real cores)

# Intel igb(4): FreeBSD puts an upper limit on the the number of received
# packets a network card can process to 100 packets per interrupt cycle. This
# limit is in place because of inefficiencies in IRQ sharing when the network
# card is using the same IRQ as another device. When the Intel network card is
# assigned a unique IRQ (dmesg) and MSI-X is enabled through the driver
# (hw.igb.enable_msix=1) then interrupt scheduling is significantly more
# efficient and the NIC can be allowed to process packets as fast as they are
# received. A value of "-1" means unlimited packet processing and sets the same
# value to dev.igb.0.rx_processing_limit and dev.igb.1.rx_processing_limit . A
# process limit of "-1" is around one(1%) percent faster than "100" on a
# saturated network connection.
hw.igb.rx_process_limit="-1"  # (default 100 packets to process concurrently)

I have also went through this. No measurable improvement in throughput.

machdep.hyperthreading_allowed="0"  # (default 1, allow Hyper Threading (HT)) --> NOT APPLICABLE to my case. This AMD CPU has 4 physical cores, and  sysctl hw.ncpu --> 4, so HT (even if supported, I am not sure) is not active currently.

hw.igb.num_queues="2"  # (default 0 , queues equal the number of CPU real cores)
--> I have 4 cores, 2 active NIC, each NIC supports up to 4 queues. I used by default
hw.igb.num_queues="0", but tried it with hw.igb.num_queues="2" as well.
No improvement in throughput (for single-flow).
But! It seems degraded the multi-flow performance heavily.

hw.igb.enable_msix=1 was like that since the beginning
hw.igb.rx_process_limit="-1"  --> was set, but no real improvement in throughput
dev.igb.0.rx_processing_limit and dev.igb.1.rx_processing_limit is both set to "-1" as per previous entry did

I am very sad that this wont be solveable under Opnsense without switching to competitors or switching the hardware itself.

Sorry .. we are all no magicians.  ::)

You can go for commercial vendors like Cisco where you are limited to 85mbit and have to purchase a extra license.


Well that's disappointing. OPNsense is a great piece of software. Maybe I'll check back in when FreeBSD 12 is released as I think this is overall a better solution for my needs than ipfire.

When you send me such a device I can do some testing. No other Idea how to help

I'm willing to chip in to buy the OPNsense project an APU2.


Looks like I'm the only one willing to chip in?

Quote from: KantFreeze on August 21, 2018, 04:37:21 PM
Looks like I'm the only one willing to chip in?

@KantFreeze:
Lets be reasonable. Nobody will send equipment as a compliment to unknown people on the internet. At least that is my view.

@mimugmail: how about a donation towards you, so you can buy a brand new APU2 for yourself, and you could spend some valuable time to see its max. performance capabilities, and document your findings? No need to return the device at the end, you should keep it for future Opnsense release benchmarks / regression tests.

I bought my APU2 from a local reseller (motherboard + black case + external PSU + a 16Gb mSATA SSD), sum was approx. 200 EUR. If there are 10 real volunteers, I am willing to spend 20 EUR (non-refundable) "donation" on this project.

DM me for the details, if you are interested.