Raspberry PI 4B Network Performance #

This article discusses the network performance of the built-in ethernet controller of the Raspberry Pi 4B.

I ran all measurements on Alpine Linux 3.15 and a 5.15.4-0-rpi4 aarch64 kernel. My measurement tool of choice was iperf3.

As a communication partner, I used a Windows 10 PC on which I stopped all other network services and disconnected LAN/Internet to allow for undisturbed measurements.

Half-duplex #

When one machine is sending, and the other one is receiving data (half-duplex), the built-in ethernet controller can almost fully saturate the 1Gbit/s line. With a network layer MTU of 1500 bytes the theoretical max throughput is ~949Mbit/s:

1500bytes Ethernet payload - 40bytes TCP/IP header = 1460bytes (payload available to iperf)
1500bytes + 26bytes + 12bytes layer 1 and 2 frame = 1538bytes (on phy medium per frame)
net / gross: 1460bytes / 1538bytes = 0.9493% of link speed

Setup:

Windows 10 i5 4 Core PC with IP: 10.0.0.100
RPI4B 8GB with IP: 10.0.0.1
10 parallel TCP streams
One test run is 60 seconds
5 seconds warm-up time

PC sending #

iperf3.exe -c 10.0.0.1 -t 65 -O 5 -P 10

(CPU Affinity manually set to CPU 3 and 4)

[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-64.99  sec   737 MBytes  95.1 Mbits/sec                  sender
[  6]   0.00-64.99  sec   737 MBytes  95.1 Mbits/sec                  sender
[  8]   0.00-64.99  sec   736 MBytes  95.0 Mbits/sec                  sender
[ 10]   0.00-64.99  sec   736 MBytes  95.0 Mbits/sec                  sender
[ 12]   0.00-64.99  sec   736 MBytes  94.9 Mbits/sec                  sender
[ 14]   0.00-64.99  sec   736 MBytes  94.9 Mbits/sec                  sender
[ 16]   0.00-64.99  sec   735 MBytes  94.9 Mbits/sec                  sender
[ 18]   0.00-64.99  sec   735 MBytes  94.9 Mbits/sec                  sender
[ 20]   0.00-64.99  sec   734 MBytes  94.8 Mbits/sec                  sender
[ 22]   0.00-64.99  sec   733 MBytes  94.6 Mbits/sec                  sender
[SUM]   0.00-64.99  sec  7.18 GBytes   949 Mbits/sec                  sender

Raspberry PI4 sending #

iperf3.exe -c 10.0.0.100 -t 65 -O 5 -P 10 -i 60 -A 3 -Z

The Z option enables the “zero-copy” optimization
The A 3 option runs iperf on the 4th CPU to ensure CPUs are equally used

From the perspective of the Windows 10 machine:

iperf3 output on the Raspberry PI4:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-65.00  sec   559 MBytes  72.1 Mbits/sec    0             sender
[  7]   0.00-65.00  sec   548 MBytes  70.8 Mbits/sec    0             sender
[  9]   0.00-65.00  sec   648 MBytes  83.6 Mbits/sec    0             sender
[ 11]   0.00-65.00  sec   821 MBytes   106 Mbits/sec    0             sender
[ 13]   0.00-65.00  sec   570 MBytes  73.6 Mbits/sec    0             sender
[ 15]   0.00-65.00  sec  1.03 GBytes   136 Mbits/sec    0             sender
[ 17]   0.00-65.00  sec   822 MBytes   106 Mbits/sec    0             sender
[ 19]   0.00-65.00  sec  1.29 GBytes   171 Mbits/sec    0             sender
[ 21]   0.00-65.00  sec   190 MBytes  24.5 Mbits/sec    0             sender
[ 23]   0.00-65.00  sec   826 MBytes   107 Mbits/sec    0             sender
[SUM]   0.00-65.00  sec  7.19 GBytes   950 Mbits/sec    0             sender

Full-duplex #

When both machines send and receive simultaneously, the throughput drops significantly to about ~550Mbits/sec for both sides.

Windows sender:

[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-64.99  sec   407 MBytes  52.6 Mbits/sec                  sender
[  6]   0.00-64.99  sec   481 MBytes  62.1 Mbits/sec                  sender
[  8]   0.00-64.99  sec   518 MBytes  66.8 Mbits/sec                  sender
[ 10]   0.00-64.99  sec   394 MBytes  50.9 Mbits/sec                  sender
[ 12]   0.00-64.99  sec   211 MBytes  27.2 Mbits/sec                  sender
[ 14]   0.00-64.99  sec   315 MBytes  40.7 Mbits/sec                  sender
[ 16]   0.00-64.99  sec   434 MBytes  56.0 Mbits/sec                  sender
[ 18]   0.00-64.99  sec   487 MBytes  62.9 Mbits/sec                  sender
[ 20]   0.00-64.99  sec   301 MBytes  38.8 Mbits/sec                  sender
[ 22]   0.00-64.99  sec   632 MBytes  81.6 Mbits/sec                  sender
[SUM]   0.00-64.99  sec  4.08 GBytes   540 Mbits/sec                  sender

Raspberry PI sender:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-65.00  sec  85.9 MBytes  11.1 Mbits/sec    0             sender
[  7]   0.00-65.00  sec  89.4 MBytes  11.5 Mbits/sec    0             sender
[  9]   0.00-65.00  sec   410 MBytes  53.0 Mbits/sec    0             sender
[ 11]   0.00-65.00  sec   982 MBytes   127 Mbits/sec    0             sender
[ 13]   0.00-65.00  sec  90.0 MBytes  11.6 Mbits/sec    0             sender
[ 15]   0.00-65.00  sec  91.7 MBytes  11.8 Mbits/sec    0             sender
[ 17]   0.00-65.00  sec  87.8 MBytes  11.3 Mbits/sec    0             sender
[ 19]   0.00-65.00  sec   421 MBytes  54.3 Mbits/sec    0             sender
[ 21]   0.00-65.00  sec   985 MBytes   127 Mbits/sec    0             sender
[ 23]   0.00-65.00  sec   984 MBytes   127 Mbits/sec    0             sender
[SUM]   0.00-65.00  sec  4.13 GBytes   545 Mbits/sec    0             sender

Performance Optimization 1 - Interrupt CPU Affinity #

The following shows the interrupt distribution over the 4 CPUs on the Raspberry PI4:

lbox:~# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  9:          0          0          0          0     GICv2  25 Level     vgic
 11:      59242       6754      27139      15696     GICv2  30 Level     arch_timer
 12:          0          0          0          0     GICv2  27 Level     kvm guest vtimer
 18:        860          0          0          0     GICv2  65 Level     fe00b880.mailbox
 22:          0          0          0          0     GICv2 112 Level     bcm2708_fb DMA
 24:        347          0          0          0     GICv2 114 Level     DMA IRQ
 31:         55          0          0          0     GICv2  66 Level     VCHIQ doorbell
 32:       7045          0          0          0     GICv2 158 Level     mmc1, mmc0
 33:          0          0          0          0     GICv2  48 Level     arm-pmu
 34:          0          0          0          0     GICv2  49 Level     arm-pmu
 35:          0          0          0          0     GICv2  50 Level     arm-pmu
 36:          0          0          0          0     GICv2  51 Level     arm-pmu
 38:    1899524          0          0          0     GICv2 189 Level     eth0
 39:     214360          0          0          0     GICv2 190 Level     eth0
 46:        364          0          0          0  BRCM STB PCIe MSI 524288 Edge      xhci_hcd

Interrupt 38 is responsible for sending and interrupt 39 for receiving data. These interrupts are only processed by CPU0, which is not ideal.

Thus, for the first performance optimization, I made the following changes:

Pin the interrupt which signals a send operation has finished to CPU 1 (counting from 0)
Pin the interrupt which signals data has been received to CPU 2

echo 2 > /proc/irq/38/smp_affinity
echo 4 > /proc/irq/39/smp_affinity

Note: The RPI4 ARM GICv2 cannot signal one particular interrupt to more than one CPU. For example, setting the affinity mask for IRQ 38 to 6 will still lead to CPU 1 processing all interrupts.

After another full-duplex run, the interrupt stats showed a much better distribution over CPU 1 and 2:

 38:    1899524     107186          0          0     GICv2 189 Level     eth0
 39:     214360          0     398379          0     GICv2 190 Level     eth0

The Raspberry PI was almost back to its half-duplex sending speed, whereas the Windows PC 10 send-performance (Raspberry PI receive performance) did not increase quite as much and also, the throughput fluctuated much more:

Windows Sender:

[SUM]   0.00-64.99  sec  4.64 GBytes   613 Mbits/sec                  sender

Raspberry PI Sender:

[SUM]   0.00-65.00  sec  6.54 GBytes   864 Mbits/sec  315             sender

Note: The Raspberry PI stats show 315 TCP retransmissions.

Performance Optimization 2 - Paket Steering #

So far, I have looked at hardware interrupts signalled when data has been received on the link layer. However, the kernel also needs to execute code that handles further data processing (Layer 3 and above). For instance, it needs to decide whether a packet is dropped, forwarded to another interface or sent to a local process.

The following commands enable all CPUs for the kernel code which handles the receive and transmit queues:

echo f >/sys/class/net/eth0/queues/tx-0/xps_cpus
echo f >/sys/class/net/eth0/queues/tx-1/xps_cpus
echo f >/sys/class/net/eth0/queues/tx-2/xps_cpus
echo f >/sys/class/net/eth0/queues/tx-3/xps_cpus
echo f >/sys/class/net/eth0/queues/tx-4/xps_cpus
echo f >/sys/class/net/eth0/queues/rx-0/rps_cpus

Note: Some more insights about the way this works can be found here

This brings another performance gain and removes the throughput fluctuation seen before:

Windows Sender:

[SUM]   0.00-64.99  sec  7.08 GBytes   935 Mbits/sec                  sender

Raspberry PI Sender:

[SUM]   0.00-65.00  sec  5.78 GBytes   764 Mbits/sec    0             sender

Performance increased on both machines, with the Windows machine sending again at almost line speed.

Looking at top, the interrupt controller seems to be at peak processing capacity in regards to the send queue:

Mem: 97216K used, 7903080K free, 120K shrd, 3384K buff, 21636K cached
CPU0:   0% usr   0% sys   0% nic  98% idle   0% io   0% irq   0% sirq
CPU1:   0% usr   0% sys   0% nic   0% idle   0% io   0% irq 100% sirq
CPU2:   0% usr   0% sys   0% nic  74% idle   0% io   0% irq  24% sirq
CPU3:   3% usr  62% sys   0% nic  30% idle   0% io   0% irq   3% sirq
Load average: 0.57 0.15 0.04 6/114 2240
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
   17     2 root     RW       0   0%   1  28% [ksoftirqd/1]
 2231  2228 root     S     2332   0%   3  14% iperf3 -s
 2240  2186 root     R     2304   0%   3   5% iperf3 -c 10.0.0.100 -t 65 -O 5 -P 10 -i 60 -A 3 -Z
   22     2 root     RW       0   0%   2   5% [ksoftirqd/2]
 2239  2236 root     R     1668   0%   0   0% top
    7     2 root     IW       0   0%   2   0% [kworker/u8:0-ev]
   38     2 root     IW       0   0%   0   0% [kworker/0:2-eve]
   11     2 root     SW       0   0%   0   0% [ksoftirqd/0]

Performance Optimization Attempt - Overclocking the CPU #

I added the following settings to the according usercfg.txt section:

over_voltage=6
arm_freq=2000
force_turbo=1

and verified that the CPU speed is indeed at 2Ghz by running vcgencmd measure_clock arm. (Please note that force_turbo=1 enforces these settings even when the CPU is idle and should not be used under normal circumstances)

Increasing the CPU speed showed a negligible improvement with the Raspberry PI send-performance changing from 764 to 773 Mbits/sec.

This article goes into great detail about overclocking the PI4.

Performance Optimization Attempt - PCIe Payload Setting #

Inspired by this comment (coming from Jeff Geerling’s post here), I added pci=pcie_bus_perf to the kernel parameters in cmdline.txt and the effect of this kernel parameter can be verified by executing lspci -vv:

lbox:~# lspci -vv
00:00.0 PCI bridge: Broadcom Inc. and subsidiaries BCM2711 PCIe Bridge (rev 10) (prog-if 00 [Normal decode])
        Device tree node: /sys/firmware/devicetree/base/scb/pcie@7d500000/pci@0,0
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 0
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 00000000-00000fff [size=4K]
        Memory behind bridge: c0000000-c00fffff [size=1M]
        Prefetchable memory behind bridge: [disabled]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [48] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [ac] Express (v2) Root Port (Slot-), MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr+ NoSnoop+
                        MaxPayload 512 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <2us
                        ClockPM+ Surprise- LLActRep- BwNot+ ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (ok), Width x1 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt+ ABWMgmt+
...
01:00.0 USB controller: VIA Technologies, Inc. VL805/806 xHCI USB 3.0 Controller (rev 01) (prog-if 30 [XHCI])
        Subsystem: VIA Technologies, Inc. VL805/806 xHCI USB 3.0 Controller
...
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 256 bytes

Under DevCap the tool shows the maximum payload supported by the PCI Bridge, respectively, the USB controller. Under DevCtl you can find the current setting which are MaxPayload 512 bytes, MaxReadReq 512 bytes for the PCI Bridge and MaxPayload 256 bytes, MaxReadReq 256 bytes for the USB controller.

(Details about the kernel parameter can be found here.)

With this change, I could not observe any further performance improvement. I assume that the PI4 ethernet controller is not sitting on the PCI bus but connected to the CPU via different means. However, changing the maximum payload size might affect external USB3 devices.

Summary #

After applying the interrupt tweaks described above, the Raspberry PI4B shows an impressive full-duplex throughput over 20 connections with a download speed of 936Mbit/s and an upload speed of 764Mbit/s.

Overclocking the CPU or increasing the PCI-express payload did not result in a higher upload speed. The output of top or atop suggest the interrupt controller is at its peak processing capacity for the send queue with 95% and higher utilization.

It also needs to be considered that iperf itself requires CPU power and might limit throughput.

Finally, the Raspberry PI shows higher receive than send-performance in full-duplex in all scenarios.