XFRM pCPU: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
Goal: scalable IPsec throughput with multiple CPUs(with IPsec HW offload) | Goal: scalable IPsec throughput with multiple CPUs(with IPsec HW offload) | ||
The | The ide of per CPU SA in the outgoing direction was discussed at Linux IPsec workshop March 2019, in Prague. During the following days a small group of people worked on a prototype of user space(IKE), Libreswan, and Linux kernel, xfrm. The libreswan call the options "clones". In Kernel it is called pCPU. These names may change as we adopt the idea to TOS bits oer TCP/UDP DST port hashing. | ||
== Results == | == Results == | ||
The test result show the aggregated throughput increase linearly with the number of CPUs. | The test result, as Nov 2019, show the aggregated throughput increase linearly with the number of CPUs. | ||
We tested using Mellonex CX4 | We tested using physical servers, with Mellonex CX4 NIC. These NICs (latest Linux driver CX5) support RSS for ESP. In the test the clear text traffic was generated using a hardware traffic generator and IPsec gateway forwarding it to the other IPsec gateway which decrypt and the traffic generator receives the clear text. | ||
The initial number are ''' | <pre> | ||
|Traffic generator|-----|IPsec Gateweay west|=====ipsec 40Gbps link====|ipsec gateway|---|traffic generator| | |||
</pre> | |||
The initial number are '''17-18 Gbps with 3 flows we see about 6-7Gbps per CPU''' | |||
== How to test this == | == How to test this == | ||
=== Libreswan source with | === Libreswan source with clones support #clones-3 === | ||
<pre> | <pre> | ||
git clone --single-branch --branch clones-3 https://github.com/antonyantony/libreswan | git clone --single-branch --branch clones-3 https://github.com/antonyantony/libreswan | ||
Line 43: | Line 47: | ||
</pre> | </pre> | ||
===Kernel source pcpu-2=== | ===Kernel source with pCPU #pcpu-2=== | ||
git clone -b pcpu-2 https://github.com/antonyantony/linux | git clone -b pcpu-2 https://github.com/antonyantony/linux | ||
== Kernel / xfrm plans == | == Kernel / xfrm plans == | ||
* Release private branch | * Release private branch by Steffen's repository for wider testing. | ||
* Kernel support for rekey. One could rekey in any order - either head SA or the sub SA. | * Kernel support for rekey. One could rekey in any order - either a head SA or the sub SA. | ||
* One main difference is installing new sub SA | * One main difference is when installing a new sub SA during a rekey, add_sa() would delete the old sub SA. Libreswan should not try to delete it. | ||
* Ben would like to add feature bind a sub sa to a head SA? | * Ben would like to add feature bind a sub sa to a head SA? | ||
* seems to need latest iproute2 otherwise "ip x s" may loop. | * seems to need latest iproute2 otherwise "ip x s" may loop. | ||
* bug fixes | |||
== Libreswan Plans == | == Libreswan Plans == | ||
* Currently support clones=n. Both sides should have same number. | * Currently support clones=n. Both sides should have same number. | ||
* support for asymmetric configuration, one side 8(initiator) and responder (4). | * support for asymmetric configuration, one side 8(initiator) and responder (4). | ||
* fix rekey, we should not delete a sub SA. Only delete the head SA. | * fix rekey, we should not delete a sub SA. Only delete the head SA during it's rekey. | ||
* fix bugs ipsec auto --down and delete | * fix bugs ipsec auto --down and delete | ||
* don't allow clone instance on its own to | * don't allow clone instance on its own to add|delete|down : using the unaliased name. | ||
* test interop with unsupported version. ideally we should figure it out and not install clones. It could be that we will install clones and the last one would be used. | * test interop with unsupported version. ideally we should figure it out and not install clones. It could be that we will install clones and the last one would be used. | ||
== nCPU < nSAs == | == nCPU < nSAs == | ||
Lets say there are 4 cpus and number of clone configured is 8, because the other end has 8 CPUs. The head SA's list only has 4 places for sub SAs. | Lets say there are 4 cpus and number of clone configured is 8, because the other end has 8 CPUs. The head SA's list only has 4 places for sub SAs. Libreswan should install only 4 send SAs ONLY. | ||
A bit detail about Child SA initiator and what the initiator and responder are committing. As I understand the RFC, also from Tero, when an initiator send a request to setup an SA, IKE Child SA requst(both in IKE_AUTH and CREATE_CHILD SA) is a bi directional SA, the initiator is committing to receive on that SA. Also the responder is committing to recive and not to send. The 4CPU side IKE daemon will install 8 Receive SAs and 4 sub SAs then everything would work. | |||
== Linux kernel XFRM details == | |||
You need extra flags to XFRM_MSG_GETSA and XFRM_MSG_UPDSA, XFRM_MSG_GETSA, only for the out SA. | |||
=== XFRM_MSG_NEWSA head SA === | |||
=== | XFRMA_SA_EXTRA_FLAGS set XFRM_SA_PCPU_HEAD flag | ||
=== XFRM_MSG_NEWSA sub SA === | |||
XFRMA_SA_EXTRA_FLAGS set XFRM_SA_PCPU_SUB AND | |||
new attribute XFRMA_SA_PCPU <cpu id>. CPU SA ID start from 0, and it is a u32. | |||
=== XFRM_MSG_UPDSA === | |||
both head SA and sub SA need extra attributes. | both head SA and sub SA need extra attributes. | ||
* head SA set XFRMA_SA_EXTRA_FLAGS to XFRM_SA_PCPU_HEAD | * head SA set XFRMA_SA_EXTRA_FLAGS to XFRM_SA_PCPU_HEAD | ||
* sub | * sub SA set XFRMA_SA_EXTRA_FLAGS to XFRM_SA_PCPU_SUB AND XFRMA_SA_PCPU to <sub-sa-id>. Sub SA ID start from 0-u32 | ||
=== XFRM_MSG_GETSA call only change for sub sda === | === XFRM_MSG_GETSA call only change for sub sda === | ||
* sub SA set | * sub SA XFRMA_SA_EXTRA_FLAGS set XFRM_SA_PCPU_SUB AND XFRMA_SA_PCPU to <sub-sa-id>. | ||
* also set XFRMA_SRCADDR to src addr | * also set XFRMA_SRCADDR to src addr | ||
== what kind work load is supported == | == what kind work load is supported == | ||
As of Nov 2019 only | |||
=== can we distribute 4 tuple workload === | === can we distribute 4 tuple workload === |
Revision as of 00:38, 18 November 2019
Goal: scalable IPsec throughput with multiple CPUs(with IPsec HW offload)
The ide of per CPU SA in the outgoing direction was discussed at Linux IPsec workshop March 2019, in Prague. During the following days a small group of people worked on a prototype of user space(IKE), Libreswan, and Linux kernel, xfrm. The libreswan call the options "clones". In Kernel it is called pCPU. These names may change as we adopt the idea to TOS bits oer TCP/UDP DST port hashing.
Results
The test result, as Nov 2019, show the aggregated throughput increase linearly with the number of CPUs. We tested using physical servers, with Mellonex CX4 NIC. These NICs (latest Linux driver CX5) support RSS for ESP. In the test the clear text traffic was generated using a hardware traffic generator and IPsec gateway forwarding it to the other IPsec gateway which decrypt and the traffic generator receives the clear text.
|Traffic generator|-----|IPsec Gateweay west|=====ipsec 40Gbps link====|ipsec gateway|---|traffic generator|
The initial number are 17-18 Gbps with 3 flows we see about 6-7Gbps per CPU
How to test this
Libreswan source with clones support #clones-3
git clone --single-branch --branch clones-3 https://github.com/antonyantony/libreswan
Sample config | ipsec.conf
conn westnet-eastnet rightid=@east leftid=@west left=192.1.2.45 right=192.1.2.23 rightsubnet=192.0.2.0/24 leftsubnet=192.0.1.0/24 authby=secret clones=2 auto=add nic-offload=no ipsec auto --up westnet-eastnet taskset 0x1 ping -n -c 2 -I 192.0.1.254 192.0.2.254 taskset 0x2 ping -n -c 2 -I 192.0.1.254 192.0.2.254 ipsec trafficstatus ipsec whack --trafficstatus 006 #2: "westnet-eastnet-0", type=ESP, add_time=1234567890, inBytes=0, outBytes=0, id='@east' 006 #4: "westnet-eastnet-1", type=ESP, add_time=1234567890, inBytes=168, outBytes=168, id='@east' 006 #3: "westnet-eastnet-2", type=ESP, add_time=1234567890, inBytes=168, outBytes=168, id='@east' NOTE both SA #3 and #4 has outgoing traffic on it.
Kernel source with pCPU #pcpu-2
git clone -b pcpu-2 https://github.com/antonyantony/linux
Kernel / xfrm plans
- Release private branch by Steffen's repository for wider testing.
- Kernel support for rekey. One could rekey in any order - either a head SA or the sub SA.
- One main difference is when installing a new sub SA during a rekey, add_sa() would delete the old sub SA. Libreswan should not try to delete it.
- Ben would like to add feature bind a sub sa to a head SA?
- seems to need latest iproute2 otherwise "ip x s" may loop.
- bug fixes
Libreswan Plans
- Currently support clones=n. Both sides should have same number.
- support for asymmetric configuration, one side 8(initiator) and responder (4).
- fix rekey, we should not delete a sub SA. Only delete the head SA during it's rekey.
- fix bugs ipsec auto --down and delete
- don't allow clone instance on its own to add|delete|down : using the unaliased name.
- test interop with unsupported version. ideally we should figure it out and not install clones. It could be that we will install clones and the last one would be used.
nCPU < nSAs
Lets say there are 4 cpus and number of clone configured is 8, because the other end has 8 CPUs. The head SA's list only has 4 places for sub SAs. Libreswan should install only 4 send SAs ONLY.
A bit detail about Child SA initiator and what the initiator and responder are committing. As I understand the RFC, also from Tero, when an initiator send a request to setup an SA, IKE Child SA requst(both in IKE_AUTH and CREATE_CHILD SA) is a bi directional SA, the initiator is committing to receive on that SA. Also the responder is committing to recive and not to send. The 4CPU side IKE daemon will install 8 Receive SAs and 4 sub SAs then everything would work.
Linux kernel XFRM details
You need extra flags to XFRM_MSG_GETSA and XFRM_MSG_UPDSA, XFRM_MSG_GETSA, only for the out SA.
XFRM_MSG_NEWSA head SA
XFRMA_SA_EXTRA_FLAGS set XFRM_SA_PCPU_HEAD flag
XFRM_MSG_NEWSA sub SA
XFRMA_SA_EXTRA_FLAGS set XFRM_SA_PCPU_SUB AND
new attribute XFRMA_SA_PCPU <cpu id>. CPU SA ID start from 0, and it is a u32.
XFRM_MSG_UPDSA
both head SA and sub SA need extra attributes.
- head SA set XFRMA_SA_EXTRA_FLAGS to XFRM_SA_PCPU_HEAD
- sub SA set XFRMA_SA_EXTRA_FLAGS to XFRM_SA_PCPU_SUB AND XFRMA_SA_PCPU to <sub-sa-id>. Sub SA ID start from 0-u32
XFRM_MSG_GETSA call only change for sub sda
- sub SA XFRMA_SA_EXTRA_FLAGS set XFRM_SA_PCPU_SUB AND XFRMA_SA_PCPU to <sub-sa-id>.
- also set XFRMA_SRCADDR to src addr
what kind work load is supported
As of Nov 2019 only
can we distribute 4 tuple workload
yes. The application on the sender side must run on the right CPU, aka use something like "taskset 0x1 ping -n -c 2 -I 192.0.1.254 192.0.2.254" or numactl, or something
Receiver side RSS support
To get this working you need Receive Side Scaling RSS The receiver NIC should be able steer different flows, based on SPI, into separate Qs otherwise receiver seems to getting overwhelmed. We used Mellonex CX4 to test. Some cards initially tested did not seems to support RSS for ESP flows, instead only TCP and UDP. While figuring out RSS for these cards we tried a bit different approch. ESP in UDP encapsulation, along with ESP in UDP GRO patches we could see the flows getting distributed on the receiver.
= RSS Commands
Enable GRO and it should work. ideally you should be able to run the following,
ethtool -N <nic> rx-flow-hash esp4
Another argument is if the NIC agnostic the 16 bits of SPI, of ESP packet, is aligned with UDP port number and should provide enough entropy.
ethtool -N eno2 rx-flow-hash udp4 sdfn