XFRM pCPU: Difference between revisions
No edit summary |
No edit summary |
||
Line 81: | Line 81: | ||
== Linux kernel XFRM details == | == Linux kernel XFRM details == | ||
The changes are to SAdb entry aka state, or SA. The new concept is head SA and sub SA. These are supported with additional XFRMA_SA_EXTRA_FLAGS, and attributes of the SADB entry. SPDB, aka, policy can either point to the head SA if your policy has SPI in it. Libreswan installed policy do not have SPI in it. You need only one policy. Note libreswan might | The changes are to the SAdb entry aka state, or SA. The new concept is head SA and sub SA for the outgoing direction. These are supported with additional XFRMA_SA_EXTRA_FLAGS, and attributes of the SADB entry. SPDB, aka, policy can either point to the head SA if your policy has SPI in it. Libreswan installed policy do not have SPI in it. You need only one policy. Note libreswan might install multiple identical polices. And this works too. | ||
The head SA is a catch all SA. It is not associated to a specific CPU. When there is N CPU, install N+1 SAs. One head SA and N sub SAs. | |||
To add SADB entry you need extra attributes to the netlink calls, for methods XFRM_MSG_NEWSA, XFRM_MSG_UPDSA, and XFRM_MSG_GETSA, only for the outgoing SA.Installing incoming or receiving SA to the kernel remain un-changed. | |||
=== XFRM_MSG_NEWSA head SA === | === XFRM_MSG_NEWSA head SA === | ||
Line 142: | Line 144: | ||
* can IKE daemon use other flow distribution methods based on SPI??? DPDK??? | * can IKE daemon use other flow distribution methods based on SPI??? DPDK??? | ||
* is this another way of flow control??? https://doc.dpdk.org/dts/test_plans/link_flowctrl_test_plan.html | * is this another way of flow control??? https://doc.dpdk.org/dts/test_plans/link_flowctrl_test_plan.html | ||
=== Install only SA only for the CPUs that has workload === | |||
In the current model libreswan is configured with clones=N. When the connection comes up libreswan, as part of IKE_AUTH exchange, negotiates the head SA. And immediately negotiate N sub SAs using CREATE_CHILD_SA, "New Child SA, RFC 7296 #1.3.1". To negotiate N sub SAs we need N CREATE_CHILD_SA exchanges, or round trip times. Next question is do we need a SA for all CPUs? what if the workloads are going to be on CPUi and CPUj, say CPU 2 and CPU 4, only. In the current we can't optimize. | |||
Lets say you have 72 cores, and eg multi tennant model, only CPU 2 and CPU 4 need sub SA. | |||
One idea is add the sa based accuire message. This would need extending XFRM acquire message. To include CPU id. | |||
Add CPU id in the acquire message, initially if there is a head SA forward the traffic via the head SA. | |||
There are two cases. | |||
1 There is no head SA, Just a XFRM policy entry. The first packet of a new flow arrives on CPU 4 that match policy. The XFRM creates a larval state with a timeout, 30 sec default, sends an acquire message to the IKE daemon, with the new attribute CPU id in it. Pluto, IKE daemon, start IKE negotiation. While the negotiation, IKE_INIT and IKE_AUTH, is going on the traffic is dropped or first packet is cached. The pluto would first negotiate, the head SA, in IKE_AUTH, and install the head SA. if the flow is continues first few packets will use the head SA. While pluto will negotiate and installs sub SA for CPU 4. Pluto installs the sub SA for CPU 4 and installs the SUB SA. Then the traffic will switch to this sub SA for CPU 4. | |||
Now new flow arrive on CPU 2. Here is a bit of XFRM magic. Lets say larval expired. There won |
Revision as of 00:53, 19 November 2019
Goal: scalable IPsec throughput with multiple CPUs(without IPsec HW offload)
The idea of per-CPU SA in the outgoing direction was discussed at Linux IPsec workshop March 2019, in Prague. A small group of people worked on a prototype of user space(IKE), Libreswan, and Linux kernel, XFRM. The libreswan implementation calls this option "clones". In the Linux kernel it is called pCPU. These names may change as we adopt the idea to include TOS bits over TCP/UDP DST port hashing.
The tests were performed without using IPsec HW hardware offload to separate the performance numbers of per-CPU SA's from hardware interaction.
Results
The test result, as of Nov 2019, show an aggregated throughput increase that is linearly with the number of CPUs.
We tested using physical servers, using Mellonex CX4 NIC. These NICs (using the latest Linux driver CX5) support RSS for ESP. In the tests, the clear text traffic was generated using a hardware traffic generator which sends traffic to the first IPsec gateway. The IPsec gateway encrypts the traffic and send it to the second IPsec gateway. That gateway decrypts the traffic into clear text and forwards the traffic to the receiving end of the traffic generator.
|Traffic Generator Sender|-----|IPsec Gateweay #1|=====ipsec 40Gbps link====|IPsec Gateway #2|---|Traffic Generator Receiver|
The initial measurements we obtained are: 17-18 Gbps with 3 CPU's. We see about 6-7 Gbps per CPU
Test setup using libreswan
Linux kernel source with pCPU support
git clone -b pcpu-2 https://github.com/antonyantony/linux
Kernel / xfrm future plans
- Release private branch at Steffen's repository for wider testing.
- Kernel support for IPsec rekey. One could rekey in any order - either a head SA or the sub SA.
- One main difference is when installing a new sub SA during a rekey, add_sa() would delete the old sub SA. Libreswan should not try to delete it. Or convince to Steffen to allow deleting an old sub SA.
- Ben would like to add feature bind a sub sa to a head SA?
- seems to need latest iproute2 otherwise "ip x s" may loop.
- bug fixes : noticed a kerenel crash from overnight running?
Libreswan with clones support
git clone --single-branch --branch clones-3 https://github.com/antonyantony/libreswan
Sample config | ipsec.conf
conn westnet-eastnet rightid=@east leftid=@west left=192.1.2.45 right=192.1.2.23 rightsubnet=192.0.2.0/24 leftsubnet=192.0.1.0/24 authby=secret clones=2 auto=add nic-offload=no
Initiate the connection and test the multiple CPU IPsec SA's:
ipsec auto --up westnet-eastnet taskset 0x1 ping -n -c 2 -I 192.0.1.254 192.0.2.254 taskset 0x2 ping -n -c 2 -I 192.0.1.254 192.0.2.254 ipsec trafficstatus ipsec trafficstatus 006 #2: "westnet-eastnet-0", type=ESP, add_time=1234567890, inBytes=0, outBytes=0, id='@east' 006 #4: "westnet-eastnet-1", type=ESP, add_time=1234567890, inBytes=168, outBytes=168, id='@east' 006 #3: "westnet-eastnet-2", type=ESP, add_time=1234567890, inBytes=168, outBytes=168, id='@east'
NOTE: Both SA #3 and #4 should have outgoing traffic on it.
Future Libreswan plans
- Current support using clones=n requires both endpoints to have the same clone number. Future plan is to allow asymmetric configuration, such as one side using 8 clones on 4 CPUs and the other side using using 12 clones on 12 CPUs
- Match Rekey support behaviour between kernel and libreswan. Deleting sub and head SA during a rekey procedure needs to be worked out with kernel
- Complete support for ipsec auto --down and delete
- Prevent clone instance on their own to be manipulated using ipsec auto add|delete|down
- Ensure interoperability against IPsec gateways that do not support clone SA's, such as previous versions of libreswan without clone support.
Linux kernel XFRM details
The changes are to the SAdb entry aka state, or SA. The new concept is head SA and sub SA for the outgoing direction. These are supported with additional XFRMA_SA_EXTRA_FLAGS, and attributes of the SADB entry. SPDB, aka, policy can either point to the head SA if your policy has SPI in it. Libreswan installed policy do not have SPI in it. You need only one policy. Note libreswan might install multiple identical polices. And this works too.
The head SA is a catch all SA. It is not associated to a specific CPU. When there is N CPU, install N+1 SAs. One head SA and N sub SAs.
To add SADB entry you need extra attributes to the netlink calls, for methods XFRM_MSG_NEWSA, XFRM_MSG_UPDSA, and XFRM_MSG_GETSA, only for the outgoing SA.Installing incoming or receiving SA to the kernel remain un-changed.
XFRM_MSG_NEWSA head SA
XFRMA_SA_EXTRA_FLAGS includes the XFRM_SA_PCPU_HEAD flag
XFRM_MSG_NEWSA sub SA
XFRMA_SA_EXTRA_FLAGS includes the XFRM_SA_PCPU_SUB and the new attribute XFRMA_SA_PCPU set to the <cpu id>. CPU SA ID start from 0, and it is a u32.
XFRM_MSG_UPDSA
Both the head SA and the sub SAs need extra attributes:
- The head SA sets the XFRMA_SA_EXTRA_FLAGS to XFRM_SA_PCPU_HEAD
- The sub SA sets the XFRMA_SA_EXTRA_FLAGS to XFRM_SA_PCPU_SUB and XFRMA_SA_PCPU is set to <sub-sa-id>.
XFRM_MSG_GETSA
This call only requires changes for sub SAs:
- The sub SA XFRMA_SA_EXTRA_FLAGS is set to XFRM_SA_PCPU_SUB and XFRMA_SA_PCPU is set to <sub-sa-id>.
- Set XFRMA_SRCADDR to the src addr
This is the call used by libreswan "ipsec trafficstatus" without this changes it will not find the sub SAs.
when nCPU < nSAs
When there are 4 CPUs and the number of clones configured is 8, because the other end has 8 CPUs. The head SA's list only has 4 places for sub SAs. Libreswan should install only 4 outbound sub SA's and install 8 inbound sub SA's. This is a local policy and not affecting the remote IPsec peer. From the view of the remote peer, 4 inbound SA's appear to be unused. The remote peer can still use all its 8 outbound SAs. IPsec SA's are negotiated as as bundle of one inbound and one outbound SA. Both ends commit to receiving on their inbound SA's, but are free to decide on which outbound SA's they will send traffic. This setup is therefor compliant with RFC 7296.
In our example above, the IKE daemon on the 4-CPU machine has a list of all 8 SA bundles, but will have installed only 4 outbound SA's along with the 8 inbound SA's in the Linux kernel. The "ip xfrm state" will show this.
Supported Work loads
As of Nov 2019, to make full use of the cloned SA's, network traffic load has to be distributed over different CPU's to take advantage of the clone feature.
If the traffic is generated on the IPsec machine itself, the application(s) need to be writing their traffic (eg using send() of write() syscalls) running on different CPU's. This can often be done using the taskset or numctl commands.
For forwarded traffic, you need RSS support on the NIC receiving the clear text. RSS will steer different flows onto different CPUs and this use different sub SA's. If all the traffic consists of one single flow, the traffic will not be able to be distributed over different CPUs - to avoid out of order delivery.
Can we distribute 4 tuple flows locally generated?
yes. See above.
Receiver side RSS support
To get this working you need Receive Side Scaling RSS The receiver NIC should be able steer different flows, based on SPI, into separate queues to prevent the receiver from getting overwhelmed. We used Mellanex CX4 to test. Some cards initially tested did not seems to support RSS for ESP flows, instead only TCP and UDP. While figuring out RSS for these cards we tried a bit different approch. ESP in UDP encapsulation, along with ESP in UDP GRO patches we could see the flows getting distributed on the receiver.
= RSS Commands
Enable GRO and it should work. ideally you should be able to run the following,
ethtool -N <nic> rx-flow-hash esp4
Another argument is if the NIC agnostic the 16 bits of SPI, of ESP packet, is aligned with UDP port number and should provide enough entropy.
ethtool -N eno2 rx-flow-hash udp4 sdfn
Future research/ideas
- Test with SR IOV and virtualisation(KVM): need systems with NIC that support SR IOV and RSS for ESP or atleast UDP.
- Software RSS https://www.linux-kvm.org/page/Multiqueue
- can IKE daemon use other flow distribution methods based on SPI??? DPDK???
- is this another way of flow control??? https://doc.dpdk.org/dts/test_plans/link_flowctrl_test_plan.html
Install only SA only for the CPUs that has workload
In the current model libreswan is configured with clones=N. When the connection comes up libreswan, as part of IKE_AUTH exchange, negotiates the head SA. And immediately negotiate N sub SAs using CREATE_CHILD_SA, "New Child SA, RFC 7296 #1.3.1". To negotiate N sub SAs we need N CREATE_CHILD_SA exchanges, or round trip times. Next question is do we need a SA for all CPUs? what if the workloads are going to be on CPUi and CPUj, say CPU 2 and CPU 4, only. In the current we can't optimize.
Lets say you have 72 cores, and eg multi tennant model, only CPU 2 and CPU 4 need sub SA. One idea is add the sa based accuire message. This would need extending XFRM acquire message. To include CPU id. Add CPU id in the acquire message, initially if there is a head SA forward the traffic via the head SA.
There are two cases. 1 There is no head SA, Just a XFRM policy entry. The first packet of a new flow arrives on CPU 4 that match policy. The XFRM creates a larval state with a timeout, 30 sec default, sends an acquire message to the IKE daemon, with the new attribute CPU id in it. Pluto, IKE daemon, start IKE negotiation. While the negotiation, IKE_INIT and IKE_AUTH, is going on the traffic is dropped or first packet is cached. The pluto would first negotiate, the head SA, in IKE_AUTH, and install the head SA. if the flow is continues first few packets will use the head SA. While pluto will negotiate and installs sub SA for CPU 4. Pluto installs the sub SA for CPU 4 and installs the SUB SA. Then the traffic will switch to this sub SA for CPU 4.
Now new flow arrive on CPU 2. Here is a bit of XFRM magic. Lets say larval expired. There won