XFRM pCPU: Difference between revisions

From Libreswan
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
Goal: scalable IPsec throughput with multiple CPUs(with IPsec HW offload)
Goal: scalable IPsec throughput with multiple CPUs(without IPsec HW offload)


The ide of per CPU SA in the outgoing direction was discussed at Linux IPsec workshop March 2019, in Prague. During the following days a small group of people worked on a prototype of user space(IKE), Libreswan, and Linux kernel, xfrm. The libreswan call the options "clones". In Kernel it is called pCPU. These names may change as we adopt the idea to TOS bits oer TCP/UDP DST port hashing.
The idea of per-CPU SA in the outgoing direction was discussed at Linux IPsec workshop March 2019, in Prague. A small group of people worked on a prototype of user space(IKE), Libreswan, and Linux kernel, XFRM. The libreswan implementation calls this option "clones". In the Linux kernel it is called pCPU. These names may change as we adopt the idea to include TOS bits over TCP/UDP DST port hashing.
 
The tests were performed without using IPsec HW hardware offload to separate the performance numbers of per-CPU SA's from hardware interaction.  


== Results ==
== Results ==


The test result, as Nov 2019, show the aggregated throughput increase linearly with the number of CPUs.  
The test result, as of Nov 2019, show an aggregated throughput increase that is linearly with the number of CPUs.  
We tested using physical servers, with Mellonex CX4 NIC. These NICs (latest Linux driver CX5) support RSS for ESP. In the test the clear text traffic was generated using a hardware traffic generator and IPsec gateway forwarding it to the other IPsec gateway which decrypt and the traffic generator receives the clear text.
 
We tested using physical servers, using Mellonex CX4 NIC. These NICs (using the latest Linux driver CX5) support RSS for ESP. In the tests, the clear text traffic was generated using a hardware traffic generator which sends traffic to the first IPsec gateway. The IPsec gateway encrypts the traffic and send it to the second IPsec gateway. That gateway decrypts the traffic into clear text and forwards the traffic to the receiving end of the traffic generator.


<pre>
<pre>
|Traffic generator|-----|IPsec Gateweay west|=====ipsec 40Gbps link====|ipsec gateway|---|traffic generator|
|Traffic Generator Sender|-----|IPsec Gateweay #1|=====ipsec 40Gbps link====|IPsec Gateway #2|---|Traffic Generator Receiver|
</pre>
</pre>


The initial number are '''17-18 Gbps with 3 flows we see about 6-7Gbps per CPU'''
The initial measurements we obtained are'''17-18 Gbps with 3 CPU's. We see about 6-7 Gbps per CPU'''
 
== Test setup using libreswan ==
 
=== Linux kernel source with pCPU support ===
 
git clone -b pcpu-2 https://github.com/antonyantony/linux
 
== Kernel / xfrm plans ==
* Release private branch by Steffen's repository for wider testing.
* Kernel support for rekey. One could rekey in any order - either a head SA or the sub SA.
* One main difference is when installing a new sub SA during a rekey, add_sa() would delete the old sub SA. Libreswan should not try to delete it.
* Ben would like to add feature bind a sub sa to a head SA?
* seems to need latest iproute2 otherwise "ip x s"  may loop.
* bug fixes
 
=== Libreswan with clones support ===


== How to test this ==
=== Libreswan source with clones support #clones-3 ===
<pre>
<pre>
git clone --single-branch --branch clones-3 https://github.com/antonyantony/libreswan
git clone --single-branch --branch clones-3 https://github.com/antonyantony/libreswan
</pre>
</pre>
Sample config [https://github.com/antonyantony/libreswan/blob/clones-1/testing/pluto/ikev2-68-sa-clones/ipsec.conf | ipsec.conf]
Sample config [https://github.com/antonyantony/libreswan/blob/clones-1/testing/pluto/ikev2-68-sa-clones/ipsec.conf | ipsec.conf]
<pre>
<pre>
conn westnet-eastnet
conn westnet-eastnet
Line 32: Line 51:
         auto=add
         auto=add
         nic-offload=no
         nic-offload=no
</pre>


Initiate the connection and test the multiple CPU IPsec SA's:
<pre>
ipsec auto --up westnet-eastnet
ipsec auto --up westnet-eastnet
taskset 0x1 ping -n -c 2 -I 192.0.1.254 192.0.2.254
taskset 0x1 ping -n -c 2 -I 192.0.1.254 192.0.2.254
Line 43: Line 66:
006 #4: "westnet-eastnet-1", type=ESP, add_time=1234567890, inBytes=168, outBytes=168, id='@east'
006 #4: "westnet-eastnet-1", type=ESP, add_time=1234567890, inBytes=168, outBytes=168, id='@east'
006 #3: "westnet-eastnet-2", type=ESP, add_time=1234567890, inBytes=168, outBytes=168, id='@east'
006 #3: "westnet-eastnet-2", type=ESP, add_time=1234567890, inBytes=168, outBytes=168, id='@east'
NOTE both SA #3 and #4 has outgoing traffic on it.
</pre>
</pre>


===Kernel source with pCPU #pcpu-2===
NOTE: Both SA #3 and #4 should have outgoing traffic on it.


git clone -b pcpu-2 https://github.com/antonyantony/linux
== Future Libreswan plans ==
 
== Kernel / xfrm plans ==
* Release private branch by Steffen's repository for wider testing.
* Kernel support for rekey. One could rekey in any order - either a head SA or the sub SA.
* One main difference is when installing a new sub SA during a rekey, add_sa() would delete the old sub SA. Libreswan should not try to delete it.
* Ben would like to add feature bind a sub sa to a head SA?
* seems to need latest iproute2 otherwise "ip x s"  may loop.
* bug fixes


== Libreswan Plans ==
* Current support using clones=n requires both endpoints to have the same clone number. Future plan is to allow asymmetric configuration, such as one side using 8 clones on 4 CPUs and the other side using using 12 clones on 12 CPUs
* Currently support clones=n. Both sides should have same number.
* Match Rekey support behaviour between kernel and libreswan. Deleting sub and head SA during a rekey procedure needs to be worked out with kernel
* support for asymmetric configuration, one side 8(initiator) and responder (4).
* Complete support for ipsec auto --down and delete
* fix rekey, we should not delete a sub SA. Only delete the head SA during it's rekey.
* Prevent clone instance on their own to be manipulated using ipsec auto add|delete|down
* fix bugs ipsec auto --down and delete
* Ensure interoperability against IPsec gateways that do not support clone SA's, such as previous versions of libreswan without clone support.
* don't allow clone instance on its own to add|delete|down : using the unaliased name.
* test interop with unsupported version. ideally we should figure it out and not install clones. It could be that we will install clones and the last one would be used.


==  nCPU < nSAs ==
Lets say there are 4 cpus and number of clone configured is 8, because the other end has 8 CPUs. The head SA's list only has 4 places for sub SAs. Libreswan should install only 4 send SAs ONLY.


A bit detail about Child SA initiator and what the initiator and responder are committing. As I understand the RFC, also from Tero, when an initiator send a request to setup an SA, IKE Child SA requst(both in IKE_AUTH and CREATE_CHILD SA) is a bi directional SA, the initiator is committing to receive on that SA. Also the responder is committing to recive and not to send. The 4CPU side IKE daemon will install 8 Receive SAs and 4 sub SAs then everything would work.


== Linux kernel XFRM details ==


== Linux kernel XFRM details ==
Most changes are SAdb entry aka state, or SA. The new concept is head SA and sub SA. These are supported with additional XFRMA_SA_EXTRA_FLAGS, and attributes of the SADB entry. [The SA policy should not specify SPI???] Check with Ben. He said it works.
Most changes are SAdb entry aka state, or SA. The new concept is head SA and sub SA. It is supported with additional flags, and attributes of SADB entry. SA policy should not specify SPI???  Check with Ben. He said it work too.


You need extra flags to XFRM_MSG_GETSA  and XFRM_MSG_UPDSA, XFRM_MSG_GETSA, only for the out SA.
You need extra flags to XFRM_MSG_GETSA  and XFRM_MSG_UPDSA, XFRM_MSG_GETSA, only for the out SA.


=== XFRM_MSG_NEWSA head SA ===
=== XFRM_MSG_NEWSA head SA ===
XFRMA_SA_EXTRA_FLAGS set XFRM_SA_PCPU_HEAD flag
XFRMA_SA_EXTRA_FLAGS includes the XFRM_SA_PCPU_HEAD flag


=== XFRM_MSG_NEWSA sub SA ===
=== XFRM_MSG_NEWSA sub SA ===
XFRMA_SA_EXTRA_FLAGS set XFRM_SA_PCPU_SUB AND
XFRMA_SA_EXTRA_FLAGS includes the XFRM_SA_PCPU_SUB and the new attribute XFRMA_SA_PCPU set to the <cpu id>. CPU SA ID start from 0, and it is a u32.
 
=== XFRM_MSG_UPDSA  ===
Both the head SA and the sub SAs need extra attributes:
 
* The head SA sets the XFRMA_SA_EXTRA_FLAGS to XFRM_SA_PCPU_HEAD
* The sub SA sets the XFRMA_SA_EXTRA_FLAGS to XFRM_SA_PCPU_SUB and XFRMA_SA_PCPU is set to <sub-sa-id>.
 
=== XFRM_MSG_GETSA ===
 
This call only requires changes for sub SAs:
 
* The sub SA XFRMA_SA_EXTRA_FLAGS is set to XFRM_SA_PCPU_SUB and XFRMA_SA_PCPU is set to <sub-sa-id>.
* Set XFRMA_SRCADDR to the src addr


new attribute XFRMA_SA_PCPU <cpu id>. CPU SA ID start from 0, and it is a u32.
== when nCPU < nSAs ==


=== XFRM_MSG_UPDSA  ===
When there are 4 CPUs and the number of clone configured is 8 because the other end has 8 CPUs. The head SA's list only has 4 places for sub SAs. Libreswan should install only 4 outbound sub SA's and install 8 inbound sub SA's. This is a local policy and not affecting the remote IPsec peer. From the view of the remote peer, 4 inbound SA's appear to be unused. The remote peer can still use all its 8 outbound SAs. IPsec SA's are negotiated as as bundle of one inbound and one outbound SA. Both ends commit to receiving on their inbound SA's, but are free to decide on which outbound SA's they will send traffic. This setup is therefor compliant with RFC 7296.
both head SA and sub SA need extra attributes.
* head SA set XFRMA_SA_EXTRA_FLAGS to XFRM_SA_PCPU_HEAD
* sub SA set XFRMA_SA_EXTRA_FLAGS to XFRM_SA_PCPU_SUB AND XFRMA_SA_PCPU to <sub-sa-id>. Sub SA ID start from 0-u32


=== XFRM_MSG_GETSA call only change for sub sda ===
In our example above, the IKE daemon on the 4-CPU machine has a list of all 8 SA bundles, but will have installed only 4 outbound SA's along with the 8 inbound SA's. The "ip xfrm state" will show this.  
* sub SA XFRMA_SA_EXTRA_FLAGS set XFRM_SA_PCPU_SUB AND XFRMA_SA_PCPU to <sub-sa-id>.
* also set XFRMA_SRCADDR to src addr


==  Work load is supported ==
==  Supported Work loads  ==
As of Nov 2019 only support traffic that can be distributed over multiple CPUs.


If it is a local traffic that need to encrypted to IPsec, you need to start the application different CPUs using taskset or numactl.
As of Nov 2019, to make full use of the cloned SA's, network traffic load has to be distributed over different CPU's to take advantage of the clone feature.
If it is forwarded traffic you need RSS support on the receive side of clear text. With RSS you can steer different flows into different CPU, hence different SA.


=== can we distribute 4 tuple flows locally generated? ===  
If the traffic is generated on the IPsec machine itself, the application(s) need to be writing their traffic (eg using send() of write() syscalls) running on different CPU's. This can often be done using the taskset or numctl commands.
yes. The application on the sender side must run on the right CPU, aka use something like "taskset 0x1 ping -n -c 2 -I 192.0.1.254 192.0.2.254" or numactl, or something
 
For forwarded traffic, you need RSS support on the NIC receiving the clear text. RSS will steer different flows onto different CPUs and this use different sub SA's. If all the traffic consists of one single flow, the traffic will not be able to be distributed over different CPUs - to avoid out of order delivery.
 
=== Can we distribute 4 tuple flows locally generated? ===  
 
yes. See above.


==  Receiver side RSS support ==
==  Receiver side RSS support ==
To get this working you need Receive Side Scaling [https://www.kernel.org/doc/Documentation/networking/scaling.txt RSS] The receiver NIC should be able steer different flows, based on SPI, into separate Qs otherwise receiver seems to getting overwhelmed. We used Mellonex CX4 to test. Some cards initially tested did not seems to support RSS for ESP flows, instead only TCP and UDP. While figuring out RSS for these cards we tried a bit different approch. ESP in UDP encapsulation, along with ESP in UDP GRO patches we could see the flows getting distributed on the receiver.
 
To get this working you need Receive Side Scaling [https://www.kernel.org/doc/Documentation/networking/scaling.txt RSS] The receiver NIC should be able steer different flows, based on SPI, into separate queues to prevent the receiver from getting overwhelmed. We used Mellanex CX4 to test. Some cards initially tested did not seems to support RSS for ESP flows, instead only TCP and UDP. While figuring out RSS for these cards we tried a bit different approch. ESP in UDP encapsulation, along with ESP in UDP GRO patches we could see the flows getting distributed on the receiver.


==== RSS Commands ===  
==== RSS Commands ===  

Revision as of 06:35, 18 November 2019

Goal: scalable IPsec throughput with multiple CPUs(without IPsec HW offload)

The idea of per-CPU SA in the outgoing direction was discussed at Linux IPsec workshop March 2019, in Prague. A small group of people worked on a prototype of user space(IKE), Libreswan, and Linux kernel, XFRM. The libreswan implementation calls this option "clones". In the Linux kernel it is called pCPU. These names may change as we adopt the idea to include TOS bits over TCP/UDP DST port hashing.

The tests were performed without using IPsec HW hardware offload to separate the performance numbers of per-CPU SA's from hardware interaction.

Results

The test result, as of Nov 2019, show an aggregated throughput increase that is linearly with the number of CPUs.

We tested using physical servers, using Mellonex CX4 NIC. These NICs (using the latest Linux driver CX5) support RSS for ESP. In the tests, the clear text traffic was generated using a hardware traffic generator which sends traffic to the first IPsec gateway. The IPsec gateway encrypts the traffic and send it to the second IPsec gateway. That gateway decrypts the traffic into clear text and forwards the traffic to the receiving end of the traffic generator.

|Traffic Generator Sender|-----|IPsec Gateweay #1|=====ipsec 40Gbps link====|IPsec Gateway #2|---|Traffic Generator Receiver|

The initial measurements we obtained are: 17-18 Gbps with 3 CPU's. We see about 6-7 Gbps per CPU

Test setup using libreswan

Linux kernel source with pCPU support

git clone -b pcpu-2 https://github.com/antonyantony/linux

Kernel / xfrm plans

  • Release private branch by Steffen's repository for wider testing.
  • Kernel support for rekey. One could rekey in any order - either a head SA or the sub SA.
  • One main difference is when installing a new sub SA during a rekey, add_sa() would delete the old sub SA. Libreswan should not try to delete it.
  • Ben would like to add feature bind a sub sa to a head SA?
  • seems to need latest iproute2 otherwise "ip x s" may loop.
  • bug fixes

Libreswan with clones support

git clone --single-branch --branch clones-3 https://github.com/antonyantony/libreswan

Sample config | ipsec.conf

conn westnet-eastnet
	rightid=@east
        leftid=@west
        left=192.1.2.45
        right=192.1.2.23
	rightsubnet=192.0.2.0/24
	leftsubnet=192.0.1.0/24
	authby=secret
        clones=2
        auto=add
        nic-offload=no

Initiate the connection and test the multiple CPU IPsec SA's:

ipsec auto --up westnet-eastnet
taskset 0x1 ping -n -c 2 -I 192.0.1.254 192.0.2.254
taskset 0x2 ping -n -c 2 -I 192.0.1.254 192.0.2.254

ipsec trafficstatus

ipsec whack --trafficstatus
006 #2: "westnet-eastnet-0", type=ESP, add_time=1234567890, inBytes=0, outBytes=0, id='@east'
006 #4: "westnet-eastnet-1", type=ESP, add_time=1234567890, inBytes=168, outBytes=168, id='@east'
006 #3: "westnet-eastnet-2", type=ESP, add_time=1234567890, inBytes=168, outBytes=168, id='@east'

NOTE: Both SA #3 and #4 should have outgoing traffic on it.

Future Libreswan plans

  • Current support using clones=n requires both endpoints to have the same clone number. Future plan is to allow asymmetric configuration, such as one side using 8 clones on 4 CPUs and the other side using using 12 clones on 12 CPUs
  • Match Rekey support behaviour between kernel and libreswan. Deleting sub and head SA during a rekey procedure needs to be worked out with kernel
  • Complete support for ipsec auto --down and delete
  • Prevent clone instance on their own to be manipulated using ipsec auto add|delete|down
  • Ensure interoperability against IPsec gateways that do not support clone SA's, such as previous versions of libreswan without clone support.


Linux kernel XFRM details

Most changes are SAdb entry aka state, or SA. The new concept is head SA and sub SA. These are supported with additional XFRMA_SA_EXTRA_FLAGS, and attributes of the SADB entry. [The SA policy should not specify SPI???] Check with Ben. He said it works.

You need extra flags to XFRM_MSG_GETSA and XFRM_MSG_UPDSA, XFRM_MSG_GETSA, only for the out SA.

XFRM_MSG_NEWSA head SA

XFRMA_SA_EXTRA_FLAGS includes the XFRM_SA_PCPU_HEAD flag

XFRM_MSG_NEWSA sub SA

XFRMA_SA_EXTRA_FLAGS includes the XFRM_SA_PCPU_SUB and the new attribute XFRMA_SA_PCPU set to the <cpu id>. CPU SA ID start from 0, and it is a u32.

XFRM_MSG_UPDSA

Both the head SA and the sub SAs need extra attributes:

  • The head SA sets the XFRMA_SA_EXTRA_FLAGS to XFRM_SA_PCPU_HEAD
  • The sub SA sets the XFRMA_SA_EXTRA_FLAGS to XFRM_SA_PCPU_SUB and XFRMA_SA_PCPU is set to <sub-sa-id>.

XFRM_MSG_GETSA

This call only requires changes for sub SAs:

  • The sub SA XFRMA_SA_EXTRA_FLAGS is set to XFRM_SA_PCPU_SUB and XFRMA_SA_PCPU is set to <sub-sa-id>.
  • Set XFRMA_SRCADDR to the src addr

when nCPU < nSAs

When there are 4 CPUs and the number of clone configured is 8 because the other end has 8 CPUs. The head SA's list only has 4 places for sub SAs. Libreswan should install only 4 outbound sub SA's and install 8 inbound sub SA's. This is a local policy and not affecting the remote IPsec peer. From the view of the remote peer, 4 inbound SA's appear to be unused. The remote peer can still use all its 8 outbound SAs. IPsec SA's are negotiated as as bundle of one inbound and one outbound SA. Both ends commit to receiving on their inbound SA's, but are free to decide on which outbound SA's they will send traffic. This setup is therefor compliant with RFC 7296.

In our example above, the IKE daemon on the 4-CPU machine has a list of all 8 SA bundles, but will have installed only 4 outbound SA's along with the 8 inbound SA's. The "ip xfrm state" will show this.

Supported Work loads

As of Nov 2019, to make full use of the cloned SA's, network traffic load has to be distributed over different CPU's to take advantage of the clone feature.

If the traffic is generated on the IPsec machine itself, the application(s) need to be writing their traffic (eg using send() of write() syscalls) running on different CPU's. This can often be done using the taskset or numctl commands.

For forwarded traffic, you need RSS support on the NIC receiving the clear text. RSS will steer different flows onto different CPUs and this use different sub SA's. If all the traffic consists of one single flow, the traffic will not be able to be distributed over different CPUs - to avoid out of order delivery.

Can we distribute 4 tuple flows locally generated?

yes. See above.

Receiver side RSS support

To get this working you need Receive Side Scaling RSS The receiver NIC should be able steer different flows, based on SPI, into separate queues to prevent the receiver from getting overwhelmed. We used Mellanex CX4 to test. Some cards initially tested did not seems to support RSS for ESP flows, instead only TCP and UDP. While figuring out RSS for these cards we tried a bit different approch. ESP in UDP encapsulation, along with ESP in UDP GRO patches we could see the flows getting distributed on the receiver.

= RSS Commands

Enable GRO and it should work. ideally you should be able to run the following,

 ethtool -N <nic> rx-flow-hash esp4 

Another argument is if the NIC agnostic the 16 bits of SPI, of ESP packet, is aligned with UDP port number and should provide enough entropy.

 ethtool -N eno2 rx-flow-hash udp4 sdfn