Monthly Archives: June 2015

Dockered DPDK: packaging Open vSwitch

I recently attended the NFV World Congress in San Jose, and had a great time talking to vendors about their solutions and current trends toward widespread NFV adoption. Intel’s hot new(ish) multicore programming framework – the Data Plane Development Kit, or DPDK – was part of the marketing spiel of almost everyone even remotely invested in the NFVI.  The main interest is in the poll mode driver, which dedicates a CPU core to polling devices rather than waiting for interrupts to signal when a packet has arrived.  This has resulted in some amazing packet processing rates, such as a DPDK-accelerated Open vSwitch switching at 14.88Mp/s.

Since I’ve been working with Docker lately, I naturally started imagining what could be done with combining crazy fast DPDK applications with the lightweight virtualization and deployment flexibility of Docker.  Many DPDK applications – such as Open vSwitch – have some requirements in the DPDK build that may break other applications if they relied on the same libraries.  This makes it a great candidate for containerization, since we can give the application its very own tested build and run environment.

I was not, of course, the first to think of this – some Googling will turn up quite a few bits and pieces that have been helpful in writing this post.  My goal here is to bring that information into a consolidated tutorial and to explain the containerized DPDK framework that I have published to Dockerhub.

DPDK Framework in a Container

DPDK applications need to access a set of headers and libraries for compilation, so I decided to create a base container (Github, Dockerhub) with those resources.  Here’s the Dockerfile:

Pretty basic stuff at first – get some packages, set the all-important RTE_SDK environment variable, grab the source.  One thing that is important is to not rely on kernel headers; doing so would be seriously non portable.  The uio and igb_uio kernel modules have to be built and installed by the host that will run the DPDK container. Therefore, we configure the SDK to not compile kernel modules, and therefore not require installing kernel headers on the build system.

The key part of this build script is the deferment of compilation to when the application is built, so that the application can specify its requirements. This is done by requiring the DPDK application provide dpdk_env.sh and dpdk_config.sh, which provide environment variables (such as RTE_TARGET) and a set of commands to run before compilation occurs. For example, Open vSwitch requires that DPDK be compiled with CONFIG_RTE_BUILD_COMBINE_LIBS=y in its configuration, which would be inserted in dpdk_config.sh.

DPDK Application in a Container

Now that the framework is there, time to use it in an application.  In this post I will demonstrate Open vSwitch in a container (Github, Dockerhub), which could be plenty useful.  To begin, here’s the dpdk_env.sh and dpdk_config.sh files:

OVS has some special requirements for DPDK, which is kind of the point of putting it in a container, right? Here’s the Dockerfile to build it:

The ONBUILD instructions in the DPDK Dockerfile will be executed first, of course, which will compile the DPDK framework. Then we install more packages for OVS, get the source, and compile with DPDK options. In the last few lines, we move the final script into the container, which is all the stuff OVS needs running:

Now, here you could go a bit differently, and the repository I linked to may change somewhat. It could be said that it is more Dockerish to put the ovsdb-server in its own container, and then link them. However, this is a self contained example, so we’ll just go with this.

Running Open vSwitch

Before we start it up, we need to fulfill some prerequisites. I won’t go into details on the how and why, but please see the DPDK Getting Started Guide and the OVS-DPDK installation guide.  OVS requires 1GB huge pages, so you need your /etc/default/grub to have at least these options:

followed by an update-grub and reboot. You also need to mount them with this or the /etc/fstab equivalent:

Compile the kernel module on the host system and insert it. Download DPDK, extract, and run the dpdk/tools/setup.sh script. Choose to build to the x86_64-native-linuxapp-gcc target, currently option 9, and then insert the UIO module, currently option 12. Finally, bind one of your interfaces with option 18, though you’ll have to bring that interface down first.

Now you can start the container. Here’s what I used:

This gives the container access to the huge page mount, and the uio0 device that you just bound to the UIO driver. I found that I needed to run the container as --privileged to access parts of the /dev/uio0 filesystem, though it appears that some people are able to get around this. I will update this post if I find out how to run the container without privileged.

If all goes well, you now have DPDK-accelerated OVS running in a container, and you can go about adding interfaces to the container, adding them to OVS, and setting up rules for forwarding packets at ludicrous speeds. Good luck, and please let me know how it works out for you!

Links

DPDK base Docker container – rakurai/dpdkGithub, Dockerhub
Open vSwitch Docker container – rakurai/ovs-dpdkGithub, Dockerhub
DPDK Getting Started Guide
OVS-DPDK installation guide

Exposing Docker containers with SR-IOV

In some of my recent research in NFV, I’ve needed to expose Docker containers to the host’s network, treating them like fully functional virtual machines with their own interfaces and routable IP addresses.  This type of exposure is overkill for many applications, but necessary for user space packet processing such as that required for NFV.  An example use case might be if you want to give your containerized virtual firewall direct access to a physical interface, bypassing the OVS/bridge and the associated overhead, but without the security vulnerabilities of --net=host.

You have a few options in this kind of situation.  You could directly assign the interface, giving the container exclusive access.  The number of available physical NICs is limited, though, so a more realistic option for most of us is to virtualize the interface, and let the container think it has a real NIC.  Jérôme Petazzoni’s pipework script makes it a breeze to do this; by default, if you assign a physical interface to a container with pipework, it will create a virtual interface, place it under a fast macvlan L2 switch, and assign the virtual interface to the container’s network namespace. This comes with a cost, of course: macvlan is still a software switch, another layer between your container and the NIC.

A third option is to use Single Root IO Virtualization (SR-IOV), which allows a PCI device to present itself as multiple virtual functions to the OS.  Each function acts as a device and gets its own MAC, and the NIC can then use its built in hardware classifier to place incoming packets in separate RX queues based on the destination MAC.  Scott Lowe has a great intro to SR-IOV here.

There are a few reasons that you might want to use SR-IOV rather than a macvlan bridge.  There are architectural benefits to moving the packet processing from the kernel to user space – it can improve cache locality and reduce context switches (Rathore et al., 2013).  It can also make it easier to quantify and control the CPU usage of the user space application, which is critical in multi tenant environments.

On the other hand, there are situations where you would definitely not want to use SR-IOV – primarily when you have containers on the same host that need to communicate with each other. In this case, packets must be copied over the PCIe bus to the NIC to be switched back to the host, which has a pretty dismal performance penalty.  I published a paper recently that covers this and other performance issues concerning Docker networking with SR-IOV, macvlan, and OVS; take a look at the results of chaining multiple containers and the increase in latency and jitter (Anderson et al., 2016).

So, let’s dig in and make it happen.  You’ll want to make sure that the NIC you intend to virtualize actually supports SR-IOV, and how many virtual functions are supported.  I’m working with some Dell C8220s with Intel 82599 10G NICs, which support up to 63 virtual functions.  Here’s a list of Intel devices with support, other manufacturers should have their own lists.

Creating Virtual NICs

First, get a list of your available NICs.  Here’s a handy one liner:

This gives you the PCI slot, class, and other useful information of your Ethernet devices, like this:

In this case, I’m going to be virtualizing the 10G NIC, so I note the slot: 0000:82:00.0.  Next, decide how many virtual functions you’re going to need, in addition to the physical device.  I’m going to be assigning interfaces to 2 Docker containers, so I’ll create 2 VFs.  Next, we’ll just write that number into the sriov_numvfs file for the device:

Now, check ifconfig -a. You should see a number of new interfaces created, starting with “eth” and numbered after your existing interfaces. They’ve been assigned random MACs, and are ready for you to use. Here’s mine:

Plumb It

My preferred tool to add interfaces to Docker containers is with pipework, but in this case, pipework will virtualize the virtual interface with a macvlan bridge.  As a workaround, I forked the pipework repository and made it accept --direct-phys as the first argument, to force it to skip the macvlan and bring the interface directly into the container’s network namespace. I’ve upstreamed the change, and if it makes its way into the original project, I’ll update this post.

First, I’ll make a container for testing:

Now, let’s give that container a new virtual NIC, with the modified pipework:

By default, pipework will name the new interface eth1 inside the container (Note: see bottom of post for one caveat).  Just to double check:

Note that the MAC is the same as ifconfig on the host, and that also the interface no longer visible in the host’s ifconfig: this is because the interface is now in the container’s network namespace. Now, to try it out with another physical machine on that interface’s network:

If you like doing things the hard way, here’s the steps to mimic how pipework put the interface in the container’s network namespace:

EDIT:

One issue that you may have with this approach happens when you stop the container.  If you’ve renamed the interface to something else, like pipework and my above example do, there will be a conflict when the kernel tries to move the interface back to the host’s namespace.  The simplest solution would just be to avoid renaming the interface, unless it’s critical that the interface be named something specific within the container.  This is pretty easy with pipework, just specify the container interface name:

Let me know how it works out for you in the comments below.

Links:
Jérôme Petazzoni’s pipework
Modified pipework with –direct-phys option
Scott Lowe on SR-IOV
Intel devices that support SR-IOV
Intro to Linux namespaces