Exposing Docker containers with SR-IOV

In some of my recent research in NFV, I’ve needed to expose Docker containers to the host’s network, treating them like fully functional virtual machines with their own interfaces and routable IP addresses.  This type of exposure is overkill for many applications, but necessary for user space packet processing such as that required for NFV.  An example use case might be if you want to give your containerized virtual firewall direct access to a physical interface, bypassing the OVS/bridge and the associated overhead, but without the security vulnerabilities of --net=host.

You have a few options in this kind of situation.  You could directly assign the interface, giving the container exclusive access.  The number of available physical NICs is limited, though, so a more realistic option for most of us is to virtualize the interface, and let the container think it has a real NIC.  Jérôme Petazzoni’s pipework script makes it a breeze to do this; by default, if you assign a physical interface to a container with pipework, it will create a virtual interface, place it under a fast macvlan L2 switch, and assign the virtual interface to the container’s network namespace. This comes with a cost, of course: macvlan is still a software switch, another layer between your container and the NIC.

A third option is to use Single Root IO Virtualization (SR-IOV), which allows a PCI device to present itself as multiple virtual functions to the OS.  Each function acts as a device and gets its own MAC, and the NIC can then use its built in hardware classifier to place incoming packets in separate RX queues based on the destination MAC.  Scott Lowe has a great intro to SR-IOV here.

There are a few reasons that you might want to use SR-IOV rather than a macvlan bridge.  There are architectural benefits to moving the packet processing from the kernel to user space – it can improve cache locality and reduce context switches (Rathore et al., 2013).  It can also make it easier to quantify and control the CPU usage of the user space application, which is critical in multi tenant environments.

On the other hand, there are situations where you would definitely not want to use SR-IOV – primarily when you have containers on the same host that need to communicate with each other. In this case, packets must be copied over the PCIe bus to the NIC to be switched back to the host, which has a pretty dismal performance penalty.  I published a paper recently that covers this and other performance issues concerning Docker networking with SR-IOV, macvlan, and OVS; take a look at the results of chaining multiple containers and the increase in latency and jitter (Anderson et al., 2016).

So, let’s dig in and make it happen.  You’ll want to make sure that the NIC you intend to virtualize actually supports SR-IOV, and how many virtual functions are supported.  I’m working with some Dell C8220s with Intel 82599 10G NICs, which support up to 63 virtual functions.  Here’s a list of Intel devices with support, other manufacturers should have their own lists.

Creating Virtual NICs

First, get a list of your available NICs.  Here’s a handy one liner:

This gives you the PCI slot, class, and other useful information of your Ethernet devices, like this:

In this case, I’m going to be virtualizing the 10G NIC, so I note the slot: 0000:82:00.0.  Next, decide how many virtual functions you’re going to need, in addition to the physical device.  I’m going to be assigning interfaces to 2 Docker containers, so I’ll create 2 VFs.  Next, we’ll just write that number into the sriov_numvfs file for the device:

Now, check ifconfig -a. You should see a number of new interfaces created, starting with “eth” and numbered after your existing interfaces. They’ve been assigned random MACs, and are ready for you to use. Here’s mine:

Plumb It

My preferred tool to add interfaces to Docker containers is with pipework, but in this case, pipework will virtualize the virtual interface with a macvlan bridge.  As a workaround, I forked the pipework repository and made it accept --direct-phys as the first argument, to force it to skip the macvlan and bring the interface directly into the container’s network namespace. I’ve upstreamed the change, and if it makes its way into the original project, I’ll update this post.

First, I’ll make a container for testing:

Now, let’s give that container a new virtual NIC, with the modified pipework:

By default, pipework will name the new interface eth1 inside the container (Note: see bottom of post for one caveat).  Just to double check:

Note that the MAC is the same as ifconfig on the host, and that also the interface no longer visible in the host’s ifconfig: this is because the interface is now in the container’s network namespace. Now, to try it out with another physical machine on that interface’s network:

If you like doing things the hard way, here’s the steps to mimic how pipework put the interface in the container’s network namespace:

EDIT:

One issue that you may have with this approach happens when you stop the container.  If you’ve renamed the interface to something else, like pipework and my above example do, there will be a conflict when the kernel tries to move the interface back to the host’s namespace.  The simplest solution would just be to avoid renaming the interface, unless it’s critical that the interface be named something specific within the container.  This is pretty easy with pipework, just specify the container interface name:

Let me know how it works out for you in the comments below.

Links:
Jérôme Petazzoni’s pipework
Modified pipework with –direct-phys option
Scott Lowe on SR-IOV
Intel devices that support SR-IOV
Intro to Linux namespaces

8 comments

  1. So does it mean we could create DMZ on a single physical host, run docker containers that are not on the same network on a single physical host and make sure they cannot communicate with each other or add firewall rules to filter communication between containers ? What’s the impact on multi hosts architecture when you need to segregate containers on different Vlans ?

    1. Well, I’m not very familiar with the particulars of creating a DMZ, but yes, container interfaces on the same host could be members of different IP subnets. Note though that SDN rules that you would normally apply to Open vSwitch on the host would need to move to the physical switch that the host attaches to. Supposedly, NICs that support SR-IOV will also respect VLAN tags when switching among VFs, but it may be implementation dependent.

  2. Just like enabling sriov on an NIC, there is a way to find the max vfs this NIC supports.
    $ cat /sys/bus/pci/devices/0000:82:00.0/sriov_totalvfs

  3. There is a small mistake in your last EDIT. The command must be:
    sudo pipework –direct-phys eth4 -i eth4 $server 10.10.1.4/24

    One side note: I use i350-T2 and there is a conflict in BIOS/PCI, therefore I had to set GRUB_CMDLINE_LINUX_DEFAULT=”intel_iommu=on pci=assign-busses”, to let Linux do the stuff.

    Nevertheless, it works like a charm. Thanks for the post.

  4. “In this case, just as with macvlan passthru mode, packets must be copied to the NIC to be switched back to the host, which has a pretty dismal performance penalty.”

    Not necessary, sr-iov NICs supports switch in the adapter that switches packets between VFs of the same PF. They need not leave host to be routed back at the switch.

    1. You’re correct, a SR-IOV NIC can switch packets between VFs without having leave the host. That’s actually what I meant to say, but my wording was pretty unclear by comparing it to macvlan passthru, which uses the external switch to ‘hairpin’ the connection. Thanks for pointing it out, I have edited the text to clarify that.

      The issue, however, still stands. The time required to interrupt the NIC, copy the packet to the NIC memory, switch it, interrupt the CPU, and copy the packet back to the recipient is the dismal performance I was referring to. This is fairly easy to test out yourself, but I’ve recently had a paper published detailing that and other networking performance issues with Docker. Take a look at the VNF chain performance comparison, and note the increase in latency and jitter as a packet has to be switched by the NIC.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">