Linux containers,as a lighter virtualization alternative to virtual machines,are gaining momentum. TheHigh Performance Computing (HPC) community is eyeing Linux containers with interest,hoping that they can provide the isolationandconfigurability of Virtual Machines, but without the performance penalties.
In this article, I will show a simple example of libvirt-based container configuration in which I assign the container one of the ultra-low latency (usNIC) enabled Ethernet interfaces availablein the host. This allows bare-metal performance of HPC applications, but within the confines of a Linux container.
Before we jump into the specific libvirt configuration details, let's first quickly review the following points:
Introduction to Linux Containers
Fun fact: there is no formal definition of a Linux "container." Most people identify a Linux container with keywords like LXC, libvirt, Docker, namespaces, cgroups, etc.
Some of those keywords identify user space tools used to configure and manage some form of containers (LXC, libvirt, andDocker). Others identify some of the building blocks used to define a container (namespaces and cgroups).
Even in the Linux kernel, there is no definition of a "container."
However, the kernel does provide a number of features that can be combined to define what many people call a "container." None of these features are mandatory, and depending on what level of sharing or isolation you need between containers - or between the host and containers - the definition/configuration of a "container" will (or will not) make use of certain features.
In the context of this article, I will focus on assignment of usNIC enabled devices in libvirt-based LXC containers. For simplicity, I will ignore all security-related aspects.
Network namespaces, PCI, and filesystems
Given the relationship between devices and the filesystem, I will focus on filesystem related aspects and ignore the other commonly configured parts of a container, such as CPU, generic devices, etc.
Assigning containers their own view of the filesystem, with different degrees of sharing between host filesystem and container filesystem, is already possible and easy to achieve (seemount
documentation for namespaces). However, what is still not possible is to partition or virtualize (i.e., make namespace-aware) certain parts of the filesystem.
Filesystem elements such as the virtual filesystems commonly mounted in/proc
, /sys
, and /dev
are examples that fall into that category. These special filesystems provide a lot of information and configuration knobs that you may not want to share between the host and all containers, or between containers.
Also, a number of device drivers place special files in/dev
that user space can use to interact with the devices via the device driver.
Even though network interfaces do not normally need to add anything to/dev/
(i.e., there is no/dev/enp7s0f0
), usNIC enabled Ethernet interfaces have entries in/dev
because the Libfabric and Verbs libraries require to access those entries.
Sidenote:For more information on why modern Linux distribution do not use interface names likeethX
any more, and how names likeenp7s0f0
are derived, see this document.
Thetools you use to manage containers may assign a new network namespace to each container you create by default, or may need you to explicitly ask for that. Libvirt, as explained here, does that automatically when you assign a host network interface to the container. Specifically: when you create a new network namespace, you have the option of moving into the container any of the network interfaces (e.g.,enp7s0f0
) available in the host.
You can do this by hand using theip link
command, or you can have that assignment taken care for you by one of the container management tools. Later we will see how libvirt does that for us.
Once you have moved a network interface into a container, that network device will be only visible and usable inside that container.
Figure 1: (a) host with no containers (b) container that has been assigned a new network namespace which shares all network interfaces with the host (c) container that has been assigned a new network namespace and one of the host network interfaces (no longer visible in the host)However, the Ethernet adapter also has an identity as a PCI device. As such, it appears in/sys
and can be seen via commands likelspci
from any network namespace - not only from the one where the associated network device (enp7s0f0
) lives.
This gap derives from the fact that the Ethernet device is hooked to both the PCI layer and the networking layer, but only the latter has been assigned a namespace.
Figure 2: (a) host with no containers (b) container that has been assigned a new network namespace which can not access any of the host network interfaces (c) container that has been assigned a new network namespace and one of the host network interfaces.Tools you can use to assign devices to containers
You can classify containers based on different criteria, such as based on what they will be used to run inside. At the two extremes, you have these options:
In the first case, you only need to populate the container filesystem with what is strictly needed to run a given application. Most likely, not much more than a set of libraries. Other parts of the filesystem may be shared with the host (including the virtual filesystems), or may not be needed at all.
In the second case, you want to assign the container a full filesystem and have less (if any) sharing with the host filesystem, including the special entries like/proc
, /sys
, /dev
, etc.
Even though full distribution container support is still not considered "ready for prime time" due to the limitations imposed by a few special filesystems as discussed above, there are a number of generic tools available that can be used to provide some kind of device/resource assignments and isolation between containers:
You can check LXD for an example of project whose goal is to add whatever is missing in order make containers as isolated as virtual machines in terms of resource usage/access.
In section "Example of libvirt LXC container configuration" we will see a simple example of how you can tell libvirt to use bind mounts and cgroup device controllers to assign a usNIC enabled Ethernet interface to a container.
Support for bind mounts has been available for long time (seeman mount
for the details).
cgroup device controller support may already be enabled on your distro by default. But if not, you can enable it with this kernel configuration option:
You can find some documentation about this feature in the kernel file Documentation/cgroups/devices.txt. We will not configure it manually as described in that document; instead, we will tell libvirt to do that for us.
Loading the required kernel modules and understanding the role of key filesystem entries
For a detailed description of how to deploy usNIC you can refer to the usNIC deployment guide (available at cisco.com). Keep in mind that:
The only missing point, which is the focus of this article, is to make sure certain files created by step 1) will be visible and usable inside the container's filesystem.
Normally, users do not need to have a detailed knowledge of what files are created by the kernel modules and used by the user space libraries. In our case, however, wedoneed to have some knowledge about these files in order to properly populate the container filesystem.
Before I show you the libvirt XML configuration, let's first discuss the role of three key file/directories we will need to tell libvirt about.
Once you have created a "Virtual NIC" (vNIC) on the Cisco UCS Virtual Interface Card (VIC) and enabled the usNIC feature in it (per the Cisco documentation cited above), you will see the following three filesystem entries in the host:
/dev/infiniband/uverbsX
/sys/class/infiniband/usnic_X/
iface
file in this directory tells you with which network interface (visible withifconfig
) this usNIC entry is associated to./sys/class/infiniband_verbs/uverbsX/
dev
major:minor
device ID which will match with what you will see in/dev/infiniband/uverbsX
. You can refer back to this information when/if you want to check if libvirt configures the cgroup device whitelist properly (see example, below).ibdev
/sys/class/infiniband/usnic_X/
Note that:
/sys/class/infiniband/usnic_X/
directory will be populated when you load the usNIC kernel driver module (i.e.,usnic_verbs.ko
)./dev/infiniband/
and /sys/class/infiniband_verbs/
directories also will be populated when you load the usNIC kernel driver module.In order to find the mapping between one of the network interfaces visible with ifconfig and the associateduverbsX
entry in/dev/infiniband
, you can either use the files in/sys
described above, or use theusd_devinfo
command that comes with theusnic-utils
package.
Example of libvirt LXC container configuration
Libvirt describes the configuration of containers (as well as virtual machines) with an XML file. Here is a link to detailed documentation of all libvirt's XML options. In the context of this article, I recommend reading the following sections of that documentation:
Let's start with a simple container configuration and add the delta needed to assign one usNIC enabled host Ethernet interface to the container. This example shows how to create a container on a Cisco UCS C240-M3 rack server running Centos 7.
Here is a stripped-down version of the container XML; I have removed the details that are not relevant for this discussion:
<domain type='lxc'> <name>container_1</name> <memory unit='GiB'>8</memory> <currentMemory unit='GiB'>0</currentMemory> <os> <type arch='x86_64'>exe</type> <init>/sbin/init</init> </os> <devices> <filesystem type='mount' accessmode='passthrough'> <source dir='/usr/local/var/lib/lxc/container_1/rootfs'/> <target dir='/'/> </filesystem><console type='pty'/> </devices></domain>
The only detail worth noting is that the container root filesystem is located at/usr/local/var/lib/lxc/container_1/rootfs
in the host.
Note that with this basic configuration, and according to the section "Device Nodes" mentioned above, the container's/dev
tree will not contain any of the special entries from the host's/dev
tree, including the/dev/infiniband
directory that we need for usNIC:
[container_1]#ls /dev/infiniband ls: cannot access /dev/infiniband: No such file or directory
However, since/sys
is shared with the host, you can see the entries associated to usNIC enabled Ethernet interfaces:
[container_1]#find /sys/class -name uverbs* /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband_verbs/uverbs1 /sys/class/infiniband_verbs/uverbs2 /sys/class/infiniband_verbs/uverbs3[container_1]#find /sys/class -name usnic* /sys/class/infiniband/usnic_0 /sys/class/infiniband/usnic_1 /sys/class/infiniband/usnic_2 /sys/class/infiniband/usnic_3
But notice that none of the/dev/infiniband/uverbsX
devices are present (yet) in the container. Running a simple usNIC diagnostic program in the container shows warnings (one for each device I have configured on my server):
[container_1]#/opt/cisco/usnic/bin/usd_devinfo usd_open_for_attrs: No such device usd_open_for_attrs: No such device usd_open_for_attrs: No such device usd_open_for_attrs: No such device
Since we did not assign any host network interface to the container, by default, libvirt allowed the container to see all Ethernet interfaces (i.e., it did not create a new network namespace):
[container_1]#ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 8: enp6s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000 link/ether 00:25:b5:00:00:04 brd ff:ff:ff:ff:ff:ff 9: enp7s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000 link/ether 00:25:b5:00:00:14 brd ff:ff:ff:ff:ff:ff 10: enp8s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000 link/ether 00:25:b5:00:00:24 brd ff:ff:ff:ff:ff:ff 11: enp9s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000 link/ether 00:25:b5:01:01:0f brd ff:ff:ff:ff:ff:ff
Now we edit the libvirt configuration to assign one usNIC enabled interface to the container. This means that inside the container:
/dev/infiniband/
will show an entry for the assigned usNIC enabled interfaceifconfig
will also show the usNIC enabled Ethernet interface .Let's assignenp7s0f0
(i.e., usnic_1
) to the container. Here is the new libvirt LXC container configuration (the changes compared tocontainer_1
are shown in red):
<domain type='lxc'> <name>container_2</name> <memory unit='GiB'>8</memory> <currentMemory unit='GiB'>0</currentMemory> <os> <type arch='x86_64'>exe</type> <init>/sbin/init</init> </os> <devices> <filesystem type='mount' accessmode='passthrough'> <source dir='/usr/local/var/lib/lxc/centos_container/rootfs'/> <target dir='/'/> </filesystem><hostdev mode='capabilities' type='misc'> <source> <char>/dev/infiniband/uverbs1</char> </source> </hostdev> <hostdev mode='capabilities' type='net'> <source> <interface>enp7s0f0</interface> </source> </hostdev> <console type='pty'/> </devices></domain>
You can find more details about the above two new pieces of configuration here.
If I start the container with the new "container_2" configuration, this is what I can see now from within it:
enp7s0f0
)/dev/infiniband/uverbs1
/sys
(as with the previous configurationcontainer_1
)Specifically:
[container_2]#ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 9: enp7s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000 link/ether 00:25:b5:00:00:14 brd ff:ff:ff:ff:ff:ff[container_2]#ls -ls /dev/infiniband/total 0 0 crwx------. 1 root root 231, 193 Apr 1 20:44 uverbs1[container_2]#find /sys/class -name uverbs* /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband_verbs/uverbs1 /sys/class/infiniband_verbs/uverbs2 /sys/class/infiniband_verbs/uverbs3[container_2]#find /sys/class -name usnic* /sys/class/infiniband/usnic_0 /sys/class/infiniband/usnic_1 /sys/class/infiniband/usnic_2 /sys/class/infiniband/usnic_3
Here is how the usNIC diagnostic commandusd_devinfo
shows the information about the visible usNIC enabled network interfaces (there are still some warnings because of theuverbsX
entries that are present in/sys
but not in/dev/infiniband
):
[container_2]#/opt/cisco/usnic/bin/usd_devinfo usd_open_for_attrs: No such deviceusnic_1: Interface: enp7s0f0 MAC Address: 00:25:b5:00:00:14 IP Address: 10.0.7.1 Netmask: 255.255.255.0 Prefix len: 24 MTU: 9000 Link State: UP Bandwidth: 10 Gb/s Device ID: UCSB-PCIE-CSC-02 [VIC 1225] [0x0085] Firmware: 2.2(2.5) VFs: 64 CQ per VF: 6 QP per VF: 6 Max CQ: 256 Max CQ Entries: 65535 Max QP: 384 Max Send Credits: 4095 Max Recv Credits: 4095 Capabilities: CQ sharing: yes PIO Sends: nousd_open_for_attrs: No such deviceusd_open_for_attrs: No such device
Let's compare the content of/dev/infiniband
in the host and in the container:
[container_2]#ls -ls /dev/infiniband/total 0 0 crwx------. 1 root root 231, 193 Apr 1 20:44 uverbs1
[host]#ls -ls /dev/infiniband/total 0 0 crw-rw-rw-. 1 root root 231, 192 Mar 31 17:30 uverbs0 0 crw-rw-rw-. 1 root root 231, 193 Mar 31 17:30 uverbs1 0 crw-rw-rw-. 1 root root 231, 194 Mar 31 17:30 uverbs2 0 crw-rw-rw-. 1 root root 231, 195 Mar 31 17:30 uverbs3
As you can see,uverbs1
- and onlyuverbs1
- is visible in the container. The device major number for alluverbsX
entries is 231, while the device minors are 192/193/194/195.
Let's now compare thedevice.list
device whitelist for the container and for the host:
[container_2]#cat /sys/fs/cgroup/devices/devices.list c 1:3 rwm c 1:5 rwm c 1:7 rwm c 1:8 rwm c 1:9 rwm c 5:0 rwm c 5:2 rwm c 10:229 rwm c 231:193 rwm c 136:* rwm
[host]#cat /sys/fs/cgroup/devices/devices.list a *:* rwm
As you can see from the two commands above:
We can see that "ping" works just fine from inside the container (using the enp7s0f0 interface):
[container_2]#ip addr show dev enp7s0f09: enp7s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000 link/ether 00:25:b5:00:00:14 brd ff:ff:ff:ff:ff:ff inet 10.0.7.1/24 brd 10.0.7.255 scope global enp7s0f0 valid_lft forever preferred_lft forever inet6 fe80::225:b5ff:fe00:14/64 scope link valid_lft forever preferred_lft forever[container_2]#ping -c 1 10.0.7.2 PING 10.0.7.2 (10.0.7.2) 56(84) bytes of data. 64 bytes from 10.0.7.2: icmp_seq=1 ttl=64 time=0.279 ms--- 10.0.7.2 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.279/0.279/0.279/0.000 ms
We can test theusnic_1
interface using theusd_pingpong
command to another container, similarly configured with usNIC enabled interface on another Cisco UCS C240-M3 rack server connected on a regular IP/Ethernet network:
[container_2]#/opt/cisco/usnic/bin/usd_pingpong -d usnic_1 -h 10.0.7.2open usnic_1 OK, IP=10.0.7.1QP create OK, addr -h 10.0.7.1 -p 3333sending params...payload_size=4, pkt_size=46posted 63 RX buffers, size=64 (4)100000 pkts,1.790 us / HRT
The 1.79 microsecond half-round trip ping-pong time (show in red, above) shows that we are getting bare-metal performance inside of the container.
Wrapup
As Linux containers become more mainstream - potentially even in HPC - it will become more important to understand how to expose native hardware functionality properly. Documentation and "best practice" knowledge is still somewhat scarce in the rapidly-evolving Linux containers ecosystem; this blog entry explains some of the underlying concepts and shows some examples of how adding just a few lines of XML allows bare-metal performance with the isolation and configurability of Linux containers.