InfiniBand — Rohit Zambre — Rohit Zambre's Blog

Summary: This post describes the series of coordinated events that occur under the hood between the CPU and NIC through the PCI Express fabric to transmit a message and signal its completion over the InfiniBand interconnect.

The primary method of sending a message over InfiniBand is through the Verbs API. libibverbs is the standard implementation of this API and is maintained by the Linux-RDMA community. There are two kinds of functions in Verbs: slow-path and fast-path functions. Slow-path functions (such as ibv_open_device, ibv_alloc_pd, etc.) are those related to the creation and configuration of resources (such as the Context, Protection Domain, and Memory Region). They are called “slow” because they involve the kernel and hence incur the expensive overhead of a context switch. Fast-path functions (such as ibv_post_send, ibv_poll_cq, etc.) deal with initiation and completion of operations. They are called “fast” because they bypass the kernel and, hence, are much faster than the slow-path functions. The critical path of communication consists primarily of fast-path functions and occasionally a slow-path function such as ibv_reg_mr to register Memory Regions on the fly (depending on the communication middleware). This post focuses on mechanisms that occur after the programmer has executed an ibv_post_send.

Quick PCIe background

The Network Interface Card (NIC) is typically attached to the server through a PCI Express (PCIe) slot. The main conductor of the PCIe I/O subsystem is the Root Complex (RC). The RC connects the processor and memory to the PCIe fabric. The PCIe fabric may consist a hierarchy of devices. The peripherals connected to the PCIe fabric are called PCIe endpoints. The PCIe protocol consists of three layers: the Transaction layer, the Data Link layer, and the Physical layer. The first, the upper-most layer, describes the type of transaction occurring. For this post, two types of Transaction Layer Packets (TLPs) are relevant: MemoryWrite (MWr), and Memory Read (MRd). Unlike the standalone MWr TLP, the MRd TLP is coupled with a Completion with Data (CplD) transaction from the target PCIe endpoint which contains the data requested by the initiator. The Data Link layer ensures the successful execution of all transactions using Data Link Layer Packet (DLLP) acknowledgements (ACK/NACK) and a credit-based flow-control mechanism. An initiator can issue a transaction as long as it has enough credits for that transaction. Its credits are replenished when it receives Update Flow Control (UpdateFC) DLLPs from its neighbors. Such a flow-control mechanism allows the PCIe protocol to have multiple outstanding transactions.

Basic mechanisms involved

First, I will describe how messages are sent using the completely offloaded approach, that is, the CPU only informs the NIC that there is a message to be transmitted; the NIC will do everything else to transmit the data. In such an approach, the CPU is more available for computation activities. However, such an approach can be detrimental for the communication performance of small messages (it will become evident soon). To improve the communication performance in such cases, InfiniBand offers certain operational features which I will describe in the next section.

From a CPU programmer’s perspective, there exists a transmit queue (the send queue in Verbs is the Queue Pair (QP)) and a completion queue (long for CQ in Verbs). The user posts their message descriptor (MD; Work Queue Element/Entry (WQE; wookie) in Verbs) to the transmit queue, after which they poll on the CQ to confirm the completion of the posted message. The user could also request to be notified with an interrupt regarding the completion. However, the polling approach is latency-oriented since there is no context switch to the kernel in the critical path. The actual transmission of a message over the network occurs through coordination between the processor chip and the NIC using memory mapped I/O (MMIO) and direct memory access (DMA) reads and writes. I will describe these steps below the following figure.

Step 0: The user first enqueues an MD into the TxQ. The network driver then prepares the device-specific MD that contains headers for the NIC, and a pointer to the payload.

Step 1: Using an 8-byte atomic write to a memory-mapped location, the CPU (the network driver) notifies the NIC that a message is ready to be sent. This is called ringing the DoorBell. The RC executes the DoorBell using a MWr PCIe transaction.

Step 2: After the DoorBell ring, the NIC fetches the MD using a DMA read. A MRd PCIe transaction conducts the DMA read.

Step 3:The NIC will then fetch the payload from a registered memory region using another DMA read (another MRd TLP). Note that the virtual address has to be translated to its physical address before the NIC can perform DMA-reads.

Step 4: Once the NIC receives the payload, it transmits the read data over the network. Upon a successful transmission, the NIC receives an acknowledgment (ACK) from the target-NIC.

Step 5: Upon the reception of the ACK, the NIC will DMA-write (using a MWr TLP) a completion queue entry (CQE; a.k.a. cookie in Verbs; 64 bytes in Mellanox InfiniBand) to the CQ associated with the TxQ. The CPU will then poll for this completion to make progress.

In summary, the critical data path of each post entails one MMIO write, two DMA reads, and one DMA write. The DMA-reads translate to round-trip PCIe latencies which are expensive. For example, the roundtrip PCIe latency of a ThunderX2 machine is around 125 nanoseconds.

Operational features

Inlining, Postlist, Unsignaled Completions, and Programmed I/O are IB’s operational features that help reduce this overhead. I describe them below considering the depth of the QP to be n.

Postlist: Instead of posting only one WQE per ibv_post_send, IB allows the application to post a linked list of WQEs with just one call to ibv_post_send. It can reduce the number of DoorBell rings from n to 1.

Inlining: Here, the CPU (the network driver) copies the data into the WQE. Hence, with its first DMA read for the WQE, the NIC gets the payload as well, eliminating the second DMA read for the payload.

Unsignaled Completions: Instead of signaling a completion for each WQE, IB allows the application to turn off completions for WQEs provided that at least one out of every n WQEs is signaled. Turning off completions reduces the DMA writes of CQEs by the NIC. Additionally, the application polls fewer CQEs, reducing the overhead of making progress.

BlueFlame: BlueFlame is Mellanox’s terminology for programmed I/O—it writes the WQE along with the DoorBell, cutting off the DMA read for the WQE itself. Note that BlueFlame is used only without Postlist. With Postlist, the NIC will DMA-read the WQEs in the linked list.

To reduce the overheads of the PCIe roundtrip latency, developers typically use both Inlining and BlueFlame together for small messages. It eliminates two PCIe round-trip latencies. While the use of Inlining and BlueFlame is dependent on message size, the use of Postlist and Unsignaled Completions is reliant primarily on the user’s design choices and application semantics.

Prior to entering the high-performance, user-level networking space in April 2017, I wasn't even aware of the existence of the field of high-performance interconnects. I barely knew the sockets interface. I had heard about the concepts of DMA and RDMA but nothing about RDMA programming (while RDMA is technically a subset of high-performance, user-level networking, the terms are commonly used interchangeably). At SC'16, I had heard about InfiniBand, but only at Mellanox Technology's booths. So my brain subconsciously associated the two. To the RDMA novices: I was wrong; to the RDMA experts: you can imagine the turmoil in my head when the first RDMA interface I was learning to use was the abstract `libfabric`.

While I could further digress from the topic of this post and write about my then confused state of mind, I will instead leave a list of links for those just getting started with the high-performance, user-level networking space:

If you don't know much about `sockets`, I suggest you quickly read about it and write a simple send/receive program using the `sockets` API to understand the implications of maintaining flow control of data transmission/reception in your code.
OFI guide: all sections just before `OFI Architecture` to motivate the existence of this field.
InfiniBand: this RedHat guide officially defines what InfiniBand is.

From what I have read so far, InfiniBand is the first official interface (building off of the Virtual Interface Architecture) for high-performance, user-level networking. High-performance because of no kernel involvement (hence, user-level) for operations that involve transmission/reception of data, unlike TCP/IP. The kernel is involved only in the creation of resources used for issuing data transmission/reception. Additionally, unlike TCP/IP, the InfiniBand interface permits RDMA operations (remote reads, writes, atomics, etc.). The InfiniBand (IB) specification has both hardware and software components.

`libibverbs` is the software component (Verbs API) of the IB interface. As `sockets` is to TCP/IP, `libibverbs` is to IB. Your best bet to learn how to code with `libibverbs` is the amazing RDMAmojo blog written by Dotan Barak, the creator of the man pages for `libibverbs`. You could even solely rely on his blog to learn about the InfiniBand concepts. He writes about the API in excruciating detail with very helpful FAQs. Here's his big-picture tutorial-style presentation. Other critical software components of the IB interface are the user-space libraries and kernel-space modules that implement the API and enable IB resource creation. This is precisely what the Open Fabric Enterprise Distribution (OFED) is. OFED’ user-space libraries are in the rdma-core repository and the kernel components are in driver/infiniband subsystem of the linux tree.

The hardware component of IB is where different vendors come into play. The IB interface is abstract; hence, multiple vendors can have different implementations of the IB specification. Mellanox Technologies has been an active, prominent InfiniBand hardware vendor. In addition to meeting the IB hardware specifications in the NIC design, the vendors have to support the `libibverbs` API by providing a user-space driver and a kernel-space driver that actually do the work (of setting up resources on the NIC) when a `libibverbs` function such as `ibv_open_device` is called. These vendor-specific libraries and kernel modules are a standard part of the OFED. The vendor-specific user-space libraries are called providers in rdma-core. These providers span both the IB and other technologies, such as RoCE and iWARP, that implement RDMA over Ethernet adapters (I’ll delve into the convergence between IB and Ethernet in another post). Mellanox OFED (MOFED) is Mellanox's implementation of the OFED libraries and kernel modules. MOFED contains certain optimizations that are targeted towards Mellanox hardware (the mlx4 and mlx5 providers) but haven't been incorporated into OFED yet.

`libfabric` is another, fairly recent API and intends to serve a level of abstraction higher than that of `libibverbs`. Alongside InfiniBand, several other user-level networking interfaces exist. Typically they are proprietary and vendor-specific. Cray has the uGNI interface, Intel Omni-Path has PSM2, Cisco usNIC, etc. The underlying concepts (message queues, completion queues, registered memory, etc.) between the different interfaces are similar with certain differences in capabilities and semantics. The Open Fabrics Interface (OFI) intends to unify all of the available interfaces by providing an abstract API: `libfabric`. Each vendor will then support the OFI through its `libfabric-provider` that will call corresponding functions in its own interface. This way, a user-level networking application written using the `libfabric` API is portable across different vendors. Based on the hardware that the application will be running on, the right `libfabric-provider` can then be selected.

Hope that helps you with an introduction to high-performance, user-level networking. If not, hope it gives you enough search keywords to use in your favorite search engine. All the best!

If you are interested to learn how transmitting a message works under the hood, check out my newer blog post: How are messages transmitted on InfiniBand?

How does InfiniBand work?

Quick PCIe background

Basic mechanisms involved

Operational features

For the RDMA novice: libfabric, libibverbs, InfiniBand, OFED, MOFED?

Career Blog

Featured Posts