1fi_intro(7) Libfabric v1.18.1 fi_intro(7)
2
3
4
6 fi_intro - libfabric introduction
7
9 This introduction is part of the libfabric’s programmer’s guide. See
10 fi_guide(7). This section provides insight into the motivation for the
11 libfabric design and underlying networking features that are being ex‐
12 posed through the API.
13
15 The sockets API is a widely used networking API. This guide assumes
16 that a reader has a working knowledge of programming to sockets. It
17 makes reference to socket based communications throughout in an effort
18 to help explain libfabric concepts and how they relate or differ from
19 the socket API. To be clear, the intent of this guide is not to criti‐
20 cize the socket API, but reference sockets as a starting point in order
21 to explain certain network features or limitations. The following sec‐
22 tions provide a high-level overview of socket semantics for reference.
23
24 Connected (TCP) Communication
25 The most widely used type of socket is SOCK_STREAM. This sort of sock‐
26 et usually runs over TCP/IP, and as a result is often referred to as a
27 `TCP' socket. TCP sockets are connection-oriented, requiring an ex‐
28 plicit connection setup before data transfers can occur. A single TCP
29 socket can only transfer data to a single peer socket. Communicating
30 with multiple peers requires the use of one socket per peer.
31
32 Applications using TCP sockets are typically labeled as either a client
33 or server. Server applications listen for connection requests, and ac‐
34 cept them when they occur. Clients, on the other hand, initiate con‐
35 nections to the server. In socket API terms, a server calls listen(),
36 and the client calls connect(). After a connection has been estab‐
37 lished, data transfers between a client and server are similar. The
38 following code segments highlight the general flow for a sample client
39 and server. Error handling and some subtleties of the socket API are
40 omitted for brevity.
41
42 /* Example server code flow to initiate listen */
43 struct addrinfo *ai, hints;
44 int listen_fd;
45
46 memset(&hints, 0, sizeof hints);
47 hints.ai_socktype = SOCK_STREAM;
48 hints.ai_flags = AI_PASSIVE;
49 getaddrinfo(NULL, "7471", &hints, &ai);
50
51 listen_fd = socket(ai->ai_family, SOCK_STREAM, 0);
52 bind(listen_fd, ai->ai_addr, ai->ai_addrlen);
53 freeaddrinfo(ai);
54
55 fcntl(listen_fd, F_SETFL, O_NONBLOCK);
56 listen(listen_fd, 128);
57
58 In this example, the server will listen for connection requests on port
59 7471 across all addresses in the system. The call to getaddrinfo() is
60 used to form the local socket address. The node parameter is set to
61 NULL, which result in a wild card IP address being returned. The port
62 is hard-coded to 7471. The AI_PASSIVE flag signifies that the address
63 will be used by the listening side of the connection. That is, the ad‐
64 dress information should be relative to the local node.
65
66 This example will work with both IPv4 and IPv6. The getaddrinfo() call
67 abstracts the address format away from the server, improving its porta‐
68 bility. Using the data returned by getaddrinfo(), the server allocates
69 a socket of type SOCK_STREAM, and binds the socket to port 7471.
70
71 In practice, most enterprise-level applications make use of non-block‐
72 ing sockets. This is needed for a single application thread to manage
73 multiple socket connections. The fcntl() command sets the listening
74 socket to non-blocking mode. This will affect how the server processes
75 connection requests (shown below). Finally, the server starts listen‐
76 ing for connection requests by calling listen. Until listen is called,
77 connection requests that arrive at the server will be rejected by the
78 operating system.
79
80 /* Example client code flow to start connection */
81 struct addrinfo *ai, hints;
82 int client_fd;
83
84 memset(&hints, 0, sizeof hints);
85 hints.ai_socktype = SOCK_STREAM;
86 getaddrinfo("10.31.20.04", "7471", &hints, &ai);
87
88 client_fd = socket(ai->ai_family, SOCK_STREAM, 0);
89 fcntl(client_fd, F_SETFL, O_NONBLOCK);
90
91 connect(client_fd, ai->ai_addr, ai->ai_addrlen);
92 freeaddrinfo(ai);
93
94 Similar to the server, the client makes use of getaddrinfo(). Since
95 the AI_PASSIVE flag is not specified, the given address is treated as
96 that of the destination. The client expects to reach the server at IP
97 address 10.31.20.04, port 7471. For this example the address is hard-
98 coded into the client. More typically, the address will be given to
99 the client via the command line, through a configuration file, or from
100 a service. Often the port number will be well-known, and the client
101 will find the server by name, with DNS (domain name service) providing
102 the name to address resolution. Fortunately, the getaddrinfo call can
103 be used to convert host names into IP addresses.
104
105 Whether the client is given the server’s network address directly or a
106 name which must be translated into the network address, the mechanism
107 used to provide this information to the client varies widely. A simple
108 mechanism that is commonly used is for users to provide the server’s
109 address using a command line option. The problem of telling applica‐
110 tions where its peers are located increases significantly for applica‐
111 tions that communicate with hundreds to millions of peer processes, of‐
112 ten requiring a separate, dedicated application to solve. For a typi‐
113 cal client-server socket application, this is not an issue, so we will
114 defer more discussion until later.
115
116 Using the getaddrinfo() results, the client opens a socket, configures
117 it for non-blocking mode, and initiates the connection request. At
118 this point, the network stack has sent a request to the server to es‐
119 tablish the connection. Because the socket has been set to non-block‐
120 ing, the connect call will return immediately and not wait for the con‐
121 nection to be established. As a result any attempt to send data at
122 this point will likely fail.
123
124 /* Example server code flow to accept a connection */
125 struct pollfd fds;
126 int server_fd;
127
128 fds.fd = listen_fd;
129 fds.events = POLLIN;
130
131 poll(&fds, -1);
132
133 server_fd = accept(listen_fd, NULL, 0);
134 fcntl(server_fd, F_SETFL, O_NONBLOCK);
135
136 Applications that use non-blocking sockets use select(), poll(), or an
137 equivalent such as epoll() to receive notification of when a socket is
138 ready to send or receive data. In this case, the server wishes to know
139 when the listening socket has a connection request to process. It adds
140 the listening socket to a poll set, then waits until a connection re‐
141 quest arrives (i.e. POLLIN is true). The poll() call blocks until
142 POLLIN is set on the socket. POLLIN indicates that the socket has data
143 to accept. Since this is a listening socket, the data is a connection
144 request. The server accepts the request by calling accept(). That re‐
145 turns a new socket to the server, which is ready for data transfers.
146 Finally, the server sets the new socket to non-blocking mode.
147
148 /* Example client code flow to establish a connection */
149 struct pollfd fds;
150 int err;
151 socklen_t len;
152
153 fds.fd = client_fd;
154 fds.events = POLLOUT;
155
156 poll(&fds, -1);
157
158 len = sizeof err;
159 getsockopt(client_fd, SOL_SOCKET, SO_ERROR, &err, &len);
160
161 The client is notified that its connection request has completed when
162 its connecting socket is `ready to send data' (i.e. POLLOUT is true).
163 The poll() call blocks until POLLOUT is set on the socket, indicating
164 the connection attempt is done. Note that the connection request may
165 have completed with an error, and the client still needs to check if
166 the connection attempt was successful. That is not conveyed to the ap‐
167 plication by the poll() call. The getsockopt() call is used to re‐
168 trieve the result of the connection attempt. If err in this example is
169 set to 0, then the connection attempt succeeded. The socket is now
170 ready to send and receive data.
171
172 After a connection has been established, the process of sending or re‐
173 ceiving data is the same for both the client and server. The examples
174 below differ only by name of the socket variable used by the client or
175 server application.
176
177 /* Example of client sending data to server */
178 struct pollfd fds;
179 size_t offset, size, ret;
180 char buf[4096];
181
182 fds.fd = client_fd;
183 fds.events = POLLOUT;
184
185 size = sizeof(buf);
186 for (offset = 0; offset < size; ) {
187 poll(&fds, -1);
188
189 ret = send(client_fd, buf + offset, size - offset, 0);
190 offset += ret;
191 }
192
193 Network communication involves buffering of data at both the sending
194 and receiving sides of the connection. TCP uses a credit based scheme
195 to manage flow control to ensure that there is sufficient buffer space
196 at the receive side of a connection to accept incoming data. This flow
197 control is hidden from the application by the socket API. As a result,
198 stream based sockets may not transfer all the data that the application
199 requests to send as part of a single operation.
200
201 In this example, the client maintains an offset into the buffer that it
202 wishes to send. As data is accepted by the network, the offset in‐
203 creases. The client then waits until the network is ready to accept
204 more data before attempting another transfer. The poll() operation
205 supports this. When the client socket is ready for data, it sets POLL‐
206 OUT to true. This indicates that send will transfer some additional
207 amount of data. The client issues a send() request for the remaining
208 amount of buffer that it wishes to transfer. If send() transfers less
209 data than requested, the client updates the offset, waits for the net‐
210 work to become ready, then tries again.
211
212 /* Example of server receiving data from client */
213 struct pollfd fds;
214 size_t offset, size, ret;
215 char buf[4096];
216
217 fds.fd = server_fd;
218 fds.events = POLLIN;
219
220 size = sizeof(buf);
221 for (offset = 0; offset < size; ) {
222 poll(&fds, -1);
223
224 ret = recv(client_fd, buf + offset, size - offset, 0);
225 offset += ret;
226 }
227
228 The flow for receiving data is similar to that used to send it. Be‐
229 cause of the streaming nature of the socket, there is no guarantee that
230 the receiver will obtain all of the available data as part of a single
231 call. The server instead must wait until the socket is ready to re‐
232 ceive data (POLLIN), before calling receive to obtain what data is
233 available. In this example, the server knows to expect exactly 4 KB of
234 data from the client. More generally, a client and server will ex‐
235 change communication protocol headers at the start of all messages, and
236 the header will include the size of the message.
237
238 It is worth noting that the previous two examples are written so that
239 they are simple to understand. They are poorly constructed when con‐
240 sidering performance. In both cases, the application always precedes a
241 data transfer call (send or recv) with poll(). The impact is even if
242 the network is ready to transfer data or has data queued for receiving,
243 the application will always experience the latency and processing over‐
244 head of poll(). A better approach is to call send() or recv() prior to
245 entering the for() loops, and only enter the loops if needed.
246
247 Connection-less (UDP) Communication
248 As mentioned, TCP sockets are connection-oriented. They may be used to
249 communicate between exactly 2 processes. For parallel applications
250 that need to communicate with thousands peer processes, the overhead of
251 managing this many simultaneous sockets can be significant, to the
252 point where the application performance may decrease as more processes
253 are added.
254
255 To support communicating with a large number of peers, or for applica‐
256 tions that do not need the overhead of reliable communication, sockets
257 offers another commonly used socket option, SOCK_DGRAM. Datagrams are
258 unreliable, connectionless messages. The most common type of
259 SOCK_DGRAM socket runs over UDP/IP. As a result, datagram sockets are
260 often referred to as UDP sockets.
261
262 UDP sockets use the same socket API as that described above for TCP
263 sockets; however, the communication behavior differs. First, an appli‐
264 cation using UDP sockets does not need to connect to a peer prior to
265 sending it a message. The destination address is specified as part of
266 the send operation. A second major difference is that the message is
267 not guaranteed to arrive at the peer. Network congestion in switches,
268 routers, or the remote NIC can discard the message, and no attempt will
269 be made to resend the message. The sender will not be notified that
270 the message either arrived or was dropped. Another difference between
271 TCP and UDP sockets is the maximum size of the transfer that is al‐
272 lowed. UDP sockets limit messages to at most 64k, though in practice,
273 applications use a much smaller size, usually aligned to the network
274 MTU size (for example, 1500 bytes).
275
276 Most use of UDP sockets replace the socket send() / recv() calls with
277 sendto() and recvfrom().
278
279 /* Example send to peer at given IP address and UDP port */
280 struct addrinfo *ai, hints;
281
282 memset(&hints, 0, sizeof hints);
283 hints.ai_socktype = SOCK_DGRAM;
284 getaddrinfo("10.31.20.04", "7471", &hints, &ai);
285
286 ret = sendto(client_fd, buf, size, 0, ai->ai_addr, ai->ai_addrlen);
287
288 In the above example, we use getadddrinfo() to convert the given IP ad‐
289 dress and UDP port number into a sockaddr. That is passed into the
290 sendto() call in order to specify the destination of the message. Note
291 the similarities between this flow and the TCP socket flow. The
292 recvfrom() call allows us to receive the address of the sender of a
293 message. Note that unlike streaming sockets, the entire message is ac‐
294 cepted by the network on success. All contents of the buf parameter,
295 specified by the size parameter, have been queued by the network layer.
296
297 Although not shown, the application could call poll() or an equivalent
298 prior to calling sendto() to ensure that the socket is ready to accept
299 new data. Similarly, poll() may be used prior to calling recvfrom() to
300 check if there is data ready to be read from the socket.
301
302 /* Example receive a message from a peer */
303 struct sockaddr_in addr;
304 socklen_t addrlen;
305
306 addrlen = sizeof(addr);
307 ret = recvfrom(client_fd, buf, size, 0, &addr, &addrlen);
308
309 This example will receive any incoming message from any peer. The ad‐
310 dress of the peer will be provided in the addr parameter. In this
311 case, we only provide enough space to record and IPv4 address (limited
312 by our use of struct sockaddr_in). Supporting an IPv6 address would
313 simply require passing in a larger address buffer (mapped to struct
314 sockaddr_in6 for example).
315
316 Advantages
317 The socket API has two significant advantages. First, it is available
318 on a wide variety of operating systems and platforms, and works over
319 the vast majority of available networking hardware. It can even work
320 for communication between processes on the same system without any net‐
321 work hardware. It is easily the de-facto networking API. This by it‐
322 self makes it appealing to use.
323
324 The second key advantage is that it is relatively easy to program to.
325 The importance of this should not be overlooked. Networking APIs that
326 offer access to higher performing features, but are difficult to pro‐
327 gram to correctly or well, often result in lower application perfor‐
328 mance. This is not unlike coding an application in a higher-level lan‐
329 guage such as C or C++, versus assembly. Although writing directly to
330 assembly language offers the promise of being better performing, for
331 the vast majority of developers, their applications will perform better
332 if written in C or C++, and using an optimized compiler. Applications
333 should have a clear need for high-performance networking before select‐
334 ing an alternative API to sockets.
335
336 Disadvantages
337 When considering the problems with the socket API as it pertains to
338 high-performance networking, we limit our discussion to the two most
339 common sockets types: streaming (TCP) and datagram (UDP).
340
341 Most applications require that network data be sent reliably. This in‐
342 variably means using a connection-oriented TCP socket. TCP sockets
343 transfer data as a stream of bytes. However, many applications operate
344 on messages. The result is that applications often insert headers that
345 are simply used to convert application message to / from a byte stream.
346 These headers consume additional network bandwidth and processing. The
347 streaming nature of the interface also results in the application using
348 loops as shown in the examples above to send and receive larger mes‐
349 sages. The complexity of those loops can be significant if the appli‐
350 cation is managing sockets to hundreds or thousands of peers.
351
352 Another issue highlighted by the above examples deals with the asyn‐
353 chronous nature of network traffic. When using a reliable transport,
354 it is not enough to place an application’s data onto the network. If
355 the network is busy, it could drop the packet, or the data could become
356 corrupted during a transfer. The data must be kept until it has been
357 acknowledged by the peer, so that it can be resent if needed. The
358 socket API is defined such that the application owns the contents of
359 its memory buffers after a socket call returns.
360
361 As an example, if we examine the socket send() call, once send() re‐
362 turns the application is free to modify its buffer. The network imple‐
363 mentation has a couple of options. One option is for the send call to
364 place the data directly onto the network. The call must then block be‐
365 fore returning to the user until the peer acknowledges that it received
366 the data, at which point send() can return. The obvious problem with
367 this approach is that the application is blocked in the send() call un‐
368 til the network stack at the peer can process the data and generate an
369 acknowledgment. This can be a significant amount of time where the ap‐
370 plication is blocked and unable to process other work, such as respond‐
371 ing to messages from other clients. Such an approach is not feasible.
372
373 A better option is for the send() call to copy the application’s data
374 into an internal buffer. The data transfer is then issued out of that
375 buffer, which allows retrying the operation in case of a failure. The
376 send() call in this case is not blocked, but all data that passes
377 through the network will result in a memory copy to a local buffer,
378 even in the absence of any errors.
379
380 Allowing immediate re-use of a data buffer is a feature of the socket
381 API that keeps it simple and easy to program to. However, such a fea‐
382 ture can potentially have a negative impact on network performance.
383 For network or memory limited applications, or even applications con‐
384 cerned about power consumption, an alternative API may be attractive.
385
386 A slightly more hidden problem occurs in the socket APIs designed for
387 UDP sockets. This problem is an inefficiency in the implementation as
388 a result of the API design being designed for ease of use. In order
389 for the application to send data to a peer, it needs to provide the IP
390 address and UDP port number of the peer. That involves passing a sock‐
391 addr structure to the sendto() and recvfrom() calls. However, IP ad‐
392 dresses are a higher- level network layer address. In order to trans‐
393 fer data between systems, low-level link layer addresses are needed,
394 for example Ethernet addresses. The network layer must map IP address‐
395 es to Ethernet addresses on every send operation. When scaled to thou‐
396 sands of peers, that overhead on every send call can be significant.
397
398 Finally, because the socket API is often considered in conjunction with
399 TCP and UDP protocols, it is intentionally detached from the underlying
400 network hardware implementation, including NICs, switches, and routers.
401 Access to available network features is therefore constrained by what
402 the API can support.
403
404 It is worth noting here, that some operating systems support enhanced
405 APIs that may be used to interact with TCP and UDP sockets. For exam‐
406 ple, Linux supports an interface known as io_uring, and Windows has an
407 asynchronous socket API. Those APIs can help alleviate some of the
408 problems described above. However, an application will still be re‐
409 stricted by the features that the TCP an UDP protocols provide.
410
412 By analyzing the socket API in the context of high-performance network‐
413 ing, we can start to see some features that are desirable for a network
414 API.
415
416 Avoiding Memory Copies
417 The socket API implementation usually results in data copies occurring
418 at both the sender and the receiver. This is a trade-off between keep‐
419 ing the interface easy to use, versus providing reliability. Ideally,
420 all memory copies would be avoided when transferring data over the net‐
421 work. There are techniques and APIs that can be used to avoid memory
422 copies, but in practice, the cost of avoiding a copy can often be more
423 than the copy itself, in particular for small transfers (measured in
424 bytes, versus kilobytes or more).
425
426 To avoid a memory copy at the sender, we need to place the application
427 data directly onto the network. If we also want to avoid blocking the
428 sending application, we need some way for the network layer to communi‐
429 cate with the application when the buffer is safe to re-use. This
430 would allow the original buffer to be re-used in case the data needs to
431 be re-transmitted. This leads us to crafting a network interface that
432 behaves asynchronously. The application will need to issue a request,
433 then receive some sort of notification when the request has completed.
434
435 Avoiding a memory copy at the receiver is more challenging. When data
436 arrives from the network, it needs to land into an available memory
437 buffer, or it will be dropped, resulting in the sender re-transmitting
438 the data. If we use socket recv() semantics, the only way to avoid a
439 copy at the receiver is for the recv() to be called before the send().
440 Recv() would then need to block until the data has arrived. Not only
441 does this block the receiver, it is impractical to use outside of an
442 application with a simple request-reply protocol.
443
444 Instead, what is needed is a way for the receiving application to pro‐
445 vide one or more buffers to the network for received data to land. The
446 network then needs to notify the application when data is available.
447 This sort of mechanism works well if the receiver does not care where
448 in its memory space the data is located; it only needs to be able to
449 process the incoming message.
450
451 As an alternative, it is possible to reverse this flow, and have the
452 network layer hand its buffer to the application. The application
453 would then be responsible for returning the buffer to the network layer
454 when it is done with its processing. While this approach can avoid
455 memory copies, it suffers from a few drawbacks. First, the network
456 layer does not know what size of messages to expect, which can lead to
457 inefficient memory use. Second, many would consider this a more diffi‐
458 cult programming model to use. And finally, the network buffers would
459 need to be mapped into the application process’ memory space, which
460 negatively impacts performance.
461
462 In addition to processing messages, some applications want to receive
463 data and store it in a specific location in memory. For example, a
464 database may want to merge received data records into an existing ta‐
465 ble. In such cases, even if data arriving from the network goes di‐
466 rectly into an application’s receive buffers, it may still need to be
467 copied into its final location. It would be ideal if the network sup‐
468 ported placing data that arrives from the network into a specific memo‐
469 ry buffer, with the buffer determined based on the contents of the da‐
470 ta.
471
472 Network Buffers
473 Based on the problems described above, we can start to see that avoid‐
474 ing memory copies depends upon the ownership of the memory buffers used
475 for network traffic. With socket based transports, the network buffers
476 are owned and managed by the networking stack. This is usually handled
477 by the operating system kernel. However, this results in the data
478 `bouncing' between the application buffers and the network buffers. By
479 putting the application in control of managing the network buffers, we
480 can avoid this overhead. The cost for doing so is additional complexi‐
481 ty in the application.
482
483 Note that even though we want the application to own the network buf‐
484 fers, we would still like to avoid the situation where the application
485 implements a complex network protocol. The trade-off is that the app
486 provides the data buffers to the network stack, but the network stack
487 continues to handle things like flow control, reliability, and segmen‐
488 tation and reassembly.
489
490 Resource Management
491 We define resource management to mean properly allocating network re‐
492 sources in order to avoid overrunning data buffers or queues. Flow
493 control is a common aspect of resource management. Without proper flow
494 control, a sender can overrun a slow or busy receiver. This can result
495 in dropped packets, re-transmissions, and increased network congestion.
496 Significant research and development has gone into implementing flow
497 control algorithms. Because of its complexity, it is not something
498 that an application developer should need to deal with. That said,
499 there are some applications where flow control simply falls out of the
500 network protocol. For example, a request-reply protocol naturally has
501 flow control built in.
502
503 For our purposes, we expand the definition of resource management be‐
504 yond flow control. Flow control typically only deals with available
505 network buffering at a peer. We also want to be concerned about having
506 available space in outbound data transfer queues. That is, as we issue
507 commands to the local NIC to send data, that those commands can be
508 queued at the NIC. When we consider reliability, this means tracking
509 outstanding requests until they have been acknowledged. Resource man‐
510 agement will need to ensure that we do not overflow that request queue.
511
512 Additionally, supporting asynchronous operations (described in detail
513 below) will introduce potential new queues. Those queues also must not
514 overflow.
515
516 Asynchronous Operations
517 Arguably, the key feature of achieving high-performance is supporting
518 asynchronous operations, or the ability to overlap different communica‐
519 tion and communication with computation. The socket API supports asyn‐
520 chronous transfers with its non-blocking mode. However, because the
521 API itself operates synchronously, the result is additional data
522 copies. For an API to be asynchronous, an application needs to be able
523 to submit work, then later receive some sort of notification that the
524 work is done. In order to avoid extra memory copies, the application
525 must agree not to modify its data buffers until the operation com‐
526 pletes.
527
528 There are two main ways to notify an application that it is safe to re-
529 use its data buffers. One mechanism is for the network layer to invoke
530 some sort of callback or send a signal to the application that the re‐
531 quest is done. Some asynchronous APIs use this mechanism. The draw‐
532 back of this approach is that signals interrupt an application’s pro‐
533 cessing. This can negatively impact the CPU caches, plus requires in‐
534 terrupt processing. Additionally, it is often difficult to develop an
535 application that can handle processing a signal that can occur at any‐
536 time.
537
538 An alternative mechanism for supporting asynchronous operations is to
539 write events into some sort of completion queue when an operation com‐
540 pletes. This provides a way to indicate to an application when a data
541 transfer has completed, plus gives the application control over when
542 and how to process completed requests. For example, it can process re‐
543 quests in batches to improve code locality and performance.
544
545 Interrupts and Signals
546 Interrupts are a natural extension to supporting asynchronous opera‐
547 tions. However, when dealing with an asynchronous API, they can nega‐
548 tively impact performance. Interrupts, even when directed to a kernel
549 agent, can interfere with application processing.
550
551 If an application has an asynchronous interface with completed opera‐
552 tions written into a completion queue, it is often sufficient for the
553 application to simply check the queue for events. As long as the ap‐
554 plication has other work to perform, there is no need for it to block.
555 This alleviates the need for interrupt generation. A NIC merely needs
556 to write an entry into the completion queue and update a tail pointer
557 to signal that a request is done.
558
559 If we follow this argument, then it can be beneficial to give the ap‐
560 plication control over when interrupts should occur and when to write
561 events to some sort of wait object. By having the application notify
562 the network layer that it will wait until a completion occurs, we can
563 better manage the number and type of interrupts that are generated.
564
565 Event Queues
566 As outlined above, there are performance advantages to having an API
567 that reports completions or provides other types of notification using
568 an event queue. A very simple type of event queue merely tracks com‐
569 pleted operations. As data is received or a send completes, an entry
570 is written into the event queue.
571
572 Direct Hardware Access
573 When discussing the network layer, most software implementations refer
574 to kernel modules responsible for implementing the necessary transport
575 and network protocols. However, if we want network latency to approach
576 sub-microsecond speeds, then we need to remove as much software between
577 the application and its access to the hardware as possible. One way to
578 do this is for the application to have direct access to the network in‐
579 terface controller’s command queues. Similarly, the NIC requires di‐
580 rect access to the application’s data buffers and control structures,
581 such as the above mentioned completion queues.
582
583 Note that when we speak about an application having direct access to
584 network hardware, we’re referring to the application process. Natural‐
585 ly, an application developer is highly unlikely to code for a specific
586 hardware NIC. That work would be left to some sort of network library
587 specifically targeting the NIC. The actual network layer, which imple‐
588 ments the network transport, could be part of the network library or
589 offloaded onto the NIC’s hardware or firmware.
590
591 Kernel Bypass
592 Kernel bypass is a feature that allows the application to avoid calling
593 into the kernel for data transfer operations. This is possible when it
594 has direct access to the NIC hardware. Complete kernel bypass is im‐
595 practical because of security concerns and resource management con‐
596 straints. However, it is possible to avoid kernel calls for what are
597 called `fast-path' operations, such as send or receive.
598
599 For security and stability reasons, operating system kernels cannot re‐
600 ly on data that comes from user space applications. As a result, even
601 a simple kernel call often requires acquiring and releasing locks, cou‐
602 pled with data verification checks. If we can limit the effects of a
603 poorly written or malicious application to its own process space, we
604 can avoid the overhead that comes with kernel validation without im‐
605 pacting system stability.
606
607 Direct Data Placement
608 Direct data placement means avoiding data copies when sending and re‐
609 ceiving data, plus placing received data into the correct memory buffer
610 where needed. On a broader scale, it is part of having direct hardware
611 access, with the application and NIC communicating directly with shared
612 memory buffers and queues.
613
614 Direct data placement is often thought of by those familiar with RDMA -
615 remote direct memory access. RDMA is a technique that allows reading
616 and writing memory that belongs to a peer process that is running on a
617 node across the network. Advanced RDMA hardware is capable of access‐
618 ing the target memory buffers without involving the execution of the
619 peer process. RDMA relies on offloading the network transport onto the
620 NIC in order to avoid interrupting the target process.
621
622 The main advantages of supporting direct data placement is avoiding
623 memory copies and minimizing processing overhead.
624
626 We want to design a network interface that can meet the requirements
627 outlined above. Moreover, we also want to take into account the per‐
628 formance of the interface itself. It is often not obvious how an in‐
629 terface can adversely affect performance, versus performance being a
630 result of the underlying implementation. The following sections de‐
631 scribe how interface choices can impact performance. Of course, when
632 we begin defining the actual APIs that an application will use, we will
633 need to trade off raw performance for ease of use where it makes sense.
634
635 When considering performance goals for an API, we need to take into ac‐
636 count the target application use cases. For the purposes of this dis‐
637 cussion, we want to consider applications that communicate with thou‐
638 sands to millions of peer processes. Data transfers will include mil‐
639 lions of small messages per second per peer, and large transfers that
640 may be up to gigabytes of data. At such extreme scales, even small op‐
641 timizations are measurable, in terms of both performance and power. If
642 we have a million peers sending a millions messages per second, elimi‐
643 nating even a single instruction from the code path quickly multiplies
644 to saving billions of instructions per second from the overall execu‐
645 tion, when viewing the operation of the entire application.
646
647 We once again refer to the socket API as part of this discussion in or‐
648 der to illustrate how an API can affect performance.
649
650 /* Notable socket function prototypes */
651 /* "control" functions */
652 int socket(int domain, int type, int protocol);
653 int bind(int socket, const struct sockaddr *addr, socklen_t addrlen);
654 int listen(int socket, int backlog);
655 int accept(int socket, struct sockaddr *addr, socklen_t *addrlen);
656 int connect(int socket, const struct sockaddr *addr, socklen_t addrlen);
657 int shutdown(int socket, int how);
658 int close(int socket);
659
660 /* "fast path" data operations - send only (receive calls not shown) */
661 ssize_t send(int socket, const void *buf, size_t len, int flags);
662 ssize_t sendto(int socket, const void *buf, size_t len, int flags,
663 const struct sockaddr *dest_addr, socklen_t addrlen);
664 ssize_t sendmsg(int socket, const struct msghdr *msg, int flags);
665 ssize_t write(int socket, const void *buf, size_t count);
666 ssize_t writev(int socket, const struct iovec *iov, int iovcnt);
667
668 /* "indirect" data operations */
669 int poll(struct pollfd *fds, nfds_t nfds, int timeout);
670 int select(int nfds, fd_set *readfds, fd_set *writefds,
671 fd_set *exceptfds, struct timeval *timeout);
672
673 Examining this list, there are a couple of features to note. First,
674 there are multiple calls that can be used to send data, as well as mul‐
675 tiple calls that can be used to wait for a non-blocking socket to be‐
676 come ready. This will be discussed in more detail further on. Second,
677 the operations have been split into different groups (terminology is
678 ours). Control operations are those functions that an application sel‐
679 dom invokes during execution. They often only occur as part of ini‐
680 tialization.
681
682 Data operations, on the other hand, may be called hundreds to millions
683 of times during an application’s lifetime. They deal directly or indi‐
684 rectly with transferring or receiving data over the network. Data op‐
685 erations can be split into two groups. Fast path calls interact with
686 the network stack to immediately send or receive data. In order to
687 achieve high bandwidth and low latency, those operations need to be as
688 fast as possible. Non-fast path operations that still deal with data
689 transfers are those calls, that while still frequently called by the
690 application, are not as performance critical. For example, the se‐
691 lect() and poll() calls are used to block an application thread until a
692 socket becomes ready. Because those calls suspend the thread execu‐
693 tion, performance is a lesser concern. (Performance of those opera‐
694 tions is still of a concern, but the cost of executing the operating
695 system scheduler often swamps any but the most substantial performance
696 gains.)
697
698 Call Setup Costs
699 The amount of work that an application needs to perform before issuing
700 a data transfer operation can affect performance, especially message
701 rates. Obviously, the more parameters an application must push on the
702 stack to call a function increases its instruction count. However, re‐
703 placing stack variables with a single data structure does not help to
704 reduce the setup costs.
705
706 Suppose that an application wishes to send a single data buffer of a
707 given size to a peer. If we examine the socket API, the best fit for
708 such an operation is the write() call. That call takes only those val‐
709 ues which are necessary to perform the data transfer. The send() call
710 is a close second, and send() is a more natural function name for net‐
711 work communication, but send() requires one extra argument over
712 write(). Other functions are even worse in terms of setup costs. The
713 sendmsg() function, for example, requires that the application format a
714 data structure, the address of which is passed into the call. This re‐
715 quires significantly more instructions from the application if done for
716 every data transfer.
717
718 Even though all other send functions can be replaced by sendmsg(), it
719 is useful to have multiple ways for the application to issue send re‐
720 quests. Not only are the other calls easier to read and use (which
721 lower software maintenance costs), but they can also improve perfor‐
722 mance.
723
724 Branches and Loops
725 When designing an API, developers rarely consider how the API impacts
726 the underlying implementation. However, the selection of API parame‐
727 ters can require that the underlying implementation add branches or use
728 control loops. Consider the difference between the write() and
729 writev() calls. The latter passes in an array of I/O vectors, which
730 may be processed using a loop such as this:
731
732 /* Sample implementation for processing an array */
733 for (i = 0; i < iovcnt; i++) {
734 ...
735 }
736
737 In order to process the iovec array, the natural software construct
738 would be to use a loop to iterate over the entries. Loops result in
739 additional processing. Typically, a loop requires initializing a loop
740 control variable (e.g. i = 0), adds ALU operations (e.g. i++), and a
741 comparison (e.g. i < iovcnt). This overhead is necessary to handle an
742 arbitrary number of iovec entries. If the common case is that the ap‐
743 plication wants to send a single data buffer, write() is a better op‐
744 tion.
745
746 In addition to control loops, an API can result in the implementation
747 needing branches. Branches can change the execution flow of a program,
748 impacting processor pipe-lining techniques. Processor branch predic‐
749 tion helps alleviate this issue. However, while branch prediction can
750 be correct nearly 100% of the time while running a micro-benchmark,
751 such as a network bandwidth or latency test, with more realistic net‐
752 work traffic, the impact can become measurable.
753
754 We can easily see how an API can introduce branches into the code flow
755 if we examine the send() call. Send() takes an extra flags parameter
756 over the write() call. This allows the application to modify the be‐
757 havior of send(). From the viewpoint of implementing send(), the flags
758 parameter must be checked. In the best case, this adds one additional
759 check (flags are non-zero). In the worst case, every valid flag may
760 need a separate check, resulting in potentially dozens of checks.
761
762 Overall, the sockets API is well designed considering these performance
763 implications. It provides complex calls where they are needed, with
764 simpler functions available that can avoid some of the overhead inher‐
765 ent in other calls.
766
767 Command Formatting
768 The ultimate objective of invoking a network function is to transfer or
769 receive data from the network. In this section, we’re dropping to the
770 very bottom of the software stack to the component responsible for di‐
771 rectly accessing the hardware. This is usually referred to as the net‐
772 work driver, and its implementation is often tied to a specific piece
773 of hardware, or a series of NICs by a single hardware vendor.
774
775 In order to signal a NIC that it should read a memory buffer and copy
776 that data onto the network, the software driver usually needs to write
777 some sort of command to the NIC. To limit hardware complexity and
778 cost, a NIC may only support a couple of command formats. This differs
779 from the software interfaces that we’ve been discussing, where we can
780 have different APIs of varying complexity in order to reduce overhead.
781 There can be significant costs associated with formatting the command
782 and posting it to the hardware.
783
784 With a standard NIC, the command is formatted by a kernel driver. That
785 driver sits at the bottom of the network stack servicing requests from
786 multiple applications. It must typically format each command only af‐
787 ter a request has passed through the network stack.
788
789 With devices that are directly accessible by a single application,
790 there are opportunities to use pre-formatted command structures. The
791 more of the command that can be initialized prior to the application
792 submitting a network request, the more streamlined the process, and the
793 better the performance.
794
795 As an example, a NIC needs to have the destination address as part of a
796 send operation. If an application is sending to a single peer, that
797 information can be cached and be part of a pre-formatted network head‐
798 er. This is only possible if the NIC driver knows that the destination
799 will not change between sends. The closer that the driver can be to
800 the application, the greater the chance for optimization. An optimal
801 approach is for the driver to be part of a library that executes en‐
802 tirely within the application process space.
803
804 Memory Footprint
805 Memory footprint concerns are most notable among high-performance com‐
806 puting (HPC) applications that communicate with thousands of peers.
807 Excessive memory consumption impacts application scalability, limiting
808 the number of peers that can operate in parallel to solve problems.
809 There is often a trade-off between minimizing the memory footprint
810 needed for network communication, application performance, and ease of
811 use of the network interface.
812
813 As we discussed with the socket API semantics, part of the ease of us‐
814 ing sockets comes from the network layering copying the user’s buffer
815 into an internal buffer belonging to the network stack. The amount of
816 internal buffering that’s made available to the application directly
817 correlates with the bandwidth that an application can achieve. In gen‐
818 eral, larger internal buffering increases network performance, with a
819 cost of increasing the memory footprint consumed by the application.
820 This memory footprint exists independent of the amount of memory allo‐
821 cated directly by the application. Eliminating network buffering not
822 only helps with performance, but also scalability, by reducing the mem‐
823 ory footprint needed to support the application.
824
825 While network memory buffering increases as an application scales, it
826 can often be configured to a fixed size. The amount of buffering need‐
827 ed is dependent on the number of active communication streams being
828 used at any one time. That number is often significantly lower than
829 the total number of peers that an application may need to communicate
830 with. The amount of memory required to address the peers, however,
831 usually has a linear relationship with the total number of peers.
832
833 With the socket API, each peer is identified using a struct sockaddr.
834 If we consider a UDP based socket application using IPv4 addresses, a
835 peer is identified by the following address.
836
837 /* IPv4 socket address - with typedefs removed */
838 struct sockaddr_in {
839 uint16_t sin_family; /* AF_INET */
840 uint16_t sin_port;
841 struct {
842 uint32_t sin_addr;
843 } in_addr;
844 };
845
846 In total, the application requires 8-bytes of addressing for each peer.
847 If the app communicates with a million peers, that explodes to roughly
848 8 MB of memory space that is consumed just to maintain the address
849 list. If IPv6 addressing is needed, then the requirement increases by
850 a factor of 4.
851
852 Luckily, there are some tricks that can be used to help reduce the ad‐
853 dressing memory footprint, though doing so will introduce more instruc‐
854 tions into code path to access the network stack. For instance, we can
855 notice that all addresses in the above example have the same sin_family
856 value (AF_INET). There’s no need to store that for each address. This
857 potentially shrinks each address from 8 bytes to 6. (We may be left
858 with unaligned data, but that’s a trade-off to reducing the memory con‐
859 sumption). Depending on how the addresses are assigned, further reduc‐
860 tion may be possible. For example, if the application uses the same
861 set of port addresses at each node, then we can eliminate storing the
862 port, and instead calculate it from some base value. This type of
863 trick can be applied to the IP portion of the address if the app is
864 lucky enough to run across sequential IP addresses.
865
866 The main issue with this sort of address reduction is that it is diffi‐
867 cult to achieve. It requires that each application check for and han‐
868 dle address compression, exposing the application to the addressing
869 format used by the networking stack. It should be kept in mind that
870 TCP/IP and UDP/IP addresses are logical addresses, not physical. When
871 running over Ethernet, the addresses that appear at the link layer are
872 MAC addresses, not IP addresses. The IP to MAC address association is
873 managed by the network software. We would like to provide addressing
874 that is simple for an application to use, but at the same time can pro‐
875 vide a minimal memory footprint.
876
877 Communication Resources
878 We need to take a brief detour in the discussion in order to delve
879 deeper into the network problem and solution space. Instead of contin‐
880 uing to think of a socket as a single entity, with both send and re‐
881 ceive capabilities, we want to consider its components separately. A
882 network socket can be viewed as three basic constructs: a transport
883 level address, a send or transmit queue, and a receive queue. Because
884 our discussion will begin to pivot away from pure socket semantics, we
885 will refer to our network `socket' as an endpoint.
886
887 In order to reduce an application’s memory footprint, we need to con‐
888 sider features that fall outside of the socket API. So far, much of
889 the discussion has been around sending data to a peer. We now want to
890 focus on the best mechanisms for receiving data.
891
892 With sockets, when an app has data to receive (indicated, for example,
893 by a POLLIN event), we call recv(). The network stack copies the re‐
894 ceive data into its buffer and returns. If we want to avoid the data
895 copy on the receive side, we need a way for the application to post its
896 buffers to the network stack before data arrives.
897
898 Arguably, a natural way of extending the socket API to support this
899 feature is to have each call to recv() simply post the buffer to the
900 network layer. As data is received, the receive buffers are removed in
901 the order that they were posted. Data is copied into the posted buffer
902 and returned to the user. It would be noted that the size of the post‐
903 ed receive buffer may be larger (or smaller) than the amount of data
904 received. If the available buffer space is larger, hypothetically, the
905 network layer could wait a short amount of time to see if more data ar‐
906 rives. If nothing more arrives, the receive completes with the buffer
907 returned to the application.
908
909 This raises an issue regarding how to handle buffering on the receive
910 side. So far, with sockets we’ve mostly considered a streaming proto‐
911 col. However, many applications deal with messages which end up being
912 layered over the data stream. If they send an 8 KB message, they want
913 the receiver to receive an 8 KB message. Message boundaries need to be
914 maintained.
915
916 If an application sends and receives a fixed sized message, buffer al‐
917 location becomes trivial. The app can post X number of buffers each of
918 an optimal size. However, if there is a wide mix in message sizes,
919 difficulties arise. It is not uncommon for an app to have 80% of its
920 messages be a couple hundred of bytes or less, but 80% of the total da‐
921 ta that it sends to be in large transfers that are, say, a megabyte or
922 more. Pre-posting receive buffers in such a situation is challenging.
923
924 A commonly used technique used to handle this situation is to implement
925 one application level protocol for smaller messages, and use a separate
926 protocol for transfers that are larger than some given threshold. This
927 would allow an application to post a bunch of smaller messages, say 4
928 KB, to receive data. For transfers that are larger than 4 KB, a dif‐
929 ferent communication protocol is used, possibly over a different socket
930 or endpoint.
931
932 Shared Receive Queues
933 If an application pre-posts receive buffers to a network queue, it
934 needs to balance the size of each buffer posted, the number of buffers
935 that are posted to each queue, and the number of queues that are in
936 use. With a socket like approach, each socket would maintain an inde‐
937 pendent receive queue where data is placed. If an application is using
938 1000 endpoints and posts 100 buffers, each 4 KB, that results in 400 MB
939 of memory space being consumed to receive data. (We can start to real‐
940 ize that by eliminating memory copies, one of the trade offs is in‐
941 creased memory consumption.) While 400 MB seems like a lot of memory,
942 there is less than half a megabyte allocated to a single receive queue.
943 At today’s networking speeds, that amount of space can be consumed
944 within milliseconds. The result is that if only a few endpoints are in
945 use, the application will experience long delays where flow control
946 will kick in and back the transfers off.
947
948 There are a couple of observations that we can make here. The first is
949 that in order to achieve high scalability, we need to move away from a
950 connection-oriented protocol, such as streaming sockets. Secondly, we
951 need to reduce the number of receive queues that an application uses.
952
953 A shared receive queue is a network queue that can receive data for
954 many different endpoints at once. With shared receive queues, we no
955 longer associate a receive queue with a specific transport address.
956 Instead network data will target a specific endpoint address. As data
957 arrives, the endpoint will remove an entry from the shared receive
958 queue, place the data into the application’s posted buffer, and return
959 it to the user. Shared receive queues can greatly reduce the amount of
960 buffer space needed by an applications. In the previous example, if a
961 shared receive queue were used, the app could post 10 times the number
962 of buffers (1000 total), yet still consume 100 times less memory (4 MB
963 total). This is far more scalable. The drawback is that the applica‐
964 tion must now be aware of receive queues and shared receive queues,
965 rather than considering the network only at the level of a socket.
966
967 Multi-Receive Buffers
968 Shared receive queues greatly improve application scalability; however,
969 it still results in some inefficiencies as defined so far. We’ve only
970 considered the case of posting a series of fixed sized memory buffers
971 to the receive queue. As mentioned, determining the size of each buf‐
972 fer is challenging. Transfers larger than the fixed size require using
973 some other protocol in order to complete. If transfers are typically
974 much smaller than the fixed size, then the extra buffer space goes un‐
975 used.
976
977 Again referring to our example, if the application posts 1000 buffers,
978 then it can only receive 1000 messages before the queue is emptied. At
979 data rates measured in millions of messages per second, this will in‐
980 troduce stalls in the data stream. An obvious solution is to increase
981 the number of buffers posted. The problem is dealing with variable
982 sized messages, including some which are only a couple hundred bytes in
983 length. For example, if the average message size in our case is 256
984 bytes or less, then even though we’ve allocated 4 MB of buffer space,
985 we only make use of 6% of that space. The rest is wasted in order to
986 handle messages which may only occasionally be up to 4 KB.
987
988 A second optimization that we can make is to fill up each posted re‐
989 ceive buffer as messages arrive. So, instead of a 4 KB buffer being
990 removed from use as soon as a single 256 byte message arrives, it can
991 instead receive up to 16, 256 byte, messages. We refer to such a fea‐
992 ture as `multi-receive' buffers.
993
994 With multi-receive buffers, instead of posting a bunch of smaller buf‐
995 fers, we instead post a single larger buffer, say the entire 4 MB, at
996 once. As data is received, it is placed into the posted buffer. Un‐
997 like TCP streams, we still maintain message boundaries. The advantages
998 here are twofold. Not only is memory used more efficiently, allowing
999 us to receive more smaller messages at once and larger messages over‐
1000 all, but we reduce the number of function calls that the application
1001 must make to maintain its supply of available receive buffers.
1002
1003 When combined with shared receive queues, multi-receive buffers help
1004 support optimal receive side buffering and processing. The main draw‐
1005 back to supporting multi-receive buffers are that the application will
1006 not necessarily know up front how many messages may be associated with
1007 a single posted memory buffer. This is rarely a problem for applica‐
1008 tions.
1009
1010 Optimal Hardware Allocation
1011 As part of scalability considerations, we not only need to consider the
1012 processing and memory resources of the host system, but also the allo‐
1013 cation and use of the NIC hardware. We’ve referred to network end‐
1014 points as combination of transport addressing, transmit queues, and re‐
1015 ceive queues. The latter two queues are often implemented as hardware
1016 command queues. Command queues are used to signal the NIC to perform
1017 some sort of work. A transmit queue indicates that the NIC should
1018 transfer data. A transmit command often contains information such as
1019 the address of the buffer to transmit, the length of the buffer, and
1020 destination addressing data. The actual format and data contents vary
1021 based on the hardware implementation.
1022
1023 NICs have limited resources. Only the most scalable, high-performance
1024 applications likely need to be concerned with utilizing NIC hardware
1025 optimally. However, such applications are an important and specific
1026 focus of libfabric. Managing NIC resources is often handled by a re‐
1027 source manager application, which is responsible for allocating systems
1028 to competing applications, among other activities.
1029
1030 Supporting applications that wish to make optimal use of hardware re‐
1031 quires that hardware related abstractions be exposed to the applica‐
1032 tion. Such abstractions cannot require a specific hardware implementa‐
1033 tion, and care must be taken to ensure that the resulting API is still
1034 usable by developers unfamiliar with dealing with such low level de‐
1035 tails. Exposing concepts such as shared receive queues is an example
1036 of giving an application more control over how hardware resources are
1037 used.
1038
1039 Sharing Command Queues
1040 By exposing the transmit and receive queues to the application, we open
1041 the possibility for the application that makes use of multiple end‐
1042 points to determine how those queues might be shared. We talked about
1043 the benefits of sharing a receive queue among endpoints. The benefits
1044 of sharing transmit queues are not as obvious.
1045
1046 An application that uses more addressable endpoints than there are
1047 transmit queues will need to share transmit queues among the endpoints.
1048 By controlling which endpoint uses which transmit queue, the applica‐
1049 tion can prioritize traffic. A transmit queue can also be configured
1050 to optimize for a specific type of data transfer, such as large trans‐
1051 fers only.
1052
1053 From the perspective of a software API, sharing transmit or receive
1054 queues implies exposing those constructs to the application, and allow‐
1055 ing them to be associated with different endpoint addresses.
1056
1057 Multiple Queues
1058 The opposite of a shared command queue are endpoints that have multiple
1059 queues. An application that can take advantage of multiple transmit or
1060 receive queues can increase parallel handling of messages without syn‐
1061 chronization constraints. Being able to use multiple command queues
1062 through a single endpoint has advantages over using multiple endpoints.
1063 Multiple endpoints require separate addresses, which increases memory
1064 use. A single endpoint with multiple queues can continue to expose a
1065 single address, while taking full advantage of available NIC resources.
1066
1067 Progress Model Considerations
1068 One aspect of the sockets programming interface that developers often
1069 don’t consider is the location of the protocol implementation. This is
1070 usually managed by the operating system kernel. The network stack is
1071 responsible for handling flow control messages, timing out transfers,
1072 re-transmitting unacknowledged transfers, processing received data, and
1073 sending acknowledgments. This processing requires that the network
1074 stack consume CPU cycles. Portions of that processing can be done
1075 within the context of the application thread, but much must be handled
1076 by kernel threads dedicated to network processing.
1077
1078 By moving the network processing directly into the application process,
1079 we need to be concerned with how network communication makes forward
1080 progress. For example, how and when are acknowledgments sent? How are
1081 timeouts and message re-transmissions handled? The progress model de‐
1082 fines this behavior, and it depends on how much of the network process‐
1083 ing has been offloaded onto the NIC.
1084
1085 More generally, progress is the ability of the underlying network im‐
1086 plementation to complete processing of an asynchronous request. In
1087 many cases, the processing of an asynchronous request requires the use
1088 of the host processor. For performance reasons, it may be undesirable
1089 for the provider to allocate a thread for this purpose, which will com‐
1090 pete with the application thread(s). We can avoid thread context
1091 switches if the application thread can be used to make forward progress
1092 on requests – check for acknowledgments, retry timed out operations,
1093 etc. Doing so requires that the application periodically call into the
1094 network stack.
1095
1096 Ordering
1097 Network ordering is a complex subject. With TCP sockets, data is sent
1098 and received in the same order. Buffers are re-usable by the applica‐
1099 tion immediately upon returning from a function call. As a result, or‐
1100 dering is simple to understand and use. UDP sockets complicate things
1101 slightly. With UDP sockets, messages may be received out of order from
1102 how they were sent. In practice, this often doesn’t occur, particular‐
1103 ly, if the application only communicates over a local area network,
1104 such as Ethernet.
1105
1106 With our evolving network API, there are situations where exposing dif‐
1107 ferent order semantics can improve performance. These details will be
1108 discussed further below.
1109
1110 Messages
1111 UDP sockets allow messages to arrive out of order because each message
1112 is routed from the sender to the receiver independently. This allows
1113 packets to take different network paths, to avoid congestion or take
1114 advantage of multiple network links for improved bandwidth. We would
1115 like to take advantage of the same features in those cases where the
1116 application doesn’t care in which order messages arrive.
1117
1118 Unlike UDP sockets, however, our definition of message ordering is more
1119 subtle. UDP messages are small, MTU sized packets. In our case, mes‐
1120 sages may be gigabytes in size. We define message ordering to indicate
1121 whether the start of each message is processed in order or out of or‐
1122 der. This is related to, but separate from the order of how the mes‐
1123 sage payload is received.
1124
1125 An example will help clarify this distinction. Suppose that an appli‐
1126 cation has posted two messages to its receive queue. The first receive
1127 points to a 4 KB buffer. The second receive points to a 64 KB buffer.
1128 The sender will transmit a 4 KB message followed by a 64 KB message.
1129 If messages are processed in order, then the 4 KB send will match with
1130 the 4 KB received, and the 64 KB send will match with the 64 KB re‐
1131 ceive. However, if messages can be processed out of order, then the
1132 sends and receives can mismatch, resulting in the 64 KB send being
1133 truncated.
1134
1135 In this example, we’re not concerned with what order the data is re‐
1136 ceived in. The 64 KB send could be broken in 64 1-KB transfers that
1137 take different routes to the destination. So, bytes 2k-3k could be re‐
1138 ceived before bytes 1k-2k. Message ordering is not concerned with or‐
1139 dering within a message, only between messages. With ordered messages,
1140 the messages themselves need to be processed in order.
1141
1142 The more relaxed message ordering can be the more optimizations that
1143 the network stack can use to transfer the data. However, the applica‐
1144 tion must be aware of message ordering semantics, and be able to select
1145 the desired semantic for its needs. For the purposes of this section,
1146 messages refers to transport level operations, which includes RDMA and
1147 similar operations (some of which have not yet been discussed).
1148
1149 Data
1150 Data ordering refers to the receiving and placement of data both within
1151 and between messages. Data ordering is most important to messages that
1152 can update the same target memory buffer. For example, imagine an ap‐
1153 plication that writes a series of database records directly into a peer
1154 memory location. Data ordering, combined with message ordering, en‐
1155 sures that the data from the second write updates memory after the
1156 first write completes. The result is that the memory location will
1157 contain the records carried in the second write.
1158
1159 Enforcing data ordering between messages requires that the messages
1160 themselves be ordered. Data ordering can also apply within a single
1161 message, though this level of ordering is usually less important to ap‐
1162 plications. Intra-message data ordering indicates that the data for a
1163 single message is received in order. Some applications use this fea‐
1164 ture to `spin' reading the last byte of a receive buffer. Once the
1165 byte changes, the application knows that the operation has completed
1166 and all earlier data has been received. (Note that while such behavior
1167 is interesting for benchmark purposes, using such a feature in this way
1168 is strongly discouraged. It is not portable between networks or plat‐
1169 forms.)
1170
1171 Completions
1172 Completion ordering refers to the sequence that asynchronous operations
1173 report their completion to the application. Typically, unreliable data
1174 transfer will naturally complete in the order that they are submitted
1175 to a transmit queue. Each operation is transmitted to the network,
1176 with the completion occurring immediately after. For reliable data
1177 transfers, an operation cannot complete until it has been acknowledged
1178 by the peer. Since ack packets can be lost or possibly take different
1179 paths through the network, operations can be marked as completed out of
1180 order. Out of order acks is more likely if messages can be processed
1181 out of order.
1182
1183 Asynchronous interfaces require that the application track their out‐
1184 standing requests. Handling out of order completions can increase ap‐
1185 plication complexity, but it does allow for optimizing network utiliza‐
1186 tion.
1187
1189 Libfabric is well architected to support the previously discussed fea‐
1190 tures. For further information on the libfabric architecture, see the
1191 next programmer’s guide section: fi_arch(7).
1192
1194 OpenFabrics.
1195
1196
1197
1198Libfabric Programmer’s Manual 2023-01-02 fi_intro(7)