1fi_intro(7)                    Libfabric v1.18.1                   fi_intro(7)
2
3
4

NAME

6       fi_intro - libfabric introduction
7

OVERVIEW

9       This  introduction  is part of the libfabric’s programmer’s guide.  See
10       fi_guide(7).  This section provides insight into the motivation for the
11       libfabric  design and underlying networking features that are being ex‐
12       posed through the API.
13

Review of Sockets Communication

15       The sockets API is a widely used networking API.   This  guide  assumes
16       that  a  reader  has a working knowledge of programming to sockets.  It
17       makes reference to socket based communications throughout in an  effort
18       to  help  explain libfabric concepts and how they relate or differ from
19       the socket API.  To be clear, the intent of this guide is not to criti‐
20       cize the socket API, but reference sockets as a starting point in order
21       to explain certain network features or limitations.  The following sec‐
22       tions provide a high-level overview of socket semantics for reference.
23
24   Connected (TCP) Communication
25       The most widely used type of socket is SOCK_STREAM.  This sort of sock‐
26       et usually runs over TCP/IP, and as a result is often referred to as  a
27       `TCP'  socket.   TCP  sockets are connection-oriented, requiring an ex‐
28       plicit connection setup before data transfers can occur.  A single  TCP
29       socket  can  only transfer data to a single peer socket.  Communicating
30       with multiple peers requires the use of one socket per peer.
31
32       Applications using TCP sockets are typically labeled as either a client
33       or server.  Server applications listen for connection requests, and ac‐
34       cept them when they occur.  Clients, on the other hand,  initiate  con‐
35       nections  to the server.  In socket API terms, a server calls listen(),
36       and the client calls connect().  After a  connection  has  been  estab‐
37       lished,  data  transfers  between a client and server are similar.  The
38       following code segments highlight the general flow for a sample  client
39       and  server.   Error handling and some subtleties of the socket API are
40       omitted for brevity.
41
42              /* Example server code flow to initiate listen */
43              struct addrinfo *ai, hints;
44              int listen_fd;
45
46              memset(&hints, 0, sizeof hints);
47              hints.ai_socktype = SOCK_STREAM;
48              hints.ai_flags = AI_PASSIVE;
49              getaddrinfo(NULL, "7471", &hints, &ai);
50
51              listen_fd = socket(ai->ai_family, SOCK_STREAM, 0);
52              bind(listen_fd, ai->ai_addr, ai->ai_addrlen);
53              freeaddrinfo(ai);
54
55              fcntl(listen_fd, F_SETFL, O_NONBLOCK);
56              listen(listen_fd, 128);
57
58       In this example, the server will listen for connection requests on port
59       7471  across all addresses in the system.  The call to getaddrinfo() is
60       used to form the local socket address.  The node parameter  is  set  to
61       NULL,  which result in a wild card IP address being returned.  The port
62       is hard-coded to 7471.  The AI_PASSIVE flag signifies that the  address
63       will be used by the listening side of the connection.  That is, the ad‐
64       dress information should be relative to the local node.
65
66       This example will work with both IPv4 and IPv6.  The getaddrinfo() call
67       abstracts the address format away from the server, improving its porta‐
68       bility.  Using the data returned by getaddrinfo(), the server allocates
69       a socket of type SOCK_STREAM, and binds the socket to port 7471.
70
71       In  practice, most enterprise-level applications make use of non-block‐
72       ing sockets.  This is needed for a single application thread to  manage
73       multiple  socket  connections.   The fcntl() command sets the listening
74       socket to non-blocking mode.  This will affect how the server processes
75       connection  requests (shown below).  Finally, the server starts listen‐
76       ing for connection requests by calling listen.  Until listen is called,
77       connection  requests  that arrive at the server will be rejected by the
78       operating system.
79
80              /* Example client code flow to start connection */
81              struct addrinfo *ai, hints;
82              int client_fd;
83
84              memset(&hints, 0, sizeof hints);
85              hints.ai_socktype = SOCK_STREAM;
86              getaddrinfo("10.31.20.04", "7471", &hints, &ai);
87
88              client_fd = socket(ai->ai_family, SOCK_STREAM, 0);
89              fcntl(client_fd, F_SETFL, O_NONBLOCK);
90
91              connect(client_fd, ai->ai_addr, ai->ai_addrlen);
92              freeaddrinfo(ai);
93
94       Similar to the server, the client makes use  of  getaddrinfo().   Since
95       the  AI_PASSIVE  flag is not specified, the given address is treated as
96       that of the destination.  The client expects to reach the server at  IP
97       address  10.31.20.04, port 7471.  For this example the address is hard-
98       coded into the client.  More typically, the address will  be  given  to
99       the  client via the command line, through a configuration file, or from
100       a service.  Often the port number will be well-known,  and  the  client
101       will  find the server by name, with DNS (domain name service) providing
102       the name to address resolution.  Fortunately, the getaddrinfo call  can
103       be used to convert host names into IP addresses.
104
105       Whether  the client is given the server’s network address directly or a
106       name which must be translated into the network address,  the  mechanism
107       used to provide this information to the client varies widely.  A simple
108       mechanism that is commonly used is for users to  provide  the  server’s
109       address  using  a command line option.  The problem of telling applica‐
110       tions where its peers are located increases significantly for  applica‐
111       tions that communicate with hundreds to millions of peer processes, of‐
112       ten requiring a separate, dedicated application to solve.  For a  typi‐
113       cal  client-server socket application, this is not an issue, so we will
114       defer more discussion until later.
115
116       Using the getaddrinfo() results, the client opens a socket,  configures
117       it  for  non-blocking  mode,  and initiates the connection request.  At
118       this point, the network stack has sent a request to the server  to  es‐
119       tablish  the connection.  Because the socket has been set to non-block‐
120       ing, the connect call will return immediately and not wait for the con‐
121       nection  to  be  established.   As a result any attempt to send data at
122       this point will likely fail.
123
124              /* Example server code flow to accept a connection */
125              struct pollfd fds;
126              int server_fd;
127
128              fds.fd = listen_fd;
129              fds.events = POLLIN;
130
131              poll(&fds, -1);
132
133              server_fd = accept(listen_fd, NULL, 0);
134              fcntl(server_fd, F_SETFL, O_NONBLOCK);
135
136       Applications that use non-blocking sockets use select(), poll(), or  an
137       equivalent  such as epoll() to receive notification of when a socket is
138       ready to send or receive data.  In this case, the server wishes to know
139       when the listening socket has a connection request to process.  It adds
140       the listening socket to a poll set, then waits until a  connection  re‐
141       quest  arrives  (i.e. POLLIN  is  true).   The poll() call blocks until
142       POLLIN is set on the socket.  POLLIN indicates that the socket has data
143       to  accept.  Since this is a listening socket, the data is a connection
144       request.  The server accepts the request by calling accept().  That re‐
145       turns  a  new  socket to the server, which is ready for data transfers.
146       Finally, the server sets the new socket to non-blocking mode.
147
148              /* Example client code flow to establish a connection */
149              struct pollfd fds;
150              int err;
151              socklen_t len;
152
153              fds.fd = client_fd;
154              fds.events = POLLOUT;
155
156              poll(&fds, -1);
157
158              len = sizeof err;
159              getsockopt(client_fd, SOL_SOCKET, SO_ERROR, &err, &len);
160
161       The client is notified that its connection request has  completed  when
162       its  connecting  socket is `ready to send data' (i.e. POLLOUT is true).
163       The poll() call blocks until POLLOUT is set on the  socket,  indicating
164       the  connection  attempt is done.  Note that the connection request may
165       have completed with an error, and the client still needs  to  check  if
166       the connection attempt was successful.  That is not conveyed to the ap‐
167       plication by the poll() call.  The getsockopt() call  is  used  to  re‐
168       trieve the result of the connection attempt.  If err in this example is
169       set to 0, then the connection attempt succeeded.   The  socket  is  now
170       ready to send and receive data.
171
172       After  a connection has been established, the process of sending or re‐
173       ceiving data is the same for both the client and server.  The  examples
174       below  differ only by name of the socket variable used by the client or
175       server application.
176
177              /* Example of client sending data to server */
178              struct pollfd fds;
179              size_t offset, size, ret;
180              char buf[4096];
181
182              fds.fd = client_fd;
183              fds.events = POLLOUT;
184
185              size = sizeof(buf);
186              for (offset = 0; offset < size; ) {
187                  poll(&fds, -1);
188
189                  ret = send(client_fd, buf + offset, size - offset, 0);
190                  offset += ret;
191              }
192
193       Network communication involves buffering of data at  both  the  sending
194       and  receiving sides of the connection.  TCP uses a credit based scheme
195       to manage flow control to ensure that there is sufficient buffer  space
196       at the receive side of a connection to accept incoming data.  This flow
197       control is hidden from the application by the socket API.  As a result,
198       stream based sockets may not transfer all the data that the application
199       requests to send as part of a single operation.
200
201       In this example, the client maintains an offset into the buffer that it
202       wishes  to  send.   As  data is accepted by the network, the offset in‐
203       creases.  The client then waits until the network is  ready  to  accept
204       more  data  before  attempting  another transfer.  The poll() operation
205       supports this.  When the client socket is ready for data, it sets POLL‐
206       OUT  to  true.   This indicates that send will transfer some additional
207       amount of data.  The client issues a send() request for  the  remaining
208       amount  of buffer that it wishes to transfer.  If send() transfers less
209       data than requested, the client updates the offset, waits for the  net‐
210       work to become ready, then tries again.
211
212              /* Example of server receiving data from client */
213              struct pollfd fds;
214              size_t offset, size, ret;
215              char buf[4096];
216
217              fds.fd = server_fd;
218              fds.events = POLLIN;
219
220              size = sizeof(buf);
221              for (offset = 0; offset < size; ) {
222                  poll(&fds, -1);
223
224                  ret = recv(client_fd, buf + offset, size - offset, 0);
225                  offset += ret;
226              }
227
228       The  flow  for  receiving data is similar to that used to send it.  Be‐
229       cause of the streaming nature of the socket, there is no guarantee that
230       the  receiver will obtain all of the available data as part of a single
231       call.  The server instead must wait until the socket is  ready  to  re‐
232       ceive  data  (POLLIN),  before  calling  receive to obtain what data is
233       available.  In this example, the server knows to expect exactly 4 KB of
234       data  from  the  client.   More generally, a client and server will ex‐
235       change communication protocol headers at the start of all messages, and
236       the header will include the size of the message.
237
238       It  is  worth noting that the previous two examples are written so that
239       they are simple to understand.  They are poorly constructed  when  con‐
240       sidering performance.  In both cases, the application always precedes a
241       data transfer call (send or recv) with poll().  The impact is  even  if
242       the network is ready to transfer data or has data queued for receiving,
243       the application will always experience the latency and processing over‐
244       head of poll().  A better approach is to call send() or recv() prior to
245       entering the for() loops, and only enter the loops if needed.
246
247   Connection-less (UDP) Communication
248       As mentioned, TCP sockets are connection-oriented.  They may be used to
249       communicate  between  exactly  2  processes.  For parallel applications
250       that need to communicate with thousands peer processes, the overhead of
251       managing  this  many  simultaneous  sockets  can be significant, to the
252       point where the application performance may decrease as more  processes
253       are added.
254
255       To  support communicating with a large number of peers, or for applica‐
256       tions that do not need the overhead of reliable communication,  sockets
257       offers  another commonly used socket option, SOCK_DGRAM.  Datagrams are
258       unreliable,  connectionless  messages.   The  most   common   type   of
259       SOCK_DGRAM  socket runs over UDP/IP.  As a result, datagram sockets are
260       often referred to as UDP sockets.
261
262       UDP sockets use the same socket API as that  described  above  for  TCP
263       sockets; however, the communication behavior differs.  First, an appli‐
264       cation using UDP sockets does not need to connect to a  peer  prior  to
265       sending  it a message.  The destination address is specified as part of
266       the send operation.  A second major difference is that the  message  is
267       not  guaranteed to arrive at the peer.  Network congestion in switches,
268       routers, or the remote NIC can discard the message, and no attempt will
269       be  made  to  resend the message.  The sender will not be notified that
270       the message either arrived or was dropped.  Another difference  between
271       TCP  and  UDP  sockets  is the maximum size of the transfer that is al‐
272       lowed.  UDP sockets limit messages to at most 64k, though in  practice,
273       applications  use  a  much smaller size, usually aligned to the network
274       MTU size (for example, 1500 bytes).
275
276       Most use of UDP sockets replace the socket send() / recv()  calls  with
277       sendto() and recvfrom().
278
279              /* Example send to peer at given IP address and UDP port */
280              struct addrinfo *ai, hints;
281
282              memset(&hints, 0, sizeof hints);
283              hints.ai_socktype = SOCK_DGRAM;
284              getaddrinfo("10.31.20.04", "7471", &hints, &ai);
285
286              ret = sendto(client_fd, buf, size, 0, ai->ai_addr, ai->ai_addrlen);
287
288       In the above example, we use getadddrinfo() to convert the given IP ad‐
289       dress and UDP port number into a sockaddr.  That  is  passed  into  the
290       sendto() call in order to specify the destination of the message.  Note
291       the similarities between this  flow  and  the  TCP  socket  flow.   The
292       recvfrom()  call  allows  us  to receive the address of the sender of a
293       message.  Note that unlike streaming sockets, the entire message is ac‐
294       cepted  by  the network on success.  All contents of the buf parameter,
295       specified by the size parameter, have been queued by the network layer.
296
297       Although not shown, the application could call poll() or an  equivalent
298       prior  to calling sendto() to ensure that the socket is ready to accept
299       new data.  Similarly, poll() may be used prior to calling recvfrom() to
300       check if there is data ready to be read from the socket.
301
302              /* Example receive a message from a peer */
303              struct sockaddr_in addr;
304              socklen_t addrlen;
305
306              addrlen = sizeof(addr);
307              ret = recvfrom(client_fd, buf, size, 0, &addr, &addrlen);
308
309       This  example will receive any incoming message from any peer.  The ad‐
310       dress of the peer will be provided in  the  addr  parameter.   In  this
311       case,  we only provide enough space to record and IPv4 address (limited
312       by our use of struct sockaddr_in).  Supporting an  IPv6  address  would
313       simply  require  passing  in  a larger address buffer (mapped to struct
314       sockaddr_in6 for example).
315
316   Advantages
317       The socket API has two significant advantages.  First, it is  available
318       on  a  wide  variety of operating systems and platforms, and works over
319       the vast majority of available networking hardware.  It can  even  work
320       for communication between processes on the same system without any net‐
321       work hardware.  It is easily the de-facto networking API.  This by  it‐
322       self makes it appealing to use.
323
324       The  second  key advantage is that it is relatively easy to program to.
325       The importance of this should not be overlooked.  Networking APIs  that
326       offer  access  to higher performing features, but are difficult to pro‐
327       gram to correctly or well, often result in  lower  application  perfor‐
328       mance.  This is not unlike coding an application in a higher-level lan‐
329       guage such as C or C++, versus assembly.  Although writing directly  to
330       assembly  language  offers  the promise of being better performing, for
331       the vast majority of developers, their applications will perform better
332       if  written in C or C++, and using an optimized compiler.  Applications
333       should have a clear need for high-performance networking before select‐
334       ing an alternative API to sockets.
335
336   Disadvantages
337       When  considering  the  problems  with the socket API as it pertains to
338       high-performance networking, we limit our discussion to  the  two  most
339       common sockets types: streaming (TCP) and datagram (UDP).
340
341       Most applications require that network data be sent reliably.  This in‐
342       variably means using a connection-oriented  TCP  socket.   TCP  sockets
343       transfer data as a stream of bytes.  However, many applications operate
344       on messages.  The result is that applications often insert headers that
345       are simply used to convert application message to / from a byte stream.
346       These headers consume additional network bandwidth and processing.  The
347       streaming nature of the interface also results in the application using
348       loops as shown in the examples above to send and  receive  larger  mes‐
349       sages.   The complexity of those loops can be significant if the appli‐
350       cation is managing sockets to hundreds or thousands of peers.
351
352       Another issue highlighted by the above examples deals  with  the  asyn‐
353       chronous  nature  of network traffic.  When using a reliable transport,
354       it is not enough to place an application’s data onto the  network.   If
355       the network is busy, it could drop the packet, or the data could become
356       corrupted during a transfer.  The data must be kept until it  has  been
357       acknowledged  by  the  peer,  so  that it can be resent if needed.  The
358       socket API is defined such that the application owns  the  contents  of
359       its memory buffers after a socket call returns.
360
361       As  an  example,  if we examine the socket send() call, once send() re‐
362       turns the application is free to modify its buffer.  The network imple‐
363       mentation  has a couple of options.  One option is for the send call to
364       place the data directly onto the network.  The call must then block be‐
365       fore returning to the user until the peer acknowledges that it received
366       the data, at which point send() can return.  The obvious  problem  with
367       this approach is that the application is blocked in the send() call un‐
368       til the network stack at the peer can process the data and generate  an
369       acknowledgment.  This can be a significant amount of time where the ap‐
370       plication is blocked and unable to process other work, such as respond‐
371       ing to messages from other clients.  Such an approach is not feasible.
372
373       A  better  option is for the send() call to copy the application’s data
374       into an internal buffer.  The data transfer is then issued out of  that
375       buffer,  which allows retrying the operation in case of a failure.  The
376       send() call in this case is not  blocked,  but  all  data  that  passes
377       through  the  network  will  result in a memory copy to a local buffer,
378       even in the absence of any errors.
379
380       Allowing immediate re-use of a data buffer is a feature of  the  socket
381       API  that keeps it simple and easy to program to.  However, such a fea‐
382       ture can potentially have a negative  impact  on  network  performance.
383       For  network  or memory limited applications, or even applications con‐
384       cerned about power consumption, an alternative API may be attractive.
385
386       A slightly more hidden problem occurs in the socket APIs  designed  for
387       UDP  sockets.  This problem is an inefficiency in the implementation as
388       a result of the API design being designed for ease of  use.   In  order
389       for  the application to send data to a peer, it needs to provide the IP
390       address and UDP port number of the peer.  That involves passing a sock‐
391       addr  structure  to the sendto() and recvfrom() calls.  However, IP ad‐
392       dresses are a higher- level network layer address.  In order to  trans‐
393       fer  data  between  systems, low-level link layer addresses are needed,
394       for example Ethernet addresses.  The network layer must map IP address‐
395       es to Ethernet addresses on every send operation.  When scaled to thou‐
396       sands of peers, that overhead on every send call can be significant.
397
398       Finally, because the socket API is often considered in conjunction with
399       TCP and UDP protocols, it is intentionally detached from the underlying
400       network hardware implementation, including NICs, switches, and routers.
401       Access  to  available network features is therefore constrained by what
402       the API can support.
403
404       It is worth noting here, that some operating systems  support  enhanced
405       APIs  that may be used to interact with TCP and UDP sockets.  For exam‐
406       ple, Linux supports an interface known as io_uring, and Windows has  an
407       asynchronous  socket  API.   Those  APIs can help alleviate some of the
408       problems described above.  However, an application will  still  be  re‐
409       stricted by the features that the TCP an UDP protocols provide.
410

High-Performance Networking

412       By analyzing the socket API in the context of high-performance network‐
413       ing, we can start to see some features that are desirable for a network
414       API.
415
416   Avoiding Memory Copies
417       The  socket API implementation usually results in data copies occurring
418       at both the sender and the receiver.  This is a trade-off between keep‐
419       ing  the interface easy to use, versus providing reliability.  Ideally,
420       all memory copies would be avoided when transferring data over the net‐
421       work.   There  are techniques and APIs that can be used to avoid memory
422       copies, but in practice, the cost of avoiding a copy can often be  more
423       than  the  copy  itself, in particular for small transfers (measured in
424       bytes, versus kilobytes or more).
425
426       To avoid a memory copy at the sender, we need to place the  application
427       data  directly onto the network.  If we also want to avoid blocking the
428       sending application, we need some way for the network layer to communi‐
429       cate  with  the  application  when  the buffer is safe to re-use.  This
430       would allow the original buffer to be re-used in case the data needs to
431       be  re-transmitted.  This leads us to crafting a network interface that
432       behaves asynchronously.  The application will need to issue a  request,
433       then receive some sort of notification when the request has completed.
434
435       Avoiding  a memory copy at the receiver is more challenging.  When data
436       arrives from the network, it needs to land  into  an  available  memory
437       buffer,  or it will be dropped, resulting in the sender re-transmitting
438       the data.  If we use socket recv() semantics, the only way to  avoid  a
439       copy  at the receiver is for the recv() to be called before the send().
440       Recv() would then need to block until the data has arrived.   Not  only
441       does  this  block  the receiver, it is impractical to use outside of an
442       application with a simple request-reply protocol.
443
444       Instead, what is needed is a way for the receiving application to  pro‐
445       vide one or more buffers to the network for received data to land.  The
446       network then needs to notify the application when  data  is  available.
447       This  sort  of mechanism works well if the receiver does not care where
448       in its memory space the data is located; it only needs to  be  able  to
449       process the incoming message.
450
451       As  an  alternative,  it is possible to reverse this flow, and have the
452       network layer hand its buffer  to  the  application.   The  application
453       would then be responsible for returning the buffer to the network layer
454       when it is done with its processing.  While  this  approach  can  avoid
455       memory  copies,  it  suffers  from a few drawbacks.  First, the network
456       layer does not know what size of messages to expect, which can lead  to
457       inefficient memory use.  Second, many would consider this a more diffi‐
458       cult programming model to use.  And finally, the network buffers  would
459       need  to  be  mapped  into the application process’ memory space, which
460       negatively impacts performance.
461
462       In addition to processing messages, some applications want  to  receive
463       data  and  store  it  in a specific location in memory.  For example, a
464       database may want to merge received data records into an  existing  ta‐
465       ble.   In  such  cases, even if data arriving from the network goes di‐
466       rectly into an application’s receive buffers, it may still need  to  be
467       copied  into its final location.  It would be ideal if the network sup‐
468       ported placing data that arrives from the network into a specific memo‐
469       ry  buffer, with the buffer determined based on the contents of the da‐
470       ta.
471
472   Network Buffers
473       Based on the problems described above, we can start to see that  avoid‐
474       ing memory copies depends upon the ownership of the memory buffers used
475       for network traffic.  With socket based transports, the network buffers
476       are owned and managed by the networking stack.  This is usually handled
477       by the operating system kernel.  However,  this  results  in  the  data
478       `bouncing' between the application buffers and the network buffers.  By
479       putting the application in control of managing the network buffers,  we
480       can avoid this overhead.  The cost for doing so is additional complexi‐
481       ty in the application.
482
483       Note that even though we want the application to own the  network  buf‐
484       fers,  we would still like to avoid the situation where the application
485       implements a complex network protocol.  The trade-off is that  the  app
486       provides  the  data buffers to the network stack, but the network stack
487       continues to handle things like flow control, reliability, and  segmen‐
488       tation and reassembly.
489
490   Resource Management
491       We  define  resource management to mean properly allocating network re‐
492       sources in order to avoid overrunning data  buffers  or  queues.   Flow
493       control is a common aspect of resource management.  Without proper flow
494       control, a sender can overrun a slow or busy receiver.  This can result
495       in dropped packets, re-transmissions, and increased network congestion.
496       Significant research and development has gone  into  implementing  flow
497       control  algorithms.   Because  of  its complexity, it is not something
498       that an application developer should need to  deal  with.   That  said,
499       there  are some applications where flow control simply falls out of the
500       network protocol.  For example, a request-reply protocol naturally  has
501       flow control built in.
502
503       For  our  purposes, we expand the definition of resource management be‐
504       yond flow control.  Flow control typically only  deals  with  available
505       network buffering at a peer.  We also want to be concerned about having
506       available space in outbound data transfer queues.  That is, as we issue
507       commands  to  the  local  NIC  to send data, that those commands can be
508       queued at the NIC.  When we consider reliability, this  means  tracking
509       outstanding  requests until they have been acknowledged.  Resource man‐
510       agement will need to ensure that we do not overflow that request queue.
511
512       Additionally, supporting asynchronous operations (described  in  detail
513       below) will introduce potential new queues.  Those queues also must not
514       overflow.
515
516   Asynchronous Operations
517       Arguably, the key feature of achieving high-performance  is  supporting
518       asynchronous operations, or the ability to overlap different communica‐
519       tion and communication with computation.  The socket API supports asyn‐
520       chronous  transfers  with  its non-blocking mode.  However, because the
521       API itself  operates  synchronously,  the  result  is  additional  data
522       copies.  For an API to be asynchronous, an application needs to be able
523       to submit work, then later receive some sort of notification  that  the
524       work  is  done.  In order to avoid extra memory copies, the application
525       must agree not to modify its data  buffers  until  the  operation  com‐
526       pletes.
527
528       There are two main ways to notify an application that it is safe to re-
529       use its data buffers.  One mechanism is for the network layer to invoke
530       some  sort of callback or send a signal to the application that the re‐
531       quest is done.  Some asynchronous APIs use this mechanism.   The  draw‐
532       back  of  this approach is that signals interrupt an application’s pro‐
533       cessing.  This can negatively impact the CPU caches, plus requires  in‐
534       terrupt  processing.  Additionally, it is often difficult to develop an
535       application that can handle processing a signal that can occur at  any‐
536       time.
537
538       An  alternative  mechanism for supporting asynchronous operations is to
539       write events into some sort of completion queue when an operation  com‐
540       pletes.   This provides a way to indicate to an application when a data
541       transfer has completed, plus gives the application  control  over  when
542       and how to process completed requests.  For example, it can process re‐
543       quests in batches to improve code locality and performance.
544
545   Interrupts and Signals
546       Interrupts are a natural extension to  supporting  asynchronous  opera‐
547       tions.   However, when dealing with an asynchronous API, they can nega‐
548       tively impact performance.  Interrupts, even when directed to a  kernel
549       agent, can interfere with application processing.
550
551       If  an  application has an asynchronous interface with completed opera‐
552       tions written into a completion queue, it is often sufficient  for  the
553       application  to  simply check the queue for events.  As long as the ap‐
554       plication has other work to perform, there is no need for it to  block.
555       This  alleviates the need for interrupt generation.  A NIC merely needs
556       to write an entry into the completion queue and update a  tail  pointer
557       to signal that a request is done.
558
559       If  we  follow this argument, then it can be beneficial to give the ap‐
560       plication control over when interrupts should occur and when  to  write
561       events  to  some sort of wait object.  By having the application notify
562       the network layer that it will wait until a completion occurs,  we  can
563       better manage the number and type of interrupts that are generated.
564
565   Event Queues
566       As  outlined  above,  there are performance advantages to having an API
567       that reports completions or provides other types of notification  using
568       an  event  queue.  A very simple type of event queue merely tracks com‐
569       pleted operations.  As data is received or a send completes,  an  entry
570       is written into the event queue.
571
572   Direct Hardware Access
573       When  discussing the network layer, most software implementations refer
574       to kernel modules responsible for implementing the necessary  transport
575       and network protocols.  However, if we want network latency to approach
576       sub-microsecond speeds, then we need to remove as much software between
577       the application and its access to the hardware as possible.  One way to
578       do this is for the application to have direct access to the network in‐
579       terface  controller’s  command queues.  Similarly, the NIC requires di‐
580       rect access to the application’s data buffers and  control  structures,
581       such as the above mentioned completion queues.
582
583       Note  that  when  we speak about an application having direct access to
584       network hardware, we’re referring to the application process.  Natural‐
585       ly,  an application developer is highly unlikely to code for a specific
586       hardware NIC.  That work would be left to some sort of network  library
587       specifically targeting the NIC.  The actual network layer, which imple‐
588       ments the network transport, could be part of the  network  library  or
589       offloaded onto the NIC’s hardware or firmware.
590
591   Kernel Bypass
592       Kernel bypass is a feature that allows the application to avoid calling
593       into the kernel for data transfer operations.  This is possible when it
594       has  direct  access to the NIC hardware.  Complete kernel bypass is im‐
595       practical because of security concerns  and  resource  management  con‐
596       straints.   However,  it is possible to avoid kernel calls for what are
597       called `fast-path' operations, such as send or receive.
598
599       For security and stability reasons, operating system kernels cannot re‐
600       ly  on data that comes from user space applications.  As a result, even
601       a simple kernel call often requires acquiring and releasing locks, cou‐
602       pled  with  data verification checks.  If we can limit the effects of a
603       poorly written or malicious application to its own  process  space,  we
604       can  avoid  the  overhead that comes with kernel validation without im‐
605       pacting system stability.
606
607   Direct Data Placement
608       Direct data placement means avoiding data copies when sending  and  re‐
609       ceiving data, plus placing received data into the correct memory buffer
610       where needed.  On a broader scale, it is part of having direct hardware
611       access, with the application and NIC communicating directly with shared
612       memory buffers and queues.
613
614       Direct data placement is often thought of by those familiar with RDMA -
615       remote  direct  memory access.  RDMA is a technique that allows reading
616       and writing memory that belongs to a peer process that is running on  a
617       node  across the network.  Advanced RDMA hardware is capable of access‐
618       ing the target memory buffers without involving the  execution  of  the
619       peer process.  RDMA relies on offloading the network transport onto the
620       NIC in order to avoid interrupting the target process.
621
622       The main advantages of supporting direct  data  placement  is  avoiding
623       memory copies and minimizing processing overhead.
624

Designing Interfaces for Performance

626       We  want  to  design a network interface that can meet the requirements
627       outlined above.  Moreover, we also want to take into account  the  per‐
628       formance  of  the interface itself.  It is often not obvious how an in‐
629       terface can adversely affect performance, versus  performance  being  a
630       result  of  the  underlying implementation.  The following sections de‐
631       scribe how interface choices can impact performance.  Of  course,  when
632       we begin defining the actual APIs that an application will use, we will
633       need to trade off raw performance for ease of use where it makes sense.
634
635       When considering performance goals for an API, we need to take into ac‐
636       count  the target application use cases.  For the purposes of this dis‐
637       cussion, we want to consider applications that communicate  with  thou‐
638       sands  to millions of peer processes.  Data transfers will include mil‐
639       lions of small messages per second per peer, and large  transfers  that
640       may be up to gigabytes of data.  At such extreme scales, even small op‐
641       timizations are measurable, in terms of both performance and power.  If
642       we  have a million peers sending a millions messages per second, elimi‐
643       nating even a single instruction from the code path quickly  multiplies
644       to  saving  billions of instructions per second from the overall execu‐
645       tion, when viewing the operation of the entire application.
646
647       We once again refer to the socket API as part of this discussion in or‐
648       der to illustrate how an API can affect performance.
649
650              /* Notable socket function prototypes */
651              /* "control" functions */
652              int socket(int domain, int type, int protocol);
653              int bind(int socket, const struct sockaddr *addr, socklen_t addrlen);
654              int listen(int socket, int backlog);
655              int accept(int socket, struct sockaddr *addr, socklen_t *addrlen);
656              int connect(int socket, const struct sockaddr *addr, socklen_t addrlen);
657              int shutdown(int socket, int how);
658              int close(int socket);
659
660              /* "fast path" data operations - send only (receive calls not shown) */
661              ssize_t send(int socket, const void *buf, size_t len, int flags);
662              ssize_t sendto(int socket, const void *buf, size_t len, int flags,
663                  const struct sockaddr *dest_addr, socklen_t addrlen);
664              ssize_t sendmsg(int socket, const struct msghdr *msg, int flags);
665              ssize_t write(int socket, const void *buf, size_t count);
666              ssize_t writev(int socket, const struct iovec *iov, int iovcnt);
667
668              /* "indirect" data operations */
669              int poll(struct pollfd *fds, nfds_t nfds, int timeout);
670              int select(int nfds, fd_set *readfds, fd_set *writefds,
671                  fd_set *exceptfds, struct timeval *timeout);
672
673       Examining  this  list,  there are a couple of features to note.  First,
674       there are multiple calls that can be used to send data, as well as mul‐
675       tiple  calls  that can be used to wait for a non-blocking socket to be‐
676       come ready.  This will be discussed in more detail further on.  Second,
677       the  operations  have  been split into different groups (terminology is
678       ours).  Control operations are those functions that an application sel‐
679       dom  invokes  during  execution.  They often only occur as part of ini‐
680       tialization.
681
682       Data operations, on the other hand, may be called hundreds to  millions
683       of times during an application’s lifetime.  They deal directly or indi‐
684       rectly with transferring or receiving data over the network.  Data  op‐
685       erations  can  be split into two groups.  Fast path calls interact with
686       the network stack to immediately send or receive  data.   In  order  to
687       achieve  high bandwidth and low latency, those operations need to be as
688       fast as possible.  Non-fast path operations that still deal  with  data
689       transfers  are  those  calls, that while still frequently called by the
690       application, are not as performance critical.   For  example,  the  se‐
691       lect() and poll() calls are used to block an application thread until a
692       socket becomes ready.  Because those calls suspend  the  thread  execu‐
693       tion,  performance  is  a lesser concern.  (Performance of those opera‐
694       tions is still of a concern, but the cost of  executing  the  operating
695       system  scheduler often swamps any but the most substantial performance
696       gains.)
697
698   Call Setup Costs
699       The amount of work that an application needs to perform before  issuing
700       a  data  transfer  operation can affect performance, especially message
701       rates.  Obviously, the more parameters an application must push on  the
702       stack to call a function increases its instruction count.  However, re‐
703       placing stack variables with a single data structure does not  help  to
704       reduce the setup costs.
705
706       Suppose  that  an  application wishes to send a single data buffer of a
707       given size to a peer.  If we examine the socket API, the best  fit  for
708       such an operation is the write() call.  That call takes only those val‐
709       ues which are necessary to perform the data transfer.  The send()  call
710       is  a close second, and send() is a more natural function name for net‐
711       work  communication,  but  send()  requires  one  extra  argument  over
712       write().   Other functions are even worse in terms of setup costs.  The
713       sendmsg() function, for example, requires that the application format a
714       data structure, the address of which is passed into the call.  This re‐
715       quires significantly more instructions from the application if done for
716       every data transfer.
717
718       Even  though  all other send functions can be replaced by sendmsg(), it
719       is useful to have multiple ways for the application to issue  send  re‐
720       quests.   Not  only  are  the other calls easier to read and use (which
721       lower software maintenance costs), but they can  also  improve  perfor‐
722       mance.
723
724   Branches and Loops
725       When  designing  an API, developers rarely consider how the API impacts
726       the underlying implementation.  However, the selection of  API  parame‐
727       ters can require that the underlying implementation add branches or use
728       control  loops.   Consider  the  difference  between  the  write()  and
729       writev()  calls.   The  latter passes in an array of I/O vectors, which
730       may be processed using a loop such as this:
731
732              /* Sample implementation for processing an array */
733              for (i = 0; i < iovcnt; i++) {
734                  ...
735              }
736
737       In order to process the iovec array,  the  natural  software  construct
738       would  be  to  use a loop to iterate over the entries.  Loops result in
739       additional processing.  Typically, a loop requires initializing a  loop
740       control  variable  (e.g. i  = 0), adds ALU operations (e.g. i++), and a
741       comparison (e.g. i < iovcnt).  This overhead is necessary to handle  an
742       arbitrary  number of iovec entries.  If the common case is that the ap‐
743       plication wants to send a single data buffer, write() is a  better  op‐
744       tion.
745
746       In  addition  to control loops, an API can result in the implementation
747       needing branches.  Branches can change the execution flow of a program,
748       impacting  processor  pipe-lining techniques.  Processor branch predic‐
749       tion helps alleviate this issue.  However, while branch prediction  can
750       be  correct  nearly  100%  of the time while running a micro-benchmark,
751       such as a network bandwidth or latency test, with more  realistic  net‐
752       work traffic, the impact can become measurable.
753
754       We  can easily see how an API can introduce branches into the code flow
755       if we examine the send() call.  Send() takes an extra  flags  parameter
756       over  the  write() call.  This allows the application to modify the be‐
757       havior of send().  From the viewpoint of implementing send(), the flags
758       parameter  must be checked.  In the best case, this adds one additional
759       check (flags are non-zero).  In the worst case, every  valid  flag  may
760       need a separate check, resulting in potentially dozens of checks.
761
762       Overall, the sockets API is well designed considering these performance
763       implications.  It provides complex calls where they  are  needed,  with
764       simpler  functions available that can avoid some of the overhead inher‐
765       ent in other calls.
766
767   Command Formatting
768       The ultimate objective of invoking a network function is to transfer or
769       receive  data from the network.  In this section, we’re dropping to the
770       very bottom of the software stack to the component responsible for  di‐
771       rectly accessing the hardware.  This is usually referred to as the net‐
772       work driver, and its implementation is often tied to a  specific  piece
773       of hardware, or a series of NICs by a single hardware vendor.
774
775       In  order  to signal a NIC that it should read a memory buffer and copy
776       that data onto the network, the software driver usually needs to  write
777       some  sort  of  command  to  the NIC.  To limit hardware complexity and
778       cost, a NIC may only support a couple of command formats.  This differs
779       from  the  software interfaces that we’ve been discussing, where we can
780       have different APIs of varying complexity in order to reduce  overhead.
781       There  can  be significant costs associated with formatting the command
782       and posting it to the hardware.
783
784       With a standard NIC, the command is formatted by a kernel driver.  That
785       driver  sits at the bottom of the network stack servicing requests from
786       multiple applications.  It must typically format each command only  af‐
787       ter a request has passed through the network stack.
788
789       With  devices  that  are  directly  accessible by a single application,
790       there are opportunities to use pre-formatted command  structures.   The
791       more  of  the  command that can be initialized prior to the application
792       submitting a network request, the more streamlined the process, and the
793       better the performance.
794
795       As an example, a NIC needs to have the destination address as part of a
796       send operation.  If an application is sending to a  single  peer,  that
797       information  can be cached and be part of a pre-formatted network head‐
798       er.  This is only possible if the NIC driver knows that the destination
799       will  not  change  between sends.  The closer that the driver can be to
800       the application, the greater the chance for optimization.   An  optimal
801       approach  is  for  the driver to be part of a library that executes en‐
802       tirely within the application process space.
803
804   Memory Footprint
805       Memory footprint concerns are most notable among high-performance  com‐
806       puting  (HPC)  applications  that  communicate with thousands of peers.
807       Excessive memory consumption impacts application scalability,  limiting
808       the  number  of  peers  that can operate in parallel to solve problems.
809       There is often a trade-off  between  minimizing  the  memory  footprint
810       needed  for network communication, application performance, and ease of
811       use of the network interface.
812
813       As we discussed with the socket API semantics, part of the ease of  us‐
814       ing  sockets  comes from the network layering copying the user’s buffer
815       into an internal buffer belonging to the network stack.  The amount  of
816       internal  buffering  that’s  made available to the application directly
817       correlates with the bandwidth that an application can achieve.  In gen‐
818       eral,  larger  internal buffering increases network performance, with a
819       cost of increasing the memory footprint consumed  by  the  application.
820       This  memory footprint exists independent of the amount of memory allo‐
821       cated directly by the application.  Eliminating network  buffering  not
822       only helps with performance, but also scalability, by reducing the mem‐
823       ory footprint needed to support the application.
824
825       While network memory buffering increases as an application  scales,  it
826       can often be configured to a fixed size.  The amount of buffering need‐
827       ed is dependent on the number of  active  communication  streams  being
828       used  at  any  one time.  That number is often significantly lower than
829       the total number of peers that an application may need  to  communicate
830       with.   The  amount  of  memory required to address the peers, however,
831       usually has a linear relationship with the total number of peers.
832
833       With the socket API, each peer is identified using a  struct  sockaddr.
834       If  we  consider a UDP based socket application using IPv4 addresses, a
835       peer is identified by the following address.
836
837              /* IPv4 socket address - with typedefs removed */
838              struct sockaddr_in {
839                  uint16_t sin_family; /* AF_INET */
840                  uint16_t sin_port;
841                  struct {
842                      uint32_t sin_addr;
843                  } in_addr;
844              };
845
846       In total, the application requires 8-bytes of addressing for each peer.
847       If  the app communicates with a million peers, that explodes to roughly
848       8 MB of memory space that is consumed  just  to  maintain  the  address
849       list.   If IPv6 addressing is needed, then the requirement increases by
850       a factor of 4.
851
852       Luckily, there are some tricks that can be used to help reduce the  ad‐
853       dressing memory footprint, though doing so will introduce more instruc‐
854       tions into code path to access the network stack.  For instance, we can
855       notice that all addresses in the above example have the same sin_family
856       value (AF_INET).  There’s no need to store that for each address.  This
857       potentially  shrinks  each  address from 8 bytes to 6.  (We may be left
858       with unaligned data, but that’s a trade-off to reducing the memory con‐
859       sumption).  Depending on how the addresses are assigned, further reduc‐
860       tion may be possible.  For example, if the application  uses  the  same
861       set  of  port addresses at each node, then we can eliminate storing the
862       port, and instead calculate it from some  base  value.   This  type  of
863       trick  can  be  applied  to the IP portion of the address if the app is
864       lucky enough to run across sequential IP addresses.
865
866       The main issue with this sort of address reduction is that it is diffi‐
867       cult  to achieve.  It requires that each application check for and han‐
868       dle address compression, exposing the  application  to  the  addressing
869       format  used  by  the networking stack.  It should be kept in mind that
870       TCP/IP and UDP/IP addresses are logical addresses, not physical.   When
871       running  over Ethernet, the addresses that appear at the link layer are
872       MAC addresses, not IP addresses.  The IP to MAC address association  is
873       managed  by  the network software.  We would like to provide addressing
874       that is simple for an application to use, but at the same time can pro‐
875       vide a minimal memory footprint.
876
877   Communication Resources
878       We  need  to  take  a  brief detour in the discussion in order to delve
879       deeper into the network problem and solution space.  Instead of contin‐
880       uing  to  think  of a socket as a single entity, with both send and re‐
881       ceive capabilities, we want to consider its components  separately.   A
882       network  socket  can  be  viewed as three basic constructs: a transport
883       level address, a send or transmit queue, and a receive queue.   Because
884       our  discussion will begin to pivot away from pure socket semantics, we
885       will refer to our network `socket' as an endpoint.
886
887       In order to reduce an application’s memory footprint, we need  to  con‐
888       sider  features  that  fall outside of the socket API.  So far, much of
889       the discussion has been around sending data to a peer.  We now want  to
890       focus on the best mechanisms for receiving data.
891
892       With  sockets, when an app has data to receive (indicated, for example,
893       by a POLLIN event), we call recv().  The network stack copies  the  re‐
894       ceive  data  into its buffer and returns.  If we want to avoid the data
895       copy on the receive side, we need a way for the application to post its
896       buffers to the network stack before data arrives.
897
898       Arguably,  a  natural  way  of extending the socket API to support this
899       feature is to have each call to recv() simply post the  buffer  to  the
900       network layer.  As data is received, the receive buffers are removed in
901       the order that they were posted.  Data is copied into the posted buffer
902       and returned to the user.  It would be noted that the size of the post‐
903       ed receive buffer may be larger (or smaller) than the  amount  of  data
904       received.  If the available buffer space is larger, hypothetically, the
905       network layer could wait a short amount of time to see if more data ar‐
906       rives.   If nothing more arrives, the receive completes with the buffer
907       returned to the application.
908
909       This raises an issue regarding how to handle buffering on  the  receive
910       side.   So far, with sockets we’ve mostly considered a streaming proto‐
911       col.  However, many applications deal with messages which end up  being
912       layered  over the data stream.  If they send an 8 KB message, they want
913       the receiver to receive an 8 KB message.  Message boundaries need to be
914       maintained.
915
916       If  an application sends and receives a fixed sized message, buffer al‐
917       location becomes trivial.  The app can post X number of buffers each of
918       an  optimal  size.   However,  if there is a wide mix in message sizes,
919       difficulties arise.  It is not uncommon for an app to have 80%  of  its
920       messages be a couple hundred of bytes or less, but 80% of the total da‐
921       ta that it sends to be in large transfers that are, say, a megabyte  or
922       more.  Pre-posting receive buffers in such a situation is challenging.
923
924       A commonly used technique used to handle this situation is to implement
925       one application level protocol for smaller messages, and use a separate
926       protocol for transfers that are larger than some given threshold.  This
927       would allow an application to post a bunch of smaller messages,  say  4
928       KB,  to  receive data.  For transfers that are larger than 4 KB, a dif‐
929       ferent communication protocol is used, possibly over a different socket
930       or endpoint.
931
932   Shared Receive Queues
933       If  an  application  pre-posts  receive  buffers to a network queue, it
934       needs to balance the size of each buffer posted, the number of  buffers
935       that  are  posted  to  each queue, and the number of queues that are in
936       use.  With a socket like approach, each socket would maintain an  inde‐
937       pendent receive queue where data is placed.  If an application is using
938       1000 endpoints and posts 100 buffers, each 4 KB, that results in 400 MB
939       of memory space being consumed to receive data.  (We can start to real‐
940       ize that by eliminating memory copies, one of the  trade  offs  is  in‐
941       creased  memory  consumption.) While 400 MB seems like a lot of memory,
942       there is less than half a megabyte allocated to a single receive queue.
943       At  today’s  networking  speeds,  that  amount of space can be consumed
944       within milliseconds.  The result is that if only a few endpoints are in
945       use,  the  application  will  experience long delays where flow control
946       will kick in and back the transfers off.
947
948       There are a couple of observations that we can make here.  The first is
949       that  in order to achieve high scalability, we need to move away from a
950       connection-oriented protocol, such as streaming sockets.  Secondly,  we
951       need to reduce the number of receive queues that an application uses.
952
953       A  shared  receive  queue  is a network queue that can receive data for
954       many different endpoints at once.  With shared receive  queues,  we  no
955       longer  associate  a  receive  queue with a specific transport address.
956       Instead network data will target a specific endpoint address.  As  data
957       arrives,  the  endpoint  will  remove  an entry from the shared receive
958       queue, place the data into the application’s posted buffer, and  return
959       it to the user.  Shared receive queues can greatly reduce the amount of
960       buffer space needed by an applications.  In the previous example, if  a
961       shared  receive queue were used, the app could post 10 times the number
962       of buffers (1000 total), yet still consume 100 times less memory (4  MB
963       total).   This is far more scalable.  The drawback is that the applica‐
964       tion must now be aware of receive queues  and  shared  receive  queues,
965       rather than considering the network only at the level of a socket.
966
967   Multi-Receive Buffers
968       Shared receive queues greatly improve application scalability; however,
969       it still results in some inefficiencies as defined so far.  We’ve  only
970       considered  the  case of posting a series of fixed sized memory buffers
971       to the receive queue.  As mentioned, determining the size of each  buf‐
972       fer is challenging.  Transfers larger than the fixed size require using
973       some other protocol in order to complete.  If transfers  are  typically
974       much  smaller than the fixed size, then the extra buffer space goes un‐
975       used.
976
977       Again referring to our example, if the application posts 1000  buffers,
978       then it can only receive 1000 messages before the queue is emptied.  At
979       data rates measured in millions of messages per second, this  will  in‐
980       troduce  stalls in the data stream.  An obvious solution is to increase
981       the number of buffers posted.  The problem  is  dealing  with  variable
982       sized messages, including some which are only a couple hundred bytes in
983       length.  For example, if the average message size in our  case  is  256
984       bytes  or  less, then even though we’ve allocated 4 MB of buffer space,
985       we only make use of 6% of that space.  The rest is wasted in  order  to
986       handle messages which may only occasionally be up to 4 KB.
987
988       A  second  optimization  that we can make is to fill up each posted re‐
989       ceive buffer as messages arrive.  So, instead of a 4  KB  buffer  being
990       removed  from  use as soon as a single 256 byte message arrives, it can
991       instead receive up to 16, 256 byte, messages.  We refer to such a  fea‐
992       ture as `multi-receive' buffers.
993
994       With  multi-receive buffers, instead of posting a bunch of smaller buf‐
995       fers, we instead post a single larger buffer, say the entire 4  MB,  at
996       once.   As  data is received, it is placed into the posted buffer.  Un‐
997       like TCP streams, we still maintain message boundaries.  The advantages
998       here  are  twofold.  Not only is memory used more efficiently, allowing
999       us to receive more smaller messages at once and larger  messages  over‐
1000       all,  but  we  reduce the number of function calls that the application
1001       must make to maintain its supply of available receive buffers.
1002
1003       When combined with shared receive queues,  multi-receive  buffers  help
1004       support  optimal receive side buffering and processing.  The main draw‐
1005       back to supporting multi-receive buffers are that the application  will
1006       not  necessarily know up front how many messages may be associated with
1007       a single posted memory buffer.  This is rarely a problem  for  applica‐
1008       tions.
1009
1010   Optimal Hardware Allocation
1011       As part of scalability considerations, we not only need to consider the
1012       processing and memory resources of the host system, but also the  allo‐
1013       cation  and  use  of  the NIC hardware.  We’ve referred to network end‐
1014       points as combination of transport addressing, transmit queues, and re‐
1015       ceive  queues.  The latter two queues are often implemented as hardware
1016       command queues.  Command queues are used to signal the NIC  to  perform
1017       some  sort  of  work.   A  transmit queue indicates that the NIC should
1018       transfer data.  A transmit command often contains information  such  as
1019       the  address  of  the buffer to transmit, the length of the buffer, and
1020       destination addressing data.  The actual format and data contents  vary
1021       based on the hardware implementation.
1022
1023       NICs  have limited resources.  Only the most scalable, high-performance
1024       applications likely need to be concerned with  utilizing  NIC  hardware
1025       optimally.   However,  such  applications are an important and specific
1026       focus of libfabric.  Managing NIC resources is often handled by  a  re‐
1027       source manager application, which is responsible for allocating systems
1028       to competing applications, among other activities.
1029
1030       Supporting applications that wish to make optimal use of  hardware  re‐
1031       quires  that  hardware  related abstractions be exposed to the applica‐
1032       tion.  Such abstractions cannot require a specific hardware implementa‐
1033       tion,  and care must be taken to ensure that the resulting API is still
1034       usable by developers unfamiliar with dealing with such  low  level  de‐
1035       tails.   Exposing  concepts such as shared receive queues is an example
1036       of giving an application more control over how hardware  resources  are
1037       used.
1038
1039   Sharing Command Queues
1040       By exposing the transmit and receive queues to the application, we open
1041       the possibility for the application that makes  use  of  multiple  end‐
1042       points  to determine how those queues might be shared.  We talked about
1043       the benefits of sharing a receive queue among endpoints.  The  benefits
1044       of sharing transmit queues are not as obvious.
1045
1046       An  application  that  uses  more  addressable endpoints than there are
1047       transmit queues will need to share transmit queues among the endpoints.
1048       By  controlling  which endpoint uses which transmit queue, the applica‐
1049       tion can prioritize traffic.  A transmit queue can also  be  configured
1050       to  optimize for a specific type of data transfer, such as large trans‐
1051       fers only.
1052
1053       From the perspective of a software API,  sharing  transmit  or  receive
1054       queues implies exposing those constructs to the application, and allow‐
1055       ing them to be associated with different endpoint addresses.
1056
1057   Multiple Queues
1058       The opposite of a shared command queue are endpoints that have multiple
1059       queues.  An application that can take advantage of multiple transmit or
1060       receive queues can increase parallel handling of messages without  syn‐
1061       chronization  constraints.   Being  able to use multiple command queues
1062       through a single endpoint has advantages over using multiple endpoints.
1063       Multiple  endpoints  require separate addresses, which increases memory
1064       use.  A single endpoint with multiple queues can continue to  expose  a
1065       single address, while taking full advantage of available NIC resources.
1066
1067   Progress Model Considerations
1068       One  aspect  of the sockets programming interface that developers often
1069       don’t consider is the location of the protocol implementation.  This is
1070       usually  managed  by the operating system kernel.  The network stack is
1071       responsible for handling flow control messages, timing  out  transfers,
1072       re-transmitting unacknowledged transfers, processing received data, and
1073       sending acknowledgments.  This processing  requires  that  the  network
1074       stack  consume  CPU  cycles.   Portions  of that processing can be done
1075       within the context of the application thread, but much must be  handled
1076       by kernel threads dedicated to network processing.
1077
1078       By moving the network processing directly into the application process,
1079       we need to be concerned with how network  communication  makes  forward
1080       progress.  For example, how and when are acknowledgments sent?  How are
1081       timeouts and message re-transmissions handled?  The progress model  de‐
1082       fines this behavior, and it depends on how much of the network process‐
1083       ing has been offloaded onto the NIC.
1084
1085       More generally, progress is the ability of the underlying  network  im‐
1086       plementation  to  complete  processing  of an asynchronous request.  In
1087       many cases, the processing of an asynchronous request requires the  use
1088       of  the host processor.  For performance reasons, it may be undesirable
1089       for the provider to allocate a thread for this purpose, which will com‐
1090       pete  with  the  application  thread(s).   We  can avoid thread context
1091       switches if the application thread can be used to make forward progress
1092       on  requests  –  check for acknowledgments, retry timed out operations,
1093       etc.  Doing so requires that the application periodically call into the
1094       network stack.
1095
1096   Ordering
1097       Network  ordering is a complex subject.  With TCP sockets, data is sent
1098       and received in the same order.  Buffers are re-usable by the  applica‐
1099       tion immediately upon returning from a function call.  As a result, or‐
1100       dering is simple to understand and use.  UDP sockets complicate  things
1101       slightly.  With UDP sockets, messages may be received out of order from
1102       how they were sent.  In practice, this often doesn’t occur, particular‐
1103       ly,  if  the  application  only communicates over a local area network,
1104       such as Ethernet.
1105
1106       With our evolving network API, there are situations where exposing dif‐
1107       ferent  order semantics can improve performance.  These details will be
1108       discussed further below.
1109
1110   Messages
1111       UDP sockets allow messages to arrive out of order because each  message
1112       is  routed  from the sender to the receiver independently.  This allows
1113       packets to take different network paths, to avoid  congestion  or  take
1114       advantage  of  multiple network links for improved bandwidth.  We would
1115       like to take advantage of the same features in those  cases  where  the
1116       application doesn’t care in which order messages arrive.
1117
1118       Unlike UDP sockets, however, our definition of message ordering is more
1119       subtle.  UDP messages are small, MTU sized packets.  In our case,  mes‐
1120       sages may be gigabytes in size.  We define message ordering to indicate
1121       whether the start of each message is processed in order or out  of  or‐
1122       der.   This  is related to, but separate from the order of how the mes‐
1123       sage payload is received.
1124
1125       An example will help clarify this distinction.  Suppose that an  appli‐
1126       cation has posted two messages to its receive queue.  The first receive
1127       points to a 4 KB buffer.  The second receive points to a 64 KB  buffer.
1128       The  sender  will  transmit a 4 KB message followed by a 64 KB message.
1129       If messages are processed in order, then the 4 KB send will match  with
1130       the  4  KB  received,  and the 64 KB send will match with the 64 KB re‐
1131       ceive.  However, if messages can be processed out of  order,  then  the
1132       sends  and  receives  can  mismatch,  resulting in the 64 KB send being
1133       truncated.
1134
1135       In this example, we’re not concerned with what order the  data  is  re‐
1136       ceived  in.   The  64 KB send could be broken in 64 1-KB transfers that
1137       take different routes to the destination.  So, bytes 2k-3k could be re‐
1138       ceived  before bytes 1k-2k.  Message ordering is not concerned with or‐
1139       dering within a message, only between messages.  With ordered messages,
1140       the messages themselves need to be processed in order.
1141
1142       The  more  relaxed  message ordering can be the more optimizations that
1143       the network stack can use to transfer the data.  However, the  applica‐
1144       tion must be aware of message ordering semantics, and be able to select
1145       the desired semantic for its needs.  For the purposes of this  section,
1146       messages  refers to transport level operations, which includes RDMA and
1147       similar operations (some of which have not yet been discussed).
1148
1149   Data
1150       Data ordering refers to the receiving and placement of data both within
1151       and between messages.  Data ordering is most important to messages that
1152       can update the same target memory buffer.  For example, imagine an  ap‐
1153       plication that writes a series of database records directly into a peer
1154       memory location.  Data ordering, combined with  message  ordering,  en‐
1155       sures  that  the  data  from  the second write updates memory after the
1156       first write completes.  The result is that  the  memory  location  will
1157       contain the records carried in the second write.
1158
1159       Enforcing  data  ordering  between  messages requires that the messages
1160       themselves be ordered.  Data ordering can also apply  within  a  single
1161       message, though this level of ordering is usually less important to ap‐
1162       plications.  Intra-message data ordering indicates that the data for  a
1163       single  message  is received in order.  Some applications use this fea‐
1164       ture to `spin' reading the last byte of a  receive  buffer.   Once  the
1165       byte  changes,  the  application knows that the operation has completed
1166       and all earlier data has been received.  (Note that while such behavior
1167       is interesting for benchmark purposes, using such a feature in this way
1168       is strongly discouraged.  It is not portable between networks or  plat‐
1169       forms.)
1170
1171   Completions
1172       Completion ordering refers to the sequence that asynchronous operations
1173       report their completion to the application.  Typically, unreliable data
1174       transfer  will  naturally complete in the order that they are submitted
1175       to a transmit queue.  Each operation is  transmitted  to  the  network,
1176       with  the  completion  occurring  immediately after.  For reliable data
1177       transfers, an operation cannot complete until it has been  acknowledged
1178       by  the peer.  Since ack packets can be lost or possibly take different
1179       paths through the network, operations can be marked as completed out of
1180       order.   Out  of order acks is more likely if messages can be processed
1181       out of order.
1182
1183       Asynchronous interfaces require that the application track  their  out‐
1184       standing  requests.  Handling out of order completions can increase ap‐
1185       plication complexity, but it does allow for optimizing network utiliza‐
1186       tion.
1187

lifabric Architecture

1189       Libfabric  is well architected to support the previously discussed fea‐
1190       tures.  For further information on the libfabric architecture, see  the
1191       next programmer’s guide section: fi_arch(7).
1192

AUTHORS

1194       OpenFabrics.
1195
1196
1197
1198Libfabric Programmer’s Manual     2023-01-02                       fi_intro(7)
Impressum