fi_arch(7)

1fi_arch(7)                     Libfabric v1.18.1                    fi_arch(7)
2
3
4
5                   /
6                  / (9 CONNECTED)
7
8       /Event
9       /
10
11              Connections require the use of both passive and active endpoints.
12              In order to establish a connection, an application must first create a
13              passive endpoint and associate it with an event queue. The event queue
14              will be used to report the connection management events. The application
15              then calls listen on the passive endpoint. A single passive endpoint can
16              be used to form multiple connections.
17
18              The connecting peer allocates an active endpoint, which is also
19              associated with an event queue. Connect is called on the active
20              endpoint, which results in sending a connection request (CONNREQ)
21              message to the passive endpoint. The CONNREQ event is inserted into
22              the passive endpoint’s event queue, where the listening application can
23              process it.
24
25              Upon processing the CONNREQ, the listening application will allocate
26              an active endpoint to use with the connection. The active endpoint is
27              bound with an event queue. Although the diagram shows the use of a
28              separate event queue, the active endpoint may use the same event queue
29              as used by the passive endpoint. Accept is called on the active endpoint
30              to finish forming the connection. It should be noted that the OFI accept
31              call is different than the accept call used by sockets. The differences
32              result from OFI supporting process direct I/O.
33
34              libfabric does not define the connection establishment protocol, but
35              does support a traditional three-way handshake used by many technologies.
36              After calling accept, a response is sent to the connecting active endpoint.
37              That response generates a CONNECTED event on the remote event queue. If a
38              three-way handshake is used, the remote endpoint will generate an
39              acknowledgment message that will generate a CONNECTED event for the accepting
40              endpoint. Regardless of the connection protocol, both the active and passive
41              sides of the connection will receive a CONNECTED event that signals that the
42              connection has been established.
43
44              ## Connectionless Communications
45
46              Connectionless communication allows data transfers between active endpoints
47              without going through a connection setup process. The diagram below shows
48              the basic components needed to setup connection-less communication.
49              Connectionless communication setup differs from UDP sockets in that it
50              requires that the remote addresses be stored with libfabric.
51
52       1 insert_addr() 2 send() | | /Address <–3 lookup–> / Active
53       / /
54
55              libfabric requires the addresses of peer endpoints be inserted into a local
56              addressing table, or address vector, before data transfers can be initiated
57              against the remote endpoint. Address vectors abstract fabric specific
58              addressing requirements and avoid long queuing delays on data transfers
59              when address resolution is needed. For example, IP addresses may need to be
60              resolved into Ethernet MAC addresses. Address vectors allow this resolution
61              to occur during application initialization time. libfabric does not define
62              how an address vector be implemented, only its conceptual model.
63
64              All connection-less endpoints that transfer data must be associated with an
65              address vector.
66
67              # Endpoints
68
69              At a low-level, endpoints are usually associated with a transmit context, or
70              queue, and a receive context, or queue.  Although the terms transmit and
71              receive queues are easier to understand, libfabric uses the terminology
72              context, since queue like behavior of acting as a FIFO (first-in, first-out)
73              is not guaranteed.  Transmit and receive contexts may be implemented using
74              hardware queues mapped directly into the process’s address space.  An endpoint
75              may be configured only to transmit or receive data.  Data transfer requests
76              are converted by the underlying provider into commands that are inserted into
77              hardware transmit and/or receive contexts.
78
79              Endpoints are also associated with completion queues. Completion queues are
80              used to report the completion of asynchronous data transfer operations.
81
82              ## Shared Contexts
83
84              An advanced usage model allows for sharing resources among multiple endpoints.
85              The most common form of sharing is having multiple connected endpoints
86              make use of a single receive context.  This can reduce receive side buffering
87              requirements, allowing the number of connected endpoints that an application
88              can manage to scale to larger numbers.
89
90              # Data Transfers
91
92              Obviously, a primary goal of network communication is to transfer data between
93              processes running on different systems. In a similar way that the socket API
94              defines different data transfer semantics for TCP versus UDP sockets, that is,
95              streaming versus datagram messages, libfabric defines different types of data
96              transfers. However, unlike sockets, libfabric allows different semantics over
97              a single endpoint, even when communicating with the same peer.
98
99              libfabric uses separate API sets for the different data transfer semantics;
100              although, there are strong similarities between the API sets.  The differences
101              are the result of the parameters needed to invoke each type of data transfer.
102
103              ## Message transfers
104
105              Message transfers are most similar to UDP datagram transfers, except that
106              transfers may be sent and received reliably.  Message transfers may also be
107              gigabytes in size, depending on the provider implementation.  The sender
108              requests that data be transferred as a single transport operation to a peer.
109              Even if the data is referenced using an I/O vector, it is treated as a single
110              logical unit or message.  The data is placed into a waiting receive buffer
111              at the peer, with the receive buffer usually chosen using FIFO ordering.
112              Note that even though receive buffers are selected using FIFO ordering, the
113              received messages may complete out of order.  This can occur as a result of
114              data between and within messages taking different paths through the network,
115              handling lost or retransmitted packets, etc.
116
117              Message transfers are usually invoked using API calls that contain the string
118              "send" or "recv".  As a result they may be referred to simply as sent or
119              received messages.
120
121              Message transfers involve the target process posting memory buffers to the
122              receive (Rx) context of its endpoint.  When a message arrives from the network,
123              a receive buffer is removed from the Rx context, and the data is copied from
124              the network into the receive buffer.  Messages are matched with posted receives
125              in the order that they are received.  Note that this may differ from the order
126              that messages are sent, depending on the transmit side's ordering semantics.
127
128              Conceptually, on the transmit side, messages are posted to a transmit (Tx)
129              context.  The network processes messages from the Tx context, packetizing
130              the data into outbound messages.  Although many implementations process the
131              Tx context in order (i.e. the Tx context is a true queue), ordering guarantees
132              specified through the libfabric API determine the actual processing order.  As
133              a general rule, the more relaxed an application is on its message and data
134              ordering, the more optimizations the networking software and hardware can
135              leverage, providing better performance.
136
137              ## Tagged messages
138
139              Tagged messages are similar to message transfers except that the messages
140              carry one additional piece of information, a message tag.  Tags are application
141              defined values that are part of the message transfer protocol and are used to
142              route packets at the receiver.  At a high level, they are roughly similar to
143              message ids.  The difference is that tag values are set by the application,
144              may be any value, and duplicate tag values are allowed.
145
146              Each sent message carries a single tag value, which is used to select a receive
147              buffer into which the data is copied.  On the receiving side, message buffers
148              are also marked with a tag.  Messages that arrive from the network search
149              through the posted receive messages until a matching tag is found.
150
151              Tags are often used to identify virtual communication groups or roles.
152              In practice, message tags are typically divided into fields.  For example, the
153              upper 16 bits of the tag may indicate a virtual group, with the lower 16 bits
154              identifying the message purpose.  The tag message interface in libfabric is
155              designed around this usage model.  Each sent message carries exactly one tag
156              value, specified through the API.  At the receiver, buffers are associated
157              with both a tag value and a mask.  The mask is used as part of the buffer
158              matching process.  The mask is applied against the received tag value carried
159              in the sent message prior to checking the tag against the receive buffer.  For
160              example, the mask may indicate to ignore the lower 16-bits of a tag.  If
161              the resulting values match, then the tags are said to match.  The received
162              data is then placed into the matched buffer.
163
164              For performance reasons, the mask is specified as 'ignore' bits. Although
165              this is backwards from how many developers think of a mask (where the bits
166              that are valid would be set to 1), the definition ends up mapping well with
167              applications.  The actual operation performed when matching tags is:
168
169       send_tag | ignore == recv_tag | ignore
170
171       /* this is equivalent to: * send_tag & ~ignore == recv_tag & ~ignore */
172       ```
173
174       Tagged messages are equivalent of message transfers  if  a  single  tag
175       value is used.  But tagged messages require that the receiver perform a
176       matching operation at the target, which can impact  performance  versus
177       untagged messages.
178
179   RMA
180       RMA operations are architected such that they can require no processing
181       by the CPU at the RMA target.  NICs which offload transport functional‐
182       ity  can perform RMA operations without impacting host processing.  RMA
183       write operations transmit data from the initiator to the  target.   The
184       memory  location where the data should be written is carried within the
185       transport message itself, with verification checks  at  the  target  to
186       prevent invalid access.
187
188       RMA  read  operations fetch data from the target system and transfer it
189       back to the initiator of the request, where it is placed  into  memory.
190       This too can be done without involving the host processor at the target
191       system when the NIC supports transport offloading.
192
193       The advantage of RMA operations is that they decouple the processing of
194       the  peers.   Data  can  be placed or fetched whenever the initiator is
195       ready without necessarily impacting the peer process.
196
197       Because RMA operations allow a peer to directly access the memory of  a
198       process,  additional protection mechanisms are used to prevent uninten‐
199       tional or unwanted access.  RMA memory that is updated by a write oper‐
200       ation  or  is fetched by a read operation must be registered for access
201       with the correct permissions specified.
202
203   Atomic operations
204       Atomic transfers are used to read and update  data  located  in  remote
205       memory regions in an atomic fashion.  Conceptually, they are similar to
206       local atomic operations of a  similar  nature  (e.g. atomic  increment,
207       compare  and swap, etc.).  The benefit of atomic operations is they en‐
208       able offloading basic arithmetic capabilities onto a NIC.  Unlike other
209       data  transfer operations, which merely need to transfer bytes of data,
210       atomics require knowledge of the format of the data being accessed.
211
212       A single atomic function operates across an array of data, applying  an
213       atomic  operation to each entry.  The atomicity of an operation is lim‐
214       ited to a single data type or entry, however, not across the entire ar‐
215       ray.   libfabric defines a wide variety of atomic operations across all
216       common data types.  However support for a given operation is  dependent
217       on the provider implementation.
218
219   Collective operations
220       In  general,  collective  operations  can  be thought of as coordinated
221       atomic operations between a set of peer endpoints, almost like a multi‐
222       cast  atomic request.  A single collective operation can result in data
223       being collected from multiple peers, combined using  a  set  of  atomic
224       primitives, and the results distributed to all peers.  A collective op‐
225       eration is a group communication exchange.  It involves multiple  peers
226       exchanging  data with other peers participating in the collective call.
227       Collective operations require close coordination by  all  participating
228       members,  and  collective calls can strain the fabric, as well as local
229       and remote data buffers.
230
231       Collective operations are an area of heavy research, with dedicated li‐
232       braries  focused  almost  exclusively on implementing collective opera‐
233       tions efficiently.  Such libraries are a specific target of  libfabric.
234       The  main  object of the libfabric collection APIs is to expose network
235       acceleration features for implementing collectives to higher-level  li‐
236       braries  and applications.  It is recommended that applications needing
237       collective communication target higher-level libraries,  such  as  MPI,
238       instead of using libfabric collective APIs for that purpose.
239

AUTHORS

241       OpenFabrics.
242
243
244
245Libfabric Programmer’s Manual     2023-01-02                        fi_arch(7)