1fi_arch(7) Libfabric v1.18.1 fi_arch(7)
2
3
4
5 /
6 / (9 CONNECTED)
7
8 /Event
9 /
10
11 Connections require the use of both passive and active endpoints.
12 In order to establish a connection, an application must first create a
13 passive endpoint and associate it with an event queue. The event queue
14 will be used to report the connection management events. The application
15 then calls listen on the passive endpoint. A single passive endpoint can
16 be used to form multiple connections.
17
18 The connecting peer allocates an active endpoint, which is also
19 associated with an event queue. Connect is called on the active
20 endpoint, which results in sending a connection request (CONNREQ)
21 message to the passive endpoint. The CONNREQ event is inserted into
22 the passive endpoint’s event queue, where the listening application can
23 process it.
24
25 Upon processing the CONNREQ, the listening application will allocate
26 an active endpoint to use with the connection. The active endpoint is
27 bound with an event queue. Although the diagram shows the use of a
28 separate event queue, the active endpoint may use the same event queue
29 as used by the passive endpoint. Accept is called on the active endpoint
30 to finish forming the connection. It should be noted that the OFI accept
31 call is different than the accept call used by sockets. The differences
32 result from OFI supporting process direct I/O.
33
34 libfabric does not define the connection establishment protocol, but
35 does support a traditional three-way handshake used by many technologies.
36 After calling accept, a response is sent to the connecting active endpoint.
37 That response generates a CONNECTED event on the remote event queue. If a
38 three-way handshake is used, the remote endpoint will generate an
39 acknowledgment message that will generate a CONNECTED event for the accepting
40 endpoint. Regardless of the connection protocol, both the active and passive
41 sides of the connection will receive a CONNECTED event that signals that the
42 connection has been established.
43
44 ## Connectionless Communications
45
46 Connectionless communication allows data transfers between active endpoints
47 without going through a connection setup process. The diagram below shows
48 the basic components needed to setup connection-less communication.
49 Connectionless communication setup differs from UDP sockets in that it
50 requires that the remote addresses be stored with libfabric.
51
52 1 insert_addr() 2 send() | | /Address <–3 lookup–> / Active
53 / /
54
55 libfabric requires the addresses of peer endpoints be inserted into a local
56 addressing table, or address vector, before data transfers can be initiated
57 against the remote endpoint. Address vectors abstract fabric specific
58 addressing requirements and avoid long queuing delays on data transfers
59 when address resolution is needed. For example, IP addresses may need to be
60 resolved into Ethernet MAC addresses. Address vectors allow this resolution
61 to occur during application initialization time. libfabric does not define
62 how an address vector be implemented, only its conceptual model.
63
64 All connection-less endpoints that transfer data must be associated with an
65 address vector.
66
67 # Endpoints
68
69 At a low-level, endpoints are usually associated with a transmit context, or
70 queue, and a receive context, or queue. Although the terms transmit and
71 receive queues are easier to understand, libfabric uses the terminology
72 context, since queue like behavior of acting as a FIFO (first-in, first-out)
73 is not guaranteed. Transmit and receive contexts may be implemented using
74 hardware queues mapped directly into the process’s address space. An endpoint
75 may be configured only to transmit or receive data. Data transfer requests
76 are converted by the underlying provider into commands that are inserted into
77 hardware transmit and/or receive contexts.
78
79 Endpoints are also associated with completion queues. Completion queues are
80 used to report the completion of asynchronous data transfer operations.
81
82 ## Shared Contexts
83
84 An advanced usage model allows for sharing resources among multiple endpoints.
85 The most common form of sharing is having multiple connected endpoints
86 make use of a single receive context. This can reduce receive side buffering
87 requirements, allowing the number of connected endpoints that an application
88 can manage to scale to larger numbers.
89
90 # Data Transfers
91
92 Obviously, a primary goal of network communication is to transfer data between
93 processes running on different systems. In a similar way that the socket API
94 defines different data transfer semantics for TCP versus UDP sockets, that is,
95 streaming versus datagram messages, libfabric defines different types of data
96 transfers. However, unlike sockets, libfabric allows different semantics over
97 a single endpoint, even when communicating with the same peer.
98
99 libfabric uses separate API sets for the different data transfer semantics;
100 although, there are strong similarities between the API sets. The differences
101 are the result of the parameters needed to invoke each type of data transfer.
102
103 ## Message transfers
104
105 Message transfers are most similar to UDP datagram transfers, except that
106 transfers may be sent and received reliably. Message transfers may also be
107 gigabytes in size, depending on the provider implementation. The sender
108 requests that data be transferred as a single transport operation to a peer.
109 Even if the data is referenced using an I/O vector, it is treated as a single
110 logical unit or message. The data is placed into a waiting receive buffer
111 at the peer, with the receive buffer usually chosen using FIFO ordering.
112 Note that even though receive buffers are selected using FIFO ordering, the
113 received messages may complete out of order. This can occur as a result of
114 data between and within messages taking different paths through the network,
115 handling lost or retransmitted packets, etc.
116
117 Message transfers are usually invoked using API calls that contain the string
118 "send" or "recv". As a result they may be referred to simply as sent or
119 received messages.
120
121 Message transfers involve the target process posting memory buffers to the
122 receive (Rx) context of its endpoint. When a message arrives from the network,
123 a receive buffer is removed from the Rx context, and the data is copied from
124 the network into the receive buffer. Messages are matched with posted receives
125 in the order that they are received. Note that this may differ from the order
126 that messages are sent, depending on the transmit side's ordering semantics.
127
128 Conceptually, on the transmit side, messages are posted to a transmit (Tx)
129 context. The network processes messages from the Tx context, packetizing
130 the data into outbound messages. Although many implementations process the
131 Tx context in order (i.e. the Tx context is a true queue), ordering guarantees
132 specified through the libfabric API determine the actual processing order. As
133 a general rule, the more relaxed an application is on its message and data
134 ordering, the more optimizations the networking software and hardware can
135 leverage, providing better performance.
136
137 ## Tagged messages
138
139 Tagged messages are similar to message transfers except that the messages
140 carry one additional piece of information, a message tag. Tags are application
141 defined values that are part of the message transfer protocol and are used to
142 route packets at the receiver. At a high level, they are roughly similar to
143 message ids. The difference is that tag values are set by the application,
144 may be any value, and duplicate tag values are allowed.
145
146 Each sent message carries a single tag value, which is used to select a receive
147 buffer into which the data is copied. On the receiving side, message buffers
148 are also marked with a tag. Messages that arrive from the network search
149 through the posted receive messages until a matching tag is found.
150
151 Tags are often used to identify virtual communication groups or roles.
152 In practice, message tags are typically divided into fields. For example, the
153 upper 16 bits of the tag may indicate a virtual group, with the lower 16 bits
154 identifying the message purpose. The tag message interface in libfabric is
155 designed around this usage model. Each sent message carries exactly one tag
156 value, specified through the API. At the receiver, buffers are associated
157 with both a tag value and a mask. The mask is used as part of the buffer
158 matching process. The mask is applied against the received tag value carried
159 in the sent message prior to checking the tag against the receive buffer. For
160 example, the mask may indicate to ignore the lower 16-bits of a tag. If
161 the resulting values match, then the tags are said to match. The received
162 data is then placed into the matched buffer.
163
164 For performance reasons, the mask is specified as 'ignore' bits. Although
165 this is backwards from how many developers think of a mask (where the bits
166 that are valid would be set to 1), the definition ends up mapping well with
167 applications. The actual operation performed when matching tags is:
168
169 send_tag | ignore == recv_tag | ignore
170
171 /* this is equivalent to: * send_tag & ~ignore == recv_tag & ~ignore */
172 ```
173
174 Tagged messages are equivalent of message transfers if a single tag
175 value is used. But tagged messages require that the receiver perform a
176 matching operation at the target, which can impact performance versus
177 untagged messages.
178
179 RMA
180 RMA operations are architected such that they can require no processing
181 by the CPU at the RMA target. NICs which offload transport functional‐
182 ity can perform RMA operations without impacting host processing. RMA
183 write operations transmit data from the initiator to the target. The
184 memory location where the data should be written is carried within the
185 transport message itself, with verification checks at the target to
186 prevent invalid access.
187
188 RMA read operations fetch data from the target system and transfer it
189 back to the initiator of the request, where it is placed into memory.
190 This too can be done without involving the host processor at the target
191 system when the NIC supports transport offloading.
192
193 The advantage of RMA operations is that they decouple the processing of
194 the peers. Data can be placed or fetched whenever the initiator is
195 ready without necessarily impacting the peer process.
196
197 Because RMA operations allow a peer to directly access the memory of a
198 process, additional protection mechanisms are used to prevent uninten‐
199 tional or unwanted access. RMA memory that is updated by a write oper‐
200 ation or is fetched by a read operation must be registered for access
201 with the correct permissions specified.
202
203 Atomic operations
204 Atomic transfers are used to read and update data located in remote
205 memory regions in an atomic fashion. Conceptually, they are similar to
206 local atomic operations of a similar nature (e.g. atomic increment,
207 compare and swap, etc.). The benefit of atomic operations is they en‐
208 able offloading basic arithmetic capabilities onto a NIC. Unlike other
209 data transfer operations, which merely need to transfer bytes of data,
210 atomics require knowledge of the format of the data being accessed.
211
212 A single atomic function operates across an array of data, applying an
213 atomic operation to each entry. The atomicity of an operation is lim‐
214 ited to a single data type or entry, however, not across the entire ar‐
215 ray. libfabric defines a wide variety of atomic operations across all
216 common data types. However support for a given operation is dependent
217 on the provider implementation.
218
219 Collective operations
220 In general, collective operations can be thought of as coordinated
221 atomic operations between a set of peer endpoints, almost like a multi‐
222 cast atomic request. A single collective operation can result in data
223 being collected from multiple peers, combined using a set of atomic
224 primitives, and the results distributed to all peers. A collective op‐
225 eration is a group communication exchange. It involves multiple peers
226 exchanging data with other peers participating in the collective call.
227 Collective operations require close coordination by all participating
228 members, and collective calls can strain the fabric, as well as local
229 and remote data buffers.
230
231 Collective operations are an area of heavy research, with dedicated li‐
232 braries focused almost exclusively on implementing collective opera‐
233 tions efficiently. Such libraries are a specific target of libfabric.
234 The main object of the libfabric collection APIs is to expose network
235 acceleration features for implementing collectives to higher-level li‐
236 braries and applications. It is recommended that applications needing
237 collective communication target higher-level libraries, such as MPI,
238 instead of using libfabric collective APIs for that purpose.
239
241 OpenFabrics.
242
243
244
245Libfabric Programmer’s Manual 2023-01-02 fi_arch(7)