1fi_peer(3) Libfabric v1.18.1 fi_peer(3)
2
3
4
6 fi_export_fid / fi_import_fid
7 Share a fabric object between different providers or resources
8
9 struct fid_peer_av
10 An address vector sharable between independent providers
11
12 struct fid_peer_av_set
13 An AV set sharable between independent providers
14
15 struct fid_peer_cq
16 A completion queue that may be shared between independent
17 providers
18
19 struct fid_peer_srx
20 A shared receive context that may be shared between independent
21 providers
22
24 #include <rdma/fabric.h>
25 #include <rdma/fi_ext.h>
26 #include <rdma/providers/fi_peer.h>
27
28 int fi_export_fid(struct fid *fid, uint64_t flags,
29 struct fid **expfid, void *context);
30
31 int fi_import_fid(struct fid *fid, struct fid *expfid, uint64_t flags);
32
34 fid Returned fabric identifier for opened object.
35
36 expfid Exported fabric object that may be shared with another provider.
37
38 flags Control flags for the operation.
39
40 *context:
41 User defined context that will be associated with a fabric ob‐
42 ject.
43
45 NOTICE: The peer APIs describe by this man page are developmental and
46 may change between libfabric versions. The data structures and API
47 definitions should not be considered stable between versions.
48 Providers being used as peers must target the same libfabric version.
49
50 Functions defined in this man page are typically used by providers to
51 communicate with other providers, known as peer providers, or by other
52 libraries to communicate with the libfabric core, known as peer li‐
53 braries. Most middleware and applications should not need to access
54 this functionality, as the documentation mainly targets provider devel‐
55 opers.
56
57 Peer providers are a way for independently developed providers to be
58 used together in a tight fashion, such that layering overhead and du‐
59 plicate provider functionality can be avoided. Peer providers are
60 linked by having one provider export specific functionality to another.
61 This is done by having one provider export a sharable fabric object
62 (fid), which is imported by one or more peer providers.
63
64 As an example, a provider which uses TCP to communicate with remote
65 peers may wish to use the shared memory provider to communicate with
66 local peers. To remove layering overhead, the TCP based provider may
67 export its completion queue and shared receive context and import those
68 into the shared memory provider.
69
70 The general mechanisms used to share fabric objects between peer
71 providers are similar, independent from the object being shared. How‐
72 ever, because the goal of using peer providers is to avoid overhead,
73 providers must be explicitly written to support the peer provider mech‐
74 anisms.
75
76 There are two peer provider models. In the example listed above, both
77 peers are full providers in their own right and usable in a stand-alone
78 fashion. In a second model, one of the peers is known as an offload
79 provider. An offload provider implements a subset of the libfabric API
80 and targets the use of specific acceleration hardware. For example,
81 network switches may support collective operations, such as barrier or
82 broadcast. An offload provider may be written specifically to leverage
83 this capability; however, such a provider is not usable for general
84 purposes. As a result, an offload provider is paired with a main peer
85 provider.
86
88 The peer AV allows the sharing of addressing metadata between
89 providers. It specifically targets the use case of having a main
90 provider paired with an offload provider, where the offload provider
91 leverages the communication that has already been established through
92 the main provider. In other situations, such as that mentioned above
93 pairing a tcp provider with a shared memory provider, each peer will
94 likely have their own AV that is not shared.
95
96 The setup for a peer AV is similar to the setup for a shared CQ, de‐
97 scribed below. The owner of the AV creates a fid_peer_av object that
98 links back to its actual fid_av. The fid_peer_av is then imported by
99 the offload provider.
100
101 Peer AVs are configured by the owner calling the peer’s fi_av_open()
102 call, passing in the FI_PEER flag, and pointing the context parameter
103 to struct fi_peer_av_context.
104
105 The data structures to support peer AVs are:
106
107 struct fid_peer_av;
108
109 struct fi_ops_av_owner {
110 size_t size;
111 int (*query)(struct fid_peer_av *av, struct fi_av_attr *attr);
112 fi_addr_t (*ep_addr)(struct fid_peer_av *av, struct fid_ep *ep);
113 };
114
115 struct fid_peer_av {
116 struct fid fid;
117 struct fi_ops_av_owner *owner_ops;
118 };
119
120 struct fi_peer_av_context {
121 size_t size;
122 struct fid_peer_av *av;
123 };
124
125 fi_ops_av_owner::query()
126 This call returns current attributes for the peer AV. The owner sets
127 the fields of the input struct fi_av_attr based on the current state of
128 the AV for return to the caller.
129
130 fi_ops_av_owner::ep_addr()
131 This lookup function returns the fi_addr of the address associated with
132 the given local endpoint. If the address of the local endpoint has not
133 been inserted into the AV, the function should return FI_ADDR_NOTAVAIL.
134
136 The peer AV set allows the sharing of collective addressing data be‐
137 tween providers. It specifically targets the use case pairing a main
138 provider with a collective offload provider. The setup for a peer AV
139 set is similar to a shared CQ, described below. The owner of the AV
140 set creates a fid_peer_av_set object that links back to its fid_av_set.
141 The fid_peer_av_set is imported by the offload provider.
142
143 Peer AV sets are configured by the owner calling the peer’s
144 fi_av_set_open() call, passing in the FI_PEER_AV flag, and pointing the
145 context parameter to struct fi_peer_av_set_context.
146
147 The data structures to support peer AV sets are:
148
149 struct fi_ops_av_set_owner {
150 size_t size;
151 int (*members)(struct fid_peer_av_set *av, fi_addr_t *addr,
152 size_t *count);
153 };
154
155 struct fid_peer_av_set {
156 struct fid fid;
157 struct fi_ops_av_set_owner *owner_ops;
158 };
159
160 struct fi_peer_av_set_context {
161 size_t size;
162 struct fi_peer_av_set *av_set;
163 };
164
165 fi_ops_peer_av_owner::members
166 This call returns an array of AV addresses that are members of the AV
167 set. The size of the array is specified through the count parameter.
168 On return, count is set to the number of addresses in the AV set. If
169 the input count value is too small, the function returns -FI_ETOOSMALL.
170 Otherwise, the function returns an array of fi_addr values.
171
173 The peer CQ defines a mechanism by which a peer provider may insert
174 completions into the CQ owned by another provider. This avoids the
175 overhead of the libfabric user needing to access multiple CQs.
176
177 To setup a peer CQ, a provider creates a fid_peer_cq object, which
178 links back to the provider’s actual fid_cq. The fid_peer_cq object is
179 then imported by a peer provider. The fid_peer_cq defines callbacks
180 that the providers use to communicate with each other. The provider
181 that allocates the fid_peer_cq is known as the owner, with the other
182 provider referred to as the peer. An owner may setup peer relation‐
183 ships with multiple providers.
184
185 Peer CQs are configured by the owner calling the peer’s fi_cq_open()
186 call. The owner passes in the FI_PEER flag to fi_cq_open(). When
187 FI_PEER is specified, the context parameter passed into fi_cq_open()
188 must reference a struct fi_peer_cq_context. Providers that do not sup‐
189 port peer CQs must fail the fi_cq_open() call with -FI_EINVAL (indicat‐
190 ing an invalid flag). The fid_peer_cq referenced by struct
191 fi_peer_cq_context must remain valid until the peer’s CQ is closed.
192
193 The data structures to support peer CQs are defined as follows:
194
195 struct fi_ops_cq_owner {
196 size_t size;
197 ssize_t (*write)(struct fid_peer_cq *cq, void *context, uint64_t flags,
198 size_t len, void *buf, uint64_t data, uint64_t tag, fi_addr_t src);
199 ssize_t (*writeerr)(struct fid_peer_cq *cq,
200 const struct fi_cq_err_entry *err_entry);
201 };
202
203 struct fid_peer_cq {
204 struct fid fid;
205 struct fi_ops_cq_owner *owner_ops;
206 };
207
208 struct fi_peer_cq_context {
209 size_t size;
210 struct fid_peer_cq *cq;
211 };
212
213 For struct fid_peer_cq, the owner initializes the fid and owner_ops
214 fields. struct fi_ops_cq_owner is used by the peer to communicate with
215 the owning provider.
216
217 If manual progress is needed on the peer CQ, the owner should drive
218 progress by using the fi_cq_read() function with the buf parameter set
219 to NULL and count equal 0. The peer provider should set other func‐
220 tions that attempt to read the peer’s CQ (i.e. fi_cq_readerr,
221 fi_cq_sread, etc.) to return -FI_ENOSYS.
222
223 fi_ops_cq_owner::write()
224 This call directs the owner to insert new completions into the CQ. The
225 fi_cq_attr::format field, along with other related attributes, deter‐
226 mines which input parameters are valid. Parameters that are not re‐
227 ported as part of a completion are ignored by the owner, and should be
228 set to 0, NULL, or other appropriate value by the user. For example,
229 if source addressing is not returned with a completion, then the src
230 parameter should be set to FI_ADDR_NOTAVAIL and ignored on input.
231
232 The owner is responsible for locking, event signaling, and handling CQ
233 overflow. Data passed through the write callback is relative to the
234 user. For example, the fi_addr_t is relative to the peer’s AV. The
235 owner is responsible for converting the address if source addressing is
236 needed.
237
238 (TBD: should CQ overflow push back to the user for flow control? Do we
239 need backoff / resume callbacks in ops_cq_user?)
240
241 fi_ops_cq_owner::writeerr()
242 The behavior of this call is similar to the write() ops. It inserts a
243 completion indicating that a data transfer has failed into the CQ.
244
245 EXAMPLE PEER CQ SETUP
246 The above description defines the generic mechanism for sharing CQs be‐
247 tween providers. This section outlines one possible implementation to
248 demonstrate the use of the APIs. In the example, provider A uses
249 provider B as a peer for data transfers targeting endpoints on the lo‐
250 cal node.
251
252 1. Provider A is configured to use provider B as a peer. This may be coded
253 into provider A or set through an environment variable.
254 2. The application calls:
255 fi_cq_open(domain_a, attr, &cq_a, app_context)
256 3. Provider A allocates cq_a and automatically configures it to be used
257 as a peer cq.
258 4. Provider A takes these steps:
259 allocate peer_cq and reference cq_a
260 set peer_cq_context->cq = peer_cq
261 set attr_b.flags |= FI_PEER
262 fi_cq_open(domain_b, attr_b, &cq_b, peer_cq_context)
263 5. Provider B allocates a cq, but configures it such that all completions
264 are written to the peer_cq. The cq ops to read from the cq are
265 set to enosys calls.
266 6. Provider B inserts its own callbacks into the peer_cq object. It
267 creates a reference between the peer_cq object and its own cq.
268
270 The peer domain allows a provider to access the operations of a domain
271 object of its peer. For example, an offload provider can use a peer
272 domain to register memory buffers with the main provider.
273
274 The setup of a peer domain is similar to the setup for a peer CQ out‐
275 line above. The owner’s domain object is imported directly into the
276 peer.
277
278 Peer domains are configured by the owner calling the peer’s fi_do‐
279 main2() call. The owner passes in the FI_PEER flag to fi_domain2().
280 When FI_PEER is specified, the context parameter passed into fi_do‐
281 main2() must reference a struct fi_peer_domain_context. Providers that
282 do not support peer domains must fail the fi_domain2() call with
283 -FI_EINVAL. The fid_domain referenced by struct fi_peer_domain_context
284 must remain valid until the peer’s domain is closed.
285
286 The data structures to support peer domains are defined as follows:
287
288 struct fi_peer_domain_context {
289 size_t size;
290 struct fid_domain *domain;
291 };
292
294 The peer EQ defines a mechanism by which a peer provider may insert
295 events into the EQ owned by another provider. This avoids the overhead
296 of the libfabric user needing to access multiple EQs.
297
298 The setup of a peer EQ is similar to the setup for a peer CQ outline
299 above. The owner’s EQ object is imported directly into the peer
300 provider.
301
302 Peer EQs are configured by the owner calling the peer’s fi_eq_open()
303 call. The owner passes in the FI_PEER flag to fi_eq_open(). When
304 FI_PEER is specified, the context parameter passed into fi_eq_open()
305 must reference a struct fi_peer_eq_context. Providers that do not sup‐
306 port peer EQs must fail the fi_eq_open() call with -FI_EINVAL (indicat‐
307 ing an invalid flag). The fid_eq referenced by struct fi_peer_eq_con‐
308 text must remain valid until the peer’s EQ is closed.
309
310 The data structures to support peer EQs are defined as follows:
311
312 struct fi_peer_eq_context {
313 size_t size;
314 struct fid_eq *eq;
315 };
316
318 The peer SRX defines a mechanism by which peer providers may share a
319 common shared receive context. This avoids the overhead of having sep‐
320 arate receive queues, can eliminate memory copies, and ensures correct
321 application level message ordering.
322
323 The setup of a peer SRX is similar to the setup for a peer CQ outlined
324 above. A fid_peer_srx object links the owner of the SRX with the peer
325 provider. Peer SRXs are configured by the owner calling the peer’s
326 fi_srx_context() call with the FI_PEER flag set. The context parameter
327 passed to fi_srx_context() must be a struct fi_peer_srx_context.
328
329 The owner provider initializes all elements of the fid_peer_srx and
330 referenced structures (fi_ops_srx_owner and fi_ops_srx_peer), with the
331 exception of the fi_ops_srx_peer callback functions. Those must be
332 initialized by the peer provider prior to returning from the
333 fi_srx_contex() call and are used by the owner to control peer actions.
334
335 The data structures to support peer SRXs are defined as follows:
336
337 struct fid_peer_srx;
338
339 /* Castable to dlist_entry */
340 struct fi_peer_rx_entry {
341 struct fi_peer_rx_entry *next;
342 struct fi_peer_rx_entry *prev;
343 struct fi_peer_srx *srx;
344 fi_addr_t addr;
345 size_t size;
346 uint64_t tag;
347 uint64_t flags;
348 void *context;
349 size_t count;
350 void **desc;
351 void *peer_context;
352 void *owner_context;
353 struct iovec *iov;
354 };
355
356 struct fi_ops_srx_owner {
357 size_t size;
358 int (*get_msg)(struct fid_peer_srx *srx, fi_addr_t addr,
359 size_t size, struct fi_peer_rx_entry **entry);
360 int (*get_tag)(struct fid_peer_srx *srx, fi_addr_t addr,
361 uint64_t tag, struct fi_peer_rx_entry **entry);
362 int (*queue_msg)(struct fi_peer_rx_entry *entry);
363 int (*queue_tag)(struct fi_peer_rx_entry *entry);
364 void (*free_entry)(struct fi_peer_rx_entry *entry);
365 };
366
367 struct fi_ops_srx_peer {
368 size_t size;
369 int (*start_msg)(struct fi_peer_rx_entry *entry);
370 int (*start_tag)(struct fi_peer_rx_entry *entry);
371 int (*discard_msg)(struct fi_peer_rx_entry *entry);
372 int (*discard_tag)(struct fi_peer_rx_entry *entry);
373 };
374
375 struct fid_peer_srx {
376 struct fid_ep ep_fid;
377 struct fi_ops_srx_owner *owner_ops;
378 struct fi_ops_srx_peer *peer_ops;
379 };
380
381 struct fi_peer_srx_context {
382 size_t size;
383 struct fid_peer_srx *srx;
384 };
385
386 The ownership of structure field values and callback functions is simi‐
387 lar to those defined for peer CQs, relative to owner versus peer ops.
388
389 fi_peer_rx_entry
390 fi_peer_rx_entry defines a common receive entry for use between the
391 owner and peer. The entry is allocated and set by the owner and passed
392 between owner and peer to communicate details of the application-posted
393 receive entry. All fields are only modifiable by the owner, except for
394 the peer_context which is provided for the peer to use to save peer-
395 specific information for unexpected message processing. Similarly, the
396 owner_context can be used by the owner_context as needed for storing
397 extra owner-specific information.
398
399 fi_ops_srx_owner::get_msg_entry() / get_tag_entry()
400 These calls are invoked by the peer provider to obtain the receive buf‐
401 fer(s) where an incoming message should be placed. The peer provider
402 will pass in the relevant fields to request a matching rx_entry from
403 the owner. If source addressing is required, the addr will be passed
404 in; otherwise, the address will be set to FI_ADDR_NOT_AVAIL. The size
405 field indicates the received message size. This field is used by the
406 owner when handling multi-received data buffers, but may be ignored
407 otherwise. The peer provider is responsible for checking that an in‐
408 coming message fits within the provided buffer space. The tag parame‐
409 ter is used for tagged messages. An fi_peer_rx_entry is allocated by
410 the owner, whether or not a match was found. If a match was found, the
411 owner will return FI_SUCCESS and the rx_entry will be filled in with
412 the appropriate receive fields for the peer to process accordingly. If
413 no match was found, the owner will return -FI_ENOENT; the rx_entry will
414 still be valid but will not match to an existing posted receive. When
415 the peer gets FI_ENOENT, it should allocate whatever resources it needs
416 to process the message later (on start_msg/tag) and set the rx_en‐
417 try->peer_context appropriately, followed by a call to the owner’s
418 queue_msg/tag. The get and queue messages should be serialized. When
419 the owner gets a matching receive for the queued unexpected message, it
420 will call the peer’s start function to notify the peer of the updated
421 rx_entry (or the peer’s discard function if the message is to be dis‐
422 carded) (TBD: The peer may need to update the src addr if the remote
423 endpoint is inserted into the AV after the message has been received.)
424
425 fi_ops_srx_peer::start_msg() / start_tag()
426 These calls indicate that an asynchronous get_msg_entry() or
427 get_tag_entry() has completed and a buffer is now available to receive
428 the message. Control of the fi_peer_rx_entry is returned to the peer
429 provider and has been initialized for receiving the incoming message.
430
431 fi_ops_srx_peer::discard_msg() / discard_tag()
432 Indicates that the message and data associated with the specified
433 fi_peer_rx_entry should be discarded. This often indicates that the
434 application has canceled or discarded the receive operation. No com‐
435 pletion should be generated by the peer provider for a discarded mes‐
436 sage. Control of the fi_peer_rx_entry is returned to the peer
437 provider.
438
439 EXAMPLE PEER SRX SETUP
440 The above description defines the generic mechanism for sharing SRXs
441 between providers. This section outlines one possible implementation
442 to demonstrate the use of the APIs. In the example, provider A uses
443 provider B as a peer for data transfers targeting endpoints on the lo‐
444 cal node.
445
446 1. Provider A is configured to use provider B as a peer. This may be coded
447 into provider A or set through an environment variable.
448 2. The application calls:
449 fi_srx_context(domain_a, attr, &srx_a, app_context)
450 3. Provider A allocates srx_a and automatically configures it to be used
451 as a peer srx.
452 4. Provider A takes these steps:
453 allocate peer_srx and reference srx_a
454 set peer_srx_context->srx = peer_srx
455 set attr_b.flags |= FI_PEER
456 fi_srx_context(domain_b, attr_b, &srx_b, peer_srx_context)
457 5. Provider B allocates an srx, but configures it such that all receive
458 buffers are obtained from the peer_srx. The srx ops to post receives are
459 set to enosys calls.
460 6. Provider B inserts its own callbacks into the peer_srx object. It
461 creates a reference between the peer_srx object and its own srx.
462
463 EXAMPLE PEER SRX RECEIVE FLOW
464 The following outlines shows simplified, example software flows for re‐
465 ceive message handling using a peer SRX. The first flow demonstrates
466 the case where a receive buffer is waiting when the message arrives.
467
468 1. Application calls fi_recv() / fi_trecv() on owner.
469 2. Owner queues the receive buffer.
470 3. A message is received by the peer provider.
471 4. The peer calls owner->get_msg() / get_tag().
472 5. The owner removes the queued receive buffer and returns it to
473 the peer. The get entry call will complete with FI_SUCCESS.
474 6. When the peer finishes processing the message and completes it on its own
475 CQ, the peer will call free_entry to free the entry with the owner.
476
477 The second case below shows the flow when a message arrives before the
478 application has posted the matching receive buffer.
479
480 1. A message is received by the peer provider.
481 2. The peer calls owner->get_msg() / get_tag().
482 3. The owner fails to find a matching receive buffer.
483 4. The owner allocates a rx_entry with any known fields and returns -FI_ENOENT.
484 5. The peer allocates any resources needed to handle the asynchronous processing
485 and sets peer_context accordingly.
486 6. The peer allocates any needed resources for processing the unexpected
487 message and sets the peer_context accordingly, calling the owner's queue
488 function when ready to queue the unexpected message from the peer.
489 7. The application calls fi_recv() / fi_trecv() on owner, posting the
490 matching receive buffer.
491 8. The owner matches the receive with the queued message on the peer.
492 9. The owner removes the queued request, fills in the rest of the known fields
493 and calls the peer->start_msg() / start_tag() function.
494 10. When the peer finishes processing the message and completes it on its own
495 CQ, the peer will call free_entry to free the entry with the owner.
496
498 The fi_export_fid function is reserved for future use.
499
500 The fi_import_fid call may be used to import a fabric object created
501 and owned by the libfabric user. This allows upper level libraries or
502 the application to override or define low-level libfabric behavior.
503 Details on specific uses of fi_import_fid are outside the scope of this
504 documentation.
505
507 Providers frequently send control messages to their remote counterparts
508 as part of their wire protocol. For example, a provider may send an
509 ACK message to guarantee reliable delivery of a message or to meet a
510 requested completion semantic. When two or more providers are coordi‐
511 nating as peers, it can be more efficient if control messages for both
512 peer providers go over the same transport. In some cases, such as when
513 one of the peers is an offload provider, it may even be required. Peer
514 transfers define the mechanism by which such communication occurs.
515
516 Peer transfers enable one peer to send and receive data transfers over
517 its associated peer. Providers that require this functionality indi‐
518 cate this by setting the FI_PEER_TRANSFER flag as a mode bit,
519 i.e. fi_info::mode.
520
521 To use such a provider as a peer, the main, or owner, provider must
522 setup peer transfers by opening a peer transfer endpoint and accepting
523 transfers with this flag set. Setup of peer transfers involves the
524 following data structures:
525
526 struct fi_ops_transfer_peer {
527 size_t size;
528 ssize_t (*complete)(struct fid_ep *ep, struct fi_cq_tagged_entry *buf,
529 fi_addr_t *src_addr);
530 ssize_t (*comperr)(struct fid_ep *ep, struct fi_cq_err_entry *buf);
531 };
532
533 struct fi_peer_transfer_context {
534 size_t size;
535 struct fi_info *info;
536 struct fid_ep *ep;
537 struct fi_ops_transfer_peer *peer_ops;
538 };
539
540 Peer transfer contexts form a virtual link between endpoints allocated
541 on each of the peer providers. The setup of a peer transfer context
542 occurs through the fi_endpoint() API. The main provider calls fi_end‐
543 point() with the FI_PEER_TRANSFER mode bit set in the info parameter,
544 and the context parameter must reference the struct fi_peer_trans‐
545 fer_context defined above.
546
547 The size field indicates the size of struct fi_peer_transfer_context
548 being passed to the peer. This is used for backward compatibility.
549 The info field is optional. If given, it defines the attributes of the
550 main provider’s objects. It may be used to report the capabilities and
551 restrictions on peer transfers, such as whether memory registration is
552 required, maximum message sizes, data and completion ordering seman‐
553 tics, and so forth. If the importing provider cannot meet these re‐
554 strictions, it must fail the fi_endpoint() call.
555
556 The peer_ops field contains callbacks from the main provider into the
557 peer and is used to report the completion (success or failure) of peer
558 initiated data transfers. The callback functions defined in struct
559 fi_ops_transfer_peer must be set by the peer provider before returning
560 from the fi_endpoint() call. Actions that the peer provider can take
561 from within the completion callbacks are most unrestricted, and can in‐
562 clude any of the following types of operations: initiation of addition‐
563 al data transfers, writing events to the owner’s CQ or EQ, and memory
564 registration/deregistration. The owner must ensure that deadlock can‐
565 not occur prior to invoking the peer’s callback should the peer invoke
566 any of these operations. Further, the owner must avoid recursive calls
567 into the completion callbacks.
568
570 Returns FI_SUCCESS on success. On error, a negative value correspond‐
571 ing to fabric errno is returned. Fabric errno values are defined in
572 rdma/fi_errno.h.
573
575 fi_provider(7), fi_provider(3), fi_cq(3),
576
578 OpenFabrics.
579
580
581
582Libfabric Programmer’s Manual 2023-03-14 fi_peer(3)