1fi_peer(3) Libfabric v1.17.0 fi_peer(3)
2
3
4
6 fi_export_fid / fi_import_fid
7 Share a fabric object between different providers or resources
8
9 struct fid_peer_av
10 An address vector sharable between independent providers
11
12 struct fid_peer_av_set
13 An AV set sharable between independent providers
14
15 struct fid_peer_cq
16 A completion queue that may be shared between independent
17 providers
18
19 struct fid_peer_srx
20 A shared receive context that may be shared between independent
21 providers
22
24 #include <rdma/fabric.h>
25 #include <rdma/fi_ext.h>
26
27 int fi_export_fid(struct fid *fid, uint64_t flags,
28 struct fid **expfid, void *context);
29
30 int fi_import_fid(struct fid *fid, struct fid *expfid, uint64_t flags);
31
33 fid Returned fabric identifier for opened object.
34
35 expfid Exported fabric object that may be shared with another provider.
36
37 flags Control flags for the operation.
38
39 *context:
40 User defined context that will be associated with a fabric ob‐
41 ject.
42
44 NOTICE: The peer APIs describe by this man page are developmental and
45 may change between libfabric versions. The data structures and API
46 definitions should not be considered stable between versions.
47 Providers being used as peers must target the same libfabric version.
48
49 Functions defined in this man page are typically used by providers to
50 communicate with other providers, known as peer providers, or by other
51 libraries to communicate with the libfabric core, known as peer li‐
52 braries. Most middleware and applications should not need to access
53 this functionality, as the documentation mainly targets provider devel‐
54 opers.
55
56 Peer providers are a way for independently developed providers to be
57 used together in a tight fashion, such that layering overhead and du‐
58 plicate provider functionality can be avoided. Peer providers are
59 linked by having one provider export specific functionality to another.
60 This is done by having one provider export a sharable fabric object
61 (fid), which is imported by one or more peer providers.
62
63 As an example, a provider which uses TCP to communicate with remote
64 peers may wish to use the shared memory provider to communicate with
65 local peers. To remove layering overhead, the TCP based provider may
66 export its completion queue and shared receive context and import those
67 into the shared memory provider.
68
69 The general mechanisms used to share fabric objects between peer
70 providers are similar, independent from the object being shared. How‐
71 ever, because the goal of using peer providers is to avoid overhead,
72 providers must be explicitly written to support the peer provider mech‐
73 anisms.
74
75 There are two peer provider models. In the example listed above, both
76 peers are full providers in their own right and usable in a stand-alone
77 fashion. In a second model, one of the peers is known as an offload
78 provider. An offload provider implements a subset of the libfabric API
79 and targets the use of specific acceleration hardware. For example,
80 network switches may support collective operations, such as barrier or
81 broadcast. An offload provider may be written specifically to leverage
82 this capability; however, such a provider is not usable for general
83 purposes. As a result, an offload provider is paired with a main peer
84 provider.
85
87 The peer AV allows the sharing of addressing metadata between
88 providers. It specifically targets the use case of having a main
89 provider paired with an offload provider, where the offload provider
90 leverages the communication that has already been established through
91 the main provider. In other situations, such as that mentioned above
92 pairing a tcp provider with a shared memory provider, each peer will
93 likely have their own AV that is not shared.
94
95 The setup for a peer AV is similar to the setup for a shared CQ, de‐
96 scribed below. The owner of the AV creates a fid_peer_av object that
97 links back to its actual fid_av. The fid_peer_av is then imported by
98 the offload provider.
99
100 Peer AVs are configured by the owner calling the peer’s fi_av_open()
101 call, passing in the FI_PEER flag, and pointing the context parameter
102 to struct fi_peer_av_context.
103
104 The data structures to support peer AVs are:
105
106 struct fid_peer_av;
107
108 struct fi_ops_av_owner {
109 size_t size;
110 int (*query)(struct fid_peer_av *av, struct fi_av_attr *attr);
111 fi_addr_t (*ep_addr)(struct fid_peer_av *av, struct fid_ep *ep);
112 };
113
114 struct fid_peer_av {
115 struct fid fid;
116 struct fi_ops_av_owner *owner_ops;
117 };
118
119 struct fi_peer_av_context {
120 size_t size;
121 struct fid_peer_av *av;
122 };
123
124 fi_ops_av_owner::query()
125 This call returns current attributes for the peer AV. The owner sets
126 the fields of the input struct fi_av_attr based on the current state of
127 the AV for return to the caller.
128
129 fi_ops_av_owner::ep_addr()
130 This lookup function returns the fi_addr of the address associated with
131 the given local endpoint. If the address of the local endpoint has not
132 been inserted into the AV, the function should return FI_ADDR_NOTAVAIL.
133
135 The peer AV set allows the sharing of collective addressing data be‐
136 tween providers. It specifically targets the use case pairing a main
137 provider with a collective offload provider. The setup for a peer AV
138 set is similar to a shared CQ, described below. The owner of the AV
139 set creates a fid_peer_av_set object that links back to its fid_av_set.
140 The fid_peer_av_set is imported by the offload provider.
141
142 Peer AV sets are configured by the owner calling the peer’s
143 fi_av_set_open() call, passing in the FI_PEER_AV flag, and pointing the
144 context parameter to struct fi_peer_av_set_context.
145
146 The data structures to support peer AV sets are:
147
148 struct fi_ops_av_set_owner {
149 size_t size;
150 int (*members)(struct fid_peer_av_set *av, fi_addr_t *addr,
151 size_t *count);
152 };
153
154 struct fid_peer_av_set {
155 struct fid fid;
156 struct fi_ops_av_set_owner *owner_ops;
157 };
158
159 struct fi_peer_av_set_context {
160 size_t size;
161 struct fi_peer_av_set *av_set;
162 };
163
164 fi_ops_peer_av_owner::members
165 This call returns an array of AV addresses that are members of the AV
166 set. The size of the array is specified through the count parameter.
167 On return, count is set to the number of addresses in the AV set. If
168 the input count value is too small, the function returns -FI_ETOOSMALL.
169 Otherwise, the function returns an array of fi_addr values.
170
172 The peer CQ defines a mechanism by which a peer provider may insert
173 completions into the CQ owned by another provider. This avoids the
174 overhead of the libfabric user needing to access multiple CQs.
175
176 To setup a peer CQ, a provider creates a fid_peer_cq object, which
177 links back to the provider’s actual fid_cq. The fid_peer_cq object is
178 then imported by a peer provider. The fid_peer_cq defines callbacks
179 that the providers use to communicate with each other. The provider
180 that allocates the fid_peer_cq is known as the owner, with the other
181 provider referred to as the peer. An owner may setup peer relation‐
182 ships with multiple providers.
183
184 Peer CQs are configured by the owner calling the peer’s fi_cq_open()
185 call. The owner passes in the FI_PEER flag to fi_cq_open(). When
186 FI_PEER is specified, the context parameter passed into fi_cq_open()
187 must reference a struct fi_peer_cq_context. Providers that do not sup‐
188 port peer CQs must fail the fi_cq_open() call with -FI_EINVAL (indicat‐
189 ing an invalid flag). The fid_peer_cq referenced by struct
190 fi_peer_cq_context must remain valid until the peer’s CQ is closed.
191
192 The data structures to support peer CQs are defined as follows:
193
194 struct fi_ops_cq_owner {
195 size_t size;
196 ssize_t (*write)(struct fid_peer_cq *cq, void *context, uint64_t flags,
197 size_t len, void *buf, uint64_t data, uint64_t tag, fi_addr_t src);
198 ssize_t (*writeerr)(struct fid_peer_cq *cq,
199 const struct fi_cq_err_entry *err_entry);
200 };
201
202 struct fid_peer_cq {
203 struct fid fid;
204 struct fi_ops_cq_owner *owner_ops;
205 };
206
207 struct fi_peer_cq_context {
208 size_t size;
209 struct fid_peer_cq *cq;
210 };
211
212 For struct fid_peer_cq, the owner initializes the fid and owner_ops
213 fields. struct fi_ops_cq_owner is used by the peer to communicate with
214 the owning provider.
215
216 If manual progress is needed on the peer CQ, the owner should drive
217 progress by using the fi_cq_read() function with the buf parameter set
218 to NULL and count equal 0. The peer provider should set other func‐
219 tions that attempt to read the peer’s CQ (i.e. fi_cq_readerr,
220 fi_cq_sread, etc.) to return -FI_ENOSYS.
221
222 fi_ops_cq_owner::write()
223 This call directs the owner to insert new completions into the CQ. The
224 fi_cq_attr::format field, along with other related attributes, deter‐
225 mines which input parameters are valid. Parameters that are not re‐
226 ported as part of a completion are ignored by the owner, and should be
227 set to 0, NULL, or other appropriate value by the user. For example,
228 if source addressing is not returned with a completion, then the src
229 parameter should be set to FI_ADDR_NOTAVAIL and ignored on input.
230
231 The owner is responsible for locking, event signaling, and handling CQ
232 overflow. Data passed through the write callback is relative to the
233 user. For example, the fi_addr_t is relative to the peer’s AV. The
234 owner is responsible for converting the address if source addressing is
235 needed.
236
237 (TBD: should CQ overflow push back to the user for flow control? Do we
238 need backoff / resume callbacks in ops_cq_user?)
239
240 fi_ops_cq_owner::writeerr()
241 The behavior of this call is similar to the write() ops. It inserts a
242 completion indicating that a data transfer has failed into the CQ.
243
244 EXAMPLE PEER CQ SETUP
245 The above description defines the generic mechanism for sharing CQs be‐
246 tween providers. This section outlines one possible implementation to
247 demonstrate the use of the APIs. In the example, provider A uses
248 provider B as a peer for data transfers targeting endpoints on the lo‐
249 cal node.
250
251 1. Provider A is configured to use provider B as a peer. This may be coded
252 into provider A or set through an environment variable.
253 2. The application calls:
254 fi_cq_open(domain_a, attr, &cq_a, app_context)
255 3. Provider A allocates cq_a and automatically configures it to be used
256 as a peer cq.
257 4. Provider A takes these steps:
258 allocate peer_cq and reference cq_a
259 set peer_cq_context->cq = peer_cq
260 set attr_b.flags |= FI_PEER
261 fi_cq_open(domain_b, attr_b, &cq_b, peer_cq_context)
262 5. Provider B allocates a cq, but configures it such that all completions
263 are written to the peer_cq. The cq ops to read from the cq are
264 set to enosys calls.
265 8. Provider B inserts its own callbacks into the peer_cq object. It
266 creates a reference between the peer_cq object and its own cq.
267
269 The peer domain allows a provider to access the operations of a domain
270 object of its peer. For example, an offload provider can use a peer
271 domain to register memory buffers with the main provider.
272
273 The setup of a peer domain is similar to the setup for a peer CQ out‐
274 line above. The owner’s domain object is imported directly into the
275 peer.
276
277 Peer domains are configured by the owner calling the peer’s fi_do‐
278 main2() call. The owner passes in the FI_PEER flag to fi_domain2().
279 When FI_PEER is specified, the context parameter passed into fi_do‐
280 main2() must reference a struct fi_peer_domain_context. Providers that
281 do not support peer domains must fail the fi_domain2() call with
282 -FI_EINVAL. The fid_domain referenced by struct fi_peer_domain_context
283 must remain valid until the peer’s domain is closed.
284
285 The data structures to support peer domains are defined as follows:
286
287 struct fi_peer_domain_context {
288 size_t size;
289 struct fid_domain *domain;
290 };
291
293 The peer EQ defines a mechanism by which a peer provider may insert
294 events into the EQ owned by another provider. This avoids the overhead
295 of the libfabric user needing to access multiple EQs.
296
297 The setup of a peer EQ is similar to the setup for a peer CQ outline
298 above. The owner’s EQ object is imported directly into the peer
299 provider.
300
301 Peer EQs are configured by the owner calling the peer’s fi_eq_open()
302 call. The owner passes in the FI_PEER flag to fi_eq_open(). When
303 FI_PEER is specified, the context parameter passed into fi_eq_open()
304 must reference a struct fi_peer_eq_context. Providers that do not sup‐
305 port peer EQs must fail the fi_eq_open() call with -FI_EINVAL (indicat‐
306 ing an invalid flag). The fid_eq referenced by struct fi_peer_eq_con‐
307 text must remain valid until the peer’s EQ is closed.
308
309 The data structures to support peer EQs are defined as follows:
310
311 struct fi_peer_eq_context {
312 size_t size;
313 struct fid_eq *eq;
314 };
315
317 The peer SRX defines a mechanism by which peer providers may share a
318 common shared receive context. This avoids the overhead of having sep‐
319 arate receive queues, can eliminate memory copies, and ensures correct
320 application level message ordering.
321
322 The setup of a peer SRX is similar to the setup for a peer CQ outlined
323 above. A fid_peer_srx object links the owner of the SRX with the peer
324 provider. Peer SRXs are configured by the owner calling the peer’s
325 fi_srx_context() call with the FI_PEER flag set. The context parameter
326 passed to fi_srx_context() must be a struct fi_peer_srx_context.
327
328 The owner provider initializes all elements of the fid_peer_srx and
329 referenced structures (fi_ops_srx_owner and fi_ops_srx_peer), with the
330 exception of the fi_ops_srx_peer callback functions. Those must be
331 initialized by the peer provider prior to returning from the
332 fi_srx_contex() call and are used by the owner to control peer actions.
333
334 The data structures to support peer SRXs are defined as follows:
335
336 struct fid_peer_srx;
337
338 /* Castable to dlist_entry */
339 struct fi_peer_rx_entry {
340 struct fi_peer_rx_entry *next;
341 struct fi_peer_rx_entry *prev;
342 struct fi_peer_srx *srx;
343 fi_addr_t addr;
344 size_t size;
345 uint64_t tag;
346 uint64_t flags;
347 void *context;
348 size_t count;
349 void **desc;
350 void *peer_context;
351 void *user_context;
352 struct iovec *iov;
353 };
354
355 struct fi_ops_srx_owner {
356 size_t size;
357 int (*get_msg)(struct fid_peer_srx *srx, fi_addr_t addr,
358 size_t size, struct fi_peer_rx_entry **entry);
359 int (*get_tag)(struct fid_peer_srx *srx, fi_addr_t addr,
360 uint64_t tag, struct fi_peer_rx_entry **entry);
361 int (*queue_msg)(struct fi_peer_rx_entry *entry);
362 int (*queue_tag)(struct fi_peer_rx_entry *entry);
363 void (*free_entry)(struct fi_peer_rx_entry *entry);
364 };
365
366 struct fi_ops_srx_peer {
367 size_t size;
368 int (*start_msg)(struct fid_peer_srx *srx);
369 int (*start_tag)(struct fid_peer_srx *srx);
370 int (*discard_msg)(struct fid_peer_srx *srx);
371 int (*discard_tag)(struct fid_peer_srx *srx);
372 };
373
374 struct fid_peer_srx {
375 struct fid_ep ep_fid;
376 struct fi_ops_srx_owner *owner_ops;
377 struct fi_ops_srx_peer *peer_ops;
378 };
379
380 struct fi_peer_srx_context {
381 size_t size;
382 struct fid_peer_srx *srx;
383 };
384
385 The ownership of structure field values and callback functions is simi‐
386 lar to those defined for peer CQs, relative to owner versus peer ops.
387
388 fi_ops_srx_owner::get_msg_entry() / get_tag_entry()
389 These calls are invoked by the peer provider to obtain the receive buf‐
390 fer(s) where an incoming message should be placed. The peer provider
391 will pass in the relevant fields to request a matching rx_entry from
392 the owner. If source addressing is required, the addr will be passed
393 in; otherwise, the address will be set to FI_ADDR_NOT_AVAIL. The size
394 field indicates the received message size. This field is used by the
395 owner when handling multi-received data buffers, but may be ignored
396 otherwise. The peer provider is responsible for checking that an in‐
397 coming message fits within the provided buffer space. The tag parame‐
398 ter is used for tagged messages. An fi_peer_rx_entry is allocated by
399 the owner, whether or not a match was found. If a match was found, the
400 owner will return FI_SUCCESS and the rx_entry will be filled in with
401 the appropriate receive fields for the peer to process accordingly. If
402 no match was found, the owner will return -FI_ENOENT; the rx_entry will
403 still be valid but will not match to an existing posted receive. When
404 the peer gets FI_ENOENT, it should allocate whatever resources it needs
405 to process the message later (on start_msg/tag) and set the rx_en‐
406 try->user_context appropriately, followed by a call to the owner’s
407 queue_msg/tag. The get and queue messages should be serialized. When
408 the owner gets a matching receive for the queued unexpected message, it
409 will call the peer’s start function to notify the peer of the updated
410 rx_entry (or the peer’s discard function if the message is to be dis‐
411 carded) (TBD: The peer may need to update the src addr if the remote
412 endpoint is inserted into the AV after the message has been received.)
413
414 fi_ops_srx_peer::start_msg() / start_tag()
415 These calls indicate that an asynchronous get_msg_entry() or
416 get_tag_entry() has completed and a buffer is now available to receive
417 the message. Control of the fi_peer_rx_entry is returned to the peer
418 provider and has been initialized for receiving the incoming message.
419
420 fi_ops_srx_peer::discard_msg() / discard_tag()
421 Indicates that the message and data associated with the specified
422 fi_peer_rx_entry should be discarded. This often indicates that the
423 application has canceled or discarded the receive operation. No com‐
424 pletion should be generated by the peer provider for a discarded mes‐
425 sage. Control of the fi_peer_rx_entry is returned to the peer
426 provider.
427
428 EXAMPLE PEER SRX SETUP
429 The above description defines the generic mechanism for sharing SRXs
430 between providers. This section outlines one possible implementation
431 to demonstrate the use of the APIs. In the example, provider A uses
432 provider B as a peer for data transfers targeting endpoints on the lo‐
433 cal node.
434
435 1. Provider A is configured to use provider B as a peer. This may be coded
436 into provider A or set through an environment variable.
437 2. The application calls:
438 fi_srx_context(domain_a, attr, &srx_a, app_context)
439 3. Provider A allocates srx_a and automatically configures it to be used
440 as a peer srx.
441 4. Provider A takes these steps:
442 allocate peer_srx and reference srx_a
443 set peer_srx_context->srx = peer_srx
444 set attr_b.flags |= FI_PEER
445 fi_srx_context(domain_b, attr_b, &srx_b, peer_srx_context)
446 5. Provider B allocates an srx, but configures it such that all receive
447 buffers are obtained from the peer_srx. The srx ops to post receives are
448 set to enosys calls.
449 8. Provider B inserts its own callbacks into the peer_srx object. It
450 creates a reference between the peer_srx object and its own srx.
451
452 EXAMPLE PEER SRX RECEIVE FLOW
453 The following outlines shows simplified, example software flows for re‐
454 ceive message handling using a peer SRX. The first flow demonstrates
455 the case where a receive buffer is waiting when the message arrives.
456
457 1. Application calls fi_recv() / fi_trecv() on owner.
458 2. Owner queues the receive buffer.
459 3. A message is received by the peer provider.
460 4. The peer calls owner->get_msg() / get_tag().
461 5. The owner removes the queued receive buffer and returns it to
462 the peer. The get entry call will complete with FI_SUCCESS.
463 6. When the peer finishes processing the message and completes it on its own
464 CQ, the peer will call free_entry to free the entry with the owner.
465
466 The second case below shows the flow when a message arrives before the
467 application has posted the matching receive buffer.
468
469 1. A message is received by the peer provider.
470 2. The peer calls owner->get_msg() / get_tag().
471 3. The owner fails to find a matching receive buffer.
472 4. The owner allocates a rx_entry with any known fields and returns -FI_ENOENT.
473 5. The peer allocates any resources needed to handle the asynchronous processing
474 and sets peer_context accordingly.
475 6. The peer calls the peer's queue function and the owner queues the peer request
476 on an unexpected/pending list.
477 5. The application calls fi_recv() / fi_trecv() on owner, posting the
478 matching receive buffer.
479 6. The owner matches the receive with the queued message on the peer.
480 7. The owner removes the queued request, fills in the rest of the known fields
481 and calls the peer->start_msg() / start_tag() function.
482 9. When the peer finishes processing the message and completes it on its own
483 CQ, the peer will call free_entry to free the entry with the owner.
484
486 The fi_export_fid function is reserved for future use.
487
488 The fi_import_fid call may be used to import a fabric object created
489 and owned by the libfabric user. This allows upper level libraries or
490 the application to override or define low-level libfabric behavior.
491 Details on specific uses of fi_import_fid are outside the scope of this
492 documentation.
493
495 Providers frequently send control messages to their remote counterparts
496 as part of their wire protocol. For example, a provider may send an
497 ACK message to guarantee reliable delivery of a message or to meet a
498 requested completion semantic. When two or more providers are coordi‐
499 nating as peers, it can be more efficient if control messages for both
500 peer providers go over the same transport. In some cases, such as when
501 one of the peers is an offload provider, it may even be required. Peer
502 transfers define the mechanism by which such communication occurs.
503
504 Peer transfers enable one peer to send and receive data transfers over
505 its associated peer. Providers that require this functionality indi‐
506 cate this by setting the FI_PEER_TRANSFER flag as a mode bit,
507 i.e. fi_info::mode.
508
509 To use such a provider as a peer, the main, or owner, provider must
510 setup peer transfers by opening a peer transfer endpoint and accepting
511 transfers with this flag set. Setup of peer transfers involves the
512 following data structures:
513
514 struct fi_ops_transfer_peer {
515 size_t size;
516 ssize_t (*complete)(struct fid_ep *ep, struct fi_cq_tagged_entry *buf,
517 fi_addr_t *src_addr);
518 ssize_t (*comperr)(struct fid_ep *ep, struct fi_cq_err_entry *buf);
519 };
520
521 struct fi_peer_transfer_context {
522 size_t size;
523 struct fi_info *info;
524 struct fid_ep *ep;
525 struct fi_ops_transfer_peer *peer_ops;
526 };
527
528 Peer transfer contexts form a virtual link between endpoints allocated
529 on each of the peer providers. The setup of a peer transfer context
530 occurs through the fi_endpoint2() API. The main provider calls fi_end‐
531 point2() passing in the FI_PEER_TRANSFER flag. When specified, the
532 context parameter reference the struct fi_peer_transfer_context defined
533 above.
534
535 The size field indicates the size of struct fi_peer_transfer_context
536 being passed to the peer. This is used for backward compatibility.
537 The info field is optional. If given, it defines the attributes of the
538 main provider’s objects. It may be used to report the capabilities and
539 restrictions on peer transfers, such as whether memory registration is
540 required, maximum message sizes, data and completion ordering seman‐
541 tics, and so forth. If the importing provider cannot meet these re‐
542 strictions, it must fail the fi_endpoint2() call.
543
544 The peer_ops field contains callbacks from the main provider into the
545 peer and is used to report the completion (success or failure) of peer
546 initiated data transfers. The callback functions defined in struct
547 fi_ops_transfer_peer must be set by the peer provider before returning
548 from the fi_endpoint2() call. Actions that the peer provider can take
549 from within the completion callbacks are most unrestricted, and can in‐
550 clude any of the following types of operations: initiation of addition‐
551 al data transfers, writing events to the owner’s CQ or EQ, and memory
552 registration/deregistration. The owner must ensure that deadlock can‐
553 not occur prior to invoking the peer’s callback should the peer invoke
554 any of these operations. Further, the owner must avoid recursive calls
555 into the completion callbacks.
556
558 Returns FI_SUCCESS on success. On error, a negative value correspond‐
559 ing to fabric errno is returned. Fabric errno values are defined in
560 rdma/fi_errno.h.
561
563 fi_provider(7), fi_provider(3), fi_cq(3),
564
566 OpenFabrics.
567
568
569
570Libfabric Programmer’s Manual 2022-12-11 fi_peer(3)