1fi_endpoint(3) Libfabric v1.18.1 fi_endpoint(3)
2
3
4
6 fi_endpoint - Fabric endpoint operations
7
8 fi_endpoint / fi_endpoint2 / fi_scalable_ep / fi_passive_ep / fi_close
9 Allocate or close an endpoint.
10
11 fi_ep_bind
12 Associate an endpoint with hardware resources, such as event
13 queues, completion queues, counters, address vectors, or shared
14 transmit/receive contexts.
15
16 fi_scalable_ep_bind
17 Associate a scalable endpoint with an address vector
18
19 fi_pep_bind
20 Associate a passive endpoint with an event queue
21
22 fi_enable
23 Transitions an active endpoint into an enabled state.
24
25 fi_cancel
26 Cancel a pending asynchronous data transfer
27
28 fi_ep_alias
29 Create an alias to the endpoint
30
31 fi_control
32 Control endpoint operation.
33
34 fi_getopt / fi_setopt
35 Get or set endpoint options.
36
37 fi_rx_context / fi_tx_context / fi_srx_context / fi_stx_context
38 Open a transmit or receive context.
39
40 fi_tc_dscp_set / fi_tc_dscp_get
41 Convert between a DSCP value and a network traffic class
42
43 fi_rx_size_left / fi_tx_size_left (DEPRECATED)
44 Query the lower bound on how many RX/TX operations may be posted
45 without an operation returning -FI_EAGAIN. This functions have
46 been deprecated and will be removed in a future version of the
47 library.
48
50 #include <rdma/fabric.h>
51
52 #include <rdma/fi_endpoint.h>
53
54 int fi_endpoint(struct fid_domain *domain, struct fi_info *info,
55 struct fid_ep **ep, void *context);
56
57 int fi_endpoint2(struct fid_domain *domain, struct fi_info *info,
58 struct fid_ep **ep, uint64_t flags, void *context);
59
60 int fi_scalable_ep(struct fid_domain *domain, struct fi_info *info,
61 struct fid_ep **sep, void *context);
62
63 int fi_passive_ep(struct fi_fabric *fabric, struct fi_info *info,
64 struct fid_pep **pep, void *context);
65
66 int fi_tx_context(struct fid_ep *sep, int index,
67 struct fi_tx_attr *attr, struct fid_ep **tx_ep,
68 void *context);
69
70 int fi_rx_context(struct fid_ep *sep, int index,
71 struct fi_rx_attr *attr, struct fid_ep **rx_ep,
72 void *context);
73
74 int fi_stx_context(struct fid_domain *domain,
75 struct fi_tx_attr *attr, struct fid_stx **stx,
76 void *context);
77
78 int fi_srx_context(struct fid_domain *domain,
79 struct fi_rx_attr *attr, struct fid_ep **rx_ep,
80 void *context);
81
82 int fi_close(struct fid *ep);
83
84 int fi_ep_bind(struct fid_ep *ep, struct fid *fid, uint64_t flags);
85
86 int fi_scalable_ep_bind(struct fid_ep *sep, struct fid *fid, uint64_t flags);
87
88 int fi_pep_bind(struct fid_pep *pep, struct fid *fid, uint64_t flags);
89
90 int fi_enable(struct fid_ep *ep);
91
92 int fi_cancel(struct fid_ep *ep, void *context);
93
94 int fi_ep_alias(struct fid_ep *ep, struct fid_ep **alias_ep, uint64_t flags);
95
96 int fi_control(struct fid *ep, int command, void *arg);
97
98 int fi_getopt(struct fid *ep, int level, int optname,
99 void *optval, size_t *optlen);
100
101 int fi_setopt(struct fid *ep, int level, int optname,
102 const void *optval, size_t optlen);
103
104 uint32_t fi_tc_dscp_set(uint8_t dscp);
105
106 uint8_t fi_tc_dscp_get(uint32_t tclass);
107
108 DEPRECATED ssize_t fi_rx_size_left(struct fid_ep *ep);
109
110 DEPRECATED ssize_t fi_tx_size_left(struct fid_ep *ep);
111
113 fid On creation, specifies a fabric or access domain. On bind,
114 identifies the event queue, completion queue, counter, or ad‐
115 dress vector to bind to the endpoint. In other cases, it’s a
116 fabric identifier of an associated resource.
117
118 info Details about the fabric interface endpoint to be opened, ob‐
119 tained from fi_getinfo.
120
121 ep A fabric endpoint.
122
123 sep A scalable fabric endpoint.
124
125 pep A passive fabric endpoint.
126
127 context
128 Context associated with the endpoint or asynchronous operation.
129
130 index Index to retrieve a specific transmit/receive context.
131
132 attr Transmit or receive context attributes.
133
134 flags Additional flags to apply to the operation.
135
136 command
137 Command of control operation to perform on endpoint.
138
139 arg Optional control argument.
140
141 level Protocol level at which the desired option resides.
142
143 optname
144 The protocol option to read or set.
145
146 optval The option value that was read or to set.
147
148 optlen The size of the optval buffer.
149
151 Endpoints are transport level communication portals. There are two
152 types of endpoints: active and passive. Passive endpoints belong to a
153 fabric domain and are most often used to listen for incoming connection
154 requests. However, a passive endpoint may be used to reserve a fabric
155 address that can be granted to an active endpoint. Active endpoints
156 belong to access domains and can perform data transfers.
157
158 Active endpoints may be connection-oriented or connectionless, and may
159 provide data reliability. The data transfer interfaces – messages
160 (fi_msg), tagged messages (fi_tagged), RMA (fi_rma), and atomics
161 (fi_atomic) – are associated with active endpoints. In basic configu‐
162 rations, an active endpoint has transmit and receive queues. In gener‐
163 al, operations that generate traffic on the fabric are posted to the
164 transmit queue. This includes all RMA and atomic operations, along
165 with sent messages and sent tagged messages. Operations that post buf‐
166 fers for receiving incoming data are submitted to the receive queue.
167
168 Active endpoints are created in the disabled state. They must transi‐
169 tion into an enabled state before accepting data transfer operations,
170 including posting of receive buffers. The fi_enable call is used to
171 transition an active endpoint into an enabled state. The fi_connect
172 and fi_accept calls will also transition an endpoint into the enabled
173 state, if it is not already active.
174
175 In order to transition an endpoint into an enabled state, it must be
176 bound to one or more fabric resources. An endpoint that will generate
177 asynchronous completions, either through data transfer operations or
178 communication establishment events, must be bound to the appropriate
179 completion queues or event queues, respectively, before being enabled.
180 Additionally, endpoints that use manual progress must be associated
181 with relevant completion queues or event queues in order to drive
182 progress. For endpoints that are only used as the target of RMA or
183 atomic operations, this means binding the endpoint to a completion
184 queue associated with receive processing. Connectionless endpoints
185 must be bound to an address vector.
186
187 Once an endpoint has been activated, it may be associated with an ad‐
188 dress vector. Receive buffers may be posted to it and calls may be
189 made to connection establishment routines. Connectionless endpoints
190 may also perform data transfers.
191
192 The behavior of an endpoint may be adjusted by setting its control data
193 and protocol options. This allows the underlying provider to redirect
194 function calls to implementations optimized to meet the desired appli‐
195 cation behavior.
196
197 If an endpoint experiences a critical error, it will transition back
198 into a disabled state. Critical errors are reported through the event
199 queue associated with the EP. In certain cases, a disabled endpoint
200 may be re-enabled. The ability to transition back into an enabled
201 state is provider specific and depends on the type of error that the
202 endpoint experienced. When an endpoint is disabled as a result of a
203 critical error, all pending operations are discarded.
204
205 fi_endpoint / fi_passive_ep / fi_scalable_ep
206 fi_endpoint allocates a new active endpoint. fi_passive_ep allocates a
207 new passive endpoint. fi_scalable_ep allocates a scalable endpoint.
208 The properties and behavior of the endpoint are defined based on the
209 provided struct fi_info. See fi_getinfo for additional details on
210 fi_info. fi_info flags that control the operation of an endpoint are
211 defined below. See section SCALABLE ENDPOINTS.
212
213 If an active endpoint is allocated in order to accept a connection re‐
214 quest, the fi_info parameter must be the same as the fi_info structure
215 provided with the connection request (FI_CONNREQ) event.
216
217 An active endpoint may acquire the properties of a passive endpoint by
218 setting the fi_info handle field to the passive endpoint fabric de‐
219 scriptor. This is useful for applications that need to reserve the
220 fabric address of an endpoint prior to knowing if the endpoint will be
221 used on the active or passive side of a connection. For example, this
222 feature is useful for simulating socket semantics. Once an active end‐
223 point acquires the properties of a passive endpoint, the passive end‐
224 point is no longer bound to any fabric resources and must no longer be
225 used. The user is expected to close the passive endpoint after opening
226 the active endpoint in order to free up any lingering resources that
227 had been used.
228
229 fi_endpoint2
230 Similar to fi_endpoint, buf accepts an extra parameter flags. Mainly
231 used for opening endpoints that use peer transfer feature. See
232 fi_peer(3)
233
234 fi_close
235 Closes an endpoint and release all resources associated with it.
236
237 When closing a scalable endpoint, there must be no opened transmit con‐
238 texts, or receive contexts associated with the scalable endpoint. If
239 resources are still associated with the scalable endpoint when attempt‐
240 ing to close, the call will return -FI_EBUSY.
241
242 Outstanding operations posted to the endpoint when fi_close is called
243 will be discarded. Discarded operations will silently be dropped, with
244 no completions reported. Additionally, a provider may discard previ‐
245 ously completed operations from the associated completion queue(s).
246 The behavior to discard completed operations is provider specific.
247
248 fi_ep_bind
249 fi_ep_bind is used to associate an endpoint with other allocated re‐
250 sources, such as completion queues, counters, address vectors, event
251 queues, shared contexts, and memory regions. The type of objects that
252 must be bound with an endpoint depend on the endpoint type and its con‐
253 figuration.
254
255 Passive endpoints must be bound with an EQ that supports connection
256 management events. Connectionless endpoints must be bound to a single
257 address vector. If an endpoint is using a shared transmit and/or re‐
258 ceive context, the shared contexts must be bound to the endpoint. CQs,
259 counters, AV, and shared contexts must be bound to endpoints before
260 they are enabled either explicitly or implicitly.
261
262 An endpoint must be bound with CQs capable of reporting completions for
263 any asynchronous operation initiated on the endpoint. For example, if
264 the endpoint supports any outbound transfers (sends, RMA, atomics,
265 etc.), then it must be bound to a completion queue that can report
266 transmit completions. This is true even if the endpoint is configured
267 to suppress successful completions, in order that operations that com‐
268 plete in error may be reported to the user.
269
270 An active endpoint may direct asynchronous completions to different
271 CQs, based on the type of operation. This is specified using
272 fi_ep_bind flags. The following flags may be OR’ed together when bind‐
273 ing an endpoint to a completion domain CQ.
274
275 FI_RECV
276 Directs the notification of inbound data transfers to the speci‐
277 fied completion queue. This includes received messages. This
278 binding automatically includes FI_REMOTE_WRITE, if applicable to
279 the endpoint.
280
281 FI_SELECTIVE_COMPLETION
282 By default, data transfer operations write CQ completion entries
283 into the associated completion queue after they have successful‐
284 ly completed. Applications can use this bind flag to selective‐
285 ly enable when completions are generated. If FI_SELECTIVE_COM‐
286 PLETION is specified, data transfer operations will not generate
287 CQ entries for successful completions unless FI_COMPLETION is
288 set as an operational flag for the given operation. Operations
289 that fail asynchronously will still generate completions, even
290 if a completion is not requested. FI_SELECTIVE_COMPLETION must
291 be OR’ed with FI_TRANSMIT and/or FI_RECV flags.
292
293 When FI_SELECTIVE_COMPLETION is set, the user must determine when a re‐
294 quest that does NOT have FI_COMPLETION set has completed indirectly,
295 usually based on the completion of a subsequent operation or by using
296 completion counters. Use of this flag may improve performance by al‐
297 lowing the provider to avoid writing a CQ completion entry for every
298 operation.
299
300 See Notes section below for additional information on how this flag in‐
301 teracts with the FI_CONTEXT and FI_CONTEXT2 mode bits.
302
303 FI_TRANSMIT
304 Directs the completion of outbound data transfer requests to the
305 specified completion queue. This includes send message, RMA,
306 and atomic operations.
307
308 An endpoint may optionally be bound to a completion counter. Associat‐
309 ing an endpoint with a counter is in addition to binding the EP with a
310 CQ. When binding an endpoint to a counter, the following flags may be
311 specified.
312
313 FI_READ
314 Increments the specified counter whenever an RMA read, atomic
315 fetch, or atomic compare operation initiated from the endpoint
316 has completed successfully or in error.
317
318 FI_RECV
319 Increments the specified counter whenever a message is received
320 over the endpoint. Received messages include both tagged and
321 normal message operations.
322
323 FI_REMOTE_READ
324 Increments the specified counter whenever an RMA read, atomic
325 fetch, or atomic compare operation is initiated from a remote
326 endpoint that targets the given endpoint. Use of this flag re‐
327 quires that the endpoint be created using FI_RMA_EVENT.
328
329 FI_REMOTE_WRITE
330 Increments the specified counter whenever an RMA write or base
331 atomic operation is initiated from a remote endpoint that tar‐
332 gets the given endpoint. Use of this flag requires that the
333 endpoint be created using FI_RMA_EVENT.
334
335 FI_SEND
336 Increments the specified counter whenever a message transfer
337 initiated over the endpoint has completed successfully or in er‐
338 ror. Sent messages include both tagged and normal message oper‐
339 ations.
340
341 FI_WRITE
342 Increments the specified counter whenever an RMA write or base
343 atomic operation initiated from the endpoint has completed suc‐
344 cessfully or in error.
345
346 An endpoint may only be bound to a single CQ or counter for a given
347 type of operation. For example, a EP may not bind to two counters both
348 using FI_WRITE. Furthermore, providers may limit CQ and counter bind‐
349 ings to endpoints of the same endpoint type (DGRAM, MSG, RDM, etc.).
350
351 fi_scalable_ep_bind
352 fi_scalable_ep_bind is used to associate a scalable endpoint with an
353 address vector. See section on SCALABLE ENDPOINTS. A scalable end‐
354 point has a single transport level address and can support multiple
355 transmit and receive contexts. The transmit and receive contexts share
356 the transport-level address. Address vectors that are bound to scal‐
357 able endpoints are implicitly bound to any transmit or receive contexts
358 created using the scalable endpoint.
359
360 fi_enable
361 This call transitions the endpoint into an enabled state. An endpoint
362 must be enabled before it may be used to perform data transfers. En‐
363 abling an endpoint typically results in hardware resources being as‐
364 signed to it. Endpoints making use of completion queues, counters,
365 event queues, and/or address vectors must be bound to them before being
366 enabled.
367
368 Calling connect or accept on an endpoint will implicitly enable an end‐
369 point if it has not already been enabled.
370
371 fi_enable may also be used to re-enable an endpoint that has been dis‐
372 abled as a result of experiencing a critical error. Applications
373 should check the return value from fi_enable to see if a disabled end‐
374 point has successfully be re-enabled.
375
376 fi_cancel
377 fi_cancel attempts to cancel an outstanding asynchronous operation.
378 Canceling an operation causes the fabric provider to search for the op‐
379 eration and, if it is still pending, complete it as having been can‐
380 celed. An error queue entry will be available in the associated error
381 queue with error code FI_ECANCELED. On the other hand, if the opera‐
382 tion completed before the call to fi_cancel, then the completion status
383 of that operation will be available in the associated completion queue.
384 No specific entry related to fi_cancel itself will be posted.
385
386 Cancel uses the context parameter associated with an operation to iden‐
387 tify the request to cancel. Operations posted without a valid context
388 parameter – either no context parameter is specified or the context
389 value was ignored by the provider – cannot be canceled. If multiple
390 outstanding operations match the context parameter, only one will be
391 canceled. In this case, the operation which is canceled is provider
392 specific. The cancel operation is asynchronous, but will complete
393 within a bounded period of time.
394
395 fi_ep_alias
396 This call creates an alias to the specified endpoint. Conceptually, an
397 endpoint alias provides an alternate software path from the application
398 to the underlying provider hardware. An alias EP differs from its par‐
399 ent endpoint only by its default data transfer flags. For example, an
400 alias EP may be configured to use a different completion mode. By de‐
401 fault, an alias EP inherits the same data transfer flags as the parent
402 endpoint. An application can use fi_control to modify the alias EP op‐
403 erational flags.
404
405 When allocating an alias, an application may configure either the
406 transmit or receive operational flags. This avoids needing a separate
407 call to fi_control to set those flags. The flags passed to fi_ep_alias
408 must include FI_TRANSMIT or FI_RECV (not both) with other operational
409 flags OR’ed in. This will override the transmit or receive flags, re‐
410 spectively, for operations posted through the alias endpoint. All al‐
411 located aliases must be closed for the underlying endpoint to be re‐
412 leased.
413
414 fi_control
415 The control operation is used to adjust the default behavior of an end‐
416 point. It allows the underlying provider to redirect function calls to
417 implementations optimized to meet the desired application behavior. As
418 a result, calls to fi_ep_control must be serialized against all other
419 calls to an endpoint.
420
421 The base operation of an endpoint is selected during creation using
422 struct fi_info. The following control commands and arguments may be
423 assigned to an endpoint.
424
425 **FI_BACKLOG - int *value**
426 This option only applies to passive endpoints. It is used to
427 set the connection request backlog for listening endpoints.
428
429 **FI_GETOPSFLAG – uint64_t *flags**
430 Used to retrieve the current value of flags associated with the
431 data transfer operations initiated on the endpoint. The control
432 argument must include FI_TRANSMIT or FI_RECV (not both) flags to
433 indicate the type of data transfer flags to be returned. See
434 below for a list of control flags.
435
436 FI_GETWAIT – void **
437 This command allows the user to retrieve the file descriptor as‐
438 sociated with a socket endpoint. The fi_control arg parameter
439 should be an address where a pointer to the returned file de‐
440 scriptor will be written. See fi_eq.3 for addition details us‐
441 ing fi_control with FI_GETWAIT. The file descriptor may be used
442 for notification that the endpoint is ready to send or receive
443 data.
444
445 **FI_SETOPSFLAG – uint64_t *flags**
446 Used to change the data transfer operation flags associated with
447 an endpoint. The control argument must include FI_TRANSMIT or
448 FI_RECV (not both) to indicate the type of data transfer that
449 the flags should apply to, with other flags OR’ed in. The given
450 flags will override the previous transmit and receive attributes
451 that were set when the endpoint was created. Valid control
452 flags are defined below.
453
454 fi_getopt / fi_setopt
455 Endpoint protocol operations may be retrieved using fi_getopt or set
456 using fi_setopt. Applications specify the level that a desired option
457 exists, identify the option, and provide input/output buffers to get or
458 set the option. fi_setopt provides an application a way to adjust low-
459 level protocol and implementation specific details of an endpoint.
460
461 The following option levels and option names and parameters are de‐
462 fined.
463
464 FI_OPT_ENDPOINT • .RS 2
465
466 FI_OPT_BUFFERED_LIMIT - size_t
467 Defines the maximum size of a buffered message that will be re‐
468 ported to users as part of a receive completion when the
469 FI_BUFFERED_RECV mode is enabled on an endpoint.
470
471 fi_getopt() will return the currently configured threshold, or the
472 provider’s default threshold if one has not be set by the application.
473 fi_setopt() allows an application to configure the threshold. If the
474 provider cannot support the requested threshold, it will fail the
475 fi_setopt() call with FI_EMSGSIZE. Calling fi_setopt() with the
476 threshold set to SIZE_MAX will set the threshold to the maximum sup‐
477 ported by the provider. fi_getopt() can then be used to retrieve the
478 set size.
479
480 In most cases, the sending and receiving endpoints must be configured
481 to use the same threshold value, and the threshold must be set prior to
482 enabling the endpoint.
483 • .RS 2
484
485 FI_OPT_BUFFERED_MIN - size_t
486 Defines the minimum size of a buffered message that will be re‐
487 ported. Applications would set this to a size that’s big enough
488 to decide whether to discard or claim a buffered receive or when
489 to claim a buffered receive on getting a buffered receive com‐
490 pletion. The value is typically used by a provider when sending
491 a rendezvous protocol request where it would send at least
492 FI_OPT_BUFFERED_MIN bytes of application data along with it. A
493 smaller sized rendezvous protocol message usually results in
494 better latency for the overall transfer of a large message.
495 • .RS 2
496
497 FI_OPT_CM_DATA_SIZE - size_t
498 Defines the size of available space in CM messages for user-de‐
499 fined data. This value limits the amount of data that applica‐
500 tions can exchange between peer endpoints using the fi_connect,
501 fi_accept, and fi_reject operations. The size returned is de‐
502 pendent upon the properties of the endpoint, except in the case
503 of passive endpoints, in which the size reflects the maximum
504 size of the data that may be present as part of a connection re‐
505 quest event. This option is read only.
506 • .RS 2
507
508 FI_OPT_MIN_MULTI_RECV - size_t
509 Defines the minimum receive buffer space available when the re‐
510 ceive buffer is released by the provider (see FI_MULTI_RECV).
511 Modifying this value is only guaranteed to set the minimum buf‐
512 fer space needed on receives posted after the value has been
513 changed. It is recommended that applications that want to over‐
514 ride the default MIN_MULTI_RECV value set this option before en‐
515 abling the corresponding endpoint.
516 • .RS 2
517
518 FI_OPT_FI_HMEM_P2P - int
519 Defines how the provider should handle peer to peer FI_HMEM
520 transfers for this endpoint. By default, the provider will
521 chose whether to use peer to peer support based on the type of
522 transfer (FI_HMEM_P2P_ENABLED). Valid values defined in fi_end‐
523 point.h are:
524
525 • FI_HMEM_P2P_ENABLED: Peer to peer support may be used by the
526 provider to handle FI_HMEM transfers, and which transfers are
527 initiated using peer to peer is subject to the provider imple‐
528 mentation.
529
530 • FI_HMEM_P2P_REQUIRED: Peer to peer support must be used for
531 transfers, transfers that cannot be performed using p2p will
532 be reported as failing.
533
534 • FI_HMEM_P2P_PREFERRED: Peer to peer support should be used by
535 the provider for all transfers if available, but the provider
536 may choose to copy the data to initiate the transfer if peer
537 to peer support is unavailable.
538
539 • FI_HMEM_P2P_DISABLED: Peer to peer support should not be used.
540 fi_setopt() will return -FI_EOPNOTSUPP if the mode requested cannot be
541 supported by the provider. The FI_HMEM_DISABLE_P2P environment vari‐
542 able discussed in fi_mr(3) takes precedence over this setopt option.
543 • .RS 2
544
545 FI_OPT_XPU_TRIGGER - struct fi_trigger_xpu *
546 This option only applies to the fi_getopt() call. It is used to
547 query the maximum number of variables required to support XPU
548 triggered operations, along with the size of each variable.
549
550 The user provides a filled out struct fi_trigger_xpu on input. The
551 iface and device fields should reference an HMEM domain. If the
552 provider does not support XPU triggered operations from the given de‐
553 vice, fi_getopt() will return -FI_EOPNOTSUPP. On input, var should
554 reference an array of struct fi_trigger_var data structures, with count
555 set to the size of the referenced array. If count is 0, the var field
556 will be ignored, and the provider will return the number of fi_trig‐
557 ger_var structures needed. If count is > 0, the provider will set
558 count to the needed value, and for each fi_trigger_var available, set
559 the datatype and count of the variable used for the trigger.
560 • .RS 2
561
562 FI_OPT_CUDA_API_PERMITTED - bool *
563 This option only applies to the fi_setopt call. It is used to
564 control endpoint’s behavior in making calls to CUDA API. By de‐
565 fault, an endpoint is permitted to call CUDA API. If user wish
566 to prohibit an endpoint from making such calls, user can achieve
567 that by set this option to false. If an endpoint’s support of
568 CUDA memory relies on making calls to CUDA API, it will return
569 -FI_EOPNOTSUPP for the call to fi_setopt. If either CUDA li‐
570 brary or CUDA device is not available, endpoint will return
571 -FI_EINVAL. All providers that support FI_HMEM capability im‐
572 plement this option.
573
574 fi_tc_dscp_set
575 This call converts a DSCP defined value into a libfabric traffic class
576 value. It should be used when assigning a DSCP value when setting the
577 tclass field in either domain or endpoint attributes
578
579 fi_tc_dscp_get
580 This call returns the DSCP value associated with the tclass field for
581 the domain or endpoint attributes.
582
583 fi_rx_size_left (DEPRECATED)
584 This function has been deprecated and will be removed in a future ver‐
585 sion of the library. It may not be supported by all providers.
586
587 The fi_rx_size_left call returns a lower bound on the number of receive
588 operations that may be posted to the given endpoint without that opera‐
589 tion returning -FI_EAGAIN. Depending on the specific details of the
590 subsequently posted receive operations (e.g., number of iov entries,
591 which receive function is called, etc.), it may be possible to post
592 more receive operations than originally indicated by fi_rx_size_left.
593
594 fi_tx_size_left (DEPRECATED)
595 This function has been deprecated and will be removed in a future ver‐
596 sion of the library. It may not be supported by all providers.
597
598 The fi_tx_size_left call returns a lower bound on the number of trans‐
599 mit operations that may be posted to the given endpoint without that
600 operation returning -FI_EAGAIN. Depending on the specific details of
601 the subsequently posted transmit operations (e.g., number of iov en‐
602 tries, which transmit function is called, etc.), it may be possible to
603 post more transmit operations than originally indicated by
604 fi_tx_size_left.
605
607 The fi_ep_attr structure defines the set of attributes associated with
608 an endpoint. Endpoint attributes may be further refined using the
609 transmit and receive context attributes as shown below.
610
611 struct fi_ep_attr {
612 enum fi_ep_type type;
613 uint32_t protocol;
614 uint32_t protocol_version;
615 size_t max_msg_size;
616 size_t msg_prefix_size;
617 size_t max_order_raw_size;
618 size_t max_order_war_size;
619 size_t max_order_waw_size;
620 uint64_t mem_tag_format;
621 size_t tx_ctx_cnt;
622 size_t rx_ctx_cnt;
623 size_t auth_key_size;
624 uint8_t *auth_key;
625 };
626
627 type - Endpoint Type
628 If specified, indicates the type of fabric interface communication de‐
629 sired. Supported types are:
630
631 FI_EP_DGRAM
632 Supports a connectionless, unreliable datagram communication.
633 Message boundaries are maintained, but the maximum message size
634 may be limited to the fabric MTU. Flow control is not guaran‐
635 teed.
636
637 FI_EP_MSG
638 Provides a reliable, connection-oriented data transfer service
639 with flow control that maintains message boundaries.
640
641 FI_EP_RDM
642 Reliable datagram message. Provides a reliable, connectionless
643 data transfer service with flow control that maintains message
644 boundaries.
645
646 FI_EP_SOCK_DGRAM
647 A connectionless, unreliable datagram endpoint with UDP socket-
648 like semantics. FI_EP_SOCK_DGRAM is most useful for applica‐
649 tions designed around using UDP sockets. See the SOCKET END‐
650 POINT section for additional details and restrictions that apply
651 to datagram socket endpoints.
652
653 FI_EP_SOCK_STREAM
654 Data streaming endpoint with TCP socket-like semantics. Pro‐
655 vides a reliable, connection-oriented data transfer service that
656 does not maintain message boundaries. FI_EP_SOCK_STREAM is most
657 useful for applications designed around using TCP sockets. See
658 the SOCKET ENDPOINT section for additional details and restric‐
659 tions that apply to stream endpoints.
660
661 FI_EP_UNSPEC
662 The type of endpoint is not specified. This is usually provided
663 as input, with other attributes of the endpoint or the provider
664 selecting the type.
665
666 Protocol
667 Specifies the low-level end to end protocol employed by the provider.
668 A matching protocol must be used by communicating endpoints to ensure
669 interoperability. The following protocol values are defined. Provider
670 specific protocols are also allowed. Provider specific protocols will
671 be indicated by having the upper bit of the protocol value set to one.
672
673 FI_PROTO_EFA
674 Proprietary protocol on Elastic Fabric Adapter fabric. It sup‐
675 ports both DGRAM and RDM endpoints.
676
677 FI_PROTO_GNI
678 Protocol runs over Cray GNI low-level interface.
679
680 FI_PROTO_IB_RDM
681 Reliable-datagram protocol implemented over InfiniBand reliable-
682 connected queue pairs.
683
684 FI_PROTO_IB_UD
685 The protocol runs over Infiniband unreliable datagram queue
686 pairs.
687
688 FI_PROTO_IWARP
689 The protocol runs over the Internet wide area RDMA protocol
690 transport.
691
692 FI_PROTO_IWARP_RDM
693 Reliable-datagram protocol implemented over iWarp reliable-con‐
694 nected queue pairs.
695
696 FI_PROTO_NETWORKDIRECT
697 Protocol runs over Microsoft NetworkDirect service provider in‐
698 terface. This adds reliable-datagram semantics over the Net‐
699 workDirect connection- oriented endpoint semantics.
700
701 FI_PROTO_PSMX
702 The protocol is based on an Intel proprietary protocol known as
703 PSM, performance scaled messaging. PSMX is an extended version
704 of the PSM protocol to support the libfabric interfaces.
705
706 FI_PROTO_PSMX2
707 The protocol is based on an Intel proprietary protocol known as
708 PSM2, performance scaled messaging version 2. PSMX2 is an ex‐
709 tended version of the PSM2 protocol to support the libfabric in‐
710 terfaces.
711
712 FI_PROTO_PSMX3
713 The protocol is Intel’s protocol known as PSM3, performance
714 scaled messaging version 3. PSMX3 is implemented over RoCEv2
715 and verbs.
716
717 FI_PROTO_RDMA_CM_IB_RC
718 The protocol runs over Infiniband reliable-connected queue
719 pairs, using the RDMA CM protocol for connection establishment.
720
721 FI_PROTO_RXD
722 Reliable-datagram protocol implemented over datagram endpoints.
723 RXD is a libfabric utility component that adds RDM endpoint se‐
724 mantics over DGRAM endpoint semantics.
725
726 FI_PROTO_RXM
727 Reliable-datagram protocol implemented over message endpoints.
728 RXM is a libfabric utility component that adds RDM endpoint se‐
729 mantics over MSG endpoint semantics.
730
731 FI_PROTO_SOCK_TCP
732 The protocol is layered over TCP packets.
733
734 FI_PROTO_UDP
735 The protocol sends and receives UDP datagrams. For example, an
736 endpoint using FI_PROTO_UDP will be able to communicate with a
737 remote peer that is using Berkeley SOCK_DGRAM sockets using IP‐
738 PROTO_UDP.
739
740 FI_PROTO_UNSPEC
741 The protocol is not specified. This is usually provided as in‐
742 put, with other attributes of the socket or the provider select‐
743 ing the actual protocol.
744
745 protocol_version - Protocol Version
746 Identifies which version of the protocol is employed by the provider.
747 The protocol version allows providers to extend an existing protocol,
748 by adding support for additional features or functionality for example,
749 in a backward compatible manner. Providers that support different ver‐
750 sions of the same protocol should inter-operate, but only when using
751 the capabilities defined for the lesser version.
752
753 max_msg_size - Max Message Size
754 Defines the maximum size for an application data transfer as a single
755 operation.
756
757 msg_prefix_size - Message Prefix Size
758 Specifies the size of any required message prefix buffer space. This
759 field will be 0 unless the FI_MSG_PREFIX mode is enabled. If msg_pre‐
760 fix_size is > 0 the specified value will be a multiple of 8-bytes.
761
762 Max RMA Ordered Size
763 The maximum ordered size specifies the delivery order of transport data
764 into target memory for RMA and atomic operations. Data ordering is
765 separate, but dependent on message ordering (defined below). Data or‐
766 dering is unspecified where message order is not defined.
767
768 Data ordering refers to the access of the same target memory by subse‐
769 quent operations. When back to back RMA read or write operations ac‐
770 cess the same registered memory location, data ordering indicates
771 whether the second operation reads or writes the target memory after
772 the first operation has completed. For example, will an RMA read that
773 follows an RMA write read back the data that was written? Similarly,
774 will an RMA write that follows an RMA read update the target buffer af‐
775 ter the read has transferred the original data? Data ordering answers
776 these questions, even in the presence of errors, such as the need to
777 resend data because of lost or corrupted network traffic.
778
779 RMA ordering applies between two operations, and not within a single
780 data transfer. Therefore, ordering is defined per byte-addressable
781 memory location. I.e. ordering specifies whether location X is ac‐
782 cessed by the second operation after the first operation. Nothing is
783 implied about the completion of the first operation before the second
784 operation is initiated. For example, if the first operation updates
785 locations X and Y, but the second operation only accesses location X,
786 there are no guarantees defined relative to location Y and the second
787 operation.
788
789 In order to support large data transfers being broken into multiple
790 packets and sent using multiple paths through the fabric, data ordering
791 may be limited to transfers of a specific size or less. Providers
792 specify when data ordering is maintained through the following values.
793 Note that even if data ordering is not maintained, message ordering may
794 be.
795
796 max_order_raw_size
797 Read after write size. If set, an RMA or atomic read operation
798 issued after an RMA or atomic write operation, both of which are
799 smaller than the size, will be ordered. Where the target memory
800 locations overlap, the RMA or atomic read operation will see the
801 results of the previous RMA or atomic write.
802
803 max_order_war_size
804 Write after read size. If set, an RMA or atomic write operation
805 issued after an RMA or atomic read operation, both of which are
806 smaller than the size, will be ordered. The RMA or atomic read
807 operation will see the initial value of the target memory loca‐
808 tion before a subsequent RMA or atomic write updates the value.
809
810 max_order_waw_size
811 Write after write size. If set, an RMA or atomic write opera‐
812 tion issued after an RMA or atomic write operation, both of
813 which are smaller than the size, will be ordered. The target
814 memory location will reflect the results of the second RMA or
815 atomic write.
816
817 An order size value of 0 indicates that ordering is not guaranteed. A
818 value of -1 guarantees ordering for any data size.
819
820 mem_tag_format - Memory Tag Format
821 The memory tag format is a bit array used to convey the number of
822 tagged bits supported by a provider. Additionally, it may be used to
823 divide the bit array into separate fields. The mem_tag_format option‐
824 ally begins with a series of bits set to 0, to signify bits which are
825 ignored by the provider. Following the initial prefix of ignored bits,
826 the array will consist of alternating groups of bits set to all 1’s or
827 all 0’s. Each group of bits corresponds to a tagged field. The impli‐
828 cation of defining a tagged field is that when a mask is applied to the
829 tagged bit array, all bits belonging to a single field will either be
830 set to 1 or 0, collectively.
831
832 For example, a mem_tag_format of 0x30FF indicates support for 14 tagged
833 bits, separated into 3 fields. The first field consists of 2-bits, the
834 second field 4-bits, and the final field 8-bits. Valid masks for such
835 a tagged field would be a bitwise OR’ing of zero or more of the follow‐
836 ing values: 0x3000, 0x0F00, and 0x00FF. The provider may not validate
837 the mask provided by the application for performance reasons.
838
839 By identifying fields within a tag, a provider may be able to optimize
840 their search routines. An application which requests tag fields must
841 provide tag masks that either set all mask bits corresponding to a
842 field to all 0 or all 1. When negotiating tag fields, an application
843 can request a specific number of fields of a given size. A provider
844 must return a tag format that supports the requested number of fields,
845 with each field being at least the size requested, or fail the request.
846 A provider may increase the size of the fields. When reporting comple‐
847 tions (see FI_CQ_FORMAT_TAGGED), it is not guaranteed that the provider
848 would clear out any unsupported tag bits in the tag field of the com‐
849 pletion entry.
850
851 It is recommended that field sizes be ordered from smallest to largest.
852 A generic, unstructured tag and mask can be achieved by requesting a
853 bit array consisting of alternating 1’s and 0’s.
854
855 tx_ctx_cnt - Transmit Context Count
856 Number of transmit contexts to associate with the endpoint. If not
857 specified (0), 1 context will be assigned if the endpoint supports out‐
858 bound transfers. Transmit contexts are independent transmit queues
859 that may be separately configured. Each transmit context may be bound
860 to a separate CQ, and no ordering is defined between contexts. Addi‐
861 tionally, no synchronization is needed when accessing contexts in par‐
862 allel.
863
864 If the count is set to the value FI_SHARED_CONTEXT, the endpoint will
865 be configured to use a shared transmit context, if supported by the
866 provider. Providers that do not support shared transmit contexts will
867 fail the request.
868
869 See the scalable endpoint and shared contexts sections for additional
870 details.
871
872 rx_ctx_cnt - Receive Context Count
873 Number of receive contexts to associate with the endpoint. If not
874 specified, 1 context will be assigned if the endpoint supports inbound
875 transfers. Receive contexts are independent processing queues that may
876 be separately configured. Each receive context may be bound to a sepa‐
877 rate CQ, and no ordering is defined between contexts. Additionally, no
878 synchronization is needed when accessing contexts in parallel.
879
880 If the count is set to the value FI_SHARED_CONTEXT, the endpoint will
881 be configured to use a shared receive context, if supported by the
882 provider. Providers that do not support shared receive contexts will
883 fail the request.
884
885 See the scalable endpoint and shared contexts sections for additional
886 details.
887
888 auth_key_size - Authorization Key Length
889 The length of the authorization key in bytes. This field will be 0 if
890 authorization keys are not available or used. This field is ignored
891 unless the fabric is opened with API version 1.5 or greater.
892
893 auth_key - Authorization Key
894 If supported by the fabric, an authorization key (a.k.a. job key) to
895 associate with the endpoint. An authorization key is used to limit
896 communication between endpoints. Only peer endpoints that are pro‐
897 grammed to use the same authorization key may communicate. Authoriza‐
898 tion keys are often used to implement job keys, to ensure that process‐
899 es running in different jobs do not accidentally cross traffic. The
900 domain authorization key will be used if auth_key_size is set to 0.
901 This field is ignored unless the fabric is opened with API version 1.5
902 or greater.
903
905 Attributes specific to the transmit capabilities of an endpoint are
906 specified using struct fi_tx_attr.
907
908 struct fi_tx_attr {
909 uint64_t caps;
910 uint64_t mode;
911 uint64_t op_flags;
912 uint64_t msg_order;
913 uint64_t comp_order;
914 size_t inject_size;
915 size_t size;
916 size_t iov_limit;
917 size_t rma_iov_limit;
918 uint32_t tclass;
919 };
920
921 caps - Capabilities
922 The requested capabilities of the context. The capabilities must be a
923 subset of those requested of the associated endpoint. See the CAPABIL‐
924 ITIES section of fi_getinfo(3) for capability details. If the caps
925 field is 0 on input to fi_getinfo(3), the applicable capability bits
926 from the fi_info structure will be used.
927
928 The following capabilities apply to the transmit attributes: FI_MSG,
929 FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_SEND, FI_HMEM,
930 FI_TRIGGER, FI_FENCE, FI_MULTICAST, FI_RMA_PMEM, FI_NAMED_RX_CTX,
931 FI_COLLECTIVE, and FI_XPU.
932
933 Many applications will be able to ignore this field and rely solely on
934 the fi_info::caps field. Use of this field provides fine grained con‐
935 trol over the transmit capabilities associated with an endpoint. It is
936 useful when handling scalable endpoints, with multiple transmit con‐
937 texts, for example, and allows configuring a specific transmit context
938 with fewer capabilities than that supported by the endpoint or other
939 transmit contexts.
940
941 mode
942 The operational mode bits of the context. The mode bits will be a sub‐
943 set of those associated with the endpoint. See the MODE section of
944 fi_getinfo(3) for details. A mode value of 0 will be ignored on input
945 to fi_getinfo(3), with the mode value of the fi_info structure used in‐
946 stead. On return from fi_getinfo(3), the mode will be set only to
947 those constraints specific to transmit operations.
948
949 op_flags - Default transmit operation flags
950 Flags that control the operation of operations submitted against the
951 context. Applicable flags are listed in the Operation Flags section.
952
953 msg_order - Message Ordering
954 Message ordering refers to the order in which transport layer headers
955 (as viewed by the application) are identified and processed. Relaxed
956 message order enables data transfers to be sent and received out of or‐
957 der, which may improve performance by utilizing multiple paths through
958 the fabric from the initiating endpoint to a target endpoint. Message
959 order applies only between a single source and destination endpoint
960 pair. Ordering between different target endpoints is not defined.
961
962 Message order is determined using a set of ordering bits. Each set bit
963 indicates that ordering is maintained between data transfers of the
964 specified type. Message order is defined for [read | write | send] op‐
965 erations submitted by an application after [read | write | send] opera‐
966 tions.
967
968 Message ordering only applies to the end to end transmission of trans‐
969 port headers. Message ordering is necessary, but does not guarantee,
970 the order in which message data is sent or received by the transport
971 layer. Message ordering requires matching ordering semantics on the
972 receiving side of a data transfer operation in order to guarantee that
973 ordering is met.
974
975 FI_ORDER_ATOMIC_RAR
976 Atomic read after read. If set, atomic fetch operations are
977 transmitted in the order submitted relative to other atomic
978 fetch operations. If not set, atomic fetches may be transmitted
979 out of order from their submission.
980
981 FI_ORDER_ATOMIC_RAW
982 Atomic read after write. If set, atomic fetch operations are
983 transmitted in the order submitted relative to atomic update op‐
984 erations. If not set, atomic fetches may be transmitted ahead
985 of atomic updates.
986
987 FI_ORDER_ATOMIC_WAR
988 RMA write after read. If set, atomic update operations are
989 transmitted in the order submitted relative to atomic fetch op‐
990 erations. If not set, atomic updates may be transmitted ahead
991 of atomic fetches.
992
993 FI_ORDER_ATOMIC_WAW
994 RMA write after write. If set, atomic update operations are
995 transmitted in the order submitted relative to other atomic up‐
996 date operations. If not atomic updates may be transmitted out
997 of order from their submission.
998
999 FI_ORDER_NONE
1000 No ordering is specified. This value may be used as input in
1001 order to obtain the default message order supported by the
1002 provider. FI_ORDER_NONE is an alias for the value 0.
1003
1004 FI_ORDER_RAR
1005 Read after read. If set, RMA and atomic read operations are
1006 transmitted in the order submitted relative to other RMA and
1007 atomic read operations. If not set, RMA and atomic reads may be
1008 transmitted out of order from their submission.
1009
1010 FI_ORDER_RAS
1011 Read after send. If set, RMA and atomic read operations are
1012 transmitted in the order submitted relative to message send op‐
1013 erations, including tagged sends. If not set, RMA and atomic
1014 reads may be transmitted ahead of sends.
1015
1016 FI_ORDER_RAW
1017 Read after write. If set, RMA and atomic read operations are
1018 transmitted in the order submitted relative to RMA and atomic
1019 write operations. If not set, RMA and atomic reads may be
1020 transmitted ahead of RMA and atomic writes.
1021
1022 FI_ORDER_RMA_RAR
1023 RMA read after read. If set, RMA read operations are transmit‐
1024 ted in the order submitted relative to other RMA read opera‐
1025 tions. If not set, RMA reads may be transmitted out of order
1026 from their submission.
1027
1028 FI_ORDER_RMA_RAW
1029 RMA read after write. If set, RMA read operations are transmit‐
1030 ted in the order submitted relative to RMA write operations. If
1031 not set, RMA reads may be transmitted ahead of RMA writes.
1032
1033 FI_ORDER_RMA_WAR
1034 RMA write after read. If set, RMA write operations are trans‐
1035 mitted in the order submitted relative to RMA read operations.
1036 If not set, RMA writes may be transmitted ahead of RMA reads.
1037
1038 FI_ORDER_RMA_WAW
1039 RMA write after write. If set, RMA write operations are trans‐
1040 mitted in the order submitted relative to other RMA write opera‐
1041 tions. If not set, RMA writes may be transmitted out of order
1042 from their submission.
1043
1044 FI_ORDER_SAR
1045 Send after read. If set, message send operations, including
1046 tagged sends, are transmitted in order submitted relative to RMA
1047 and atomic read operations. If not set, message sends may be
1048 transmitted ahead of RMA and atomic reads.
1049
1050 FI_ORDER_SAS
1051 Send after send. If set, message send operations, including
1052 tagged sends, are transmitted in the order submitted relative to
1053 other message send. If not set, message sends may be transmit‐
1054 ted out of order from their submission.
1055
1056 FI_ORDER_SAW
1057 Send after write. If set, message send operations, including
1058 tagged sends, are transmitted in order submitted relative to RMA
1059 and atomic write operations. If not set, message sends may be
1060 transmitted ahead of RMA and atomic writes.
1061
1062 FI_ORDER_WAR
1063 Write after read. If set, RMA and atomic write operations are
1064 transmitted in the order submitted relative to RMA and atomic
1065 read operations. If not set, RMA and atomic writes may be
1066 transmitted ahead of RMA and atomic reads.
1067
1068 FI_ORDER_WAS
1069 Write after send. If set, RMA and atomic write operations are
1070 transmitted in the order submitted relative to message send op‐
1071 erations, including tagged sends. If not set, RMA and atomic
1072 writes may be transmitted ahead of sends.
1073
1074 FI_ORDER_WAW
1075 Write after write. If set, RMA and atomic write operations are
1076 transmitted in the order submitted relative to other RMA and
1077 atomic write operations. If not set, RMA and atomic writes may
1078 be transmitted out of order from their submission.
1079
1080 comp_order - Completion Ordering
1081 Completion ordering refers to the order in which completed requests are
1082 written into the completion queue. Completion ordering is similar to
1083 message order. Relaxed completion order may enable faster reporting of
1084 completed transfers, allow acknowledgments to be sent over different
1085 fabric paths, and support more sophisticated retry mechanisms. This
1086 can result in lower-latency completions, particularly when using con‐
1087 nectionless endpoints. Strict completion ordering may require that
1088 providers queue completed operations or limit available optimizations.
1089
1090 For transmit requests, completion ordering depends on the endpoint com‐
1091 munication type. For unreliable communication, completion ordering ap‐
1092 plies to all data transfer requests submitted to an endpoint. For re‐
1093 liable communication, completion ordering only applies to requests that
1094 target a single destination endpoint. Completion ordering of requests
1095 that target different endpoints over a reliable transport is not de‐
1096 fined.
1097
1098 Applications should specify the completion ordering that they support
1099 or require. Providers should return the completion order that they ac‐
1100 tually provide, with the constraint that the returned ordering is
1101 stricter than that specified by the application. Supported completion
1102 order values are:
1103
1104 FI_ORDER_NONE
1105 No ordering is defined for completed operations. Requests sub‐
1106 mitted to the transmit context may complete in any order.
1107
1108 FI_ORDER_STRICT
1109 Requests complete in the order in which they are submitted to
1110 the transmit context.
1111
1112 inject_size
1113 The requested inject operation size (see the FI_INJECT flag) that the
1114 context will support. This is the maximum size data transfer that can
1115 be associated with an inject operation (such as fi_inject) or may be
1116 used with the FI_INJECT data transfer flag.
1117
1118 size
1119 The size of the transmit context. The mapping of the size value to re‐
1120 sources is provider specific, but it is directly related to the number
1121 of command entries allocated for the endpoint. A smaller size value
1122 consumes fewer hardware and software resources, while a larger size al‐
1123 lows queuing more transmit requests.
1124
1125 While the size attribute guides the size of underlying endpoint trans‐
1126 mit queue, there is not necessarily a one-to-one mapping between a
1127 transmit operation and a queue entry. A single transmit operation may
1128 consume multiple queue entries; for example, one per scatter-gather en‐
1129 try. Additionally, the size field is intended to guide the allocation
1130 of the endpoint’s transmit context. Specifically, for connectionless
1131 endpoints, there may be lower-level queues use to track communication
1132 on a per peer basis. The sizes of any lower-level queues may only be
1133 significantly smaller than the endpoint’s transmit size, in order to
1134 reduce resource utilization.
1135
1136 iov_limit
1137 This is the maximum number of IO vectors (scatter-gather elements) that
1138 a single posted operation may reference.
1139
1140 rma_iov_limit
1141 This is the maximum number of RMA IO vectors (scatter-gather elements)
1142 that an RMA or atomic operation may reference. The rma_iov_limit cor‐
1143 responds to the rma_iov_count values in RMA and atomic operations. See
1144 struct fi_msg_rma and struct fi_msg_atomic in fi_rma.3 and fi_atomic.3,
1145 for additional details. This limit applies to both the number of RMA
1146 IO vectors that may be specified when initiating an operation from the
1147 local endpoint, as well as the maximum number of IO vectors that may be
1148 carried in a single request from a remote endpoint.
1149
1150 Traffic Class (tclass)
1151 Traffic classes can be a differentiated services code point (DSCP) val‐
1152 ue, one of the following defined labels, or a provider-specific defini‐
1153 tion. If tclass is unset or set to FI_TC_UNSPEC, the endpoint will use
1154 the default traffic class associated with the domain.
1155
1156 FI_TC_BEST_EFFORT
1157 This is the default in the absence of any other local or fabric
1158 configuration. This class carries the traffic for a number of
1159 applications executing concurrently over the same network infra‐
1160 structure. Even though it is shared, network capacity and re‐
1161 source allocation are distributed fairly across the applica‐
1162 tions.
1163
1164 FI_TC_BULK_DATA
1165 This class is intended for large data transfers associated with
1166 I/O and is present to separate sustained I/O transfers from oth‐
1167 er application inter-process communications.
1168
1169 FI_TC_DEDICATED_ACCESS
1170 This class operates at the highest priority, except the manage‐
1171 ment class. It carries a high bandwidth allocation, minimum la‐
1172 tency targets, and the highest scheduling and arbitration prior‐
1173 ity.
1174
1175 FI_TC_LOW_LATENCY
1176 This class supports low latency, low jitter data patterns typi‐
1177 cally caused by transactional data exchanges, barrier synchro‐
1178 nizations, and collective operations that are typical of HPC ap‐
1179 plications. This class often requires maximum tolerable laten‐
1180 cies that data transfers must achieve for correct or performance
1181 operations. Fulfillment of such requests in this class will
1182 typically require accompanying bandwidth and message size limi‐
1183 tations so as not to consume excessive bandwidth at high priori‐
1184 ty.
1185
1186 FI_TC_NETWORK_CTRL
1187 This class is intended for traffic directly related to fabric
1188 (network) management, which is critical to the correct operation
1189 of the network. Its use is typically restricted to privileged
1190 network management applications.
1191
1192 FI_TC_SCAVENGER
1193 This class is used for data that is desired but does not have
1194 strict delivery requirements, such as in-band network or appli‐
1195 cation level monitoring data. Use of this class indicates that
1196 the traffic is considered lower priority and should not inter‐
1197 fere with higher priority workflows.
1198
1199 fi_tc_dscp_set / fi_tc_dscp_get
1200 DSCP values are supported via the DSCP get and set functions.
1201 The definitions for DSCP values are outside the scope of libfab‐
1202 ric. See the fi_tc_dscp_set and fi_tc_dscp_get function defini‐
1203 tions for details on their use.
1204
1206 Attributes specific to the receive capabilities of an endpoint are
1207 specified using struct fi_rx_attr.
1208
1209 struct fi_rx_attr {
1210 uint64_t caps;
1211 uint64_t mode;
1212 uint64_t op_flags;
1213 uint64_t msg_order;
1214 uint64_t comp_order;
1215 size_t total_buffered_recv;
1216 size_t size;
1217 size_t iov_limit;
1218 };
1219
1220 caps - Capabilities
1221 The requested capabilities of the context. The capabilities must be a
1222 subset of those requested of the associated endpoint. See the CAPABIL‐
1223 ITIES section if fi_getinfo(3) for capability details. If the caps
1224 field is 0 on input to fi_getinfo(3), the applicable capability bits
1225 from the fi_info structure will be used.
1226
1227 The following capabilities apply to the receive attributes: FI_MSG,
1228 FI_RMA, FI_TAGGED, FI_ATOMIC, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_RECV,
1229 FI_HMEM, FI_TRIGGER, FI_RMA_PMEM, FI_DIRECTED_RECV, FI_VARIABLE_MSG,
1230 FI_MULTI_RECV, FI_SOURCE, FI_RMA_EVENT, FI_SOURCE_ERR, FI_COLLECTIVE,
1231 and FI_XPU.
1232
1233 Many applications will be able to ignore this field and rely solely on
1234 the fi_info::caps field. Use of this field provides fine grained con‐
1235 trol over the receive capabilities associated with an endpoint. It is
1236 useful when handling scalable endpoints, with multiple receive con‐
1237 texts, for example, and allows configuring a specific receive context
1238 with fewer capabilities than that supported by the endpoint or other
1239 receive contexts.
1240
1241 mode
1242 The operational mode bits of the context. The mode bits will be a sub‐
1243 set of those associated with the endpoint. See the MODE section of
1244 fi_getinfo(3) for details. A mode value of 0 will be ignored on input
1245 to fi_getinfo(3), with the mode value of the fi_info structure used in‐
1246 stead. On return from fi_getinfo(3), the mode will be set only to
1247 those constraints specific to receive operations.
1248
1249 op_flags - Default receive operation flags
1250 Flags that control the operation of operations submitted against the
1251 context. Applicable flags are listed in the Operation Flags section.
1252
1253 msg_order - Message Ordering
1254 For a description of message ordering, see the msg_order field in the
1255 Transmit Context Attribute section. Receive context message ordering
1256 defines the order in which received transport message headers are pro‐
1257 cessed when received by an endpoint. When ordering is set, it indi‐
1258 cates that message headers will be processed in order, based on how the
1259 transmit side has identified the messages. Typically, this means that
1260 messages will be handled in order based on a message level sequence
1261 number.
1262
1263 The following ordering flags, as defined for transmit ordering, also
1264 apply to the processing of received operations: FI_ORDER_NONE, FI_OR‐
1265 DER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_OR‐
1266 DER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR,
1267 FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAR, FI_ORDER_RMA_WAW, FI_ORDER_ATOM‐
1268 IC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, and FI_ORDER_ATOM‐
1269 IC_WAW.
1270
1271 comp_order - Completion Ordering
1272 For a description of completion ordering, see the comp_order field in
1273 the Transmit Context Attribute section.
1274
1275 FI_ORDER_DATA
1276 When set, this bit indicates that received data is written into
1277 memory in order. Data ordering applies to memory accessed as
1278 part of a single operation and between operations if message or‐
1279 dering is guaranteed.
1280
1281 FI_ORDER_NONE
1282 No ordering is defined for completed operations. Receive opera‐
1283 tions may complete in any order, regardless of their submission
1284 order.
1285
1286 FI_ORDER_STRICT
1287 Receive operations complete in the order in which they are pro‐
1288 cessed by the receive context, based on the receive side msg_or‐
1289 der attribute.
1290
1291 total_buffered_recv
1292 This field is supported for backwards compatibility purposes. It is a
1293 hint to the provider of the total available space that may be needed to
1294 buffer messages that are received for which there is no matching re‐
1295 ceive operation. The provider may adjust or ignore this value. The
1296 allocation of internal network buffering among received message is
1297 provider specific. For instance, a provider may limit the size of mes‐
1298 sages which can be buffered or the amount of buffering allocated to a
1299 single message.
1300
1301 If receive side buffering is disabled (total_buffered_recv = 0) and a
1302 message is received by an endpoint, then the behavior is dependent on
1303 whether resource management has been enabled (FI_RM_ENABLED has be set
1304 or not). See the Resource Management section of fi_domain.3 for fur‐
1305 ther clarification. It is recommended that applications enable re‐
1306 source management if they anticipate receiving unexpected messages,
1307 rather than modifying this value.
1308
1309 size
1310 The size of the receive context. The mapping of the size value to re‐
1311 sources is provider specific, but it is directly related to the number
1312 of command entries allocated for the endpoint. A smaller size value
1313 consumes fewer hardware and software resources, while a larger size al‐
1314 lows queuing more transmit requests.
1315
1316 While the size attribute guides the size of underlying endpoint receive
1317 queue, there is not necessarily a one-to-one mapping between a receive
1318 operation and a queue entry. A single receive operation may consume
1319 multiple queue entries; for example, one per scatter-gather entry. Ad‐
1320 ditionally, the size field is intended to guide the allocation of the
1321 endpoint’s receive context. Specifically, for connectionless end‐
1322 points, there may be lower-level queues use to track communication on a
1323 per peer basis. The sizes of any lower-level queues may only be sig‐
1324 nificantly smaller than the endpoint’s receive size, in order to reduce
1325 resource utilization.
1326
1327 iov_limit
1328 This is the maximum number of IO vectors (scatter-gather elements) that
1329 a single posted operating may reference.
1330
1332 A scalable endpoint is a communication portal that supports multiple
1333 transmit and receive contexts. Scalable endpoints are loosely modeled
1334 after the networking concept of transmit/receive side scaling, also
1335 known as multi-queue. Support for scalable endpoints is domain specif‐
1336 ic. Scalable endpoints may improve the performance of multi-threaded
1337 and parallel applications, by allowing threads to access independent
1338 transmit and receive queues. A scalable endpoint has a single trans‐
1339 port level address, which can reduce the memory requirements needed to
1340 store remote addressing data, versus using standard endpoints. Scal‐
1341 able endpoints cannot be used directly for communication operations,
1342 and require the application to explicitly create transmit and receive
1343 contexts as described below.
1344
1345 fi_tx_context
1346 Transmit contexts are independent transmit queues. Ordering and syn‐
1347 chronization between contexts are not defined. Conceptually a transmit
1348 context behaves similar to a send-only endpoint. A transmit context
1349 may be configured with fewer capabilities than the base endpoint and
1350 with different attributes (such as ordering requirements and inject
1351 size) than other contexts associated with the same scalable endpoint.
1352 Each transmit context has its own completion queue. The number of
1353 transmit contexts associated with an endpoint is specified during end‐
1354 point creation.
1355
1356 The fi_tx_context call is used to retrieve a specific context, identi‐
1357 fied by an index (see above for details on transmit context at‐
1358 tributes). Providers may dynamically allocate contexts when fi_tx_con‐
1359 text is called, or may statically create all contexts when fi_endpoint
1360 is invoked. By default, a transmit context inherits the properties of
1361 its associated endpoint. However, applications may request context
1362 specific attributes through the attr parameter. Support for per trans‐
1363 mit context attributes is provider specific and not guaranteed.
1364 Providers will return the actual attributes assigned to the context
1365 through the attr parameter, if provided.
1366
1367 fi_rx_context
1368 Receive contexts are independent receive queues for receiving incoming
1369 data. Ordering and synchronization between contexts are not guaran‐
1370 teed. Conceptually a receive context behaves similar to a receive-only
1371 endpoint. A receive context may be configured with fewer capabilities
1372 than the base endpoint and with different attributes (such as ordering
1373 requirements and inject size) than other contexts associated with the
1374 same scalable endpoint. Each receive context has its own completion
1375 queue. The number of receive contexts associated with an endpoint is
1376 specified during endpoint creation.
1377
1378 Receive contexts are often associated with steering flows, that specify
1379 which incoming packets targeting a scalable endpoint to process. How‐
1380 ever, receive contexts may be targeted directly by the initiator, if
1381 supported by the underlying protocol. Such contexts are referred to as
1382 `named'. Support for named contexts must be indicated by setting the
1383 caps FI_NAMED_RX_CTX capability when the corresponding endpoint is cre‐
1384 ated. Support for named receive contexts is coordinated with address
1385 vectors. See fi_av(3) and fi_rx_addr(3).
1386
1387 The fi_rx_context call is used to retrieve a specific context, identi‐
1388 fied by an index (see above for details on receive context attributes).
1389 Providers may dynamically allocate contexts when fi_rx_context is
1390 called, or may statically create all contexts when fi_endpoint is in‐
1391 voked. By default, a receive context inherits the properties of its
1392 associated endpoint. However, applications may request context specif‐
1393 ic attributes through the attr parameter. Support for per receive con‐
1394 text attributes is provider specific and not guaranteed. Providers
1395 will return the actual attributes assigned to the context through the
1396 attr parameter, if provided.
1397
1399 Shared contexts are transmit and receive contexts explicitly shared
1400 among one or more endpoints. A shareable context allows an application
1401 to use a single dedicated provider resource among multiple transport
1402 addressable endpoints. This can greatly reduce the resources needed to
1403 manage communication over multiple endpoints by multiplexing transmit
1404 and/or receive processing, with the potential cost of serializing ac‐
1405 cess across multiple endpoints. Support for shareable contexts is do‐
1406 main specific.
1407
1408 Conceptually, shareable transmit contexts are transmit queues that may
1409 be accessed by many endpoints. The use of a shared transmit context is
1410 mostly opaque to an application. Applications must allocate and bind
1411 shared transmit contexts to endpoints, but operations are posted di‐
1412 rectly to the endpoint. Shared transmit contexts are not associated
1413 with completion queues or counters. Completed operations are posted to
1414 the CQs bound to the endpoint. An endpoint may only be associated with
1415 a single shared transmit context.
1416
1417 Unlike shared transmit contexts, applications interact directly with
1418 shared receive contexts. Users post receive buffers directly to a
1419 shared receive context, with the buffers usable by any endpoint bound
1420 to the shared receive context. Shared receive contexts are not associ‐
1421 ated with completion queues or counters. Completed receive operations
1422 are posted to the CQs bound to the endpoint. An endpoint may only be
1423 associated with a single receive context, and all connectionless end‐
1424 points associated with a shared receive context must also share the
1425 same address vector.
1426
1427 Endpoints associated with a shared transmit context may use dedicated
1428 receive contexts, and vice-versa. Or an endpoint may use shared trans‐
1429 mit and receive contexts. And there is no requirement that the same
1430 group of endpoints sharing a context of one type also share the context
1431 of an alternate type. Furthermore, an endpoint may use a shared con‐
1432 text of one type, but a scalable set of contexts of the alternate type.
1433
1434 fi_stx_context
1435 This call is used to open a shareable transmit context (see above for
1436 details on the transmit context attributes). Endpoints associated with
1437 a shared transmit context must use a subset of the transmit context’s
1438 attributes. Note that this is the reverse of the requirement for
1439 transmit contexts for scalable endpoints.
1440
1441 fi_srx_context
1442 This allocates a shareable receive context (see above for details on
1443 the receive context attributes). Endpoints associated with a shared
1444 receive context must use a subset of the receive context’s attributes.
1445 Note that this is the reverse of the requirement for receive contexts
1446 for scalable endpoints.
1447
1449 The following feature and description should be considered experimen‐
1450 tal. Until the experimental tag is removed, the interfaces, semantics,
1451 and data structures associated with socket endpoints may change between
1452 library versions.
1453
1454 This section applies to endpoints of type FI_EP_SOCK_STREAM and
1455 FI_EP_SOCK_DGRAM, commonly referred to as socket endpoints.
1456
1457 Socket endpoints are defined with semantics that allow them to more
1458 easily be adopted by developers familiar with the UNIX socket API, or
1459 by middleware that exposes the socket API, while still taking advantage
1460 of high-performance hardware features.
1461
1462 The key difference between socket endpoints and other active endpoints
1463 are socket endpoints use synchronous data transfers. Buffers passed
1464 into send and receive operations revert to the control of the applica‐
1465 tion upon returning from the function call. As a result, no data
1466 transfer completions are reported to the application, and socket end‐
1467 points are not associated with completion queues or counters.
1468
1469 Socket endpoints support a subset of message operations: fi_send,
1470 fi_sendv, fi_sendmsg, fi_recv, fi_recvv, fi_recvmsg, and fi_inject.
1471 Because data transfers are synchronous, the return value from send and
1472 receive operations indicate the number of bytes transferred on success,
1473 or a negative value on error, including -FI_EAGAIN if the endpoint can‐
1474 not send or receive any data because of full or empty queues, respec‐
1475 tively.
1476
1477 Socket endpoints are associated with event queues and address vectors,
1478 and process connection management events asynchronously, similar to
1479 other endpoints. Unlike UNIX sockets, socket endpoint must still be
1480 declared as either active or passive.
1481
1482 Socket endpoints behave like non-blocking sockets. In order to support
1483 select and poll semantics, active socket endpoints are associated with
1484 a file descriptor that is signaled whenever the endpoint is ready to
1485 send and/or receive data. The file descriptor may be retrieved using
1486 fi_control.
1487
1489 Operation flags are obtained by OR-ing the following flags together.
1490 Operation flags define the default flags applied to an endpoint’s data
1491 transfer operations, where a flags parameter is not available. Data
1492 transfer operations that take flags as input override the op_flags val‐
1493 ue of transmit or receive context attributes of an endpoint.
1494
1495 FI_COMMIT_COMPLETE
1496 Indicates that a completion should not be generated (locally or
1497 at the peer) until the result of an operation have been made
1498 persistent. See fi_cq(3) for additional details on completion
1499 semantics.
1500
1501 FI_COMPLETION
1502 Indicates that a completion queue entry should be written for
1503 data transfer operations. This flag only applies to operations
1504 issued on an endpoint that was bound to a completion queue with
1505 the FI_SELECTIVE_COMPLETION flag set, otherwise, it is ignored.
1506 See the fi_ep_bind section above for more detail.
1507
1508 FI_DELIVERY_COMPLETE
1509 Indicates that a completion should be generated when the opera‐
1510 tion has been processed by the destination endpoint(s). See
1511 fi_cq(3) for additional details on completion semantics.
1512
1513 FI_INJECT
1514 Indicates that all outbound data buffers should be returned to
1515 the user’s control immediately after a data transfer call re‐
1516 turns, even if the operation is handled asynchronously. This
1517 may require that the provider copy the data into a local buffer
1518 and transfer out of that buffer. A provider can limit the total
1519 amount of send data that may be buffered and/or the size of a
1520 single send that can use this flag. This limit is indicated us‐
1521 ing inject_size (see inject_size above).
1522
1523 FI_INJECT_COMPLETE
1524 Indicates that a completion should be generated when the source
1525 buffer(s) may be reused. See fi_cq(3) for additional details on
1526 completion semantics.
1527
1528 FI_MULTICAST
1529 Indicates that data transfers will target multicast addresses by
1530 default. Any fi_addr_t passed into a data transfer operation
1531 will be treated as a multicast address.
1532
1533 FI_MULTI_RECV
1534 Applies to posted receive operations. This flag allows the user
1535 to post a single buffer that will receive multiple incoming mes‐
1536 sages. Received messages will be packed into the receive buffer
1537 until the buffer has been consumed. Use of this flag may cause
1538 a single posted receive operation to generate multiple comple‐
1539 tions as messages are placed into the buffer. The placement of
1540 received data into the buffer may be subjected to provider spe‐
1541 cific alignment restrictions. The buffer will be released by
1542 the provider when the available buffer space falls below the
1543 specified minimum (see FI_OPT_MIN_MULTI_RECV).
1544
1545 FI_TRANSMIT_COMPLETE
1546 Indicates that a completion should be generated when the trans‐
1547 mit operation has completed relative to the local provider. See
1548 fi_cq(3) for additional details on completion semantics.
1549
1551 Users should call fi_close to release all resources allocated to the
1552 fabric endpoint.
1553
1554 Endpoints allocated with the FI_CONTEXT or FI_CONTEXT2 mode bits set
1555 must typically provide struct fi_context(2) as their per operation con‐
1556 text parameter. (See fi_getinfo.3 for details.) However, when FI_SE‐
1557 LECTIVE_COMPLETION is enabled to suppress CQ completion entries, and an
1558 operation is initiated without the FI_COMPLETION flag set, then the
1559 context parameter is ignored. An application does not need to pass in
1560 a valid struct fi_context(2) into such data transfers.
1561
1562 Operations that complete in error that are not associated with valid
1563 operational context will use the endpoint context in any error report‐
1564 ing structures.
1565
1566 Although applications typically associate individual completions with
1567 either completion queues or counters, an endpoint can be attached to
1568 both a counter and completion queue. When combined with using selec‐
1569 tive completions, this allows an application to use counters to track
1570 successful completions, with a CQ used to report errors. Operations
1571 that complete with an error increment the error counter and generate a
1572 CQ completion event.
1573
1574 As mentioned in fi_getinfo(3), the ep_attr structure can be used to
1575 query providers that support various endpoint attributes. fi_getinfo
1576 can return provider info structures that can support the minimal set of
1577 requirements (such that the application maintains correctness). Howev‐
1578 er, it can also return provider info structures that exceed application
1579 requirements. As an example, consider an application requesting
1580 msg_order as FI_ORDER_NONE. The resulting output from fi_getinfo may
1581 have all the ordering bits set. The application can reset the ordering
1582 bits it does not require before creating the endpoint. The provider is
1583 free to implement a stricter ordering than is required by the applica‐
1584 tion.
1585
1587 Returns 0 on success. On error, a negative value corresponding to fab‐
1588 ric errno is returned. For fi_cancel, a return value of 0 indicates
1589 that the cancel request was submitted for processing. For fi_se‐
1590 topt/fi_getopt, a return value of -FI_ENOPROTOOPT indicates the
1591 provider does not support the requested option.
1592
1593 Fabric errno values are defined in rdma/fi_errno.h.
1594
1596 -FI_EDOMAIN
1597 A resource domain was not bound to the endpoint or an attempt
1598 was made to bind multiple domains.
1599
1600 -FI_ENOCQ
1601 The endpoint has not been configured with necessary event queue.
1602
1603 -FI_EOPBADSTATE
1604 The endpoint’s state does not permit the requested operation.
1605
1607 fi_getinfo(3), fi_domain(3), fi_cq(3) fi_msg(3), fi_tagged(3),
1608 fi_rma(3) fi_peer(3)
1609
1611 OpenFabrics.
1612
1613
1614
1615Libfabric Programmer’s Manual 2023-03-15 fi_endpoint(3)