1fi_endpoint(3) Libfabric v1.15.1 fi_endpoint(3)
2
3
4
6 fi_endpoint - Fabric endpoint operations
7
8 fi_endpoint / fi_scalable_ep / fi_passive_ep / fi_close
9 Allocate or close an endpoint.
10
11 fi_ep_bind
12 Associate an endpoint with hardware resources, such as event
13 queues, completion queues, counters, address vectors, or shared
14 transmit/receive contexts.
15
16 fi_scalable_ep_bind
17 Associate a scalable endpoint with an address vector
18
19 fi_pep_bind
20 Associate a passive endpoint with an event queue
21
22 fi_enable
23 Transitions an active endpoint into an enabled state.
24
25 fi_cancel
26 Cancel a pending asynchronous data transfer
27
28 fi_ep_alias
29 Create an alias to the endpoint
30
31 fi_control
32 Control endpoint operation.
33
34 fi_getopt / fi_setopt
35 Get or set endpoint options.
36
37 fi_rx_context / fi_tx_context / fi_srx_context / fi_stx_context
38 Open a transmit or receive context.
39
40 fi_tc_dscp_set / fi_tc_dscp_get
41 Convert between a DSCP value and a network traffic class
42
43 fi_rx_size_left / fi_tx_size_left (DEPRECATED)
44 Query the lower bound on how many RX/TX operations may be posted
45 without an operation returning -FI_EAGAIN. This functions have
46 been deprecated and will be removed in a future version of the
47 library.
48
50 #include <rdma/fabric.h>
51
52 #include <rdma/fi_endpoint.h>
53
54 int fi_endpoint(struct fid_domain *domain, struct fi_info *info,
55 struct fid_ep **ep, void *context);
56
57 int fi_scalable_ep(struct fid_domain *domain, struct fi_info *info,
58 struct fid_ep **sep, void *context);
59
60 int fi_passive_ep(struct fi_fabric *fabric, struct fi_info *info,
61 struct fid_pep **pep, void *context);
62
63 int fi_tx_context(struct fid_ep *sep, int index,
64 struct fi_tx_attr *attr, struct fid_ep **tx_ep,
65 void *context);
66
67 int fi_rx_context(struct fid_ep *sep, int index,
68 struct fi_rx_attr *attr, struct fid_ep **rx_ep,
69 void *context);
70
71 int fi_stx_context(struct fid_domain *domain,
72 struct fi_tx_attr *attr, struct fid_stx **stx,
73 void *context);
74
75 int fi_srx_context(struct fid_domain *domain,
76 struct fi_rx_attr *attr, struct fid_ep **rx_ep,
77 void *context);
78
79 int fi_close(struct fid *ep);
80
81 int fi_ep_bind(struct fid_ep *ep, struct fid *fid, uint64_t flags);
82
83 int fi_scalable_ep_bind(struct fid_ep *sep, struct fid *fid, uint64_t flags);
84
85 int fi_pep_bind(struct fid_pep *pep, struct fid *fid, uint64_t flags);
86
87 int fi_enable(struct fid_ep *ep);
88
89 int fi_cancel(struct fid_ep *ep, void *context);
90
91 int fi_ep_alias(struct fid_ep *ep, struct fid_ep **alias_ep, uint64_t flags);
92
93 int fi_control(struct fid *ep, int command, void *arg);
94
95 int fi_getopt(struct fid *ep, int level, int optname,
96 void *optval, size_t *optlen);
97
98 int fi_setopt(struct fid *ep, int level, int optname,
99 const void *optval, size_t optlen);
100
101 uint32_t fi_tc_dscp_set(uint8_t dscp);
102
103 uint8_t fi_tc_dscp_get(uint32_t tclass);
104
105 DEPRECATED ssize_t fi_rx_size_left(struct fid_ep *ep);
106
107 DEPRECATED ssize_t fi_tx_size_left(struct fid_ep *ep);
108
110 fid On creation, specifies a fabric or access domain. On bind,
111 identifies the event queue, completion queue, counter, or ad‐
112 dress vector to bind to the endpoint. In other cases, it’s a
113 fabric identifier of an associated resource.
114
115 info Details about the fabric interface endpoint to be opened, ob‐
116 tained from fi_getinfo.
117
118 ep A fabric endpoint.
119
120 sep A scalable fabric endpoint.
121
122 pep A passive fabric endpoint.
123
124 context
125 Context associated with the endpoint or asynchronous operation.
126
127 index Index to retrieve a specific transmit/receive context.
128
129 attr Transmit or receive context attributes.
130
131 flags Additional flags to apply to the operation.
132
133 command
134 Command of control operation to perform on endpoint.
135
136 arg Optional control argument.
137
138 level Protocol level at which the desired option resides.
139
140 optname
141 The protocol option to read or set.
142
143 optval The option value that was read or to set.
144
145 optlen The size of the optval buffer.
146
148 Endpoints are transport level communication portals. There are two
149 types of endpoints: active and passive. Passive endpoints belong to a
150 fabric domain and are most often used to listen for incoming connection
151 requests. However, a passive endpoint may be used to reserve a fabric
152 address that can be granted to an active endpoint. Active endpoints
153 belong to access domains and can perform data transfers.
154
155 Active endpoints may be connection-oriented or connectionless, and may
156 provide data reliability. The data transfer interfaces – messages
157 (fi_msg), tagged messages (fi_tagged), RMA (fi_rma), and atomics
158 (fi_atomic) – are associated with active endpoints. In basic configu‐
159 rations, an active endpoint has transmit and receive queues. In gener‐
160 al, operations that generate traffic on the fabric are posted to the
161 transmit queue. This includes all RMA and atomic operations, along
162 with sent messages and sent tagged messages. Operations that post buf‐
163 fers for receiving incoming data are submitted to the receive queue.
164
165 Active endpoints are created in the disabled state. They must transi‐
166 tion into an enabled state before accepting data transfer operations,
167 including posting of receive buffers. The fi_enable call is used to
168 transition an active endpoint into an enabled state. The fi_connect
169 and fi_accept calls will also transition an endpoint into the enabled
170 state, if it is not already active.
171
172 In order to transition an endpoint into an enabled state, it must be
173 bound to one or more fabric resources. An endpoint that will generate
174 asynchronous completions, either through data transfer operations or
175 communication establishment events, must be bound to the appropriate
176 completion queues or event queues, respectively, before being enabled.
177 Additionally, endpoints that use manual progress must be associated
178 with relevant completion queues or event queues in order to drive
179 progress. For endpoints that are only used as the target of RMA or
180 atomic operations, this means binding the endpoint to a completion
181 queue associated with receive processing. Connectionless endpoints
182 must be bound to an address vector.
183
184 Once an endpoint has been activated, it may be associated with an ad‐
185 dress vector. Receive buffers may be posted to it and calls may be
186 made to connection establishment routines. Connectionless endpoints
187 may also perform data transfers.
188
189 The behavior of an endpoint may be adjusted by setting its control data
190 and protocol options. This allows the underlying provider to redirect
191 function calls to implementations optimized to meet the desired appli‐
192 cation behavior.
193
194 If an endpoint experiences a critical error, it will transition back
195 into a disabled state. Critical errors are reported through the event
196 queue associated with the EP. In certain cases, a disabled endpoint
197 may be re-enabled. The ability to transition back into an enabled
198 state is provider specific and depends on the type of error that the
199 endpoint experienced. When an endpoint is disabled as a result of a
200 critical error, all pending operations are discarded.
201
202 fi_endpoint / fi_passive_ep / fi_scalable_ep
203 fi_endpoint allocates a new active endpoint. fi_passive_ep allocates a
204 new passive endpoint. fi_scalable_ep allocates a scalable endpoint.
205 The properties and behavior of the endpoint are defined based on the
206 provided struct fi_info. See fi_getinfo for additional details on
207 fi_info. fi_info flags that control the operation of an endpoint are
208 defined below. See section SCALABLE ENDPOINTS.
209
210 If an active endpoint is allocated in order to accept a connection re‐
211 quest, the fi_info parameter must be the same as the fi_info structure
212 provided with the connection request (FI_CONNREQ) event.
213
214 An active endpoint may acquire the properties of a passive endpoint by
215 setting the fi_info handle field to the passive endpoint fabric de‐
216 scriptor. This is useful for applications that need to reserve the
217 fabric address of an endpoint prior to knowing if the endpoint will be
218 used on the active or passive side of a connection. For example, this
219 feature is useful for simulating socket semantics. Once an active end‐
220 point acquires the properties of a passive endpoint, the passive end‐
221 point is no longer bound to any fabric resources and must no longer be
222 used. The user is expected to close the passive endpoint after opening
223 the active endpoint in order to free up any lingering resources that
224 had been used.
225
226 fi_close
227 Closes an endpoint and release all resources associated with it.
228
229 When closing a scalable endpoint, there must be no opened transmit con‐
230 texts, or receive contexts associated with the scalable endpoint. If
231 resources are still associated with the scalable endpoint when attempt‐
232 ing to close, the call will return -FI_EBUSY.
233
234 Outstanding operations posted to the endpoint when fi_close is called
235 will be discarded. Discarded operations will silently be dropped, with
236 no completions reported. Additionally, a provider may discard previ‐
237 ously completed operations from the associated completion queue(s).
238 The behavior to discard completed operations is provider specific.
239
240 fi_ep_bind
241 fi_ep_bind is used to associate an endpoint with other allocated re‐
242 sources, such as completion queues, counters, address vectors, event
243 queues, shared contexts, and memory regions. The type of objects that
244 must be bound with an endpoint depend on the endpoint type and its con‐
245 figuration.
246
247 Passive endpoints must be bound with an EQ that supports connection
248 management events. Connectionless endpoints must be bound to a single
249 address vector. If an endpoint is using a shared transmit and/or re‐
250 ceive context, the shared contexts must be bound to the endpoint. CQs,
251 counters, AV, and shared contexts must be bound to endpoints before
252 they are enabled either explicitly or implicitly.
253
254 An endpoint must be bound with CQs capable of reporting completions for
255 any asynchronous operation initiated on the endpoint. For example, if
256 the endpoint supports any outbound transfers (sends, RMA, atomics,
257 etc.), then it must be bound to a completion queue that can report
258 transmit completions. This is true even if the endpoint is configured
259 to suppress successful completions, in order that operations that com‐
260 plete in error may be reported to the user.
261
262 An active endpoint may direct asynchronous completions to different
263 CQs, based on the type of operation. This is specified using
264 fi_ep_bind flags. The following flags may be OR’ed together when bind‐
265 ing an endpoint to a completion domain CQ.
266
267 FI_RECV
268 Directs the notification of inbound data transfers to the speci‐
269 fied completion queue. This includes received messages. This
270 binding automatically includes FI_REMOTE_WRITE, if applicable to
271 the endpoint.
272
273 FI_SELECTIVE_COMPLETION
274 By default, data transfer operations write CQ completion entries
275 into the associated completion queue after they have successful‐
276 ly completed. Applications can use this bind flag to selective‐
277 ly enable when completions are generated. If FI_SELECTIVE_COM‐
278 PLETION is specified, data transfer operations will not generate
279 CQ entries for successful completions unless FI_COMPLETION is
280 set as an operational flag for the given operation. Operations
281 that fail asynchronously will still generate completions, even
282 if a completion is not requested. FI_SELECTIVE_COMPLETION must
283 be OR’ed with FI_TRANSMIT and/or FI_RECV flags.
284
285 When FI_SELECTIVE_COMPLETION is set, the user must determine when a re‐
286 quest that does NOT have FI_COMPLETION set has completed indirectly,
287 usually based on the completion of a subsequent operation or by using
288 completion counters. Use of this flag may improve performance by al‐
289 lowing the provider to avoid writing a CQ completion entry for every
290 operation.
291
292 See Notes section below for additional information on how this flag in‐
293 teracts with the FI_CONTEXT and FI_CONTEXT2 mode bits.
294
295 FI_TRANSMIT
296 Directs the completion of outbound data transfer requests to the
297 specified completion queue. This includes send message, RMA,
298 and atomic operations.
299
300 An endpoint may optionally be bound to a completion counter. Associat‐
301 ing an endpoint with a counter is in addition to binding the EP with a
302 CQ. When binding an endpoint to a counter, the following flags may be
303 specified.
304
305 FI_READ
306 Increments the specified counter whenever an RMA read, atomic
307 fetch, or atomic compare operation initiated from the endpoint
308 has completed successfully or in error.
309
310 FI_RECV
311 Increments the specified counter whenever a message is received
312 over the endpoint. Received messages include both tagged and
313 normal message operations.
314
315 FI_REMOTE_READ
316 Increments the specified counter whenever an RMA read, atomic
317 fetch, or atomic compare operation is initiated from a remote
318 endpoint that targets the given endpoint. Use of this flag re‐
319 quires that the endpoint be created using FI_RMA_EVENT.
320
321 FI_REMOTE_WRITE
322 Increments the specified counter whenever an RMA write or base
323 atomic operation is initiated from a remote endpoint that tar‐
324 gets the given endpoint. Use of this flag requires that the
325 endpoint be created using FI_RMA_EVENT.
326
327 FI_SEND
328 Increments the specified counter whenever a message transfer
329 initiated over the endpoint has completed successfully or in er‐
330 ror. Sent messages include both tagged and normal message oper‐
331 ations.
332
333 FI_WRITE
334 Increments the specified counter whenever an RMA write or base
335 atomic operation initiated from the endpoint has completed suc‐
336 cessfully or in error.
337
338 An endpoint may only be bound to a single CQ or counter for a given
339 type of operation. For example, a EP may not bind to two counters both
340 using FI_WRITE. Furthermore, providers may limit CQ and counter bind‐
341 ings to endpoints of the same endpoint type (DGRAM, MSG, RDM, etc.).
342
343 fi_scalable_ep_bind
344 fi_scalable_ep_bind is used to associate a scalable endpoint with an
345 address vector. See section on SCALABLE ENDPOINTS. A scalable end‐
346 point has a single transport level address and can support multiple
347 transmit and receive contexts. The transmit and receive contexts share
348 the transport-level address. Address vectors that are bound to scal‐
349 able endpoints are implicitly bound to any transmit or receive contexts
350 created using the scalable endpoint.
351
352 fi_enable
353 This call transitions the endpoint into an enabled state. An endpoint
354 must be enabled before it may be used to perform data transfers. En‐
355 abling an endpoint typically results in hardware resources being as‐
356 signed to it. Endpoints making use of completion queues, counters,
357 event queues, and/or address vectors must be bound to them before being
358 enabled.
359
360 Calling connect or accept on an endpoint will implicitly enable an end‐
361 point if it has not already been enabled.
362
363 fi_enable may also be used to re-enable an endpoint that has been dis‐
364 abled as a result of experiencing a critical error. Applications
365 should check the return value from fi_enable to see if a disabled end‐
366 point has successfully be re-enabled.
367
368 fi_cancel
369 fi_cancel attempts to cancel an outstanding asynchronous operation.
370 Canceling an operation causes the fabric provider to search for the op‐
371 eration and, if it is still pending, complete it as having been can‐
372 celed. An error queue entry will be available in the associated error
373 queue with error code FI_ECANCELED. On the other hand, if the opera‐
374 tion completed before the call to fi_cancel, then the completion status
375 of that operation will be available in the associated completion queue.
376 No specific entry related to fi_cancel itself will be posted.
377
378 Cancel uses the context parameter associated with an operation to iden‐
379 tify the request to cancel. Operations posted without a valid context
380 parameter – either no context parameter is specified or the context
381 value was ignored by the provider – cannot be canceled. If multiple
382 outstanding operations match the context parameter, only one will be
383 canceled. In this case, the operation which is canceled is provider
384 specific. The cancel operation is asynchronous, but will complete
385 within a bounded period of time.
386
387 fi_ep_alias
388 This call creates an alias to the specified endpoint. Conceptually, an
389 endpoint alias provides an alternate software path from the application
390 to the underlying provider hardware. An alias EP differs from its par‐
391 ent endpoint only by its default data transfer flags. For example, an
392 alias EP may be configured to use a different completion mode. By de‐
393 fault, an alias EP inherits the same data transfer flags as the parent
394 endpoint. An application can use fi_control to modify the alias EP op‐
395 erational flags.
396
397 When allocating an alias, an application may configure either the
398 transmit or receive operational flags. This avoids needing a separate
399 call to fi_control to set those flags. The flags passed to fi_ep_alias
400 must include FI_TRANSMIT or FI_RECV (not both) with other operational
401 flags OR’ed in. This will override the transmit or receive flags, re‐
402 spectively, for operations posted through the alias endpoint. All al‐
403 located aliases must be closed for the underlying endpoint to be re‐
404 leased.
405
406 fi_control
407 The control operation is used to adjust the default behavior of an end‐
408 point. It allows the underlying provider to redirect function calls to
409 implementations optimized to meet the desired application behavior. As
410 a result, calls to fi_ep_control must be serialized against all other
411 calls to an endpoint.
412
413 The base operation of an endpoint is selected during creation using
414 struct fi_info. The following control commands and arguments may be
415 assigned to an endpoint.
416
417 **FI_BACKLOG - int *value**
418 This option only applies to passive endpoints. It is used to
419 set the connection request backlog for listening endpoints.
420
421 **FI_GETOPSFLAG – uint64_t *flags**
422 Used to retrieve the current value of flags associated with the
423 data transfer operations initiated on the endpoint. The control
424 argument must include FI_TRANSMIT or FI_RECV (not both) flags to
425 indicate the type of data transfer flags to be returned. See
426 below for a list of control flags.
427
428 FI_GETWAIT – void **
429 This command allows the user to retrieve the file descriptor as‐
430 sociated with a socket endpoint. The fi_control arg parameter
431 should be an address where a pointer to the returned file de‐
432 scriptor will be written. See fi_eq.3 for addition details us‐
433 ing fi_control with FI_GETWAIT. The file descriptor may be used
434 for notification that the endpoint is ready to send or receive
435 data.
436
437 **FI_SETOPSFLAG – uint64_t *flags**
438 Used to change the data transfer operation flags associated with
439 an endpoint. The control argument must include FI_TRANSMIT or
440 FI_RECV (not both) to indicate the type of data transfer that
441 the flags should apply to, with other flags OR’ed in. The given
442 flags will override the previous transmit and receive attributes
443 that were set when the endpoint was created. Valid control
444 flags are defined below.
445
446 fi_getopt / fi_setopt
447 Endpoint protocol operations may be retrieved using fi_getopt or set
448 using fi_setopt. Applications specify the level that a desired option
449 exists, identify the option, and provide input/output buffers to get or
450 set the option. fi_setopt provides an application a way to adjust
451 low-level protocol and implementation specific details of an endpoint.
452
453 The following option levels and option names and parameters are de‐
454 fined.
455
456 FI_OPT_ENDPOINT • .RS 2
457
458 FI_OPT_BUFFERED_LIMIT - size_t
459 Defines the maximum size of a buffered message that will be re‐
460 ported to users as part of a receive completion when the
461 FI_BUFFERED_RECV mode is enabled on an endpoint.
462
463 fi_getopt() will return the currently configured threshold, or the
464 provider’s default threshold if one has not be set by the application.
465 fi_setopt() allows an application to configure the threshold. If the
466 provider cannot support the requested threshold, it will fail the
467 fi_setopt() call with FI_EMSGSIZE. Calling fi_setopt() with the
468 threshold set to SIZE_MAX will set the threshold to the maximum sup‐
469 ported by the provider. fi_getopt() can then be used to retrieve the
470 set size.
471
472 In most cases, the sending and receiving endpoints must be configured
473 to use the same threshold value, and the threshold must be set prior to
474 enabling the endpoint.
475 • .RS 2
476
477 FI_OPT_BUFFERED_MIN - size_t
478 Defines the minimum size of a buffered message that will be re‐
479 ported. Applications would set this to a size that’s big enough
480 to decide whether to discard or claim a buffered receive or when
481 to claim a buffered receive on getting a buffered receive com‐
482 pletion. The value is typically used by a provider when sending
483 a rendezvous protocol request where it would send at least
484 FI_OPT_BUFFERED_MIN bytes of application data along with it. A
485 smaller sized rendezvous protocol message usually results in
486 better latency for the overall transfer of a large message.
487 • .RS 2
488
489 FI_OPT_CM_DATA_SIZE - size_t
490 Defines the size of available space in CM messages for user-de‐
491 fined data. This value limits the amount of data that applica‐
492 tions can exchange between peer endpoints using the fi_connect,
493 fi_accept, and fi_reject operations. The size returned is de‐
494 pendent upon the properties of the endpoint, except in the case
495 of passive endpoints, in which the size reflects the maximum
496 size of the data that may be present as part of a connection re‐
497 quest event. This option is read only.
498 • .RS 2
499
500 FI_OPT_MIN_MULTI_RECV - size_t
501 Defines the minimum receive buffer space available when the re‐
502 ceive buffer is released by the provider (see FI_MULTI_RECV).
503 Modifying this value is only guaranteed to set the minimum buf‐
504 fer space needed on receives posted after the value has been
505 changed. It is recommended that applications that want to over‐
506 ride the default MIN_MULTI_RECV value set this option before en‐
507 abling the corresponding endpoint.
508 • .RS 2
509
510 FI_OPT_FI_HMEM_P2P - int
511 Defines how the provider should handle peer to peer FI_HMEM
512 transfers for this endpoint. By default, the provider will
513 chose whether to use peer to peer support based on the type of
514 transfer (FI_HMEM_P2P_ENABLED). Valid values defined in fi_end‐
515 point.h are:
516
517 • FI_HMEM_P2P_ENABLED: Peer to peer support may be used by the
518 provider to handle FI_HMEM transfers, and which transfers are
519 initiated using peer to peer is subject to the provider imple‐
520 mentation.
521
522 • FI_HMEM_P2P_REQUIRED: Peer to peer support must be used for
523 transfers, transfers that cannot be performed using p2p will
524 be reported as failing.
525
526 • FI_HMEM_P2P_PREFERRED: Peer to peer support should be used by
527 the provider for all transfers if available, but the provider
528 may choose to copy the data to initiate the transfer if peer
529 to peer support is unavailable.
530
531 • FI_HMEM_P2P_DISABLED: Peer to peer support should not be used.
532 fi_setopt() will return -FI_EOPNOTSUPP if the mode requested cannot be
533 supported by the provider. The FI_HMEM_DISABLE_P2P environment vari‐
534 able discussed in fi_mr(3) takes precedence over this setopt option.
535 • .RS 2
536
537 FI_OPT_XPU_TRIGGER - struct fi_trigger_xpu *
538 This option only applies to the fi_getopt() call. It is used to
539 query the maximum number of variables required to support XPU
540 triggered operations, along with the size of each variable.
541
542 The user provides a filled out struct fi_trigger_xpu on input. The
543 iface and device fields should reference an HMEM domain. If the
544 provider does not support XPU triggered operations from the given de‐
545 vice, fi_getopt() will return -FI_EOPNOTSUPP. On input, var should
546 reference an array of struct fi_trigger_var data structures, with count
547 set to the size of the referenced array. If count is 0, the var field
548 will be ignored, and the provider will return the number of fi_trig‐
549 ger_var structures needed. If count is > 0, the provider will set
550 count to the needed value, and for each fi_trigger_var available, set
551 the datatype and count of the variable used for the trigger.
552
553 fi_tc_dscp_set
554 This call converts a DSCP defined value into a libfabric traffic class
555 value. It should be used when assigning a DSCP value when setting the
556 tclass field in either domain or endpoint attributes
557
558 fi_tc_dscp_get
559 This call returns the DSCP value associated with the tclass field for
560 the domain or endpoint attributes.
561
562 fi_rx_size_left (DEPRECATED)
563 This function has been deprecated and will be removed in a future ver‐
564 sion of the library. It may not be supported by all providers.
565
566 The fi_rx_size_left call returns a lower bound on the number of receive
567 operations that may be posted to the given endpoint without that opera‐
568 tion returning -FI_EAGAIN. Depending on the specific details of the
569 subsequently posted receive operations (e.g., number of iov entries,
570 which receive function is called, etc.), it may be possible to post
571 more receive operations than originally indicated by fi_rx_size_left.
572
573 fi_tx_size_left (DEPRECATED)
574 This function has been deprecated and will be removed in a future ver‐
575 sion of the library. It may not be supported by all providers.
576
577 The fi_tx_size_left call returns a lower bound on the number of trans‐
578 mit operations that may be posted to the given endpoint without that
579 operation returning -FI_EAGAIN. Depending on the specific details of
580 the subsequently posted transmit operations (e.g., number of iov en‐
581 tries, which transmit function is called, etc.), it may be possible to
582 post more transmit operations than originally indicated by
583 fi_tx_size_left.
584
586 The fi_ep_attr structure defines the set of attributes associated with
587 an endpoint. Endpoint attributes may be further refined using the
588 transmit and receive context attributes as shown below.
589
590 struct fi_ep_attr {
591 enum fi_ep_type type;
592 uint32_t protocol;
593 uint32_t protocol_version;
594 size_t max_msg_size;
595 size_t msg_prefix_size;
596 size_t max_order_raw_size;
597 size_t max_order_war_size;
598 size_t max_order_waw_size;
599 uint64_t mem_tag_format;
600 size_t tx_ctx_cnt;
601 size_t rx_ctx_cnt;
602 size_t auth_key_size;
603 uint8_t *auth_key;
604 };
605
606 type - Endpoint Type
607 If specified, indicates the type of fabric interface communication de‐
608 sired. Supported types are:
609
610 FI_EP_DGRAM
611 Supports a connectionless, unreliable datagram communication.
612 Message boundaries are maintained, but the maximum message size
613 may be limited to the fabric MTU. Flow control is not guaran‐
614 teed.
615
616 FI_EP_MSG
617 Provides a reliable, connection-oriented data transfer service
618 with flow control that maintains message boundaries.
619
620 FI_EP_RDM
621 Reliable datagram message. Provides a reliable, connectionless
622 data transfer service with flow control that maintains message
623 boundaries.
624
625 FI_EP_SOCK_DGRAM
626 A connectionless, unreliable datagram endpoint with UDP sock‐
627 et-like semantics. FI_EP_SOCK_DGRAM is most useful for applica‐
628 tions designed around using UDP sockets. See the SOCKET END‐
629 POINT section for additional details and restrictions that apply
630 to datagram socket endpoints.
631
632 FI_EP_SOCK_STREAM
633 Data streaming endpoint with TCP socket-like semantics. Pro‐
634 vides a reliable, connection-oriented data transfer service that
635 does not maintain message boundaries. FI_EP_SOCK_STREAM is most
636 useful for applications designed around using TCP sockets. See
637 the SOCKET ENDPOINT section for additional details and restric‐
638 tions that apply to stream endpoints.
639
640 FI_EP_UNSPEC
641 The type of endpoint is not specified. This is usually provided
642 as input, with other attributes of the endpoint or the provider
643 selecting the type.
644
645 Protocol
646 Specifies the low-level end to end protocol employed by the provider.
647 A matching protocol must be used by communicating endpoints to ensure
648 interoperability. The following protocol values are defined. Provider
649 specific protocols are also allowed. Provider specific protocols will
650 be indicated by having the upper bit of the protocol value set to one.
651
652 FI_PROTO_GNI
653 Protocol runs over Cray GNI low-level interface.
654
655 FI_PROTO_IB_RDM
656 Reliable-datagram protocol implemented over InfiniBand reli‐
657 able-connected queue pairs.
658
659 FI_PROTO_IB_UD
660 The protocol runs over Infiniband unreliable datagram queue
661 pairs.
662
663 FI_PROTO_IWARP
664 The protocol runs over the Internet wide area RDMA protocol
665 transport.
666
667 FI_PROTO_IWARP_RDM
668 Reliable-datagram protocol implemented over iWarp reliable-con‐
669 nected queue pairs.
670
671 FI_PROTO_NETWORKDIRECT
672 Protocol runs over Microsoft NetworkDirect service provider in‐
673 terface. This adds reliable-datagram semantics over the Net‐
674 workDirect connection- oriented endpoint semantics.
675
676 FI_PROTO_PSMX
677 The protocol is based on an Intel proprietary protocol known as
678 PSM, performance scaled messaging. PSMX is an extended version
679 of the PSM protocol to support the libfabric interfaces.
680
681 FI_PROTO_PSMX2
682 The protocol is based on an Intel proprietary protocol known as
683 PSM2, performance scaled messaging version 2. PSMX2 is an ex‐
684 tended version of the PSM2 protocol to support the libfabric in‐
685 terfaces.
686
687 FI_PROTO_PSMX3
688 The protocol is Intel’s protocol known as PSM3, performance
689 scaled messaging version 3. PSMX3 is implemented over RoCEv2
690 and verbs.
691
692 FI_PROTO_RDMA_CM_IB_RC
693 The protocol runs over Infiniband reliable-connected queue
694 pairs, using the RDMA CM protocol for connection establishment.
695
696 FI_PROTO_RXD
697 Reliable-datagram protocol implemented over datagram endpoints.
698 RXD is a libfabric utility component that adds RDM endpoint se‐
699 mantics over DGRAM endpoint semantics.
700
701 FI_PROTO_RXM
702 Reliable-datagram protocol implemented over message endpoints.
703 RXM is a libfabric utility component that adds RDM endpoint se‐
704 mantics over MSG endpoint semantics.
705
706 FI_PROTO_SOCK_TCP
707 The protocol is layered over TCP packets.
708
709 FI_PROTO_UDP
710 The protocol sends and receives UDP datagrams. For example, an
711 endpoint using FI_PROTO_UDP will be able to communicate with a
712 remote peer that is using Berkeley SOCK_DGRAM sockets using IP‐
713 PROTO_UDP.
714
715 FI_PROTO_UNSPEC
716 The protocol is not specified. This is usually provided as in‐
717 put, with other attributes of the socket or the provider select‐
718 ing the actual protocol.
719
720 protocol_version - Protocol Version
721 Identifies which version of the protocol is employed by the provider.
722 The protocol version allows providers to extend an existing protocol,
723 by adding support for additional features or functionality for example,
724 in a backward compatible manner. Providers that support different ver‐
725 sions of the same protocol should inter-operate, but only when using
726 the capabilities defined for the lesser version.
727
728 max_msg_size - Max Message Size
729 Defines the maximum size for an application data transfer as a single
730 operation.
731
732 msg_prefix_size - Message Prefix Size
733 Specifies the size of any required message prefix buffer space. This
734 field will be 0 unless the FI_MSG_PREFIX mode is enabled. If msg_pre‐
735 fix_size is > 0 the specified value will be a multiple of 8-bytes.
736
737 Max RMA Ordered Size
738 The maximum ordered size specifies the delivery order of transport data
739 into target memory for RMA and atomic operations. Data ordering is
740 separate, but dependent on message ordering (defined below). Data or‐
741 dering is unspecified where message order is not defined.
742
743 Data ordering refers to the access of the same target memory by subse‐
744 quent operations. When back to back RMA read or write operations ac‐
745 cess the same registered memory location, data ordering indicates
746 whether the second operation reads or writes the target memory after
747 the first operation has completed. For example, will an RMA read that
748 follows an RMA write read back the data that was written? Similarly,
749 will an RMA write that follows an RMA read update the target buffer af‐
750 ter the read has transferred the original data? Data ordering answers
751 these questions, even in the presence of errors, such as the need to
752 resend data because of lost or corrupted network traffic.
753
754 RMA ordering applies between two operations, and not within a single
755 data transfer. Therefore, ordering is defined per byte-addressable
756 memory location. I.e. ordering specifies whether location X is ac‐
757 cessed by the second operation after the first operation. Nothing is
758 implied about the completion of the first operation before the second
759 operation is initiated. For example, if the first operation updates
760 locations X and Y, but the second operation only accesses location X,
761 there are no guarantees defined relative to location Y and the second
762 operation.
763
764 In order to support large data transfers being broken into multiple
765 packets and sent using multiple paths through the fabric, data ordering
766 may be limited to transfers of a specific size or less. Providers
767 specify when data ordering is maintained through the following values.
768 Note that even if data ordering is not maintained, message ordering may
769 be.
770
771 max_order_raw_size
772 Read after write size. If set, an RMA or atomic read operation
773 issued after an RMA or atomic write operation, both of which are
774 smaller than the size, will be ordered. Where the target memory
775 locations overlap, the RMA or atomic read operation will see the
776 results of the previous RMA or atomic write.
777
778 max_order_war_size
779 Write after read size. If set, an RMA or atomic write operation
780 issued after an RMA or atomic read operation, both of which are
781 smaller than the size, will be ordered. The RMA or atomic read
782 operation will see the initial value of the target memory loca‐
783 tion before a subsequent RMA or atomic write updates the value.
784
785 max_order_waw_size
786 Write after write size. If set, an RMA or atomic write opera‐
787 tion issued after an RMA or atomic write operation, both of
788 which are smaller than the size, will be ordered. The target
789 memory location will reflect the results of the second RMA or
790 atomic write.
791
792 An order size value of 0 indicates that ordering is not guaranteed. A
793 value of -1 guarantees ordering for any data size.
794
795 mem_tag_format - Memory Tag Format
796 The memory tag format is a bit array used to convey the number of
797 tagged bits supported by a provider. Additionally, it may be used to
798 divide the bit array into separate fields. The mem_tag_format option‐
799 ally begins with a series of bits set to 0, to signify bits which are
800 ignored by the provider. Following the initial prefix of ignored bits,
801 the array will consist of alternating groups of bits set to all 1’s or
802 all 0’s. Each group of bits corresponds to a tagged field. The impli‐
803 cation of defining a tagged field is that when a mask is applied to the
804 tagged bit array, all bits belonging to a single field will either be
805 set to 1 or 0, collectively.
806
807 For example, a mem_tag_format of 0x30FF indicates support for 14 tagged
808 bits, separated into 3 fields. The first field consists of 2-bits, the
809 second field 4-bits, and the final field 8-bits. Valid masks for such
810 a tagged field would be a bitwise OR’ing of zero or more of the follow‐
811 ing values: 0x3000, 0x0F00, and 0x00FF. The provider may not validate
812 the mask provided by the application for performance reasons.
813
814 By identifying fields within a tag, a provider may be able to optimize
815 their search routines. An application which requests tag fields must
816 provide tag masks that either set all mask bits corresponding to a
817 field to all 0 or all 1. When negotiating tag fields, an application
818 can request a specific number of fields of a given size. A provider
819 must return a tag format that supports the requested number of fields,
820 with each field being at least the size requested, or fail the request.
821 A provider may increase the size of the fields. When reporting comple‐
822 tions (see FI_CQ_FORMAT_TAGGED), it is not guaranteed that the provider
823 would clear out any unsupported tag bits in the tag field of the com‐
824 pletion entry.
825
826 It is recommended that field sizes be ordered from smallest to largest.
827 A generic, unstructured tag and mask can be achieved by requesting a
828 bit array consisting of alternating 1’s and 0’s.
829
830 tx_ctx_cnt - Transmit Context Count
831 Number of transmit contexts to associate with the endpoint. If not
832 specified (0), 1 context will be assigned if the endpoint supports out‐
833 bound transfers. Transmit contexts are independent transmit queues
834 that may be separately configured. Each transmit context may be bound
835 to a separate CQ, and no ordering is defined between contexts. Addi‐
836 tionally, no synchronization is needed when accessing contexts in par‐
837 allel.
838
839 If the count is set to the value FI_SHARED_CONTEXT, the endpoint will
840 be configured to use a shared transmit context, if supported by the
841 provider. Providers that do not support shared transmit contexts will
842 fail the request.
843
844 See the scalable endpoint and shared contexts sections for additional
845 details.
846
847 rx_ctx_cnt - Receive Context Count
848 Number of receive contexts to associate with the endpoint. If not
849 specified, 1 context will be assigned if the endpoint supports inbound
850 transfers. Receive contexts are independent processing queues that may
851 be separately configured. Each receive context may be bound to a sepa‐
852 rate CQ, and no ordering is defined between contexts. Additionally, no
853 synchronization is needed when accessing contexts in parallel.
854
855 If the count is set to the value FI_SHARED_CONTEXT, the endpoint will
856 be configured to use a shared receive context, if supported by the
857 provider. Providers that do not support shared receive contexts will
858 fail the request.
859
860 See the scalable endpoint and shared contexts sections for additional
861 details.
862
863 auth_key_size - Authorization Key Length
864 The length of the authorization key in bytes. This field will be 0 if
865 authorization keys are not available or used. This field is ignored
866 unless the fabric is opened with API version 1.5 or greater.
867
868 auth_key - Authorization Key
869 If supported by the fabric, an authorization key (a.k.a. job key) to
870 associate with the endpoint. An authorization key is used to limit
871 communication between endpoints. Only peer endpoints that are pro‐
872 grammed to use the same authorization key may communicate. Authoriza‐
873 tion keys are often used to implement job keys, to ensure that process‐
874 es running in different jobs do not accidentally cross traffic. The
875 domain authorization key will be used if auth_key_size is set to 0.
876 This field is ignored unless the fabric is opened with API version 1.5
877 or greater.
878
880 Attributes specific to the transmit capabilities of an endpoint are
881 specified using struct fi_tx_attr.
882
883 struct fi_tx_attr {
884 uint64_t caps;
885 uint64_t mode;
886 uint64_t op_flags;
887 uint64_t msg_order;
888 uint64_t comp_order;
889 size_t inject_size;
890 size_t size;
891 size_t iov_limit;
892 size_t rma_iov_limit;
893 uint32_t tclass;
894 };
895
896 caps - Capabilities
897 The requested capabilities of the context. The capabilities must be a
898 subset of those requested of the associated endpoint. See the CAPABIL‐
899 ITIES section of fi_getinfo(3) for capability details. If the caps
900 field is 0 on input to fi_getinfo(3), the applicable capability bits
901 from the fi_info structure will be used.
902
903 The following capabilities apply to the transmit attributes: FI_MSG,
904 FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_SEND, FI_HMEM,
905 FI_TRIGGER, FI_FENCE, FI_MULTICAST, FI_RMA_PMEM, FI_NAMED_RX_CTX,
906 FI_COLLECTIVE, and FI_XPU.
907
908 Many applications will be able to ignore this field and rely solely on
909 the fi_info::caps field. Use of this field provides fine grained con‐
910 trol over the transmit capabilities associated with an endpoint. It is
911 useful when handling scalable endpoints, with multiple transmit con‐
912 texts, for example, and allows configuring a specific transmit context
913 with fewer capabilities than that supported by the endpoint or other
914 transmit contexts.
915
916 mode
917 The operational mode bits of the context. The mode bits will be a sub‐
918 set of those associated with the endpoint. See the MODE section of
919 fi_getinfo(3) for details. A mode value of 0 will be ignored on input
920 to fi_getinfo(3), with the mode value of the fi_info structure used in‐
921 stead. On return from fi_getinfo(3), the mode will be set only to
922 those constraints specific to transmit operations.
923
924 op_flags - Default transmit operation flags
925 Flags that control the operation of operations submitted against the
926 context. Applicable flags are listed in the Operation Flags section.
927
928 msg_order - Message Ordering
929 Message ordering refers to the order in which transport layer headers
930 (as viewed by the application) are identified and processed. Relaxed
931 message order enables data transfers to be sent and received out of or‐
932 der, which may improve performance by utilizing multiple paths through
933 the fabric from the initiating endpoint to a target endpoint. Message
934 order applies only between a single source and destination endpoint
935 pair. Ordering between different target endpoints is not defined.
936
937 Message order is determined using a set of ordering bits. Each set bit
938 indicates that ordering is maintained between data transfers of the
939 specified type. Message order is defined for [read | write | send] op‐
940 erations submitted by an application after [read | write | send] opera‐
941 tions.
942
943 Message ordering only applies to the end to end transmission of trans‐
944 port headers. Message ordering is necessary, but does not guarantee,
945 the order in which message data is sent or received by the transport
946 layer. Message ordering requires matching ordering semantics on the
947 receiving side of a data transfer operation in order to guarantee that
948 ordering is met.
949
950 FI_ORDER_ATOMIC_RAR
951 Atomic read after read. If set, atomic fetch operations are
952 transmitted in the order submitted relative to other atomic
953 fetch operations. If not set, atomic fetches may be transmitted
954 out of order from their submission.
955
956 FI_ORDER_ATOMIC_RAW
957 Atomic read after write. If set, atomic fetch operations are
958 transmitted in the order submitted relative to atomic update op‐
959 erations. If not set, atomic fetches may be transmitted ahead
960 of atomic updates.
961
962 FI_ORDER_ATOMIC_WAR
963 RMA write after read. If set, atomic update operations are
964 transmitted in the order submitted relative to atomic fetch op‐
965 erations. If not set, atomic updates may be transmitted ahead
966 of atomic fetches.
967
968 FI_ORDER_ATOMIC_WAW
969 RMA write after write. If set, atomic update operations are
970 transmitted in the order submitted relative to other atomic up‐
971 date operations. If not atomic updates may be transmitted out
972 of order from their submission.
973
974 FI_ORDER_NONE
975 No ordering is specified. This value may be used as input in
976 order to obtain the default message order supported by the
977 provider. FI_ORDER_NONE is an alias for the value 0.
978
979 FI_ORDER_RAR
980 Read after read. If set, RMA and atomic read operations are
981 transmitted in the order submitted relative to other RMA and
982 atomic read operations. If not set, RMA and atomic reads may be
983 transmitted out of order from their submission.
984
985 FI_ORDER_RAS
986 Read after send. If set, RMA and atomic read operations are
987 transmitted in the order submitted relative to message send op‐
988 erations, including tagged sends. If not set, RMA and atomic
989 reads may be transmitted ahead of sends.
990
991 FI_ORDER_RAW
992 Read after write. If set, RMA and atomic read operations are
993 transmitted in the order submitted relative to RMA and atomic
994 write operations. If not set, RMA and atomic reads may be
995 transmitted ahead of RMA and atomic writes.
996
997 FI_ORDER_RMA_RAR
998 RMA read after read. If set, RMA read operations are transmit‐
999 ted in the order submitted relative to other RMA read opera‐
1000 tions. If not set, RMA reads may be transmitted out of order
1001 from their submission.
1002
1003 FI_ORDER_RMA_RAW
1004 RMA read after write. If set, RMA read operations are transmit‐
1005 ted in the order submitted relative to RMA write operations. If
1006 not set, RMA reads may be transmitted ahead of RMA writes.
1007
1008 FI_ORDER_RMA_WAR
1009 RMA write after read. If set, RMA write operations are trans‐
1010 mitted in the order submitted relative to RMA read operations.
1011 If not set, RMA writes may be transmitted ahead of RMA reads.
1012
1013 FI_ORDER_RMA_WAW
1014 RMA write after write. If set, RMA write operations are trans‐
1015 mitted in the order submitted relative to other RMA write opera‐
1016 tions. If not set, RMA writes may be transmitted out of order
1017 from their submission.
1018
1019 FI_ORDER_SAR
1020 Send after read. If set, message send operations, including
1021 tagged sends, are transmitted in order submitted relative to RMA
1022 and atomic read operations. If not set, message sends may be
1023 transmitted ahead of RMA and atomic reads.
1024
1025 FI_ORDER_SAS
1026 Send after send. If set, message send operations, including
1027 tagged sends, are transmitted in the order submitted relative to
1028 other message send. If not set, message sends may be transmit‐
1029 ted out of order from their submission.
1030
1031 FI_ORDER_SAW
1032 Send after write. If set, message send operations, including
1033 tagged sends, are transmitted in order submitted relative to RMA
1034 and atomic write operations. If not set, message sends may be
1035 transmitted ahead of RMA and atomic writes.
1036
1037 FI_ORDER_WAR
1038 Write after read. If set, RMA and atomic write operations are
1039 transmitted in the order submitted relative to RMA and atomic
1040 read operations. If not set, RMA and atomic writes may be
1041 transmitted ahead of RMA and atomic reads.
1042
1043 FI_ORDER_WAS
1044 Write after send. If set, RMA and atomic write operations are
1045 transmitted in the order submitted relative to message send op‐
1046 erations, including tagged sends. If not set, RMA and atomic
1047 writes may be transmitted ahead of sends.
1048
1049 FI_ORDER_WAW
1050 Write after write. If set, RMA and atomic write operations are
1051 transmitted in the order submitted relative to other RMA and
1052 atomic write operations. If not set, RMA and atomic writes may
1053 be transmitted out of order from their submission.
1054
1055 comp_order - Completion Ordering
1056 Completion ordering refers to the order in which completed requests are
1057 written into the completion queue. Completion ordering is similar to
1058 message order. Relaxed completion order may enable faster reporting of
1059 completed transfers, allow acknowledgments to be sent over different
1060 fabric paths, and support more sophisticated retry mechanisms. This
1061 can result in lower-latency completions, particularly when using con‐
1062 nectionless endpoints. Strict completion ordering may require that
1063 providers queue completed operations or limit available optimizations.
1064
1065 For transmit requests, completion ordering depends on the endpoint com‐
1066 munication type. For unreliable communication, completion ordering ap‐
1067 plies to all data transfer requests submitted to an endpoint. For re‐
1068 liable communication, completion ordering only applies to requests that
1069 target a single destination endpoint. Completion ordering of requests
1070 that target different endpoints over a reliable transport is not de‐
1071 fined.
1072
1073 Applications should specify the completion ordering that they support
1074 or require. Providers should return the completion order that they ac‐
1075 tually provide, with the constraint that the returned ordering is
1076 stricter than that specified by the application. Supported completion
1077 order values are:
1078
1079 FI_ORDER_NONE
1080 No ordering is defined for completed operations. Requests sub‐
1081 mitted to the transmit context may complete in any order.
1082
1083 FI_ORDER_STRICT
1084 Requests complete in the order in which they are submitted to
1085 the transmit context.
1086
1087 inject_size
1088 The requested inject operation size (see the FI_INJECT flag) that the
1089 context will support. This is the maximum size data transfer that can
1090 be associated with an inject operation (such as fi_inject) or may be
1091 used with the FI_INJECT data transfer flag.
1092
1093 size
1094 The size of the transmit context. The mapping of the size value to re‐
1095 sources is provider specific, but it is directly related to the number
1096 of command entries allocated for the endpoint. A smaller size value
1097 consumes fewer hardware and software resources, while a larger size al‐
1098 lows queuing more transmit requests.
1099
1100 While the size attribute guides the size of underlying endpoint trans‐
1101 mit queue, there is not necessarily a one-to-one mapping between a
1102 transmit operation and a queue entry. A single transmit operation may
1103 consume multiple queue entries; for example, one per scatter-gather en‐
1104 try. Additionally, the size field is intended to guide the allocation
1105 of the endpoint’s transmit context. Specifically, for connectionless
1106 endpoints, there may be lower-level queues use to track communication
1107 on a per peer basis. The sizes of any lower-level queues may only be
1108 significantly smaller than the endpoint’s transmit size, in order to
1109 reduce resource utilization.
1110
1111 iov_limit
1112 This is the maximum number of IO vectors (scatter-gather elements) that
1113 a single posted operation may reference.
1114
1115 rma_iov_limit
1116 This is the maximum number of RMA IO vectors (scatter-gather elements)
1117 that an RMA or atomic operation may reference. The rma_iov_limit cor‐
1118 responds to the rma_iov_count values in RMA and atomic operations. See
1119 struct fi_msg_rma and struct fi_msg_atomic in fi_rma.3 and fi_atomic.3,
1120 for additional details. This limit applies to both the number of RMA
1121 IO vectors that may be specified when initiating an operation from the
1122 local endpoint, as well as the maximum number of IO vectors that may be
1123 carried in a single request from a remote endpoint.
1124
1125 Traffic Class (tclass)
1126 Traffic classes can be a differentiated services code point (DSCP) val‐
1127 ue, one of the following defined labels, or a provider-specific defini‐
1128 tion. If tclass is unset or set to FI_TC_UNSPEC, the endpoint will use
1129 the default traffic class associated with the domain.
1130
1131 FI_TC_BEST_EFFORT
1132 This is the default in the absence of any other local or fabric
1133 configuration. This class carries the traffic for a number of
1134 applications executing concurrently over the same network infra‐
1135 structure. Even though it is shared, network capacity and re‐
1136 source allocation are distributed fairly across the applica‐
1137 tions.
1138
1139 FI_TC_BULK_DATA
1140 This class is intended for large data transfers associated with
1141 I/O and is present to separate sustained I/O transfers from oth‐
1142 er application inter-process communications.
1143
1144 FI_TC_DEDICATED_ACCESS
1145 This class operates at the highest priority, except the manage‐
1146 ment class. It carries a high bandwidth allocation, minimum la‐
1147 tency targets, and the highest scheduling and arbitration prior‐
1148 ity.
1149
1150 FI_TC_LOW_LATENCY
1151 This class supports low latency, low jitter data patterns typi‐
1152 cally caused by transactional data exchanges, barrier synchro‐
1153 nizations, and collective operations that are typical of HPC ap‐
1154 plications. This class often requires maximum tolerable laten‐
1155 cies that data transfers must achieve for correct or performance
1156 operations. Fulfillment of such requests in this class will
1157 typically require accompanying bandwidth and message size limi‐
1158 tations so as not to consume excessive bandwidth at high priori‐
1159 ty.
1160
1161 FI_TC_NETWORK_CTRL
1162 This class is intended for traffic directly related to fabric
1163 (network) management, which is critical to the correct operation
1164 of the network. Its use is typically restricted to privileged
1165 network management applications.
1166
1167 FI_TC_SCAVENGER
1168 This class is used for data that is desired but does not have
1169 strict delivery requirements, such as in-band network or appli‐
1170 cation level monitoring data. Use of this class indicates that
1171 the traffic is considered lower priority and should not inter‐
1172 fere with higher priority workflows.
1173
1174 fi_tc_dscp_set / fi_tc_dscp_get
1175 DSCP values are supported via the DSCP get and set functions.
1176 The definitions for DSCP values are outside the scope of libfab‐
1177 ric. See the fi_tc_dscp_set and fi_tc_dscp_get function defini‐
1178 tions for details on their use.
1179
1181 Attributes specific to the receive capabilities of an endpoint are
1182 specified using struct fi_rx_attr.
1183
1184 struct fi_rx_attr {
1185 uint64_t caps;
1186 uint64_t mode;
1187 uint64_t op_flags;
1188 uint64_t msg_order;
1189 uint64_t comp_order;
1190 size_t total_buffered_recv;
1191 size_t size;
1192 size_t iov_limit;
1193 };
1194
1195 caps - Capabilities
1196 The requested capabilities of the context. The capabilities must be a
1197 subset of those requested of the associated endpoint. See the CAPABIL‐
1198 ITIES section if fi_getinfo(3) for capability details. If the caps
1199 field is 0 on input to fi_getinfo(3), the applicable capability bits
1200 from the fi_info structure will be used.
1201
1202 The following capabilities apply to the receive attributes: FI_MSG,
1203 FI_RMA, FI_TAGGED, FI_ATOMIC, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_RECV,
1204 FI_HMEM, FI_TRIGGER, FI_RMA_PMEM, FI_DIRECTED_RECV, FI_VARIABLE_MSG,
1205 FI_MULTI_RECV, FI_SOURCE, FI_RMA_EVENT, FI_SOURCE_ERR, FI_COLLECTIVE,
1206 and FI_XPU.
1207
1208 Many applications will be able to ignore this field and rely solely on
1209 the fi_info::caps field. Use of this field provides fine grained con‐
1210 trol over the receive capabilities associated with an endpoint. It is
1211 useful when handling scalable endpoints, with multiple receive con‐
1212 texts, for example, and allows configuring a specific receive context
1213 with fewer capabilities than that supported by the endpoint or other
1214 receive contexts.
1215
1216 mode
1217 The operational mode bits of the context. The mode bits will be a sub‐
1218 set of those associated with the endpoint. See the MODE section of
1219 fi_getinfo(3) for details. A mode value of 0 will be ignored on input
1220 to fi_getinfo(3), with the mode value of the fi_info structure used in‐
1221 stead. On return from fi_getinfo(3), the mode will be set only to
1222 those constraints specific to receive operations.
1223
1224 op_flags - Default receive operation flags
1225 Flags that control the operation of operations submitted against the
1226 context. Applicable flags are listed in the Operation Flags section.
1227
1228 msg_order - Message Ordering
1229 For a description of message ordering, see the msg_order field in the
1230 Transmit Context Attribute section. Receive context message ordering
1231 defines the order in which received transport message headers are pro‐
1232 cessed when received by an endpoint. When ordering is set, it indi‐
1233 cates that message headers will be processed in order, based on how the
1234 transmit side has identified the messages. Typically, this means that
1235 messages will be handled in order based on a message level sequence
1236 number.
1237
1238 The following ordering flags, as defined for transmit ordering, also
1239 apply to the processing of received operations: FI_ORDER_NONE, FI_OR‐
1240 DER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_OR‐
1241 DER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR,
1242 FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAR, FI_ORDER_RMA_WAW, FI_ORDER_ATOM‐
1243 IC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, and FI_ORDER_ATOM‐
1244 IC_WAW.
1245
1246 comp_order - Completion Ordering
1247 For a description of completion ordering, see the comp_order field in
1248 the Transmit Context Attribute section.
1249
1250 FI_ORDER_DATA
1251 When set, this bit indicates that received data is written into
1252 memory in order. Data ordering applies to memory accessed as
1253 part of a single operation and between operations if message or‐
1254 dering is guaranteed.
1255
1256 FI_ORDER_NONE
1257 No ordering is defined for completed operations. Receive opera‐
1258 tions may complete in any order, regardless of their submission
1259 order.
1260
1261 FI_ORDER_STRICT
1262 Receive operations complete in the order in which they are pro‐
1263 cessed by the receive context, based on the receive side msg_or‐
1264 der attribute.
1265
1266 total_buffered_recv
1267 This field is supported for backwards compatibility purposes. It is a
1268 hint to the provider of the total available space that may be needed to
1269 buffer messages that are received for which there is no matching re‐
1270 ceive operation. The provider may adjust or ignore this value. The
1271 allocation of internal network buffering among received message is
1272 provider specific. For instance, a provider may limit the size of mes‐
1273 sages which can be buffered or the amount of buffering allocated to a
1274 single message.
1275
1276 If receive side buffering is disabled (total_buffered_recv = 0) and a
1277 message is received by an endpoint, then the behavior is dependent on
1278 whether resource management has been enabled (FI_RM_ENABLED has be set
1279 or not). See the Resource Management section of fi_domain.3 for fur‐
1280 ther clarification. It is recommended that applications enable re‐
1281 source management if they anticipate receiving unexpected messages,
1282 rather than modifying this value.
1283
1284 size
1285 The size of the receive context. The mapping of the size value to re‐
1286 sources is provider specific, but it is directly related to the number
1287 of command entries allocated for the endpoint. A smaller size value
1288 consumes fewer hardware and software resources, while a larger size al‐
1289 lows queuing more transmit requests.
1290
1291 While the size attribute guides the size of underlying endpoint receive
1292 queue, there is not necessarily a one-to-one mapping between a receive
1293 operation and a queue entry. A single receive operation may consume
1294 multiple queue entries; for example, one per scatter-gather entry. Ad‐
1295 ditionally, the size field is intended to guide the allocation of the
1296 endpoint’s receive context. Specifically, for connectionless end‐
1297 points, there may be lower-level queues use to track communication on a
1298 per peer basis. The sizes of any lower-level queues may only be sig‐
1299 nificantly smaller than the endpoint’s receive size, in order to reduce
1300 resource utilization.
1301
1302 iov_limit
1303 This is the maximum number of IO vectors (scatter-gather elements) that
1304 a single posted operating may reference.
1305
1307 A scalable endpoint is a communication portal that supports multiple
1308 transmit and receive contexts. Scalable endpoints are loosely modeled
1309 after the networking concept of transmit/receive side scaling, also
1310 known as multi-queue. Support for scalable endpoints is domain specif‐
1311 ic. Scalable endpoints may improve the performance of multi-threaded
1312 and parallel applications, by allowing threads to access independent
1313 transmit and receive queues. A scalable endpoint has a single trans‐
1314 port level address, which can reduce the memory requirements needed to
1315 store remote addressing data, versus using standard endpoints. Scal‐
1316 able endpoints cannot be used directly for communication operations,
1317 and require the application to explicitly create transmit and receive
1318 contexts as described below.
1319
1320 fi_tx_context
1321 Transmit contexts are independent transmit queues. Ordering and syn‐
1322 chronization between contexts are not defined. Conceptually a transmit
1323 context behaves similar to a send-only endpoint. A transmit context
1324 may be configured with fewer capabilities than the base endpoint and
1325 with different attributes (such as ordering requirements and inject
1326 size) than other contexts associated with the same scalable endpoint.
1327 Each transmit context has its own completion queue. The number of
1328 transmit contexts associated with an endpoint is specified during end‐
1329 point creation.
1330
1331 The fi_tx_context call is used to retrieve a specific context, identi‐
1332 fied by an index (see above for details on transmit context at‐
1333 tributes). Providers may dynamically allocate contexts when fi_tx_con‐
1334 text is called, or may statically create all contexts when fi_endpoint
1335 is invoked. By default, a transmit context inherits the properties of
1336 its associated endpoint. However, applications may request context
1337 specific attributes through the attr parameter. Support for per trans‐
1338 mit context attributes is provider specific and not guaranteed.
1339 Providers will return the actual attributes assigned to the context
1340 through the attr parameter, if provided.
1341
1342 fi_rx_context
1343 Receive contexts are independent receive queues for receiving incoming
1344 data. Ordering and synchronization between contexts are not guaran‐
1345 teed. Conceptually a receive context behaves similar to a receive-only
1346 endpoint. A receive context may be configured with fewer capabilities
1347 than the base endpoint and with different attributes (such as ordering
1348 requirements and inject size) than other contexts associated with the
1349 same scalable endpoint. Each receive context has its own completion
1350 queue. The number of receive contexts associated with an endpoint is
1351 specified during endpoint creation.
1352
1353 Receive contexts are often associated with steering flows, that specify
1354 which incoming packets targeting a scalable endpoint to process. How‐
1355 ever, receive contexts may be targeted directly by the initiator, if
1356 supported by the underlying protocol. Such contexts are referred to as
1357 `named'. Support for named contexts must be indicated by setting the
1358 caps FI_NAMED_RX_CTX capability when the corresponding endpoint is cre‐
1359 ated. Support for named receive contexts is coordinated with address
1360 vectors. See fi_av(3) and fi_rx_addr(3).
1361
1362 The fi_rx_context call is used to retrieve a specific context, identi‐
1363 fied by an index (see above for details on receive context attributes).
1364 Providers may dynamically allocate contexts when fi_rx_context is
1365 called, or may statically create all contexts when fi_endpoint is in‐
1366 voked. By default, a receive context inherits the properties of its
1367 associated endpoint. However, applications may request context specif‐
1368 ic attributes through the attr parameter. Support for per receive con‐
1369 text attributes is provider specific and not guaranteed. Providers
1370 will return the actual attributes assigned to the context through the
1371 attr parameter, if provided.
1372
1374 Shared contexts are transmit and receive contexts explicitly shared
1375 among one or more endpoints. A shareable context allows an application
1376 to use a single dedicated provider resource among multiple transport
1377 addressable endpoints. This can greatly reduce the resources needed to
1378 manage communication over multiple endpoints by multiplexing transmit
1379 and/or receive processing, with the potential cost of serializing ac‐
1380 cess across multiple endpoints. Support for shareable contexts is do‐
1381 main specific.
1382
1383 Conceptually, shareable transmit contexts are transmit queues that may
1384 be accessed by many endpoints. The use of a shared transmit context is
1385 mostly opaque to an application. Applications must allocate and bind
1386 shared transmit contexts to endpoints, but operations are posted di‐
1387 rectly to the endpoint. Shared transmit contexts are not associated
1388 with completion queues or counters. Completed operations are posted to
1389 the CQs bound to the endpoint. An endpoint may only be associated with
1390 a single shared transmit context.
1391
1392 Unlike shared transmit contexts, applications interact directly with
1393 shared receive contexts. Users post receive buffers directly to a
1394 shared receive context, with the buffers usable by any endpoint bound
1395 to the shared receive context. Shared receive contexts are not associ‐
1396 ated with completion queues or counters. Completed receive operations
1397 are posted to the CQs bound to the endpoint. An endpoint may only be
1398 associated with a single receive context, and all connectionless end‐
1399 points associated with a shared receive context must also share the
1400 same address vector.
1401
1402 Endpoints associated with a shared transmit context may use dedicated
1403 receive contexts, and vice-versa. Or an endpoint may use shared trans‐
1404 mit and receive contexts. And there is no requirement that the same
1405 group of endpoints sharing a context of one type also share the context
1406 of an alternate type. Furthermore, an endpoint may use a shared con‐
1407 text of one type, but a scalable set of contexts of the alternate type.
1408
1409 fi_stx_context
1410 This call is used to open a shareable transmit context (see above for
1411 details on the transmit context attributes). Endpoints associated with
1412 a shared transmit context must use a subset of the transmit context’s
1413 attributes. Note that this is the reverse of the requirement for
1414 transmit contexts for scalable endpoints.
1415
1416 fi_srx_context
1417 This allocates a shareable receive context (see above for details on
1418 the receive context attributes). Endpoints associated with a shared
1419 receive context must use a subset of the receive context’s attributes.
1420 Note that this is the reverse of the requirement for receive contexts
1421 for scalable endpoints.
1422
1424 The following feature and description should be considered experimen‐
1425 tal. Until the experimental tag is removed, the interfaces, semantics,
1426 and data structures associated with socket endpoints may change between
1427 library versions.
1428
1429 This section applies to endpoints of type FI_EP_SOCK_STREAM and
1430 FI_EP_SOCK_DGRAM, commonly referred to as socket endpoints.
1431
1432 Socket endpoints are defined with semantics that allow them to more
1433 easily be adopted by developers familiar with the UNIX socket API, or
1434 by middleware that exposes the socket API, while still taking advantage
1435 of high-performance hardware features.
1436
1437 The key difference between socket endpoints and other active endpoints
1438 are socket endpoints use synchronous data transfers. Buffers passed
1439 into send and receive operations revert to the control of the applica‐
1440 tion upon returning from the function call. As a result, no data
1441 transfer completions are reported to the application, and socket end‐
1442 points are not associated with completion queues or counters.
1443
1444 Socket endpoints support a subset of message operations: fi_send,
1445 fi_sendv, fi_sendmsg, fi_recv, fi_recvv, fi_recvmsg, and fi_inject.
1446 Because data transfers are synchronous, the return value from send and
1447 receive operations indicate the number of bytes transferred on success,
1448 or a negative value on error, including -FI_EAGAIN if the endpoint can‐
1449 not send or receive any data because of full or empty queues, respec‐
1450 tively.
1451
1452 Socket endpoints are associated with event queues and address vectors,
1453 and process connection management events asynchronously, similar to
1454 other endpoints. Unlike UNIX sockets, socket endpoint must still be
1455 declared as either active or passive.
1456
1457 Socket endpoints behave like non-blocking sockets. In order to support
1458 select and poll semantics, active socket endpoints are associated with
1459 a file descriptor that is signaled whenever the endpoint is ready to
1460 send and/or receive data. The file descriptor may be retrieved using
1461 fi_control.
1462
1464 Operation flags are obtained by OR-ing the following flags together.
1465 Operation flags define the default flags applied to an endpoint’s data
1466 transfer operations, where a flags parameter is not available. Data
1467 transfer operations that take flags as input override the op_flags val‐
1468 ue of transmit or receive context attributes of an endpoint.
1469
1470 FI_COMMIT_COMPLETE
1471 Indicates that a completion should not be generated (locally or
1472 at the peer) until the result of an operation have been made
1473 persistent. See fi_cq(3) for additional details on completion
1474 semantics.
1475
1476 FI_COMPLETION
1477 Indicates that a completion queue entry should be written for
1478 data transfer operations. This flag only applies to operations
1479 issued on an endpoint that was bound to a completion queue with
1480 the FI_SELECTIVE_COMPLETION flag set, otherwise, it is ignored.
1481 See the fi_ep_bind section above for more detail.
1482
1483 FI_DELIVERY_COMPLETE
1484 Indicates that a completion should be generated when the opera‐
1485 tion has been processed by the destination endpoint(s). See
1486 fi_cq(3) for additional details on completion semantics.
1487
1488 FI_INJECT
1489 Indicates that all outbound data buffers should be returned to
1490 the user’s control immediately after a data transfer call re‐
1491 turns, even if the operation is handled asynchronously. This
1492 may require that the provider copy the data into a local buffer
1493 and transfer out of that buffer. A provider can limit the total
1494 amount of send data that may be buffered and/or the size of a
1495 single send that can use this flag. This limit is indicated us‐
1496 ing inject_size (see inject_size above).
1497
1498 FI_INJECT_COMPLETE
1499 Indicates that a completion should be generated when the source
1500 buffer(s) may be reused. See fi_cq(3) for additional details on
1501 completion semantics.
1502
1503 FI_MULTICAST
1504 Indicates that data transfers will target multicast addresses by
1505 default. Any fi_addr_t passed into a data transfer operation
1506 will be treated as a multicast address.
1507
1508 FI_MULTI_RECV
1509 Applies to posted receive operations. This flag allows the user
1510 to post a single buffer that will receive multiple incoming mes‐
1511 sages. Received messages will be packed into the receive buffer
1512 until the buffer has been consumed. Use of this flag may cause
1513 a single posted receive operation to generate multiple comple‐
1514 tions as messages are placed into the buffer. The placement of
1515 received data into the buffer may be subjected to provider spe‐
1516 cific alignment restrictions. The buffer will be released by
1517 the provider when the available buffer space falls below the
1518 specified minimum (see FI_OPT_MIN_MULTI_RECV).
1519
1520 FI_TRANSMIT_COMPLETE
1521 Indicates that a completion should be generated when the trans‐
1522 mit operation has completed relative to the local provider. See
1523 fi_cq(3) for additional details on completion semantics.
1524
1526 Users should call fi_close to release all resources allocated to the
1527 fabric endpoint.
1528
1529 Endpoints allocated with the FI_CONTEXT or FI_CONTEXT2 mode bits set
1530 must typically provide struct fi_context(2) as their per operation con‐
1531 text parameter. (See fi_getinfo.3 for details.) However, when FI_SE‐
1532 LECTIVE_COMPLETION is enabled to suppress CQ completion entries, and an
1533 operation is initiated without the FI_COMPLETION flag set, then the
1534 context parameter is ignored. An application does not need to pass in
1535 a valid struct fi_context(2) into such data transfers.
1536
1537 Operations that complete in error that are not associated with valid
1538 operational context will use the endpoint context in any error report‐
1539 ing structures.
1540
1541 Although applications typically associate individual completions with
1542 either completion queues or counters, an endpoint can be attached to
1543 both a counter and completion queue. When combined with using selec‐
1544 tive completions, this allows an application to use counters to track
1545 successful completions, with a CQ used to report errors. Operations
1546 that complete with an error increment the error counter and generate a
1547 CQ completion event.
1548
1549 As mentioned in fi_getinfo(3), the ep_attr structure can be used to
1550 query providers that support various endpoint attributes. fi_getinfo
1551 can return provider info structures that can support the minimal set of
1552 requirements (such that the application maintains correctness). Howev‐
1553 er, it can also return provider info structures that exceed application
1554 requirements. As an example, consider an application requesting
1555 msg_order as FI_ORDER_NONE. The resulting output from fi_getinfo may
1556 have all the ordering bits set. The application can reset the ordering
1557 bits it does not require before creating the endpoint. The provider is
1558 free to implement a stricter ordering than is required by the applica‐
1559 tion.
1560
1562 Returns 0 on success. On error, a negative value corresponding to fab‐
1563 ric errno is returned. For fi_cancel, a return value of 0 indicates
1564 that the cancel request was submitted for processing.
1565
1566 Fabric errno values are defined in rdma/fi_errno.h.
1567
1569 -FI_EDOMAIN
1570 A resource domain was not bound to the endpoint or an attempt
1571 was made to bind multiple domains.
1572
1573 -FI_ENOCQ
1574 The endpoint has not been configured with necessary event queue.
1575
1576 -FI_EOPBADSTATE
1577 The endpoint’s state does not permit the requested operation.
1578
1580 fi_getinfo(3), fi_domain(3), fi_cq(3) fi_msg(3), fi_tagged(3),
1581 fi_rma(3)
1582
1584 OpenFabrics.
1585
1586
1587
1588Libfabric Programmer’s Manual 2021-11-20 fi_endpoint(3)