1fi_endpoint(3) Libfabric v1.8.0 fi_endpoint(3)
2
3
4
6 fi_endpoint - Fabric endpoint operations
7
8 fi_endpoint / fi_scalable_ep / fi_passive_ep / fi_close
9 Allocate or close an endpoint.
10
11 fi_ep_bind
12 Associate an endpoint with hardware resources, such as event
13 queues, completion queues, counters, address vectors, or shared
14 transmit/receive contexts.
15
16 fi_scalable_ep_bind
17 Associate a scalable endpoint with an address vector
18
19 fi_pep_bind
20 Associate a passive endpoint with an event queue
21
22 fi_enable
23 Transitions an active endpoint into an enabled state.
24
25 fi_cancel
26 Cancel a pending asynchronous data transfer
27
28 fi_ep_alias
29 Create an alias to the endpoint
30
31 fi_control
32 Control endpoint operation.
33
34 fi_getopt / fi_setopt
35 Get or set endpoint options.
36
37 fi_rx_context / fi_tx_context / fi_srx_context / fi_stx_context
38 Open a transmit or receive context.
39
40 fi_rx_size_left / fi_tx_size_left (DEPRECATED)
41 Query the lower bound on how many RX/TX operations may be posted
42 without an operation returning -FI_EAGAIN. This functions have
43 been deprecated and will be removed in a future version of the
44 library.
45
47 #include <rdma/fabric.h>
48
49 #include <rdma/fi_endpoint.h>
50
51 int fi_endpoint(struct fid_domain *domain, struct fi_info *info,
52 struct fid_ep **ep, void *context);
53
54 int fi_scalable_ep(struct fid_domain *domain, struct fi_info *info,
55 struct fid_ep **sep, void *context);
56
57 int fi_passive_ep(struct fi_fabric *fabric, struct fi_info *info,
58 struct fid_pep **pep, void *context);
59
60 int fi_tx_context(struct fid_ep *sep, int index,
61 struct fi_tx_attr *attr, struct fid_ep **tx_ep,
62 void *context);
63
64 int fi_rx_context(struct fid_ep *sep, int index,
65 struct fi_rx_attr *attr, struct fid_ep **rx_ep,
66 void *context);
67
68 int fi_stx_context(struct fid_domain *domain,
69 struct fi_tx_attr *attr, struct fid_stx **stx,
70 void *context);
71
72 int fi_srx_context(struct fid_domain *domain,
73 struct fi_rx_attr *attr, struct fid_ep **rx_ep,
74 void *context);
75
76 int fi_close(struct fid *ep);
77
78 int fi_ep_bind(struct fid_ep *ep, struct fid *fid, uint64_t flags);
79
80 int fi_scalable_ep_bind(struct fid_ep *sep, struct fid *fid, uint64_t flags);
81
82 int fi_pep_bind(struct fid_pep *pep, struct fid *fid, uint64_t flags);
83
84 int fi_enable(struct fid_ep *ep);
85
86 int fi_cancel(struct fid_ep *ep, void *context);
87
88 int fi_ep_alias(struct fid_ep *ep, struct fid_ep **alias_ep, uint64_t flags);
89
90 int fi_control(struct fid *ep, int command, void *arg);
91
92 int fi_getopt(struct fid *ep, int level, int optname,
93 void *optval, size_t *optlen);
94
95 int fi_setopt(struct fid *ep, int level, int optname,
96 const void *optval, size_t optlen);
97
98 DEPRECATED ssize_t fi_rx_size_left(struct fid_ep *ep);
99
100 DEPRECATED ssize_t fi_tx_size_left(struct fid_ep *ep);
101
103 fid On creation, specifies a fabric or access domain. On bind,
104 identifies the event queue, completion queue, counter, or ad‐
105 dress vector to bind to the endpoint. In other cases, it's a
106 fabric identifier of an associated resource.
107
108 info Details about the fabric interface endpoint to be opened, ob‐
109 tained from fi_getinfo.
110
111 ep A fabric endpoint.
112
113 sep A scalable fabric endpoint.
114
115 pep A passive fabric endpoint.
116
117 context
118 Context associated with the endpoint or asynchronous operation.
119
120 index Index to retrieve a specific transmit/receive context.
121
122 attr Transmit or receive context attributes.
123
124 flags Additional flags to apply to the operation.
125
126 command
127 Command of control operation to perform on endpoint.
128
129 arg Optional control argument.
130
131 level Protocol level at which the desired option resides.
132
133 optname
134 The protocol option to read or set.
135
136 optval The option value that was read or to set.
137
138 optlen The size of the optval buffer.
139
141 Endpoints are transport level communication portals. There are two
142 types of endpoints: active and passive. Passive endpoints belong to a
143 fabric domain and are most often used to listen for incoming connection
144 requests. However, a passive endpoint may be used to reserve a fabric
145 address that can be granted to an active endpoint. Active endpoints
146 belong to access domains and can perform data transfers.
147
148 Active endpoints may be connection-oriented or connectionless, and may
149 provide data reliability. The data transfer interfaces -- messages
150 (fi_msg), tagged messages (fi_tagged), RMA (fi_rma), and atomics
151 (fi_atomic) -- are associated with active endpoints. In basic configu‐
152 rations, an active endpoint has transmit and receive queues. In gener‐
153 al, operations that generate traffic on the fabric are posted to the
154 transmit queue. This includes all RMA and atomic operations, along
155 with sent messages and sent tagged messages. Operations that post buf‐
156 fers for receiving incoming data are submitted to the receive queue.
157
158 Active endpoints are created in the disabled state. They must transi‐
159 tion into an enabled state before accepting data transfer operations,
160 including posting of receive buffers. The fi_enable call is used to
161 transition an active endpoint into an enabled state. The fi_connect
162 and fi_accept calls will also transition an endpoint into the enabled
163 state, if it is not already active.
164
165 In order to transition an endpoint into an enabled state, it must be
166 bound to one or more fabric resources. An endpoint that will generate
167 asynchronous completions, either through data transfer operations or
168 communication establishment events, must be bound to the appropriate
169 completion queues or event queues, respectively, before being enabled.
170 Additionally, endpoints that use manual progress must be associated
171 with relevant completion queues or event queues in order to drive
172 progress. For endpoints that are only used as the target of RMA or
173 atomic operations, this means binding the endpoint to a completion
174 queue associated with receive processing. Unconnected endpoints must
175 be bound to an address vector.
176
177 Once an endpoint has been activated, it may be associated with an ad‐
178 dress vector. Receive buffers may be posted to it and calls may be
179 made to connection establishment routines. Connectionless endpoints
180 may also perform data transfers.
181
182 The behavior of an endpoint may be adjusted by setting its control data
183 and protocol options. This allows the underlying provider to redirect
184 function calls to implementations optimized to meet the desired appli‐
185 cation behavior.
186
187 If an endpoint experiences a critical error, it will transition back
188 into a disabled state. Critical errors are reported through the event
189 queue associated with the EP. In certain cases, a disabled endpoint
190 may be re-enabled. The ability to transition back into an enabled
191 state is provider specific and depends on the type of error that the
192 endpoint experienced. When an endpoint is disabled as a result of a
193 critical error, all pending operations are discarded.
194
195 fi_endpoint / fi_passive_ep / fi_scalable_ep
196 fi_endpoint allocates a new active endpoint. fi_passive_ep allocates a
197 new passive endpoint. fi_scalable_ep allocates a scalable endpoint.
198 The properties and behavior of the endpoint are defined based on the
199 provided struct fi_info. See fi_getinfo for additional details on
200 fi_info. fi_info flags that control the operation of an endpoint are
201 defined below. See section SCALABLE ENDPOINTS.
202
203 If an active endpoint is allocated in order to accept a connection re‐
204 quest, the fi_info parameter must be the same as the fi_info structure
205 provided with the connection request (FI_CONNREQ) event.
206
207 An active endpoint may acquire the properties of a passive endpoint by
208 setting the fi_info handle field to the passive endpoint fabric de‐
209 scriptor. This is useful for applications that need to reserve the
210 fabric address of an endpoint prior to knowing if the endpoint will be
211 used on the active or passive side of a connection. For example, this
212 feature is useful for simulating socket semantics. Once an active end‐
213 point acquires the properties of a passive endpoint, the passive end‐
214 point is no longer bound to any fabric resources and must no longer be
215 used. The user is expected to close the passive endpoint after opening
216 the active endpoint in order to free up any lingering resources that
217 had been used.
218
219 fi_close
220 Closes an endpoint and release all resources associated with it.
221
222 When closing a scalable endpoint, there must be no opened transmit con‐
223 texts, or receive contexts associated with the scalable endpoint. If
224 resources are still associated with the scalable endpoint when attempt‐
225 ing to close, the call will return -FI_EBUSY.
226
227 Outstanding operations posted to the endpoint when fi_close is called
228 will be discarded. Discarded operations will silently be dropped, with
229 no completions reported. Additionally, a provider may discard previ‐
230 ously completed operations from the associated completion queue(s).
231 The behavior to discard completed operations is provider specific.
232
233 fi_ep_bind
234 fi_ep_bind is used to associate an endpoint with other allocated re‐
235 sources, such as completion queues, counters, address vectors, event
236 queues, shared contexts, and memory regions. The type of objects that
237 must be bound with an endpoint depend on the endpoint type and its con‐
238 figuration.
239
240 Passive endpoints must be bound with an EQ that supports connection
241 management events. Connectionless endpoints must be bound to a single
242 address vector. If an endpoint is using a shared transmit and/or re‐
243 ceive context, the shared contexts must be bound to the endpoint. CQs,
244 counters, AV, and shared contexts must be bound to endpoints before
245 they are enabled either explicitly or implicitly.
246
247 An endpoint must be bound with CQs capable of reporting completions for
248 any asynchronous operation initiated on the endpoint. For example, if
249 the endpoint supports any outbound transfers (sends, RMA, atomics,
250 etc.), then it must be bound to a completion queue that can report
251 transmit completions. This is true even if the endpoint is configured
252 to suppress successful completions, in order that operations that com‐
253 plete in error may be reported to the user.
254
255 An active endpoint may direct asynchronous completions to different
256 CQs, based on the type of operation. This is specified using
257 fi_ep_bind flags. The following flags may be OR'ed together when bind‐
258 ing an endpoint to a completion domain CQ.
259
260 FI_TRANSMIT
261 Directs the completion of outbound data transfer requests to the
262 specified completion queue. This includes send message, RMA,
263 and atomic operations.
264
265 FI_RECV
266 Directs the notification of inbound data transfers to the speci‐
267 fied completion queue. This includes received messages. This
268 binding automatically includes FI_REMOTE_WRITE, if applicable to
269 the endpoint.
270
271 FI_SELECTIVE_COMPLETION
272 By default, data transfer operations write CQ completion entries
273 into the associated completion queue after they have successful‐
274 ly completed. Applications can use this bind flag to selective‐
275 ly enable when completions are generated. If FI_SELECTIVE_COM‐
276 PLETION is specified, data transfer operations will not generate
277 CQ entries for successful completions unless FI_COMPLETION is
278 set as an operational flag for the given operation. Operations
279 that fail asynchronously will still generate completions, even
280 if a completion is not requested. FI_SELECTIVE_COMPLETION must
281 be OR'ed with FI_TRANSMIT and/or FI_RECV flags.
282
283 When FI_SELECTIVE_COMPLETION is set, the user must determine when a re‐
284 quest that does NOT have FI_COMPLETION set has completed indirectly,
285 usually based on the completion of a subsequent operation or by using
286 completion counters. Use of this flag may improve performance by al‐
287 lowing the provider to avoid writing a CQ completion entry for every
288 operation.
289
290 See Notes section below for additional information on how this flag in‐
291 teracts with the FI_CONTEXT and FI_CONTEXT2 mode bits.
292
293 An endpoint may optionally be bound to a completion counter. Associat‐
294 ing an endpoint with a counter is in addition to binding the EP with a
295 CQ. When binding an endpoint to a counter, the following flags may be
296 specified.
297
298 FI_SEND
299 Increments the specified counter whenever a message transfer
300 initiated over the endpoint has completed successfully or in er‐
301 ror. Sent messages include both tagged and normal message oper‐
302 ations.
303
304 FI_RECV
305 Increments the specified counter whenever a message is received
306 over the endpoint. Received messages include both tagged and
307 normal message operations.
308
309 FI_READ
310 Increments the specified counter whenever an RMA read, atomic
311 fetch, or atomic compare operation initiated from the endpoint
312 has completed successfully or in error.
313
314 FI_WRITE
315 Increments the specified counter whenever an RMA write or base
316 atomic operation initiated from the endpoint has completed suc‐
317 cessfully or in error.
318
319 FI_REMOTE_READ
320 Increments the specified counter whenever an RMA read, atomic
321 fetch, or atomic compare operation is initiated from a remote
322 endpoint that targets the given endpoint. Use of this flag re‐
323 quires that the endpoint be created using FI_RMA_EVENT.
324
325 FI_REMOTE_WRITE
326 Increments the specified counter whenever an RMA write or base
327 atomic operation is initiated from a remote endpoint that tar‐
328 gets the given endpoint. Use of this flag requires that the
329 endpoint be created using FI_RMA_EVENT.
330
331 An endpoint may only be bound to a single CQ or counter for a given
332 type of operation. For example, a EP may not bind to two counters both
333 using FI_WRITE. Furthermore, providers may limit CQ and counter bind‐
334 ings to endpoints of the same endpoint type (DGRAM, MSG, RDM, etc.).
335
336 fi_scalable_ep_bind
337 fi_scalable_ep_bind is used to associate a scalable endpoint with an
338 address vector. See section on SCALABLE ENDPOINTS. A scalable end‐
339 point has a single transport level address and can support multiple
340 transmit and receive contexts. The transmit and receive contexts share
341 the transport-level address. Address vectors that are bound to scal‐
342 able endpoints are implicitly bound to any transmit or receive contexts
343 created using the scalable endpoint.
344
345 fi_enable
346 This call transitions the endpoint into an enabled state. An endpoint
347 must be enabled before it may be used to perform data transfers. En‐
348 abling an endpoint typically results in hardware resources being as‐
349 signed to it. Endpoints making use of completion queues, counters,
350 event queues, and/or address vectors must be bound to them before being
351 enabled.
352
353 Calling connect or accept on an endpoint will implicitly enable an end‐
354 point if it has not already been enabled.
355
356 fi_enable may also be used to re-enable an endpoint that has been dis‐
357 abled as a result of experiencing a critical error. Applications
358 should check the return value from fi_enable to see if a disabled end‐
359 point has successfully be re-enabled.
360
361 fi_cancel
362 fi_cancel attempts to cancel an outstanding asynchronous operation.
363 Canceling an operation causes the fabric provider to search for the op‐
364 eration and, if it is still pending, complete it as having been can‐
365 celed. An error queue entry will be available in the associated error
366 queue with error code FI_ECANCELED. On the other hand, if the opera‐
367 tion completed before the call to fi_cancel, then the completion status
368 of that operation will be available in the associated completion queue.
369 No specific entry related to fi_cancel itself will be posted.
370
371 Cancel uses the context parameter associated with an operation to iden‐
372 tify the request to cancel. Operations posted without a valid context
373 parameter -- either no context parameter is specified or the context
374 value was ignored by the provider -- cannot be canceled. If multiple
375 outstanding operations match the context parameter, only one will be
376 canceled. In this case, the operation which is canceled is provider
377 specific. The cancel operation is asynchronous, but will complete
378 within a bounded period of time.
379
380 fi_ep_alias
381 This call creates an alias to the specified endpoint. Conceptually, an
382 endpoint alias provides an alternate software path from the application
383 to the underlying provider hardware. An alias EP differs from its par‐
384 ent endpoint only by its default data transfer flags. For example, an
385 alias EP may be configured to use a different completion mode. By de‐
386 fault, an alias EP inherits the same data transfer flags as the parent
387 endpoint. An application can use fi_control to modify the alias EP op‐
388 erational flags.
389
390 When allocating an alias, an application may configure either the
391 transmit or receive operational flags. This avoids needing a separate
392 call to fi_control to set those flags. The flags passed to fi_ep_alias
393 must include FI_TRANSMIT or FI_RECV (not both) with other operational
394 flags OR'ed in. This will override the transmit or receive flags, re‐
395 spectively, for operations posted through the alias endpoint. All al‐
396 located aliases must be closed for the underlying endpoint to be re‐
397 leased.
398
399 fi_control
400 The control operation is used to adjust the default behavior of an end‐
401 point. It allows the underlying provider to redirect function calls to
402 implementations optimized to meet the desired application behavior. As
403 a result, calls to fi_ep_control must be serialized against all other
404 calls to an endpoint.
405
406 The base operation of an endpoint is selected during creation using
407 struct fi_info. The following control commands and arguments may be
408 assigned to an endpoint.
409
410 **FI_GETOPSFLAG -- uint64_t *flags**
411 Used to retrieve the current value of flags associated with the
412 data transfer operations initiated on the endpoint. The control
413 argument must include FI_TRANSMIT or FI_RECV (not both) flags to
414 indicate the type of data transfer flags to be returned. See
415 below for a list of control flags.
416
417 **FI_SETOPSFLAG -- uint64_t *flags**
418 Used to change the data transfer operation flags associated with
419 an endpoint. The control argument must include FI_TRANSMIT or
420 FI_RECV (not both) to indicate the type of data transfer that
421 the flags should apply to, with other flags OR'ed in. The given
422 flags will override the previous transmit and receive attributes
423 that were set when the endpoint was created. Valid control
424 flags are defined below.
425
426 **FI_BACKLOG - int *value**
427 This option only applies to passive endpoints. It is used to
428 set the connection request backlog for listening endpoints.
429
430 FI_GETWAIT (void **)
431 This command allows the user to retrieve the file descriptor as‐
432 sociated with a socket endpoint. The fi_control arg parameter
433 should be an address where a pointer to the returned file de‐
434 scriptor will be written. See fi_eq.3 for addition details us‐
435 ing fi_control with FI_GETWAIT. The file descriptor may be used
436 for notification that the endpoint is ready to send or receive
437 data.
438
439 fi_getopt / fi_setopt
440 Endpoint protocol operations may be retrieved using fi_getopt or set
441 using fi_setopt. Applications specify the level that a desired option
442 exists, identify the option, and provide input/output buffers to get or
443 set the option. fi_setopt provides an application a way to adjust
444 low-level protocol and implementation specific details of an endpoint.
445
446 The following option levels and option names and parameters are de‐
447 fined.
448
449 FI_OPT_ENDPOINT · .RS 2
450
451 FI_OPT_MIN_MULTI_RECV - size_t
452 Defines the minimum receive buffer space available when the re‐
453 ceive buffer is released by the provider (see FI_MULTI_RECV).
454 Modifying this value is only guaranteed to set the minimum buf‐
455 fer space needed on receives posted after the value has been
456 changed. It is recommended that applications that want to over‐
457 ride the default MIN_MULTI_RECV value set this option before en‐
458 abling the corresponding endpoint.
459 · .RS 2
460
461 FI_OPT_CM_DATA_SIZE - size_t
462 Defines the size of available space in CM messages for user-de‐
463 fined data. This value limits the amount of data that applica‐
464 tions can exchange between peer endpoints using the fi_connect,
465 fi_accept, and fi_reject operations. The size returned is de‐
466 pendent upon the properties of the endpoint, except in the case
467 of passive endpoints, in which the size reflects the maximum
468 size of the data that may be present as part of a connection re‐
469 quest event. This option is read only.
470 · .RS 2
471
472 FI_OPT_BUFFERED_LIMIT - size_t
473 Defines the maximum size of a buffered message that will be re‐
474 ported to users as part of a receive completion when the
475 FI_BUFFERED_RECV mode is enabled on an endpoint.
476
477 fi_getopt() will return the currently configured threshold, or the
478 provider's default threshold if one has not be set by the application.
479 fi_setopt() allows an application to configure the threshold. If the
480 provider cannot support the requested threshold, it will fail the
481 fi_setopt() call with FI_EMSGSIZE. Calling fi_setopt() with the
482 threshold set to SIZE_MAX will set the threshold to the maximum sup‐
483 ported by the provider. fi_getopt() can then be used to retrieve the
484 set size.
485
486 In most cases, the sending and receiving endpoints must be configured
487 to use the same threshold value, and the threshold must be set prior to
488 enabling the endpoint. · .RS 2
489
490 FI_OPT_BUFFERED_MIN - size_t
491 Defines the minimum size of a buffered message that will be re‐
492 ported. Applications would set this to a size that's big enough
493 to decide whether to discard or claim a buffered receive or when
494 to claim a buffered receive on getting a buffered receive com‐
495 pletion. The value is typically used by a provider when sending
496 a rendezvous protocol request where it would send atleast
497 FI_OPT_BUFFERED_MIN bytes of application data along with it. A
498 smaller sized renedezvous protocol message usually results in
499 better latency for the overall transfer of a large message.
500
501 fi_rx_size_left (DEPRECATED)
502 This function has been deprecated and will be removed in a future ver‐
503 sion of the library. It may not be supported by all providers.
504
505 The fi_rx_size_left call returns a lower bound on the number of receive
506 operations that may be posted to the given endpoint without that opera‐
507 tion returning -FI_EAGAIN. Depending on the specific details of the
508 subsequently posted receive operations (e.g., number of iov entries,
509 which receive function is called, etc.), it may be possible to post
510 more receive operations than originally indicated by fi_rx_size_left.
511
512 fi_tx_size_left (DEPRECATED)
513 This function has been deprecated and will be removed in a future ver‐
514 sion of the library. It may not be supported by all providers.
515
516 The fi_tx_size_left call returns a lower bound on the number of trans‐
517 mit operations that may be posted to the given endpoint without that
518 operation returning -FI_EAGAIN. Depending on the specific details of
519 the subsequently posted transmit operations (e.g., number of iov en‐
520 tries, which transmit function is called, etc.), it may be possible to
521 post more transmit operations than originally indicated by
522 fi_tx_size_left.
523
525 The fi_ep_attr structure defines the set of attributes associated with
526 an endpoint. Endpoint attributes may be further refined using the
527 transmit and receive context attributes as shown below.
528
529 struct fi_ep_attr {
530 enum fi_ep_type type;
531 uint32_t protocol;
532 uint32_t protocol_version;
533 size_t max_msg_size;
534 size_t msg_prefix_size;
535 size_t max_order_raw_size;
536 size_t max_order_war_size;
537 size_t max_order_waw_size;
538 uint64_t mem_tag_format;
539 size_t tx_ctx_cnt;
540 size_t rx_ctx_cnt;
541 size_t auth_key_size;
542 uint8_t *auth_key;
543 };
544
545 type - Endpoint Type
546 If specified, indicates the type of fabric interface communication de‐
547 sired. Supported types are:
548
549 FI_EP_UNSPEC
550 The type of endpoint is not specified. This is usually provided
551 as input, with other attributes of the endpoint or the provider
552 selecting the type.
553
554 FI_EP_MSG
555 Provides a reliable, connection-oriented data transfer service
556 with flow control that maintains message boundaries.
557
558 FI_EP_DGRAM
559 Supports a connectionless, unreliable datagram communication.
560 Message boundaries are maintained, but the maximum message size
561 may be limited to the fabric MTU. Flow control is not guaran‐
562 teed.
563
564 FI_EP_RDM
565 Reliable datagram message. Provides a reliable, unconnected da‐
566 ta transfer service with flow control that maintains message
567 boundaries.
568
569 FI_EP_SOCK_STREAM
570 Data streaming endpoint with TCP socket-like semantics. Pro‐
571 vides a reliable, connection-oriented data transfer service that
572 does not maintain message boundaries. FI_EP_SOCK_STREAM is most
573 useful for applications designed around using TCP sockets. See
574 the SOCKET ENDPOINT section for additional details and restric‐
575 tions that apply to stream endpoints.
576
577 FI_EP_SOCK_DGRAM
578 A connectionless, unreliable datagram endpoint with UDP sock‐
579 et-like semantics. FI_EP_SOCK_DGRAM is most useful for applica‐
580 tions designed around using UDP sockets. See the SOCKET END‐
581 POINT section for additional details and restrictions that apply
582 to datagram socket endpoints.
583
584 Protocol
585 Specifies the low-level end to end protocol employed by the provider.
586 A matching protocol must be used by communicating endpoints to ensure
587 interoperability. The following protocol values are defined. Provider
588 specific protocols are also allowed. Provider specific protocols will
589 be indicated by having the upper bit of the protocol value set to one.
590
591 FI_PROTO_UNSPEC
592 The protocol is not specified. This is usually provided as in‐
593 put, with other attributes of the socket or the provider select‐
594 ing the actual protocol.
595
596 FI_PROTO_RDMA_CM_IB_RC
597 The protocol runs over Infiniband reliable-connected queue
598 pairs, using the RDMA CM protocol for connection establishment.
599
600 FI_PROTO_IWARP
601 The protocol runs over the Internet wide area RDMA protocol
602 transport.
603
604 FI_PROTO_IB_UD
605 The protocol runs over Infiniband unreliable datagram queue
606 pairs.
607
608 FI_PROTO_PSMX
609 The protocol is based on an Intel proprietary protocol known as
610 PSM, performance scaled messaging. PSMX is an extended version
611 of the PSM protocol to support the libfabric interfaces.
612
613 FI_PROTO_UDP
614 The protocol sends and receives UDP datagrams. For example, an
615 endpoint using FI_PROTO_UDP will be able to communicate with a
616 remote peer that is using Berkeley SOCK_DGRAM sockets using IP‐
617 PROTO_UDP.
618
619 FI_PROTO_SOCK_TCP
620 The protocol is layered over TCP packets.
621
622 FI_PROTO_IWARP_RDM
623 Reliable-datagram protocol implemented over iWarp reliable-con‐
624 nected queue pairs.
625
626 FI_PROTO_IB_RDM
627 Reliable-datagram protocol implemented over InfiniBand reli‐
628 able-connected queue pairs.
629
630 FI_PROTO_GNI
631 Protocol runs over Cray GNI low-level interface.
632
633 FI_PROTO_RXM
634 Reliable-datagram protocol implemented over message endpoints.
635 RXM is a libfabric utility component that adds RDM endpoint se‐
636 mantics over MSG endpoint semantics.
637
638 FI_PROTO_RXD
639 Reliable-datagram protocol implemented over datagram endpoints.
640 RXD is a libfabric utility component that adds RDM endpoint se‐
641 mantics over DGRAM endpoint semantics.
642
643 FI_PROTO_NETWORKDIRECT
644 Protocol runs over Microsoft NetworkDirect service provider in‐
645 terface. This adds reliable-datagram semantics over the Net‐
646 workDirect connection- oriented endpoint semantics.
647
648 FI_PROTO_PSMX2
649 The protocol is based on an Intel proprietary protocol known as
650 PSM2, performance scaled messaging version 2. PSMX2 is an ex‐
651 tended version of the PSM2 protocol to support the libfabric in‐
652 terfaces.
653
654 protocol_version - Protocol Version
655 Identifies which version of the protocol is employed by the provider.
656 The protocol version allows providers to extend an existing protocol,
657 by adding support for additional features or functionality for example,
658 in a backward compatible manner. Providers that support different ver‐
659 sions of the same protocol should inter-operate, but only when using
660 the capabilities defined for the lesser version.
661
662 max_msg_size - Max Message Size
663 Defines the maximum size for an application data transfer as a single
664 operation.
665
666 msg_prefix_size - Message Prefix Size
667 Specifies the size of any required message prefix buffer space. This
668 field will be 0 unless the FI_MSG_PREFIX mode is enabled. If msg_pre‐
669 fix_size is > 0 the specified value will be a multiple of 8-bytes.
670
671 Max RMA Ordered Size
672 The maximum ordered size specifies the delivery order of transport data
673 into target memory for RMA and atomic operations. Data ordering is
674 separate, but dependent on message ordering (defined below). Data or‐
675 dering is unspecified where message order is not defined.
676
677 Data ordering refers to the access of target memory by subsequent oper‐
678 ations. When back to back RMA read or write operations access the same
679 registered memory location, data ordering indicates whether the second
680 operation reads or writes the target memory after the first operation
681 has completed. Because RMA ordering applies between two operations,
682 and not within a single data transfer, ordering is defined per byte-ad‐
683 dressable memory location. I.e. ordering specifies whether location X
684 is accessed by the second operation after the first operation. Nothing
685 is implied about the completion of the first operation before the sec‐
686 ond operation is initiated.
687
688 In order to support large data transfers being broken into multiple
689 packets and sent using multiple paths through the fabric, data ordering
690 may be limited to transfers of a specific size or less. Providers
691 specify when data ordering is maintained through the following values.
692 Note that even if data ordering is not maintained, message ordering may
693 be.
694
695 max_order_raw_size
696 Read after write size. If set, an RMA or atomic read operation
697 issued after an RMA or atomic write operation, both of which are
698 smaller than the size, will be ordered. Where the target memory
699 locations overlap, the RMA or atomic read operation will see the
700 results of the previous RMA or atomic write.
701
702 max_order_war_size
703 Write after read size. If set, an RMA or atomic write operation
704 issued after an RMA or atomic read operation, both of which are
705 smaller than the size, will be ordered. The RMA or atomic read
706 operation will see the initial value of the target memory loca‐
707 tion before a subsequent RMA or atomic write updates the value.
708
709 max_order_waw_size
710 Write after write size. If set, an RMA or atomic write opera‐
711 tion issued after an RMA or atomic write operation, both of
712 which are smaller than the size, will be ordered. The target
713 memory location will reflect the results of the second RMA or
714 atomic write.
715
716 An order size value of 0 indicates that ordering is not guaranteed. A
717 value of -1 guarantees ordering for any data size.
718
719 mem_tag_format - Memory Tag Format
720 The memory tag format is a bit array used to convey the number of
721 tagged bits supported by a provider. Additionally, it may be used to
722 divide the bit array into separate fields. The mem_tag_format option‐
723 ally begins with a series of bits set to 0, to signify bits which are
724 ignored by the provider. Following the initial prefix of ignored bits,
725 the array will consist of alternating groups of bits set to all 1's or
726 all 0's. Each group of bits corresponds to a tagged field. The impli‐
727 cation of defining a tagged field is that when a mask is applied to the
728 tagged bit array, all bits belonging to a single field will either be
729 set to 1 or 0, collectively.
730
731 For example, a mem_tag_format of 0x30FF indicates support for 14 tagged
732 bits, separated into 3 fields. The first field consists of 2-bits, the
733 second field 4-bits, and the final field 8-bits. Valid masks for such
734 a tagged field would be a bitwise OR'ing of zero or more of the follow‐
735 ing values: 0x3000, 0x0F00, and 0x00FF. The provider may not validate
736 the mask provided by the application for performance reasons.
737
738 By identifying fields within a tag, a provider may be able to optimize
739 their search routines. An application which requests tag fields must
740 provide tag masks that either set all mask bits corresponding to a
741 field to all 0 or all 1. When negotiating tag fields, an application
742 can request a specific number of fields of a given size. A provider
743 must return a tag format that supports the requested number of fields,
744 with each field being at least the size requested, or fail the request.
745 A provider may increase the size of the fields. When reporting comple‐
746 tions (see FI_CQ_FORMAT_TAGGED), it is not guaranteed that the provider
747 would clear out any unsupported tag bits in the tag field of the com‐
748 pletion entry.
749
750 It is recommended that field sizes be ordered from smallest to largest.
751 A generic, unstructured tag and mask can be achieved by requesting a
752 bit array consisting of alternating 1's and 0's.
753
754 tx_ctx_cnt - Transmit Context Count
755 Number of transmit contexts to associate with the endpoint. If not
756 specified (0), 1 context will be assigned if the endpoint supports out‐
757 bound transfers. Transmit contexts are independent transmit queues
758 that may be separately configured. Each transmit context may be bound
759 to a separate CQ, and no ordering is defined between contexts. Addi‐
760 tionally, no synchronization is needed when accessing contexts in par‐
761 allel.
762
763 If the count is set to the value FI_SHARED_CONTEXT, the endpoint will
764 be configured to use a shared transmit context, if supported by the
765 provider. Providers that do not support shared transmit contexts will
766 fail the request.
767
768 See the scalable endpoint and shared contexts sections for additional
769 details.
770
771 rx_ctx_cnt - Receive Context Count
772 Number of receive contexts to associate with the endpoint. If not
773 specified, 1 context will be assigned if the endpoint supports inbound
774 transfers. Receive contexts are independent processing queues that may
775 be separately configured. Each receive context may be bound to a sepa‐
776 rate CQ, and no ordering is defined between contexts. Additionally, no
777 synchronization is needed when accessing contexts in parallel.
778
779 If the count is set to the value FI_SHARED_CONTEXT, the endpoint will
780 be configured to use a shared receive context, if supported by the
781 provider. Providers that do not support shared receive contexts will
782 fail the request.
783
784 See the scalable endpoint and shared contexts sections for additional
785 details.
786
787 auth_key_size - Authorization Key Length
788 The length of the authorization key in bytes. This field will be 0 if
789 authorization keys are not available or used. This field is ignored
790 unless the fabric is opened with API version 1.5 or greater.
791
792 auth_key - Authorization Key
793 If supported by the fabric, an authorization key (a.k.a. job key) to
794 associate with the endpoint. An authorization key is used to limit
795 communication between endpoints. Only peer endpoints that are pro‐
796 grammed to use the same authorization key may communicate. Authoriza‐
797 tion keys are often used to implement job keys, to ensure that process‐
798 es running in different jobs do not accidentally cross traffic. The
799 domain authorization key will be used if auth_key_size is set to 0.
800 This field is ignored unless the fabric is opened with API version 1.5
801 or greater.
802
804 Attributes specific to the transmit capabilities of an endpoint are
805 specified using struct fi_tx_attr.
806
807 struct fi_tx_attr {
808 uint64_t caps;
809 uint64_t mode;
810 uint64_t op_flags;
811 uint64_t msg_order;
812 uint64_t comp_order;
813 size_t inject_size;
814 size_t size;
815 size_t iov_limit;
816 size_t rma_iov_limit;
817 };
818
819 caps - Capabilities
820 The requested capabilities of the context. The capabilities must be a
821 subset of those requested of the associated endpoint. See the CAPABIL‐
822 ITIES section of fi_getinfo(3) for capability details. If the caps
823 field is 0 on input to fi_getinfo(3), the caps value from the fi_info
824 structure will be used.
825
826 mode
827 The operational mode bits of the context. The mode bits will be a sub‐
828 set of those associated with the endpoint. See the MODE section of
829 fi_getinfo(3) for details. A mode value of 0 will be ignored on input
830 to fi_getinfo(3), with the mode value of the fi_info structure used in‐
831 stead. On return from fi_getinfo(3), the mode will be set only to
832 those constraints specific to transmit operations.
833
834 op_flags - Default transmit operation flags
835 Flags that control the operation of operations submitted against the
836 context. Applicable flags are listed in the Operation Flags section.
837
838 msg_order - Message Ordering
839 Message ordering refers to the order in which transport layer headers
840 (as viewed by the application) are identified and processed. Relaxed
841 message order enables data transfers to be sent and received out of or‐
842 der, which may improve performance by utilizing multiple paths through
843 the fabric from the initiating endpoint to a target endpoint. Message
844 order applies only between a single source and destination endpoint
845 pair. Ordering between different target endpoints is not defined.
846
847 Message order is determined using a set of ordering bits. Each set bit
848 indicates that ordering is maintained between data transfers of the
849 specified type. Message order is defined for [read | write | send] op‐
850 erations submitted by an application after [read | write | send] opera‐
851 tions.
852
853 Message ordering only applies to the end to end transmission of trans‐
854 port headers. Message ordering is necessary, but does not guarantee,
855 the order in which message data is sent or received by the transport
856 layer. Message ordering requires matching ordering semantics on the
857 receiving side of a data transfer operation in order to guarantee that
858 ordering is met.
859
860 FI_ORDER_NONE
861 No ordering is specified. This value may be used as input in
862 order to obtain the default message order supported by the
863 provider. FI_ORDER_NONE is an alias for the value 0.
864
865 FI_ORDER_RAR
866 Read after read. If set, RMA and atomic read operations are
867 transmitted in the order submitted relative to other RMA and
868 atomic read operations. If not set, RMA and atomic reads may be
869 transmitted out of order from their submission.
870
871 FI_ORDER_RAW
872 Read after write. If set, RMA and atomic read operations are
873 transmitted in the order submitted relative to RMA and atomic
874 write operations. If not set, RMA and atomic reads may be
875 transmitted ahead of RMA and atomic writes.
876
877 FI_ORDER_RAS
878 Read after send. If set, RMA and atomic read operations are
879 transmitted in the order submitted relative to message send op‐
880 erations, including tagged sends. If not set, RMA and atomic
881 reads may be transmitted ahead of sends.
882
883 FI_ORDER_WAR
884 Write after read. If set, RMA and atomic write operations are
885 transmitted in the order submitted relative to RMA and atomic
886 read operations. If not set, RMA and atomic writes may be
887 transmitted ahead of RMA and atomic reads.
888
889 FI_ORDER_WAW
890 Write after write. If set, RMA and atomic write operations are
891 transmitted in the order submitted relative to other RMA and
892 atomic write operations. If not set, RMA and atomic writes may
893 be transmitted out of order from their submission.
894
895 FI_ORDER_WAS
896 Write after send. If set, RMA and atomic write operations are
897 transmitted in the order submitted relative to message send op‐
898 erations, including tagged sends. If not set, RMA and atomic
899 writes may be transmitted ahead of sends.
900
901 FI_ORDER_SAR
902 Send after read. If set, message send operations, including
903 tagged sends, are transmitted in order submitted relative to RMA
904 and atomic read operations. If not set, message sends may be
905 transmitted ahead of RMA and atomic reads.
906
907 FI_ORDER_SAW
908 Send after write. If set, message send operations, including
909 tagged sends, are transmitted in order submitted relative to RMA
910 and atomic write operations. If not set, message sends may be
911 transmitted ahead of RMA and atomic writes.
912
913 FI_ORDER_SAS
914 Send after send. If set, message send operations, including
915 tagged sends, are transmitted in the order submitted relative to
916 other message send. If not set, message sends may be transmit‐
917 ted out of order from their submission.
918
919 FI_ORDER_RMA_RAR
920 RMA read after read. If set, RMA read operations are transmit‐
921 ted in the order submitted relative to other RMA read opera‐
922 tions. If not set, RMA reads may be transmitted out of order
923 from their submission.
924
925 FI_ORDER_RMA_RAW
926 RMA read after write. If set, RMA read operations are transmit‐
927 ted in the order submitted relative to RMA write operations. If
928 not set, RMA reads may be transmitted ahead of RMA writes.
929
930 FI_ORDER_RMA_WAR
931 RMA write after read. If set, RMA write operations are trans‐
932 mitted in the order submitted relative to RMA read operations.
933 If not set, RMA writes may be transmitted ahead of RMA reads.
934
935 FI_ORDER_RMA_WAW
936 RMA write after write. If set, RMA write operations are trans‐
937 mitted in the order submitted relative to other RMA write opera‐
938 tions. If not set, RMA writes may be transmitted out of order
939 from their submission.
940
941 FI_ORDER_ATOMIC_RAR
942 Atomic read after read. If set, atomic fetch operations are
943 transmitted in the order submitted relative to other atomic
944 fetch operations. If not set, atomic fetches may be transmitted
945 out of order from their submission.
946
947 FI_ORDER_ATOMIC_RAW
948 Atomic read after write. If set, atomic fetch operations are
949 transmitted in the order submitted relative to atomic update op‐
950 erations. If not set, atomic fetches may be transmitted ahead
951 of atomic updates.
952
953 FI_ORDER_ATOMIC_WAR
954 RMA write after read. If set, atomic update operations are
955 transmitted in the order submitted relative to atomic fetch op‐
956 erations. If not set, atomic updates may be transmitted ahead
957 of atomic fetches.
958
959 FI_ORDER_ATOMIC_WAW
960 RMA write after write. If set, atomic update operations are
961 transmitted in the order submitted relative to other atomic up‐
962 date operations. If not atomic updates may be transmitted out
963 of order from their submission.
964
965 comp_order - Completion Ordering
966 Completion ordering refers to the order in which completed requests are
967 written into the completion queue. Completion ordering is similar to
968 message order. Relaxed completion order may enable faster reporting of
969 completed transfers, allow acknowledgments to be sent over different
970 fabric paths, and support more sophisticated retry mechanisms. This
971 can result in lower-latency completions, particularly when using uncon‐
972 nected endpoints. Strict completion ordering may require that
973 providers queue completed operations or limit available optimizations.
974
975 For transmit requests, completion ordering depends on the endpoint com‐
976 munication type. For unreliable communication, completion ordering ap‐
977 plies to all data transfer requests submitted to an endpoint. For re‐
978 liable communication, completion ordering only applies to requests that
979 target a single destination endpoint. Completion ordering of requests
980 that target different endpoints over a reliable transport is not de‐
981 fined.
982
983 Applications should specify the completion ordering that they support
984 or require. Providers should return the completion order that they ac‐
985 tually provide, with the constraint that the returned ordering is
986 stricter than that specified by the application. Supported completion
987 order values are:
988
989 FI_ORDER_NONE
990 No ordering is defined for completed operations. Requests sub‐
991 mitted to the transmit context may complete in any order.
992
993 FI_ORDER_STRICT
994 Requests complete in the order in which they are submitted to
995 the transmit context.
996
997 inject_size
998 The requested inject operation size (see the FI_INJECT flag) that the
999 context will support. This is the maximum size data transfer that can
1000 be associated with an inject operation (such as fi_inject) or may be
1001 used with the FI_INJECT data transfer flag.
1002
1003 size
1004 The size of the context. The size is specified as the minimum number
1005 of transmit operations that may be posted to the endpoint without the
1006 operation returning -FI_EAGAIN.
1007
1008 iov_limit
1009 This is the maximum number of IO vectors (scatter-gather elements) that
1010 a single posted operation may reference.
1011
1012 rma_iov_limit
1013 This is the maximum number of RMA IO vectors (scatter-gather elements)
1014 that an RMA or atomic operation may reference. The rma_iov_limit cor‐
1015 responds to the rma_iov_count values in RMA and atomic operations. See
1016 struct fi_msg_rma and struct fi_msg_atomic in fi_rma.3 and fi_atomic.3,
1017 for additional details. This limit applies to both the number of RMA
1018 IO vectors that may be specified when initiating an operation from the
1019 local endpoint, as well as the maximum number of IO vectors that may be
1020 carried in a single request from a remote endpoint.
1021
1023 Attributes specific to the receive capabilities of an endpoint are
1024 specified using struct fi_rx_attr.
1025
1026 struct fi_rx_attr {
1027 uint64_t caps;
1028 uint64_t mode;
1029 uint64_t op_flags;
1030 uint64_t msg_order;
1031 uint64_t comp_order;
1032 size_t total_buffered_recv;
1033 size_t size;
1034 size_t iov_limit;
1035 };
1036
1037 caps - Capabilities
1038 The requested capabilities of the context. The capabilities must be a
1039 subset of those requested of the associated endpoint. See the CAPABIL‐
1040 ITIES section if fi_getinfo(3) for capability details. If the caps
1041 field is 0 on input to fi_getinfo(3), the caps value from the fi_info
1042 structure will be used.
1043
1044 mode
1045 The operational mode bits of the context. The mode bits will be a sub‐
1046 set of those associated with the endpoint. See the MODE section of
1047 fi_getinfo(3) for details. A mode value of 0 will be ignored on input
1048 to fi_getinfo(3), with the mode value of the fi_info structure used in‐
1049 stead. On return from fi_getinfo(3), the mode will be set only to
1050 those constraints specific to receive operations.
1051
1052 op_flags - Default receive operation flags
1053 Flags that control the operation of operations submitted against the
1054 context. Applicable flags are listed in the Operation Flags section.
1055
1056 msg_order - Message Ordering
1057 For a description of message ordering, see the msg_order field in the
1058 Transmit Context Attribute section. Receive context message ordering
1059 defines the order in which received transport message headers are pro‐
1060 cessed when received by an endpoint. When ordering is set, it indi‐
1061 cates that message headers will be processed in order, based on how the
1062 transmit side has identified the messages. Typically, this means that
1063 messages will be handled in order based on a message level sequence
1064 number.
1065
1066 The following ordering flags, as defined for transmit ordering, also
1067 apply to the processing of received operations: FI_ORDER_NONE, FI_OR‐
1068 DER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_OR‐
1069 DER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR,
1070 FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAR, FI_ORDER_RMA_WAW, FI_ORDER_ATOM‐
1071 IC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, and FI_ORDER_ATOM‐
1072 IC_WAW.
1073
1074 comp_order - Completion Ordering
1075 For a description of completion ordering, see the comp_order field in
1076 the Transmit Context Attribute section.
1077
1078 FI_ORDER_NONE
1079 No ordering is defined for completed operations. Receive opera‐
1080 tions may complete in any order, regardless of their submission
1081 order.
1082
1083 FI_ORDER_STRICT
1084 Receive operations complete in the order in which they are pro‐
1085 cessed by the receive context, based on the receive side msg_or‐
1086 der attribute.
1087
1088 FI_ORDER_DATA
1089 When set, this bit indicates that received data is written into
1090 memory in order. Data ordering applies to memory accessed as
1091 part of a single operation and between operations if message or‐
1092 dering is guaranteed.
1093
1094 total_buffered_recv
1095 This field is supported for backwards compatibility purposes. It is a
1096 hint to the provider of the total available space that may be needed to
1097 buffer messages that are received for which there is no matching re‐
1098 ceive operation. The provider may adjust or ignore this value. The
1099 allocation of internal network buffering among received message is
1100 provider specific. For instance, a provider may limit the size of mes‐
1101 sages which can be buffered or the amount of buffering allocated to a
1102 single message.
1103
1104 If receive side buffering is disabled (total_buffered_recv = 0) and a
1105 message is received by an endpoint, then the behavior is dependent on
1106 whether resource management has been enabled (FI_RM_ENABLED has be set
1107 or not). See the Resource Management section of fi_domain.3 for fur‐
1108 ther clarification. It is recommended that applications enable re‐
1109 source management if they anticipate receiving unexpected messages,
1110 rather than modifying this value.
1111
1112 size
1113 The size of the context. The size is specified as the minimum number
1114 of receive operations that may be posted to the endpoint without the
1115 operation returning -FI_EAGAIN.
1116
1117 iov_limit
1118 This is the maximum number of IO vectors (scatter-gather elements) that
1119 a single posted operating may reference.
1120
1122 A scalable endpoint is a communication portal that supports multiple
1123 transmit and receive contexts. Scalable endpoints are loosely modeled
1124 after the networking concept of transmit/receive side scaling, also
1125 known as multi-queue. Support for scalable endpoints is domain specif‐
1126 ic. Scalable endpoints may improve the performance of multi-threaded
1127 and parallel applications, by allowing threads to access independent
1128 transmit and receive queues. A scalable endpoint has a single trans‐
1129 port level address, which can reduce the memory requirements needed to
1130 store remote addressing data, versus using standard endpoints. Scal‐
1131 able endpoints cannot be used directly for communication operations,
1132 and require the application to explicitly create transmit and receive
1133 contexts as described below.
1134
1135 fi_tx_context
1136 Transmit contexts are independent transmit queues. Ordering and syn‐
1137 chronization between contexts are not defined. Conceptually a transmit
1138 context behaves similar to a send-only endpoint. A transmit context
1139 may be configured with fewer capabilities than the base endpoint and
1140 with different attributes (such as ordering requirements and inject
1141 size) than other contexts associated with the same scalable endpoint.
1142 Each transmit context has its own completion queue. The number of
1143 transmit contexts associated with an endpoint is specified during end‐
1144 point creation.
1145
1146 The fi_tx_context call is used to retrieve a specific context, identi‐
1147 fied by an index (see above for details on transmit context at‐
1148 tributes). Providers may dynamically allocate contexts when fi_tx_con‐
1149 text is called, or may statically create all contexts when fi_endpoint
1150 is invoked. By default, a transmit context inherits the properties of
1151 its associated endpoint. However, applications may request context
1152 specific attributes through the attr parameter. Support for per trans‐
1153 mit context attributes is provider specific and not guaranteed.
1154 Providers will return the actual attributes assigned to the context
1155 through the attr parameter, if provided.
1156
1157 fi_rx_context
1158 Receive contexts are independent receive queues for receiving incoming
1159 data. Ordering and synchronization between contexts are not guaran‐
1160 teed. Conceptually a receive context behaves similar to a receive-only
1161 endpoint. A receive context may be configured with fewer capabilities
1162 than the base endpoint and with different attributes (such as ordering
1163 requirements and inject size) than other contexts associated with the
1164 same scalable endpoint. Each receive context has its own completion
1165 queue. The number of receive contexts associated with an endpoint is
1166 specified during endpoint creation.
1167
1168 Receive contexts are often associated with steering flows, that specify
1169 which incoming packets targeting a scalable endpoint to process. How‐
1170 ever, receive contexts may be targeted directly by the initiator, if
1171 supported by the underlying protocol. Such contexts are referred to as
1172 'named'. Support for named contexts must be indicated by setting the
1173 caps FI_NAMED_RX_CTX capability when the corresponding endpoint is cre‐
1174 ated. Support for named receive contexts is coordinated with address
1175 vectors. See fi_av(3) and fi_rx_addr(3).
1176
1177 The fi_rx_context call is used to retrieve a specific context, identi‐
1178 fied by an index (see above for details on receive context attributes).
1179 Providers may dynamically allocate contexts when fi_rx_context is
1180 called, or may statically create all contexts when fi_endpoint is in‐
1181 voked. By default, a receive context inherits the properties of its
1182 associated endpoint. However, applications may request context specif‐
1183 ic attributes through the attr parameter. Support for per receive con‐
1184 text attributes is provider specific and not guaranteed. Providers
1185 will return the actual attributes assigned to the context through the
1186 attr parameter, if provided.
1187
1189 Shared contexts are transmit and receive contexts explicitly shared
1190 among one or more endpoints. A shareable context allows an application
1191 to use a single dedicated provider resource among multiple transport
1192 addressable endpoints. This can greatly reduce the resources needed to
1193 manage communication over multiple endpoints by multiplexing transmit
1194 and/or receive processing, with the potential cost of serializing ac‐
1195 cess across multiple endpoints. Support for shareable contexts is do‐
1196 main specific.
1197
1198 Conceptually, shareable transmit contexts are transmit queues that may
1199 be accessed by many endpoints. The use of a shared transmit context is
1200 mostly opaque to an application. Applications must allocate and bind
1201 shared transmit contexts to endpoints, but operations are posted di‐
1202 rectly to the endpoint. Shared transmit contexts are not associated
1203 with completion queues or counters. Completed operations are posted to
1204 the CQs bound to the endpoint. An endpoint may only be associated with
1205 a single shared transmit context.
1206
1207 Unlike shared transmit contexts, applications interact directly with
1208 shared receive contexts. Users post receive buffers directly to a
1209 shared receive context, with the buffers usable by any endpoint bound
1210 to the shared receive context. Shared receive contexts are not associ‐
1211 ated with completion queues or counters. Completed receive operations
1212 are posted to the CQs bound to the endpoint. An endpoint may only be
1213 associated with a single receive context, and all connectionless end‐
1214 points associated with a shared receive context must also share the
1215 same address vector.
1216
1217 Endpoints associated with a shared transmit context may use dedicated
1218 receive contexts, and vice-versa. Or an endpoint may use shared trans‐
1219 mit and receive contexts. And there is no requirement that the same
1220 group of endpoints sharing a context of one type also share the context
1221 of an alternate type. Furthermore, an endpoint may use a shared con‐
1222 text of one type, but a scalable set of contexts of the alternate type.
1223
1224 fi_stx_context
1225 This call is used to open a shareable transmit context (see above for
1226 details on the transmit context attributes). Endpoints associated with
1227 a shared transmit context must use a subset of the transmit context's
1228 attributes. Note that this is the reverse of the requirement for
1229 transmit contexts for scalable endpoints.
1230
1231 fi_srx_context
1232 This allocates a shareable receive context (see above for details on
1233 the receive context attributes). Endpoints associated with a shared
1234 receive context must use a subset of the receive context's attributes.
1235 Note that this is the reverse of the requirement for receive contexts
1236 for scalable endpoints.
1237
1239 The following feature and description should be considered experimen‐
1240 tal. Until the experimental tag is removed, the interfaces, semantics,
1241 and data structures associated with socket endpoints may change between
1242 library versions.
1243
1244 This section applies to endpoints of type FI_EP_SOCK_STREAM and
1245 FI_EP_SOCK_DGRAM, commonly referred to as socket endpoints.
1246
1247 Socket endpoints are defined with semantics that allow them to more
1248 easily be adopted by developers familiar with the UNIX socket API, or
1249 by middleware that exposes the socket API, while still taking advantage
1250 of high-performance hardware features.
1251
1252 The key difference between socket endpoints and other active endpoints
1253 are socket endpoints use synchronous data transfers. Buffers passed
1254 into send and receive operations revert to the control of the applica‐
1255 tion upon returning from the function call. As a result, no data
1256 transfer completions are reported to the application, and socket end‐
1257 points are not associated with completion queues or counters.
1258
1259 Socket endpoints support a subset of message operations: fi_send,
1260 fi_sendv, fi_sendmsg, fi_recv, fi_recvv, fi_recvmsg, and fi_inject.
1261 Because data transfers are synchronous, the return value from send and
1262 receive operations indicate the number of bytes transferred on success,
1263 or a negative value on error, including -FI_EAGAIN if the endpoint can‐
1264 not send or receive any data because of full or empty queues, respec‐
1265 tively.
1266
1267 Socket endpoints are associated with event queues and address vectors,
1268 and process connection management events asynchronously, similar to
1269 other endpoints. Unlike UNIX sockets, socket endpoint must still be
1270 declared as either active or passive.
1271
1272 Socket endpoints behave like non-blocking sockets. In order to support
1273 select and poll semantics, active socket endpoints are associated with
1274 a file descriptor that is signaled whenever the endpoint is ready to
1275 send and/or receive data. The file descriptor may be retrieved using
1276 fi_control.
1277
1279 Operation flags are obtained by OR-ing the following flags together.
1280 Operation flags define the default flags applied to an endpoint's data
1281 transfer operations, where a flags parameter is not available. Data
1282 transfer operations that take flags as input override the op_flags val‐
1283 ue of transmit or receive context attributes of an endpoint.
1284
1285 FI_INJECT
1286 Indicates that all outbound data buffers should be returned to
1287 the user's control immediately after a data transfer call re‐
1288 turns, even if the operation is handled asynchronously. This
1289 may require that the provider copy the data into a local buffer
1290 and transfer out of that buffer. A provider can limit the total
1291 amount of send data that may be buffered and/or the size of a
1292 single send that can use this flag. This limit is indicated us‐
1293 ing inject_size (see inject_size above).
1294
1295 FI_MULTI_RECV
1296 Applies to posted receive operations. This flag allows the user
1297 to post a single buffer that will receive multiple incoming mes‐
1298 sages. Received messages will be packed into the receive buffer
1299 until the buffer has been consumed. Use of this flag may cause
1300 a single posted receive operation to generate multiple comple‐
1301 tions as messages are placed into the buffer. The placement of
1302 received data into the buffer may be subjected to provider spe‐
1303 cific alignment restrictions. The buffer will be released by
1304 the provider when the available buffer space falls below the
1305 specified minimum (see FI_OPT_MIN_MULTI_RECV).
1306
1307 FI_COMPLETION
1308 Indicates that a completion queue entry should be written for
1309 data transfer operations. This flag only applies to operations
1310 issued on an endpoint that was bound to a completion queue with
1311 the FI_SELECTIVE_COMPLETION flag set, otherwise, it is ignored.
1312 See the fi_ep_bind section above for more detail.
1313
1314 FI_INJECT_COMPLETE
1315 Indicates that a completion should be generated when the source
1316 buffer(s) may be reused. See fi_cq(3) for additional details on
1317 completion semantics.
1318
1319 FI_TRANSMIT_COMPLETE
1320 Indicates that a completion should be generated when the trans‐
1321 mit operation has completed relative to the local provider. See
1322 fi_cq(3) for additional details on completion semantics.
1323
1324 FI_DELIVERY_COMPLETE
1325 Indicates that a completion should be generated when the opera‐
1326 tion has been processed by the destination endpoint(s). See
1327 fi_cq(3) for additional details on completion semantics.
1328
1329 FI_COMMIT_COMPLETE
1330 Indicates that a completion should not be generated (locally or
1331 at the peer) until the result of an operation have been made
1332 persistent. See fi_cq(3) for additional details on completion
1333 semantics.
1334
1335 FI_MULTICAST
1336 Indicates that data transfers will target multicast addresses by
1337 default. Any fi_addr_t passed into a data transfer operation
1338 will be treated as a multicast address.
1339
1341 Users should call fi_close to release all resources allocated to the
1342 fabric endpoint.
1343
1344 Endpoints allocated with the FI_CONTEXT or FI_CONTEXT2 mode bits set
1345 must typically provide struct fi_context(2) as their per operation con‐
1346 text parameter. (See fi_getinfo.3 for details.) However, when FI_SE‐
1347 LECTIVE_COMPLETION is enabled to suppress CQ completion entries, and an
1348 operation is initiated without the FI_COMPLETION flag set, then the
1349 context parameter is ignored. An application does not need to pass in
1350 a valid struct fi_context(2) into such data transfers.
1351
1352 Operations that complete in error that are not associated with valid
1353 operational context will use the endpoint context in any error report‐
1354 ing structures.
1355
1356 Although applications typically associate individual completions with
1357 either completion queues or counters, an endpoint can be attached to
1358 both a counter and completion queue. When combined with using selec‐
1359 tive completions, this allows an application to use counters to track
1360 successful completions, with a CQ used to report errors. Operations
1361 that complete with an error increment the error counter and generate a
1362 CQ completion event.
1363
1364 As mentioned in fi_getinfo(3), the ep_attr structure can be used to
1365 query providers that support various endpoint attributes. fi_getinfo
1366 can return provider info structures that can support the minimal set of
1367 requirements (such that the application maintains correctness). Howev‐
1368 er, it can also return provider info structures that exceed application
1369 requirements. As an example, consider an application requesting
1370 msg_order as FI_ORDER_NONE. The resulting output from fi_getinfo may
1371 have all the ordering bits set. The application can reset the ordering
1372 bits it does not require before creating the endpoint. The provider is
1373 free to implement a stricter ordering than is required by the applica‐
1374 tion.
1375
1377 Returns 0 on success. On error, a negative value corresponding to fab‐
1378 ric errno is returned. For fi_cancel, a return value of 0 indicates
1379 that the cancel request was submitted for processing.
1380
1381 Fabric errno values are defined in rdma/fi_errno.h.
1382
1384 -FI_EDOMAIN
1385 A resource domain was not bound to the endpoint or an attempt
1386 was made to bind multiple domains.
1387
1388 -FI_ENOCQ
1389 The endpoint has not been configured with necessary event queue.
1390
1391 -FI_EOPBADSTATE
1392 The endpoint's state does not permit the requested operation.
1393
1395 fi_getinfo(3), fi_domain(3), fi_cq(3) fi_msg(3), fi_tagged(3),
1396 fi_rma(3)
1397
1399 OpenFabrics.
1400
1401
1402
1403Libfabric Programmer's Manual 2019-05-14 fi_endpoint(3)