1fi_endpoint(3) Libfabric v1.7.0 fi_endpoint(3)
2
3
4
6 fi_endpoint - Fabric endpoint operations
7
8 fi_endpoint / fi_scalable_ep / fi_passive_ep / fi_close
9 Allocate or close an endpoint.
10
11 fi_ep_bind
12 Associate an endpoint with hardware resources, such as event
13 queues, completion queues, counters, address vectors, or shared
14 transmit/receive contexts.
15
16 fi_scalable_ep_bind
17 Associate a scalable endpoint with an address vector
18
19 fi_pep_bind
20 Associate a passive endpoint with an event queue
21
22 fi_enable
23 Transitions an active endpoint into an enabled state.
24
25 fi_cancel
26 Cancel a pending asynchronous data transfer
27
28 fi_ep_alias
29 Create an alias to the endpoint
30
31 fi_control
32 Control endpoint operation.
33
34 fi_getopt / fi_setopt
35 Get or set endpoint options.
36
37 fi_rx_context / fi_tx_context / fi_srx_context / fi_stx_context
38 Open a transmit or receive context.
39
40 fi_rx_size_left / fi_tx_size_left (DEPRECATED)
41 Query the lower bound on how many RX/TX operations may be posted
42 without an operation returning -FI_EAGAIN. This functions have
43 been deprecated and will be removed in a future version of the
44 library.
45
47 #include <rdma/fabric.h>
48
49 #include <rdma/fi_endpoint.h>
50
51 int fi_endpoint(struct fid_domain *domain, struct fi_info *info,
52 struct fid_ep **ep, void *context);
53
54 int fi_scalable_ep(struct fid_domain *domain, struct fi_info *info,
55 struct fid_ep **sep, void *context);
56
57 int fi_passive_ep(struct fi_fabric *fabric, struct fi_info *info,
58 struct fid_pep **pep, void *context);
59
60 int fi_tx_context(struct fid_ep *sep, int index,
61 struct fi_tx_attr *attr, struct fid_ep **tx_ep,
62 void *context);
63
64 int fi_rx_context(struct fid_ep *sep, int index,
65 struct fi_rx_attr *attr, struct fid_ep **rx_ep,
66 void *context);
67
68 int fi_stx_context(struct fid_domain *domain,
69 struct fi_tx_attr *attr, struct fid_stx **stx,
70 void *context);
71
72 int fi_srx_context(struct fid_domain *domain,
73 struct fi_rx_attr *attr, struct fid_ep **rx_ep,
74 void *context);
75
76 int fi_close(struct fid *ep);
77
78 int fi_ep_bind(struct fid_ep *ep, struct fid *fid, uint64_t flags);
79
80 int fi_scalable_ep_bind(struct fid_ep *sep, struct fid *fid, uint64_t flags);
81
82 int fi_pep_bind(struct fid_pep *pep, struct fid *fid, uint64_t flags);
83
84 int fi_enable(struct fid_ep *ep);
85
86 int fi_cancel(struct fid_ep *ep, void *context);
87
88 int fi_ep_alias(struct fid_ep *ep, struct fid_ep **alias_ep, uint64_t flags);
89
90 int fi_control(struct fid *ep, int command, void *arg);
91
92 int fi_getopt(struct fid *ep, int level, int optname,
93 void *optval, size_t *optlen);
94
95 int fi_setopt(struct fid *ep, int level, int optname,
96 const void *optval, size_t optlen);
97
98 DEPRECATED ssize_t fi_rx_size_left(struct fid_ep *ep);
99
100 DEPRECATED ssize_t fi_tx_size_left(struct fid_ep *ep);
101
103 fid On creation, specifies a fabric or access domain. On bind,
104 identifies the event queue, completion queue, counter, or ad‐
105 dress vector to bind to the endpoint. In other cases, it's a
106 fabric identifier of an associated resource.
107
108 info Details about the fabric interface endpoint to be opened, ob‐
109 tained from fi_getinfo.
110
111 ep A fabric endpoint.
112
113 sep A scalable fabric endpoint.
114
115 pep A passive fabric endpoint.
116
117 context
118 Context associated with the endpoint or asynchronous operation.
119
120 index Index to retrieve a specific transmit/receive context.
121
122 attr Transmit or receive context attributes.
123
124 flags Additional flags to apply to the operation.
125
126 command
127 Command of control operation to perform on endpoint.
128
129 arg Optional control argument.
130
131 level Protocol level at which the desired option resides.
132
133 optname
134 The protocol option to read or set.
135
136 optval The option value that was read or to set.
137
138 optlen The size of the optval buffer.
139
141 Endpoints are transport level communication portals. There are two
142 types of endpoints: active and passive. Passive endpoints belong to a
143 fabric domain and are most often used to listen for incoming connection
144 requests. However, a passive endpoint may be used to reserve a fabric
145 address that can be granted to an active endpoint. Active endpoints
146 belong to access domains and can perform data transfers.
147
148 Active endpoints may be connection-oriented or connectionless, and may
149 provide data reliability. The data transfer interfaces -- messages
150 (fi_msg), tagged messages (fi_tagged), RMA (fi_rma), and atomics
151 (fi_atomic) -- are associated with active endpoints. In basic configu‐
152 rations, an active endpoint has transmit and receive queues. In gener‐
153 al, operations that generate traffic on the fabric are posted to the
154 transmit queue. This includes all RMA and atomic operations, along
155 with sent messages and sent tagged messages. Operations that post buf‐
156 fers for receiving incoming data are submitted to the receive queue.
157
158 Active endpoints are created in the disabled state. They must transi‐
159 tion into an enabled state before accepting data transfer operations,
160 including posting of receive buffers. The fi_enable call is used to
161 transition an active endpoint into an enabled state. The fi_connect
162 and fi_accept calls will also transition an endpoint into the enabled
163 state, if it is not already active.
164
165 In order to transition an endpoint into an enabled state, it must be
166 bound to one or more fabric resources. An endpoint that will generate
167 asynchronous completions, either through data transfer operations or
168 communication establishment events, must be bound to the appropriate
169 completion queues or event queues, respectively, before being enabled.
170 Additionally, endpoints that use manual progress must be associated
171 with relevant completion queues or event queues in order to drive
172 progress. For endpoints that are only used as the target of RMA or
173 atomic operations, this means binding the endpoint to a completion
174 queue associated with receive processing. Unconnected endpoints must
175 be bound to an address vector.
176
177 Once an endpoint has been activated, it may be associated with an ad‐
178 dress vector. Receive buffers may be posted to it and calls may be
179 made to connection establishment routines. Connectionless endpoints
180 may also perform data transfers.
181
182 The behavior of an endpoint may be adjusted by setting its control data
183 and protocol options. This allows the underlying provider to redirect
184 function calls to implementations optimized to meet the desired appli‐
185 cation behavior.
186
187 If an endpoint experiences a critical error, it will transition back
188 into a disabled state. Critical errors are reported through the event
189 queue associated with the EP. In certain cases, a disabled endpoint
190 may be re-enabled. The ability to transition back into an enabled
191 state is provider specific and depends on the type of error that the
192 endpoint experienced. When an endpoint is disabled as a result of a
193 critical error, all pending operations are discarded.
194
195 fi_endpoint / fi_passive_ep / fi_scalable_ep
196 fi_endpoint allocates a new active endpoint. fi_passive_ep allocates a
197 new passive endpoint. fi_scalable_ep allocates a scalable endpoint.
198 The properties and behavior of the endpoint are defined based on the
199 provided struct fi_info. See fi_getinfo for additional details on
200 fi_info. fi_info flags that control the operation of an endpoint are
201 defined below. See section SCALABLE ENDPOINTS.
202
203 If an active endpoint is allocated in order to accept a connection re‐
204 quest, the fi_info parameter must be the same as the fi_info structure
205 provided with the connection request (FI_CONNREQ) event.
206
207 An active endpoint may acquire the properties of a passive endpoint by
208 setting the fi_info handle field to the passive endpoint fabric de‐
209 scriptor. This is useful for applications that need to reserve the
210 fabric address of an endpoint prior to knowing if the endpoint will be
211 used on the active or passive side of a connection. For example, this
212 feature is useful for simulating socket semantics. Once an active end‐
213 point acquires the properties of a passive endpoint, the passive end‐
214 point is no longer bound to any fabric resources and must no longer be
215 used. The user is expected to close the passive endpoint after opening
216 the active endpoint in order to free up any lingering resources that
217 had been used.
218
219 fi_close
220 Closes an endpoint and release all resources associated with it.
221
222 When closing a scalable endpoint, there must be no opened transmit con‐
223 texts, or receive contexts associated with the scalable endpoint. If
224 resources are still associated with the scalable endpoint when attempt‐
225 ing to close, the call will return -FI_EBUSY.
226
227 Outstanding operations posted to the endpoint when fi_close is called
228 will be discarded. Discarded operations will silently be dropped, with
229 no completions reported. Additionally, a provider may discard previ‐
230 ously completed operations from the associated completion queue(s).
231 The behavior to discard completed operations is provider specific.
232
233 fi_ep_bind
234 fi_ep_bind is used to associate an endpoint with hardware resources.
235 The common use of fi_ep_bind is to direct asynchronous operations asso‐
236 ciated with an endpoint to a completion queue. An endpoint must be
237 bound with CQs capable of reporting completions for any asynchronous
238 operation initiated on the endpoint. This is true even for endpoints
239 which are configured to suppress successful completions, in order that
240 operations that complete in error may be reported to the user. For
241 passive endpoints, this requires binding the endpoint with an EQ that
242 supports the communication management (CM) domain.
243
244 An active endpoint may direct asynchronous completions to different
245 CQs, based on the type of operation. This is specified using
246 fi_ep_bind flags. The following flags may be used separately or OR'ed
247 together when binding an endpoint to a completion domain CQ.
248
249 FI_TRANSMIT
250 Directs the completion of outbound data transfer requests to the
251 specified completion queue. This includes send message, RMA,
252 and atomic operations.
253
254 FI_RECV
255 Directs the notification of inbound data transfers to the speci‐
256 fied completion queue. This includes received messages. This
257 binding automatically includes FI_REMOTE_WRITE, if applicable to
258 the endpoint.
259
260 FI_SELECTIVE_COMPLETION
261 By default, data transfer operations generate completion entries
262 into a completion queue after they have successfully completed.
263 Applications can use this bind flag to selectively enable when
264 completions are generated. If FI_SELECTIVE_COMPLETION is speci‐
265 fied, data transfer operations will not generate entries for
266 successful completions unless FI_COMPLETION is set as an opera‐
267 tional flag for the given operation. Operations that fail asyn‐
268 chronously will still generate completions, even if a completion
269 is not requested. FI_SELECTIVE_COMPLETION must be OR'ed with
270 FI_TRANSMIT and/or FI_RECV flags.
271
272 When FI_SELECTIVE_COMPLETION is set, the user must determine when a re‐
273 quest that does NOT have FI_COMPLETION set has completed indirectly,
274 usually based on the completion of a subsequent operation. Use of this
275 flag may improve performance by allowing the provider to avoid writing
276 a completion entry for every operation.
277
278 Example: An application can selectively generate send completions by
279 using the following general approach:
280
281 fi_tx_attr::op_flags = 0; // default - no completion
282 fi_ep_bind(ep, cq, FI_TRANSMIT | FI_SELECTIVE_COMPLETION);
283 fi_send(ep, ...); // no completion
284 fi_sendv(ep, ...); // no completion
285 fi_sendmsg(ep, ..., FI_COMPLETION); // completion!
286 fi_inject(ep, ...); // no completion
287
288 Example: An application can selectively disable send completions by
289 modifying the operational flags:
290
291 fi_tx_attr::op_flags = FI_COMPLETION; // default - completion
292 fi_ep_bind(ep, cq, FI_TRANSMIT | FI_SELECTIVE_COMPLETION);
293 fi_send(ep, ...); // completion
294 fi_sendv(ep, ...); // completion
295 fi_sendmsg(ep, ..., 0); // no completion!
296 fi_inject(ep, ...); // no completion!
297
298 Example: Omitting FI_SELECTIVE_COMPLETION when binding will generate
299 completions for all non-fi_inject calls:
300
301 fi_tx_attr::op_flags = 0;
302 fi_ep_bind(ep, cq, FI_TRANSMIT); // default - completion
303 fi_send(ep, ...); // completion
304 fi_sendv(ep, ...); // completion
305 fi_sendmsg(ep, ..., 0); // completion!
306 fi_sendmsg(ep, ..., FI_COMPLETION); // completion
307 fi_sendmsg(ep, ..., FI_INJECT|FI_COMPLETION); // completion!
308 fi_inject(ep, ...); // no completion!
309
310 An endpoint may also be bound to a fabric counter. When binding an
311 endpoint to a counter, the following flags may be specified.
312
313 FI_SEND
314 Increments the specified counter whenever a message transfer
315 initiated over the endpoint has completed successfully or in er‐
316 ror. Sent messages include both tagged and normal message oper‐
317 ations.
318
319 FI_RECV
320 Increments the specified counter whenever a message is received
321 over the endpoint. Received messages include both tagged and
322 normal message operations.
323
324 FI_READ
325 Increments the specified counter whenever an RMA read, atomic
326 fetch, or atomic compare operation initiated from the endpoint
327 has completed successfully or in error.
328
329 FI_WRITE
330 Increments the specified counter whenever an RMA write or base
331 atomic operation initiated from the endpoint has completed suc‐
332 cessfully or in error.
333
334 FI_REMOTE_READ
335 Increments the specified counter whenever an RMA read, atomic
336 fetch, or atomic compare operation is initiated from a remote
337 endpoint that targets the given endpoint. Use of this flag re‐
338 quires that the endpoint be created using FI_RMA_EVENT.
339
340 FI_REMOTE_WRITE
341 Increments the specified counter whenever an RMA write or base
342 atomic operation is initiated from a remote endpoint that tar‐
343 gets the given endpoint. Use of this flag requires that the
344 endpoint be created using FI_RMA_EVENT.
345
346 An endpoint may only be bound to a single CQ or counter for a given
347 type of operation. For example, a EP may not bind to two counters both
348 using FI_WRITE. Furthermore, providers may limit CQ and counter bind‐
349 ings to endpoints of the same endpoint type (DGRAM, MSG, RDM, etc.).
350
351 Connectionless endpoints must be bound to a single address vector.
352
353 If an endpoint is using a shared transmit and/or receive context, the
354 shared contexts must be bound to the endpoint. CQs, counters, AV, and
355 shared contexts must be bound to endpoints before they are enabled.
356
357 fi_scalable_ep_bind
358 fi_scalable_ep_bind is used to associate a scalable endpoint with an
359 address vector. See section on SCALABLE ENDPOINTS. A scalable end‐
360 point has a single transport level address and can support multiple
361 transmit and receive contexts. The transmit and receive contexts share
362 the transport-level address. Address vectors that are bound to scal‐
363 able endpoints are implicitly bound to any transmit or receive contexts
364 created using the scalable endpoint.
365
366 fi_enable
367 This call transitions the endpoint into an enabled state. An endpoint
368 must be enabled before it may be used to perform data transfers. En‐
369 abling an endpoint typically results in hardware resources being as‐
370 signed to it. Endpoints making use of completion queues, counters,
371 event queues, and/or address vectors must be bound to them before being
372 enabled.
373
374 Calling connect or accept on an endpoint will implicitly enable an end‐
375 point if it has not already been enabled.
376
377 fi_enable may also be used to re-enable an endpoint that has been dis‐
378 abled as a result of experiencing a critical error. Applications
379 should check the return value from fi_enable to see if a disabled end‐
380 point has successfully be re-enabled.
381
382 fi_cancel
383 fi_cancel attempts to cancel an outstanding asynchronous operation.
384 Canceling an operation causes the fabric provider to search for the op‐
385 eration and, if it is still pending, complete it as having been can‐
386 celed. An error queue entry will be available in the associated error
387 queue with error code FI_ECANCELED. On the other hand, if the opera‐
388 tion completed before the call to fi_cancel, then the completion status
389 of that operation will be available in the associated completion queue.
390 No specific entry related to fi_cancel itself will be posted.
391
392 Cancel uses the context parameter associated with an operation to iden‐
393 tify the request to cancel. Operations posted without a valid context
394 parameter -- either no context parameter is specified or the context
395 value was ignored by the provider -- cannot be canceled. If multiple
396 outstanding operations match the context parameter, only one will be
397 canceled. In this case, the operation which is canceled is provider
398 specific. The cancel operation is asynchronous, but will complete
399 within a bounded period of time.
400
401 fi_ep_alias
402 This call creates an alias to the specified endpoint. Conceptually, an
403 endpoint alias provides an alternate software path from the application
404 to the underlying provider hardware. An alias EP differs from its par‐
405 ent endpoint only by its default data transfer flags. For example, an
406 alias EP may be configured to use a different completion mode. By de‐
407 fault, an alias EP inherits the same data transfer flags as the parent
408 endpoint. An application can use fi_control to modify the alias EP op‐
409 erational flags.
410
411 When allocating an alias, an application may configure either the
412 transmit or receive operational flags. This avoids needing a separate
413 call to fi_control to set those flags. The flags passed to fi_ep_alias
414 must include FI_TRANSMIT or FI_RECV (not both) with other operational
415 flags OR'ed in. This will override the transmit or receive flags, re‐
416 spectively, for operations posted through the alias endpoint. All al‐
417 located aliases must be closed for the underlying endpoint to be re‐
418 leased.
419
420 fi_control
421 The control operation is used to adjust the default behavior of an end‐
422 point. It allows the underlying provider to redirect function calls to
423 implementations optimized to meet the desired application behavior. As
424 a result, calls to fi_ep_control must be serialized against all other
425 calls to an endpoint.
426
427 The base operation of an endpoint is selected during creation using
428 struct fi_info. The following control commands and arguments may be
429 assigned to an endpoint.
430
431 **FI_GETOPSFLAG -- uint64_t *flags**
432 Used to retrieve the current value of flags associated with the
433 data transfer operations initiated on the endpoint. The control
434 argument must include FI_TRANSMIT or FI_RECV (not both) flags to
435 indicate the type of data transfer flags to be returned. See
436 below for a list of control flags.
437
438 **FI_SETOPSFLAG -- uint64_t *flags**
439 Used to change the data transfer operation flags associated with
440 an endpoint. The control argument must include FI_TRANSMIT or
441 FI_RECV (not both) to indicate the type of data transfer that
442 the flags should apply to, with other flags OR'ed in. The given
443 flags will override the previous transmit and receive attributes
444 that were set when the endpoint was created. Valid control
445 flags are defined below.
446
447 **FI_BACKLOG - int *value**
448 This option only applies to passive endpoints. It is used to
449 set the connection request backlog for listening endpoints.
450
451 FI_GETWAIT (void **)
452 This command allows the user to retrieve the file descriptor as‐
453 sociated with a socket endpoint. The fi_control arg parameter
454 should be an address where a pointer to the returned file de‐
455 scriptor will be written. See fi_eq.3 for addition details us‐
456 ing fi_control with FI_GETWAIT. The file descriptor may be used
457 for notification that the endpoint is ready to send or receive
458 data.
459
460 fi_getopt / fi_setopt
461 Endpoint protocol operations may be retrieved using fi_getopt or set
462 using fi_setopt. Applications specify the level that a desired option
463 exists, identify the option, and provide input/output buffers to get or
464 set the option. fi_setopt provides an application a way to adjust
465 low-level protocol and implementation specific details of an endpoint.
466
467 The following option levels and option names and parameters are de‐
468 fined.
469
470 FI_OPT_ENDPOINT · .RS 2
471
472 FI_OPT_MIN_MULTI_RECV - size_t
473 Defines the minimum receive buffer space available when the re‐
474 ceive buffer is released by the provider (see FI_MULTI_RECV).
475 Modifying this value is only guaranteed to set the minimum buf‐
476 fer space needed on receives posted after the value has been
477 changed. It is recommended that applications that want to over‐
478 ride the default MIN_MULTI_RECV value set this option before en‐
479 abling the corresponding endpoint.
480 · .RS 2
481
482 FI_OPT_CM_DATA_SIZE - size_t
483 Defines the size of available space in CM messages for user-de‐
484 fined data. This value limits the amount of data that applica‐
485 tions can exchange between peer endpoints using the fi_connect,
486 fi_accept, and fi_reject operations. The size returned is de‐
487 pendent upon the properties of the endpoint, except in the case
488 of passive endpoints, in which the size reflects the maximum
489 size of the data that may be present as part of a connection re‐
490 quest event. This option is read only.
491 · .RS 2
492
493 FI_OPT_BUFFERED_LIMIT - size_t
494 Defines the maximum size of a buffered message that will be re‐
495 ported to users as part of a receive completion when the
496 FI_BUFFERED_RECV mode is enabled on an endpoint.
497
498 fi_getopt() will return the currently configured threshold, or the
499 provider's default threshold if one has not be set by the application.
500 fi_setopt() allows an application to configure the threshold. If the
501 provider cannot support the requested threshold, it will fail the
502 fi_setopt() call with FI_EMSGSIZE. Calling fi_setopt() with the
503 threshold set to SIZE_MAX will set the threshold to the maximum sup‐
504 ported by the provider. fi_getopt() can then be used to retrieve the
505 set size.
506
507 In most cases, the sending and receiving endpoints must be configured
508 to use the same threshold value, and the threshold must be set prior to
509 enabling the endpoint. · .RS 2
510
511 FI_OPT_BUFFERED_MIN - size_t
512 Defines the minimum size of a buffered message that will be re‐
513 ported. Applications would set this to a size that's big enough
514 to decide whether to discard or claim a buffered receive or when
515 to claim a buffered receive on getting a buffered receive com‐
516 pletion. The value is typically used by a provider when sending
517 a rendezvous protocol request where it would send atleast
518 FI_OPT_BUFFERED_MIN bytes of application data along with it. A
519 smaller sized renedezvous protocol message usually results in
520 better latency for the overall transfer of a large message.
521
522 fi_rx_size_left (DEPRECATED)
523 This function has been deprecated and will be removed in a future ver‐
524 sion of the library. It may not be supported by all providers.
525
526 The fi_rx_size_left call returns a lower bound on the number of receive
527 operations that may be posted to the given endpoint without that opera‐
528 tion returning -FI_EAGAIN. Depending on the specific details of the
529 subsequently posted receive operations (e.g., number of iov entries,
530 which receive function is called, etc.), it may be possible to post
531 more receive operations than originally indicated by fi_rx_size_left.
532
533 fi_tx_size_left (DEPRECATED)
534 This function has been deprecated and will be removed in a future ver‐
535 sion of the library. It may not be supported by all providers.
536
537 The fi_tx_size_left call returns a lower bound on the number of trans‐
538 mit operations that may be posted to the given endpoint without that
539 operation returning -FI_EAGAIN. Depending on the specific details of
540 the subsequently posted transmit operations (e.g., number of iov en‐
541 tries, which transmit function is called, etc.), it may be possible to
542 post more transmit operations than originally indicated by
543 fi_tx_size_left.
544
546 The fi_ep_attr structure defines the set of attributes associated with
547 an endpoint. Endpoint attributes may be further refined using the
548 transmit and receive context attributes as shown below.
549
550 struct fi_ep_attr {
551 enum fi_ep_type type;
552 uint32_t protocol;
553 uint32_t protocol_version;
554 size_t max_msg_size;
555 size_t msg_prefix_size;
556 size_t max_order_raw_size;
557 size_t max_order_war_size;
558 size_t max_order_waw_size;
559 uint64_t mem_tag_format;
560 size_t tx_ctx_cnt;
561 size_t rx_ctx_cnt;
562 size_t auth_key_size;
563 uint8_t *auth_key;
564 };
565
566 type - Endpoint Type
567 If specified, indicates the type of fabric interface communication de‐
568 sired. Supported types are:
569
570 FI_EP_UNSPEC
571 The type of endpoint is not specified. This is usually provided
572 as input, with other attributes of the endpoint or the provider
573 selecting the type.
574
575 FI_EP_MSG
576 Provides a reliable, connection-oriented data transfer service
577 with flow control that maintains message boundaries.
578
579 FI_EP_DGRAM
580 Supports a connectionless, unreliable datagram communication.
581 Message boundaries are maintained, but the maximum message size
582 may be limited to the fabric MTU. Flow control is not guaran‐
583 teed.
584
585 FI_EP_RDM
586 Reliable datagram message. Provides a reliable, unconnected da‐
587 ta transfer service with flow control that maintains message
588 boundaries.
589
590 FI_EP_SOCK_STREAM
591 Data streaming endpoint with TCP socket-like semantics. Pro‐
592 vides a reliable, connection-oriented data transfer service that
593 does not maintain message boundaries. FI_EP_SOCK_STREAM is most
594 useful for applications designed around using TCP sockets. See
595 the SOCKET ENDPOINT section for additional details and restric‐
596 tions that apply to stream endpoints.
597
598 FI_EP_SOCK_DGRAM
599 A connectionless, unreliable datagram endpoint with UDP sock‐
600 et-like semantics. FI_EP_SOCK_DGRAM is most useful for applica‐
601 tions designed around using UDP sockets. See the SOCKET END‐
602 POINT section for additional details and restrictions that apply
603 to datagram socket endpoints.
604
605 Protocol
606 Specifies the low-level end to end protocol employed by the provider.
607 A matching protocol must be used by communicating endpoints to ensure
608 interoperability. The following protocol values are defined. Provider
609 specific protocols are also allowed. Provider specific protocols will
610 be indicated by having the upper bit of the protocol value set to one.
611
612 FI_PROTO_UNSPEC
613 The protocol is not specified. This is usually provided as in‐
614 put, with other attributes of the socket or the provider select‐
615 ing the actual protocol.
616
617 FI_PROTO_RDMA_CM_IB_RC
618 The protocol runs over Infiniband reliable-connected queue
619 pairs, using the RDMA CM protocol for connection establishment.
620
621 FI_PROTO_IWARP
622 The protocol runs over the Internet wide area RDMA protocol
623 transport.
624
625 FI_PROTO_IB_UD
626 The protocol runs over Infiniband unreliable datagram queue
627 pairs.
628
629 FI_PROTO_PSMX
630 The protocol is based on an Intel proprietary protocol known as
631 PSM, performance scaled messaging. PSMX is an extended version
632 of the PSM protocol to support the libfabric interfaces.
633
634 FI_PROTO_UDP
635 The protocol sends and receives UDP datagrams. For example, an
636 endpoint using FI_PROTO_UDP will be able to communicate with a
637 remote peer that is using Berkeley SOCK_DGRAM sockets using IP‐
638 PROTO_UDP.
639
640 FI_PROTO_SOCK_TCP
641 The protocol is layered over TCP packets.
642
643 FI_PROTO_IWARP_RDM
644 Reliable-datagram protocol implemented over iWarp reliable-con‐
645 nected queue pairs.
646
647 FI_PROTO_IB_RDM
648 Reliable-datagram protocol implemented over InfiniBand reli‐
649 able-connected queue pairs.
650
651 FI_PROTO_GNI
652 Protocol runs over Cray GNI low-level interface.
653
654 FI_PROTO_RXM
655 Reliable-datagram protocol implemented over message endpoints.
656 RXM is a libfabric utility component that adds RDM endpoint se‐
657 mantics over MSG endpoint semantics.
658
659 FI_PROTO_RXD
660 Reliable-datagram protocol implemented over datagram endpoints.
661 RXD is a libfabric utility component that adds RDM endpoint se‐
662 mantics over DGRAM endpoint semantics.
663
664 FI_PROTO_NETWORKDIRECT
665 Protocol runs over Microsoft NetworkDirect service provider in‐
666 terface. This adds reliable-datagram semantics over the Net‐
667 workDirect connection- oriented endpoint semantics.
668
669 FI_PROTO_PSMX2
670 The protocol is based on an Intel proprietary protocol known as
671 PSM2, performance scaled messaging version 2. PSMX2 is an ex‐
672 tended version of the PSM2 protocol to support the libfabric in‐
673 terfaces.
674
675 protocol_version - Protocol Version
676 Identifies which version of the protocol is employed by the provider.
677 The protocol version allows providers to extend an existing protocol,
678 by adding support for additional features or functionality for example,
679 in a backward compatible manner. Providers that support different ver‐
680 sions of the same protocol should inter-operate, but only when using
681 the capabilities defined for the lesser version.
682
683 max_msg_size - Max Message Size
684 Defines the maximum size for an application data transfer as a single
685 operation.
686
687 msg_prefix_size - Message Prefix Size
688 Specifies the size of any required message prefix buffer space. This
689 field will be 0 unless the FI_MSG_PREFIX mode is enabled. If msg_pre‐
690 fix_size is > 0 the specified value will be a multiple of 8-bytes.
691
692 Max RMA Ordered Size
693 The maximum ordered size specifies the delivery order of transport data
694 into target memory for RMA and atomic operations. Data ordering is
695 separate, but dependent on message ordering (defined below). Data or‐
696 dering is unspecified where message order is not defined.
697
698 Data ordering refers to the access of target memory by subsequent oper‐
699 ations. When back to back RMA read or write operations access the same
700 registered memory location, data ordering indicates whether the second
701 operation reads or writes the target memory after the first operation
702 has completed. Because RMA ordering applies between two operations,
703 and not within a single data transfer, ordering is defined per byte-ad‐
704 dressable memory location. I.e. ordering specifies whether location X
705 is accessed by the second operation after the first operation. Nothing
706 is implied about the completion of the first operation before the sec‐
707 ond operation is initiated.
708
709 In order to support large data transfers being broken into multiple
710 packets and sent using multiple paths through the fabric, data ordering
711 may be limited to transfers of a specific size or less. Providers
712 specify when data ordering is maintained through the following values.
713 Note that even if data ordering is not maintained, message ordering may
714 be.
715
716 max_order_raw_size
717 Read after write size. If set, an RMA or atomic read operation
718 issued after an RMA or atomic write operation, both of which are
719 smaller than the size, will be ordered. Where the target memory
720 locations overlap, the RMA or atomic read operation will see the
721 results of the previous RMA or atomic write.
722
723 max_order_war_size
724 Write after read size. If set, an RMA or atomic write operation
725 issued after an RMA or atomic read operation, both of which are
726 smaller than the size, will be ordered. The RMA or atomic read
727 operation will see the initial value of the target memory loca‐
728 tion before a subsequent RMA or atomic write updates the value.
729
730 max_order_waw_size
731 Write after write size. If set, an RMA or atomic write opera‐
732 tion issued after an RMA or atomic write operation, both of
733 which are smaller than the size, will be ordered. The target
734 memory location will reflect the results of the second RMA or
735 atomic write.
736
737 An order size value of 0 indicates that ordering is not guaranteed. A
738 value of -1 guarantees ordering for any data size.
739
740 mem_tag_format - Memory Tag Format
741 The memory tag format is a bit array used to convey the number of
742 tagged bits supported by a provider. Additionally, it may be used to
743 divide the bit array into separate fields. The mem_tag_format option‐
744 ally begins with a series of bits set to 0, to signify bits which are
745 ignored by the provider. Following the initial prefix of ignored bits,
746 the array will consist of alternating groups of bits set to all 1's or
747 all 0's. Each group of bits corresponds to a tagged field. The impli‐
748 cation of defining a tagged field is that when a mask is applied to the
749 tagged bit array, all bits belonging to a single field will either be
750 set to 1 or 0, collectively.
751
752 For example, a mem_tag_format of 0x30FF indicates support for 14 tagged
753 bits, separated into 3 fields. The first field consists of 2-bits, the
754 second field 4-bits, and the final field 8-bits. Valid masks for such
755 a tagged field would be a bitwise OR'ing of zero or more of the follow‐
756 ing values: 0x3000, 0x0F00, and 0x00FF. The provider may not validate
757 the mask provided by the application for performance reasons.
758
759 By identifying fields within a tag, a provider may be able to optimize
760 their search routines. An application which requests tag fields must
761 provide tag masks that either set all mask bits corresponding to a
762 field to all 0 or all 1. When negotiating tag fields, an application
763 can request a specific number of fields of a given size. A provider
764 must return a tag format that supports the requested number of fields,
765 with each field being at least the size requested, or fail the request.
766 A provider may increase the size of the fields. When reporting comple‐
767 tions (see FI_CQ_FORMAT_TAGGED), it is not guaranteed that the provider
768 would clear out any unsupported tag bits in the tag field of the com‐
769 pletion entry.
770
771 It is recommended that field sizes be ordered from smallest to largest.
772 A generic, unstructured tag and mask can be achieved by requesting a
773 bit array consisting of alternating 1's and 0's.
774
775 tx_ctx_cnt - Transmit Context Count
776 Number of transmit contexts to associate with the endpoint. If not
777 specified (0), 1 context will be assigned if the endpoint supports out‐
778 bound transfers. Transmit contexts are independent transmit queues
779 that may be separately configured. Each transmit context may be bound
780 to a separate CQ, and no ordering is defined between contexts. Addi‐
781 tionally, no synchronization is needed when accessing contexts in par‐
782 allel.
783
784 If the count is set to the value FI_SHARED_CONTEXT, the endpoint will
785 be configured to use a shared transmit context, if supported by the
786 provider. Providers that do not support shared transmit contexts will
787 fail the request.
788
789 See the scalable endpoint and shared contexts sections for additional
790 details.
791
792 rx_ctx_cnt - Receive Context Count
793 Number of receive contexts to associate with the endpoint. If not
794 specified, 1 context will be assigned if the endpoint supports inbound
795 transfers. Receive contexts are independent processing queues that may
796 be separately configured. Each receive context may be bound to a sepa‐
797 rate CQ, and no ordering is defined between contexts. Additionally, no
798 synchronization is needed when accessing contexts in parallel.
799
800 If the count is set to the value FI_SHARED_CONTEXT, the endpoint will
801 be configured to use a shared receive context, if supported by the
802 provider. Providers that do not support shared receive contexts will
803 fail the request.
804
805 See the scalable endpoint and shared contexts sections for additional
806 details.
807
808 auth_key_size - Authorization Key Length
809 The length of the authorization key in bytes. This field will be 0 if
810 authorization keys are not available or used. This field is ignored
811 unless the fabric is opened with API version 1.5 or greater.
812
813 auth_key - Authorization Key
814 If supported by the fabric, an authorization key (a.k.a. job key) to
815 associate with the endpoint. An authorization key is used to limit
816 communication between endpoints. Only peer endpoints that are pro‐
817 grammed to use the same authorization key may communicate. Authoriza‐
818 tion keys are often used to implement job keys, to ensure that process‐
819 es running in different jobs do not accidentally cross traffic. The
820 domain authorization key will be used if auth_key_size is set to 0.
821 This field is ignored unless the fabric is opened with API version 1.5
822 or greater.
823
825 Attributes specific to the transmit capabilities of an endpoint are
826 specified using struct fi_tx_attr.
827
828 struct fi_tx_attr {
829 uint64_t caps;
830 uint64_t mode;
831 uint64_t op_flags;
832 uint64_t msg_order;
833 uint64_t comp_order;
834 size_t inject_size;
835 size_t size;
836 size_t iov_limit;
837 size_t rma_iov_limit;
838 };
839
840 caps - Capabilities
841 The requested capabilities of the context. The capabilities must be a
842 subset of those requested of the associated endpoint. See the CAPABIL‐
843 ITIES section of fi_getinfo(3) for capability details. If the caps
844 field is 0 on input to fi_getinfo(3), the caps value from the fi_info
845 structure will be used.
846
847 mode
848 The operational mode bits of the context. The mode bits will be a sub‐
849 set of those associated with the endpoint. See the MODE section of
850 fi_getinfo(3) for details. A mode value of 0 will be ignored on input
851 to fi_getinfo(3), with the mode value of the fi_info structure used in‐
852 stead. On return from fi_getinfo(3), the mode will be set only to
853 those constraints specific to transmit operations.
854
855 op_flags - Default transmit operation flags
856 Flags that control the operation of operations submitted against the
857 context. Applicable flags are listed in the Operation Flags section.
858
859 msg_order - Message Ordering
860 Message ordering refers to the order in which transport layer headers
861 (as viewed by the application) are processed. Relaxed message order
862 enables data transfers to be sent and received out of order, which may
863 improve performance by utilizing multiple paths through the fabric from
864 the initiating endpoint to a target endpoint. Message order applies
865 only between a single source and destination endpoint pair. Ordering
866 between different target endpoints is not defined.
867
868 Message order is determined using a set of ordering bits. Each set bit
869 indicates that ordering is maintained between data transfers of the
870 specified type. Message order is defined for [read | write | send] op‐
871 erations submitted by an application after [read | write | send] opera‐
872 tions.
873
874 Message ordering only applies to the end to end transmission of trans‐
875 port headers. Message ordering is necessary, but does not guarantee,
876 the order in which message data is sent or received by the transport
877 layer. Message ordering requires matching ordering semantics on the
878 receiving side of a data transfer operation in order to guarantee that
879 ordering is met.
880
881 FI_ORDER_NONE
882 No ordering is specified. This value may be used as input in
883 order to obtain the default message order supported by the
884 provider. FI_ORDER_NONE is an alias for the value 0.
885
886 FI_ORDER_RAR
887 Read after read. If set, RMA and atomic read operations are
888 transmitted in the order submitted relative to other RMA and
889 atomic read operations. If not set, RMA and atomic reads may be
890 transmitted out of order from their submission.
891
892 FI_ORDER_RAW
893 Read after write. If set, RMA and atomic read operations are
894 transmitted in the order submitted relative to RMA and atomic
895 write operations. If not set, RMA and atomic reads may be
896 transmitted ahead of RMA and atomic writes.
897
898 FI_ORDER_RAS
899 Read after send. If set, RMA and atomic read operations are
900 transmitted in the order submitted relative to message send op‐
901 erations, including tagged sends. If not set, RMA and atomic
902 reads may be transmitted ahead of sends.
903
904 FI_ORDER_WAR
905 Write after read. If set, RMA and atomic write operations are
906 transmitted in the order submitted relative to RMA and atomic
907 read operations. If not set, RMA and atomic writes may be
908 transmitted ahead of RMA and atomic reads.
909
910 FI_ORDER_WAW
911 Write after write. If set, RMA and atomic write operations are
912 transmitted in the order submitted relative to other RMA and
913 atomic write operations. If not set, RMA and atomic writes may
914 be transmitted out of order from their submission.
915
916 FI_ORDER_WAS
917 Write after send. If set, RMA and atomic write operations are
918 transmitted in the order submitted relative to message send op‐
919 erations, including tagged sends. If not set, RMA and atomic
920 writes may be transmitted ahead of sends.
921
922 FI_ORDER_SAR
923 Send after read. If set, message send operations, including
924 tagged sends, are transmitted in order submitted relative to RMA
925 and atomic read operations. If not set, message sends may be
926 transmitted ahead of RMA and atomic reads.
927
928 FI_ORDER_SAW
929 Send after write. If set, message send operations, including
930 tagged sends, are transmitted in order submitted relative to RMA
931 and atomic write operations. If not set, message sends may be
932 transmitted ahead of RMA and atomic writes.
933
934 FI_ORDER_SAS
935 Send after send. If set, message send operations, including
936 tagged sends, are transmitted in the order submitted relative to
937 other message send. If not set, message sends may be transmit‐
938 ted out of order from their submission.
939
940 comp_order - Completion Ordering
941 Completion ordering refers to the order in which completed requests are
942 written into the completion queue. Completion ordering is similar to
943 message order. Relaxed completion order may enable faster reporting of
944 completed transfers, allow acknowledgments to be sent over different
945 fabric paths, and support more sophisticated retry mechanisms. This
946 can result in lower-latency completions, particularly when using uncon‐
947 nected endpoints. Strict completion ordering may require that
948 providers queue completed operations or limit available optimizations.
949
950 For transmit requests, completion ordering depends on the endpoint com‐
951 munication type. For unreliable communication, completion ordering ap‐
952 plies to all data transfer requests submitted to an endpoint. For re‐
953 liable communication, completion ordering only applies to requests that
954 target a single destination endpoint. Completion ordering of requests
955 that target different endpoints over a reliable transport is not de‐
956 fined.
957
958 Applications should specify the completion ordering that they support
959 or require. Providers should return the completion order that they ac‐
960 tually provide, with the constraint that the returned ordering is
961 stricter than that specified by the application. Supported completion
962 order values are:
963
964 FI_ORDER_NONE
965 No ordering is defined for completed operations. Requests sub‐
966 mitted to the transmit context may complete in any order.
967
968 FI_ORDER_STRICT
969 Requests complete in the order in which they are submitted to
970 the transmit context.
971
972 inject_size
973 The requested inject operation size (see the FI_INJECT flag) that the
974 context will support. This is the maximum size data transfer that can
975 be associated with an inject operation (such as fi_inject) or may be
976 used with the FI_INJECT data transfer flag.
977
978 size
979 The size of the context. The size is specified as the minimum number
980 of transmit operations that may be posted to the endpoint without the
981 operation returning -FI_EAGAIN.
982
983 iov_limit
984 This is the maximum number of IO vectors (scatter-gather elements) that
985 a single posted operation may reference.
986
987 rma_iov_limit
988 This is the maximum number of RMA IO vectors (scatter-gather elements)
989 that an RMA or atomic operation may reference. The rma_iov_limit cor‐
990 responds to the rma_iov_count values in RMA and atomic operations. See
991 struct fi_msg_rma and struct fi_msg_atomic in fi_rma.3 and fi_atomic.3,
992 for additional details. This limit applies to both the number of RMA
993 IO vectors that may be specified when initiating an operation from the
994 local endpoint, as well as the maximum number of IO vectors that may be
995 carried in a single request from a remote endpoint.
996
998 Attributes specific to the receive capabilities of an endpoint are
999 specified using struct fi_rx_attr.
1000
1001 struct fi_rx_attr {
1002 uint64_t caps;
1003 uint64_t mode;
1004 uint64_t op_flags;
1005 uint64_t msg_order;
1006 uint64_t comp_order;
1007 size_t total_buffered_recv;
1008 size_t size;
1009 size_t iov_limit;
1010 };
1011
1012 caps - Capabilities
1013 The requested capabilities of the context. The capabilities must be a
1014 subset of those requested of the associated endpoint. See the CAPABIL‐
1015 ITIES section if fi_getinfo(3) for capability details. If the caps
1016 field is 0 on input to fi_getinfo(3), the caps value from the fi_info
1017 structure will be used.
1018
1019 mode
1020 The operational mode bits of the context. The mode bits will be a sub‐
1021 set of those associated with the endpoint. See the MODE section of
1022 fi_getinfo(3) for details. A mode value of 0 will be ignored on input
1023 to fi_getinfo(3), with the mode value of the fi_info structure used in‐
1024 stead. On return from fi_getinfo(3), the mode will be set only to
1025 those constraints specific to receive operations.
1026
1027 op_flags - Default receive operation flags
1028 Flags that control the operation of operations submitted against the
1029 context. Applicable flags are listed in the Operation Flags section.
1030
1031 msg_order - Message Ordering
1032 For a description of message ordering, see the msg_order field in the
1033 Transmit Context Attribute section. Receive context message ordering
1034 defines the order in which received transport message headers are pro‐
1035 cessed when received by an endpoint.
1036
1037 The following ordering flags, as defined for transmit ordering, also
1038 apply to the processing of received operations: FI_ORDER_NONE, FI_OR‐
1039 DER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_OR‐
1040 DER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, and FI_ORDER_SAS.
1041
1042 comp_order - Completion Ordering
1043 For a description of completion ordering, see the comp_order field in
1044 the Transmit Context Attribute section.
1045
1046 FI_ORDER_NONE
1047 No ordering is defined for completed operations. Receive opera‐
1048 tions may complete in any order, regardless of their submission
1049 order.
1050
1051 FI_ORDER_STRICT
1052 Receive operations complete in the order in which they are pro‐
1053 cessed by the receive context, based on the receive side msg_or‐
1054 der attribute.
1055
1056 FI_ORDER_DATA
1057 When set, this bit indicates that received data is written into
1058 memory in order. Data ordering applies to memory accessed as
1059 part of a single operation and between operations if message or‐
1060 dering is guaranteed.
1061
1062 total_buffered_recv
1063 This field is supported for backwards compatibility purposes. It is a
1064 hint to the provider of the total available space that may be needed to
1065 buffer messages that are received for which there is no matching re‐
1066 ceive operation. The provider may adjust or ignore this value. The
1067 allocation of internal network buffering among received message is
1068 provider specific. For instance, a provider may limit the size of mes‐
1069 sages which can be buffered or the amount of buffering allocated to a
1070 single message.
1071
1072 If receive side buffering is disabled (total_buffered_recv = 0) and a
1073 message is received by an endpoint, then the behavior is dependent on
1074 whether resource management has been enabled (FI_RM_ENABLED has be set
1075 or not). See the Resource Management section of fi_domain.3 for fur‐
1076 ther clarification. It is recommended that applications enable re‐
1077 source management if they anticipate receiving unexpected messages,
1078 rather than modifying this value.
1079
1080 size
1081 The size of the context. The size is specified as the minimum number
1082 of receive operations that may be posted to the endpoint without the
1083 operation returning -FI_EAGAIN.
1084
1085 iov_limit
1086 This is the maximum number of IO vectors (scatter-gather elements) that
1087 a single posted operating may reference.
1088
1090 A scalable endpoint is a communication portal that supports multiple
1091 transmit and receive contexts. Scalable endpoints are loosely modeled
1092 after the networking concept of transmit/receive side scaling, also
1093 known as multi-queue. Support for scalable endpoints is domain specif‐
1094 ic. Scalable endpoints may improve the performance of multi-threaded
1095 and parallel applications, by allowing threads to access independent
1096 transmit and receive queues. A scalable endpoint has a single trans‐
1097 port level address, which can reduce the memory requirements needed to
1098 store remote addressing data, versus using standard endpoints. Scal‐
1099 able endpoints cannot be used directly for communication operations,
1100 and require the application to explicitly create transmit and receive
1101 contexts as described below.
1102
1103 fi_tx_context
1104 Transmit contexts are independent transmit queues. Ordering and syn‐
1105 chronization between contexts are not defined. Conceptually a transmit
1106 context behaves similar to a send-only endpoint. A transmit context
1107 may be configured with fewer capabilities than the base endpoint and
1108 with different attributes (such as ordering requirements and inject
1109 size) than other contexts associated with the same scalable endpoint.
1110 Each transmit context has its own completion queue. The number of
1111 transmit contexts associated with an endpoint is specified during end‐
1112 point creation.
1113
1114 The fi_tx_context call is used to retrieve a specific context, identi‐
1115 fied by an index (see above for details on transmit context at‐
1116 tributes). Providers may dynamically allocate contexts when fi_tx_con‐
1117 text is called, or may statically create all contexts when fi_endpoint
1118 is invoked. By default, a transmit context inherits the properties of
1119 its associated endpoint. However, applications may request context
1120 specific attributes through the attr parameter. Support for per trans‐
1121 mit context attributes is provider specific and not guaranteed.
1122 Providers will return the actual attributes assigned to the context
1123 through the attr parameter, if provided.
1124
1125 fi_rx_context
1126 Receive contexts are independent receive queues for receiving incoming
1127 data. Ordering and synchronization between contexts are not guaran‐
1128 teed. Conceptually a receive context behaves similar to a receive-only
1129 endpoint. A receive context may be configured with fewer capabilities
1130 than the base endpoint and with different attributes (such as ordering
1131 requirements and inject size) than other contexts associated with the
1132 same scalable endpoint. Each receive context has its own completion
1133 queue. The number of receive contexts associated with an endpoint is
1134 specified during endpoint creation.
1135
1136 Receive contexts are often associated with steering flows, that specify
1137 which incoming packets targeting a scalable endpoint to process. How‐
1138 ever, receive contexts may be targeted directly by the initiator, if
1139 supported by the underlying protocol. Such contexts are referred to as
1140 'named'. Support for named contexts must be indicated by setting the
1141 caps FI_NAMED_RX_CTX capability when the corresponding endpoint is cre‐
1142 ated. Support for named receive contexts is coordinated with address
1143 vectors. See fi_av(3) and fi_rx_addr(3).
1144
1145 The fi_rx_context call is used to retrieve a specific context, identi‐
1146 fied by an index (see above for details on receive context attributes).
1147 Providers may dynamically allocate contexts when fi_rx_context is
1148 called, or may statically create all contexts when fi_endpoint is in‐
1149 voked. By default, a receive context inherits the properties of its
1150 associated endpoint. However, applications may request context specif‐
1151 ic attributes through the attr parameter. Support for per receive con‐
1152 text attributes is provider specific and not guaranteed. Providers
1153 will return the actual attributes assigned to the context through the
1154 attr parameter, if provided.
1155
1157 Shared contexts are transmit and receive contexts explicitly shared
1158 among one or more endpoints. A shareable context allows an application
1159 to use a single dedicated provider resource among multiple transport
1160 addressable endpoints. This can greatly reduce the resources needed to
1161 manage communication over multiple endpoints by multiplexing transmit
1162 and/or receive processing, with the potential cost of serializing ac‐
1163 cess across multiple endpoints. Support for shareable contexts is do‐
1164 main specific.
1165
1166 Conceptually, shareable transmit contexts are transmit queues that may
1167 be accessed by many endpoints. The use of a shared transmit context is
1168 mostly opaque to an application. Applications must allocate and bind
1169 shared transmit contexts to endpoints, but operations are posted di‐
1170 rectly to the endpoint. Shared transmit contexts are not associated
1171 with completion queues or counters. Completed operations are posted to
1172 the CQs bound to the endpoint. An endpoint may only be associated with
1173 a single shared transmit context.
1174
1175 Unlike shared transmit contexts, applications interact directly with
1176 shared receive contexts. Users post receive buffers directly to a
1177 shared receive context, with the buffers usable by any endpoint bound
1178 to the shared receive context. Shared receive contexts are not associ‐
1179 ated with completion queues or counters. Completed receive operations
1180 are posted to the CQs bound to the endpoint. An endpoint may only be
1181 associated with a single receive context, and all connectionless end‐
1182 points associated with a shared receive context must also share the
1183 same address vector.
1184
1185 Endpoints associated with a shared transmit context may use dedicated
1186 receive contexts, and vice-versa. Or an endpoint may use shared trans‐
1187 mit and receive contexts. And there is no requirement that the same
1188 group of endpoints sharing a context of one type also share the context
1189 of an alternate type. Furthermore, an endpoint may use a shared con‐
1190 text of one type, but a scalable set of contexts of the alternate type.
1191
1192 fi_stx_context
1193 This call is used to open a shareable transmit context (see above for
1194 details on the transmit context attributes). Endpoints associated with
1195 a shared transmit context must use a subset of the transmit context's
1196 attributes. Note that this is the reverse of the requirement for
1197 transmit contexts for scalable endpoints.
1198
1199 fi_srx_context
1200 This allocates a shareable receive context (see above for details on
1201 the receive context attributes). Endpoints associated with a shared
1202 receive context must use a subset of the receive context's attributes.
1203 Note that this is the reverse of the requirement for receive contexts
1204 for scalable endpoints.
1205
1207 The following feature and description should be considered experimen‐
1208 tal. Until the experimental tag is removed, the interfaces, semantics,
1209 and data structures associated with socket endpoints may change between
1210 library versions.
1211
1212 This section applies to endpoints of type FI_EP_SOCK_STREAM and
1213 FI_EP_SOCK_DGRAM, commonly referred to as socket endpoints.
1214
1215 Socket endpoints are defined with semantics that allow them to more
1216 easily be adopted by developers familiar with the UNIX socket API, or
1217 by middleware that exposes the socket API, while still taking advantage
1218 of high-performance hardware features.
1219
1220 The key difference between socket endpoints and other active endpoints
1221 are socket endpoints use synchronous data transfers. Buffers passed
1222 into send and receive operations revert to the control of the applica‐
1223 tion upon returning from the function call. As a result, no data
1224 transfer completions are reported to the application, and socket end‐
1225 points are not associated with completion queues or counters.
1226
1227 Socket endpoints support a subset of message operations: fi_send,
1228 fi_sendv, fi_sendmsg, fi_recv, fi_recvv, fi_recvmsg, and fi_inject.
1229 Because data transfers are synchronous, the return value from send and
1230 receive operations indicate the number of bytes transferred on success,
1231 or a negative value on error, including -FI_EAGAIN if the endpoint can‐
1232 not send or receive any data because of full or empty queues, respec‐
1233 tively.
1234
1235 Socket endpoints are associated with event queues and address vectors,
1236 and process connection management events asynchronously, similar to
1237 other endpoints. Unlike UNIX sockets, socket endpoint must still be
1238 declared as either active or passive.
1239
1240 Socket endpoints behave like non-blocking sockets. In order to support
1241 select and poll semantics, active socket endpoints are associated with
1242 a file descriptor that is signaled whenever the endpoint is ready to
1243 send and/or receive data. The file descriptor may be retrieved using
1244 fi_control.
1245
1247 Operation flags are obtained by OR-ing the following flags together.
1248 Operation flags define the default flags applied to an endpoint's data
1249 transfer operations, where a flags parameter is not available. Data
1250 transfer operations that take flags as input override the op_flags val‐
1251 ue of transmit or receive context attributes of an endpoint.
1252
1253 FI_INJECT
1254 Indicates that all outbound data buffers should be returned to
1255 the user's control immediately after a data transfer call re‐
1256 turns, even if the operation is handled asynchronously. This
1257 may require that the provider copy the data into a local buffer
1258 and transfer out of that buffer. A provider can limit the total
1259 amount of send data that may be buffered and/or the size of a
1260 single send that can use this flag. This limit is indicated us‐
1261 ing inject_size (see inject_size above).
1262
1263 FI_MULTI_RECV
1264 Applies to posted receive operations. This flag allows the user
1265 to post a single buffer that will receive multiple incoming mes‐
1266 sages. Received messages will be packed into the receive buffer
1267 until the buffer has been consumed. Use of this flag may cause
1268 a single posted receive operation to generate multiple comple‐
1269 tions as messages are placed into the buffer. The placement of
1270 received data into the buffer may be subjected to provider spe‐
1271 cific alignment restrictions. The buffer will be released by
1272 the provider when the available buffer space falls below the
1273 specified minimum (see FI_OPT_MIN_MULTI_RECV).
1274
1275 FI_COMPLETION
1276 Indicates that a completion entry should be generated for data
1277 transfer operations. This flag only applies to operations is‐
1278 sued on endpoints that were bound to a CQ or counter with the
1279 FI_SELECTIVE_COMPLETION flag. See the fi_ep_bind section above
1280 for more detail.
1281
1282 FI_INJECT_COMPLETE
1283 Indicates that a completion should be generated when the source
1284 buffer(s) may be reused. See fi_cq(3) for additional details on
1285 completion semantics.
1286
1287 FI_TRANSMIT_COMPLETE
1288 Indicates that a completion should be generated when the trans‐
1289 mit operation has completed relative to the local provider. See
1290 fi_cq(3) for additional details on completion semantics.
1291
1292 FI_DELIVERY_COMPLETE
1293 Indicates that a completion should be generated when the opera‐
1294 tion has been processed by the destination endpoint(s). See
1295 fi_cq(3) for additional details on completion semantics.
1296
1297 FI_COMMIT_COMPLETE
1298 Indicates that a completion should not be generated (locally or
1299 at the peer) until the result of an operation have been made
1300 persistent. See fi_cq(3) for additional details on completion
1301 semantics.
1302
1303 FI_MULTICAST
1304 Indicates that data transfers will target multicast addresses by
1305 default. Any fi_addr_t passed into a data transfer operation
1306 will be treated as a multicast address.
1307
1309 Users should call fi_close to release all resources allocated to the
1310 fabric endpoint.
1311
1312 Endpoints allocated with the FI_CONTEXT mode set must typically provide
1313 struct fi_context as their per operation context parameter. (See
1314 fi_getinfo.3 for details.) However, when FI_SELECTIVE_COMPLETION is en‐
1315 abled to suppress completion entries, and an operation is initiated
1316 without FI_COMPLETION flag set, then the context parameter is ignored.
1317 An application does not need to pass in a valid struct fi_context into
1318 such data transfers.
1319
1320 Operations that complete in error that are not associated with valid
1321 operational context will use the endpoint context in any error report‐
1322 ing structures.
1323
1324 Although applications typically associate individual completions with
1325 either completion queues or counters, an endpoint can be attached to
1326 both a counter and completion queue. When combined with using selec‐
1327 tive completions, this allows an application to use counters to track
1328 successful completions, with a CQ used to report errors. Operations
1329 that complete with an error increment the error counter and generate a
1330 completion event. The generation of entries going to the CQ can then
1331 be controlled using FI_SELECTIVE_COMPLETION.
1332
1333 As mentioned in fi_getinfo(3), the ep_attr structure can be used to
1334 query providers that support various endpoint attributes. fi_getinfo
1335 can return provider info structures that can support the minimal set of
1336 requirements (such that the application maintains correctness). Howev‐
1337 er, it can also return provider info structures that exceed application
1338 requirements. As an example, consider an application requesting
1339 msg_order as FI_ORDER_NONE. The resulting output from fi_getinfo may
1340 have all the ordering bits set. The application can reset the ordering
1341 bits it does not require before creating the endpoint. The provider is
1342 free to implement a stricter ordering than is required by the applica‐
1343 tion.
1344
1346 Returns 0 on success. On error, a negative value corresponding to fab‐
1347 ric errno is returned. For fi_cancel, a return value of 0 indicates
1348 that the cancel request was submitted for processing.
1349
1350 Fabric errno values are defined in rdma/fi_errno.h.
1351
1353 -FI_EDOMAIN
1354 A resource domain was not bound to the endpoint or an attempt
1355 was made to bind multiple domains.
1356
1357 -FI_ENOCQ
1358 The endpoint has not been configured with necessary event queue.
1359
1360 -FI_EOPBADSTATE
1361 The endpoint's state does not permit the requested operation.
1362
1364 fi_getinfo(3), fi_domain(3), fi_cq(3) fi_msg(3), fi_tagged(3),
1365 fi_rma(3)
1366
1368 OpenFabrics.
1369
1370
1371
1372Libfabric Programmer's Manual 2018-11-30 fi_endpoint(3)