1fi_endpoint(3) Libfabric v1.6.1 fi_endpoint(3)
2
3
4
6 fi_endpoint - Fabric endpoint operations
7
8 fi_endpoint / fi_scalable_ep / fi_passive_ep / fi_close
9 Allocate or close an endpoint.
10
11 fi_ep_bind
12 Associate an endpoint with hardware resources, such as event
13 queues, completion queues, counters, address vectors, or shared
14 transmit/receive contexts.
15
16 fi_scalable_ep_bind
17 Associate a scalable endpoint with an address vector
18
19 fi_pep_bind
20 Associate a passive endpoint with an event queue
21
22 fi_enable
23 Transitions an active endpoint into an enabled state.
24
25 fi_cancel
26 Cancel a pending asynchronous data transfer
27
28 fi_ep_alias
29 Create an alias to the endpoint
30
31 fi_control
32 Control endpoint operation.
33
34 fi_getopt / fi_setopt
35 Get or set endpoint options.
36
37 fi_rx_context / fi_tx_context / fi_srx_context / fi_stx_context
38 Open a transmit or receive context.
39
40 fi_rx_size_left / fi_tx_size_left (DEPRECATED)
41 Query the lower bound on how many RX/TX operations may be posted
42 without an operation returning -FI_EAGAIN. This functions have
43 been deprecated and will be removed in a future version of the
44 library.
45
47 #include <rdma/fabric.h>
48
49 #include <rdma/fi_endpoint.h>
50
51 int fi_endpoint(struct fid_domain *domain, struct fi_info *info,
52 struct fid_ep **ep, void *context);
53
54 int fi_scalable_ep(struct fid_domain *domain, struct fi_info *info,
55 struct fid_ep **sep, void *context);
56
57 int fi_passive_ep(struct fi_fabric *fabric, struct fi_info *info,
58 struct fid_pep **pep, void *context);
59
60 int fi_tx_context(struct fid_ep *sep, int index,
61 struct fi_tx_attr *attr, struct fid_ep **tx_ep,
62 void *context);
63
64 int fi_rx_context(struct fid_ep *sep, int index,
65 struct fi_rx_attr *attr, struct fid_ep **rx_ep,
66 void *context);
67
68 int fi_stx_context(struct fid_domain *domain,
69 struct fi_tx_attr *attr, struct fid_stx **stx,
70 void *context);
71
72 int fi_srx_context(struct fid_domain *domain,
73 struct fi_rx_attr *attr, struct fid_ep **rx_ep,
74 void *context);
75
76 int fi_close(struct fid *ep);
77
78 int fi_ep_bind(struct fid_ep *ep, struct fid *fid, uint64_t flags);
79
80 int fi_scalable_ep_bind(struct fid_ep *sep, struct fid *fid, uint64_t flags);
81
82 int fi_pep_bind(struct fid_pep *pep, struct fid *fid, uint64_t flags);
83
84 int fi_enable(struct fid_ep *ep);
85
86 int fi_cancel(struct fid_ep *ep, void *context);
87
88 int fi_ep_alias(struct fid_ep *ep, struct fid_ep **alias_ep, uint64_t flags);
89
90 int fi_control(struct fid *ep, int command, void *arg);
91
92 int fi_getopt(struct fid *ep, int level, int optname,
93 void *optval, size_t *optlen);
94
95 int fi_setopt(struct fid *ep, int level, int optname,
96 const void *optval, size_t optlen);
97
98 DEPRECATED ssize_t fi_rx_size_left(struct fid_ep *ep);
99
100 DEPRECATED ssize_t fi_tx_size_left(struct fid_ep *ep);
101
103 fid : On creation, specifies a fabric or access domain. On bind, iden‐
104 tifies the event queue, completion queue, counter, or address vector to
105 bind to the endpoint. In other cases, it's a fabric identifier of an
106 associated resource.
107
108 info : Details about the fabric interface endpoint to be opened,
109 obtained from fi_getinfo.
110
111 ep : A fabric endpoint.
112
113 sep : A scalable fabric endpoint.
114
115 pep : A passive fabric endpoint.
116
117 context : Context associated with the endpoint or asynchronous opera‐
118 tion.
119
120 index : Index to retrieve a specific transmit/receive context.
121
122 attr : Transmit or receive context attributes.
123
124 flags : Additional flags to apply to the operation.
125
126 command : Command of control operation to perform on endpoint.
127
128 arg : Optional control argument.
129
130 level : Protocol level at which the desired option resides.
131
132 optname : The protocol option to read or set.
133
134 optval : The option value that was read or to set.
135
136 optlen : The size of the optval buffer.
137
139 Endpoints are transport level communication portals. There are two
140 types of endpoints: active and passive. Passive endpoints belong to a
141 fabric domain and are most often used to listen for incoming connection
142 requests. However, a passive endpoint may be used to reserve a fabric
143 address that can be granted to an active endpoint. Active endpoints
144 belong to access domains and can perform data transfers.
145
146 Active endpoints may be connection-oriented or connectionless, and may
147 provide data reliability. The data transfer interfaces -- messages
148 (fi_msg), tagged messages (fi_tagged), RMA (fi_rma), and atomics
149 (fi_atomic) -- are associated with active endpoints. In basic configu‐
150 rations, an active endpoint has transmit and receive queues. In gen‐
151 eral, operations that generate traffic on the fabric are posted to the
152 transmit queue. This includes all RMA and atomic operations, along
153 with sent messages and sent tagged messages. Operations that post buf‐
154 fers for receiving incoming data are submitted to the receive queue.
155
156 Active endpoints are created in the disabled state. They must transi‐
157 tion into an enabled state before accepting data transfer operations,
158 including posting of receive buffers. The fi_enable call is used to
159 transition an active endpoint into an enabled state. The fi_connect
160 and fi_accept calls will also transition an endpoint into the enabled
161 state, if it is not already active.
162
163 In order to transition an endpoint into an enabled state, it must be
164 bound to one or more fabric resources. An endpoint that will generate
165 asynchronous completions, either through data transfer operations or
166 communication establishment events, must be bound to the appropriate
167 completion queues or event queues, respectively, before being enabled.
168 Additionally, endpoints that use manual progress must be associated
169 with relevant completion queues or event queues in order to drive
170 progress. For endpoints that are only used as the target of RMA or
171 atomic operations, this means binding the endpoint to a completion
172 queue associated with receive processing. Unconnected endpoints must
173 be bound to an address vector.
174
175 Once an endpoint has been activated, it may be associated with an
176 address vector. Receive buffers may be posted to it and calls may be
177 made to connection establishment routines. Connectionless endpoints
178 may also perform data transfers.
179
180 The behavior of an endpoint may be adjusted by setting its control data
181 and protocol options. This allows the underlying provider to redirect
182 function calls to implementations optimized to meet the desired appli‐
183 cation behavior.
184
185 If an endpoint experiences a critical error, it will transition back
186 into a disabled state. Critical errors are reported through the event
187 queue associated with the EP. In certain cases, a disabled endpoint
188 may be re-enabled. The ability to transition back into an enabled
189 state is provider specific and depends on the type of error that the
190 endpoint experienced. When an endpoint is disabled as a result of a
191 critical error, all pending operations are discarded.
192
193 fi_endpoint / fi_passive_ep / fi_scalable_ep
194 fi_endpoint allocates a new active endpoint. fi_passive_ep allocates a
195 new passive endpoint. fi_scalable_ep allocates a scalable endpoint.
196 The properties and behavior of the endpoint are defined based on the
197 provided struct fi_info. See fi_getinfo for additional details on
198 fi_info. fi_info flags that control the operation of an endpoint are
199 defined below. See section SCALABLE ENDPOINTS.
200
201 If an active endpoint is allocated in order to accept a connection
202 request, the fi_info parameter must be the same as the fi_info struc‐
203 ture provided with the connection request (FI_CONNREQ) event.
204
205 An active endpoint may acquire the properties of a passive endpoint by
206 setting the fi_info handle field to the passive endpoint fabric
207 descriptor. This is useful for applications that need to reserve the
208 fabric address of an endpoint prior to knowing if the endpoint will be
209 used on the active or passive side of a connection. For example, this
210 feature is useful for simulating socket semantics. Once an active end‐
211 point acquires the properties of a passive endpoint, the passive end‐
212 point is no longer bound to any fabric resources and must no longer be
213 used. The user is expected to close the passive endpoint after opening
214 the active endpoint in order to free up any lingering resources that
215 had been used.
216
217 fi_close
218 Closes an endpoint and release all resources associated with it.
219
220 When closing a scalable endpoint, there must be no opened transmit con‐
221 texts, or receive contexts associated with the scalable endpoint. If
222 resources are still associated with the scalable endpoint when attempt‐
223 ing to close, the call will return -FI_EBUSY.
224
225 Outstanding operations posted to the endpoint when fi_close is called
226 will be discarded. Discarded operations will silently be dropped, with
227 no completions reported. Additionally, a provider may discard previ‐
228 ously completed operations from the associated completion queue(s).
229 The behavior to discard completed operations is provider specific.
230
231 fi_ep_bind
232 fi_ep_bind is used to associate an endpoint with hardware resources.
233 The common use of fi_ep_bind is to direct asynchronous operations asso‐
234 ciated with an endpoint to a completion queue. An endpoint must be
235 bound with CQs capable of reporting completions for any asynchronous
236 operation initiated on the endpoint. This is true even for endpoints
237 which are configured to suppress successful completions, in order that
238 operations that complete in error may be reported to the user. For
239 passive endpoints, this requires binding the endpoint with an EQ that
240 supports the communication management (CM) domain.
241
242 An active endpoint may direct asynchronous completions to different
243 CQs, based on the type of operation. This is specified using
244 fi_ep_bind flags. The following flags may be used separately or OR'ed
245 together when binding an endpoint to a completion domain CQ.
246
247 FI_TRANSMIT : Directs the completion of outbound data transfer requests
248 to the specified completion queue. This includes send message, RMA,
249 and atomic operations.
250
251 FI_RECV : Directs the notification of inbound data transfers to the
252 specified completion queue. This includes received messages. This
253 binding automatically includes FI_REMOTE_WRITE, if applicable to the
254 endpoint.
255
256 FI_SELECTIVE_COMPLETION : By default, data transfer operations generate
257 completion entries into a completion queue after they have successfully
258 completed. Applications can use this bind flag to selectively enable
259 when completions are generated. If FI_SELECTIVE_COMPLETION is speci‐
260 fied, data transfer operations will not generate entries for successful
261 completions unless FI_COMPLETION is set as an operational flag for the
262 given operation. FI_SELECTIVE_COMPLETION must be OR'ed with FI_TRANS‐
263 MIT and/or FI_RECV flags.
264
265 When FI_SELECTIVE_COMPLETION is set, the user must determine when a
266 request that does NOT have FI_COMPLETION set has completed indirectly,
267 usually based on the completion of a subsequent operation. Use of this
268 flag may improve performance by allowing the provider to avoid writing
269 a completion entry for every operation.
270
271 Example: An application can selectively generate send completions by
272 using the following general approach:
273
274 fi_tx_attr::op_flags = 0; // default - no completion
275 fi_ep_bind(ep, cq, FI_TRANSMIT | FI_SELECTIVE_COMPLETION);
276 fi_send(ep, ...); // no completion
277 fi_sendv(ep, ...); // no completion
278 fi_sendmsg(ep, ..., FI_COMPLETION); // completion!
279 fi_inject(ep, ...); // no completion
280
281 Example: An application can selectively disable send completions by
282 modifying the operational flags:
283
284 fi_tx_attr::op_flags = FI_COMPLETION; // default - completion
285 fi_ep_bind(ep, cq, FI_TRANSMIT | FI_SELECTIVE_COMPLETION);
286 fi_send(ep, ...); // completion
287 fi_sendv(ep, ...); // completion
288 fi_sendmsg(ep, ..., 0); // no completion!
289 fi_inject(ep, ...); // no completion!
290
291 Example: Omitting FI_SELECTIVE_COMPLETION when binding will generate
292 completions for all non-fi_inject calls:
293
294 fi_tx_attr::op_flags = 0;
295 fi_ep_bind(ep, cq, FI_TRANSMIT); // default - completion
296 fi_send(ep, ...); // completion
297 fi_sendv(ep, ...); // completion
298 fi_sendmsg(ep, ..., 0); // completion!
299 fi_sendmsg(ep, ..., FI_COMPLETION); // completion
300 fi_sendmsg(ep, ..., FI_INJECT|FI_COMPLETION); // completion!
301 fi_inject(ep, ...); // no completion!
302
303 An endpoint may also be bound to a fabric counter. When binding an
304 endpoint to a counter, the following flags may be specified.
305
306 FI_SEND : Increments the specified counter whenever a message transfer
307 initiated over the endpoint has completed successfully or in error.
308 Sent messages include both tagged and normal message operations.
309
310 FI_RECV : Increments the specified counter whenever a message is
311 received over the endpoint. Received messages include both tagged and
312 normal message operations.
313
314 FI_READ : Increments the specified counter whenever an RMA read or
315 atomic fetch operation initiated from the endpoint has completed suc‐
316 cessfully or in error.
317
318 FI_WRITE : Increments the specified counter whenever an RMA write or
319 atomic operation initiated from the endpoint has completed successfully
320 or in error.
321
322 FI_REMOTE_READ : Increments the specified counter whenever an RMA read
323 or atomic fetch operation is initiated from a remote endpoint that tar‐
324 gets the given endpoint. Use of this flag requires that the endpoint
325 be created using FI_RMA_EVENT.
326
327 FI_REMOTE_WRITE : Increments the specified counter whenever an RMA
328 write or atomic operation is initiated from a remote endpoint that tar‐
329 gets the given endpoint. Use of this flag requires that the endpoint
330 be created using FI_RMA_EVENT.
331
332 An endpoint may only be bound to a single CQ or counter for a given
333 type of operation. For example, a EP may not bind to two counters both
334 using FI_WRITE. Furthermore, providers may limit CQ and counter bind‐
335 ings to endpoints of the same endpoint type (DGRAM, MSG, RDM, etc.).
336
337 Connectionless endpoints must be bound to a single address vector.
338
339 If an endpoint is using a shared transmit and/or receive context, the
340 shared contexts must be bound to the endpoint. CQs, counters, AV, and
341 shared contexts must be bound to endpoints before they are enabled.
342
343 fi_scalable_ep_bind
344 fi_scalable_ep_bind is used to associate a scalable endpoint with an
345 address vector. See section on SCALABLE ENDPOINTS. A scalable end‐
346 point has a single transport level address and can support multiple
347 transmit and receive contexts. The transmit and receive contexts share
348 the transport-level address. Address vectors that are bound to scal‐
349 able endpoints are implicitly bound to any transmit or receive contexts
350 created using the scalable endpoint.
351
352 fi_enable
353 This call transitions the endpoint into an enabled state. An endpoint
354 must be enabled before it may be used to perform data transfers.
355 Enabling an endpoint typically results in hardware resources being
356 assigned to it. Endpoints making use of completion queues, counters,
357 event queues, and/or address vectors must be bound to them before being
358 enabled.
359
360 Calling connect or accept on an endpoint will implicitly enable an end‐
361 point if it has not already been enabled.
362
363 fi_enable may also be used to re-enable an endpoint that has been dis‐
364 abled as a result of experiencing a critical error. Applications
365 should check the return value from fi_enable to see if a disabled end‐
366 point has successfully be re-enabled.
367
368 fi_cancel
369 fi_cancel attempts to cancel an outstanding asynchronous operation.
370 Canceling an operation causes the fabric provider to search for the
371 operation and, if it is still pending, complete it as having been can‐
372 celed. An error queue entry will be available in the associated error
373 queue with error code FI_ECANCELED. On the other hand, if the opera‐
374 tion completed before the call to fi_cancel, then the completion status
375 of that operation will be available in the associated completion queue.
376 No specific entry related to fi_cancel itself will be posted.
377
378 Cancel uses the context parameter associated with an operation to iden‐
379 tify the request to cancel. Operations posted without a valid context
380 parameter -- either no context parameter is specified or the context
381 value was ignored by the provider -- cannot be canceled. If multiple
382 outstanding operations match the context parameter, only one will be
383 canceled. In this case, the operation which is canceled is provider
384 specific. The cancel operation is asynchronous, but will complete
385 within a bounded period of time.
386
387 fi_ep_alias
388 This call creates an alias to the specified endpoint. Conceptually, an
389 endpoint alias provides an alternate software path from the application
390 to the underlying provider hardware. An alias EP differs from its par‐
391 ent endpoint only by its default data transfer flags. For example, an
392 alias EP may be configured to use a different completion mode. By
393 default, an alias EP inherits the same data transfer flags as the par‐
394 ent endpoint. An application can use fi_control to modify the alias EP
395 operational flags.
396
397 When allocating an alias, an application may configure either the
398 transmit or receive operational flags. This avoids needing a separate
399 call to fi_control to set those flags. The flags passed to fi_ep_alias
400 must include FI_TRANSMIT or FI_RECV (not both) with other operational
401 flags OR'ed in. This will override the transmit or receive flags,
402 respectively, for operations posted through the alias endpoint. All
403 allocated aliases must be closed for the underlying endpoint to be
404 released.
405
406 fi_control
407 The control operation is used to adjust the default behavior of an end‐
408 point. It allows the underlying provider to redirect function calls to
409 implementations optimized to meet the desired application behavior. As
410 a result, calls to fi_ep_control must be serialized against all other
411 calls to an endpoint.
412
413 The base operation of an endpoint is selected during creation using
414 struct fi_info. The following control commands and arguments may be
415 assigned to an endpoint.
416
417 **FI_GETOPSFLAG -- uint64_t *flags** : Used to retrieve the current
418 value of flags associated with the data transfer operations initiated
419 on the endpoint. The control argument must include FI_TRANSMIT or
420 FI_RECV (not both) flags to indicate the type of data transfer flags to
421 be returned. See below for a list of control flags.
422
423 **FI_SETOPSFLAG -- uint64_t *flags** : Used to change the data transfer
424 operation flags associated with an endpoint. The control argument must
425 include FI_TRANSMIT or FI_RECV (not both) to indicate the type of data
426 transfer that the flags should apply to, with other flags OR'ed in.
427 The given flags will override the previous transmit and receive
428 attributes that were set when the endpoint was created. Valid control
429 flags are defined below.
430
431 **FI_BACKLOG - int *value** : This option only applies to passive end‐
432 points. It is used to set the connection request backlog for listening
433 endpoints.
434
435 FI_GETWAIT (void **) : This command allows the user to retrieve the
436 file descriptor associated with a socket endpoint. The fi_control arg
437 parameter should be an address where a pointer to the returned file
438 descriptor will be written. See fi_eq.3 for addition details using
439 fi_control with FI_GETWAIT. The file descriptor may be used for noti‐
440 fication that the endpoint is ready to send or receive data.
441
442 fi_getopt / fi_setopt
443 Endpoint protocol operations may be retrieved using fi_getopt or set
444 using fi_setopt. Applications specify the level that a desired option
445 exists, identify the option, and provide input/output buffers to get or
446 set the option. fi_setopt provides an application a way to adjust
447 low-level protocol and implementation specific details of an endpoint.
448
449 The following option levels and option names and parameters are
450 defined.
451
452 FI_OPT_ENDPOINT
453
454 · FI_OPT_MIN_MULTI_RECV - size_t : Defines the minimum receive buffer
455 space available when the receive buffer is released by the provider
456 (see FI_MULTI_RECV). Modifying this value is only guaranteed to set
457 the minimum buffer space needed on receives posted after the value
458 has been changed. It is recommended that applications that want to
459 override the default MIN_MULTI_RECV value set this option before
460 enabling the corresponding endpoint.
461
462 · FI_OPT_CM_DATA_SIZE - size_t : Defines the size of available space in
463 CM messages for user-defined data. This value limits the amount of
464 data that applications can exchange between peer endpoints using the
465 fi_connect, fi_accept, and fi_reject operations. The size returned
466 is dependent upon the properties of the endpoint, except in the case
467 of passive endpoints, in which the size reflects the maximum size of
468 the data that may be present as part of a connection request event.
469 This option is read only.
470
471 fi_rx_size_left (DEPRECATED)
472 This function has been deprecated and will be removed in a future ver‐
473 sion of the library. It may not be supported by all providers.
474
475 The fi_rx_size_left call returns a lower bound on the number of receive
476 operations that may be posted to the given endpoint without that opera‐
477 tion returning -FI_EAGAIN. Depending on the specific details of the
478 subsequently posted receive operations (e.g., number of iov entries,
479 which receive function is called, etc.), it may be possible to post
480 more receive operations than originally indicated by fi_rx_size_left.
481
482 fi_tx_size_left (DEPRECATED)
483 This function has been deprecated and will be removed in a future ver‐
484 sion of the library. It may not be supported by all providers.
485
486 The fi_tx_size_left call returns a lower bound on the number of trans‐
487 mit operations that may be posted to the given endpoint without that
488 operation returning -FI_EAGAIN. Depending on the specific details of
489 the subsequently posted transmit operations (e.g., number of iov
490 entries, which transmit function is called, etc.), it may be possible
491 to post more transmit operations than originally indicated by
492 fi_tx_size_left.
493
495 The fi_ep_attr structure defines the set of attributes associated with
496 an endpoint. Endpoint attributes may be further refined using the
497 transmit and receive context attributes as shown below.
498
499 struct fi_ep_attr {
500 enum fi_ep_type type;
501 uint32_t protocol;
502 uint32_t protocol_version;
503 size_t max_msg_size;
504 size_t msg_prefix_size;
505 size_t max_order_raw_size;
506 size_t max_order_war_size;
507 size_t max_order_waw_size;
508 uint64_t mem_tag_format;
509 size_t tx_ctx_cnt;
510 size_t rx_ctx_cnt;
511 size_t auth_key_size;
512 uint8_t *auth_key;
513 };
514
515 type - Endpoint Type
516 If specified, indicates the type of fabric interface communication
517 desired. Supported types are:
518
519 FI_EP_UNSPEC : The type of endpoint is not specified. This is usually
520 provided as input, with other attributes of the endpoint or the
521 provider selecting the type.
522
523 FI_EP_MSG : Provides a reliable, connection-oriented data transfer ser‐
524 vice with flow control that maintains message boundaries.
525
526 FI_EP_DGRAM : Supports a connectionless, unreliable datagram communica‐
527 tion. Message boundaries are maintained, but the maximum message size
528 may be limited to the fabric MTU. Flow control is not guaranteed.
529
530 FI_EP_RDM : Reliable datagram message. Provides a reliable, uncon‐
531 nected data transfer service with flow control that maintains message
532 boundaries.
533
534 FI_EP_SOCK_STREAM : Data streaming endpoint with TCP socket-like seman‐
535 tics. Provides a reliable, connection-oriented data transfer service
536 that does not maintain message boundaries. FI_EP_SOCK_STREAM is most
537 useful for applications designed around using TCP sockets. See the
538 SOCKET ENDPOINT section for additional details and restrictions that
539 apply to stream endpoints.
540
541 FI_EP_SOCK_DGRAM : A connectionless, unreliable datagram endpoint with
542 UDP socket-like semantics. FI_EP_SOCK_DGRAM is most useful for appli‐
543 cations designed around using UDP sockets. See the SOCKET ENDPOINT
544 section for additional details and restrictions that apply to datagram
545 socket endpoints.
546
547 Protocol
548 Specifies the low-level end to end protocol employed by the provider.
549 A matching protocol must be used by communicating endpoints to ensure
550 interoperability. The following protocol values are defined. Provider
551 specific protocols are also allowed. Provider specific protocols will
552 be indicated by having the upper bit of the protocol value set to one.
553
554 FI_PROTO_UNSPEC : The protocol is not specified. This is usually pro‐
555 vided as input, with other attributes of the socket or the provider
556 selecting the actual protocol.
557
558 FI_PROTO_RDMA_CM_IB_RC : The protocol runs over Infiniband reli‐
559 able-connected queue pairs, using the RDMA CM protocol for connection
560 establishment.
561
562 FI_PROTO_IWARP : The protocol runs over the Internet wide area RDMA
563 protocol transport.
564
565 FI_PROTO_IB_UD : The protocol runs over Infiniband unreliable datagram
566 queue pairs.
567
568 FI_PROTO_PSMX : The protocol is based on an Intel proprietary protocol
569 known as PSM, performance scaled messaging. PSMX is an extended ver‐
570 sion of the PSM protocol to support the libfabric interfaces.
571
572 FI_PROTO_UDP : The protocol sends and receives UDP datagrams. For
573 example, an endpoint using FI_PROTO_UDP will be able to communicate
574 with a remote peer that is using Berkeley SOCK_DGRAM sockets using
575 IPPROTO_UDP.
576
577 FI_PROTO_SOCK_TCP : The protocol is layered over TCP packets.
578
579 FI_PROTO_IWARP_RDM : Reliable-datagram protocol implemented over iWarp
580 reliable-connected queue pairs.
581
582 FI_PROTO_IB_RDM : Reliable-datagram protocol implemented over Infini‐
583 Band reliable-connected queue pairs.
584
585 FI_PROTO_GNI : Protocol runs over Cray GNI low-level interface.
586
587 FI_PROTO_RXM : Reliable-datagram protocol implemented over message end‐
588 points. RXM is a libfabric utility component that adds RDM endpoint
589 semantics over MSG endpoint semantics.
590
591 FI_PROTO_RXD : Reliable-datagram protocol implemented over datagram
592 endpoints. RXD is a libfabric utility component that adds RDM endpoint
593 semantics over DGRAM endpoint semantics.
594
595 FI_PROTO_NETWORKDIRECT : Protocol runs over Microsoft NetworkDirect
596 service provider interface. This adds reliable-datagram semantics over
597 the NetworkDirect connection- oriented endpoint semantics.
598
599 FI_PROTO_PSMX2 : The protocol is based on an Intel proprietary protocol
600 known as PSM2, performance scaled messaging version 2. PSMX2 is an
601 extended version of the PSM2 protocol to support the libfabric inter‐
602 faces.
603
604 protocol_version - Protocol Version
605 Identifies which version of the protocol is employed by the provider.
606 The protocol version allows providers to extend an existing protocol,
607 by adding support for additional features or functionality for example,
608 in a backward compatible manner. Providers that support different ver‐
609 sions of the same protocol should inter-operate, but only when using
610 the capabilities defined for the lesser version.
611
612 max_msg_size - Max Message Size
613 Defines the maximum size for an application data transfer as a single
614 operation.
615
616 msg_prefix_size - Message Prefix Size
617 Specifies the size of any required message prefix buffer space. This
618 field will be 0 unless the FI_MSG_PREFIX mode is enabled. If msg_pre‐
619 fix_size is > 0 the specified value will be a multiple of 8-bytes.
620
621 Max RMA Ordered Size
622 The maximum ordered size specifies the delivery order of transport data
623 into target memory for RMA and atomic operations. Data ordering is
624 separate, but dependent on message ordering (defined below). Data
625 ordering is unspecified where message order is not defined.
626
627 Data ordering refers to the access of target memory by subsequent oper‐
628 ations. When back to back RMA read or write operations access the same
629 registered memory location, data ordering indicates whether the second
630 operation reads or writes the target memory after the first operation
631 has completed. Because RMA ordering applies between two operations,
632 and not within a single data transfer, ordering is defined per
633 byte-addressable memory location. I.e. ordering specifies whether
634 location X is accessed by the second operation after the first opera‐
635 tion. Nothing is implied about the completion of the first operation
636 before the second operation is initiated.
637
638 In order to support large data transfers being broken into multiple
639 packets and sent using multiple paths through the fabric, data ordering
640 may be limited to transfers of a specific size or less. Providers
641 specify when data ordering is maintained through the following values.
642 Note that even if data ordering is not maintained, message ordering may
643 be.
644
645 max_order_raw_size : Read after write size. If set, an RMA or atomic
646 read operation issued after an RMA or atomic write operation, both of
647 which are smaller than the size, will be ordered. Where the target
648 memory locations overlap, the RMA or atomic read operation will see the
649 results of the previous RMA or atomic write.
650
651 max_order_war_size : Write after read size. If set, an RMA or atomic
652 write operation issued after an RMA or atomic read operation, both of
653 which are smaller than the size, will be ordered. The RMA or atomic
654 read operation will see the initial value of the target memory location
655 before a subsequent RMA or atomic write updates the value.
656
657 max_order_waw_size : Write after write size. If set, an RMA or atomic
658 write operation issued after an RMA or atomic write operation, both of
659 which are smaller than the size, will be ordered. The target memory
660 location will reflect the results of the second RMA or atomic write.
661
662 An order size value of 0 indicates that ordering is not guaranteed. A
663 value of -1 guarantees ordering for any data size.
664
665 mem_tag_format - Memory Tag Format
666 The memory tag format is a bit array used to convey the number of
667 tagged bits supported by a provider. Additionally, it may be used to
668 divide the bit array into separate fields. The mem_tag_format option‐
669 ally begins with a series of bits set to 0, to signify bits which are
670 ignored by the provider. Following the initial prefix of ignored bits,
671 the array will consist of alternating groups of bits set to all 1's or
672 all 0's. Each group of bits corresponds to a tagged field. The impli‐
673 cation of defining a tagged field is that when a mask is applied to the
674 tagged bit array, all bits belonging to a single field will either be
675 set to 1 or 0, collectively.
676
677 For example, a mem_tag_format of 0x30FF indicates support for 14 tagged
678 bits, separated into 3 fields. The first field consists of 2-bits, the
679 second field 4-bits, and the final field 8-bits. Valid masks for such
680 a tagged field would be a bitwise OR'ing of zero or more of the follow‐
681 ing values: 0x3000, 0x0F00, and 0x00FF.
682
683 By identifying fields within a tag, a provider may be able to optimize
684 their search routines. An application which requests tag fields must
685 provide tag masks that either set all mask bits corresponding to a
686 field to all 0 or all 1. When negotiating tag fields, an application
687 can request a specific number of fields of a given size. A provider
688 must return a tag format that supports the requested number of fields,
689 with each field being at least the size requested, or fail the request.
690 A provider may increase the size of the fields. When reporting comple‐
691 tions (see FI_CQ_FORMAT_TAGGED), the provider must provide the exact
692 value of the received tag, clearing out any unsupported tag bits.
693
694 It is recommended that field sizes be ordered from smallest to largest.
695 A generic, unstructured tag and mask can be achieved by requesting a
696 bit array consisting of alternating 1's and 0's.
697
698 tx_ctx_cnt - Transmit Context Count
699 Number of transmit contexts to associate with the endpoint. If not
700 specified (0), 1 context will be assigned if the endpoint supports out‐
701 bound transfers. Transmit contexts are independent transmit queues
702 that may be separately configured. Each transmit context may be bound
703 to a separate CQ, and no ordering is defined between contexts. Addi‐
704 tionally, no synchronization is needed when accessing contexts in par‐
705 allel.
706
707 If the count is set to the value FI_SHARED_CONTEXT, the endpoint will
708 be configured to use a shared transmit context, if supported by the
709 provider. Providers that do not support shared transmit contexts will
710 fail the request.
711
712 See the scalable endpoint and shared contexts sections for additional
713 details.
714
715 rx_ctx_cnt - Receive Context Count
716 Number of receive contexts to associate with the endpoint. If not
717 specified, 1 context will be assigned if the endpoint supports inbound
718 transfers. Receive contexts are independent processing queues that may
719 be separately configured. Each receive context may be bound to a sepa‐
720 rate CQ, and no ordering is defined between contexts. Additionally, no
721 synchronization is needed when accessing contexts in parallel.
722
723 If the count is set to the value FI_SHARED_CONTEXT, the endpoint will
724 be configured to use a shared receive context, if supported by the
725 provider. Providers that do not support shared receive contexts will
726 fail the request.
727
728 See the scalable endpoint and shared contexts sections for additional
729 details.
730
731 auth_key_size - Authorization Key Length
732 The length of the authorization key in bytes. This field will be 0 if
733 authorization keys are not available or used. This field is ignored
734 unless the fabric is opened with API version 1.5 or greater.
735
736 auth_key - Authorization Key
737 If supported by the fabric, an authorization key (a.k.a. job key) to
738 associate with the endpoint. An authorization key is used to limit
739 communication between endpoints. Only peer endpoints that are pro‐
740 grammed to use the same authorization key may communicate. Authoriza‐
741 tion keys are often used to implement job keys, to ensure that pro‐
742 cesses running in different jobs do not accidentally cross traffic.
743 The domain authorization key will be used if auth_key_size is set to 0.
744 This field is ignored unless the fabric is opened with API version 1.5
745 or greater.
746
748 Attributes specific to the transmit capabilities of an endpoint are
749 specified using struct fi_tx_attr.
750
751 struct fi_tx_attr {
752 uint64_t caps;
753 uint64_t mode;
754 uint64_t op_flags;
755 uint64_t msg_order;
756 uint64_t comp_order;
757 size_t inject_size;
758 size_t size;
759 size_t iov_limit;
760 size_t rma_iov_limit;
761 };
762
763 caps - Capabilities
764 The requested capabilities of the context. The capabilities must be a
765 subset of those requested of the associated endpoint. See the CAPABIL‐
766 ITIES section of fi_getinfo(3) for capability details. If the caps
767 field is 0 on input to fi_getinfo(3), the caps value from the fi_info
768 structure will be used.
769
770 mode
771 The operational mode bits of the context. The mode bits will be a sub‐
772 set of those associated with the endpoint. See the MODE section of
773 fi_getinfo(3) for details. A mode value of 0 will be ignored on input
774 to fi_getinfo(3), with the mode value of the fi_info structure used
775 instead. On return from fi_getinfo(3), the mode will be set only to
776 those constraints specific to transmit operations.
777
778 op_flags - Default transmit operation flags
779 Flags that control the operation of operations submitted against the
780 context. Applicable flags are listed in the Operation Flags section.
781
782 msg_order - Message Ordering
783 Message ordering refers to the order in which transport layer headers
784 (as viewed by the application) are processed. Relaxed message order
785 enables data transfers to be sent and received out of order, which may
786 improve performance by utilizing multiple paths through the fabric from
787 the initiating endpoint to a target endpoint. Message order applies
788 only between a single source and destination endpoint pair. Ordering
789 between different target endpoints is not defined.
790
791 Message order is determined using a set of ordering bits. Each set bit
792 indicates that ordering is maintained between data transfers of the
793 specified type. Message order is defined for [read | write | send]
794 operations submitted by an application after [read | write | send]
795 operations.
796
797 Message ordering only applies to the end to end transmission of trans‐
798 port headers. Message ordering is necessary, but does not guarantee,
799 the order in which message data is sent or received by the transport
800 layer. Message ordering requires matching ordering semantics on the
801 receiving side of a data transfer operation in order to guarantee that
802 ordering is met.
803
804 FI_ORDER_NONE : No ordering is specified. This value may be used as
805 input in order to obtain the default message order supported by the
806 provider. FI_ORDER_NONE is an alias for the value 0.
807
808 FI_ORDER_RAR : Read after read. If set, RMA and atomic read operations
809 are transmitted in the order submitted relative to other RMA and atomic
810 read operations. If not set, RMA and atomic reads may be transmitted
811 out of order from their submission.
812
813 FI_ORDER_RAW : Read after write. If set, RMA and atomic read opera‐
814 tions are transmitted in the order submitted relative to RMA and atomic
815 write operations. If not set, RMA and atomic reads may be transmitted
816 ahead of RMA and atomic writes.
817
818 FI_ORDER_RAS : Read after send. If set, RMA and atomic read operations
819 are transmitted in the order submitted relative to message send opera‐
820 tions, including tagged sends. If not set, RMA and atomic reads may be
821 transmitted ahead of sends.
822
823 FI_ORDER_WAR : Write after read. If set, RMA and atomic write opera‐
824 tions are transmitted in the order submitted relative to RMA and atomic
825 read operations. If not set, RMA and atomic writes may be transmitted
826 ahead of RMA and atomic reads.
827
828 FI_ORDER_WAW : Write after write. If set, RMA and atomic write opera‐
829 tions are transmitted in the order submitted relative to other RMA and
830 atomic write operations. If not set, RMA and atomic writes may be
831 transmitted out of order from their submission.
832
833 FI_ORDER_WAS : Write after send. If set, RMA and atomic write opera‐
834 tions are transmitted in the order submitted relative to message send
835 operations, including tagged sends. If not set, RMA and atomic writes
836 may be transmitted ahead of sends.
837
838 FI_ORDER_SAR : Send after read. If set, message send operations,
839 including tagged sends, are transmitted in order submitted relative to
840 RMA and atomic read operations. If not set, message sends may be
841 transmitted ahead of RMA and atomic reads.
842
843 FI_ORDER_SAW : Send after write. If set, message send operations,
844 including tagged sends, are transmitted in order submitted relative to
845 RMA and atomic write operations. If not set, message sends may be
846 transmitted ahead of RMA and atomic writes.
847
848 FI_ORDER_SAS : Send after send. If set, message send operations,
849 including tagged sends, are transmitted in the order submitted relative
850 to other message send. If not set, message sends may be transmitted
851 out of order from their submission.
852
853 comp_order - Completion Ordering
854 Completion ordering refers to the order in which completed requests are
855 written into the completion queue. Completion ordering is similar to
856 message order. Relaxed completion order may enable faster reporting of
857 completed transfers, allow acknowledgments to be sent over different
858 fabric paths, and support more sophisticated retry mechanisms. This
859 can result in lower-latency completions, particularly when using uncon‐
860 nected endpoints. Strict completion ordering may require that
861 providers queue completed operations or limit available optimizations.
862
863 For transmit requests, completion ordering depends on the endpoint com‐
864 munication type. For unreliable communication, completion ordering
865 applies to all data transfer requests submitted to an endpoint. For
866 reliable communication, completion ordering only applies to requests
867 that target a single destination endpoint. Completion ordering of
868 requests that target different endpoints over a reliable transport is
869 not defined.
870
871 Applications should specify the completion ordering that they support
872 or require. Providers should return the completion order that they
873 actually provide, with the constraint that the returned ordering is
874 stricter than that specified by the application. Supported completion
875 order values are:
876
877 FI_ORDER_NONE : No ordering is defined for completed operations.
878 Requests submitted to the transmit context may complete in any order.
879
880 FI_ORDER_STRICT : Requests complete in the order in which they are sub‐
881 mitted to the transmit context.
882
883 inject_size
884 The requested inject operation size (see the FI_INJECT flag) that the
885 context will support. This is the maximum size data transfer that can
886 be associated with an inject operation (such as fi_inject) or may be
887 used with the FI_INJECT data transfer flag.
888
889 size
890 The size of the context. The size is specified as the minimum number
891 of transmit operations that may be posted to the endpoint without the
892 operation returning -FI_EAGAIN.
893
894 iov_limit
895 This is the maximum number of IO vectors (scatter-gather elements) that
896 a single posted operation may reference.
897
898 rma_iov_limit
899 This is the maximum number of RMA IO vectors (scatter-gather elements)
900 that an RMA or atomic operation may reference. The rma_iov_limit cor‐
901 responds to the rma_iov_count values in RMA and atomic operations. See
902 struct fi_msg_rma and struct fi_msg_atomic in fi_rma.3 and fi_atomic.3,
903 for additional details. This limit applies to both the number of RMA
904 IO vectors that may be specified when initiating an operation from the
905 local endpoint, as well as the maximum number of IO vectors that may be
906 carried in a single request from a remote endpoint.
907
909 Attributes specific to the receive capabilities of an endpoint are
910 specified using struct fi_rx_attr.
911
912 struct fi_rx_attr {
913 uint64_t caps;
914 uint64_t mode;
915 uint64_t op_flags;
916 uint64_t msg_order;
917 uint64_t comp_order;
918 size_t total_buffered_recv;
919 size_t size;
920 size_t iov_limit;
921 };
922
923 caps - Capabilities
924 The requested capabilities of the context. The capabilities must be a
925 subset of those requested of the associated endpoint. See the CAPABIL‐
926 ITIES section if fi_getinfo(3) for capability details. If the caps
927 field is 0 on input to fi_getinfo(3), the caps value from the fi_info
928 structure will be used.
929
930 mode
931 The operational mode bits of the context. The mode bits will be a sub‐
932 set of those associated with the endpoint. See the MODE section of
933 fi_getinfo(3) for details. A mode value of 0 will be ignored on input
934 to fi_getinfo(3), with the mode value of the fi_info structure used
935 instead. On return from fi_getinfo(3), the mode will be set only to
936 those constraints specific to receive operations.
937
938 op_flags - Default receive operation flags
939 Flags that control the operation of operations submitted against the
940 context. Applicable flags are listed in the Operation Flags section.
941
942 msg_order - Message Ordering
943 For a description of message ordering, see the msg_order field in the
944 Transmit Context Attribute section. Receive context message ordering
945 defines the order in which received transport message headers are pro‐
946 cessed when received by an endpoint.
947
948 The following ordering flags, as defined for transmit ordering, also
949 apply to the processing of received operations: FI_ORDER_NONE,
950 FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW,
951 FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, and FI_ORDER_SAS.
952
953 comp_order - Completion Ordering
954 For a description of completion ordering, see the comp_order field in
955 the Transmit Context Attribute section.
956
957 FI_ORDER_NONE : No ordering is defined for completed operations.
958 Receive operations may complete in any order, regardless of their sub‐
959 mission order.
960
961 FI_ORDER_STRICT : Receive operations complete in the order in which
962 they are processed by the receive context, based on the receive side
963 msg_order attribute.
964
965 FI_ORDER_DATA : When set, this bit indicates that received data is
966 written into memory in order. Data ordering applies to memory accessed
967 as part of a single operation and between operations if message order‐
968 ing is guaranteed.
969
970 total_buffered_recv
971 This field is supported for backwards compatibility purposes. It is a
972 hint to the provider of the total available space that may be needed to
973 buffer messages that are received for which there is no matching
974 receive operation. The provider may adjust or ignore this value. The
975 allocation of internal network buffering among received message is
976 provider specific. For instance, a provider may limit the size of mes‐
977 sages which can be buffered or the amount of buffering allocated to a
978 single message.
979
980 If receive side buffering is disabled (total_buffered_recv = 0) and a
981 message is received by an endpoint, then the behavior is dependent on
982 whether resource management has been enabled (FI_RM_ENABLED has be set
983 or not). See the Resource Management section of fi_domain.3 for fur‐
984 ther clarification. It is recommended that applications enable
985 resource management if they anticipate receiving unexpected messages,
986 rather than modifying this value.
987
988 size
989 The size of the context. The size is specified as the minimum number
990 of receive operations that may be posted to the endpoint without the
991 operation returning -FI_EAGAIN.
992
993 iov_limit
994 This is the maximum number of IO vectors (scatter-gather elements) that
995 a single posted operating may reference.
996
998 A scalable endpoint is a communication portal that supports multiple
999 transmit and receive contexts. Scalable endpoints are loosely modeled
1000 after the networking concept of transmit/receive side scaling, also
1001 known as multi-queue. Support for scalable endpoints is domain spe‐
1002 cific. Scalable endpoints may improve the performance of
1003 multi-threaded and parallel applications, by allowing threads to access
1004 independent transmit and receive queues. A scalable endpoint has a
1005 single transport level address, which can reduce the memory require‐
1006 ments needed to store remote addressing data, versus using standard
1007 endpoints. Scalable endpoints cannot be used directly for communica‐
1008 tion operations, and require the application to explicitly create
1009 transmit and receive contexts as described below.
1010
1011 fi_tx_context
1012 Transmit contexts are independent transmit queues. Ordering and syn‐
1013 chronization between contexts are not defined. Conceptually a transmit
1014 context behaves similar to a send-only endpoint. A transmit context
1015 may be configured with fewer capabilities than the base endpoint and
1016 with different attributes (such as ordering requirements and inject
1017 size) than other contexts associated with the same scalable endpoint.
1018 Each transmit context has its own completion queue. The number of
1019 transmit contexts associated with an endpoint is specified during end‐
1020 point creation.
1021
1022 The fi_tx_context call is used to retrieve a specific context, identi‐
1023 fied by an index (see above for details on transmit context
1024 attributes). Providers may dynamically allocate contexts when
1025 fi_tx_context is called, or may statically create all contexts when
1026 fi_endpoint is invoked. By default, a transmit context inherits the
1027 properties of its associated endpoint. However, applications may
1028 request context specific attributes through the attr parameter. Sup‐
1029 port for per transmit context attributes is provider specific and not
1030 guaranteed. Providers will return the actual attributes assigned to
1031 the context through the attr parameter, if provided.
1032
1033 fi_rx_context
1034 Receive contexts are independent receive queues for receiving incoming
1035 data. Ordering and synchronization between contexts are not guaran‐
1036 teed. Conceptually a receive context behaves similar to a receive-only
1037 endpoint. A receive context may be configured with fewer capabilities
1038 than the base endpoint and with different attributes (such as ordering
1039 requirements and inject size) than other contexts associated with the
1040 same scalable endpoint. Each receive context has its own completion
1041 queue. The number of receive contexts associated with an endpoint is
1042 specified during endpoint creation.
1043
1044 Receive contexts are often associated with steering flows, that specify
1045 which incoming packets targeting a scalable endpoint to process. How‐
1046 ever, receive contexts may be targeted directly by the initiator, if
1047 supported by the underlying protocol. Such contexts are referred to as
1048 'named'. Support for named contexts must be indicated by setting the
1049 caps FI_NAMED_RX_CTX capability when the corresponding endpoint is cre‐
1050 ated. Support for named receive contexts is coordinated with address
1051 vectors. See fi_av(3) and fi_rx_addr(3).
1052
1053 The fi_rx_context call is used to retrieve a specific context, identi‐
1054 fied by an index (see above for details on receive context attributes).
1055 Providers may dynamically allocate contexts when fi_rx_context is
1056 called, or may statically create all contexts when fi_endpoint is
1057 invoked. By default, a receive context inherits the properties of its
1058 associated endpoint. However, applications may request context spe‐
1059 cific attributes through the attr parameter. Support for per receive
1060 context attributes is provider specific and not guaranteed. Providers
1061 will return the actual attributes assigned to the context through the
1062 attr parameter, if provided.
1063
1065 Shared contexts are transmit and receive contexts explicitly shared
1066 among one or more endpoints. A shareable context allows an application
1067 to use a single dedicated provider resource among multiple transport
1068 addressable endpoints. This can greatly reduce the resources needed to
1069 manage communication over multiple endpoints by multiplexing transmit
1070 and/or receive processing, with the potential cost of serializing
1071 access across multiple endpoints. Support for shareable contexts is
1072 domain specific.
1073
1074 Conceptually, shareable transmit contexts are transmit queues that may
1075 be accessed by many endpoints. The use of a shared transmit context is
1076 mostly opaque to an application. Applications must allocate and bind
1077 shared transmit contexts to endpoints, but operations are posted
1078 directly to the endpoint. Shared transmit contexts are not associated
1079 with completion queues or counters. Completed operations are posted to
1080 the CQs bound to the endpoint. An endpoint may only be associated with
1081 a single shared transmit context.
1082
1083 Unlike shared transmit contexts, applications interact directly with
1084 shared receive contexts. Users post receive buffers directly to a
1085 shared receive context, with the buffers usable by any endpoint bound
1086 to the shared receive context. Shared receive contexts are not associ‐
1087 ated with completion queues or counters. Completed receive operations
1088 are posted to the CQs bound to the endpoint. An endpoint may only be
1089 associated with a single receive context, and all connectionless end‐
1090 points associated with a shared receive context must also share the
1091 same address vector.
1092
1093 Endpoints associated with a shared transmit context may use dedicated
1094 receive contexts, and vice-versa. Or an endpoint may use shared trans‐
1095 mit and receive contexts. And there is no requirement that the same
1096 group of endpoints sharing a context of one type also share the context
1097 of an alternate type. Furthermore, an endpoint may use a shared con‐
1098 text of one type, but a scalable set of contexts of the alternate type.
1099
1100 fi_stx_context
1101 This call is used to open a shareable transmit context (see above for
1102 details on the transmit context attributes). Endpoints associated with
1103 a shared transmit context must use a subset of the transmit context's
1104 attributes. Note that this is the reverse of the requirement for
1105 transmit contexts for scalable endpoints.
1106
1107 fi_srx_context
1108 This allocates a shareable receive context (see above for details on
1109 the receive context attributes). Endpoints associated with a shared
1110 receive context must use a subset of the receive context's attributes.
1111 Note that this is the reverse of the requirement for receive contexts
1112 for scalable endpoints.
1113
1115 The following feature and description should be considered experimen‐
1116 tal. Until the experimental tag is removed, the interfaces, semantics,
1117 and data structures associated with socket endpoints may change between
1118 library versions.
1119
1120 This section applies to endpoints of type FI_EP_SOCK_STREAM and
1121 FI_EP_SOCK_DGRAM, commonly referred to as socket endpoints.
1122
1123 Socket endpoints are defined with semantics that allow them to more
1124 easily be adopted by developers familiar with the UNIX socket API, or
1125 by middleware that exposes the socket API, while still taking advantage
1126 of high-performance hardware features.
1127
1128 The key difference between socket endpoints and other active endpoints
1129 are socket endpoints use synchronous data transfers. Buffers passed
1130 into send and receive operations revert to the control of the applica‐
1131 tion upon returning from the function call. As a result, no data
1132 transfer completions are reported to the application, and socket end‐
1133 points are not associated with completion queues or counters.
1134
1135 Socket endpoints support a subset of message operations: fi_send,
1136 fi_sendv, fi_sendmsg, fi_recv, fi_recvv, fi_recvmsg, and fi_inject.
1137 Because data transfers are synchronous, the return value from send and
1138 receive operations indicate the number of bytes transferred on success,
1139 or a negative value on error, including -FI_EAGAIN if the endpoint can‐
1140 not send or receive any data because of full or empty queues, respec‐
1141 tively.
1142
1143 Socket endpoints are associated with event queues and address vectors,
1144 and process connection management events asynchronously, similar to
1145 other endpoints. Unlike UNIX sockets, socket endpoint must still be
1146 declared as either active or passive.
1147
1148 Socket endpoints behave like non-blocking sockets. In order to support
1149 select and poll semantics, active socket endpoints are associated with
1150 a file descriptor that is signaled whenever the endpoint is ready to
1151 send and/or receive data. The file descriptor may be retrieved using
1152 fi_control.
1153
1155 Operation flags are obtained by OR-ing the following flags together.
1156 Operation flags define the default flags applied to an endpoint's data
1157 transfer operations, where a flags parameter is not available. Data
1158 transfer operations that take flags as input override the op_flags
1159 value of transmit or receive context attributes of an endpoint.
1160
1161 FI_INJECT : Indicates that all outbound data buffers should be returned
1162 to the user's control immediately after a data transfer call returns,
1163 even if the operation is handled asynchronously. This may require that
1164 the provider copy the data into a local buffer and transfer out of that
1165 buffer. A provider can limit the total amount of send data that may be
1166 buffered and/or the size of a single send that can use this flag. This
1167 limit is indicated using inject_size (see inject_size above).
1168
1169 FI_MULTI_RECV : Applies to posted receive operations. This flag allows
1170 the user to post a single buffer that will receive multiple incoming
1171 messages. Received messages will be packed into the receive buffer
1172 until the buffer has been consumed. Use of this flag may cause a sin‐
1173 gle posted receive operation to generate multiple completions as mes‐
1174 sages are placed into the buffer. The placement of received data into
1175 the buffer may be subjected to provider specific alignment restric‐
1176 tions. The buffer will be released by the provider when the available
1177 buffer space falls below the specified minimum (see
1178 FI_OPT_MIN_MULTI_RECV).
1179
1180 FI_COMPLETION : Indicates that a completion entry should be generated
1181 for data transfer operations. This flag only applies to operations
1182 issued on endpoints that were bound to a CQ or counter with the
1183 FI_SELECTIVE_COMPLETION flag. See the fi_ep_bind section above for
1184 more detail.
1185
1186 FI_INJECT_COMPLETE : Indicates that a completion should be generated
1187 when the source buffer(s) may be reused. A completion guarantees that
1188 the buffers will not be read from again and the application may reclaim
1189 them. No other guarantees are made with respect to the state of the
1190 operation.
1191
1192 Note: This flag is used to control when a completion entry is inserted
1193 into a completion queue. It does not apply to operations that do not
1194 generate a completion queue entry, such as the fi_inject operation, and
1195 is not subject to the inject_size message limit restriction.
1196
1197 FI_TRANSMIT_COMPLETE : Indicates that a completion should be generated
1198 when the transmit operation has completed relative to the local
1199 provider. The exact behavior is dependent on the endpoint type.
1200
1201 For reliable endpoints:
1202
1203 Indicates that a completion should be generated when the operation has
1204 been delivered to the peer endpoint. A completion guarantees that the
1205 operation is no longer dependent on the fabric or local resources. The
1206 state of the operation at the peer endpoint is not defined.
1207
1208 For unreliable endpoints:
1209
1210 Indicates that a completion should be generated when the operation has
1211 been delivered to the fabric. A completion guarantees that the opera‐
1212 tion is no longer dependent on local resources. The state of the oper‐
1213 ation within the fabric is not defined.
1214
1215 FI_DELIVERY_COMPLETE : Indicates that a completion should not be gener‐
1216 ated until an operation has been processed by the destination end‐
1217 point(s). A completion guarantees that the result of the operation is
1218 available.
1219
1220 This completion mode applies only to reliable endpoints. For opera‐
1221 tions that return data to the initiator, such as RMA read or
1222 atomic-fetch, the source endpoint is also considered a destination end‐
1223 point. This is the default completion mode for such operations.
1224
1225 FI_COMMIT_COMPLETE : Indicates that a completion should not be gener‐
1226 ated (locally or at the peer) until the result of an operation have
1227 been made persistent. A completion guarantees that the result is both
1228 available and durable, in the case of power failure.
1229
1230 This completion mode applies only to operations that target persistent
1231 memory regions over reliable endpoints. This completion mode is exper‐
1232 imental.
1233
1234 FI_MULTICAST : Indicates that data transfers will target multicast
1235 addresses by default. Any fi_addr_t passed into a data transfer opera‐
1236 tion will be treated as a multicast address.
1237
1239 Users should call fi_close to release all resources allocated to the
1240 fabric endpoint.
1241
1242 Endpoints allocated with the FI_CONTEXT mode set must typically provide
1243 struct fi_context as their per operation context parameter. (See
1244 fi_getinfo.3 for details.) However, when FI_SELECTIVE_COMPLETION is
1245 enabled to suppress completion entries, and an operation is initiated
1246 without FI_COMPLETION flag set, then the context parameter is ignored.
1247 An application does not need to pass in a valid struct fi_context into
1248 such data transfers.
1249
1250 Operations that complete in error that are not associated with valid
1251 operational context will use the endpoint context in any error report‐
1252 ing structures.
1253
1254 Although applications typically associate individual completions with
1255 either completion queues or counters, an endpoint can be attached to
1256 both a counter and completion queue. When combined with using selec‐
1257 tive completions, this allows an application to use counters to track
1258 successful completions, with a CQ used to report errors. Operations
1259 that complete with an error increment the error counter and generate a
1260 completion event. The generation of entries going to the CQ can then
1261 be controlled using FI_SELECTIVE_COMPLETION.
1262
1263 As mentioned in fi_getinfo(3), the ep_attr structure can be used to
1264 query providers that support various endpoint attributes. fi_getinfo
1265 can return provider info structures that can support the minimal set of
1266 requirements (such that the application maintains correctness). How‐
1267 ever, it can also return provider info structures that exceed applica‐
1268 tion requirements. As an example, consider an application requesting
1269 msg_order as FI_ORDER_NONE. The resulting output from fi_getinfo may
1270 have all the ordering bits set. The application can reset the ordering
1271 bits it does not require before creating the endpoint. The provider is
1272 free to implement a stricter ordering than is required by the applica‐
1273 tion.
1274
1276 Returns 0 on success. On error, a negative value corresponding to fab‐
1277 ric errno is returned. For fi_cancel, a return value of 0 indicates
1278 that the cancel request was submitted for processing.
1279
1280 Fabric errno values are defined in rdma/fi_errno.h.
1281
1283 -FI_EDOMAIN : A resource domain was not bound to the endpoint or an
1284 attempt was made to bind multiple domains.
1285
1286 -FI_ENOCQ : The endpoint has not been configured with necessary event
1287 queue.
1288
1289 -FI_EOPBADSTATE : The endpoint's state does not permit the requested
1290 operation.
1291
1293 fi_getinfo(3), fi_domain(3), fi_msg(3), fi_tagged(3), fi_rma(3)
1294
1296 OpenFabrics.
1297
1298
1299
1300Libfabric Programmer's Manual 2018-02-13 fi_endpoint(3)