1fi_endpoint(3)                 Libfabric v1.14.0                fi_endpoint(3)
2
3
4

NAME

6       fi_endpoint - Fabric endpoint operations
7
8       fi_endpoint / fi_scalable_ep / fi_passive_ep / fi_close
9              Allocate or close an endpoint.
10
11       fi_ep_bind
12              Associate  an  endpoint  with  hardware resources, such as event
13              queues, completion queues, counters, address vectors, or  shared
14              transmit/receive contexts.
15
16       fi_scalable_ep_bind
17              Associate a scalable endpoint with an address vector
18
19       fi_pep_bind
20              Associate a passive endpoint with an event queue
21
22       fi_enable
23              Transitions an active endpoint into an enabled state.
24
25       fi_cancel
26              Cancel a pending asynchronous data transfer
27
28       fi_ep_alias
29              Create an alias to the endpoint
30
31       fi_control
32              Control endpoint operation.
33
34       fi_getopt / fi_setopt
35              Get or set endpoint options.
36
37       fi_rx_context / fi_tx_context / fi_srx_context / fi_stx_context
38              Open a transmit or receive context.
39
40       fi_tc_dscp_set / fi_tc_dscp_get
41              Convert between a DSCP value and a network traffic class
42
43       fi_rx_size_left / fi_tx_size_left (DEPRECATED)
44              Query the lower bound on how many RX/TX operations may be posted
45              without an operation returning -FI_EAGAIN.  This functions  have
46              been  deprecated  and will be removed in a future version of the
47              library.
48

SYNOPSIS

50              #include <rdma/fabric.h>
51
52              #include <rdma/fi_endpoint.h>
53
54              int fi_endpoint(struct fid_domain *domain, struct fi_info *info,
55                  struct fid_ep **ep, void *context);
56
57              int fi_scalable_ep(struct fid_domain *domain, struct fi_info *info,
58                  struct fid_ep **sep, void *context);
59
60              int fi_passive_ep(struct fi_fabric *fabric, struct fi_info *info,
61                  struct fid_pep **pep, void *context);
62
63              int fi_tx_context(struct fid_ep *sep, int index,
64                  struct fi_tx_attr *attr, struct fid_ep **tx_ep,
65                  void *context);
66
67              int fi_rx_context(struct fid_ep *sep, int index,
68                  struct fi_rx_attr *attr, struct fid_ep **rx_ep,
69                  void *context);
70
71              int fi_stx_context(struct fid_domain *domain,
72                  struct fi_tx_attr *attr, struct fid_stx **stx,
73                  void *context);
74
75              int fi_srx_context(struct fid_domain *domain,
76                  struct fi_rx_attr *attr, struct fid_ep **rx_ep,
77                  void *context);
78
79              int fi_close(struct fid *ep);
80
81              int fi_ep_bind(struct fid_ep *ep, struct fid *fid, uint64_t flags);
82
83              int fi_scalable_ep_bind(struct fid_ep *sep, struct fid *fid, uint64_t flags);
84
85              int fi_pep_bind(struct fid_pep *pep, struct fid *fid, uint64_t flags);
86
87              int fi_enable(struct fid_ep *ep);
88
89              int fi_cancel(struct fid_ep *ep, void *context);
90
91              int fi_ep_alias(struct fid_ep *ep, struct fid_ep **alias_ep, uint64_t flags);
92
93              int fi_control(struct fid *ep, int command, void *arg);
94
95              int fi_getopt(struct fid *ep, int level, int optname,
96                  void *optval, size_t *optlen);
97
98              int fi_setopt(struct fid *ep, int level, int optname,
99                  const void *optval, size_t optlen);
100
101              uint32_t fi_tc_dscp_set(uint8_t dscp);
102
103              uint8_t fi_tc_dscp_get(uint32_t tclass);
104
105              DEPRECATED ssize_t fi_rx_size_left(struct fid_ep *ep);
106
107              DEPRECATED ssize_t fi_tx_size_left(struct fid_ep *ep);
108

ARGUMENTS

110       fid    On creation, specifies a fabric  or  access  domain.   On  bind,
111              identifies  the  event  queue, completion queue, counter, or ad‐
112              dress vector to bind to the endpoint.  In other  cases,  it’s  a
113              fabric identifier of an associated resource.
114
115       info   Details  about  the  fabric interface endpoint to be opened, ob‐
116              tained from fi_getinfo.
117
118       ep     A fabric endpoint.
119
120       sep    A scalable fabric endpoint.
121
122       pep    A passive fabric endpoint.
123
124       context
125              Context associated with the endpoint or asynchronous operation.
126
127       index  Index to retrieve a specific transmit/receive context.
128
129       attr   Transmit or receive context attributes.
130
131       flags  Additional flags to apply to the operation.
132
133       command
134              Command of control operation to perform on endpoint.
135
136       arg    Optional control argument.
137
138       level  Protocol level at which the desired option resides.
139
140       optname
141              The protocol option to read or set.
142
143       optval The option value that was read or to set.
144
145       optlen The size of the optval buffer.
146

DESCRIPTION

148       Endpoints are transport level communication  portals.   There  are  two
149       types  of endpoints: active and passive.  Passive endpoints belong to a
150       fabric domain and are most often used to listen for incoming connection
151       requests.   However, a passive endpoint may be used to reserve a fabric
152       address that can be granted to an active  endpoint.   Active  endpoints
153       belong to access domains and can perform data transfers.
154
155       Active  endpoints may be connection-oriented or connectionless, and may
156       provide data reliability.  The  data  transfer  interfaces  –  messages
157       (fi_msg),  tagged  messages  (fi_tagged),  RMA  (fi_rma),  and  atomics
158       (fi_atomic) – are associated with active endpoints.  In basic  configu‐
159       rations, an active endpoint has transmit and receive queues.  In gener‐
160       al, operations that generate traffic on the fabric are  posted  to  the
161       transmit  queue.   This  includes  all RMA and atomic operations, along
162       with sent messages and sent tagged messages.  Operations that post buf‐
163       fers for receiving incoming data are submitted to the receive queue.
164
165       Active  endpoints are created in the disabled state.  They must transi‐
166       tion into an enabled state before accepting data  transfer  operations,
167       including  posting  of  receive buffers.  The fi_enable call is used to
168       transition an active endpoint into an enabled  state.   The  fi_connect
169       and  fi_accept  calls will also transition an endpoint into the enabled
170       state, if it is not already active.
171
172       In order to transition an endpoint into an enabled state,  it  must  be
173       bound  to one or more fabric resources.  An endpoint that will generate
174       asynchronous completions, either through data  transfer  operations  or
175       communication  establishment  events,  must be bound to the appropriate
176       completion queues or event queues, respectively, before being  enabled.
177       Additionally,  endpoints  that  use  manual progress must be associated
178       with relevant completion queues or  event  queues  in  order  to  drive
179       progress.   For  endpoints  that  are only used as the target of RMA or
180       atomic operations, this means binding  the  endpoint  to  a  completion
181       queue  associated  with  receive  processing.  Connectionless endpoints
182       must be bound to an address vector.
183
184       Once an endpoint has been activated, it may be associated with  an  ad‐
185       dress  vector.   Receive  buffers  may be posted to it and calls may be
186       made to connection establishment  routines.   Connectionless  endpoints
187       may also perform data transfers.
188
189       The behavior of an endpoint may be adjusted by setting its control data
190       and protocol options.  This allows the underlying provider to  redirect
191       function  calls to implementations optimized to meet the desired appli‐
192       cation behavior.
193
194       If an endpoint experiences a critical error, it  will  transition  back
195       into  a disabled state.  Critical errors are reported through the event
196       queue associated with the EP.  In certain cases,  a  disabled  endpoint
197       may  be  re-enabled.   The  ability  to transition back into an enabled
198       state is provider specific and depends on the type of  error  that  the
199       endpoint  experienced.   When  an endpoint is disabled as a result of a
200       critical error, all pending operations are discarded.
201
202   fi_endpoint / fi_passive_ep / fi_scalable_ep
203       fi_endpoint allocates a new active endpoint.  fi_passive_ep allocates a
204       new  passive  endpoint.   fi_scalable_ep allocates a scalable endpoint.
205       The properties and behavior of the endpoint are defined  based  on  the
206       provided  struct  fi_info.   See  fi_getinfo  for additional details on
207       fi_info.  fi_info flags that control the operation of an  endpoint  are
208       defined below.  See section SCALABLE ENDPOINTS.
209
210       If  an active endpoint is allocated in order to accept a connection re‐
211       quest, the fi_info parameter must be the same as the fi_info  structure
212       provided with the connection request (FI_CONNREQ) event.
213
214       An  active endpoint may acquire the properties of a passive endpoint by
215       setting the fi_info handle field to the  passive  endpoint  fabric  de‐
216       scriptor.   This  is  useful  for applications that need to reserve the
217       fabric address of an endpoint prior to knowing if the endpoint will  be
218       used  on the active or passive side of a connection.  For example, this
219       feature is useful for simulating socket semantics.  Once an active end‐
220       point  acquires  the properties of a passive endpoint, the passive end‐
221       point is no longer bound to any fabric resources and must no longer  be
222       used.  The user is expected to close the passive endpoint after opening
223       the active endpoint in order to free up any  lingering  resources  that
224       had been used.
225
226   fi_close
227       Closes an endpoint and release all resources associated with it.
228
229       When closing a scalable endpoint, there must be no opened transmit con‐
230       texts, or receive contexts associated with the scalable  endpoint.   If
231       resources are still associated with the scalable endpoint when attempt‐
232       ing to close, the call will return -FI_EBUSY.
233
234       Outstanding operations posted to the endpoint when fi_close  is  called
235       will be discarded.  Discarded operations will silently be dropped, with
236       no completions reported.  Additionally, a provider may  discard  previ‐
237       ously  completed  operations  from  the associated completion queue(s).
238       The behavior to discard completed operations is provider specific.
239
240   fi_ep_bind
241       fi_ep_bind is used to associate an endpoint with  other  allocated  re‐
242       sources,  such  as  completion queues, counters, address vectors, event
243       queues, shared contexts, and memory regions.  The type of objects  that
244       must be bound with an endpoint depend on the endpoint type and its con‐
245       figuration.
246
247       Passive endpoints must be bound with an  EQ  that  supports  connection
248       management  events.  Connectionless endpoints must be bound to a single
249       address vector.  If an endpoint is using a shared transmit  and/or  re‐
250       ceive context, the shared contexts must be bound to the endpoint.  CQs,
251       counters, AV, and shared contexts must be  bound  to  endpoints  before
252       they are enabled either explicitly or implicitly.
253
254       An endpoint must be bound with CQs capable of reporting completions for
255       any asynchronous operation initiated on the endpoint.  For example,  if
256       the  endpoint  supports  any  outbound  transfers (sends, RMA, atomics,
257       etc.), then it must be bound to a  completion  queue  that  can  report
258       transmit  completions.  This is true even if the endpoint is configured
259       to suppress successful completions, in order that operations that  com‐
260       plete in error may be reported to the user.
261
262       An  active  endpoint  may  direct asynchronous completions to different
263       CQs,  based  on  the  type  of  operation.   This  is  specified  using
264       fi_ep_bind flags.  The following flags may be OR’ed together when bind‐
265       ing an endpoint to a completion domain CQ.
266
267       FI_RECV
268              Directs the notification of inbound data transfers to the speci‐
269              fied  completion  queue.  This includes received messages.  This
270              binding automatically includes FI_REMOTE_WRITE, if applicable to
271              the endpoint.
272
273       FI_SELECTIVE_COMPLETION
274              By default, data transfer operations write CQ completion entries
275              into the associated completion queue after they have successful‐
276              ly completed.  Applications can use this bind flag to selective‐
277              ly enable when completions are generated.  If  FI_SELECTIVE_COM‐
278              PLETION is specified, data transfer operations will not generate
279              CQ entries for successful completions  unless  FI_COMPLETION  is
280              set  as an operational flag for the given operation.  Operations
281              that fail asynchronously will still generate  completions,  even
282              if  a completion is not requested.  FI_SELECTIVE_COMPLETION must
283              be OR’ed with FI_TRANSMIT and/or FI_RECV flags.
284
285       When FI_SELECTIVE_COMPLETION is set, the user must determine when a re‐
286       quest  that  does  NOT have FI_COMPLETION set has completed indirectly,
287       usually based on the completion of a subsequent operation or  by  using
288       completion  counters.   Use of this flag may improve performance by al‐
289       lowing the provider to avoid writing a CQ completion  entry  for  every
290       operation.
291
292       See Notes section below for additional information on how this flag in‐
293       teracts with the FI_CONTEXT and FI_CONTEXT2 mode bits.
294
295       FI_TRANSMIT
296              Directs the completion of outbound data transfer requests to the
297              specified  completion  queue.   This includes send message, RMA,
298              and atomic operations.
299
300       An endpoint may optionally be bound to a completion counter.  Associat‐
301       ing  an endpoint with a counter is in addition to binding the EP with a
302       CQ.  When binding an endpoint to a counter, the following flags may  be
303       specified.
304
305       FI_READ
306              Increments  the  specified  counter whenever an RMA read, atomic
307              fetch, or atomic compare operation initiated from  the  endpoint
308              has completed successfully or in error.
309
310       FI_RECV
311              Increments  the specified counter whenever a message is received
312              over the endpoint.  Received messages include  both  tagged  and
313              normal message operations.
314
315       FI_REMOTE_READ
316              Increments  the  specified  counter whenever an RMA read, atomic
317              fetch, or atomic compare operation is initiated  from  a  remote
318              endpoint  that targets the given endpoint.  Use of this flag re‐
319              quires that the endpoint be created using FI_RMA_EVENT.
320
321       FI_REMOTE_WRITE
322              Increments the specified counter whenever an RMA write  or  base
323              atomic  operation  is initiated from a remote endpoint that tar‐
324              gets the given endpoint.  Use of this  flag  requires  that  the
325              endpoint be created using FI_RMA_EVENT.
326
327       FI_SEND
328              Increments  the  specified  counter  whenever a message transfer
329              initiated over the endpoint has completed successfully or in er‐
330              ror.  Sent messages include both tagged and normal message oper‐
331              ations.
332
333       FI_WRITE
334              Increments the specified counter whenever an RMA write  or  base
335              atomic  operation initiated from the endpoint has completed suc‐
336              cessfully or in error.
337
338       An endpoint may only be bound to a single CQ or  counter  for  a  given
339       type of operation.  For example, a EP may not bind to two counters both
340       using FI_WRITE.  Furthermore, providers may limit CQ and counter  bind‐
341       ings to endpoints of the same endpoint type (DGRAM, MSG, RDM, etc.).
342
343   fi_scalable_ep_bind
344       fi_scalable_ep_bind  is  used  to associate a scalable endpoint with an
345       address vector.  See section on SCALABLE ENDPOINTS.   A  scalable  end‐
346       point  has  a  single  transport level address and can support multiple
347       transmit and receive contexts.  The transmit and receive contexts share
348       the  transport-level  address.  Address vectors that are bound to scal‐
349       able endpoints are implicitly bound to any transmit or receive contexts
350       created using the scalable endpoint.
351
352   fi_enable
353       This  call transitions the endpoint into an enabled state.  An endpoint
354       must be enabled before it may be used to perform data  transfers.   En‐
355       abling  an  endpoint  typically results in hardware resources being as‐
356       signed to it.  Endpoints making use  of  completion  queues,  counters,
357       event queues, and/or address vectors must be bound to them before being
358       enabled.
359
360       Calling connect or accept on an endpoint will implicitly enable an end‐
361       point if it has not already been enabled.
362
363       fi_enable  may also be used to re-enable an endpoint that has been dis‐
364       abled as a result  of  experiencing  a  critical  error.   Applications
365       should  check the return value from fi_enable to see if a disabled end‐
366       point has successfully be re-enabled.
367
368   fi_cancel
369       fi_cancel attempts to cancel  an  outstanding  asynchronous  operation.
370       Canceling an operation causes the fabric provider to search for the op‐
371       eration and, if it is still pending, complete it as  having  been  can‐
372       celed.   An error queue entry will be available in the associated error
373       queue with error code FI_ECANCELED.  On the other hand, if  the  opera‐
374       tion completed before the call to fi_cancel, then the completion status
375       of that operation will be available in the associated completion queue.
376       No specific entry related to fi_cancel itself will be posted.
377
378       Cancel uses the context parameter associated with an operation to iden‐
379       tify the request to cancel.  Operations posted without a valid  context
380       parameter  –  either  no  context parameter is specified or the context
381       value was ignored by the provider – cannot be  canceled.   If  multiple
382       outstanding  operations  match  the context parameter, only one will be
383       canceled.  In this case, the operation which is  canceled  is  provider
384       specific.   The  cancel  operation  is  asynchronous, but will complete
385       within a bounded period of time.
386
387   fi_ep_alias
388       This call creates an alias to the specified endpoint.  Conceptually, an
389       endpoint alias provides an alternate software path from the application
390       to the underlying provider hardware.  An alias EP differs from its par‐
391       ent  endpoint only by its default data transfer flags.  For example, an
392       alias EP may be configured to use a different completion mode.  By  de‐
393       fault,  an alias EP inherits the same data transfer flags as the parent
394       endpoint.  An application can use fi_control to modify the alias EP op‐
395       erational flags.
396
397       When  allocating  an  alias,  an  application  may configure either the
398       transmit or receive operational flags.  This avoids needing a  separate
399       call to fi_control to set those flags.  The flags passed to fi_ep_alias
400       must include FI_TRANSMIT or FI_RECV (not both) with  other  operational
401       flags  OR’ed in.  This will override the transmit or receive flags, re‐
402       spectively, for operations posted through the alias endpoint.  All  al‐
403       located  aliases  must  be closed for the underlying endpoint to be re‐
404       leased.
405
406   fi_control
407       The control operation is used to adjust the default behavior of an end‐
408       point.  It allows the underlying provider to redirect function calls to
409       implementations optimized to meet the desired application behavior.  As
410       a  result,  calls to fi_ep_control must be serialized against all other
411       calls to an endpoint.
412
413       The base operation of an endpoint is  selected  during  creation  using
414       struct  fi_info.   The  following control commands and arguments may be
415       assigned to an endpoint.
416
417       **FI_BACKLOG - int *value**
418              This option only applies to passive endpoints.  It  is  used  to
419              set the connection request backlog for listening endpoints.
420
421       **FI_GETOPSFLAG – uint64_t *flags**
422              Used  to retrieve the current value of flags associated with the
423              data transfer operations initiated on the endpoint.  The control
424              argument must include FI_TRANSMIT or FI_RECV (not both) flags to
425              indicate the type of data transfer flags to  be  returned.   See
426              below for a list of control flags.
427
428       FI_GETWAIT – void **
429              This command allows the user to retrieve the file descriptor as‐
430              sociated with a socket endpoint.  The fi_control  arg  parameter
431              should  be  an  address where a pointer to the returned file de‐
432              scriptor will be written.  See fi_eq.3 for addition details  us‐
433              ing fi_control with FI_GETWAIT.  The file descriptor may be used
434              for notification that the endpoint is ready to send  or  receive
435              data.
436
437       **FI_SETOPSFLAG – uint64_t *flags**
438              Used to change the data transfer operation flags associated with
439              an endpoint.  The control argument must include  FI_TRANSMIT  or
440              FI_RECV  (not  both)  to indicate the type of data transfer that
441              the flags should apply to, with other flags OR’ed in.  The given
442              flags will override the previous transmit and receive attributes
443              that were set when the  endpoint  was  created.   Valid  control
444              flags are defined below.
445
446   fi_getopt / fi_setopt
447       Endpoint  protocol  operations  may be retrieved using fi_getopt or set
448       using fi_setopt.  Applications specify the level that a desired  option
449       exists, identify the option, and provide input/output buffers to get or
450       set the option.  fi_setopt provides an  application  a  way  to  adjust
451       low-level protocol and implementation specific details of an endpoint.
452
453       The  following  option  levels  and option names and parameters are de‐
454       fined.
455
456       FI_OPT_ENDPOINT • .RS 2
457
458       FI_OPT_BUFFERED_LIMIT - size_t
459              Defines the maximum size of a buffered message that will be  re‐
460              ported  to  users  as  part  of  a  receive  completion when the
461              FI_BUFFERED_RECV mode is enabled on an endpoint.
462
463       fi_getopt() will return the  currently  configured  threshold,  or  the
464       provider’s  default threshold if one has not be set by the application.
465       fi_setopt() allows an application to configure the threshold.   If  the
466       provider  cannot  support  the  requested  threshold,  it will fail the
467       fi_setopt()  call  with  FI_EMSGSIZE.   Calling  fi_setopt()  with  the
468       threshold  set  to  SIZE_MAX will set the threshold to the maximum sup‐
469       ported by the provider.  fi_getopt() can then be used to  retrieve  the
470       set size.
471
472       In  most  cases, the sending and receiving endpoints must be configured
473       to use the same threshold value, and the threshold must be set prior to
474       enabling the endpoint.
475       • .RS 2
476
477       FI_OPT_BUFFERED_MIN - size_t
478              Defines  the minimum size of a buffered message that will be re‐
479              ported.  Applications would set this to a size that’s big enough
480              to decide whether to discard or claim a buffered receive or when
481              to claim a buffered receive on getting a buffered  receive  com‐
482              pletion.  The value is typically used by a provider when sending
483              a rendezvous protocol request  where  it  would  send  at  least
484              FI_OPT_BUFFERED_MIN  bytes of application data along with it.  A
485              smaller sized rendezvous protocol  message  usually  results  in
486              better latency for the overall transfer of a large message.
487       • .RS 2
488
489       FI_OPT_CM_DATA_SIZE - size_t
490              Defines  the size of available space in CM messages for user-de‐
491              fined data.  This value limits the amount of data that  applica‐
492              tions  can exchange between peer endpoints using the fi_connect,
493              fi_accept, and fi_reject operations.  The size returned  is  de‐
494              pendent  upon the properties of the endpoint, except in the case
495              of passive endpoints, in which the  size  reflects  the  maximum
496              size of the data that may be present as part of a connection re‐
497              quest event.  This option is read only.
498       • .RS 2
499
500       FI_OPT_MIN_MULTI_RECV - size_t
501              Defines the minimum receive buffer space available when the  re‐
502              ceive  buffer  is  released by the provider (see FI_MULTI_RECV).
503              Modifying this value is only guaranteed to set the minimum  buf‐
504              fer  space  needed  on  receives posted after the value has been
505              changed.  It is recommended that applications that want to over‐
506              ride the default MIN_MULTI_RECV value set this option before en‐
507              abling the corresponding endpoint.
508       • .RS 2
509
510       FI_OPT_FI_HMEM_P2P - int
511              Defines how the provider should  handle  peer  to  peer  FI_HMEM
512              transfers  for  this  endpoint.   By  default, the provider will
513              chose whether to use peer to peer support based on the  type  of
514              transfer (FI_HMEM_P2P_ENABLED).  Valid values defined in fi_end‐
515              point.h are:
516
517              • FI_HMEM_P2P_ENABLED: Peer to peer support may be used  by  the
518                provider  to handle FI_HMEM transfers, and which transfers are
519                initiated using peer to peer is subject to the provider imple‐
520                mentation.
521
522              • FI_HMEM_P2P_REQUIRED:  Peer  to  peer support must be used for
523                transfers, transfers that cannot be performed using  p2p  will
524                be reported as failing.
525
526              • FI_HMEM_P2P_PREFERRED:  Peer to peer support should be used by
527                the provider for all transfers if available, but the  provider
528                may  choose  to copy the data to initiate the transfer if peer
529                to peer support is unavailable.
530
531              • FI_HMEM_P2P_DISABLED: Peer to peer support should not be used.
532       fi_setopt() will return -FI_EOPNOTSUPP if the mode requested cannot  be
533       supported  by  the provider.  The FI_HMEM_DISABLE_P2P environment vari‐
534       able discussed in fi_mr(3) takes precedence over this setopt option.
535
536   fi_tc_dscp_set
537       This call converts a DSCP defined value into a libfabric traffic  class
538       value.   It should be used when assigning a DSCP value when setting the
539       tclass field in either domain or endpoint attributes
540
541   fi_tc_dscp_get
542       This call returns the DSCP value associated with the tclass  field  for
543       the domain or endpoint attributes.
544
545   fi_rx_size_left (DEPRECATED)
546       This  function has been deprecated and will be removed in a future ver‐
547       sion of the library.  It may not be supported by all providers.
548
549       The fi_rx_size_left call returns a lower bound on the number of receive
550       operations that may be posted to the given endpoint without that opera‐
551       tion returning -FI_EAGAIN.  Depending on the specific  details  of  the
552       subsequently  posted  receive  operations (e.g., number of iov entries,
553       which receive function is called, etc.), it may  be  possible  to  post
554       more receive operations than originally indicated by fi_rx_size_left.
555
556   fi_tx_size_left (DEPRECATED)
557       This  function has been deprecated and will be removed in a future ver‐
558       sion of the library.  It may not be supported by all providers.
559
560       The fi_tx_size_left call returns a lower bound on the number of  trans‐
561       mit  operations  that  may be posted to the given endpoint without that
562       operation returning -FI_EAGAIN.  Depending on the specific  details  of
563       the  subsequently  posted  transmit operations (e.g., number of iov en‐
564       tries, which transmit function is called, etc.), it may be possible  to
565       post   more   transmit   operations   than   originally   indicated  by
566       fi_tx_size_left.
567

ENDPOINT ATTRIBUTES

569       The fi_ep_attr structure defines the set of attributes associated  with
570       an  endpoint.   Endpoint  attributes  may  be further refined using the
571       transmit and receive context attributes as shown below.
572
573              struct fi_ep_attr {
574                  enum fi_ep_type type;
575                  uint32_t        protocol;
576                  uint32_t        protocol_version;
577                  size_t          max_msg_size;
578                  size_t          msg_prefix_size;
579                  size_t          max_order_raw_size;
580                  size_t          max_order_war_size;
581                  size_t          max_order_waw_size;
582                  uint64_t        mem_tag_format;
583                  size_t          tx_ctx_cnt;
584                  size_t          rx_ctx_cnt;
585                  size_t          auth_key_size;
586                  uint8_t         *auth_key;
587              };
588
589   type - Endpoint Type
590       If specified, indicates the type of fabric interface communication  de‐
591       sired.  Supported types are:
592
593       FI_EP_DGRAM
594              Supports  a  connectionless,  unreliable datagram communication.
595              Message boundaries are maintained, but the maximum message  size
596              may  be  limited to the fabric MTU.  Flow control is not guaran‐
597              teed.
598
599       FI_EP_MSG
600              Provides a reliable, connection-oriented data  transfer  service
601              with flow control that maintains message boundaries.
602
603       FI_EP_RDM
604              Reliable  datagram message.  Provides a reliable, connectionless
605              data transfer service with flow control that  maintains  message
606              boundaries.
607
608       FI_EP_SOCK_DGRAM
609              A  connectionless,  unreliable  datagram endpoint with UDP sock‐
610              et-like semantics.  FI_EP_SOCK_DGRAM is most useful for applica‐
611              tions  designed  around  using UDP sockets.  See the SOCKET END‐
612              POINT section for additional details and restrictions that apply
613              to datagram socket endpoints.
614
615       FI_EP_SOCK_STREAM
616              Data  streaming  endpoint  with TCP socket-like semantics.  Pro‐
617              vides a reliable, connection-oriented data transfer service that
618              does not maintain message boundaries.  FI_EP_SOCK_STREAM is most
619              useful for applications designed around using TCP sockets.   See
620              the  SOCKET ENDPOINT section for additional details and restric‐
621              tions that apply to stream endpoints.
622
623       FI_EP_UNSPEC
624              The type of endpoint is not specified.  This is usually provided
625              as  input, with other attributes of the endpoint or the provider
626              selecting the type.
627
628   Protocol
629       Specifies the low-level end to end protocol employed by  the  provider.
630       A  matching  protocol must be used by communicating endpoints to ensure
631       interoperability.  The following protocol values are defined.  Provider
632       specific  protocols are also allowed.  Provider specific protocols will
633       be indicated by having the upper bit of the protocol value set to one.
634
635       FI_PROTO_GNI
636              Protocol runs over Cray GNI low-level interface.
637
638       FI_PROTO_IB_RDM
639              Reliable-datagram protocol  implemented  over  InfiniBand  reli‐
640              able-connected queue pairs.
641
642       FI_PROTO_IB_UD
643              The  protocol  runs  over  Infiniband  unreliable datagram queue
644              pairs.
645
646       FI_PROTO_IWARP
647              The protocol runs over the  Internet  wide  area  RDMA  protocol
648              transport.
649
650       FI_PROTO_IWARP_RDM
651              Reliable-datagram  protocol implemented over iWarp reliable-con‐
652              nected queue pairs.
653
654       FI_PROTO_NETWORKDIRECT
655              Protocol runs over Microsoft NetworkDirect service provider  in‐
656              terface.   This  adds  reliable-datagram semantics over the Net‐
657              workDirect connection- oriented endpoint semantics.
658
659       FI_PROTO_PSMX
660              The protocol is based on an Intel proprietary protocol known  as
661              PSM,  performance scaled messaging.  PSMX is an extended version
662              of the PSM protocol to support the libfabric interfaces.
663
664       FI_PROTO_PSMX2
665              The protocol is based on an Intel proprietary protocol known  as
666              PSM2,  performance  scaled messaging version 2.  PSMX2 is an ex‐
667              tended version of the PSM2 protocol to support the libfabric in‐
668              terfaces.
669
670       FI_PROTO_PSMX3
671              The  protocol  is  Intel’s  protocol  known as PSM3, performance
672              scaled messaging version 3.  PSMX3 is  implemented  over  RoCEv2
673              and verbs.
674
675       FI_PROTO_RDMA_CM_IB_RC
676              The  protocol  runs  over  Infiniband  reliable-connected  queue
677              pairs, using the RDMA CM protocol for connection establishment.
678
679       FI_PROTO_RXD
680              Reliable-datagram protocol implemented over datagram  endpoints.
681              RXD  is a libfabric utility component that adds RDM endpoint se‐
682              mantics over DGRAM endpoint semantics.
683
684       FI_PROTO_RXM
685              Reliable-datagram protocol implemented over  message  endpoints.
686              RXM  is a libfabric utility component that adds RDM endpoint se‐
687              mantics over MSG endpoint semantics.
688
689       FI_PROTO_SOCK_TCP
690              The protocol is layered over TCP packets.
691
692       FI_PROTO_UDP
693              The protocol sends and receives UDP datagrams.  For example,  an
694              endpoint  using  FI_PROTO_UDP will be able to communicate with a
695              remote peer that is using Berkeley SOCK_DGRAM sockets using  IP‐
696              PROTO_UDP.
697
698       FI_PROTO_UNSPEC
699              The  protocol is not specified.  This is usually provided as in‐
700              put, with other attributes of the socket or the provider select‐
701              ing the actual protocol.
702
703   protocol_version - Protocol Version
704       Identifies  which  version of the protocol is employed by the provider.
705       The protocol version allows providers to extend an  existing  protocol,
706       by adding support for additional features or functionality for example,
707       in a backward compatible manner.  Providers that support different ver‐
708       sions  of  the  same protocol should inter-operate, but only when using
709       the capabilities defined for the lesser version.
710
711   max_msg_size - Max Message Size
712       Defines the maximum size for an application data transfer as  a  single
713       operation.
714
715   msg_prefix_size - Message Prefix Size
716       Specifies  the  size of any required message prefix buffer space.  This
717       field will be 0 unless the FI_MSG_PREFIX mode is enabled.  If  msg_pre‐
718       fix_size is > 0 the specified value will be a multiple of 8-bytes.
719
720   Max RMA Ordered Size
721       The maximum ordered size specifies the delivery order of transport data
722       into target memory for RMA and atomic  operations.   Data  ordering  is
723       separate,  but dependent on message ordering (defined below).  Data or‐
724       dering is unspecified where message order is not defined.
725
726       Data ordering refers to the access of the same target memory by  subse‐
727       quent  operations.   When back to back RMA read or write operations ac‐
728       cess the same  registered  memory  location,  data  ordering  indicates
729       whether  the  second  operation reads or writes the target memory after
730       the first operation has completed.  For example, will an RMA read  that
731       follows  an  RMA write read back the data that was written?  Similarly,
732       will an RMA write that follows an RMA read update the target buffer af‐
733       ter  the read has transferred the original data?  Data ordering answers
734       these questions, even in the presence of errors, such as  the  need  to
735       resend data because of lost or corrupted network traffic.
736
737       RMA  ordering  applies  between two operations, and not within a single
738       data transfer.  Therefore, ordering  is  defined  per  byte-addressable
739       memory  location.   I.e.   ordering specifies whether location X is ac‐
740       cessed by the second operation after the first operation.   Nothing  is
741       implied  about  the completion of the first operation before the second
742       operation is initiated.  For example, if the  first  operation  updates
743       locations  X  and Y, but the second operation only accesses location X,
744       there are no guarantees defined relative to location Y and  the  second
745       operation.
746
747       In  order  to  support  large data transfers being broken into multiple
748       packets and sent using multiple paths through the fabric, data ordering
749       may  be  limited  to  transfers  of a specific size or less.  Providers
750       specify when data ordering is maintained through the following  values.
751       Note that even if data ordering is not maintained, message ordering may
752       be.
753
754       max_order_raw_size
755              Read after write size.  If set, an RMA or atomic read  operation
756              issued after an RMA or atomic write operation, both of which are
757              smaller than the size, will be ordered.  Where the target memory
758              locations overlap, the RMA or atomic read operation will see the
759              results of the previous RMA or atomic write.
760
761       max_order_war_size
762              Write after read size.  If set, an RMA or atomic write operation
763              issued  after an RMA or atomic read operation, both of which are
764              smaller than the size, will be ordered.  The RMA or atomic  read
765              operation  will see the initial value of the target memory loca‐
766              tion before a subsequent RMA or atomic write updates the value.
767
768       max_order_waw_size
769              Write after write size.  If set, an RMA or atomic  write  opera‐
770              tion  issued  after  an  RMA  or atomic write operation, both of
771              which are smaller than the size, will be  ordered.   The  target
772              memory  location  will  reflect the results of the second RMA or
773              atomic write.
774
775       An order size value of 0 indicates that ordering is not guaranteed.   A
776       value of -1 guarantees ordering for any data size.
777
778   mem_tag_format - Memory Tag Format
779       The  memory  tag  format  is  a  bit array used to convey the number of
780       tagged bits supported by a provider.  Additionally, it may be  used  to
781       divide  the bit array into separate fields.  The mem_tag_format option‐
782       ally begins with a series of bits set to 0, to signify bits  which  are
783       ignored by the provider.  Following the initial prefix of ignored bits,
784       the array will consist of alternating groups of bits set to all 1’s  or
785       all 0’s.  Each group of bits corresponds to a tagged field.  The impli‐
786       cation of defining a tagged field is that when a mask is applied to the
787       tagged  bit  array, all bits belonging to a single field will either be
788       set to 1 or 0, collectively.
789
790       For example, a mem_tag_format of 0x30FF indicates support for 14 tagged
791       bits, separated into 3 fields.  The first field consists of 2-bits, the
792       second field 4-bits, and the final field 8-bits.  Valid masks for  such
793       a tagged field would be a bitwise OR’ing of zero or more of the follow‐
794       ing values: 0x3000, 0x0F00, and 0x00FF.  The provider may not  validate
795       the mask provided by the application for performance reasons.
796
797       By  identifying fields within a tag, a provider may be able to optimize
798       their search routines.  An application which requests tag  fields  must
799       provide  tag  masks  that  either  set all mask bits corresponding to a
800       field to all 0 or all 1.  When negotiating tag fields,  an  application
801       can  request  a  specific number of fields of a given size.  A provider
802       must return a tag format that supports the requested number of  fields,
803       with each field being at least the size requested, or fail the request.
804       A provider may increase the size of the fields.  When reporting comple‐
805       tions (see FI_CQ_FORMAT_TAGGED), it is not guaranteed that the provider
806       would clear out any unsupported tag bits in the tag field of  the  com‐
807       pletion entry.
808
809       It is recommended that field sizes be ordered from smallest to largest.
810       A generic, unstructured tag and mask can be achieved  by  requesting  a
811       bit array consisting of alternating 1’s and 0’s.
812
813   tx_ctx_cnt - Transmit Context Count
814       Number  of  transmit  contexts  to associate with the endpoint.  If not
815       specified (0), 1 context will be assigned if the endpoint supports out‐
816       bound  transfers.   Transmit  contexts  are independent transmit queues
817       that may be separately configured.  Each transmit context may be  bound
818       to  a  separate CQ, and no ordering is defined between contexts.  Addi‐
819       tionally, no synchronization is needed when accessing contexts in  par‐
820       allel.
821
822       If  the  count is set to the value FI_SHARED_CONTEXT, the endpoint will
823       be configured to use a shared transmit context,  if  supported  by  the
824       provider.   Providers that do not support shared transmit contexts will
825       fail the request.
826
827       See the scalable endpoint and shared contexts sections  for  additional
828       details.
829
830   rx_ctx_cnt - Receive Context Count
831       Number  of  receive  contexts  to  associate with the endpoint.  If not
832       specified, 1 context will be assigned if the endpoint supports  inbound
833       transfers.  Receive contexts are independent processing queues that may
834       be separately configured.  Each receive context may be bound to a sepa‐
835       rate CQ, and no ordering is defined between contexts.  Additionally, no
836       synchronization is needed when accessing contexts in parallel.
837
838       If the count is set to the value FI_SHARED_CONTEXT, the  endpoint  will
839       be  configured  to  use  a  shared receive context, if supported by the
840       provider.  Providers that do not support shared receive  contexts  will
841       fail the request.
842
843       See  the  scalable endpoint and shared contexts sections for additional
844       details.
845
846   auth_key_size - Authorization Key Length
847       The length of the authorization key in bytes.  This field will be 0  if
848       authorization  keys  are  not available or used.  This field is ignored
849       unless the fabric is opened with API version 1.5 or greater.
850
851   auth_key - Authorization Key
852       If supported by the fabric, an authorization key (a.k.a.  job  key)  to
853       associate  with  the  endpoint.   An authorization key is used to limit
854       communication between endpoints.  Only peer  endpoints  that  are  pro‐
855       grammed  to use the same authorization key may communicate.  Authoriza‐
856       tion keys are often used to implement job keys, to ensure that process‐
857       es  running  in  different jobs do not accidentally cross traffic.  The
858       domain authorization key will be used if auth_key_size  is  set  to  0.
859       This  field is ignored unless the fabric is opened with API version 1.5
860       or greater.
861

TRANSMIT CONTEXT ATTRIBUTES

863       Attributes specific to the transmit capabilities  of  an  endpoint  are
864       specified using struct fi_tx_attr.
865
866              struct fi_tx_attr {
867                  uint64_t  caps;
868                  uint64_t  mode;
869                  uint64_t  op_flags;
870                  uint64_t  msg_order;
871                  uint64_t  comp_order;
872                  size_t    inject_size;
873                  size_t    size;
874                  size_t    iov_limit;
875                  size_t    rma_iov_limit;
876                  uint32_t  tclass;
877              };
878
879   caps - Capabilities
880       The  requested capabilities of the context.  The capabilities must be a
881       subset of those requested of the associated endpoint.  See the CAPABIL‐
882       ITIES  section  of  fi_getinfo(3)  for capability details.  If the caps
883       field is 0 on input to fi_getinfo(3), the  applicable  capability  bits
884       from the fi_info structure will be used.
885
886       The  following  capabilities  apply to the transmit attributes: FI_MSG,
887       FI_RMA, FI_TAGGED,  FI_ATOMIC,  FI_READ,  FI_WRITE,  FI_SEND,  FI_HMEM,
888       FI_TRIGGER,  FI_FENCE,  FI_MULTICAST, FI_RMA_PMEM, FI_NAMED_RX_CTX, and
889       FI_COLLECTIVE.
890
891       Many applications will be able to ignore this field and rely solely  on
892       the  fi_info::caps field.  Use of this field provides fine grained con‐
893       trol over the transmit capabilities associated with an endpoint.  It is
894       useful  when  handling  scalable endpoints, with multiple transmit con‐
895       texts, for example, and allows configuring a specific transmit  context
896       with  fewer  capabilities  than that supported by the endpoint or other
897       transmit contexts.
898
899   mode
900       The operational mode bits of the context.  The mode bits will be a sub‐
901       set  of  those  associated  with the endpoint.  See the MODE section of
902       fi_getinfo(3) for details.  A mode value of 0 will be ignored on  input
903       to fi_getinfo(3), with the mode value of the fi_info structure used in‐
904       stead.  On return from fi_getinfo(3), the mode  will  be  set  only  to
905       those constraints specific to transmit operations.
906
907   op_flags - Default transmit operation flags
908       Flags  that  control  the operation of operations submitted against the
909       context.  Applicable flags are listed in the Operation Flags section.
910
911   msg_order - Message Ordering
912       Message ordering refers to the order in which transport  layer  headers
913       (as  viewed  by the application) are identified and processed.  Relaxed
914       message order enables data transfers to be sent and received out of or‐
915       der,  which may improve performance by utilizing multiple paths through
916       the fabric from the initiating endpoint to a target endpoint.   Message
917       order  applies  only  between  a single source and destination endpoint
918       pair.  Ordering between different target endpoints is not defined.
919
920       Message order is determined using a set of ordering bits.  Each set bit
921       indicates  that  ordering  is  maintained between data transfers of the
922       specified type.  Message order is defined for [read | write | send] op‐
923       erations submitted by an application after [read | write | send] opera‐
924       tions.
925
926       Message ordering only applies to the end to end transmission of  trans‐
927       port  headers.   Message ordering is necessary, but does not guarantee,
928       the order in which message data is sent or received  by  the  transport
929       layer.   Message  ordering  requires matching ordering semantics on the
930       receiving side of a data transfer operation in order to guarantee  that
931       ordering is met.
932
933       FI_ORDER_ATOMIC_RAR
934              Atomic  read  after  read.   If set, atomic fetch operations are
935              transmitted in the order  submitted  relative  to  other  atomic
936              fetch operations.  If not set, atomic fetches may be transmitted
937              out of order from their submission.
938
939       FI_ORDER_ATOMIC_RAW
940              Atomic read after write.  If set, atomic  fetch  operations  are
941              transmitted in the order submitted relative to atomic update op‐
942              erations.  If not set, atomic fetches may be  transmitted  ahead
943              of atomic updates.
944
945       FI_ORDER_ATOMIC_WAR
946              RMA  write  after  read.   If  set, atomic update operations are
947              transmitted in the order submitted relative to atomic fetch  op‐
948              erations.   If  not set, atomic updates may be transmitted ahead
949              of atomic fetches.
950
951       FI_ORDER_ATOMIC_WAW
952              RMA write after write.  If set,  atomic  update  operations  are
953              transmitted  in the order submitted relative to other atomic up‐
954              date operations.  If not atomic updates may be  transmitted  out
955              of order from their submission.
956
957       FI_ORDER_NONE
958              No  ordering  is  specified.  This value may be used as input in
959              order to obtain the  default  message  order  supported  by  the
960              provider.  FI_ORDER_NONE is an alias for the value 0.
961
962       FI_ORDER_RAR
963              Read  after  read.   If  set, RMA and atomic read operations are
964              transmitted in the order submitted relative  to  other  RMA  and
965              atomic read operations.  If not set, RMA and atomic reads may be
966              transmitted out of order from their submission.
967
968       FI_ORDER_RAS
969              Read after send.  If set, RMA and  atomic  read  operations  are
970              transmitted  in the order submitted relative to message send op‐
971              erations, including tagged sends.  If not set,  RMA  and  atomic
972              reads may be transmitted ahead of sends.
973
974       FI_ORDER_RAW
975              Read  after  write.   If set, RMA and atomic read operations are
976              transmitted in the order submitted relative to  RMA  and  atomic
977              write  operations.   If  not  set,  RMA  and atomic reads may be
978              transmitted ahead of RMA and atomic writes.
979
980       FI_ORDER_RMA_RAR
981              RMA read after read.  If set, RMA read operations are  transmit‐
982              ted  in  the  order  submitted relative to other RMA read opera‐
983              tions.  If not set, RMA reads may be transmitted  out  of  order
984              from their submission.
985
986       FI_ORDER_RMA_RAW
987              RMA read after write.  If set, RMA read operations are transmit‐
988              ted in the order submitted relative to RMA write operations.  If
989              not set, RMA reads may be transmitted ahead of RMA writes.
990
991       FI_ORDER_RMA_WAR
992              RMA  write  after read.  If set, RMA write operations are trans‐
993              mitted in the order submitted relative to RMA  read  operations.
994              If not set, RMA writes may be transmitted ahead of RMA reads.
995
996       FI_ORDER_RMA_WAW
997              RMA  write after write.  If set, RMA write operations are trans‐
998              mitted in the order submitted relative to other RMA write opera‐
999              tions.   If  not set, RMA writes may be transmitted out of order
1000              from their submission.
1001
1002       FI_ORDER_SAR
1003              Send after read.  If set,  message  send  operations,  including
1004              tagged sends, are transmitted in order submitted relative to RMA
1005              and atomic read operations.  If not set, message  sends  may  be
1006              transmitted ahead of RMA and atomic reads.
1007
1008       FI_ORDER_SAS
1009              Send  after  send.   If  set, message send operations, including
1010              tagged sends, are transmitted in the order submitted relative to
1011              other  message send.  If not set, message sends may be transmit‐
1012              ted out of order from their submission.
1013
1014       FI_ORDER_SAW
1015              Send after write.  If set, message  send  operations,  including
1016              tagged sends, are transmitted in order submitted relative to RMA
1017              and atomic write operations.  If not set, message sends  may  be
1018              transmitted ahead of RMA and atomic writes.
1019
1020       FI_ORDER_WAR
1021              Write  after  read.  If set, RMA and atomic write operations are
1022              transmitted in the order submitted relative to  RMA  and  atomic
1023              read  operations.   If  not  set,  RMA  and atomic writes may be
1024              transmitted ahead of RMA and atomic reads.
1025
1026       FI_ORDER_WAS
1027              Write after send.  If set, RMA and atomic write  operations  are
1028              transmitted  in the order submitted relative to message send op‐
1029              erations, including tagged sends.  If not set,  RMA  and  atomic
1030              writes may be transmitted ahead of sends.
1031
1032       FI_ORDER_WAW
1033              Write  after write.  If set, RMA and atomic write operations are
1034              transmitted in the order submitted relative  to  other  RMA  and
1035              atomic  write operations.  If not set, RMA and atomic writes may
1036              be transmitted out of order from their submission.
1037
1038   comp_order - Completion Ordering
1039       Completion ordering refers to the order in which completed requests are
1040       written  into  the completion queue.  Completion ordering is similar to
1041       message order.  Relaxed completion order may enable faster reporting of
1042       completed  transfers,  allow  acknowledgments to be sent over different
1043       fabric paths, and support more sophisticated  retry  mechanisms.   This
1044       can  result  in lower-latency completions, particularly when using con‐
1045       nectionless endpoints.  Strict completion  ordering  may  require  that
1046       providers queue completed operations or limit available optimizations.
1047
1048       For transmit requests, completion ordering depends on the endpoint com‐
1049       munication type.  For unreliable communication, completion ordering ap‐
1050       plies  to all data transfer requests submitted to an endpoint.  For re‐
1051       liable communication, completion ordering only applies to requests that
1052       target  a single destination endpoint.  Completion ordering of requests
1053       that target different endpoints over a reliable transport  is  not  de‐
1054       fined.
1055
1056       Applications  should  specify the completion ordering that they support
1057       or require.  Providers should return the completion order that they ac‐
1058       tually  provide,  with  the  constraint  that  the returned ordering is
1059       stricter than that specified by the application.  Supported  completion
1060       order values are:
1061
1062       FI_ORDER_NONE
1063              No  ordering is defined for completed operations.  Requests sub‐
1064              mitted to the transmit context may complete in any order.
1065
1066       FI_ORDER_STRICT
1067              Requests complete in the order in which they  are  submitted  to
1068              the transmit context.
1069
1070   inject_size
1071       The  requested  inject operation size (see the FI_INJECT flag) that the
1072       context will support.  This is the maximum size data transfer that  can
1073       be  associated  with  an inject operation (such as fi_inject) or may be
1074       used with the FI_INJECT data transfer flag.
1075
1076   size
1077       The size of the transmit context.  The mapping of the size value to re‐
1078       sources  is provider specific, but it is directly related to the number
1079       of command entries allocated for the endpoint.  A  smaller  size  value
1080       consumes fewer hardware and software resources, while a larger size al‐
1081       lows queuing more transmit requests.
1082
1083       While the size attribute guides the size of underlying endpoint  trans‐
1084       mit  queue,  there  is  not  necessarily a one-to-one mapping between a
1085       transmit operation and a queue entry.  A single transmit operation  may
1086       consume multiple queue entries; for example, one per scatter-gather en‐
1087       try.  Additionally, the size field is intended to guide the  allocation
1088       of  the  endpoint’s transmit context.  Specifically, for connectionless
1089       endpoints, there may be lower-level queues use to  track  communication
1090       on  a  per peer basis.  The sizes of any lower-level queues may only be
1091       significantly smaller than the endpoint’s transmit size,  in  order  to
1092       reduce resource utilization.
1093
1094   iov_limit
1095       This is the maximum number of IO vectors (scatter-gather elements) that
1096       a single posted operation may reference.
1097
1098   rma_iov_limit
1099       This is the maximum number of RMA IO vectors (scatter-gather  elements)
1100       that  an RMA or atomic operation may reference.  The rma_iov_limit cor‐
1101       responds to the rma_iov_count values in RMA and atomic operations.  See
1102       struct fi_msg_rma and struct fi_msg_atomic in fi_rma.3 and fi_atomic.3,
1103       for additional details.  This limit applies to both the number  of  RMA
1104       IO  vectors that may be specified when initiating an operation from the
1105       local endpoint, as well as the maximum number of IO vectors that may be
1106       carried in a single request from a remote endpoint.
1107
1108   Traffic Class (tclass)
1109       Traffic classes can be a differentiated services code point (DSCP) val‐
1110       ue, one of the following defined labels, or a provider-specific defini‐
1111       tion.  If tclass is unset or set to FI_TC_UNSPEC, the endpoint will use
1112       the default traffic class associated with the domain.
1113
1114       FI_TC_BEST_EFFORT
1115              This is the default in the absence of any other local or  fabric
1116              configuration.   This  class carries the traffic for a number of
1117              applications executing concurrently over the same network infra‐
1118              structure.   Even  though it is shared, network capacity and re‐
1119              source allocation are distributed  fairly  across  the  applica‐
1120              tions.
1121
1122       FI_TC_BULK_DATA
1123              This  class is intended for large data transfers associated with
1124              I/O and is present to separate sustained I/O transfers from oth‐
1125              er application inter-process communications.
1126
1127       FI_TC_DEDICATED_ACCESS
1128              This  class operates at the highest priority, except the manage‐
1129              ment class.  It carries a high bandwidth allocation, minimum la‐
1130              tency targets, and the highest scheduling and arbitration prior‐
1131              ity.
1132
1133       FI_TC_LOW_LATENCY
1134              This class supports low latency, low jitter data patterns  typi‐
1135              cally  caused  by transactional data exchanges, barrier synchro‐
1136              nizations, and collective operations that are typical of HPC ap‐
1137              plications.   This class often requires maximum tolerable laten‐
1138              cies that data transfers must achieve for correct or performance
1139              operations.   Fulfillment  of  such  requests in this class will
1140              typically require accompanying bandwidth and message size  limi‐
1141              tations so as not to consume excessive bandwidth at high priori‐
1142              ty.
1143
1144       FI_TC_NETWORK_CTRL
1145              This class is intended for traffic directly  related  to  fabric
1146              (network) management, which is critical to the correct operation
1147              of the network.  Its use is typically restricted  to  privileged
1148              network management applications.
1149
1150       FI_TC_SCAVENGER
1151              This  class  is  used for data that is desired but does not have
1152              strict delivery requirements, such as in-band network or  appli‐
1153              cation  level monitoring data.  Use of this class indicates that
1154              the traffic is considered lower priority and should  not  inter‐
1155              fere with higher priority workflows.
1156
1157       fi_tc_dscp_set / fi_tc_dscp_get
1158              DSCP  values  are  supported via the DSCP get and set functions.
1159              The definitions for DSCP values are outside the scope of libfab‐
1160              ric.  See the fi_tc_dscp_set and fi_tc_dscp_get function defini‐
1161              tions for details on their use.
1162

RECEIVE CONTEXT ATTRIBUTES

1164       Attributes specific to the receive  capabilities  of  an  endpoint  are
1165       specified using struct fi_rx_attr.
1166
1167              struct fi_rx_attr {
1168                  uint64_t  caps;
1169                  uint64_t  mode;
1170                  uint64_t  op_flags;
1171                  uint64_t  msg_order;
1172                  uint64_t  comp_order;
1173                  size_t    total_buffered_recv;
1174                  size_t    size;
1175                  size_t    iov_limit;
1176              };
1177
1178   caps - Capabilities
1179       The  requested capabilities of the context.  The capabilities must be a
1180       subset of those requested of the associated endpoint.  See the CAPABIL‐
1181       ITIES  section  if  fi_getinfo(3)  for capability details.  If the caps
1182       field is 0 on input to fi_getinfo(3), the  applicable  capability  bits
1183       from the fi_info structure will be used.
1184
1185       The  following  capabilities  apply  to the receive attributes: FI_MSG,
1186       FI_RMA, FI_TAGGED, FI_ATOMIC, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_RECV,
1187       FI_HMEM,  FI_TRIGGER,  FI_RMA_PMEM,  FI_DIRECTED_RECV, FI_VARIABLE_MSG,
1188       FI_MULTI_RECV, FI_SOURCE, FI_RMA_EVENT, FI_SOURCE_ERR,  and  FI_COLLEC‐
1189       TIVE.
1190
1191       Many  applications will be able to ignore this field and rely solely on
1192       the fi_info::caps field.  Use of this field provides fine grained  con‐
1193       trol  over the receive capabilities associated with an endpoint.  It is
1194       useful when handling scalable endpoints,  with  multiple  receive  con‐
1195       texts,  for  example, and allows configuring a specific receive context
1196       with fewer capabilities than that supported by the  endpoint  or  other
1197       receive contexts.
1198
1199   mode
1200       The operational mode bits of the context.  The mode bits will be a sub‐
1201       set of those associated with the endpoint.  See  the  MODE  section  of
1202       fi_getinfo(3)  for details.  A mode value of 0 will be ignored on input
1203       to fi_getinfo(3), with the mode value of the fi_info structure used in‐
1204       stead.   On  return  from  fi_getinfo(3),  the mode will be set only to
1205       those constraints specific to receive operations.
1206
1207   op_flags - Default receive operation flags
1208       Flags that control the operation of operations  submitted  against  the
1209       context.  Applicable flags are listed in the Operation Flags section.
1210
1211   msg_order - Message Ordering
1212       For  a  description of message ordering, see the msg_order field in the
1213       Transmit Context Attribute section.  Receive context  message  ordering
1214       defines  the order in which received transport message headers are pro‐
1215       cessed when received by an endpoint.  When ordering is  set,  it  indi‐
1216       cates that message headers will be processed in order, based on how the
1217       transmit side has identified the messages.  Typically, this means  that
1218       messages  will  be  handled  in order based on a message level sequence
1219       number.
1220
1221       The following ordering flags, as defined for  transmit  ordering,  also
1222       apply  to  the processing of received operations: FI_ORDER_NONE, FI_OR‐
1223       DER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_OR‐
1224       DER_WAS,  FI_ORDER_SAR,  FI_ORDER_SAW,  FI_ORDER_SAS, FI_ORDER_RMA_RAR,
1225       FI_ORDER_RMA_RAW,  FI_ORDER_RMA_WAR,  FI_ORDER_RMA_WAW,  FI_ORDER_ATOM‐
1226       IC_RAR,  FI_ORDER_ATOMIC_RAW,  FI_ORDER_ATOMIC_WAR,  and FI_ORDER_ATOM‐
1227       IC_WAW.
1228
1229   comp_order - Completion Ordering
1230       For a description of completion ordering, see the comp_order  field  in
1231       the Transmit Context Attribute section.
1232
1233       FI_ORDER_DATA
1234              When  set, this bit indicates that received data is written into
1235              memory in order.  Data ordering applies to  memory  accessed  as
1236              part of a single operation and between operations if message or‐
1237              dering is guaranteed.
1238
1239       FI_ORDER_NONE
1240              No ordering is defined for completed operations.  Receive opera‐
1241              tions  may complete in any order, regardless of their submission
1242              order.
1243
1244       FI_ORDER_STRICT
1245              Receive operations complete in the order in which they are  pro‐
1246              cessed by the receive context, based on the receive side msg_or‐
1247              der attribute.
1248
1249   total_buffered_recv
1250       This field is supported for backwards compatibility purposes.  It is  a
1251       hint to the provider of the total available space that may be needed to
1252       buffer messages that are received for which there is  no  matching  re‐
1253       ceive  operation.   The  provider may adjust or ignore this value.  The
1254       allocation of internal network  buffering  among  received  message  is
1255       provider specific.  For instance, a provider may limit the size of mes‐
1256       sages which can be buffered or the amount of buffering allocated  to  a
1257       single message.
1258
1259       If  receive  side buffering is disabled (total_buffered_recv = 0) and a
1260       message is received by an endpoint, then the behavior is  dependent  on
1261       whether  resource management has been enabled (FI_RM_ENABLED has be set
1262       or not).  See the Resource Management section of fi_domain.3  for  fur‐
1263       ther  clarification.   It  is  recommended that applications enable re‐
1264       source management if they  anticipate  receiving  unexpected  messages,
1265       rather than modifying this value.
1266
1267   size
1268       The  size of the receive context.  The mapping of the size value to re‐
1269       sources is provider specific, but it is directly related to the  number
1270       of  command  entries  allocated for the endpoint.  A smaller size value
1271       consumes fewer hardware and software resources, while a larger size al‐
1272       lows queuing more transmit requests.
1273
1274       While the size attribute guides the size of underlying endpoint receive
1275       queue, there is not necessarily a one-to-one mapping between a  receive
1276       operation  and  a  queue entry.  A single receive operation may consume
1277       multiple queue entries; for example, one per scatter-gather entry.  Ad‐
1278       ditionally,  the  size field is intended to guide the allocation of the
1279       endpoint’s receive  context.   Specifically,  for  connectionless  end‐
1280       points, there may be lower-level queues use to track communication on a
1281       per peer basis.  The sizes of any lower-level queues may only  be  sig‐
1282       nificantly smaller than the endpoint’s receive size, in order to reduce
1283       resource utilization.
1284
1285   iov_limit
1286       This is the maximum number of IO vectors (scatter-gather elements) that
1287       a single posted operating may reference.
1288

SCALABLE ENDPOINTS

1290       A  scalable  endpoint  is a communication portal that supports multiple
1291       transmit and receive contexts.  Scalable endpoints are loosely  modeled
1292       after  the  networking  concept  of transmit/receive side scaling, also
1293       known as multi-queue.  Support for scalable endpoints is domain specif‐
1294       ic.   Scalable  endpoints may improve the performance of multi-threaded
1295       and parallel applications, by allowing threads  to  access  independent
1296       transmit  and  receive queues.  A scalable endpoint has a single trans‐
1297       port level address, which can reduce the memory requirements needed  to
1298       store  remote  addressing data, versus using standard endpoints.  Scal‐
1299       able endpoints cannot be used directly  for  communication  operations,
1300       and  require  the application to explicitly create transmit and receive
1301       contexts as described below.
1302
1303   fi_tx_context
1304       Transmit contexts are independent transmit queues.  Ordering  and  syn‐
1305       chronization between contexts are not defined.  Conceptually a transmit
1306       context behaves similar to a send-only endpoint.   A  transmit  context
1307       may  be  configured  with fewer capabilities than the base endpoint and
1308       with different attributes (such as  ordering  requirements  and  inject
1309       size)  than  other contexts associated with the same scalable endpoint.
1310       Each transmit context has its own  completion  queue.   The  number  of
1311       transmit  contexts associated with an endpoint is specified during end‐
1312       point creation.
1313
1314       The fi_tx_context call is used to retrieve a specific context,  identi‐
1315       fied  by  an  index  (see  above  for  details  on transmit context at‐
1316       tributes).  Providers may dynamically allocate contexts when fi_tx_con‐
1317       text  is called, or may statically create all contexts when fi_endpoint
1318       is invoked.  By default, a transmit context inherits the properties  of
1319       its  associated  endpoint.   However,  applications may request context
1320       specific attributes through the attr parameter.  Support for per trans‐
1321       mit  context  attributes  is  provider  specific  and  not  guaranteed.
1322       Providers will return the actual attributes  assigned  to  the  context
1323       through the attr parameter, if provided.
1324
1325   fi_rx_context
1326       Receive  contexts are independent receive queues for receiving incoming
1327       data.  Ordering and synchronization between contexts  are  not  guaran‐
1328       teed.  Conceptually a receive context behaves similar to a receive-only
1329       endpoint.  A receive context may be configured with fewer  capabilities
1330       than  the base endpoint and with different attributes (such as ordering
1331       requirements and inject size) than other contexts associated  with  the
1332       same  scalable  endpoint.   Each receive context has its own completion
1333       queue.  The number of receive contexts associated with an  endpoint  is
1334       specified during endpoint creation.
1335
1336       Receive contexts are often associated with steering flows, that specify
1337       which incoming packets targeting a scalable endpoint to process.   How‐
1338       ever,  receive  contexts  may be targeted directly by the initiator, if
1339       supported by the underlying protocol.  Such contexts are referred to as
1340       `named'.   Support  for named contexts must be indicated by setting the
1341       caps FI_NAMED_RX_CTX capability when the corresponding endpoint is cre‐
1342       ated.   Support  for named receive contexts is coordinated with address
1343       vectors.  See fi_av(3) and fi_rx_addr(3).
1344
1345       The fi_rx_context call is used to retrieve a specific context,  identi‐
1346       fied by an index (see above for details on receive context attributes).
1347       Providers may  dynamically  allocate  contexts  when  fi_rx_context  is
1348       called,  or  may statically create all contexts when fi_endpoint is in‐
1349       voked.  By default, a receive context inherits the  properties  of  its
1350       associated endpoint.  However, applications may request context specif‐
1351       ic attributes through the attr parameter.  Support for per receive con‐
1352       text  attributes  is  provider  specific and not guaranteed.  Providers
1353       will return the actual attributes assigned to the context  through  the
1354       attr parameter, if provided.
1355

SHARED CONTEXTS

1357       Shared  contexts  are  transmit  and receive contexts explicitly shared
1358       among one or more endpoints.  A shareable context allows an application
1359       to  use  a  single dedicated provider resource among multiple transport
1360       addressable endpoints.  This can greatly reduce the resources needed to
1361       manage  communication  over multiple endpoints by multiplexing transmit
1362       and/or receive processing, with the potential cost of  serializing  ac‐
1363       cess  across multiple endpoints.  Support for shareable contexts is do‐
1364       main specific.
1365
1366       Conceptually, shareable transmit contexts are transmit queues that  may
1367       be accessed by many endpoints.  The use of a shared transmit context is
1368       mostly opaque to an application.  Applications must allocate  and  bind
1369       shared  transmit  contexts  to endpoints, but operations are posted di‐
1370       rectly to the endpoint.  Shared transmit contexts  are  not  associated
1371       with completion queues or counters.  Completed operations are posted to
1372       the CQs bound to the endpoint.  An endpoint may only be associated with
1373       a single shared transmit context.
1374
1375       Unlike  shared  transmit  contexts, applications interact directly with
1376       shared receive contexts.  Users post  receive  buffers  directly  to  a
1377       shared  receive  context, with the buffers usable by any endpoint bound
1378       to the shared receive context.  Shared receive contexts are not associ‐
1379       ated  with completion queues or counters.  Completed receive operations
1380       are posted to the CQs bound to the endpoint.  An endpoint may  only  be
1381       associated  with  a single receive context, and all connectionless end‐
1382       points associated with a shared receive context  must  also  share  the
1383       same address vector.
1384
1385       Endpoints  associated  with a shared transmit context may use dedicated
1386       receive contexts, and vice-versa.  Or an endpoint may use shared trans‐
1387       mit  and  receive  contexts.  And there is no requirement that the same
1388       group of endpoints sharing a context of one type also share the context
1389       of  an  alternate type.  Furthermore, an endpoint may use a shared con‐
1390       text of one type, but a scalable set of contexts of the alternate type.
1391
1392   fi_stx_context
1393       This call is used to open a shareable transmit context (see  above  for
1394       details on the transmit context attributes).  Endpoints associated with
1395       a shared transmit context must use a subset of the  transmit  context’s
1396       attributes.   Note  that  this  is  the  reverse of the requirement for
1397       transmit contexts for scalable endpoints.
1398
1399   fi_srx_context
1400       This allocates a shareable receive context (see above  for  details  on
1401       the  receive  context  attributes).  Endpoints associated with a shared
1402       receive context must use a subset of the receive context’s  attributes.
1403       Note  that  this is the reverse of the requirement for receive contexts
1404       for scalable endpoints.
1405

SOCKET ENDPOINTS

1407       The following feature and description should be  considered  experimen‐
1408       tal.  Until the experimental tag is removed, the interfaces, semantics,
1409       and data structures associated with socket endpoints may change between
1410       library versions.
1411
1412       This  section  applies  to  endpoints  of  type  FI_EP_SOCK_STREAM  and
1413       FI_EP_SOCK_DGRAM, commonly referred to as socket endpoints.
1414
1415       Socket endpoints are defined with semantics that  allow  them  to  more
1416       easily  be  adopted by developers familiar with the UNIX socket API, or
1417       by middleware that exposes the socket API, while still taking advantage
1418       of high-performance hardware features.
1419
1420       The  key difference between socket endpoints and other active endpoints
1421       are socket endpoints use synchronous data  transfers.   Buffers  passed
1422       into  send and receive operations revert to the control of the applica‐
1423       tion upon returning from the function  call.   As  a  result,  no  data
1424       transfer  completions  are reported to the application, and socket end‐
1425       points are not associated with completion queues or counters.
1426
1427       Socket endpoints support  a  subset  of  message  operations:  fi_send,
1428       fi_sendv,  fi_sendmsg,  fi_recv,  fi_recvv,  fi_recvmsg, and fi_inject.
1429       Because data transfers are synchronous, the return value from send  and
1430       receive operations indicate the number of bytes transferred on success,
1431       or a negative value on error, including -FI_EAGAIN if the endpoint can‐
1432       not  send  or receive any data because of full or empty queues, respec‐
1433       tively.
1434
1435       Socket endpoints are associated with event queues and address  vectors,
1436       and  process  connection  management  events asynchronously, similar to
1437       other endpoints.  Unlike UNIX sockets, socket endpoint  must  still  be
1438       declared as either active or passive.
1439
1440       Socket endpoints behave like non-blocking sockets.  In order to support
1441       select and poll semantics, active socket endpoints are associated  with
1442       a  file  descriptor  that is signaled whenever the endpoint is ready to
1443       send and/or receive data.  The file descriptor may be  retrieved  using
1444       fi_control.
1445

OPERATION FLAGS

1447       Operation  flags  are  obtained by OR-ing the following flags together.
1448       Operation flags define the default flags applied to an endpoint’s  data
1449       transfer  operations,  where  a flags parameter is not available.  Data
1450       transfer operations that take flags as input override the op_flags val‐
1451       ue of transmit or receive context attributes of an endpoint.
1452
1453       FI_COMMIT_COMPLETE
1454              Indicates  that a completion should not be generated (locally or
1455              at the peer) until the result of an  operation  have  been  made
1456              persistent.   See  fi_cq(3) for additional details on completion
1457              semantics.
1458
1459       FI_COMPLETION
1460              Indicates that a completion queue entry should  be  written  for
1461              data  transfer operations.  This flag only applies to operations
1462              issued on an endpoint that was bound to a completion queue  with
1463              the  FI_SELECTIVE_COMPLETION flag set, otherwise, it is ignored.
1464              See the fi_ep_bind section above for more detail.
1465
1466       FI_DELIVERY_COMPLETE
1467              Indicates that a completion should be generated when the  opera‐
1468              tion  has  been  processed  by the destination endpoint(s).  See
1469              fi_cq(3) for additional details on completion semantics.
1470
1471       FI_INJECT
1472              Indicates that all outbound data buffers should be  returned  to
1473              the  user’s  control  immediately after a data transfer call re‐
1474              turns, even if the operation is  handled  asynchronously.   This
1475              may  require that the provider copy the data into a local buffer
1476              and transfer out of that buffer.  A provider can limit the total
1477              amount  of  send  data that may be buffered and/or the size of a
1478              single send that can use this flag.  This limit is indicated us‐
1479              ing inject_size (see inject_size above).
1480
1481       FI_INJECT_COMPLETE
1482              Indicates  that a completion should be generated when the source
1483              buffer(s) may be reused.  See fi_cq(3) for additional details on
1484              completion semantics.
1485
1486       FI_MULTICAST
1487              Indicates that data transfers will target multicast addresses by
1488              default.  Any fi_addr_t passed into a  data  transfer  operation
1489              will be treated as a multicast address.
1490
1491       FI_MULTI_RECV
1492              Applies to posted receive operations.  This flag allows the user
1493              to post a single buffer that will receive multiple incoming mes‐
1494              sages.  Received messages will be packed into the receive buffer
1495              until the buffer has been consumed.  Use of this flag may  cause
1496              a  single  posted receive operation to generate multiple comple‐
1497              tions as messages are placed into the buffer.  The placement  of
1498              received  data into the buffer may be subjected to provider spe‐
1499              cific alignment restrictions.  The buffer will  be  released  by
1500              the  provider  when  the  available buffer space falls below the
1501              specified minimum (see FI_OPT_MIN_MULTI_RECV).
1502
1503       FI_TRANSMIT_COMPLETE
1504              Indicates that a completion should be generated when the  trans‐
1505              mit operation has completed relative to the local provider.  See
1506              fi_cq(3) for additional details on completion semantics.
1507

NOTES

1509       Users should call fi_close to release all resources  allocated  to  the
1510       fabric endpoint.
1511
1512       Endpoints  allocated  with  the FI_CONTEXT or FI_CONTEXT2 mode bits set
1513       must typically provide struct fi_context(2) as their per operation con‐
1514       text  parameter.   (See fi_getinfo.3 for details.) However, when FI_SE‐
1515       LECTIVE_COMPLETION is enabled to suppress CQ completion entries, and an
1516       operation  is  initiated  without  the FI_COMPLETION flag set, then the
1517       context parameter is ignored.  An application does not need to pass  in
1518       a valid struct fi_context(2) into such data transfers.
1519
1520       Operations  that  complete  in error that are not associated with valid
1521       operational context will use the endpoint context in any error  report‐
1522       ing structures.
1523
1524       Although  applications  typically associate individual completions with
1525       either completion queues or counters, an endpoint can  be  attached  to
1526       both  a  counter and completion queue.  When combined with using selec‐
1527       tive completions, this allows an application to use counters  to  track
1528       successful  completions,  with  a CQ used to report errors.  Operations
1529       that complete with an error increment the error counter and generate  a
1530       CQ completion event.
1531
1532       As  mentioned  in  fi_getinfo(3),  the ep_attr structure can be used to
1533       query providers that support various endpoint  attributes.   fi_getinfo
1534       can return provider info structures that can support the minimal set of
1535       requirements (such that the application maintains correctness).  Howev‐
1536       er, it can also return provider info structures that exceed application
1537       requirements.   As  an  example,  consider  an  application  requesting
1538       msg_order  as  FI_ORDER_NONE.  The resulting output from fi_getinfo may
1539       have all the ordering bits set.  The application can reset the ordering
1540       bits it does not require before creating the endpoint.  The provider is
1541       free to implement a stricter ordering than is required by the  applica‐
1542       tion.
1543

RETURN VALUES

1545       Returns 0 on success.  On error, a negative value corresponding to fab‐
1546       ric errno is returned.  For fi_cancel, a return value  of  0  indicates
1547       that the cancel request was submitted for processing.
1548
1549       Fabric errno values are defined in rdma/fi_errno.h.
1550

ERRORS

1552       -FI_EDOMAIN
1553              A  resource  domain  was not bound to the endpoint or an attempt
1554              was made to bind multiple domains.
1555
1556       -FI_ENOCQ
1557              The endpoint has not been configured with necessary event queue.
1558
1559       -FI_EOPBADSTATE
1560              The endpoint’s state does not permit the requested operation.
1561

SEE ALSO

1563       fi_getinfo(3),   fi_domain(3),   fi_cq(3)   fi_msg(3),    fi_tagged(3),
1564       fi_rma(3)
1565

AUTHORS

1567       OpenFabrics.
1568
1569
1570
1571Libfabric Programmer’s Manual     2021-10-29                    fi_endpoint(3)
Impressum