fi_tx_size_left(3)

1fi_endpoint(3)                 Libfabric v1.8.0                 fi_endpoint(3)
2
3
4

NAME

6       fi_endpoint - Fabric endpoint operations
7
8       fi_endpoint / fi_scalable_ep / fi_passive_ep / fi_close
9              Allocate or close an endpoint.
10
11       fi_ep_bind
12              Associate  an  endpoint  with  hardware resources, such as event
13              queues, completion queues, counters, address vectors, or  shared
14              transmit/receive contexts.
15
16       fi_scalable_ep_bind
17              Associate a scalable endpoint with an address vector
18
19       fi_pep_bind
20              Associate a passive endpoint with an event queue
21
22       fi_enable
23              Transitions an active endpoint into an enabled state.
24
25       fi_cancel
26              Cancel a pending asynchronous data transfer
27
28       fi_ep_alias
29              Create an alias to the endpoint
30
31       fi_control
32              Control endpoint operation.
33
34       fi_getopt / fi_setopt
35              Get or set endpoint options.
36
37       fi_rx_context / fi_tx_context / fi_srx_context / fi_stx_context
38              Open a transmit or receive context.
39
40       fi_rx_size_left / fi_tx_size_left (DEPRECATED)
41              Query the lower bound on how many RX/TX operations may be posted
42              without an operation returning -FI_EAGAIN.  This functions  have
43              been  deprecated  and will be removed in a future version of the
44              library.
45

SYNOPSIS

47              #include <rdma/fabric.h>
48
49              #include <rdma/fi_endpoint.h>
50
51              int fi_endpoint(struct fid_domain *domain, struct fi_info *info,
52                  struct fid_ep **ep, void *context);
53
54              int fi_scalable_ep(struct fid_domain *domain, struct fi_info *info,
55                  struct fid_ep **sep, void *context);
56
57              int fi_passive_ep(struct fi_fabric *fabric, struct fi_info *info,
58                  struct fid_pep **pep, void *context);
59
60              int fi_tx_context(struct fid_ep *sep, int index,
61                  struct fi_tx_attr *attr, struct fid_ep **tx_ep,
62                  void *context);
63
64              int fi_rx_context(struct fid_ep *sep, int index,
65                  struct fi_rx_attr *attr, struct fid_ep **rx_ep,
66                  void *context);
67
68              int fi_stx_context(struct fid_domain *domain,
69                  struct fi_tx_attr *attr, struct fid_stx **stx,
70                  void *context);
71
72              int fi_srx_context(struct fid_domain *domain,
73                  struct fi_rx_attr *attr, struct fid_ep **rx_ep,
74                  void *context);
75
76              int fi_close(struct fid *ep);
77
78              int fi_ep_bind(struct fid_ep *ep, struct fid *fid, uint64_t flags);
79
80              int fi_scalable_ep_bind(struct fid_ep *sep, struct fid *fid, uint64_t flags);
81
82              int fi_pep_bind(struct fid_pep *pep, struct fid *fid, uint64_t flags);
83
84              int fi_enable(struct fid_ep *ep);
85
86              int fi_cancel(struct fid_ep *ep, void *context);
87
88              int fi_ep_alias(struct fid_ep *ep, struct fid_ep **alias_ep, uint64_t flags);
89
90              int fi_control(struct fid *ep, int command, void *arg);
91
92              int fi_getopt(struct fid *ep, int level, int optname,
93                  void *optval, size_t *optlen);
94
95              int fi_setopt(struct fid *ep, int level, int optname,
96                  const void *optval, size_t optlen);
97
98              DEPRECATED ssize_t fi_rx_size_left(struct fid_ep *ep);
99
100              DEPRECATED ssize_t fi_tx_size_left(struct fid_ep *ep);
101

ARGUMENTS

103       fid    On creation, specifies a fabric  or  access  domain.   On  bind,
104              identifies  the  event  queue, completion queue, counter, or ad‐
105              dress vector to bind to the endpoint.  In other  cases,  it's  a
106              fabric identifier of an associated resource.
107
108       info   Details  about  the  fabric interface endpoint to be opened, ob‐
109              tained from fi_getinfo.
110
111       ep     A fabric endpoint.
112
113       sep    A scalable fabric endpoint.
114
115       pep    A passive fabric endpoint.
116
117       context
118              Context associated with the endpoint or asynchronous operation.
119
120       index  Index to retrieve a specific transmit/receive context.
121
122       attr   Transmit or receive context attributes.
123
124       flags  Additional flags to apply to the operation.
125
126       command
127              Command of control operation to perform on endpoint.
128
129       arg    Optional control argument.
130
131       level  Protocol level at which the desired option resides.
132
133       optname
134              The protocol option to read or set.
135
136       optval The option value that was read or to set.
137
138       optlen The size of the optval buffer.
139

DESCRIPTION

141       Endpoints are transport level communication  portals.   There  are  two
142       types  of endpoints: active and passive.  Passive endpoints belong to a
143       fabric domain and are most often used to listen for incoming connection
144       requests.   However, a passive endpoint may be used to reserve a fabric
145       address that can be granted to an active  endpoint.   Active  endpoints
146       belong to access domains and can perform data transfers.
147
148       Active  endpoints may be connection-oriented or connectionless, and may
149       provide data reliability.  The data  transfer  interfaces  --  messages
150       (fi_msg),  tagged  messages  (fi_tagged),  RMA  (fi_rma),  and  atomics
151       (fi_atomic) -- are associated with active endpoints.  In basic configu‐
152       rations, an active endpoint has transmit and receive queues.  In gener‐
153       al, operations that generate traffic on the fabric are  posted  to  the
154       transmit  queue.   This  includes  all RMA and atomic operations, along
155       with sent messages and sent tagged messages.  Operations that post buf‐
156       fers for receiving incoming data are submitted to the receive queue.
157
158       Active  endpoints are created in the disabled state.  They must transi‐
159       tion into an enabled state before accepting data  transfer  operations,
160       including  posting  of  receive buffers.  The fi_enable call is used to
161       transition an active endpoint into an enabled  state.   The  fi_connect
162       and  fi_accept  calls will also transition an endpoint into the enabled
163       state, if it is not already active.
164
165       In order to transition an endpoint into an enabled state,  it  must  be
166       bound  to one or more fabric resources.  An endpoint that will generate
167       asynchronous completions, either through data  transfer  operations  or
168       communication  establishment  events,  must be bound to the appropriate
169       completion queues or event queues, respectively, before being  enabled.
170       Additionally,  endpoints  that  use  manual progress must be associated
171       with relevant completion queues or  event  queues  in  order  to  drive
172       progress.   For  endpoints  that  are only used as the target of RMA or
173       atomic operations, this means binding  the  endpoint  to  a  completion
174       queue  associated  with receive processing.  Unconnected endpoints must
175       be bound to an address vector.
176
177       Once an endpoint has been activated, it may be associated with  an  ad‐
178       dress  vector.   Receive  buffers  may be posted to it and calls may be
179       made to connection establishment  routines.   Connectionless  endpoints
180       may also perform data transfers.
181
182       The behavior of an endpoint may be adjusted by setting its control data
183       and protocol options.  This allows the underlying provider to  redirect
184       function  calls to implementations optimized to meet the desired appli‐
185       cation behavior.
186
187       If an endpoint experiences a critical error, it  will  transition  back
188       into  a disabled state.  Critical errors are reported through the event
189       queue associated with the EP.  In certain cases,  a  disabled  endpoint
190       may  be  re-enabled.   The  ability  to transition back into an enabled
191       state is provider specific and depends on the type of  error  that  the
192       endpoint  experienced.   When  an endpoint is disabled as a result of a
193       critical error, all pending operations are discarded.
194
195   fi_endpoint / fi_passive_ep / fi_scalable_ep
196       fi_endpoint allocates a new active endpoint.  fi_passive_ep allocates a
197       new  passive  endpoint.   fi_scalable_ep allocates a scalable endpoint.
198       The properties and behavior of the endpoint are defined  based  on  the
199       provided  struct  fi_info.   See  fi_getinfo  for additional details on
200       fi_info.  fi_info flags that control the operation of an  endpoint  are
201       defined below.  See section SCALABLE ENDPOINTS.
202
203       If  an active endpoint is allocated in order to accept a connection re‐
204       quest, the fi_info parameter must be the same as the fi_info  structure
205       provided with the connection request (FI_CONNREQ) event.
206
207       An  active endpoint may acquire the properties of a passive endpoint by
208       setting the fi_info handle field to the  passive  endpoint  fabric  de‐
209       scriptor.   This  is  useful  for applications that need to reserve the
210       fabric address of an endpoint prior to knowing if the endpoint will  be
211       used  on the active or passive side of a connection.  For example, this
212       feature is useful for simulating socket semantics.  Once an active end‐
213       point  acquires  the properties of a passive endpoint, the passive end‐
214       point is no longer bound to any fabric resources and must no longer  be
215       used.  The user is expected to close the passive endpoint after opening
216       the active endpoint in order to free up any  lingering  resources  that
217       had been used.
218
219   fi_close
220       Closes an endpoint and release all resources associated with it.
221
222       When closing a scalable endpoint, there must be no opened transmit con‐
223       texts, or receive contexts associated with the scalable  endpoint.   If
224       resources are still associated with the scalable endpoint when attempt‐
225       ing to close, the call will return -FI_EBUSY.
226
227       Outstanding operations posted to the endpoint when fi_close  is  called
228       will be discarded.  Discarded operations will silently be dropped, with
229       no completions reported.  Additionally, a provider may  discard  previ‐
230       ously  completed  operations  from  the associated completion queue(s).
231       The behavior to discard completed operations is provider specific.
232
233   fi_ep_bind
234       fi_ep_bind is used to associate an endpoint with  other  allocated  re‐
235       sources,  such  as  completion queues, counters, address vectors, event
236       queues, shared contexts, and memory regions.  The type of objects  that
237       must be bound with an endpoint depend on the endpoint type and its con‐
238       figuration.
239
240       Passive endpoints must be bound with an  EQ  that  supports  connection
241       management  events.  Connectionless endpoints must be bound to a single
242       address vector.  If an endpoint is using a shared transmit  and/or  re‐
243       ceive context, the shared contexts must be bound to the endpoint.  CQs,
244       counters, AV, and shared contexts must be  bound  to  endpoints  before
245       they are enabled either explicitly or implicitly.
246
247       An endpoint must be bound with CQs capable of reporting completions for
248       any asynchronous operation initiated on the endpoint.  For example,  if
249       the  endpoint  supports  any  outbound  transfers (sends, RMA, atomics,
250       etc.), then it must be bound to a  completion  queue  that  can  report
251       transmit  completions.  This is true even if the endpoint is configured
252       to suppress successful completions, in order that operations that  com‐
253       plete in error may be reported to the user.
254
255       An  active  endpoint  may  direct asynchronous completions to different
256       CQs,  based  on  the  type  of  operation.   This  is  specified  using
257       fi_ep_bind flags.  The following flags may be OR'ed together when bind‐
258       ing an endpoint to a completion domain CQ.
259
260       FI_TRANSMIT
261              Directs the completion of outbound data transfer requests to the
262              specified  completion  queue.   This includes send message, RMA,
263              and atomic operations.
264
265       FI_RECV
266              Directs the notification of inbound data transfers to the speci‐
267              fied  completion  queue.  This includes received messages.  This
268              binding automatically includes FI_REMOTE_WRITE, if applicable to
269              the endpoint.
270
271       FI_SELECTIVE_COMPLETION
272              By default, data transfer operations write CQ completion entries
273              into the associated completion queue after they have successful‐
274              ly completed.  Applications can use this bind flag to selective‐
275              ly enable when completions are generated.  If  FI_SELECTIVE_COM‐
276              PLETION is specified, data transfer operations will not generate
277              CQ entries for successful completions  unless  FI_COMPLETION  is
278              set  as an operational flag for the given operation.  Operations
279              that fail asynchronously will still generate  completions,  even
280              if  a completion is not requested.  FI_SELECTIVE_COMPLETION must
281              be OR'ed with FI_TRANSMIT and/or FI_RECV flags.
282
283       When FI_SELECTIVE_COMPLETION is set, the user must determine when a re‐
284       quest  that  does  NOT have FI_COMPLETION set has completed indirectly,
285       usually based on the completion of a subsequent operation or  by  using
286       completion  counters.   Use of this flag may improve performance by al‐
287       lowing the provider to avoid writing a CQ completion  entry  for  every
288       operation.
289
290       See Notes section below for additional information on how this flag in‐
291       teracts with the FI_CONTEXT and FI_CONTEXT2 mode bits.
292
293       An endpoint may optionally be bound to a completion counter.  Associat‐
294       ing  an endpoint with a counter is in addition to binding the EP with a
295       CQ.  When binding an endpoint to a counter, the following flags may  be
296       specified.
297
298       FI_SEND
299              Increments  the  specified  counter  whenever a message transfer
300              initiated over the endpoint has completed successfully or in er‐
301              ror.  Sent messages include both tagged and normal message oper‐
302              ations.
303
304       FI_RECV
305              Increments the specified counter whenever a message is  received
306              over  the  endpoint.   Received messages include both tagged and
307              normal message operations.
308
309       FI_READ
310              Increments the specified counter whenever an  RMA  read,  atomic
311              fetch,  or  atomic compare operation initiated from the endpoint
312              has completed successfully or in error.
313
314       FI_WRITE
315              Increments the specified counter whenever an RMA write  or  base
316              atomic  operation initiated from the endpoint has completed suc‐
317              cessfully or in error.
318
319       FI_REMOTE_READ
320              Increments the specified counter whenever an  RMA  read,  atomic
321              fetch,  or  atomic  compare operation is initiated from a remote
322              endpoint that targets the given endpoint.  Use of this flag  re‐
323              quires that the endpoint be created using FI_RMA_EVENT.
324
325       FI_REMOTE_WRITE
326              Increments  the  specified counter whenever an RMA write or base
327              atomic operation is initiated from a remote endpoint  that  tar‐
328              gets  the  given  endpoint.   Use of this flag requires that the
329              endpoint be created using FI_RMA_EVENT.
330
331       An endpoint may only be bound to a single CQ or  counter  for  a  given
332       type of operation.  For example, a EP may not bind to two counters both
333       using FI_WRITE.  Furthermore, providers may limit CQ and counter  bind‐
334       ings to endpoints of the same endpoint type (DGRAM, MSG, RDM, etc.).
335
336   fi_scalable_ep_bind
337       fi_scalable_ep_bind  is  used  to associate a scalable endpoint with an
338       address vector.  See section on SCALABLE ENDPOINTS.   A  scalable  end‐
339       point  has  a  single  transport level address and can support multiple
340       transmit and receive contexts.  The transmit and receive contexts share
341       the  transport-level  address.  Address vectors that are bound to scal‐
342       able endpoints are implicitly bound to any transmit or receive contexts
343       created using the scalable endpoint.
344
345   fi_enable
346       This  call transitions the endpoint into an enabled state.  An endpoint
347       must be enabled before it may be used to perform data  transfers.   En‐
348       abling  an  endpoint  typically results in hardware resources being as‐
349       signed to it.  Endpoints making use  of  completion  queues,  counters,
350       event queues, and/or address vectors must be bound to them before being
351       enabled.
352
353       Calling connect or accept on an endpoint will implicitly enable an end‐
354       point if it has not already been enabled.
355
356       fi_enable  may also be used to re-enable an endpoint that has been dis‐
357       abled as a result  of  experiencing  a  critical  error.   Applications
358       should  check the return value from fi_enable to see if a disabled end‐
359       point has successfully be re-enabled.
360
361   fi_cancel
362       fi_cancel attempts to cancel  an  outstanding  asynchronous  operation.
363       Canceling an operation causes the fabric provider to search for the op‐
364       eration and, if it is still pending, complete it as  having  been  can‐
365       celed.   An error queue entry will be available in the associated error
366       queue with error code FI_ECANCELED.  On the other hand, if  the  opera‐
367       tion completed before the call to fi_cancel, then the completion status
368       of that operation will be available in the associated completion queue.
369       No specific entry related to fi_cancel itself will be posted.
370
371       Cancel uses the context parameter associated with an operation to iden‐
372       tify the request to cancel.  Operations posted without a valid  context
373       parameter  --  either  no context parameter is specified or the context
374       value was ignored by the provider -- cannot be canceled.   If  multiple
375       outstanding  operations  match  the context parameter, only one will be
376       canceled.  In this case, the operation which is  canceled  is  provider
377       specific.   The  cancel  operation  is  asynchronous, but will complete
378       within a bounded period of time.
379
380   fi_ep_alias
381       This call creates an alias to the specified endpoint.  Conceptually, an
382       endpoint alias provides an alternate software path from the application
383       to the underlying provider hardware.  An alias EP differs from its par‐
384       ent  endpoint only by its default data transfer flags.  For example, an
385       alias EP may be configured to use a different completion mode.  By  de‐
386       fault,  an alias EP inherits the same data transfer flags as the parent
387       endpoint.  An application can use fi_control to modify the alias EP op‐
388       erational flags.
389
390       When  allocating  an  alias,  an  application  may configure either the
391       transmit or receive operational flags.  This avoids needing a  separate
392       call to fi_control to set those flags.  The flags passed to fi_ep_alias
393       must include FI_TRANSMIT or FI_RECV (not both) with  other  operational
394       flags  OR'ed in.  This will override the transmit or receive flags, re‐
395       spectively, for operations posted through the alias endpoint.  All  al‐
396       located  aliases  must  be closed for the underlying endpoint to be re‐
397       leased.
398
399   fi_control
400       The control operation is used to adjust the default behavior of an end‐
401       point.  It allows the underlying provider to redirect function calls to
402       implementations optimized to meet the desired application behavior.  As
403       a  result,  calls to fi_ep_control must be serialized against all other
404       calls to an endpoint.
405
406       The base operation of an endpoint is  selected  during  creation  using
407       struct  fi_info.   The  following control commands and arguments may be
408       assigned to an endpoint.
409
410       **FI_GETOPSFLAG -- uint64_t *flags**
411              Used to retrieve the current value of flags associated with  the
412              data transfer operations initiated on the endpoint.  The control
413              argument must include FI_TRANSMIT or FI_RECV (not both) flags to
414              indicate  the  type  of data transfer flags to be returned.  See
415              below for a list of control flags.
416
417       **FI_SETOPSFLAG -- uint64_t *flags**
418              Used to change the data transfer operation flags associated with
419              an  endpoint.   The control argument must include FI_TRANSMIT or
420              FI_RECV (not both) to indicate the type of  data  transfer  that
421              the flags should apply to, with other flags OR'ed in.  The given
422              flags will override the previous transmit and receive attributes
423              that  were  set  when  the  endpoint was created.  Valid control
424              flags are defined below.
425
426       **FI_BACKLOG - int *value**
427              This option only applies to passive endpoints.  It  is  used  to
428              set the connection request backlog for listening endpoints.
429
430       FI_GETWAIT (void **)
431              This command allows the user to retrieve the file descriptor as‐
432              sociated with a socket endpoint.  The fi_control  arg  parameter
433              should  be  an  address where a pointer to the returned file de‐
434              scriptor will be written.  See fi_eq.3 for addition details  us‐
435              ing fi_control with FI_GETWAIT.  The file descriptor may be used
436              for notification that the endpoint is ready to send  or  receive
437              data.
438
439   fi_getopt / fi_setopt
440       Endpoint  protocol  operations  may be retrieved using fi_getopt or set
441       using fi_setopt.  Applications specify the level that a desired  option
442       exists, identify the option, and provide input/output buffers to get or
443       set the option.  fi_setopt provides an  application  a  way  to  adjust
444       low-level protocol and implementation specific details of an endpoint.
445
446       The  following  option  levels  and option names and parameters are de‐
447       fined.
448
449       FI_OPT_ENDPOINT · .RS 2
450
451       FI_OPT_MIN_MULTI_RECV - size_t
452              Defines the minimum receive buffer space available when the  re‐
453              ceive  buffer  is  released by the provider (see FI_MULTI_RECV).
454              Modifying this value is only guaranteed to set the minimum  buf‐
455              fer  space  needed  on  receives posted after the value has been
456              changed.  It is recommended that applications that want to over‐
457              ride the default MIN_MULTI_RECV value set this option before en‐
458              abling the corresponding endpoint.
459       · .RS 2
460
461       FI_OPT_CM_DATA_SIZE - size_t
462              Defines the size of available space in CM messages for  user-de‐
463              fined  data.  This value limits the amount of data that applica‐
464              tions can exchange between peer endpoints using the  fi_connect,
465              fi_accept,  and  fi_reject operations.  The size returned is de‐
466              pendent upon the properties of the endpoint, except in the  case
467              of  passive  endpoints,  in  which the size reflects the maximum
468              size of the data that may be present as part of a connection re‐
469              quest event.  This option is read only.
470       · .RS 2
471
472       FI_OPT_BUFFERED_LIMIT - size_t
473              Defines  the maximum size of a buffered message that will be re‐
474              ported to users  as  part  of  a  receive  completion  when  the
475              FI_BUFFERED_RECV mode is enabled on an endpoint.
476
477       fi_getopt()  will  return  the  currently  configured threshold, or the
478       provider's default threshold if one has not be set by the  application.
479       fi_setopt()  allows  an application to configure the threshold.  If the
480       provider cannot support the  requested  threshold,  it  will  fail  the
481       fi_setopt()  call  with  FI_EMSGSIZE.   Calling  fi_setopt()  with  the
482       threshold set to SIZE_MAX will set the threshold to  the  maximum  sup‐
483       ported  by  the provider.  fi_getopt() can then be used to retrieve the
484       set size.
485
486       In most cases, the sending and receiving endpoints must  be  configured
487       to use the same threshold value, and the threshold must be set prior to
488       enabling the endpoint.  · .RS 2
489
490       FI_OPT_BUFFERED_MIN - size_t
491              Defines the minimum size of a buffered message that will be  re‐
492              ported.  Applications would set this to a size that's big enough
493              to decide whether to discard or claim a buffered receive or when
494              to  claim  a buffered receive on getting a buffered receive com‐
495              pletion.  The value is typically used by a provider when sending
496              a  rendezvous  protocol  request  where  it  would  send atleast
497              FI_OPT_BUFFERED_MIN bytes of application data along with it.   A
498              smaller  sized  renedezvous  protocol message usually results in
499              better latency for the overall transfer of a large message.
500
501   fi_rx_size_left (DEPRECATED)
502       This function has been deprecated and will be removed in a future  ver‐
503       sion of the library.  It may not be supported by all providers.
504
505       The fi_rx_size_left call returns a lower bound on the number of receive
506       operations that may be posted to the given endpoint without that opera‐
507       tion  returning  -FI_EAGAIN.   Depending on the specific details of the
508       subsequently posted receive operations (e.g., number  of  iov  entries,
509       which  receive  function  is  called, etc.), it may be possible to post
510       more receive operations than originally indicated by fi_rx_size_left.
511
512   fi_tx_size_left (DEPRECATED)
513       This function has been deprecated and will be removed in a future  ver‐
514       sion of the library.  It may not be supported by all providers.
515
516       The  fi_tx_size_left call returns a lower bound on the number of trans‐
517       mit operations that may be posted to the given  endpoint  without  that
518       operation  returning  -FI_EAGAIN.  Depending on the specific details of
519       the subsequently posted transmit operations (e.g., number  of  iov  en‐
520       tries,  which transmit function is called, etc.), it may be possible to
521       post  more   transmit   operations   than   originally   indicated   by
522       fi_tx_size_left.
523

ENDPOINT ATTRIBUTES

525       The  fi_ep_attr structure defines the set of attributes associated with
526       an endpoint.  Endpoint attributes may  be  further  refined  using  the
527       transmit and receive context attributes as shown below.
528
529              struct fi_ep_attr {
530                  enum fi_ep_type type;
531                  uint32_t        protocol;
532                  uint32_t        protocol_version;
533                  size_t          max_msg_size;
534                  size_t          msg_prefix_size;
535                  size_t          max_order_raw_size;
536                  size_t          max_order_war_size;
537                  size_t          max_order_waw_size;
538                  uint64_t        mem_tag_format;
539                  size_t          tx_ctx_cnt;
540                  size_t          rx_ctx_cnt;
541                  size_t          auth_key_size;
542                  uint8_t         *auth_key;
543              };
544
545   type - Endpoint Type
546       If  specified, indicates the type of fabric interface communication de‐
547       sired.  Supported types are:
548
549       FI_EP_UNSPEC
550              The type of endpoint is not specified.  This is usually provided
551              as  input, with other attributes of the endpoint or the provider
552              selecting the type.
553
554       FI_EP_MSG
555              Provides a reliable, connection-oriented data  transfer  service
556              with flow control that maintains message boundaries.
557
558       FI_EP_DGRAM
559              Supports  a  connectionless,  unreliable datagram communication.
560              Message boundaries are maintained, but the maximum message  size
561              may  be  limited to the fabric MTU.  Flow control is not guaran‐
562              teed.
563
564       FI_EP_RDM
565              Reliable datagram message.  Provides a reliable, unconnected da‐
566              ta  transfer  service  with  flow control that maintains message
567              boundaries.
568
569       FI_EP_SOCK_STREAM
570              Data streaming endpoint with TCP  socket-like  semantics.   Pro‐
571              vides a reliable, connection-oriented data transfer service that
572              does not maintain message boundaries.  FI_EP_SOCK_STREAM is most
573              useful  for applications designed around using TCP sockets.  See
574              the SOCKET ENDPOINT section for additional details and  restric‐
575              tions that apply to stream endpoints.
576
577       FI_EP_SOCK_DGRAM
578              A  connectionless,  unreliable  datagram endpoint with UDP sock‐
579              et-like semantics.  FI_EP_SOCK_DGRAM is most useful for applica‐
580              tions  designed  around  using UDP sockets.  See the SOCKET END‐
581              POINT section for additional details and restrictions that apply
582              to datagram socket endpoints.
583
584   Protocol
585       Specifies  the  low-level end to end protocol employed by the provider.
586       A matching protocol must be used by communicating endpoints  to  ensure
587       interoperability.  The following protocol values are defined.  Provider
588       specific protocols are also allowed.  Provider specific protocols  will
589       be indicated by having the upper bit of the protocol value set to one.
590
591       FI_PROTO_UNSPEC
592              The  protocol is not specified.  This is usually provided as in‐
593              put, with other attributes of the socket or the provider select‐
594              ing the actual protocol.
595
596       FI_PROTO_RDMA_CM_IB_RC
597              The  protocol  runs  over  Infiniband  reliable-connected  queue
598              pairs, using the RDMA CM protocol for connection establishment.
599
600       FI_PROTO_IWARP
601              The protocol runs over the  Internet  wide  area  RDMA  protocol
602              transport.
603
604       FI_PROTO_IB_UD
605              The  protocol  runs  over  Infiniband  unreliable datagram queue
606              pairs.
607
608       FI_PROTO_PSMX
609              The protocol is based on an Intel proprietary protocol known  as
610              PSM,  performance scaled messaging.  PSMX is an extended version
611              of the PSM protocol to support the libfabric interfaces.
612
613       FI_PROTO_UDP
614              The protocol sends and receives UDP datagrams.  For example,  an
615              endpoint  using  FI_PROTO_UDP will be able to communicate with a
616              remote peer that is using Berkeley SOCK_DGRAM sockets using  IP‐
617              PROTO_UDP.
618
619       FI_PROTO_SOCK_TCP
620              The protocol is layered over TCP packets.
621
622       FI_PROTO_IWARP_RDM
623              Reliable-datagram  protocol implemented over iWarp reliable-con‐
624              nected queue pairs.
625
626       FI_PROTO_IB_RDM
627              Reliable-datagram protocol  implemented  over  InfiniBand  reli‐
628              able-connected queue pairs.
629
630       FI_PROTO_GNI
631              Protocol runs over Cray GNI low-level interface.
632
633       FI_PROTO_RXM
634              Reliable-datagram  protocol  implemented over message endpoints.
635              RXM is a libfabric utility component that adds RDM endpoint  se‐
636              mantics over MSG endpoint semantics.
637
638       FI_PROTO_RXD
639              Reliable-datagram  protocol implemented over datagram endpoints.
640              RXD is a libfabric utility component that adds RDM endpoint  se‐
641              mantics over DGRAM endpoint semantics.
642
643       FI_PROTO_NETWORKDIRECT
644              Protocol  runs over Microsoft NetworkDirect service provider in‐
645              terface.  This adds reliable-datagram semantics  over  the  Net‐
646              workDirect connection- oriented endpoint semantics.
647
648       FI_PROTO_PSMX2
649              The  protocol is based on an Intel proprietary protocol known as
650              PSM2, performance scaled messaging version 2.  PSMX2 is  an  ex‐
651              tended version of the PSM2 protocol to support the libfabric in‐
652              terfaces.
653
654   protocol_version - Protocol Version
655       Identifies which version of the protocol is employed by  the  provider.
656       The  protocol  version allows providers to extend an existing protocol,
657       by adding support for additional features or functionality for example,
658       in a backward compatible manner.  Providers that support different ver‐
659       sions of the same protocol should inter-operate, but  only  when  using
660       the capabilities defined for the lesser version.
661
662   max_msg_size - Max Message Size
663       Defines  the  maximum size for an application data transfer as a single
664       operation.
665
666   msg_prefix_size - Message Prefix Size
667       Specifies the size of any required message prefix buffer  space.   This
668       field  will be 0 unless the FI_MSG_PREFIX mode is enabled.  If msg_pre‐
669       fix_size is > 0 the specified value will be a multiple of 8-bytes.
670
671   Max RMA Ordered Size
672       The maximum ordered size specifies the delivery order of transport data
673       into  target  memory  for  RMA and atomic operations.  Data ordering is
674       separate, but dependent on message ordering (defined below).  Data  or‐
675       dering is unspecified where message order is not defined.
676
677       Data ordering refers to the access of target memory by subsequent oper‐
678       ations.  When back to back RMA read or write operations access the same
679       registered  memory location, data ordering indicates whether the second
680       operation reads or writes the target memory after the  first  operation
681       has  completed.   Because  RMA ordering applies between two operations,
682       and not within a single data transfer, ordering is defined per byte-ad‐
683       dressable memory location.  I.e.  ordering specifies whether location X
684       is accessed by the second operation after the first operation.  Nothing
685       is  implied about the completion of the first operation before the sec‐
686       ond operation is initiated.
687
688       In order to support large data transfers  being  broken  into  multiple
689       packets and sent using multiple paths through the fabric, data ordering
690       may be limited to transfers of a  specific  size  or  less.   Providers
691       specify  when data ordering is maintained through the following values.
692       Note that even if data ordering is not maintained, message ordering may
693       be.
694
695       max_order_raw_size
696              Read  after write size.  If set, an RMA or atomic read operation
697              issued after an RMA or atomic write operation, both of which are
698              smaller than the size, will be ordered.  Where the target memory
699              locations overlap, the RMA or atomic read operation will see the
700              results of the previous RMA or atomic write.
701
702       max_order_war_size
703              Write after read size.  If set, an RMA or atomic write operation
704              issued after an RMA or atomic read operation, both of which  are
705              smaller  than the size, will be ordered.  The RMA or atomic read
706              operation will see the initial value of the target memory  loca‐
707              tion before a subsequent RMA or atomic write updates the value.
708
709       max_order_waw_size
710              Write  after  write size.  If set, an RMA or atomic write opera‐
711              tion issued after an RMA or  atomic  write  operation,  both  of
712              which  are  smaller  than the size, will be ordered.  The target
713              memory location will reflect the results of the  second  RMA  or
714              atomic write.
715
716       An  order size value of 0 indicates that ordering is not guaranteed.  A
717       value of -1 guarantees ordering for any data size.
718
719   mem_tag_format - Memory Tag Format
720       The memory tag format is a bit array  used  to  convey  the  number  of
721       tagged  bits  supported by a provider.  Additionally, it may be used to
722       divide the bit array into separate fields.  The mem_tag_format  option‐
723       ally  begins  with a series of bits set to 0, to signify bits which are
724       ignored by the provider.  Following the initial prefix of ignored bits,
725       the  array will consist of alternating groups of bits set to all 1's or
726       all 0's.  Each group of bits corresponds to a tagged field.  The impli‐
727       cation of defining a tagged field is that when a mask is applied to the
728       tagged bit array, all bits belonging to a single field will  either  be
729       set to 1 or 0, collectively.
730
731       For example, a mem_tag_format of 0x30FF indicates support for 14 tagged
732       bits, separated into 3 fields.  The first field consists of 2-bits, the
733       second  field 4-bits, and the final field 8-bits.  Valid masks for such
734       a tagged field would be a bitwise OR'ing of zero or more of the follow‐
735       ing  values: 0x3000, 0x0F00, and 0x00FF.  The provider may not validate
736       the mask provided by the application for performance reasons.
737
738       By identifying fields within a tag, a provider may be able to  optimize
739       their  search  routines.  An application which requests tag fields must
740       provide tag masks that either set all  mask  bits  corresponding  to  a
741       field  to  all 0 or all 1.  When negotiating tag fields, an application
742       can request a specific number of fields of a given  size.   A  provider
743       must  return a tag format that supports the requested number of fields,
744       with each field being at least the size requested, or fail the request.
745       A provider may increase the size of the fields.  When reporting comple‐
746       tions (see FI_CQ_FORMAT_TAGGED), it is not guaranteed that the provider
747       would  clear  out any unsupported tag bits in the tag field of the com‐
748       pletion entry.
749
750       It is recommended that field sizes be ordered from smallest to largest.
751       A  generic,  unstructured  tag and mask can be achieved by requesting a
752       bit array consisting of alternating 1's and 0's.
753
754   tx_ctx_cnt - Transmit Context Count
755       Number of transmit contexts to associate with  the  endpoint.   If  not
756       specified (0), 1 context will be assigned if the endpoint supports out‐
757       bound transfers.  Transmit contexts  are  independent  transmit  queues
758       that  may be separately configured.  Each transmit context may be bound
759       to a separate CQ, and no ordering is defined between  contexts.   Addi‐
760       tionally,  no synchronization is needed when accessing contexts in par‐
761       allel.
762
763       If the count is set to the value FI_SHARED_CONTEXT, the  endpoint  will
764       be  configured  to  use  a shared transmit context, if supported by the
765       provider.  Providers that do not support shared transmit contexts  will
766       fail the request.
767
768       See  the  scalable endpoint and shared contexts sections for additional
769       details.
770
771   rx_ctx_cnt - Receive Context Count
772       Number of receive contexts to associate  with  the  endpoint.   If  not
773       specified,  1 context will be assigned if the endpoint supports inbound
774       transfers.  Receive contexts are independent processing queues that may
775       be separately configured.  Each receive context may be bound to a sepa‐
776       rate CQ, and no ordering is defined between contexts.  Additionally, no
777       synchronization is needed when accessing contexts in parallel.
778
779       If  the  count is set to the value FI_SHARED_CONTEXT, the endpoint will
780       be configured to use a shared receive  context,  if  supported  by  the
781       provider.   Providers  that do not support shared receive contexts will
782       fail the request.
783
784       See the scalable endpoint and shared contexts sections  for  additional
785       details.
786
787   auth_key_size - Authorization Key Length
788       The  length of the authorization key in bytes.  This field will be 0 if
789       authorization keys are not available or used.  This  field  is  ignored
790       unless the fabric is opened with API version 1.5 or greater.
791
792   auth_key - Authorization Key
793       If  supported  by the fabric, an authorization key (a.k.a.  job key) to
794       associate with the endpoint.  An authorization key  is  used  to  limit
795       communication  between  endpoints.   Only  peer endpoints that are pro‐
796       grammed to use the same authorization key may communicate.   Authoriza‐
797       tion keys are often used to implement job keys, to ensure that process‐
798       es running in different jobs do not accidentally  cross  traffic.   The
799       domain  authorization  key  will  be used if auth_key_size is set to 0.
800       This field is ignored unless the fabric is opened with API version  1.5
801       or greater.
802

TRANSMIT CONTEXT ATTRIBUTES

804       Attributes  specific  to  the  transmit capabilities of an endpoint are
805       specified using struct fi_tx_attr.
806
807              struct fi_tx_attr {
808                  uint64_t  caps;
809                  uint64_t  mode;
810                  uint64_t  op_flags;
811                  uint64_t  msg_order;
812                  uint64_t  comp_order;
813                  size_t    inject_size;
814                  size_t    size;
815                  size_t    iov_limit;
816                  size_t    rma_iov_limit;
817              };
818
819   caps - Capabilities
820       The requested capabilities of the context.  The capabilities must be  a
821       subset of those requested of the associated endpoint.  See the CAPABIL‐
822       ITIES section of fi_getinfo(3) for capability  details.   If  the  caps
823       field  is  0 on input to fi_getinfo(3), the caps value from the fi_info
824       structure will be used.
825
826   mode
827       The operational mode bits of the context.  The mode bits will be a sub‐
828       set  of  those  associated  with the endpoint.  See the MODE section of
829       fi_getinfo(3) for details.  A mode value of 0 will be ignored on  input
830       to fi_getinfo(3), with the mode value of the fi_info structure used in‐
831       stead.  On return from fi_getinfo(3), the mode  will  be  set  only  to
832       those constraints specific to transmit operations.
833
834   op_flags - Default transmit operation flags
835       Flags  that  control  the operation of operations submitted against the
836       context.  Applicable flags are listed in the Operation Flags section.
837
838   msg_order - Message Ordering
839       Message ordering refers to the order in which transport  layer  headers
840       (as  viewed  by the application) are identified and processed.  Relaxed
841       message order enables data transfers to be sent and received out of or‐
842       der,  which may improve performance by utilizing multiple paths through
843       the fabric from the initiating endpoint to a target endpoint.   Message
844       order  applies  only  between  a single source and destination endpoint
845       pair.  Ordering between different target endpoints is not defined.
846
847       Message order is determined using a set of ordering bits.  Each set bit
848       indicates  that  ordering  is  maintained between data transfers of the
849       specified type.  Message order is defined for [read | write | send] op‐
850       erations submitted by an application after [read | write | send] opera‐
851       tions.
852
853       Message ordering only applies to the end to end transmission of  trans‐
854       port  headers.   Message ordering is necessary, but does not guarantee,
855       the order in which message data is sent or received  by  the  transport
856       layer.   Message  ordering  requires matching ordering semantics on the
857       receiving side of a data transfer operation in order to guarantee  that
858       ordering is met.
859
860       FI_ORDER_NONE
861              No  ordering  is  specified.  This value may be used as input in
862              order to obtain the  default  message  order  supported  by  the
863              provider.  FI_ORDER_NONE is an alias for the value 0.
864
865       FI_ORDER_RAR
866              Read  after  read.   If  set, RMA and atomic read operations are
867              transmitted in the order submitted relative  to  other  RMA  and
868              atomic read operations.  If not set, RMA and atomic reads may be
869              transmitted out of order from their submission.
870
871       FI_ORDER_RAW
872              Read after write.  If set, RMA and atomic  read  operations  are
873              transmitted  in  the  order submitted relative to RMA and atomic
874              write operations.  If not set,  RMA  and  atomic  reads  may  be
875              transmitted ahead of RMA and atomic writes.
876
877       FI_ORDER_RAS
878              Read  after  send.   If  set, RMA and atomic read operations are
879              transmitted in the order submitted relative to message send  op‐
880              erations,  including  tagged  sends.  If not set, RMA and atomic
881              reads may be transmitted ahead of sends.
882
883       FI_ORDER_WAR
884              Write after read.  If set, RMA and atomic write  operations  are
885              transmitted  in  the  order submitted relative to RMA and atomic
886              read operations.  If not set,  RMA  and  atomic  writes  may  be
887              transmitted ahead of RMA and atomic reads.
888
889       FI_ORDER_WAW
890              Write  after write.  If set, RMA and atomic write operations are
891              transmitted in the order submitted relative  to  other  RMA  and
892              atomic  write operations.  If not set, RMA and atomic writes may
893              be transmitted out of order from their submission.
894
895       FI_ORDER_WAS
896              Write after send.  If set, RMA and atomic write  operations  are
897              transmitted  in the order submitted relative to message send op‐
898              erations, including tagged sends.  If not set,  RMA  and  atomic
899              writes may be transmitted ahead of sends.
900
901       FI_ORDER_SAR
902              Send  after  read.   If  set, message send operations, including
903              tagged sends, are transmitted in order submitted relative to RMA
904              and  atomic  read  operations.  If not set, message sends may be
905              transmitted ahead of RMA and atomic reads.
906
907       FI_ORDER_SAW
908              Send after write.  If set, message  send  operations,  including
909              tagged sends, are transmitted in order submitted relative to RMA
910              and atomic write operations.  If not set, message sends  may  be
911              transmitted ahead of RMA and atomic writes.
912
913       FI_ORDER_SAS
914              Send  after  send.   If  set, message send operations, including
915              tagged sends, are transmitted in the order submitted relative to
916              other  message send.  If not set, message sends may be transmit‐
917              ted out of order from their submission.
918
919       FI_ORDER_RMA_RAR
920              RMA read after read.  If set, RMA read operations are  transmit‐
921              ted  in  the  order  submitted relative to other RMA read opera‐
922              tions.  If not set, RMA reads may be transmitted  out  of  order
923              from their submission.
924
925       FI_ORDER_RMA_RAW
926              RMA read after write.  If set, RMA read operations are transmit‐
927              ted in the order submitted relative to RMA write operations.  If
928              not set, RMA reads may be transmitted ahead of RMA writes.
929
930       FI_ORDER_RMA_WAR
931              RMA  write  after read.  If set, RMA write operations are trans‐
932              mitted in the order submitted relative to RMA  read  operations.
933              If not set, RMA writes may be transmitted ahead of RMA reads.
934
935       FI_ORDER_RMA_WAW
936              RMA  write after write.  If set, RMA write operations are trans‐
937              mitted in the order submitted relative to other RMA write opera‐
938              tions.   If  not set, RMA writes may be transmitted out of order
939              from their submission.
940
941       FI_ORDER_ATOMIC_RAR
942              Atomic read after read.  If set,  atomic  fetch  operations  are
943              transmitted  in  the  order  submitted  relative to other atomic
944              fetch operations.  If not set, atomic fetches may be transmitted
945              out of order from their submission.
946
947       FI_ORDER_ATOMIC_RAW
948              Atomic  read  after  write.  If set, atomic fetch operations are
949              transmitted in the order submitted relative to atomic update op‐
950              erations.   If  not set, atomic fetches may be transmitted ahead
951              of atomic updates.
952
953       FI_ORDER_ATOMIC_WAR
954              RMA write after read.  If  set,  atomic  update  operations  are
955              transmitted  in the order submitted relative to atomic fetch op‐
956              erations.  If not set, atomic updates may be  transmitted  ahead
957              of atomic fetches.
958
959       FI_ORDER_ATOMIC_WAW
960              RMA  write  after  write.   If set, atomic update operations are
961              transmitted in the order submitted relative to other atomic  up‐
962              date  operations.   If not atomic updates may be transmitted out
963              of order from their submission.
964
965   comp_order - Completion Ordering
966       Completion ordering refers to the order in which completed requests are
967       written  into  the completion queue.  Completion ordering is similar to
968       message order.  Relaxed completion order may enable faster reporting of
969       completed  transfers,  allow  acknowledgments to be sent over different
970       fabric paths, and support more sophisticated  retry  mechanisms.   This
971       can result in lower-latency completions, particularly when using uncon‐
972       nected  endpoints.   Strict  completion  ordering  may   require   that
973       providers queue completed operations or limit available optimizations.
974
975       For transmit requests, completion ordering depends on the endpoint com‐
976       munication type.  For unreliable communication, completion ordering ap‐
977       plies  to all data transfer requests submitted to an endpoint.  For re‐
978       liable communication, completion ordering only applies to requests that
979       target  a single destination endpoint.  Completion ordering of requests
980       that target different endpoints over a reliable transport  is  not  de‐
981       fined.
982
983       Applications  should  specify the completion ordering that they support
984       or require.  Providers should return the completion order that they ac‐
985       tually  provide,  with  the  constraint  that  the returned ordering is
986       stricter than that specified by the application.  Supported  completion
987       order values are:
988
989       FI_ORDER_NONE
990              No  ordering is defined for completed operations.  Requests sub‐
991              mitted to the transmit context may complete in any order.
992
993       FI_ORDER_STRICT
994              Requests complete in the order in which they  are  submitted  to
995              the transmit context.
996
997   inject_size
998       The  requested  inject operation size (see the FI_INJECT flag) that the
999       context will support.  This is the maximum size data transfer that  can
1000       be  associated  with  an inject operation (such as fi_inject) or may be
1001       used with the FI_INJECT data transfer flag.
1002
1003   size
1004       The size of the context.  The size is specified as the  minimum  number
1005       of  transmit  operations that may be posted to the endpoint without the
1006       operation returning -FI_EAGAIN.
1007
1008   iov_limit
1009       This is the maximum number of IO vectors (scatter-gather elements) that
1010       a single posted operation may reference.
1011
1012   rma_iov_limit
1013       This  is the maximum number of RMA IO vectors (scatter-gather elements)
1014       that an RMA or atomic operation may reference.  The rma_iov_limit  cor‐
1015       responds to the rma_iov_count values in RMA and atomic operations.  See
1016       struct fi_msg_rma and struct fi_msg_atomic in fi_rma.3 and fi_atomic.3,
1017       for  additional  details.  This limit applies to both the number of RMA
1018       IO vectors that may be specified when initiating an operation from  the
1019       local endpoint, as well as the maximum number of IO vectors that may be
1020       carried in a single request from a remote endpoint.
1021

RECEIVE CONTEXT ATTRIBUTES

1023       Attributes specific to the receive  capabilities  of  an  endpoint  are
1024       specified using struct fi_rx_attr.
1025
1026              struct fi_rx_attr {
1027                  uint64_t  caps;
1028                  uint64_t  mode;
1029                  uint64_t  op_flags;
1030                  uint64_t  msg_order;
1031                  uint64_t  comp_order;
1032                  size_t    total_buffered_recv;
1033                  size_t    size;
1034                  size_t    iov_limit;
1035              };
1036
1037   caps - Capabilities
1038       The  requested capabilities of the context.  The capabilities must be a
1039       subset of those requested of the associated endpoint.  See the CAPABIL‐
1040       ITIES  section  if  fi_getinfo(3)  for capability details.  If the caps
1041       field is 0 on input to fi_getinfo(3), the caps value from  the  fi_info
1042       structure will be used.
1043
1044   mode
1045       The operational mode bits of the context.  The mode bits will be a sub‐
1046       set of those associated with the endpoint.  See  the  MODE  section  of
1047       fi_getinfo(3)  for details.  A mode value of 0 will be ignored on input
1048       to fi_getinfo(3), with the mode value of the fi_info structure used in‐
1049       stead.   On  return  from  fi_getinfo(3),  the mode will be set only to
1050       those constraints specific to receive operations.
1051
1052   op_flags - Default receive operation flags
1053       Flags that control the operation of operations  submitted  against  the
1054       context.  Applicable flags are listed in the Operation Flags section.
1055
1056   msg_order - Message Ordering
1057       For  a  description of message ordering, see the msg_order field in the
1058       Transmit Context Attribute section.  Receive context  message  ordering
1059       defines  the order in which received transport message headers are pro‐
1060       cessed when received by an endpoint.  When ordering is  set,  it  indi‐
1061       cates that message headers will be processed in order, based on how the
1062       transmit side has identified the messages.  Typically, this means  that
1063       messages  will  be  handled  in order based on a message level sequence
1064       number.
1065
1066       The following ordering flags, as defined for  transmit  ordering,  also
1067       apply  to  the processing of received operations: FI_ORDER_NONE, FI_OR‐
1068       DER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_OR‐
1069       DER_WAS,  FI_ORDER_SAR,  FI_ORDER_SAW,  FI_ORDER_SAS, FI_ORDER_RMA_RAR,
1070       FI_ORDER_RMA_RAW,  FI_ORDER_RMA_WAR,  FI_ORDER_RMA_WAW,  FI_ORDER_ATOM‐
1071       IC_RAR,  FI_ORDER_ATOMIC_RAW,  FI_ORDER_ATOMIC_WAR,  and FI_ORDER_ATOM‐
1072       IC_WAW.
1073
1074   comp_order - Completion Ordering
1075       For a description of completion ordering, see the comp_order  field  in
1076       the Transmit Context Attribute section.
1077
1078       FI_ORDER_NONE
1079              No ordering is defined for completed operations.  Receive opera‐
1080              tions may complete in any order, regardless of their  submission
1081              order.
1082
1083       FI_ORDER_STRICT
1084              Receive  operations complete in the order in which they are pro‐
1085              cessed by the receive context, based on the receive side msg_or‐
1086              der attribute.
1087
1088       FI_ORDER_DATA
1089              When  set, this bit indicates that received data is written into
1090              memory in order.  Data ordering applies to  memory  accessed  as
1091              part of a single operation and between operations if message or‐
1092              dering is guaranteed.
1093
1094   total_buffered_recv
1095       This field is supported for backwards compatibility purposes.  It is  a
1096       hint to the provider of the total available space that may be needed to
1097       buffer messages that are received for which there is  no  matching  re‐
1098       ceive  operation.   The  provider may adjust or ignore this value.  The
1099       allocation of internal network  buffering  among  received  message  is
1100       provider specific.  For instance, a provider may limit the size of mes‐
1101       sages which can be buffered or the amount of buffering allocated  to  a
1102       single message.
1103
1104       If  receive  side buffering is disabled (total_buffered_recv = 0) and a
1105       message is received by an endpoint, then the behavior is  dependent  on
1106       whether  resource management has been enabled (FI_RM_ENABLED has be set
1107       or not).  See the Resource Management section of fi_domain.3  for  fur‐
1108       ther  clarification.   It  is  recommended that applications enable re‐
1109       source management if they  anticipate  receiving  unexpected  messages,
1110       rather than modifying this value.
1111
1112   size
1113       The  size  of the context.  The size is specified as the minimum number
1114       of receive operations that may be posted to the  endpoint  without  the
1115       operation returning -FI_EAGAIN.
1116
1117   iov_limit
1118       This is the maximum number of IO vectors (scatter-gather elements) that
1119       a single posted operating may reference.
1120

SCALABLE ENDPOINTS

1122       A scalable endpoint is a communication portal  that  supports  multiple
1123       transmit  and receive contexts.  Scalable endpoints are loosely modeled
1124       after the networking concept of  transmit/receive  side  scaling,  also
1125       known as multi-queue.  Support for scalable endpoints is domain specif‐
1126       ic.  Scalable endpoints may improve the performance  of  multi-threaded
1127       and  parallel  applications,  by allowing threads to access independent
1128       transmit and receive queues.  A scalable endpoint has a  single  trans‐
1129       port  level address, which can reduce the memory requirements needed to
1130       store remote addressing data, versus using standard  endpoints.   Scal‐
1131       able  endpoints  cannot  be used directly for communication operations,
1132       and require the application to explicitly create transmit  and  receive
1133       contexts as described below.
1134
1135   fi_tx_context
1136       Transmit  contexts  are independent transmit queues.  Ordering and syn‐
1137       chronization between contexts are not defined.  Conceptually a transmit
1138       context  behaves  similar  to a send-only endpoint.  A transmit context
1139       may be configured with fewer capabilities than the  base  endpoint  and
1140       with  different  attributes  (such  as ordering requirements and inject
1141       size) than other contexts associated with the same  scalable  endpoint.
1142       Each  transmit  context  has  its  own completion queue.  The number of
1143       transmit contexts associated with an endpoint is specified during  end‐
1144       point creation.
1145
1146       The  fi_tx_context call is used to retrieve a specific context, identi‐
1147       fied by an index  (see  above  for  details  on  transmit  context  at‐
1148       tributes).  Providers may dynamically allocate contexts when fi_tx_con‐
1149       text is called, or may statically create all contexts when  fi_endpoint
1150       is  invoked.  By default, a transmit context inherits the properties of
1151       its associated endpoint.  However,  applications  may  request  context
1152       specific attributes through the attr parameter.  Support for per trans‐
1153       mit  context  attributes  is  provider  specific  and  not  guaranteed.
1154       Providers  will  return  the  actual attributes assigned to the context
1155       through the attr parameter, if provided.
1156
1157   fi_rx_context
1158       Receive contexts are independent receive queues for receiving  incoming
1159       data.   Ordering  and  synchronization between contexts are not guaran‐
1160       teed.  Conceptually a receive context behaves similar to a receive-only
1161       endpoint.   A receive context may be configured with fewer capabilities
1162       than the base endpoint and with different attributes (such as  ordering
1163       requirements  and  inject size) than other contexts associated with the
1164       same scalable endpoint.  Each receive context has  its  own  completion
1165       queue.   The  number of receive contexts associated with an endpoint is
1166       specified during endpoint creation.
1167
1168       Receive contexts are often associated with steering flows, that specify
1169       which  incoming packets targeting a scalable endpoint to process.  How‐
1170       ever, receive contexts may be targeted directly by  the  initiator,  if
1171       supported by the underlying protocol.  Such contexts are referred to as
1172       'named'.  Support for named contexts must be indicated by  setting  the
1173       caps FI_NAMED_RX_CTX capability when the corresponding endpoint is cre‐
1174       ated.  Support for named receive contexts is coordinated  with  address
1175       vectors.  See fi_av(3) and fi_rx_addr(3).
1176
1177       The  fi_rx_context call is used to retrieve a specific context, identi‐
1178       fied by an index (see above for details on receive context attributes).
1179       Providers  may  dynamically  allocate  contexts  when  fi_rx_context is
1180       called, or may statically create all contexts when fi_endpoint  is  in‐
1181       voked.   By  default,  a receive context inherits the properties of its
1182       associated endpoint.  However, applications may request context specif‐
1183       ic attributes through the attr parameter.  Support for per receive con‐
1184       text attributes is provider specific  and  not  guaranteed.   Providers
1185       will  return  the actual attributes assigned to the context through the
1186       attr parameter, if provided.
1187

SHARED CONTEXTS

1189       Shared contexts are transmit and  receive  contexts  explicitly  shared
1190       among one or more endpoints.  A shareable context allows an application
1191       to use a single dedicated provider resource  among  multiple  transport
1192       addressable endpoints.  This can greatly reduce the resources needed to
1193       manage communication over multiple endpoints by  multiplexing  transmit
1194       and/or  receive  processing, with the potential cost of serializing ac‐
1195       cess across multiple endpoints.  Support for shareable contexts is  do‐
1196       main specific.
1197
1198       Conceptually,  shareable transmit contexts are transmit queues that may
1199       be accessed by many endpoints.  The use of a shared transmit context is
1200       mostly  opaque  to an application.  Applications must allocate and bind
1201       shared transmit contexts to endpoints, but operations  are  posted  di‐
1202       rectly  to  the  endpoint.  Shared transmit contexts are not associated
1203       with completion queues or counters.  Completed operations are posted to
1204       the CQs bound to the endpoint.  An endpoint may only be associated with
1205       a single shared transmit context.
1206
1207       Unlike shared transmit contexts, applications  interact  directly  with
1208       shared  receive  contexts.   Users  post  receive buffers directly to a
1209       shared receive context, with the buffers usable by any  endpoint  bound
1210       to the shared receive context.  Shared receive contexts are not associ‐
1211       ated with completion queues or counters.  Completed receive  operations
1212       are  posted  to the CQs bound to the endpoint.  An endpoint may only be
1213       associated with a single receive context, and all  connectionless  end‐
1214       points  associated  with  a  shared receive context must also share the
1215       same address vector.
1216
1217       Endpoints associated with a shared transmit context may  use  dedicated
1218       receive contexts, and vice-versa.  Or an endpoint may use shared trans‐
1219       mit and receive contexts.  And there is no requirement  that  the  same
1220       group of endpoints sharing a context of one type also share the context
1221       of an alternate type.  Furthermore, an endpoint may use a  shared  con‐
1222       text of one type, but a scalable set of contexts of the alternate type.
1223
1224   fi_stx_context
1225       This  call  is used to open a shareable transmit context (see above for
1226       details on the transmit context attributes).  Endpoints associated with
1227       a  shared  transmit context must use a subset of the transmit context's
1228       attributes.  Note that this is  the  reverse  of  the  requirement  for
1229       transmit contexts for scalable endpoints.
1230
1231   fi_srx_context
1232       This  allocates  a  shareable receive context (see above for details on
1233       the receive context attributes).  Endpoints associated  with  a  shared
1234       receive  context must use a subset of the receive context's attributes.
1235       Note that this is the reverse of the requirement for  receive  contexts
1236       for scalable endpoints.
1237

SOCKET ENDPOINTS

1239       The  following  feature and description should be considered experimen‐
1240       tal.  Until the experimental tag is removed, the interfaces, semantics,
1241       and data structures associated with socket endpoints may change between
1242       library versions.
1243
1244       This  section  applies  to  endpoints  of  type  FI_EP_SOCK_STREAM  and
1245       FI_EP_SOCK_DGRAM, commonly referred to as socket endpoints.
1246
1247       Socket  endpoints  are  defined  with semantics that allow them to more
1248       easily be adopted by developers familiar with the UNIX socket  API,  or
1249       by middleware that exposes the socket API, while still taking advantage
1250       of high-performance hardware features.
1251
1252       The key difference between socket endpoints and other active  endpoints
1253       are  socket  endpoints  use synchronous data transfers.  Buffers passed
1254       into send and receive operations revert to the control of the  applica‐
1255       tion  upon  returning  from  the  function  call.  As a result, no data
1256       transfer completions are reported to the application, and  socket  end‐
1257       points are not associated with completion queues or counters.
1258
1259       Socket  endpoints  support  a  subset  of  message operations: fi_send,
1260       fi_sendv, fi_sendmsg, fi_recv,  fi_recvv,  fi_recvmsg,  and  fi_inject.
1261       Because  data transfers are synchronous, the return value from send and
1262       receive operations indicate the number of bytes transferred on success,
1263       or a negative value on error, including -FI_EAGAIN if the endpoint can‐
1264       not send or receive any data because of full or empty  queues,  respec‐
1265       tively.
1266
1267       Socket  endpoints are associated with event queues and address vectors,
1268       and process connection management  events  asynchronously,  similar  to
1269       other  endpoints.   Unlike  UNIX sockets, socket endpoint must still be
1270       declared as either active or passive.
1271
1272       Socket endpoints behave like non-blocking sockets.  In order to support
1273       select  and poll semantics, active socket endpoints are associated with
1274       a file descriptor that is signaled whenever the endpoint  is  ready  to
1275       send  and/or  receive data.  The file descriptor may be retrieved using
1276       fi_control.
1277

OPERATION FLAGS

1279       Operation flags are obtained by OR-ing the  following  flags  together.
1280       Operation  flags define the default flags applied to an endpoint's data
1281       transfer operations, where a flags parameter is  not  available.   Data
1282       transfer operations that take flags as input override the op_flags val‐
1283       ue of transmit or receive context attributes of an endpoint.
1284
1285       FI_INJECT
1286              Indicates that all outbound data buffers should be  returned  to
1287              the  user's  control  immediately after a data transfer call re‐
1288              turns, even if the operation is  handled  asynchronously.   This
1289              may  require that the provider copy the data into a local buffer
1290              and transfer out of that buffer.  A provider can limit the total
1291              amount  of  send  data that may be buffered and/or the size of a
1292              single send that can use this flag.  This limit is indicated us‐
1293              ing inject_size (see inject_size above).
1294
1295       FI_MULTI_RECV
1296              Applies to posted receive operations.  This flag allows the user
1297              to post a single buffer that will receive multiple incoming mes‐
1298              sages.  Received messages will be packed into the receive buffer
1299              until the buffer has been consumed.  Use of this flag may  cause
1300              a  single  posted receive operation to generate multiple comple‐
1301              tions as messages are placed into the buffer.  The placement  of
1302              received  data into the buffer may be subjected to provider spe‐
1303              cific alignment restrictions.  The buffer will  be  released  by
1304              the  provider  when  the  available buffer space falls below the
1305              specified minimum (see FI_OPT_MIN_MULTI_RECV).
1306
1307       FI_COMPLETION
1308              Indicates that a completion queue entry should  be  written  for
1309              data  transfer operations.  This flag only applies to operations
1310              issued on an endpoint that was bound to a completion queue  with
1311              the  FI_SELECTIVE_COMPLETION flag set, otherwise, it is ignored.
1312              See the fi_ep_bind section above for more detail.
1313
1314       FI_INJECT_COMPLETE
1315              Indicates that a completion should be generated when the  source
1316              buffer(s) may be reused.  See fi_cq(3) for additional details on
1317              completion semantics.
1318
1319       FI_TRANSMIT_COMPLETE
1320              Indicates that a completion should be generated when the  trans‐
1321              mit operation has completed relative to the local provider.  See
1322              fi_cq(3) for additional details on completion semantics.
1323
1324       FI_DELIVERY_COMPLETE
1325              Indicates that a completion should be generated when the  opera‐
1326              tion  has  been  processed  by the destination endpoint(s).  See
1327              fi_cq(3) for additional details on completion semantics.
1328
1329       FI_COMMIT_COMPLETE
1330              Indicates that a completion should not be generated (locally  or
1331              at  the  peer)  until  the result of an operation have been made
1332              persistent.  See fi_cq(3) for additional details  on  completion
1333              semantics.
1334
1335       FI_MULTICAST
1336              Indicates that data transfers will target multicast addresses by
1337              default.  Any fi_addr_t passed into a  data  transfer  operation
1338              will be treated as a multicast address.
1339

NOTES

1341       Users  should  call  fi_close to release all resources allocated to the
1342       fabric endpoint.
1343
1344       Endpoints allocated with the FI_CONTEXT or FI_CONTEXT2  mode  bits  set
1345       must typically provide struct fi_context(2) as their per operation con‐
1346       text parameter.  (See fi_getinfo.3 for details.) However,  when  FI_SE‐
1347       LECTIVE_COMPLETION is enabled to suppress CQ completion entries, and an
1348       operation is initiated without the FI_COMPLETION  flag  set,  then  the
1349       context  parameter is ignored.  An application does not need to pass in
1350       a valid struct fi_context(2) into such data transfers.
1351
1352       Operations that complete in error that are not  associated  with  valid
1353       operational  context will use the endpoint context in any error report‐
1354       ing structures.
1355
1356       Although applications typically associate individual  completions  with
1357       either  completion  queues  or counters, an endpoint can be attached to
1358       both a counter and completion queue.  When combined with  using  selec‐
1359       tive  completions,  this allows an application to use counters to track
1360       successful completions, with a CQ used to  report  errors.   Operations
1361       that  complete with an error increment the error counter and generate a
1362       CQ completion event.
1363
1364       As mentioned in fi_getinfo(3), the ep_attr structure  can  be  used  to
1365       query  providers  that support various endpoint attributes.  fi_getinfo
1366       can return provider info structures that can support the minimal set of
1367       requirements (such that the application maintains correctness).  Howev‐
1368       er, it can also return provider info structures that exceed application
1369       requirements.   As  an  example,  consider  an  application  requesting
1370       msg_order as FI_ORDER_NONE.  The resulting output from  fi_getinfo  may
1371       have all the ordering bits set.  The application can reset the ordering
1372       bits it does not require before creating the endpoint.  The provider is
1373       free  to implement a stricter ordering than is required by the applica‐
1374       tion.
1375

RETURN VALUES

1377       Returns 0 on success.  On error, a negative value corresponding to fab‐
1378       ric  errno  is  returned.  For fi_cancel, a return value of 0 indicates
1379       that the cancel request was submitted for processing.
1380
1381       Fabric errno values are defined in rdma/fi_errno.h.
1382

ERRORS

1384       -FI_EDOMAIN
1385              A resource domain was not bound to the endpoint  or  an  attempt
1386              was made to bind multiple domains.
1387
1388       -FI_ENOCQ
1389              The endpoint has not been configured with necessary event queue.
1390
1391       -FI_EOPBADSTATE
1392              The endpoint's state does not permit the requested operation.
1393

AUTHORS

1399       OpenFabrics.
1400
1401
1402
1403Libfabric Programmer's Manual     2019-05-14                    fi_endpoint(3)