1fi_endpoint(3)                 Libfabric v1.6.1                 fi_endpoint(3)
2
3
4

NAME

6       fi_endpoint - Fabric endpoint operations
7
8       fi_endpoint / fi_scalable_ep / fi_passive_ep / fi_close
9              Allocate or close an endpoint.
10
11       fi_ep_bind
12              Associate  an  endpoint  with  hardware resources, such as event
13              queues, completion queues, counters, address vectors, or  shared
14              transmit/receive contexts.
15
16       fi_scalable_ep_bind
17              Associate a scalable endpoint with an address vector
18
19       fi_pep_bind
20              Associate a passive endpoint with an event queue
21
22       fi_enable
23              Transitions an active endpoint into an enabled state.
24
25       fi_cancel
26              Cancel a pending asynchronous data transfer
27
28       fi_ep_alias
29              Create an alias to the endpoint
30
31       fi_control
32              Control endpoint operation.
33
34       fi_getopt / fi_setopt
35              Get or set endpoint options.
36
37       fi_rx_context / fi_tx_context / fi_srx_context / fi_stx_context
38              Open a transmit or receive context.
39
40       fi_rx_size_left / fi_tx_size_left (DEPRECATED)
41              Query the lower bound on how many RX/TX operations may be posted
42              without an operation returning -FI_EAGAIN.  This functions  have
43              been  deprecated  and will be removed in a future version of the
44              library.
45

SYNOPSIS

47              #include <rdma/fabric.h>
48
49              #include <rdma/fi_endpoint.h>
50
51              int fi_endpoint(struct fid_domain *domain, struct fi_info *info,
52                  struct fid_ep **ep, void *context);
53
54              int fi_scalable_ep(struct fid_domain *domain, struct fi_info *info,
55                  struct fid_ep **sep, void *context);
56
57              int fi_passive_ep(struct fi_fabric *fabric, struct fi_info *info,
58                  struct fid_pep **pep, void *context);
59
60              int fi_tx_context(struct fid_ep *sep, int index,
61                  struct fi_tx_attr *attr, struct fid_ep **tx_ep,
62                  void *context);
63
64              int fi_rx_context(struct fid_ep *sep, int index,
65                  struct fi_rx_attr *attr, struct fid_ep **rx_ep,
66                  void *context);
67
68              int fi_stx_context(struct fid_domain *domain,
69                  struct fi_tx_attr *attr, struct fid_stx **stx,
70                  void *context);
71
72              int fi_srx_context(struct fid_domain *domain,
73                  struct fi_rx_attr *attr, struct fid_ep **rx_ep,
74                  void *context);
75
76              int fi_close(struct fid *ep);
77
78              int fi_ep_bind(struct fid_ep *ep, struct fid *fid, uint64_t flags);
79
80              int fi_scalable_ep_bind(struct fid_ep *sep, struct fid *fid, uint64_t flags);
81
82              int fi_pep_bind(struct fid_pep *pep, struct fid *fid, uint64_t flags);
83
84              int fi_enable(struct fid_ep *ep);
85
86              int fi_cancel(struct fid_ep *ep, void *context);
87
88              int fi_ep_alias(struct fid_ep *ep, struct fid_ep **alias_ep, uint64_t flags);
89
90              int fi_control(struct fid *ep, int command, void *arg);
91
92              int fi_getopt(struct fid *ep, int level, int optname,
93                  void *optval, size_t *optlen);
94
95              int fi_setopt(struct fid *ep, int level, int optname,
96                  const void *optval, size_t optlen);
97
98              DEPRECATED ssize_t fi_rx_size_left(struct fid_ep *ep);
99
100              DEPRECATED ssize_t fi_tx_size_left(struct fid_ep *ep);
101

ARGUMENTS

103       fid : On creation, specifies a fabric or access domain.  On bind, iden‐
104       tifies the event queue, completion queue, counter, or address vector to
105       bind to the endpoint.  In other cases, it's a fabric identifier  of  an
106       associated resource.
107
108       info  :  Details  about  the  fabric  interface  endpoint to be opened,
109       obtained from fi_getinfo.
110
111       ep : A fabric endpoint.
112
113       sep : A scalable fabric endpoint.
114
115       pep : A passive fabric endpoint.
116
117       context : Context associated with the endpoint or  asynchronous  opera‐
118       tion.
119
120       index : Index to retrieve a specific transmit/receive context.
121
122       attr : Transmit or receive context attributes.
123
124       flags : Additional flags to apply to the operation.
125
126       command : Command of control operation to perform on endpoint.
127
128       arg : Optional control argument.
129
130       level : Protocol level at which the desired option resides.
131
132       optname : The protocol option to read or set.
133
134       optval : The option value that was read or to set.
135
136       optlen : The size of the optval buffer.
137

DESCRIPTION

139       Endpoints  are  transport  level  communication portals.  There are two
140       types of endpoints: active and passive.  Passive endpoints belong to  a
141       fabric domain and are most often used to listen for incoming connection
142       requests.  However, a passive endpoint may be used to reserve a  fabric
143       address  that  can  be granted to an active endpoint.  Active endpoints
144       belong to access domains and can perform data transfers.
145
146       Active endpoints may be connection-oriented or connectionless, and  may
147       provide  data  reliability.   The  data transfer interfaces -- messages
148       (fi_msg),  tagged  messages  (fi_tagged),  RMA  (fi_rma),  and  atomics
149       (fi_atomic) -- are associated with active endpoints.  In basic configu‐
150       rations, an active endpoint has transmit and receive queues.   In  gen‐
151       eral,  operations that generate traffic on the fabric are posted to the
152       transmit queue.  This includes all RMA  and  atomic  operations,  along
153       with sent messages and sent tagged messages.  Operations that post buf‐
154       fers for receiving incoming data are submitted to the receive queue.
155
156       Active endpoints are created in the disabled state.  They must  transi‐
157       tion  into  an enabled state before accepting data transfer operations,
158       including posting of receive buffers.  The fi_enable call  is  used  to
159       transition  an  active  endpoint into an enabled state.  The fi_connect
160       and fi_accept calls will also transition an endpoint into  the  enabled
161       state, if it is not already active.
162
163       In  order  to  transition an endpoint into an enabled state, it must be
164       bound to one or more fabric resources.  An endpoint that will  generate
165       asynchronous  completions,  either  through data transfer operations or
166       communication establishment events, must be bound  to  the  appropriate
167       completion  queues or event queues, respectively, before being enabled.
168       Additionally, endpoints that use manual  progress  must  be  associated
169       with  relevant  completion  queues  or  event  queues in order to drive
170       progress.  For endpoints that are only used as the  target  of  RMA  or
171       atomic  operations,  this  means  binding  the endpoint to a completion
172       queue associated with receive processing.  Unconnected  endpoints  must
173       be bound to an address vector.
174
175       Once  an  endpoint  has  been  activated,  it may be associated with an
176       address vector.  Receive buffers may be posted to it and calls  may  be
177       made  to  connection  establishment routines.  Connectionless endpoints
178       may also perform data transfers.
179
180       The behavior of an endpoint may be adjusted by setting its control data
181       and  protocol options.  This allows the underlying provider to redirect
182       function calls to implementations optimized to meet the desired  appli‐
183       cation behavior.
184
185       If  an  endpoint  experiences a critical error, it will transition back
186       into a disabled state.  Critical errors are reported through the  event
187       queue  associated  with  the EP.  In certain cases, a disabled endpoint
188       may be re-enabled.  The ability to  transition  back  into  an  enabled
189       state  is  provider  specific and depends on the type of error that the
190       endpoint experienced.  When an endpoint is disabled as a  result  of  a
191       critical error, all pending operations are discarded.
192
193   fi_endpoint / fi_passive_ep / fi_scalable_ep
194       fi_endpoint allocates a new active endpoint.  fi_passive_ep allocates a
195       new passive endpoint.  fi_scalable_ep allocates  a  scalable  endpoint.
196       The  properties  and  behavior of the endpoint are defined based on the
197       provided struct fi_info.  See  fi_getinfo  for  additional  details  on
198       fi_info.   fi_info  flags that control the operation of an endpoint are
199       defined below.  See section SCALABLE ENDPOINTS.
200
201       If an active endpoint is allocated in  order  to  accept  a  connection
202       request,  the  fi_info parameter must be the same as the fi_info struc‐
203       ture provided with the connection request (FI_CONNREQ) event.
204
205       An active endpoint may acquire the properties of a passive endpoint  by
206       setting  the  fi_info  handle  field  to  the  passive  endpoint fabric
207       descriptor.  This is useful for applications that need to  reserve  the
208       fabric  address of an endpoint prior to knowing if the endpoint will be
209       used on the active or passive side of a connection.  For example,  this
210       feature is useful for simulating socket semantics.  Once an active end‐
211       point acquires the properties of a passive endpoint, the  passive  end‐
212       point  is no longer bound to any fabric resources and must no longer be
213       used.  The user is expected to close the passive endpoint after opening
214       the  active  endpoint  in order to free up any lingering resources that
215       had been used.
216
217   fi_close
218       Closes an endpoint and release all resources associated with it.
219
220       When closing a scalable endpoint, there must be no opened transmit con‐
221       texts,  or  receive contexts associated with the scalable endpoint.  If
222       resources are still associated with the scalable endpoint when attempt‐
223       ing to close, the call will return -FI_EBUSY.
224
225       Outstanding  operations  posted to the endpoint when fi_close is called
226       will be discarded.  Discarded operations will silently be dropped, with
227       no  completions  reported.  Additionally, a provider may discard previ‐
228       ously completed operations from  the  associated  completion  queue(s).
229       The behavior to discard completed operations is provider specific.
230
231   fi_ep_bind
232       fi_ep_bind  is  used  to associate an endpoint with hardware resources.
233       The common use of fi_ep_bind is to direct asynchronous operations asso‐
234       ciated  with  an  endpoint  to a completion queue.  An endpoint must be
235       bound with CQs capable of reporting completions  for  any  asynchronous
236       operation  initiated  on the endpoint.  This is true even for endpoints
237       which are configured to suppress successful completions, in order  that
238       operations  that  complete  in  error may be reported to the user.  For
239       passive endpoints, this requires binding the endpoint with an  EQ  that
240       supports the communication management (CM) domain.
241
242       An  active  endpoint  may  direct asynchronous completions to different
243       CQs,  based  on  the  type  of  operation.   This  is  specified  using
244       fi_ep_bind  flags.  The following flags may be used separately or OR'ed
245       together when binding an endpoint to a completion domain CQ.
246
247       FI_TRANSMIT : Directs the completion of outbound data transfer requests
248       to  the  specified  completion queue.  This includes send message, RMA,
249       and atomic operations.
250
251       FI_RECV : Directs the notification of inbound  data  transfers  to  the
252       specified  completion  queue.   This  includes received messages.  This
253       binding automatically includes FI_REMOTE_WRITE, if  applicable  to  the
254       endpoint.
255
256       FI_SELECTIVE_COMPLETION : By default, data transfer operations generate
257       completion entries into a completion queue after they have successfully
258       completed.   Applications  can use this bind flag to selectively enable
259       when completions are generated.  If FI_SELECTIVE_COMPLETION  is  speci‐
260       fied, data transfer operations will not generate entries for successful
261       completions unless FI_COMPLETION is set as an operational flag for  the
262       given  operation.  FI_SELECTIVE_COMPLETION must be OR'ed with FI_TRANS‐
263       MIT and/or FI_RECV flags.
264
265       When FI_SELECTIVE_COMPLETION is set, the user  must  determine  when  a
266       request  that does NOT have FI_COMPLETION set has completed indirectly,
267       usually based on the completion of a subsequent operation.  Use of this
268       flag  may improve performance by allowing the provider to avoid writing
269       a completion entry for every operation.
270
271       Example: An application can selectively generate  send  completions  by
272       using the following general approach:
273
274                fi_tx_attr::op_flags = 0; // default - no completion
275                fi_ep_bind(ep, cq, FI_TRANSMIT | FI_SELECTIVE_COMPLETION);
276                fi_send(ep, ...);                   // no completion
277                fi_sendv(ep, ...);                  // no completion
278                fi_sendmsg(ep, ..., FI_COMPLETION); // completion!
279                fi_inject(ep, ...);                 // no completion
280
281       Example:  An  application  can  selectively disable send completions by
282       modifying the operational flags:
283
284                fi_tx_attr::op_flags = FI_COMPLETION; // default - completion
285                fi_ep_bind(ep, cq, FI_TRANSMIT | FI_SELECTIVE_COMPLETION);
286                fi_send(ep, ...);       // completion
287                fi_sendv(ep, ...);      // completion
288                fi_sendmsg(ep, ..., 0); // no completion!
289                fi_inject(ep, ...);     // no completion!
290
291       Example: Omitting FI_SELECTIVE_COMPLETION when  binding  will  generate
292       completions for all non-fi_inject calls:
293
294                fi_tx_attr::op_flags = 0;
295                fi_ep_bind(ep, cq, FI_TRANSMIT);    // default - completion
296                fi_send(ep, ...);                   // completion
297                fi_sendv(ep, ...);                  // completion
298                fi_sendmsg(ep, ..., 0);             // completion!
299                fi_sendmsg(ep, ..., FI_COMPLETION); // completion
300                fi_sendmsg(ep, ..., FI_INJECT|FI_COMPLETION); // completion!
301                fi_inject(ep, ...);                 // no completion!
302
303       An  endpoint  may  also  be bound to a fabric counter.  When binding an
304       endpoint to a counter, the following flags may be specified.
305
306       FI_SEND : Increments the specified counter whenever a message  transfer
307       initiated  over  the  endpoint  has completed successfully or in error.
308       Sent messages include both tagged and normal message operations.
309
310       FI_RECV : Increments  the  specified  counter  whenever  a  message  is
311       received  over the endpoint.  Received messages include both tagged and
312       normal message operations.
313
314       FI_READ : Increments the specified counter  whenever  an  RMA  read  or
315       atomic  fetch  operation initiated from the endpoint has completed suc‐
316       cessfully or in error.
317
318       FI_WRITE : Increments the specified counter whenever an  RMA  write  or
319       atomic operation initiated from the endpoint has completed successfully
320       or in error.
321
322       FI_REMOTE_READ : Increments the specified counter whenever an RMA  read
323       or atomic fetch operation is initiated from a remote endpoint that tar‐
324       gets the given endpoint.  Use of this flag requires that  the  endpoint
325       be created using FI_RMA_EVENT.
326
327       FI_REMOTE_WRITE  :  Increments  the  specified  counter whenever an RMA
328       write or atomic operation is initiated from a remote endpoint that tar‐
329       gets  the  given endpoint.  Use of this flag requires that the endpoint
330       be created using FI_RMA_EVENT.
331
332       An endpoint may only be bound to a single CQ or  counter  for  a  given
333       type of operation.  For example, a EP may not bind to two counters both
334       using FI_WRITE.  Furthermore, providers may limit CQ and counter  bind‐
335       ings to endpoints of the same endpoint type (DGRAM, MSG, RDM, etc.).
336
337       Connectionless endpoints must be bound to a single address vector.
338
339       If  an  endpoint is using a shared transmit and/or receive context, the
340       shared contexts must be bound to the endpoint.  CQs, counters, AV,  and
341       shared contexts must be bound to endpoints before they are enabled.
342
343   fi_scalable_ep_bind
344       fi_scalable_ep_bind  is  used  to associate a scalable endpoint with an
345       address vector.  See section on SCALABLE ENDPOINTS.   A  scalable  end‐
346       point  has  a  single  transport level address and can support multiple
347       transmit and receive contexts.  The transmit and receive contexts share
348       the  transport-level  address.  Address vectors that are bound to scal‐
349       able endpoints are implicitly bound to any transmit or receive contexts
350       created using the scalable endpoint.
351
352   fi_enable
353       This  call transitions the endpoint into an enabled state.  An endpoint
354       must be enabled before it  may  be  used  to  perform  data  transfers.
355       Enabling  an  endpoint  typically  results  in hardware resources being
356       assigned to it.  Endpoints making use of completion  queues,  counters,
357       event queues, and/or address vectors must be bound to them before being
358       enabled.
359
360       Calling connect or accept on an endpoint will implicitly enable an end‐
361       point if it has not already been enabled.
362
363       fi_enable  may also be used to re-enable an endpoint that has been dis‐
364       abled as a result  of  experiencing  a  critical  error.   Applications
365       should  check the return value from fi_enable to see if a disabled end‐
366       point has successfully be re-enabled.
367
368   fi_cancel
369       fi_cancel attempts to cancel  an  outstanding  asynchronous  operation.
370       Canceling  an  operation  causes  the fabric provider to search for the
371       operation and, if it is still pending, complete it as having been  can‐
372       celed.   An error queue entry will be available in the associated error
373       queue with error code FI_ECANCELED.  On the other hand, if  the  opera‐
374       tion completed before the call to fi_cancel, then the completion status
375       of that operation will be available in the associated completion queue.
376       No specific entry related to fi_cancel itself will be posted.
377
378       Cancel uses the context parameter associated with an operation to iden‐
379       tify the request to cancel.  Operations posted without a valid  context
380       parameter  --  either  no context parameter is specified or the context
381       value was ignored by the provider -- cannot be canceled.   If  multiple
382       outstanding  operations  match  the context parameter, only one will be
383       canceled.  In this case, the operation which is  canceled  is  provider
384       specific.   The  cancel  operation  is  asynchronous, but will complete
385       within a bounded period of time.
386
387   fi_ep_alias
388       This call creates an alias to the specified endpoint.  Conceptually, an
389       endpoint alias provides an alternate software path from the application
390       to the underlying provider hardware.  An alias EP differs from its par‐
391       ent  endpoint only by its default data transfer flags.  For example, an
392       alias EP may be configured to use  a  different  completion  mode.   By
393       default,  an alias EP inherits the same data transfer flags as the par‐
394       ent endpoint.  An application can use fi_control to modify the alias EP
395       operational flags.
396
397       When  allocating  an  alias,  an  application  may configure either the
398       transmit or receive operational flags.  This avoids needing a  separate
399       call to fi_control to set those flags.  The flags passed to fi_ep_alias
400       must include FI_TRANSMIT or FI_RECV (not both) with  other  operational
401       flags  OR'ed  in.   This  will  override the transmit or receive flags,
402       respectively, for operations posted through the  alias  endpoint.   All
403       allocated  aliases  must  be  closed  for the underlying endpoint to be
404       released.
405
406   fi_control
407       The control operation is used to adjust the default behavior of an end‐
408       point.  It allows the underlying provider to redirect function calls to
409       implementations optimized to meet the desired application behavior.  As
410       a  result,  calls to fi_ep_control must be serialized against all other
411       calls to an endpoint.
412
413       The base operation of an endpoint is  selected  during  creation  using
414       struct  fi_info.   The  following control commands and arguments may be
415       assigned to an endpoint.
416
417       **FI_GETOPSFLAG -- uint64_t *flags** : Used  to  retrieve  the  current
418       value  of  flags associated with the data transfer operations initiated
419       on the endpoint.  The control  argument  must  include  FI_TRANSMIT  or
420       FI_RECV (not both) flags to indicate the type of data transfer flags to
421       be returned.  See below for a list of control flags.
422
423       **FI_SETOPSFLAG -- uint64_t *flags** : Used to change the data transfer
424       operation flags associated with an endpoint.  The control argument must
425       include FI_TRANSMIT or FI_RECV (not both) to indicate the type of  data
426       transfer  that  the  flags  should apply to, with other flags OR'ed in.
427       The given  flags  will  override  the  previous  transmit  and  receive
428       attributes  that were set when the endpoint was created.  Valid control
429       flags are defined below.
430
431       **FI_BACKLOG - int *value** : This option only applies to passive  end‐
432       points.  It is used to set the connection request backlog for listening
433       endpoints.
434
435       FI_GETWAIT (void **) : This command allows the  user  to  retrieve  the
436       file  descriptor associated with a socket endpoint.  The fi_control arg
437       parameter should be an address where a pointer  to  the  returned  file
438       descriptor  will  be  written.   See fi_eq.3 for addition details using
439       fi_control with FI_GETWAIT.  The file descriptor may be used for  noti‐
440       fication that the endpoint is ready to send or receive data.
441
442   fi_getopt / fi_setopt
443       Endpoint  protocol  operations  may be retrieved using fi_getopt or set
444       using fi_setopt.  Applications specify the level that a desired  option
445       exists, identify the option, and provide input/output buffers to get or
446       set the option.  fi_setopt provides an  application  a  way  to  adjust
447       low-level protocol and implementation specific details of an endpoint.
448
449       The  following  option  levels  and  option  names  and  parameters are
450       defined.
451
452       FI_OPT_ENDPOINT
453
454       · FI_OPT_MIN_MULTI_RECV - size_t : Defines the minimum  receive  buffer
455         space  available  when the receive buffer is released by the provider
456         (see FI_MULTI_RECV).  Modifying this value is only guaranteed to  set
457         the  minimum  buffer  space needed on receives posted after the value
458         has been changed.  It is recommended that applications that  want  to
459         override  the  default  MIN_MULTI_RECV  value  set this option before
460         enabling the corresponding endpoint.
461
462       · FI_OPT_CM_DATA_SIZE - size_t : Defines the size of available space in
463         CM  messages  for user-defined data.  This value limits the amount of
464         data that applications can exchange between peer endpoints using  the
465         fi_connect,  fi_accept,  and fi_reject operations.  The size returned
466         is dependent upon the properties of the endpoint, except in the  case
467         of  passive endpoints, in which the size reflects the maximum size of
468         the data that may be present as part of a connection  request  event.
469         This option is read only.
470
471   fi_rx_size_left (DEPRECATED)
472       This  function has been deprecated and will be removed in a future ver‐
473       sion of the library.  It may not be supported by all providers.
474
475       The fi_rx_size_left call returns a lower bound on the number of receive
476       operations that may be posted to the given endpoint without that opera‐
477       tion returning -FI_EAGAIN.  Depending on the specific  details  of  the
478       subsequently  posted  receive  operations (e.g., number of iov entries,
479       which receive function is called, etc.), it may  be  possible  to  post
480       more receive operations than originally indicated by fi_rx_size_left.
481
482   fi_tx_size_left (DEPRECATED)
483       This  function has been deprecated and will be removed in a future ver‐
484       sion of the library.  It may not be supported by all providers.
485
486       The fi_tx_size_left call returns a lower bound on the number of  trans‐
487       mit  operations  that  may be posted to the given endpoint without that
488       operation returning -FI_EAGAIN.  Depending on the specific  details  of
489       the  subsequently  posted  transmit  operations  (e.g.,  number  of iov
490       entries, which transmit function is called, etc.), it may  be  possible
491       to   post   more  transmit  operations  than  originally  indicated  by
492       fi_tx_size_left.
493

ENDPOINT ATTRIBUTES

495       The fi_ep_attr structure defines the set of attributes associated  with
496       an  endpoint.   Endpoint  attributes  may  be further refined using the
497       transmit and receive context attributes as shown below.
498
499              struct fi_ep_attr {
500                  enum fi_ep_type type;
501                  uint32_t        protocol;
502                  uint32_t        protocol_version;
503                  size_t          max_msg_size;
504                  size_t          msg_prefix_size;
505                  size_t          max_order_raw_size;
506                  size_t          max_order_war_size;
507                  size_t          max_order_waw_size;
508                  uint64_t        mem_tag_format;
509                  size_t          tx_ctx_cnt;
510                  size_t          rx_ctx_cnt;
511                  size_t          auth_key_size;
512                  uint8_t         *auth_key;
513              };
514
515   type - Endpoint Type
516       If specified, indicates the  type  of  fabric  interface  communication
517       desired.  Supported types are:
518
519       FI_EP_UNSPEC  : The type of endpoint is not specified.  This is usually
520       provided as input,  with  other  attributes  of  the  endpoint  or  the
521       provider selecting the type.
522
523       FI_EP_MSG : Provides a reliable, connection-oriented data transfer ser‐
524       vice with flow control that maintains message boundaries.
525
526       FI_EP_DGRAM : Supports a connectionless, unreliable datagram communica‐
527       tion.   Message boundaries are maintained, but the maximum message size
528       may be limited to the fabric MTU.  Flow control is not guaranteed.
529
530       FI_EP_RDM : Reliable datagram message.   Provides  a  reliable,  uncon‐
531       nected  data  transfer service with flow control that maintains message
532       boundaries.
533
534       FI_EP_SOCK_STREAM : Data streaming endpoint with TCP socket-like seman‐
535       tics.   Provides  a reliable, connection-oriented data transfer service
536       that does not maintain message boundaries.  FI_EP_SOCK_STREAM  is  most
537       useful  for  applications  designed  around using TCP sockets.  See the
538       SOCKET ENDPOINT section for additional details  and  restrictions  that
539       apply to stream endpoints.
540
541       FI_EP_SOCK_DGRAM  : A connectionless, unreliable datagram endpoint with
542       UDP socket-like semantics.  FI_EP_SOCK_DGRAM is most useful for  appli‐
543       cations  designed  around  using  UDP sockets.  See the SOCKET ENDPOINT
544       section for additional details and restrictions that apply to  datagram
545       socket endpoints.
546
547   Protocol
548       Specifies  the  low-level end to end protocol employed by the provider.
549       A matching protocol must be used by communicating endpoints  to  ensure
550       interoperability.  The following protocol values are defined.  Provider
551       specific protocols are also allowed.  Provider specific protocols  will
552       be indicated by having the upper bit of the protocol value set to one.
553
554       FI_PROTO_UNSPEC  : The protocol is not specified.  This is usually pro‐
555       vided as input, with other attributes of the  socket  or  the  provider
556       selecting the actual protocol.
557
558       FI_PROTO_RDMA_CM_IB_RC  :  The  protocol  runs  over  Infiniband  reli‐
559       able-connected queue pairs, using the RDMA CM protocol  for  connection
560       establishment.
561
562       FI_PROTO_IWARP  :  The  protocol  runs over the Internet wide area RDMA
563       protocol transport.
564
565       FI_PROTO_IB_UD : The protocol runs over Infiniband unreliable  datagram
566       queue pairs.
567
568       FI_PROTO_PSMX  : The protocol is based on an Intel proprietary protocol
569       known as PSM, performance scaled messaging.  PSMX is an  extended  ver‐
570       sion of the PSM protocol to support the libfabric interfaces.
571
572       FI_PROTO_UDP  :  The  protocol  sends  and receives UDP datagrams.  For
573       example, an endpoint using FI_PROTO_UDP will  be  able  to  communicate
574       with  a  remote  peer  that  is using Berkeley SOCK_DGRAM sockets using
575       IPPROTO_UDP.
576
577       FI_PROTO_SOCK_TCP : The protocol is layered over TCP packets.
578
579       FI_PROTO_IWARP_RDM : Reliable-datagram protocol implemented over  iWarp
580       reliable-connected queue pairs.
581
582       FI_PROTO_IB_RDM  :  Reliable-datagram protocol implemented over Infini‐
583       Band reliable-connected queue pairs.
584
585       FI_PROTO_GNI : Protocol runs over Cray GNI low-level interface.
586
587       FI_PROTO_RXM : Reliable-datagram protocol implemented over message end‐
588       points.   RXM  is  a libfabric utility component that adds RDM endpoint
589       semantics over MSG endpoint semantics.
590
591       FI_PROTO_RXD : Reliable-datagram  protocol  implemented  over  datagram
592       endpoints.  RXD is a libfabric utility component that adds RDM endpoint
593       semantics over DGRAM endpoint semantics.
594
595       FI_PROTO_NETWORKDIRECT : Protocol  runs  over  Microsoft  NetworkDirect
596       service provider interface.  This adds reliable-datagram semantics over
597       the NetworkDirect connection- oriented endpoint semantics.
598
599       FI_PROTO_PSMX2 : The protocol is based on an Intel proprietary protocol
600       known  as  PSM2,  performance  scaled messaging version 2.  PSMX2 is an
601       extended version of the PSM2 protocol to support the  libfabric  inter‐
602       faces.
603
604   protocol_version - Protocol Version
605       Identifies  which  version of the protocol is employed by the provider.
606       The protocol version allows providers to extend an  existing  protocol,
607       by adding support for additional features or functionality for example,
608       in a backward compatible manner.  Providers that support different ver‐
609       sions  of  the  same protocol should inter-operate, but only when using
610       the capabilities defined for the lesser version.
611
612   max_msg_size - Max Message Size
613       Defines the maximum size for an application data transfer as  a  single
614       operation.
615
616   msg_prefix_size - Message Prefix Size
617       Specifies  the  size of any required message prefix buffer space.  This
618       field will be 0 unless the FI_MSG_PREFIX mode is enabled.  If  msg_pre‐
619       fix_size is > 0 the specified value will be a multiple of 8-bytes.
620
621   Max RMA Ordered Size
622       The maximum ordered size specifies the delivery order of transport data
623       into target memory for RMA and atomic  operations.   Data  ordering  is
624       separate,  but  dependent  on  message  ordering (defined below).  Data
625       ordering is unspecified where message order is not defined.
626
627       Data ordering refers to the access of target memory by subsequent oper‐
628       ations.  When back to back RMA read or write operations access the same
629       registered memory location, data ordering indicates whether the  second
630       operation  reads  or writes the target memory after the first operation
631       has completed.  Because RMA ordering applies  between  two  operations,
632       and  not  within  a  single  data  transfer,  ordering  is  defined per
633       byte-addressable memory location.   I.e.   ordering  specifies  whether
634       location  X  is accessed by the second operation after the first opera‐
635       tion.  Nothing is implied about the completion of the  first  operation
636       before the second operation is initiated.
637
638       In  order  to  support  large data transfers being broken into multiple
639       packets and sent using multiple paths through the fabric, data ordering
640       may  be  limited  to  transfers  of a specific size or less.  Providers
641       specify when data ordering is maintained through the following  values.
642       Note that even if data ordering is not maintained, message ordering may
643       be.
644
645       max_order_raw_size : Read after write size.  If set, an RMA  or  atomic
646       read  operation  issued after an RMA or atomic write operation, both of
647       which are smaller than the size, will be  ordered.   Where  the  target
648       memory locations overlap, the RMA or atomic read operation will see the
649       results of the previous RMA or atomic write.
650
651       max_order_war_size : Write after read size.  If set, an RMA  or  atomic
652       write  operation  issued after an RMA or atomic read operation, both of
653       which are smaller than the size, will be ordered.  The  RMA  or  atomic
654       read operation will see the initial value of the target memory location
655       before a subsequent RMA or atomic write updates the value.
656
657       max_order_waw_size : Write after write size.  If set, an RMA or  atomic
658       write  operation issued after an RMA or atomic write operation, both of
659       which are smaller than the size, will be ordered.   The  target  memory
660       location will reflect the results of the second RMA or atomic write.
661
662       An  order size value of 0 indicates that ordering is not guaranteed.  A
663       value of -1 guarantees ordering for any data size.
664
665   mem_tag_format - Memory Tag Format
666       The memory tag format is a bit array  used  to  convey  the  number  of
667       tagged  bits  supported by a provider.  Additionally, it may be used to
668       divide the bit array into separate fields.  The mem_tag_format  option‐
669       ally  begins  with a series of bits set to 0, to signify bits which are
670       ignored by the provider.  Following the initial prefix of ignored bits,
671       the  array will consist of alternating groups of bits set to all 1's or
672       all 0's.  Each group of bits corresponds to a tagged field.  The impli‐
673       cation of defining a tagged field is that when a mask is applied to the
674       tagged bit array, all bits belonging to a single field will  either  be
675       set to 1 or 0, collectively.
676
677       For example, a mem_tag_format of 0x30FF indicates support for 14 tagged
678       bits, separated into 3 fields.  The first field consists of 2-bits, the
679       second  field 4-bits, and the final field 8-bits.  Valid masks for such
680       a tagged field would be a bitwise OR'ing of zero or more of the follow‐
681       ing values: 0x3000, 0x0F00, and 0x00FF.
682
683       By  identifying fields within a tag, a provider may be able to optimize
684       their search routines.  An application which requests tag  fields  must
685       provide  tag  masks  that  either  set all mask bits corresponding to a
686       field to all 0 or all 1.  When negotiating tag fields,  an  application
687       can  request  a  specific number of fields of a given size.  A provider
688       must return a tag format that supports the requested number of  fields,
689       with each field being at least the size requested, or fail the request.
690       A provider may increase the size of the fields.  When reporting comple‐
691       tions  (see  FI_CQ_FORMAT_TAGGED),  the provider must provide the exact
692       value of the received tag, clearing out any unsupported tag bits.
693
694       It is recommended that field sizes be ordered from smallest to largest.
695       A  generic,  unstructured  tag and mask can be achieved by requesting a
696       bit array consisting of alternating 1's and 0's.
697
698   tx_ctx_cnt - Transmit Context Count
699       Number of transmit contexts to associate with  the  endpoint.   If  not
700       specified (0), 1 context will be assigned if the endpoint supports out‐
701       bound transfers.  Transmit contexts  are  independent  transmit  queues
702       that  may be separately configured.  Each transmit context may be bound
703       to a separate CQ, and no ordering is defined between  contexts.   Addi‐
704       tionally,  no synchronization is needed when accessing contexts in par‐
705       allel.
706
707       If the count is set to the value FI_SHARED_CONTEXT, the  endpoint  will
708       be  configured  to  use  a shared transmit context, if supported by the
709       provider.  Providers that do not support shared transmit contexts  will
710       fail the request.
711
712       See  the  scalable endpoint and shared contexts sections for additional
713       details.
714
715   rx_ctx_cnt - Receive Context Count
716       Number of receive contexts to associate  with  the  endpoint.   If  not
717       specified,  1 context will be assigned if the endpoint supports inbound
718       transfers.  Receive contexts are independent processing queues that may
719       be separately configured.  Each receive context may be bound to a sepa‐
720       rate CQ, and no ordering is defined between contexts.  Additionally, no
721       synchronization is needed when accessing contexts in parallel.
722
723       If  the  count is set to the value FI_SHARED_CONTEXT, the endpoint will
724       be configured to use a shared receive  context,  if  supported  by  the
725       provider.   Providers  that do not support shared receive contexts will
726       fail the request.
727
728       See the scalable endpoint and shared contexts sections  for  additional
729       details.
730
731   auth_key_size - Authorization Key Length
732       The  length of the authorization key in bytes.  This field will be 0 if
733       authorization keys are not available or used.  This  field  is  ignored
734       unless the fabric is opened with API version 1.5 or greater.
735
736   auth_key - Authorization Key
737       If  supported  by the fabric, an authorization key (a.k.a.  job key) to
738       associate with the endpoint.  An authorization key  is  used  to  limit
739       communication  between  endpoints.   Only  peer endpoints that are pro‐
740       grammed to use the same authorization key may communicate.   Authoriza‐
741       tion  keys  are  often  used to implement job keys, to ensure that pro‐
742       cesses running in different jobs do  not  accidentally  cross  traffic.
743       The domain authorization key will be used if auth_key_size is set to 0.
744       This field is ignored unless the fabric is opened with API version  1.5
745       or greater.
746

TRANSMIT CONTEXT ATTRIBUTES

748       Attributes  specific  to  the  transmit capabilities of an endpoint are
749       specified using struct fi_tx_attr.
750
751              struct fi_tx_attr {
752                  uint64_t  caps;
753                  uint64_t  mode;
754                  uint64_t  op_flags;
755                  uint64_t  msg_order;
756                  uint64_t  comp_order;
757                  size_t    inject_size;
758                  size_t    size;
759                  size_t    iov_limit;
760                  size_t    rma_iov_limit;
761              };
762
763   caps - Capabilities
764       The requested capabilities of the context.  The capabilities must be  a
765       subset of those requested of the associated endpoint.  See the CAPABIL‐
766       ITIES section of fi_getinfo(3) for capability  details.   If  the  caps
767       field  is  0 on input to fi_getinfo(3), the caps value from the fi_info
768       structure will be used.
769
770   mode
771       The operational mode bits of the context.  The mode bits will be a sub‐
772       set  of  those  associated  with the endpoint.  See the MODE section of
773       fi_getinfo(3) for details.  A mode value of 0 will be ignored on  input
774       to  fi_getinfo(3),  with  the  mode value of the fi_info structure used
775       instead.  On return from fi_getinfo(3), the mode will be  set  only  to
776       those constraints specific to transmit operations.
777
778   op_flags - Default transmit operation flags
779       Flags  that  control  the operation of operations submitted against the
780       context.  Applicable flags are listed in the Operation Flags section.
781
782   msg_order - Message Ordering
783       Message ordering refers to the order in which transport  layer  headers
784       (as  viewed  by  the application) are processed.  Relaxed message order
785       enables data transfers to be sent and received out of order, which  may
786       improve performance by utilizing multiple paths through the fabric from
787       the initiating endpoint to a target endpoint.   Message  order  applies
788       only  between  a single source and destination endpoint pair.  Ordering
789       between different target endpoints is not defined.
790
791       Message order is determined using a set of ordering bits.  Each set bit
792       indicates  that  ordering  is  maintained between data transfers of the
793       specified type.  Message order is defined for [read  |  write  |  send]
794       operations  submitted  by  an  application  after [read | write | send]
795       operations.
796
797       Message ordering only applies to the end to end transmission of  trans‐
798       port  headers.   Message ordering is necessary, but does not guarantee,
799       the order in which message data is sent or received  by  the  transport
800       layer.   Message  ordering  requires matching ordering semantics on the
801       receiving side of a data transfer operation in order to guarantee  that
802       ordering is met.
803
804       FI_ORDER_NONE  :  No  ordering is specified.  This value may be used as
805       input in order to obtain the default message  order  supported  by  the
806       provider.  FI_ORDER_NONE is an alias for the value 0.
807
808       FI_ORDER_RAR : Read after read.  If set, RMA and atomic read operations
809       are transmitted in the order submitted relative to other RMA and atomic
810       read  operations.   If not set, RMA and atomic reads may be transmitted
811       out of order from their submission.
812
813       FI_ORDER_RAW : Read after write.  If set, RMA and  atomic  read  opera‐
814       tions are transmitted in the order submitted relative to RMA and atomic
815       write operations.  If not set, RMA and atomic reads may be  transmitted
816       ahead of RMA and atomic writes.
817
818       FI_ORDER_RAS : Read after send.  If set, RMA and atomic read operations
819       are transmitted in the order submitted relative to message send  opera‐
820       tions, including tagged sends.  If not set, RMA and atomic reads may be
821       transmitted ahead of sends.
822
823       FI_ORDER_WAR : Write after read.  If set, RMA and atomic  write  opera‐
824       tions are transmitted in the order submitted relative to RMA and atomic
825       read operations.  If not set, RMA and atomic writes may be  transmitted
826       ahead of RMA and atomic reads.
827
828       FI_ORDER_WAW  : Write after write.  If set, RMA and atomic write opera‐
829       tions are transmitted in the order submitted relative to other RMA  and
830       atomic  write  operations.   If  not  set, RMA and atomic writes may be
831       transmitted out of order from their submission.
832
833       FI_ORDER_WAS : Write after send.  If set, RMA and atomic  write  opera‐
834       tions  are  transmitted in the order submitted relative to message send
835       operations, including tagged sends.  If not set, RMA and atomic  writes
836       may be transmitted ahead of sends.
837
838       FI_ORDER_SAR  :  Send  after  read.   If  set, message send operations,
839       including tagged sends, are transmitted in order submitted relative  to
840       RMA  and  atomic  read  operations.   If  not set, message sends may be
841       transmitted ahead of RMA and atomic reads.
842
843       FI_ORDER_SAW : Send after write.   If  set,  message  send  operations,
844       including  tagged sends, are transmitted in order submitted relative to
845       RMA and atomic write operations.  If not  set,  message  sends  may  be
846       transmitted ahead of RMA and atomic writes.
847
848       FI_ORDER_SAS  :  Send  after  send.   If  set, message send operations,
849       including tagged sends, are transmitted in the order submitted relative
850       to  other  message  send.  If not set, message sends may be transmitted
851       out of order from their submission.
852
853   comp_order - Completion Ordering
854       Completion ordering refers to the order in which completed requests are
855       written  into  the completion queue.  Completion ordering is similar to
856       message order.  Relaxed completion order may enable faster reporting of
857       completed  transfers,  allow  acknowledgments to be sent over different
858       fabric paths, and support more sophisticated  retry  mechanisms.   This
859       can result in lower-latency completions, particularly when using uncon‐
860       nected  endpoints.   Strict  completion  ordering  may   require   that
861       providers queue completed operations or limit available optimizations.
862
863       For transmit requests, completion ordering depends on the endpoint com‐
864       munication type.  For  unreliable  communication,  completion  ordering
865       applies  to  all  data transfer requests submitted to an endpoint.  For
866       reliable communication, completion ordering only  applies  to  requests
867       that  target  a  single  destination  endpoint.  Completion ordering of
868       requests that target different endpoints over a reliable  transport  is
869       not defined.
870
871       Applications  should  specify the completion ordering that they support
872       or require.  Providers should return the  completion  order  that  they
873       actually  provide,  with  the  constraint that the returned ordering is
874       stricter than that specified by the application.  Supported  completion
875       order values are:
876
877       FI_ORDER_NONE  :  No  ordering  is  defined  for  completed operations.
878       Requests submitted to the transmit context may complete in any order.
879
880       FI_ORDER_STRICT : Requests complete in the order in which they are sub‐
881       mitted to the transmit context.
882
883   inject_size
884       The  requested  inject operation size (see the FI_INJECT flag) that the
885       context will support.  This is the maximum size data transfer that  can
886       be  associated  with  an inject operation (such as fi_inject) or may be
887       used with the FI_INJECT data transfer flag.
888
889   size
890       The size of the context.  The size is specified as the  minimum  number
891       of  transmit  operations that may be posted to the endpoint without the
892       operation returning -FI_EAGAIN.
893
894   iov_limit
895       This is the maximum number of IO vectors (scatter-gather elements) that
896       a single posted operation may reference.
897
898   rma_iov_limit
899       This  is the maximum number of RMA IO vectors (scatter-gather elements)
900       that an RMA or atomic operation may reference.  The rma_iov_limit  cor‐
901       responds to the rma_iov_count values in RMA and atomic operations.  See
902       struct fi_msg_rma and struct fi_msg_atomic in fi_rma.3 and fi_atomic.3,
903       for  additional  details.  This limit applies to both the number of RMA
904       IO vectors that may be specified when initiating an operation from  the
905       local endpoint, as well as the maximum number of IO vectors that may be
906       carried in a single request from a remote endpoint.
907

RECEIVE CONTEXT ATTRIBUTES

909       Attributes specific to the receive  capabilities  of  an  endpoint  are
910       specified using struct fi_rx_attr.
911
912              struct fi_rx_attr {
913                  uint64_t  caps;
914                  uint64_t  mode;
915                  uint64_t  op_flags;
916                  uint64_t  msg_order;
917                  uint64_t  comp_order;
918                  size_t    total_buffered_recv;
919                  size_t    size;
920                  size_t    iov_limit;
921              };
922
923   caps - Capabilities
924       The  requested capabilities of the context.  The capabilities must be a
925       subset of those requested of the associated endpoint.  See the CAPABIL‐
926       ITIES  section  if  fi_getinfo(3)  for capability details.  If the caps
927       field is 0 on input to fi_getinfo(3), the caps value from  the  fi_info
928       structure will be used.
929
930   mode
931       The operational mode bits of the context.  The mode bits will be a sub‐
932       set of those associated with the endpoint.  See  the  MODE  section  of
933       fi_getinfo(3)  for details.  A mode value of 0 will be ignored on input
934       to fi_getinfo(3), with the mode value of  the  fi_info  structure  used
935       instead.   On  return  from fi_getinfo(3), the mode will be set only to
936       those constraints specific to receive operations.
937
938   op_flags - Default receive operation flags
939       Flags that control the operation of operations  submitted  against  the
940       context.  Applicable flags are listed in the Operation Flags section.
941
942   msg_order - Message Ordering
943       For  a  description of message ordering, see the msg_order field in the
944       Transmit Context Attribute section.  Receive context  message  ordering
945       defines  the order in which received transport message headers are pro‐
946       cessed when received by an endpoint.
947
948       The following ordering flags, as defined for  transmit  ordering,  also
949       apply   to   the  processing  of  received  operations:  FI_ORDER_NONE,
950       FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS,  FI_ORDER_WAR,  FI_ORDER_WAW,
951       FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, and FI_ORDER_SAS.
952
953   comp_order - Completion Ordering
954       For  a  description of completion ordering, see the comp_order field in
955       the Transmit Context Attribute section.
956
957       FI_ORDER_NONE :  No  ordering  is  defined  for  completed  operations.
958       Receive  operations may complete in any order, regardless of their sub‐
959       mission order.
960
961       FI_ORDER_STRICT : Receive operations complete in  the  order  in  which
962       they  are  processed  by the receive context, based on the receive side
963       msg_order attribute.
964
965       FI_ORDER_DATA : When set, this bit  indicates  that  received  data  is
966       written into memory in order.  Data ordering applies to memory accessed
967       as part of a single operation and between operations if message  order‐
968       ing is guaranteed.
969
970   total_buffered_recv
971       This  field is supported for backwards compatibility purposes.  It is a
972       hint to the provider of the total available space that may be needed to
973       buffer  messages  that  are  received  for  which  there is no matching
974       receive operation.  The provider may adjust or ignore this value.   The
975       allocation  of  internal  network  buffering  among received message is
976       provider specific.  For instance, a provider may limit the size of mes‐
977       sages  which  can be buffered or the amount of buffering allocated to a
978       single message.
979
980       If receive side buffering is disabled (total_buffered_recv = 0)  and  a
981       message  is  received by an endpoint, then the behavior is dependent on
982       whether resource management has been enabled (FI_RM_ENABLED has be  set
983       or  not).   See the Resource Management section of fi_domain.3 for fur‐
984       ther  clarification.   It  is  recommended  that  applications   enable
985       resource  management  if they anticipate receiving unexpected messages,
986       rather than modifying this value.
987
988   size
989       The size of the context.  The size is specified as the  minimum  number
990       of  receive  operations  that may be posted to the endpoint without the
991       operation returning -FI_EAGAIN.
992
993   iov_limit
994       This is the maximum number of IO vectors (scatter-gather elements) that
995       a single posted operating may reference.
996

SCALABLE ENDPOINTS

998       A  scalable  endpoint  is a communication portal that supports multiple
999       transmit and receive contexts.  Scalable endpoints are loosely  modeled
1000       after  the  networking  concept  of transmit/receive side scaling, also
1001       known as multi-queue.  Support for scalable endpoints  is  domain  spe‐
1002       cific.    Scalable   endpoints   may   improve   the   performance   of
1003       multi-threaded and parallel applications, by allowing threads to access
1004       independent  transmit  and  receive  queues.  A scalable endpoint has a
1005       single transport level address, which can reduce  the  memory  require‐
1006       ments  needed  to  store  remote addressing data, versus using standard
1007       endpoints.  Scalable endpoints cannot be used directly  for  communica‐
1008       tion  operations,  and  require  the  application  to explicitly create
1009       transmit and receive contexts as described below.
1010
1011   fi_tx_context
1012       Transmit contexts are independent transmit queues.  Ordering  and  syn‐
1013       chronization between contexts are not defined.  Conceptually a transmit
1014       context behaves similar to a send-only endpoint.   A  transmit  context
1015       may  be  configured  with fewer capabilities than the base endpoint and
1016       with different attributes (such as  ordering  requirements  and  inject
1017       size)  than  other contexts associated with the same scalable endpoint.
1018       Each transmit context has its own  completion  queue.   The  number  of
1019       transmit  contexts associated with an endpoint is specified during end‐
1020       point creation.
1021
1022       The fi_tx_context call is used to retrieve a specific context,  identi‐
1023       fied   by   an  index  (see  above  for  details  on  transmit  context
1024       attributes).   Providers  may  dynamically   allocate   contexts   when
1025       fi_tx_context  is  called,  or  may statically create all contexts when
1026       fi_endpoint is invoked.  By default, a transmit  context  inherits  the
1027       properties  of  its  associated  endpoint.   However,  applications may
1028       request context specific attributes through the attr  parameter.   Sup‐
1029       port  for  per transmit context attributes is provider specific and not
1030       guaranteed.  Providers will return the actual  attributes  assigned  to
1031       the context through the attr parameter, if provided.
1032
1033   fi_rx_context
1034       Receive  contexts are independent receive queues for receiving incoming
1035       data.  Ordering and synchronization between contexts  are  not  guaran‐
1036       teed.  Conceptually a receive context behaves similar to a receive-only
1037       endpoint.  A receive context may be configured with fewer  capabilities
1038       than  the base endpoint and with different attributes (such as ordering
1039       requirements and inject size) than other contexts associated  with  the
1040       same  scalable  endpoint.   Each receive context has its own completion
1041       queue.  The number of receive contexts associated with an  endpoint  is
1042       specified during endpoint creation.
1043
1044       Receive contexts are often associated with steering flows, that specify
1045       which incoming packets targeting a scalable endpoint to process.   How‐
1046       ever,  receive  contexts  may be targeted directly by the initiator, if
1047       supported by the underlying protocol.  Such contexts are referred to as
1048       'named'.   Support  for named contexts must be indicated by setting the
1049       caps FI_NAMED_RX_CTX capability when the corresponding endpoint is cre‐
1050       ated.   Support  for named receive contexts is coordinated with address
1051       vectors.  See fi_av(3) and fi_rx_addr(3).
1052
1053       The fi_rx_context call is used to retrieve a specific context,  identi‐
1054       fied by an index (see above for details on receive context attributes).
1055       Providers may  dynamically  allocate  contexts  when  fi_rx_context  is
1056       called,  or  may  statically  create  all  contexts when fi_endpoint is
1057       invoked.  By default, a receive context inherits the properties of  its
1058       associated  endpoint.   However,  applications may request context spe‐
1059       cific attributes through the attr parameter.  Support for  per  receive
1060       context  attributes is provider specific and not guaranteed.  Providers
1061       will return the actual attributes assigned to the context  through  the
1062       attr parameter, if provided.
1063

SHARED CONTEXTS

1065       Shared  contexts  are  transmit  and receive contexts explicitly shared
1066       among one or more endpoints.  A shareable context allows an application
1067       to  use  a  single dedicated provider resource among multiple transport
1068       addressable endpoints.  This can greatly reduce the resources needed to
1069       manage  communication  over multiple endpoints by multiplexing transmit
1070       and/or receive processing,  with  the  potential  cost  of  serializing
1071       access  across  multiple  endpoints.  Support for shareable contexts is
1072       domain specific.
1073
1074       Conceptually, shareable transmit contexts are transmit queues that  may
1075       be accessed by many endpoints.  The use of a shared transmit context is
1076       mostly opaque to an application.  Applications must allocate  and  bind
1077       shared  transmit  contexts  to  endpoints,  but  operations  are posted
1078       directly to the endpoint.  Shared transmit contexts are not  associated
1079       with completion queues or counters.  Completed operations are posted to
1080       the CQs bound to the endpoint.  An endpoint may only be associated with
1081       a single shared transmit context.
1082
1083       Unlike  shared  transmit  contexts, applications interact directly with
1084       shared receive contexts.  Users post  receive  buffers  directly  to  a
1085       shared  receive  context, with the buffers usable by any endpoint bound
1086       to the shared receive context.  Shared receive contexts are not associ‐
1087       ated  with completion queues or counters.  Completed receive operations
1088       are posted to the CQs bound to the endpoint.  An endpoint may  only  be
1089       associated  with  a single receive context, and all connectionless end‐
1090       points associated with a shared receive context  must  also  share  the
1091       same address vector.
1092
1093       Endpoints  associated  with a shared transmit context may use dedicated
1094       receive contexts, and vice-versa.  Or an endpoint may use shared trans‐
1095       mit  and  receive  contexts.  And there is no requirement that the same
1096       group of endpoints sharing a context of one type also share the context
1097       of  an  alternate type.  Furthermore, an endpoint may use a shared con‐
1098       text of one type, but a scalable set of contexts of the alternate type.
1099
1100   fi_stx_context
1101       This call is used to open a shareable transmit context (see  above  for
1102       details on the transmit context attributes).  Endpoints associated with
1103       a shared transmit context must use a subset of the  transmit  context's
1104       attributes.   Note  that  this  is  the  reverse of the requirement for
1105       transmit contexts for scalable endpoints.
1106
1107   fi_srx_context
1108       This allocates a shareable receive context (see above  for  details  on
1109       the  receive  context  attributes).  Endpoints associated with a shared
1110       receive context must use a subset of the receive context's  attributes.
1111       Note  that  this is the reverse of the requirement for receive contexts
1112       for scalable endpoints.
1113

SOCKET ENDPOINTS

1115       The following feature and description should be  considered  experimen‐
1116       tal.  Until the experimental tag is removed, the interfaces, semantics,
1117       and data structures associated with socket endpoints may change between
1118       library versions.
1119
1120       This  section  applies  to  endpoints  of  type  FI_EP_SOCK_STREAM  and
1121       FI_EP_SOCK_DGRAM, commonly referred to as socket endpoints.
1122
1123       Socket endpoints are defined with semantics that  allow  them  to  more
1124       easily  be  adopted by developers familiar with the UNIX socket API, or
1125       by middleware that exposes the socket API, while still taking advantage
1126       of high-performance hardware features.
1127
1128       The  key difference between socket endpoints and other active endpoints
1129       are socket endpoints use synchronous data  transfers.   Buffers  passed
1130       into  send and receive operations revert to the control of the applica‐
1131       tion upon returning from the function  call.   As  a  result,  no  data
1132       transfer  completions  are reported to the application, and socket end‐
1133       points are not associated with completion queues or counters.
1134
1135       Socket endpoints support  a  subset  of  message  operations:  fi_send,
1136       fi_sendv,  fi_sendmsg,  fi_recv,  fi_recvv,  fi_recvmsg, and fi_inject.
1137       Because data transfers are synchronous, the return value from send  and
1138       receive operations indicate the number of bytes transferred on success,
1139       or a negative value on error, including -FI_EAGAIN if the endpoint can‐
1140       not  send  or receive any data because of full or empty queues, respec‐
1141       tively.
1142
1143       Socket endpoints are associated with event queues and address  vectors,
1144       and  process  connection  management  events asynchronously, similar to
1145       other endpoints.  Unlike UNIX sockets, socket endpoint  must  still  be
1146       declared as either active or passive.
1147
1148       Socket endpoints behave like non-blocking sockets.  In order to support
1149       select and poll semantics, active socket endpoints are associated  with
1150       a  file  descriptor  that is signaled whenever the endpoint is ready to
1151       send and/or receive data.  The file descriptor may be  retrieved  using
1152       fi_control.
1153

OPERATION FLAGS

1155       Operation  flags  are  obtained by OR-ing the following flags together.
1156       Operation flags define the default flags applied to an endpoint's  data
1157       transfer  operations,  where  a flags parameter is not available.  Data
1158       transfer operations that take flags  as  input  override  the  op_flags
1159       value of transmit or receive context attributes of an endpoint.
1160
1161       FI_INJECT : Indicates that all outbound data buffers should be returned
1162       to the user's control immediately after a data transfer  call  returns,
1163       even if the operation is handled asynchronously.  This may require that
1164       the provider copy the data into a local buffer and transfer out of that
1165       buffer.  A provider can limit the total amount of send data that may be
1166       buffered and/or the size of a single send that can use this flag.  This
1167       limit is indicated using inject_size (see inject_size above).
1168
1169       FI_MULTI_RECV : Applies to posted receive operations.  This flag allows
1170       the user to post a single buffer that will  receive  multiple  incoming
1171       messages.   Received  messages  will  be packed into the receive buffer
1172       until the buffer has been consumed.  Use of this flag may cause a  sin‐
1173       gle  posted  receive operation to generate multiple completions as mes‐
1174       sages are placed into the buffer.  The placement of received data  into
1175       the  buffer  may  be  subjected to provider specific alignment restric‐
1176       tions.  The buffer will be released by the provider when the  available
1177       buffer    space    falls    below    the    specified    minimum   (see
1178       FI_OPT_MIN_MULTI_RECV).
1179
1180       FI_COMPLETION : Indicates that a completion entry should  be  generated
1181       for  data  transfer  operations.   This flag only applies to operations
1182       issued on endpoints that were  bound  to  a  CQ  or  counter  with  the
1183       FI_SELECTIVE_COMPLETION  flag.   See  the  fi_ep_bind section above for
1184       more detail.
1185
1186       FI_INJECT_COMPLETE : Indicates that a completion  should  be  generated
1187       when  the source buffer(s) may be reused.  A completion guarantees that
1188       the buffers will not be read from again and the application may reclaim
1189       them.   No  other  guarantees are made with respect to the state of the
1190       operation.
1191
1192       Note: This flag is used to control when a completion entry is  inserted
1193       into  a  completion queue.  It does not apply to operations that do not
1194       generate a completion queue entry, such as the fi_inject operation, and
1195       is not subject to the inject_size message limit restriction.
1196
1197       FI_TRANSMIT_COMPLETE  : Indicates that a completion should be generated
1198       when the  transmit  operation  has  completed  relative  to  the  local
1199       provider.  The exact behavior is dependent on the endpoint type.
1200
1201       For reliable endpoints:
1202
1203       Indicates  that a completion should be generated when the operation has
1204       been delivered to the peer endpoint.  A completion guarantees that  the
1205       operation is no longer dependent on the fabric or local resources.  The
1206       state of the operation at the peer endpoint is not defined.
1207
1208       For unreliable endpoints:
1209
1210       Indicates that a completion should be generated when the operation  has
1211       been  delivered to the fabric.  A completion guarantees that the opera‐
1212       tion is no longer dependent on local resources.  The state of the oper‐
1213       ation within the fabric is not defined.
1214
1215       FI_DELIVERY_COMPLETE : Indicates that a completion should not be gener‐
1216       ated until an operation has been  processed  by  the  destination  end‐
1217       point(s).   A completion guarantees that the result of the operation is
1218       available.
1219
1220       This completion mode applies only to reliable  endpoints.   For  opera‐
1221       tions  that  return  data  to  the  initiator,  such  as  RMA  read  or
1222       atomic-fetch, the source endpoint is also considered a destination end‐
1223       point.  This is the default completion mode for such operations.
1224
1225       FI_COMMIT_COMPLETE  :  Indicates that a completion should not be gener‐
1226       ated (locally or at the peer) until the result  of  an  operation  have
1227       been  made persistent.  A completion guarantees that the result is both
1228       available and durable, in the case of power failure.
1229
1230       This completion mode applies only to operations that target  persistent
1231       memory regions over reliable endpoints.  This completion mode is exper‐
1232       imental.
1233
1234       FI_MULTICAST : Indicates that  data  transfers  will  target  multicast
1235       addresses by default.  Any fi_addr_t passed into a data transfer opera‐
1236       tion will be treated as a multicast address.
1237

NOTES

1239       Users should call fi_close to release all resources  allocated  to  the
1240       fabric endpoint.
1241
1242       Endpoints allocated with the FI_CONTEXT mode set must typically provide
1243       struct fi_context as  their  per  operation  context  parameter.   (See
1244       fi_getinfo.3  for  details.)  However,  when FI_SELECTIVE_COMPLETION is
1245       enabled to suppress completion entries, and an operation  is  initiated
1246       without  FI_COMPLETION flag set, then the context parameter is ignored.
1247       An application does not need to pass in a valid struct fi_context  into
1248       such data transfers.
1249
1250       Operations  that  complete  in error that are not associated with valid
1251       operational context will use the endpoint context in any error  report‐
1252       ing structures.
1253
1254       Although  applications  typically associate individual completions with
1255       either completion queues or counters, an endpoint can  be  attached  to
1256       both  a  counter and completion queue.  When combined with using selec‐
1257       tive completions, this allows an application to use counters  to  track
1258       successful  completions,  with  a CQ used to report errors.  Operations
1259       that complete with an error increment the error counter and generate  a
1260       completion  event.   The generation of entries going to the CQ can then
1261       be controlled using FI_SELECTIVE_COMPLETION.
1262
1263       As mentioned in fi_getinfo(3), the ep_attr structure  can  be  used  to
1264       query  providers  that support various endpoint attributes.  fi_getinfo
1265       can return provider info structures that can support the minimal set of
1266       requirements  (such  that the application maintains correctness).  How‐
1267       ever, it can also return provider info structures that exceed  applica‐
1268       tion  requirements.   As an example, consider an application requesting
1269       msg_order as FI_ORDER_NONE.  The resulting output from  fi_getinfo  may
1270       have all the ordering bits set.  The application can reset the ordering
1271       bits it does not require before creating the endpoint.  The provider is
1272       free  to implement a stricter ordering than is required by the applica‐
1273       tion.
1274

RETURN VALUES

1276       Returns 0 on success.  On error, a negative value corresponding to fab‐
1277       ric  errno  is  returned.  For fi_cancel, a return value of 0 indicates
1278       that the cancel request was submitted for processing.
1279
1280       Fabric errno values are defined in rdma/fi_errno.h.
1281

ERRORS

1283       -FI_EDOMAIN : A resource domain was not bound to  the  endpoint  or  an
1284       attempt was made to bind multiple domains.
1285
1286       -FI_ENOCQ  :  The endpoint has not been configured with necessary event
1287       queue.
1288
1289       -FI_EOPBADSTATE : The endpoint's state does not  permit  the  requested
1290       operation.
1291

SEE ALSO

1293       fi_getinfo(3), fi_domain(3), fi_msg(3), fi_tagged(3), fi_rma(3)
1294

AUTHORS

1296       OpenFabrics.
1297
1298
1299
1300Libfabric Programmer's Manual     2018-02-13                    fi_endpoint(3)
Impressum