fi_setup(7)

1fi_setup(7)                    Libfabric v1.18.1                   fi_setup(7)
2
3
4

NAME

6       fi_setup - libfabric setup and initialization
7

OVERVIEW

9       A  full  description of the libfabric API is documented in the relevant
10       man pages.  This section provides an introduction to select interfaces,
11       including  how  they  may  be used.  It does not attempt to capture all
12       subtleties or use cases, nor describe all possible data  structures  or
13       fields.   However, it is useful for new developers trying to kick-start
14       using libfabric.
15

fi_getinfo()

17       The fi_getinfo() call is one of the first calls that  applications  in‐
18       voke.   It  is  designed to be easy to use for simple applications, but
19       extensible enough to configure a network for optimal  performance.   It
20       serves  several purposes.  First, it abstracts away network implementa‐
21       tion and addressing details.  Second, it allows an application to spec‐
22       ify  which  features  they require of the network.  Last, it provides a
23       mechanism for a provider to report how an application can use the  net‐
24       work in order to achieve the best performance.  fi_getinfo() is loosely
25       based on the getaddrinfo() call.
26
27              /* API prototypes */
28              struct fi_info *fi_allocinfo(void);
29
30              int fi_getinfo(int version, const char *node, const char *service,
31                  uint64_t flags, struct fi_info *hints, struct fi_info **info);
32
33              /* Sample initialization code flow */
34              struct fi_info *hints, *info;
35
36              hints = fi_allocinfo();
37
38              /* hints will point to a cleared fi_info structure
39               * Initialize hints here to request specific network capabilities
40               */
41
42              fi_getinfo(FI_VERSION(1, 16), NULL, NULL, 0, hints, &info);
43              fi_freeinfo(hints);
44
45              /* Use the returned info structure to allocate fabric resources */
46
47       The hints parameter is the key for  requesting  fabric  services.   The
48       fi_info structure contains several data fields, plus pointers to a wide
49       variety of attributes.  The fi_allocinfo() call simplifies the creation
50       of  an  fi_info structure and is strongly recommended for use.  In this
51       example, the application is merely attempting to get  a  list  of  what
52       providers  are  available in the system and the features that they sup‐
53       port.  Note that the API is designed to be extensible.  Versioning  in‐
54       formation is provided as part of the fi_getinfo() call.  The version is
55       used by libfabric to determine what API  features  the  application  is
56       aware of.  In this case, the application indicates that it can properly
57       handle any feature that was defined for the 1.16 release (or earlier).
58
59       Applications should always hard code the version that they are  written
60       for  into  the  fi_getinfo() call.  This ensures that newer versions of
61       libfabric will provide backwards compatibility with that  used  by  the
62       application.   Newer  versions  of  libfabric must support applications
63       that were compiled against an older version of the  library.   It  must
64       also  support  applications  written against header files from an older
65       library version, but re-compiled against  newer  header  files.   Among
66       other things, the version parameter allows libfabric to determine if an
67       application is aware of new fields that may have been added  to  struc‐
68       tures, or if the data in those fields may be uninitialized.
69
70       Typically,  an  application will initialize the hints parameter to list
71       the features that it will use.
72
73              /* Taking a peek at the contents of fi_info */
74              struct fi_info {
75                  struct fi_info *next;
76                  uint64_t caps;
77                  uint64_t mode;
78                  uint32_t addr_format;
79                  size_t src_addrlen;
80                  size_t dest_addrlen;
81                  void *src_addr;
82                  void *dest_addr;
83                  fid_t handle;
84                  struct fi_tx_attr *tx_attr;
85                  struct fi_rx_attr *rx_attr;
86                  struct fi_ep_attr *ep_attr;
87                  struct fi_domain_attr *domain_attr;
88                  struct fi_fabric_attr *fabric_attr;
89                  struct fid_nic *nic;
90              };
91
92       The fi_info structure references several  different  attributes,  which
93       correspond to the different libfabric objects that an application allo‐
94       cates.  For basic applications, modifying or accessing  most  attribute
95       fields  are unnecessary.  Many applications will only need to deal with
96       a few fields of fi_info, most notably  the  endpoint  type,  capability
97       (caps) bits, and mode bits.  These are defined in more detail below.
98
99       On  success, the fi_getinfo() function returns a linked list of fi_info
100       structures.  Each entry in the list will meet the conditions  specified
101       through  the  hints parameter.  The returned entries may come from dif‐
102       ferent network providers, or may differ  in  the  returned  attributes.
103       For  example,  if  hints  does  not specify a particular endpoint type,
104       there may be an entry for each of the three endpoint types.  As a  gen‐
105       eral  rule, libfabric attempts to return the list of fi_info structures
106       in order  from  most  desirable  to  least.   High-performance  network
107       providers are listed before more generic providers.
108
109   Capabilities (fi_info::caps)
110       The  fi_info  caps  field  is used to specify the features and services
111       that the application requires of the network.  This field is a bit-mask
112       of desired capabilities.  There are capability bits for each of the da‐
113       ta transfer services previously mentioned: FI_MSG,  FI_TAGGED,  FI_RMA,
114       FI_ATOMIC,  and  FI_COLLECTIVE.   Applications  should set each bit for
115       each set of operations that it will use.  These bits are often the only
116       caps bits set by an application.
117
118       Capabilities  are  grouped into three general categories: primary, sec‐
119       ondary, and primary modifiers.  Primary capabilities must explicitly be
120       requested by an application, and a provider must enable support for on‐
121       ly those primary capabilities which were selected.   This  is  required
122       for  both performance and security reasons.  Primary modifiers are used
123       to limit a primary capability, such as restricting an endpoint to being
124       send-only.
125
126       Secondary  capabilities  may optionally be requested by an application.
127       If requested, a provider must support a capability if it is  asked  for
128       or  fail the fi_getinfo request.  A provider may optionally report non-
129       requested secondary capabilities if doing so would not compromise  per‐
130       formance  or  security.  That is, a provider may grant an application a
131       secondary capability, whether the application.  The most  commonly  ac‐
132       cessed  secondary capability bits indicate if provider communication is
133       restricted to the local node Ifor example, the shared  memory  provider
134       only  supports  local  communication) and/or remote nodes (which can be
135       the case for NICs that lack loopback support).  Other  secondary  capa‐
136       bility  bits mostly deal with features targeting highly-scalable appli‐
137       cations, but may not be commonly supported across multiple providers.
138
139       Because different providers support different sets of capabilities, ap‐
140       plications that desire optimal network performance may need to code for
141       a capability being either present or absent.  When present, such  capa‐
142       bilities can offer a scalability or performance boost.  When absent, an
143       application may prefer to adjust its protocol or implementation to work
144       around  the  network limitations.  Although providers can often emulate
145       features, doing so can impact overall performance, including  the  per‐
146       formance  of data transfers that otherwise appear unrelated to the fea‐
147       ture in use.  For example, if a provider needs to insert protocol head‐
148       ers  into  the message stream in order to implement a given capability,
149       the insertion of that header could negatively impact the performance of
150       all  transfers.   By  exposing such limitations to the application, the
151       application developer has better control over how to best  emulate  the
152       feature or work around its absence.
153
154       It  is  recommended  that applications code for only those capabilities
155       required to achieve the best performance.  If a capability  would  have
156       little to no effect on overall performance, developers should avoid us‐
157       ing such features as part of an initial implementation.  This will  al‐
158       low the application to work well across the widest variety of hardware.
159       Application optimizations can then add support  for  less  common  fea‐
160       tures.  To see which features are supported by which providers, see the
161       libfabric Provider Feature Maxtrix for the relevant release.
162
163   Mode Bits (fi_info::mode)
164       Where capability bits represent features desired by applications,  mode
165       bits correspond to behavior needed by the provider.  That is, capabili‐
166       ty bits are top down requests, whereas mode bits are bottom up restric‐
167       tions.   Mode bits are set by the provider to request that the applica‐
168       tion use the API in a specific way in order to achieve optimal  perfor‐
169       mance.   Mode  bits  often  imply that the additional work to implement
170       certain communication semantics needed by the application will be  less
171       if implemented by the applicaiton than forcing that same implementation
172       down into the provider.  Mode bits arise as a result of hardware imple‐
173       mentation restrictions.
174
175       An  application  developer  decides which mode bits they want to or can
176       easily support as part of their development process.  Each mode bit de‐
177       scribes  a  particular behavior that the application must follow to use
178       various interfaces.  Applications set the mode bits that  they  support
179       when  calling  fi_getinfo().   If  a  provider requires a mode bit that
180       isn’t set, that  provider  will  be  skipped  by  fi_getinfo().   If  a
181       provider  does  not need a mode bit that is set, it will respond to the
182       fi_getinfo() call, with the mode bit cleared.  This indicates that  the
183       application  does  not  need to perform the action required by the mode
184       bit.
185
186       One of common mode bit needed by providers is FI_CONTEXT  (and  FI_CON‐
187       TEXT2).   This  mode bit requires that applications pass in a libfabric
188       defined data structure (struct fi_context) into any data transfer func‐
189       tion.   That  structure must remain valid and unused by the application
190       until the data transfer operation completes.  The purpose  behind  this
191       mode  bit  is  that the struct fi_context provides “scratch” space that
192       the provider can use to track the request.  For example, it may need to
193       insert the request into a linked list while it is pending, or track the
194       number of times that an outbound transfer has been retried.  Since many
195       applications  already  track outstanding operations with their own data
196       structure, by embedding the struct fi_context into that same structure,
197       overall  performance can be improved.  This avoids the provider needing
198       to allocate and free internal structures for each request.
199
200       Continuing with this example, if an application does not already  track
201       outstanding  requests,  then it would leave the FI_CONTEXT mode bit un‐
202       set.  This would indicate that the provider needs to  get  and  release
203       its own structure for tracking purposes.  In this case, the costs would
204       essentially be the same whether it were  done  by  the  application  or
205       provider.
206
207       For  the  broadest  support of different network technologies, applica‐
208       tions should attempt to support as many mode bits as feasible.   It  is
209       recommended that providers support applications that cannot support any
210       mode bits, with as small an impact as possible.   However,  implementa‐
211       tion  of  mode  bit  avoidance in the provider can still impact perfor‐
212       mance, even when the mode bit is disabled.  As a result, some providers
213       may always require specific mode bits be set.
214

FIDs (fid_t)

216       FID  stands for fabric identifier.  It is the base object type assigned
217       to all libfabric API objects.  All fabric resources are represented  by
218       a  fid  structure,  and all fid’s are derived from a base fid type.  In
219       object-oriented terms, a fid would be the parent class.   The  contents
220       of a fid are visible to the application.
221
222              /* Base FID definition */
223              enum {
224                  FI_CLASS_UNSPEC,
225                  FI_CLASS_FABRIC,
226                  FI_CLASS_DOMAIN,
227                  ...
228              };
229
230              struct fi_ops {
231                  size_t size;
232                  int (*close)(struct fid *fid);
233                  ...
234              };
235
236              /* All fabric interface descriptors must start with this structure */
237              struct fid {
238                  size_t fclass;
239                  void *context;
240                  struct fi_ops *ops;
241              };
242
243       The  fid structure is designed as a trade-off between minimizing memory
244       footprint versus software overhead.  Each fid is identified as  a  spe‐
245       cific  object  class,  which  helps with debugging.  Examples are given
246       above (e.g. FI_CLASS_FABRIC).  The context field is an application  de‐
247       fined  data  value, assigned to an object during its creation.  The use
248       of the context field is application specific, but it  is  meant  to  be
249       read  by applications.  Applications often set context to a correspond‐
250       ing structure that it’s allocated.  The context field is the only field
251       that  applications are recommended to access directly.  Access to other
252       fields should be done using defined function calls  (for  example,  the
253       close() operation).
254
255       The  ops field points to a set of function pointers.  The fi_ops struc‐
256       ture defines the operations that apply to that class.  The  size  field
257       in  the  fi_ops  structure  is  used  for extensibility, and allows the
258       fi_ops structure to grow in a backward compatible manner as new  opera‐
259       tions  are added.  The fid deliberately points to the fi_ops structure,
260       rather than embedding the operations directly.   This  allows  multiple
261       fids  to point to the same set of ops, which minimizes the memory foot‐
262       print of each fid.  (Internally, providers usually set ops to a  static
263       data structure, with the fid structure dynamically allocated.)
264
265       Although it’s possible for applications to access function pointers di‐
266       rectly, it is strongly recommended that the static inline functions de‐
267       fined  in  the man pages be used instead.  This is required by applica‐
268       tions that may be built using the FABRIC_DIRECT library feature.  (FAB‐
269       RIC_DIRECT  is  a  compile time option that allows for highly optimized
270       builds by tightly coupling an application with a specific provider.)
271
272       Other OFI classes are derived from this structure, adding their own set
273       of operations.
274
275              /* Example of deriving a new class for a fabric object */
276              struct fi_ops_fabric {
277                  size_t size;
278                  int (*domain)(struct fid_fabric *fabric, struct fi_info *info,
279                      struct fid_domain **dom, void *context);
280                  ...
281              };
282
283              struct fid_fabric {
284                  struct fid fid;
285                  struct fi_ops_fabric *ops;
286              };
287
288       Other  fid  classes follow a similar pattern as that shown for fid_fab‐
289       ric.  The base fid structure is followed by zero or  more  pointers  to
290       operation sets.
291

Fabric (fid_fabric)

293       The  top-level  object that applications open is the fabric identifier.
294       The fabric can mostly be viewed as a container object by  applications,
295       though it does identify which provider(s) applications use.
296
297       Opening  a  fabric  is  usually  a  straightforward  call after calling
298       fi_getinfo().
299
300              int fi_fabric(struct fi_fabric_attr *attr, struct fid_fabric **fabric, void *context);
301
302       The fabric attributes can be directly  accessed  from  struct  fi_info.
303       The  newly  opened  fabric  is returned through the `fabric' parameter.
304       The `context' parameter appears in many operations.  It is a user-spec‐
305       ified  value  that  is  associated  with the fabric.  It may be used to
306       point to an application specific  structure  and  is  retrievable  from
307       struct fid_fabric.
308
309   Attributes (fi_fabric_attr)
310       The fabric attributes are straightforward.
311
312              struct fi_fabric_attr {
313                  struct fid_fabric *fabric;
314                  char *name;
315                  char *prov_name;
316                  uint32_t prov_version;
317                  uint32_t api_version;
318              };
319
320       The  only  field  that  applications  are likely to use directly is the
321       prov_name.  This is a string value that can be used by hints to  select
322       a  specific  provider for use.  On most systems, there will be multiple
323       providers available.  Only one is likely to represent the  high-perfor‐
324       mance  network  attached  to  the system.  Others are generic providers
325       that may be available on any system, such as the  TCP  socket  and  UDP
326       providers.
327
328       The  fabric field is used to help applications manage opened fabric re‐
329       sources.  If an application has already opened a fabric that  can  sup‐
330       port the returned fi_info structure, this will be set to that fabric.
331

Domains (fid_domain)

333       Domains  frequently  map to a specific local network interface adapter.
334       A domain may either refer to the entire NIC, a  port  on  a  multi-port
335       NIC,  a  virtual device exposed by a NIC, multiple NICs being used in a
336       multi-rail fashion, and so forth.  Although it’s convenient to think of
337       a  domain  as referring to a NIC, such an association isn’t expected by
338       libfabric.  From the viewpoint of the application, a domain  identifies
339       a set of resources that may be used together.
340
341       Similar  to a fabric, opening a domain is straightforward after calling
342       fi_getinfo().
343
344              int fi_domain(struct fid_fabric *fabric, struct fi_info *info,
345                  struct fid_domain **domain, void *context);
346
347       The fi_info structure returned from fi_getinfo() can be passed directly
348       to fi_domain() to open a new domain.
349
350   Attributes (fi_domain_attr)
351       One of the goals of a domain is to define the relationship between data
352       transfer  services  (endpoints)  and  completion  services  (completion
353       queues  and counters).  Many of the domain attributes describe that re‐
354       lationship and its impact to the application.
355
356              struct fi_domain_attr {
357                  struct fid_domain *domain;
358                  char *name;
359                  enum fi_threading threading;
360                  enum fi_progress control_progress;
361                  enum fi_progress data_progress;
362                  enum fi_resource_mgmt resource_mgmt;
363                  enum fi_av_type av_type;
364                  enum fi_mr_mode mr_mode;
365                  size_t mr_key_size;
366                  size_t cq_data_size;
367                  size_t cq_cnt;
368                  size_t ep_cnt;
369                  size_t tx_ctx_cnt;
370                  size_t rx_ctx_cnt;
371                  ...
372
373       Full details of the domain attributes and  their  meaning  are  in  the
374       fi_domain  man page.  Information on select attributes and their impact
375       to the application are described below.
376
377   Threading (fi_threading)
378       libfabric defines a unique threading model.  The  libfabric  design  is
379       heavily  influenced  by object-oriented programming concepts.  A multi-
380       threaded application must determine  how  libfabric  objects  (domains,
381       endpoints,  completion  queues,  etc.)  will  be  allocated  among  its
382       threads, or if any thread can access any object.  For example,  an  ap‐
383       plication may spawn a new thread to handle each new connected endpoint.
384       The domain threading field provides a mechanism for an  application  to
385       identify  which  objects  may  be  accessed simultaneously by different
386       threads.  This in turn allows a provider to optimize or, in some cases,
387       eliminate internal synchronization and locking around those objects.
388
389       Threading defines where providers could optimize synchronization primi‐
390       tives.  However, providers may still implement more serialization  than
391       is needed by the application.  (This is usually a result of keeping the
392       provider implementation simpler).
393
394       It is recommended that applications target either FI_THREAD_SAFE  (full
395       thread safety implemented by the provider) or FI_THREAD_DOMAIN (objects
396       associated with a single domain will  only  be  accessed  by  a  single
397       thread).
398
399   Progress (fi_progress)
400       Progress  models  are  a result of using the host processor in order to
401       perform some portion of the transport protocol.  In order  to  simplify
402       development,  libfabric defines two progress models: automatic or manu‐
403       al.  It does not attempt to identify which specific interface  features
404       may  be  offloaded, or what operations require additional processing by
405       the application’s thread.
406
407       Automatic progress means that an operation initiated by the application
408       will  eventually  complete,  even  if  the application makes no further
409       calls into the libfabric API.  The operation is  either  offloaded  en‐
410       tirely  onto hardware, the provider uses an internal thread, or the op‐
411       erating system kernel may perform  the  task.   The  use  of  automatic
412       progress  may  increase  system  overhead and latency in the latter two
413       cases.  For control operations, such as connection setup, this is  usu‐
414       ally  acceptable.  However, the impact to data transfers may be measur‐
415       able, especially if internal threads are required to provide  automatic
416       progress.
417
418       The manual progress model can avoid this overhead for providers that do
419       not offload all transport features into hardware.  With manual progress
420       the provider implementation will handle transport operations as part of
421       specific libfabric functions.  For  example,  a  call  to  fi_cq_read()
422       which  retrieves  an array completed operations may also be responsible
423       for sending ack messages to notify peers that a message  has  been  re‐
424       ceived.  Since reading the completion queue is part of the normal oper‐
425       ation of an application, there is minimal impact to the application and
426       additional threads are avoided.
427
428       Applications need to take care when using manual progress, particularly
429       if they link into libfabric multiple times through different code paths
430       or  library  dependencies.   If  application  threads are used to drive
431       progress, such as responding to received data with  ACKs,  then  it  is
432       critical  that  the  application thread call into libfabric in a timely
433       manner.
434
435   Memory Registration (fid_mr)
436       RMA, atomic, and collective operations can read and write  memory  that
437       is  owned by a peer process, and neither require the involvement of the
438       target processor.  Because the memory can be modified over the network,
439       an  application  must  opt  into exposing its memory to peers.  This is
440       handled by the memory registration process.  Registered memory  regions
441       associate  memory buffers with permissions granted for access by fabric
442       resources.  A memory buffer must be registered before it can be used as
443       the target of a remote RMA, atomic, or collective data transfer.  Addi‐
444       tionally, a fabric provider may require that data buffers be registered
445       before  being  used even in the case of local transfers.  The latter is
446       necessary to ensure that the virtual to physical page mappings  do  not
447       change while network hardware is performing the transfer.
448
449       In  order  to  handle diverse hardware requirements, there are a set of
450       mr_mode bits associated with memory registration.  The mr_mode bits be‐
451       have  similar  to fi_info mode bits.  Applications indicate which types
452       of restrictions they can support, and the providers  clear  those  bits
453       which aren’t needed.
454
455       For  hardware  that requires memory registration, managing registration
456       is critical to achieving good performance and scalability.  The act  of
457       registering memory is costly and should be avoided on a per data trans‐
458       fer basis.  libfabric has extensive internal support for managing memo‐
459       ry  registration,  hiding  registration  from user application, caching
460       registration to reduce per transfer overhead, and detecting when cached
461       registrations are no longer valid.  It is recommended that applications
462       that are not natively designed to account  for  registering  memory  to
463       make use of libfabric’s registration cache.  This can be done by simply
464       not setting the relevant mr_mode bits.
465
466   Memory Region APIs
467       The following APIs highlight how to allocate and  access  a  registered
468       memory  region.  Note that this is not a complete list of memory region
469       (MR) calls, and for full details on each API, readers should refer  di‐
470       rectly to the fi_mr man page.
471
472              int fi_mr_reg(struct fid_domain *domain, const void *buf, size_t len,
473                  uint64_t access, uint64_t offset, uint64_t requested_key, uint64_t flags,
474                  struct fid_mr **mr, void *context);
475
476              void * fi_mr_desc(struct fid_mr *mr);
477              uint64_t fi_mr_key(struct fid_mr *mr);
478
479       By  default,  memory regions are associated with a domain.  A MR is ac‐
480       cessible by any endpoint that is  opened  on  that  domain.   A  region
481       starts at the address specified by `buf', and is `len' bytes long.  The
482       `access' parameter are permission flags that are OR’ed  together.   The
483       permissions  indicate  which  type of operations may be invoked against
484       the region (e.g. FI_READ, FI_WRITE,  FI_REMOTE_READ,  FI_REMOTE_WRITE).
485       The `buf' parameter typically references allocated virtual memory.
486
487       A  MR  is  associated with local and remote protection keys.  The local
488       key is referred to as a memory descriptor and may be retrieved by call‐
489       ing  fi_mr_desc().  This call is only needed if the FI_MR_LOCAL mr_mode
490       bit has been set.  The memory descriptor is passed directly  into  data
491       transfer operations, for example:
492
493              /* fi_mr_desc() example using fi_send() */
494              fi_send(ep, buf, len, fi_mr_desc(mr), 0, NULL);
495
496       The  remote  key,  or simply MR key, is used by the peer when targeting
497       the MR with an RMA or atomic operation.  In many cases,  the  key  will
498       need  to be sent in a separate message to the initiating peer.  libfab‐
499       ric API uses a 64-bit key where one is used.  The actual key size  used
500       by  a  provider is part of its domain attributes Support for larger key
501       sizes, as required by some providers, is conveyed  through  an  mr_mode
502       bit,  and requires the use of extended MR API calls that map the larger
503       size to a 64-bit value.
504

Endpoints

506       Endpoints are transport level communication portals.  Opening  an  end‐
507       point is trivial after calling fi_getinfo().
508
509   Active (fid_ep)
510       Active  endpoints  may be connection-oriented or connection-less.  They
511       are considered active as they may be used to  perform  data  transfers.
512       All  data  transfer  interfaces  –  messages  (fi_msg), tagged messages
513       (fi_tagged),  RMA  (fi_rma),  atomics  (fi_atomic),   and   collectives
514       (fi_collective)  – are associated with active endpoints.  Though an in‐
515       dividual endpoint may not be enabled to use  all  data  transfers.   In
516       standard  configurations,  an  active endpoint has one transmit and one
517       receive queue.  In general, operations that  generate  traffic  on  the
518       fabric  are  posted  to  the transmit queue.  This includes all RMA and
519       atomic operations, along with sent messages and sent  tagged  messages.
520       Operations  that post buffers for receiving incoming data are submitted
521       to the receive queue.
522
523       Active endpoints are created in the disabled state.  The endpoint  must
524       first  be configured prior to it being enabled.  Endpoints must transi‐
525       tion into an enabled state before accepting data  transfer  operations,
526       including  posting of receive buffers.  The fi_enable() call is used to
527       transition an active endpoint into an enabled state.  The  fi_connect()
528       and fi_accept() calls will also transition an endpoint into the enabled
529       state, if it is not already enabled.
530
531              int fi_endpoint(struct fid_domain *domain, struct fi_info *info,
532                  struct fid_ep **ep, void *context);
533
534   Enabling (fi_enable)
535       In order to transition an endpoint into an enabled state,  it  must  be
536       bound  to one or more fabric resources.  This includes binding the end‐
537       point to a completion queue and  event  queue.   Unconnected  endpoints
538       must also be bound to an address vector.
539
540              /* Example to enable an unconnected endpoint */
541
542              /* Allocate an address vector and associated it with the endpoint */
543              fi_av_open(domain, &av_attr, &av, NULL);
544              fi_ep_bind(ep, &av->fid, 0);
545
546              /* Allocate and associate completion queues with the endpoint */
547              fi_cq_open(domain, &cq_attr, &cq, NULL);
548              fi_ep_bind(ep, &cq->fid, FI_TRANSMIT | FI_RECV);
549
550              fi_enable(ep);
551
552       In the above example, we allocate an AV and CQ.  The attributes for the
553       AV and CQ are omitted (additional discussion below).   Those  are  then
554       associated  with the endpoint through the fi_ep_bind() call.  After all
555       necessary resources have been assigned to the endpoint, we  enable  it.
556       Enabling the endpoint indicates to the provider that it should allocate
557       any hardware and software resources and complete the initialization for
558       the  endpoint.   (If  the  endpoint  is  not bound to all necessary re‐
559       sources, the fi_enable() call will fail.)
560
561       The fi_enable() call is always called for unconnected endpoints.   Con‐
562       nected endpoints may be able to skip calling fi_enable(), since fi_con‐
563       nect() and fi_accept() will enable the endpoint automatically.   Howev‐
564       er,  applications  may  still call fi_enable() prior to calling fi_con‐
565       nect() or fi_accept().  Doing so allows the application to post receive
566       buffers  to  the endpoint, which ensures that they are available to re‐
567       ceive data in the case the peer endpoint sends messages immediately af‐
568       ter it establishes the connection.
569
570   Passive (fid_pep)
571       Passive  endpoints are used to listen for incoming connection requests.
572       Passive endpoints are of type FI_EP_MSG, and may not perform  any  data
573       transfers.   An  application wishing to create a passive endpoint typi‐
574       cally calls fi_getinfo() using the FI_SOURCE flag, often only  specify‐
575       ing a `service' address.  The service address corresponds to a TCP port
576       number.
577
578       Passive endpoints are associated with event queues.  Event  queues  re‐
579       port  connection requests from peers.  Unlike active endpoints, passive
580       endpoints are not associated with a domain.  This allows an application
581       to listen for connection requests across multiple domains, though still
582       restricted to a single provider.
583
584              /* Example passive endpoint listen */
585              fi_passive_ep(fabric, info, &pep, NULL);
586
587              fi_eq_open(fabric, &eq_attr, &eq, NULL);
588              fi_pep_bind(pep, &eq->fid, 0);
589
590              fi_listen(pep);
591
592       A passive endpoint must be bound to an event queue before calling  lis‐
593       ten.   This ensures that connection requests can be reported to the ap‐
594       plication.  To accept new connections, the application waits for a  re‐
595       quest, allocates a new active endpoint for it, and accepts the request.
596
597              /* Example accepting a new connection */
598
599              /* Wait for a CONNREQ event */
600              fi_eq_sread(eq, &event, &cm_entry, sizeof cm_entry, -1, 0);
601              assert(event == FI_CONNREQ);
602
603              /* Allocate a new endpoint for the connection */
604              if (!cm_entry.info->domain_attr->domain)
605                  fi_domain(fabric, cm_entry.info, &domain, NULL);
606              fi_endpoint(domain, cm_entry.info, &ep, NULL);
607
608              fi_ep_bind(ep, &eq->fid, 0);
609              fi_cq_open(domain, &cq_attr, &cq, NULL);
610              fi_ep_bind(ep, &cq->fid, FI_TRANSMIT | FI_RECV);
611
612              fi_enable(ep);
613              fi_recv(ep, rx_buf, len, NULL, 0, NULL);
614
615              fi_accept(ep, NULL, 0);
616              fi_eq_sread(eq, &event, &cm_entry, sizeof cm_entry, -1, 0);
617              assert(event == FI_CONNECTED);
618
619       The  connection  request  event (FI_CONNREQ) includes information about
620       the type of endpoint to allocate, including default attributes to  use.
621       If  a  domain has not already been opened for the endpoint, one must be
622       opened.  Then the endpoint and related resources can be allocated.  Un‐
623       like  the unconnected endpoint example above, a connected endpoint does
624       not have an AV, but does need to be bound to an event queue.   In  this
625       case,  we use the same EQ as the listening endpoint.  Once the other EP
626       resources (e.g. CQ) have been allocated and bound, the EP  can  be  en‐
627       abled.
628
629       To accept the connection, the application calls fi_accept().  Note that
630       because of thread synchronization issues, it is possible for the active
631       endpoint to receive data even before fi_accept() can return.  The post‐
632       ing of receive buffers prior to calling fi_accept() handles this condi‐
633       tion,  which  avoids  network flow control issues occurring immediately
634       after connecting.
635
636       The fi_eq_sread() calls are blocking (synchronous) read  calls  to  the
637       event  queue.   These  calls  wait until an event occurs, which in this
638       case are connection request and establishment events.
639
640   EP Attributes (fi_ep_attr)
641       The properties of an endpoint are specified using endpoint  attributes.
642       These are attributes for the endpoint as a whole.  There are additional
643       attributes specifically related to the transmit  and  receive  contexts
644       underpinning the endpoint (details below).
645
646              struct fi_ep_attr {
647                  enum fi_ep_type type;
648                  uint32_t        protocol;
649                  uint32_t        protocol_version;
650                  size_t          max_msg_size;
651                  ...
652              };
653
654       A  full  description  of each field is available in the fi_endpoint man
655       page, with selected details listed below.
656
657   Endpoint Type (fi_ep_type)
658       This indicates the type of endpoint: reliable datagram (FI_EP_RDM), re‐
659       liable-connected  (FI_EP_MSG),  or  unreliable  datagram (FI_EP_DGRAM).
660       Nearly all applications will want to specify the  endpoint  type  as  a
661       hint passed into fi_getinfo, as most applications will only be coded to
662       support a single endpoint type.
663
664   Maximum Message Size (max_msg_size)
665       This size is the maximum size for any data transfer operation that goes
666       over  the  endpoint.   For unreliable datagram endpoints, this is often
667       the MTU of the underlying network.  For reliable endpoints, this  value
668       is  often a restriction of the underlying transport protocol.  A common
669       minimum maximum message size is 2GB, though some providers  support  an
670       arbitrarily  large  size.   Applications  that require transfers larger
671       than the maximum reported size are required to break up a single, large
672       transfer into multiple operations.
673
674       Providers  expose their hardware or network limits to the applications,
675       rather than segmenting large transfers internally, in order to minimize
676       completion overhead.  For example, for a provider to support large mes‐
677       sage segmentation internally, it would need to emulate  all  completion
678       mechanisms  (queues  and  counters) in software, even if transfers that
679       are larger than the transport supported maximum were never used.
680
681   Message Order Size (max_order_xxx_size)
682       These fields specify data ordering.  They define the delivery order  of
683       transport  data into target memory for RMA and atomic operations.  Data
684       ordering requires message ordering.  If message ordering is not  speci‐
685       fied, these fields do not apply.
686
687       For  example,  suppose  that an application issues two RMA write opera‐
688       tions to the same target memory  location.   (The  application  may  be
689       writing a time stamp value every time a local condition is met, for in‐
690       stance).  Message ordering indicates that the first write as  initiated
691       by  the  sender is the first write processed by the receiver.  Data or‐
692       dering indicates whether the data from the first write  updates  memory
693       before the second write updates memory.
694
695       The max_order_xxx_size fields indicate how large a message may be while
696       still achieving data ordering.  If a field is 0, then no data  ordering
697       is  guaranteed.   If a field is the same as the max_msg_size, then data
698       order is guaranteed for all messages.
699
700       Providers may support data ordering up to max_msg_size for back to back
701       operations that are the same.  For example, an RMA write followed by an
702       RMA write may have data ordering regardless of the  size  of  the  data
703       transfer  (max_order_waw_size  = max_msg_size).  Mixed operations, such
704       as a read followed by a write, are often restricted.  This  is  because
705       RMA  read  operations  may  require acknowledgments from the initiator,
706       which impacts the re-transmission protocol.
707
708       For example, consider an RMA read followed by a write.  The target will
709       process  the  read request, retrieve the data, and send a reply.  While
710       that is occurring, a write is received that wants to  update  the  same
711       memory  location  accessed  by  the  read.  If the target processes the
712       write, it will overwrite the memory used by the read.  If the read  re‐
713       sponse  is  lost, and the read is retried, the target will be unable to
714       re-send the data.  To handle this, the target either  needs  to:  defer
715       handling the write until it receives an acknowledgment for the read re‐
716       sponse, buffer the read response so it can be re-transmitted, or  indi‐
717       cate that data ordering is not guaranteed.
718
719       Because the read or write operation may be gigabytes in size, deferring
720       the write may add significant latency, and buffering the read  response
721       may  be  impractical.  The max_order_xxx_size fields indicate how large
722       back to back operations may be with ordering still maintained.  In many
723       cases, read after write and write and read ordering may be significant‐
724       ly limited, but still usable for implementing specific algorithms, such
725       as a global locking mechanism.
726
727   Rx/Tx Context Attributes (fi_rx_attr / fi_tx_attr)
728       The  endpoint attributes define the overall abilities for the endpoint;
729       however, attributes that apply specifically to receive or transmit con‐
730       texts are defined by struct fi_rx_attr and fi_tx_attr, respectively:
731
732              struct fi_rx_attr {
733                  uint64_t caps;
734                  uint64_t mode;
735                  uint64_t op_flags;
736                  uint64_t msg_order;
737                  uint64_t comp_order;
738                  ...
739              };
740
741              struct fi_tx_attr {
742                  uint64_t caps;
743                  uint64_t mode;
744                  uint64_t op_flags;
745                  uint64_t msg_order;
746                  uint64_t comp_order;
747                  size_t inject_size;
748                  ...
749              };
750
751       Rx/Tx  context  capabilities must be a subset of the endpoint capabili‐
752       ties.  For many applications, the default attributes  returned  by  the
753       provider will be sufficient, with the application only needing to spec‐
754       ify endpoint attributes.
755
756       Both context attributes include an op_flags field.  This field is  used
757       by  applications to specify the default operation flags to use with any
758       call.  For example, by  setting  the  transmit  context’s  op_flags  to
759       FI_INJECT,  the  application  has  indicated  to  the provider that all
760       transmit operations should assume `inject' behavior is  desired.   I.e.
761       the buffer provided to the call must be returned to the application up‐
762       on return from the function.  The op_flags applies  to  all  operations
763       that  do  not provide flags as part of the call (e.g. fi_sendmsg).  One
764       use of op_flags is to specify the default completion  semantic  desired
765       (discussed  next)  by the application.  By setting the default op_flags
766       at initialization time, we can avoid passing the flags as arguments in‐
767       to  some  data transfer calls, avoid parsing the flags, and can prepare
768       submitted commands ahead of time.
769
770       It should be noted that some attributes are  dependent  upon  the  peer
771       endpoint  having  supporting attributes in order to achieve correct ap‐
772       plication behavior.  For example, message order must be the  compatible
773       between  the  initiator’s  transmit attributes and the target’s receive
774       attributes.  Any mismatch may result in incorrect behavior  that  could
775       be difficult to debug.
776

Completions

778       Data  transfer  operations  complete asynchronously.  Libfabric defines
779       two mechanism by which an application can be notified that an operation
780       has  completed:  completion  queues  and counters.  Regardless of which
781       mechanism is used to notify the application that an operation is  done,
782       developers must be aware of what a completion indicates.
783
784       In  all cases, a completion indicates that it is safe to reuse the buf‐
785       fer(s) associated with the data transfer.  This completion mode is  re‐
786       ferred  to  as inject complete and corresponds to the operational flags
787       FI_INJECT_COMPLETE.  However, a completion may also guarantee  stronger
788       semantics.
789
790       Although  libfabric  does  not define an implementation, a provider can
791       meet the requirement for inject complete by copying  the  application’s
792       buffer into a network buffer before generating the completion.  Even if
793       the transmit operation is lost and must be retried,  the  provider  can
794       resend  the  original  data from the copied location.  For large trans‐
795       fers, a provider may not mark a request as inject  complete  until  the
796       data  has  been  acknowledged  by  the  target.  Applications, however,
797       should only infer that it is safe to reuse their data buffer for an in‐
798       ject complete operation.
799
800       Transmit  complete is a completion mode that provides slightly stronger
801       guarantees to the application.  The meaning of  transmit  complete  de‐
802       pends  on whether the endpoint is reliable or unreliable.  For an unre‐
803       liable endpoint (FI_EP_DGRAM), a transmit completion indicates that the
804       request  has  been  delivered to the network.  That is, the message has
805       been delivered at least as far as hardware queues  on  the  local  NIC.
806       For reliable endpoints, a transmit complete occurs when the request has
807       reached the target endpoint.  Typically, this indicates that the target
808       has  acked  the  request.  Transmit complete maps to the operation flag
809       FI_TRANSMIT_COMPLETE.
810
811       A third completion mode is defined to provide guarantees beyond  trans‐
812       mit  complete.   With  transmit complete, an application knows that the
813       message  is  no  longer  dependent  on  the  local   NIC   or   network
814       (e.g. switches).   However,  the data may be buffered at the remote NIC
815       and has not necessarily been written to the target memory.   As  a  re‐
816       sult,  data  sent  in  the request may not be visible to all processes.
817       The third completion mode is delivery complete.
818
819       Delivery complete indicates that  the  results  of  the  operation  are
820       available  to  all  processes  on  the fabric.  The distinction between
821       transmit and delivery complete is  subtle,  but  important.   It  often
822       deals  with  when  the target endpoint generates an acknowledgment to a
823       message.  For providers that offload transport  protocol  to  the  NIC,
824       support  for transmit complete is common.  Delivery complete guarantees
825       are more easily met by providers that implement portions of their  pro‐
826       tocol  on  the  host  processor.   Delivery complete corresponds to the
827       FI_DELIVERY_COMPLETE operation flag.
828
829       Applications can request a default completion mode when opening an end‐
830       point  by  setting  one  of  the  above  mentioned complete flags as an
831       op_flags for the context’s attributes.  However, it is  usually  recom‐
832       mended  that application use the provider’s default flags for best per‐
833       formance, and amend its protocol to achieve its  completion  semantics.
834       For  example,  many  applications will perform a `finalize' or `commit'
835       procedure as part of their operation, which synchronizes the processing
836       of  all peers and guarantees that all previously sent data has been re‐
837       ceived.
838
839       A full discussion of completion semantics is given  in  the  fi_cq  man
840       page.
841
842   CQs (fid_cq)
843       Completion  queues  often map directly to provider hardware mechanisms,
844       and libfabric is designed around minimizing the software impact of  ac‐
845       cessing  those mechanisms.  Unlike other objects discussed so far (fab‐
846       rics, domains, endpoints), completion queues are not part of the fi_in‐
847       fo structure or involved with the fi_getinfo() call.
848
849       All  active endpoints must be bound with one or more completion queues.
850       This is true even if completions will be suppressed by the  application
851       (e.g. using  the  FI_SELECTIVE_COMPLETION flag).  Completion queues are
852       needed to report operations that  complete  in  error  and  help  drive
853       progress in the case of manual progress.
854
855       CQs  are  allocated  separately  from endpoints and are associated with
856       endpoints through the fi_ep_bind() function.
857
858   CQ Format (fi_cq_format)
859       In order to minimize the amount of data that a  provider  must  report,
860       the  type of completion data written back to the application is select-
861       able.  This limits the number of bytes the provider writes  to  memory,
862       and  allows  necessary completion data to fit into a compact structure.
863       Each CQ format maps to a  specific  completion  structure.   Developers
864       should  analyze each structure, select the smallest structure that con‐
865       tains all of the data it requires, and specify the  corresponding  enum
866       value as the CQ format.
867
868       For  example,  if  an application only needs to know which request com‐
869       pleted, along with the size of a received message, it  can  select  the
870       following:
871
872              cq_attr->format = FI_CQ_FORMAT_MSG;
873
874              struct fi_cq_msg_entry {
875                  void      *op_context;
876                  uint64_t  flags;
877                  size_t    len;
878              };
879
880       Once  the format has been selected, the underlying provider will assume
881       that read operations against the CQ will pass in an array of the corre‐
882       sponding  structure.   The  CQ  data  formats  are designed such that a
883       structure that reports more information can be cast to one that reports
884       less.
885
886   Reading Completions (fi_cq_read)
887       Completions  may  be  read  from  a CQ by using one of the non-blocking
888       calls, fi_cq_read / fi_cq_readfrom,  or  one  of  the  blocking  calls,
889       fi_cq_sread  /  fi_cq_sreadfrom.  Regardless of which call is used, ap‐
890       plications pass in an array of completion structures based on  the  se‐
891       lected CQ format.  The CQ interfaces are optimized for batch completion
892       processing, allowing the application to retrieve  multiple  completions
893       from  a single read call.  The difference between the read and readfrom
894       calls is that readfrom returns source addressing  data,  if  available.
895       The  readfrom  derivative  of  the calls is only useful for unconnected
896       endpoints, and only if the corresponding endpoint has  been  configured
897       with the FI_SOURCE capability.
898
899       FI_SOURCE  requires  that the provider use the source address available
900       in the raw completion data, such as the packet’s source address, to re‐
901       trieve a matching entry in the endpoint’s address vector.  Applications
902       that carry some sort of source identifier as part of their data packets
903       can avoid the overhead associated with using FI_SOURCE.
904
905   Retrieving Errors
906       Because the selected completion structure is insufficient to report all
907       data necessary to debug or handle an operation that completes in error,
908       failed  operations  are reported using a separate fi_cq_readerr() func‐
909       tion.  This call takes as input a CQ error entry structure,  which  al‐
910       lows  the  provider to report more information regarding the reason for
911       the failure.
912
913              /* read error prototype */
914              fi_cq_readerr(struct fid_cq *cq, struct fi_cq_err_entry *buf, uint64_t flags);
915
916              /* error data structure */
917              struct fi_cq_err_entry {
918                  void      *op_context;
919                  uint64_t  flags;
920                  size_t    len;
921                  void      *buf;
922                  uint64_t  data;
923                  uint64_t  tag;
924                  size_t    olen;
925                  int       err;
926                  int       prov_errno;
927                  void      *err_data;
928                  size_t    err_data_size;
929              };
930
931              /* Sample error handling */
932              struct fi_cq_msg_entry entry;
933              struct fi_cq_err_entry err_entry;
934              int ret;
935
936              ret = fi_cq_read(cq, &entry, 1);
937              if (ret == -FI_EAVAIL)
938                  ret = fi_cq_readerr(cq, &err_entry, 0);
939
940       As illustrated, if an error entry has been inserted into the completion
941       queue,  then attempting to read the CQ will result in the read call re‐
942       turning -FI_EAVAIL (error available).  This indicates that the applica‐
943       tion must use the fi_cq_readerr() call to remove the failed operation’s
944       completion information before other completions can be reaped from  the
945       CQ.
946
947       A  fabric error code regarding the failure is reported as the err field
948       in the fi_cq_err_entry structure.  A provider specific  error  code  is
949       also available through the prov_errno field.  This field can be decoded
950       into a displayable string  using  the  fi_cq_strerror()  routine.   The
951       err_data  field  is provider specific data that assists the provider in
952       decoding the reason for the failure.
953

Address Vectors (fid_av)

955       A primary goal of address vectors is to allow applications to  communi‐
956       cate with thousands to millions of peers while minimizing the amount of
957       data needed to store peer addressing  information.   It  pushes  fabric
958       specific  addressing details away from the application to the provider.
959       This allows the provider to optimize how  it  converts  addresses  into
960       routing  data, and enables data compression techniques that may be dif‐
961       ficult for an application to achieve without being aware  of  low-level
962       fabric addressing details.  For example, providers may be able to algo‐
963       rithmically calculate addressing components, rather  than  storing  the
964       data  locally.   Additionally,  providers can communicate with resource
965       management entities or fabric manager agents to obtain quality of  ser‐
966       vice or other information about the fabric, in order to improve network
967       utilization.
968
969       An equally important objective is ensuring that  the  resulting  inter‐
970       faces, particularly data transfer operations, are fast and easy to use.
971       Conceptually, an address vector converts an endpoint  address  into  an
972       fi_addr_t.   The  fi_addr_t  (fabric  interface  address datatype) is a
973       64-bit value that is used in all `fast-path' operations –  data  trans‐
974       fers and completions.
975
976       Address  vectors  are  associated  with  domain  objects.   This allows
977       providers to implement portions of an address vector, such  as  quality
978       of service mappings, in hardware.
979
980   AV Type (fi_av_type)
981       There  are two types of address vectors.  The type refers to the format
982       of the returned fi_addr_t values for addresses that are  inserted  into
983       the  AV.  With type FI_AV_TABLE, returned addresses are simple indices,
984       and developers may think of the AV as an array of addresses.  Each  ad‐
985       dress  that  is inserted into the AV is mapped to the index of the next
986       free array slot.  The advantage of FI_AV_TABLE is that applications can
987       refer  to peers using a simple index, eliminating an application’s need
988       to store any addressing data.  I.e.  the application can  generate  the
989       fi_addr_t values themselves.  This type maps well to applications, such
990       as MPI, where a peer is referenced by rank.
991
992       The second type is FI_AV_MAP.  This type does not define  any  specific
993       format for the fi_addr_t value.  Applications that use type map are re‐
994       quired to provide the correct fi_addr_t for a given peer when issuing a
995       data transfer operation.  The advantage of FI_AV_MAP is that a provider
996       can use the fi_addr_t to encode the target’s address, which avoids  re‐
997       trieving  the data from memory.  As a simple example, consider a fabric
998       that uses TCP/IPv4 based addressing.  An fi_addr_t is large  enough  to
999       contain  the address, which allows a provider to copy the data from the
1000       fi_addr_t directly into an outgoing packet.
1001
1002   Sharing AVs Between Processes
1003       Large scale parallel programs typically run with multiple processes al‐
1004       located  on  each  node.   Because these processes communicate with the
1005       same set of peers, the addressing data needed by each  process  is  the
1006       same.   Libfabric defines a mechanism by which processes running on the
1007       same node may share their address vectors.  This  allows  a  system  to
1008       maintain  a  single  copy  of addressing data, rather than one copy per
1009       process.
1010
1011       Although libfabric does not require any implementation for how  an  ad‐
1012       dress vector is shared, the interfaces map well to using shared memory.
1013       Address vectors which will be shared are given an application  specific
1014       name.   How an application selects a name that avoid conflicts with un‐
1015       related processes, or how it communicates the name with peer  processes
1016       is outside the scope of libfabric.
1017
1018       In addition to having a name, a shared AV also has a base map address –
1019       map_addr.  Use of map_addr is only important for address  vectors  that
1020       are  of type FI_AV_MAP, and allows applications to share fi_addr_t val‐
1021       ues.  From the viewpoint of the application, the map_addr is  the  base
1022       value  for  all fi_addr_t values.  A common use for map_addr is for the
1023       process that creates the initial address vector to request a value from
1024       the  provider,  exchange  the returned map_addr with its peers, and for
1025       the peers to open the shared AV using the same map_addr.   This  allows
1026       the  fi_addr_t  values to be stored in shared memory that is accessible
1027       by all peers.
1028

Using Native Wait Objects: TryWait

1030       There is an important difference between using libfabric completion ob‐
1031       jects,  versus sockets, that may not be obvious from the discussions so
1032       far.  With sockets, the object that is signaled is the same object that
1033       abstracts  the  queues,  namely  the file descriptor.  When data is re‐
1034       ceived on a socket, that data is placed in a queue associated  directly
1035       with  the fd.  Reading from the fd retrieves that data.  If an applica‐
1036       tion wishes to block until data arrives on a socket, it calls  select()
1037       or  poll()  on  the fd.  The fd is signaled when a message is received,
1038       which releases the blocked thread, allowing it to read the fd.
1039
1040       By associating the wait object with the underlying data queue, applica‐
1041       tions  are  exposed  to an interface that is easy to use and race free.
1042       If data is available to read from the socket at the  time  select()  or
1043       poll() is called, those calls simply return that the fd is readable.
1044
1045       There are a couple of significant disadvantages to this approach, which
1046       have been discussed previously, but from different  perspectives.   The
1047       first  is  that every socket must be associated with its own fd.  There
1048       is no way to share a wait object among multiple sockets.   (This  is  a
1049       main  reason  for  the  development of epoll semantics).  The second is
1050       that the queue is maintained in the kernel, so that  the  select()  and
1051       poll() calls can check them.
1052
1053       Libfabric  allows  for  the separation of the wait object from the data
1054       queues.  For applications that use libfabric  interfaces  to  wait  for
1055       events,  such as fi_cq_sread, this separation is mostly hidden from the
1056       application.  The exception is that applications may receive a  signal,
1057       but  no events are retrieved when a queue is read.  This separation al‐
1058       lows the queues to reside in the application’s memory space, while wait
1059       objects  may  still  use kernel components.  A reason for the latter is
1060       that wait objects may be signaled as part of system interrupt  process‐
1061       ing, which would go through a kernel driver.
1062
1063       Applications  that  want to use native wait objects (e.g. file descrip‐
1064       tors) directly in operating system calls  must  perform  an  additional
1065       step  in their processing.  In order to handle race conditions that can
1066       occur between inserting an event into a completion or event object  and
1067       signaling  the corresponding wait object, libfabric defines an `fi_try‐
1068       wait()' function.  The fi_trywait  implementation  is  responsible  for
1069       handling potential race conditions which could result in an application
1070       either losing events or hanging.  The  following  example  demonstrates
1071       the use of fi_trywait().
1072
1073              /* Get the native wait object -- an fd in this case */
1074              fi_control(&cq->fid, FI_GETWAIT, (void *) &fd);
1075              FD_ZERO(&fds);
1076              FD_SET(fd, &fds);
1077
1078              while (1) {
1079                  ret = fi_trywait(fabric, &cq->fid, 1);
1080                  if (ret == FI_SUCCESS) {
1081                      /* It’s safe to block on the fd */
1082                      select(fd + 1, &fds, NULL, &fds, &timeout);
1083                  } else if (ret == -FI_EAGAIN) {
1084                      /* Read and process all completions from the CQ */
1085                      do {
1086                          ret = fi_cq_read(cq, &comp, 1);
1087                      } while (ret > 0);
1088                  } else {
1089                      /* something really bad happened */
1090                  }
1091              }
1092
1093       In  this  example, the application has allocated a CQ with an fd as its
1094       wait object.  It calls select() on the fd.   Before  calling  select(),
1095       the  application  must  call  fi_trywait() successfully (return code of
1096       FI_SUCCESS).  Success indicates that a blocking operation  can  now  be
1097       invoked on the native wait object without fear of the application hang‐
1098       ing or events being lost.  If fi_trywait() returns –FI_EAGAIN, it  usu‐
1099       ally indicates that there are queued events to process.
1100

Environment Variables

1102       Environment  variables  are used by providers to configure internal op‐
1103       tions for optimal performance or memory  consumption.   Libfabric  pro‐
1104       vides an interface for querying which environment variables are usable,
1105       along with an application to display the information to a command  win‐
1106       dow.   Although  environment variables are usually configured by an ad‐
1107       ministrator, an application can query for variables programmatically.
1108
1109              /* APIs to query for supported environment variables */
1110              enum fi_param_type {
1111                  FI_PARAM_STRING,
1112                  FI_PARAM_INT,
1113                  FI_PARAM_BOOL,
1114                  FI_PARAM_SIZE_T,
1115              };
1116
1117              struct fi_param {
1118                  /* The name of the environment variable */
1119                  const char *name;
1120                  /* What type of value it stores */
1121                  enum fi_param_type type;
1122                  /* A description of how the variable is used */
1123                  const char *help_string;
1124                  /* The current value of the variable */
1125                  const char *value;
1126              };
1127
1128              int fi_getparams(struct fi_param **params, int *count);
1129              void fi_freeparams(struct fi_param *params);
1130
1131       The modification of environment variables is typically a tuning activi‐
1132       ty  done  on  larger clusters.  However there are a few values that are
1133       useful for developers.  These can be seen by executing the fi_info com‐
1134       mand.
1135
1136              $ fi_info -e
1137              # FI_LOG_LEVEL: String
1138              # Specify logging level: warn, trace, info, debug (default: warn)
1139
1140              # FI_LOG_PROV: String
1141              # Specify specific provider to log (default: all)
1142
1143              # FI_PROVIDER: String
1144              # Only use specified provider (default: all available)
1145
1146       The  fi_info  application,  which  ships with libfabric, can be used to
1147       list all environment variables for all providers.  The `-e' option will
1148       list  all variables, and the `-g' option can be used to filter the out‐
1149       put to only those variables with a matching substring.   Variables  are
1150       documented  directly  in  code  with  the  description available as the
1151       help_string output.
1152
1153       The FI_LOG_LEVEL can be used to increase the debug output from  libfab‐
1154       ric  and  the  providers.  Note that in the release build of libfabric,
1155       debug output from data path operations (transmit, receive, and  comple‐
1156       tion processing) may not be available.  The FI_PROVIDER variable can be
1157       used to enable or disable specific providers.  This is useful to ensure
1158       that a given provider will be used.
1159

AUTHORS

1161       OpenFabrics.
1162
1163
1164
1165Libfabric Programmer’s Manual     2023-01-02                       fi_setup(7)