1fi_collective(3)               Libfabric v1.17.0              fi_collective(3)
2
3
4

NAME

6       fi_join_collective
7              Operation where a subset of peers join a new collective group.
8
9       fi_barrier / fi_barrier2
10              Collective operation that does not complete until all peers have
11              entered the barrier call.
12
13       fi_broadcast
14              A single sender transmits data to all peers, including itself.
15
16       fi_alltoall
17              Each peer distributes a slice of its local data to all peers.
18
19       fi_allreduce
20              Collective operation where all peers broadcast an atomic  opera‐
21              tion to all other peers.
22
23       fi_allgather
24              Each peer sends a complete copy of its local data to all peers.
25
26       fi_reduce_scatter
27              Collective  call  where  data  is  collected  from all peers and
28              merged (reduced).  The results of the reduction  is  distributed
29              back  to  the peers, with each peer receiving a slice of the re‐
30              sults.
31
32       fi_reduce
33              Collective call where data is collected from all peers to a root
34              peer and merged (reduced).
35
36       fi_scatter
37              A single sender distributes (scatters) a slice of its local data
38              to all peers.
39
40       fi_gather
41              All peers send their data to a root peer.
42
43       fi_query_collective
44              Returns information about which collective operations  are  sup‐
45              ported by a provider, and limitations on the collective.
46

SYNOPSIS

48              #include <rdma/fi_collective.h>
49
50              int fi_join_collective(struct fid_ep *ep, fi_addr_t coll_addr,
51                  const struct fid_av_set *set,
52                  uint64_t flags, struct fid_mc **mc, void *context);
53
54              ssize_t fi_barrier(struct fid_ep *ep, fi_addr_t coll_addr,
55                  void *context);
56
57              ssize_t fi_barrier2(struct fid_ep *ep, fi_addr_t coll_addr,
58                  uint64_t flags, void *context);
59
60              ssize_t fi_broadcast(struct fid_ep *ep, void *buf, size_t count, void *desc,
61                  fi_addr_t coll_addr, fi_addr_t root_addr, enum fi_datatype datatype,
62                  uint64_t flags, void *context);
63
64              ssize_t fi_alltoall(struct fid_ep *ep, const void *buf, size_t count,
65                  void *desc, void *result, void *result_desc,
66                  fi_addr_t coll_addr, enum fi_datatype datatype,
67                  uint64_t flags, void *context);
68
69              ssize_t fi_allreduce(struct fid_ep *ep, const void *buf, size_t count,
70                  void *desc, void *result, void *result_desc,
71                  fi_addr_t coll_addr, enum fi_datatype datatype, enum fi_op op,
72                  uint64_t flags, void *context);
73
74              ssize_t fi_allgather(struct fid_ep *ep, const void *buf, size_t count,
75                  void *desc, void *result, void *result_desc,
76                  fi_addr_t coll_addr, enum fi_datatype datatype,
77                  uint64_t flags, void *context);
78
79              ssize_t fi_reduce_scatter(struct fid_ep *ep, const void *buf, size_t count,
80                  void *desc, void *result, void *result_desc,
81                  fi_addr_t coll_addr, enum fi_datatype datatype, enum fi_op op,
82                  uint64_t flags, void *context);
83
84              ssize_t fi_reduce(struct fid_ep *ep, const void *buf, size_t count,
85                  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
86                  fi_addr_t root_addr, enum fi_datatype datatype, enum fi_op op,
87                  uint64_t flags, void *context);
88
89              ssize_t fi_scatter(struct fid_ep *ep, const void *buf, size_t count,
90                  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
91                  fi_addr_t root_addr, enum fi_datatype datatype,
92                  uint64_t flags, void *context);
93
94              ssize_t fi_gather(struct fid_ep *ep, const void *buf, size_t count,
95                  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
96                  fi_addr_t root_addr, enum fi_datatype datatype,
97                  uint64_t flags, void *context);
98
99              int fi_query_collective(struct fid_domain *domain,
100                  fi_collective_op coll, struct fi_collective_attr *attr, uint64_t flags);
101

ARGUMENTS

103       ep     Fabric endpoint on which to initiate collective operation.
104
105       set    Address vector set defining the collective membership.
106
107       mc     Multicast group associated with the collective.
108
109       buf    Local data buffer that specifies first operand of collective op‐
110              eration
111
112       datatype
113              Datatype associated with atomic operands
114
115       op     Atomic operation to perform
116
117       result Local data buffer to store the result of the  collective  opera‐
118              tion.
119
120       desc / result_desc
121              Data  descriptor associated with the local data buffer and local
122              result buffer, respectively.
123
124       coll_addr
125              Address referring to the collective group of endpoints.
126
127       root_addr
128              Single endpoint that is the source or destination of  collective
129              data.
130
131       flags  Additional flags to apply for the atomic operation
132
133       context
134              User  specified  pointer  to associate with the operation.  This
135              parameter is ignored if the operation will not generate  a  suc‐
136              cessful  completion, unless an op flag specifies the context pa‐
137              rameter be used for required input.
138

DESCRIPTION (EXPERIMENTAL APIs)

140       The collective APIs are new to the 1.9  libfabric  release.   Although,
141       efforts  have  been  made  to design the APIs such that they align well
142       with applications and are implementable  by  the  providers,  the  APIs
143       should  be  considered experimental and may be subject to change in fu‐
144       ture versions of the library until the experimental tag  has  been  re‐
145       moved.
146
147       In general collective operations can be thought of as coordinated atom‐
148       ic operations between a set of peer endpoints.  Readers should refer to
149       the  fi_atomic(3)  man  page  for  details on the atomic operations and
150       datatypes defined by libfabric.
151
152       A collective operation is a group communication exchange.  It  involves
153       multiple  peers  exchanging  data with other peers participating in the
154       collective call.  Collective operations require close  coordination  by
155       all  participating members.  All participants must invoke the same col‐
156       lective call before any single member can complete its operation local‐
157       ly.   As  a  result, collective calls can strain the fabric, as well as
158       local and remote data buffers.
159
160       Libfabric collective interfaces target fabrics that support  offloading
161       portions  of  the collective communication into network switches, NICs,
162       and other devices.  However, no implementation requirement is placed on
163       the provider.
164
165       The  first step in using a collective call is identifying the peer end‐
166       points that will participate.  Collective membership follows one of two
167       models,  both supported by libfabric.  In the first model, the applica‐
168       tion manages the membership.  This usually means that  the  application
169       is performing a collective operation itself using point to point commu‐
170       nication to identify the members who will  participate.   Additionally,
171       the  application  may  be interacting with a fabric resource manager to
172       reserve network resources needed to execute collective operations.   In
173       this  model,  the application will inform libfabric that the membership
174       has already been established.
175
176       A separate model moves the membership management  under  libfabric  and
177       directly  into the provider.  In this model, the application must iden‐
178       tify which peer addresses will be members.  That  information  is  con‐
179       veyed  to the libfabric provider, which is then responsible for coordi‐
180       nating the creation of the collective group.  In the  provider  managed
181       model, the provider will usually perform the necessary collective oper‐
182       ation to establish the communication group and interact with any fabric
183       management agents.
184
185       In  both  models,  the  collective  membership  is  communicated to the
186       provider by creating and configuring an address vector  set  (AV  set).
187       An  AV set represents an ordered subset of addresses in an address vec‐
188       tor (AV).  Details on creating and configuring an AV set are  available
189       in fi_av_set(3).
190
191       Once  an  AV set has been programmed with the collective membership in‐
192       formation,  an  endpoint  is  joined  to  the  set.   This   uses   the
193       fi_join_collective operation and operates asynchronously.  This differs
194       from how an endpoint is associated synchronously with an AV  using  the
195       fi_ep_bind()  call.   Upon  completion of the fi_join_collective opera‐
196       tion, an fi_addr is provided that is used as the  target  address  when
197       invoking a collective operation.
198
199       For  developer convenience, a set of collective APIs are defined.  Col‐
200       lective APIs differ from message and RMA interfaces in that the  format
201       of the data is known to the provider, and the collective may perform an
202       operation on that data.  This aligns collective operations closely with
203       the atomic interfaces.
204
205   Join Collective (fi_join_collective)
206       This  call attaches an endpoint to a collective membership group.  Lib‐
207       fabric  treats  collective  members  as  a  multicast  group,  and  the
208       fi_join_collective  call attaches the endpoint to that multicast group.
209       By default, the endpoint will join the group based on the data transfer
210       capabilities  of  the  endpoint.  For example, if the endpoint has been
211       configured to both send and receive data, then  the  endpoint  will  be
212       able to initiate and receive transfers to and from the collective.  The
213       input flags may be used to restrict access  to  the  collective  group,
214       subject to endpoint capability limitations.
215
216       Join  collective  operations  complete  asynchronously, and may involve
217       fabric transfers, dependent on the provider  implementation.   An  end‐
218       point  must be bound to an event queue prior to calling fi_join_collec‐
219       tive.  The result of the join operation will be reported to the  EQ  as
220       an FI_JOIN_COMPLETE event.  Applications cannot issue collective trans‐
221       fers until receiving notification that the join operation has  complet‐
222       ed.   Note  that an endpoint may begin receiving messages from the col‐
223       lective group as soon as the join completes, which can occur  prior  to
224       the FI_JOIN_COMPLETE event being generated.
225
226       The  join  collective  operation is itself a collective operation.  All
227       participating peers must call fi_join_collective before any  individual
228       peer will report that the join has completed.  Application managed col‐
229       lective memberships are an exception.  With application managed member‐
230       ships,  the  fi_join_collective  call  may be completed locally without
231       fabric communication.  For provider managed memberships, the join  col‐
232       lective call requires as input a coll_addr that refers to either an ad‐
233       dress associated with an AV set (see  fi_av_set_addr)  or  an  existing
234       collective  group  (obtained through a previous call to fi_join_collec‐
235       tive).  The fi_join_collective call will create a new  collective  sub‐
236       group.   If  application managed memberships are used, coll_addr should
237       be set to FI_ADDR_UNAVAIL.
238
239       Applications must call fi_close on the collective group  to  disconnect
240       the endpoint from the group.  After a join operation has completed, the
241       fi_mc_addr call may be used to retrieve the address associated with the
242       multicast group.  See fi_cm(3) for additional details on fi_mc_addr().
243
244   Barrier (fi_barrier)
245       The  fi_barrier  operation  provides  a mechanism to synchronize peers.
246       Barrier does not result in any data being transferred at  the  applica‐
247       tion  level.   A barrier does not complete locally until all peers have
248       invoked the barrier call.  This signifies to the local application that
249       work  by  peers  that  completed prior to them calling barrier has fin‐
250       ished.
251
252   Barrier (fi_barrier2)
253       The fi_barrier2 operations is the same as fi_barrier, but with an extra
254       parameter to pass in operation flags.
255
256   Broadcast (fi_broadcast)
257       fi_broadcast  transfers  an  array  of data from a single sender to all
258       other members of the collective group.   The  input  buf  parameter  is
259       treated as the transmit buffer if the local rank is the root, otherwise
260       it is the receive buffer.  The broadcast operation acts  as  an  atomic
261       write  or read to a data array.  As a result, the format of the data in
262       buf is specified through the datatype parameter.  Any non-void datatype
263       may be broadcast.
264
265       The  following  diagram  shows  an  example  of broadcast being used to
266       transfer an array of integers to a group of peers.
267
268              [1]  [1]  [1]
269              [5]  [5]  [5]
270              [9]  [9]  [9]
271               |____^    ^
272               |_________|
273               broadcast
274
275   All to All (fi_alltoall)
276       The fi_alltoall collective involves distributing (or  scattering)  dif‐
277       ferent portions of an array of data to peers.  It is best explained us‐
278       ing an example.  Here three peers perform an all to all  collective  to
279       exchange different entries in an integer array.
280
281              [1]   [2]   [3]
282              [5]   [6]   [7]
283              [9]  [10]  [11]
284                 \   |   /
285                 All to all
286                 /   |   \
287              [1]   [5]   [9]
288              [2]   [6]  [10]
289              [3]   [7]  [11]
290
291       Each peer sends a piece of its data to the other peers.
292
293       All  to all operations may be performed on any non-void datatype.  How‐
294       ever, all to all does not perform an operation on the data  itself,  so
295       no operation is specified.
296
297   All Reduce (fi_allreduce)
298       fi_allreduce  can  be  described  as  all peers providing input into an
299       atomic operation, with the result copied back to each peer.   Conceptu‐
300       ally, this can be viewed as each peer issuing a multicast atomic opera‐
301       tion to all other peers, fetching the results, and combining them.  The
302       combining  of  the  results  is  referred  to  as  the  reduction.  The
303       fi_allreduce() operation takes as input an array of data and the speci‐
304       fied  atomic  operation  to  perform.  The results of the reduction are
305       written into the result buffer.
306
307       Any non-void datatype may be specified.  Valid  atomic  operations  are
308       listed  below  in  the fi_query_collective call.  The following diagram
309       shows an example of an all reduce operation involving summing an  array
310       of integers between three peers.
311
312               [1]  [1]  [1]
313               [5]  [5]  [5]
314               [9]  [9]  [9]
315                 \   |   /
316                    sum
317                 /   |   \
318               [3]  [3]  [3]
319              [15] [15] [15]
320              [27] [27] [27]
321                All Reduce
322
323   All Gather (fi_allgather)
324       Conceptually,  all  gather can be viewed as the opposite of the scatter
325       component from reduce-scatter.  All gather collects data from all peers
326       into a single array, then copies that array back to each peer.
327
328              [1]  [5]  [9]
329                \   |   /
330               All gather
331                /   |   \
332              [1]  [1]  [1]
333              [5]  [5]  [5]
334              [9]  [9]  [9]
335
336       All  gather  may  be  performed on any non-void datatype.  However, all
337       gather does not perform an operation on the data itself, so  no  opera‐
338       tion is specified.
339
340   Reduce-Scatter (fi_reduce_scatter)
341       The  fi_reduce_scatter  collective is similar to an fi_allreduce opera‐
342       tion, followed by all to all.  With reduce scatter, all  peers  provide
343       input into an atomic operation, similar to all reduce.  However, rather
344       than the full result being copied to each peer,  each  participant  re‐
345       ceives only a slice of the result.
346
347       This is shown by the following example:
348
349              [1]  [1]  [1]
350              [5]  [5]  [5]
351              [9]  [9]  [9]
352                \   |   /
353                   sum (reduce)
354                    |
355                   [3]
356                  [15]
357                  [27]
358                    |
359                 scatter
360                /   |   \
361              [3] [15] [27]
362
363       The reduce scatter call supports the same datatype and atomic operation
364       as fi_allreduce.
365
366   Reduce (fi_reduce)
367       The fi_reduce collective is the first half of  an  fi_allreduce  opera‐
368       tion.   With  reduce, all peers provide input into an atomic operation,
369       with the the results collected by a single `root' endpoint.
370
371       This is shown by the following example, with the leftmost peer  identi‐
372       fied as the root:
373
374              [1]  [1]  [1]
375              [5]  [5]  [5]
376              [9]  [9]  [9]
377                \   |   /
378                   sum (reduce)
379                  /
380               [3]
381              [15]
382              [27]
383
384       The  reduce  call  supports  the  same datatype and atomic operation as
385       fi_allreduce.
386
387   Scatter (fi_scatter)
388       The fi_scatter collective is the second half  of  an  fi_reduce_scatter
389       operation.   The  data  from a single `root' endpoint is split and dis‐
390       tributed to all peers.
391
392       This is shown by the following example:
393
394               [3]
395              [15]
396              [27]
397                  \
398                 scatter
399                /   |   \
400              [3] [15] [27]
401
402       The scatter operation is used to distribute results to the  peers.   No
403       atomic operation is performed on the data.
404
405   Gather (fi_gather)
406       The  fi_gather  operation  is used to collect (gather) the results from
407       all peers and store them at a `root' peer.
408
409       This is shown by the following example, with the leftmost peer  identi‐
410       fied as the root.
411
412              [1]  [5]  [9]
413                \   |   /
414                  gather
415                 /
416              [1]
417              [5]
418              [9]
419
420       The gather operation does not perform any operation on the data itself.
421
422   Query Collective Attributes (fi_query_collective)
423       The  fi_query_collective  call  reports which collective operations are
424       supported by the underlying  provider,  for  suitably  configured  end‐
425       points.   Collective  operations  needed by an application that are not
426       supported by the provider must be implemented by the application.   The
427       query call checks whether a provider supports a specific collective op‐
428       eration for a given datatype and operation, if applicable.
429
430       The name of the collective, as well as the datatype and associated  op‐
431       eration, if applicable, and are provided as input into fi_query_collec‐
432       tive.
433
434       The coll parameter may reference one of these collectives:  FI_BARRIER,
435       FI_BROADCAST,  FI_ALLTOALL, FI_ALLREDUCE, FI_ALLGATHER, FI_REDUCE_SCAT‐
436       TER, FI_REDUCE, FI_SCATTER, or FI_GATHER.  Additional  details  on  the
437       collective operation is specified through the struct fi_collective_attr
438       parameter.  For collectives that act on data, the operation and related
439       data type must be specified through the given attributes.
440
441              struct fi_collective_attr {
442                  enum fi_op op;
443                  enum fi_datatype datatype;
444                  struct fi_atomic_attr datatype_attr;
445                  size_t max_members;
446                    uint64_t mode;
447              };
448
449       For a description of struct fi_atomic_attr, see fi_atomic(3).
450
451       op     On  input, this specifies the atomic operation involved with the
452              collective call.  This should be set to  one  of  the  following
453              values:   FI_MIN,  FI_MAX,  FI_SUM,  FI_PROD,  FI_LOR,  FI_LAND,
454              FI_BOR,  FI_BAND,  FI_LXOR,  FI_BXOR,  FI_ATOMIC_READ,  FI_ATOM‐
455              IC_WRITE,  of FI_NOOP.  For collectives that do not exchange ap‐
456              plication data (fi_barrier), this should be set to FI_NOOP.
457
458       datatype
459              On onput, specifies the datatype of the data being  modified  by
460              the collective.  This should be set to one of the following val‐
461              ues:   FI_INT8,   FI_UINT8,   FI_INT16,   FI_UINT16,   FI_INT32,
462              FI_UINT32,    FI_INT64,    FI_UINT64,    FI_FLOAT,    FI_DOUBLE,
463              FI_FLOAT_COMPLEX,       FI_DOUBLE_COMPLEX,       FI_LONG_DOUBLE,
464              FI_LONG_DOUBLE_COMPLEX, or FI_VOID.  For collectives that do not
465              exchange application data (fi_barrier), this should  be  set  to
466              FI_VOID.
467
468       datatype_attr.count
469              The maximum number of elements that may be used with the collec‐
470              tive.
471
472       datatype.size
473              The size of the datatype as supported by the provider.  Applica‐
474              tions should validate the size of datatypes that differ based on
475              the platform, such as FI_LONG_DOUBLE.
476
477       max_members
478              The maximum number of peers that may participate in a collective
479              operation.
480
481       mode   This field is reserved and should be 0.
482
483       If  a  collective  operation  is  supported, the query call will return
484       FI_SUCCESS, along with attributes on the limits for using that  collec‐
485       tive operation through the provider.
486
487   Completions
488       Collective  operations  map  to underlying fi_atomic operations.  For a
489       discussion of atomic completion semantics, see fi_atomic(3).  The  com‐
490       pletion,  ordering,  and atomicity of collective operations match those
491       defined for point to point atomic operations.
492

FLAGS

494       The following flags are defined for the specified operations.
495
496       FI_SCATTER
497              Applies to fi_query_collective.  When  set,  requests  attribute
498              information on the reduce-scatter collective operation.
499

RETURN VALUE

501       Returns 0 on success.  On error, a negative value corresponding to fab‐
502       ric errno is returned.  Fabric errno values are defined in  rdma/fi_er‐
503       rno.h.
504

ERRORS

506       -FI_EAGAIN
507              See fi_msg(3) for a detailed description of handling FI_EAGAIN.
508
509       -FI_EOPNOTSUPP
510              The  requested  atomic  operation  is not supported on this end‐
511              point.
512
513       -FI_EMSGSIZE
514              The number of collective operations in a single request  exceeds
515              that supported by the underlying provider.
516

NOTES

518       Collective  operations  map to atomic operations.  As such, they follow
519       most of the conventions and restrictions as peer to peer atomic  opera‐
520       tions.   This  includes data atomicity, data alignment, and message or‐
521       dering semantics.  See fi_atomic(3) for additional information  on  the
522       datatypes and operations defined for atomic and collective operations.
523

SEE ALSO

525       fi_getinfo(3), fi_av(3), fi_atomic(3), fi_cm(3)
526

AUTHORS

528       OpenFabrics.
529
530
531
532Libfabric Programmer’s Manual     2022-12-11                  fi_collective(3)
Impressum