1fi_collective(3)               Libfabric v1.18.1              fi_collective(3)
2
3
4

NAME

6       fi_join_collective
7              Operation where a subset of peers join a new collective group.
8
9       fi_barrier / fi_barrier2
10              Collective operation that does not complete until all peers have
11              entered the barrier call.
12
13       fi_broadcast
14              A single sender transmits data to all peers, including itself.
15
16       fi_alltoall
17              Each peer distributes a slice of its local data to all peers.
18
19       fi_allreduce
20              Collective operation where all peers broadcast an atomic  opera‐
21              tion to all other peers.
22
23       fi_allgather
24              Each peer sends a complete copy of its local data to all peers.
25
26       fi_reduce_scatter
27              Collective  call  where  data  is  collected  from all peers and
28              merged (reduced).  The results of the reduction  is  distributed
29              back  to  the peers, with each peer receiving a slice of the re‐
30              sults.
31
32       fi_reduce
33              Collective call where data is collected from all peers to a root
34              peer and merged (reduced).
35
36       fi_scatter
37              A single sender distributes (scatters) a slice of its local data
38              to all peers.
39
40       fi_gather
41              All peers send their data to a root peer.
42
43       fi_query_collective
44              Returns information about which collective operations  are  sup‐
45              ported by a provider, and limitations on the collective.
46

SYNOPSIS

48              #include <rdma/fi_collective.h>
49
50              int fi_join_collective(struct fid_ep *ep, fi_addr_t coll_addr,
51                  const struct fid_av_set *set,
52                  uint64_t flags, struct fid_mc **mc, void *context);
53
54              ssize_t fi_barrier(struct fid_ep *ep, fi_addr_t coll_addr,
55                  void *context);
56
57              ssize_t fi_barrier2(struct fid_ep *ep, fi_addr_t coll_addr,
58                  uint64_t flags, void *context);
59
60              ssize_t fi_broadcast(struct fid_ep *ep, void *buf, size_t count, void *desc,
61                  fi_addr_t coll_addr, fi_addr_t root_addr, enum fi_datatype datatype,
62                  uint64_t flags, void *context);
63
64              ssize_t fi_alltoall(struct fid_ep *ep, const void *buf, size_t count,
65                  void *desc, void *result, void *result_desc,
66                  fi_addr_t coll_addr, enum fi_datatype datatype,
67                  uint64_t flags, void *context);
68
69              ssize_t fi_allreduce(struct fid_ep *ep, const void *buf, size_t count,
70                  void *desc, void *result, void *result_desc,
71                  fi_addr_t coll_addr, enum fi_datatype datatype, enum fi_op op,
72                  uint64_t flags, void *context);
73
74              ssize_t fi_allgather(struct fid_ep *ep, const void *buf, size_t count,
75                  void *desc, void *result, void *result_desc,
76                  fi_addr_t coll_addr, enum fi_datatype datatype,
77                  uint64_t flags, void *context);
78
79              ssize_t fi_reduce_scatter(struct fid_ep *ep, const void *buf, size_t count,
80                  void *desc, void *result, void *result_desc,
81                  fi_addr_t coll_addr, enum fi_datatype datatype, enum fi_op op,
82                  uint64_t flags, void *context);
83
84              ssize_t fi_reduce(struct fid_ep *ep, const void *buf, size_t count,
85                  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
86                  fi_addr_t root_addr, enum fi_datatype datatype, enum fi_op op,
87                  uint64_t flags, void *context);
88
89              ssize_t fi_scatter(struct fid_ep *ep, const void *buf, size_t count,
90                  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
91                  fi_addr_t root_addr, enum fi_datatype datatype,
92                  uint64_t flags, void *context);
93
94              ssize_t fi_gather(struct fid_ep *ep, const void *buf, size_t count,
95                  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
96                  fi_addr_t root_addr, enum fi_datatype datatype,
97                  uint64_t flags, void *context);
98
99              int fi_query_collective(struct fid_domain *domain,
100                  fi_collective_op coll, struct fi_collective_attr *attr, uint64_t flags);
101

ARGUMENTS

103       ep     Fabric endpoint on which to initiate collective operation.
104
105       set    Address vector set defining the collective membership.
106
107       mc     Multicast group associated with the collective.
108
109       buf    Local data buffer that specifies first operand of collective op‐
110              eration
111
112       datatype
113              Datatype associated with atomic operands
114
115       op     Atomic operation to perform
116
117       result Local data buffer to store the result of the  collective  opera‐
118              tion.
119
120       desc / result_desc
121              Data  descriptor associated with the local data buffer and local
122              result buffer, respectively.
123
124       coll_addr
125              Address referring to the collective group of endpoints.
126
127       root_addr
128              Single endpoint that is the source or destination of  collective
129              data.
130
131       flags  Additional flags to apply for the atomic operation
132
133       context
134              User  specified  pointer  to associate with the operation.  This
135              parameter is ignored if the operation will not generate  a  suc‐
136              cessful  completion, unless an op flag specifies the context pa‐
137              rameter be used for required input.
138

DESCRIPTION

140       In general collective operations can be thought of as coordinated atom‐
141       ic operations between a set of peer endpoints.  Readers should refer to
142       the fi_atomic(3) man page for details  on  the  atomic  operations  and
143       datatypes defined by libfabric.
144
145       A  collective operation is a group communication exchange.  It involves
146       multiple peers exchanging data with other peers  participating  in  the
147       collective  call.   Collective operations require close coordination by
148       all participating members.  All participants must invoke the same  col‐
149       lective call before any single member can complete its operation local‐
150       ly.  As a result, collective calls can strain the fabric,  as  well  as
151       local and remote data buffers.
152
153       Libfabric  collective interfaces target fabrics that support offloading
154       portions of the collective communication into network  switches,  NICs,
155       and other devices.  However, no implementation requirement is placed on
156       the provider.
157
158       The first step in using a collective call is identifying the peer  end‐
159       points that will participate.  Collective membership follows one of two
160       models, both supported by libfabric.  In the first model, the  applica‐
161       tion  manages  the membership.  This usually means that the application
162       is performing a collective operation itself using point to point commu‐
163       nication  to  identify the members who will participate.  Additionally,
164       the application may be interacting with a fabric  resource  manager  to
165       reserve  network resources needed to execute collective operations.  In
166       this model, the application will inform libfabric that  the  membership
167       has already been established.
168
169       A  separate  model  moves the membership management under libfabric and
170       directly into the provider.  In this model, the application must  iden‐
171       tify  which  peer  addresses will be members.  That information is con‐
172       veyed to the libfabric provider, which is then responsible for  coordi‐
173       nating  the  creation of the collective group.  In the provider managed
174       model, the provider will usually perform the necessary collective oper‐
175       ation to establish the communication group and interact with any fabric
176       management agents.
177
178       In both models,  the  collective  membership  is  communicated  to  the
179       provider  by  creating  and configuring an address vector set (AV set).
180       An AV set represents an ordered subset of addresses in an address  vec‐
181       tor  (AV).  Details on creating and configuring an AV set are available
182       in fi_av_set(3).
183
184       Once an AV set has been programmed with the collective  membership  in‐
185       formation,   an   endpoint  is  joined  to  the  set.   This  uses  the
186       fi_join_collective operation and operates asynchronously.  This differs
187       from  how  an endpoint is associated synchronously with an AV using the
188       fi_ep_bind() call.  Upon completion of  the  fi_join_collective  opera‐
189       tion,  an  fi_addr  is provided that is used as the target address when
190       invoking a collective operation.
191
192       For developer convenience, a set of collective APIs are defined.   Col‐
193       lective  APIs differ from message and RMA interfaces in that the format
194       of the data is known to the provider, and the collective may perform an
195       operation on that data.  This aligns collective operations closely with
196       the atomic interfaces.
197
198   Join Collective (fi_join_collective)
199       This call attaches an endpoint to a collective membership group.   Lib‐
200       fabric  treats  collective  members  as  a  multicast  group,  and  the
201       fi_join_collective call attaches the endpoint to that multicast  group.
202       By default, the endpoint will join the group based on the data transfer
203       capabilities of the endpoint.  For example, if the  endpoint  has  been
204       configured  to  both  send  and receive data, then the endpoint will be
205       able to initiate and receive transfers to and from the collective.  The
206       input  flags  may  be  used to restrict access to the collective group,
207       subject to endpoint capability limitations.
208
209       Join collective operations complete  asynchronously,  and  may  involve
210       fabric  transfers,  dependent  on the provider implementation.  An end‐
211       point must be bound to an event queue prior to calling  fi_join_collec‐
212       tive.   The  result of the join operation will be reported to the EQ as
213       an FI_JOIN_COMPLETE event.  Applications cannot issue collective trans‐
214       fers  until receiving notification that the join operation has complet‐
215       ed.  Note that an endpoint may begin receiving messages from  the  col‐
216       lective  group  as soon as the join completes, which can occur prior to
217       the FI_JOIN_COMPLETE event being generated.
218
219       The join collective operation is itself a  collective  operation.   All
220       participating  peers must call fi_join_collective before any individual
221       peer will report that the join has completed.  Application managed col‐
222       lective memberships are an exception.  With application managed member‐
223       ships, the fi_join_collective call may  be  completed  locally  without
224       fabric  communication.  For provider managed memberships, the join col‐
225       lective call requires as input a coll_addr that refers to either an ad‐
226       dress  associated  with  an  AV set (see fi_av_set_addr) or an existing
227       collective group (obtained through a previous call  to  fi_join_collec‐
228       tive).   The  fi_join_collective call will create a new collective sub‐
229       group.  If application managed memberships are used,  coll_addr  should
230       be set to FI_ADDR_UNAVAIL.
231
232       Applications  must  call fi_close on the collective group to disconnect
233       the endpoint from the group.  After a join operation has completed, the
234       fi_mc_addr call may be used to retrieve the address associated with the
235       multicast group.  See fi_cm(3) for additional details on fi_mc_addr().
236
237   Barrier (fi_barrier)
238       The fi_barrier operation provides a  mechanism  to  synchronize  peers.
239       Barrier  does  not result in any data being transferred at the applica‐
240       tion level.  A barrier does not complete locally until all  peers  have
241       invoked the barrier call.  This signifies to the local application that
242       work by peers that completed prior to them  calling  barrier  has  fin‐
243       ished.
244
245   Barrier (fi_barrier2)
246       The fi_barrier2 operations is the same as fi_barrier, but with an extra
247       parameter to pass in operation flags.
248
249   Broadcast (fi_broadcast)
250       fi_broadcast transfers an array of data from a  single  sender  to  all
251       other  members  of  the  collective  group.  The input buf parameter is
252       treated as the transmit buffer if the local rank is the root, otherwise
253       it  is  the  receive buffer.  The broadcast operation acts as an atomic
254       write or read to a data array.  As a result, the format of the data  in
255       buf is specified through the datatype parameter.  Any non-void datatype
256       may be broadcast.
257
258       The following diagram shows an  example  of  broadcast  being  used  to
259       transfer an array of integers to a group of peers.
260
261              [1]  [1]  [1]
262              [5]  [5]  [5]
263              [9]  [9]  [9]
264               |____^    ^
265               |_________|
266               broadcast
267
268   All to All (fi_alltoall)
269       The  fi_alltoall  collective involves distributing (or scattering) dif‐
270       ferent portions of an array of data to peers.  It is best explained us‐
271       ing  an  example.  Here three peers perform an all to all collective to
272       exchange different entries in an integer array.
273
274              [1]   [2]   [3]
275              [5]   [6]   [7]
276              [9]  [10]  [11]
277                 \   |   /
278                 All to all
279                 /   |   \
280              [1]   [5]   [9]
281              [2]   [6]  [10]
282              [3]   [7]  [11]
283
284       Each peer sends a piece of its data to the other peers.
285
286       All to all operations may be performed on any non-void datatype.   How‐
287       ever,  all  to all does not perform an operation on the data itself, so
288       no operation is specified.
289
290   All Reduce (fi_allreduce)
291       fi_allreduce can be described as all  peers  providing  input  into  an
292       atomic  operation, with the result copied back to each peer.  Conceptu‐
293       ally, this can be viewed as each peer issuing a multicast atomic opera‐
294       tion to all other peers, fetching the results, and combining them.  The
295       combining of  the  results  is  referred  to  as  the  reduction.   The
296       fi_allreduce() operation takes as input an array of data and the speci‐
297       fied atomic operation to perform.  The results  of  the  reduction  are
298       written into the result buffer.
299
300       Any  non-void  datatype  may be specified.  Valid atomic operations are
301       listed below in the fi_query_collective call.   The  following  diagram
302       shows  an example of an all reduce operation involving summing an array
303       of integers between three peers.
304
305               [1]  [1]  [1]
306               [5]  [5]  [5]
307               [9]  [9]  [9]
308                 \   |   /
309                    sum
310                 /   |   \
311               [3]  [3]  [3]
312              [15] [15] [15]
313              [27] [27] [27]
314                All Reduce
315
316   All Gather (fi_allgather)
317       Conceptually, all gather can be viewed as the opposite of  the  scatter
318       component from reduce-scatter.  All gather collects data from all peers
319       into a single array, then copies that array back to each peer.
320
321              [1]  [5]  [9]
322                \   |   /
323               All gather
324                /   |   \
325              [1]  [1]  [1]
326              [5]  [5]  [5]
327              [9]  [9]  [9]
328
329       All gather may be performed on any  non-void  datatype.   However,  all
330       gather  does  not perform an operation on the data itself, so no opera‐
331       tion is specified.
332
333   Reduce-Scatter (fi_reduce_scatter)
334       The fi_reduce_scatter collective is similar to an  fi_allreduce  opera‐
335       tion,  followed  by all to all.  With reduce scatter, all peers provide
336       input into an atomic operation, similar to all reduce.  However, rather
337       than  the  full  result being copied to each peer, each participant re‐
338       ceives only a slice of the result.
339
340       This is shown by the following example:
341
342              [1]  [1]  [1]
343              [5]  [5]  [5]
344              [9]  [9]  [9]
345                \   |   /
346                   sum (reduce)
347                    |
348                   [3]
349                  [15]
350                  [27]
351                    |
352                 scatter
353                /   |   \
354              [3] [15] [27]
355
356       The reduce scatter call supports the same datatype and atomic operation
357       as fi_allreduce.
358
359   Reduce (fi_reduce)
360       The  fi_reduce  collective  is the first half of an fi_allreduce opera‐
361       tion.  With reduce, all peers provide input into an  atomic  operation,
362       with the the results collected by a single `root' endpoint.
363
364       This  is shown by the following example, with the leftmost peer identi‐
365       fied as the root:
366
367              [1]  [1]  [1]
368              [5]  [5]  [5]
369              [9]  [9]  [9]
370                \   |   /
371                   sum (reduce)
372                  /
373               [3]
374              [15]
375              [27]
376
377       The reduce call supports the same  datatype  and  atomic  operation  as
378       fi_allreduce.
379
380   Scatter (fi_scatter)
381       The  fi_scatter  collective  is the second half of an fi_reduce_scatter
382       operation.  The data from a single `root' endpoint is  split  and  dis‐
383       tributed to all peers.
384
385       This is shown by the following example:
386
387               [3]
388              [15]
389              [27]
390                  \
391                 scatter
392                /   |   \
393              [3] [15] [27]
394
395       The  scatter  operation is used to distribute results to the peers.  No
396       atomic operation is performed on the data.
397
398   Gather (fi_gather)
399       The fi_gather operation is used to collect (gather)  the  results  from
400       all peers and store them at a `root' peer.
401
402       This  is shown by the following example, with the leftmost peer identi‐
403       fied as the root.
404
405              [1]  [5]  [9]
406                \   |   /
407                  gather
408                 /
409              [1]
410              [5]
411              [9]
412
413       The gather operation does not perform any operation on the data itself.
414
415   Query Collective Attributes (fi_query_collective)
416       The fi_query_collective call reports which  collective  operations  are
417       supported  by  the  underlying  provider,  for suitably configured end‐
418       points.  Collective operations needed by an application  that  are  not
419       supported  by the provider must be implemented by the application.  The
420       query call checks whether a provider supports a specific collective op‐
421       eration for a given datatype and operation, if applicable.
422
423       The  name of the collective, as well as the datatype and associated op‐
424       eration, if applicable, and are provided as input into fi_query_collec‐
425       tive.
426
427       The  coll parameter may reference one of these collectives: FI_BARRIER,
428       FI_BROADCAST, FI_ALLTOALL, FI_ALLREDUCE, FI_ALLGATHER,  FI_REDUCE_SCAT‐
429       TER,  FI_REDUCE,  FI_SCATTER,  or FI_GATHER.  Additional details on the
430       collective operation is specified through the struct fi_collective_attr
431       parameter.  For collectives that act on data, the operation and related
432       data type must be specified through the given attributes.
433
434              struct fi_collective_attr {
435                  enum fi_op op;
436                  enum fi_datatype datatype;
437                  struct fi_atomic_attr datatype_attr;
438                  size_t max_members;
439                    uint64_t mode;
440              };
441
442       For a description of struct fi_atomic_attr, see fi_atomic(3).
443
444       op     On input, this specifies the atomic operation involved with  the
445              collective  call.   This  should  be set to one of the following
446              values:  FI_MIN,  FI_MAX,  FI_SUM,  FI_PROD,  FI_LOR,   FI_LAND,
447              FI_BOR,  FI_BAND,  FI_LXOR,  FI_BXOR,  FI_ATOMIC_READ,  FI_ATOM‐
448              IC_WRITE, of FI_NOOP.  For collectives that do not exchange  ap‐
449              plication data (fi_barrier), this should be set to FI_NOOP.
450
451       datatype
452              On  onput,  specifies the datatype of the data being modified by
453              the collective.  This should be set to one of the following val‐
454              ues:   FI_INT8,   FI_UINT8,   FI_INT16,   FI_UINT16,   FI_INT32,
455              FI_UINT32,    FI_INT64,    FI_UINT64,    FI_FLOAT,    FI_DOUBLE,
456              FI_FLOAT_COMPLEX,       FI_DOUBLE_COMPLEX,       FI_LONG_DOUBLE,
457              FI_LONG_DOUBLE_COMPLEX, or FI_VOID.  For collectives that do not
458              exchange  application  data  (fi_barrier), this should be set to
459              FI_VOID.
460
461       datatype_attr.count
462              The maximum number of elements that may be used with the collec‐
463              tive.
464
465       datatype.size
466              The size of the datatype as supported by the provider.  Applica‐
467              tions should validate the size of datatypes that differ based on
468              the platform, such as FI_LONG_DOUBLE.
469
470       max_members
471              The maximum number of peers that may participate in a collective
472              operation.
473
474       mode   This field is reserved and should be 0.
475
476       If a collective operation is supported,  the  query  call  will  return
477       FI_SUCCESS,  along with attributes on the limits for using that collec‐
478       tive operation through the provider.
479
480   Completions
481       Collective operations map to underlying fi_atomic  operations.   For  a
482       discussion  of atomic completion semantics, see fi_atomic(3).  The com‐
483       pletion, ordering, and atomicity of collective operations  match  those
484       defined for point to point atomic operations.
485

FLAGS

487       The following flags are defined for the specified operations.
488
489       FI_SCATTER
490              Applies  to  fi_query_collective.   When set, requests attribute
491              information on the reduce-scatter collective operation.
492

RETURN VALUE

494       Returns 0 on success.  On error, a negative value corresponding to fab‐
495       ric  errno is returned.  Fabric errno values are defined in rdma/fi_er‐
496       rno.h.
497

ERRORS

499       -FI_EAGAIN
500              See fi_msg(3) for a detailed description of handling FI_EAGAIN.
501
502       -FI_EOPNOTSUPP
503              The requested atomic operation is not  supported  on  this  end‐
504              point.
505
506       -FI_EMSGSIZE
507              The  number of collective operations in a single request exceeds
508              that supported by the underlying provider.
509

NOTES

511       Collective operations map to atomic operations.  As such,  they  follow
512       most  of the conventions and restrictions as peer to peer atomic opera‐
513       tions.  This includes data atomicity, data alignment, and  message  or‐
514       dering  semantics.   See fi_atomic(3) for additional information on the
515       datatypes and operations defined for atomic and collective operations.
516

SEE ALSO

518       fi_getinfo(3), fi_av(3), fi_atomic(3), fi_cm(3)
519

AUTHORS

521       OpenFabrics.
522
523
524
525Libfabric Programmer’s Manual     2023-01-02                  fi_collective(3)
Impressum