1fi_collective(3)               Libfabric v1.10.0              fi_collective(3)
2
3
4

NAME

6       fi_join_collective
7              Operation where a subset of peers join a new collective group.
8
9       fi_barrier
10              Collective operation that does not complete until all peers have
11              entered the barrier call.
12
13       fi_broadcast
14              A single sender transmits data to all peers, including itself.
15
16       fi_alltoall
17              Each peer distributes a slice of its local data to all peers.
18
19       fi_allreduce
20              Collective operation where all peers broadcast an atomic  opera‐
21              tion to all other peers.
22
23       fi_allgather
24              Each peer sends a complete copy of its local data to all peers.
25
26       fi_reduce_scatter
27              Collective  call  where  data  is  collected  from all peers and
28              merged (reduced).  The results of the reduction  is  distributed
29              back  to  the peers, with each peer receiving a slice of the re‐
30              sults.
31
32       fi_reduce
33              Collective call where data is collected from all peers to a root
34              peer and merged (reduced).
35
36       fi_scatter
37              A single sender distributes (scatters) a slice of its local data
38              to all peers.
39
40       fi_gather
41              All peers send their data to a root peer.
42
43       fi_query_collective
44              Returns information about which collective operations  are  sup‐
45              ported by a provider, and limitations on the collective.
46

SYNOPSIS

48              #include <rdma/fi_collective.h>
49
50              int fi_join_collective(struct fid_ep *ep, fi_addr_t coll_addr,
51                  const struct fid_av_set *set,
52                  uint64_t flags, struct fid_mc **mc, void *context);
53
54              ssize_t fi_barrier(struct fid_ep *ep, fi_addr_t coll_addr,
55                  void *context);
56
57              ssize_t fi_broadcast(struct fid_ep *ep, void *buf, size_t count, void *desc,
58                  fi_addr_t coll_addr, fi_addr_t root_addr, enum fi_datatype datatype,
59                  uint64_t flags, void *context);
60
61              ssize_t fi_alltoall(struct fid_ep *ep, const void *buf, size_t count,
62                  void *desc, void *result, void *result_desc,
63                  fi_addr_t coll_addr, enum fi_datatype datatype,
64                  uint64_t flags, void *context);
65
66              ssize_t fi_allreduce(struct fid_ep *ep, const void *buf, size_t count,
67                  void *desc, void *result, void *result_desc,
68                  fi_addr_t coll_addr, enum fi_datatype datatype, enum fi_op op,
69                  uint64_t flags, void *context);
70
71              ssize_t fi_allgather(struct fid_ep *ep, const void *buf, size_t count,
72                  void *desc, void *result, void *result_desc,
73                  fi_addr_t coll_addr, enum fi_datatype datatype,
74                  uint64_t flags, void *context);
75
76              ssize_t fi_reduce_scatter(struct fid_ep *ep, const void *buf, size_t count,
77                  void *desc, void *result, void *result_desc,
78                  fi_addr_t coll_addr, enum fi_datatype datatype, enum fi_op op,
79                  uint64_t flags, void *context);
80
81              ssize_t fi_reduce(struct fid_ep *ep, const void *buf, size_t count,
82                  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
83                  fi_addr_t root_addr, enum fi_datatype datatype, enum fi_op op,
84                  uint64_t flags, void *context);
85
86              ssize_t fi_scatter(struct fid_ep *ep, const void *buf, size_t count,
87                  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
88                  fi_addr_t root_addr, enum fi_datatype datatype,
89                  uint64_t flags, void *context);
90
91              ssize_t fi_gather(struct fid_ep *ep, const void *buf, size_t count,
92                  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
93                  fi_addr_t root_addr, enum fi_datatype datatype,
94                  uint64_t flags, void *context);
95
96              int fi_query_collective(struct fid_domain *domain,
97                  fi_collective_op coll, struct fi_collective_attr *attr, uint64_t flags);
98

ARGUMENTS

100       ep     Fabric endpoint on which to initiate collective operation.
101
102       set    Address vector set defining the collective membership.
103
104       mc     Multicast group associated with the collective.
105
106       buf    Local data buffer that specifies first operand of collective op‐
107              eration
108
109       datatype
110              Datatype associated with atomic operands
111
112       op     Atomic operation to perform
113
114       result Local data buffer to store the result of the  collective  opera‐
115              tion.
116
117       desc / result_desc
118              Data  descriptor associated with the local data buffer and local
119              result buffer, respectively.
120
121       coll_addr
122              Address referring to the collective group of endpoints.
123
124       root_addr
125              Single endpoint that is the source or destination of  collective
126              data.
127
128       flags  Additional flags to apply for the atomic operation
129
130       context
131              User  specified  pointer  to associate with the operation.  This
132              parameter is ignored if the operation will not generate  a  suc‐
133              cessful  completion, unless an op flag specifies the context pa‐
134              rameter be used for required input.
135

DESCRIPTION (EXPERIMENTAL APIs)

137       The collective APIs are new to the 1.9  libfabric  release.   Although,
138       efforts  have  been  made  to design the APIs such that they align well
139       with applications and are implementable  by  the  providers,  the  APIs
140       should  be  considered experimental and may be subject to change in fu‐
141       ture versions of the library until the experimental tag  has  been  re‐
142       moved.
143
144       In general collective operations can be thought of as coordinated atom‐
145       ic operations between a set of peer endpoints.  Readers should refer to
146       the  fi_atomic(3)  man  page  for  details on the atomic operations and
147       datatypes defined by libfabric.
148
149       A collective operation is a group communication exchange.  It  involves
150       multiple  peers  exchanging  data with other peers participating in the
151       collective call.  Collective operations require close  coordination  by
152       all  participating members.  All participants must invoke the same col‐
153       lective call before any single member can complete its operation local‐
154       ly.   As  a  result, collective calls can strain the fabric, as well as
155       local and remote data buffers.
156
157       Libfabric collective interfaces target fabrics that support  offloading
158       portions  of  the collective communication into network switches, NICs,
159       and other devices.  However, no implementation requirement is placed on
160       the provider.
161
162       The  first step in using a collective call is identifying the peer end‐
163       points that will participate.  Collective membership follows one of two
164       models,  both supported by libfabric.  In the first model, the applica‐
165       tion manages the membership.  This usually means that  the  application
166       is performing a collective operation itself using point to point commu‐
167       nication to identify the members who will  participate.   Additionally,
168       the  application  may  be interacting with a fabric resource manager to
169       reserve network resources needed to execute collective operations.   In
170       this  model,  the application will inform libfabric that the membership
171       has already been established.
172
173       A separate model moves the membership management  under  libfabric  and
174       directly  into the provider.  In this model, the application must iden‐
175       tify which peer addresses will be members.  That  information  is  con‐
176       veyed  to the libfabric provider, which is then responsible for coordi‐
177       nating the creation of the collective group.  In the  provider  managed
178       model, the provider will usually perform the necessary collective oper‐
179       ation to establish the communication group and interact with any fabric
180       management agents.
181
182       In  both  models,  the  collective  membership  is  communicated to the
183       provider by creating and configuring an address vector  set  (AV  set).
184       An  AV set represents an ordered subset of addresses in an address vec‐
185       tor (AV).  Details on creating and configuring an AV set are  available
186       in fi_av_set(3).
187
188       Once  an  AV set has been programmed with the collective membership in‐
189       formation,  an  endpoint  is  joined  to  the  set.   This   uses   the
190       fi_join_collective operation and operates asynchronously.  This differs
191       from how an endpoint is associated synchronously with an AV  using  the
192       fi_ep_bind()  call.   Upon  completion of the fi_join_collective opera‐
193       tion, an fi_addr is provided that is used as the  target  address  when
194       invoking a collective operation.
195
196       For  developer convenience, a set of collective APIs are defined.  Col‐
197       lective APIs differ from message and RMA interfaces in that the  format
198       of the data is known to the provider, and the collective may perform an
199       operation on that data.  This aligns collective operations closely with
200       the atomic interfaces.
201
202   Join Collective (fi_join_collective)
203       This  call attaches an endpoint to a collective membership group.  Lib‐
204       fabric  treats  collective  members  as  a  multicast  group,  and  the
205       fi_join_collective  call attaches the endpoint to that multicast group.
206       By default, the endpoint will join the group based on the data transfer
207       capabilities  of  the  endpoint.  For example, if the endpoint has been
208       configured to both send and receive data, then  the  endpoint  will  be
209       able to initiate and receive transfers to and from the collective.  The
210       input flags may be used to restrict access  to  the  collective  group,
211       subject to endpoint capability limitations.
212
213       Join  collective  operations  complete  asynchronously, and may involve
214       fabric transfers, dependent on the provider  implementation.   An  end‐
215       point  must be bound to an event queue prior to calling fi_join_collec‐
216       tive.  The result of the join operation will be reported to the  EQ  as
217       an FI_JOIN_COMPLETE event.  Applications cannot issue collective trans‐
218       fers until receiving notification that the join operation has  complet‐
219       ed.   Note  that an endpoint may begin receiving messages from the col‐
220       lective group as soon as the join completes, which can occur  prior  to
221       the FI_JOIN_COMPLETE event being generated.
222
223       The  join  collective  operation is itself a collective operation.  All
224       participating peers must call fi_join_collective before any  individual
225       peer will report that the join has completed.  Application managed col‐
226       lective memberships are an exception.  With application managed member‐
227       ships,  the  fi_join_collective  call  may be completed locally without
228       fabric communication.  For provider managed memberships, the join  col‐
229       lective call requires as input a coll_addr that refers to either an ad‐
230       dress associated with an AV set (see  fi_av_set_addr)  or  an  existing
231       collective  group  (obtained through a previous call to fi_join_collec‐
232       tive).  The fi_join_collective call will create a new  collective  sub‐
233       group.   If  application managed memberships are used, coll_addr should
234       be set to FI_ADDR_UNAVAIL.
235
236       Applications must call fi_close on the collective group  to  disconnect
237       the endpoint from the group.  After a join operation has completed, the
238       fi_mc_addr call may be used to retrieve the address associated with the
239       multicast group.  See fi_cm(3) for additional details on fi_mc_addr().
240
241   Barrier (fi_barrier)
242       The  fi_barrier  operation  provides  a mechanism to synchronize peers.
243       Barrier does not result in any data being transferred at  the  applica‐
244       tion  level.   A barrier does not complete locally until all peers have
245       invoked the barrier call.  This signifies to the local application that
246       work  by  peers  that  completed prior to them calling barrier has fin‐
247       ished.
248
249   Broadcast (fi_broadcast)
250       fi_broadcast transfers an array of data from a  single  sender  to  all
251       other  members  of  the  collective  group.  The input buf parameter is
252       treated as the transmit buffer if the local rank is the root, otherwise
253       it  is  the  receive buffer.  The broadcast operation acts as an atomic
254       write or read to a data array.  As a result, the format of the data  in
255       buf is specified through the datatype parameter.  Any non-void datatype
256       may be broadcast.
257
258       The following diagram shows an  example  of  broadcast  being  used  to
259       transfer an array of integers to a group of peers.
260
261              [1]  [1]  [1]
262              [5]  [5]  [5]
263              [9]  [9]  [9]
264               |____^    ^
265               |_________|
266               broadcast
267
268   All to All (fi_alltoall)
269       The  fi_alltoall  collective involves distributing (or scattering) dif‐
270       ferent portions of an array of data to peers.  It is best explained us‐
271       ing  an  example.  Here three peers perform an all to all collective to
272       exchange different entries in an integer array.
273
274              [1]   [2]   [3]
275              [5]   [6]   [7]
276              [9]  [10]  [11]
277                 \   |   /
278                 All to all
279                 /   |   \
280              [1]   [5]   [9]
281              [2]   [6]  [10]
282              [3]   [7]  [11]
283
284       Each peer sends a piece of its data to the other peers.
285
286       All to all operations may be performed on any non-void datatype.   How‐
287       ever,  all  to all does not perform an operation on the data itself, so
288       no operation is specified.
289
290   All Reduce (fi_allreduce)
291       fi_allreduce can be described as all  peers  providing  input  into  an
292       atomic  operation, with the result copied back to each peer.  Conceptu‐
293       ally, this can be viewed as each peer issuing a multicast atomic opera‐
294       tion to all other peers, fetching the results, and combining them.  The
295       combining of  the  results  is  referred  to  as  the  reduction.   The
296       fi_allreduce() operation takes as input an array of data and the speci‐
297       fied atomic operation to perform.  The results  of  the  reduction  are
298       written into the result buffer.
299
300       Any  non-void  datatype  may be specified.  Valid atomic operations are
301       listed below in the fi_query_collective call.   The  following  diagram
302       shows  an example of an all reduce operation involving summing an array
303       of integers between three peers.
304
305               [1]  [1]  [1]
306               [5]  [5]  [5]
307               [9]  [9]  [9]
308                 \   |   /
309                    sum
310                 /   |   \
311               [3]  [3]  [3]
312              [15] [15] [15]
313              [27] [27] [27]
314                All Reduce
315
316   All Gather (fi_allgather)
317       Conceptually, all gather can be viewed as the opposite of  the  scatter
318       component from reduce-scatter.  All gather collects data from all peers
319       into a single array, then copies that array back to each peer.
320
321              [1]  [5]  [9]
322                \   |   /
323               All gather
324                /   |   \
325              [1]  [1]  [1]
326              [5]  [5]  [5]
327              [9]  [9]  [9]
328
329       All gather may be performed on any  non-void  datatype.   However,  all
330       gather  does  not perform an operation on the data itself, so no opera‐
331       tion is specified.
332
333   Reduce-Scatter (fi_reduce_scatter)
334       The fi_reduce_scatter collective is similar to an  fi_allreduce  opera‐
335       tion,  followed  by all to all.  With reduce scatter, all peers provide
336       input into an atomic operation, similar to all reduce.  However, rather
337       than  the  full  result being copied to each peer, each participant re‐
338       ceives only a slice of the result.
339
340       This is shown by the following example:
341
342              [1]  [1]  [1]
343              [5]  [5]  [5]
344              [9]  [9]  [9]
345                \   |   /
346                   sum (reduce)
347                    |
348                   [3]
349                  [15]
350                  [27]
351                    |
352                 scatter
353                /   |   \
354              [3] [15] [27]
355
356       The reduce scatter call supports the same datatype and atomic operation
357       as fi_allreduce.
358
359   Reduce (fi_reduce)
360       The  fi_reduce  collective  is the first half of an fi_allreduce opera‐
361       tion.  With reduce, all peers provide input into an  atomic  operation,
362       with the the results collected by a single 'root' endpoint.
363
364       This  is shown by the following example, with the leftmost peer identi‐
365       fied as the root:
366
367              [1]  [1]  [1]
368              [5]  [5]  [5]
369              [9]  [9]  [9]
370                \   |   /
371                   sum (reduce)
372                  /
373               [3]
374              [15]
375              [27]
376
377       The reduce call supports the same  datatype  and  atomic  operation  as
378       fi_allreduce.
379
380   Scatter (fi_scatter)
381       The  fi_scatter  collective  is the second half of an fi_reduce_scatter
382       operation.  The data from a single 'root' endpoint is  split  and  dis‐
383       tributed to all peers.
384
385       This is shown by the following example:
386
387               [3]
388              [15]
389              [27]
390                  \
391                 scatter
392                /   |   \
393              [3] [15] [27]
394
395       The  scatter  operation is used to distribute results to the peers.  No
396       atomic operation is performed on the data.
397
398   Gather (fi_gather)
399       The fi_gather operation is used to collect (gather)  the  results  from
400       all peers and store them at a 'root' peer.
401
402       This  is shown by the following example, with the leftmost peer identi‐
403       fied as the root.
404
405              [1]  [5]  [9]
406                \   |   /
407                  gather
408                 /
409              [1]
410              [5]
411              [9]
412
413       The gather operation does not perform any operation on the data itself.
414
415   Query Collective Attributes (fi_query_collective)
416       The fi_query_collective call reports which  collective  operations  are
417       supported  by  the  underlying  provider,  for suitably configured end‐
418       points.  Collective operations needed by an application  that  are  not
419       supported  by the provider must be implemented by the application.  The
420       query call checks whether a provider supports a specific collective op‐
421       eration for a given datatype and operation, if applicable.
422
423       The  name of the collective, as well as the datatype and associated op‐
424       eration, if applicable, and are provided as input into fi_query_collec‐
425       tive.
426
427       The  coll parameter may reference one of these collectives: FI_BARRIER,
428       FI_BROADCAST, FI_ALLTOALL, FI_ALLREDUCE, FI_ALLGATHER,  FI_REDUCE_SCAT‐
429       TER,  FI_REDUCE,  FI_SCATTER,  or FI_GATHER.  Additional details on the
430       collective operation is specified through the struct fi_collective_attr
431       parameter.  For collectives that act on data, the operation and related
432       data type must be specified through the given attributes.
433
434              struct fi_collective_attr {
435                  enum fi_op op;
436                  enum fi_datatype datatype;
437                  struct fi_atomic_attr datatype_attr;
438                  size_t max_members;
439                    uint64_t mode;
440              };
441
442       For a description of struct fi_atomic_attr, see fi_atomic(3).
443
444       op     On input, this specifies the atomic operation involved with  the
445              collective  call.   This  should  be set to one of the following
446              values:  FI_MIN,  FI_MAX,  FI_SUM,  FI_PROD,  FI_LOR,   FI_LAND,
447              FI_BOR,  FI_BAND,  FI_LXOR,  FI_BXOR,  FI_ATOMIC_READ,  FI_ATOM‐
448              IC_WRITE, of FI_NOOP.  For collectives that do not exchange  ap‐
449              plication data (fi_barrier), this should be set to FI_NOOP.
450
451       datatype
452              On  onput,  specifies the datatype of the data being modified by
453              the collective.  This should be set to one of the following val‐
454              ues:   FI_INT8,   FI_UINT8,   FI_INT16,   FI_UINT16,   FI_INT32,
455              FI_UINT32,    FI_INT64,    FI_UINT64,    FI_FLOAT,    FI_DOUBLE,
456              FI_FLOAT_COMPLEX,       FI_DOUBLE_COMPLEX,       FI_LONG_DOUBLE,
457              FI_LONG_DOUBLE_COMPLEX, or FI_VOID.  For collectives that do not
458              exchange  application  data  (fi_barrier), this should be set to
459              FI_VOID.
460
461       datatype_attr.count
462              The maximum number of elements that may be used with the collec‐
463              tive.
464
465       datatype.size
466              The size of the datatype as supported by the provider.  Applica‐
467              tions should validate the size of datatypes that differ based on
468              the platform, such as FI_LONG_DOUBLE.
469
470       max_members
471              The maximum number of peers that may participate in a collective
472              operation.
473
474       mode   This field is reserved and should be 0.
475
476       If a collective operation is supported,  the  query  call  will  return
477       FI_SUCCESS,  along with attributes on the limits for using that collec‐
478       tive operation through the provider.
479
480   Completions
481       Collective operations map to underlying fi_atomic  operations.   For  a
482       discussion  of atomic completion semantics, see fi_atomic(3).  The com‐
483       pletion, ordering, and atomicity of collective operations  match  those
484       defined for point to point atomic operations.
485

FLAGS

487       The following flags are defined for the specified operations.
488
489       FI_SCATTER
490              Applies  to  fi_query_collective.   When set, requests attribute
491              information on the reduce-scatter collective operation.
492

RETURN VALUE

494       Returns 0 on success.  On error, a negative value corresponding to fab‐
495       ric  errno is returned.  Fabric errno values are defined in rdma/fi_er‐
496       rno.h.
497

ERRORS

499       -FI_EAGAIN
500              See fi_msg(3) for a detailed description of handling FI_EAGAIN.
501
502       -FI_EOPNOTSUPP
503              The requested atomic operation is not  supported  on  this  end‐
504              point.
505
506       -FI_EMSGSIZE
507              The  number of collective operations in a single request exceeds
508              that supported by the underlying provider.
509

NOTES

511       Collective operations map to atomic operations.  As such,  they  follow
512       most  of the conventions and restrictions as peer to peer atomic opera‐
513       tions.  This includes data atomicity, data alignment, and  message  or‐
514       dering  semantics.   See fi_atomic(3) for additional information on the
515       datatypes and operations defined for atomic and collective operations.
516

SEE ALSO

518       fi_getinfo(3), fi_av(3), fi_atomic(3), fi_cm(3)
519

AUTHORS

521       OpenFabrics.
522
523
524
525Libfabric Programmer's Manual     2020-04-13                  fi_collective(3)
Impressum