1fi_collective(3) Libfabric v1.17.0 fi_collective(3)
2
3
4
6 fi_join_collective
7 Operation where a subset of peers join a new collective group.
8
9 fi_barrier / fi_barrier2
10 Collective operation that does not complete until all peers have
11 entered the barrier call.
12
13 fi_broadcast
14 A single sender transmits data to all peers, including itself.
15
16 fi_alltoall
17 Each peer distributes a slice of its local data to all peers.
18
19 fi_allreduce
20 Collective operation where all peers broadcast an atomic opera‐
21 tion to all other peers.
22
23 fi_allgather
24 Each peer sends a complete copy of its local data to all peers.
25
26 fi_reduce_scatter
27 Collective call where data is collected from all peers and
28 merged (reduced). The results of the reduction is distributed
29 back to the peers, with each peer receiving a slice of the re‐
30 sults.
31
32 fi_reduce
33 Collective call where data is collected from all peers to a root
34 peer and merged (reduced).
35
36 fi_scatter
37 A single sender distributes (scatters) a slice of its local data
38 to all peers.
39
40 fi_gather
41 All peers send their data to a root peer.
42
43 fi_query_collective
44 Returns information about which collective operations are sup‐
45 ported by a provider, and limitations on the collective.
46
48 #include <rdma/fi_collective.h>
49
50 int fi_join_collective(struct fid_ep *ep, fi_addr_t coll_addr,
51 const struct fid_av_set *set,
52 uint64_t flags, struct fid_mc **mc, void *context);
53
54 ssize_t fi_barrier(struct fid_ep *ep, fi_addr_t coll_addr,
55 void *context);
56
57 ssize_t fi_barrier2(struct fid_ep *ep, fi_addr_t coll_addr,
58 uint64_t flags, void *context);
59
60 ssize_t fi_broadcast(struct fid_ep *ep, void *buf, size_t count, void *desc,
61 fi_addr_t coll_addr, fi_addr_t root_addr, enum fi_datatype datatype,
62 uint64_t flags, void *context);
63
64 ssize_t fi_alltoall(struct fid_ep *ep, const void *buf, size_t count,
65 void *desc, void *result, void *result_desc,
66 fi_addr_t coll_addr, enum fi_datatype datatype,
67 uint64_t flags, void *context);
68
69 ssize_t fi_allreduce(struct fid_ep *ep, const void *buf, size_t count,
70 void *desc, void *result, void *result_desc,
71 fi_addr_t coll_addr, enum fi_datatype datatype, enum fi_op op,
72 uint64_t flags, void *context);
73
74 ssize_t fi_allgather(struct fid_ep *ep, const void *buf, size_t count,
75 void *desc, void *result, void *result_desc,
76 fi_addr_t coll_addr, enum fi_datatype datatype,
77 uint64_t flags, void *context);
78
79 ssize_t fi_reduce_scatter(struct fid_ep *ep, const void *buf, size_t count,
80 void *desc, void *result, void *result_desc,
81 fi_addr_t coll_addr, enum fi_datatype datatype, enum fi_op op,
82 uint64_t flags, void *context);
83
84 ssize_t fi_reduce(struct fid_ep *ep, const void *buf, size_t count,
85 void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
86 fi_addr_t root_addr, enum fi_datatype datatype, enum fi_op op,
87 uint64_t flags, void *context);
88
89 ssize_t fi_scatter(struct fid_ep *ep, const void *buf, size_t count,
90 void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
91 fi_addr_t root_addr, enum fi_datatype datatype,
92 uint64_t flags, void *context);
93
94 ssize_t fi_gather(struct fid_ep *ep, const void *buf, size_t count,
95 void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
96 fi_addr_t root_addr, enum fi_datatype datatype,
97 uint64_t flags, void *context);
98
99 int fi_query_collective(struct fid_domain *domain,
100 fi_collective_op coll, struct fi_collective_attr *attr, uint64_t flags);
101
103 ep Fabric endpoint on which to initiate collective operation.
104
105 set Address vector set defining the collective membership.
106
107 mc Multicast group associated with the collective.
108
109 buf Local data buffer that specifies first operand of collective op‐
110 eration
111
112 datatype
113 Datatype associated with atomic operands
114
115 op Atomic operation to perform
116
117 result Local data buffer to store the result of the collective opera‐
118 tion.
119
120 desc / result_desc
121 Data descriptor associated with the local data buffer and local
122 result buffer, respectively.
123
124 coll_addr
125 Address referring to the collective group of endpoints.
126
127 root_addr
128 Single endpoint that is the source or destination of collective
129 data.
130
131 flags Additional flags to apply for the atomic operation
132
133 context
134 User specified pointer to associate with the operation. This
135 parameter is ignored if the operation will not generate a suc‐
136 cessful completion, unless an op flag specifies the context pa‐
137 rameter be used for required input.
138
140 The collective APIs are new to the 1.9 libfabric release. Although,
141 efforts have been made to design the APIs such that they align well
142 with applications and are implementable by the providers, the APIs
143 should be considered experimental and may be subject to change in fu‐
144 ture versions of the library until the experimental tag has been re‐
145 moved.
146
147 In general collective operations can be thought of as coordinated atom‐
148 ic operations between a set of peer endpoints. Readers should refer to
149 the fi_atomic(3) man page for details on the atomic operations and
150 datatypes defined by libfabric.
151
152 A collective operation is a group communication exchange. It involves
153 multiple peers exchanging data with other peers participating in the
154 collective call. Collective operations require close coordination by
155 all participating members. All participants must invoke the same col‐
156 lective call before any single member can complete its operation local‐
157 ly. As a result, collective calls can strain the fabric, as well as
158 local and remote data buffers.
159
160 Libfabric collective interfaces target fabrics that support offloading
161 portions of the collective communication into network switches, NICs,
162 and other devices. However, no implementation requirement is placed on
163 the provider.
164
165 The first step in using a collective call is identifying the peer end‐
166 points that will participate. Collective membership follows one of two
167 models, both supported by libfabric. In the first model, the applica‐
168 tion manages the membership. This usually means that the application
169 is performing a collective operation itself using point to point commu‐
170 nication to identify the members who will participate. Additionally,
171 the application may be interacting with a fabric resource manager to
172 reserve network resources needed to execute collective operations. In
173 this model, the application will inform libfabric that the membership
174 has already been established.
175
176 A separate model moves the membership management under libfabric and
177 directly into the provider. In this model, the application must iden‐
178 tify which peer addresses will be members. That information is con‐
179 veyed to the libfabric provider, which is then responsible for coordi‐
180 nating the creation of the collective group. In the provider managed
181 model, the provider will usually perform the necessary collective oper‐
182 ation to establish the communication group and interact with any fabric
183 management agents.
184
185 In both models, the collective membership is communicated to the
186 provider by creating and configuring an address vector set (AV set).
187 An AV set represents an ordered subset of addresses in an address vec‐
188 tor (AV). Details on creating and configuring an AV set are available
189 in fi_av_set(3).
190
191 Once an AV set has been programmed with the collective membership in‐
192 formation, an endpoint is joined to the set. This uses the
193 fi_join_collective operation and operates asynchronously. This differs
194 from how an endpoint is associated synchronously with an AV using the
195 fi_ep_bind() call. Upon completion of the fi_join_collective opera‐
196 tion, an fi_addr is provided that is used as the target address when
197 invoking a collective operation.
198
199 For developer convenience, a set of collective APIs are defined. Col‐
200 lective APIs differ from message and RMA interfaces in that the format
201 of the data is known to the provider, and the collective may perform an
202 operation on that data. This aligns collective operations closely with
203 the atomic interfaces.
204
205 Join Collective (fi_join_collective)
206 This call attaches an endpoint to a collective membership group. Lib‐
207 fabric treats collective members as a multicast group, and the
208 fi_join_collective call attaches the endpoint to that multicast group.
209 By default, the endpoint will join the group based on the data transfer
210 capabilities of the endpoint. For example, if the endpoint has been
211 configured to both send and receive data, then the endpoint will be
212 able to initiate and receive transfers to and from the collective. The
213 input flags may be used to restrict access to the collective group,
214 subject to endpoint capability limitations.
215
216 Join collective operations complete asynchronously, and may involve
217 fabric transfers, dependent on the provider implementation. An end‐
218 point must be bound to an event queue prior to calling fi_join_collec‐
219 tive. The result of the join operation will be reported to the EQ as
220 an FI_JOIN_COMPLETE event. Applications cannot issue collective trans‐
221 fers until receiving notification that the join operation has complet‐
222 ed. Note that an endpoint may begin receiving messages from the col‐
223 lective group as soon as the join completes, which can occur prior to
224 the FI_JOIN_COMPLETE event being generated.
225
226 The join collective operation is itself a collective operation. All
227 participating peers must call fi_join_collective before any individual
228 peer will report that the join has completed. Application managed col‐
229 lective memberships are an exception. With application managed member‐
230 ships, the fi_join_collective call may be completed locally without
231 fabric communication. For provider managed memberships, the join col‐
232 lective call requires as input a coll_addr that refers to either an ad‐
233 dress associated with an AV set (see fi_av_set_addr) or an existing
234 collective group (obtained through a previous call to fi_join_collec‐
235 tive). The fi_join_collective call will create a new collective sub‐
236 group. If application managed memberships are used, coll_addr should
237 be set to FI_ADDR_UNAVAIL.
238
239 Applications must call fi_close on the collective group to disconnect
240 the endpoint from the group. After a join operation has completed, the
241 fi_mc_addr call may be used to retrieve the address associated with the
242 multicast group. See fi_cm(3) for additional details on fi_mc_addr().
243
244 Barrier (fi_barrier)
245 The fi_barrier operation provides a mechanism to synchronize peers.
246 Barrier does not result in any data being transferred at the applica‐
247 tion level. A barrier does not complete locally until all peers have
248 invoked the barrier call. This signifies to the local application that
249 work by peers that completed prior to them calling barrier has fin‐
250 ished.
251
252 Barrier (fi_barrier2)
253 The fi_barrier2 operations is the same as fi_barrier, but with an extra
254 parameter to pass in operation flags.
255
256 Broadcast (fi_broadcast)
257 fi_broadcast transfers an array of data from a single sender to all
258 other members of the collective group. The input buf parameter is
259 treated as the transmit buffer if the local rank is the root, otherwise
260 it is the receive buffer. The broadcast operation acts as an atomic
261 write or read to a data array. As a result, the format of the data in
262 buf is specified through the datatype parameter. Any non-void datatype
263 may be broadcast.
264
265 The following diagram shows an example of broadcast being used to
266 transfer an array of integers to a group of peers.
267
268 [1] [1] [1]
269 [5] [5] [5]
270 [9] [9] [9]
271 |____^ ^
272 |_________|
273 broadcast
274
275 All to All (fi_alltoall)
276 The fi_alltoall collective involves distributing (or scattering) dif‐
277 ferent portions of an array of data to peers. It is best explained us‐
278 ing an example. Here three peers perform an all to all collective to
279 exchange different entries in an integer array.
280
281 [1] [2] [3]
282 [5] [6] [7]
283 [9] [10] [11]
284 \ | /
285 All to all
286 / | \
287 [1] [5] [9]
288 [2] [6] [10]
289 [3] [7] [11]
290
291 Each peer sends a piece of its data to the other peers.
292
293 All to all operations may be performed on any non-void datatype. How‐
294 ever, all to all does not perform an operation on the data itself, so
295 no operation is specified.
296
297 All Reduce (fi_allreduce)
298 fi_allreduce can be described as all peers providing input into an
299 atomic operation, with the result copied back to each peer. Conceptu‐
300 ally, this can be viewed as each peer issuing a multicast atomic opera‐
301 tion to all other peers, fetching the results, and combining them. The
302 combining of the results is referred to as the reduction. The
303 fi_allreduce() operation takes as input an array of data and the speci‐
304 fied atomic operation to perform. The results of the reduction are
305 written into the result buffer.
306
307 Any non-void datatype may be specified. Valid atomic operations are
308 listed below in the fi_query_collective call. The following diagram
309 shows an example of an all reduce operation involving summing an array
310 of integers between three peers.
311
312 [1] [1] [1]
313 [5] [5] [5]
314 [9] [9] [9]
315 \ | /
316 sum
317 / | \
318 [3] [3] [3]
319 [15] [15] [15]
320 [27] [27] [27]
321 All Reduce
322
323 All Gather (fi_allgather)
324 Conceptually, all gather can be viewed as the opposite of the scatter
325 component from reduce-scatter. All gather collects data from all peers
326 into a single array, then copies that array back to each peer.
327
328 [1] [5] [9]
329 \ | /
330 All gather
331 / | \
332 [1] [1] [1]
333 [5] [5] [5]
334 [9] [9] [9]
335
336 All gather may be performed on any non-void datatype. However, all
337 gather does not perform an operation on the data itself, so no opera‐
338 tion is specified.
339
340 Reduce-Scatter (fi_reduce_scatter)
341 The fi_reduce_scatter collective is similar to an fi_allreduce opera‐
342 tion, followed by all to all. With reduce scatter, all peers provide
343 input into an atomic operation, similar to all reduce. However, rather
344 than the full result being copied to each peer, each participant re‐
345 ceives only a slice of the result.
346
347 This is shown by the following example:
348
349 [1] [1] [1]
350 [5] [5] [5]
351 [9] [9] [9]
352 \ | /
353 sum (reduce)
354 |
355 [3]
356 [15]
357 [27]
358 |
359 scatter
360 / | \
361 [3] [15] [27]
362
363 The reduce scatter call supports the same datatype and atomic operation
364 as fi_allreduce.
365
366 Reduce (fi_reduce)
367 The fi_reduce collective is the first half of an fi_allreduce opera‐
368 tion. With reduce, all peers provide input into an atomic operation,
369 with the the results collected by a single `root' endpoint.
370
371 This is shown by the following example, with the leftmost peer identi‐
372 fied as the root:
373
374 [1] [1] [1]
375 [5] [5] [5]
376 [9] [9] [9]
377 \ | /
378 sum (reduce)
379 /
380 [3]
381 [15]
382 [27]
383
384 The reduce call supports the same datatype and atomic operation as
385 fi_allreduce.
386
387 Scatter (fi_scatter)
388 The fi_scatter collective is the second half of an fi_reduce_scatter
389 operation. The data from a single `root' endpoint is split and dis‐
390 tributed to all peers.
391
392 This is shown by the following example:
393
394 [3]
395 [15]
396 [27]
397 \
398 scatter
399 / | \
400 [3] [15] [27]
401
402 The scatter operation is used to distribute results to the peers. No
403 atomic operation is performed on the data.
404
405 Gather (fi_gather)
406 The fi_gather operation is used to collect (gather) the results from
407 all peers and store them at a `root' peer.
408
409 This is shown by the following example, with the leftmost peer identi‐
410 fied as the root.
411
412 [1] [5] [9]
413 \ | /
414 gather
415 /
416 [1]
417 [5]
418 [9]
419
420 The gather operation does not perform any operation on the data itself.
421
422 Query Collective Attributes (fi_query_collective)
423 The fi_query_collective call reports which collective operations are
424 supported by the underlying provider, for suitably configured end‐
425 points. Collective operations needed by an application that are not
426 supported by the provider must be implemented by the application. The
427 query call checks whether a provider supports a specific collective op‐
428 eration for a given datatype and operation, if applicable.
429
430 The name of the collective, as well as the datatype and associated op‐
431 eration, if applicable, and are provided as input into fi_query_collec‐
432 tive.
433
434 The coll parameter may reference one of these collectives: FI_BARRIER,
435 FI_BROADCAST, FI_ALLTOALL, FI_ALLREDUCE, FI_ALLGATHER, FI_REDUCE_SCAT‐
436 TER, FI_REDUCE, FI_SCATTER, or FI_GATHER. Additional details on the
437 collective operation is specified through the struct fi_collective_attr
438 parameter. For collectives that act on data, the operation and related
439 data type must be specified through the given attributes.
440
441 struct fi_collective_attr {
442 enum fi_op op;
443 enum fi_datatype datatype;
444 struct fi_atomic_attr datatype_attr;
445 size_t max_members;
446 uint64_t mode;
447 };
448
449 For a description of struct fi_atomic_attr, see fi_atomic(3).
450
451 op On input, this specifies the atomic operation involved with the
452 collective call. This should be set to one of the following
453 values: FI_MIN, FI_MAX, FI_SUM, FI_PROD, FI_LOR, FI_LAND,
454 FI_BOR, FI_BAND, FI_LXOR, FI_BXOR, FI_ATOMIC_READ, FI_ATOM‐
455 IC_WRITE, of FI_NOOP. For collectives that do not exchange ap‐
456 plication data (fi_barrier), this should be set to FI_NOOP.
457
458 datatype
459 On onput, specifies the datatype of the data being modified by
460 the collective. This should be set to one of the following val‐
461 ues: FI_INT8, FI_UINT8, FI_INT16, FI_UINT16, FI_INT32,
462 FI_UINT32, FI_INT64, FI_UINT64, FI_FLOAT, FI_DOUBLE,
463 FI_FLOAT_COMPLEX, FI_DOUBLE_COMPLEX, FI_LONG_DOUBLE,
464 FI_LONG_DOUBLE_COMPLEX, or FI_VOID. For collectives that do not
465 exchange application data (fi_barrier), this should be set to
466 FI_VOID.
467
468 datatype_attr.count
469 The maximum number of elements that may be used with the collec‐
470 tive.
471
472 datatype.size
473 The size of the datatype as supported by the provider. Applica‐
474 tions should validate the size of datatypes that differ based on
475 the platform, such as FI_LONG_DOUBLE.
476
477 max_members
478 The maximum number of peers that may participate in a collective
479 operation.
480
481 mode This field is reserved and should be 0.
482
483 If a collective operation is supported, the query call will return
484 FI_SUCCESS, along with attributes on the limits for using that collec‐
485 tive operation through the provider.
486
487 Completions
488 Collective operations map to underlying fi_atomic operations. For a
489 discussion of atomic completion semantics, see fi_atomic(3). The com‐
490 pletion, ordering, and atomicity of collective operations match those
491 defined for point to point atomic operations.
492
494 The following flags are defined for the specified operations.
495
496 FI_SCATTER
497 Applies to fi_query_collective. When set, requests attribute
498 information on the reduce-scatter collective operation.
499
501 Returns 0 on success. On error, a negative value corresponding to fab‐
502 ric errno is returned. Fabric errno values are defined in rdma/fi_er‐
503 rno.h.
504
506 -FI_EAGAIN
507 See fi_msg(3) for a detailed description of handling FI_EAGAIN.
508
509 -FI_EOPNOTSUPP
510 The requested atomic operation is not supported on this end‐
511 point.
512
513 -FI_EMSGSIZE
514 The number of collective operations in a single request exceeds
515 that supported by the underlying provider.
516
518 Collective operations map to atomic operations. As such, they follow
519 most of the conventions and restrictions as peer to peer atomic opera‐
520 tions. This includes data atomicity, data alignment, and message or‐
521 dering semantics. See fi_atomic(3) for additional information on the
522 datatypes and operations defined for atomic and collective operations.
523
525 fi_getinfo(3), fi_av(3), fi_atomic(3), fi_cm(3)
526
528 OpenFabrics.
529
530
531
532Libfabric Programmer’s Manual 2022-12-11 fi_collective(3)