1fi_collective(3) Libfabric v1.12.1 fi_collective(3)
2
3
4
6 fi_join_collective
7 Operation where a subset of peers join a new collective group.
8
9 fi_barrier
10 Collective operation that does not complete until all peers have
11 entered the barrier call.
12
13 fi_broadcast
14 A single sender transmits data to all peers, including itself.
15
16 fi_alltoall
17 Each peer distributes a slice of its local data to all peers.
18
19 fi_allreduce
20 Collective operation where all peers broadcast an atomic opera‐
21 tion to all other peers.
22
23 fi_allgather
24 Each peer sends a complete copy of its local data to all peers.
25
26 fi_reduce_scatter
27 Collective call where data is collected from all peers and
28 merged (reduced). The results of the reduction is distributed
29 back to the peers, with each peer receiving a slice of the re‐
30 sults.
31
32 fi_reduce
33 Collective call where data is collected from all peers to a root
34 peer and merged (reduced).
35
36 fi_scatter
37 A single sender distributes (scatters) a slice of its local data
38 to all peers.
39
40 fi_gather
41 All peers send their data to a root peer.
42
43 fi_query_collective
44 Returns information about which collective operations are sup‐
45 ported by a provider, and limitations on the collective.
46
48 #include <rdma/fi_collective.h>
49
50 int fi_join_collective(struct fid_ep *ep, fi_addr_t coll_addr,
51 const struct fid_av_set *set,
52 uint64_t flags, struct fid_mc **mc, void *context);
53
54 ssize_t fi_barrier(struct fid_ep *ep, fi_addr_t coll_addr,
55 void *context);
56
57 ssize_t fi_broadcast(struct fid_ep *ep, void *buf, size_t count, void *desc,
58 fi_addr_t coll_addr, fi_addr_t root_addr, enum fi_datatype datatype,
59 uint64_t flags, void *context);
60
61 ssize_t fi_alltoall(struct fid_ep *ep, const void *buf, size_t count,
62 void *desc, void *result, void *result_desc,
63 fi_addr_t coll_addr, enum fi_datatype datatype,
64 uint64_t flags, void *context);
65
66 ssize_t fi_allreduce(struct fid_ep *ep, const void *buf, size_t count,
67 void *desc, void *result, void *result_desc,
68 fi_addr_t coll_addr, enum fi_datatype datatype, enum fi_op op,
69 uint64_t flags, void *context);
70
71 ssize_t fi_allgather(struct fid_ep *ep, const void *buf, size_t count,
72 void *desc, void *result, void *result_desc,
73 fi_addr_t coll_addr, enum fi_datatype datatype,
74 uint64_t flags, void *context);
75
76 ssize_t fi_reduce_scatter(struct fid_ep *ep, const void *buf, size_t count,
77 void *desc, void *result, void *result_desc,
78 fi_addr_t coll_addr, enum fi_datatype datatype, enum fi_op op,
79 uint64_t flags, void *context);
80
81 ssize_t fi_reduce(struct fid_ep *ep, const void *buf, size_t count,
82 void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
83 fi_addr_t root_addr, enum fi_datatype datatype, enum fi_op op,
84 uint64_t flags, void *context);
85
86 ssize_t fi_scatter(struct fid_ep *ep, const void *buf, size_t count,
87 void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
88 fi_addr_t root_addr, enum fi_datatype datatype,
89 uint64_t flags, void *context);
90
91 ssize_t fi_gather(struct fid_ep *ep, const void *buf, size_t count,
92 void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
93 fi_addr_t root_addr, enum fi_datatype datatype,
94 uint64_t flags, void *context);
95
96 int fi_query_collective(struct fid_domain *domain,
97 fi_collective_op coll, struct fi_collective_attr *attr, uint64_t flags);
98
100 ep Fabric endpoint on which to initiate collective operation.
101
102 set Address vector set defining the collective membership.
103
104 mc Multicast group associated with the collective.
105
106 buf Local data buffer that specifies first operand of collective op‐
107 eration
108
109 datatype
110 Datatype associated with atomic operands
111
112 op Atomic operation to perform
113
114 result Local data buffer to store the result of the collective opera‐
115 tion.
116
117 desc / result_desc
118 Data descriptor associated with the local data buffer and local
119 result buffer, respectively.
120
121 coll_addr
122 Address referring to the collective group of endpoints.
123
124 root_addr
125 Single endpoint that is the source or destination of collective
126 data.
127
128 flags Additional flags to apply for the atomic operation
129
130 context
131 User specified pointer to associate with the operation. This
132 parameter is ignored if the operation will not generate a suc‐
133 cessful completion, unless an op flag specifies the context pa‐
134 rameter be used for required input.
135
137 The collective APIs are new to the 1.9 libfabric release. Although,
138 efforts have been made to design the APIs such that they align well
139 with applications and are implementable by the providers, the APIs
140 should be considered experimental and may be subject to change in fu‐
141 ture versions of the library until the experimental tag has been re‐
142 moved.
143
144 In general collective operations can be thought of as coordinated atom‐
145 ic operations between a set of peer endpoints. Readers should refer to
146 the fi_atomic(3) man page for details on the atomic operations and
147 datatypes defined by libfabric.
148
149 A collective operation is a group communication exchange. It involves
150 multiple peers exchanging data with other peers participating in the
151 collective call. Collective operations require close coordination by
152 all participating members. All participants must invoke the same col‐
153 lective call before any single member can complete its operation local‐
154 ly. As a result, collective calls can strain the fabric, as well as
155 local and remote data buffers.
156
157 Libfabric collective interfaces target fabrics that support offloading
158 portions of the collective communication into network switches, NICs,
159 and other devices. However, no implementation requirement is placed on
160 the provider.
161
162 The first step in using a collective call is identifying the peer end‐
163 points that will participate. Collective membership follows one of two
164 models, both supported by libfabric. In the first model, the applica‐
165 tion manages the membership. This usually means that the application
166 is performing a collective operation itself using point to point commu‐
167 nication to identify the members who will participate. Additionally,
168 the application may be interacting with a fabric resource manager to
169 reserve network resources needed to execute collective operations. In
170 this model, the application will inform libfabric that the membership
171 has already been established.
172
173 A separate model moves the membership management under libfabric and
174 directly into the provider. In this model, the application must iden‐
175 tify which peer addresses will be members. That information is con‐
176 veyed to the libfabric provider, which is then responsible for coordi‐
177 nating the creation of the collective group. In the provider managed
178 model, the provider will usually perform the necessary collective oper‐
179 ation to establish the communication group and interact with any fabric
180 management agents.
181
182 In both models, the collective membership is communicated to the
183 provider by creating and configuring an address vector set (AV set).
184 An AV set represents an ordered subset of addresses in an address vec‐
185 tor (AV). Details on creating and configuring an AV set are available
186 in fi_av_set(3).
187
188 Once an AV set has been programmed with the collective membership in‐
189 formation, an endpoint is joined to the set. This uses the
190 fi_join_collective operation and operates asynchronously. This differs
191 from how an endpoint is associated synchronously with an AV using the
192 fi_ep_bind() call. Upon completion of the fi_join_collective opera‐
193 tion, an fi_addr is provided that is used as the target address when
194 invoking a collective operation.
195
196 For developer convenience, a set of collective APIs are defined. Col‐
197 lective APIs differ from message and RMA interfaces in that the format
198 of the data is known to the provider, and the collective may perform an
199 operation on that data. This aligns collective operations closely with
200 the atomic interfaces.
201
202 Join Collective (fi_join_collective)
203 This call attaches an endpoint to a collective membership group. Lib‐
204 fabric treats collective members as a multicast group, and the
205 fi_join_collective call attaches the endpoint to that multicast group.
206 By default, the endpoint will join the group based on the data transfer
207 capabilities of the endpoint. For example, if the endpoint has been
208 configured to both send and receive data, then the endpoint will be
209 able to initiate and receive transfers to and from the collective. The
210 input flags may be used to restrict access to the collective group,
211 subject to endpoint capability limitations.
212
213 Join collective operations complete asynchronously, and may involve
214 fabric transfers, dependent on the provider implementation. An end‐
215 point must be bound to an event queue prior to calling fi_join_collec‐
216 tive. The result of the join operation will be reported to the EQ as
217 an FI_JOIN_COMPLETE event. Applications cannot issue collective trans‐
218 fers until receiving notification that the join operation has complet‐
219 ed. Note that an endpoint may begin receiving messages from the col‐
220 lective group as soon as the join completes, which can occur prior to
221 the FI_JOIN_COMPLETE event being generated.
222
223 The join collective operation is itself a collective operation. All
224 participating peers must call fi_join_collective before any individual
225 peer will report that the join has completed. Application managed col‐
226 lective memberships are an exception. With application managed member‐
227 ships, the fi_join_collective call may be completed locally without
228 fabric communication. For provider managed memberships, the join col‐
229 lective call requires as input a coll_addr that refers to either an ad‐
230 dress associated with an AV set (see fi_av_set_addr) or an existing
231 collective group (obtained through a previous call to fi_join_collec‐
232 tive). The fi_join_collective call will create a new collective sub‐
233 group. If application managed memberships are used, coll_addr should
234 be set to FI_ADDR_UNAVAIL.
235
236 Applications must call fi_close on the collective group to disconnect
237 the endpoint from the group. After a join operation has completed, the
238 fi_mc_addr call may be used to retrieve the address associated with the
239 multicast group. See fi_cm(3) for additional details on fi_mc_addr().
240
241 Barrier (fi_barrier)
242 The fi_barrier operation provides a mechanism to synchronize peers.
243 Barrier does not result in any data being transferred at the applica‐
244 tion level. A barrier does not complete locally until all peers have
245 invoked the barrier call. This signifies to the local application that
246 work by peers that completed prior to them calling barrier has fin‐
247 ished.
248
249 Broadcast (fi_broadcast)
250 fi_broadcast transfers an array of data from a single sender to all
251 other members of the collective group. The input buf parameter is
252 treated as the transmit buffer if the local rank is the root, otherwise
253 it is the receive buffer. The broadcast operation acts as an atomic
254 write or read to a data array. As a result, the format of the data in
255 buf is specified through the datatype parameter. Any non-void datatype
256 may be broadcast.
257
258 The following diagram shows an example of broadcast being used to
259 transfer an array of integers to a group of peers.
260
261 [1] [1] [1]
262 [5] [5] [5]
263 [9] [9] [9]
264 |____^ ^
265 |_________|
266 broadcast
267
268 All to All (fi_alltoall)
269 The fi_alltoall collective involves distributing (or scattering) dif‐
270 ferent portions of an array of data to peers. It is best explained us‐
271 ing an example. Here three peers perform an all to all collective to
272 exchange different entries in an integer array.
273
274 [1] [2] [3]
275 [5] [6] [7]
276 [9] [10] [11]
277 \ | /
278 All to all
279 / | \
280 [1] [5] [9]
281 [2] [6] [10]
282 [3] [7] [11]
283
284 Each peer sends a piece of its data to the other peers.
285
286 All to all operations may be performed on any non-void datatype. How‐
287 ever, all to all does not perform an operation on the data itself, so
288 no operation is specified.
289
290 All Reduce (fi_allreduce)
291 fi_allreduce can be described as all peers providing input into an
292 atomic operation, with the result copied back to each peer. Conceptu‐
293 ally, this can be viewed as each peer issuing a multicast atomic opera‐
294 tion to all other peers, fetching the results, and combining them. The
295 combining of the results is referred to as the reduction. The
296 fi_allreduce() operation takes as input an array of data and the speci‐
297 fied atomic operation to perform. The results of the reduction are
298 written into the result buffer.
299
300 Any non-void datatype may be specified. Valid atomic operations are
301 listed below in the fi_query_collective call. The following diagram
302 shows an example of an all reduce operation involving summing an array
303 of integers between three peers.
304
305 [1] [1] [1]
306 [5] [5] [5]
307 [9] [9] [9]
308 \ | /
309 sum
310 / | \
311 [3] [3] [3]
312 [15] [15] [15]
313 [27] [27] [27]
314 All Reduce
315
316 All Gather (fi_allgather)
317 Conceptually, all gather can be viewed as the opposite of the scatter
318 component from reduce-scatter. All gather collects data from all peers
319 into a single array, then copies that array back to each peer.
320
321 [1] [5] [9]
322 \ | /
323 All gather
324 / | \
325 [1] [1] [1]
326 [5] [5] [5]
327 [9] [9] [9]
328
329 All gather may be performed on any non-void datatype. However, all
330 gather does not perform an operation on the data itself, so no opera‐
331 tion is specified.
332
333 Reduce-Scatter (fi_reduce_scatter)
334 The fi_reduce_scatter collective is similar to an fi_allreduce opera‐
335 tion, followed by all to all. With reduce scatter, all peers provide
336 input into an atomic operation, similar to all reduce. However, rather
337 than the full result being copied to each peer, each participant re‐
338 ceives only a slice of the result.
339
340 This is shown by the following example:
341
342 [1] [1] [1]
343 [5] [5] [5]
344 [9] [9] [9]
345 \ | /
346 sum (reduce)
347 |
348 [3]
349 [15]
350 [27]
351 |
352 scatter
353 / | \
354 [3] [15] [27]
355
356 The reduce scatter call supports the same datatype and atomic operation
357 as fi_allreduce.
358
359 Reduce (fi_reduce)
360 The fi_reduce collective is the first half of an fi_allreduce opera‐
361 tion. With reduce, all peers provide input into an atomic operation,
362 with the the results collected by a single 'root' endpoint.
363
364 This is shown by the following example, with the leftmost peer identi‐
365 fied as the root:
366
367 [1] [1] [1]
368 [5] [5] [5]
369 [9] [9] [9]
370 \ | /
371 sum (reduce)
372 /
373 [3]
374 [15]
375 [27]
376
377 The reduce call supports the same datatype and atomic operation as
378 fi_allreduce.
379
380 Scatter (fi_scatter)
381 The fi_scatter collective is the second half of an fi_reduce_scatter
382 operation. The data from a single 'root' endpoint is split and dis‐
383 tributed to all peers.
384
385 This is shown by the following example:
386
387 [3]
388 [15]
389 [27]
390 \
391 scatter
392 / | \
393 [3] [15] [27]
394
395 The scatter operation is used to distribute results to the peers. No
396 atomic operation is performed on the data.
397
398 Gather (fi_gather)
399 The fi_gather operation is used to collect (gather) the results from
400 all peers and store them at a 'root' peer.
401
402 This is shown by the following example, with the leftmost peer identi‐
403 fied as the root.
404
405 [1] [5] [9]
406 \ | /
407 gather
408 /
409 [1]
410 [5]
411 [9]
412
413 The gather operation does not perform any operation on the data itself.
414
415 Query Collective Attributes (fi_query_collective)
416 The fi_query_collective call reports which collective operations are
417 supported by the underlying provider, for suitably configured end‐
418 points. Collective operations needed by an application that are not
419 supported by the provider must be implemented by the application. The
420 query call checks whether a provider supports a specific collective op‐
421 eration for a given datatype and operation, if applicable.
422
423 The name of the collective, as well as the datatype and associated op‐
424 eration, if applicable, and are provided as input into fi_query_collec‐
425 tive.
426
427 The coll parameter may reference one of these collectives: FI_BARRIER,
428 FI_BROADCAST, FI_ALLTOALL, FI_ALLREDUCE, FI_ALLGATHER, FI_REDUCE_SCAT‐
429 TER, FI_REDUCE, FI_SCATTER, or FI_GATHER. Additional details on the
430 collective operation is specified through the struct fi_collective_attr
431 parameter. For collectives that act on data, the operation and related
432 data type must be specified through the given attributes.
433
434 struct fi_collective_attr {
435 enum fi_op op;
436 enum fi_datatype datatype;
437 struct fi_atomic_attr datatype_attr;
438 size_t max_members;
439 uint64_t mode;
440 };
441
442 For a description of struct fi_atomic_attr, see fi_atomic(3).
443
444 op On input, this specifies the atomic operation involved with the
445 collective call. This should be set to one of the following
446 values: FI_MIN, FI_MAX, FI_SUM, FI_PROD, FI_LOR, FI_LAND,
447 FI_BOR, FI_BAND, FI_LXOR, FI_BXOR, FI_ATOMIC_READ, FI_ATOM‐
448 IC_WRITE, of FI_NOOP. For collectives that do not exchange ap‐
449 plication data (fi_barrier), this should be set to FI_NOOP.
450
451 datatype
452 On onput, specifies the datatype of the data being modified by
453 the collective. This should be set to one of the following val‐
454 ues: FI_INT8, FI_UINT8, FI_INT16, FI_UINT16, FI_INT32,
455 FI_UINT32, FI_INT64, FI_UINT64, FI_FLOAT, FI_DOUBLE,
456 FI_FLOAT_COMPLEX, FI_DOUBLE_COMPLEX, FI_LONG_DOUBLE,
457 FI_LONG_DOUBLE_COMPLEX, or FI_VOID. For collectives that do not
458 exchange application data (fi_barrier), this should be set to
459 FI_VOID.
460
461 datatype_attr.count
462 The maximum number of elements that may be used with the collec‐
463 tive.
464
465 datatype.size
466 The size of the datatype as supported by the provider. Applica‐
467 tions should validate the size of datatypes that differ based on
468 the platform, such as FI_LONG_DOUBLE.
469
470 max_members
471 The maximum number of peers that may participate in a collective
472 operation.
473
474 mode This field is reserved and should be 0.
475
476 If a collective operation is supported, the query call will return
477 FI_SUCCESS, along with attributes on the limits for using that collec‐
478 tive operation through the provider.
479
480 Completions
481 Collective operations map to underlying fi_atomic operations. For a
482 discussion of atomic completion semantics, see fi_atomic(3). The com‐
483 pletion, ordering, and atomicity of collective operations match those
484 defined for point to point atomic operations.
485
487 The following flags are defined for the specified operations.
488
489 FI_SCATTER
490 Applies to fi_query_collective. When set, requests attribute
491 information on the reduce-scatter collective operation.
492
494 Returns 0 on success. On error, a negative value corresponding to fab‐
495 ric errno is returned. Fabric errno values are defined in rdma/fi_er‐
496 rno.h.
497
499 -FI_EAGAIN
500 See fi_msg(3) for a detailed description of handling FI_EAGAIN.
501
502 -FI_EOPNOTSUPP
503 The requested atomic operation is not supported on this end‐
504 point.
505
506 -FI_EMSGSIZE
507 The number of collective operations in a single request exceeds
508 that supported by the underlying provider.
509
511 Collective operations map to atomic operations. As such, they follow
512 most of the conventions and restrictions as peer to peer atomic opera‐
513 tions. This includes data atomicity, data alignment, and message or‐
514 dering semantics. See fi_atomic(3) for additional information on the
515 datatypes and operations defined for atomic and collective operations.
516
518 fi_getinfo(3), fi_av(3), fi_atomic(3), fi_cm(3)
519
521 OpenFabrics.
522
523
524
525Libfabric Programmer's Manual 2020-04-13 fi_collective(3)