1fi_trigger(3) Libfabric v1.15.1 fi_trigger(3)
2
3
4
6 fi_trigger - Triggered operations
7
9 #include <rdma/fi_trigger.h>
10
12 Triggered operations allow an application to queue a data transfer re‐
13 quest that is deferred until a specified condition is met. A typical
14 use is to send a message only after receiving all input data. Trig‐
15 gered operations can help reduce the latency needed to initiate a
16 transfer by removing the need to return control back to an application
17 prior to the data transfer starting.
18
19 An endpoint must be created with the FI_TRIGGER capability in order for
20 triggered operations to be specified. A triggered operation is re‐
21 quested by specifying the FI_TRIGGER flag as part of the operation.
22 Such an endpoint is referred to as a trigger-able endpoint.
23
24 Any data transfer operation is potentially trigger-able, subject to
25 provider constraints. Trigger-able endpoints are initialized such that
26 only those interfaces supported by the provider which are trigger-able
27 are available.
28
29 Triggered operations require that applications use struct fi_trig‐
30 gered_context as their per operation context parameter, or if the
31 provider requires the FI_CONTEXT2 mode, struct fi_trigger_context2.
32 The use of struct fi_triggered_context[2] replaces struct fi_con‐
33 text[2], if required by the provider. Although struct fi_trig‐
34 gered_context[2] is not opaque to the application, the contents of the
35 structure may be modified by the provider once it has been submitted as
36 an operation. This structure has similar requirements as struct
37 fi_context[2]. It must be allocated by the application and remain
38 valid until the corresponding operation completes or is successfully
39 canceled.
40
41 Struct fi_triggered_context[2] is used to specify the condition that
42 must be met before the triggered data transfer is initiated. If the
43 condition is met when the request is made, then the data transfer may
44 be initiated immediately. The format of struct fi_triggered_context[2]
45 is described below.
46
47 struct fi_triggered_context {
48 enum fi_trigger_event event_type; /* trigger type */
49 union {
50 struct fi_trigger_threshold threshold;
51 struct fi_trigger_xpu xpu;
52 void *internal[3]; /* reserved */
53 } trigger;
54 };
55
56 struct fi_triggered_context2 {
57 enum fi_trigger_event event_type; /* trigger type */
58 union {
59 struct fi_trigger_threshold threshold;
60 struct fi_trigger_xpu xpu;
61 void *internal[7]; /* reserved */
62 } trigger;
63 };
64
65 The triggered context indicates the type of event assigned to the trig‐
66 ger, along with a union of trigger details that is based on the event
67 type.
68
70 Completion based triggers defer a data transfer until one or more re‐
71 lated data transfers complete. For example, a send operation may be
72 deferred until a receive operation completes, indicating that the data
73 to be transferred is now available.
74
75 The following trigger event related to completion based transfers is
76 defined.
77
78 FI_TRIGGER_THRESHOLD
79 This indicates that the data transfer operation will be deferred
80 until an event counter crosses an application specified thresh‐
81 old value. The threshold is specified using struct fi_trig‐
82 ger_threshold:
83
84 struct fi_trigger_threshold {
85 struct fid_cntr *cntr; /* event counter to check */
86 size_t threshold; /* threshold value */
87 };
88
89 Threshold operations are triggered in the order of the threshold val‐
90 ues. This is true even if the counter increments by a value greater
91 than 1. If two triggered operations have the same threshold, they will
92 be triggered in the order in which they were submitted to the endpoint.
93
95 XPU based triggers work in conjunction with heterogenous memory
96 (FI_HMEM capability). XPU triggers define a split execution model for
97 specifying a data transfer separately from initiating the transfer.
98 Unlike completion triggers, the user controls the timing of when the
99 transfer starts by writing data into a trigger variable location.
100
101 XPU transfers allow the requesting and triggering to occur on separate
102 computational domains. For example, a process running on the host CPU
103 can setup a data transfer, with a compute kernel running on a GPU sig‐
104 naling the start of the transfer. XPU refers to a CPU, GPU, FPGA, or
105 other acceleration device with some level of computational ability.
106
107 Endpoints must be created with both the FI_TRIGGER and FI_XPU capabili‐
108 ties to use XPU triggers. XPU triggered enabled endpoints only support
109 XPU triggered operations. The behavior of mixing XPU triggered opera‐
110 tions with normal data transfers or non-XPU triggered operations is not
111 defined by the API and subject to provider support and implementation.
112
113 The use of XPU triggers requires coordination between the fabric
114 provider, application, and submitting XPU. The result is that hardware
115 implementation details need to be conveyed across the computational do‐
116 mains. The XPU trigger API abstracts those details. When submitting a
117 XPU trigger operation, the user identifies the XPU where the triggering
118 will occur. The triggering XPU must match with the location of the lo‐
119 cal memory regions. For example, if triggering will be done by a GPU
120 kernel, the type of GPU and its local identifier are given. As output,
121 the fabric provider will return a list of variables and corresponding
122 values. The XPU signals that the data transfer is safe to initiate by
123 writing the given values to the specified variable locations. The num‐
124 ber of variables and their sizes are provider specific.
125
126 XPU trigger operations are submitted using the FI_TRIGGER flag with
127 struct fi_triggered_context or struct fi_triggered_context2, as re‐
128 quired by the provider. The trigger event_type is:
129
130 FI_TRIGGER_XPU
131 Indicates that the data transfer operation will be deferred un‐
132 til the user writes provider specified data to provider indicat‐
133 ed memory locations. The user indicates which device will ini‐
134 tiate the write. The struct fi_trigger_xpu is used to convey
135 both input and output data regarding the signaling of the trig‐
136 ger.
137
138 struct fi_trigger_var {
139 enum fi_datatype datatype;
140 int count;
141 void *addr;
142 union {
143 uint8_t val8;
144 uint16_t val16;
145 uint32_t val32;
146 uint64_t val64;
147 uint8_t *data;
148 } value;
149 };
150
151 struct fi_trigger_xpu {
152 int count;
153 enum fi_hmem_iface iface;
154 union {
155 uint64_t reserved;
156 int cuda;
157 int ze;
158 } device;
159 struct fi_trigger_var *var;
160 };
161
162 On input to a triggered operation, the iface field indicates the soft‐
163 ware interface that will be used to write the variables. The device
164 union specifies the device identifier. For valid iface and device val‐
165 ues, see fi_mr(3). The iface and device must match with the iface and
166 device of any local HMEM memory regions. Count should be set to the
167 number of fi_trigger_var structures available, with the var field
168 pointing to an array of struct fi_trigger_var. The user is responsible
169 for ensuring that there are sufficient fi_trigger_var structures avail‐
170 able and of an appropriate size. The count and size of fi_trigger_var
171 structures can be obtained by calling fi_getopt() on the endpoint with
172 the FI_OPT_XPU_TRIGGER option. See fi_endpoint(3) for details.
173
174 Each fi_trigger_var structure referenced should have the datatype and
175 count fields initialized to the number of values referenced by the
176 struct fi_trigger_val. If the count is 1, one of the val fields will
177 be used to return the necessary data (val8, val16, etc.). If count >
178 1, the data field will return all necessary data used to signal the
179 trigger. The data field must reference a buffer large enough to hold
180 the returned bytes.
181
182 On output, the provider will set the fi_trigger_xpu count to the number
183 of fi_trigger_var variables that must be signaled. Count will be less
184 than or equal to the input value. The provider will initialize each
185 valid fi_trigger_var entry with information needed to signal the trig‐
186 ger. The datatype indicates the size of the data that must be written.
187 Valid datatype values are FI_UINT8, FI_UINT16, FI_UINT32, and
188 FI_UINT64. For signal variables <= 64 bits, the count field will be 1.
189 If a trigger requires writing more than 64-bits, the datatype field
190 will be set to FI_UINT8, with count set to the number of bytes that
191 must be written. The data that must be written to signal the start of
192 an operation is returned through either the value union val fields or
193 data array.
194
195 Users signal the start of a transfer by writing the returned data to
196 the given memory address. The write must occur from the specified in‐
197 put XPU location (based on the iface and device fields). If a transfer
198 cannot be initiated for some reason, such as an error occurring before
199 the transfer can start, the triggered operation should be canceled to
200 release any allocated resources. If multiple variables are specified,
201 they must be updated in order.
202
203 Note that the provider will not modify the fi_trigger_xpu or fi_trig‐
204 ger_var structures after returning from the data transfer call.
205
206 In order to support multiple provider implementations, users should
207 trigger data transfer operations in the same order that they are queued
208 and should serialize the writing of triggers that reference the same
209 endpoint. Providers may return the same trigger variable for multiple
210 data transfer requests.
211
213 The following feature and description are enhancements to triggered op‐
214 eration support.
215
216 The deferred work queue interface is designed as primitive constructs
217 that can be used to implement application-level collective operations.
218 They are a more advanced form of triggered operation. They allow an
219 application to queue operations to a deferred work queue that is asso‐
220 ciated with the domain. Note that the deferred work queue is a concep‐
221 tual construct, rather than an implementation requirement. Deferred
222 work requests consist of three main components: an event or condition
223 that must first be met, an operation to perform, and a completion noti‐
224 fication.
225
226 Because deferred work requests are posted directly to the domain, they
227 can support a broader set of conditions and operations. Deferred work
228 requests are submitted using struct fi_deferred_work. That structure,
229 along with the corresponding operation structures (referenced through
230 the op union) used to describe the work must remain valid until the op‐
231 eration completes or is canceled. The format of the deferred work re‐
232 quest is as follows:
233
234 struct fi_deferred_work {
235 struct fi_context2 context;
236
237 uint64_t threshold;
238 struct fid_cntr *triggering_cntr;
239 struct fid_cntr *completion_cntr;
240
241 enum fi_trigger_op op_type;
242
243 union {
244 struct fi_op_msg *msg;
245 struct fi_op_tagged *tagged;
246 struct fi_op_rma *rma;
247 struct fi_op_atomic *atomic;
248 struct fi_op_fetch_atomic *fetch_atomic;
249 struct fi_op_compare_atomic *compare_atomic;
250 struct fi_op_cntr *cntr;
251 } op;
252 };
253
254 Once a work request has been posted to the deferred work queue, it will
255 remain on the queue until the triggering counter (success plus error
256 counter values) has reached the indicated threshold. If the triggering
257 condition has already been met at the time the work request is queued,
258 the operation will be initiated immediately.
259
260 On the completion of a deferred data transfer, the specified completion
261 counter will be incremented by one. Note that deferred counter opera‐
262 tions do not update the completion counter; only the counter specified
263 through the fi_op_cntr is modified. The completion_cntr field must be
264 NULL for counter operations.
265
266 Because deferred work targets support of collective communication oper‐
267 ations, posted work requests do not generate any completions at the
268 endpoint by default. For example, completed operations are not written
269 to the EP’s completion queue or update the EP counter (unless the EP
270 counter is explicitly referenced as the completion_cntr). An applica‐
271 tion may request EP completions by specifying the FI_COMPLETION flag as
272 part of the operation.
273
274 It is the responsibility of the application to detect and handle situa‐
275 tions that occur which could result in a deferred work request’s condi‐
276 tion not being met. For example, if a work request is dependent upon
277 the successful completion of a data transfer operation, which fails,
278 then the application must cancel the work request.
279
280 To submit a deferred work request, applications should use the domain’s
281 fi_control function with command FI_QUEUE_WORK and struct fi_de‐
282 ferred_work as the fi_control arg parameter. To cancel a deferred work
283 request, use fi_control with command FI_CANCEL_WORK and the correspond‐
284 ing struct fi_deferred_work to cancel. The fi_control command
285 FI_FLUSH_WORK will cancel all queued work requests. FI_FLUSH_WORK may
286 be used to flush all work queued to the domain, or may be used to can‐
287 cel all requests waiting on a specific triggering_cntr.
288
289 Deferred work requests are not acted upon by the provider until the as‐
290 sociated event has occurred; although, certain validation checks may
291 still occur when a request is submitted. Referenced data buffers are
292 not read or otherwise accessed. But the provider may validate fabric
293 objects, such as endpoints and counters, and that input parameters fall
294 within supported ranges. If a specific request is not supported by the
295 provider, it will fail the operation with -FI_ENOSYS.
296
298 fi_getinfo(3), fi_endpoint(3), fi_mr(3), fi_alias(3), fi_cntr(3)
299
301 OpenFabrics.
302
303
304
305Libfabric Programmer’s Manual 2021-11-20 fi_trigger(3)