1fi_trigger(3)                  Libfabric v1.15.1                 fi_trigger(3)
2
3
4

NAME

6       fi_trigger - Triggered operations
7

SYNOPSIS

9              #include <rdma/fi_trigger.h>
10

DESCRIPTION

12       Triggered  operations allow an application to queue a data transfer re‐
13       quest that is deferred until a specified condition is met.   A  typical
14       use  is  to  send a message only after receiving all input data.  Trig‐
15       gered operations can help reduce  the  latency  needed  to  initiate  a
16       transfer  by removing the need to return control back to an application
17       prior to the data transfer starting.
18
19       An endpoint must be created with the FI_TRIGGER capability in order for
20       triggered  operations  to  be  specified.  A triggered operation is re‐
21       quested by specifying the FI_TRIGGER flag as  part  of  the  operation.
22       Such an endpoint is referred to as a trigger-able endpoint.
23
24       Any  data  transfer  operation  is potentially trigger-able, subject to
25       provider constraints.  Trigger-able endpoints are initialized such that
26       only  those interfaces supported by the provider which are trigger-able
27       are available.
28
29       Triggered operations require  that  applications  use  struct  fi_trig‐
30       gered_context  as  their  per  operation  context  parameter, or if the
31       provider requires the  FI_CONTEXT2  mode,  struct  fi_trigger_context2.
32       The  use  of  struct  fi_triggered_context[2]  replaces  struct fi_con‐
33       text[2],  if  required  by  the  provider.   Although  struct  fi_trig‐
34       gered_context[2]  is not opaque to the application, the contents of the
35       structure may be modified by the provider once it has been submitted as
36       an  operation.   This  structure  has  similar  requirements  as struct
37       fi_context[2].  It must be allocated  by  the  application  and  remain
38       valid  until  the  corresponding operation completes or is successfully
39       canceled.
40
41       Struct fi_triggered_context[2] is used to specify  the  condition  that
42       must  be  met  before the triggered data transfer is initiated.  If the
43       condition is met when the request is made, then the data  transfer  may
44       be initiated immediately.  The format of struct fi_triggered_context[2]
45       is described below.
46
47              struct fi_triggered_context {
48                  enum fi_trigger_event event_type;   /* trigger type */
49                  union {
50                      struct fi_trigger_threshold threshold;
51                      struct fi_trigger_xpu xpu;
52                      void *internal[3]; /* reserved */
53                  } trigger;
54              };
55
56              struct fi_triggered_context2 {
57                  enum fi_trigger_event event_type;   /* trigger type */
58                  union {
59                      struct fi_trigger_threshold threshold;
60                      struct fi_trigger_xpu xpu;
61                      void *internal[7]; /* reserved */
62                  } trigger;
63              };
64
65       The triggered context indicates the type of event assigned to the trig‐
66       ger,  along  with a union of trigger details that is based on the event
67       type.
68

COMPLETION BASED TRIGGERS

70       Completion based triggers defer a data transfer until one or  more  re‐
71       lated  data  transfers  complete.  For example, a send operation may be
72       deferred until a receive operation completes, indicating that the  data
73       to be transferred is now available.
74
75       The  following  trigger  event related to completion based transfers is
76       defined.
77
78       FI_TRIGGER_THRESHOLD
79              This indicates that the data transfer operation will be deferred
80              until  an event counter crosses an application specified thresh‐
81              old value.  The threshold is  specified  using  struct  fi_trig‐
82              ger_threshold:
83
84              struct fi_trigger_threshold {
85                  struct fid_cntr *cntr; /* event counter to check */
86                  size_t threshold;      /* threshold value */
87              };
88
89       Threshold  operations  are triggered in the order of the threshold val‐
90       ues.  This is true even if the counter increments by  a  value  greater
91       than 1.  If two triggered operations have the same threshold, they will
92       be triggered in the order in which they were submitted to the endpoint.
93

XPU TRIGGERS

95       XPU  based  triggers  work  in  conjunction  with  heterogenous  memory
96       (FI_HMEM  capability).  XPU triggers define a split execution model for
97       specifying a data transfer separately  from  initiating  the  transfer.
98       Unlike  completion  triggers,  the user controls the timing of when the
99       transfer starts by writing data into a trigger variable location.
100
101       XPU transfers allow the requesting and triggering to occur on  separate
102       computational  domains.  For example, a process running on the host CPU
103       can setup a data transfer, with a compute kernel running on a GPU  sig‐
104       naling  the  start of the transfer.  XPU refers to a CPU, GPU, FPGA, or
105       other acceleration device with some level of computational ability.
106
107       Endpoints must be created with both the FI_TRIGGER and FI_XPU capabili‐
108       ties to use XPU triggers.  XPU triggered enabled endpoints only support
109       XPU triggered operations.  The behavior of mixing XPU triggered  opera‐
110       tions with normal data transfers or non-XPU triggered operations is not
111       defined by the API and subject to provider support and implementation.
112
113       The use of  XPU  triggers  requires  coordination  between  the  fabric
114       provider, application, and submitting XPU.  The result is that hardware
115       implementation details need to be conveyed across the computational do‐
116       mains.  The XPU trigger API abstracts those details.  When submitting a
117       XPU trigger operation, the user identifies the XPU where the triggering
118       will occur.  The triggering XPU must match with the location of the lo‐
119       cal memory regions.  For example, if triggering will be done by  a  GPU
120       kernel, the type of GPU and its local identifier are given.  As output,
121       the fabric provider will return a list of variables  and  corresponding
122       values.   The XPU signals that the data transfer is safe to initiate by
123       writing the given values to the specified variable locations.  The num‐
124       ber of variables and their sizes are provider specific.
125
126       XPU  trigger  operations  are  submitted using the FI_TRIGGER flag with
127       struct fi_triggered_context or  struct  fi_triggered_context2,  as  re‐
128       quired by the provider.  The trigger event_type is:
129
130       FI_TRIGGER_XPU
131              Indicates  that the data transfer operation will be deferred un‐
132              til the user writes provider specified data to provider indicat‐
133              ed  memory locations.  The user indicates which device will ini‐
134              tiate the write.  The struct fi_trigger_xpu is  used  to  convey
135              both  input and output data regarding the signaling of the trig‐
136              ger.
137
138              struct fi_trigger_var {
139                  enum fi_datatype datatype;
140                  int count;
141                  void *addr;
142                  union {
143                      uint8_t val8;
144                      uint16_t val16;
145                      uint32_t val32;
146                      uint64_t val64;
147                      uint8_t *data;
148                  } value;
149              };
150
151              struct fi_trigger_xpu {
152                  int count;
153                  enum fi_hmem_iface iface;
154                  union {
155                      uint64_t reserved;
156                      int cuda;
157                      int ze;
158                  } device;
159                  struct fi_trigger_var *var;
160              };
161
162       On input to a triggered operation, the iface field indicates the  soft‐
163       ware  interface  that  will be used to write the variables.  The device
164       union specifies the device identifier.  For valid iface and device val‐
165       ues,  see fi_mr(3).  The iface and device must match with the iface and
166       device of any local HMEM memory regions.  Count should be  set  to  the
167       number  of  fi_trigger_var  structures  available,  with  the var field
168       pointing to an array of struct fi_trigger_var.  The user is responsible
169       for ensuring that there are sufficient fi_trigger_var structures avail‐
170       able and of an appropriate size.  The count and size of  fi_trigger_var
171       structures  can be obtained by calling fi_getopt() on the endpoint with
172       the FI_OPT_XPU_TRIGGER option.  See fi_endpoint(3) for details.
173
174       Each fi_trigger_var structure referenced should have the  datatype  and
175       count  fields  initialized  to  the  number of values referenced by the
176       struct fi_trigger_val.  If the count is 1, one of the val  fields  will
177       be  used  to return the necessary data (val8, val16, etc.).  If count >
178       1, the data field will return all necessary data  used  to  signal  the
179       trigger.   The  data field must reference a buffer large enough to hold
180       the returned bytes.
181
182       On output, the provider will set the fi_trigger_xpu count to the number
183       of  fi_trigger_var variables that must be signaled.  Count will be less
184       than or equal to the input value.  The provider  will  initialize  each
185       valid  fi_trigger_var entry with information needed to signal the trig‐
186       ger.  The datatype indicates the size of the data that must be written.
187       Valid   datatype   values   are  FI_UINT8,  FI_UINT16,  FI_UINT32,  and
188       FI_UINT64.  For signal variables <= 64 bits, the count field will be 1.
189       If  a  trigger  requires  writing more than 64-bits, the datatype field
190       will be set to FI_UINT8, with count set to the  number  of  bytes  that
191       must  be written.  The data that must be written to signal the start of
192       an operation is returned through either the value union val  fields  or
193       data array.
194
195       Users  signal  the  start of a transfer by writing the returned data to
196       the given memory address.  The write must occur from the specified  in‐
197       put XPU location (based on the iface and device fields).  If a transfer
198       cannot be initiated for some reason, such as an error occurring  before
199       the  transfer  can start, the triggered operation should be canceled to
200       release any allocated resources.  If multiple variables are  specified,
201       they must be updated in order.
202
203       Note  that  the provider will not modify the fi_trigger_xpu or fi_trig‐
204       ger_var structures after returning from the data transfer call.
205
206       In order to support multiple  provider  implementations,  users  should
207       trigger data transfer operations in the same order that they are queued
208       and should serialize the writing of triggers that  reference  the  same
209       endpoint.   Providers may return the same trigger variable for multiple
210       data transfer requests.
211

DEFERRED WORK QUEUES

213       The following feature and description are enhancements to triggered op‐
214       eration support.
215
216       The  deferred  work queue interface is designed as primitive constructs
217       that can be used to implement application-level collective  operations.
218       They  are  a  more advanced form of triggered operation.  They allow an
219       application to queue operations to a deferred work queue that is  asso‐
220       ciated with the domain.  Note that the deferred work queue is a concep‐
221       tual construct, rather than an  implementation  requirement.   Deferred
222       work  requests  consist of three main components: an event or condition
223       that must first be met, an operation to perform, and a completion noti‐
224       fication.
225
226       Because  deferred work requests are posted directly to the domain, they
227       can support a broader set of conditions and operations.  Deferred  work
228       requests  are submitted using struct fi_deferred_work.  That structure,
229       along with the corresponding operation structures  (referenced  through
230       the op union) used to describe the work must remain valid until the op‐
231       eration completes or is canceled.  The format of the deferred work  re‐
232       quest is as follows:
233
234              struct fi_deferred_work {
235                  struct fi_context2    context;
236
237                  uint64_t              threshold;
238                  struct fid_cntr       *triggering_cntr;
239                  struct fid_cntr       *completion_cntr;
240
241                  enum fi_trigger_op    op_type;
242
243                  union {
244                      struct fi_op_msg            *msg;
245                      struct fi_op_tagged         *tagged;
246                      struct fi_op_rma            *rma;
247                      struct fi_op_atomic         *atomic;
248                      struct fi_op_fetch_atomic   *fetch_atomic;
249                      struct fi_op_compare_atomic *compare_atomic;
250                      struct fi_op_cntr           *cntr;
251                  } op;
252              };
253
254       Once a work request has been posted to the deferred work queue, it will
255       remain on the queue until the triggering counter  (success  plus  error
256       counter values) has reached the indicated threshold.  If the triggering
257       condition has already been met at the time the work request is  queued,
258       the operation will be initiated immediately.
259
260       On the completion of a deferred data transfer, the specified completion
261       counter will be incremented by one.  Note that deferred counter  opera‐
262       tions  do not update the completion counter; only the counter specified
263       through the fi_op_cntr is modified.  The completion_cntr field must  be
264       NULL for counter operations.
265
266       Because deferred work targets support of collective communication oper‐
267       ations, posted work requests do not generate  any  completions  at  the
268       endpoint by default.  For example, completed operations are not written
269       to the EP’s completion queue or update the EP counter  (unless  the  EP
270       counter  is explicitly referenced as the completion_cntr).  An applica‐
271       tion may request EP completions by specifying the FI_COMPLETION flag as
272       part of the operation.
273
274       It is the responsibility of the application to detect and handle situa‐
275       tions that occur which could result in a deferred work request’s condi‐
276       tion  not  being met.  For example, if a work request is dependent upon
277       the successful completion of a data transfer  operation,  which  fails,
278       then the application must cancel the work request.
279
280       To submit a deferred work request, applications should use the domain’s
281       fi_control  function  with  command  FI_QUEUE_WORK  and  struct  fi_de‐
282       ferred_work as the fi_control arg parameter.  To cancel a deferred work
283       request, use fi_control with command FI_CANCEL_WORK and the correspond‐
284       ing   struct   fi_deferred_work  to  cancel.   The  fi_control  command
285       FI_FLUSH_WORK will cancel all queued work requests.  FI_FLUSH_WORK  may
286       be  used to flush all work queued to the domain, or may be used to can‐
287       cel all requests waiting on a specific triggering_cntr.
288
289       Deferred work requests are not acted upon by the provider until the as‐
290       sociated  event  has  occurred; although, certain validation checks may
291       still occur when a request is submitted.  Referenced data  buffers  are
292       not  read  or otherwise accessed.  But the provider may validate fabric
293       objects, such as endpoints and counters, and that input parameters fall
294       within supported ranges.  If a specific request is not supported by the
295       provider, it will fail the operation with -FI_ENOSYS.
296

SEE ALSO

298       fi_getinfo(3), fi_endpoint(3), fi_mr(3), fi_alias(3), fi_cntr(3)
299

AUTHORS

301       OpenFabrics.
302
303
304
305Libfabric Programmer’s Manual     2021-11-20                     fi_trigger(3)
Impressum