1strigger(1) Slurm Commands strigger(1)
2
3
4
6 strigger - Used set, get or clear Slurm trigger information.
7
8
10 strigger --set [OPTIONS...]
11 strigger --get [OPTIONS...]
12 strigger --clear [OPTIONS...]
13
14
16 strigger is used to set, get or clear Slurm trigger information. Trig‐
17 gers include events such as a node failing, a job reaching its time
18 limit or a job terminating. These events can cause actions such as the
19 execution of an arbitrary script. Typical uses include notifying sys‐
20 tem administrators of node failures and gracefully terminating a job
21 when its time limit is approaching. A hostlist expression for the
22 nodelist or job ID is passed as an argument to the program.
23
24 Trigger events are not processed instantly, but a check is performed
25 for trigger events on a periodic basis (currently every 15 seconds).
26 Any trigger events which occur within that interval will be compared
27 against the trigger programs set at the end of the time interval. The
28 trigger program will be executed once for any event occurring in that
29 interval. The record of those events (e.g. nodes which went DOWN in
30 the previous 15 seconds) will then be cleared. The trigger program
31 must set a new trigger before the end of the next interval to ensure
32 that no trigger events are missed OR the trigger must be created with
33 an argument of "--flags=PERM". If desired, multiple trigger programs
34 can be set for the same event.
35
36 IMPORTANT NOTE: This command can only set triggers if run by the user
37 SlurmUser unless SlurmUser is configured as user root. This is re‐
38 quired for the slurmctld daemon to set the appropriate user and group
39 IDs for the executed program. Also note that the trigger program is
40 executed on the same node that the slurmctld daemon uses rather than
41 some allocated compute node. To check the value of SlurmUser, run the
42 command:
43
44 scontrol show config | grep SlurmUser
45
46
48 -a, --primary_slurmctld_failure
49 Trigger an event when the primary slurmctld fails.
50
51
52 -A, --primary_slurmctld_resumed_operation
53 Trigger an event when the primary slurmctld resuming operation
54 after failure.
55
56
57 -b, --primary_slurmctld_resumed_control
58 Trigger an event when primary slurmctld resumes control.
59
60
61 -B, --backup_slurmctld_failure
62 Trigger an event when the backup slurmctld fails.
63
64
65 -c, --backup_slurmctld_resumed_operation
66 Trigger an event when the backup slurmctld resumes operation af‐
67 ter failure.
68
69
70 -C, --backup_slurmctld_assumed_control
71 Trigger event when backup slurmctld assumes control.
72
73
74
75 --burst_buffer
76 Trigger event when burst buffer error occurs.
77
78
79 --clear
80 Clear or delete a previously defined event trigger. The --id,
81 --jobid or --user option must be specified to identify the trig‐
82 ger(s) to be cleared. Only user root or the trigger's creator
83 can delete a trigger.
84
85
86 -d, --down
87 Trigger an event if the specified node goes into a DOWN state.
88
89
90 -D, --drained
91 Trigger an event if the specified node goes into a DRAINED
92 state.
93
94
95 -e, --primary_slurmctld_acct_buffer_full
96 Trigger an event when primary slurmctld accounting buffer is
97 full.
98
99
100 -F, --fail
101 Trigger an event if the specified node goes into a FAILING
102 state.
103
104
105 -f, --fini
106 Trigger an event when the specified job completes execution.
107
108
109 --flags=type
110 Associate flags with the reservation. Multiple flags should be
111 comma separated. Valid flags include:
112
113 PERM Make the trigger permanent. Do not purge it after the
114 event occurs.
115
116
117 --front_end
118 Trigger events based upon changes in state of front end nodes
119 rather than compute nodes. Applies to Cray ALPS architectures
120 only, where the slurmd daemon executes on front end nodes rather
121 than the compute nodes. Use this option with either the --up or
122 --down option.
123
124
125 -g, --primary_slurmdbd_failure
126 Trigger an event when the primary slurmdbd fails. The trigger is
127 launched by slurmctld in the occasions it tries to connect to
128 slurmdbd, but receives no response on the socket.
129
130
131 -G, --primary_slurmdbd_resumed_operation
132 Trigger an event when the primary slurmdbd resumes operation af‐
133 ter failure. This event is triggered when opening the connec‐
134 tion from slurmctld to slurmdbd results in a response. It can
135 happen also in different situations, periodically every 15 sec‐
136 onds when checking the connection status, when saving state,
137 when agent queue is filling, and so on.
138
139
140 --get Show registered event triggers. Options can be used for filter‐
141 ing purposes.
142
143
144 -h, --primary_database_failure
145 Trigger an event when the primary database fails. This event is
146 triggered when the accounting plugin tries to open a connection
147 with mysql and it fails and the slurmctld needs the database for
148 some operations.
149
150
151 -H, --primary_database_resumed_operation
152 Trigger an event when the primary database resumes operation af‐
153 ter failure. It happens when the connection to mysql from the
154 accounting plugin is restored.
155
156
157 -i, --id=id
158 Trigger ID number.
159
160
161 -I, --idle
162 Trigger an event if the specified node remains in an IDLE state
163 for at least the time period specified by the --offset option.
164 This can be useful to hibernate a node that remains idle, thus
165 reducing power consumption.
166
167
168 -j, --jobid=id
169 Job ID of interest. NOTE: The --jobid option can not be used in
170 conjunction with the --node option. When the --jobid option is
171 used in conjunction with the --up or --down option, all nodes
172 allocated to that job will considered the nodes used as a trig‐
173 ger event.
174
175
176 -M, --clusters=<string>
177 Clusters to issue commands to. Note that the SlurmDBD must be
178 up for this option to work properly.
179
180
181 -n, --node[=host]
182 Host name(s) of interest. By default, all nodes associated with
183 the job (if --jobid is specified) or on the system are consid‐
184 ered for event triggers. NOTE: The --node option can not be
185 used in conjunction with the --jobid option. When the --jobid
186 option is used in conjunction with the --up, --down or --drained
187 option, all nodes allocated to that job will considered the
188 nodes used as a trigger event. Since this option's argument is
189 optional, for proper parsing the single letter option must be
190 followed immediately with the value and not include a space be‐
191 tween them. For example "-ntux" and not "-n tux".
192
193
194 -N, --noheader
195 Do not print the header when displaying a list of triggers.
196
197
198 -o, --offset=seconds
199 The specified action should follow the event by this time inter‐
200 val. Specify a negative value if action should preceded the
201 event. The default value is zero if no --offset option is spec‐
202 ified. The resolution of this time is about 20 seconds, so to
203 execute a script not less than five minutes prior to a job
204 reaching its time limit, specify --offset=320 (5 minutes plus 20
205 seconds).
206
207
208 -p, --program=path
209 Execute the program at the specified fully qualified pathname
210 when the event occurs. You may quote the path and include extra
211 program arguments if desired. The program will be executed as
212 the user who sets the trigger. If the program fails to termi‐
213 nate within 5 minutes, it will be killed along with any spawned
214 processes.
215
216
217 -Q, --quiet
218 Do not report non-fatal errors. This can be useful to clear
219 triggers which may have already been purged.
220
221
222 -r, --reconfig
223 Trigger an event when the system configuration changes. This is
224 triggered when the slurmctld daemon reads its configuration file
225 or when a node state changes.
226
227
228 --set Register an event trigger based upon the supplied options.
229 NOTE: An event is only triggered once. A new event trigger must
230 be set established for future events of the same type to be pro‐
231 cessed. Triggers can only be set if the command is run by the
232 user SlurmUser unless SlurmUser is configured as user root.
233
234
235 -t, --time
236 Trigger an event when the specified job's time limit is reached.
237 This must be used in conjunction with the --jobid option.
238
239
240 -u, --up
241 Trigger an event if the specified node is returned to service
242 from a DOWN state.
243
244
245 --user=user_name_or_id
246 Clear or get triggers created by the specified user. For exam‐
247 ple, a trigger created by user root for a job created by user
248 adam could be cleared with an option --user=root. Specify ei‐
249 ther a user name or user ID.
250
251
252 -v, --verbose
253 Print detailed event logging. This includes time-stamps on data
254 structures, record counts, etc.
255
256
257 -V , --version
258 Print version information and exit.
259
260
262 TRIG_ID
263 Trigger ID number.
264
265
266 RES_TYPE
267 Resource type: job or node
268
269
270 RES_ID Resource ID: job ID or host names or "*" for any host
271
272
273 TYPE Trigger type: time or fini (for jobs only), down or up (for jobs
274 or nodes), or drained, idle or reconfig (for nodes only)
275
276
277 OFFSET Time offset in seconds. Negative numbers indicated the action
278 should occur before the event (if possible)
279
280
281 USER Name of the user requesting the action
282
283
284 PROGRAM
285 Pathname of the program to execute when the event occurs
286
287
289 Executing strigger sends a remote procedure call to slurmctld. If
290 enough calls from strigger or other Slurm client commands that send re‐
291 mote procedure calls to the slurmctld daemon come in at once, it can
292 result in a degradation of performance of the slurmctld daemon, possi‐
293 bly resulting in a denial of service.
294
295 Do not run strigger or other Slurm client commands that send remote
296 procedure calls to slurmctld from loops in shell scripts or other pro‐
297 grams. Ensure that programs limit calls to strigger to the minimum nec‐
298 essary for the information you are trying to gather.
299
300
302 Some strigger options may be set via environment variables. These envi‐
303 ronment variables, along with their corresponding options, are listed
304 below. (Note: commandline options will always override these settings)
305
306 SLURM_CONF The location of the Slurm configuration file.
307
308
310 Execute the program "/usr/sbin/primary_slurmctld_failure" whenever the
311 primary slurmctld fails.
312
313 $ cat /usr/sbin/primary_slurmctld_failure
314 #!/bin/bash
315 # Submit trigger for next primary slurmctld failure event
316 strigger --set --primary_slurmctld_failure \
317 --program=/usr/sbin/primary_slurmctld_failure
318 # Notify the administrator of the failure using e-mail
319 /bin/mail slurm_admin@site.com -s Primary_SLURMCTLD_FAILURE
320
321 $ strigger --set --primary_slurmctld_failure \
322 --program=/usr/sbin/primary_slurmctld_failure
323
324
325 Execute the program "/usr/sbin/slurm_admin_notify" whenever any node in
326 the cluster goes down. The subject line will include the node names
327 which have entered the down state (passed as an argument to the script
328 by Slurm).
329
330 $ cat /usr/sbin/slurm_admin_notify
331 #!/bin/bash
332 # Submit trigger for next event
333 strigger --set --node --down \
334 --program=/usr/sbin/slurm_admin_notify
335 # Notify administrator using by e-mail
336 /bin/mail slurm_admin@site.com -s NodesDown:$*
337
338 $ strigger --set --node --down \
339 --program=/usr/sbin/slurm_admin_notify
340
341
342 Execute the program "/usr/sbin/slurm_suspend_node" whenever any node in
343 the cluster remains in the idle state for at least 600 seconds.
344
345 $ strigger --set --node --idle --offset=600 \
346 --program=/usr/sbin/slurm_suspend_node
347
348
349 Execute the program "/home/joe/clean_up" when job 1234 is within 10
350 minutes of reaching its time limit.
351
352 $ strigger --set --jobid=1234 --time --offset=-600 \
353 --program=/home/joe/clean_up
354
355
356 Execute the program "/home/joe/node_died" when any node allocated to
357 job 1234 enters the DOWN state.
358
359 $ strigger --set --jobid=1234 --down \
360 --program=/home/joe/node_died
361
362
363 Show all triggers associated with job 1235.
364
365 $ strigger --get --jobid=1235
366 TRIG_ID RES_TYPE RES_ID TYPE OFFSET USER PROGRAM
367 123 job 1235 time -600 joe /home/bob/clean_up
368 125 job 1235 down 0 joe /home/bob/node_died
369
370
371 Delete event trigger 125.
372
373 $ strigger --clear --id=125
374
375
376 Execute /home/joe/job_fini upon completion of job 1237.
377
378 $ strigger --set --jobid=1237 --fini --program=/home/joe/job_fini
379
380
382 Copyright (C) 2007 The Regents of the University of California. Pro‐
383 duced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
384 Copyright (C) 2008-2010 Lawrence Livermore National Security.
385 Copyright (C) 2010-2013 SchedMD LLC.
386
387 This file is part of Slurm, a resource management program. For de‐
388 tails, see <https://slurm.schedmd.com/>.
389
390 Slurm is free software; you can redistribute it and/or modify it under
391 the terms of the GNU General Public License as published by the Free
392 Software Foundation; either version 2 of the License, or (at your op‐
393 tion) any later version.
394
395 Slurm is distributed in the hope that it will be useful, but WITHOUT
396 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
397 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
398 for more details.
399
400
402 scontrol(1), sinfo(1), squeue(1)
403
404
405
406
407February 2021 Slurm Commands strigger(1)