1strigger(1) Slurm Commands strigger(1)
2
3
4
6 strigger - Used set, get or clear Slurm trigger information.
7
8
10 strigger --set [OPTIONS...]
11 strigger --get [OPTIONS...]
12 strigger --clear [OPTIONS...]
13
14
16 strigger is used to set, get or clear Slurm trigger information. Trig‐
17 gers include events such as a node failing, a job reaching its time
18 limit or a job terminating. These events can cause actions such as the
19 execution of an arbitrary script. Typical uses include notifying sys‐
20 tem administrators of node failures and gracefully terminating a job
21 when it's time limit is approaching. A hostlist expression for the
22 nodelist or job ID is passed as an argument to the program.
23
24 Trigger events are not processed instantly, but a check is performed
25 for trigger events on a periodic basis (currently every 15 seconds).
26 Any trigger events which occur within that interval will be compared
27 against the trigger programs set at the end of the time interval. The
28 trigger program will be executed once for any event occurring in that
29 interval. The record of those events (e.g. nodes which went DOWN in
30 the previous 15 seconds) will then be cleared. The trigger program
31 must set a new trigger before the end of the next interval to ensure
32 that no trigger events are missed OR the trigger must be created with
33 an argument of "--flags=PERM". If desired, multiple trigger programs
34 can be set for the same event.
35
36 IMPORTANT NOTE: This command can only set triggers if run by the user
37 SlurmUser unless SlurmUser is configured as user root. This is
38 required for the slurmctld daemon to set the appropriate user and group
39 IDs for the executed program. Also note that the trigger program is
40 executed on the same node that the slurmctld daemon uses rather than
41 some allocated compute node. To check the value of SlurmUser, run the
42 command:
43
44 scontrol show config | grep SlurmUser
45
46
48 -a, --primary_slurmctld_failure
49 Trigger an event when the primary slurmctld fails.
50
51
52 -A, --primary_slurmctld_resumed_operation
53 Trigger an event when the primary slurmctld resuming operation
54 after failure.
55
56
57 -b, --primary_slurmctld_resumed_control
58 Trigger an event when primary slurmctld resumes control.
59
60
61 -B, --backup_slurmctld_failure
62 Trigger an event when the backup slurmctld fails.
63
64
65 -c, --backup_slurmctld_resumed_operation
66 Trigger an event when the backup slurmctld resumes operation
67 after failure.
68
69
70 -C, --backup_slurmctld_assumed_control
71 Trigger event when backup slurmctld assumes control.
72
73
74
75 --burst_buffer
76 Trigger event when burst buffer error occurs.
77
78
79 --clear
80 Clear or delete a previously defined event trigger. The --id,
81 --jobid or --user option must be specified to identify the trig‐
82 ger(s) to be cleared. Only user root or the trigger's creator
83 can delete a trigger.
84
85
86 -d, --down
87 Trigger an event if the specified node goes into a DOWN state.
88
89
90 -D, --drained
91 Trigger an event if the specified node goes into a DRAINED
92 state.
93
94
95 -e, --primary_slurmctld_acct_buffer_full
96 Trigger an event when primary slurmctld accounting buffer is
97 full.
98
99
100 -F, --fail
101 Trigger an event if the specified node goes into a FAILING
102 state.
103
104
105 -f, --fini
106 Trigger an event when the specified job completes execution.
107
108
109 --flags=type
110 Associate flags with the reservation. Multiple flags should be
111 comma separated. Valid flags include:
112
113 PERM Make the trigger permanent. Do not purge it after the
114 event occurs.
115
116
117 --front_end
118 Trigger events based upon changes in state of front end nodes
119 rather than compute nodes. Applies to Cray ALPS architectures
120 only, where the slurmd daemon executes on front end nodes rather
121 than the compute nodes. Use this option with either the --up or
122 --down option.
123
124
125 -g, --primary_slurmdbd_failure
126 Trigger an event when the primary slurmdbd fails.
127
128
129 -G, --primary_slurmdbd_resumed_operation
130 Trigger an event when the primary slurmdbd resumes operation
131 after failure.
132
133
134 --get Show registered event triggers. Options can be used for filter‐
135 ing purposes.
136
137
138 -h, --primary_database_failure
139 Trigger an event when the primary database fails.
140
141
142 -H, --primary_database_resumed_operation
143 Trigger an event when the primary database resumes operation
144 after failure.
145
146
147 -i, --id=id
148 Trigger ID number.
149
150
151 -I, --idle
152 Trigger an event if the specified node remains in an IDLE state
153 for at least the time period specified by the --offset option.
154 This can be useful to hibernate a node that remains idle, thus
155 reducing power consumption.
156
157
158 -j, --jobid=id
159 Job ID of interest. NOTE: The --jobid option can not be used in
160 conjunction with the --node option. When the --jobid option is
161 used in conjunction with the --up or --down option, all nodes
162 allocated to that job will considered the nodes used as a trig‐
163 ger event.
164
165
166 -M, --clusters=<string>
167 Clusters to issue commands to. Note that the SlurmDBD must be
168 up for this option to work properly.
169
170
171 -n, --node[=host]
172 Host name(s) of interest. By default, all nodes associated with
173 the job (if --jobid is specified) or on the system are consid‐
174 ered for event triggers. NOTE: The --node option can not be
175 used in conjunction with the --jobid option. When the --jobid
176 option is used in conjunction with the --up, --down or --drained
177 option, all nodes allocated to that job will considered the
178 nodes used as a trigger event. Since this option's argument is
179 optional, for proper parsing the single letter option must be
180 followed immediately with the value and not include a space
181 between them. For example "-ntux" and not "-n tux".
182
183
184 -N, --noheader
185 Do not print the header when displaying a list of triggers.
186
187
188 -o, --offset=seconds
189 The specified action should follow the event by this time inter‐
190 val. Specify a negative value if action should preceded the
191 event. The default value is zero if no --offset option is spec‐
192 ified. The resolution of this time is about 20 seconds, so to
193 execute a script not less than five minutes prior to a job
194 reaching its time limit, specify --offset=320 (5 minutes plus 20
195 seconds).
196
197
198 -p, --program=path
199 Execute the program at the specified fully qualified pathname
200 when the event occurs. You may quote the path and include extra
201 program arguments if desired. The program will be executed as
202 the user who sets the trigger. If the program fails to termi‐
203 nate within 5 minutes, it will be killed along with any spawned
204 processes.
205
206
207 -Q, --quiet
208 Do not report non-fatal errors. This can be useful to clear
209 triggers which may have already been purged.
210
211
212 -r, --reconfig
213 Trigger an event when the system configuration changes. This is
214 triggered when the slurmctld daemon reads its configuration file
215 or when a node state changes.
216
217
218 --set Register an event trigger based upon the supplied options.
219 NOTE: An event is only triggered once. A new event trigger must
220 be set established for future events of the same type to be pro‐
221 cessed. Triggers can only be set if the command is run by the
222 user SlurmUser unless SlurmUser is configured as user root.
223
224
225 -t, --time
226 Trigger an event when the specified job's time limit is reached.
227 This must be used in conjunction with the --jobid option.
228
229
230 -u, --up
231 Trigger an event if the specified node is returned to service
232 from a DOWN state.
233
234
235 --user=user_name_or_id
236 Clear or get triggers created by the specified user. For exam‐
237 ple, a trigger created by user root for a job created by user
238 adam could be cleared with an option --user=root. Specify
239 either a user name or user ID.
240
241
242 -v, --verbose
243 Print detailed event logging. This includes time-stamps on data
244 structures, record counts, etc.
245
246
247 -V , --version
248 Print version information and exit.
249
250
252 TRIG_ID
253 Trigger ID number.
254
255
256 RES_TYPE
257 Resource type: job or node
258
259
260 RES_ID Resource ID: job ID or host names or "*" for any host
261
262
263 TYPE Trigger type: time or fini (for jobs only), down or up (for jobs
264 or nodes), or drained, idle or reconfig (for nodes only)
265
266
267 OFFSET Time offset in seconds. Negative numbers indicated the action
268 should occur before the event (if possible)
269
270
271 USER Name of the user requesting the action
272
273
274 PROGRAM
275 Pathname of the program to execute when the event occurs
276
277
279 Some strigger options may be set via environment variables. These envi‐
280 ronment variables, along with their corresponding options, are listed
281 below. (Note: commandline options will always override these settings)
282
283 SLURM_CONF The location of the Slurm configuration file.
284
285
287 Execute the program "/usr/sbin/primary_slurmctld_failure" whenever the
288 primary slurmctld fails.
289
290 > cat /usr/sbin/primary_slurmctld_failure
291 #!/bin/bash
292 # Submit trigger for next primary slurmctld failure event
293 strigger --set --primary_slurmctld_failure \
294 --program=/usr/sbin/primary_slurmctld_failure
295 # Notify the administrator of the failure using by e-mail
296 /bin/mail slurm_admin@site.com -s Primary_SLURMCTLD_FAILURE
297
298 > strigger --set --primary_slurmctld_failure \
299 --program=/usr/sbin/primary_slurmctld_failure
300
301
302 Execute the program "/usr/sbin/slurm_admin_notify" whenever any node in
303 the cluster goes down. The subject line will include the node names
304 which have entered the down state (passed as an argument to the script
305 by Slurm).
306
307 > cat /usr/sbin/slurm_admin_notify
308 #!/bin/bash
309 # Submit trigger for next event
310 strigger --set --node --down \
311 --program=/usr/sbin/slurm_admin_notify
312 # Notify administrator using by e-mail
313 /bin/mail slurm_admin@site.com -s NodesDown:$*
314
315 > strigger --set --node --down \
316 --program=/usr/sbin/slurm_admin_notify
317
318
319 Execute the program "/usr/sbin/slurm_suspend_node" whenever any node in
320 the cluster remains in the idle state for at least 600 seconds.
321
322 > strigger --set --node --idle --offset=600 \
323 --program=/usr/sbin/slurm_suspend_node
324
325
326 Execute the program "/home/joe/clean_up" when job 1234 is within 10
327 minutes of reaching its time limit.
328
329 > strigger --set --jobid=1234 --time --offset=-600 \
330 --program=/home/joe/clean_up
331
332
333 Execute the program "/home/joe/node_died" when any node allocated to
334 job 1234 enters the DOWN state.
335
336 > strigger --set --jobid=1234 --down \
337 --program=/home/joe/node_died
338
339
340 Show all triggers associated with job 1235.
341
342 > strigger --get --jobid=1235
343 TRIG_ID RES_TYPE RES_ID TYPE OFFSET USER PROGRAM
344 123 job 1235 time -600 joe /home/bob/clean_up
345 125 job 1235 down 0 joe /home/bob/node_died
346
347
348 Delete event trigger 125.
349
350 > strigger --clear --id=125
351
352
353 Execute /home/joe/job_fini upon completion of job 1237.
354
355 > strigger --set --jobid=1237 --fini --program=/home/joe/job_fini
356
357
359 Copyright (C) 2007 The Regents of the University of California. Pro‐
360 duced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
361 Copyright (C) 2008-2010 Lawrence Livermore National Security.
362 Copyright (C) 2010-2013 SchedMD LLC.
363
364 This file is part of Slurm, a resource management program. For
365 details, see <https://slurm.schedmd.com/>.
366
367 Slurm is free software; you can redistribute it and/or modify it under
368 the terms of the GNU General Public License as published by the Free
369 Software Foundation; either version 2 of the License, or (at your
370 option) any later version.
371
372 Slurm is distributed in the hope that it will be useful, but WITHOUT
373 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
374 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
375 for more details.
376
377
379 scontrol(1), sinfo(1), squeue(1)
380
381
382
383
384August 2016 Slurm Commands strigger(1)