strigger(1)

1strigger(1)                     Slurm Commands                     strigger(1)
2
3
4

NAME

6       strigger - Used to set, get or clear Slurm trigger information.
7
8

SYNOPSIS

10       strigger --set   [OPTIONS...]
11       strigger --get   [OPTIONS...]
12       strigger --clear [OPTIONS...]
13
14

DESCRIPTION

16       strigger is used to set, get or clear Slurm trigger information.  Trig‐
17       gers include events such as a node failing, a  job  reaching  its  time
18       limit or a job terminating.  These events can cause actions such as the
19       execution of an arbitrary script.  Typical uses include notifying  sys‐
20       tem  administrators  of  node failures and gracefully terminating a job
21       when its time limit is approaching.   A  hostlist  expression  for  the
22       nodelist or job ID is passed as an argument to the program.
23
24       Trigger  events  are  not processed instantly, but a check is performed
25       for trigger events on a periodic basis (currently  every  15  seconds).
26       Any  trigger  events  which occur within that interval will be compared
27       against the trigger programs set at the end of the time interval.   The
28       trigger  program  will be executed once for any event occurring in that
29       interval.  The record of those events (e.g. nodes which  went  DOWN  in
30       the  previous  15  seconds)  will then be cleared.  The trigger program
31       must set a new trigger before the end of the next  interval  to  ensure
32       that  no  trigger events are missed OR the trigger must be created with
33       an argument of "--flags=PERM".  If desired, multiple  trigger  programs
34       can be set for the same event.
35
36       IMPORTANT  NOTE:  This command can only set triggers if run by the user
37       SlurmUser unless SlurmUser is configured as user  root.   This  is  re‐
38       quired  for  the slurmctld daemon to set the appropriate user and group
39       IDs for the executed program.  Also note that the  trigger  program  is
40       executed  on  the  same node that the slurmctld daemon uses rather than
41       some allocated compute node.  To check the value of SlurmUser, run  the
42       command:
43
44              scontrol show config | grep SlurmUser
45
46

ARGUMENTS

48       -C, --backup_slurmctld_assumed_control
49              Trigger event when backup slurmctld assumes control.
50
51       -B, --backup_slurmctld_failure
52              Trigger an event when the backup slurmctld fails.
53
54       -c, --backup_slurmctld_resumed_operation
55              Trigger an event when the backup slurmctld resumes operation af‐
56              ter failure.
57
58       --burst_buffer
59              Trigger event when burst buffer error occurs.
60
61       --clear
62              Clear or delete a previously defined event trigger.   The  --id,
63              --jobid or --user option must be specified to identify the trig‐
64              ger(s) to be cleared.  Only user root or the  trigger's  creator
65              can delete a trigger.
66
67       -M, --clusters=<string>
68              Clusters  to  issue commands to.  Note that the SlurmDBD must be
69              up for this option to work properly.
70
71       -d, --down
72              Trigger an event if the specified node goes into a DOWN state.
73
74       -D, --drained
75              Trigger an event if the  specified  node  goes  into  a  DRAINED
76              state.
77
78       -F, --fail
79              Trigger  an  event  if  the  specified  node goes into a FAILING
80              state.
81
82       -f, --fini
83              Trigger an event when the specified job completes execution.
84
85       --flags=<flag>
86              Associate flags with the reservation. Multiple flags  should  be
87              comma separated.  Valid flags include:
88
89              PERM   Make  the  trigger  permanent.  Do not purge it after the
90                     event occurs.
91
92       --front_end
93              Trigger events based upon changes in state of  front  end  nodes
94              rather  than  compute  nodes. Applies to Cray ALPS architectures
95              only, where the slurmd daemon executes on front end nodes rather
96              than the compute nodes.  Use this option with either the --up or
97              --down option.
98
99       --get  Show registered event triggers.  Options can be used for filter‐
100              ing purposes.
101
102       -i, --id=<id>
103              Trigger ID number.
104
105       -I, --idle
106              Trigger  an event if the specified node remains in an IDLE state
107              for at least the time period specified by the  --offset  option.
108              This  can  be useful to hibernate a node that remains idle, thus
109              reducing power consumption.
110
111       -j, --jobid=<id>
112              Job ID of interest.  NOTE: The --jobid option can not be used in
113              conjunction  with  the --node option. When the --jobid option is
114              used in conjunction with the --up or --down  option,  all  nodes
115              allocated  to that job will considered the nodes used as a trig‐
116              ger event.
117
118       -n, --node[=host]
119              Host name(s) of interest.  By default, all nodes associated with
120              the  job  (if --jobid is specified) or on the system are consid‐
121              ered for event triggers.  NOTE: The --node  option  can  not  be
122              used  in  conjunction  with the --jobid option. When the --jobid
123              option is used in conjunction with the --up, --down or --drained
124              option,  all  nodes  allocated  to  that job will considered the
125              nodes used as a trigger event. Since this option's  argument  is
126              optional,  for  proper  parsing the single letter option must be
127              followed immediately with the value and not include a space  be‐
128              tween them. For example "-ntux" and not "-n tux".
129
130       -N, --noheader
131              Do not print the header when displaying a list of triggers.
132
133       -o, --offset=<seconds>
134              The specified action should follow the event by this time inter‐
135              val.  Specify a negative value if  action  should  preceded  the
136              event.  The default value is zero if no --offset option is spec‐
137              ified.  The resolution of this time is about 20 seconds,  so  to
138              execute  a  script  not  less  than  five minutes prior to a job
139              reaching its time limit, specify --offset=320 (5 minutes plus 20
140              seconds).
141
142       -h, --primary_database_failure
143              Trigger  an event when the primary database fails. This event is
144              triggered when the accounting plugin tries to open a  connection
145              with mysql and it fails and the slurmctld needs the database for
146              some operations.
147
148       -H, --primary_database_resumed_operation
149              Trigger an event when the primary database resumes operation af‐
150              ter  failure.   It happens when the connection to mysql from the
151              accounting plugin is restored.
152
153       -g, --primary_slurmdbd_failure
154              Trigger an event when the primary slurmdbd fails. The trigger is
155              launched  by  slurmctld  in the occasions it tries to connect to
156              slurmdbd, but receives no response on the socket.
157
158       -G, --primary_slurmdbd_resumed_operation
159              Trigger an event when the primary slurmdbd resumes operation af‐
160              ter  failure.   This event is triggered when opening the connec‐
161              tion from slurmctld to slurmdbd results in a  response.  It  can
162              happen  also in different situations, periodically every 15 sec‐
163              onds when checking the connection  status,  when  saving  state,
164              when agent queue is filling, and so on.
165
166       -e, --primary_slurmctld_acct_buffer_full
167              Trigger  an  event  when  primary slurmctld accounting buffer is
168              full.
169
170       -a, --primary_slurmctld_failure
171              Trigger an event when the primary slurmctld fails.
172
173       -b, --primary_slurmctld_resumed_control
174              Trigger an event when primary slurmctld resumes control.
175
176       -A, --primary_slurmctld_resumed_operation
177              Trigger an event when the primary slurmctld  resuming  operation
178              after failure.
179
180       -p, --program=<path>
181              Execute  the  program  at the specified fully qualified pathname
182              when the event occurs.  You may quote the path and include extra
183              program  arguments  if desired.  The program will be executed as
184              the user who sets the trigger.  If the program fails  to  termi‐
185              nate  within 5 minutes, it will be killed along with any spawned
186              processes.
187
188       -Q, --quiet
189              Do not report non-fatal errors.  This can  be  useful  to  clear
190              triggers which may have already been purged.
191
192       -r, --reconfig
193              Trigger an event when the system configuration changes.  This is
194              triggered when the slurmctld daemon reads its configuration file
195              or when a node state changes.
196
197       --set  Register  an  event  trigger  based  upon  the supplied options.
198              NOTE: An event is only triggered once. A new event trigger  must
199              be set established for future events of the same type to be pro‐
200              cessed.  Triggers can only be set if the command is run  by  the
201              user SlurmUser unless SlurmUser is configured as user root.
202
203       -t, --time
204              Trigger an event when the specified job's time limit is reached.
205              This must be used in conjunction with the --jobid option.
206
207       -u, --up
208              Trigger an event if the specified node is  returned  to  service
209              from a DOWN state.
210
211       --user=<user_name_or_id>
212              Clear  or get triggers created by the specified user.  For exam‐
213              ple, a trigger created by user root for a job  created  by  user
214              adam  could  be cleared with an option --user=root.  Specify ei‐
215              ther a user name or user ID.
216
217       -v, --verbose
218              Print detailed event logging. This includes time-stamps on  data
219              structures, record counts, etc.
220
221       -V , --version
222              Print version information and exit.
223

OUTPUT FIELD DESCRIPTIONS

225       TRIG_ID
226              Trigger ID number.
227
228       RES_TYPE
229              Resource type: job or node
230
231       RES_ID Resource ID: job ID or host names or "*" for any host
232
233       TYPE   Trigger type: time or fini (for jobs only), down or up (for jobs
234              or nodes), or drained, idle or reconfig (for nodes only)
235
236       OFFSET Time offset in seconds. Negative numbers  indicated  the  action
237              should occur before the event (if possible)
238
239       USER   Name of the user requesting the action
240
241       PROGRAM
242              Pathname of the program to execute when the event occurs
243

PERFORMANCE

245       Executing  strigger  sends  a  remote  procedure  call to slurmctld. If
246       enough calls from strigger or other Slurm client commands that send re‐
247       mote  procedure  calls  to the slurmctld daemon come in at once, it can
248       result in a degradation of performance of the slurmctld daemon,  possi‐
249       bly resulting in a denial of service.
250
251       Do  not  run  strigger  or other Slurm client commands that send remote
252       procedure calls to slurmctld from loops in shell scripts or other  pro‐
253       grams. Ensure that programs limit calls to strigger to the minimum nec‐
254       essary for the information you are trying to gather.
255
256

ENVIRONMENT VARIABLES

258       Some strigger options may be set via environment variables. These envi‐
259       ronment  variables,  along with their corresponding options, are listed
260       below.  (Note: Command line options will  always  override  these  set‐
261       tings.)
262
263
264       SLURM_CONF          The location of the Slurm configuration file.
265

EXAMPLES

267       Execute  the program "/usr/sbin/primary_slurmctld_failure" whenever the
268       primary slurmctld fails.
269
270              $ cat /usr/sbin/primary_slurmctld_failure
271              #!/bin/bash
272              # Submit trigger for next primary slurmctld failure event
273              strigger --set --primary_slurmctld_failure \
274                       --program=/usr/sbin/primary_slurmctld_failure
275              # Notify the administrator of the failure using e-mail
276              /bin/mail slurm_admin@site.com -s Primary_SLURMCTLD_FAILURE
277
278              $ strigger --set --primary_slurmctld_failure \
279                         --program=/usr/sbin/primary_slurmctld_failure
280
281
282       Execute the program "/usr/sbin/slurm_admin_notify" whenever any node in
283       the  cluster  goes  down.  The subject line will include the node names
284       which have entered the down state (passed as an argument to the  script
285       by Slurm).
286
287              $ cat /usr/sbin/slurm_admin_notify
288              #!/bin/bash
289              # Submit trigger for next event
290              strigger --set --node --down \
291                       --program=/usr/sbin/slurm_admin_notify
292              # Notify administrator using by e-mail
293              /bin/mail slurm_admin@site.com -s NodesDown:$*
294
295              $ strigger --set --node --down \
296                         --program=/usr/sbin/slurm_admin_notify
297
298
299       Execute the program "/usr/sbin/slurm_suspend_node" whenever any node in
300       the cluster remains in the idle state for at least 600 seconds.
301
302              $ strigger --set --node --idle --offset=600 \
303                         --program=/usr/sbin/slurm_suspend_node
304
305
306       Execute the program "/home/joe/clean_up" when job  1234  is  within  10
307       minutes of reaching its time limit.
308
309              $ strigger --set --jobid=1234 --time --offset=-600 \
310                         --program=/home/joe/clean_up
311
312
313       Execute  the  program  "/home/joe/node_died" when any node allocated to
314       job 1234 enters the DOWN state.
315
316              $ strigger --set --jobid=1234 --down \
317                         --program=/home/joe/node_died
318
319
320       Show all triggers associated with job 1235.
321
322              $ strigger --get --jobid=1235
323              TRIG_ID RES_TYPE RES_ID TYPE OFFSET USER PROGRAM
324                  123      job   1235 time   -600  joe /home/bob/clean_up
325                  125      job   1235 down      0  joe /home/bob/node_died
326
327
328       Delete event trigger 125.
329
330              $ strigger --clear --id=125
331
332
333       Execute /home/joe/job_fini upon completion of job 1237.
334
335              $ strigger --set --jobid=1237 --fini --program=/home/joe/job_fini
336
337

COPYING

339       Copyright (C) 2007 The Regents of the University of  California.   Pro‐
340       duced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
341       Copyright (C) 2008-2010 Lawrence Livermore National Security.
342       Copyright (C) 2010-2022 SchedMD LLC.
343
344       This  file  is  part  of Slurm, a resource management program.  For de‐
345       tails, see <https://slurm.schedmd.com/>.
346
347       Slurm is free software; you can redistribute it and/or modify it  under
348       the  terms  of  the GNU General Public License as published by the Free
349       Software Foundation; either version 2 of the License, or (at  your  op‐
350       tion) any later version.
351
352       Slurm  is  distributed  in the hope that it will be useful, but WITHOUT
353       ANY WARRANTY; without even the implied warranty of  MERCHANTABILITY  or
354       FITNESS  FOR  A PARTICULAR PURPOSE.  See the GNU General Public License
355       for more details.
356
357