1sdiag(1)                        Slurm Commands                        sdiag(1)
2
3
4

NAME

6       sdiag - Scheduling diagnostic tool for Slurm
7
8

SYNOPSIS

10       sdiag
11
12

DESCRIPTION

14       sdiag  shows information related to slurmctld execution about: threads,
15       agents, jobs, and scheduling algorithms. The goal  is  to  obtain  data
16       from  slurmctld behaviour helping to adjust configuration parameters or
17       queues policies. The main reason behind  is  to  know  Slurm  behaviour
18       under systems with a high throughput.
19
20       It  has two execution modes. The default mode --all shows several coun‐
21       ters and statistics explained later, and  there  is  another  execution
22       option --reset for resetting those values.
23
24       Values are reset at midnight UTC time by default.
25
26       The  first  block  of information is related to global slurmctld execu‐
27       tion:
28
29       Server thread count
30              The number of current active slurmctld threads.  A  high  number
31              would  mean  a high load processing events like job submissions,
32              jobs dispatching, jobs completing, etc. If this is  often  close
33              to MAX_SERVER_THREADS it could point to a potential bottleneck.
34
35
36       Agent queue size
37              Slurm  design  has  scalability  in mind and sending messages to
38              thousands of nodes is not a trivial task.  The  agent  mechanism
39              helps  to control communication between slurmctld and the slurmd
40              daemons for a best effort.  This  value  denotes  the  count  of
41              enqueued outgoing RPC requests in an internal retry list.
42
43
44       Agent count
45              Number  of agent threads. Each of these agent threads can create
46              in turn a group of up to 2 + AGENT_THREAD_COUNT  active  threads
47              at a time.
48
49
50       Agent thread count
51              Total count of active threads created by all the agent threads.
52
53
54       DBD Agent queue size
55              Slurm  queues up the messages intended for the SlurmDBD and pro‐
56              cesses them in a separate thread. If the SlurmDBD, or  database,
57              is down then this number will increase.
58
59              The  max  queue size is configured in the slurm.conf with MaxDB‐
60              DMsgs. If this number begins to grow more than half of  the  max
61              queue size, the slurmdbd and the database should be investigated
62              immediately.
63
64
65       Jobs submitted
66              Number of jobs submitted since last reset
67
68
69       Jobs started
70              Number of jobs started since last  reset.  This  includes  back‐
71              filled jobs.
72
73
74       Jobs completed
75              Number of jobs completed since last reset.
76
77
78       Jobs canceled
79              Number of jobs canceled since last reset.
80
81
82       Jobs failed
83              Number  of  jobs  failed  due to slurmd or other internal issues
84              since last reset.
85
86
87       Job states ts:
88              Lists the timestamp of when the following job state counts  were
89              gathered.
90
91
92       Jobs pending:
93              Number  of  jobs  pending  at  the  given time of the time stamp
94              above.
95
96
97       Jobs running:
98              Number of jobs running at the  given  time  of  the  time  stamp
99              above.
100
101
102       Jobs running ts:
103              Time stamp of when the running job count was taken.
104
105
106       The  next  block of information is related to main scheduling algorithm
107       based on jobs  priorities.  A  scheduling  cycle  implies  to  get  the
108       job_write_lock  lock,  then  trying  to get resources for jobs pending,
109       starting from the most priority one and going in descendent order. Once
110       a  job can not get the resources the loop keeps going but just for jobs
111       requesting other partitions. Jobs with  dependencies  or  affected   by
112       accounts limits are not processed.
113
114
115       Last cycle
116              Time in microseconds for last scheduling cycle.
117
118
119       Max cycle
120              Maximum time in microseconds for any scheduling cycle since last
121              reset.
122
123
124       Total cycles
125              Total run time in microseconds for all scheduling  cycles  since
126              last reset.  Scheduling is performed periodically and (depending
127              upon configuration) when a job is submitted or  a  job  is  com‐
128              pleted.
129
130
131       Mean cycle
132              Mean  time  in microseconds for all scheduling cycles since last
133              reset.
134
135
136       Mean depth cycle
137              Mean of cycle depth. Depth means number of jobs processed  in  a
138              scheduling cycle.
139
140
141       Cycles per minute
142              Counter of scheduling executions per minute.
143
144
145       Last queue length
146              Length of jobs pending queue.
147
148
149       The  next  block  of  information  is related to backfilling scheduling
150       algorithm.  A backfilling scheduling cycle implies  to  get  locks  for
151       jobs,  nodes  and  partitions  objects then trying to get resources for
152       jobs pending. Jobs are processed based on priorities. If a job can  not
153       get resources the algorithm calculates when it could get them obtaining
154       a future start time for the job.  Then next job is  processed  and  the
155       algorithm  tries  to  get resources for that job but avoiding to affect
156       the previous ones, and again it calculates the future start time if not
157       current  resources available. The backfilling algorithm takes more time
158       for each new job to  process  since  more  priority  jobs  can  not  be
159       affected.  The algorithm itself takes measures for avoiding a long exe‐
160       cution cycle and for taking all the locks for too long.
161
162
163       Total backfilled jobs (since last slurm start)
164              Number of jobs started thanks to backfilling  since  last  slurm
165              start.
166
167
168       Total backfilled jobs (since last stats cycle start)
169              Number  of  jobs  started  thanks to backfilling since last time
170              stats where reset.  By default these values are  reset  at  mid‐
171              night UTC time.
172
173
174       Total backfilled heterogeneous job components
175              Number  of  heterogeneous job components started thanks to back‐
176              filling since last Slurm start.
177
178
179       Total cycles
180              Number of backfill scheduling cycles since last reset
181
182
183       Last cycle when
184              Time when last backfill scheduling cycle happened in the  format
185              "weekday Month MonthDay hour:minute.seconds year"
186
187
188       Last cycle
189              Time  in  microseconds  of  last  backfill scheduling cycle.  It
190              counts only execution time, removing sleep time inside a  sched‐
191              uling  cycle when it executes for an extended period time.  Note
192              that locks are released during the sleep time so that other work
193              can proceed.
194
195
196       Max cycle
197              Time in microseconds of maximum backfill scheduling cycle execu‐
198              tion since last reset.  It counts only execution time,  removing
199              sleep  time  inside  a  scheduling cycle when it executes for an
200              extended period time.  Note that locks are released  during  the
201              sleep time so that other work can proceed.
202
203
204       Mean cycle
205              Mean time in microseconds of backfilling scheduling cycles since
206              last reset.
207
208
209       Last depth cycle
210              Number of processed  jobs  during  last  backfilling  scheduling
211              cycle.  It  counts every job even if that job can not be started
212              due to dependencies or limits.
213
214
215       Last depth cycle (try sched)
216              Number of processed  jobs  during  last  backfilling  scheduling
217              cycle.  It  counts only jobs with a chance to start using avail‐
218              able resources. These jobs consume  more  scheduling  time  than
219              jobs  which  are found can not be started due to dependencies or
220              limits.
221
222
223       Depth Mean
224              Mean count of jobs processed during all  backfilling  scheduling
225              cycles  since last reset.  Jobs which are found to be ineligible
226              to run when examined by the backfill scheduler are  not  counted
227              (e.g. jobs submitted to multiple partitions and already started,
228              jobs which have reached a QOS or account limit such  as  maximum
229              running jobs for an account, etc).
230
231
232       Depth Mean (try sched)
233              The  subset  of Depth Mean that the backfill scheduler attempted
234              to schedule.
235
236
237       Last queue length
238              Number of jobs pending to be processed by backfilling algorithm.
239              A job is counted once for each partition it is queued to use.  A
240              pending job array will normally be counted as one job (tasks  of
241              a job array which have already been started/requeued or individ‐
242              ually modified will already have individual job records and  are
243              each counted as a separate job).
244
245
246       Queue length Mean
247              Mean  count of jobs pending to be processed by backfilling algo‐
248              rithm.  A job is counted once for each partition  it  requested.
249              A  pending  job array will normally be counted as one job (tasks
250              of a job array which have already been started/requeued or indi‐
251              vidually  modified  will already have individual job records and
252              are each counted as a separate job).
253
254
255       Last table size
256              Count of different time slots tested by the  backfill  scheduler
257              in its last iteration.
258
259
260       Mean table size
261              Mean count of different time slots tested by the backfill sched‐
262              uler.  Larger counts increase the time required for the backfill
263              operation.  The table size is influenced by many schuling param‐
264              eters,   including:   bf_min_age_reserve,   bf_min_prio_reserve,
265              bf_resolution, and bf_window.
266
267
268       Latency for 1000 calls to gettimeofday()
269              Latency of 1000 calls to the gettimeofday() syscall in microsec‐
270              onds, as measured at controller startup.
271
272
273       The next blocks of information report the most frequently issued remote
274       procedure  calls (RPCs), calls made for the Slurmctld daemon to perform
275       some action.  The fourth block reports the RPCs issued by message type.
276       You  will  need  to look up those RPC codes in the Slurm source code by
277       looking them up  in  the  file  src/common/slurm_protocol_defs.h.   The
278       report includes the number of times each RPC is invoked, the total time
279       consumed by all of those RPCs plus the average time  consumed  by  each
280       RPC  in  microseconds.  The fifth block reports the RPCs issued by user
281       ID, the total number of RPCs they have issued, the total time  consumed
282       by  all  of  those  RPCs  plus the average time consumed by each RPC in
283       microseconds.  RPCs statistics are collected for the life of the slurm‐
284       ctld process unless explicitly --reset.
285
286
287       The  sixth  block of information, labeled Pending RPC Statistics, shows
288       information about pending outgoing RPCs on the slurmctld  agent  queue.
289       The  first  section  of this block shows types of RPCs on the queue and
290       the count of each. The second section shows up to the first 25 individ‐
291       ual  RPCs pending on the agent queue, including the type and the desti‐
292       nation host list.  This information is cached and only refreshed on  30
293       second intervals.
294
295

OPTIONS

297       -a, --all
298              Get  and  report information. This is the default mode of opera‐
299              tion.
300
301
302       -h, --help
303              Print description of options and exit.
304
305
306       -i, --sort-by-id
307              Sort Remote Procedure Call (RPC) data by  message  type  ID  and
308              user ID.
309
310
311       -M, --cluster=<string>
312              The  cluster  to issue commands to. Only one cluster name may be
313              specified.  Note that the SlurmDBD must be up for this option to
314              work properly.
315
316
317       -r, --reset
318              Reset  scheduler and RPC counters to 0. Only supported for Slurm
319              operators and administrators.
320
321
322       -t, --sort-by-time
323              Sort Remote Procedure Call (RPC) data by total run time.
324
325
326       -T, --sort-by-time2
327              Sort Remote Procedure Call (RPC) data by average run time.
328
329
330       --usage
331              Print list of options and exit.
332
333
334       -V, --version
335              Print current version number and exit.
336
337

PERFORMANCE

339       Executing sdiag sends a remote procedure call to slurmctld.  If  enough
340       calls from sdiag or other Slurm client commands that send remote proce‐
341       dure calls to the slurmctld daemon come in at once, it can result in  a
342       degradation  of performance of the slurmctld daemon, possibly resulting
343       in a denial of service.
344
345       Do not run sdiag or other Slurm client commands that send remote proce‐
346       dure  calls to slurmctld from loops in shell scripts or other programs.
347       Ensure that programs limit calls to sdiag to the minimum necessary  for
348       the information you are trying to gather.
349
350

ENVIRONMENT VARIABLES

352       Some sdiag options may be set via environment variables. These environ‐
353       ment variables, along with  their  corresponding  options,  are  listed
354       below.  (Note: commandline options will always override these settings)
355
356       SLURM_CLUSTERS      Same as --cluster
357
358
359       SLURM_CONF          The location of the Slurm configuration file.
360
361

COPYING

363       Copyright (C) 2010-2011 Barcelona Supercomputing Center.
364       Copyright (C) 2010-2019 SchedMD LLC.
365
366       Slurm  is free software; you can redistribute it and/or modify it under
367       the terms of the GNU General Public License as published  by  the  Free
368       Software  Foundation;  either  version  2  of  the License, or (at your
369       option) any later version.
370
371       Slurm is distributed in the hope that it will be  useful,  but  WITHOUT
372       ANY  WARRANTY;  without even the implied warranty of MERCHANTABILITY or
373       FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General  Public  License
374       for more details.
375
376

SEE ALSO

378       sinfo(1), squeue(1), scontrol(1), slurm.conf(5),
379
380
381
382October 2019                    Slurm Commands                        sdiag(1)
Impressum