sdiag(1) - f29

1sdiag(1)                        Slurm Commands                        sdiag(1)
2
3
4

NAME

6       sdiag - Scheduling diagnostic tool for Slurm
7
8

SYNOPSIS

10       sdiag
11
12

DESCRIPTION

14       sdiag  shows information related to slurmctld execution about: threads,
15       agents, jobs, and scheduling algorithms. The goal  is  to  obtain  data
16       from  slurmctld behaviour helping to adjust configuration parameters or
17       queues policies. The main reason behind  is  to  know  Slurm  behaviour
18       under systems with a high throughput.
19
20       It  has two execution modes. The default mode --all shows several coun‐
21       ters and statistics explained later, and  there  is  another  execution
22       option --reset for resetting those values.
23
24       Values are reset at midnight UTC time by default.
25
26       The  first  block  of information is related to global slurmctld execu‐
27       tion:
28
29       Server thread count
30              The number of current active slurmctld threads.  A  high  number
31              would  mean  a high load processing events like job submissions,
32              jobs dispatching, jobs completing, etc. If this is  often  close
33              to MAX_SERVER_THREADS it could point to a potential bottleneck.
34
35
36       Agent queue size
37              Slurm  design  has  scalability  in mind and sending messages to
38              thousands of nodes is not a trivial task.  The  agent  mechanism
39              helps to control communication between the slurm daemons and the
40              controller for a  best  effort.  If  this  values  is  close  to
41              MAX_AGENT_CNT  there could be some delays affecting jobs manage‐
42              ment.
43
44
45       Agent count
46              Number of active agent threads.
47
48
49       DBD Agent queue size
50              Slurm queues up the messages intended for the SlurmDBD and  pro‐
51              cesses  them in a separate thread. If the SlurmDBD, or database,
52              is down then this number will increase. The max  queue  size  is
53              calculated as:
54
55              MAX(10000, ((max_job_cnt * 2) + (node_record_count * 4)))
56
57              If  this  number  begins to grow more than half of the max queue
58              size, the slurmdbd and the database should be investigated imme‐
59              diately.
60
61
62       Jobs submitted
63              Number of jobs submitted since last reset
64
65
66       Jobs started
67              Number  of  jobs  started  since last reset. This includes back‐
68              filled jobs.
69
70
71       Jobs completed
72              Number of jobs completed since last reset.
73
74
75       Jobs canceled
76              Number of jobs canceled since last reset.
77
78
79       Jobs failed
80              Number of jobs failed due to slurmd  or  other  internal  issues
81              since last reset.
82
83
84       Job states ts:
85              Lists  the timestamp of when the following job state counts were
86              gathered.
87
88
89       Jobs pending:
90              Number of jobs pending at the  given  time  of  the  time  stamp
91              above.
92
93
94       Jobs running:
95              Number  of  jobs  running  at  the  given time of the time stamp
96              above.
97
98
99       Jobs running ts:
100              Time stamp of when the running job count was taken.
101
102
103       The second block of information is related to main scheduling algorithm
104       based  on  jobs  priorities.  A  scheduling  cycle  implies  to get the
105       job_write_lock lock, then trying to get  resources  for  jobs  pending,
106       starting from the most priority one and going in descendent order. Once
107       a job can not get the resources the loop keeps going but just for  jobs
108       requesting  other  partitions.  Jobs  with dependencies or affected  by
109       accounts limits are not processed.
110
111
112       Last cycle
113              Time in microseconds for last scheduling cycle.
114
115
116       Max cycle
117              Time in microseconds for the maximum scheduling cycle since last
118              reset.
119
120
121       Total cycles
122              Number of scheduling cycles since last reset. Scheduling is done
123              in periodically and when a job is submitted or  a  job  is  com‐
124              pleted.
125
126
127       Mean cycle
128              Mean of scheduling cycles since last reset
129
130
131       Mean depth cycle
132              Mean  of  cycle depth. Depth means number of jobs processed in a
133              scheduling cycle.
134
135
136       Cycles per minute
137              Counter of scheduling executions per minute
138
139
140       Last queue length
141              Length of jobs pending queue.
142
143
144       Latency for gettimeofday()
145              Latency of 1000 calls to the gettimeofday() syscall in microsec‐
146              onds, as measured at controller startup.
147
148
149       The  third  block  of  information is related to backfilling scheduling
150       algorithm.  A backfilling scheduling cycle implies  to  get  locks  for
151       jobs,  nodes  and  partitions  objects then trying to get resources for
152       jobs pending. Jobs are processed based on priorities. If a job can  not
153       get resources the algorithm calculates when it could get them obtaining
154       a future start time for the job.  Then next job is  processed  and  the
155       algorithm  tries  to  get resources for that job but avoiding to affect
156       the previous ones, and again it calculates the future start time if not
157       current  resources available. The backfilling algorithm takes more time
158       for each new job to  process  since  more  priority  jobs  can  not  be
159       affected.  The algorithm itself takes measures for avoiding a long exe‐
160       cution cycle and for taking all the locks for too long.
161
162
163       Total backfilled jobs (since last slurm start)
164              Number of jobs started thanks to backfilling  since  last  slurm
165              start.
166
167
168       Total backfilled jobs (since last stats cycle start)
169              Number  of  jobs  started  thanks to backfilling since last time
170              stats where reset.  By default these values are  reset  at  mid‐
171              night UTC time.
172
173
174       Total backfilled heterogeneous job components
175              Number  of  heterogeneous job components started thanks to back‐
176              filling since last Slurm start.
177
178
179       Total cycles
180              Number of scheduling cycles since last reset
181
182
183       Last cycle when
184              Time when last execution cycle happened in format "weekday Month
185              MonthDay hour:minute.seconds year"
186
187
188       Last cycle
189              Time  in microseconds of last backfilling cycle.  It counts only
190              execution time removing sleep time  inside  a  scheduling  cycle
191              when  it takes too much time.  Note that locks are released dur‐
192              ing the sleep time so that other work can proceed.
193
194
195       Max cycle
196              Time in microseconds  of  maximum  backfilling  cycle  execution
197              since  last reset.  It counts only execution time removing sleep
198              time inside a scheduling cycle when  it  takes  too  much  time.
199              Note that locks are released during the sleep time so that other
200              work can proceed.
201
202
203       Mean cycle
204              Mean of backfilling scheduling cycles in microseconds since last
205              reset
206
207
208
209       Last depth cycle
210              Number  of  processed  jobs  during  last backfilling scheduling
211              cycle. It counts every process even if it has no option to  exe‐
212              cute due to dependencies or limits.
213
214
215       Last depth cycle (try sched)
216              Number  of  processed  jobs  during  last backfilling scheduling
217              cycle. It counts only processes with a chance to run waiting for
218              available  resources. These jobs are which makes the backfilling
219              algorithm heavier.
220
221
222       Depth Mean
223              Mean of processed  jobs  during  backfilling  scheduling  cycles
224              since  last reset.  Jobs which are found to be ineligible to run
225              when examined by the backfill scheduler are  not  counted  (e.g.
226              jobs  submitted to multiple partitions and already started, jobs
227              which have reached a QOS or account limit such as  maximum  run‐
228              ning jobs for an account, etc).
229
230
231       Depth Mean (try sched)
232              The  subset  of Depth Mean that the backfill scheduler attempted
233              to schedule.
234
235
236       Last queue length
237              Number of jobs pending to be processed by backfilling algorithm.
238              A job once for each partition it requested.  A pending job array
239              will normally be counted as one job (tasks of a job array  which
240              have already been started/requeued or individually modified will
241              already have individual job records and are each  counted  as  a
242              separate job).
243
244
245       Queue length Mean
246              Mean  of  jobs pending to be processed by backfilling algorithm.
247              A job once for each partition it requested.  A pending job array
248              will  normally be counted as one job (tasks of a job array which
249              have already been started/requeued or individually modified will
250              already  have  individual  job records and are each counted as a
251              separate job).
252
253
254       The fourth and fifth blocks of information report the  most  frequently
255       issued remote procedure calls (RPCs), calls made for the Slurmctld dae‐
256       mon to perform some action.  The fourth block reports the  RPCs  issued
257       by message type.  You will need to look up those RPC codes in the Slurm
258       source code by looking them  up  in  the  file  src/common/slurm_proto‐
259       col_defs.h.   The  report  includes  the  number  of  times each RPC is
260       invoked, the total time consumed by all of those RPCs plus the  average
261       time consumed by each RPC in microseconds.  The fifth block reports the
262       RPCs issued by user ID, the total number of RPCs they have issued,  the
263       total time consumed by all of those RPCs plus the average time consumed
264       by each RPC in microseconds.
265
266
267       The sixth block of information, labeled Pending RPC  Statistics,  shows
268       information  about  pending outgoing RPCs on the slurmctld agent queue.
269       The first section of this block shows types of RPCs on  the  queue  and
270       the count of each. The second section shows up to the first 25 individ‐
271       ual RPCs pending on the agent queue, including the type and the  desti‐
272       nation  host list.  This information is cached and only refreshed on 30
273       second intervals.
274
275

OPTIONS

277       -a, --all
278              Get and report information. This is the default mode  of  opera‐
279              tion.
280
281
282       -h, --help
283              Print description of options and exit.
284
285
286       -i, --sort-by-id
287              Sort  Remote  Procedure  Call  (RPC) data by message type ID and
288              user ID.
289
290
291       -r, --reset
292              Reset counters. Only supported for Slurm operators and  adminis‐
293              trators.
294
295
296       -t, --sort-by-time
297              Sort Remote Procedure Call (RPC) data by total run time.
298
299
300       -T, --sort-by-time2
301              Sort Remote Procedure Call (RPC) data by average run time.
302
303
304       --usage
305              Print list of options and exit.
306
307
308       -V, --version
309              Print current version number and exit.
310
311

ENVIRONMENT VARIABLES

313       Some sdiag options may be set via environment variables. These environ‐
314       ment variables, along with  their  corresponding  options,  are  listed
315       below.  (Note: commandline options will always override these settings)
316
317       SLURM_CONF          The location of the Slurm configuration file.
318
319

COPYING

321       Copyright (C) 2010-2011 Barcelona Supercomputing Center.
322       Copyright (C) 2010-2017 SchedMD LLC.
323
324       Slurm  is free software; you can redistribute it and/or modify it under
325       the terms of the GNU General Public License as published  by  the  Free
326       Software  Foundation;  either  version  2  of  the License, or (at your
327       option) any later version.
328
329       Slurm is distributed in the hope that it will be  useful,  but  WITHOUT
330       ANY  WARRANTY;  without even the implied warranty of MERCHANTABILITY or
331       FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General  Public  License
332       for more details.
333
334

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

ENVIRONMENT VARIABLES

COPYING

SEE ALSO