sdiag(1) - f32

1sdiag(1)                        Slurm Commands                        sdiag(1)
2
3
4

NAME

6       sdiag - Scheduling diagnostic tool for Slurm
7
8

SYNOPSIS

10       sdiag
11
12

DESCRIPTION

14       sdiag  shows information related to slurmctld execution about: threads,
15       agents, jobs, and scheduling algorithms. The goal  is  to  obtain  data
16       from  slurmctld behaviour helping to adjust configuration parameters or
17       queues policies. The main reason behind  is  to  know  Slurm  behaviour
18       under systems with a high throughput.
19
20       It  has two execution modes. The default mode --all shows several coun‐
21       ters and statistics explained later, and  there  is  another  execution
22       option --reset for resetting those values.
23
24       Values are reset at midnight UTC time by default.
25
26       The  first  block  of information is related to global slurmctld execu‐
27       tion:
28
29       Server thread count
30              The number of current active slurmctld threads.  A  high  number
31              would  mean  a high load processing events like job submissions,
32              jobs dispatching, jobs completing, etc. If this is  often  close
33              to MAX_SERVER_THREADS it could point to a potential bottleneck.
34
35
36       Agent queue size
37              Slurm  design  has  scalability  in mind and sending messages to
38              thousands of nodes is not a trivial task.  The  agent  mechanism
39              helps  to control communication between slurmctld and the slurmd
40              daemons for a best effort.  This  value  denotes  the  count  of
41              enqueued outgoing RPC requests in an internal retry list.
42
43
44       Agent count
45              Number  of agent threads. Each of these agent threads can create
46              in turn a group of up to 2 + AGENT_THREAD_COUNT  active  threads
47              at a time.
48
49
50       DBD Agent queue size
51              Slurm  queues up the messages intended for the SlurmDBD and pro‐
52              cesses them in a separate thread. If the SlurmDBD, or  database,
53              is  down  then  this number will increase. The max queue size is
54              calculated as:
55
56              MAX(10000, ((max_job_cnt * 2) + (node_record_count * 4)))
57
58              If this number begins to grow more than half of  the  max  queue
59              size, the slurmdbd and the database should be investigated imme‐
60              diately.
61
62
63       Jobs submitted
64              Number of jobs submitted since last reset
65
66
67       Jobs started
68              Number of jobs started since last  reset.  This  includes  back‐
69              filled jobs.
70
71
72       Jobs completed
73              Number of jobs completed since last reset.
74
75
76       Jobs canceled
77              Number of jobs canceled since last reset.
78
79
80       Jobs failed
81              Number  of  jobs  failed  due to slurmd or other internal issues
82              since last reset.
83
84
85       Job states ts:
86              Lists the timestamp of when the following job state counts  were
87              gathered.
88
89
90       Jobs pending:
91              Number  of  jobs  pending  at  the  given time of the time stamp
92              above.
93
94
95       Jobs running:
96              Number of jobs running at the  given  time  of  the  time  stamp
97              above.
98
99
100       Jobs running ts:
101              Time stamp of when the running job count was taken.
102
103
104       The  next  block of information is related to main scheduling algorithm
105       based on jobs  priorities.  A  scheduling  cycle  implies  to  get  the
106       job_write_lock  lock,  then  trying  to get resources for jobs pending,
107       starting from the most priority one and going in descendent order. Once
108       a  job can not get the resources the loop keeps going but just for jobs
109       requesting other partitions. Jobs with  dependencies  or  affected   by
110       accounts limits are not processed.
111
112
113       Last cycle
114              Time in microseconds for last scheduling cycle.
115
116
117       Max cycle
118              Maximum time in microseconds for any scheduling cycle since last
119              reset.
120
121
122       Total cycles
123              Total run time in microseconds for all scheduling  cycles  since
124              last reset.  Scheduling is performed periodically and (depending
125              upon configuration) when a job is submitted or  a  job  is  com‐
126              pleted.
127
128
129       Mean cycle
130              Mean  time  in microseconds for all scheduling cycles since last
131              reset.
132
133
134       Mean depth cycle
135              Mean of cycle depth. Depth means number of jobs processed  in  a
136              scheduling cycle.
137
138
139       Cycles per minute
140              Counter of scheduling executions per minute.
141
142
143       Last queue length
144              Length of jobs pending queue.
145
146
147       The  next  block  of  information  is related to backfilling scheduling
148       algorithm.  A backfilling scheduling cycle implies  to  get  locks  for
149       jobs,  nodes  and  partitions  objects then trying to get resources for
150       jobs pending. Jobs are processed based on priorities. If a job can  not
151       get resources the algorithm calculates when it could get them obtaining
152       a future start time for the job.  Then next job is  processed  and  the
153       algorithm  tries  to  get resources for that job but avoiding to affect
154       the previous ones, and again it calculates the future start time if not
155       current  resources available. The backfilling algorithm takes more time
156       for each new job to  process  since  more  priority  jobs  can  not  be
157       affected.  The algorithm itself takes measures for avoiding a long exe‐
158       cution cycle and for taking all the locks for too long.
159
160
161       Total backfilled jobs (since last slurm start)
162              Number of jobs started thanks to backfilling  since  last  slurm
163              start.
164
165
166       Total backfilled jobs (since last stats cycle start)
167              Number  of  jobs  started  thanks to backfilling since last time
168              stats where reset.  By default these values are  reset  at  mid‐
169              night UTC time.
170
171
172       Total backfilled heterogeneous job components
173              Number  of  heterogeneous job components started thanks to back‐
174              filling since last Slurm start.
175
176
177       Total cycles
178              Number of backfill scheduling cycles since last reset
179
180
181       Last cycle when
182              Time when last backfill scheduling cycle happened in the  format
183              "weekday Month MonthDay hour:minute.seconds year"
184
185
186       Last cycle
187              Time  in  microseconds  of  last  backfill scheduling cycle.  It
188              counts only execution time, removing sleep time inside a  sched‐
189              uling  cycle when it executes for an extended period time.  Note
190              that locks are released during the sleep time so that other work
191              can proceed.
192
193
194       Max cycle
195              Time in microseconds of maximum backfill scheduling cycle execu‐
196              tion since last reset.  It counts only execution time,  removing
197              sleep  time  inside  a  scheduling cycle when it executes for an
198              extended period time.  Note that locks are released  during  the
199              sleep time so that other work can proceed.
200
201
202       Mean cycle
203              Mean time in microseconds of backfilling scheduling cycles since
204              last reset.
205
206
207       Last depth cycle
208              Number of processed  jobs  during  last  backfilling  scheduling
209              cycle.  It  counts every job even if that job can not be started
210              due to dependencies or limits.
211
212
213       Last depth cycle (try sched)
214              Number of processed  jobs  during  last  backfilling  scheduling
215              cycle.  It  counts only jobs with a chance to start using avail‐
216              able resources. These jobs consume  more  scheduling  time  than
217              jobs  which  are found can not be started due to dependencies or
218              limits.
219
220
221       Depth Mean
222              Mean count of jobs processed during all  backfilling  scheduling
223              cycles  since last reset.  Jobs which are found to be ineligible
224              to run when examined by the backfill scheduler are  not  counted
225              (e.g. jobs submitted to multiple partitions and already started,
226              jobs which have reached a QOS or account limit such  as  maximum
227              running jobs for an account, etc).
228
229
230       Depth Mean (try sched)
231              The  subset  of Depth Mean that the backfill scheduler attempted
232              to schedule.
233
234
235       Last queue length
236              Number of jobs pending to be processed by backfilling algorithm.
237              A job is counted once for each partition it is queued to use.  A
238              pending job array will normally be counted as one job (tasks  of
239              a job array which have already been started/requeued or individ‐
240              ually modified will already have individual job records and  are
241              each counted as a separate job).
242
243
244       Queue length Mean
245              Mean  count of jobs pending to be processed by backfilling algo‐
246              rithm.  A job once for each partition it requested.   A  pending
247              job  array  will  normally be counted as one job (tasks of a job
248              array which have already been started/requeued  or  individually
249              modified  will  already have individual job records and are each
250              counted as a separate job).
251
252
253       Latency for 1000 calls to gettimeofday()
254              Latency of 1000 calls to the gettimeofday() syscall in microsec‐
255              onds, as measured at controller startup.
256
257
258       The next blocks of information report the most frequently issued remote
259       procedure calls (RPCs), calls made for the Slurmctld daemon to  perform
260       some action.  The fourth block reports the RPCs issued by message type.
261       You will need to look up those RPC codes in the Slurm  source  code  by
262       looking  them  up  in  the  file src/common/slurm_protocol_defs.h.  The
263       report includes the number of times each RPC is invoked, the total time
264       consumed  by  all  of those RPCs plus the average time consumed by each
265       RPC in microseconds.  The fifth block reports the RPCs issued  by  user
266       ID,  the total number of RPCs they have issued, the total time consumed
267       by all of those RPCs plus the average time  consumed  by  each  RPC  in
268       microseconds.  RPCs statistics are collected for the life of the slurm‐
269       ctld process unless explicitly --reset.
270
271
272       The sixth block of information, labeled Pending RPC  Statistics,  shows
273       information  about  pending outgoing RPCs on the slurmctld agent queue.
274       The first section of this block shows types of RPCs on  the  queue  and
275       the count of each. The second section shows up to the first 25 individ‐
276       ual RPCs pending on the agent queue, including the type and the  desti‐
277       nation  host list.  This information is cached and only refreshed on 30
278       second intervals.
279
280

OPTIONS

282       -a, --all
283              Get and report information. This is the default mode  of  opera‐
284              tion.
285
286
287       -h, --help
288              Print description of options and exit.
289
290
291       -i, --sort-by-id
292              Sort  Remote  Procedure  Call  (RPC) data by message type ID and
293              user ID.
294
295
296       -r, --reset
297              Reset scheduler and RPC counters to 0. Only supported for  Slurm
298              operators and administrators.
299
300
301       -t, --sort-by-time
302              Sort Remote Procedure Call (RPC) data by total run time.
303
304
305       -T, --sort-by-time2
306              Sort Remote Procedure Call (RPC) data by average run time.
307
308
309       --usage
310              Print list of options and exit.
311
312
313       -V, --version
314              Print current version number and exit.
315
316

ENVIRONMENT VARIABLES

318       Some sdiag options may be set via environment variables. These environ‐
319       ment variables, along with  their  corresponding  options,  are  listed
320       below.  (Note: commandline options will always override these settings)
321
322       SLURM_CONF          The location of the Slurm configuration file.
323
324

COPYING

326       Copyright (C) 2010-2011 Barcelona Supercomputing Center.
327       Copyright (C) 2010-2019 SchedMD LLC.
328
329       Slurm  is free software; you can redistribute it and/or modify it under
330       the terms of the GNU General Public License as published  by  the  Free
331       Software  Foundation;  either  version  2  of  the License, or (at your
332       option) any later version.
333
334       Slurm is distributed in the hope that it will be  useful,  but  WITHOUT
335       ANY  WARRANTY;  without even the implied warranty of MERCHANTABILITY or
336       FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General  Public  License
337       for more details.
338
339

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

ENVIRONMENT VARIABLES

COPYING

SEE ALSO