sdiag(1) - f39

1sdiag(1)                        Slurm Commands                        sdiag(1)
2
3
4

NAME

6       sdiag - Scheduling diagnostic tool for Slurm
7
8

SYNOPSIS

10       sdiag
11
12

DESCRIPTION

14       sdiag  shows information related to slurmctld execution about: threads,
15       agents, jobs, and scheduling algorithms. The goal  is  to  obtain  data
16       from  slurmctld behaviour helping to adjust configuration parameters or
17       queues policies. The main reason behind is to know Slurm behaviour  un‐
18       der systems with a high throughput.
19
20       It  has two execution modes. The default mode --all shows several coun‐
21       ters and statistics explained later, and there is another execution op‐
22       tion --reset for resetting those values.
23
24       Values are reset at midnight UTC time by default.
25
26       The  first  block  of information is related to global slurmctld execu‐
27       tion:
28
29
30       Server thread count
31              The number of current active slurmctld threads.  A  high  number
32              would  mean  a high load processing events like job submissions,
33              jobs dispatching, jobs completing, etc. If this is  often  close
34              to MAX_SERVER_THREADS it could point to a potential bottleneck.
35
36       Agent queue size
37              Slurm  design  has  scalability  in mind and sending messages to
38              thousands of nodes is not a trivial task.  The  agent  mechanism
39              helps  to control communication between slurmctld and the slurmd
40              daemons for a best effort. This value denotes the count  of  en‐
41              queued outgoing RPC requests in an internal retry list.
42
43       Agent count
44              Number  of agent threads. Each of these agent threads can create
45              in turn a group of up to 2 + AGENT_THREAD_COUNT  active  threads
46              at a time.
47
48       Agent thread count
49              Total count of active threads created by all the agent threads.
50
51       DBD Agent queue size
52              Slurm  queues up the messages intended for the SlurmDBD and pro‐
53              cesses them in a separate thread. If the SlurmDBD, or  database,
54              is down then this number will increase.
55
56              The  max  queue size is configured in the slurm.conf with MaxDB‐
57              DMsgs. If this number begins to grow more than half of  the  max
58              queue size, the slurmdbd and the database should be investigated
59              immediately.
60
61       Jobs submitted
62              Number of jobs submitted since last reset
63
64       Jobs started
65              Number of jobs started since last  reset.  This  includes  back‐
66              filled jobs.
67
68       Jobs completed
69              Number of jobs completed since last reset.
70
71       Jobs canceled
72              Number of jobs canceled since last reset.
73
74       Jobs failed
75              Number  of  jobs  failed  due to slurmd or other internal issues
76              since last reset.
77
78       Job states ts:
79              Lists the timestamp of when the following job state counts  were
80              gathered.
81
82       Jobs pending:
83              Number  of  jobs  pending  at  the  given time of the time stamp
84              above.
85
86       Jobs running:
87              Number of jobs running at the  given  time  of  the  time  stamp
88              above.
89
90       Jobs running ts:
91              Time stamp of when the running job count was taken.
92
93       The  next  block of information is related to main scheduling algorithm
94       based on jobs  priorities.  A  scheduling  cycle  implies  to  get  the
95       job_write_lock  lock,  then  trying  to get resources for jobs pending,
96       starting from the most priority one and going in descendent order. Once
97       a  job can not get the resources the loop keeps going but just for jobs
98       requesting other partitions. Jobs with dependencies or affected  by ac‐
99       counts limits are not processed.
100
101
102       Last cycle
103              Time in microseconds for last scheduling cycle.
104
105       Max cycle
106              Maximum time in microseconds for any scheduling cycle since last
107              reset.
108
109       Total cycles
110              Total run time in microseconds for all scheduling  cycles  since
111              last reset.  Scheduling is performed periodically and (depending
112              upon configuration) when a job is submitted or  a  job  is  com‐
113              pleted.
114
115       Mean cycle
116              Mean  time  in microseconds for all scheduling cycles since last
117              reset.
118
119       Mean depth cycle
120              Mean of cycle depth. Depth means number of jobs processed  in  a
121              scheduling cycle.
122
123       Cycles per minute
124              Counter of scheduling executions per minute.
125
126       Last queue length
127              Length of jobs pending queue.
128
129       The  next block of information is related to backfilling scheduling al‐
130       gorithm.  A backfilling scheduling cycle implies to get locks for jobs,
131       nodes  and  partitions  objects  then  trying to get resources for jobs
132       pending. Jobs are processed based on priorities. If a job can  not  get
133       resources  the  algorithm calculates when it could get them obtaining a
134       future start time for the job.  Then next job is processed and the  al‐
135       gorithm  tries to get resources for that job but avoiding to affect the
136       previous ones, and again it calculates the future  start  time  if  not
137       current  resources available. The backfilling algorithm takes more time
138       for each new job to process since more priority jobs  can  not  be  af‐
139       fected.  The algorithm itself takes measures for avoiding a long execu‐
140       tion cycle and for taking all the locks for too long.
141
142
143       Total backfilled jobs (since last slurm start)
144              Number of jobs started thanks to backfilling  since  last  slurm
145              start.
146
147       Total backfilled jobs (since last stats cycle start)
148              Number  of  jobs  started  thanks to backfilling since last time
149              stats where reset.  By default these values are  reset  at  mid‐
150              night UTC time.
151
152       Total backfilled heterogeneous job components
153              Number  of  heterogeneous job components started thanks to back‐
154              filling since last Slurm start.
155
156       Total cycles
157              Number of backfill scheduling cycles since last reset
158
159       Last cycle when
160              Time when last backfill scheduling cycle happened in the  format
161              "weekday Month MonthDay hour:minute.seconds year"
162
163       Last cycle
164              Time  in  microseconds  of  last  backfill scheduling cycle.  It
165              counts only execution time, removing sleep time inside a  sched‐
166              uling  cycle when it executes for an extended period time.  Note
167              that locks are released during the sleep time so that other work
168              can proceed.
169
170       Max cycle
171              Time in microseconds of maximum backfill scheduling cycle execu‐
172              tion since last reset.  It counts only execution time,  removing
173              sleep time inside a scheduling cycle when it executes for an ex‐
174              tended period time.  Note that locks  are  released  during  the
175              sleep time so that other work can proceed.
176
177       Mean cycle
178              Mean time in microseconds of backfilling scheduling cycles since
179              last reset.
180
181       Last depth cycle
182              Number of processed jobs during last backfilling scheduling  cy‐
183              cle. It counts every job even if that job can not be started due
184              to dependencies or limits.
185
186       Last depth cycle (try sched)
187              Number of processed jobs during last backfilling scheduling  cy‐
188              cle.  It counts only jobs with a chance to start using available
189              resources. These jobs consume more  scheduling  time  than  jobs
190              which  are  found can not be started due to dependencies or lim‐
191              its.
192
193       Depth Mean
194              Mean count of jobs processed during all  backfilling  scheduling
195              cycles  since last reset.  Jobs which are found to be ineligible
196              to run when examined by the backfill scheduler are  not  counted
197              (e.g. jobs submitted to multiple partitions and already started,
198              jobs which have reached a QOS or account limit such  as  maximum
199              running jobs for an account, etc).
200
201       Depth Mean (try sched)
202              The  subset  of Depth Mean that the backfill scheduler attempted
203              to schedule.
204
205       Last queue length
206              Number of jobs pending to be processed by backfilling algorithm.
207              A job is counted once for each partition it is queued to use.  A
208              pending job array will normally be counted as one job (tasks  of
209              a job array which have already been started/requeued or individ‐
210              ually modified will already have individual job records and  are
211              each counted as a separate job).
212
213       Queue length Mean
214              Mean  count of jobs pending to be processed by backfilling algo‐
215              rithm.  A job is counted once for each partition  it  requested.
216              A  pending  job array will normally be counted as one job (tasks
217              of a job array which have already been started/requeued or indi‐
218              vidually  modified  will already have individual job records and
219              are each counted as a separate job).
220
221       Last table size
222              Count of different time slots tested by the  backfill  scheduler
223              in its last iteration.
224
225       Mean table size
226              Mean count of different time slots tested by the backfill sched‐
227              uler.  Larger counts increase the time required for the backfill
228              operation.  The table size is influenced by many schuling param‐
229              eters,   including:   bf_min_age_reserve,   bf_min_prio_reserve,
230              bf_resolution, and bf_window.
231
232       Latency for 1000 calls to gettimeofday()
233              Latency of 1000 calls to the gettimeofday() syscall in microsec‐
234              onds, as measured at controller startup.
235
236       The next blocks of information report the most frequently issued remote
237       procedure  calls (RPCs), calls made for the Slurmctld daemon to perform
238       some action.  The fourth block reports the RPCs issued by message type.
239       You  will  need  to look up those RPC codes in the Slurm source code by
240       looking them up in the file src/common/slurm_protocol_defs.h.  The  re‐
241       port  includes  the number of times each RPC is invoked, the total time
242       consumed by all of those RPCs plus the average time  consumed  by  each
243       RPC  in  microseconds.  The fifth block reports the RPCs issued by user
244       ID, the total number of RPCs they have issued, the total time  consumed
245       by  all of those RPCs plus the average time consumed by each RPC in mi‐
246       croseconds.  RPCs statistics are collected for the life of  the  slurm‐
247       ctld process unless explicitly --reset.
248
249
250       The  sixth  block of information, labeled Pending RPC Statistics, shows
251       information about pending outgoing RPCs on the slurmctld  agent  queue.
252       The  first  section  of this block shows types of RPCs on the queue and
253       the count of each. The second section shows up to the first 25 individ‐
254       ual  RPCs pending on the agent queue, including the type and the desti‐
255       nation host list.  This information is cached and only refreshed on  30
256       second intervals.
257
258

OPTIONS

260       -a, --all
261              Get  and  report information. This is the default mode of opera‐
262              tion.
263
264       -M, --cluster=<string>
265              The cluster to issue commands to. Only one cluster name  may  be
266              specified.  Note that the SlurmDBD must be up for this option to
267              work properly.
268
269       -h, --help
270              Print description of options and exit.
271
272       -r, --reset
273              Reset scheduler and RPC counters to 0. Only supported for  Slurm
274              operators and administrators.
275
276       -i, --sort-by-id
277              Sort  Remote  Procedure  Call  (RPC) data by message type ID and
278              user ID.
279
280       -t, --sort-by-time
281              Sort Remote Procedure Call (RPC) data by total run time.
282
283       -T, --sort-by-time2
284              Sort Remote Procedure Call (RPC) data by average run time.
285
286       --usage
287              Print list of options and exit.
288
289       -V, --version
290              Print current version number and exit.
291

PERFORMANCE

293       Executing sdiag sends a remote procedure call to slurmctld.  If  enough
294       calls from sdiag or other Slurm client commands that send remote proce‐
295       dure calls to the slurmctld daemon come in at once, it can result in  a
296       degradation  of performance of the slurmctld daemon, possibly resulting
297       in a denial of service.
298
299       Do not run sdiag or other Slurm client commands that send remote proce‐
300       dure  calls to slurmctld from loops in shell scripts or other programs.
301       Ensure that programs limit calls to sdiag to the minimum necessary  for
302       the information you are trying to gather.
303
304

ENVIRONMENT VARIABLES

306       Some sdiag options may be set via environment variables. These environ‐
307       ment variables, along with their corresponding options, are listed  be‐
308       low.  (Note: Command line options will always override these settings.)
309
310
311       SLURM_CLUSTERS      Same as --cluster
312
313       SLURM_CONF          The location of the Slurm configuration file.
314
315

COPYING

317       Copyright (C) 2010-2011 Barcelona Supercomputing Center.
318       Copyright (C) 2010-2022 SchedMD LLC.
319
320       Slurm  is free software; you can redistribute it and/or modify it under
321       the terms of the GNU General Public License as published  by  the  Free
322       Software  Foundation;  either version 2 of the License, or (at your op‐
323       tion) any later version.
324
325       Slurm is distributed in the hope that it will be  useful,  but  WITHOUT
326       ANY  WARRANTY;  without even the implied warranty of MERCHANTABILITY or
327       FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General  Public  License
328       for more details.
329
330