1sdiag(1) Slurm Commands sdiag(1)
2
3
4
6 sdiag - Scheduling diagnostic tool for Slurm
7
8
10 sdiag
11
12
14 sdiag shows information related to slurmctld execution about: threads,
15 agents, jobs, and scheduling algorithms. The goal is to obtain data
16 from slurmctld behaviour helping to adjust configuration parameters or
17 queues policies. The main reason behind is to know Slurm behaviour
18 under systems with a high throughput.
19
20 It has two execution modes. The default mode --all shows several coun‐
21 ters and statistics explained later, and there is another execution
22 option --reset for resetting those values.
23
24 Values are reset at midnight UTC time by default.
25
26 The first block of information is related to global slurmctld execu‐
27 tion:
28
29 Server thread count
30 The number of current active slurmctld threads. A high number
31 would mean a high load processing events like job submissions,
32 jobs dispatching, jobs completing, etc. If this is often close
33 to MAX_SERVER_THREADS it could point to a potential bottleneck.
34
35
36 Agent queue size
37 Slurm design has scalability in mind and sending messages to
38 thousands of nodes is not a trivial task. The agent mechanism
39 helps to control communication between slurmctld and the slurmd
40 daemons for a best effort. This value denotes the count of
41 enqueued outgoing RPC requests in an internal retry list.
42
43
44 Agent count
45 Number of agent threads. Each of these agent threads can create
46 in turn a group of up to 2 + AGENT_THREAD_COUNT active threads
47 at a time.
48
49
50 Agent thread count
51 Total count of active threads created by all the agent threads.
52
53
54 DBD Agent queue size
55 Slurm queues up the messages intended for the SlurmDBD and pro‐
56 cesses them in a separate thread. If the SlurmDBD, or database,
57 is down then this number will increase.
58
59 The max queue size is configured in the slurm.conf with MaxDB‐
60 DMsgs. If this number begins to grow more than half of the max
61 queue size, the slurmdbd and the database should be investigated
62 immediately.
63
64
65 Jobs submitted
66 Number of jobs submitted since last reset
67
68
69 Jobs started
70 Number of jobs started since last reset. This includes back‐
71 filled jobs.
72
73
74 Jobs completed
75 Number of jobs completed since last reset.
76
77
78 Jobs canceled
79 Number of jobs canceled since last reset.
80
81
82 Jobs failed
83 Number of jobs failed due to slurmd or other internal issues
84 since last reset.
85
86
87 Job states ts:
88 Lists the timestamp of when the following job state counts were
89 gathered.
90
91
92 Jobs pending:
93 Number of jobs pending at the given time of the time stamp
94 above.
95
96
97 Jobs running:
98 Number of jobs running at the given time of the time stamp
99 above.
100
101
102 Jobs running ts:
103 Time stamp of when the running job count was taken.
104
105
106 The next block of information is related to main scheduling algorithm
107 based on jobs priorities. A scheduling cycle implies to get the
108 job_write_lock lock, then trying to get resources for jobs pending,
109 starting from the most priority one and going in descendent order. Once
110 a job can not get the resources the loop keeps going but just for jobs
111 requesting other partitions. Jobs with dependencies or affected by
112 accounts limits are not processed.
113
114
115 Last cycle
116 Time in microseconds for last scheduling cycle.
117
118
119 Max cycle
120 Maximum time in microseconds for any scheduling cycle since last
121 reset.
122
123
124 Total cycles
125 Total run time in microseconds for all scheduling cycles since
126 last reset. Scheduling is performed periodically and (depending
127 upon configuration) when a job is submitted or a job is com‐
128 pleted.
129
130
131 Mean cycle
132 Mean time in microseconds for all scheduling cycles since last
133 reset.
134
135
136 Mean depth cycle
137 Mean of cycle depth. Depth means number of jobs processed in a
138 scheduling cycle.
139
140
141 Cycles per minute
142 Counter of scheduling executions per minute.
143
144
145 Last queue length
146 Length of jobs pending queue.
147
148
149 The next block of information is related to backfilling scheduling
150 algorithm. A backfilling scheduling cycle implies to get locks for
151 jobs, nodes and partitions objects then trying to get resources for
152 jobs pending. Jobs are processed based on priorities. If a job can not
153 get resources the algorithm calculates when it could get them obtaining
154 a future start time for the job. Then next job is processed and the
155 algorithm tries to get resources for that job but avoiding to affect
156 the previous ones, and again it calculates the future start time if not
157 current resources available. The backfilling algorithm takes more time
158 for each new job to process since more priority jobs can not be
159 affected. The algorithm itself takes measures for avoiding a long exe‐
160 cution cycle and for taking all the locks for too long.
161
162
163 Total backfilled jobs (since last slurm start)
164 Number of jobs started thanks to backfilling since last slurm
165 start.
166
167
168 Total backfilled jobs (since last stats cycle start)
169 Number of jobs started thanks to backfilling since last time
170 stats where reset. By default these values are reset at mid‐
171 night UTC time.
172
173
174 Total backfilled heterogeneous job components
175 Number of heterogeneous job components started thanks to back‐
176 filling since last Slurm start.
177
178
179 Total cycles
180 Number of backfill scheduling cycles since last reset
181
182
183 Last cycle when
184 Time when last backfill scheduling cycle happened in the format
185 "weekday Month MonthDay hour:minute.seconds year"
186
187
188 Last cycle
189 Time in microseconds of last backfill scheduling cycle. It
190 counts only execution time, removing sleep time inside a sched‐
191 uling cycle when it executes for an extended period time. Note
192 that locks are released during the sleep time so that other work
193 can proceed.
194
195
196 Max cycle
197 Time in microseconds of maximum backfill scheduling cycle execu‐
198 tion since last reset. It counts only execution time, removing
199 sleep time inside a scheduling cycle when it executes for an
200 extended period time. Note that locks are released during the
201 sleep time so that other work can proceed.
202
203
204 Mean cycle
205 Mean time in microseconds of backfilling scheduling cycles since
206 last reset.
207
208
209 Last depth cycle
210 Number of processed jobs during last backfilling scheduling
211 cycle. It counts every job even if that job can not be started
212 due to dependencies or limits.
213
214
215 Last depth cycle (try sched)
216 Number of processed jobs during last backfilling scheduling
217 cycle. It counts only jobs with a chance to start using avail‐
218 able resources. These jobs consume more scheduling time than
219 jobs which are found can not be started due to dependencies or
220 limits.
221
222
223 Depth Mean
224 Mean count of jobs processed during all backfilling scheduling
225 cycles since last reset. Jobs which are found to be ineligible
226 to run when examined by the backfill scheduler are not counted
227 (e.g. jobs submitted to multiple partitions and already started,
228 jobs which have reached a QOS or account limit such as maximum
229 running jobs for an account, etc).
230
231
232 Depth Mean (try sched)
233 The subset of Depth Mean that the backfill scheduler attempted
234 to schedule.
235
236
237 Last queue length
238 Number of jobs pending to be processed by backfilling algorithm.
239 A job is counted once for each partition it is queued to use. A
240 pending job array will normally be counted as one job (tasks of
241 a job array which have already been started/requeued or individ‐
242 ually modified will already have individual job records and are
243 each counted as a separate job).
244
245
246 Queue length Mean
247 Mean count of jobs pending to be processed by backfilling algo‐
248 rithm. A job is counted once for each partition it requested.
249 A pending job array will normally be counted as one job (tasks
250 of a job array which have already been started/requeued or indi‐
251 vidually modified will already have individual job records and
252 are each counted as a separate job).
253
254
255 Last table size
256 Count of different time slots tested by the backfill scheduler
257 in its last iteration.
258
259
260 Mean table size
261 Mean count of different time slots tested by the backfill sched‐
262 uler. Larger counts increase the time required for the backfill
263 operation. The table size is influenced by many schuling param‐
264 eters, including: bf_min_age_reserve, bf_min_prio_reserve,
265 bf_resolution, and bf_window.
266
267
268 Latency for 1000 calls to gettimeofday()
269 Latency of 1000 calls to the gettimeofday() syscall in microsec‐
270 onds, as measured at controller startup.
271
272
273 The next blocks of information report the most frequently issued remote
274 procedure calls (RPCs), calls made for the Slurmctld daemon to perform
275 some action. The fourth block reports the RPCs issued by message type.
276 You will need to look up those RPC codes in the Slurm source code by
277 looking them up in the file src/common/slurm_protocol_defs.h. The
278 report includes the number of times each RPC is invoked, the total time
279 consumed by all of those RPCs plus the average time consumed by each
280 RPC in microseconds. The fifth block reports the RPCs issued by user
281 ID, the total number of RPCs they have issued, the total time consumed
282 by all of those RPCs plus the average time consumed by each RPC in
283 microseconds. RPCs statistics are collected for the life of the slurm‐
284 ctld process unless explicitly --reset.
285
286
287 The sixth block of information, labeled Pending RPC Statistics, shows
288 information about pending outgoing RPCs on the slurmctld agent queue.
289 The first section of this block shows types of RPCs on the queue and
290 the count of each. The second section shows up to the first 25 individ‐
291 ual RPCs pending on the agent queue, including the type and the desti‐
292 nation host list. This information is cached and only refreshed on 30
293 second intervals.
294
295
297 -a, --all
298 Get and report information. This is the default mode of opera‐
299 tion.
300
301
302 -h, --help
303 Print description of options and exit.
304
305
306 -i, --sort-by-id
307 Sort Remote Procedure Call (RPC) data by message type ID and
308 user ID.
309
310
311 -M, --cluster=<string>
312 The cluster to issue commands to. Only one cluster name may be
313 specified. Note that the SlurmDBD must be up for this option to
314 work properly.
315
316
317 -r, --reset
318 Reset scheduler and RPC counters to 0. Only supported for Slurm
319 operators and administrators.
320
321
322 -t, --sort-by-time
323 Sort Remote Procedure Call (RPC) data by total run time.
324
325
326 -T, --sort-by-time2
327 Sort Remote Procedure Call (RPC) data by average run time.
328
329
330 --usage
331 Print list of options and exit.
332
333
334 -V, --version
335 Print current version number and exit.
336
337
339 Executing sdiag sends a remote procedure call to slurmctld. If enough
340 calls from sdiag or other Slurm client commands that send remote proce‐
341 dure calls to the slurmctld daemon come in at once, it can result in a
342 degradation of performance of the slurmctld daemon, possibly resulting
343 in a denial of service.
344
345 Do not run sdiag or other Slurm client commands that send remote proce‐
346 dure calls to slurmctld from loops in shell scripts or other programs.
347 Ensure that programs limit calls to sdiag to the minimum necessary for
348 the information you are trying to gather.
349
350
352 Some sdiag options may be set via environment variables. These environ‐
353 ment variables, along with their corresponding options, are listed
354 below. (Note: commandline options will always override these settings)
355
356 SLURM_CLUSTERS Same as --cluster
357
358
359 SLURM_CONF The location of the Slurm configuration file.
360
361
363 Copyright (C) 2010-2011 Barcelona Supercomputing Center.
364 Copyright (C) 2010-2019 SchedMD LLC.
365
366 Slurm is free software; you can redistribute it and/or modify it under
367 the terms of the GNU General Public License as published by the Free
368 Software Foundation; either version 2 of the License, or (at your
369 option) any later version.
370
371 Slurm is distributed in the hope that it will be useful, but WITHOUT
372 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
373 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
374 for more details.
375
376
378 sinfo(1), squeue(1), scontrol(1), slurm.conf(5),
379
380
381
382October 2019 Slurm Commands sdiag(1)