1sdiag(1) Slurm Commands sdiag(1)
2
3
4
6 sdiag - Scheduling diagnostic tool for Slurm
7
8
10 sdiag
11
12
14 sdiag shows information related to slurmctld execution about: threads,
15 agents, jobs, and scheduling algorithms. The goal is to obtain data
16 from slurmctld behaviour helping to adjust configuration parameters or
17 queues policies. The main reason behind is to know Slurm behaviour
18 under systems with a high throughput.
19
20 It has two execution modes. The default mode --all shows several coun‐
21 ters and statistics explained later, and there is another execution
22 option --reset for resetting those values.
23
24 Values are reset at midnight UTC time by default.
25
26 The first block of information is related to global slurmctld execu‐
27 tion:
28
29 Server thread count
30 The number of current active slurmctld threads. A high number
31 would mean a high load processing events like job submissions,
32 jobs dispatching, jobs completing, etc. If this is often close
33 to MAX_SERVER_THREADS it could point to a potential bottleneck.
34
35
36 Agent queue size
37 Slurm design has scalability in mind and sending messages to
38 thousands of nodes is not a trivial task. The agent mechanism
39 helps to control communication between slurmctld and the slurmd
40 daemons for a best effort. This value denotes the count of
41 enqueued outgoing RPC requests in an internal retry list.
42
43
44 Agent count
45 Number of agent threads. Each of these agent threads can create
46 in turn a group of up to 2 + AGENT_THREAD_COUNT active threads
47 at a time.
48
49
50 DBD Agent queue size
51 Slurm queues up the messages intended for the SlurmDBD and pro‐
52 cesses them in a separate thread. If the SlurmDBD, or database,
53 is down then this number will increase. The max queue size is
54 calculated as:
55
56 MAX(10000, ((max_job_cnt * 2) + (node_record_count * 4)))
57
58 If this number begins to grow more than half of the max queue
59 size, the slurmdbd and the database should be investigated imme‐
60 diately.
61
62
63 Jobs submitted
64 Number of jobs submitted since last reset
65
66
67 Jobs started
68 Number of jobs started since last reset. This includes back‐
69 filled jobs.
70
71
72 Jobs completed
73 Number of jobs completed since last reset.
74
75
76 Jobs canceled
77 Number of jobs canceled since last reset.
78
79
80 Jobs failed
81 Number of jobs failed due to slurmd or other internal issues
82 since last reset.
83
84
85 Job states ts:
86 Lists the timestamp of when the following job state counts were
87 gathered.
88
89
90 Jobs pending:
91 Number of jobs pending at the given time of the time stamp
92 above.
93
94
95 Jobs running:
96 Number of jobs running at the given time of the time stamp
97 above.
98
99
100 Jobs running ts:
101 Time stamp of when the running job count was taken.
102
103
104 The next block of information is related to main scheduling algorithm
105 based on jobs priorities. A scheduling cycle implies to get the
106 job_write_lock lock, then trying to get resources for jobs pending,
107 starting from the most priority one and going in descendent order. Once
108 a job can not get the resources the loop keeps going but just for jobs
109 requesting other partitions. Jobs with dependencies or affected by
110 accounts limits are not processed.
111
112
113 Last cycle
114 Time in microseconds for last scheduling cycle.
115
116
117 Max cycle
118 Maximum time in microseconds for any scheduling cycle since last
119 reset.
120
121
122 Total cycles
123 Total run time in microseconds for all scheduling cycles since
124 last reset. Scheduling is performed periodically and (depending
125 upon configuration) when a job is submitted or a job is com‐
126 pleted.
127
128
129 Mean cycle
130 Mean time in microseconds for all scheduling cycles since last
131 reset.
132
133
134 Mean depth cycle
135 Mean of cycle depth. Depth means number of jobs processed in a
136 scheduling cycle.
137
138
139 Cycles per minute
140 Counter of scheduling executions per minute.
141
142
143 Last queue length
144 Length of jobs pending queue.
145
146
147 The next block of information is related to backfilling scheduling
148 algorithm. A backfilling scheduling cycle implies to get locks for
149 jobs, nodes and partitions objects then trying to get resources for
150 jobs pending. Jobs are processed based on priorities. If a job can not
151 get resources the algorithm calculates when it could get them obtaining
152 a future start time for the job. Then next job is processed and the
153 algorithm tries to get resources for that job but avoiding to affect
154 the previous ones, and again it calculates the future start time if not
155 current resources available. The backfilling algorithm takes more time
156 for each new job to process since more priority jobs can not be
157 affected. The algorithm itself takes measures for avoiding a long exe‐
158 cution cycle and for taking all the locks for too long.
159
160
161 Total backfilled jobs (since last slurm start)
162 Number of jobs started thanks to backfilling since last slurm
163 start.
164
165
166 Total backfilled jobs (since last stats cycle start)
167 Number of jobs started thanks to backfilling since last time
168 stats where reset. By default these values are reset at mid‐
169 night UTC time.
170
171
172 Total backfilled heterogeneous job components
173 Number of heterogeneous job components started thanks to back‐
174 filling since last Slurm start.
175
176
177 Total cycles
178 Number of backfill scheduling cycles since last reset
179
180
181 Last cycle when
182 Time when last backfill scheduling cycle happened in the format
183 "weekday Month MonthDay hour:minute.seconds year"
184
185
186 Last cycle
187 Time in microseconds of last backfill scheduling cycle. It
188 counts only execution time, removing sleep time inside a sched‐
189 uling cycle when it executes for an extended period time. Note
190 that locks are released during the sleep time so that other work
191 can proceed.
192
193
194 Max cycle
195 Time in microseconds of maximum backfill scheduling cycle execu‐
196 tion since last reset. It counts only execution time, removing
197 sleep time inside a scheduling cycle when it executes for an
198 extended period time. Note that locks are released during the
199 sleep time so that other work can proceed.
200
201
202 Mean cycle
203 Mean time in microseconds of backfilling scheduling cycles since
204 last reset.
205
206
207 Last depth cycle
208 Number of processed jobs during last backfilling scheduling
209 cycle. It counts every job even if that job can not be started
210 due to dependencies or limits.
211
212
213 Last depth cycle (try sched)
214 Number of processed jobs during last backfilling scheduling
215 cycle. It counts only jobs with a chance to start using avail‐
216 able resources. These jobs consume more scheduling time than
217 jobs which are found can not be started due to dependencies or
218 limits.
219
220
221 Depth Mean
222 Mean count of jobs processed during all backfilling scheduling
223 cycles since last reset. Jobs which are found to be ineligible
224 to run when examined by the backfill scheduler are not counted
225 (e.g. jobs submitted to multiple partitions and already started,
226 jobs which have reached a QOS or account limit such as maximum
227 running jobs for an account, etc).
228
229
230 Depth Mean (try sched)
231 The subset of Depth Mean that the backfill scheduler attempted
232 to schedule.
233
234
235 Last queue length
236 Number of jobs pending to be processed by backfilling algorithm.
237 A job is counted once for each partition it is queued to use. A
238 pending job array will normally be counted as one job (tasks of
239 a job array which have already been started/requeued or individ‐
240 ually modified will already have individual job records and are
241 each counted as a separate job).
242
243
244 Queue length Mean
245 Mean count of jobs pending to be processed by backfilling algo‐
246 rithm. A job once for each partition it requested. A pending
247 job array will normally be counted as one job (tasks of a job
248 array which have already been started/requeued or individually
249 modified will already have individual job records and are each
250 counted as a separate job).
251
252
253 Latency for 1000 calls to gettimeofday()
254 Latency of 1000 calls to the gettimeofday() syscall in microsec‐
255 onds, as measured at controller startup.
256
257
258 The next blocks of information report the most frequently issued remote
259 procedure calls (RPCs), calls made for the Slurmctld daemon to perform
260 some action. The fourth block reports the RPCs issued by message type.
261 You will need to look up those RPC codes in the Slurm source code by
262 looking them up in the file src/common/slurm_protocol_defs.h. The
263 report includes the number of times each RPC is invoked, the total time
264 consumed by all of those RPCs plus the average time consumed by each
265 RPC in microseconds. The fifth block reports the RPCs issued by user
266 ID, the total number of RPCs they have issued, the total time consumed
267 by all of those RPCs plus the average time consumed by each RPC in
268 microseconds. RPCs statistics are collected for the life of the slurm‐
269 ctld process unless explicitly --reset.
270
271
272 The sixth block of information, labeled Pending RPC Statistics, shows
273 information about pending outgoing RPCs on the slurmctld agent queue.
274 The first section of this block shows types of RPCs on the queue and
275 the count of each. The second section shows up to the first 25 individ‐
276 ual RPCs pending on the agent queue, including the type and the desti‐
277 nation host list. This information is cached and only refreshed on 30
278 second intervals.
279
280
282 -a, --all
283 Get and report information. This is the default mode of opera‐
284 tion.
285
286
287 -h, --help
288 Print description of options and exit.
289
290
291 -i, --sort-by-id
292 Sort Remote Procedure Call (RPC) data by message type ID and
293 user ID.
294
295
296 -r, --reset
297 Reset scheduler and RPC counters to 0. Only supported for Slurm
298 operators and administrators.
299
300
301 -t, --sort-by-time
302 Sort Remote Procedure Call (RPC) data by total run time.
303
304
305 -T, --sort-by-time2
306 Sort Remote Procedure Call (RPC) data by average run time.
307
308
309 --usage
310 Print list of options and exit.
311
312
313 -V, --version
314 Print current version number and exit.
315
316
318 Some sdiag options may be set via environment variables. These environ‐
319 ment variables, along with their corresponding options, are listed
320 below. (Note: commandline options will always override these settings)
321
322 SLURM_CONF The location of the Slurm configuration file.
323
324
326 Copyright (C) 2010-2011 Barcelona Supercomputing Center.
327 Copyright (C) 2010-2019 SchedMD LLC.
328
329 Slurm is free software; you can redistribute it and/or modify it under
330 the terms of the GNU General Public License as published by the Free
331 Software Foundation; either version 2 of the License, or (at your
332 option) any later version.
333
334 Slurm is distributed in the hope that it will be useful, but WITHOUT
335 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
336 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
337 for more details.
338
339
341 sinfo(1), squeue(1), scontrol(1), slurm.conf(5),
342
343
344
345October 2019 Slurm Commands sdiag(1)