1sdiag(1) Slurm Commands sdiag(1)
2
3
4
6 sdiag - Scheduling diagnostic tool for Slurm
7
8
10 sdiag
11
12
14 sdiag shows information related to slurmctld execution about: threads,
15 agents, jobs, and scheduling algorithms. The goal is to obtain data
16 from slurmctld behaviour helping to adjust configuration parameters or
17 queues policies. The main reason behind is to know Slurm behaviour
18 under systems with a high throughput.
19
20 It has two execution modes. The default mode --all shows several coun‐
21 ters and statistics explained later, and there is another execution
22 option --reset for resetting those values.
23
24 Values are reset at midnight UTC time by default.
25
26 The first block of information is related to global slurmctld execu‐
27 tion:
28
29 Server thread count
30 The number of current active slurmctld threads. A high number
31 would mean a high load processing events like job submissions,
32 jobs dispatching, jobs completing, etc. If this is often close
33 to MAX_SERVER_THREADS it could point to a potential bottleneck.
34
35
36 Agent queue size
37 Slurm design has scalability in mind and sending messages to
38 thousands of nodes is not a trivial task. The agent mechanism
39 helps to control communication between the slurm daemons and the
40 controller for a best effort. If this values is close to
41 MAX_AGENT_CNT there could be some delays affecting jobs manage‐
42 ment.
43
44
45 Agent count
46 Number of active agent threads.
47
48
49 DBD Agent queue size
50 Slurm queues up the messages intended for the SlurmDBD and pro‐
51 cesses them in a separate thread. If the SlurmDBD, or database,
52 is down then this number will increase. The max queue size is
53 calculated as:
54
55 MAX(10000, ((max_job_cnt * 2) + (node_record_count * 4)))
56
57 If this number begins to grow more than half of the max queue
58 size, the slurmdbd and the database should be investigated imme‐
59 diately.
60
61
62 Jobs submitted
63 Number of jobs submitted since last reset
64
65
66 Jobs started
67 Number of jobs started since last reset. This includes back‐
68 filled jobs.
69
70
71 Jobs completed
72 Number of jobs completed since last reset.
73
74
75 Jobs canceled
76 Number of jobs canceled since last reset.
77
78
79 Jobs failed
80 Number of jobs failed due to slurmd or other internal issues
81 since last reset.
82
83
84 Job states ts:
85 Lists the timestamp of when the following job state counts were
86 gathered.
87
88
89 Jobs pending:
90 Number of jobs pending at the given time of the time stamp
91 above.
92
93
94 Jobs running:
95 Number of jobs running at the given time of the time stamp
96 above.
97
98
99 Jobs running ts:
100 Time stamp of when the running job count was taken.
101
102
103 The second block of information is related to main scheduling algorithm
104 based on jobs priorities. A scheduling cycle implies to get the
105 job_write_lock lock, then trying to get resources for jobs pending,
106 starting from the most priority one and going in descendent order. Once
107 a job can not get the resources the loop keeps going but just for jobs
108 requesting other partitions. Jobs with dependencies or affected by
109 accounts limits are not processed.
110
111
112 Last cycle
113 Time in microseconds for last scheduling cycle.
114
115
116 Max cycle
117 Time in microseconds for the maximum scheduling cycle since last
118 reset.
119
120
121 Total cycles
122 Number of scheduling cycles since last reset. Scheduling is done
123 in periodically and when a job is submitted or a job is com‐
124 pleted.
125
126
127 Mean cycle
128 Mean of scheduling cycles since last reset
129
130
131 Mean depth cycle
132 Mean of cycle depth. Depth means number of jobs processed in a
133 scheduling cycle.
134
135
136 Cycles per minute
137 Counter of scheduling executions per minute
138
139
140 Last queue length
141 Length of jobs pending queue.
142
143
144 Latency for gettimeofday()
145 Latency of 1000 calls to the gettimeofday() syscall in microsec‐
146 onds, as measured at controller startup.
147
148
149 The third block of information is related to backfilling scheduling
150 algorithm. A backfilling scheduling cycle implies to get locks for
151 jobs, nodes and partitions objects then trying to get resources for
152 jobs pending. Jobs are processed based on priorities. If a job can not
153 get resources the algorithm calculates when it could get them obtaining
154 a future start time for the job. Then next job is processed and the
155 algorithm tries to get resources for that job but avoiding to affect
156 the previous ones, and again it calculates the future start time if not
157 current resources available. The backfilling algorithm takes more time
158 for each new job to process since more priority jobs can not be
159 affected. The algorithm itself takes measures for avoiding a long exe‐
160 cution cycle and for taking all the locks for too long.
161
162
163 Total backfilled jobs (since last slurm start)
164 Number of jobs started thanks to backfilling since last slurm
165 start.
166
167
168 Total backfilled jobs (since last stats cycle start)
169 Number of jobs started thanks to backfilling since last time
170 stats where reset. By default these values are reset at mid‐
171 night UTC time.
172
173
174 Total backfilled heterogeneous job components
175 Number of heterogeneous job components started thanks to back‐
176 filling since last Slurm start.
177
178
179 Total cycles
180 Number of scheduling cycles since last reset
181
182
183 Last cycle when
184 Time when last execution cycle happened in format "weekday Month
185 MonthDay hour:minute.seconds year"
186
187
188 Last cycle
189 Time in microseconds of last backfilling cycle. It counts only
190 execution time removing sleep time inside a scheduling cycle
191 when it takes too much time. Note that locks are released dur‐
192 ing the sleep time so that other work can proceed.
193
194
195 Max cycle
196 Time in microseconds of maximum backfilling cycle execution
197 since last reset. It counts only execution time removing sleep
198 time inside a scheduling cycle when it takes too much time.
199 Note that locks are released during the sleep time so that other
200 work can proceed.
201
202
203 Mean cycle
204 Mean of backfilling scheduling cycles in microseconds since last
205 reset
206
207
208
209 Last depth cycle
210 Number of processed jobs during last backfilling scheduling
211 cycle. It counts every process even if it has no option to exe‐
212 cute due to dependencies or limits.
213
214
215 Last depth cycle (try sched)
216 Number of processed jobs during last backfilling scheduling
217 cycle. It counts only processes with a chance to run waiting for
218 available resources. These jobs are which makes the backfilling
219 algorithm heavier.
220
221
222 Depth Mean
223 Mean of processed jobs during backfilling scheduling cycles
224 since last reset. Jobs which are found to be ineligible to run
225 when examined by the backfill scheduler are not counted (e.g.
226 jobs submitted to multiple partitions and already started, jobs
227 which have reached a QOS or account limit such as maximum run‐
228 ning jobs for an account, etc).
229
230
231 Depth Mean (try sched)
232 The subset of Depth Mean that the backfill scheduler attempted
233 to schedule.
234
235
236 Last queue length
237 Number of jobs pending to be processed by backfilling algorithm.
238 A job once for each partition it requested. A pending job array
239 will normally be counted as one job (tasks of a job array which
240 have already been started/requeued or individually modified will
241 already have individual job records and are each counted as a
242 separate job).
243
244
245 Queue length Mean
246 Mean of jobs pending to be processed by backfilling algorithm.
247 A job once for each partition it requested. A pending job array
248 will normally be counted as one job (tasks of a job array which
249 have already been started/requeued or individually modified will
250 already have individual job records and are each counted as a
251 separate job).
252
253
254 The fourth and fifth blocks of information report the most frequently
255 issued remote procedure calls (RPCs), calls made for the Slurmctld dae‐
256 mon to perform some action. The fourth block reports the RPCs issued
257 by message type. You will need to look up those RPC codes in the Slurm
258 source code by looking them up in the file src/common/slurm_proto‐
259 col_defs.h. The report includes the number of times each RPC is
260 invoked, the total time consumed by all of those RPCs plus the average
261 time consumed by each RPC in microseconds. The fifth block reports the
262 RPCs issued by user ID, the total number of RPCs they have issued, the
263 total time consumed by all of those RPCs plus the average time consumed
264 by each RPC in microseconds.
265
266
267 The sixth block of information, labeled Pending RPC Statistics, shows
268 information about pending outgoing RPCs on the slurmctld agent queue.
269 The first section of this block shows types of RPCs on the queue and
270 the count of each. The second section shows up to the first 25 individ‐
271 ual RPCs pending on the agent queue, including the type and the desti‐
272 nation host list. This information is cached and only refreshed on 30
273 second intervals.
274
275
277 -a, --all
278 Get and report information. This is the default mode of opera‐
279 tion.
280
281
282 -h, --help
283 Print description of options and exit.
284
285
286 -i, --sort-by-id
287 Sort Remote Procedure Call (RPC) data by message type ID and
288 user ID.
289
290
291 -r, --reset
292 Reset counters. Only supported for Slurm operators and adminis‐
293 trators.
294
295
296 -t, --sort-by-time
297 Sort Remote Procedure Call (RPC) data by total run time.
298
299
300 -T, --sort-by-time2
301 Sort Remote Procedure Call (RPC) data by average run time.
302
303
304 --usage
305 Print list of options and exit.
306
307
308 -V, --version
309 Print current version number and exit.
310
311
313 Some sdiag options may be set via environment variables. These environ‐
314 ment variables, along with their corresponding options, are listed
315 below. (Note: commandline options will always override these settings)
316
317 SLURM_CONF The location of the Slurm configuration file.
318
319
321 Copyright (C) 2010-2011 Barcelona Supercomputing Center.
322 Copyright (C) 2010-2017 SchedMD LLC.
323
324 Slurm is free software; you can redistribute it and/or modify it under
325 the terms of the GNU General Public License as published by the Free
326 Software Foundation; either version 2 of the License, or (at your
327 option) any later version.
328
329 Slurm is distributed in the hope that it will be useful, but WITHOUT
330 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
331 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
332 for more details.
333
334
336 sinfo(1), squeue(1), scontrol(1), slurm.conf(5),
337
338
339
340July 2018 Slurm Commands sdiag(1)