1sdiag(1) Slurm Commands sdiag(1)
2
3
4
6 sdiag - Scheduling diagnostic tool for Slurm
7
8
10 sdiag
11
12
14 sdiag shows information related to slurmctld execution about: threads,
15 agents, jobs, and scheduling algorithms. The goal is to obtain data
16 from slurmctld behaviour helping to adjust configuration parameters or
17 queues policies. The main reason behind is to know Slurm behaviour un‐
18 der systems with a high throughput.
19
20 It has two execution modes. The default mode --all shows several coun‐
21 ters and statistics explained later, and there is another execution op‐
22 tion --reset for resetting those values.
23
24 Values are reset at midnight UTC time by default.
25
26 The first block of information is related to global slurmctld execu‐
27 tion:
28
29
30 Server thread count
31 The number of current active slurmctld threads. A high number
32 would mean a high load processing events like job submissions,
33 jobs dispatching, jobs completing, etc. If this is often close
34 to MAX_SERVER_THREADS it could point to a potential bottleneck.
35
36 Agent queue size
37 Slurm design has scalability in mind and sending messages to
38 thousands of nodes is not a trivial task. The agent mechanism
39 helps to control communication between slurmctld and the slurmd
40 daemons for a best effort. This value denotes the count of en‐
41 queued outgoing RPC requests in an internal retry list.
42
43 Agent count
44 Number of agent threads. Each of these agent threads can create
45 in turn a group of up to 2 + AGENT_THREAD_COUNT active threads
46 at a time.
47
48 Agent thread count
49 Total count of active threads created by all the agent threads.
50
51 DBD Agent queue size
52 Slurm queues up the messages intended for the SlurmDBD and pro‐
53 cesses them in a separate thread. If the SlurmDBD, or database,
54 is down then this number will increase.
55
56 The max queue size is configured in the slurm.conf with MaxDB‐
57 DMsgs. If this number begins to grow more than half of the max
58 queue size, the slurmdbd and the database should be investigated
59 immediately.
60
61 Jobs submitted
62 Number of jobs submitted since last reset
63
64 Jobs started
65 Number of jobs started since last reset. This includes back‐
66 filled jobs.
67
68 Jobs completed
69 Number of jobs completed since last reset.
70
71 Jobs canceled
72 Number of jobs canceled since last reset.
73
74 Jobs failed
75 Number of jobs failed due to slurmd or other internal issues
76 since last reset.
77
78 Job states ts:
79 Lists the timestamp of when the following job state counts were
80 gathered.
81
82 Jobs pending:
83 Number of jobs pending at the given time of the time stamp
84 above.
85
86 Jobs running:
87 Number of jobs running at the given time of the time stamp
88 above.
89
90 Jobs running ts:
91 Time stamp of when the running job count was taken.
92
93 The next block of information is related to main scheduling algorithm
94 based on jobs priorities. A scheduling cycle implies to get the
95 job_write_lock lock, then trying to get resources for jobs pending,
96 starting from the most priority one and going in descendent order. Once
97 a job can not get the resources the loop keeps going but just for jobs
98 requesting other partitions. Jobs with dependencies or affected by ac‐
99 counts limits are not processed.
100
101
102 Last cycle
103 Time in microseconds for last scheduling cycle.
104
105 Max cycle
106 Maximum time in microseconds for any scheduling cycle since last
107 reset.
108
109 Total cycles
110 Total run time in microseconds for all scheduling cycles since
111 last reset. Scheduling is performed periodically and (depending
112 upon configuration) when a job is submitted or a job is com‐
113 pleted.
114
115 Mean cycle
116 Mean time in microseconds for all scheduling cycles since last
117 reset.
118
119 Mean depth cycle
120 Mean of cycle depth. Depth means number of jobs processed in a
121 scheduling cycle.
122
123 Cycles per minute
124 Counter of scheduling executions per minute.
125
126 Last queue length
127 Length of jobs pending queue.
128
129 The next block of information is related to backfilling scheduling al‐
130 gorithm. A backfilling scheduling cycle implies to get locks for jobs,
131 nodes and partitions objects then trying to get resources for jobs
132 pending. Jobs are processed based on priorities. If a job can not get
133 resources the algorithm calculates when it could get them obtaining a
134 future start time for the job. Then next job is processed and the al‐
135 gorithm tries to get resources for that job but avoiding to affect the
136 previous ones, and again it calculates the future start time if not
137 current resources available. The backfilling algorithm takes more time
138 for each new job to process since more priority jobs can not be af‐
139 fected. The algorithm itself takes measures for avoiding a long execu‐
140 tion cycle and for taking all the locks for too long.
141
142
143 Total backfilled jobs (since last slurm start)
144 Number of jobs started thanks to backfilling since last slurm
145 start.
146
147 Total backfilled jobs (since last stats cycle start)
148 Number of jobs started thanks to backfilling since last time
149 stats where reset. By default these values are reset at mid‐
150 night UTC time.
151
152 Total backfilled heterogeneous job components
153 Number of heterogeneous job components started thanks to back‐
154 filling since last Slurm start.
155
156 Total cycles
157 Number of backfill scheduling cycles since last reset
158
159 Last cycle when
160 Time when last backfill scheduling cycle happened in the format
161 "weekday Month MonthDay hour:minute.seconds year"
162
163 Last cycle
164 Time in microseconds of last backfill scheduling cycle. It
165 counts only execution time, removing sleep time inside a sched‐
166 uling cycle when it executes for an extended period time. Note
167 that locks are released during the sleep time so that other work
168 can proceed.
169
170 Max cycle
171 Time in microseconds of maximum backfill scheduling cycle execu‐
172 tion since last reset. It counts only execution time, removing
173 sleep time inside a scheduling cycle when it executes for an ex‐
174 tended period time. Note that locks are released during the
175 sleep time so that other work can proceed.
176
177 Mean cycle
178 Mean time in microseconds of backfilling scheduling cycles since
179 last reset.
180
181 Last depth cycle
182 Number of processed jobs during last backfilling scheduling cy‐
183 cle. It counts every job even if that job can not be started due
184 to dependencies or limits.
185
186 Last depth cycle (try sched)
187 Number of processed jobs during last backfilling scheduling cy‐
188 cle. It counts only jobs with a chance to start using available
189 resources. These jobs consume more scheduling time than jobs
190 which are found can not be started due to dependencies or lim‐
191 its.
192
193 Depth Mean
194 Mean count of jobs processed during all backfilling scheduling
195 cycles since last reset. Jobs which are found to be ineligible
196 to run when examined by the backfill scheduler are not counted
197 (e.g. jobs submitted to multiple partitions and already started,
198 jobs which have reached a QOS or account limit such as maximum
199 running jobs for an account, etc).
200
201 Depth Mean (try sched)
202 The subset of Depth Mean that the backfill scheduler attempted
203 to schedule.
204
205 Last queue length
206 Number of jobs pending to be processed by backfilling algorithm.
207 A job is counted once for each partition it is queued to use. A
208 pending job array will normally be counted as one job (tasks of
209 a job array which have already been started/requeued or individ‐
210 ually modified will already have individual job records and are
211 each counted as a separate job).
212
213 Queue length Mean
214 Mean count of jobs pending to be processed by backfilling algo‐
215 rithm. A job is counted once for each partition it requested.
216 A pending job array will normally be counted as one job (tasks
217 of a job array which have already been started/requeued or indi‐
218 vidually modified will already have individual job records and
219 are each counted as a separate job).
220
221 Last table size
222 Count of different time slots tested by the backfill scheduler
223 in its last iteration.
224
225 Mean table size
226 Mean count of different time slots tested by the backfill sched‐
227 uler. Larger counts increase the time required for the backfill
228 operation. The table size is influenced by many schuling param‐
229 eters, including: bf_min_age_reserve, bf_min_prio_reserve,
230 bf_resolution, and bf_window.
231
232 Latency for 1000 calls to gettimeofday()
233 Latency of 1000 calls to the gettimeofday() syscall in microsec‐
234 onds, as measured at controller startup.
235
236 The next blocks of information report the most frequently issued remote
237 procedure calls (RPCs), calls made for the Slurmctld daemon to perform
238 some action. The fourth block reports the RPCs issued by message type.
239 You will need to look up those RPC codes in the Slurm source code by
240 looking them up in the file src/common/slurm_protocol_defs.h. The re‐
241 port includes the number of times each RPC is invoked, the total time
242 consumed by all of those RPCs plus the average time consumed by each
243 RPC in microseconds. The fifth block reports the RPCs issued by user
244 ID, the total number of RPCs they have issued, the total time consumed
245 by all of those RPCs plus the average time consumed by each RPC in mi‐
246 croseconds. RPCs statistics are collected for the life of the slurm‐
247 ctld process unless explicitly --reset.
248
249
250 The sixth block of information, labeled Pending RPC Statistics, shows
251 information about pending outgoing RPCs on the slurmctld agent queue.
252 The first section of this block shows types of RPCs on the queue and
253 the count of each. The second section shows up to the first 25 individ‐
254 ual RPCs pending on the agent queue, including the type and the desti‐
255 nation host list. This information is cached and only refreshed on 30
256 second intervals.
257
258
260 -a, --all
261 Get and report information. This is the default mode of opera‐
262 tion.
263
264 -M, --cluster=<string>
265 The cluster to issue commands to. Only one cluster name may be
266 specified. Note that the SlurmDBD must be up for this option to
267 work properly.
268
269 -h, --help
270 Print description of options and exit.
271
272 -r, --reset
273 Reset scheduler and RPC counters to 0. Only supported for Slurm
274 operators and administrators.
275
276 -i, --sort-by-id
277 Sort Remote Procedure Call (RPC) data by message type ID and
278 user ID.
279
280 -t, --sort-by-time
281 Sort Remote Procedure Call (RPC) data by total run time.
282
283 -T, --sort-by-time2
284 Sort Remote Procedure Call (RPC) data by average run time.
285
286 --usage
287 Print list of options and exit.
288
289 -V, --version
290 Print current version number and exit.
291
293 Executing sdiag sends a remote procedure call to slurmctld. If enough
294 calls from sdiag or other Slurm client commands that send remote proce‐
295 dure calls to the slurmctld daemon come in at once, it can result in a
296 degradation of performance of the slurmctld daemon, possibly resulting
297 in a denial of service.
298
299 Do not run sdiag or other Slurm client commands that send remote proce‐
300 dure calls to slurmctld from loops in shell scripts or other programs.
301 Ensure that programs limit calls to sdiag to the minimum necessary for
302 the information you are trying to gather.
303
304
306 Some sdiag options may be set via environment variables. These environ‐
307 ment variables, along with their corresponding options, are listed be‐
308 low. (Note: Command line options will always override these settings.)
309
310
311 SLURM_CLUSTERS Same as --cluster
312
313 SLURM_CONF The location of the Slurm configuration file.
314
315
317 Copyright (C) 2010-2011 Barcelona Supercomputing Center.
318 Copyright (C) 2010-2022 SchedMD LLC.
319
320 Slurm is free software; you can redistribute it and/or modify it under
321 the terms of the GNU General Public License as published by the Free
322 Software Foundation; either version 2 of the License, or (at your op‐
323 tion) any later version.
324
325 Slurm is distributed in the hope that it will be useful, but WITHOUT
326 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
327 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
328 for more details.
329
330
332 sinfo(1), squeue(1), scontrol(1), slurm.conf(5),
333
334
335
336May 2021 Slurm Commands sdiag(1)