1srun(1) Slurm Commands srun(1)
2
3
4
6 srun - Run parallel jobs
7
8
10 srun [OPTIONS(0)...] [ : [OPTIONS(N)...]] executable(0) [args(0)...]
11
12 Option(s) define multiple jobs in a co-scheduled heterogeneous job.
13 For more details about heterogeneous jobs see the document
14 https://slurm.schedmd.com/heterogeneous_jobs.html
15
16
18 Run a parallel job on cluster managed by Slurm. If necessary, srun
19 will first create a resource allocation in which to run the parallel
20 job.
21
22 The following document describes the influence of various options on
23 the allocation of cpus to jobs and tasks.
24 https://slurm.schedmd.com/cpu_management.html
25
26
28 srun will return the highest exit code of all tasks run or the highest
29 signal (with the high-order bit set in an 8-bit integer -- e.g. 128 +
30 signal) of any task that exited with a signal.
31
32
34 The executable is resolved in the following order:
35
36 1. If executable starts with ".", then path is constructed as: current
37 working directory / executable
38
39 2. If executable starts with a "/", then path is considered absolute.
40
41 3. If executable can be resolved through PATH. See path_resolution(7).
42
43 4. If executable is in current working directory.
44
45 Current working directory is the calling process working directory
46 unless the --chdir argument is passed, which will override the current
47 working directory.
48
49
51 --accel-bind=<options>
52 Control how tasks are bound to generic resources of type gpu,
53 mic and nic. Multiple options may be specified. Supported
54 options include:
55
56 g Bind each task to GPUs which are closest to the allocated
57 CPUs.
58
59 m Bind each task to MICs which are closest to the allocated
60 CPUs.
61
62 n Bind each task to NICs which are closest to the allocated
63 CPUs.
64
65 v Verbose mode. Log how tasks are bound to GPU and NIC
66 devices.
67
68 This option applies to job allocations.
69
70
71 -A, --account=<account>
72 Charge resources used by this job to specified account. The
73 account is an arbitrary string. The account name may be changed
74 after job submission using the scontrol command. This option
75 applies to job allocations.
76
77
78 --acctg-freq
79 Define the job accounting and profiling sampling intervals.
80 This can be used to override the JobAcctGatherFrequency parame‐
81 ter in Slurm's configuration file, slurm.conf. The supported
82 format is follows:
83
84 --acctg-freq=<datatype>=<interval>
85 where <datatype>=<interval> specifies the task sam‐
86 pling interval for the jobacct_gather plugin or a
87 sampling interval for a profiling type by the
88 acct_gather_profile plugin. Multiple, comma-sepa‐
89 rated <datatype>=<interval> intervals may be speci‐
90 fied. Supported datatypes are as follows:
91
92 task=<interval>
93 where <interval> is the task sampling inter‐
94 val in seconds for the jobacct_gather plugins
95 and for task profiling by the
96 acct_gather_profile plugin. NOTE: This fre‐
97 quency is used to monitor memory usage. If
98 memory limits are enforced the highest fre‐
99 quency a user can request is what is config‐
100 ured in the slurm.conf file. They can not
101 turn it off (=0) either.
102
103 energy=<interval>
104 where <interval> is the sampling interval in
105 seconds for energy profiling using the
106 acct_gather_energy plugin
107
108 network=<interval>
109 where <interval> is the sampling interval in
110 seconds for infiniband profiling using the
111 acct_gather_infiniband plugin.
112
113 filesystem=<interval>
114 where <interval> is the sampling interval in
115 seconds for filesystem profiling using the
116 acct_gather_filesystem plugin.
117
118 The default value for the task sampling
119 interval
120 is 30. The default value for all other intervals is 0. An
121 interval of 0 disables sampling of the specified type. If the
122 task sampling interval is 0, accounting information is collected
123 only at job termination (reducing Slurm interference with the
124 job).
125 Smaller (non-zero) values have a greater impact upon job perfor‐
126 mance, but a value of 30 seconds is not likely to be noticeable
127 for applications having less than 10,000 tasks. This option
128 applies job allocations.
129
130
131 -B --extra-node-info=<sockets[:cores[:threads]]>
132 Restrict node selection to nodes with at least the specified
133 number of sockets, cores per socket and/or threads per core.
134 NOTE: These options do not specify the resource allocation size.
135 Each value specified is considered a minimum. An asterisk (*)
136 can be used as a placeholder indicating that all available
137 resources of that type are to be utilized. Values can also be
138 specified as min-max. The individual levels can also be speci‐
139 fied in separate options if desired:
140 --sockets-per-node=<sockets>
141 --cores-per-socket=<cores>
142 --threads-per-core=<threads>
143 If task/affinity plugin is enabled, then specifying an alloca‐
144 tion in this manner also sets a default --cpu-bind option of
145 threads if the -B option specifies a thread count, otherwise an
146 option of cores if a core count is specified, otherwise an
147 option of sockets. If SelectType is configured to
148 select/cons_res, it must have a parameter of CR_Core,
149 CR_Core_Memory, CR_Socket, or CR_Socket_Memory for this option
150 to be honored. If not specified, the scontrol show job will
151 display 'ReqS:C:T=*:*:*'. This option applies to job alloca‐
152 tions.
153
154
155 --bb=<spec>
156 Burst buffer specification. The form of the specification is
157 system dependent. Also see --bbf. This option applies to job
158 allocations.
159
160
161 --bbf=<file_name>
162 Path of file containing burst buffer specification. The form of
163 the specification is system dependent. Also see --bb. This
164 option applies to job allocations.
165
166
167 --bcast[=<dest_path>]
168 Copy executable file to allocated compute nodes. If a file name
169 is specified, copy the executable to the specified destination
170 file path. If no path is specified, copy the file to a file
171 named "slurm_bcast_<job_id>.<step_id>" in the current working.
172 For example, "srun --bcast=/tmp/mine -N3 a.out" will copy the
173 file "a.out" from your current directory to the file "/tmp/mine"
174 on each of the three allocated compute nodes and execute that
175 file. This option applies to step allocations.
176
177
178 -b, --begin=<time>
179 Defer initiation of this job until the specified time. It
180 accepts times of the form HH:MM:SS to run a job at a specific
181 time of day (seconds are optional). (If that time is already
182 past, the next day is assumed.) You may also specify midnight,
183 noon, fika (3 PM) or teatime (4 PM) and you can have a
184 time-of-day suffixed with AM or PM for running in the morning or
185 the evening. You can also say what day the job will be run, by
186 specifying a date of the form MMDDYY or MM/DD/YY YYYY-MM-DD.
187 Combine date and time using the following format
188 YYYY-MM-DD[THH:MM[:SS]]. You can also give times like now +
189 count time-units, where the time-units can be seconds (default),
190 minutes, hours, days, or weeks and you can tell Slurm to run the
191 job today with the keyword today and to run the job tomorrow
192 with the keyword tomorrow. The value may be changed after job
193 submission using the scontrol command. For example:
194 --begin=16:00
195 --begin=now+1hour
196 --begin=now+60 (seconds by default)
197 --begin=2010-01-20T12:34:00
198
199
200 Notes on date/time specifications:
201 - Although the 'seconds' field of the HH:MM:SS time specifica‐
202 tion is allowed by the code, note that the poll time of the
203 Slurm scheduler is not precise enough to guarantee dispatch of
204 the job on the exact second. The job will be eligible to start
205 on the next poll following the specified time. The exact poll
206 interval depends on the Slurm scheduler (e.g., 60 seconds with
207 the default sched/builtin).
208 - If no time (HH:MM:SS) is specified, the default is
209 (00:00:00).
210 - If a date is specified without a year (e.g., MM/DD) then the
211 current year is assumed, unless the combination of MM/DD and
212 HH:MM:SS has already passed for that year, in which case the
213 next year is used.
214 This option applies to job allocations.
215
216
217 --checkpoint=<time>
218 Specifies the interval between creating checkpoints of the job
219 step. By default, the job step will have no checkpoints cre‐
220 ated. Acceptable time formats include "minutes", "minutes:sec‐
221 onds", "hours:minutes:seconds", "days-hours", "days-hours:min‐
222 utes" and "days-hours:minutes:seconds". This option applies to
223 job and step allocations.
224
225
226 --cluster-constraint=<list>
227 Specifies features that a federated cluster must have to have a
228 sibling job submitted to it. Slurm will attempt to submit a sib‐
229 ling job to a cluster if it has at least one of the specified
230 features.
231
232
233 --comment=<string>
234 An arbitrary comment. This option applies to job allocations.
235
236
237 --compress[=type]
238 Compress file before sending it to compute hosts. The optional
239 argument specifies the data compression library to be used.
240 Supported values are "lz4" (default) and "zlib". Some compres‐
241 sion libraries may be unavailable on some systems. For use with
242 the --bcast option. This option applies to step allocations.
243
244
245 -C, --constraint=<list>
246 Nodes can have features assigned to them by the Slurm adminis‐
247 trator. Users can specify which of these features are required
248 by their job using the constraint option. Only nodes having
249 features matching the job constraints will be used to satisfy
250 the request. Multiple constraints may be specified with AND,
251 OR, matching OR, resource counts, etc. (some operators are not
252 supported on all system types). Supported constraint options
253 include:
254
255 Single Name
256 Only nodes which have the specified feature will be used.
257 For example, --constraint="intel"
258
259 Node Count
260 A request can specify the number of nodes needed with
261 some feature by appending an asterisk and count after the
262 feature name. For example "--nodes=16 --con‐
263 straint=graphics*4 ..." indicates that the job requires
264 16 nodes and that at least four of those nodes must have
265 the feature "graphics."
266
267 AND If only nodes with all of specified features will be
268 used. The ampersand is used for an AND operator. For
269 example, --constraint="intel&gpu"
270
271 OR If only nodes with at least one of specified features
272 will be used. The vertical bar is used for an OR opera‐
273 tor. For example, --constraint="intel|amd"
274
275 Matching OR
276 If only one of a set of possible options should be used
277 for all allocated nodes, then use the OR operator and
278 enclose the options within square brackets. For example:
279 "--constraint=[rack1|rack2|rack3|rack4]" might be used to
280 specify that all nodes must be allocated on a single rack
281 of the cluster, but any of those four racks can be used.
282
283 Multiple Counts
284 Specific counts of multiple resources may be specified by
285 using the AND operator and enclosing the options within
286 square brackets. For example: "--con‐
287 straint=[rack1*2&rack2*4]" might be used to specify that
288 two nodes must be allocated from nodes with the feature
289 of "rack1" and four nodes must be allocated from nodes
290 with the feature "rack2".
291
292 NOTE: This construct does not support multiple Intel KNL
293 NUMA or MCDRAM modes. For example, while "--con‐
294 straint=[(knl&quad)*2&(knl&hemi)*4]" is not supported,
295 "--constraint=[haswell*2&(knl&hemi)*4]" is supported.
296 Specification of multiple KNL modes requires the use of a
297 heterogeneous job.
298
299
300 Parenthesis
301 Parenthesis can be used to group like node features
302 together. For example "--con‐
303 straint=[(knl&snc4&flat)*4&haswell*1]" might be used to
304 specify that four nodes with the features "knl", "snc4"
305 and "flat" plus one node with the feature "haswell" are
306 required. All options within parenthesis should be
307 grouped with AND (e.g. "&") operands.
308
309 WARNING: When srun is executed from within salloc or sbatch, the con‐
310 straint value can only contain a single feature name. None of the other
311 operators are currently supported for job steps.
312 This option applies to job and step allocations.
313
314
315 --contiguous
316 If set, then the allocated nodes must form a contiguous set.
317 Not honored with the topology/tree or topology/3d_torus plugins,
318 both of which can modify the node ordering. This option applies
319 to job allocations.
320
321
322 --cores-per-socket=<cores>
323 Restrict node selection to nodes with at least the specified
324 number of cores per socket. See additional information under -B
325 option above when task/affinity plugin is enabled. This option
326 applies to job allocations.
327
328
329 --cpu-bind=[{quiet,verbose},]type
330 Bind tasks to CPUs. Used only when the task/affinity or
331 task/cgroup plugin is enabled. NOTE: To have Slurm always
332 report on the selected CPU binding for all commands executed in
333 a shell, you can enable verbose mode by setting the
334 SLURM_CPU_BIND environment variable value to "verbose".
335
336 The following informational environment variables are set when
337 --cpu-bind is in use:
338 SLURM_CPU_BIND_VERBOSE
339 SLURM_CPU_BIND_TYPE
340 SLURM_CPU_BIND_LIST
341
342 See the ENVIRONMENT VARIABLES section for a more detailed
343 description of the individual SLURM_CPU_BIND variables. These
344 variable are available only if the task/affinity plugin is con‐
345 figured.
346
347 When using --cpus-per-task to run multithreaded tasks, be aware
348 that CPU binding is inherited from the parent of the process.
349 This means that the multithreaded task should either specify or
350 clear the CPU binding itself to avoid having all threads of the
351 multithreaded task use the same mask/CPU as the parent. Alter‐
352 natively, fat masks (masks which specify more than one allowed
353 CPU) could be used for the tasks in order to provide multiple
354 CPUs for the multithreaded tasks.
355
356 By default, a job step has access to every CPU allocated to the
357 job. To ensure that distinct CPUs are allocated to each job
358 step, use the --exclusive option.
359
360 Note that a job step can be allocated different numbers of CPUs
361 on each node or be allocated CPUs not starting at location zero.
362 Therefore one of the options which automatically generate the
363 task binding is recommended. Explicitly specified masks or
364 bindings are only honored when the job step has been allocated
365 every available CPU on the node.
366
367 Binding a task to a NUMA locality domain means to bind the task
368 to the set of CPUs that belong to the NUMA locality domain or
369 "NUMA node". If NUMA locality domain options are used on sys‐
370 tems with no NUMA support, then each socket is considered a
371 locality domain.
372
373 If the --cpu-bind option is not used, the default binding mode
374 will depend upon Slurm's configuration and the step's resource
375 allocation. If all allocated nodes have the same configured
376 CpuBind mode, that will be used. Otherwise if the job's Parti‐
377 tion has a configured CpuBind mode, that will be used. Other‐
378 wise if Slurm has a configured TaskPluginParam value, that mode
379 will be used. Otherwise automatic binding will be performed as
380 described below.
381
382
383 Auto Binding
384 Applies only when task/affinity is enabled. If the job
385 step allocation includes an allocation with a number of
386 sockets, cores, or threads equal to the number of tasks
387 times cpus-per-task, then the tasks will by default be
388 bound to the appropriate resources (auto binding). Dis‐
389 able this mode of operation by explicitly setting
390 "--cpu-bind=none". Use TaskPluginParam=auto‐
391 bind=[threads|cores|sockets] to set a default cpu binding
392 in case "auto binding" doesn't find a match.
393
394 Supported options include:
395
396 q[uiet]
397 Quietly bind before task runs (default)
398
399 v[erbose]
400 Verbosely report binding before task runs
401
402 no[ne] Do not bind tasks to CPUs (default unless auto
403 binding is applied)
404
405 rank Automatically bind by task rank. The lowest num‐
406 bered task on each node is bound to socket (or
407 core or thread) zero, etc. Not supported unless
408 the entire node is allocated to the job.
409
410 map_cpu:<list>
411 Bind by setting CPU masks on tasks (or ranks) as
412 specified where <list> is
413 <cpu_id_for_task_0>,<cpu_id_for_task_1>,... CPU
414 IDs are interpreted as decimal values unless they
415 are preceded with '0x' in which case they inter‐
416 preted as hexadecimal values. If the number of
417 tasks (or ranks) exceeds the number of elements in
418 this list, elements in the list will be reused as
419 needed starting from the beginning of the list.
420 To simplify support for large task counts, the
421 lists may follow a map with an asterisk and repe‐
422 tition count For example "map_cpu:0x0f*4,0xf0*4".
423 Not supported unless the entire node is allocated
424 to the job.
425
426 mask_cpu:<list>
427 Bind by setting CPU masks on tasks (or ranks) as
428 specified where <list> is
429 <cpu_mask_for_task_0>,<cpu_mask_for_task_1>,...
430 The mapping is specified for a node and identical
431 mapping is applied to the tasks on every node
432 (i.e. the lowest task ID on each node is mapped to
433 the first mask specified in the list, etc.). CPU
434 masks are always interpreted as hexadecimal values
435 but can be preceded with an optional '0x'. Not
436 supported unless the entire node is allocated to
437 the job. To simplify support for large task
438 counts, the lists may follow a map with an aster‐
439 isk and repetition count For example
440 "mask_cpu:0x0f*4,0xf0*4". Not supported unless
441 the entire node is allocated to the job.
442
443 rank_ldom
444 Bind to a NUMA locality domain by rank. Not sup‐
445 ported unless the entire node is allocated to the
446 job.
447
448 map_ldom:<list>
449 Bind by mapping NUMA locality domain IDs to tasks
450 as specified where <list> is
451 <ldom1>,<ldom2>,...<ldomN>. The locality domain
452 IDs are interpreted as decimal values unless they
453 are preceded with '0x' in which case they are
454 interpreted as hexadecimal values. Not supported
455 unless the entire node is allocated to the job.
456
457 mask_ldom:<list>
458 Bind by setting NUMA locality domain masks on
459 tasks as specified where <list> is
460 <mask1>,<mask2>,...<maskN>. NUMA locality domain
461 masks are always interpreted as hexadecimal values
462 but can be preceded with an optional '0x'. Not
463 supported unless the entire node is allocated to
464 the job.
465
466 sockets
467 Automatically generate masks binding tasks to
468 sockets. Only the CPUs on the socket which have
469 been allocated to the job will be used. If the
470 number of tasks differs from the number of allo‐
471 cated sockets this can result in sub-optimal bind‐
472 ing.
473
474 cores Automatically generate masks binding tasks to
475 cores. If the number of tasks differs from the
476 number of allocated cores this can result in
477 sub-optimal binding.
478
479 threads
480 Automatically generate masks binding tasks to
481 threads. If the number of tasks differs from the
482 number of allocated threads this can result in
483 sub-optimal binding.
484
485 ldoms Automatically generate masks binding tasks to NUMA
486 locality domains. If the number of tasks differs
487 from the number of allocated locality domains this
488 can result in sub-optimal binding.
489
490 boards Automatically generate masks binding tasks to
491 boards. If the number of tasks differs from the
492 number of allocated boards this can result in
493 sub-optimal binding. This option is supported by
494 the task/cgroup plugin only.
495
496 help Show help message for cpu-bind
497
498 This option applies to job and step allocations.
499
500
501 --cpu-freq =<p1[-p2[:p3]]>
502
503 Request that the job step initiated by this srun command be run
504 at some requested frequency if possible, on the CPUs selected
505 for the step on the compute node(s).
506
507 p1 can be [#### | low | medium | high | highm1] which will set
508 the frequency scaling_speed to the corresponding value, and set
509 the frequency scaling_governor to UserSpace. See below for defi‐
510 nition of the values.
511
512 p1 can be [Conservative | OnDemand | Performance | PowerSave]
513 which will set the scaling_governor to the corresponding value.
514 The governor has to be in the list set by the slurm.conf option
515 CpuFreqGovernors.
516
517 When p2 is present, p1 will be the minimum scaling frequency and
518 p2 will be the maximum scaling frequency.
519
520 p2 can be [#### | medium | high | highm1] p2 must be greater
521 than p1.
522
523 p3 can be [Conservative | OnDemand | Performance | PowerSave |
524 UserSpace] which will set the governor to the corresponding
525 value.
526
527 If p3 is UserSpace, the frequency scaling_speed will be set by a
528 power or energy aware scheduling strategy to a value between p1
529 and p2 that lets the job run within the site's power goal. The
530 job may be delayed if p1 is higher than a frequency that allows
531 the job to run within the goal.
532
533 If the current frequency is < min, it will be set to min. Like‐
534 wise, if the current frequency is > max, it will be set to max.
535
536 Acceptable values at present include:
537
538 #### frequency in kilohertz
539
540 Low the lowest available frequency
541
542 High the highest available frequency
543
544 HighM1 (high minus one) will select the next highest
545 available frequency
546
547 Medium attempts to set a frequency in the middle of the
548 available range
549
550 Conservative attempts to use the Conservative CPU governor
551
552 OnDemand attempts to use the OnDemand CPU governor (the
553 default value)
554
555 Performance attempts to use the Performance CPU governor
556
557 PowerSave attempts to use the PowerSave CPU governor
558
559 UserSpace attempts to use the UserSpace CPU governor
560
561
562 The following informational environment variable is set
563 in the job
564 step when --cpu-freq option is requested.
565 SLURM_CPU_FREQ_REQ
566
567 This environment variable can also be used to supply the value
568 for the CPU frequency request if it is set when the 'srun' com‐
569 mand is issued. The --cpu-freq on the command line will over‐
570 ride the environment variable value. The form on the environ‐
571 ment variable is the same as the command line. See the ENVIRON‐
572 MENT VARIABLES section for a description of the
573 SLURM_CPU_FREQ_REQ variable.
574
575 NOTE: This parameter is treated as a request, not a requirement.
576 If the job step's node does not support setting the CPU fre‐
577 quency, or the requested value is outside the bounds of the
578 legal frequencies, an error is logged, but the job step is
579 allowed to continue.
580
581 NOTE: Setting the frequency for just the CPUs of the job step
582 implies that the tasks are confined to those CPUs. If task con‐
583 finement (i.e., TaskPlugin=task/affinity or TaskPlu‐
584 gin=task/cgroup with the "ConstrainCores" option) is not config‐
585 ured, this parameter is ignored.
586
587 NOTE: When the step completes, the frequency and governor of
588 each selected CPU is reset to the previous values.
589
590 NOTE: When submitting jobs with the --cpu-freq option with lin‐
591 uxproc as the ProctrackType can cause jobs to run too quickly
592 before Accounting is able to poll for job information. As a
593 result not all of accounting information will be present.
594
595 This option applies to job and step allocations.
596
597
598 --cpus-per-gpu=<ncpus>
599 Advise Slurm that ensuing job steps will require ncpus proces‐
600 sors per allocated GPU. Requires the --gpus option. Not com‐
601 patible with the --cpus-per-task option.
602
603
604 -c, --cpus-per-task=<ncpus>
605 Request that ncpus be allocated per process. This may be useful
606 if the job is multithreaded and requires more than one CPU per
607 task for optimal performance. The default is one CPU per
608 process. If -c is specified without -n, as many tasks will be
609 allocated per node as possible while satisfying the -c restric‐
610 tion. For instance on a cluster with 8 CPUs per node, a job
611 request for 4 nodes and 3 CPUs per task may be allocated 3 or 6
612 CPUs per node (1 or 2 tasks per node) depending upon resource
613 consumption by other jobs. Such a job may be unable to execute
614 more than a total of 4 tasks. This option may also be useful to
615 spawn tasks without allocating resources to the job step from
616 the job's allocation when running multiple job steps with the
617 --exclusive option.
618
619 WARNING: There are configurations and options interpreted dif‐
620 ferently by job and job step requests which can result in incon‐
621 sistencies for this option. For example srun -c2
622 --threads-per-core=1 prog may allocate two cores for the job,
623 but if each of those cores contains two threads, the job alloca‐
624 tion will include four CPUs. The job step allocation will then
625 launch two threads per CPU for a total of two tasks.
626
627 WARNING: When srun is executed from within salloc or sbatch,
628 there are configurations and options which can result in incon‐
629 sistent allocations when -c has a value greater than -c on sal‐
630 loc or sbatch.
631
632 This option applies to job allocations.
633
634
635 --deadline=<OPT>
636 remove the job if no ending is possible before this deadline
637 (start > (deadline - time[-min])). Default is no deadline.
638 Valid time formats are:
639 HH:MM[:SS] [AM|PM]
640 MMDD[YY] or MM/DD[/YY] or MM.DD[.YY]
641 MM/DD[/YY]-HH:MM[:SS]
642 YYYY-MM-DD[THH:MM[:SS]]]
643
644 This option applies only to job allocations.
645
646
647 --delay-boot=<minutes>
648 Do not reboot nodes in order to satisfied this job's feature
649 specification if the job has been eligible to run for less than
650 this time period. If the job has waited for less than the spec‐
651 ified period, it will use only nodes which already have the
652 specified features. The argument is in units of minutes. A
653 default value may be set by a system administrator using the
654 delay_boot option of the SchedulerParameters configuration
655 parameter in the slurm.conf file, otherwise the default value is
656 zero (no delay).
657
658 This option applies only to job allocations.
659
660
661 -d, --dependency=<dependency_list>
662 Defer the start of this job until the specified dependencies
663 have been satisfied completed. This option does not apply to job
664 steps (executions of srun within an existing salloc or sbatch
665 allocation) only to job allocations. <dependency_list> is of
666 the form <type:job_id[:job_id][,type:job_id[:job_id]]> or
667 <type:job_id[:job_id][?type:job_id[:job_id]]>. All dependencies
668 must be satisfied if the "," separator is used. Any dependency
669 may be satisfied if the "?" separator is used. Many jobs can
670 share the same dependency and these jobs may even belong to dif‐
671 ferent users. The value may be changed after job submission
672 using the scontrol command. Once a job dependency fails due to
673 the termination state of a preceding job, the dependent job will
674 never be run, even if the preceding job is requeued and has a
675 different termination state in a subsequent execution. This
676 option applies to job allocations.
677
678 after:job_id[:jobid...]
679 This job can begin execution after the specified jobs
680 have begun execution.
681
682 afterany:job_id[:jobid...]
683 This job can begin execution after the specified jobs
684 have terminated.
685
686 afterburstbuffer:job_id[:jobid...]
687 This job can begin execution after the specified jobs
688 have terminated and any associated burst buffer stage out
689 operations have completed.
690
691 aftercorr:job_id[:jobid...]
692 A task of this job array can begin execution after the
693 corresponding task ID in the specified job has completed
694 successfully (ran to completion with an exit code of
695 zero).
696
697 afternotok:job_id[:jobid...]
698 This job can begin execution after the specified jobs
699 have terminated in some failed state (non-zero exit code,
700 node failure, timed out, etc).
701
702 afterok:job_id[:jobid...]
703 This job can begin execution after the specified jobs
704 have successfully executed (ran to completion with an
705 exit code of zero).
706
707 expand:job_id
708 Resources allocated to this job should be used to expand
709 the specified job. The job to expand must share the same
710 QOS (Quality of Service) and partition. Gang scheduling
711 of resources in the partition is also not supported.
712
713 singleton
714 This job can begin execution after any previously
715 launched jobs sharing the same job name and user have
716 terminated. In other words, only one job by that name
717 and owned by that user can be running or suspended at any
718 point in time.
719
720
721 -D, --chdir=<path>
722 Have the remote processes do a chdir to path before beginning
723 execution. The default is to chdir to the current working direc‐
724 tory of the srun process. The path can be specified as full path
725 or relative path to the directory where the command is executed.
726 This option applies to job allocations.
727
728
729 -e, --error=<filename pattern>
730 Specify how stderr is to be redirected. By default in interac‐
731 tive mode, srun redirects stderr to the same file as stdout, if
732 one is specified. The --error option is provided to allow stdout
733 and stderr to be redirected to different locations. See IO Re‐
734 direction below for more options. If the specified file already
735 exists, it will be overwritten. This option applies to job and
736 step allocations.
737
738
739 -E, --preserve-env
740 Pass the current values of environment variables SLURM_JOB_NODES
741 and SLURM_NTASKS through to the executable, rather than comput‐
742 ing them from commandline parameters. This option applies to job
743 allocations.
744
745
746 --epilog=<executable>
747 srun will run executable just after the job step completes. The
748 command line arguments for executable will be the command and
749 arguments of the job step. If executable is "none", then no
750 srun epilog will be run. This parameter overrides the SrunEpilog
751 parameter in slurm.conf. This parameter is completely indepen‐
752 dent from the Epilog parameter in slurm.conf. This option
753 applies to job allocations.
754
755
756
757 --exclusive[=user|mcs]
758 This option applies to job and job step allocations, and has two
759 slightly different meanings for each one. When used to initiate
760 a job, the job allocation cannot share nodes with other running
761 jobs (or just other users with the "=user" option or "=mcs"
762 option). The default shared/exclusive behavior depends on sys‐
763 tem configuration and the partition's OverSubscribe option takes
764 precedence over the job's option.
765
766 This option can also be used when initiating more than one job
767 step within an existing resource allocation, where you want sep‐
768 arate processors to be dedicated to each job step. If sufficient
769 processors are not available to initiate the job step, it will
770 be deferred. This can be thought of as providing a mechanism for
771 resource management to the job within it's allocation.
772
773 The exclusive allocation of CPUs only applies to job steps
774 explicitly invoked with the --exclusive option. For example, a
775 job might be allocated one node with four CPUs and a remote
776 shell invoked on the allocated node. If that shell is not
777 invoked with the --exclusive option, then it may create a job
778 step with four tasks using the --exclusive option and not con‐
779 flict with the remote shell's resource allocation. Use the
780 --exclusive option to invoke every job step to ensure distinct
781 resources for each step.
782
783 Note that all CPUs allocated to a job are available to each job
784 step unless the --exclusive option is used plus task affinity is
785 configured. Since resource management is provided by processor,
786 the --ntasks option must be specified, but the following options
787 should NOT be specified --relative, --distribution=arbitrary.
788 See EXAMPLE below.
789
790
791 --export=<environment variables [ALL] | NONE>
792 Identify which environment variables are propagated to the
793 launched application. By default, all are propagated. Multiple
794 environment variable names should be comma separated. Environ‐
795 ment variable names may be specified to propagate the current
796 value (e.g. "--export=EDITOR") or specific values may be
797 exported (e.g. "--export=EDITOR=/bin/emacs"). In these two exam‐
798 ples, the propagated environment will only contain the variable
799 EDITOR. If one desires to add to the environment instead of
800 replacing it, have the argument include ALL (e.g.
801 "--export=ALL,EDITOR=/bin/emacs"). This will propagate EDITOR
802 along with the current environment. Unlike sbatch, if ALL is
803 specified, any additional specified environment variables are
804 ignored. If one desires no environment variables be propagated,
805 use the argument NONE. Regardless of this setting, the appro‐
806 priate SLURM_* task environment variables are always exported to
807 the environment. srun may deviate from the above behavior if
808 the default launch plugin, launch/slurm, is not used.
809
810
811 -F, --nodefile=<node file>
812 Much like --nodelist, but the list is contained in a file of
813 name node file. The node names of the list may also span multi‐
814 ple lines in the file. Duplicate node names in the file will
815 be ignored. The order of the node names in the list is not
816 important; the node names will be sorted by Slurm.
817
818
819 --gid=<group>
820 If srun is run as root, and the --gid option is used, submit the
821 job with group's group access permissions. group may be the
822 group name or the numerical group ID. This option applies to job
823 allocations.
824
825
826 -G, --gpus=[<type>:]<number>
827 Specify the total number of GPUs required for the job. An
828 optional GPU type specification can be supplied. For example
829 "--gpus=volta:3". Multiple options can be requested in a comma
830 separated list, for example: "--gpus=volta:3,kepler:1". See
831 also the --gpus-per-node, --gpus-per-socket and --gpus-per-task
832 options.
833
834
835 --gpu-bind=<type>
836 Bind tasks to specific GPUs. By default every spawned task can
837 access every GPU allocated to the job.
838
839 Supported type options:
840
841 closest Bind each task to the GPU(s) which are closest. In a
842 NUMA environment, each task may be bound to more than
843 one GPU (i.e. all GPUs in that NUMA environment).
844
845 map_gpu:<list>
846 Bind by setting GPU masks on tasks (or ranks) as spec‐
847 ified where <list> is
848 <gpu_id_for_task_0>,<gpu_id_for_task_1>,... GPU IDs
849 are interpreted as decimal values unless they are pre‐
850 ceded with '0x' in which case they interpreted as
851 hexadecimal values. If the number of tasks (or ranks)
852 exceeds the number of elements in this list, elements
853 in the list will be reused as needed starting from the
854 beginning of the list. To simplify support for large
855 task counts, the lists may follow a map with an aster‐
856 isk and repetition count. For example
857 "map_cpu:0*4,1*4". Not supported unless the entire
858 node is allocated to the job.
859
860 mask_gpu:<list>
861 Bind by setting GPU masks on tasks (or ranks) as spec‐
862 ified where <list> is
863 <gpu_mask_for_task_0>,<gpu_mask_for_task_1>,... The
864 mapping is specified for a node and identical mapping
865 is applied to the tasks on every node (i.e. the lowest
866 task ID on each node is mapped to the first mask spec‐
867 ified in the list, etc.). GPU masks are always inter‐
868 preted as hexadecimal values but can be preceded with
869 an optional '0x'. Not supported unless the entire node
870 is allocated to the job. To simplify support for large
871 task counts, the lists may follow a map with an aster‐
872 isk and repetition count. For example
873 "mask_gpu:0x0f*4,0xf0*4". Not supported unless the
874 entire node is allocated to the job.
875
876
877 --gpu-freq=[<type]=value>[,<type=value>][,verbose]
878 Request that GPUs allocated to the job are configured with spe‐
879 cific frequency values. This option can be used to indepen‐
880 dently configure the GPU and its memory frequencies. After the
881 job is completed, the frequencies of all affected GPUs will be
882 reset to the highest possible values. In some cases, system
883 power caps may override the requested values. The field type
884 can be "memory". If type is not specified, the GPU frequency is
885 implied. The value field can either be "low", "medium", "high",
886 "highm1" or a numeric value in megahertz (MHz). If the speci‐
887 fied numeric value is not possible, a value as close as possible
888 will be used. See below for definition of the values. The ver‐
889 bose option causes current GPU frequency information to be
890 logged. Examples of use include "--gpu-freq=medium,memory=high"
891 and "--gpu-freq=450".
892
893 Supported value definitions:
894
895 low the lowest available frequency.
896
897 medium attempts to set a frequency in the middle of the
898 available range.
899
900 high the highest available frequency.
901
902 highm1 (high minus one) will select the next highest avail‐
903 able frequency.
904
905
906 --gpus-per-node=[<type>:]<number>
907 Specify the number of GPUs required for the job on each node
908 included in the job's resource allocation. An optional GPU type
909 specification can be supplied. For example
910 "--gpus-per-node=volta:3". Multiple options can be requested in
911 a comma separated list, for example:
912 "--gpus-per-node=volta:3,kepler:1". See also the --gpus,
913 --gpus-per-socket and --gpus-per-task options.
914
915
916 --gpus-per-socket=[<type>:]<number>
917 Specify the number of GPUs required for the job on each socket
918 included in the job's resource allocation. An optional GPU type
919 specification can be supplied. For example
920 "--gpus-per-socket=volta:3". Multiple options can be requested
921 in a comma separated list, for example:
922 "--gpus-per-socket=volta:3,kepler:1". Requires job to specify a
923 sockets per node count ( --sockets-per-node). See also the
924 --gpus, --gpus-per-node and --gpus-per-task options. This
925 option applies to job allocations.
926
927
928 --gpus-per-task=[<type>:]<number>
929 Specify the number of GPUs required for the job on each task to
930 be spawned in the job's resource allocation. An optional GPU
931 type specification can be supplied. This option requires the
932 specification of a task count. For example
933 "--gpus-per-task=volta:1". Multiple options can be requested in
934 a comma separated list, for example:
935 "--gpus-per-task=volta:3,kepler:1". Requires job to specify a
936 task count (--nodes). See also the --gpus, --gpus-per-socket
937 and --gpus-per-node options.
938
939
940 --gres=<list>
941 Specifies a comma delimited list of generic consumable
942 resources. The format of each entry on the list is
943 "name[[:type]:count]". The name is that of the consumable
944 resource. The count is the number of those resources with a
945 default value of 1. The count can have a suffix of "k" or "K"
946 (multiple of 1024), "m" or "M" (multiple of 1024 x 1024), "g" or
947 "G" (multiple of 1024 x 1024 x 1024), "t" or "T" (multiple of
948 1024 x 1024 x 1024 x 1024), "p" or "P" (multiple of 1024 x 1024
949 x 1024 x 1024 x 1024). The specified resources will be allo‐
950 cated to the job on each node. The available generic consumable
951 resources is configurable by the system administrator. A list
952 of available generic consumable resources will be printed and
953 the command will exit if the option argument is "help". Exam‐
954 ples of use include "--gres=gpu:2,mic:1", "--gres=gpu:kepler:2",
955 and "--gres=help". NOTE: This option applies to job and step
956 allocations. By default, a job step is allocated all of the
957 generic resources that have allocated to the job. To change the
958 behavior so that each job step is allocated no generic
959 resources, explicitly set the value of --gres to specify zero
960 counts for each generic resource OR set "--gres=none" OR set the
961 SLURM_STEP_GRES environment variable to "none".
962
963
964 --gres-flags=<type>
965 Specify generic resource task binding options. This option
966 applies to job allocations.
967
968 disable-binding
969 Disable filtering of CPUs with respect to generic
970 resource locality. This option is currently required to
971 use more CPUs than are bound to a GRES (i.e. if a GPU is
972 bound to the CPUs on one socket, but resources on more
973 than one socket are required to run the job). This
974 option may permit a job to be allocated resources sooner
975 than otherwise possible, but may result in lower job per‐
976 formance.
977
978 enforce-binding
979 The only CPUs available to the job will be those bound to
980 the selected GRES (i.e. the CPUs identified in the
981 gres.conf file will be strictly enforced). This option
982 may result in delayed initiation of a job. For example a
983 job requiring two GPUs and one CPU will be delayed until
984 both GPUs on a single socket are available rather than
985 using GPUs bound to separate sockets, however the appli‐
986 cation performance may be improved due to improved commu‐
987 nication speed. Requires the node to be configured with
988 more than one socket and resource filtering will be per‐
989 formed on a per-socket basis.
990
991
992 -H, --hold
993 Specify the job is to be submitted in a held state (priority of
994 zero). A held job can now be released using scontrol to reset
995 its priority (e.g. "scontrol release <job_id>"). This option
996 applies to job allocations.
997
998
999 -h, --help
1000 Display help information and exit.
1001
1002
1003 --hint=<type>
1004 Bind tasks according to application hints.
1005
1006 compute_bound
1007 Select settings for compute bound applications: use all
1008 cores in each socket, one thread per core.
1009
1010 memory_bound
1011 Select settings for memory bound applications: use only
1012 one core in each socket, one thread per core.
1013
1014 [no]multithread
1015 [don't] use extra threads with in-core multi-threading
1016 which can benefit communication intensive applications.
1017 Only supported with the task/affinity plugin.
1018
1019 help show this help message
1020
1021 This option applies to job allocations.
1022
1023
1024 -I, --immediate[=<seconds>]
1025 exit if resources are not available within the time period spec‐
1026 ified. If no argument is given, resources must be available
1027 immediately for the request to succeed. By default, --immediate
1028 is off, and the command will block until resources become avail‐
1029 able. Since this option's argument is optional, for proper pars‐
1030 ing the single letter option must be followed immediately with
1031 the value and not include a space between them. For example
1032 "-I60" and not "-I 60". This option applies to job and step
1033 allocations.
1034
1035
1036 -i, --input=<mode>
1037 Specify how stdin is to redirected. By default, srun redirects
1038 stdin from the terminal all tasks. See IO Redirection below for
1039 more options. For OS X, the poll() function does not support
1040 stdin, so input from a terminal is not possible. This option
1041 applies to job and step allocations.
1042
1043
1044 -J, --job-name=<jobname>
1045 Specify a name for the job. The specified name will appear along
1046 with the job id number when querying running jobs on the system.
1047 The default is the supplied executable program's name. NOTE:
1048 This information may be written to the slurm_jobacct.log file.
1049 This file is space delimited so if a space is used in the job‐
1050 name name it will cause problems in properly displaying the con‐
1051 tents of the slurm_jobacct.log file when the sacct command is
1052 used. This option applies to job and step allocations.
1053
1054
1055 --jobid=<jobid>
1056 Initiate a job step under an already allocated job with job id
1057 id. Using this option will cause srun to behave exactly as if
1058 the SLURM_JOB_ID environment variable was set. This option
1059 applies to step allocations.
1060
1061
1062 -K, --kill-on-bad-exit[=0|1]
1063 Controls whether or not to terminate a step if any task exits
1064 with a non-zero exit code. If this option is not specified, the
1065 default action will be based upon the Slurm configuration param‐
1066 eter of KillOnBadExit. If this option is specified, it will take
1067 precedence over KillOnBadExit. An option argument of zero will
1068 not terminate the job. A non-zero argument or no argument will
1069 terminate the job. Note: This option takes precedence over the
1070 -W, --wait option to terminate the job immediately if a task
1071 exits with a non-zero exit code. Since this option's argument
1072 is optional, for proper parsing the single letter option must be
1073 followed immediately with the value and not include a space
1074 between them. For example "-K1" and not "-K 1".
1075
1076
1077 -k, --no-kill [=off]
1078 Do not automatically terminate a job if one of the nodes it has
1079 been allocated fails. This option applies to job and step allo‐
1080 cations. The job will assume all responsibilities for
1081 fault-tolerance. Tasks launch using this option will not be
1082 considered terminated (e.g. -K, --kill-on-bad-exit and -W,
1083 --wait options will have no effect upon the job step). The
1084 active job step (MPI job) will likely suffer a fatal error, but
1085 subsequent job steps may be run if this option is specified.
1086
1087 Specify an optional argument of "off" disable the effect of the
1088 SLURM_NO_KILL environment variable.
1089
1090 The default action is to terminate the job upon node failure.
1091
1092
1093 -l, --label
1094 Prepend task number to lines of stdout/err. The --label option
1095 will prepend lines of output with the remote task id. This
1096 option applies to step allocations.
1097
1098
1099 -L, --licenses=<license>
1100 Specification of licenses (or other resources available on all
1101 nodes of the cluster) which must be allocated to this job.
1102 License names can be followed by a colon and count (the default
1103 count is one). Multiple license names should be comma separated
1104 (e.g. "--licenses=foo:4,bar"). This option applies to job allo‐
1105 cations.
1106
1107
1108 -M, --clusters=<string>
1109 Clusters to issue commands to. Multiple cluster names may be
1110 comma separated. The job will be submitted to the one cluster
1111 providing the earliest expected job initiation time. The default
1112 value is the current cluster. A value of 'all' will query to run
1113 on all clusters. Note the --export option to control environ‐
1114 ment variables exported between clusters. This option applies
1115 only to job allocations. Note that the SlurmDBD must be up for
1116 this option to work properly.
1117
1118
1119 -m, --distribution=
1120 *|block|cyclic|arbitrary|plane=<options>
1121 [:*|block|cyclic|fcyclic[:*|block|
1122 cyclic|fcyclic]][,Pack|NoPack]
1123
1124 Specify alternate distribution methods for remote processes.
1125 This option controls the distribution of tasks to the nodes on
1126 which resources have been allocated, and the distribution of
1127 those resources to tasks for binding (task affinity). The first
1128 distribution method (before the first ":") controls the distri‐
1129 bution of tasks to nodes. The second distribution method (after
1130 the first ":") controls the distribution of allocated CPUs
1131 across sockets for binding to tasks. The third distribution
1132 method (after the second ":") controls the distribution of allo‐
1133 cated CPUs across cores for binding to tasks. The second and
1134 third distributions apply only if task affinity is enabled. The
1135 third distribution is supported only if the task/cgroup plugin
1136 is configured. The default value for each distribution type is
1137 specified by *.
1138
1139 Note that with select/cons_res, the number of CPUs allocated on
1140 each socket and node may be different. Refer to
1141 https://slurm.schedmd.com/mc_support.html for more information
1142 on resource allocation, distribution of tasks to nodes, and
1143 binding of tasks to CPUs.
1144 First distribution method (distribution of tasks across nodes):
1145
1146
1147 * Use the default method for distributing tasks to nodes
1148 (block).
1149
1150 block The block distribution method will distribute tasks to a
1151 node such that consecutive tasks share a node. For exam‐
1152 ple, consider an allocation of three nodes each with two
1153 cpus. A four-task block distribution request will dis‐
1154 tribute those tasks to the nodes with tasks one and two
1155 on the first node, task three on the second node, and
1156 task four on the third node. Block distribution is the
1157 default behavior if the number of tasks exceeds the num‐
1158 ber of allocated nodes.
1159
1160 cyclic The cyclic distribution method will distribute tasks to a
1161 node such that consecutive tasks are distributed over
1162 consecutive nodes (in a round-robin fashion). For exam‐
1163 ple, consider an allocation of three nodes each with two
1164 cpus. A four-task cyclic distribution request will dis‐
1165 tribute those tasks to the nodes with tasks one and four
1166 on the first node, task two on the second node, and task
1167 three on the third node. Note that when SelectType is
1168 select/cons_res, the same number of CPUs may not be allo‐
1169 cated on each node. Task distribution will be round-robin
1170 among all the nodes with CPUs yet to be assigned to
1171 tasks. Cyclic distribution is the default behavior if
1172 the number of tasks is no larger than the number of allo‐
1173 cated nodes.
1174
1175 plane The tasks are distributed in blocks of a specified size.
1176 The options include a number representing the size of the
1177 task block. This is followed by an optional specifica‐
1178 tion of the task distribution scheme within a block of
1179 tasks and between the blocks of tasks. The number of
1180 tasks distributed to each node is the same as for cyclic
1181 distribution, but the taskids assigned to each node
1182 depend on the plane size. For more details (including
1183 examples and diagrams), please see
1184 https://slurm.schedmd.com/mc_support.html
1185 and
1186 https://slurm.schedmd.com/dist_plane.html
1187
1188 arbitrary
1189 The arbitrary method of distribution will allocate pro‐
1190 cesses in-order as listed in file designated by the envi‐
1191 ronment variable SLURM_HOSTFILE. If this variable is
1192 listed it will over ride any other method specified. If
1193 not set the method will default to block. Inside the
1194 hostfile must contain at minimum the number of hosts
1195 requested and be one per line or comma separated. If
1196 specifying a task count (-n, --ntasks=<number>), your
1197 tasks will be laid out on the nodes in the order of the
1198 file.
1199 NOTE: The arbitrary distribution option on a job alloca‐
1200 tion only controls the nodes to be allocated to the job
1201 and not the allocation of CPUs on those nodes. This
1202 option is meant primarily to control a job step's task
1203 layout in an existing job allocation for the srun com‐
1204 mand.
1205 NOTE: If number of tasks is given and a list of requested
1206 nodes is also given the number of nodes used from that
1207 list will be reduced to match that of the number of tasks
1208 if the number of nodes in the list is greater than the
1209 number of tasks.
1210
1211
1212 Second distribution method (distribution of CPUs across sockets
1213 for binding):
1214
1215
1216 * Use the default method for distributing CPUs across sock‐
1217 ets (cyclic).
1218
1219 block The block distribution method will distribute allocated
1220 CPUs consecutively from the same socket for binding to
1221 tasks, before using the next consecutive socket.
1222
1223 cyclic The cyclic distribution method will distribute allocated
1224 CPUs for binding to a given task consecutively from the
1225 same socket, and from the next consecutive socket for the
1226 next task, in a round-robin fashion across sockets.
1227
1228 fcyclic
1229 The fcyclic distribution method will distribute allocated
1230 CPUs for binding to tasks from consecutive sockets in a
1231 round-robin fashion across the sockets.
1232
1233
1234 Third distribution method (distribution of CPUs across cores for
1235 binding):
1236
1237
1238 * Use the default method for distributing CPUs across cores
1239 (inherited from second distribution method).
1240
1241 block The block distribution method will distribute allocated
1242 CPUs consecutively from the same core for binding to
1243 tasks, before using the next consecutive core.
1244
1245 cyclic The cyclic distribution method will distribute allocated
1246 CPUs for binding to a given task consecutively from the
1247 same core, and from the next consecutive core for the
1248 next task, in a round-robin fashion across cores.
1249
1250 fcyclic
1251 The fcyclic distribution method will distribute allocated
1252 CPUs for binding to tasks from consecutive cores in a
1253 round-robin fashion across the cores.
1254
1255
1256
1257 Optional control for task distribution over nodes:
1258
1259
1260 Pack Rather than evenly distributing a job step's tasks evenly
1261 across it's allocated nodes, pack them as tightly as pos‐
1262 sible on the nodes.
1263
1264 NoPack Rather than packing a job step's tasks as tightly as pos‐
1265 sible on the nodes, distribute them evenly. This user
1266 option will supersede the SelectTypeParameters
1267 CR_Pack_Nodes configuration parameter.
1268
1269 This option applies to job and step allocations.
1270
1271
1272 --mail-type=<type>
1273 Notify user by email when certain event types occur. Valid type
1274 values are NONE, BEGIN, END, FAIL, REQUEUE, ALL (equivalent to
1275 BEGIN, END, FAIL, REQUEUE, and STAGE_OUT), STAGE_OUT (burst buf‐
1276 fer stage out and teardown completed), TIME_LIMIT, TIME_LIMIT_90
1277 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80
1278 percent of time limit), and TIME_LIMIT_50 (reached 50 percent of
1279 time limit). Multiple type values may be specified in a comma
1280 separated list. The user to be notified is indicated with
1281 --mail-user. This option applies to job allocations.
1282
1283
1284 --mail-user=<user>
1285 User to receive email notification of state changes as defined
1286 by --mail-type. The default value is the submitting user. This
1287 option applies to job allocations.
1288
1289
1290 --mcs-label=<mcs>
1291 Used only when the mcs/group plugin is enabled. This parameter
1292 is a group among the groups of the user. Default value is cal‐
1293 culated by the Plugin mcs if it's enabled. This option applies
1294 to job allocations.
1295
1296
1297 --mem=<size[units]>
1298 Specify the real memory required per node. Default units are
1299 megabytes unless the SchedulerParameters configuration parameter
1300 includes the "default_gbytes" option for gigabytes. Different
1301 units can be specified using the suffix [K|M|G|T]. Default
1302 value is DefMemPerNode and the maximum value is MaxMemPerNode.
1303 If configured, both of parameters can be seen using the scontrol
1304 show config command. This parameter would generally be used if
1305 whole nodes are allocated to jobs (SelectType=select/linear).
1306 Specifying a memory limit of zero for a job step will restrict
1307 the job step to the amount of memory allocated to the job, but
1308 not remove any of the job's memory allocation from being avail‐
1309 able to other job steps. Also see --mem-per-cpu and
1310 --mem-per-gpu. The --mem, --mem-per-cpu and --mem-per-gpu
1311 options are mutually exclusive. If --mem, --mem-per-cpu or
1312 --mem-per-gpu are specified as command line arguments, then they
1313 will take precedence over the environment (potentially inherited
1314 from salloc or sbatch).
1315
1316 NOTE: A memory size specification of zero is treated as a spe‐
1317 cial case and grants the job access to all of the memory on each
1318 node for newly submitted jobs and all available job memory to a
1319 new job steps.
1320
1321 Specifying new memory limits for job steps are only advisory.
1322
1323 If the job is allocated multiple nodes in a heterogeneous clus‐
1324 ter, the memory limit on each node will be that of the node in
1325 the allocation with the smallest memory size (same limit will
1326 apply to every node in the job's allocation).
1327
1328 NOTE: Enforcement of memory limits currently relies upon the
1329 task/cgroup plugin or enabling of accounting, which samples mem‐
1330 ory use on a periodic basis (data need not be stored, just col‐
1331 lected). In both cases memory use is based upon the job's Resi‐
1332 dent Set Size (RSS). A task may exceed the memory limit until
1333 the next periodic accounting sample.
1334
1335 This option applies to job and step allocations.
1336
1337
1338 --mem-per-cpu=<size[units]>
1339 Minimum memory required per allocated CPU. Default units are
1340 megabytes unless the SchedulerParameters configuration parameter
1341 includes the "default_gbytes" option for gigabytes. Different
1342 units can be specified using the suffix [K|M|G|T]. Default
1343 value is DefMemPerCPU and the maximum value is MaxMemPerCPU (see
1344 exception below). If configured, both of parameters can be seen
1345 using the scontrol show config command. Note that if the job's
1346 --mem-per-cpu value exceeds the configured MaxMemPerCPU, then
1347 the user's limit will be treated as a memory limit per task;
1348 --mem-per-cpu will be reduced to a value no larger than MaxMem‐
1349 PerCPU; --cpus-per-task will be set and the value of
1350 --cpus-per-task multiplied by the new --mem-per-cpu value will
1351 equal the original --mem-per-cpu value specified by the user.
1352 This parameter would generally be used if individual processors
1353 are allocated to jobs (SelectType=select/cons_res). If
1354 resources are allocated by the core, socket or whole nodes; the
1355 number of CPUs allocated to a job may be higher than the task
1356 count and the value of --mem-per-cpu should be adjusted accord‐
1357 ingly. Specifying a memory limit of zero for a job step will
1358 restrict the job step to the amount of memory allocated to the
1359 job, but not remove any of the job's memory allocation from
1360 being available to other job steps. Also see --mem and
1361 --mem-per-gpu. The --mem, --mem-per-cpu and --mem-per-gpu
1362 options are mutually exclusive.
1363
1364 NOTE:If the final amount of memory requested by job (eg.: when
1365 --mem-per-cpu use with --exclusive option) can't be satisfied by
1366 any of nodes configured in the partition, the job will be
1367 rejected.
1368
1369
1370 --mem-per-gpu=<size[units]>
1371 Minimum memory required per allocated GPU. Default units are
1372 megabytes unless the SchedulerParameters configuration parameter
1373 includes the "default_gbytes" option for gigabytes. Different
1374 units can be specified using the suffix [K|M|G|T]. Default
1375 value is DefMemPerGPU and is available on both a global and per
1376 partition basis. If configured, the parameters can be seen
1377 using the scontrol show config and scontrol show partition com‐
1378 mands. Also see --mem. The --mem, --mem-per-cpu and
1379 --mem-per-gpu options are mutually exclusive.
1380
1381
1382 --mem-bind=[{quiet,verbose},]type
1383 Bind tasks to memory. Used only when the task/affinity plugin is
1384 enabled and the NUMA memory functions are available. Note that
1385 the resolution of CPU and memory binding may differ on some
1386 architectures. For example, CPU binding may be performed at the
1387 level of the cores within a processor while memory binding will
1388 be performed at the level of nodes, where the definition of
1389 "nodes" may differ from system to system. By default no memory
1390 binding is performed; any task using any CPU can use any memory.
1391 This option is typically used to ensure that each task is bound
1392 to the memory closest to it's assigned CPU. The use of any type
1393 other than "none" or "local" is not recommended. If you want
1394 greater control, try running a simple test code with the options
1395 "--cpu-bind=verbose,none --mem-bind=verbose,none" to determine
1396 the specific configuration.
1397
1398 NOTE: To have Slurm always report on the selected memory binding
1399 for all commands executed in a shell, you can enable verbose
1400 mode by setting the SLURM_MEM_BIND environment variable value to
1401 "verbose".
1402
1403 The following informational environment variables are set when
1404 --mem-bind is in use:
1405
1406 SLURM_MEM_BIND_LIST
1407 SLURM_MEM_BIND_PREFER
1408 SLURM_MEM_BIND_SORT
1409 SLURM_MEM_BIND_TYPE
1410 SLURM_MEM_BIND_VERBOSE
1411
1412 See the ENVIRONMENT VARIABLES section for a more detailed
1413 description of the individual SLURM_MEM_BIND* variables.
1414
1415 Supported options include:
1416
1417 help show this help message
1418
1419 local Use memory local to the processor in use
1420
1421 map_mem:<list>
1422 Bind by setting memory masks on tasks (or ranks) as spec‐
1423 ified where <list> is
1424 <numa_id_for_task_0>,<numa_id_for_task_1>,... The map‐
1425 ping is specified for a node and identical mapping is
1426 applied to the tasks on every node (i.e. the lowest task
1427 ID on each node is mapped to the first ID specified in
1428 the list, etc.). NUMA IDs are interpreted as decimal
1429 values unless they are preceded with '0x' in which case
1430 they interpreted as hexadecimal values. If the number of
1431 tasks (or ranks) exceeds the number of elements in this
1432 list, elements in the list will be reused as needed
1433 starting from the beginning of the list. To simplify
1434 support for large task counts, the lists may follow a map
1435 with an asterisk and repetition count For example
1436 "map_mem:0x0f*4,0xf0*4". Not supported unless the entire
1437 node is allocated to the job.
1438
1439 mask_mem:<list>
1440 Bind by setting memory masks on tasks (or ranks) as spec‐
1441 ified where <list> is
1442 <numa_mask_for_task_0>,<numa_mask_for_task_1>,... The
1443 mapping is specified for a node and identical mapping is
1444 applied to the tasks on every node (i.e. the lowest task
1445 ID on each node is mapped to the first mask specified in
1446 the list, etc.). NUMA masks are always interpreted as
1447 hexadecimal values. Note that masks must be preceded
1448 with a '0x' if they don't begin with [0-9] so they are
1449 seen as numerical values. If the number of tasks (or
1450 ranks) exceeds the number of elements in this list, ele‐
1451 ments in the list will be reused as needed starting from
1452 the beginning of the list. To simplify support for large
1453 task counts, the lists may follow a mask with an asterisk
1454 and repetition count For example "mask_mem:0*4,1*4". Not
1455 supported unless the entire node is allocated to the job.
1456
1457 no[ne] don't bind tasks to memory (default)
1458
1459 nosort avoid sorting free cache pages (default, LaunchParameters
1460 configuration parameter can override this default)
1461
1462 p[refer]
1463 Prefer use of first specified NUMA node, but permit
1464 use of other available NUMA nodes.
1465
1466 q[uiet]
1467 quietly bind before task runs (default)
1468
1469 rank bind by task rank (not recommended)
1470
1471 sort sort free cache pages (run zonesort on Intel KNL nodes)
1472
1473 v[erbose]
1474 verbosely report binding before task runs
1475
1476 This option applies to job and step allocations.
1477
1478
1479 --mincpus=<n>
1480 Specify a minimum number of logical cpus/processors per node.
1481 This option applies to job allocations.
1482
1483
1484 --msg-timeout=<seconds>
1485 Modify the job launch message timeout. The default value is
1486 MessageTimeout in the Slurm configuration file slurm.conf.
1487 Changes to this are typically not recommended, but could be use‐
1488 ful to diagnose problems. This option applies to job alloca‐
1489 tions.
1490
1491
1492 --mpi=<mpi_type>
1493 Identify the type of MPI to be used. May result in unique initi‐
1494 ation procedures.
1495
1496 list Lists available mpi types to choose from.
1497
1498 openmpi
1499 For use with OpenMPI.
1500
1501 pmi2 To enable PMI2 support. The PMI2 support in Slurm works
1502 only if the MPI implementation supports it, in other
1503 words if the MPI has the PMI2 interface implemented. The
1504 --mpi=pmi2 will load the library lib/slurm/mpi_pmi2.so
1505 which provides the server side functionality but the
1506 client side must implement PMI2_Init() and the other
1507 interface calls.
1508
1509 pmix To enable PMIx support (http://pmix.github.io/master).
1510 The PMIx support in Slurm can be used to launch parallel
1511 applications (e.g. MPI) if it supports PMIx, PMI2 or
1512 PMI1. Slurm must be configured with pmix support by pass‐
1513 ing "--with-pmix=<PMIx installation path>" option to its
1514 "./configure" script.
1515
1516 At the time of writing PMIx is supported in Open MPI
1517 starting from version 2.0. PMIx also supports backward
1518 compatibility with PMI1 and PMI2 and can be used if MPI
1519 was configured with PMI2/PMI1 support pointing to the
1520 PMIx library ("libpmix"). If MPI supports PMI1/PMI2 but
1521 doesn't provide the way to point to a specific implemen‐
1522 tation, a hack'ish solution leveraging LD_PRELOAD can be
1523 used to force "libpmix" usage.
1524
1525
1526 none No special MPI processing. This is the default and works
1527 with many other versions of MPI.
1528
1529 This option applies to step allocations.
1530
1531
1532 --multi-prog
1533 Run a job with different programs and different arguments for
1534 each task. In this case, the executable program specified is
1535 actually a configuration file specifying the executable and
1536 arguments for each task. See MULTIPLE PROGRAM CONFIGURATION
1537 below for details on the configuration file contents. This
1538 option applies to step allocations.
1539
1540
1541 -N, --nodes=<minnodes[-maxnodes]>
1542 Request that a minimum of minnodes nodes be allocated to this
1543 job. A maximum node count may also be specified with maxnodes.
1544 If only one number is specified, this is used as both the mini‐
1545 mum and maximum node count. The partition's node limits super‐
1546 sede those of the job. If a job's node limits are outside of
1547 the range permitted for its associated partition, the job will
1548 be left in a PENDING state. This permits possible execution at
1549 a later time, when the partition limit is changed. If a job
1550 node limit exceeds the number of nodes configured in the parti‐
1551 tion, the job will be rejected. Note that the environment vari‐
1552 able SLURM_JOB_NUM_NODES (and SLURM_NNODES for backwards compat‐
1553 ibility) will be set to the count of nodes actually allocated to
1554 the job. See the ENVIRONMENT VARIABLES section for more informa‐
1555 tion. If -N is not specified, the default behavior is to allo‐
1556 cate enough nodes to satisfy the requirements of the -n and -c
1557 options. The job will be allocated as many nodes as possible
1558 within the range specified and without delaying the initiation
1559 of the job. If number of tasks is given and a number of
1560 requested nodes is also given the number of nodes used from that
1561 request will be reduced to match that of the number of tasks if
1562 the number of nodes in the request is greater than the number of
1563 tasks. The node count specification may include a numeric value
1564 followed by a suffix of "k" (multiplies numeric value by 1,024)
1565 or "m" (multiplies numeric value by 1,048,576). This option
1566 applies to job and step allocations.
1567
1568
1569 -n, --ntasks=<number>
1570 Specify the number of tasks to run. Request that srun allocate
1571 resources for ntasks tasks. The default is one task per node,
1572 but note that the --cpus-per-task option will change this
1573 default. This option applies to job and step allocations.
1574
1575
1576 --network=<type>
1577 Specify information pertaining to the switch or network. The
1578 interpretation of type is system dependent. This option is sup‐
1579 ported when running Slurm on a Cray natively. It is used to
1580 request using Network Performance Counters. Only one value per
1581 request is valid. All options are case in-sensitive. In this
1582 configuration supported values include:
1583
1584 system
1585 Use the system-wide network performance counters. Only
1586 nodes requested will be marked in use for the job alloca‐
1587 tion. If the job does not fill up the entire system the
1588 rest of the nodes are not able to be used by other jobs
1589 using NPC, if idle their state will appear as PerfCnts.
1590 These nodes are still available for other jobs not using
1591 NPC.
1592
1593 blade Use the blade network performance counters. Only nodes
1594 requested will be marked in use for the job allocation.
1595 If the job does not fill up the entire blade(s) allocated
1596 to the job those blade(s) are not able to be used by other
1597 jobs using NPC, if idle their state will appear as PerfC‐
1598 nts. These nodes are still available for other jobs not
1599 using NPC.
1600
1601
1602 In all cases the job or step allocation request must
1603 specify the
1604 --exclusive option. Otherwise the request will be denied.
1605
1606 Also with any of these options steps are not allowed to share
1607 blades, so resources would remain idle inside an allocation if
1608 the step running on a blade does not take up all the nodes on
1609 the blade.
1610
1611 The network option is also supported on systems with IBM's Par‐
1612 allel Environment (PE). See IBM's LoadLeveler job command key‐
1613 word documentation about the keyword "network" for more informa‐
1614 tion. Multiple values may be specified in a comma separated
1615 list. All options are case in-sensitive. Supported values
1616 include:
1617
1618 BULK_XFER[=<resources>]
1619 Enable bulk transfer of data using Remote Direct-
1620 Memory Access (RDMA). The optional resources speci‐
1621 fication is a numeric value which can have a suffix
1622 of "k", "K", "m", "M", "g" or "G" for kilobytes,
1623 megabytes or gigabytes. NOTE: The resources speci‐
1624 fication is not supported by the underlying IBM in‐
1625 frastructure as of Parallel Environment version 2.2
1626 and no value should be specified at this time. The
1627 devices allocated to a job must all be of the same
1628 type. The default value depends upon depends upon
1629 what hardware is available and in order of prefer‐
1630 ences is IPONLY (which is not considered in User
1631 Space mode), HFI, IB, HPCE, and KMUX.
1632
1633 CAU=<count> Number of Collective Acceleration Units (CAU)
1634 required. Applies only to IBM Power7-IH processors.
1635 Default value is zero. Independent CAU will be
1636 allocated for each programming interface (MPI, LAPI,
1637 etc.)
1638
1639 DEVNAME=<name>
1640 Specify the device name to use for communications
1641 (e.g. "eth0" or "mlx4_0").
1642
1643 DEVTYPE=<type>
1644 Specify the device type to use for communications.
1645 The supported values of type are: "IB" (InfiniBand),
1646 "HFI" (P7 Host Fabric Interface), "IPONLY" (IP-Only
1647 interfaces), "HPCE" (HPC Ethernet), and "KMUX" (Ker‐
1648 nel Emulation of HPCE). The devices allocated to a
1649 job must all be of the same type. The default value
1650 depends upon depends upon what hardware is available
1651 and in order of preferences is IPONLY (which is not
1652 considered in User Space mode), HFI, IB, HPCE, and
1653 KMUX.
1654
1655 IMMED =<count>
1656 Number of immediate send slots per window required.
1657 Applies only to IBM Power7-IH processors. Default
1658 value is zero.
1659
1660 INSTANCES =<count>
1661 Specify number of network connections for each task
1662 on each network connection. The default instance
1663 count is 1.
1664
1665 IPV4 Use Internet Protocol (IP) version 4 communications
1666 (default).
1667
1668 IPV6 Use Internet Protocol (IP) version 6 communications.
1669
1670 LAPI Use the LAPI programming interface.
1671
1672 MPI Use the MPI programming interface. MPI is the
1673 default interface.
1674
1675 PAMI Use the PAMI programming interface.
1676
1677 SHMEM Use the OpenSHMEM programming interface.
1678
1679 SN_ALL Use all available switch networks (default).
1680
1681 SN_SINGLE Use one available switch network.
1682
1683 UPC Use the UPC programming interface.
1684
1685 US Use User Space communications.
1686
1687
1688 Some examples of network specifications:
1689
1690 Instances=2,US,MPI,SN_ALL
1691 Create two user space connections for MPI communica‐
1692 tions on every switch network for each task.
1693
1694 US,MPI,Instances=3,Devtype=IB
1695 Create three user space connections for MPI communi‐
1696 cations on every InfiniBand network for each task.
1697
1698 IPV4,LAPI,SN_Single
1699 Create a IP version 4 connection for LAPI communica‐
1700 tions on one switch network for each task.
1701
1702 Instances=2,US,LAPI,MPI
1703 Create two user space connections each for LAPI and
1704 MPI communications on every switch network for each
1705 task. Note that SN_ALL is the default option so
1706 every switch network is used. Also note that
1707 Instances=2 specifies that two connections are
1708 established for each protocol (LAPI and MPI) and
1709 each task. If there are two networks and four tasks
1710 on the node then a total of 32 connections are
1711 established (2 instances x 2 protocols x 2 networks
1712 x 4 tasks).
1713
1714 This option applies to job and step allocations.
1715
1716
1717 --nice[=adjustment]
1718 Run the job with an adjusted scheduling priority within Slurm.
1719 With no adjustment value the scheduling priority is decreased by
1720 100. A negative nice value increases the priority, otherwise
1721 decreases it. The adjustment range is +/- 2147483645. Only priv‐
1722 ileged users can specify a negative adjustment.
1723
1724
1725 --ntasks-per-core=<ntasks>
1726 Request the maximum ntasks be invoked on each core. This option
1727 applies to the job allocation, but not to step allocations.
1728 Meant to be used with the --ntasks option. Related to
1729 --ntasks-per-node except at the core level instead of the node
1730 level. Masks will automatically be generated to bind the tasks
1731 to specific core unless --cpu-bind=none is specified. NOTE:
1732 This option is not supported unless SelectType=cons_res is con‐
1733 figured (either directly or indirectly on Cray systems) along
1734 with the node's core count.
1735
1736
1737 --ntasks-per-node=<ntasks>
1738 Request that ntasks be invoked on each node. If used with the
1739 --ntasks option, the --ntasks option will take precedence and
1740 the --ntasks-per-node will be treated as a maximum count of
1741 tasks per node. Meant to be used with the --nodes option. This
1742 is related to --cpus-per-task=ncpus, but does not require knowl‐
1743 edge of the actual number of cpus on each node. In some cases,
1744 it is more convenient to be able to request that no more than a
1745 specific number of tasks be invoked on each node. Examples of
1746 this include submitting a hybrid MPI/OpenMP app where only one
1747 MPI "task/rank" should be assigned to each node while allowing
1748 the OpenMP portion to utilize all of the parallelism present in
1749 the node, or submitting a single setup/cleanup/monitoring job to
1750 each node of a pre-existing allocation as one step in a larger
1751 job script. This option applies to job allocations.
1752
1753
1754 --ntasks-per-socket=<ntasks>
1755 Request the maximum ntasks be invoked on each socket. This
1756 option applies to the job allocation, but not to step alloca‐
1757 tions. Meant to be used with the --ntasks option. Related to
1758 --ntasks-per-node except at the socket level instead of the node
1759 level. Masks will automatically be generated to bind the tasks
1760 to specific sockets unless --cpu-bind=none is specified. NOTE:
1761 This option is not supported unless SelectType=cons_res is con‐
1762 figured (either directly or indirectly on Cray systems) along
1763 with the node's socket count.
1764
1765
1766 -O, --overcommit
1767 Overcommit resources. This option applies to job and step allo‐
1768 cations. When applied to job allocation, only one CPU is allo‐
1769 cated to the job per node and options used to specify the number
1770 of tasks per node, socket, core, etc. are ignored. When
1771 applied to job step allocations (the srun command when executed
1772 within an existing job allocation), this option can be used to
1773 launch more than one task per CPU. Normally, srun will not
1774 allocate more than one process per CPU. By specifying --over‐
1775 commit you are explicitly allowing more than one process per
1776 CPU. However no more than MAX_TASKS_PER_NODE tasks are permitted
1777 to execute per node. NOTE: MAX_TASKS_PER_NODE is defined in the
1778 file slurm.h and is not a variable, it is set at Slurm build
1779 time.
1780
1781
1782 -o, --output=<filename pattern>
1783 Specify the "filename pattern" for stdout redirection. By
1784 default in interactive mode, srun collects stdout from all tasks
1785 and sends this output via TCP/IP to the attached terminal. With
1786 --output stdout may be redirected to a file, to one file per
1787 task, or to /dev/null. See section IO Redirection below for the
1788 various forms of filename pattern. If the specified file
1789 already exists, it will be overwritten.
1790
1791 If --error is not also specified on the command line, both std‐
1792 out and stderr will directed to the file specified by --output.
1793 This option applies to job and step allocations.
1794
1795
1796 --open-mode=<append|truncate>
1797 Open the output and error files using append or truncate mode as
1798 specified. For heterogeneous job steps the default value is
1799 "append". Otherwise the default value is specified by the sys‐
1800 tem configuration parameter JobFileAppend. This option applies
1801 to job and step allocations.
1802
1803
1804 --pack-group=<expr>
1805 Identify each job in a heterogeneous job allocation for which a
1806 step is to be created. Applies only to srun commands issued
1807 inside a salloc allocation or sbatch script. <expr> is a set of
1808 integers corresponding to one or more options indexes on the
1809 salloc or sbatch command line. Examples: "--pack-group=2",
1810 "--pack-group=0,4", "--pack-group=1,3-5". The default value is
1811 --pack-group=0.
1812
1813
1814 -p, --partition=<partition_names>
1815 Request a specific partition for the resource allocation. If
1816 not specified, the default behavior is to allow the slurm con‐
1817 troller to select the default partition as designated by the
1818 system administrator. If the job can use more than one parti‐
1819 tion, specify their names in a comma separate list and the one
1820 offering earliest initiation will be used with no regard given
1821 to the partition name ordering (although higher priority parti‐
1822 tions will be considered first). When the job is initiated, the
1823 name of the partition used will be placed first in the job
1824 record partition string. This option applies to job allocations.
1825
1826
1827 --power=<flags>
1828 Comma separated list of power management plugin options. Cur‐
1829 rently available flags include: level (all nodes allocated to
1830 the job should have identical power caps, may be disabled by the
1831 Slurm configuration option PowerParameters=job_no_level). This
1832 option applies to job allocations.
1833
1834
1835 --priority=<value>
1836 Request a specific job priority. May be subject to configura‐
1837 tion specific constraints. value should either be a numeric
1838 value or "TOP" (for highest possible value). Only Slurm opera‐
1839 tors and administrators can set the priority of a job. This
1840 option applies to job allocations only.
1841
1842
1843 --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>
1844 enables detailed data collection by the acct_gather_profile
1845 plugin. Detailed data are typically time-series that are stored
1846 in an HDF5 file for the job or an InfluxDB database depending on
1847 the configured plugin.
1848
1849
1850 All All data types are collected. (Cannot be combined with
1851 other values.)
1852
1853
1854 None No data types are collected. This is the default.
1855 (Cannot be combined with other values.)
1856
1857
1858 Energy Energy data is collected.
1859
1860
1861 Task Task (I/O, Memory, ...) data is collected.
1862
1863
1864 Filesystem
1865 Filesystem data is collected.
1866
1867
1868 Network Network (InfiniBand) data is collected.
1869
1870
1871 This option applies to job and step allocations.
1872
1873
1874 --prolog=<executable>
1875 srun will run executable just before launching the job step.
1876 The command line arguments for executable will be the command
1877 and arguments of the job step. If executable is "none", then no
1878 srun prolog will be run. This parameter overrides the SrunProlog
1879 parameter in slurm.conf. This parameter is completely indepen‐
1880 dent from the Prolog parameter in slurm.conf. This option
1881 applies to job allocations.
1882
1883
1884 --propagate[=rlimit[,rlimit...]]
1885 Allows users to specify which of the modifiable (soft) resource
1886 limits to propagate to the compute nodes and apply to their
1887 jobs. If no rlimit is specified, then all resource limits will
1888 be propagated. The following rlimit names are supported by
1889 Slurm (although some options may not be supported on some sys‐
1890 tems):
1891
1892 ALL All limits listed below (default)
1893
1894 NONE No limits listed below
1895
1896 AS The maximum address space for a process
1897
1898 CORE The maximum size of core file
1899
1900 CPU The maximum amount of CPU time
1901
1902 DATA The maximum size of a process's data segment
1903
1904 FSIZE The maximum size of files created. Note that if the
1905 user sets FSIZE to less than the current size of the
1906 slurmd.log, job launches will fail with a 'File size
1907 limit exceeded' error.
1908
1909 MEMLOCK The maximum size that may be locked into memory
1910
1911 NOFILE The maximum number of open files
1912
1913 NPROC The maximum number of processes available
1914
1915 RSS The maximum resident set size
1916
1917 STACK The maximum stack size
1918
1919 This option applies to job allocations.
1920
1921
1922 --pty Execute task zero in pseudo terminal mode. Implicitly sets
1923 --unbuffered. Implicitly sets --error and --output to /dev/null
1924 for all tasks except task zero, which may cause those tasks to
1925 exit immediately (e.g. shells will typically exit immediately in
1926 that situation). This option applies to step allocations.
1927
1928
1929 -q, --qos=<qos>
1930 Request a quality of service for the job. QOS values can be
1931 defined for each user/cluster/account association in the Slurm
1932 database. Users will be limited to their association's defined
1933 set of qos's when the Slurm configuration parameter, Account‐
1934 ingStorageEnforce, includes "qos" in it's definition. This
1935 option applies to job allocations.
1936
1937
1938 -Q, --quiet
1939 Suppress informational messages from srun. Errors will still be
1940 displayed. This option applies to job and step allocations.
1941
1942
1943 --quit-on-interrupt
1944 Quit immediately on single SIGINT (Ctrl-C). Use of this option
1945 disables the status feature normally available when srun
1946 receives a single Ctrl-C and causes srun to instead immediately
1947 terminate the running job. This option applies to step alloca‐
1948 tions.
1949
1950
1951 -r, --relative=<n>
1952 Run a job step relative to node n of the current allocation.
1953 This option may be used to spread several job steps out among
1954 the nodes of the current job. If -r is used, the current job
1955 step will begin at node n of the allocated nodelist, where the
1956 first node is considered node 0. The -r option is not permitted
1957 with -w or -x option and will result in a fatal error when not
1958 running within a prior allocation (i.e. when SLURM_JOB_ID is not
1959 set). The default for n is 0. If the value of --nodes exceeds
1960 the number of nodes identified with the --relative option, a
1961 warning message will be printed and the --relative option will
1962 take precedence. This option applies to step allocations.
1963
1964
1965 --reboot
1966 Force the allocated nodes to reboot before starting the job.
1967 This is only supported with some system configurations and will
1968 otherwise be silently ignored. This option applies to job allo‐
1969 cations.
1970
1971
1972 --resv-ports[=count]
1973 Reserve communication ports for this job. Users can specify the
1974 number of port they want to reserve. The parameter Mpi‐
1975 Params=ports=12000-12999 must be specified in slurm.conf. If not
1976 specified and Slurm's OpenMPI plugin is used, then by default
1977 the number of reserved equal to the highest number of tasks on
1978 any node in the job step allocation. If the number of reserved
1979 ports is zero then no ports is reserved. Used for OpenMPI. This
1980 option applies to job and step allocations.
1981
1982
1983 --reservation=<name>
1984 Allocate resources for the job from the named reservation. This
1985 option applies to job allocations.
1986
1987
1988 -s, --oversubscribe
1989 The job allocation can over-subscribe resources with other run‐
1990 ning jobs. The resources to be over-subscribed can be nodes,
1991 sockets, cores, and/or hyperthreads depending upon configura‐
1992 tion. The default over-subscribe behavior depends on system
1993 configuration and the partition's OverSubscribe option takes
1994 precedence over the job's option. This option may result in the
1995 allocation being granted sooner than if the --oversubscribe
1996 option was not set and allow higher system utilization, but
1997 application performance will likely suffer due to competition
1998 for resources. Also see the --exclusive option. This option
1999 applies to step allocations.
2000
2001
2002 -S, --core-spec=<num>
2003 Count of specialized cores per node reserved by the job for sys‐
2004 tem operations and not used by the application. The application
2005 will not use these cores, but will be charged for their alloca‐
2006 tion. Default value is dependent upon the node's configured
2007 CoreSpecCount value. If a value of zero is designated and the
2008 Slurm configuration option AllowSpecResourcesUsage is enabled,
2009 the job will be allowed to override CoreSpecCount and use the
2010 specialized resources on nodes it is allocated. This option can
2011 not be used with the --thread-spec option. This option applies
2012 to job allocations.
2013
2014
2015 --signal=<sig_num>[@<sig_time>]
2016 When a job is within sig_time seconds of its end time, send it
2017 the signal sig_num. Due to the resolution of event handling by
2018 Slurm, the signal may be sent up to 60 seconds earlier than
2019 specified. sig_num may either be a signal number or name (e.g.
2020 "10" or "USR1"). sig_time must have an integer value between 0
2021 and 65535. By default, no signal is sent before the job's end
2022 time. If a sig_num is specified without any sig_time, the
2023 default time will be 60 seconds. This option applies to job
2024 allocations. To have the signal sent at preemption time see the
2025 preempt_send_user_signal SlurmctldParameter.
2026
2027
2028 --slurmd-debug=<level>
2029 Specify a debug level for slurmd(8). The level may be specified
2030 either an integer value between 0 [quiet, only errors are dis‐
2031 played] and 4 [verbose operation] or the SlurmdDebug tags.
2032
2033 quiet Log nothing
2034
2035 fatal Log only fatal errors
2036
2037 error Log only errors
2038
2039 info Log errors and general informational messages
2040
2041 verbose Log errors and verbose informational messages
2042
2043
2044 The slurmd debug information is copied onto the stderr of
2045 the job. By default only errors are displayed. This option
2046 applies to job and step allocations.
2047
2048
2049 --sockets-per-node=<sockets>
2050 Restrict node selection to nodes with at least the specified
2051 number of sockets. See additional information under -B option
2052 above when task/affinity plugin is enabled. This option applies
2053 to job allocations.
2054
2055
2056 --spread-job
2057 Spread the job allocation over as many nodes as possible and
2058 attempt to evenly distribute tasks across the allocated nodes.
2059 This option disables the topology/tree plugin. This option
2060 applies to job allocations.
2061
2062
2063 --switches=<count>[@<max-time>]
2064 When a tree topology is used, this defines the maximum count of
2065 switches desired for the job allocation and optionally the maxi‐
2066 mum time to wait for that number of switches. If Slurm finds an
2067 allocation containing more switches than the count specified,
2068 the job remains pending until it either finds an allocation with
2069 desired switch count or the time limit expires. It there is no
2070 switch count limit, there is no delay in starting the job.
2071 Acceptable time formats include "minutes", "minutes:seconds",
2072 "hours:minutes:seconds", "days-hours", "days-hours:minutes" and
2073 "days-hours:minutes:seconds". The job's maximum time delay may
2074 be limited by the system administrator using the SchedulerParam‐
2075 eters configuration parameter with the max_switch_wait parameter
2076 option. On a dragonfly network the only switch count supported
2077 is 1 since communication performance will be highest when a job
2078 is allocate resources on one leaf switch or more than 2 leaf
2079 switches. The default max-time is the max_switch_wait Sched‐
2080 ulerParameters. This option applies to job allocations.
2081
2082
2083 -T, --threads=<nthreads>
2084 Allows limiting the number of concurrent threads used to send
2085 the job request from the srun process to the slurmd processes on
2086 the allocated nodes. Default is to use one thread per allocated
2087 node up to a maximum of 60 concurrent threads. Specifying this
2088 option limits the number of concurrent threads to nthreads (less
2089 than or equal to 60). This should only be used to set a low
2090 thread count for testing on very small memory computers. This
2091 option applies to job allocations.
2092
2093
2094 -t, --time=<time>
2095 Set a limit on the total run time of the job allocation. If the
2096 requested time limit exceeds the partition's time limit, the job
2097 will be left in a PENDING state (possibly indefinitely). The
2098 default time limit is the partition's default time limit. When
2099 the time limit is reached, each task in each job step is sent
2100 SIGTERM followed by SIGKILL. The interval between signals is
2101 specified by the Slurm configuration parameter KillWait. The
2102 OverTimeLimit configuration parameter may permit the job to run
2103 longer than scheduled. Time resolution is one minute and second
2104 values are rounded up to the next minute.
2105
2106 A time limit of zero requests that no time limit be imposed.
2107 Acceptable time formats include "minutes", "minutes:seconds",
2108 "hours:minutes:seconds", "days-hours", "days-hours:minutes" and
2109 "days-hours:minutes:seconds". This option applies to job and
2110 step allocations.
2111
2112
2113 --task-epilog=<executable>
2114 The slurmstepd daemon will run executable just after each task
2115 terminates. This will be executed before any TaskEpilog parame‐
2116 ter in slurm.conf is executed. This is meant to be a very
2117 short-lived program. If it fails to terminate within a few sec‐
2118 onds, it will be killed along with any descendant processes.
2119 This option applies to step allocations.
2120
2121
2122 --task-prolog=<executable>
2123 The slurmstepd daemon will run executable just before launching
2124 each task. This will be executed after any TaskProlog parameter
2125 in slurm.conf is executed. Besides the normal environment vari‐
2126 ables, this has SLURM_TASK_PID available to identify the process
2127 ID of the task being started. Standard output from this program
2128 of the form "export NAME=value" will be used to set environment
2129 variables for the task being spawned. This option applies to
2130 step allocations.
2131
2132
2133 --test-only
2134 Returns an estimate of when a job would be scheduled to run
2135 given the current job queue and all the other srun arguments
2136 specifying the job. This limits srun's behavior to just return
2137 information; no job is actually submitted. The program will be
2138 executed directly by the slurmd daemon. This option applies to
2139 job allocations.
2140
2141
2142 --thread-spec=<num>
2143 Count of specialized threads per node reserved by the job for
2144 system operations and not used by the application. The applica‐
2145 tion will not use these threads, but will be charged for their
2146 allocation. This option can not be used with the --core-spec
2147 option. This option applies to job allocations.
2148
2149
2150 --threads-per-core=<threads>
2151 Restrict node selection to nodes with at least the specified
2152 number of threads per core. NOTE: "Threads" refers to the num‐
2153 ber of processing units on each core rather than the number of
2154 application tasks to be launched per core. See additional
2155 information under -B option above when task/affinity plugin is
2156 enabled. This option applies to job allocations.
2157
2158
2159 --time-min=<time>
2160 Set a minimum time limit on the job allocation. If specified,
2161 the job may have it's --time limit lowered to a value no lower
2162 than --time-min if doing so permits the job to begin execution
2163 earlier than otherwise possible. The job's time limit will not
2164 be changed after the job is allocated resources. This is per‐
2165 formed by a backfill scheduling algorithm to allocate resources
2166 otherwise reserved for higher priority jobs. Acceptable time
2167 formats include "minutes", "minutes:seconds", "hours:min‐
2168 utes:seconds", "days-hours", "days-hours:minutes" and
2169 "days-hours:minutes:seconds". This option applies to job alloca‐
2170 tions.
2171
2172
2173 --tmp=<size[units]>
2174 Specify a minimum amount of temporary disk space per node.
2175 Default units are megabytes unless the SchedulerParameters con‐
2176 figuration parameter includes the "default_gbytes" option for
2177 gigabytes. Different units can be specified using the suffix
2178 [K|M|G|T]. This option applies to job allocations.
2179
2180
2181 -u, --unbuffered
2182 By default the connection between slurmstepd and the user
2183 launched application is over a pipe. The stdio output written by
2184 the application is buffered by the glibc until it is flushed or
2185 the output is set as unbuffered. See setbuf(3). If this option
2186 is specified the tasks are executed with a pseudo terminal so
2187 that the application output is unbuffered. This option applies
2188 to step allocations.
2189
2190 --usage
2191 Display brief help message and exit.
2192
2193
2194 --uid=<user>
2195 Attempt to submit and/or run a job as user instead of the invok‐
2196 ing user id. The invoking user's credentials will be used to
2197 check access permissions for the target partition. User root may
2198 use this option to run jobs as a normal user in a RootOnly par‐
2199 tition for example. If run as root, srun will drop its permis‐
2200 sions to the uid specified after node allocation is successful.
2201 user may be the user name or numerical user ID. This option
2202 applies to job and step allocations.
2203
2204
2205 --use-min-nodes
2206 If a range of node counts is given, prefer the smaller count.
2207
2208
2209 -V, --version
2210 Display version information and exit.
2211
2212
2213 -v, --verbose
2214 Increase the verbosity of srun's informational messages. Multi‐
2215 ple -v's will further increase srun's verbosity. By default
2216 only errors will be displayed. This option applies to job and
2217 step allocations.
2218
2219
2220 -W, --wait=<seconds>
2221 Specify how long to wait after the first task terminates before
2222 terminating all remaining tasks. A value of 0 indicates an
2223 unlimited wait (a warning will be issued after 60 seconds). The
2224 default value is set by the WaitTime parameter in the slurm con‐
2225 figuration file (see slurm.conf(5)). This option can be useful
2226 to ensure that a job is terminated in a timely fashion in the
2227 event that one or more tasks terminate prematurely. Note: The
2228 -K, --kill-on-bad-exit option takes precedence over -W, --wait
2229 to terminate the job immediately if a task exits with a non-zero
2230 exit code. This option applies to job allocations.
2231
2232
2233 -w, --nodelist=<host1,host2,... or filename>
2234 Request a specific list of hosts. The job will contain all of
2235 these hosts and possibly additional hosts as needed to satisfy
2236 resource requirements. The list may be specified as a
2237 comma-separated list of hosts, a range of hosts (host[1-5,7,...]
2238 for example), or a filename. The host list will be assumed to
2239 be a filename if it contains a "/" character. If you specify a
2240 minimum node or processor count larger than can be satisfied by
2241 the supplied host list, additional resources will be allocated
2242 on other nodes as needed. Rather than repeating a host name
2243 multiple times, an asterisk and a repetition count may be
2244 appended to a host name. For example "host1,host1" and "host1*2"
2245 are equivalent. If number of tasks is given and a list of
2246 requested nodes is also given the number of nodes used from that
2247 list will be reduced to match that of the number of tasks if the
2248 number of nodes in the list is greater than the number of tasks.
2249 This option applies to job and step allocations.
2250
2251
2252 --wckey=<wckey>
2253 Specify wckey to be used with job. If TrackWCKey=no (default)
2254 in the slurm.conf this value is ignored. This option applies to
2255 job allocations.
2256
2257
2258 -X, --disable-status
2259 Disable the display of task status when srun receives a single
2260 SIGINT (Ctrl-C). Instead immediately forward the SIGINT to the
2261 running job. Without this option a second Ctrl-C in one second
2262 is required to forcibly terminate the job and srun will immedi‐
2263 ately exit. May also be set via the environment variable
2264 SLURM_DISABLE_STATUS. This option applies to job allocations.
2265
2266
2267 -x, --exclude=<host1,host2,... or filename>
2268 Request that a specific list of hosts not be included in the
2269 resources allocated to this job. The host list will be assumed
2270 to be a filename if it contains a "/"character. This option
2271 applies to job allocations.
2272
2273
2274 --x11[=<all|first|last>]
2275 Sets up X11 forwarding on all, first or last node(s) of the
2276 allocation. This option is only enabled if Slurm was compiled
2277 with X11 support and PrologFlags=x11 is defined in the
2278 slurm.conf. Default is all.
2279
2280
2281 -Z, --no-allocate
2282 Run the specified tasks on a set of nodes without creating a
2283 Slurm "job" in the Slurm queue structure, bypassing the normal
2284 resource allocation step. The list of nodes must be specified
2285 with the -w, --nodelist option. This is a privileged option
2286 only available for the users "SlurmUser" and "root". This option
2287 applies to job allocations.
2288
2289
2290 srun will submit the job request to the slurm job controller, then ini‐
2291 tiate all processes on the remote nodes. If the request cannot be met
2292 immediately, srun will block until the resources are free to run the
2293 job. If the -I (--immediate) option is specified srun will terminate if
2294 resources are not immediately available.
2295
2296 When initiating remote processes srun will propagate the current work‐
2297 ing directory, unless --chdir=<path> is specified, in which case path
2298 will become the working directory for the remote processes.
2299
2300 The -n, -c, and -N options control how CPUs and nodes will be allo‐
2301 cated to the job. When specifying only the number of processes to run
2302 with -n, a default of one CPU per process is allocated. By specifying
2303 the number of CPUs required per task (-c), more than one CPU may be
2304 allocated per process. If the number of nodes is specified with -N,
2305 srun will attempt to allocate at least the number of nodes specified.
2306
2307 Combinations of the above three options may be used to change how pro‐
2308 cesses are distributed across nodes and cpus. For instance, by specify‐
2309 ing both the number of processes and number of nodes on which to run,
2310 the number of processes per node is implied. However, if the number of
2311 CPUs per process is more important then number of processes (-n) and
2312 the number of CPUs per process (-c) should be specified.
2313
2314 srun will refuse to allocate more than one process per CPU unless
2315 --overcommit (-O) is also specified.
2316
2317 srun will attempt to meet the above specifications "at a minimum." That
2318 is, if 16 nodes are requested for 32 processes, and some nodes do not
2319 have 2 CPUs, the allocation of nodes will be increased in order to meet
2320 the demand for CPUs. In other words, a minimum of 16 nodes are being
2321 requested. However, if 16 nodes are requested for 15 processes, srun
2322 will consider this an error, as 15 processes cannot run across 16
2323 nodes.
2324
2325
2326 IO Redirection
2327
2328 By default, stdout and stderr will be redirected from all tasks to the
2329 stdout and stderr of srun, and stdin will be redirected from the stan‐
2330 dard input of srun to all remote tasks. If stdin is only to be read by
2331 a subset of the spawned tasks, specifying a file to read from rather
2332 than forwarding stdin from the srun command may be preferable as it
2333 avoids moving and storing data that will never be read.
2334
2335 For OS X, the poll() function does not support stdin, so input from a
2336 terminal is not possible.
2337
2338 This behavior may be changed with the --output, --error, and --input
2339 (-o, -e, -i) options. Valid format specifications for these options are
2340
2341 all stdout stderr is redirected from all tasks to srun. stdin is
2342 broadcast to all remote tasks. (This is the default behav‐
2343 ior)
2344
2345 none stdout and stderr is not received from any task. stdin is
2346 not sent to any task (stdin is closed).
2347
2348 taskid stdout and/or stderr are redirected from only the task with
2349 relative id equal to taskid, where 0 <= taskid <= ntasks,
2350 where ntasks is the total number of tasks in the current job
2351 step. stdin is redirected from the stdin of srun to this
2352 same task. This file will be written on the node executing
2353 the task.
2354
2355 filename srun will redirect stdout and/or stderr to the named file
2356 from all tasks. stdin will be redirected from the named file
2357 and broadcast to all tasks in the job. filename refers to a
2358 path on the host that runs srun. Depending on the cluster's
2359 file system layout, this may result in the output appearing
2360 in different places depending on whether the job is run in
2361 batch mode.
2362
2363 filename pattern
2364 srun allows for a filename pattern to be used to generate the
2365 named IO file described above. The following list of format
2366 specifiers may be used in the format string to generate a
2367 filename that will be unique to a given jobid, stepid, node,
2368 or task. In each case, the appropriate number of files are
2369 opened and associated with the corresponding tasks. Note that
2370 any format string containing %t, %n, and/or %N will be writ‐
2371 ten on the node executing the task rather than the node where
2372 srun executes, these format specifiers are not supported on a
2373 BGQ system.
2374
2375 \\ Do not process any of the replacement symbols.
2376
2377 %% The character "%".
2378
2379 %A Job array's master job allocation number.
2380
2381 %a Job array ID (index) number.
2382
2383 %J jobid.stepid of the running job. (e.g. "128.0")
2384
2385 %j jobid of the running job.
2386
2387 %s stepid of the running job.
2388
2389 %N short hostname. This will create a separate IO file
2390 per node.
2391
2392 %n Node identifier relative to current job (e.g. "0" is
2393 the first node of the running job) This will create a
2394 separate IO file per node.
2395
2396 %t task identifier (rank) relative to current job. This
2397 will create a separate IO file per task.
2398
2399 %u User name.
2400
2401 %x Job name.
2402
2403 A number placed between the percent character and format
2404 specifier may be used to zero-pad the result in the IO file‐
2405 name. This number is ignored if the format specifier corre‐
2406 sponds to non-numeric data (%N for example).
2407
2408 Some examples of how the format string may be used for a 4
2409 task job step with a Job ID of 128 and step id of 0 are
2410 included below:
2411
2412 job%J.out job128.0.out
2413
2414 job%4j.out job0128.out
2415
2416 job%j-%2t.out job128-00.out, job128-01.out, ...
2417
2419 Some srun options may be set via environment variables. These environ‐
2420 ment variables, along with their corresponding options, are listed
2421 below. Note: Command line options will always override these settings.
2422
2423 PMI_FANOUT This is used exclusively with PMI (MPICH2 and
2424 MVAPICH2) and controls the fanout of data commu‐
2425 nications. The srun command sends messages to
2426 application programs (via the PMI library) and
2427 those applications may be called upon to forward
2428 that data to up to this number of additional
2429 tasks. Higher values offload work from the srun
2430 command to the applications and likely increase
2431 the vulnerability to failures. The default value
2432 is 32.
2433
2434 PMI_FANOUT_OFF_HOST This is used exclusively with PMI (MPICH2 and
2435 MVAPICH2) and controls the fanout of data commu‐
2436 nications. The srun command sends messages to
2437 application programs (via the PMI library) and
2438 those applications may be called upon to forward
2439 that data to additional tasks. By default, srun
2440 sends one message per host and one task on that
2441 host forwards the data to other tasks on that
2442 host up to PMI_FANOUT. If PMI_FANOUT_OFF_HOST is
2443 defined, the user task may be required to forward
2444 the data to tasks on other hosts. Setting
2445 PMI_FANOUT_OFF_HOST may increase performance.
2446 Since more work is performed by the PMI library
2447 loaded by the user application, failures also can
2448 be more common and more difficult to diagnose.
2449
2450 PMI_TIME This is used exclusively with PMI (MPICH2 and
2451 MVAPICH2) and controls how much the communica‐
2452 tions from the tasks to the srun are spread out
2453 in time in order to avoid overwhelming the srun
2454 command with work. The default value is 500
2455 (microseconds) per task. On relatively slow pro‐
2456 cessors or systems with very large processor
2457 counts (and large PMI data sets), higher values
2458 may be required.
2459
2460 SLURM_CONF The location of the Slurm configuration file.
2461
2462 SLURM_ACCOUNT Same as -A, --account
2463
2464 SLURM_ACCTG_FREQ Same as --acctg-freq
2465
2466 SLURM_BCAST Same as --bcast
2467
2468 SLURM_BURST_BUFFER Same as --bb
2469
2470 SLURM_CHECKPOINT Same as --checkpoint
2471
2472 SLURM_COMPRESS Same as --compress
2473
2474 SLURM_CONSTRAINT Same as -C, --constraint
2475
2476 SLURM_CORE_SPEC Same as --core-spec
2477
2478 SLURM_CPU_BIND Same as --cpu-bind
2479
2480 SLURM_CPU_FREQ_REQ Same as --cpu-freq.
2481
2482 SLURM_CPUS_PER_GPU Same as --cpus-per-gpu
2483
2484 SLURM_CPUS_PER_TASK Same as -c, --cpus-per-task
2485
2486 SLURM_DEBUG Same as -v, --verbose
2487
2488 SLURM_DELAY_BOOT Same as --delay-boot
2489
2490 SLURMD_DEBUG Same as -d, --slurmd-debug
2491
2492 SLURM_DEPENDENCY Same as -P, --dependency=<jobid>
2493
2494 SLURM_DISABLE_STATUS Same as -X, --disable-status
2495
2496 SLURM_DIST_PLANESIZE Same as -m plane
2497
2498 SLURM_DISTRIBUTION Same as -m, --distribution
2499
2500 SLURM_EPILOG Same as --epilog
2501
2502 SLURM_EXCLUSIVE Same as --exclusive
2503
2504 SLURM_EXIT_ERROR Specifies the exit code generated when a Slurm
2505 error occurs (e.g. invalid options). This can be
2506 used by a script to distinguish application exit
2507 codes from various Slurm error conditions. Also
2508 see SLURM_EXIT_IMMEDIATE.
2509
2510 SLURM_EXIT_IMMEDIATE Specifies the exit code generated when the
2511 --immediate option is used and resources are not
2512 currently available. This can be used by a
2513 script to distinguish application exit codes from
2514 various Slurm error conditions. Also see
2515 SLURM_EXIT_ERROR.
2516
2517 SLURM_EXPORT_ENV Same as --export
2518
2519 SLURM_GPUS Same as -G, --gpus
2520
2521 SLURM_GPU_BIND Same as --gpu-bind
2522
2523 SLURM_GPU_FREQ Same as --gpu-freq
2524
2525 SLURM_GPUS_PER_NODE Same as --gpus-per-node
2526
2527 SLURM_GPUS_PER_TASK Same as --gpus-per-task
2528
2529 SLURM_GRES_FLAGS Same as --gres-flags
2530
2531 SLURM_HINT Same as --hint
2532
2533 SLURM_GRES Same as --gres. Also see SLURM_STEP_GRES
2534
2535 SLURM_IMMEDIATE Same as -I, --immediate
2536
2537 SLURM_JOB_ID Same as --jobid
2538
2539 SLURM_JOB_NAME Same as -J, --job-name except within an existing
2540 allocation, in which case it is ignored to avoid
2541 using the batch job's name as the name of each
2542 job step.
2543
2544 SLURM_JOB_NUM_NODES (and SLURM_NNODES for backwards compatibility)
2545 Same as -N, --nodes Total number of nodes in the
2546 job’s resource allocation.
2547
2548 SLURM_KILL_BAD_EXIT Same as -K, --kill-on-bad-exit
2549
2550 SLURM_LABELIO Same as -l, --label
2551
2552 SLURM_MEM_BIND Same as --mem-bind
2553
2554 SLURM_MEM_PER_CPU Same as --mem-per-cpu
2555
2556 SLURM_MEM_PER_GPU Same as --mem-per-gpu
2557
2558 SLURM_MEM_PER_NODE Same as --mem
2559
2560 SLURM_MPI_TYPE Same as --mpi
2561
2562 SLURM_NETWORK Same as --network
2563
2564 SLURM_NO_KILL Same as -k, --no-kill
2565
2566 SLURM_NTASKS (and SLURM_NPROCS for backwards compatibility)
2567 Same as -n, --ntasks
2568
2569 SLURM_NTASKS_PER_CORE Same as --ntasks-per-core
2570
2571 SLURM_NTASKS_PER_NODE Same as --ntasks-per-node
2572
2573 SLURM_NTASKS_PER_SOCKET
2574 Same as --ntasks-per-socket
2575
2576 SLURM_OPEN_MODE Same as --open-mode
2577
2578 SLURM_OVERCOMMIT Same as -O, --overcommit
2579
2580 SLURM_PARTITION Same as -p, --partition
2581
2582 SLURM_PMI_KVS_NO_DUP_KEYS
2583 If set, then PMI key-pairs will contain no dupli‐
2584 cate keys. MPI can use this variable to inform
2585 the PMI library that it will not use duplicate
2586 keys so PMI can skip the check for duplicate
2587 keys. This is the case for MPICH2 and reduces
2588 overhead in testing for duplicates for improved
2589 performance
2590
2591 SLURM_POWER Same as --power
2592
2593 SLURM_PROFILE Same as --profile
2594
2595 SLURM_PROLOG Same as --prolog
2596
2597 SLURM_QOS Same as --qos
2598
2599 SLURM_REMOTE_CWD Same as -D, --chdir=
2600
2601 SLURM_REQ_SWITCH When a tree topology is used, this defines the
2602 maximum count of switches desired for the job
2603 allocation and optionally the maximum time to
2604 wait for that number of switches. See --switches
2605
2606 SLURM_RESERVATION Same as --reservation
2607
2608 SLURM_RESV_PORTS Same as --resv-ports
2609
2610 SLURM_SIGNAL Same as --signal
2611
2612 SLURM_STDERRMODE Same as -e, --error
2613
2614 SLURM_STDINMODE Same as -i, --input
2615
2616 SLURM_SPREAD_JOB Same as --spread-job
2617
2618 SLURM_SRUN_REDUCE_TASK_EXIT_MSG
2619 if set and non-zero, successive task exit mes‐
2620 sages with the same exit code will be printed
2621 only once.
2622
2623 SLURM_STEP_GRES Same as --gres (only applies to job steps, not to
2624 job allocations). Also see SLURM_GRES
2625
2626 SLURM_STEP_KILLED_MSG_NODE_ID=ID
2627 If set, only the specified node will log when the
2628 job or step are killed by a signal.
2629
2630 SLURM_STDOUTMODE Same as -o, --output
2631
2632 SLURM_TASK_EPILOG Same as --task-epilog
2633
2634 SLURM_TASK_PROLOG Same as --task-prolog
2635
2636 SLURM_TEST_EXEC If defined, srun will verify existence of the
2637 executable program along with user execute per‐
2638 mission on the node where srun was called before
2639 attempting to launch it on nodes in the step.
2640
2641 SLURM_THREAD_SPEC Same as --thread-spec
2642
2643 SLURM_THREADS Same as -T, --threads
2644
2645 SLURM_TIMELIMIT Same as -t, --time
2646
2647 SLURM_UNBUFFEREDIO Same as -u, --unbuffered
2648
2649 SLURM_USE_MIN_NODES Same as --use-min-nodes
2650
2651 SLURM_WAIT Same as -W, --wait
2652
2653 SLURM_WAIT4SWITCH Max time waiting for requested switches. See
2654 --switches
2655
2656 SLURM_WCKEY Same as -W, --wckey
2657
2658 SLURM_WORKING_DIR -D, --chdir
2659
2660 SRUN_EXPORT_ENV Same as --export, and will override any setting
2661 for SRUN_EXPORT_ENV.
2662
2663
2664
2666 srun will set some environment variables in the environment of the exe‐
2667 cuting tasks on the remote compute nodes. These environment variables
2668 are:
2669
2670
2671 SLURM_*_PACK_GROUP_# For a heterogeneous job allocation, the environ‐
2672 ment variables are set separately for each compo‐
2673 nent.
2674
2675 SLURM_CLUSTER_NAME Name of the cluster on which the job is execut‐
2676 ing.
2677
2678 SLURM_CPU_BIND_VERBOSE
2679 --cpu-bind verbosity (quiet,verbose).
2680
2681 SLURM_CPU_BIND_TYPE --cpu-bind type (none,rank,map_cpu:,mask_cpu:).
2682
2683 SLURM_CPU_BIND_LIST --cpu-bind map or mask list (list of Slurm CPU
2684 IDs or masks for this node, CPU_ID = Board_ID x
2685 threads_per_board + Socket_ID x
2686 threads_per_socket + Core_ID x threads_per_core +
2687 Thread_ID).
2688
2689
2690 SLURM_CPU_FREQ_REQ Contains the value requested for cpu frequency on
2691 the srun command as a numerical frequency in
2692 kilohertz, or a coded value for a request of low,
2693 medium,highm1 or high for the frequency. See the
2694 description of the --cpu-freq option or the
2695 SLURM_CPU_FREQ_REQ input environment variable.
2696
2697 SLURM_CPUS_ON_NODE Count of processors available to the job on this
2698 node. Note the select/linear plugin allocates
2699 entire nodes to jobs, so the value indicates the
2700 total count of CPUs on the node. For the
2701 select/cons_res plugin, this number indicates the
2702 number of cores on this node allocated to the
2703 job.
2704
2705 SLURM_CPUS_PER_GPU Number of CPUs requested per allocated GPU. Only
2706 set if the --cpus-per-gpu option is specified.
2707
2708 SLURM_CPUS_PER_TASK Number of cpus requested per task. Only set if
2709 the --cpus-per-task option is specified.
2710
2711 SLURM_DISTRIBUTION Distribution type for the allocated jobs. Set the
2712 distribution with -m, --distribution.
2713
2714 SLURM_GPUS Number of GPUs requested. Only set if the -G,
2715 --gpus option is specified.
2716
2717 SLURM_GPU_BIND Requested binding of tasks to GPU. Only set if
2718 the --gpu-bind option is specified.
2719
2720 SLURM_GPU_FREQ Requested GPU frequency. Only set if the
2721 --gpu-freq option is specified.
2722
2723 SLURM_GPUS_PER_NODE Requested GPU count per allocated node. Only set
2724 if the --gpus-per-node option is specified.
2725
2726 SLURM_GPUS_PER_SOCKET Requested GPU count per allocated socket. Only
2727 set if the --gpus-per-socket option is specified.
2728
2729 SLURM_GPUS_PER_TASK Requested GPU count per allocated task. Only set
2730 if the --gpus-per-task option is specified.
2731
2732 SLURM_GTIDS Global task IDs running on this node. Zero ori‐
2733 gin and comma separated.
2734
2735 SLURM_JOB_ACCOUNT Account name associated of the job allocation.
2736
2737 SLURM_JOB_CPUS_PER_NODE
2738 Number of CPUS per node.
2739
2740 SLURM_JOB_DEPENDENCY Set to value of the --dependency option.
2741
2742 SLURM_JOB_ID (and SLURM_JOBID for backwards compatibility)
2743 Job id of the executing job.
2744
2745
2746 SLURM_JOB_NAME Set to the value of the --job-name option or the
2747 command name when srun is used to create a new
2748 job allocation. Not set when srun is used only to
2749 create a job step (i.e. within an existing job
2750 allocation).
2751
2752
2753 SLURM_JOB_PARTITION Name of the partition in which the job is run‐
2754 ning.
2755
2756
2757 SLURM_JOB_QOS Quality Of Service (QOS) of the job allocation.
2758
2759 SLURM_JOB_RESERVATION Advanced reservation containing the job alloca‐
2760 tion, if any.
2761
2762
2763 SLURM_LAUNCH_NODE_IPADDR
2764 IP address of the node from which the task launch
2765 was initiated (where the srun command ran from).
2766
2767 SLURM_LOCALID Node local task ID for the process within a job.
2768
2769
2770 SLURM_MEM_BIND_LIST --mem-bind map or mask list (<list of IDs or
2771 masks for this node>).
2772
2773 SLURM_MEM_BIND_PREFER --mem-bind prefer (prefer).
2774
2775 SLURM_MEM_BIND_SORT Sort free cache pages (run zonesort on Intel KNL
2776 nodes).
2777
2778 SLURM_MEM_BIND_TYPE --mem-bind type (none,rank,map_mem:,mask_mem:).
2779
2780 SLURM_MEM_BIND_VERBOSE
2781 --mem-bind verbosity (quiet,verbose).
2782
2783 SLURM_MEM_PER_GPU Requested memory per allocated GPU. Only set if
2784 the --mem-per-gpu option is specified.
2785
2786 SLURM_JOB_NODES Total number of nodes in the job's resource allo‐
2787 cation.
2788
2789 SLURM_NODE_ALIASES Sets of node name, communication address and
2790 hostname for nodes allocated to the job from the
2791 cloud. Each element in the set if colon separated
2792 and each set is comma separated. For example:
2793 SLURM_NODE_ALIASES=
2794 ec0:1.2.3.4:foo,ec1:1.2.3.5:bar
2795
2796 SLURM_NODEID The relative node ID of the current node.
2797
2798 SLURM_JOB_NODELIST List of nodes allocated to the job.
2799
2800 SLURM_NTASKS (and SLURM_NPROCS for backwards compatibility)
2801 Total number of processes in the current job or
2802 job step.
2803
2804 SLURM_PACK_SIZE Set to count of components in heterogeneous job.
2805
2806 SLURM_PRIO_PROCESS The scheduling priority (nice value) at the time
2807 of job submission. This value is propagated to
2808 the spawned processes.
2809
2810 SLURM_PROCID The MPI rank (or relative process ID) of the cur‐
2811 rent process.
2812
2813 SLURM_SRUN_COMM_HOST IP address of srun communication host.
2814
2815 SLURM_SRUN_COMM_PORT srun communication port.
2816
2817 SLURM_STEP_LAUNCHER_PORT
2818 Step launcher port.
2819
2820 SLURM_STEP_NODELIST List of nodes allocated to the step.
2821
2822 SLURM_STEP_NUM_NODES Number of nodes allocated to the step.
2823
2824 SLURM_STEP_NUM_TASKS Number of processes in the step.
2825
2826 SLURM_STEP_TASKS_PER_NODE
2827 Number of processes per node within the step.
2828
2829 SLURM_STEP_ID (and SLURM_STEPID for backwards compatibility)
2830 The step ID of the current job.
2831
2832 SLURM_SUBMIT_DIR The directory from which srun was invoked or, if
2833 applicable, the directory specified by the -D,
2834 --chdir option.
2835
2836 SLURM_SUBMIT_HOST The hostname of the computer from which salloc
2837 was invoked.
2838
2839 SLURM_TASK_PID The process ID of the task being started.
2840
2841 SLURM_TASKS_PER_NODE Number of tasks to be initiated on each node.
2842 Values are comma separated and in the same order
2843 as SLURM_JOB_NODELIST. If two or more consecu‐
2844 tive nodes are to have the same task count, that
2845 count is followed by "(x#)" where "#" is the rep‐
2846 etition count. For example,
2847 "SLURM_TASKS_PER_NODE=2(x3),1" indicates that the
2848 first three nodes will each execute three tasks
2849 and the fourth node will execute one task.
2850
2851
2852 SLURM_TOPOLOGY_ADDR This is set only if the system has the topol‐
2853 ogy/tree plugin configured. The value will be
2854 set to the names network switches which may be
2855 involved in the job's communications from the
2856 system's top level switch down to the leaf switch
2857 and ending with node name. A period is used to
2858 separate each hardware component name.
2859
2860 SLURM_TOPOLOGY_ADDR_PATTERN
2861 This is set only if the system has the topol‐
2862 ogy/tree plugin configured. The value will be
2863 set component types listed in SLURM_TOPOL‐
2864 OGY_ADDR. Each component will be identified as
2865 either "switch" or "node". A period is used to
2866 separate each hardware component type.
2867
2868 SLURM_UMASK The umask in effect when the job was submitted.
2869
2870 SLURMD_NODENAME Name of the node running the task. In the case of
2871 a parallel job executing on multiple compute
2872 nodes, the various tasks will have this environ‐
2873 ment variable set to different values on each
2874 compute node.
2875
2876 SRUN_DEBUG Set to the logging level of the srun command.
2877 Default value is 3 (info level). The value is
2878 incremented or decremented based upon the --ver‐
2879 bose and --quiet options.
2880
2881
2883 Signals sent to the srun command are automatically forwarded to the
2884 tasks it is controlling with a few exceptions. The escape sequence
2885 <control-c> will report the state of all tasks associated with the srun
2886 command. If <control-c> is entered twice within one second, then the
2887 associated SIGINT signal will be sent to all tasks and a termination
2888 sequence will be entered sending SIGCONT, SIGTERM, and SIGKILL to all
2889 spawned tasks. If a third <control-c> is received, the srun program
2890 will be terminated without waiting for remote tasks to exit or their
2891 I/O to complete.
2892
2893 The escape sequence <control-z> is presently ignored. Our intent is for
2894 this put the srun command into a mode where various special actions may
2895 be invoked.
2896
2897
2899 MPI use depends upon the type of MPI being used. There are three fun‐
2900 damentally different modes of operation used by these various MPI
2901 implementation.
2902
2903 1. Slurm directly launches the tasks and performs initialization of
2904 communications through the PMI2 or PMIx APIs. For example: "srun -n16
2905 a.out".
2906
2907 2. Slurm creates a resource allocation for the job and then mpirun
2908 launches tasks using Slurm's infrastructure (OpenMPI).
2909
2910 3. Slurm creates a resource allocation for the job and then mpirun
2911 launches tasks using some mechanism other than Slurm, such as SSH or
2912 RSH. These tasks are initiated outside of Slurm's monitoring or con‐
2913 trol. Slurm's epilog should be configured to purge these tasks when the
2914 job's allocation is relinquished, or the use of pam_slurm_adopt is
2915 highly recommended.
2916
2917 See https://slurm.schedmd.com/mpi_guide.html for more information on
2918 use of these various MPI implementation with Slurm.
2919
2920
2922 Comments in the configuration file must have a "#" in column one. The
2923 configuration file contains the following fields separated by white
2924 space:
2925
2926 Task rank
2927 One or more task ranks to use this configuration. Multiple val‐
2928 ues may be comma separated. Ranges may be indicated with two
2929 numbers separated with a '-' with the smaller number first (e.g.
2930 "0-4" and not "4-0"). To indicate all tasks not otherwise spec‐
2931 ified, specify a rank of '*' as the last line of the file. If
2932 an attempt is made to initiate a task for which no executable
2933 program is defined, the following error message will be produced
2934 "No executable program specified for this task".
2935
2936 Executable
2937 The name of the program to execute. May be fully qualified
2938 pathname if desired.
2939
2940 Arguments
2941 Program arguments. The expression "%t" will be replaced with
2942 the task's number. The expression "%o" will be replaced with
2943 the task's offset within this range (e.g. a configured task rank
2944 value of "1-5" would have offset values of "0-4"). Single
2945 quotes may be used to avoid having the enclosed values inter‐
2946 preted. This field is optional. Any arguments for the program
2947 entered on the command line will be added to the arguments spec‐
2948 ified in the configuration file.
2949
2950 For example:
2951 ###################################################################
2952 # srun multiple program configuration file
2953 #
2954 # srun -n8 -l --multi-prog silly.conf
2955 ###################################################################
2956 4-6 hostname
2957 1,7 echo task:%t
2958 0,2-3 echo offset:%o
2959
2960 > srun -n8 -l --multi-prog silly.conf
2961 0: offset:0
2962 1: task:1
2963 2: offset:1
2964 3: offset:2
2965 4: linux15.llnl.gov
2966 5: linux16.llnl.gov
2967 6: linux17.llnl.gov
2968 7: task:7
2969
2970
2971
2972
2974 This simple example demonstrates the execution of the command hostname
2975 in eight tasks. At least eight processors will be allocated to the job
2976 (the same as the task count) on however many nodes are required to sat‐
2977 isfy the request. The output of each task will be proceeded with its
2978 task number. (The machine "dev" in the example below has a total of
2979 two CPUs per node)
2980
2981
2982 > srun -n8 -l hostname
2983 0: dev0
2984 1: dev0
2985 2: dev1
2986 3: dev1
2987 4: dev2
2988 5: dev2
2989 6: dev3
2990 7: dev3
2991
2992
2993 The srun -r option is used within a job script to run two job steps on
2994 disjoint nodes in the following example. The script is run using allo‐
2995 cate mode instead of as a batch job in this case.
2996
2997
2998 > cat test.sh
2999 #!/bin/sh
3000 echo $SLURM_JOB_NODELIST
3001 srun -lN2 -r2 hostname
3002 srun -lN2 hostname
3003
3004 > salloc -N4 test.sh
3005 dev[7-10]
3006 0: dev9
3007 1: dev10
3008 0: dev7
3009 1: dev8
3010
3011
3012 The following script runs two job steps in parallel within an allocated
3013 set of nodes.
3014
3015
3016 > cat test.sh
3017 #!/bin/bash
3018 srun -lN2 -n4 -r 2 sleep 60 &
3019 srun -lN2 -r 0 sleep 60 &
3020 sleep 1
3021 squeue
3022 squeue -s
3023 wait
3024
3025 > salloc -N4 test.sh
3026 JOBID PARTITION NAME USER ST TIME NODES NODELIST
3027 65641 batch test.sh grondo R 0:01 4 dev[7-10]
3028
3029 STEPID PARTITION USER TIME NODELIST
3030 65641.0 batch grondo 0:01 dev[7-8]
3031 65641.1 batch grondo 0:01 dev[9-10]
3032
3033
3034 This example demonstrates how one executes a simple MPI job. We use
3035 srun to build a list of machines (nodes) to be used by mpirun in its
3036 required format. A sample command line and the script to be executed
3037 follow.
3038
3039
3040 > cat test.sh
3041 #!/bin/sh
3042 MACHINEFILE="nodes.$SLURM_JOB_ID"
3043
3044 # Generate Machinefile for mpi such that hosts are in the same
3045 # order as if run via srun
3046 #
3047 srun -l /bin/hostname | sort -n | awk '{print $2}' > $MACHINEFILE
3048
3049 # Run using generated Machine file:
3050 mpirun -np $SLURM_NTASKS -machinefile $MACHINEFILE mpi-app
3051
3052 rm $MACHINEFILE
3053
3054 > salloc -N2 -n4 test.sh
3055
3056
3057 This simple example demonstrates the execution of different jobs on
3058 different nodes in the same srun. You can do this for any number of
3059 nodes or any number of jobs. The executables are placed on the nodes
3060 sited by the SLURM_NODEID env var. Starting at 0 and going to the num‐
3061 ber specified on the srun commandline.
3062
3063
3064 > cat test.sh
3065 case $SLURM_NODEID in
3066 0) echo "I am running on "
3067 hostname ;;
3068 1) hostname
3069 echo "is where I am running" ;;
3070 esac
3071
3072 > srun -N2 test.sh
3073 dev0
3074 is where I am running
3075 I am running on
3076 dev1
3077
3078
3079 This example demonstrates use of multi-core options to control layout
3080 of tasks. We request that four sockets per node and two cores per
3081 socket be dedicated to the job.
3082
3083
3084 > srun -N2 -B 4-4:2-2 a.out
3085
3086 This example shows a script in which Slurm is used to provide resource
3087 management for a job by executing the various job steps as processors
3088 become available for their dedicated use.
3089
3090
3091 > cat my.script
3092 #!/bin/bash
3093 srun --exclusive -n4 prog1 &
3094 srun --exclusive -n3 prog2 &
3095 srun --exclusive -n1 prog3 &
3096 srun --exclusive -n1 prog4 &
3097 wait
3098
3099
3100 This example shows how to launch an application called "master" with
3101 one task, 8 CPUs and and 16 GB of memory (2 GB per CPU) plus another
3102 application called "slave" with 16 tasks, 1 CPU per task (the default)
3103 and 1 GB of memory per task.
3104
3105
3106 > srun -n1 -c16 --mem-per-cpu=1gb master : -n16 --mem-per-cpu=1gb slave
3107
3108
3110 Copyright (C) 2006-2007 The Regents of the University of California.
3111 Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
3112 Copyright (C) 2008-2010 Lawrence Livermore National Security.
3113 Copyright (C) 2010-2015 SchedMD LLC.
3114
3115 This file is part of Slurm, a resource management program. For
3116 details, see <https://slurm.schedmd.com/>.
3117
3118 Slurm is free software; you can redistribute it and/or modify it under
3119 the terms of the GNU General Public License as published by the Free
3120 Software Foundation; either version 2 of the License, or (at your
3121 option) any later version.
3122
3123 Slurm is distributed in the hope that it will be useful, but WITHOUT
3124 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
3125 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
3126 for more details.
3127
3128
3130 salloc(1), sattach(1), sbatch(1), sbcast(1), scancel(1), scontrol(1),
3131 squeue(1), slurm.conf(5), sched_setaffinity (2), numa (3) getrlimit (2)
3132
3133
3134
3135October 2019 Slurm Commands srun(1)