1srun(1) Slurm Commands srun(1)
2
3
4
6 srun - Run parallel jobs
7
8
10 srun [OPTIONS(0)... [executable(0) [args(0)...]]] [ : [OPTIONS(N)...]]
11 executable(N) [args(N)...]
12
13 Option(s) define multiple jobs in a co-scheduled heterogeneous job.
14 For more details about heterogeneous jobs see the document
15 https://slurm.schedmd.com/heterogeneous_jobs.html
16
17
19 Run a parallel job on cluster managed by Slurm. If necessary, srun
20 will first create a resource allocation in which to run the parallel
21 job.
22
23 The following document describes the influence of various options on
24 the allocation of cpus to jobs and tasks.
25 https://slurm.schedmd.com/cpu_management.html
26
27
29 srun will return the highest exit code of all tasks run or the highest
30 signal (with the high-order bit set in an 8-bit integer -- e.g. 128 +
31 signal) of any task that exited with a signal.
32 The value 253 is reserved for out-of-memory errors.
33
34
36 The executable is resolved in the following order:
37
38 1. If executable starts with ".", then path is constructed as: current
39 working directory / executable
40 2. If executable starts with a "/", then path is considered absolute.
41 3. If executable can be resolved through PATH. See path_resolution(7).
42 4. If executable is in current working directory.
43
44 Current working directory is the calling process working directory un‐
45 less the --chdir argument is passed, which will override the current
46 working directory.
47
48
50 --accel-bind=<options>
51 Control how tasks are bound to generic resources of type gpu,
52 mic and nic. Multiple options may be specified. Supported op‐
53 tions include:
54
55 g Bind each task to GPUs which are closest to the allocated
56 CPUs.
57
58 m Bind each task to MICs which are closest to the allocated
59 CPUs.
60
61 n Bind each task to NICs which are closest to the allocated
62 CPUs.
63
64 v Verbose mode. Log how tasks are bound to GPU and NIC de‐
65 vices.
66
67 This option applies to job allocations.
68
69
70 -A, --account=<account>
71 Charge resources used by this job to specified account. The ac‐
72 count is an arbitrary string. The account name may be changed
73 after job submission using the scontrol command. This option ap‐
74 plies to job allocations.
75
76
77 --acctg-freq
78 Define the job accounting and profiling sampling intervals.
79 This can be used to override the JobAcctGatherFrequency parame‐
80 ter in Slurm's configuration file, slurm.conf. The supported
81 format is follows:
82
83 --acctg-freq=<datatype>=<interval>
84 where <datatype>=<interval> specifies the task sam‐
85 pling interval for the jobacct_gather plugin or a
86 sampling interval for a profiling type by the
87 acct_gather_profile plugin. Multiple, comma-sepa‐
88 rated <datatype>=<interval> intervals may be speci‐
89 fied. Supported datatypes are as follows:
90
91 task=<interval>
92 where <interval> is the task sampling inter‐
93 val in seconds for the jobacct_gather plugins
94 and for task profiling by the
95 acct_gather_profile plugin. NOTE: This fre‐
96 quency is used to monitor memory usage. If
97 memory limits are enforced the highest fre‐
98 quency a user can request is what is config‐
99 ured in the slurm.conf file. They can not
100 turn it off (=0) either.
101
102 energy=<interval>
103 where <interval> is the sampling interval in
104 seconds for energy profiling using the
105 acct_gather_energy plugin
106
107 network=<interval>
108 where <interval> is the sampling interval in
109 seconds for infiniband profiling using the
110 acct_gather_interconnect plugin.
111
112 filesystem=<interval>
113 where <interval> is the sampling interval in
114 seconds for filesystem profiling using the
115 acct_gather_filesystem plugin.
116
117 The default value for the task sampling in‐
118 terval
119 is 30. The default value for all other intervals is 0. An in‐
120 terval of 0 disables sampling of the specified type. If the
121 task sampling interval is 0, accounting information is collected
122 only at job termination (reducing Slurm interference with the
123 job).
124 Smaller (non-zero) values have a greater impact upon job perfor‐
125 mance, but a value of 30 seconds is not likely to be noticeable
126 for applications having less than 10,000 tasks. This option ap‐
127 plies job allocations.
128
129
130 -B --extra-node-info=<sockets[:cores[:threads]]>
131 Restrict node selection to nodes with at least the specified
132 number of sockets, cores per socket and/or threads per core.
133 NOTE: These options do not specify the resource allocation size.
134 Each value specified is considered a minimum. An asterisk (*)
135 can be used as a placeholder indicating that all available re‐
136 sources of that type are to be utilized. Values can also be
137 specified as min-max. The individual levels can also be speci‐
138 fied in separate options if desired:
139 --sockets-per-node=<sockets>
140 --cores-per-socket=<cores>
141 --threads-per-core=<threads>
142 If task/affinity plugin is enabled, then specifying an alloca‐
143 tion in this manner also sets a default --cpu-bind option of
144 threads if the -B option specifies a thread count, otherwise an
145 option of cores if a core count is specified, otherwise an op‐
146 tion of sockets. If SelectType is configured to se‐
147 lect/cons_res, it must have a parameter of CR_Core, CR_Core_Mem‐
148 ory, CR_Socket, or CR_Socket_Memory for this option to be hon‐
149 ored. If not specified, the scontrol show job will display
150 'ReqS:C:T=*:*:*'. This option applies to job allocations. NOTE:
151 This option is mutually exclusive with --hint,
152 --threads-per-core and --ntasks-per-core.
153
154
155 --bb=<spec>
156 Burst buffer specification. The form of the specification is
157 system dependent. Also see --bbf. This option applies to job
158 allocations.
159
160
161 --bbf=<file_name>
162 Path of file containing burst buffer specification. The form of
163 the specification is system dependent. Also see --bb. This op‐
164 tion applies to job allocations.
165
166
167 --bcast[=<dest_path>]
168 Copy executable file to allocated compute nodes. If a file name
169 is specified, copy the executable to the specified destination
170 file path. If the path specified ends with '/' it is treated as
171 a target directory, and the destination file name will be
172 slurm_bcast_<job_id>.<step_id>_<nodename>. If no dest_path is
173 specified, then the current working directory is used, and the
174 filename follows the above pattern. For example, "srun
175 --bcast=/tmp/mine -N3 a.out" will copy the file "a.out" from
176 your current directory to the file "/tmp/mine" on each of the
177 three allocated compute nodes and execute that file. This option
178 applies to step allocations.
179
180
181 -b, --begin=<time>
182 Defer initiation of this job until the specified time. It ac‐
183 cepts times of the form HH:MM:SS to run a job at a specific time
184 of day (seconds are optional). (If that time is already past,
185 the next day is assumed.) You may also specify midnight, noon,
186 fika (3 PM) or teatime (4 PM) and you can have a time-of-day
187 suffixed with AM or PM for running in the morning or the
188 evening. You can also say what day the job will be run, by
189 specifying a date of the form MMDDYY or MM/DD/YY YYYY-MM-DD.
190 Combine date and time using the following format
191 YYYY-MM-DD[THH:MM[:SS]]. You can also give times like now +
192 count time-units, where the time-units can be seconds (default),
193 minutes, hours, days, or weeks and you can tell Slurm to run the
194 job today with the keyword today and to run the job tomorrow
195 with the keyword tomorrow. The value may be changed after job
196 submission using the scontrol command. For example:
197 --begin=16:00
198 --begin=now+1hour
199 --begin=now+60 (seconds by default)
200 --begin=2010-01-20T12:34:00
201
202
203 Notes on date/time specifications:
204 - Although the 'seconds' field of the HH:MM:SS time specifica‐
205 tion is allowed by the code, note that the poll time of the
206 Slurm scheduler is not precise enough to guarantee dispatch of
207 the job on the exact second. The job will be eligible to start
208 on the next poll following the specified time. The exact poll
209 interval depends on the Slurm scheduler (e.g., 60 seconds with
210 the default sched/builtin).
211 - If no time (HH:MM:SS) is specified, the default is
212 (00:00:00).
213 - If a date is specified without a year (e.g., MM/DD) then the
214 current year is assumed, unless the combination of MM/DD and
215 HH:MM:SS has already passed for that year, in which case the
216 next year is used.
217 This option applies to job allocations.
218
219
220 --cluster-constraint=<list>
221 Specifies features that a federated cluster must have to have a
222 sibling job submitted to it. Slurm will attempt to submit a sib‐
223 ling job to a cluster if it has at least one of the specified
224 features.
225
226
227 --comment=<string>
228 An arbitrary comment. This option applies to job allocations.
229
230
231 --compress[=type]
232 Compress file before sending it to compute hosts. The optional
233 argument specifies the data compression library to be used.
234 Supported values are "lz4" (default) and "zlib". Some compres‐
235 sion libraries may be unavailable on some systems. For use with
236 the --bcast option. This option applies to step allocations.
237
238
239 -C, --constraint=<list>
240 Nodes can have features assigned to them by the Slurm adminis‐
241 trator. Users can specify which of these features are required
242 by their job using the constraint option. Only nodes having
243 features matching the job constraints will be used to satisfy
244 the request. Multiple constraints may be specified with AND,
245 OR, matching OR, resource counts, etc. (some operators are not
246 supported on all system types). Supported constraint options
247 include:
248
249 Single Name
250 Only nodes which have the specified feature will be used.
251 For example, --constraint="intel"
252
253 Node Count
254 A request can specify the number of nodes needed with
255 some feature by appending an asterisk and count after the
256 feature name. For example, --nodes=16 --con‐
257 straint="graphics*4 ..." indicates that the job requires
258 16 nodes and that at least four of those nodes must have
259 the feature "graphics."
260
261 AND If only nodes with all of specified features will be
262 used. The ampersand is used for an AND operator. For
263 example, --constraint="intel&gpu"
264
265 OR If only nodes with at least one of specified features
266 will be used. The vertical bar is used for an OR opera‐
267 tor. For example, --constraint="intel|amd"
268
269 Matching OR
270 If only one of a set of possible options should be used
271 for all allocated nodes, then use the OR operator and en‐
272 close the options within square brackets. For example,
273 --constraint="[rack1|rack2|rack3|rack4]" might be used to
274 specify that all nodes must be allocated on a single rack
275 of the cluster, but any of those four racks can be used.
276
277 Multiple Counts
278 Specific counts of multiple resources may be specified by
279 using the AND operator and enclosing the options within
280 square brackets. For example, --con‐
281 straint="[rack1*2&rack2*4]" might be used to specify that
282 two nodes must be allocated from nodes with the feature
283 of "rack1" and four nodes must be allocated from nodes
284 with the feature "rack2".
285
286 NOTE: This construct does not support multiple Intel KNL
287 NUMA or MCDRAM modes. For example, while --con‐
288 straint="[(knl&quad)*2&(knl&hemi)*4]" is not supported,
289 --constraint="[haswell*2&(knl&hemi)*4]" is supported.
290 Specification of multiple KNL modes requires the use of a
291 heterogeneous job.
292
293 Brackets
294 Brackets can be used to indicate that you are looking for
295 a set of nodes with the different requirements contained
296 within the brackets. For example, --con‐
297 straint="[(rack1|rack2)*1&(rack3)*2]" will get you one
298 node with either the "rack1" or "rack2" features and two
299 nodes with the "rack3" feature. The same request without
300 the brackets will try to find a single node that meets
301 those requirements.
302
303 Parenthesis
304 Parenthesis can be used to group like node features to‐
305 gether. For example, --con‐
306 straint="[(knl&snc4&flat)*4&haswell*1]" might be used to
307 specify that four nodes with the features "knl", "snc4"
308 and "flat" plus one node with the feature "haswell" are
309 required. All options within parenthesis should be
310 grouped with AND (e.g. "&") operands.
311
312 WARNING: When srun is executed from within salloc or sbatch, the
313 constraint value can only contain a single feature name. None of
314 the other operators are currently supported for job steps.
315 This option applies to job and step allocations.
316
317
318 --contiguous
319 If set, then the allocated nodes must form a contiguous set.
320
321 NOTE: If SelectPlugin=cons_res this option won't be honored with
322 the topology/tree or topology/3d_torus plugins, both of which
323 can modify the node ordering. This option applies to job alloca‐
324 tions.
325
326
327 --cores-per-socket=<cores>
328 Restrict node selection to nodes with at least the specified
329 number of cores per socket. See additional information under -B
330 option above when task/affinity plugin is enabled. This option
331 applies to job allocations.
332
333
334 --cpu-bind=[{quiet,verbose},]type
335 Bind tasks to CPUs. Used only when the task/affinity or
336 task/cgroup plugin is enabled. NOTE: To have Slurm always re‐
337 port on the selected CPU binding for all commands executed in a
338 shell, you can enable verbose mode by setting the SLURM_CPU_BIND
339 environment variable value to "verbose".
340
341 The following informational environment variables are set when
342 --cpu-bind is in use:
343 SLURM_CPU_BIND_VERBOSE
344 SLURM_CPU_BIND_TYPE
345 SLURM_CPU_BIND_LIST
346
347 See the ENVIRONMENT VARIABLES section for a more detailed de‐
348 scription of the individual SLURM_CPU_BIND variables. These
349 variable are available only if the task/affinity plugin is con‐
350 figured.
351
352 When using --cpus-per-task to run multithreaded tasks, be aware
353 that CPU binding is inherited from the parent of the process.
354 This means that the multithreaded task should either specify or
355 clear the CPU binding itself to avoid having all threads of the
356 multithreaded task use the same mask/CPU as the parent. Alter‐
357 natively, fat masks (masks which specify more than one allowed
358 CPU) could be used for the tasks in order to provide multiple
359 CPUs for the multithreaded tasks.
360
361 Note that a job step can be allocated different numbers of CPUs
362 on each node or be allocated CPUs not starting at location zero.
363 Therefore one of the options which automatically generate the
364 task binding is recommended. Explicitly specified masks or
365 bindings are only honored when the job step has been allocated
366 every available CPU on the node.
367
368 Binding a task to a NUMA locality domain means to bind the task
369 to the set of CPUs that belong to the NUMA locality domain or
370 "NUMA node". If NUMA locality domain options are used on sys‐
371 tems with no NUMA support, then each socket is considered a lo‐
372 cality domain.
373
374 If the --cpu-bind option is not used, the default binding mode
375 will depend upon Slurm's configuration and the step's resource
376 allocation. If all allocated nodes have the same configured
377 CpuBind mode, that will be used. Otherwise if the job's Parti‐
378 tion has a configured CpuBind mode, that will be used. Other‐
379 wise if Slurm has a configured TaskPluginParam value, that mode
380 will be used. Otherwise automatic binding will be performed as
381 described below.
382
383
384 Auto Binding
385 Applies only when task/affinity is enabled. If the job
386 step allocation includes an allocation with a number of
387 sockets, cores, or threads equal to the number of tasks
388 times cpus-per-task, then the tasks will by default be
389 bound to the appropriate resources (auto binding). Dis‐
390 able this mode of operation by explicitly setting
391 "--cpu-bind=none". Use TaskPluginParam=auto‐
392 bind=[threads|cores|sockets] to set a default cpu binding
393 in case "auto binding" doesn't find a match.
394
395 Supported options include:
396
397 q[uiet]
398 Quietly bind before task runs (default)
399
400 v[erbose]
401 Verbosely report binding before task runs
402
403 no[ne] Do not bind tasks to CPUs (default unless auto
404 binding is applied)
405
406 rank Automatically bind by task rank. The lowest num‐
407 bered task on each node is bound to socket (or
408 core or thread) zero, etc. Not supported unless
409 the entire node is allocated to the job.
410
411 map_cpu:<list>
412 Bind by setting CPU masks on tasks (or ranks) as
413 specified where <list> is
414 <cpu_id_for_task_0>,<cpu_id_for_task_1>,... CPU
415 IDs are interpreted as decimal values unless they
416 are preceded with '0x' in which case they inter‐
417 preted as hexadecimal values. If the number of
418 tasks (or ranks) exceeds the number of elements in
419 this list, elements in the list will be reused as
420 needed starting from the beginning of the list.
421 To simplify support for large task counts, the
422 lists may follow a map with an asterisk and repe‐
423 tition count. For example
424 "map_cpu:0x0f*4,0xf0*4". Not supported unless the
425 entire node is allocated to the job.
426
427 mask_cpu:<list>
428 Bind by setting CPU masks on tasks (or ranks) as
429 specified where <list> is
430 <cpu_mask_for_task_0>,<cpu_mask_for_task_1>,...
431 The mapping is specified for a node and identical
432 mapping is applied to the tasks on every node
433 (i.e. the lowest task ID on each node is mapped to
434 the first mask specified in the list, etc.). CPU
435 masks are always interpreted as hexadecimal values
436 but can be preceded with an optional '0x'. If the
437 number of tasks (or ranks) exceeds the number of
438 elements in this list, elements in the list will
439 be reused as needed starting from the beginning of
440 the list. To simplify support for large task
441 counts, the lists may follow a map with an aster‐
442 isk and repetition count. For example
443 "mask_cpu:0x0f*4,0xf0*4". Not supported unless
444 the entire node is allocated to the job.
445
446 rank_ldom
447 Bind to a NUMA locality domain by rank. Not sup‐
448 ported unless the entire node is allocated to the
449 job.
450
451 map_ldom:<list>
452 Bind by mapping NUMA locality domain IDs to tasks
453 as specified where <list> is
454 <ldom1>,<ldom2>,...<ldomN>. The locality domain
455 IDs are interpreted as decimal values unless they
456 are preceded with '0x' in which case they are in‐
457 terpreted as hexadecimal values. Not supported
458 unless the entire node is allocated to the job.
459
460 mask_ldom:<list>
461 Bind by setting NUMA locality domain masks on
462 tasks as specified where <list> is
463 <mask1>,<mask2>,...<maskN>. NUMA locality domain
464 masks are always interpreted as hexadecimal values
465 but can be preceded with an optional '0x'. Not
466 supported unless the entire node is allocated to
467 the job.
468
469 sockets
470 Automatically generate masks binding tasks to
471 sockets. Only the CPUs on the socket which have
472 been allocated to the job will be used. If the
473 number of tasks differs from the number of allo‐
474 cated sockets this can result in sub-optimal bind‐
475 ing.
476
477 cores Automatically generate masks binding tasks to
478 cores. If the number of tasks differs from the
479 number of allocated cores this can result in
480 sub-optimal binding.
481
482 threads
483 Automatically generate masks binding tasks to
484 threads. If the number of tasks differs from the
485 number of allocated threads this can result in
486 sub-optimal binding.
487
488 ldoms Automatically generate masks binding tasks to NUMA
489 locality domains. If the number of tasks differs
490 from the number of allocated locality domains this
491 can result in sub-optimal binding.
492
493 boards Automatically generate masks binding tasks to
494 boards. If the number of tasks differs from the
495 number of allocated boards this can result in
496 sub-optimal binding. This option is supported by
497 the task/cgroup plugin only.
498
499 help Show help message for cpu-bind
500
501 This option applies to job and step allocations.
502
503
504 --cpu-freq =<p1[-p2[:p3]]>
505
506 Request that the job step initiated by this srun command be run
507 at some requested frequency if possible, on the CPUs selected
508 for the step on the compute node(s).
509
510 p1 can be [#### | low | medium | high | highm1] which will set
511 the frequency scaling_speed to the corresponding value, and set
512 the frequency scaling_governor to UserSpace. See below for defi‐
513 nition of the values.
514
515 p1 can be [Conservative | OnDemand | Performance | PowerSave]
516 which will set the scaling_governor to the corresponding value.
517 The governor has to be in the list set by the slurm.conf option
518 CpuFreqGovernors.
519
520 When p2 is present, p1 will be the minimum scaling frequency and
521 p2 will be the maximum scaling frequency.
522
523 p2 can be [#### | medium | high | highm1] p2 must be greater
524 than p1.
525
526 p3 can be [Conservative | OnDemand | Performance | PowerSave |
527 UserSpace] which will set the governor to the corresponding
528 value.
529
530 If p3 is UserSpace, the frequency scaling_speed will be set by a
531 power or energy aware scheduling strategy to a value between p1
532 and p2 that lets the job run within the site's power goal. The
533 job may be delayed if p1 is higher than a frequency that allows
534 the job to run within the goal.
535
536 If the current frequency is < min, it will be set to min. Like‐
537 wise, if the current frequency is > max, it will be set to max.
538
539 Acceptable values at present include:
540
541 #### frequency in kilohertz
542
543 Low the lowest available frequency
544
545 High the highest available frequency
546
547 HighM1 (high minus one) will select the next highest
548 available frequency
549
550 Medium attempts to set a frequency in the middle of the
551 available range
552
553 Conservative attempts to use the Conservative CPU governor
554
555 OnDemand attempts to use the OnDemand CPU governor (the de‐
556 fault value)
557
558 Performance attempts to use the Performance CPU governor
559
560 PowerSave attempts to use the PowerSave CPU governor
561
562 UserSpace attempts to use the UserSpace CPU governor
563
564
565 The following informational environment variable is set
566 in the job
567 step when --cpu-freq option is requested.
568 SLURM_CPU_FREQ_REQ
569
570 This environment variable can also be used to supply the value
571 for the CPU frequency request if it is set when the 'srun' com‐
572 mand is issued. The --cpu-freq on the command line will over‐
573 ride the environment variable value. The form on the environ‐
574 ment variable is the same as the command line. See the ENVIRON‐
575 MENT VARIABLES section for a description of the
576 SLURM_CPU_FREQ_REQ variable.
577
578 NOTE: This parameter is treated as a request, not a requirement.
579 If the job step's node does not support setting the CPU fre‐
580 quency, or the requested value is outside the bounds of the le‐
581 gal frequencies, an error is logged, but the job step is allowed
582 to continue.
583
584 NOTE: Setting the frequency for just the CPUs of the job step
585 implies that the tasks are confined to those CPUs. If task con‐
586 finement (i.e., TaskPlugin=task/affinity or TaskPlu‐
587 gin=task/cgroup with the "ConstrainCores" option) is not config‐
588 ured, this parameter is ignored.
589
590 NOTE: When the step completes, the frequency and governor of
591 each selected CPU is reset to the previous values.
592
593 NOTE: When submitting jobs with the --cpu-freq option with lin‐
594 uxproc as the ProctrackType can cause jobs to run too quickly
595 before Accounting is able to poll for job information. As a re‐
596 sult not all of accounting information will be present.
597
598 This option applies to job and step allocations.
599
600
601 --cpus-per-gpu=<ncpus>
602 Advise Slurm that ensuing job steps will require ncpus proces‐
603 sors per allocated GPU. Not compatible with the --cpus-per-task
604 option.
605
606
607 -c, --cpus-per-task=<ncpus>
608 Request that ncpus be allocated per process. This may be useful
609 if the job is multithreaded and requires more than one CPU per
610 task for optimal performance. The default is one CPU per
611 process. If -c is specified without -n, as many tasks will be
612 allocated per node as possible while satisfying the -c restric‐
613 tion. For instance on a cluster with 8 CPUs per node, a job re‐
614 quest for 4 nodes and 3 CPUs per task may be allocated 3 or 6
615 CPUs per node (1 or 2 tasks per node) depending upon resource
616 consumption by other jobs. Such a job may be unable to execute
617 more than a total of 4 tasks.
618
619 WARNING: There are configurations and options interpreted dif‐
620 ferently by job and job step requests which can result in incon‐
621 sistencies for this option. For example srun -c2
622 --threads-per-core=1 prog may allocate two cores for the job,
623 but if each of those cores contains two threads, the job alloca‐
624 tion will include four CPUs. The job step allocation will then
625 launch two threads per CPU for a total of two tasks.
626
627 WARNING: When srun is executed from within salloc or sbatch,
628 there are configurations and options which can result in incon‐
629 sistent allocations when -c has a value greater than -c on sal‐
630 loc or sbatch.
631
632 This option applies to job allocations.
633
634
635 --deadline=<OPT>
636 remove the job if no ending is possible before this deadline
637 (start > (deadline - time[-min])). Default is no deadline.
638 Valid time formats are:
639 HH:MM[:SS] [AM|PM]
640 MMDD[YY] or MM/DD[/YY] or MM.DD[.YY]
641 MM/DD[/YY]-HH:MM[:SS]
642 YYYY-MM-DD[THH:MM[:SS]]]
643 now[+count[seconds(default)|minutes|hours|days|weeks]]
644
645 This option applies only to job allocations.
646
647
648 --delay-boot=<minutes>
649 Do not reboot nodes in order to satisfied this job's feature
650 specification if the job has been eligible to run for less than
651 this time period. If the job has waited for less than the spec‐
652 ified period, it will use only nodes which already have the
653 specified features. The argument is in units of minutes. A de‐
654 fault value may be set by a system administrator using the de‐
655 lay_boot option of the SchedulerParameters configuration parame‐
656 ter in the slurm.conf file, otherwise the default value is zero
657 (no delay).
658
659 This option applies only to job allocations.
660
661
662 -d, --dependency=<dependency_list>
663 Defer the start of this job until the specified dependencies
664 have been satisfied completed. This option does not apply to job
665 steps (executions of srun within an existing salloc or sbatch
666 allocation) only to job allocations. <dependency_list> is of
667 the form <type:job_id[:job_id][,type:job_id[:job_id]]> or
668 <type:job_id[:job_id][?type:job_id[:job_id]]>. All dependencies
669 must be satisfied if the "," separator is used. Any dependency
670 may be satisfied if the "?" separator is used. Only one separa‐
671 tor may be used. Many jobs can share the same dependency and
672 these jobs may even belong to different users. The value may
673 be changed after job submission using the scontrol command. De‐
674 pendencies on remote jobs are allowed in a federation. Once a
675 job dependency fails due to the termination state of a preceding
676 job, the dependent job will never be run, even if the preceding
677 job is requeued and has a different termination state in a sub‐
678 sequent execution. This option applies to job allocations.
679
680 after:job_id[[+time][:jobid[+time]...]]
681 After the specified jobs start or are cancelled and
682 'time' in minutes from job start or cancellation happens,
683 this job can begin execution. If no 'time' is given then
684 there is no delay after start or cancellation.
685
686 afterany:job_id[:jobid...]
687 This job can begin execution after the specified jobs
688 have terminated.
689
690 afterburstbuffer:job_id[:jobid...]
691 This job can begin execution after the specified jobs
692 have terminated and any associated burst buffer stage out
693 operations have completed.
694
695 aftercorr:job_id[:jobid...]
696 A task of this job array can begin execution after the
697 corresponding task ID in the specified job has completed
698 successfully (ran to completion with an exit code of
699 zero).
700
701 afternotok:job_id[:jobid...]
702 This job can begin execution after the specified jobs
703 have terminated in some failed state (non-zero exit code,
704 node failure, timed out, etc).
705
706 afterok:job_id[:jobid...]
707 This job can begin execution after the specified jobs
708 have successfully executed (ran to completion with an
709 exit code of zero).
710
711 expand:job_id
712 Resources allocated to this job should be used to expand
713 the specified job. The job to expand must share the same
714 QOS (Quality of Service) and partition. Gang scheduling
715 of resources in the partition is also not supported.
716 "expand" is not allowed for jobs that didn't originate on
717 the same cluster as the submitted job.
718
719 singleton
720 This job can begin execution after any previously
721 launched jobs sharing the same job name and user have
722 terminated. In other words, only one job by that name
723 and owned by that user can be running or suspended at any
724 point in time. In a federation, a singleton dependency
725 must be fulfilled on all clusters unless DependencyParam‐
726 eters=disable_remote_singleton is used in slurm.conf.
727
728
729 -D, --chdir=<path>
730 Have the remote processes do a chdir to path before beginning
731 execution. The default is to chdir to the current working direc‐
732 tory of the srun process. The path can be specified as full path
733 or relative path to the directory where the command is executed.
734 This option applies to job allocations.
735
736
737 -e, --error=<filename pattern>
738 Specify how stderr is to be redirected. By default in interac‐
739 tive mode, srun redirects stderr to the same file as stdout, if
740 one is specified. The --error option is provided to allow stdout
741 and stderr to be redirected to different locations. See IO Re‐
742 direction below for more options. If the specified file already
743 exists, it will be overwritten. This option applies to job and
744 step allocations.
745
746
747 -E, --preserve-env
748 Pass the current values of environment variables
749 SLURM_JOB_NUM_NODES and SLURM_NTASKS through to the executable,
750 rather than computing them from commandline parameters. This op‐
751 tion applies to job allocations.
752
753
754 --exact
755 Allow a step access to only the resources requested for the
756 step. By default, all non-GRES resources on each node in the
757 step allocation will be used. Note that no other parallel step
758 will have access to those CPUs unless --overlap is specified.
759 This option applies to step allocations.
760
761
762 --epilog=<executable>
763 srun will run executable just after the job step completes. The
764 command line arguments for executable will be the command and
765 arguments of the job step. If executable is "none", then no
766 srun epilog will be run. This parameter overrides the SrunEpilog
767 parameter in slurm.conf. This parameter is completely indepen‐
768 dent from the Epilog parameter in slurm.conf. This option ap‐
769 plies to job allocations.
770
771
772
773 --exclusive[=user|mcs]
774 This option applies to job and job step allocations, and has two
775 slightly different meanings for each one. When used to initiate
776 a job, the job allocation cannot share nodes with other running
777 jobs (or just other users with the "=user" option or "=mcs" op‐
778 tion). The default shared/exclusive behavior depends on system
779 configuration and the partition's OverSubscribe option takes
780 precedence over the job's option.
781
782 This option can also be used when initiating more than one job
783 step within an existing resource allocation (default), where you
784 want separate processors to be dedicated to each job step. If
785 sufficient processors are not available to initiate the job
786 step, it will be deferred. This can be thought of as providing a
787 mechanism for resource management to the job within its alloca‐
788 tion (--exact implied).
789
790 The exclusive allocation of CPUs applies to job steps by de‐
791 fault. In order to share the resources use the --overlap option.
792
793 See EXAMPLE below.
794
795
796 --export=<[ALL,]environment variables|ALL|NONE>
797 Identify which environment variables from the submission envi‐
798 ronment are propagated to the launched application.
799
800 --export=ALL
801 Default mode if --export is not specified. All of the
802 users environment will be loaded from callers environ‐
803 ment.
804
805 --export=NONE
806 None of the user environment will be defined. User
807 must use absolute path to the binary to be executed
808 that will define the environment. User can not specify
809 explicit environment variables with NONE.
810 This option is particularly important for jobs that
811 are submitted on one cluster and execute on a differ‐
812 ent cluster (e.g. with different paths). To avoid
813 steps inheriting environment export settings (e.g.
814 NONE) from sbatch command, either set --export=ALL or
815 the environment variable SLURM_EXPORT_ENV should be
816 set to ALL.
817
818 --export=<[ALL,]environment variables>
819 Exports all SLURM* environment variables along with
820 explicitly defined variables. Multiple environment
821 variable names should be comma separated. Environment
822 variable names may be specified to propagate the cur‐
823 rent value (e.g. "--export=EDITOR") or specific values
824 may be exported (e.g. "--export=EDITOR=/bin/emacs").
825 If ALL is specified, then all user environment vari‐
826 ables will be loaded and will take precedence over any
827 explicitly given environment variables.
828
829 Example: --export=EDITOR,ARG1=test
830 In this example, the propagated environment will only
831 contain the variable EDITOR from the user's environ‐
832 ment, SLURM_* environment variables, and ARG1=test.
833
834 Example: --export=ALL,EDITOR=/bin/emacs
835 There are two possible outcomes for this example. If
836 the caller has the EDITOR environment variable de‐
837 fined, then the job's environment will inherit the
838 variable from the caller's environment. If the caller
839 doesn't have an environment variable defined for EDI‐
840 TOR, then the job's environment will use the value
841 given by --export.
842
843
844 -F, --nodefile=<node file>
845 Much like --nodelist, but the list is contained in a file of
846 name node file. The node names of the list may also span multi‐
847 ple lines in the file. Duplicate node names in the file will
848 be ignored. The order of the node names in the list is not im‐
849 portant; the node names will be sorted by Slurm.
850
851
852 --gid=<group>
853 If srun is run as root, and the --gid option is used, submit the
854 job with group's group access permissions. group may be the
855 group name or the numerical group ID. This option applies to job
856 allocations.
857
858
859 -G, --gpus=[<type>:]<number>
860 Specify the total number of GPUs required for the job. An op‐
861 tional GPU type specification can be supplied. For example
862 "--gpus=volta:3". Multiple options can be requested in a comma
863 separated list, for example: "--gpus=volta:3,kepler:1". See
864 also the --gpus-per-node, --gpus-per-socket and --gpus-per-task
865 options.
866
867
868 --gpu-bind=[verbose,]<type>
869 Bind tasks to specific GPUs. By default every spawned task can
870 access every GPU allocated to the job. If "verbose," is speci‐
871 fied before <type>, then print out GPU binding information.
872
873 Supported type options:
874
875 closest Bind each task to the GPU(s) which are closest. In a
876 NUMA environment, each task may be bound to more than
877 one GPU (i.e. all GPUs in that NUMA environment).
878
879 map_gpu:<list>
880 Bind by setting GPU masks on tasks (or ranks) as spec‐
881 ified where <list> is
882 <gpu_id_for_task_0>,<gpu_id_for_task_1>,... GPU IDs
883 are interpreted as decimal values unless they are pre‐
884 ceded with '0x' in which case they interpreted as
885 hexadecimal values. If the number of tasks (or ranks)
886 exceeds the number of elements in this list, elements
887 in the list will be reused as needed starting from the
888 beginning of the list. To simplify support for large
889 task counts, the lists may follow a map with an aster‐
890 isk and repetition count. For example
891 "map_gpu:0*4,1*4". If the task/cgroup plugin is used
892 and ConstrainDevices is set in cgroup.conf, then the
893 GPU IDs are zero-based indexes relative to the GPUs
894 allocated to the job (e.g. the first GPU is 0, even if
895 the global ID is 3). Otherwise, the GPU IDs are global
896 IDs, and all GPUs on each node in the job should be
897 allocated for predictable binding results.
898
899 mask_gpu:<list>
900 Bind by setting GPU masks on tasks (or ranks) as spec‐
901 ified where <list> is
902 <gpu_mask_for_task_0>,<gpu_mask_for_task_1>,... The
903 mapping is specified for a node and identical mapping
904 is applied to the tasks on every node (i.e. the lowest
905 task ID on each node is mapped to the first mask spec‐
906 ified in the list, etc.). GPU masks are always inter‐
907 preted as hexadecimal values but can be preceded with
908 an optional '0x'. To simplify support for large task
909 counts, the lists may follow a map with an asterisk
910 and repetition count. For example
911 "mask_gpu:0x0f*4,0xf0*4". If the task/cgroup plugin
912 is used and ConstrainDevices is set in cgroup.conf,
913 then the GPU IDs are zero-based indexes relative to
914 the GPUs allocated to the job (e.g. the first GPU is
915 0, even if the global ID is 3). Otherwise, the GPU IDs
916 are global IDs, and all GPUs on each node in the job
917 should be allocated for predictable binding results.
918
919 single:<tasks_per_gpu>
920 Like --gpu-bind=closest, except that each task can
921 only be bound to a single GPU, even when it can be
922 bound to multiple GPUs that are equally close. The
923 GPU to bind to is determined by <tasks_per_gpu>, where
924 the first <tasks_per_gpu> tasks are bound to the first
925 GPU available, the second <tasks_per_gpu> tasks are
926 bound to the second GPU available, etc. This is basi‐
927 cally a block distribution of tasks onto available
928 GPUs, where the available GPUs are determined by the
929 socket affinity of the task and the socket affinity of
930 the GPUs as specified in gres.conf's Cores parameter.
931
932
933 --gpu-freq=[<type]=value>[,<type=value>][,verbose]
934 Request that GPUs allocated to the job are configured with spe‐
935 cific frequency values. This option can be used to indepen‐
936 dently configure the GPU and its memory frequencies. After the
937 job is completed, the frequencies of all affected GPUs will be
938 reset to the highest possible values. In some cases, system
939 power caps may override the requested values. The field type
940 can be "memory". If type is not specified, the GPU frequency is
941 implied. The value field can either be "low", "medium", "high",
942 "highm1" or a numeric value in megahertz (MHz). If the speci‐
943 fied numeric value is not possible, a value as close as possible
944 will be used. See below for definition of the values. The ver‐
945 bose option causes current GPU frequency information to be
946 logged. Examples of use include "--gpu-freq=medium,memory=high"
947 and "--gpu-freq=450".
948
949 Supported value definitions:
950
951 low the lowest available frequency.
952
953 medium attempts to set a frequency in the middle of the
954 available range.
955
956 high the highest available frequency.
957
958 highm1 (high minus one) will select the next highest avail‐
959 able frequency.
960
961
962 --gpus-per-node=[<type>:]<number>
963 Specify the number of GPUs required for the job on each node in‐
964 cluded in the job's resource allocation. An optional GPU type
965 specification can be supplied. For example
966 "--gpus-per-node=volta:3". Multiple options can be requested in
967 a comma separated list, for example:
968 "--gpus-per-node=volta:3,kepler:1". See also the --gpus,
969 --gpus-per-socket and --gpus-per-task options.
970
971
972 --gpus-per-socket=[<type>:]<number>
973 Specify the number of GPUs required for the job on each socket
974 included in the job's resource allocation. An optional GPU type
975 specification can be supplied. For example
976 "--gpus-per-socket=volta:3". Multiple options can be requested
977 in a comma separated list, for example:
978 "--gpus-per-socket=volta:3,kepler:1". Requires job to specify a
979 sockets per node count ( --sockets-per-node). See also the
980 --gpus, --gpus-per-node and --gpus-per-task options. This op‐
981 tion applies to job allocations.
982
983
984 --gpus-per-task=[<type>:]<number>
985 Specify the number of GPUs required for the job on each task to
986 be spawned in the job's resource allocation. An optional GPU
987 type specification can be supplied. For example
988 "--gpus-per-task=volta:1". Multiple options can be requested in
989 a comma separated list, for example:
990 "--gpus-per-task=volta:3,kepler:1". See also the --gpus,
991 --gpus-per-socket and --gpus-per-node options. This option re‐
992 quires an explicit task count, e.g. -n, --ntasks or "--gpus=X
993 --gpus-per-task=Y" rather than an ambiguous range of nodes with
994 -N, --nodes.
995 NOTE: This option will not have any impact on GPU binding,
996 specifically it won't limit the number of devices set for
997 CUDA_VISIBLE_DEVICES.
998
999
1000 --gres=<list>
1001 Specifies a comma delimited list of generic consumable re‐
1002 sources. The format of each entry on the list is
1003 "name[[:type]:count]". The name is that of the consumable re‐
1004 source. The count is the number of those resources with a de‐
1005 fault value of 1. The count can have a suffix of "k" or "K"
1006 (multiple of 1024), "m" or "M" (multiple of 1024 x 1024), "g" or
1007 "G" (multiple of 1024 x 1024 x 1024), "t" or "T" (multiple of
1008 1024 x 1024 x 1024 x 1024), "p" or "P" (multiple of 1024 x 1024
1009 x 1024 x 1024 x 1024). The specified resources will be allo‐
1010 cated to the job on each node. The available generic consumable
1011 resources is configurable by the system administrator. A list
1012 of available generic consumable resources will be printed and
1013 the command will exit if the option argument is "help". Exam‐
1014 ples of use include "--gres=gpu:2,mic:1", "--gres=gpu:kepler:2",
1015 and "--gres=help". NOTE: This option applies to job and step
1016 allocations. By default, a job step is allocated all of the
1017 generic resources that have been allocated to the job. To
1018 change the behavior so that each job step is allocated no
1019 generic resources, explicitly set the value of --gres to specify
1020 zero counts for each generic resource OR set "--gres=none" OR
1021 set the SLURM_STEP_GRES environment variable to "none".
1022
1023
1024 --gres-flags=<type>
1025 Specify generic resource task binding options. This option ap‐
1026 plies to job allocations.
1027
1028 disable-binding
1029 Disable filtering of CPUs with respect to generic re‐
1030 source locality. This option is currently required to
1031 use more CPUs than are bound to a GRES (i.e. if a GPU is
1032 bound to the CPUs on one socket, but resources on more
1033 than one socket are required to run the job). This op‐
1034 tion may permit a job to be allocated resources sooner
1035 than otherwise possible, but may result in lower job per‐
1036 formance.
1037 NOTE: This option is specific to SelectType=cons_res.
1038
1039 enforce-binding
1040 The only CPUs available to the job will be those bound to
1041 the selected GRES (i.e. the CPUs identified in the
1042 gres.conf file will be strictly enforced). This option
1043 may result in delayed initiation of a job. For example a
1044 job requiring two GPUs and one CPU will be delayed until
1045 both GPUs on a single socket are available rather than
1046 using GPUs bound to separate sockets, however, the appli‐
1047 cation performance may be improved due to improved commu‐
1048 nication speed. Requires the node to be configured with
1049 more than one socket and resource filtering will be per‐
1050 formed on a per-socket basis.
1051 NOTE: This option is specific to SelectType=cons_tres.
1052
1053
1054 -H, --hold
1055 Specify the job is to be submitted in a held state (priority of
1056 zero). A held job can now be released using scontrol to reset
1057 its priority (e.g. "scontrol release <job_id>"). This option ap‐
1058 plies to job allocations.
1059
1060
1061 -h, --help
1062 Display help information and exit.
1063
1064
1065 --hint=<type>
1066 Bind tasks according to application hints.
1067 NOTE: This option cannot be used in conjunction with any of
1068 --ntasks-per-core, --threads-per-core, --cpu-bind (other than
1069 --cpu-bind=verbose) or -B. If --hint is specified as a command
1070 line argument, it will take precedence over the environment.
1071
1072 compute_bound
1073 Select settings for compute bound applications: use all
1074 cores in each socket, one thread per core.
1075
1076 memory_bound
1077 Select settings for memory bound applications: use only
1078 one core in each socket, one thread per core.
1079
1080 [no]multithread
1081 [don't] use extra threads with in-core multi-threading
1082 which can benefit communication intensive applications.
1083 Only supported with the task/affinity plugin.
1084
1085 help show this help message
1086
1087 This option applies to job allocations.
1088
1089
1090 -I, --immediate[=<seconds>]
1091 exit if resources are not available within the time period spec‐
1092 ified. If no argument is given (seconds defaults to 1), re‐
1093 sources must be available immediately for the request to suc‐
1094 ceed. If defer is configured in SchedulerParameters and sec‐
1095 onds=1 the allocation request will fail immediately; defer con‐
1096 flicts and takes precedence over this option. By default, --im‐
1097 mediate is off, and the command will block until resources be‐
1098 come available. Since this option's argument is optional, for
1099 proper parsing the single letter option must be followed immedi‐
1100 ately with the value and not include a space between them. For
1101 example "-I60" and not "-I 60". This option applies to job and
1102 step allocations.
1103
1104
1105 -i, --input=<mode>
1106 Specify how stdin is to redirected. By default, srun redirects
1107 stdin from the terminal all tasks. See IO Redirection below for
1108 more options. For OS X, the poll() function does not support
1109 stdin, so input from a terminal is not possible. This option ap‐
1110 plies to job and step allocations.
1111
1112
1113 -J, --job-name=<jobname>
1114 Specify a name for the job. The specified name will appear along
1115 with the job id number when querying running jobs on the system.
1116 The default is the supplied executable program's name. NOTE:
1117 This information may be written to the slurm_jobacct.log file.
1118 This file is space delimited so if a space is used in the job‐
1119 name name it will cause problems in properly displaying the con‐
1120 tents of the slurm_jobacct.log file when the sacct command is
1121 used. This option applies to job and step allocations.
1122
1123
1124 --jobid=<jobid>
1125 Initiate a job step under an already allocated job with job id
1126 id. Using this option will cause srun to behave exactly as if
1127 the SLURM_JOB_ID environment variable was set. This option ap‐
1128 plies to step allocations.
1129
1130
1131 -K, --kill-on-bad-exit[=0|1]
1132 Controls whether or not to terminate a step if any task exits
1133 with a non-zero exit code. If this option is not specified, the
1134 default action will be based upon the Slurm configuration param‐
1135 eter of KillOnBadExit. If this option is specified, it will take
1136 precedence over KillOnBadExit. An option argument of zero will
1137 not terminate the job. A non-zero argument or no argument will
1138 terminate the job. Note: This option takes precedence over the
1139 -W, --wait option to terminate the job immediately if a task ex‐
1140 its with a non-zero exit code. Since this option's argument is
1141 optional, for proper parsing the single letter option must be
1142 followed immediately with the value and not include a space be‐
1143 tween them. For example "-K1" and not "-K 1".
1144
1145
1146 -k, --no-kill [=off]
1147 Do not automatically terminate a job if one of the nodes it has
1148 been allocated fails. This option applies to job and step allo‐
1149 cations. The job will assume all responsibilities for
1150 fault-tolerance. Tasks launch using this option will not be
1151 considered terminated (e.g. -K, --kill-on-bad-exit and -W,
1152 --wait options will have no effect upon the job step). The ac‐
1153 tive job step (MPI job) will likely suffer a fatal error, but
1154 subsequent job steps may be run if this option is specified.
1155
1156 Specify an optional argument of "off" disable the effect of the
1157 SLURM_NO_KILL environment variable.
1158
1159 The default action is to terminate the job upon node failure.
1160
1161
1162 -l, --label
1163 Prepend task number to lines of stdout/err. The --label option
1164 will prepend lines of output with the remote task id. This op‐
1165 tion applies to step allocations.
1166
1167
1168 -L, --licenses=<license>
1169 Specification of licenses (or other resources available on all
1170 nodes of the cluster) which must be allocated to this job. Li‐
1171 cense names can be followed by a colon and count (the default
1172 count is one). Multiple license names should be comma separated
1173 (e.g. "--licenses=foo:4,bar"). This option applies to job allo‐
1174 cations.
1175
1176
1177 -M, --clusters=<string>
1178 Clusters to issue commands to. Multiple cluster names may be
1179 comma separated. The job will be submitted to the one cluster
1180 providing the earliest expected job initiation time. The default
1181 value is the current cluster. A value of 'all' will query to run
1182 on all clusters. Note the --export option to control environ‐
1183 ment variables exported between clusters. This option applies
1184 only to job allocations. Note that the SlurmDBD must be up for
1185 this option to work properly.
1186
1187
1188 -m, --distribution=
1189 *|block|cyclic|arbitrary|plane=<options>
1190 [:*|block|cyclic|fcyclic[:*|block|
1191 cyclic|fcyclic]][,Pack|NoPack]
1192
1193 Specify alternate distribution methods for remote processes.
1194 This option controls the distribution of tasks to the nodes on
1195 which resources have been allocated, and the distribution of
1196 those resources to tasks for binding (task affinity). The first
1197 distribution method (before the first ":") controls the distri‐
1198 bution of tasks to nodes. The second distribution method (after
1199 the first ":") controls the distribution of allocated CPUs
1200 across sockets for binding to tasks. The third distribution
1201 method (after the second ":") controls the distribution of allo‐
1202 cated CPUs across cores for binding to tasks. The second and
1203 third distributions apply only if task affinity is enabled. The
1204 third distribution is supported only if the task/cgroup plugin
1205 is configured. The default value for each distribution type is
1206 specified by *.
1207
1208 Note that with select/cons_res and select/cons_tres, the number
1209 of CPUs allocated to each socket and node may be different. Re‐
1210 fer to https://slurm.schedmd.com/mc_support.html for more infor‐
1211 mation on resource allocation, distribution of tasks to nodes,
1212 and binding of tasks to CPUs.
1213 First distribution method (distribution of tasks across nodes):
1214
1215
1216 * Use the default method for distributing tasks to nodes
1217 (block).
1218
1219 block The block distribution method will distribute tasks to a
1220 node such that consecutive tasks share a node. For exam‐
1221 ple, consider an allocation of three nodes each with two
1222 cpus. A four-task block distribution request will dis‐
1223 tribute those tasks to the nodes with tasks one and two
1224 on the first node, task three on the second node, and
1225 task four on the third node. Block distribution is the
1226 default behavior if the number of tasks exceeds the num‐
1227 ber of allocated nodes.
1228
1229 cyclic The cyclic distribution method will distribute tasks to a
1230 node such that consecutive tasks are distributed over
1231 consecutive nodes (in a round-robin fashion). For exam‐
1232 ple, consider an allocation of three nodes each with two
1233 cpus. A four-task cyclic distribution request will dis‐
1234 tribute those tasks to the nodes with tasks one and four
1235 on the first node, task two on the second node, and task
1236 three on the third node. Note that when SelectType is
1237 select/cons_res, the same number of CPUs may not be allo‐
1238 cated on each node. Task distribution will be round-robin
1239 among all the nodes with CPUs yet to be assigned to
1240 tasks. Cyclic distribution is the default behavior if
1241 the number of tasks is no larger than the number of allo‐
1242 cated nodes.
1243
1244 plane The tasks are distributed in blocks of a specified size.
1245 The number of tasks distributed to each node is the same
1246 as for cyclic distribution, but the taskids assigned to
1247 each node depend on the plane size. Additional distribu‐
1248 tion specifications cannot be combined with this option.
1249 For more details (including examples and diagrams),
1250 please see
1251 https://slurm.schedmd.com/mc_support.html
1252 and
1253 https://slurm.schedmd.com/dist_plane.html
1254
1255 arbitrary
1256 The arbitrary method of distribution will allocate pro‐
1257 cesses in-order as listed in file designated by the envi‐
1258 ronment variable SLURM_HOSTFILE. If this variable is
1259 listed it will over ride any other method specified. If
1260 not set the method will default to block. Inside the
1261 hostfile must contain at minimum the number of hosts re‐
1262 quested and be one per line or comma separated. If spec‐
1263 ifying a task count (-n, --ntasks=<number>), your tasks
1264 will be laid out on the nodes in the order of the file.
1265 NOTE: The arbitrary distribution option on a job alloca‐
1266 tion only controls the nodes to be allocated to the job
1267 and not the allocation of CPUs on those nodes. This op‐
1268 tion is meant primarily to control a job step's task lay‐
1269 out in an existing job allocation for the srun command.
1270 NOTE: If the number of tasks is given and a list of re‐
1271 quested nodes is also given, the number of nodes used
1272 from that list will be reduced to match that of the num‐
1273 ber of tasks if the number of nodes in the list is
1274 greater than the number of tasks.
1275
1276
1277 Second distribution method (distribution of CPUs across sockets
1278 for binding):
1279
1280
1281 * Use the default method for distributing CPUs across sock‐
1282 ets (cyclic).
1283
1284 block The block distribution method will distribute allocated
1285 CPUs consecutively from the same socket for binding to
1286 tasks, before using the next consecutive socket.
1287
1288 cyclic The cyclic distribution method will distribute allocated
1289 CPUs for binding to a given task consecutively from the
1290 same socket, and from the next consecutive socket for the
1291 next task, in a round-robin fashion across sockets.
1292
1293 fcyclic
1294 The fcyclic distribution method will distribute allocated
1295 CPUs for binding to tasks from consecutive sockets in a
1296 round-robin fashion across the sockets.
1297
1298
1299 Third distribution method (distribution of CPUs across cores for
1300 binding):
1301
1302
1303 * Use the default method for distributing CPUs across cores
1304 (inherited from second distribution method).
1305
1306 block The block distribution method will distribute allocated
1307 CPUs consecutively from the same core for binding to
1308 tasks, before using the next consecutive core.
1309
1310 cyclic The cyclic distribution method will distribute allocated
1311 CPUs for binding to a given task consecutively from the
1312 same core, and from the next consecutive core for the
1313 next task, in a round-robin fashion across cores.
1314
1315 fcyclic
1316 The fcyclic distribution method will distribute allocated
1317 CPUs for binding to tasks from consecutive cores in a
1318 round-robin fashion across the cores.
1319
1320
1321
1322 Optional control for task distribution over nodes:
1323
1324
1325 Pack Rather than evenly distributing a job step's tasks evenly
1326 across its allocated nodes, pack them as tightly as pos‐
1327 sible on the nodes. This only applies when the "block"
1328 task distribution method is used.
1329
1330 NoPack Rather than packing a job step's tasks as tightly as pos‐
1331 sible on the nodes, distribute them evenly. This user
1332 option will supersede the SelectTypeParameters
1333 CR_Pack_Nodes configuration parameter.
1334
1335 This option applies to job and step allocations.
1336
1337
1338 --mail-type=<type>
1339 Notify user by email when certain event types occur. Valid type
1340 values are NONE, BEGIN, END, FAIL, REQUEUE, ALL (equivalent to
1341 BEGIN, END, FAIL, INVALID_DEPEND, REQUEUE, and STAGE_OUT), IN‐
1342 VALID_DEPEND (dependency never satisfied), STAGE_OUT (burst buf‐
1343 fer stage out and teardown completed), TIME_LIMIT, TIME_LIMIT_90
1344 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80
1345 percent of time limit), and TIME_LIMIT_50 (reached 50 percent of
1346 time limit). Multiple type values may be specified in a comma
1347 separated list. The user to be notified is indicated with
1348 --mail-user. This option applies to job allocations.
1349
1350
1351 --mail-user=<user>
1352 User to receive email notification of state changes as defined
1353 by --mail-type. The default value is the submitting user. This
1354 option applies to job allocations.
1355
1356
1357 --mcs-label=<mcs>
1358 Used only when the mcs/group plugin is enabled. This parameter
1359 is a group among the groups of the user. Default value is cal‐
1360 culated by the Plugin mcs if it's enabled. This option applies
1361 to job allocations.
1362
1363
1364 --mem=<size[units]>
1365 Specify the real memory required per node. Default units are
1366 megabytes. Different units can be specified using the suffix
1367 [K|M|G|T]. Default value is DefMemPerNode and the maximum value
1368 is MaxMemPerNode. If configured, both of parameters can be seen
1369 using the scontrol show config command. This parameter would
1370 generally be used if whole nodes are allocated to jobs (Select‐
1371 Type=select/linear). Specifying a memory limit of zero for a
1372 job step will restrict the job step to the amount of memory al‐
1373 located to the job, but not remove any of the job's memory allo‐
1374 cation from being available to other job steps. Also see
1375 --mem-per-cpu and --mem-per-gpu. The --mem, --mem-per-cpu and
1376 --mem-per-gpu options are mutually exclusive. If --mem,
1377 --mem-per-cpu or --mem-per-gpu are specified as command line ar‐
1378 guments, then they will take precedence over the environment
1379 (potentially inherited from salloc or sbatch).
1380
1381 NOTE: A memory size specification of zero is treated as a spe‐
1382 cial case and grants the job access to all of the memory on each
1383 node for newly submitted jobs and all available job memory to
1384 new job steps.
1385
1386 Specifying new memory limits for job steps are only advisory.
1387
1388 If the job is allocated multiple nodes in a heterogeneous clus‐
1389 ter, the memory limit on each node will be that of the node in
1390 the allocation with the smallest memory size (same limit will
1391 apply to every node in the job's allocation).
1392
1393 NOTE: Enforcement of memory limits currently relies upon the
1394 task/cgroup plugin or enabling of accounting, which samples mem‐
1395 ory use on a periodic basis (data need not be stored, just col‐
1396 lected). In both cases memory use is based upon the job's Resi‐
1397 dent Set Size (RSS). A task may exceed the memory limit until
1398 the next periodic accounting sample.
1399
1400 This option applies to job and step allocations.
1401
1402
1403 --mem-per-cpu=<size[units]>
1404 Minimum memory required per allocated CPU. Default units are
1405 megabytes. Different units can be specified using the suffix
1406 [K|M|G|T]. The default value is DefMemPerCPU and the maximum
1407 value is MaxMemPerCPU (see exception below). If configured, both
1408 parameters can be seen using the scontrol show config command.
1409 Note that if the job's --mem-per-cpu value exceeds the config‐
1410 ured MaxMemPerCPU, then the user's limit will be treated as a
1411 memory limit per task; --mem-per-cpu will be reduced to a value
1412 no larger than MaxMemPerCPU; --cpus-per-task will be set and the
1413 value of --cpus-per-task multiplied by the new --mem-per-cpu
1414 value will equal the original --mem-per-cpu value specified by
1415 the user. This parameter would generally be used if individual
1416 processors are allocated to jobs (SelectType=select/cons_res).
1417 If resources are allocated by core, socket, or whole nodes, then
1418 the number of CPUs allocated to a job may be higher than the
1419 task count and the value of --mem-per-cpu should be adjusted ac‐
1420 cordingly. Specifying a memory limit of zero for a job step
1421 will restrict the job step to the amount of memory allocated to
1422 the job, but not remove any of the job's memory allocation from
1423 being available to other job steps. Also see --mem and
1424 --mem-per-gpu. The --mem, --mem-per-cpu and --mem-per-gpu op‐
1425 tions are mutually exclusive.
1426
1427 NOTE: If the final amount of memory requested by a job can't be
1428 satisfied by any of the nodes configured in the partition, the
1429 job will be rejected. This could happen if --mem-per-cpu is
1430 used with the --exclusive option for a job allocation and
1431 --mem-per-cpu times the number of CPUs on a node is greater than
1432 the total memory of that node.
1433
1434
1435 --mem-per-gpu=<size[units]>
1436 Minimum memory required per allocated GPU. Default units are
1437 megabytes. Different units can be specified using the suffix
1438 [K|M|G|T]. Default value is DefMemPerGPU and is available on
1439 both a global and per partition basis. If configured, the pa‐
1440 rameters can be seen using the scontrol show config and scontrol
1441 show partition commands. Also see --mem. The --mem,
1442 --mem-per-cpu and --mem-per-gpu options are mutually exclusive.
1443
1444
1445 --mem-bind=[{quiet,verbose},]type
1446 Bind tasks to memory. Used only when the task/affinity plugin is
1447 enabled and the NUMA memory functions are available. Note that
1448 the resolution of CPU and memory binding may differ on some ar‐
1449 chitectures. For example, CPU binding may be performed at the
1450 level of the cores within a processor while memory binding will
1451 be performed at the level of nodes, where the definition of
1452 "nodes" may differ from system to system. By default no memory
1453 binding is performed; any task using any CPU can use any memory.
1454 This option is typically used to ensure that each task is bound
1455 to the memory closest to its assigned CPU. The use of any type
1456 other than "none" or "local" is not recommended. If you want
1457 greater control, try running a simple test code with the options
1458 "--cpu-bind=verbose,none --mem-bind=verbose,none" to determine
1459 the specific configuration.
1460
1461 NOTE: To have Slurm always report on the selected memory binding
1462 for all commands executed in a shell, you can enable verbose
1463 mode by setting the SLURM_MEM_BIND environment variable value to
1464 "verbose".
1465
1466 The following informational environment variables are set when
1467 --mem-bind is in use:
1468
1469 SLURM_MEM_BIND_LIST
1470 SLURM_MEM_BIND_PREFER
1471 SLURM_MEM_BIND_SORT
1472 SLURM_MEM_BIND_TYPE
1473 SLURM_MEM_BIND_VERBOSE
1474
1475 See the ENVIRONMENT VARIABLES section for a more detailed de‐
1476 scription of the individual SLURM_MEM_BIND* variables.
1477
1478 Supported options include:
1479
1480 help show this help message
1481
1482 local Use memory local to the processor in use
1483
1484 map_mem:<list>
1485 Bind by setting memory masks on tasks (or ranks) as spec‐
1486 ified where <list> is
1487 <numa_id_for_task_0>,<numa_id_for_task_1>,... The map‐
1488 ping is specified for a node and identical mapping is ap‐
1489 plied to the tasks on every node (i.e. the lowest task ID
1490 on each node is mapped to the first ID specified in the
1491 list, etc.). NUMA IDs are interpreted as decimal values
1492 unless they are preceded with '0x' in which case they in‐
1493 terpreted as hexadecimal values. If the number of tasks
1494 (or ranks) exceeds the number of elements in this list,
1495 elements in the list will be reused as needed starting
1496 from the beginning of the list. To simplify support for
1497 large task counts, the lists may follow a map with an as‐
1498 terisk and repetition count. For example
1499 "map_mem:0x0f*4,0xf0*4". For predictable binding re‐
1500 sults, all CPUs for each node in the job should be allo‐
1501 cated to the job.
1502
1503 mask_mem:<list>
1504 Bind by setting memory masks on tasks (or ranks) as spec‐
1505 ified where <list> is
1506 <numa_mask_for_task_0>,<numa_mask_for_task_1>,... The
1507 mapping is specified for a node and identical mapping is
1508 applied to the tasks on every node (i.e. the lowest task
1509 ID on each node is mapped to the first mask specified in
1510 the list, etc.). NUMA masks are always interpreted as
1511 hexadecimal values. Note that masks must be preceded
1512 with a '0x' if they don't begin with [0-9] so they are
1513 seen as numerical values. If the number of tasks (or
1514 ranks) exceeds the number of elements in this list, ele‐
1515 ments in the list will be reused as needed starting from
1516 the beginning of the list. To simplify support for large
1517 task counts, the lists may follow a mask with an asterisk
1518 and repetition count. For example "mask_mem:0*4,1*4".
1519 For predictable binding results, all CPUs for each node
1520 in the job should be allocated to the job.
1521
1522 no[ne] don't bind tasks to memory (default)
1523
1524 nosort avoid sorting free cache pages (default, LaunchParameters
1525 configuration parameter can override this default)
1526
1527 p[refer]
1528 Prefer use of first specified NUMA node, but permit
1529 use of other available NUMA nodes.
1530
1531 q[uiet]
1532 quietly bind before task runs (default)
1533
1534 rank bind by task rank (not recommended)
1535
1536 sort sort free cache pages (run zonesort on Intel KNL nodes)
1537
1538 v[erbose]
1539 verbosely report binding before task runs
1540
1541 This option applies to job and step allocations.
1542
1543
1544 --mincpus=<n>
1545 Specify a minimum number of logical cpus/processors per node.
1546 This option applies to job allocations.
1547
1548
1549 --msg-timeout=<seconds>
1550 Modify the job launch message timeout. The default value is
1551 MessageTimeout in the Slurm configuration file slurm.conf.
1552 Changes to this are typically not recommended, but could be use‐
1553 ful to diagnose problems. This option applies to job alloca‐
1554 tions.
1555
1556
1557 --mpi=<mpi_type>
1558 Identify the type of MPI to be used. May result in unique initi‐
1559 ation procedures.
1560
1561 list Lists available mpi types to choose from.
1562
1563 pmi2 To enable PMI2 support. The PMI2 support in Slurm works
1564 only if the MPI implementation supports it, in other
1565 words if the MPI has the PMI2 interface implemented. The
1566 --mpi=pmi2 will load the library lib/slurm/mpi_pmi2.so
1567 which provides the server side functionality but the
1568 client side must implement PMI2_Init() and the other in‐
1569 terface calls.
1570
1571 pmix To enable PMIx support (https://pmix.github.io). The PMIx
1572 support in Slurm can be used to launch parallel applica‐
1573 tions (e.g. MPI) if it supports PMIx, PMI2 or PMI1. Slurm
1574 must be configured with pmix support by passing "--with-
1575 pmix=<PMIx installation path>" option to its "./config‐
1576 ure" script.
1577
1578 At the time of writing PMIx is supported in Open MPI
1579 starting from version 2.0. PMIx also supports backward
1580 compatibility with PMI1 and PMI2 and can be used if MPI
1581 was configured with PMI2/PMI1 support pointing to the
1582 PMIx library ("libpmix"). If MPI supports PMI1/PMI2 but
1583 doesn't provide the way to point to a specific implemen‐
1584 tation, a hack'ish solution leveraging LD_PRELOAD can be
1585 used to force "libpmix" usage.
1586
1587
1588 none No special MPI processing. This is the default and works
1589 with many other versions of MPI.
1590
1591 This option applies to step allocations.
1592
1593
1594 --multi-prog
1595 Run a job with different programs and different arguments for
1596 each task. In this case, the executable program specified is ac‐
1597 tually a configuration file specifying the executable and argu‐
1598 ments for each task. See MULTIPLE PROGRAM CONFIGURATION below
1599 for details on the configuration file contents. This option ap‐
1600 plies to step allocations.
1601
1602
1603 -N, --nodes=<minnodes[-maxnodes]>
1604 Request that a minimum of minnodes nodes be allocated to this
1605 job. A maximum node count may also be specified with maxnodes.
1606 If only one number is specified, this is used as both the mini‐
1607 mum and maximum node count. The partition's node limits super‐
1608 sede those of the job. If a job's node limits are outside of
1609 the range permitted for its associated partition, the job will
1610 be left in a PENDING state. This permits possible execution at
1611 a later time, when the partition limit is changed. If a job
1612 node limit exceeds the number of nodes configured in the parti‐
1613 tion, the job will be rejected. Note that the environment vari‐
1614 able SLURM_JOB_NUM_NODES (and SLURM_NNODES for backwards compat‐
1615 ibility) will be set to the count of nodes actually allocated to
1616 the job. See the ENVIRONMENT VARIABLES section for more informa‐
1617 tion. If -N is not specified, the default behavior is to allo‐
1618 cate enough nodes to satisfy the requirements of the -n and -c
1619 options. The job will be allocated as many nodes as possible
1620 within the range specified and without delaying the initiation
1621 of the job. If the number of tasks is given and a number of re‐
1622 quested nodes is also given, the number of nodes used from that
1623 request will be reduced to match that of the number of tasks if
1624 the number of nodes in the request is greater than the number of
1625 tasks. The node count specification may include a numeric value
1626 followed by a suffix of "k" (multiplies numeric value by 1,024)
1627 or "m" (multiplies numeric value by 1,048,576). This option ap‐
1628 plies to job and step allocations.
1629
1630
1631 -n, --ntasks=<number>
1632 Specify the number of tasks to run. Request that srun allocate
1633 resources for ntasks tasks. The default is one task per node,
1634 but note that the --cpus-per-task option will change this de‐
1635 fault. This option applies to job and step allocations.
1636
1637
1638 --network=<type>
1639 Specify information pertaining to the switch or network. The
1640 interpretation of type is system dependent. This option is sup‐
1641 ported when running Slurm on a Cray natively. It is used to re‐
1642 quest using Network Performance Counters. Only one value per
1643 request is valid. All options are case in-sensitive. In this
1644 configuration supported values include:
1645
1646 system
1647 Use the system-wide network performance counters. Only
1648 nodes requested will be marked in use for the job alloca‐
1649 tion. If the job does not fill up the entire system the
1650 rest of the nodes are not able to be used by other jobs
1651 using NPC, if idle their state will appear as PerfCnts.
1652 These nodes are still available for other jobs not using
1653 NPC.
1654
1655 blade Use the blade network performance counters. Only nodes re‐
1656 quested will be marked in use for the job allocation. If
1657 the job does not fill up the entire blade(s) allocated to
1658 the job those blade(s) are not able to be used by other
1659 jobs using NPC, if idle their state will appear as PerfC‐
1660 nts. These nodes are still available for other jobs not
1661 using NPC.
1662
1663
1664 In all cases the job allocation request must specify the
1665 --exclusive option and the step cannot specify the --overlap op‐
1666 tion. Otherwise the request will be denied.
1667
1668 Also with any of these options steps are not allowed to share
1669 blades, so resources would remain idle inside an allocation if
1670 the step running on a blade does not take up all the nodes on
1671 the blade.
1672
1673 The network option is also supported on systems with IBM's Par‐
1674 allel Environment (PE). See IBM's LoadLeveler job command key‐
1675 word documentation about the keyword "network" for more informa‐
1676 tion. Multiple values may be specified in a comma separated
1677 list. All options are case in-sensitive. Supported values in‐
1678 clude:
1679
1680 BULK_XFER[=<resources>]
1681 Enable bulk transfer of data using Remote Direct-
1682 Memory Access (RDMA). The optional resources speci‐
1683 fication is a numeric value which can have a suffix
1684 of "k", "K", "m", "M", "g" or "G" for kilobytes,
1685 megabytes or gigabytes. NOTE: The resources speci‐
1686 fication is not supported by the underlying IBM in‐
1687 frastructure as of Parallel Environment version 2.2
1688 and no value should be specified at this time. The
1689 devices allocated to a job must all be of the same
1690 type. The default value depends upon depends upon
1691 what hardware is available and in order of prefer‐
1692 ences is IPONLY (which is not considered in User
1693 Space mode), HFI, IB, HPCE, and KMUX.
1694
1695 CAU=<count> Number of Collective Acceleration Units (CAU) re‐
1696 quired. Applies only to IBM Power7-IH processors.
1697 Default value is zero. Independent CAU will be al‐
1698 located for each programming interface (MPI, LAPI,
1699 etc.)
1700
1701 DEVNAME=<name>
1702 Specify the device name to use for communications
1703 (e.g. "eth0" or "mlx4_0").
1704
1705 DEVTYPE=<type>
1706 Specify the device type to use for communications.
1707 The supported values of type are: "IB" (InfiniBand),
1708 "HFI" (P7 Host Fabric Interface), "IPONLY" (IP-Only
1709 interfaces), "HPCE" (HPC Ethernet), and "KMUX" (Ker‐
1710 nel Emulation of HPCE). The devices allocated to a
1711 job must all be of the same type. The default value
1712 depends upon depends upon what hardware is available
1713 and in order of preferences is IPONLY (which is not
1714 considered in User Space mode), HFI, IB, HPCE, and
1715 KMUX.
1716
1717 IMMED =<count>
1718 Number of immediate send slots per window required.
1719 Applies only to IBM Power7-IH processors. Default
1720 value is zero.
1721
1722 INSTANCES =<count>
1723 Specify number of network connections for each task
1724 on each network connection. The default instance
1725 count is 1.
1726
1727 IPV4 Use Internet Protocol (IP) version 4 communications
1728 (default).
1729
1730 IPV6 Use Internet Protocol (IP) version 6 communications.
1731
1732 LAPI Use the LAPI programming interface.
1733
1734 MPI Use the MPI programming interface. MPI is the de‐
1735 fault interface.
1736
1737 PAMI Use the PAMI programming interface.
1738
1739 SHMEM Use the OpenSHMEM programming interface.
1740
1741 SN_ALL Use all available switch networks (default).
1742
1743 SN_SINGLE Use one available switch network.
1744
1745 UPC Use the UPC programming interface.
1746
1747 US Use User Space communications.
1748
1749
1750 Some examples of network specifications:
1751
1752 Instances=2,US,MPI,SN_ALL
1753 Create two user space connections for MPI communica‐
1754 tions on every switch network for each task.
1755
1756 US,MPI,Instances=3,Devtype=IB
1757 Create three user space connections for MPI communi‐
1758 cations on every InfiniBand network for each task.
1759
1760 IPV4,LAPI,SN_Single
1761 Create a IP version 4 connection for LAPI communica‐
1762 tions on one switch network for each task.
1763
1764 Instances=2,US,LAPI,MPI
1765 Create two user space connections each for LAPI and
1766 MPI communications on every switch network for each
1767 task. Note that SN_ALL is the default option so ev‐
1768 ery switch network is used. Also note that In‐
1769 stances=2 specifies that two connections are estab‐
1770 lished for each protocol (LAPI and MPI) and each
1771 task. If there are two networks and four tasks on
1772 the node then a total of 32 connections are estab‐
1773 lished (2 instances x 2 protocols x 2 networks x 4
1774 tasks).
1775
1776 This option applies to job and step allocations.
1777
1778
1779 --nice[=adjustment]
1780 Run the job with an adjusted scheduling priority within Slurm.
1781 With no adjustment value the scheduling priority is decreased by
1782 100. A negative nice value increases the priority, otherwise de‐
1783 creases it. The adjustment range is +/- 2147483645. Only privi‐
1784 leged users can specify a negative adjustment.
1785
1786
1787 --ntasks-per-core=<ntasks>
1788 Request the maximum ntasks be invoked on each core. This option
1789 applies to the job allocation, but not to step allocations.
1790 Meant to be used with the --ntasks option. Related to
1791 --ntasks-per-node except at the core level instead of the node
1792 level. Masks will automatically be generated to bind the tasks
1793 to specific cores unless --cpu-bind=none is specified. NOTE:
1794 This option is not supported when using SelectType=select/lin‐
1795 ear.
1796
1797
1798 --ntasks-per-gpu=<ntasks>
1799 Request that there are ntasks tasks invoked for every GPU. This
1800 option can work in two ways: 1) either specify --ntasks in addi‐
1801 tion, in which case a type-less GPU specification will be auto‐
1802 matically determined to satisfy --ntasks-per-gpu, or 2) specify
1803 the GPUs wanted (e.g. via --gpus or --gres) without specifying
1804 --ntasks, and the total task count will be automatically deter‐
1805 mined. The number of CPUs needed will be automatically in‐
1806 creased if necessary to allow for any calculated task count.
1807 This option will implicitly set --gpu-bind=single:<ntasks>, but
1808 that can be overridden with an explicit --gpu-bind specifica‐
1809 tion. This option is not compatible with a node range (i.e.
1810 -N<minnodes-maxnodes>). This option is not compatible with
1811 --gpus-per-task, --gpus-per-socket, or --ntasks-per-node. This
1812 option is not supported unless SelectType=cons_tres is config‐
1813 ured (either directly or indirectly on Cray systems).
1814
1815
1816 --ntasks-per-node=<ntasks>
1817 Request that ntasks be invoked on each node. If used with the
1818 --ntasks option, the --ntasks option will take precedence and
1819 the --ntasks-per-node will be treated as a maximum count of
1820 tasks per node. Meant to be used with the --nodes option. This
1821 is related to --cpus-per-task=ncpus, but does not require knowl‐
1822 edge of the actual number of cpus on each node. In some cases,
1823 it is more convenient to be able to request that no more than a
1824 specific number of tasks be invoked on each node. Examples of
1825 this include submitting a hybrid MPI/OpenMP app where only one
1826 MPI "task/rank" should be assigned to each node while allowing
1827 the OpenMP portion to utilize all of the parallelism present in
1828 the node, or submitting a single setup/cleanup/monitoring job to
1829 each node of a pre-existing allocation as one step in a larger
1830 job script. This option applies to job allocations.
1831
1832
1833 --ntasks-per-socket=<ntasks>
1834 Request the maximum ntasks be invoked on each socket. This op‐
1835 tion applies to the job allocation, but not to step allocations.
1836 Meant to be used with the --ntasks option. Related to
1837 --ntasks-per-node except at the socket level instead of the node
1838 level. Masks will automatically be generated to bind the tasks
1839 to specific sockets unless --cpu-bind=none is specified. NOTE:
1840 This option is not supported when using SelectType=select/lin‐
1841 ear.
1842
1843
1844 -O, --overcommit
1845 Overcommit resources. This option applies to job and step allo‐
1846 cations. When applied to job allocation, only one CPU is allo‐
1847 cated to the job per node and options used to specify the number
1848 of tasks per node, socket, core, etc. are ignored. When ap‐
1849 plied to job step allocations (the srun command when executed
1850 within an existing job allocation), this option can be used to
1851 launch more than one task per CPU. Normally, srun will not al‐
1852 locate more than one process per CPU. By specifying --overcom‐
1853 mit you are explicitly allowing more than one process per CPU.
1854 However no more than MAX_TASKS_PER_NODE tasks are permitted to
1855 execute per node. NOTE: MAX_TASKS_PER_NODE is defined in the
1856 file slurm.h and is not a variable, it is set at Slurm build
1857 time.
1858
1859
1860 --overlap
1861 Allow steps to overlap each other on the CPUs. By default steps
1862 do not share CPUs with other parallel steps.
1863
1864
1865 -o, --output=<filename pattern>
1866 Specify the "filename pattern" for stdout redirection. By de‐
1867 fault in interactive mode, srun collects stdout from all tasks
1868 and sends this output via TCP/IP to the attached terminal. With
1869 --output stdout may be redirected to a file, to one file per
1870 task, or to /dev/null. See section IO Redirection below for the
1871 various forms of filename pattern. If the specified file al‐
1872 ready exists, it will be overwritten.
1873
1874 If --error is not also specified on the command line, both std‐
1875 out and stderr will directed to the file specified by --output.
1876 This option applies to job and step allocations.
1877
1878
1879 --open-mode=<append|truncate>
1880 Open the output and error files using append or truncate mode as
1881 specified. For heterogeneous job steps the default value is
1882 "append". Otherwise the default value is specified by the sys‐
1883 tem configuration parameter JobFileAppend. This option applies
1884 to job and step allocations.
1885
1886
1887 --het-group=<expr>
1888 Identify each component in a heterogeneous job allocation for
1889 which a step is to be created. Applies only to srun commands is‐
1890 sued inside a salloc allocation or sbatch script. <expr> is a
1891 set of integers corresponding to one or more options offsets on
1892 the salloc or sbatch command line. Examples: "--het-group=2",
1893 "--het-group=0,4", "--het-group=1,3-5". The default value is
1894 --het-group=0.
1895
1896
1897 -p, --partition=<partition_names>
1898 Request a specific partition for the resource allocation. If
1899 not specified, the default behavior is to allow the slurm con‐
1900 troller to select the default partition as designated by the
1901 system administrator. If the job can use more than one parti‐
1902 tion, specify their names in a comma separate list and the one
1903 offering earliest initiation will be used with no regard given
1904 to the partition name ordering (although higher priority parti‐
1905 tions will be considered first). When the job is initiated, the
1906 name of the partition used will be placed first in the job
1907 record partition string. This option applies to job allocations.
1908
1909
1910 --power=<flags>
1911 Comma separated list of power management plugin options. Cur‐
1912 rently available flags include: level (all nodes allocated to
1913 the job should have identical power caps, may be disabled by the
1914 Slurm configuration option PowerParameters=job_no_level). This
1915 option applies to job allocations.
1916
1917
1918 --priority=<value>
1919 Request a specific job priority. May be subject to configura‐
1920 tion specific constraints. value should either be a numeric
1921 value or "TOP" (for highest possible value). Only Slurm opera‐
1922 tors and administrators can set the priority of a job. This op‐
1923 tion applies to job allocations only.
1924
1925
1926 --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>
1927 enables detailed data collection by the acct_gather_profile
1928 plugin. Detailed data are typically time-series that are stored
1929 in an HDF5 file for the job or an InfluxDB database depending on
1930 the configured plugin.
1931
1932
1933 All All data types are collected. (Cannot be combined with
1934 other values.)
1935
1936
1937 None No data types are collected. This is the default.
1938 (Cannot be combined with other values.)
1939
1940
1941 Energy Energy data is collected.
1942
1943
1944 Task Task (I/O, Memory, ...) data is collected.
1945
1946
1947 Filesystem
1948 Filesystem data is collected.
1949
1950
1951 Network Network (InfiniBand) data is collected.
1952
1953
1954 This option applies to job and step allocations.
1955
1956
1957 --prolog=<executable>
1958 srun will run executable just before launching the job step.
1959 The command line arguments for executable will be the command
1960 and arguments of the job step. If executable is "none", then no
1961 srun prolog will be run. This parameter overrides the SrunProlog
1962 parameter in slurm.conf. This parameter is completely indepen‐
1963 dent from the Prolog parameter in slurm.conf. This option ap‐
1964 plies to job allocations.
1965
1966
1967 --propagate[=rlimit[,rlimit...]]
1968 Allows users to specify which of the modifiable (soft) resource
1969 limits to propagate to the compute nodes and apply to their
1970 jobs. If no rlimit is specified, then all resource limits will
1971 be propagated. The following rlimit names are supported by
1972 Slurm (although some options may not be supported on some sys‐
1973 tems):
1974
1975 ALL All limits listed below (default)
1976
1977 NONE No limits listed below
1978
1979 AS The maximum address space for a process
1980
1981 CORE The maximum size of core file
1982
1983 CPU The maximum amount of CPU time
1984
1985 DATA The maximum size of a process's data segment
1986
1987 FSIZE The maximum size of files created. Note that if the
1988 user sets FSIZE to less than the current size of the
1989 slurmd.log, job launches will fail with a 'File size
1990 limit exceeded' error.
1991
1992 MEMLOCK The maximum size that may be locked into memory
1993
1994 NOFILE The maximum number of open files
1995
1996 NPROC The maximum number of processes available
1997
1998 RSS The maximum resident set size
1999
2000 STACK The maximum stack size
2001
2002 This option applies to job allocations.
2003
2004
2005 --pty Execute task zero in pseudo terminal mode. Implicitly sets
2006 --unbuffered. Implicitly sets --error and --output to /dev/null
2007 for all tasks except task zero, which may cause those tasks to
2008 exit immediately (e.g. shells will typically exit immediately in
2009 that situation). This option applies to step allocations.
2010
2011
2012 -q, --qos=<qos>
2013 Request a quality of service for the job. QOS values can be de‐
2014 fined for each user/cluster/account association in the Slurm
2015 database. Users will be limited to their association's defined
2016 set of qos's when the Slurm configuration parameter, Account‐
2017 ingStorageEnforce, includes "qos" in its definition. This option
2018 applies to job allocations.
2019
2020
2021 -Q, --quiet
2022 Suppress informational messages from srun. Errors will still be
2023 displayed. This option applies to job and step allocations.
2024
2025
2026 --quit-on-interrupt
2027 Quit immediately on single SIGINT (Ctrl-C). Use of this option
2028 disables the status feature normally available when srun re‐
2029 ceives a single Ctrl-C and causes srun to instead immediately
2030 terminate the running job. This option applies to step alloca‐
2031 tions.
2032
2033
2034 -r, --relative=<n>
2035 Run a job step relative to node n of the current allocation.
2036 This option may be used to spread several job steps out among
2037 the nodes of the current job. If -r is used, the current job
2038 step will begin at node n of the allocated nodelist, where the
2039 first node is considered node 0. The -r option is not permitted
2040 with -w or -x option and will result in a fatal error when not
2041 running within a prior allocation (i.e. when SLURM_JOB_ID is not
2042 set). The default for n is 0. If the value of --nodes exceeds
2043 the number of nodes identified with the --relative option, a
2044 warning message will be printed and the --relative option will
2045 take precedence. This option applies to step allocations.
2046
2047
2048 --reboot
2049 Force the allocated nodes to reboot before starting the job.
2050 This is only supported with some system configurations and will
2051 otherwise be silently ignored. Only root, SlurmUser or admins
2052 can reboot nodes. This option applies to job allocations.
2053
2054
2055 --resv-ports[=count]
2056 Reserve communication ports for this job. Users can specify the
2057 number of port they want to reserve. The parameter Mpi‐
2058 Params=ports=12000-12999 must be specified in slurm.conf. If not
2059 specified and Slurm's OpenMPI plugin is used, then by default
2060 the number of reserved equal to the highest number of tasks on
2061 any node in the job step allocation. If the number of reserved
2062 ports is zero then no ports is reserved. Used for OpenMPI. This
2063 option applies to job and step allocations.
2064
2065
2066 --reservation=<reservation_names>
2067 Allocate resources for the job from the named reservation. If
2068 the job can use more than one reservation, specify their names
2069 in a comma separate list and the one offering earliest initia‐
2070 tion. Each reservation will be considered in the order it was
2071 requested. All reservations will be listed in scontrol/squeue
2072 through the life of the job. In accounting the first reserva‐
2073 tion will be seen and after the job starts the reservation used
2074 will replace it.
2075
2076
2077 -s, --oversubscribe
2078 The job allocation can over-subscribe resources with other run‐
2079 ning jobs. The resources to be over-subscribed can be nodes,
2080 sockets, cores, and/or hyperthreads depending upon configura‐
2081 tion. The default over-subscribe behavior depends on system
2082 configuration and the partition's OverSubscribe option takes
2083 precedence over the job's option. This option may result in the
2084 allocation being granted sooner than if the --oversubscribe op‐
2085 tion was not set and allow higher system utilization, but appli‐
2086 cation performance will likely suffer due to competition for re‐
2087 sources. This option applies to step allocations.
2088
2089
2090 -S, --core-spec=<num>
2091 Count of specialized cores per node reserved by the job for sys‐
2092 tem operations and not used by the application. The application
2093 will not use these cores, but will be charged for their alloca‐
2094 tion. Default value is dependent upon the node's configured
2095 CoreSpecCount value. If a value of zero is designated and the
2096 Slurm configuration option AllowSpecResourcesUsage is enabled,
2097 the job will be allowed to override CoreSpecCount and use the
2098 specialized resources on nodes it is allocated. This option can
2099 not be used with the --thread-spec option. This option applies
2100 to job allocations.
2101
2102
2103 --signal=[R:]<sig_num>[@<sig_time>]
2104 When a job is within sig_time seconds of its end time, send it
2105 the signal sig_num. Due to the resolution of event handling by
2106 Slurm, the signal may be sent up to 60 seconds earlier than
2107 specified. sig_num may either be a signal number or name (e.g.
2108 "10" or "USR1"). sig_time must have an integer value between 0
2109 and 65535. By default, no signal is sent before the job's end
2110 time. If a sig_num is specified without any sig_time, the de‐
2111 fault time will be 60 seconds. This option applies to job allo‐
2112 cations. Use the "R:" option to allow this job to overlap with
2113 a reservation with MaxStartDelay set. To have the signal sent
2114 at preemption time see the preempt_send_user_signal SlurmctldPa‐
2115 rameter.
2116
2117
2118 --slurmd-debug=<level>
2119 Specify a debug level for slurmd(8). The level may be specified
2120 either an integer value between 0 [quiet, only errors are dis‐
2121 played] and 4 [verbose operation] or the SlurmdDebug tags.
2122
2123 quiet Log nothing
2124
2125 fatal Log only fatal errors
2126
2127 error Log only errors
2128
2129 info Log errors and general informational messages
2130
2131 verbose Log errors and verbose informational messages
2132
2133
2134 The slurmd debug information is copied onto the stderr of
2135 the job. By default only errors are displayed. This option ap‐
2136 plies to job and step allocations.
2137
2138
2139 --sockets-per-node=<sockets>
2140 Restrict node selection to nodes with at least the specified
2141 number of sockets. See additional information under -B option
2142 above when task/affinity plugin is enabled. This option applies
2143 to job allocations.
2144
2145
2146 --spread-job
2147 Spread the job allocation over as many nodes as possible and at‐
2148 tempt to evenly distribute tasks across the allocated nodes.
2149 This option disables the topology/tree plugin. This option ap‐
2150 plies to job allocations.
2151
2152
2153 --switches=<count>[@<max-time>]
2154 When a tree topology is used, this defines the maximum count of
2155 switches desired for the job allocation and optionally the maxi‐
2156 mum time to wait for that number of switches. If Slurm finds an
2157 allocation containing more switches than the count specified,
2158 the job remains pending until it either finds an allocation with
2159 desired switch count or the time limit expires. It there is no
2160 switch count limit, there is no delay in starting the job. Ac‐
2161 ceptable time formats include "minutes", "minutes:seconds",
2162 "hours:minutes:seconds", "days-hours", "days-hours:minutes" and
2163 "days-hours:minutes:seconds". The job's maximum time delay may
2164 be limited by the system administrator using the SchedulerParam‐
2165 eters configuration parameter with the max_switch_wait parameter
2166 option. On a dragonfly network the only switch count supported
2167 is 1 since communication performance will be highest when a job
2168 is allocate resources on one leaf switch or more than 2 leaf
2169 switches. The default max-time is the max_switch_wait Sched‐
2170 ulerParameters. This option applies to job allocations.
2171
2172
2173 -T, --threads=<nthreads>
2174 Allows limiting the number of concurrent threads used to send
2175 the job request from the srun process to the slurmd processes on
2176 the allocated nodes. Default is to use one thread per allocated
2177 node up to a maximum of 60 concurrent threads. Specifying this
2178 option limits the number of concurrent threads to nthreads (less
2179 than or equal to 60). This should only be used to set a low
2180 thread count for testing on very small memory computers. This
2181 option applies to job allocations.
2182
2183
2184 -t, --time=<time>
2185 Set a limit on the total run time of the job allocation. If the
2186 requested time limit exceeds the partition's time limit, the job
2187 will be left in a PENDING state (possibly indefinitely). The
2188 default time limit is the partition's default time limit. When
2189 the time limit is reached, each task in each job step is sent
2190 SIGTERM followed by SIGKILL. The interval between signals is
2191 specified by the Slurm configuration parameter KillWait. The
2192 OverTimeLimit configuration parameter may permit the job to run
2193 longer than scheduled. Time resolution is one minute and second
2194 values are rounded up to the next minute.
2195
2196 A time limit of zero requests that no time limit be imposed.
2197 Acceptable time formats include "minutes", "minutes:seconds",
2198 "hours:minutes:seconds", "days-hours", "days-hours:minutes" and
2199 "days-hours:minutes:seconds". This option applies to job and
2200 step allocations.
2201
2202
2203 --task-epilog=<executable>
2204 The slurmstepd daemon will run executable just after each task
2205 terminates. This will be executed before any TaskEpilog parame‐
2206 ter in slurm.conf is executed. This is meant to be a very
2207 short-lived program. If it fails to terminate within a few sec‐
2208 onds, it will be killed along with any descendant processes.
2209 This option applies to step allocations.
2210
2211
2212 --task-prolog=<executable>
2213 The slurmstepd daemon will run executable just before launching
2214 each task. This will be executed after any TaskProlog parameter
2215 in slurm.conf is executed. Besides the normal environment vari‐
2216 ables, this has SLURM_TASK_PID available to identify the process
2217 ID of the task being started. Standard output from this program
2218 of the form "export NAME=value" will be used to set environment
2219 variables for the task being spawned. This option applies to
2220 step allocations.
2221
2222
2223 --test-only
2224 Returns an estimate of when a job would be scheduled to run
2225 given the current job queue and all the other srun arguments
2226 specifying the job. This limits srun's behavior to just return
2227 information; no job is actually submitted. The program will be
2228 executed directly by the slurmd daemon. This option applies to
2229 job allocations.
2230
2231
2232 --thread-spec=<num>
2233 Count of specialized threads per node reserved by the job for
2234 system operations and not used by the application. The applica‐
2235 tion will not use these threads, but will be charged for their
2236 allocation. This option can not be used with the --core-spec
2237 option. This option applies to job allocations.
2238
2239
2240 --threads-per-core=<threads>
2241 Restrict node selection to nodes with at least the specified
2242 number of threads per core. In task layout, use the specified
2243 maximum number of threads per core. Implies --cpu-bind=threads.
2244 NOTE: "Threads" refers to the number of processing units on each
2245 core rather than the number of application tasks to be launched
2246 per core. See additional information under -B option above when
2247 task/affinity plugin is enabled. This option applies to job and
2248 step allocations.
2249
2250
2251 --time-min=<time>
2252 Set a minimum time limit on the job allocation. If specified,
2253 the job may have its --time limit lowered to a value no lower
2254 than --time-min if doing so permits the job to begin execution
2255 earlier than otherwise possible. The job's time limit will not
2256 be changed after the job is allocated resources. This is per‐
2257 formed by a backfill scheduling algorithm to allocate resources
2258 otherwise reserved for higher priority jobs. Acceptable time
2259 formats include "minutes", "minutes:seconds", "hours:min‐
2260 utes:seconds", "days-hours", "days-hours:minutes" and
2261 "days-hours:minutes:seconds". This option applies to job alloca‐
2262 tions.
2263
2264
2265 --tmp=<size[units]>
2266 Specify a minimum amount of temporary disk space per node. De‐
2267 fault units are megabytes. Different units can be specified us‐
2268 ing the suffix [K|M|G|T]. This option applies to job alloca‐
2269 tions.
2270
2271
2272 -u, --unbuffered
2273 By default the connection between slurmstepd and the user
2274 launched application is over a pipe. The stdio output written by
2275 the application is buffered by the glibc until it is flushed or
2276 the output is set as unbuffered. See setbuf(3). If this option
2277 is specified the tasks are executed with a pseudo terminal so
2278 that the application output is unbuffered. This option applies
2279 to step allocations.
2280
2281 --usage
2282 Display brief help message and exit.
2283
2284
2285 --uid=<user>
2286 Attempt to submit and/or run a job as user instead of the invok‐
2287 ing user id. The invoking user's credentials will be used to
2288 check access permissions for the target partition. User root may
2289 use this option to run jobs as a normal user in a RootOnly par‐
2290 tition for example. If run as root, srun will drop its permis‐
2291 sions to the uid specified after node allocation is successful.
2292 user may be the user name or numerical user ID. This option ap‐
2293 plies to job and step allocations.
2294
2295
2296 --use-min-nodes
2297 If a range of node counts is given, prefer the smaller count.
2298
2299
2300 -V, --version
2301 Display version information and exit.
2302
2303
2304 -v, --verbose
2305 Increase the verbosity of srun's informational messages. Multi‐
2306 ple -v's will further increase srun's verbosity. By default
2307 only errors will be displayed. This option applies to job and
2308 step allocations.
2309
2310
2311 -W, --wait=<seconds>
2312 Specify how long to wait after the first task terminates before
2313 terminating all remaining tasks. A value of 0 indicates an un‐
2314 limited wait (a warning will be issued after 60 seconds). The
2315 default value is set by the WaitTime parameter in the slurm con‐
2316 figuration file (see slurm.conf(5)). This option can be useful
2317 to ensure that a job is terminated in a timely fashion in the
2318 event that one or more tasks terminate prematurely. Note: The
2319 -K, --kill-on-bad-exit option takes precedence over -W, --wait
2320 to terminate the job immediately if a task exits with a non-zero
2321 exit code. This option applies to job allocations.
2322
2323
2324 -w, --nodelist=<host1,host2,... or filename>
2325 Request a specific list of hosts. The job will contain all of
2326 these hosts and possibly additional hosts as needed to satisfy
2327 resource requirements. The list may be specified as a
2328 comma-separated list of hosts, a range of hosts (host[1-5,7,...]
2329 for example), or a filename. The host list will be assumed to
2330 be a filename if it contains a "/" character. If you specify a
2331 minimum node or processor count larger than can be satisfied by
2332 the supplied host list, additional resources will be allocated
2333 on other nodes as needed. Rather than repeating a host name
2334 multiple times, an asterisk and a repetition count may be ap‐
2335 pended to a host name. For example "host1,host1" and "host1*2"
2336 are equivalent. If the number of tasks is given and a list of
2337 requested nodes is also given, the number of nodes used from
2338 that list will be reduced to match that of the number of tasks
2339 if the number of nodes in the list is greater than the number of
2340 tasks. This option applies to job and step allocations.
2341
2342
2343 --wckey=<wckey>
2344 Specify wckey to be used with job. If TrackWCKey=no (default)
2345 in the slurm.conf this value is ignored. This option applies to
2346 job allocations.
2347
2348
2349 -X, --disable-status
2350 Disable the display of task status when srun receives a single
2351 SIGINT (Ctrl-C). Instead immediately forward the SIGINT to the
2352 running job. Without this option a second Ctrl-C in one second
2353 is required to forcibly terminate the job and srun will immedi‐
2354 ately exit. May also be set via the environment variable
2355 SLURM_DISABLE_STATUS. This option applies to job allocations.
2356
2357
2358 -x, --exclude=<host1,host2,... or filename>
2359 Request that a specific list of hosts not be included in the re‐
2360 sources allocated to this job. The host list will be assumed to
2361 be a filename if it contains a "/" character. This option ap‐
2362 plies to job and step allocations.
2363
2364
2365 --x11[=<all|first|last>]
2366 Sets up X11 forwarding on all, first or last node(s) of the al‐
2367 location. This option is only enabled if Slurm was compiled with
2368 X11 support and PrologFlags=x11 is defined in the slurm.conf.
2369 Default is all.
2370
2371
2372 -Z, --no-allocate
2373 Run the specified tasks on a set of nodes without creating a
2374 Slurm "job" in the Slurm queue structure, bypassing the normal
2375 resource allocation step. The list of nodes must be specified
2376 with the -w, --nodelist option. This is a privileged option
2377 only available for the users "SlurmUser" and "root". This option
2378 applies to job allocations.
2379
2380
2381 srun will submit the job request to the slurm job controller, then ini‐
2382 tiate all processes on the remote nodes. If the request cannot be met
2383 immediately, srun will block until the resources are free to run the
2384 job. If the -I (--immediate) option is specified srun will terminate if
2385 resources are not immediately available.
2386
2387 When initiating remote processes srun will propagate the current work‐
2388 ing directory, unless --chdir=<path> is specified, in which case path
2389 will become the working directory for the remote processes.
2390
2391 The -n, -c, and -N options control how CPUs and nodes will be allo‐
2392 cated to the job. When specifying only the number of processes to run
2393 with -n, a default of one CPU per process is allocated. By specifying
2394 the number of CPUs required per task (-c), more than one CPU may be al‐
2395 located per process. If the number of nodes is specified with -N, srun
2396 will attempt to allocate at least the number of nodes specified.
2397
2398 Combinations of the above three options may be used to change how pro‐
2399 cesses are distributed across nodes and cpus. For instance, by specify‐
2400 ing both the number of processes and number of nodes on which to run,
2401 the number of processes per node is implied. However, if the number of
2402 CPUs per process is more important then number of processes (-n) and
2403 the number of CPUs per process (-c) should be specified.
2404
2405 srun will refuse to allocate more than one process per CPU unless
2406 --overcommit (-O) is also specified.
2407
2408 srun will attempt to meet the above specifications "at a minimum." That
2409 is, if 16 nodes are requested for 32 processes, and some nodes do not
2410 have 2 CPUs, the allocation of nodes will be increased in order to meet
2411 the demand for CPUs. In other words, a minimum of 16 nodes are being
2412 requested. However, if 16 nodes are requested for 15 processes, srun
2413 will consider this an error, as 15 processes cannot run across 16
2414 nodes.
2415
2416
2417 IO Redirection
2418
2419 By default, stdout and stderr will be redirected from all tasks to the
2420 stdout and stderr of srun, and stdin will be redirected from the stan‐
2421 dard input of srun to all remote tasks. If stdin is only to be read by
2422 a subset of the spawned tasks, specifying a file to read from rather
2423 than forwarding stdin from the srun command may be preferable as it
2424 avoids moving and storing data that will never be read.
2425
2426 For OS X, the poll() function does not support stdin, so input from a
2427 terminal is not possible.
2428
2429 This behavior may be changed with the --output, --error, and --input
2430 (-o, -e, -i) options. Valid format specifications for these options are
2431
2432 all stdout stderr is redirected from all tasks to srun. stdin is
2433 broadcast to all remote tasks. (This is the default behav‐
2434 ior)
2435
2436 none stdout and stderr is not received from any task. stdin is
2437 not sent to any task (stdin is closed).
2438
2439 taskid stdout and/or stderr are redirected from only the task with
2440 relative id equal to taskid, where 0 <= taskid <= ntasks,
2441 where ntasks is the total number of tasks in the current job
2442 step. stdin is redirected from the stdin of srun to this
2443 same task. This file will be written on the node executing
2444 the task.
2445
2446 filename srun will redirect stdout and/or stderr to the named file
2447 from all tasks. stdin will be redirected from the named file
2448 and broadcast to all tasks in the job. filename refers to a
2449 path on the host that runs srun. Depending on the cluster's
2450 file system layout, this may result in the output appearing
2451 in different places depending on whether the job is run in
2452 batch mode.
2453
2454 filename pattern
2455 srun allows for a filename pattern to be used to generate the
2456 named IO file described above. The following list of format
2457 specifiers may be used in the format string to generate a
2458 filename that will be unique to a given jobid, stepid, node,
2459 or task. In each case, the appropriate number of files are
2460 opened and associated with the corresponding tasks. Note that
2461 any format string containing %t, %n, and/or %N will be writ‐
2462 ten on the node executing the task rather than the node where
2463 srun executes, these format specifiers are not supported on a
2464 BGQ system.
2465
2466 \\ Do not process any of the replacement symbols.
2467
2468 %% The character "%".
2469
2470 %A Job array's master job allocation number.
2471
2472 %a Job array ID (index) number.
2473
2474 %J jobid.stepid of the running job. (e.g. "128.0")
2475
2476 %j jobid of the running job.
2477
2478 %s stepid of the running job.
2479
2480 %N short hostname. This will create a separate IO file
2481 per node.
2482
2483 %n Node identifier relative to current job (e.g. "0" is
2484 the first node of the running job) This will create a
2485 separate IO file per node.
2486
2487 %t task identifier (rank) relative to current job. This
2488 will create a separate IO file per task.
2489
2490 %u User name.
2491
2492 %x Job name.
2493
2494 A number placed between the percent character and format
2495 specifier may be used to zero-pad the result in the IO file‐
2496 name. This number is ignored if the format specifier corre‐
2497 sponds to non-numeric data (%N for example).
2498
2499 Some examples of how the format string may be used for a 4
2500 task job step with a Job ID of 128 and step id of 0 are in‐
2501 cluded below:
2502
2503 job%J.out job128.0.out
2504
2505 job%4j.out job0128.out
2506
2507 job%j-%2t.out job128-00.out, job128-01.out, ...
2508
2510 Executing srun sends a remote procedure call to slurmctld. If enough
2511 calls from srun or other Slurm client commands that send remote proce‐
2512 dure calls to the slurmctld daemon come in at once, it can result in a
2513 degradation of performance of the slurmctld daemon, possibly resulting
2514 in a denial of service.
2515
2516 Do not run srun or other Slurm client commands that send remote proce‐
2517 dure calls to slurmctld from loops in shell scripts or other programs.
2518 Ensure that programs limit calls to srun to the minimum necessary for
2519 the information you are trying to gather.
2520
2521
2523 Some srun options may be set via environment variables. These environ‐
2524 ment variables, along with their corresponding options, are listed be‐
2525 low. Note: Command line options will always override these settings.
2526
2527 PMI_FANOUT This is used exclusively with PMI (MPICH2 and
2528 MVAPICH2) and controls the fanout of data commu‐
2529 nications. The srun command sends messages to ap‐
2530 plication programs (via the PMI library) and
2531 those applications may be called upon to forward
2532 that data to up to this number of additional
2533 tasks. Higher values offload work from the srun
2534 command to the applications and likely increase
2535 the vulnerability to failures. The default value
2536 is 32.
2537
2538 PMI_FANOUT_OFF_HOST This is used exclusively with PMI (MPICH2 and
2539 MVAPICH2) and controls the fanout of data commu‐
2540 nications. The srun command sends messages to
2541 application programs (via the PMI library) and
2542 those applications may be called upon to forward
2543 that data to additional tasks. By default, srun
2544 sends one message per host and one task on that
2545 host forwards the data to other tasks on that
2546 host up to PMI_FANOUT. If PMI_FANOUT_OFF_HOST is
2547 defined, the user task may be required to forward
2548 the data to tasks on other hosts. Setting
2549 PMI_FANOUT_OFF_HOST may increase performance.
2550 Since more work is performed by the PMI library
2551 loaded by the user application, failures also can
2552 be more common and more difficult to diagnose.
2553
2554 PMI_TIME This is used exclusively with PMI (MPICH2 and
2555 MVAPICH2) and controls how much the communica‐
2556 tions from the tasks to the srun are spread out
2557 in time in order to avoid overwhelming the srun
2558 command with work. The default value is 500 (mi‐
2559 croseconds) per task. On relatively slow proces‐
2560 sors or systems with very large processor counts
2561 (and large PMI data sets), higher values may be
2562 required.
2563
2564 SLURM_ACCOUNT Same as -A, --account
2565
2566 SLURM_ACCTG_FREQ Same as --acctg-freq
2567
2568 SLURM_BCAST Same as --bcast
2569
2570 SLURM_BURST_BUFFER Same as --bb
2571
2572 SLURM_CLUSTERS Same as -M, --clusters
2573
2574 SLURM_COMPRESS Same as --compress
2575
2576 SLURM_CONF The location of the Slurm configuration file.
2577
2578 SLURM_CONSTRAINT Same as -C, --constraint
2579
2580 SLURM_CORE_SPEC Same as --core-spec
2581
2582 SLURM_CPU_BIND Same as --cpu-bind
2583
2584 SLURM_CPU_FREQ_REQ Same as --cpu-freq.
2585
2586 SLURM_CPUS_PER_GPU Same as --cpus-per-gpu
2587
2588 SLURM_CPUS_PER_TASK Same as -c, --cpus-per-task
2589
2590 SLURM_DEBUG Same as -v, --verbose
2591
2592 SLURM_DELAY_BOOT Same as --delay-boot
2593
2594 SLURM_DEPENDENCY Same as -P, --dependency=<jobid>
2595
2596 SLURM_DISABLE_STATUS Same as -X, --disable-status
2597
2598 SLURM_DIST_PLANESIZE Plane distribution size. Only used if --distribu‐
2599 tion=plane, without =<size>, is set.
2600
2601 SLURM_DISTRIBUTION Same as -m, --distribution
2602
2603 SLURM_EPILOG Same as --epilog
2604
2605 SLURM_EXACT Same as --exact
2606
2607 SLURM_EXCLUSIVE Same as --exclusive
2608
2609 SLURM_EXIT_ERROR Specifies the exit code generated when a Slurm
2610 error occurs (e.g. invalid options). This can be
2611 used by a script to distinguish application exit
2612 codes from various Slurm error conditions. Also
2613 see SLURM_EXIT_IMMEDIATE.
2614
2615 SLURM_EXIT_IMMEDIATE Specifies the exit code generated when the --im‐
2616 mediate option is used and resources are not cur‐
2617 rently available. This can be used by a script
2618 to distinguish application exit codes from vari‐
2619 ous Slurm error conditions. Also see
2620 SLURM_EXIT_ERROR.
2621
2622 SLURM_EXPORT_ENV Same as --export
2623
2624 SLURM_GPU_BIND Same as --gpu-bind
2625
2626 SLURM_GPU_FREQ Same as --gpu-freq
2627
2628 SLURM_GPUS Same as -G, --gpus
2629
2630 SLURM_GPUS_PER_NODE Same as --gpus-per-node
2631
2632 SLURM_GPUS_PER_TASK Same as --gpus-per-task
2633
2634 SLURM_GRES Same as --gres. Also see SLURM_STEP_GRES
2635
2636 SLURM_GRES_FLAGS Same as --gres-flags
2637
2638 SLURM_HINT Same as --hint
2639
2640 SLURM_IMMEDIATE Same as -I, --immediate
2641
2642 SLURM_JOB_ID Same as --jobid
2643
2644 SLURM_JOB_NAME Same as -J, --job-name except within an existing
2645 allocation, in which case it is ignored to avoid
2646 using the batch job's name as the name of each
2647 job step.
2648
2649 SLURM_JOB_NODELIST Same as -w, --nodelist=<host1,host2,... or file‐
2650 name>. If job has been resized, ensure that this
2651 nodelist is adjusted (or undefined) to avoid jobs
2652 steps being rejected due to down nodes.
2653
2654 SLURM_JOB_NUM_NODES Same as -N, --nodes. Total number of nodes in
2655 the job’s resource allocation.
2656
2657 SLURM_KILL_BAD_EXIT Same as -K, --kill-on-bad-exit
2658
2659 SLURM_LABELIO Same as -l, --label
2660
2661 SLURM_MEM_BIND Same as --mem-bind
2662
2663 SLURM_MEM_PER_CPU Same as --mem-per-cpu
2664
2665 SLURM_MEM_PER_GPU Same as --mem-per-gpu
2666
2667 SLURM_MEM_PER_NODE Same as --mem
2668
2669 SLURM_MPI_TYPE Same as --mpi
2670
2671 SLURM_NETWORK Same as --network
2672
2673 SLURM_NNODES Same as -N, --nodes. Total number of nodes in the
2674 job’s resource allocation. See
2675 SLURM_JOB_NUM_NODES. Included for backwards com‐
2676 patibility.
2677
2678 SLURM_NO_KILL Same as -k, --no-kill
2679
2680 SLURM_NPROCS Same as -n, --ntasks. See SLURM_NTASKS. Included
2681 for backwards compatibility.
2682
2683 SLURM_NTASKS Same as -n, --ntasks
2684
2685 SLURM_NTASKS_PER_CORE Same as --ntasks-per-core
2686
2687 SLURM_NTASKS_PER_GPU Same as --ntasks-per-gpu
2688
2689 SLURM_NTASKS_PER_NODE Same as --ntasks-per-node
2690
2691 SLURM_NTASKS_PER_SOCKET
2692 Same as --ntasks-per-socket
2693
2694 SLURM_OPEN_MODE Same as --open-mode
2695
2696 SLURM_OVERCOMMIT Same as -O, --overcommit
2697
2698 SLURM_OVERLAP Same as --overlap
2699
2700 SLURM_PARTITION Same as -p, --partition
2701
2702 SLURM_PMI_KVS_NO_DUP_KEYS
2703 If set, then PMI key-pairs will contain no dupli‐
2704 cate keys. MPI can use this variable to inform
2705 the PMI library that it will not use duplicate
2706 keys so PMI can skip the check for duplicate
2707 keys. This is the case for MPICH2 and reduces
2708 overhead in testing for duplicates for improved
2709 performance
2710
2711 SLURM_POWER Same as --power
2712
2713 SLURM_PROFILE Same as --profile
2714
2715 SLURM_PROLOG Same as --prolog
2716
2717 SLURM_QOS Same as --qos
2718
2719 SLURM_REMOTE_CWD Same as -D, --chdir=
2720
2721 SLURM_REQ_SWITCH When a tree topology is used, this defines the
2722 maximum count of switches desired for the job al‐
2723 location and optionally the maximum time to wait
2724 for that number of switches. See --switches
2725
2726 SLURM_RESERVATION Same as --reservation
2727
2728 SLURM_RESV_PORTS Same as --resv-ports
2729
2730 SLURM_SIGNAL Same as --signal
2731
2732 SLURM_SPREAD_JOB Same as --spread-job
2733
2734 SLURM_SRUN_REDUCE_TASK_EXIT_MSG
2735 if set and non-zero, successive task exit mes‐
2736 sages with the same exit code will be printed
2737 only once.
2738
2739 SLURM_STDERRMODE Same as -e, --error
2740
2741 SLURM_STDINMODE Same as -i, --input
2742
2743 SLURM_STDOUTMODE Same as -o, --output
2744
2745 SLURM_STEP_GRES Same as --gres (only applies to job steps, not to
2746 job allocations). Also see SLURM_GRES
2747
2748 SLURM_STEP_KILLED_MSG_NODE_ID=ID
2749 If set, only the specified node will log when the
2750 job or step are killed by a signal.
2751
2752 SLURM_TASK_EPILOG Same as --task-epilog
2753
2754 SLURM_TASK_PROLOG Same as --task-prolog
2755
2756 SLURM_TEST_EXEC If defined, srun will verify existence of the ex‐
2757 ecutable program along with user execute permis‐
2758 sion on the node where srun was called before at‐
2759 tempting to launch it on nodes in the step.
2760
2761 SLURM_THREAD_SPEC Same as --thread-spec
2762
2763 SLURM_THREADS Same as -T, --threads
2764
2765 SLURM_THREADS_PER_CORE
2766 Same as -T, --threads-per-core
2767
2768 SLURM_TIMELIMIT Same as -t, --time
2769
2770 SLURM_UNBUFFEREDIO Same as -u, --unbuffered
2771
2772 SLURM_USE_MIN_NODES Same as --use-min-nodes
2773
2774 SLURM_WAIT Same as -W, --wait
2775
2776 SLURM_WAIT4SWITCH Max time waiting for requested switches. See
2777 --switches
2778
2779 SLURM_WCKEY Same as -W, --wckey
2780
2781 SLURM_WORKING_DIR -D, --chdir
2782
2783 SLURMD_DEBUG Same as -d, --slurmd-debug
2784
2785 SRUN_EXPORT_ENV Same as --export, and will override any setting
2786 for SLURM_EXPORT_ENV.
2787
2788
2789
2791 srun will set some environment variables in the environment of the exe‐
2792 cuting tasks on the remote compute nodes. These environment variables
2793 are:
2794
2795
2796 SLURM_*_HET_GROUP_# For a heterogeneous job allocation, the environ‐
2797 ment variables are set separately for each compo‐
2798 nent.
2799
2800 SLURM_CLUSTER_NAME Name of the cluster on which the job is execut‐
2801 ing.
2802
2803 SLURM_CPU_BIND_LIST --cpu-bind map or mask list (list of Slurm CPU
2804 IDs or masks for this node, CPU_ID = Board_ID x
2805 threads_per_board + Socket_ID x
2806 threads_per_socket + Core_ID x threads_per_core +
2807 Thread_ID).
2808
2809 SLURM_CPU_BIND_TYPE --cpu-bind type (none,rank,map_cpu:,mask_cpu:).
2810
2811 SLURM_CPU_BIND_VERBOSE
2812 --cpu-bind verbosity (quiet,verbose).
2813
2814 SLURM_CPU_FREQ_REQ Contains the value requested for cpu frequency on
2815 the srun command as a numerical frequency in
2816 kilohertz, or a coded value for a request of low,
2817 medium,highm1 or high for the frequency. See the
2818 description of the --cpu-freq option or the
2819 SLURM_CPU_FREQ_REQ input environment variable.
2820
2821 SLURM_CPUS_ON_NODE Count of processors available to the job on this
2822 node. Note the select/linear plugin allocates
2823 entire nodes to jobs, so the value indicates the
2824 total count of CPUs on the node. For the se‐
2825 lect/cons_res plugin, this number indicates the
2826 number of cores on this node allocated to the
2827 job.
2828
2829 SLURM_CPUS_PER_TASK Number of cpus requested per task. Only set if
2830 the --cpus-per-task option is specified.
2831
2832 SLURM_DISTRIBUTION Distribution type for the allocated jobs. Set the
2833 distribution with -m, --distribution.
2834
2835 SLURM_GTIDS Global task IDs running on this node. Zero ori‐
2836 gin and comma separated.
2837
2838 SLURM_HET_SIZE Set to count of components in heterogeneous job.
2839
2840 SLURM_JOB_ACCOUNT Account name associated of the job allocation.
2841
2842 SLURM_JOB_CPUS_PER_NODE
2843 Number of CPUS per node.
2844
2845 SLURM_JOB_DEPENDENCY Set to value of the --dependency option.
2846
2847 SLURM_JOB_ID Job id of the executing job.
2848
2849 SLURM_JOB_NAME Set to the value of the --job-name option or the
2850 command name when srun is used to create a new
2851 job allocation. Not set when srun is used only to
2852 create a job step (i.e. within an existing job
2853 allocation).
2854
2855 SLURM_JOB_NODELIST List of nodes allocated to the job.
2856
2857 SLURM_JOB_NODES Total number of nodes in the job's resource allo‐
2858 cation.
2859
2860 SLURM_JOB_PARTITION Name of the partition in which the job is run‐
2861 ning.
2862
2863 SLURM_JOB_QOS Quality Of Service (QOS) of the job allocation.
2864
2865 SLURM_JOB_RESERVATION Advanced reservation containing the job alloca‐
2866 tion, if any.
2867
2868 SLURM_JOBID Job id of the executing job. See SLURM_JOB_ID.
2869 Included for backwards compatibility.
2870
2871 SLURM_LAUNCH_NODE_IPADDR
2872 IP address of the node from which the task launch
2873 was initiated (where the srun command ran from).
2874
2875 SLURM_LOCALID Node local task ID for the process within a job.
2876
2877 SLURM_MEM_BIND_LIST --mem-bind map or mask list (<list of IDs or
2878 masks for this node>).
2879
2880 SLURM_MEM_BIND_PREFER --mem-bind prefer (prefer).
2881
2882 SLURM_MEM_BIND_SORT Sort free cache pages (run zonesort on Intel KNL
2883 nodes).
2884
2885 SLURM_MEM_BIND_TYPE --mem-bind type (none,rank,map_mem:,mask_mem:).
2886
2887 SLURM_MEM_BIND_VERBOSE
2888 --mem-bind verbosity (quiet,verbose).
2889
2890 SLURM_NODE_ALIASES Sets of node name, communication address and
2891 hostname for nodes allocated to the job from the
2892 cloud. Each element in the set if colon separated
2893 and each set is comma separated. For example:
2894 SLURM_NODE_ALIASES=
2895 ec0:1.2.3.4:foo,ec1:1.2.3.5:bar
2896
2897 SLURM_NODEID The relative node ID of the current node.
2898
2899 SLURM_NPROCS Total number of processes in the current job or
2900 job step. See SLURM_NTASKS. Included for back‐
2901 wards compatibility.
2902
2903 SLURM_NTASKS Total number of processes in the current job or
2904 job step.
2905
2906 SLURM_OVERCOMMIT Set to 1 if --overcommit was specified.
2907
2908 SLURM_PRIO_PROCESS The scheduling priority (nice value) at the time
2909 of job submission. This value is propagated to
2910 the spawned processes.
2911
2912 SLURM_PROCID The MPI rank (or relative process ID) of the cur‐
2913 rent process.
2914
2915 SLURM_SRUN_COMM_HOST IP address of srun communication host.
2916
2917 SLURM_SRUN_COMM_PORT srun communication port.
2918
2919 SLURM_STEP_ID The step ID of the current job.
2920
2921 SLURM_STEP_LAUNCHER_PORT
2922 Step launcher port.
2923
2924 SLURM_STEP_NODELIST List of nodes allocated to the step.
2925
2926 SLURM_STEP_NUM_NODES Number of nodes allocated to the step.
2927
2928 SLURM_STEP_NUM_TASKS Number of processes in the job step or whole het‐
2929 erogeneous job step.
2930
2931 SLURM_STEP_TASKS_PER_NODE
2932 Number of processes per node within the step.
2933
2934 SLURM_STEPID The step ID of the current job. See
2935 SLURM_STEP_ID. Included for backwards compatibil‐
2936 ity.
2937
2938 SLURM_SUBMIT_DIR The directory from which srun was invoked.
2939
2940 SLURM_SUBMIT_HOST The hostname of the computer from which salloc
2941 was invoked.
2942
2943 SLURM_TASK_PID The process ID of the task being started.
2944
2945 SLURM_TASKS_PER_NODE Number of tasks to be initiated on each node.
2946 Values are comma separated and in the same order
2947 as SLURM_JOB_NODELIST. If two or more consecu‐
2948 tive nodes are to have the same task count, that
2949 count is followed by "(x#)" where "#" is the rep‐
2950 etition count. For example,
2951 "SLURM_TASKS_PER_NODE=2(x3),1" indicates that the
2952 first three nodes will each execute two tasks and
2953 the fourth node will execute one task.
2954
2955
2956 SLURM_TOPOLOGY_ADDR This is set only if the system has the topol‐
2957 ogy/tree plugin configured. The value will be
2958 set to the names network switches which may be
2959 involved in the job's communications from the
2960 system's top level switch down to the leaf switch
2961 and ending with node name. A period is used to
2962 separate each hardware component name.
2963
2964 SLURM_TOPOLOGY_ADDR_PATTERN
2965 This is set only if the system has the topol‐
2966 ogy/tree plugin configured. The value will be
2967 set component types listed in SLURM_TOPOL‐
2968 OGY_ADDR. Each component will be identified as
2969 either "switch" or "node". A period is used to
2970 separate each hardware component type.
2971
2972 SLURM_UMASK The umask in effect when the job was submitted.
2973
2974 SLURMD_NODENAME Name of the node running the task. In the case of
2975 a parallel job executing on multiple compute
2976 nodes, the various tasks will have this environ‐
2977 ment variable set to different values on each
2978 compute node.
2979
2980 SRUN_DEBUG Set to the logging level of the srun command.
2981 Default value is 3 (info level). The value is
2982 incremented or decremented based upon the --ver‐
2983 bose and --quiet options.
2984
2985
2987 Signals sent to the srun command are automatically forwarded to the
2988 tasks it is controlling with a few exceptions. The escape sequence
2989 <control-c> will report the state of all tasks associated with the srun
2990 command. If <control-c> is entered twice within one second, then the
2991 associated SIGINT signal will be sent to all tasks and a termination
2992 sequence will be entered sending SIGCONT, SIGTERM, and SIGKILL to all
2993 spawned tasks. If a third <control-c> is received, the srun program
2994 will be terminated without waiting for remote tasks to exit or their
2995 I/O to complete.
2996
2997 The escape sequence <control-z> is presently ignored.
2998
2999
3001 MPI use depends upon the type of MPI being used. There are three fun‐
3002 damentally different modes of operation used by these various MPI im‐
3003 plementation.
3004
3005 1. Slurm directly launches the tasks and performs initialization of
3006 communications through the PMI2 or PMIx APIs. For example: "srun -n16
3007 a.out".
3008
3009 2. Slurm creates a resource allocation for the job and then mpirun
3010 launches tasks using Slurm's infrastructure (OpenMPI).
3011
3012 3. Slurm creates a resource allocation for the job and then mpirun
3013 launches tasks using some mechanism other than Slurm, such as SSH or
3014 RSH. These tasks are initiated outside of Slurm's monitoring or con‐
3015 trol. Slurm's epilog should be configured to purge these tasks when the
3016 job's allocation is relinquished, or the use of pam_slurm_adopt is
3017 highly recommended.
3018
3019 See https://slurm.schedmd.com/mpi_guide.html for more information on
3020 use of these various MPI implementation with Slurm.
3021
3022
3024 Comments in the configuration file must have a "#" in column one. The
3025 configuration file contains the following fields separated by white
3026 space:
3027
3028 Task rank
3029 One or more task ranks to use this configuration. Multiple val‐
3030 ues may be comma separated. Ranges may be indicated with two
3031 numbers separated with a '-' with the smaller number first (e.g.
3032 "0-4" and not "4-0"). To indicate all tasks not otherwise spec‐
3033 ified, specify a rank of '*' as the last line of the file. If
3034 an attempt is made to initiate a task for which no executable
3035 program is defined, the following error message will be produced
3036 "No executable program specified for this task".
3037
3038 Executable
3039 The name of the program to execute. May be fully qualified
3040 pathname if desired.
3041
3042 Arguments
3043 Program arguments. The expression "%t" will be replaced with
3044 the task's number. The expression "%o" will be replaced with
3045 the task's offset within this range (e.g. a configured task rank
3046 value of "1-5" would have offset values of "0-4"). Single
3047 quotes may be used to avoid having the enclosed values inter‐
3048 preted. This field is optional. Any arguments for the program
3049 entered on the command line will be added to the arguments spec‐
3050 ified in the configuration file.
3051
3052 For example:
3053 $ cat silly.conf
3054 ###################################################################
3055 # srun multiple program configuration file
3056 #
3057 # srun -n8 -l --multi-prog silly.conf
3058 ###################################################################
3059 4-6 hostname
3060 1,7 echo task:%t
3061 0,2-3 echo offset:%o
3062
3063 $ srun -n8 -l --multi-prog silly.conf
3064 0: offset:0
3065 1: task:1
3066 2: offset:1
3067 3: offset:2
3068 4: linux15.llnl.gov
3069 5: linux16.llnl.gov
3070 6: linux17.llnl.gov
3071 7: task:7
3072
3073
3075 This simple example demonstrates the execution of the command hostname
3076 in eight tasks. At least eight processors will be allocated to the job
3077 (the same as the task count) on however many nodes are required to sat‐
3078 isfy the request. The output of each task will be proceeded with its
3079 task number. (The machine "dev" in the example below has a total of
3080 two CPUs per node)
3081
3082 $ srun -n8 -l hostname
3083 0: dev0
3084 1: dev0
3085 2: dev1
3086 3: dev1
3087 4: dev2
3088 5: dev2
3089 6: dev3
3090 7: dev3
3091
3092
3093 The srun -r option is used within a job script to run two job steps on
3094 disjoint nodes in the following example. The script is run using allo‐
3095 cate mode instead of as a batch job in this case.
3096
3097 $ cat test.sh
3098 #!/bin/sh
3099 echo $SLURM_JOB_NODELIST
3100 srun -lN2 -r2 hostname
3101 srun -lN2 hostname
3102
3103 $ salloc -N4 test.sh
3104 dev[7-10]
3105 0: dev9
3106 1: dev10
3107 0: dev7
3108 1: dev8
3109
3110
3111 The following script runs two job steps in parallel within an allocated
3112 set of nodes.
3113
3114 $ cat test.sh
3115 #!/bin/bash
3116 srun -lN2 -n4 -r 2 sleep 60 &
3117 srun -lN2 -r 0 sleep 60 &
3118 sleep 1
3119 squeue
3120 squeue -s
3121 wait
3122
3123 $ salloc -N4 test.sh
3124 JOBID PARTITION NAME USER ST TIME NODES NODELIST
3125 65641 batch test.sh grondo R 0:01 4 dev[7-10]
3126
3127 STEPID PARTITION USER TIME NODELIST
3128 65641.0 batch grondo 0:01 dev[7-8]
3129 65641.1 batch grondo 0:01 dev[9-10]
3130
3131
3132 This example demonstrates how one executes a simple MPI job. We use
3133 srun to build a list of machines (nodes) to be used by mpirun in its
3134 required format. A sample command line and the script to be executed
3135 follow.
3136
3137 $ cat test.sh
3138 #!/bin/sh
3139 MACHINEFILE="nodes.$SLURM_JOB_ID"
3140
3141 # Generate Machinefile for mpi such that hosts are in the same
3142 # order as if run via srun
3143 #
3144 srun -l /bin/hostname | sort -n | awk '{print $2}' > $MACHINEFILE
3145
3146 # Run using generated Machine file:
3147 mpirun -np $SLURM_NTASKS -machinefile $MACHINEFILE mpi-app
3148
3149 rm $MACHINEFILE
3150
3151 $ salloc -N2 -n4 test.sh
3152
3153
3154 This simple example demonstrates the execution of different jobs on
3155 different nodes in the same srun. You can do this for any number of
3156 nodes or any number of jobs. The executables are placed on the nodes
3157 sited by the SLURM_NODEID env var. Starting at 0 and going to the num‐
3158 ber specified on the srun commandline.
3159
3160 $ cat test.sh
3161 case $SLURM_NODEID in
3162 0) echo "I am running on "
3163 hostname ;;
3164 1) hostname
3165 echo "is where I am running" ;;
3166 esac
3167
3168 $ srun -N2 test.sh
3169 dev0
3170 is where I am running
3171 I am running on
3172 dev1
3173
3174
3175 This example demonstrates use of multi-core options to control layout
3176 of tasks. We request that four sockets per node and two cores per
3177 socket be dedicated to the job.
3178
3179 $ srun -N2 -B 4-4:2-2 a.out
3180
3181
3182 This example shows a script in which Slurm is used to provide resource
3183 management for a job by executing the various job steps as processors
3184 become available for their dedicated use.
3185
3186 $ cat my.script
3187 #!/bin/bash
3188 srun -n4 prog1 &
3189 srun -n3 prog2 &
3190 srun -n1 prog3 &
3191 srun -n1 prog4 &
3192 wait
3193
3194
3195 This example shows how to launch an application called "server" with
3196 one task, 8 CPUs and 16 GB of memory (2 GB per CPU) plus another appli‐
3197 cation called "client" with 16 tasks, 1 CPU per task (the default) and
3198 1 GB of memory per task.
3199
3200 $ srun -n1 -c16 --mem-per-cpu=1gb server : -n16 --mem-per-cpu=1gb client
3201
3202
3204 Copyright (C) 2006-2007 The Regents of the University of California.
3205 Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
3206 Copyright (C) 2008-2010 Lawrence Livermore National Security.
3207 Copyright (C) 2010-2015 SchedMD LLC.
3208
3209 This file is part of Slurm, a resource management program. For de‐
3210 tails, see <https://slurm.schedmd.com/>.
3211
3212 Slurm is free software; you can redistribute it and/or modify it under
3213 the terms of the GNU General Public License as published by the Free
3214 Software Foundation; either version 2 of the License, or (at your op‐
3215 tion) any later version.
3216
3217 Slurm is distributed in the hope that it will be useful, but WITHOUT
3218 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
3219 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
3220 for more details.
3221
3222
3224 salloc(1), sattach(1), sbatch(1), sbcast(1), scancel(1), scontrol(1),
3225 squeue(1), slurm.conf(5), sched_setaffinity (2), numa (3) getrlimit (2)
3226
3227
3228
3229April 2021 Slurm Commands srun(1)