1slurm.conf(5) Slurm Configuration File slurm.conf(5)
2
3
4
6 slurm.conf - Slurm configuration file
7
8
10 slurm.conf is an ASCII file which describes general Slurm configuration
11 information, the nodes to be managed, information about how those nodes
12 are grouped into partitions, and various scheduling parameters associ‐
13 ated with those partitions. This file should be consistent across all
14 nodes in the cluster.
15
16 The file location can be modified at system build time using the
17 DEFAULT_SLURM_CONF parameter or at execution time by setting the
18 SLURM_CONF environment variable. The Slurm daemons also allow you to
19 override both the built-in and environment-provided location using the
20 "-f" option on the command line.
21
22 The contents of the file are case insensitive except for the names of
23 nodes and partitions. Any text following a "#" in the configuration
24 file is treated as a comment through the end of that line. Changes to
25 the configuration file take effect upon restart of Slurm daemons, dae‐
26 mon receipt of the SIGHUP signal, or execution of the command "scontrol
27 reconfigure" unless otherwise noted.
28
29 If a line begins with the word "Include" followed by whitespace and
30 then a file name, that file will be included inline with the current
31 configuration file. For large or complex systems, multiple configura‐
32 tion files may prove easier to manage and enable reuse of some files
33 (See INCLUDE MODIFIERS for more details).
34
35 Note on file permissions:
36
37 The slurm.conf file must be readable by all users of Slurm, since it is
38 used by many of the Slurm commands. Other files that are defined in
39 the slurm.conf file, such as log files and job accounting files, may
40 need to be created/owned by the user "SlurmUser" to be successfully
41 accessed. Use the "chown" and "chmod" commands to set the ownership
42 and permissions appropriately. See the section FILE AND DIRECTORY PER‐
43 MISSIONS for information about the various files and directories used
44 by Slurm.
45
46
48 The overall configuration parameters available include:
49
50
51 AccountingStorageBackupHost
52 The name of the backup machine hosting the accounting storage
53 database. If used with the accounting_storage/slurmdbd plugin,
54 this is where the backup slurmdbd would be running. Only used
55 with systems using SlurmDBD, ignored otherwise.
56
57
58 AccountingStorageEnforce
59 This controls what level of association-based enforcement to
60 impose on job submissions. Valid options are any combination of
61 associations, limits, nojobs, nosteps, qos, safe, and wckeys, or
62 all for all things (expect nojobs and nosteps, they must be
63 requested as well).
64
65 If limits, qos, or wckeys are set, associations will automati‐
66 cally be set.
67
68 If wckeys is set, TrackWCKey will automatically be set.
69
70 If safe is set, limits and associations will automatically be
71 set.
72
73 If nojobs is set nosteps will automatically be set.
74
75 By enforcing Associations no new job is allowed to run unless a
76 corresponding association exists in the system. If limits are
77 enforced users can be limited by association to whatever job
78 size or run time limits are defined.
79
80 If nojobs is set Slurm will not account for any jobs or steps on
81 the system, like wise if nosteps is set Slurm will not account
82 for any steps ran limits will still be enforced.
83
84 If safe is enforced a job will only be launched against an asso‐
85 ciation or qos that has a GrpCPUMins limit set if the job will
86 be able to run to completion. Without this option set, jobs
87 will be launched as long as their usage hasn't reached the cpu-
88 minutes limit which can lead to jobs being launched but then
89 killed when the limit is reached.
90
91 With qos and/or wckeys enforced jobs will not be scheduled
92 unless a valid qos and/or workload characterization key is spec‐
93 ified.
94
95 When AccountingStorageEnforce is changed, a restart of the
96 slurmctld daemon is required (not just a "scontrol reconfig").
97
98
99 AccountingStorageHost
100 The name of the machine hosting the accounting storage database.
101 Only used with systems using SlurmDBD, ignored otherwise. Also
102 see DefaultStorageHost.
103
104
105 AccountingStorageLoc
106 The fully qualified file name where accounting records are writ‐
107 ten when the AccountingStorageType is "accounting_stor‐
108 age/filetxt". Also see DefaultStorageLoc.
109
110
111 AccountingStoragePass
112 The password used to gain access to the database to store the
113 accounting data. Only used for database type storage plugins,
114 ignored otherwise. In the case of Slurm DBD (Database Daemon)
115 with MUNGE authentication this can be configured to use a MUNGE
116 daemon specifically configured to provide authentication between
117 clusters while the default MUNGE daemon provides authentication
118 within a cluster. In that case, AccountingStoragePass should
119 specify the named port to be used for communications with the
120 alternate MUNGE daemon (e.g. "/var/run/munge/global.socket.2").
121 The default value is NULL. Also see DefaultStoragePass.
122
123
124 AccountingStoragePort
125 The listening port of the accounting storage database server.
126 Only used for database type storage plugins, ignored otherwise.
127 Also see DefaultStoragePort.
128
129
130 AccountingStorageTRES
131 Comma separated list of resources you wish to track on the clus‐
132 ter. These are the resources requested by the sbatch/srun job
133 when it is submitted. Currently this consists of any GRES, BB
134 (burst buffer) or license along with CPU, Memory, Node, Energy,
135 FS/[Disk|Lustre], IC/OFED, Pages, and VMem. By default Billing,
136 CPU, Energy, Memory, Node, FS/Disk, Pages and VMem are tracked.
137 These default TRES cannot be disabled, but only appended to.
138 AccountingStorageTRES=gres/craynetwork,license/iop1 will track
139 billing, cpu, energy, memory, nodes, fs/disk, pages and vmem
140 along with a gres called craynetwork as well as a license called
141 iop1. Whenever these resources are used on the cluster they are
142 recorded. The TRES are automatically set up in the database on
143 the start of the slurmctld.
144
145
146 AccountingStorageType
147 The accounting storage mechanism type. Acceptable values at
148 present include "accounting_storage/filetxt", "accounting_stor‐
149 age/none" and "accounting_storage/slurmdbd". The "account‐
150 ing_storage/filetxt" value indicates that accounting records
151 will be written to the file specified by the AccountingStorage‐
152 Loc parameter. The "accounting_storage/slurmdbd" value indi‐
153 cates that accounting records will be written to the Slurm DBD,
154 which manages an underlying MySQL database. See "man slurmdbd"
155 for more information. The default value is "accounting_stor‐
156 age/none" and indicates that account records are not maintained.
157 Note: The filetxt plugin records only a limited subset of
158 accounting information and will prevent some sacct options from
159 proper operation. Also see DefaultStorageType.
160
161
162 AccountingStorageUser
163 The user account for accessing the accounting storage database.
164 Only used for database type storage plugins, ignored otherwise.
165 Also see DefaultStorageUser.
166
167
168 AccountingStoreJobComment
169 If set to "YES" then include the job's comment field in the job
170 complete message sent to the Accounting Storage database. The
171 default is "YES". Note the AdminComment and SystemComment are
172 always recorded in the database.
173
174
175 AcctGatherNodeFreq
176 The AcctGather plugins sampling interval for node accounting.
177 For AcctGather plugin values of none, this parameter is ignored.
178 For all other values this parameter is the number of seconds
179 between node accounting samples. For the acct_gather_energy/rapl
180 plugin, set a value less than 300 because the counters may over‐
181 flow beyond this rate. The default value is zero. This value
182 disables accounting sampling for nodes. Note: The accounting
183 sampling interval for jobs is determined by the value of JobAc‐
184 ctGatherFrequency.
185
186
187 AcctGatherEnergyType
188 Identifies the plugin to be used for energy consumption account‐
189 ing. The jobacct_gather plugin and slurmd daemon call this
190 plugin to collect energy consumption data for jobs and nodes.
191 The collection of energy consumption data takes place on the
192 node level, hence only in case of exclusive job allocation the
193 energy consumption measurements will reflect the job's real con‐
194 sumption. In case of node sharing between jobs the reported con‐
195 sumed energy per job (through sstat or sacct) will not reflect
196 the real energy consumed by the jobs.
197
198 Configurable values at present are:
199
200 acct_gather_energy/none
201 No energy consumption data is collected.
202
203 acct_gather_energy/ipmi
204 Energy consumption data is collected from
205 the Baseboard Management Controller (BMC)
206 using the Intelligent Platform Management
207 Interface (IPMI).
208
209 acct_gather_energy/rapl
210 Energy consumption data is collected from
211 hardware sensors using the Running Average
212 Power Limit (RAPL) mechanism. Note that
213 enabling RAPL may require the execution of
214 the command "sudo modprobe msr".
215
216
217 AcctGatherInfinibandType
218 Identifies the plugin to be used for infiniband network traffic
219 accounting. The jobacct_gather plugin and slurmd daemon call
220 this plugin to collect network traffic data for jobs and nodes.
221 The collection of network traffic data takes place on the node
222 level, hence only in case of exclusive job allocation the col‐
223 lected values will reflect the job's real traffic. In case of
224 node sharing between jobs the reported network traffic per job
225 (through sstat or sacct) will not reflect the real network traf‐
226 fic by the jobs.
227
228 Configurable values at present are:
229
230 acct_gather_infiniband/none
231 No infiniband network data are collected.
232
233 acct_gather_infiniband/ofed
234 Infiniband network traffic data are col‐
235 lected from the hardware monitoring counters
236 of Infiniband devices through the OFED
237 library. In order to account for per job
238 network traffic, add the "ic/ofed" TRES to
239 AccountingStorageTRES.
240
241
242 AcctGatherFilesystemType
243 Identifies the plugin to be used for filesystem traffic account‐
244 ing. The jobacct_gather plugin and slurmd daemon call this
245 plugin to collect filesystem traffic data for jobs and nodes.
246 The collection of filesystem traffic data takes place on the
247 node level, hence only in case of exclusive job allocation the
248 collected values will reflect the job's real traffic. In case of
249 node sharing between jobs the reported filesystem traffic per
250 job (through sstat or sacct) will not reflect the real filesys‐
251 tem traffic by the jobs.
252
253
254 Configurable values at present are:
255
256 acct_gather_filesystem/none
257 No filesystem data are collected.
258
259 acct_gather_filesystem/lustre
260 Lustre filesystem traffic data are collected
261 from the counters found in /proc/fs/lustre/.
262 In order to account for per job lustre traf‐
263 fic, add the "fs/lustre" TRES to Account‐
264 ingStorageTRES.
265
266
267 AcctGatherProfileType
268 Identifies the plugin to be used for detailed job profiling.
269 The jobacct_gather plugin and slurmd daemon call this plugin to
270 collect detailed data such as I/O counts, memory usage, or
271 energy consumption for jobs and nodes. There are interfaces in
272 this plugin to collect data as step start and completion, task
273 start and completion, and at the account gather frequency. The
274 data collected at the node level is related to jobs only in case
275 of exclusive job allocation.
276
277 Configurable values at present are:
278
279 acct_gather_profile/none
280 No profile data is collected.
281
282 acct_gather_profile/hdf5
283 This enables the HDF5 plugin. The directory
284 where the profile files are stored and which
285 values are collected are configured in the
286 acct_gather.conf file.
287
288 acct_gather_profile/influxdb
289 This enables the influxdb plugin. The
290 influxdb instance host, port, database,
291 retention policy and which values are col‐
292 lected are configured in the
293 acct_gather.conf file.
294
295
296 AllowSpecResourcesUsage
297 If set to 1, Slurm allows individual jobs to override node's
298 configured CoreSpecCount value. For a job to take advantage of
299 this feature, a command line option of --core-spec must be spec‐
300 ified. The default value for this option is 1 for Cray systems
301 and 0 for other system types.
302
303
304 AuthInfo
305 Additional information to be used for authentication of communi‐
306 cations between the Slurm daemons (slurmctld and slurmd) and the
307 Slurm clients. The interpretation of this option is specific to
308 the configured AuthType. Multiple options may be specified in a
309 comma delimited list. If not specified, the default authentica‐
310 tion information will be used.
311
312 cred_expire Default job step credential lifetime, in seconds
313 (e.g. "cred_expire=1200"). It must be suffi‐
314 ciently long enough to load user environment, run
315 prolog, deal with the slurmd getting paged out of
316 memory, etc. This also controls how long a
317 requeued job must wait before starting again. The
318 default value is 120 seconds.
319
320 socket Path name to a MUNGE daemon socket to use (e.g.
321 "socket=/var/run/munge/munge.socket.2"). The
322 default value is "/var/run/munge/munge.socket.2".
323 Used by auth/munge and crypto/munge.
324
325 ttl Credential lifetime, in seconds (e.g. "ttl=300").
326 The default value is dependent upon the MUNGE
327 installation, but is typically 300 seconds.
328
329
330 AuthType
331 The authentication method for communications between Slurm com‐
332 ponents. Acceptable values at present include "auth/munge" and
333 "auth/none". The default value is "auth/munge". "auth/none"
334 includes the UID in each communication, but it is not verified.
335 This may be fine for testing purposes, but do not use
336 "auth/none" if you desire any security. "auth/munge" indicates
337 that MUNGE is to be used. (See "https://dun.github.io/munge/"
338 for more information). All Slurm daemons and commands must be
339 terminated prior to changing the value of AuthType and later
340 restarted.
341
342
343 BackupAddr
344 Defunct option, see SlurmctldHost.
345
346
347 BackupController
348 Defunct option, see SlurmctldHost.
349
350 The backup controller recovers state information from the State‐
351 SaveLocation directory, which must be readable and writable from
352 both the primary and backup controllers. While not essential,
353 it is recommended that you specify a backup controller. See
354 the RELOCATING CONTROLLERS section if you change this.
355
356
357 BatchStartTimeout
358 The maximum time (in seconds) that a batch job is permitted for
359 launching before being considered missing and releasing the
360 allocation. The default value is 10 (seconds). Larger values may
361 be required if more time is required to execute the Prolog, load
362 user environment variables (for Moab spawned jobs), or if the
363 slurmd daemon gets paged from memory.
364 Note: The test for a job being successfully launched is only
365 performed when the Slurm daemon on the compute node registers
366 state with the slurmctld daemon on the head node, which happens
367 fairly rarely. Therefore a job will not necessarily be termi‐
368 nated if its start time exceeds BatchStartTimeout. This config‐
369 uration parameter is also applied to launch tasks and avoid
370 aborting srun commands due to long running Prolog scripts.
371
372
373 BurstBufferType
374 The plugin used to manage burst buffers. Acceptable values at
375 present include "burst_buffer/none". More information later...
376
377
378 CheckpointType
379 The system-initiated checkpoint method to be used for user jobs.
380 The slurmctld daemon must be restarted for a change in Check‐
381 pointType to take effect. Supported values presently include:
382
383 checkpoint/blcr Berkeley Lab Checkpoint Restart (BLCR). NOTE:
384 If a file is found at sbin/scch (relative to
385 the Slurm installation location), it will be
386 executed upon completion of the checkpoint.
387 This can be a script used for managing the
388 checkpoint files. NOTE: Slurm's BLCR logic
389 only supports batch jobs.
390
391 checkpoint/none no checkpoint support (default)
392
393 checkpoint/ompi OpenMPI (version 1.3 or higher)
394
395
396 ClusterName
397 The name by which this Slurm managed cluster is known in the
398 accounting database. This is needed distinguish accounting
399 records when multiple clusters report to the same database.
400 Because of limitations in some databases, any upper case letters
401 in the name will be silently mapped to lower case. In order to
402 avoid confusion, it is recommended that the name be lower case.
403
404
405 CommunicationParameters
406 Comma separated options identifying communication options.
407
408 CheckGhalQuiesce
409 Used specifically on a Cray using an Aries Ghal
410 interconnect. This will check to see if the sys‐
411 tem is quiescing when sending a message, and if
412 so, we wait until it is done before sending.
413
414 NoAddrCache By default, Slurm will cache a node's network
415 address after
416 successfully establishing the node's network
417 address. This option disables the cache and Slurm
418 will look up the node's network address each time
419 a connection is made. This is useful, for exam‐
420 ple, in a cloud environment where the node
421 addresses come and go out of DNS.
422
423 NoCtldInAddrAny
424 Used to directly bind to the address of what the
425 node resolves to running the slurmctld instead of
426 binding messages to any address on the node,
427 which is the default.
428
429 NoInAddrAny Used to directly bind to the address of what the
430 node resolves to instead of binding messages to
431 any address on the node which is the default.
432 This option is for all daemons/clients except for
433 the slurmctld.
434
435
436 CompleteWait
437 The time, in seconds, given for a job to remain in COMPLETING
438 state before any additional jobs are scheduled. If set to zero,
439 pending jobs will be started as soon as possible. Since a COM‐
440 PLETING job's resources are released for use by other jobs as
441 soon as the Epilog completes on each individual node, this can
442 result in very fragmented resource allocations. To provide jobs
443 with the minimum response time, a value of zero is recommended
444 (no waiting). To minimize fragmentation of resources, a value
445 equal to KillWait plus two is recommended. In that case, set‐
446 ting KillWait to a small value may be beneficial. The default
447 value of CompleteWait is zero seconds. The value may not exceed
448 65533.
449
450
451 ControlAddr
452 Defunct option, see SlurmctldHost.
453
454
455 ControlMachine
456 Defunct option, see SlurmctldHost.
457
458
459 CoreSpecPlugin
460 Identifies the plugins to be used for enforcement of core spe‐
461 cialization. The slurmd daemon must be restarted for a change
462 in CoreSpecPlugin to take effect. Acceptable values at present
463 include:
464
465 core_spec/cray used only for Cray systems
466
467 core_spec/none used for all other system types
468
469
470 CpuFreqDef
471 Default CPU frequency value or frequency governor to use when
472 running a job step if it has not been explicitly set with the
473 --cpu-freq option. Acceptable values at present include a
474 numeric value (frequency in kilohertz) or one of the following
475 governors:
476
477 Conservative attempts to use the Conservative CPU governor
478
479 OnDemand attempts to use the OnDemand CPU governor
480
481 Performance attempts to use the Performance CPU governor
482
483 PowerSave attempts to use the PowerSave CPU governor
484 There is no default value. If unset, no attempt to set the governor is
485 made if the --cpu-freq option has not been set.
486
487
488 CpuFreqGovernors
489 List of CPU frequency governors allowed to be set with the sal‐
490 loc, sbatch, or srun option --cpu-freq. Acceptable values at
491 present include:
492
493 Conservative attempts to use the Conservative CPU governor
494
495 OnDemand attempts to use the OnDemand CPU governor (a
496 default value)
497
498 Performance attempts to use the Performance CPU governor (a
499 default value)
500
501 PowerSave attempts to use the PowerSave CPU governor
502
503 UserSpace attempts to use the UserSpace CPU governor (a
504 default value)
505 The default is OnDemand, Performance and UserSpace.
506
507 CryptoType
508 The cryptographic signature tool to be used in the creation of
509 job step credentials. The slurmctld daemon must be restarted
510 for a change in CryptoType to take effect. Acceptable values at
511 present include "crypto/munge". The default value is
512 "crypto/munge" and is the recommended.
513
514
515 DebugFlags
516 Defines specific subsystems which should provide more detailed
517 event logging. Multiple subsystems can be specified with comma
518 separators. Most DebugFlags will result in verbose logging for
519 the identified subsystems and could impact performance. Valid
520 subsystems available today (with more to come) include:
521
522 Backfill Backfill scheduler details
523
524 BackfillMap Backfill scheduler to log a very verbose map of
525 reserved resources through time. Combine with
526 Backfill for a verbose and complete view of the
527 backfill scheduler's work.
528
529 BurstBuffer Burst Buffer plugin
530
531 CPU_Bind CPU binding details for jobs and steps
532
533 CpuFrequency Cpu frequency details for jobs and steps using
534 the --cpu-freq option.
535
536 Elasticsearch Elasticsearch debug info
537
538 Energy AcctGatherEnergy debug info
539
540 ExtSensors External Sensors debug info
541
542 Federation Federation scheduling debug info
543
544 FrontEnd Front end node details
545
546 Gres Generic resource details
547
548 HeteroJobs Heterogeneous job details
549
550 Gang Gang scheduling details
551
552 JobContainer Job container plugin details
553
554 License License management details
555
556 NodeFeatures Node Features plugin debug info
557
558 NO_CONF_HASH Do not log when the slurm.conf files differs
559 between Slurm daemons
560
561 Power Power management plugin
562
563 Priority Job prioritization
564
565 Profile AcctGatherProfile plugins details
566
567 Protocol Communication protocol details
568
569 Reservation Advanced reservations
570
571 SelectType Resource selection plugin
572
573 Steps Slurmctld resource allocation for job steps
574
575 Switch Switch plugin
576
577 TimeCray Timing of Cray APIs
578
579 TraceJobs Trace jobs in slurmctld. It will print detailed
580 job information including state, job ids and
581 allocated nodes counter.
582
583 Triggers Slurmctld triggers
584
585
586 DefMemPerCPU
587 Default real memory size available per allocated CPU in
588 megabytes. Used to avoid over-subscribing memory and causing
589 paging. DefMemPerCPU would generally be used if individual pro‐
590 cessors are allocated to jobs (SelectType=select/cons_res). The
591 default value is 0 (unlimited). Also see DefMemPerNode and
592 MaxMemPerCPU. DefMemPerCPU and DefMemPerNode are mutually
593 exclusive.
594
595
596 DefMemPerNode
597 Default real memory size available per allocated node in
598 megabytes. Used to avoid over-subscribing memory and causing
599 paging. DefMemPerNode would generally be used if whole nodes
600 are allocated to jobs (SelectType=select/linear) and resources
601 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
602 The default value is 0 (unlimited). Also see DefMemPerCPU and
603 MaxMemPerNode. DefMemPerCPU and DefMemPerNode are mutually
604 exclusive.
605
606
607 DefaultStorageHost
608 The default name of the machine hosting the accounting storage
609 and job completion databases. Only used for database type stor‐
610 age plugins and when the AccountingStorageHost and JobCompHost
611 have not been defined.
612
613
614 DefaultStorageLoc
615 The fully qualified file name where accounting records and/or
616 job completion records are written when the DefaultStorageType
617 is "filetxt". Also see AccountingStorageLoc and JobCompLoc.
618
619
620 DefaultStoragePass
621 The password used to gain access to the database to store the
622 accounting and job completion data. Only used for database type
623 storage plugins, ignored otherwise. Also see AccountingStor‐
624 agePass and JobCompPass.
625
626
627 DefaultStoragePort
628 The listening port of the accounting storage and/or job comple‐
629 tion database server. Only used for database type storage plug‐
630 ins, ignored otherwise. Also see AccountingStoragePort and Job‐
631 CompPort.
632
633
634 DefaultStorageType
635 The accounting and job completion storage mechanism type.
636 Acceptable values at present include "filetxt", "mysql" and
637 "none". The value "filetxt" indicates that records will be
638 written to a file. The value "mysql" indicates that accounting
639 records will be written to a MySQL or MariaDB database. The
640 default value is "none", which means that records are not main‐
641 tained. Also see AccountingStorageType and JobCompType.
642
643
644 DefaultStorageUser
645 The user account for accessing the accounting storage and/or job
646 completion database. Only used for database type storage plug‐
647 ins, ignored otherwise. Also see AccountingStorageUser and Job‐
648 CompUser.
649
650
651 DisableRootJobs
652 If set to "YES" then user root will be prevented from running
653 any jobs. The default value is "NO", meaning user root will be
654 able to execute jobs. DisableRootJobs may also be set by parti‐
655 tion.
656
657
658 EioTimeout
659 The number of seconds srun waits for slurmstepd to close the
660 TCP/IP connection used to relay data between the user applica‐
661 tion and srun when the user application terminates. The default
662 value is 60 seconds. May not exceed 65533.
663
664
665 EnforcePartLimits
666 If set to "ALL" then jobs which exceed a partition's size and/or
667 time limits will be rejected at submission time. If job is sub‐
668 mitted to multiple partitions, the job must satisfy the limits
669 on all the requested partitions. If set to "NO" then the job
670 will be accepted and remain queued until the partition limits
671 are altered(Time and Node Limits). If set to "ANY" or "YES" a
672 job must satisfy any of the requested partitions to be submit‐
673 ted. The default value is "NO". NOTE: If set, then a job's QOS
674 can not be used to exceed partition limits. NOTE: The partition
675 limits being considered are it's configured MaxMemPerCPU,
676 MaxMemPerNode, MinNodes, MaxNodes, MaxTime, AllocNodes, AllowAc‐
677 counts, AllowGroups, AllowQOS, and QOS usage threshold.
678
679
680 Epilog Fully qualified pathname of a script to execute as user root on
681 every node when a user's job completes (e.g.
682 "/usr/local/slurm/epilog"). A glob pattern (See glob [22m(7)) may
683 also be used to run more than one epilog script (e.g.
684 "/etc/slurm/epilog.d/*"). The Epilog script or scripts may be
685 used to purge files, disable user login, etc. By default there
686 is no epilog. See Prolog and Epilog Scripts for more informa‐
687 tion.
688
689
690 EpilogMsgTime
691 The number of microseconds that the slurmctld daemon requires to
692 process an epilog completion message from the slurmd daemons.
693 This parameter can be used to prevent a burst of epilog comple‐
694 tion messages from being sent at the same time which should help
695 prevent lost messages and improve throughput for large jobs.
696 The default value is 2000 microseconds. For a 1000 node job,
697 this spreads the epilog completion messages out over two sec‐
698 onds.
699
700
701 EpilogSlurmctld
702 Fully qualified pathname of a program for the slurmctld to exe‐
703 cute upon termination of a job allocation (e.g.
704 "/usr/local/slurm/epilog_controller"). The program executes as
705 SlurmUser, which gives it permission to drain nodes and requeue
706 the job if a failure occurs (See scontrol(1)). Exactly what the
707 program does and how it accomplishes this is completely at the
708 discretion of the system administrator. Information about the
709 job being initiated, it's allocated nodes, etc. are passed to
710 the program using environment variables. See Prolog and Epilog
711 Scripts for more information.
712
713
714 ExtSensorsFreq
715 The external sensors plugin sampling interval. If ExtSen‐
716 sorsType=ext_sensors/none, this parameter is ignored. For all
717 other values of ExtSensorsType, this parameter is the number of
718 seconds between external sensors samples for hardware components
719 (nodes, switches, etc.) The default value is zero. This value
720 disables external sensors sampling. Note: This parameter does
721 not affect external sensors data collection for jobs/steps.
722
723
724 ExtSensorsType
725 Identifies the plugin to be used for external sensors data col‐
726 lection. Slurmctld calls this plugin to collect external sen‐
727 sors data for jobs/steps and hardware components. In case of
728 node sharing between jobs the reported values per job/step
729 (through sstat or sacct) may not be accurate. See also "man
730 ext_sensors.conf".
731
732 Configurable values at present are:
733
734 ext_sensors/none No external sensors data is collected.
735
736 ext_sensors/rrd External sensors data is collected from the
737 RRD database.
738
739
740 FairShareDampeningFactor
741 Dampen the effect of exceeding a user or group's fair share of
742 allocated resources. Higher values will provides greater ability
743 to differentiate between exceeding the fair share at high levels
744 (e.g. a value of 1 results in almost no difference between over‐
745 consumption by a factor of 10 and 100, while a value of 5 will
746 result in a significant difference in priority). The default
747 value is 1.
748
749
750 FastSchedule
751 Controls how a node's configuration specifications in slurm.conf
752 are used. If the number of node configuration entries in the
753 configuration file is significantly lower than the number of
754 nodes, setting FastSchedule to 1 will permit much faster sched‐
755 uling decisions to be made. (The scheduler can just check the
756 values in a few configuration records instead of possibly thou‐
757 sands of node records.) Note that on systems with hyper-thread‐
758 ing, the processor count reported by the node will be twice the
759 actual processor count. Consider which value you want to be
760 used for scheduling purposes.
761
762 0 Base scheduling decisions upon the actual configuration of
763 each individual node except that the node's processor count
764 in Slurm's configuration must match the actual hardware
765 configuration if PreemptMode=suspend,gang or Select‐
766 Type=select/cons_res are configured (both of those plugins
767 maintain resource allocation information using bitmaps for
768 the cores in the system and must remain static, while the
769 node's memory and disk space can be established later).
770
771 1 (default)
772 Consider the configuration of each node to be that speci‐
773 fied in the slurm.conf configuration file and any node with
774 less than the configured resources will be set to DRAIN.
775
776 2 Consider the configuration of each node to be that speci‐
777 fied in the slurm.conf configuration file and any node with
778 less than the configured resources will not be set DRAIN.
779 This option is generally only useful for testing purposes.
780
781
782 FederationParameters
783 Used to define federation options. Multiple options may be comma
784 separated.
785
786
787 fed_display
788 If set, then the client status commands (e.g. squeue,
789 sinfo, sprio, etc.) will display information in a feder‐
790 ated view by default. This option is functionally equiva‐
791 lent to using the --federation options on each command.
792 Use the client's --local option to override the federated
793 view and get a local view of the given cluster.
794
795
796 FirstJobId
797 The job id to be used for the first submitted to Slurm without a
798 specific requested value. Job id values generated will incre‐
799 mented by 1 for each subsequent job. This may be used to provide
800 a meta-scheduler with a job id space which is disjoint from the
801 interactive jobs. The default value is 1. Also see MaxJobId
802
803
804 GetEnvTimeout
805 Used for Moab scheduled jobs only. Controls how long job should
806 wait in seconds for loading the user's environment before
807 attempting to load it from a cache file. Applies when the srun
808 or sbatch --get-user-env option is used. If set to 0 then always
809 load the user's environment from the cache file. The default
810 value is 2 seconds.
811
812
813 GresTypes
814 A comma delimited list of generic resources to be managed.
815 These generic resources may have an associated plugin available
816 to provide additional functionality. No generic resources are
817 managed by default. Ensure this parameter is consistent across
818 all nodes in the cluster for proper operation. The slurmctld
819 daemon must be restarted for changes to this parameter to become
820 effective.
821
822
823 GroupUpdateForce
824 If set to a non-zero value, then information about which users
825 are members of groups allowed to use a partition will be updated
826 periodically, even when there have been no changes to the
827 /etc/group file. If set to zero, group member information will
828 be updated only after the /etc/group file is updated. The
829 default value is 1. Also see the GroupUpdateTime parameter.
830
831
832 GroupUpdateTime
833 Controls how frequently information about which users are mem‐
834 bers of groups allowed to use a partition will be updated, and
835 how long user group membership lists will be cached. The time
836 interval is given in seconds with a default value of 600 sec‐
837 onds. A value of zero will prevent periodic updating of group
838 membership information. Also see the GroupUpdateForce parame‐
839 ter.
840
841
842 HealthCheckInterval
843 The interval in seconds between executions of HealthCheckPro‐
844 gram. The default value is zero, which disables execution.
845
846
847 HealthCheckNodeState
848 Identify what node states should execute the HealthCheckProgram.
849 Multiple state values may be specified with a comma separator.
850 The default value is ANY to execute on nodes in any state.
851
852 ALLOC Run on nodes in the ALLOC state (all CPUs allo‐
853 cated).
854
855 ANY Run on nodes in any state.
856
857 CYCLE Rather than running the health check program on all
858 nodes at the same time, cycle through running on all
859 compute nodes through the course of the HealthCheck‐
860 Interval. May be combined with the various node
861 state options.
862
863 IDLE Run on nodes in the IDLE state.
864
865 MIXED Run on nodes in the MIXED state (some CPUs idle and
866 other CPUs allocated).
867
868
869 HealthCheckProgram
870 Fully qualified pathname of a script to execute as user root
871 periodically on all compute nodes that are not in the
872 NOT_RESPONDING state. This program may be used to verify the
873 node is fully operational and DRAIN the node or send email if a
874 problem is detected. Any action to be taken must be explicitly
875 performed by the program (e.g. execute "scontrol update Node‐
876 Name=foo State=drain Reason=tmp_file_system_full" to drain a
877 node). The execution interval is controlled using the
878 HealthCheckInterval parameter. Note that the HealthCheckProgram
879 will be executed at the same time on all nodes to minimize its
880 impact upon parallel programs. This program is will be killed
881 if it does not terminate normally within 60 seconds. This pro‐
882 gram will also be executed when the slurmd daemon is first
883 started and before it registers with the slurmctld daemon. By
884 default, no program will be executed.
885
886
887 InactiveLimit
888 The interval, in seconds, after which a non-responsive job allo‐
889 cation command (e.g. srun or salloc) will result in the job
890 being terminated. If the node on which the command is executed
891 fails or the command abnormally terminates, this will terminate
892 its job allocation. This option has no effect upon batch jobs.
893 When setting a value, take into consideration that a debugger
894 using srun to launch an application may leave the srun command
895 in a stopped state for extended periods of time. This limit is
896 ignored for jobs running in partitions with the RootOnly flag
897 set (the scheduler running as root will be responsible for the
898 job). The default value is unlimited (zero) and may not exceed
899 65533 seconds.
900
901
902 JobAcctGatherType
903 The job accounting mechanism type. Acceptable values at present
904 include "jobacct_gather/linux" (for Linux systems) and is the
905 recommended one, "jobacct_gather/cgroup" and
906 "jobacct_gather/none" (no accounting data collected). The
907 default value is "jobacct_gather/none". "jobacct_gather/cgroup"
908 is a plugin for the Linux operating system that uses cgroups to
909 collect accounting statistics. The plugin collects the following
910 statistics: From the cgroup memory subsystem: mem‐
911 ory.usage_in_bytes (reported as 'pages') and rss from mem‐
912 ory.stat (reported as 'rss'). From the cgroup cpuacct subsystem:
913 user cpu time and system cpu time. No value is provided by
914 cgroups for virtual memory size ('vsize'). In order to use the
915 sstat tool "jobacct_gather/linux", or "jobacct_gather/cgroup"
916 must be configured.
917 NOTE: Changing this configuration parameter changes the contents
918 of the messages between Slurm daemons. Any previously running
919 job steps are managed by a slurmstepd daemon that will persist
920 through the lifetime of that job step and not change it's commu‐
921 nication protocol. Only change this configuration parameter when
922 there are no running job steps.
923
924
925 JobAcctGatherFrequency
926 The job accounting and profiling sampling intervals. The sup‐
927 ported format is follows:
928
929 JobAcctGatherFrequency=<datatype>=<interval>
930 where <datatype>=<interval> specifies the task sam‐
931 pling interval for the jobacct_gather plugin or a
932 sampling interval for a profiling type by the
933 acct_gather_profile plugin. Multiple, comma-sepa‐
934 rated <datatype>=<interval> intervals may be speci‐
935 fied. Supported datatypes are as follows:
936
937 task=<interval>
938 where <interval> is the task sampling inter‐
939 val in seconds for the jobacct_gather plugins
940 and for task profiling by the
941 acct_gather_profile plugin.
942
943 energy=<interval>
944 where <interval> is the sampling interval in
945 seconds for energy profiling using the
946 acct_gather_energy plugin
947
948 network=<interval>
949 where <interval> is the sampling interval in
950 seconds for infiniband profiling using the
951 acct_gather_infiniband plugin.
952
953 filesystem=<interval>
954 where <interval> is the sampling interval in
955 seconds for filesystem profiling using the
956 acct_gather_filesystem plugin.
957
958 The default value for task sampling interval
959 is 30 seconds. The default value for all other intervals is 0.
960 An interval of 0 disables sampling of the specified type. If
961 the task sampling interval is 0, accounting information is col‐
962 lected only at job termination (reducing Slurm interference with
963 the job).
964 Smaller (non-zero) values have a greater impact upon job perfor‐
965 mance, but a value of 30 seconds is not likely to be noticeable
966 for applications having less than 10,000 tasks.
967 Users can independently override each interval on a per job
968 basis using the --acctg-freq option when submitting the job.
969
970
971 JobAcctGatherParams
972 Arbitrary parameters for the job account gather plugin Accept‐
973 able values at present include:
974
975 NoShared Exclude shared memory from accounting.
976
977 UsePss Use PSS value instead of RSS to calculate
978 real usage of memory. The PSS value will be
979 saved as RSS.
980
981 OverMemoryKill Kill steps that are being detected to use
982 more memory than requested, every time
983 accounting information is gathered by JobAc‐
984 ctGather plugin. This parameter will not
985 kill a job directly, but only the step. See
986 MemLimitEnforce for that purpose. This
987 parameter should be used with caution as if
988 jobs exceeds its memory allocation it may
989 affect other processes and/or machine
990 health. NOTE: It is recommended to limit
991 memory by enabling task/cgroup in TaskPlugin
992 and making use of ConstrainRAMSpace=yes
993 cgroup.conf instead of using this JobAcct‐
994 Gather mechanism for memory enforcement,
995 since the former has a lower resolution
996 (JobAcctGatherFreq) and OOMs could happen at
997 some point.
998
999
1000 JobCheckpointDir
1001 Specifies the default directory for storing or reading job
1002 checkpoint information. The data stored here is only a few thou‐
1003 sand bytes per job and includes information needed to resubmit
1004 the job request, not job's memory image. The directory must be
1005 readable and writable by SlurmUser, but not writable by regular
1006 users. The job memory images may be in a different location as
1007 specified by --checkpoint-dir option at job submit time or scon‐
1008 trol's ImageDir option.
1009
1010
1011 JobCompHost
1012 The name of the machine hosting the job completion database.
1013 Only used for database type storage plugins, ignored otherwise.
1014 Also see DefaultStorageHost.
1015
1016
1017 JobCompLoc
1018 The fully qualified file name where job completion records are
1019 written when the JobCompType is "jobcomp/filetxt" or the data‐
1020 base where job completion records are stored when the JobComp‐
1021 Type is a database, or an url with format http://yourelastic‐
1022 server:port when JobCompType is "jobcomp/elasticsearch". NOTE:
1023 when you specify a URL for Elasticsearch, Slurm will remove any
1024 trailing slashes "/" from the configured URL and append
1025 "/slurm/jobcomp", which are the Elasticsearch index name (slurm)
1026 and mapping (jobcomp). NOTE: More information is available at
1027 the Slurm web site ( https://slurm.schedmd.com/elastic‐
1028 search.html ). Also see DefaultStorageLoc.
1029
1030
1031 JobCompPass
1032 The password used to gain access to the database to store the
1033 job completion data. Only used for database type storage plug‐
1034 ins, ignored otherwise. Also see DefaultStoragePass.
1035
1036
1037 JobCompPort
1038 The listening port of the job completion database server. Only
1039 used for database type storage plugins, ignored otherwise. Also
1040 see DefaultStoragePort.
1041
1042
1043 JobCompType
1044 The job completion logging mechanism type. Acceptable values at
1045 present include "jobcomp/none", "jobcomp/elasticsearch", "job‐
1046 comp/filetxt", "jobcomp/mysql" and "jobcomp/script". The
1047 default value is "jobcomp/none", which means that upon job com‐
1048 pletion the record of the job is purged from the system. If
1049 using the accounting infrastructure this plugin may not be of
1050 interest since the information here is redundant. The value
1051 "jobcomp/elasticsearch" indicates that a record of the job
1052 should be written to an Elasticsearch server specified by the
1053 JobCompLoc parameter. NOTE: More information is available at
1054 the Slurm web site ( https://slurm.schedmd.com/elastic‐
1055 search.html ). The value "jobcomp/filetxt" indicates that a
1056 record of the job should be written to a text file specified by
1057 the JobCompLoc parameter. The value "jobcomp/mysql" indicates
1058 that a record of the job should be written to a MySQL or MariaDB
1059 database specified by the JobCompLoc parameter. The value "job‐
1060 comp/script" indicates that a script specified by the JobCompLoc
1061 parameter is to be executed with environment variables indicat‐
1062 ing the job information.
1063
1064 JobCompUser
1065 The user account for accessing the job completion database.
1066 Only used for database type storage plugins, ignored otherwise.
1067 Also see DefaultStorageUser.
1068
1069
1070 JobContainerType
1071 Identifies the plugin to be used for job tracking. The slurmd
1072 daemon must be restarted for a change in JobContainerType to
1073 take effect. NOTE: The JobContainerType applies to a job allo‐
1074 cation, while ProctrackType applies to job steps. Acceptable
1075 values at present include:
1076
1077 job_container/cncu used only for Cray systems (CNCU = Compute
1078 Node Clean Up)
1079
1080 job_container/none used for all other system types
1081
1082
1083 JobCredentialPrivateKey
1084 Fully qualified pathname of a file containing a private key used
1085 for authentication by Slurm daemons. This parameter is ignored
1086 if CryptoType=crypto/munge.
1087
1088
1089 JobCredentialPublicCertificate
1090 Fully qualified pathname of a file containing a public key used
1091 for authentication by Slurm daemons. This parameter is ignored
1092 if CryptoType=crypto/munge.
1093
1094
1095 JobFileAppend
1096 This option controls what to do if a job's output or error file
1097 exist when the job is started. If JobFileAppend is set to a
1098 value of 1, then append to the existing file. By default, any
1099 existing file is truncated.
1100
1101
1102 JobRequeue
1103 This option controls the default ability for batch jobs to be
1104 requeued. Jobs may be requeued explicitly by a system adminis‐
1105 trator, after node failure, or upon preemption by a higher pri‐
1106 ority job. If JobRequeue is set to a value of 1, then batch job
1107 may be requeued unless explicitly disabled by the user. If
1108 JobRequeue is set to a value of 0, then batch job will not be
1109 requeued unless explicitly enabled by the user. Use the sbatch
1110 --no-requeue or --requeue option to change the default behavior
1111 for individual jobs. The default value is 1.
1112
1113
1114 JobSubmitPlugins
1115 A comma delimited list of job submission plugins to be used.
1116 The specified plugins will be executed in the order listed.
1117 These are intended to be site-specific plugins which can be used
1118 to set default job parameters and/or logging events. Sample
1119 plugins available in the distribution include "all_partitions",
1120 "defaults", "logging", "lua", and "partition". For examples of
1121 use, see the Slurm code in "src/plugins/job_submit" and "con‐
1122 tribs/lua/job_submit*.lua" then modify the code to satisfy your
1123 needs. Slurm can be configured to use multiple job_submit plug‐
1124 ins if desired, however the lua plugin will only execute one lua
1125 script named "job_submit.lua" located in the default script
1126 directory (typically the subdirectory "etc" of the installation
1127 directory). No job submission plugins are used by default.
1128
1129
1130 KeepAliveTime
1131 Specifies how long sockets communications used between the srun
1132 command and its slurmstepd process are kept alive after discon‐
1133 nect. Longer values can be used to improve reliability of com‐
1134 munications in the event of network failures. The default value
1135 leaves the system default value. The value may not exceed
1136 65533.
1137
1138
1139 KillOnBadExit
1140 If set to 1, a step will be terminated immediately if any task
1141 is crashed or aborted, as indicated by a non-zero exit code.
1142 With the default value of 0, if one of the processes is crashed
1143 or aborted the other processes will continue to run while the
1144 crashed or aborted process waits. The user can override this
1145 configuration parameter by using srun's -K, --kill-on-bad-exit.
1146
1147
1148 KillWait
1149 The interval, in seconds, given to a job's processes between the
1150 SIGTERM and SIGKILL signals upon reaching its time limit. If
1151 the job fails to terminate gracefully in the interval specified,
1152 it will be forcibly terminated. The default value is 30 sec‐
1153 onds. The value may not exceed 65533.
1154
1155
1156 NodeFeaturesPlugins
1157 Identifies the plugins to be used for support of node features
1158 which can change through time. For example, a node which might
1159 be booted with various BIOS setting. This is supported through
1160 the use of a node's active_features and available_features
1161 information. Acceptable values at present include:
1162
1163 node_features/knl_cray
1164 used only for Intel Knights Landing proces‐
1165 sors (KNL) on Cray systems
1166
1167 node_features/knl_generic
1168 used for Intel Knights Landing processors
1169 (KNL) on a generic Linux system
1170
1171
1172 LaunchParameters
1173 Identifies options to the job launch plugin. Acceptable values
1174 include:
1175
1176 batch_step_set_cpu_freq Set the cpu frequency for the batch step
1177 from given --cpu-freq, or slurm.conf
1178 CpuFreqDef, option. By default only
1179 steps started with srun will utilize the
1180 cpu freq setting options.
1181
1182 NOTE: If you are using srun to launch
1183 your steps inside a batch script
1184 (advised) this option will create a sit‐
1185 uation where you may have multiple
1186 agents setting the cpu_freq as the batch
1187 step usually runs on the same resources
1188 one or more steps the sruns in the
1189 script will create.
1190
1191 cray_net_exclusive Allow jobs on a Cray Native cluster
1192 exclusive access to network resources.
1193 This should only be set on clusters pro‐
1194 viding exclusive access to each node to
1195 a single job at once, and not using par‐
1196 allel steps within the job, otherwise
1197 resources on the node can be oversub‐
1198 scribed.
1199
1200 lustre_no_flush If set on a Cray Native cluster, then do
1201 not flush the Lustre cache on job step
1202 completion. This setting will only take
1203 effect after reconfiguring, and will
1204 only take effect for newly launched
1205 jobs.
1206
1207 mem_sort Sort NUMA memory at step start. User can
1208 override this default with
1209 SLURM_MEM_BIND environment variable or
1210 --mem-bind=nosort command line option.
1211
1212 send_gids Lookup and send the user_name and
1213 extended gids for a job within the
1214 slurmctld, rather than individual on
1215 each node as part of each task launch.
1216 Should avoid issues around name service
1217 scalability when launching jobs involv‐
1218 ing many nodes.
1219
1220 slurmstepd_memlock Lock the slurmstepd process's current
1221 memory in RAM.
1222
1223 slurmstepd_memlock_all Lock the slurmstepd process's current
1224 and future memory in RAM.
1225
1226 test_exec Have srun verify existence of the exe‐
1227 cutable program along with user execute
1228 permission on the node where srun was
1229 called before attempting to launch it on
1230 nodes in the step.
1231
1232
1233 LaunchType
1234 Identifies the mechanism to be used to launch application tasks.
1235 Acceptable values include:
1236
1237 launch/slurm
1238 The default value.
1239
1240
1241 Licenses
1242 Specification of licenses (or other resources available on all
1243 nodes of the cluster) which can be allocated to jobs. License
1244 names can optionally be followed by a colon and count with a
1245 default count of one. Multiple license names should be comma
1246 separated (e.g. "Licenses=foo:4,bar"). Note that Slurm pre‐
1247 vents jobs from being scheduled if their required license speci‐
1248 fication is not available. Slurm does not prevent jobs from
1249 using licenses that are not explicitly listed in the job submis‐
1250 sion specification.
1251
1252
1253 LogTimeFormat
1254 Format of the timestamp in slurmctld and slurmd log files.
1255 Accepted values are "iso8601", "iso8601_ms", "rfc5424",
1256 "rfc5424_ms", "clock", "short" and "thread_id". The values end‐
1257 ing in "_ms" differ from the ones without in that fractional
1258 seconds with millisecond precision are printed. The default
1259 value is "iso8601_ms". The "rfc5424" formats are the same as the
1260 "iso8601" formats except that the timezone value is also shown.
1261 The "clock" format shows a timestamp in microseconds retrieved
1262 with the C standard clock() function. The "short" format is a
1263 short date and time format. The "thread_id" format shows the
1264 timestamp in the C standard ctime() function form without the
1265 year but including the microseconds, the daemon's process ID and
1266 the current thread name and ID.
1267
1268
1269 MailDomain
1270 Domain name to qualify usernames if email address is not explic‐
1271 itly given with the "--mail-user" option. If unset, the local
1272 MTA will need to qualify local address itself.
1273
1274
1275 MailProg
1276 Fully qualified pathname to the program used to send email per
1277 user request. The default value is "/bin/mail" (or
1278 "/usr/bin/mail" if "/bin/mail" does not exist but
1279 "/usr/bin/mail" does exist).
1280
1281
1282 MaxArraySize
1283 The maximum job array size. The maximum job array task index
1284 value will be one less than MaxArraySize to allow for an index
1285 value of zero. Configure MaxArraySize to 0 in order to disable
1286 job array use. The value may not exceed 4000001. The value of
1287 MaxJobCount should be much larger than MaxArraySize. The
1288 default value is 1001.
1289
1290
1291 MaxJobCount
1292 The maximum number of jobs Slurm can have in its active database
1293 at one time. Set the values of MaxJobCount and MinJobAge to
1294 ensure the slurmctld daemon does not exhaust its memory or other
1295 resources. Once this limit is reached, requests to submit addi‐
1296 tional jobs will fail. The default value is 10000 jobs. NOTE:
1297 Each task of a job array counts as one job even though they will
1298 not occupy separate job records until modified or initiated.
1299 Performance can suffer with more than a few hundred thousand
1300 jobs. Setting per MaxSubmitJobs per user is generally valuable
1301 to prevent a single user from filling the system with jobs.
1302 This is accomplished using Slurm's database and configuring
1303 enforcement of resource limits. This value may not be reset via
1304 "scontrol reconfig". It only takes effect upon restart of the
1305 slurmctld daemon.
1306
1307
1308 MaxJobId
1309 The maximum job id to be used for jobs submitted to Slurm with‐
1310 out a specific requested value. Job ids are unsigned 32bit inte‐
1311 gers with the first 26 bits reserved for local job ids and the
1312 remaining 6 bits reserved for a cluster id to identify a feder‐
1313 ated job's origin. The maximun allowed local job id is
1314 67,108,863 (0x3FFFFFF). The default value is 67,043,328
1315 (0x03ff0000). MaxJobId only applies to the local job id and not
1316 the federated job id. Job id values generated will be incre‐
1317 mented by 1 for each subsequent job. Once MaxJobId is reached,
1318 the next job will be assigned FirstJobId. Federated jobs will
1319 always have a job ID of 67,108,865 or higher. Also see FirstJo‐
1320 bId.
1321
1322
1323 MaxMemPerCPU
1324 Maximum real memory size available per allocated CPU in
1325 megabytes. Used to avoid over-subscribing memory and causing
1326 paging. MaxMemPerCPU would generally be used if individual pro‐
1327 cessors are allocated to jobs (SelectType=select/cons_res). The
1328 default value is 0 (unlimited). Also see DefMemPerCPU and
1329 MaxMemPerNode. MaxMemPerCPU and MaxMemPerNode are mutually
1330 exclusive.
1331
1332 NOTE: If a job specifies a memory per CPU limit that exceeds
1333 this system limit, that job's count of CPUs per task will auto‐
1334 matically be increased. This may result in the job failing due
1335 to CPU count limits.
1336
1337
1338 MaxMemPerNode
1339 Maximum real memory size available per allocated node in
1340 megabytes. Used to avoid over-subscribing memory and causing
1341 paging. MaxMemPerNode would generally be used if whole nodes
1342 are allocated to jobs (SelectType=select/linear) and resources
1343 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
1344 The default value is 0 (unlimited). Also see DefMemPerNode and
1345 MaxMemPerCPU. MaxMemPerCPU and MaxMemPerNode are mutually
1346 exclusive.
1347
1348
1349 MaxStepCount
1350 The maximum number of steps that any job can initiate. This
1351 parameter is intended to limit the effect of bad batch scripts.
1352 The default value is 40000 steps.
1353
1354
1355 MaxTasksPerNode
1356 Maximum number of tasks Slurm will allow a job step to spawn on
1357 a single node. The default MaxTasksPerNode is 512. May not
1358 exceed 65533.
1359
1360
1361 MCSParameters
1362 MCS = Multi-Category Security MCS Plugin Parameters. The sup‐
1363 ported parameters are specific to the MCSPlugin. Changes to
1364 this value take effect when the Slurm daemons are reconfigured.
1365 More information about MCS is available here
1366 <https://slurm.schedmd.com/mcs.html>.
1367
1368
1369 MCSPlugin
1370 MCS = Multi-Category Security : associate a security label to
1371 jobs and ensure that nodes can only be shared among jobs using
1372 the same security label. Acceptable values include:
1373
1374 mcs/none is the default value. No security label associated
1375 with jobs, no particular security restriction when
1376 sharing nodes among jobs.
1377
1378 mcs/account only users with the same account can share the nodes
1379 (requires enabling of accounting).
1380
1381 mcs/group only users with the same group can share the nodes.
1382
1383 mcs/user a node cannot be shared with other users.
1384
1385
1386 MemLimitEnforce
1387 If set to yes then Slurm will terminate the job if it exceeds
1388 the value requested using the --mem-per-cpu option of sal‐
1389 loc/sbatch/srun. This is useful in combination with JobAcct‐
1390 GatherParams=OverMemoryKill. Used when jobs need to specify
1391 --mem-per-cpu for scheduling and they should be terminated if
1392 they exceed the estimated value. The default value is 'no',
1393 which disables this enforcing mechanism. NOTE: It is recom‐
1394 mended to limit memory by enabling task/cgroup in TaskPlugin and
1395 making use of ConstrainRAMSpace=yes cgroup.conf instead of using
1396 this JobAcctGather mechanism for memory enforcement, since the
1397 former has a lower resolution (JobAcctGatherFreq) and OOMs could
1398 happen at some point.
1399
1400
1401 MessageTimeout
1402 Time permitted for a round-trip communication to complete in
1403 seconds. Default value is 10 seconds. For systems with shared
1404 nodes, the slurmd daemon could be paged out and necessitate
1405 higher values.
1406
1407
1408 MinJobAge
1409 The minimum age of a completed job before its record is purged
1410 from Slurm's active database. Set the values of MaxJobCount and
1411 to ensure the slurmctld daemon does not exhaust its memory or
1412 other resources. The default value is 300 seconds. A value of
1413 zero prevents any job record purging. In order to eliminate
1414 some possible race conditions, the minimum non-zero value for
1415 MinJobAge recommended is 2.
1416
1417
1418 MpiDefault
1419 Identifies the default type of MPI to be used. Srun may over‐
1420 ride this configuration parameter in any case. Currently sup‐
1421 ported versions include: openmpi, pmi2, pmix, and none (default,
1422 which works for many other versions of MPI). More information
1423 about MPI use is available here
1424 <https://slurm.schedmd.com/mpi_guide.html>.
1425
1426
1427 MpiParams
1428 MPI parameters. Used to identify ports used by older versions
1429 of OpenMPI and native Cray systems. The input format is
1430 "ports=12000-12999" to identify a range of communication ports
1431 to be used. NOTE: This is not needed for modern versions of
1432 OpenMPI, taking it out can cause a small boost in scheduling
1433 performance. NOTE: This is require for Cray's PMI.
1434
1435 MsgAggregationParams
1436 Message aggregation parameters. Message aggregation is an
1437 optional feature that may improve system performance by reducing
1438 the number of separate messages passed between nodes. The fea‐
1439 ture works by routing messages through one or more message col‐
1440 lector nodes between their source and destination nodes. At each
1441 collector node, messages with the same destination received dur‐
1442 ing a defined message collection window are packaged into a sin‐
1443 gle composite message. When the window expires, the composite
1444 message is sent to the next collector node on the route to its
1445 destination. The route between each source and destination node
1446 is provided by the Route plugin. When a composite message is
1447 received at its destination node, the original messages are
1448 extracted and processed as if they had been sent directly.
1449 Currently, the only message types supported by message aggrega‐
1450 tion are the node registration, batch script completion, step
1451 completion, and epilog complete messages.
1452 The format for this parameter is as follows:
1453
1454 MsgAggregationParams=<option>=<value>
1455 where <option>=<value> specify a particular control
1456 variable. Multiple, comma-separated <option>=<value>
1457 pairs may be specified. Supported options are as
1458 follows:
1459
1460 WindowMsgs=<number>
1461 where <number> is the maximum number of mes‐
1462 sages in each message collection window.
1463
1464 WindowTime=<time>
1465 where <time> is the maximum elapsed time in
1466 milliseconds of each message collection win‐
1467 dow.
1468
1469 A window expires when either WindowMsgs or
1470 WindowTime is
1471 reached. By default, message aggregation is disabled. To enable
1472 the feature, set WindowMsgs to a value greater than 1. The
1473 default value for WindowTime is 100 milliseconds.
1474
1475
1476 OverTimeLimit
1477 Number of minutes by which a job can exceed its time limit
1478 before being canceled. Normally a job's time limit is treated
1479 as a hard limit and the job will be killed upon reaching that
1480 limit. Configuring OverTimeLimit will result in the job's time
1481 limit being treated like a soft limit. Adding the OverTimeLimit
1482 value to the soft time limit provides a hard time limit, at
1483 which point the job is canceled. This is particularly useful
1484 for backfill scheduling, which bases upon each job's soft time
1485 limit. The default value is zero. May not exceed exceed 65533
1486 minutes. A value of "UNLIMITED" is also supported.
1487
1488
1489 PluginDir
1490 Identifies the places in which to look for Slurm plugins. This
1491 is a colon-separated list of directories, like the PATH environ‐
1492 ment variable. The default value is "/usr/local/lib/slurm".
1493
1494
1495 PlugStackConfig
1496 Location of the config file for Slurm stackable plugins that use
1497 the Stackable Plugin Architecture for Node job (K)control
1498 (SPANK). This provides support for a highly configurable set of
1499 plugins to be called before and/or after execution of each task
1500 spawned as part of a user's job step. Default location is
1501 "plugstack.conf" in the same directory as the system slurm.conf.
1502 For more information on SPANK plugins, see the spank(8) manual.
1503
1504
1505 PowerParameters
1506 System power management parameters. The supported parameters
1507 are specific to the PowerPlugin. Changes to this value take
1508 effect when the Slurm daemons are reconfigured. More informa‐
1509 tion about system power management is available here
1510 <https://slurm.schedmd.com/power_mgmt.html>. Options current
1511 supported by any plugins are listed below.
1512
1513 balance_interval=#
1514 Specifies the time interval, in seconds, between attempts
1515 to rebalance power caps across the nodes. This also con‐
1516 trols the frequency at which Slurm attempts to collect
1517 current power consumption data (old data may be used
1518 until new data is available from the underlying infra‐
1519 structure and values below 10 seconds are not recommended
1520 for Cray systems). The default value is 30 seconds.
1521 Supported by the power/cray plugin.
1522
1523 capmc_path=
1524 Specifies the absolute path of the capmc command. The
1525 default value is "/opt/cray/capmc/default/bin/capmc".
1526 Supported by the power/cray plugin.
1527
1528 cap_watts=#
1529 Specifies the total power limit to be established across
1530 all compute nodes managed by Slurm. A value of 0 sets
1531 every compute node to have an unlimited cap. The default
1532 value is 0. Supported by the power/cray plugin.
1533
1534 decrease_rate=#
1535 Specifies the maximum rate of change in the power cap for
1536 a node where the actual power usage is below the power
1537 cap by an amount greater than lower_threshold (see
1538 below). Value represents a percentage of the difference
1539 between a node's minimum and maximum power consumption.
1540 The default value is 50 percent. Supported by the
1541 power/cray plugin.
1542
1543 get_timeout=#
1544 Amount of time allowed to get power state information in
1545 milliseconds. The default value is 5,000 milliseconds or
1546 5 seconds. Supported by the power/cray plugin and repre‐
1547 sents the time allowed for the capmc command to respond
1548 to various "get" options.
1549
1550 increase_rate=#
1551 Specifies the maximum rate of change in the power cap for
1552 a node where the actual power usage is within
1553 upper_threshold (see below) of the power cap. Value rep‐
1554 resents a percentage of the difference between a node's
1555 minimum and maximum power consumption. The default value
1556 is 20 percent. Supported by the power/cray plugin.
1557
1558 job_level
1559 All nodes associated with every job will have the same
1560 power cap, to the extent possible. Also see the
1561 --power=level option on the job submission commands.
1562
1563 job_no_level
1564 Disable the user's ability to set every node associated
1565 with a job to the same power cap. Each node will have
1566 it's power cap set independently. This disables the
1567 --power=level option on the job submission commands.
1568
1569 lower_threshold=#
1570 Specify a lower power consumption threshold. If a node's
1571 current power consumption is below this percentage of its
1572 current cap, then its power cap will be reduced. The
1573 default value is 90 percent. Supported by the power/cray
1574 plugin.
1575
1576 recent_job=#
1577 If a job has started or resumed execution (from suspend)
1578 on a compute node within this number of seconds from the
1579 current time, the node's power cap will be increased to
1580 the maximum. The default value is 300 seconds. Sup‐
1581 ported by the power/cray plugin.
1582
1583 set_timeout=#
1584 Amount of time allowed to set power state information in
1585 milliseconds. The default value is 30,000 milliseconds
1586 or 30 seconds. Supported by the power/cray plugin and
1587 represents the time allowed for the capmc command to
1588 respond to various "set" options.
1589
1590 set_watts=#
1591 Specifies the power limit to be set on every compute
1592 nodes managed by Slurm. Every node gets this same power
1593 cap and there is no variation through time based upon
1594 actual power usage on the node. Supported by the
1595 power/cray plugin.
1596
1597 upper_threshold=#
1598 Specify an upper power consumption threshold. If a
1599 node's current power consumption is above this percentage
1600 of its current cap, then its power cap will be increased
1601 to the extent possible. The default value is 95 percent.
1602 Supported by the power/cray plugin.
1603
1604
1605 PowerPlugin
1606 Identifies the plugin used for system power management. Cur‐
1607 rently supported plugins include: cray and none. Changes to
1608 this value require restarting Slurm daemons to take effect.
1609 More information about system power management is available here
1610 <https://slurm.schedmd.com/power_mgmt.html>. By default, no
1611 power plugin is loaded.
1612
1613
1614 PreemptMode
1615 Enables gang scheduling and/or controls the mechanism used to
1616 preempt jobs. When the PreemptType parameter is set to enable
1617 preemption, the PreemptMode selects the default mechanism used
1618 to preempt the lower priority jobs for the cluster. PreemptMode
1619 may be specified on a per partition basis to override this
1620 default value if PreemptType=preempt/partition_prio, but a valid
1621 default PreemptMode value must be specified for the cluster as a
1622 whole when preemption is enabled. The GANG option is used to
1623 enable gang scheduling independent of whether preemption is
1624 enabled (the PreemptType setting). The GANG option can be spec‐
1625 ified in addition to a PreemptMode setting with the two options
1626 comma separated. The SUSPEND option requires that gang schedul‐
1627 ing be enabled (i.e, "PreemptMode=SUSPEND,GANG"). NOTE: For
1628 performance reasons, the backfill scheduler reserves whole nodes
1629 for jobs, not partial nodes. If during backfill scheduling a job
1630 preempts one or more other jobs, the whole nodes for those pre‐
1631 empted jobs are reserved for the preemptor job, even if the pre‐
1632 emptor job requested fewer resources than that. These reserved
1633 nodes aren't available to other jobs during that backfill cycle,
1634 even if the other jobs could fit on the nodes. Therefore, jobs
1635 may preempt more resources during a single backfill iteration
1636 than they requested.
1637
1638 OFF is the default value and disables job preemption and
1639 gang scheduling.
1640
1641 CANCEL always cancel the job.
1642
1643 CHECKPOINT preempts jobs by checkpointing them (if possible) or
1644 canceling them.
1645
1646 GANG enables gang scheduling (time slicing) of jobs in
1647 the same partition. NOTE: Gang scheduling is per‐
1648 formed independently for each partition, so config‐
1649 uring partitions with overlapping nodes and gang
1650 scheduling is generally not recommended.
1651
1652 REQUEUE preempts jobs by requeuing them (if possible) or
1653 canceling them. For jobs to be requeued they must
1654 have the --requeue sbatch option set or the cluster
1655 wide JobRequeue parameter in slurm.conf must be set
1656 to one.
1657
1658 SUSPEND If PreemptType=preempt/partition_prio is configured
1659 then suspend and automatically resume the low prior‐
1660 ity jobs. If PreemptType=preempt/qos is configured,
1661 then the jobs sharing resources will always time
1662 slice rather than one job remaining suspended. The
1663 SUSPEND may only be used with the GANG option (the
1664 gang scheduler module performs the job resume opera‐
1665 tion).
1666
1667
1668 PreemptType
1669 This specifies the plugin used to identify which jobs can be
1670 preempted in order to start a pending job.
1671
1672 preempt/none
1673 Job preemption is disabled. This is the default.
1674
1675 preempt/partition_prio
1676 Job preemption is based upon partition priority tier.
1677 Jobs in higher priority partitions (queues) may preempt
1678 jobs from lower priority partitions. This is not compat‐
1679 ible with PreemptMode=OFF.
1680
1681 preempt/qos
1682 Job preemption rules are specified by Quality Of Service
1683 (QOS) specifications in the Slurm database. This option
1684 is not compatible with PreemptMode=OFF. A configuration
1685 of PreemptMode=SUSPEND is only supported by the
1686 select/cons_res plugin.
1687
1688
1689 PriorityDecayHalfLife
1690 This controls how long prior resource use is considered in
1691 determining how over- or under-serviced an association is (user,
1692 bank account and cluster) in determining job priority. The
1693 record of usage will be decayed over time, with half of the
1694 original value cleared at age PriorityDecayHalfLife. If set to
1695 0 no decay will be applied. This is helpful if you want to
1696 enforce hard time limits per association. If set to 0 Priori‐
1697 tyUsageResetPeriod must be set to some interval. Applicable
1698 only if PriorityType=priority/multifactor. The unit is a time
1699 string (i.e. min, hr:min:00, days-hr:min:00, or days-hr). The
1700 default value is 7-0 (7 days).
1701
1702
1703 PriorityCalcPeriod
1704 The period of time in minutes in which the half-life decay will
1705 be re-calculated. Applicable only if PriorityType=priority/mul‐
1706 tifactor. The default value is 5 (minutes).
1707
1708
1709 PriorityFavorSmall
1710 Specifies that small jobs should be given preferential schedul‐
1711 ing priority. Applicable only if PriorityType=priority/multi‐
1712 factor. Supported values are "YES" and "NO". The default value
1713 is "NO".
1714
1715
1716 PriorityFlags
1717 Flags to modify priority behavior Applicable only if Priority‐
1718 Type=priority/multifactor. The keywords below have no associ‐
1719 ated value (e.g. "PriorityFlags=ACCRUE_ALWAYS,SMALL_RELA‐
1720 TIVE_TO_TIME").
1721
1722 ACCRUE_ALWAYS If set, priority age factor will be increased
1723 despite job dependencies or holds.
1724
1725 CALCULATE_RUNNING
1726 If set, priorities will be recalculated not
1727 only for pending jobs, but also running and
1728 suspended jobs.
1729
1730 DEPTH_OBLIVIOUS If set, priority will be calculated based simi‐
1731 lar to the normal multifactor calculation, but
1732 depth of the associations in the tree do not
1733 adversely effect their priority. This option
1734 precludes the use of FAIR_TREE.
1735
1736 FAIR_TREE If set, priority will be calculated in such a
1737 way that if accounts A and B are siblings and A
1738 has a higher fairshare factor than B, all chil‐
1739 dren of A will have higher fairshare factors
1740 than all children of B.
1741
1742 INCR_ONLY If set, priority values will only increase in
1743 value. Job priority will never decrease in
1744 value.
1745
1746 MAX_TRES If set, the weighted TRES value (e.g. TRES‐
1747 BillingWeights) is calculated as the MAX of
1748 individual TRES' on a node (e.g. cpus, mem,
1749 gres) plus the sum of all global TRES' (e.g.
1750 licenses).
1751
1752 SMALL_RELATIVE_TO_TIME
1753 If set, the job's size component will be based
1754 upon not the job size alone, but the job's size
1755 divided by it's time limit.
1756
1757
1758 PriorityParameters
1759 Arbitrary string used by the PriorityType plugin.
1760
1761
1762 PriorityMaxAge
1763 Specifies the job age which will be given the maximum age factor
1764 in computing priority. For example, a value of 30 minutes would
1765 result in all jobs over 30 minutes old would get the same
1766 age-based priority. Applicable only if PriorityType=prior‐
1767 ity/multifactor. The unit is a time string (i.e. min,
1768 hr:min:00, days-hr:min:00, or days-hr). The default value is
1769 7-0 (7 days).
1770
1771
1772 PriorityUsageResetPeriod
1773 At this interval the usage of associations will be reset to 0.
1774 This is used if you want to enforce hard limits of time usage
1775 per association. If PriorityDecayHalfLife is set to be 0 no
1776 decay will happen and this is the only way to reset the usage
1777 accumulated by running jobs. By default this is turned off and
1778 it is advised to use the PriorityDecayHalfLife option to avoid
1779 not having anything running on your cluster, but if your schema
1780 is set up to only allow certain amounts of time on your system
1781 this is the way to do it. Applicable only if PriorityType=pri‐
1782 ority/multifactor.
1783
1784 NONE Never clear historic usage. The default value.
1785
1786 NOW Clear the historic usage now. Executed at startup
1787 and reconfiguration time.
1788
1789 DAILY Cleared every day at midnight.
1790
1791 WEEKLY Cleared every week on Sunday at time 00:00.
1792
1793 MONTHLY Cleared on the first day of each month at time
1794 00:00.
1795
1796 QUARTERLY Cleared on the first day of each quarter at time
1797 00:00.
1798
1799 YEARLY Cleared on the first day of each year at time 00:00.
1800
1801
1802 PriorityType
1803 This specifies the plugin to be used in establishing a job's
1804 scheduling priority. Supported values are "priority/basic" (jobs
1805 are prioritized by order of arrival), "priority/multifactor"
1806 (jobs are prioritized based upon size, age, fair-share of allo‐
1807 cation, etc). Also see PriorityFlags for configuration options.
1808 The default value is "priority/basic".
1809
1810 When not FIFO scheduling, jobs are prioritized in the following
1811 order:
1812
1813 1. Jobs that can preempt
1814
1815 2. Jobs with an advanced reservation
1816
1817 3. Partition Priority Tier
1818
1819 4. Job Priority
1820
1821 5. Job Id
1822
1823
1824
1825 PriorityWeightAge
1826 An integer value that sets the degree to which the queue wait
1827 time component contributes to the job's priority. Applicable
1828 only if PriorityType=priority/multifactor. The default value is
1829 0.
1830
1831
1832 PriorityWeightFairshare
1833 An integer value that sets the degree to which the fair-share
1834 component contributes to the job's priority. Applicable only if
1835 PriorityType=priority/multifactor. The default value is 0.
1836
1837
1838 PriorityWeightJobSize
1839 An integer value that sets the degree to which the job size com‐
1840 ponent contributes to the job's priority. Applicable only if
1841 PriorityType=priority/multifactor. The default value is 0.
1842
1843
1844 PriorityWeightPartition
1845 Partition factor used by priority/multifactor plugin in calcu‐
1846 lating job priority. Applicable only if PriorityType=prior‐
1847 ity/multifactor. The default value is 0.
1848
1849
1850 PriorityWeightQOS
1851 An integer value that sets the degree to which the Quality Of
1852 Service component contributes to the job's priority. Applicable
1853 only if PriorityType=priority/multifactor. The default value is
1854 0.
1855
1856
1857 PriorityWeightTRES
1858 A comma separated list of TRES Types and weights that sets the
1859 degree that each TRES Type contributes to the job's priority.
1860
1861 e.g.
1862 PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
1863
1864 Applicable only if PriorityType=priority/multifactor and if
1865 AccountingStorageTRES is configured with each TRES Type. Nega‐
1866 tive values are allowed. The default values are 0.
1867
1868
1869 PrivateData
1870 This controls what type of information is hidden from regular
1871 users. By default, all information is visible to all users.
1872 User SlurmUser and root can always view all information. Multi‐
1873 ple values may be specified with a comma separator. Acceptable
1874 values include:
1875
1876 accounts
1877 (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
1878 ing any account definitions unless they are coordinators
1879 of them.
1880
1881 cloud Powered down nodes in the cloud are visible.
1882
1883 events prevents users from viewing event information unless they
1884 have operator status or above.
1885
1886 jobs Prevents users from viewing jobs or job steps belonging
1887 to other users. (NON-SlurmDBD ACCOUNTING ONLY) Prevents
1888 users from viewing job records belonging to other users
1889 unless they are coordinators of the association running
1890 the job when using sacct.
1891
1892 nodes Prevents users from viewing node state information.
1893
1894 partitions
1895 Prevents users from viewing partition state information.
1896
1897 reservations
1898 Prevents regular users from viewing reservations which
1899 they can not use.
1900
1901 usage Prevents users from viewing usage of any other user, this
1902 applies to sshare. (NON-SlurmDBD ACCOUNTING ONLY) Pre‐
1903 vents users from viewing usage of any other user, this
1904 applies to sreport.
1905
1906 users (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
1907 ing information of any user other than themselves, this
1908 also makes it so users can only see associations they
1909 deal with. Coordinators can see associations of all
1910 users they are coordinator of, but can only see them‐
1911 selves when listing users.
1912
1913
1914 ProctrackType
1915 Identifies the plugin to be used for process tracking on a job
1916 step basis. The slurmd daemon uses this mechanism to identify
1917 all processes which are children of processes it spawns for a
1918 user job step. The slurmd daemon must be restarted for a change
1919 in ProctrackType to take effect. NOTE: "proctrack/linuxproc"
1920 and "proctrack/pgid" can fail to identify all processes associ‐
1921 ated with a job since processes can become a child of the init
1922 process (when the parent process terminates) or change their
1923 process group. To reliably track all processes, "proc‐
1924 track/cgroup" is highly recommended. NOTE: The JobContainerType
1925 applies to a job allocation, while ProctrackType applies to job
1926 steps. Acceptable values at present include:
1927
1928 proctrack/cgroup which uses linux cgroups to constrain and
1929 track processes, and is the default. NOTE:
1930 see "man cgroup.conf" for configuration
1931 details
1932
1933 proctrack/cray which uses Cray proprietary process tracking
1934
1935 proctrack/linuxproc which uses linux process tree using parent
1936 process IDs.
1937
1938 proctrack/lua which uses a site-specific LUA script to
1939 track processes
1940
1941 proctrack/sgi_job which uses SGI's Process Aggregates (PAGG)
1942 kernel module, see
1943 http://oss.sgi.com/projects/pagg/ for more
1944 information
1945
1946 proctrack/pgid which uses process group IDs
1947
1948
1949 Prolog Fully qualified pathname of a program for the slurmd to execute
1950 whenever it is asked to run a job step from a new job allocation
1951 (e.g. "/usr/local/slurm/prolog"). A glob pattern (See glob (7))
1952 may also be used to specify more than one program to run (e.g.
1953 "/etc/slurm/prolog.d/*"). The slurmd executes the prolog before
1954 starting the first job step. The prolog script or scripts may
1955 be used to purge files, enable user login, etc. By default
1956 there is no prolog. Any configured script is expected to com‐
1957 plete execution quickly (in less time than MessageTimeout). If
1958 the prolog fails (returns a non-zero exit code), this will
1959 result in the node being set to a DRAIN state and the job being
1960 requeued in a held state, unless nohold_on_prolog_fail is con‐
1961 figured in SchedulerParameters. See Prolog and Epilog Scripts
1962 for more information.
1963
1964
1965 PrologEpilogTimeout
1966 The interval in seconds Slurms waits for Prolog and Epilog
1967 before terminating them. The default behavior is to wait indefi‐
1968 nitely. This interval applies to the Prolog and Epilog run by
1969 slurmd daemon before and after the job, the PrologSlurmctld and
1970 EpilogSlurmctld run by slurmctld daemon, and the SPANK plugins
1971 run by the slurmstepd daemon.
1972
1973
1974 PrologFlags
1975 Flags to control the Prolog behavior. By default no flags are
1976 set. Multiple flags may be specified in a comma-separated list.
1977 Currently supported options are:
1978
1979 Alloc If set, the Prolog script will be executed at job allo‐
1980 cation. By default, Prolog is executed just before the
1981 task is launched. Therefore, when salloc is started, no
1982 Prolog is executed. Alloc is useful for preparing things
1983 before a user starts to use any allocated resources. In
1984 particular, this flag is needed on a Cray system when
1985 cluster compatibility mode is enabled.
1986
1987 NOTE: Use of the Alloc flag will increase the time
1988 required to start jobs.
1989
1990 Contain At job allocation time, use the ProcTrack plugin to cre‐
1991 ate a job container on all allocated compute nodes.
1992 This container may be used for user processes not
1993 launched under Slurm control, for example the PAM module
1994 may place processes launch through a direct user login
1995 into this container. Setting the Contain implicitly
1996 sets the Alloc flag. You must set ProctrackType=proc‐
1997 track/cgroup when using the Contain flag.
1998
1999 NoHold If set, the Alloc flag should also be set. This will
2000 allow for salloc to not block until the prolog is fin‐
2001 ished on each node. The blocking will happen when steps
2002 reach the slurmd and before any execution has happened
2003 in the step. This is a much faster way to work and if
2004 using srun to launch your tasks you should use this
2005 flag. This flag cannot be combined with the Contain or
2006 X11 flags.
2007
2008 Serial By default, the Prolog and Epilog scripts run concur‐
2009 rently on each node. This flag forces those scripts to
2010 run serially within each node, but with a significant
2011 penalty to job throughput on each node.
2012
2013 X11 Enable Slurm's built-in X11 forwarding capabilities.
2014 Slurm must have been compiled with libssh2 support
2015 enabled, and either SSH hostkey authentication or per-
2016 users SSH key authentication must be enabled within the
2017 cluster. Only RSA keys are supported at this time. Set‐
2018 ting the X11 flag implicitly enables both Contain and
2019 Alloc flags as well.
2020
2021
2022 PrologSlurmctld
2023 Fully qualified pathname of a program for the slurmctld daemon
2024 to execute before granting a new job allocation (e.g.
2025 "/usr/local/slurm/prolog_controller"). The program executes as
2026 SlurmUser on the same node where the slurmctld daemon executes,
2027 giving it permission to drain nodes and requeue the job if a
2028 failure occurs or cancel the job if appropriate. The program
2029 can be used to reboot nodes or perform other work to prepare
2030 resources for use. Exactly what the program does and how it
2031 accomplishes this is completely at the discretion of the system
2032 administrator. Information about the job being initiated, it's
2033 allocated nodes, etc. are passed to the program using environ‐
2034 ment variables. While this program is running, the nodes asso‐
2035 ciated with the job will be have a POWER_UP/CONFIGURING flag set
2036 in their state, which can be readily viewed. The slurmctld dae‐
2037 mon will wait indefinitely for this program to complete. Once
2038 the program completes with an exit code of zero, the nodes will
2039 be considered ready for use and the program will be started. If
2040 some node can not be made available for use, the program should
2041 drain the node (typically using the scontrol command) and termi‐
2042 nate with a non-zero exit code. A non-zero exit code will
2043 result in the job being requeued (where possible) or killed.
2044 Note that only batch jobs can be requeued. See Prolog and Epi‐
2045 log Scripts for more information.
2046
2047
2048 PropagatePrioProcess
2049 Controls the scheduling priority (nice value) of user spawned
2050 tasks.
2051
2052 0 The tasks will inherit the scheduling priority from the
2053 slurm daemon. This is the default value.
2054
2055 1 The tasks will inherit the scheduling priority of the com‐
2056 mand used to submit them (e.g. srun or sbatch). Unless the
2057 job is submitted by user root, the tasks will have a sched‐
2058 uling priority no higher than the slurm daemon spawning
2059 them.
2060
2061 2 The tasks will inherit the scheduling priority of the com‐
2062 mand used to submit them (e.g. srun or sbatch) with the
2063 restriction that their nice value will always be one higher
2064 than the slurm daemon (i.e. the tasks scheduling priority
2065 will be lower than the slurm daemon).
2066
2067
2068 PropagateResourceLimits
2069 A list of comma separated resource limit names. The slurmd dae‐
2070 mon uses these names to obtain the associated (soft) limit val‐
2071 ues from the user's process environment on the submit node.
2072 These limits are then propagated and applied to the jobs that
2073 will run on the compute nodes. This parameter can be useful
2074 when system limits vary among nodes. Any resource limits that
2075 do not appear in the list are not propagated. However, the user
2076 can override this by specifying which resource limits to propa‐
2077 gate with the sbatch or srun "--propagate" option. If neither
2078 PropagateResourceLimits or PropagateResourceLimitsExcept are
2079 configured and the "--propagate" option is not specified, then
2080 the default action is to propagate all limits. Only one of the
2081 parameters, either PropagateResourceLimits or PropagateResource‐
2082 LimitsExcept, may be specified. The user limits can not exceed
2083 hard limits under which the slurmd daemon operates. If the user
2084 limits are not propagated, the limits from the slurmd daemon
2085 will be propagated to the user's job. The limits used for the
2086 Slurm daemons can be set in the /etc/sysconf/slurm file. For
2087 more information, see: https://slurm.schedmd.com/faq.html#mem‐
2088 lock The following limit names are supported by Slurm (although
2089 some options may not be supported on some systems):
2090
2091 ALL All limits listed below (default)
2092
2093 NONE No limits listed below
2094
2095 AS The maximum address space for a process
2096
2097 CORE The maximum size of core file
2098
2099 CPU The maximum amount of CPU time
2100
2101 DATA The maximum size of a process's data segment
2102
2103 FSIZE The maximum size of files created. Note that if the
2104 user sets FSIZE to less than the current size of the
2105 slurmd.log, job launches will fail with a 'File size
2106 limit exceeded' error.
2107
2108 MEMLOCK The maximum size that may be locked into memory
2109
2110 NOFILE The maximum number of open files
2111
2112 NPROC The maximum number of processes available
2113
2114 RSS The maximum resident set size
2115
2116 STACK The maximum stack size
2117
2118
2119 PropagateResourceLimitsExcept
2120 A list of comma separated resource limit names. By default, all
2121 resource limits will be propagated, (as described by the Propa‐
2122 gateResourceLimits parameter), except for the limits appearing
2123 in this list. The user can override this by specifying which
2124 resource limits to propagate with the sbatch or srun "--propa‐
2125 gate" option. See PropagateResourceLimits above for a list of
2126 valid limit names.
2127
2128
2129 RebootProgram
2130 Program to be executed on each compute node to reboot it.
2131 Invoked on each node once it becomes idle after the command
2132 "scontrol reboot_nodes" is executed by an authorized user or a
2133 job is submitted with the "--reboot" option. After rebooting,
2134 the node is returned to normal use. See ResumeTimeout to con‐
2135 figure the time you expect a reboot to finish in. A node will
2136 be marked DOWN if it doesn't reboot within ResumeTimeout.
2137
2138
2139 ReconfigFlags
2140 Flags to control various actions that may be taken when an
2141 "scontrol reconfig" command is issued. Currently the options
2142 are:
2143
2144 KeepPartInfo If set, an "scontrol reconfig" command will
2145 maintain the in-memory value of partition
2146 "state" and other parameters that may have been
2147 dynamically updated by "scontrol update". Par‐
2148 tition information in the slurm.conf file will
2149 be merged with in-memory data. This flag
2150 supersedes the KeepPartState flag.
2151
2152 KeepPartState If set, an "scontrol reconfig" command will
2153 preserve only the current "state" value of
2154 in-memory partitions and will reset all other
2155 parameters of the partitions that may have been
2156 dynamically updated by "scontrol update" to the
2157 values from the slurm.conf file. Partition
2158 information in the slurm.conf file will be
2159 merged with in-memory data.
2160 The default for the above flags is not set, and the "scontrol
2161 reconfig" will rebuild the partition information using only the
2162 definitions in the slurm.conf file.
2163
2164
2165 RequeueExit
2166 Enables automatic requeue for batch jobs which exit with the
2167 specified values. Separate multiple exit code by a comma and/or
2168 specify numeric ranges using a "-" separator (e.g. "Requeue‐
2169 Exit=1-9,18") Jobs will be put back in to pending state and
2170 later scheduled again. Restarted jobs will have the environment
2171 variable SLURM_RESTART_COUNT set to the number of times the job
2172 has been restarted.
2173
2174
2175 RequeueExitHold
2176 Enables automatic requeue for batch jobs which exit with the
2177 specified values, with these jobs being held until released man‐
2178 ually by the user. Separate multiple exit code by a comma
2179 and/or specify numeric ranges using a "-" separator (e.g.
2180 "RequeueExitHold=10-12,16") These jobs are put in the JOB_SPE‐
2181 CIAL_EXIT exit state. Restarted jobs will have the environment
2182 variable SLURM_RESTART_COUNT set to the number of times the job
2183 has been restarted.
2184
2185
2186 ResumeFailProgram
2187 The program that will be executed when nodes fail to resume to
2188 by ResumeTimeout. The argument to the program will be the names
2189 of the failed nodes (using Slurm's hostlist expression format).
2190
2191
2192 ResumeProgram
2193 Slurm supports a mechanism to reduce power consumption on nodes
2194 that remain idle for an extended period of time. This is typi‐
2195 cally accomplished by reducing voltage and frequency or powering
2196 the node down. ResumeProgram is the program that will be exe‐
2197 cuted when a node in power save mode is assigned work to per‐
2198 form. For reasons of reliability, ResumeProgram may execute
2199 more than once for a node when the slurmctld daemon crashes and
2200 is restarted. If ResumeProgram is unable to restore a node to
2201 service with a responding slurmd and an updated BootTime, it
2202 should requeue any job associated with the node and set the node
2203 state to DOWN. If the node isn't actually rebooted (i.e. when
2204 multiple-slurmd is configured) starting slurmd with "-b" option
2205 might be useful. The program executes as SlurmUser. The argu‐
2206 ment to the program will be the names of nodes to be removed
2207 from power savings mode (using Slurm's hostlist expression for‐
2208 mat). By default no program is run. Related configuration
2209 options include ResumeTimeout, ResumeRate, SuspendRate, Suspend‐
2210 Time, SuspendTimeout, SuspendProgram, SuspendExcNodes, and Sus‐
2211 pendExcParts. More information is available at the Slurm web
2212 site ( https://slurm.schedmd.com/power_save.html ).
2213
2214
2215 ResumeRate
2216 The rate at which nodes in power save mode are returned to nor‐
2217 mal operation by ResumeProgram. The value is number of nodes
2218 per minute and it can be used to prevent power surges if a large
2219 number of nodes in power save mode are assigned work at the same
2220 time (e.g. a large job starts). A value of zero results in no
2221 limits being imposed. The default value is 300 nodes per
2222 minute. Related configuration options include ResumeTimeout,
2223 ResumeProgram, SuspendRate, SuspendTime, SuspendTimeout, Sus‐
2224 pendProgram, SuspendExcNodes, and SuspendExcParts.
2225
2226
2227 ResumeTimeout
2228 Maximum time permitted (in seconds) between when a node resume
2229 request is issued and when the node is actually available for
2230 use. Nodes which fail to respond in this time frame will be
2231 marked DOWN and the jobs scheduled on the node requeued. Nodes
2232 which reboot after this time frame will be marked DOWN with a
2233 reason of "Node unexpectedly rebooted." The default value is 60
2234 seconds. Related configuration options include ResumeProgram,
2235 ResumeRate, SuspendRate, SuspendTime, SuspendTimeout, Suspend‐
2236 Program, SuspendExcNodes and SuspendExcParts. More information
2237 is available at the Slurm web site (
2238 https://slurm.schedmd.com/power_save.html ).
2239
2240
2241 ResvEpilog
2242 Fully qualified pathname of a program for the slurmctld to exe‐
2243 cute when a reservation ends. The program can be used to cancel
2244 jobs, modify partition configuration, etc. The reservation
2245 named will be passed as an argument to the program. By default
2246 there is no epilog.
2247
2248
2249 ResvOverRun
2250 Describes how long a job already running in a reservation should
2251 be permitted to execute after the end time of the reservation
2252 has been reached. The time period is specified in minutes and
2253 the default value is 0 (kill the job immediately). The value
2254 may not exceed 65533 minutes, although a value of "UNLIMITED" is
2255 supported to permit a job to run indefinitely after its reserva‐
2256 tion is terminated.
2257
2258
2259 ResvProlog
2260 Fully qualified pathname of a program for the slurmctld to exe‐
2261 cute when a reservation begins. The program can be used to can‐
2262 cel jobs, modify partition configuration, etc. The reservation
2263 named will be passed as an argument to the program. By default
2264 there is no prolog.
2265
2266
2267 ReturnToService
2268 Controls when a DOWN node will be returned to service. The
2269 default value is 0. Supported values include
2270
2271 0 A node will remain in the DOWN state until a system adminis‐
2272 trator explicitly changes its state (even if the slurmd dae‐
2273 mon registers and resumes communications).
2274
2275 1 A DOWN node will become available for use upon registration
2276 with a valid configuration only if it was set DOWN due to
2277 being non-responsive. If the node was set DOWN for any
2278 other reason (low memory, unexpected reboot, etc.), its
2279 state will not automatically be changed. A node registers
2280 with a valid configuration if its memory, GRES, CPU count,
2281 etc. are equal to or greater than the values configured in
2282 slurm.conf.
2283
2284 2 A DOWN node will become available for use upon registration
2285 with a valid configuration. The node could have been set
2286 DOWN for any reason. A node registers with a valid configu‐
2287 ration if its memory, GRES, CPU count, etc. are equal to or
2288 greater than the values configured in slurm.conf. (Disabled
2289 on Cray ALPS systems.)
2290
2291
2292 RoutePlugin
2293 Identifies the plugin to be used for defining which nodes will
2294 be used for message forwarding and message aggregation.
2295
2296 route/default
2297 default, use TreeWidth.
2298
2299 route/topology
2300 use the switch hierarchy defined in a topology.conf file.
2301 TopologyPlugin=topology/tree is required.
2302
2303
2304 SallocDefaultCommand
2305 Normally, salloc(1) will run the user's default shell when a
2306 command to execute is not specified on the salloc command line.
2307 If SallocDefaultCommand is specified, salloc will instead run
2308 the configured command. The command is passed to '/bin/sh -c',
2309 so shell metacharacters are allowed, and commands with multiple
2310 arguments should be quoted. For instance:
2311
2312 SallocDefaultCommand = "$SHELL"
2313
2314 would run the shell in the user's $SHELL environment variable.
2315 and
2316
2317 SallocDefaultCommand = "srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL"
2318
2319 would run spawn the user's default shell on the allocated
2320 resources, but not consume any of the CPU or memory resources,
2321 configure it as a pseudo-terminal, and preserve all of the job's
2322 environment variables (i.e. and not over-write them with the job
2323 step's allocation information).
2324
2325 For systems with generic resources (GRES) defined, the SallocDe‐
2326 faultCommand value should explicitly specify a zero count for
2327 the configured GRES. Failure to do so will result in the
2328 launched shell consuming those GRES and preventing subsequent
2329 srun commands from using them. For example, on Cray systems add
2330 "--gres=craynetwork:0" as shown below:
2331 SallocDefaultCommand = "srun -n1 -N1 --mem-per-cpu=0 --gres=craynetwork:0 --pty --preserve-env --mpi=none $SHELL"
2332
2333 For systems with TaskPlugin set, adding an option of
2334 "--cpu-bind=no" is recommended if the default shell should have
2335 access to all of the CPUs allocated to the job on that node,
2336 otherwise the shell may be limited to a single cpu or core.
2337
2338
2339 SbcastParameters
2340 Controls sbcast command behavior. Multiple options can be speci‐
2341 fied in a comma separated list. Supported values include:
2342
2343 DestDir= Destination directory for file being broadcast to
2344 allocated compute nodes. Default value is cur‐
2345 rent working directory.
2346
2347 Compression= Specify default file compression library to be
2348 used. Supported values are "lz4", "none" and
2349 "zlib". The default value with the sbcast --com‐
2350 press option is "lz4" and "none" otherwise. Some
2351 compression libraries may be unavailable on some
2352 systems.
2353
2354
2355 SchedulerParameters
2356 The interpretation of this parameter varies by SchedulerType.
2357 Multiple options may be comma separated.
2358
2359 allow_zero_lic
2360 If set, then job submissions requesting more than config‐
2361 ured licenses won't be rejected.
2362
2363 assoc_limit_stop
2364 If set and a job cannot start due to association limits,
2365 then do not attempt to initiate any lower priority jobs
2366 in that partition. Setting this can decrease system
2367 throughput and utilization, but avoid potentially starv‐
2368 ing larger jobs by preventing them from launching indefi‐
2369 nitely.
2370
2371 batch_sched_delay=#
2372 How long, in seconds, the scheduling of batch jobs can be
2373 delayed. This can be useful in a high-throughput envi‐
2374 ronment in which batch jobs are submitted at a very high
2375 rate (i.e. using the sbatch command) and one wishes to
2376 reduce the overhead of attempting to schedule each job at
2377 submit time. The default value is 3 seconds.
2378
2379 bb_array_stage_cnt=#
2380 Number of tasks from a job array that should be available
2381 for burst buffer resource allocation. Higher values will
2382 increase the system overhead as each task from the job
2383 array will be moved to it's own job record in memory, so
2384 relatively small values are generally recommended. The
2385 default value is 10.
2386
2387 bf_busy_nodes
2388 When selecting resources for pending jobs to reserve for
2389 future execution (i.e. the job can not be started immedi‐
2390 ately), then preferentially select nodes that are in use.
2391 This will tend to leave currently idle resources avail‐
2392 able for backfilling longer running jobs, but may result
2393 in allocations having less than optimal network topology.
2394 This option is currently only supported by the
2395 select/cons_res plugin (or select/cray with SelectTypePa‐
2396 rameters set to "OTHER_CONS_RES", which layers the
2397 select/cray plugin over the select/cons_res plugin).
2398
2399 bf_continue
2400 The backfill scheduler periodically releases locks in
2401 order to permit other operations to proceed rather than
2402 blocking all activity for what could be an extended
2403 period of time. Setting this option will cause the back‐
2404 fill scheduler to continue processing pending jobs from
2405 its original job list after releasing locks even if job
2406 or node state changes. This can result in lower priority
2407 jobs being backfill scheduled instead of newly arrived
2408 higher priority jobs, but will permit more queued jobs to
2409 be considered for backfill scheduling.
2410
2411 bf_hetjob_immediate
2412 Instruct the backfill scheduler to attempt to start a
2413 heterogeneous job as soon as all of its components are
2414 determined able to do so. Otherwise, the backfill sched‐
2415 uler will delay heterogeneous jobs initiation attempts
2416 until after the rest of the queue has been processed.
2417 This delay may result in lower priority jobs being allo‐
2418 cated resources, which could delay the initiation of the
2419 heterogeneous job due to account and/or QOS limits being
2420 reached. This option is disabled by default. If enabled
2421 and bf_hetjob_prio=min is not set, then it would be auto‐
2422 matically set.
2423
2424 bf_hetjob_prio=[min|avg|max]
2425 At the beginning of each backfill scheduling cycle, a
2426 list of pending to be scheduled jobs is sorted according
2427 to the precedence order configured in PriorityType. This
2428 option instructs the scheduler to alter the sorting algo‐
2429 rithm to ensure that all components belonging to the same
2430 heterogeneous job will be attempted to be scheduled con‐
2431 secutively (thus not fragmented in the resulting list).
2432 More specifically, all components from the same heteroge‐
2433 neous job will be treated as if they all have the same
2434 priority (minimum, average or maximum depending upon this
2435 option's parameter) when compared with other jobs (or
2436 other heterogeneous job components). The original order
2437 will be preserved within the same heterogeneous job. Note
2438 that the operation is calculated for the PriorityTier
2439 layer and for the Priority resulting from the prior‐
2440 ity/multifactor plugin calculations. When enabled, if any
2441 heterogeneous job requested an advanced reservation, then
2442 all of that job's components will be treated as if they
2443 had requested an advanced reservation (and get preferen‐
2444 tial treatment in scheduling).
2445
2446 Note that this operation does not update the Priority
2447 values of the heterogeneous job components, only their
2448 order within the list, so the output of the sprio command
2449 will not be effected.
2450
2451 Heterogeneous jobs have special scheduling properties:
2452 they are only scheduled by the backfill scheduling plug‐
2453 in, each of their components is considered separately
2454 when reserving resources (and might have different Prior‐
2455 ityTier or different Priority values), and no heteroge‐
2456 neous job component is actually allocated resources until
2457 all if its components can be initiated. This may imply
2458 potential scheduling deadlock scenarios because compo‐
2459 nents from different heterogeneous jobs can start reserv‐
2460 ing resources in an interleaved fashion (not consecu‐
2461 tively), but none of the jobs can reserve resources for
2462 all components and start. Enabling this option can help
2463 to mitigate this problem. By default, this option is dis‐
2464 abled.
2465
2466 bf_ignore_newly_avail_nodes
2467 If set, then only resources available at the beginning of
2468 a backfill cycle will be considered for use. Otherwise
2469 resources made available during that backfill cycle (dur‐
2470 ing a yield with bf_continue set) may be used for lower
2471 priority jobs, delaying the initiation of higher priority
2472 jobs. Disabled by default.
2473
2474 bf_interval=#
2475 The number of seconds between backfill iterations.
2476 Higher values result in less overhead and better respon‐
2477 siveness. This option applies only to Scheduler‐
2478 Type=sched/backfill. The default value is 30 seconds.
2479
2480
2481 bf_job_part_count_reserve=#
2482 The backfill scheduling logic will reserve resources for
2483 the specified count of highest priority jobs in each par‐
2484 tition. For example, bf_job_part_count_reserve=10 will
2485 cause the backfill scheduler to reserve resources for the
2486 ten highest priority jobs in each partition. Any lower
2487 priority job that can be started using currently avail‐
2488 able resources and not adversely impact the expected
2489 start time of these higher priority jobs will be started
2490 by the backfill scheduler The default value is zero,
2491 which will reserve resources for any pending job and
2492 delay initiation of lower priority jobs. Also see
2493 bf_min_age_reserve and bf_min_prio_reserve.
2494
2495
2496 bf_max_job_array_resv=#
2497 The maximum number of tasks from a job array for which
2498 the backfill scheduler will reserve resources in the
2499 future. Since job arrays can potentially have millions
2500 of tasks, the overhead in reserving resources for all
2501 tasks can be prohibitive. In addition various limits may
2502 prevent all the jobs from starting at the expected times.
2503 This has no impact upon the number of tasks from a job
2504 array that can be started immediately, only those tasks
2505 expected to start at some future time. The default value
2506 is 20 tasks. NOTE: Jobs submitted to multiple partitions
2507 appear in the job queue once per partition. If different
2508 copies of a single job array record aren't consecutive in
2509 the job queue and another job array record is in between,
2510 then bf_max_job_array_resv tasks are considered per par‐
2511 tition that the job is submitted to.
2512
2513 bf_max_job_assoc=#
2514 The maximum number of jobs per user association to
2515 attempt starting with the backfill scheduler. This set‐
2516 ting is similar to bf_max_job_user but is handy if a user
2517 has multiple assocations equating to basically different
2518 users. One can set this limit to prevent users from
2519 flooding the backfill queue with jobs that cannot start
2520 and that prevent jobs from other users to start. The
2521 default value is 0, which means no limit. This option
2522 applies only to SchedulerType=sched/backfill. Also see
2523 the bf_max_job_user bf_max_job_part, bf_max_job_test and
2524 bf_max_job_user_part=# options. Set bf_max_job_test to a
2525 value much higher than bf_max_job_assoc.
2526
2527 bf_max_job_part=#
2528 The maximum number of jobs per partition to attempt
2529 starting with the backfill scheduler. This can be espe‐
2530 cially helpful for systems with large numbers of parti‐
2531 tions and jobs. The default value is 0, which means no
2532 limit. This option applies only to Scheduler‐
2533 Type=sched/backfill. Also see the partition_job_depth
2534 and bf_max_job_test options. Set bf_max_job_test to a
2535 value much higher than bf_max_job_part.
2536
2537 bf_max_job_start=#
2538 The maximum number of jobs which can be initiated in a
2539 single iteration of the backfill scheduler. The default
2540 value is 0, which means no limit. This option applies
2541 only to SchedulerType=sched/backfill.
2542
2543 bf_max_job_test=#
2544 The maximum number of jobs to attempt backfill scheduling
2545 for (i.e. the queue depth). Higher values result in more
2546 overhead and less responsiveness. Until an attempt is
2547 made to backfill schedule a job, its expected initiation
2548 time value will not be set. The default value is 100.
2549 In the case of large clusters, configuring a relatively
2550 small value may be desirable. This option applies only
2551 to SchedulerType=sched/backfill.
2552
2553 bf_max_job_user=#
2554 The maximum number of jobs per user to attempt starting
2555 with the backfill scheduler for ALL partitions. One can
2556 set this limit to prevent users from flooding the back‐
2557 fill queue with jobs that cannot start and that prevent
2558 jobs from other users to start. This is similar to the
2559 MAXIJOB limit in Maui. The default value is 0, which
2560 means no limit. This option applies only to Scheduler‐
2561 Type=sched/backfill. Also see the bf_max_job_part,
2562 bf_max_job_test and bf_max_job_user_part=# options. Set
2563 bf_max_job_test to a value much higher than
2564 bf_max_job_user.
2565
2566 bf_max_job_user_part=#
2567 The maximum number of jobs per user per partition to
2568 attempt starting with the backfill scheduler for any sin‐
2569 gle partition. The default value is 0, which means no
2570 limit. This option applies only to Scheduler‐
2571 Type=sched/backfill. Also see the bf_max_job_part,
2572 bf_max_job_test and bf_max_job_user=# options.
2573
2574 bf_max_time=#
2575 The maximum time the backfill scheduler can spend
2576 (including time spent sleeping when locks are released)
2577 before discontinuing, even if maximum job counts have not
2578 been reached. This option applies only to Scheduler‐
2579 Type=sched/backfill. The default value is the value of
2580 bf_interval (which defaults to 30 seconds). NOTE: This
2581 needs to be high enough that scheduling isn't always dis‐
2582 abled, and low enough that our interactive workload can
2583 get through in a reasonable period of time. Certainly
2584 needs to be below 256 (the default RPC thread limit).
2585 Running around the middle (150) may give you good
2586 results.
2587
2588 bf_min_age_reserve=#
2589 The backfill and main scheduling logic will not reserve
2590 resources for pending jobs until they have been pending
2591 and runnable for at least the specified number of sec‐
2592 onds. In addition, jobs waiting for less than the speci‐
2593 fied number of seconds will not prevent a newly submitted
2594 job from starting immediately, even if the newly submit‐
2595 ted job has a lower priority. This can be valuable if
2596 jobs lack time limits or all time limits have the same
2597 value. The default value is zero, which will reserve
2598 resources for any pending job and delay initiation of
2599 lower priority jobs. Also see bf_job_part_count_reserve
2600 and bf_min_prio_reserve.
2601
2602 bf_min_prio_reserve=#
2603 The backfill and main scheduling logic will not reserve
2604 resources for pending jobs unless they have a priority
2605 equal to or higher than the specified value. In addi‐
2606 tion, jobs with a lower priority will not prevent a newly
2607 submitted job from starting immediately, even if the
2608 newly submitted job has a lower priority. This can be
2609 valuable if one wished to maximum system utilization
2610 without regard for job priority below a certain thresh‐
2611 old. The default value is zero, which will reserve
2612 resources for any pending job and delay initiation of
2613 lower priority jobs. Also see bf_job_part_count_reserve
2614 and bf_min_age_reserve.
2615
2616 bf_resolution=#
2617 The number of seconds in the resolution of data main‐
2618 tained about when jobs begin and end. Higher values
2619 result in less overhead and better responsiveness. The
2620 default value is 60 seconds. This option applies only to
2621 SchedulerType=sched/backfill.
2622
2623 bf_window=#
2624 The number of minutes into the future to look when con‐
2625 sidering jobs to schedule. Higher values result in more
2626 overhead and less responsiveness. The default value is
2627 1440 minutes (one day). A value at least as long as the
2628 highest allowed time limit is generally advisable to pre‐
2629 vent job starvation. In order to limit the amount of
2630 data managed by the backfill scheduler, if the value of
2631 bf_window is increased, then it is generally advisable to
2632 also increase bf_resolution. This option applies only to
2633 SchedulerType=sched/backfill.
2634
2635 bf_window_linear=#
2636 For performance reasons, the backfill scheduler will
2637 decrease precision in calculation of job expected termi‐
2638 nation times. By default, the precision starts at 30 sec‐
2639 onds and that time interval doubles with each evaluation
2640 of currently executing jobs when trying to determine when
2641 a pending job can start. This algorithm can support an
2642 environment with many thousands of running jobs, but can
2643 result in the expected start time of pending jobs being
2644 gradually being deferred due to lack of precision. A
2645 value for bf_window_linear will cause the time interval
2646 to be increased by a constant amount on each iteration.
2647 The value is specified in units of seconds. For example,
2648 a value of 60 will cause the backfill scheduler on the
2649 first iteration to identify the job ending soonest and
2650 determine if the pending job can be started after that
2651 job plus all other jobs expected to end within 30 seconds
2652 (default initial value) of the first job. On the next
2653 iteration, the pending job will be evaluated for starting
2654 after the next job expected to end plus all jobs ending
2655 within 90 seconds of that time (30 second default, plus
2656 the 60 second option value). The third iteration will
2657 have a 150 second window and the fourth 210 seconds.
2658 Without this option, the time windows will double on each
2659 iteration and thus be 30, 60, 120, 240 seconds, etc. The
2660 use of bf_window_linear is not recommended with more than
2661 a few hundred simultaneously executing jobs.
2662
2663 bf_yield_interval=#
2664 The backfill scheduler will periodically relinquish locks
2665 in order for other pending operations to take place.
2666 This specifies the times when the locks are relinquish in
2667 microseconds. The default value is 2,000,000 microsec‐
2668 onds (2 seconds). Smaller values may be helpful for high
2669 throughput computing when used in conjunction with the
2670 bf_continue option. Also see the bf_yield_sleep option.
2671
2672 bf_yield_sleep=#
2673 The backfill scheduler will periodically relinquish locks
2674 in order for other pending operations to take place.
2675 This specifies the length of time for which the locks are
2676 relinquish in microseconds. The default value is 500,000
2677 microseconds (0.5 seconds). Also see the bf_yield_inter‐
2678 val option.
2679
2680 build_queue_timeout=#
2681 Defines the maximum time that can be devoted to building
2682 a queue of jobs to be tested for scheduling. If the sys‐
2683 tem has a huge number of jobs with dependencies, just
2684 building the job queue can take so much time as to
2685 adversely impact overall system performance and this
2686 parameter can be adjusted as needed. The default value
2687 is 2,000,000 microseconds (2 seconds).
2688
2689 default_queue_depth=#
2690 The default number of jobs to attempt scheduling (i.e.
2691 the queue depth) when a running job completes or other
2692 routine actions occur, however the frequency with which
2693 the scheduler is run may be limited by using the defer or
2694 sched_min_interval parameters described below. The full
2695 queue will be tested on a less frequent basis as defined
2696 by the sched_interval option described below. The default
2697 value is 100. See the partition_job_depth option to
2698 limit depth by partition.
2699
2700 defer Setting this option will avoid attempting to schedule
2701 each job individually at job submit time, but defer it
2702 until a later time when scheduling multiple jobs simulta‐
2703 neously may be possible. This option may improve system
2704 responsiveness when large numbers of jobs (many hundreds)
2705 are submitted at the same time, but it will delay the
2706 initiation time of individual jobs. Also see
2707 default_queue_depth above.
2708
2709 delay_boot=#
2710 Do not reboot nodes in order to satisfied this job's fea‐
2711 ture specification if the job has been eligible to run
2712 for less than this time period. If the job has waited
2713 for less than the specified period, it will use only
2714 nodes which already have the specified features. The
2715 argument is in units of minutes. Individual jobs may
2716 override this default value with the --delay-boot option.
2717
2718 default_gbytes
2719 The default units in job submission memory and temporary
2720 disk size specification will be gigabytes rather than
2721 megabytes. Users can override the default by using a
2722 suffix of "M" for megabytes.
2723
2724 disable_hetero_steps
2725 Disable job steps that span heterogeneous job alloca‐
2726 tions. The default value on Cray systems.
2727
2728 enable_hetero_steps
2729 Enable job steps that span heterogeneous job allocations.
2730 The default value except for Cray systems.
2731
2732 enable_user_top
2733 Enable use of the "scontrol top" command by non-privi‐
2734 leged users.
2735
2736 Ignore_NUMA
2737 Some processors (e.g. AMD Opteron 6000 series) contain
2738 multiple NUMA nodes per socket. This is a configuration
2739 which does not map into the hardware entities that Slurm
2740 optimizes resource allocation for (PU/thread, core,
2741 socket, baseboard, node and network switch). In order to
2742 optimize resource allocations on such hardware, Slurm
2743 will consider each NUMA node within the socket as a sepa‐
2744 rate socket by default. Use the Ignore_NUMA option to
2745 report the correct socket count, but not optimize
2746 resource allocations on the NUMA nodes.
2747
2748 inventory_interval=#
2749 On a Cray system using Slurm on top of ALPS this limits
2750 the number of times a Basil Inventory call is made. Nor‐
2751 mally this call happens every scheduling consideration to
2752 attempt to close a node state change window with respects
2753 to what ALPS has. This call is rather slow, so making it
2754 less frequently improves performance dramatically, but in
2755 the situation where a node changes state the window is as
2756 large as this setting. In an HTC environment this set‐
2757 ting is a must and we advise around 10 seconds.
2758
2759 kill_invalid_depend
2760 If a job has an invalid dependency and it can never run
2761 terminate it and set its state to be JOB_CANCELLED. By
2762 default the job stays pending with reason DependencyNev‐
2763 erSatisfied.
2764
2765 max_array_tasks
2766 Specify the maximum number of tasks that be included in a
2767 job array. The default limit is MaxArraySize, but this
2768 option can be used to set a lower limit. For example,
2769 max_array_tasks=1000 and MaxArraySize=100001 would permit
2770 a maximum task ID of 100000, but limit the number of
2771 tasks in any single job array to 1000.
2772
2773 max_depend_depth=#
2774 Maximum number of jobs to test for a circular job depen‐
2775 dency. Stop testing after this number of job dependencies
2776 have been tested. The default value is 10 jobs.
2777
2778 max_rpc_cnt=#
2779 If the number of active threads in the slurmctld daemon
2780 is equal to or larger than this value, defer scheduling
2781 of jobs. This can improve Slurm's ability to process
2782 requests at a cost of initiating new jobs less fre‐
2783 quently. The default value is zero, which disables this
2784 option. If a value is set, then a value of 10 or higher
2785 is recommended.
2786
2787 max_sched_time=#
2788 How long, in seconds, that the main scheduling loop will
2789 execute for before exiting. If a value is configured, be
2790 aware that all other Slurm operations will be deferred
2791 during this time period. Make certain the value is lower
2792 than MessageTimeout. If a value is not explicitly con‐
2793 figured, the default value is half of MessageTimeout with
2794 a minimum default value of 1 second and a maximum default
2795 value of 2 seconds. For example if MessageTimeout=10,
2796 the time limit will be 2 seconds (i.e. MIN(10/2, 2) = 2).
2797
2798 max_script_size=#
2799 Specify the maximum size of a batch script, in bytes.
2800 The default value is 4 megabytes. Larger values may
2801 adversely impact system performance.
2802
2803 max_switch_wait=#
2804 Maximum number of seconds that a job can delay execution
2805 waiting for the specified desired switch count. The
2806 default value is 300 seconds.
2807
2808 no_backup_scheduling
2809 If used, the backup controller will not schedule jobs
2810 when it takes over. The backup controller will allow jobs
2811 to be submitted, modified and cancelled but won't sched‐
2812 ule new jobs. This is useful in Cray environments when
2813 the backup controller resides on an external Cray node.
2814 A restart is required to alter this option. This is
2815 explicitly set on a Cray/ALPS system.
2816
2817 no_env_cache
2818 If used, any job started on node that fails to load the
2819 env from a node will fail instead of using the cached
2820 env. This will also implicitly imply the requeue_set‐
2821 up_env_fail option as well.
2822
2823 pack_serial_at_end
2824 If used with the select/cons_res plugin then put serial
2825 jobs at the end of the available nodes rather than using
2826 a best fit algorithm. This may reduce resource fragmen‐
2827 tation for some workloads.
2828
2829 partition_job_depth=#
2830 The default number of jobs to attempt scheduling (i.e.
2831 the queue depth) from each partition/queue in Slurm's
2832 main scheduling logic. The functionality is similar to
2833 that provided by the bf_max_job_part option for the back‐
2834 fill scheduling logic. The default value is 0 (no
2835 limit). Job's excluded from attempted scheduling based
2836 upon partition will not be counted against the
2837 default_queue_depth limit. Also see the bf_max_job_part
2838 option.
2839
2840 preempt_reorder_count=#
2841 Specify how many attempts should be made in reording pre‐
2842 emptable jobs to minimize the count of jobs preempted.
2843 The default value is 1. High values may adversely impact
2844 performance. The logic to support this option is only
2845 available in the select/cons_res plugin.
2846
2847 preempt_strict_order
2848 If set, then execute extra logic in an attempt to preempt
2849 only the lowest priority jobs. It may be desirable to
2850 set this configuration parameter when there are multiple
2851 priorities of preemptable jobs. The logic to support
2852 this option is only available in the select/cons_res
2853 plugin.
2854
2855 preempt_youngest_first
2856 If set, then the preemption sorting algorithm will be
2857 changed to sort by the job start times to favor preempt‐
2858 ing younger jobs over older. (Requires preempt/parti‐
2859 tion_prio or preempt/qos plugins.)
2860
2861 nohold_on_prolog_fail
2862 By default if the Prolog exits with a non-zero value the
2863 job is requeued in held state. By specifying this parame‐
2864 ter the job will be requeued but not held so that the
2865 scheduler can dispatch it to another host.
2866
2867 reduce_completing_frag
2868 This option is used to control how scheduling of
2869 resources is performed when jobs are in completing state,
2870 which influences potential fragmentation. If the option
2871 is not set then no jobs will be started in any partition
2872 when any job is in completing state. If the option is
2873 set then no jobs will be started in any individual parti‐
2874 tion that has a job in completing state. In addition, no
2875 jobs will be started in any partition with nodes that
2876 overlap with any nodes in the partition of the completing
2877 job. This option is to be used in conjunction with Com‐
2878 pleteWait. NOTE: CompleteWait must be set for this to
2879 work.
2880
2881 requeue_setup_env_fail
2882 By default if a job environment setup fails the job keeps
2883 running with a limited environment. By specifying this
2884 parameter the job will be requeued in held state and the
2885 execution node drained.
2886
2887 salloc_wait_nodes
2888 If defined, the salloc command will wait until all allo‐
2889 cated nodes are ready for use (i.e. booted) before the
2890 command returns. By default, salloc will return as soon
2891 as the resource allocation has been made.
2892
2893 sbatch_wait_nodes
2894 If defined, the sbatch script will wait until all allo‐
2895 cated nodes are ready for use (i.e. booted) before the
2896 initiation. By default, the sbatch script will be initi‐
2897 ated as soon as the first node in the job allocation is
2898 ready. The sbatch command can use the --wait-all-nodes
2899 option to override this configuration parameter.
2900
2901 sched_interval=#
2902 How frequently, in seconds, the main scheduling loop will
2903 execute and test all pending jobs. The default value is
2904 60 seconds.
2905
2906 sched_max_job_start=#
2907 The maximum number of jobs that the main scheduling logic
2908 will start in any single execution. The default value is
2909 zero, which imposes no limit.
2910
2911 sched_min_interval=#
2912 How frequently, in microseconds, the main scheduling loop
2913 will execute and test any pending jobs. The scheduler
2914 runs in a limited fashion every time that any event hap‐
2915 pens which could enable a job to start (e.g. job submit,
2916 job terminate, etc.). If these events happen at a high
2917 frequency, the scheduler can run very frequently and con‐
2918 sume significant resources if not throttled by this
2919 option. This option specifies the minimum time between
2920 the end of one scheduling cycle and the beginning of the
2921 next scheduling cycle. A value of zero will disable
2922 throttling of the scheduling logic interval. The default
2923 value is 1,000,000 microseconds on Cray/ALPS systems and
2924 2 microseconds on other systems.
2925
2926 spec_cores_first
2927 Specialized cores will be selected from the first cores
2928 of the first sockets, cycling through the sockets on a
2929 round robin basis. By default, specialized cores will be
2930 selected from the last cores of the last sockets, cycling
2931 through the sockets on a round robin basis.
2932
2933 step_retry_count=#
2934 When a step completes and there are steps ending resource
2935 allocation, then retry step allocations for at least this
2936 number of pending steps. Also see step_retry_time. The
2937 default value is 8 steps.
2938
2939 step_retry_time=#
2940 When a step completes and there are steps ending resource
2941 allocation, then retry step allocations for all steps
2942 which have been pending for at least this number of sec‐
2943 onds. Also see step_retry_count. The default value is
2944 60 seconds.
2945
2946 whole_hetjob
2947 Requests to cancel, hold or release any component of a
2948 heterogeneous job will be applied to all components of
2949 the job.
2950
2951 NOTE: this option was previously named whole_pack and
2952 this is still supported for retrocompatibility.
2953
2954
2955 SchedulerTimeSlice
2956 Number of seconds in each time slice when gang scheduling is
2957 enabled (PreemptMode=SUSPEND,GANG). The value must be between 5
2958 seconds and 65533 seconds. The default value is 30 seconds.
2959
2960
2961 SchedulerType
2962 Identifies the type of scheduler to be used. Note the slurmctld
2963 daemon must be restarted for a change in scheduler type to
2964 become effective (reconfiguring a running daemon has no effect
2965 for this parameter). The scontrol command can be used to manu‐
2966 ally change job priorities if desired. Acceptable values
2967 include:
2968
2969 sched/backfill
2970 For a backfill scheduling module to augment the default
2971 FIFO scheduling. Backfill scheduling will initiate
2972 lower-priority jobs if doing so does not delay the
2973 expected initiation time of any higher priority job.
2974 Effectiveness of backfill scheduling is dependent upon
2975 users specifying job time limits, otherwise all jobs will
2976 have the same time limit and backfilling is impossible.
2977 Note documentation for the SchedulerParameters option
2978 above. This is the default configuration.
2979
2980 sched/builtin
2981 This is the FIFO scheduler which initiates jobs in prior‐
2982 ity order. If any job in the partition can not be sched‐
2983 uled, no lower priority job in that partition will be
2984 scheduled. An exception is made for jobs that can not
2985 run due to partition constraints (e.g. the time limit) or
2986 down/drained nodes. In that case, lower priority jobs
2987 can be initiated and not impact the higher priority job.
2988
2989 sched/hold
2990 To hold all newly arriving jobs if a file
2991 "/etc/slurm.hold" exists otherwise use the built-in FIFO
2992 scheduler
2993
2994
2995 SelectType
2996 Identifies the type of resource selection algorithm to be used.
2997 Changing this value can only be done by restarting the slurmctld
2998 daemon and will result in the loss of all job information (run‐
2999 ning and pending) since the job state save format used by each
3000 plugin is different. Acceptable values include
3001
3002 select/cons_res
3003 The resources (cores and memory) within a node are indi‐
3004 vidually allocated as consumable resources. Note that
3005 whole nodes can be allocated to jobs for selected parti‐
3006 tions by using the OverSubscribe=Exclusive option. See
3007 the partition OverSubscribe parameter for more informa‐
3008 tion.
3009
3010 select/cray
3011 for a Cray system. The default value is "select/cray"
3012 for all Cray systems.
3013
3014 select/linear
3015 for allocation of entire nodes assuming a one-dimensional
3016 array of nodes in which sequentially ordered nodes are
3017 preferable. For a heterogeneous cluster (e.g. different
3018 CPU counts on the various nodes), resource allocations
3019 will favor nodes with high CPU counts as needed based
3020 upon the job's node and CPU specification if TopologyPlu‐
3021 gin=topology/none is configured. Use of other topology
3022 plugins with select/linear and heterogeneous nodes is not
3023 recommended and may result in valid job allocation
3024 requests being rejected. This is the default value.
3025
3026 select/serial
3027 for allocating resources to single CPU jobs only. Highly
3028 optimized for maximum throughput. NOTE: SPANK environ‐
3029 ment variables are NOT propagated to the job's Epilog
3030 program.
3031
3032
3033 SelectTypeParameters
3034 The permitted values of SelectTypeParameters depend upon the
3035 configured value of SelectType. The only supported options for
3036 SelectType=select/linear are CR_ONE_TASK_PER_CORE and CR_Memory,
3037 which treats memory as a consumable resource and prevents memory
3038 over subscription with job preemption or gang scheduling. By
3039 default SelectType=select/linear allocates whole nodes to jobs
3040 without considering their memory consumption. By default
3041 SelectType=select/cons_res, SelectType=select/cray, and Select‐
3042 Type=select/serial use CR_CPU, which allocates CPU (threads) to
3043 jobs without considering their memory consumption.
3044
3045 The following options are supported for SelectType=select/cray:
3046
3047 OTHER_CONS_RES
3048 Layer the select/cons_res plugin under the
3049 select/cray plugin, the default is to layer on
3050 select/linear. This also allows all the options
3051 available for SelectType=select/cons_res.
3052
3053 NHC_ABSOLUTELY_NO
3054 Never run the node health check. Implies NHC_NO
3055 and NHC_NO_STEPS as well.
3056
3057 NHC_NO_STEPS
3058 Do not run the node health check after each step.
3059 Default is to run after each step.
3060
3061 NHC_NO Do not run the node health check after each allo‐
3062 cation. Default is to run after each allocation.
3063 This also sets NHC_NO_STEPS, so the NHC will never
3064 run except when nodes have been left with unkill‐
3065 able steps.
3066
3067 The following options are supported by the Select‐
3068 Type=select/cons_res plugin:
3069
3070 CR_CPU CPUs are consumable resources. Configure the num‐
3071 ber of CPUs on each node, which may be equal to
3072 the count of cores or hyper-threads on the node
3073 depending upon the desired minimum resource allo‐
3074 cation. The node's Boards, Sockets, CoresPer‐
3075 Socket and ThreadsPerCore may optionally be con‐
3076 figured and result in job allocations which have
3077 improved locality; however doing so will prevent
3078 more than one job being from being allocated on
3079 each core.
3080
3081 CR_CPU_Memory
3082 CPUs and memory are consumable resources. Config‐
3083 ure the number of CPUs on each node, which may be
3084 equal to the count of cores or hyper-threads on
3085 the node depending upon the desired minimum
3086 resource allocation. The node's Boards, Sockets,
3087 CoresPerSocket and ThreadsPerCore may optionally
3088 be configured and result in job allocations which
3089 have improved locality; however doing so will pre‐
3090 vent more than one job being from being allocated
3091 on each core. Setting a value for DefMemPerCPU is
3092 strongly recommended.
3093
3094 CR_Core
3095 Cores are consumable resources. On nodes with
3096 hyper-threads, each thread is counted as a CPU to
3097 satisfy a job's resource requirement, but multiple
3098 jobs are not allocated threads on the same core.
3099 The count of CPUs allocated to a job may be
3100 rounded up to account for every CPU on an allo‐
3101 cated core.
3102
3103 CR_Core_Memory
3104 Cores and memory are consumable resources. On
3105 nodes with hyper-threads, each thread is counted
3106 as a CPU to satisfy a job's resource requirement,
3107 but multiple jobs are not allocated threads on the
3108 same core. The count of CPUs allocated to a job
3109 may be rounded up to account for every CPU on an
3110 allocated core. Setting a value for DefMemPerCPU
3111 is strongly recommended.
3112
3113 CR_ONE_TASK_PER_CORE
3114 Allocate one task per core by default. Without
3115 this option, by default one task will be allocated
3116 per thread on nodes with more than one ThreadsPer‐
3117 Core configured. NOTE: This option cannot be used
3118 with CR_CPU*.
3119
3120 CR_CORE_DEFAULT_DIST_BLOCK
3121 Allocate cores within a node using block distribu‐
3122 tion by default. This is a pseudo-best-fit algo‐
3123 rithm that minimizes the number of boards and min‐
3124 imizes the number of sockets (within minimum
3125 boards) used for the allocation. This default
3126 behavior can be overridden specifying a particular
3127 "-m" parameter with srun/salloc/sbatch. Without
3128 this option, cores will be allocated cyclicly
3129 across the sockets.
3130
3131 CR_LLN Schedule resources to jobs on the least loaded
3132 nodes (based upon the number of idle CPUs). This
3133 is generally only recommended for an environment
3134 with serial jobs as idle resources will tend to be
3135 highly fragmented, resulting in parallel jobs
3136 being distributed across many nodes. Note that
3137 node Weight takes precedence over how many idle
3138 resources are on each node. Also see the parti‐
3139 tion configuration parameter LLN use the least
3140 loaded nodes in selected partitions.
3141
3142 CR_Pack_Nodes
3143 If a job allocation contains more resources than
3144 will be used for launching tasks (e.g. if whole
3145 nodes are allocated to a job), then rather than
3146 distributing a job's tasks evenly across it's
3147 allocated nodes, pack them as tightly as possible
3148 on these nodes. For example, consider a job allo‐
3149 cation containing two entire nodes with eight CPUs
3150 each. If the job starts ten tasks across those
3151 two nodes without this option, it will start five
3152 tasks on each of the two nodes. With this option,
3153 eight tasks will be started on the first node and
3154 two tasks on the second node.
3155
3156 CR_Socket
3157 Sockets are consumable resources. On nodes with
3158 multiple cores, each core or thread is counted as
3159 a CPU to satisfy a job's resource requirement, but
3160 multiple jobs are not allocated resources on the
3161 same socket.
3162
3163 CR_Socket_Memory
3164 Memory and sockets are consumable resources. On
3165 nodes with multiple cores, each core or thread is
3166 counted as a CPU to satisfy a job's resource
3167 requirement, but multiple jobs are not allocated
3168 resources on the same socket. Setting a value for
3169 DefMemPerCPU is strongly recommended.
3170
3171 CR_Memory
3172 Memory is a consumable resource. NOTE: This
3173 implies OverSubscribe=YES or OverSubscribe=FORCE
3174 for all partitions. Setting a value for DefMem‐
3175 PerCPU is strongly recommended.
3176
3177
3178 SlurmUser
3179 The name of the user that the slurmctld daemon executes as. For
3180 security purposes, a user other than "root" is recommended.
3181 This user must exist on all nodes of the cluster for authentica‐
3182 tion of communications between Slurm components. The default
3183 value is "root".
3184
3185
3186 SlurmdParameters
3187 Parameters specific to the Slurmd. Multiple options may be
3188 comma separated.
3189
3190 shutdown_on_reboot
3191 If set, the Slurmd will shut itself down when a reboot
3192 request is received.
3193
3194
3195 SlurmdUser
3196 The name of the user that the slurmd daemon executes as. This
3197 user must exist on all nodes of the cluster for authentication
3198 of communications between Slurm components. The default value
3199 is "root".
3200
3201
3202 SlurmctldAddr
3203 An optional address to be used for communications to the cur‐
3204 rently active slurmctld daemon, normally used with Virtual IP
3205 addressing of the currently active server. If this parameter is
3206 not specified then each primary and backup server will have its
3207 own unique address used for communications as specified in the
3208 SlurmctldHost parameter. If this parameter is specified then
3209 the SlurmctldHost parameter will still be used for communica‐
3210 tions to specific slurmctld primary or backup servers, for exam‐
3211 ple to cause all of them to read the current configuration files
3212 or shutdown. Also see the SlurmctldPrimaryOffProg and Slurm‐
3213 ctldPrimaryOnProg configuration parameters to configure programs
3214 to manipulate virtual IP address manipulation.
3215
3216
3217 SlurmctldDebug
3218 The level of detail to provide slurmctld daemon's logs. The
3219 default value is info. If the slurmctld daemon is initiated
3220 with -v or --verbose options, that debug level will be preserve
3221 or restored upon reconfiguration.
3222
3223
3224 quiet Log nothing
3225
3226 fatal Log only fatal errors
3227
3228 error Log only errors
3229
3230 info Log errors and general informational messages
3231
3232 verbose Log errors and verbose informational messages
3233
3234 debug Log errors and verbose informational messages and
3235 debugging messages
3236
3237 debug2 Log errors and verbose informational messages and more
3238 debugging messages
3239
3240 debug3 Log errors and verbose informational messages and even
3241 more debugging messages
3242
3243 debug4 Log errors and verbose informational messages and even
3244 more debugging messages
3245
3246 debug5 Log errors and verbose informational messages and even
3247 more debugging messages
3248
3249
3250 SlurmctldHost
3251 The short, or long, hostname of the machine where Slurm control
3252 daemon is executed (i.e. the name returned by the command "host‐
3253 name -s"). This hostname is optionally followed by the address,
3254 either the IP address or a name by which the address can be
3255 identifed, enclosed in parentheses (e.g. SlurmctldHost=mas‐
3256 ter1(12.34.56.78)). This value must be specified at least once.
3257 If specified more than once, the first hostname named will be
3258 where the daemon runs. If the first specified host fails, the
3259 daemon will execute on the second host. If both the first and
3260 second specified host fails, the daemon will execute on the
3261 third host.
3262
3263
3264 SlurmctldLogFile
3265 Fully qualified pathname of a file into which the slurmctld dae‐
3266 mon's logs are written. The default value is none (performs
3267 logging via syslog).
3268 See the section LOGGING if a pathname is specified.
3269
3270
3271 SlurmctldParameters
3272 Multiple options may be comma-separated.
3273
3274
3275 allow_user_triggers
3276 Permit setting triggers from non-root/slurm_user users.
3277 SlurmUser must also be set to root to permit these trig‐
3278 gers to work. See the strigger man page for additional
3279 details.
3280
3281 cloud_dns
3282 By default, Slurm expects that the network addresses for
3283 cloud nodes won't won't be known until creation of the
3284 node and that Slurm will be notified of the node's
3285 address (e.g. scontrol update nodename=<name>
3286 nodeaddr=<addr>). Since Slurm communications rely on the
3287 node configuration found in the slurm.conf, Slurm will
3288 tell the client command, after waiting for all nodes to
3289 boot, each node's ip address. However, in environments
3290 where the nodes are in DNS, this step can be avoided by
3291 configuring this option.
3292
3293
3294 SlurmctldPidFile
3295 Fully qualified pathname of a file into which the slurmctld
3296 daemon may write its process id. This may be used for automated
3297 signal processing. The default value is "/var/run/slurm‐
3298 ctld.pid".
3299
3300
3301 SlurmctldPlugstack
3302 A comma delimited list of Slurm controller plugins to be started
3303 when the daemon begins and terminated when it ends. Only the
3304 plugin's init and fini functions are called.
3305
3306
3307 SlurmctldPort
3308 The port number that the Slurm controller, slurmctld, listens to
3309 for work. The default value is SLURMCTLD_PORT as established at
3310 system build time. If none is explicitly specified, it will be
3311 set to 6817. SlurmctldPort may also be configured to support a
3312 range of port numbers in order to accept larger bursts of incom‐
3313 ing messages by specifying two numbers separated by a dash (e.g.
3314 SlurmctldPort=6817-6818). NOTE: Either slurmctld and slurmd
3315 daemons must not execute on the same nodes or the values of
3316 SlurmctldPort and SlurmdPort must be different.
3317
3318 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3319 automatically try to interact with anything opened on ports
3320 8192-60000. Configure SlurmctldPort to use a port outside of
3321 the configured SrunPortRange and RSIP's port range.
3322
3323
3324 SlurmctldPrimaryOffProg
3325 This program is executed when a slurmctld daemon running as the
3326 primary server becomes a backup server. By default no program is
3327 executed. See also the related "SlurmctldPrimaryOnProg" parame‐
3328 ter.
3329
3330
3331 SlurmctldPrimaryOnProg
3332 This program is executed when a slurmctld daemon running as a
3333 backup server becomes the primary server. By default no program
3334 is executed. When using virtual IP addresses to manage High
3335 Available Slurm services, this program can be used to add the IP
3336 address to an interface (and optionally try to kill the unre‐
3337 sponsive slurmctld daemon and flush the ARP caches on nodes on
3338 the local ethernet fabric). See also the related "SlurmctldPri‐
3339 maryOffProg" parameter.
3340
3341 SlurmctldSyslogDebug
3342 The slurmctld daemon will log events to the syslog file at the
3343 specified level of detail. If not set, the slurmctld daemon will
3344 log to syslog at level fatal, unless there is no SlurmctldLog‐
3345 File and it is running in the background, in which case it will
3346 log to syslog at the level specified by SlurmctldDebug (at fatal
3347 in the case that SlurmctldDebug is set to quiet) or it is run in
3348 the foreground, when it will be set to quiet.
3349
3350
3351 quiet Log nothing
3352
3353 fatal Log only fatal errors
3354
3355 error Log only errors
3356
3357 info Log errors and general informational messages
3358
3359 verbose Log errors and verbose informational messages
3360
3361 debug Log errors and verbose informational messages and
3362 debugging messages
3363
3364 debug2 Log errors and verbose informational messages and more
3365 debugging messages
3366
3367 debug3 Log errors and verbose informational messages and even
3368 more debugging messages
3369
3370 debug4 Log errors and verbose informational messages and even
3371 more debugging messages
3372
3373 debug5 Log errors and verbose informational messages and even
3374 more debugging messages
3375
3376
3377
3378 SlurmctldTimeout
3379 The interval, in seconds, that the backup controller waits for
3380 the primary controller to respond before assuming control. The
3381 default value is 120 seconds. May not exceed 65533.
3382
3383
3384 SlurmdDebug
3385 The level of detail to provide slurmd daemon's logs. The
3386 default value is info.
3387
3388 quiet Log nothing
3389
3390 fatal Log only fatal errors
3391
3392 error Log only errors
3393
3394 info Log errors and general informational messages
3395
3396 verbose Log errors and verbose informational messages
3397
3398 debug Log errors and verbose informational messages and
3399 debugging messages
3400
3401 debug2 Log errors and verbose informational messages and more
3402 debugging messages
3403
3404 debug3 Log errors and verbose informational messages and even
3405 more debugging messages
3406
3407 debug4 Log errors and verbose informational messages and even
3408 more debugging messages
3409
3410 debug5 Log errors and verbose informational messages and even
3411 more debugging messages
3412
3413
3414 SlurmdLogFile
3415 Fully qualified pathname of a file into which the slurmd dae‐
3416 mon's logs are written. The default value is none (performs
3417 logging via syslog). Any "%h" within the name is replaced with
3418 the hostname on which the slurmd is running. Any "%n" within
3419 the name is replaced with the Slurm node name on which the
3420 slurmd is running.
3421 See the section LOGGING if a pathname is specified.
3422
3423
3424 SlurmdPidFile
3425 Fully qualified pathname of a file into which the slurmd daemon
3426 may write its process id. This may be used for automated signal
3427 processing. Any "%h" within the name is replaced with the host‐
3428 name on which the slurmd is running. Any "%n" within the name
3429 is replaced with the Slurm node name on which the slurmd is run‐
3430 ning. The default value is "/var/run/slurmd.pid".
3431
3432
3433 SlurmdPort
3434 The port number that the Slurm compute node daemon, slurmd, lis‐
3435 tens to for work. The default value is SLURMD_PORT as estab‐
3436 lished at system build time. If none is explicitly specified,
3437 its value will be 6818. NOTE: Either slurmctld and slurmd dae‐
3438 mons must not execute on the same nodes or the values of Slurm‐
3439 ctldPort and SlurmdPort must be different.
3440
3441 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3442 automatically try to interact with anything opened on ports
3443 8192-60000. Configure SlurmdPort to use a port outside of the
3444 configured SrunPortRange and RSIP's port range.
3445
3446
3447 SlurmdSpoolDir
3448 Fully qualified pathname of a directory into which the slurmd
3449 daemon's state information and batch job script information are
3450 written. This must be a common pathname for all nodes, but
3451 should represent a directory which is local to each node (refer‐
3452 ence a local file system). The default value is
3453 "/var/spool/slurmd". Any "%h" within the name is replaced with
3454 the hostname on which the slurmd is running. Any "%n" within
3455 the name is replaced with the Slurm node name on which the
3456 slurmd is running.
3457
3458
3459 SlurmdSyslogDebug
3460 The slurmd daemon will log events to the syslog file at the
3461 specified level of detail. If not set, the slurmd daemon will
3462 log to syslog at level fatal, unless there is no SlurmdLogFile
3463 and it is running in the background, in which case it will log
3464 to syslog at the level specified by SlurmdDebug (at fatal in
3465 the case that SlurmdDebug is set to quiet) or it is run in the
3466 foreground, when it will be set to quiet.
3467
3468
3469 quiet Log nothing
3470
3471 fatal Log only fatal errors
3472
3473 error Log only errors
3474
3475 info Log errors and general informational messages
3476
3477 verbose Log errors and verbose informational messages
3478
3479 debug Log errors and verbose informational messages and
3480 debugging messages
3481
3482 debug2 Log errors and verbose informational messages and more
3483 debugging messages
3484
3485 debug3 Log errors and verbose informational messages and even
3486 more debugging messages
3487
3488 debug4 Log errors and verbose informational messages and even
3489 more debugging messages
3490
3491 debug5 Log errors and verbose informational messages and even
3492 more debugging messages
3493
3494
3495 SlurmdTimeout
3496 The interval, in seconds, that the Slurm controller waits for
3497 slurmd to respond before configuring that node's state to DOWN.
3498 A value of zero indicates the node will not be tested by slurm‐
3499 ctld to confirm the state of slurmd, the node will not be auto‐
3500 matically set to a DOWN state indicating a non-responsive
3501 slurmd, and some other tool will take responsibility for moni‐
3502 toring the state of each compute node and its slurmd daemon.
3503 Slurm's hierarchical communication mechanism is used to ping the
3504 slurmd daemons in order to minimize system noise and overhead.
3505 The default value is 300 seconds. The value may not exceed
3506 65533 seconds.
3507
3508
3509 SlurmSchedLogFile
3510 Fully qualified pathname of the scheduling event logging file.
3511 The syntax of this parameter is the same as for SlurmctldLog‐
3512 File. In order to configure scheduler logging, set both the
3513 SlurmSchedLogFile and SlurmSchedLogLevel parameters.
3514
3515
3516 SlurmSchedLogLevel
3517 The initial level of scheduling event logging, similar to the
3518 SlurmctldDebug parameter used to control the initial level of
3519 slurmctld logging. Valid values for SlurmSchedLogLevel are "0"
3520 (scheduler logging disabled) and "1" (scheduler logging
3521 enabled). If this parameter is omitted, the value defaults to
3522 "0" (disabled). In order to configure scheduler logging, set
3523 both the SlurmSchedLogFile and SlurmSchedLogLevel parameters.
3524 The scheduler logging level can be changed dynamically using
3525 scontrol.
3526
3527
3528 SrunEpilog
3529 Fully qualified pathname of an executable to be run by srun fol‐
3530 lowing the completion of a job step. The command line arguments
3531 for the executable will be the command and arguments of the job
3532 step. This configuration parameter may be overridden by srun's
3533 --epilog parameter. Note that while the other "Epilog" executa‐
3534 bles (e.g., TaskEpilog) are run by slurmd on the compute nodes
3535 where the tasks are executed, the SrunEpilog runs on the node
3536 where the "srun" is executing.
3537
3538
3539 SrunPortRange
3540 The srun creates a set of listening ports to communicate with
3541 the controller, the slurmstepd and to handle the application
3542 I/O. By default these ports are ephemeral meaning the port num‐
3543 bers are selected by the kernel. Using this parameter allow
3544 sites to configure a range of ports from which srun ports will
3545 be selected. This is useful if sites want to allow only certain
3546 port range on their network.
3547
3548 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3549 automatically try to interact with anything opened on ports
3550 8192-60000. Configure SrunPortRange to use a range of ports
3551 above those used by RSIP, ideally 1000 or more ports, for exam‐
3552 ple "SrunPortRange=60001-63000".
3553
3554 Note: A sufficient number of ports must be configured based on
3555 the estimated number of srun on the submission nodes considering
3556 that srun opens 3 listening ports plus 2 more for every 48
3557 hosts. Example:
3558
3559 srun -N 48 will use 5 listening ports.
3560
3561
3562 srun -N 50 will use 7 listening ports.
3563
3564
3565 srun -N 200 will use 13 listening ports.
3566
3567
3568 SrunProlog
3569 Fully qualified pathname of an executable to be run by srun
3570 prior to the launch of a job step. The command line arguments
3571 for the executable will be the command and arguments of the job
3572 step. This configuration parameter may be overridden by srun's
3573 --prolog parameter. Note that while the other "Prolog" executa‐
3574 bles (e.g., TaskProlog) are run by slurmd on the compute nodes
3575 where the tasks are executed, the SrunProlog runs on the node
3576 where the "srun" is executing.
3577
3578
3579 StateSaveLocation
3580 Fully qualified pathname of a directory into which the Slurm
3581 controller, slurmctld, saves its state (e.g.
3582 "/usr/local/slurm/checkpoint"). Slurm state will saved here to
3583 recover from system failures. SlurmUser must be able to create
3584 files in this directory. If you have a BackupController config‐
3585 ured, this location should be readable and writable by both sys‐
3586 tems. Since all running and pending job information is stored
3587 here, the use of a reliable file system (e.g. RAID) is recom‐
3588 mended. The default value is "/var/spool". If any slurm dae‐
3589 mons terminate abnormally, their core files will also be written
3590 into this directory.
3591
3592
3593 SuspendExcNodes
3594 Specifies the nodes which are to not be placed in power save
3595 mode, even if the node remains idle for an extended period of
3596 time. Use Slurm's hostlist expression to identify nodes with an
3597 optional ":" separator and count of nodes to exclude from the
3598 preceding range. For example "nid[10-20]:4" will prevent 4
3599 usable nodes (i.e IDLE and not DOWN, DRAINING or already powered
3600 down) in the set "nid[10-20]" from being powered down. Multiple
3601 sets of nodes can be specified with or without counts in a comma
3602 separated list (e.g "nid[10-20]:4,nid[80-90]:2"). If a node
3603 count specification is given, any list of nodes to NOT have a
3604 node count must be after the last specification with a count.
3605 For example "nid[10-20]:4,nid[60-70]" will exclude 4 nodes in
3606 the set "nid[10-20]:4" plus all nodes in the set "nid[60-70]"
3607 while "nid[1-3],nid[10-20]:4" will exclude 4 nodes from the set
3608 "nid[1-3],nid[10-20]". By default no nodes are excluded.
3609 Related configuration options include ResumeTimeout, ResumePro‐
3610 gram, ResumeRate, SuspendProgram, SuspendRate, SuspendTime, Sus‐
3611 pendTimeout, and SuspendExcParts.
3612
3613
3614 SuspendExcParts
3615 Specifies the partitions whose nodes are to not be placed in
3616 power save mode, even if the node remains idle for an extended
3617 period of time. Multiple partitions can be identified and sepa‐
3618 rated by commas. By default no nodes are excluded. Related
3619 configuration options include ResumeTimeout, ResumeProgram,
3620 ResumeRate, SuspendProgram, SuspendRate, SuspendTime Suspend‐
3621 Timeout, and SuspendExcNodes.
3622
3623
3624 SuspendProgram
3625 SuspendProgram is the program that will be executed when a node
3626 remains idle for an extended period of time. This program is
3627 expected to place the node into some power save mode. This can
3628 be used to reduce the frequency and voltage of a node or com‐
3629 pletely power the node off. The program executes as SlurmUser.
3630 The argument to the program will be the names of nodes to be
3631 placed into power savings mode (using Slurm's hostlist expres‐
3632 sion format). By default, no program is run. Related configu‐
3633 ration options include ResumeTimeout, ResumeProgram, ResumeRate,
3634 SuspendRate, SuspendTime, SuspendTimeout, SuspendExcNodes, and
3635 SuspendExcParts.
3636
3637
3638 SuspendRate
3639 The rate at which nodes are placed into power save mode by Sus‐
3640 pendProgram. The value is number of nodes per minute and it can
3641 be used to prevent a large drop in power consumption (e.g. after
3642 a large job completes). A value of zero results in no limits
3643 being imposed. The default value is 60 nodes per minute.
3644 Related configuration options include ResumeTimeout, ResumePro‐
3645 gram, ResumeRate, SuspendProgram, SuspendTime, SuspendTimeout,
3646 SuspendExcNodes, and SuspendExcParts.
3647
3648
3649 SuspendTime
3650 Nodes which remain idle for this number of seconds will be
3651 placed into power save mode by SuspendProgram. For efficient
3652 system utilization, it is recommended that the value of Suspend‐
3653 Time be at least as large as the sum of SuspendTimeout plus
3654 ResumeTimeout. A value of -1 disables power save mode and is
3655 the default. Related configuration options include ResumeTime‐
3656 out, ResumeProgram, ResumeRate, SuspendProgram, SuspendRate,
3657 SuspendTimeout, SuspendExcNodes, and SuspendExcParts.
3658
3659
3660 SuspendTimeout
3661 Maximum time permitted (in seconds) between when a node suspend
3662 request is issued and when the node is shutdown. At that time
3663 the node must be ready for a resume request to be issued as
3664 needed for new work. The default value is 30 seconds. Related
3665 configuration options include ResumeProgram, ResumeRate, Resume‐
3666 Timeout, SuspendRate, SuspendTime, SuspendProgram, SuspendExcN‐
3667 odes and SuspendExcParts. More information is available at the
3668 Slurm web site ( https://slurm.schedmd.com/power_save.html ).
3669
3670
3671 SwitchType
3672 Identifies the type of switch or interconnect used for applica‐
3673 tion communications. Acceptable values include "switch/cray"
3674 for Cray systems, "switch/none" for switches not requiring spe‐
3675 cial processing for job launch or termination (Ethernet, and
3676 InfiniBand) and The default value is "switch/none". All Slurm
3677 daemons, commands and running jobs must be restarted for a
3678 change in SwitchType to take effect. If running jobs exist at
3679 the time slurmctld is restarted with a new value of SwitchType,
3680 records of all jobs in any state may be lost.
3681
3682
3683 TaskEpilog
3684 Fully qualified pathname of a program to be execute as the slurm
3685 job's owner after termination of each task. See TaskProlog for
3686 execution order details.
3687
3688
3689 TaskPlugin
3690 Identifies the type of task launch plugin, typically used to
3691 provide resource management within a node (e.g. pinning tasks to
3692 specific processors). More than one task plugin can be specified
3693 in a comma separated list. The prefix of "task/" is optional.
3694 Acceptable values include:
3695
3696 task/affinity enables resource containment using CPUSETs. This
3697 enables the --cpu-bind and/or --mem-bind srun
3698 options. If you use "task/affinity" and
3699 encounter problems, it may be due to the variety
3700 of system calls used to implement task affinity
3701 on different operating systems.
3702
3703 task/cgroup enables resource containment using Linux control
3704 cgroups. This enables the --cpu-bind and/or
3705 --mem-bind srun options. NOTE: see "man
3706 cgroup.conf" for configuration details.
3707
3708 task/none for systems requiring no special handling of user
3709 tasks. Lacks support for the --cpu-bind and/or
3710 --mem-bind srun options. The default value is
3711 "task/none".
3712
3713 NOTE: It is recommended to stack task/affinity,task/cgroup together
3714 when configuring TaskPlugin, and setting TaskAffinity=no and Constrain‐
3715 Cores=yes in cgroup.conf. This setup uses the task/affinity plugin for
3716 setting the affinity of the tasks (which is better and different than
3717 task/cgroup) and uses the task/cgroup plugin to fence tasks into the
3718 specified resources, thus combining the best of both pieces.
3719
3720 NOTE: For CRAY systems only: task/cgroup must be used with, and listed
3721 after task/cray in TaskPlugin. The task/affinity plugin can be listed
3722 everywhere, but the previous constrain must be satisfied. So for CRAY
3723 systems, a configuration like this is recommended:
3724
3725 TaskPlugin=task/affinity,task/cray,task/cgroup
3726
3727
3728 TaskPluginParam
3729 Optional parameters for the task plugin. Multiple options
3730 should be comma separated. If None, Boards, Sockets, Cores,
3731 Threads, and/or Verbose are specified, they will override the
3732 --cpu-bind option specified by the user in the srun command.
3733 None, Boards, Sockets, Cores and Threads are mutually exclusive
3734 and since they decrease scheduling flexibility are not generally
3735 recommended (select no more than one of them). Cpusets and
3736 Sched are mutually exclusive (select only one of them). All
3737 TaskPluginParam options are supported on FreeBSD except Cpusets.
3738 The Sched option uses cpuset_setaffinity() on FreeBSD, not
3739 sched_setaffinity().
3740
3741
3742 Boards Bind tasks to boards by default. Overrides automatic
3743 binding.
3744
3745 Cores Bind tasks to cores by default. Overrides automatic
3746 binding.
3747
3748 Cpusets Use cpusets to perform task affinity functions. By
3749 default, Sched task binding is performed.
3750
3751 None Perform no task binding by default. Overrides auto‐
3752 matic binding.
3753
3754 Sched Use sched_setaffinity (if available) to bind tasks to
3755 processors.
3756
3757 Sockets Bind to sockets by default. Overrides automatic bind‐
3758 ing.
3759
3760 Threads Bind to threads by default. Overrides automatic bind‐
3761 ing.
3762
3763 SlurmdOffSpec
3764 If specialized cores or CPUs are identified for the
3765 node (i.e. the CoreSpecCount or CpuSpecList are con‐
3766 figured for the node), then Slurm daemons running on
3767 the compute node (i.e. slurmd and slurmstepd) should
3768 run outside of those resources (i.e. specialized
3769 resources are completely unavailable to Slurm daemons
3770 and jobs spawned by Slurm). This option may not be
3771 used with the task/cray plugin.
3772
3773 Verbose Verbosely report binding before tasks run. Overrides
3774 user options.
3775
3776 Autobind Set a default binding in the event that "auto binding"
3777 doesn't find a match. Set to Threads, Cores or Sock‐
3778 ets (E.g. TaskPluginParam=autobind=threads).
3779
3780
3781 TaskProlog
3782 Fully qualified pathname of a program to be execute as the slurm
3783 job's owner prior to initiation of each task. Besides the nor‐
3784 mal environment variables, this has SLURM_TASK_PID available to
3785 identify the process ID of the task being started. Standard
3786 output from this program can be used to control the environment
3787 variables and output for the user program.
3788
3789 export NAME=value Will set environment variables for the task
3790 being spawned. Everything after the equal
3791 sign to the end of the line will be used as
3792 the value for the environment variable.
3793 Exporting of functions is not currently sup‐
3794 ported.
3795
3796 print ... Will cause that line (without the leading
3797 "print ") to be printed to the job's stan‐
3798 dard output.
3799
3800 unset NAME Will clear environment variables for the
3801 task being spawned.
3802
3803 The order of task prolog/epilog execution is as follows:
3804
3805 1. pre_launch_priv()
3806 Function in TaskPlugin
3807
3808 1. pre_launch() Function in TaskPlugin
3809
3810 2. TaskProlog System-wide per task program defined in
3811 slurm.conf
3812
3813 3. user prolog Job step specific task program defined using
3814 srun's --task-prolog option or
3815 SLURM_TASK_PROLOG environment variable
3816
3817 4. Execute the job step's task
3818
3819 5. user epilog Job step specific task program defined using
3820 srun's --task-epilog option or
3821 SLURM_TASK_EPILOG environment variable
3822
3823 6. TaskEpilog System-wide per task program defined in
3824 slurm.conf
3825
3826 7. post_term() Function in TaskPlugin
3827
3828
3829 TCPTimeout
3830 Time permitted for TCP connection to be established. Default
3831 value is 2 seconds.
3832
3833
3834 TmpFS Fully qualified pathname of the file system available to user
3835 jobs for temporary storage. This parameter is used in establish‐
3836 ing a node's TmpDisk space. The default value is "/tmp".
3837
3838
3839 TopologyParam
3840 Comma separated options identifying network topology options.
3841
3842 Dragonfly Optimize allocation for Dragonfly network. Valid
3843 when TopologyPlugin=topology/tree.
3844
3845 TopoOptional Only optimize allocation for network topology if
3846 the job includes a switch option. Since optimiz‐
3847 ing resource allocation for topology involves
3848 much higher system overhead, this option can be
3849 used to impose the extra overhead only on jobs
3850 which can take advantage of it. If most job allo‐
3851 cations are not optimized for network topology,
3852 they make fragment resources to the point that
3853 topology optimization for other jobs will be dif‐
3854 ficult to achieve. NOTE: Jobs may span across
3855 nodes without common parent switches with this
3856 enabled.
3857
3858
3859 TopologyPlugin
3860 Identifies the plugin to be used for determining the network
3861 topology and optimizing job allocations to minimize network con‐
3862 tention. See NETWORK TOPOLOGY below for details. Additional
3863 plugins may be provided in the future which gather topology
3864 information directly from the network. Acceptable values
3865 include:
3866
3867 topology/3d_torus best-fit logic over three-dimensional
3868 topology
3869
3870 topology/node_rank orders nodes based upon information a
3871 node_rank field in the node record as gen‐
3872 erated by a select plugin. Slurm performs a
3873 best-fit algorithm over those ordered nodes
3874
3875 topology/none default for other systems, best-fit logic
3876 over one-dimensional topology
3877
3878 topology/tree used for a hierarchical network as
3879 described in a topology.conf file
3880
3881
3882 TrackWCKey
3883 Boolean yes or no. Used to set display and track of the Work‐
3884 load Characterization Key. Must be set to track correct wckey
3885 usage. NOTE: You must also set TrackWCKey in your slurmdbd.conf
3886 file to create historical usage reports.
3887
3888
3889 TreeWidth
3890 Slurmd daemons use a virtual tree network for communications.
3891 TreeWidth specifies the width of the tree (i.e. the fanout). On
3892 architectures with a front end node running the slurmd daemon,
3893 the value must always be equal to or greater than the number of
3894 front end nodes which eliminates the need for message forwarding
3895 between the slurmd daemons. On other architectures the default
3896 value is 50, meaning each slurmd daemon can communicate with up
3897 to 50 other slurmd daemons and over 2500 nodes can be contacted
3898 with two message hops. The default value will work well for
3899 most clusters. Optimal system performance can typically be
3900 achieved if TreeWidth is set to the square root of the number of
3901 nodes in the cluster for systems having no more than 2500 nodes
3902 or the cube root for larger systems. The value may not exceed
3903 65533.
3904
3905
3906 UnkillableStepProgram
3907 If the processes in a job step are determined to be unkillable
3908 for a period of time specified by the UnkillableStepTimeout
3909 variable, the program specified by UnkillableStepProgram will be
3910 executed. This program can be used to take special actions to
3911 clean up the unkillable processes and/or notify computer admin‐
3912 istrators. The program will be run SlurmdUser (usually "root")
3913 on the compute node. By default no program is run.
3914
3915
3916 UnkillableStepTimeout
3917 The length of time, in seconds, that Slurm will wait before
3918 deciding that processes in a job step are unkillable (after they
3919 have been signaled with SIGKILL) and execute UnkillableStepPro‐
3920 gram as described above. The default timeout value is 60 sec‐
3921 onds. If exceeded, the compute node will be drained to prevent
3922 future jobs from being scheduled on the node.
3923
3924
3925 UsePAM If set to 1, PAM (Pluggable Authentication Modules for Linux)
3926 will be enabled. PAM is used to establish the upper bounds for
3927 resource limits. With PAM support enabled, local system adminis‐
3928 trators can dynamically configure system resource limits. Chang‐
3929 ing the upper bound of a resource limit will not alter the lim‐
3930 its of running jobs, only jobs started after a change has been
3931 made will pick up the new limits. The default value is 0 (not
3932 to enable PAM support). Remember that PAM also needs to be con‐
3933 figured to support Slurm as a service. For sites using PAM's
3934 directory based configuration option, a configuration file named
3935 slurm should be created. The module-type, control-flags, and
3936 module-path names that should be included in the file are:
3937 auth required pam_localuser.so
3938 auth required pam_shells.so
3939 account required pam_unix.so
3940 account required pam_access.so
3941 session required pam_unix.so
3942 For sites configuring PAM with a general configuration file, the
3943 appropriate lines (see above), where slurm is the service-name,
3944 should be added.
3945
3946 NOTE: UsePAM option has nothing to do with the con‐
3947 tribs/pam/pam_slurm and/or contribs/pam_slurm_adopt modules. So
3948 these two modules can work independently of the value set for
3949 UsePAM.
3950
3951
3952 VSizeFactor
3953 Memory specifications in job requests apply to real memory size
3954 (also known as resident set size). It is possible to enforce
3955 virtual memory limits for both jobs and job steps by limiting
3956 their virtual memory to some percentage of their real memory
3957 allocation. The VSizeFactor parameter specifies the job's or job
3958 step's virtual memory limit as a percentage of its real memory
3959 limit. For example, if a job's real memory limit is 500MB and
3960 VSizeFactor is set to 101 then the job will be killed if its
3961 real memory exceeds 500MB or its virtual memory exceeds 505MB
3962 (101 percent of the real memory limit). The default value is 0,
3963 which disables enforcement of virtual memory limits. The value
3964 may not exceed 65533 percent.
3965
3966
3967 WaitTime
3968 Specifies how many seconds the srun command should by default
3969 wait after the first task terminates before terminating all
3970 remaining tasks. The "--wait" option on the srun command line
3971 overrides this value. The default value is 0, which disables
3972 this feature. May not exceed 65533 seconds.
3973
3974
3975 X11Parameters
3976 For use with Slurm's built-in X11 forwarding implementation.
3977
3978 local_xauthority
3979 If set, xauth data on the compute node will be placed in
3980 a temporary file (under TmpFS) rather than in ~/.Xau‐
3981 thority, and the XAUTHORITY environment variable will be
3982 injected into the job's environment (as well as any
3983 process captured by pam_slurm_adopt). This can help
3984 avoid file locking contention on the user's home direc‐
3985 tory.
3986
3987 use_raw_hostname
3988 If set, xauth hostname will use the raw value of geth‐
3989 ostname() instead of the local part-only (as is used
3990 elsewhere within Slurm).
3991
3992
3993 The configuration of nodes (or machines) to be managed by Slurm is also
3994 specified in /etc/slurm.conf. Changes in node configuration (e.g.
3995 adding nodes, changing their processor count, etc.) require restarting
3996 both the slurmctld daemon and the slurmd daemons. All slurmd daemons
3997 must know each node in the system to forward messages in support of
3998 hierarchical communications. Only the NodeName must be supplied in the
3999 configuration file. All other node configuration information is
4000 optional. It is advisable to establish baseline node configurations,
4001 especially if the cluster is heterogeneous. Nodes which register to
4002 the system with less than the configured resources (e.g. too little
4003 memory), will be placed in the "DOWN" state to avoid scheduling jobs on
4004 them. Establishing baseline configurations will also speed Slurm's
4005 scheduling process by permitting it to compare job requirements against
4006 these (relatively few) configuration parameters and possibly avoid hav‐
4007 ing to check job requirements against every individual node's configu‐
4008 ration. The resources checked at node registration time are: CPUs,
4009 RealMemory and TmpDisk. While baseline values for each of these can be
4010 established in the configuration file, the actual values upon node reg‐
4011 istration are recorded and these actual values may be used for schedul‐
4012 ing purposes (depending upon the value of FastSchedule in the configu‐
4013 ration file.
4014
4015 Default values can be specified with a record in which NodeName is
4016 "DEFAULT". The default entry values will apply only to lines following
4017 it in the configuration file and the default values can be reset multi‐
4018 ple times in the configuration file with multiple entries where "Node‐
4019 Name=DEFAULT". Each line where NodeName is "DEFAULT" will replace or
4020 add to previous default values and not a reinitialize the default val‐
4021 ues. The "NodeName=" specification must be placed on every line
4022 describing the configuration of nodes. A single node name can not
4023 appear as a NodeName value in more than one line (duplicate node name
4024 records will be ignored). In fact, it is generally possible and desir‐
4025 able to define the configurations of all nodes in only a few lines.
4026 This convention permits significant optimization in the scheduling of
4027 larger clusters. In order to support the concept of jobs requiring
4028 consecutive nodes on some architectures, node specifications should be
4029 place in this file in consecutive order. No single node name may be
4030 listed more than once in the configuration file. Use "DownNodes=" to
4031 record the state of nodes which are temporarily in a DOWN, DRAIN or
4032 FAILING state without altering permanent configuration information. A
4033 job step's tasks are allocated to nodes in order the nodes appear in
4034 the configuration file. There is presently no capability within Slurm
4035 to arbitrarily order a job step's tasks.
4036
4037 Multiple node names may be comma separated (e.g. "alpha,beta,gamma")
4038 and/or a simple node range expression may optionally be used to specify
4039 numeric ranges of nodes to avoid building a configuration file with
4040 large numbers of entries. The node range expression can contain one
4041 pair of square brackets with a sequence of comma separated numbers
4042 and/or ranges of numbers separated by a "-" (e.g. "linux[0-64,128]", or
4043 "lx[15,18,32-33]"). Note that the numeric ranges can include one or
4044 more leading zeros to indicate the numeric portion has a fixed number
4045 of digits (e.g. "linux[0000-1023]"). Multiple numeric ranges can be
4046 included in the expression (e.g. "rack[0-63]_blade[0-41]"). If one or
4047 more numeric expressions are included, one of them must be at the end
4048 of the name (e.g. "unit[0-31]rack" is invalid), but arbitrary names can
4049 always be used in a comma separated list.
4050
4051 The node configuration specified the following information:
4052
4053
4054 NodeName
4055 Name that Slurm uses to refer to a node. Typically this would
4056 be the string that "/bin/hostname -s" returns. It may also be
4057 the fully qualified domain name as returned by "/bin/hostname
4058 -f" (e.g. "foo1.bar.com"), or any valid domain name associated
4059 with the host through the host database (/etc/hosts) or DNS,
4060 depending on the resolver settings. Note that if the short form
4061 of the hostname is not used, it may prevent use of hostlist
4062 expressions (the numeric portion in brackets must be at the end
4063 of the string). It may also be an arbitrary string if NodeHost‐
4064 name is specified. If the NodeName is "DEFAULT", the values
4065 specified with that record will apply to subsequent node speci‐
4066 fications unless explicitly set to other values in that node
4067 record or replaced with a different set of default values. Each
4068 line where NodeName is "DEFAULT" will replace or add to previous
4069 default values and not a reinitialize the default values. For
4070 architectures in which the node order is significant, nodes will
4071 be considered consecutive in the order defined. For example, if
4072 the configuration for "NodeName=charlie" immediately follows the
4073 configuration for "NodeName=baker" they will be considered adja‐
4074 cent in the computer.
4075
4076
4077 NodeHostname
4078 Typically this would be the string that "/bin/hostname -s"
4079 returns. It may also be the fully qualified domain name as
4080 returned by "/bin/hostname -f" (e.g. "foo1.bar.com"), or any
4081 valid domain name associated with the host through the host
4082 database (/etc/hosts) or DNS, depending on the resolver set‐
4083 tings. Note that if the short form of the hostname is not used,
4084 it may prevent use of hostlist expressions (the numeric portion
4085 in brackets must be at the end of the string). A node range
4086 expression can be used to specify a set of nodes. If an expres‐
4087 sion is used, the number of nodes identified by NodeHostname on
4088 a line in the configuration file must be identical to the number
4089 of nodes identified by NodeName. By default, the NodeHostname
4090 will be identical in value to NodeName.
4091
4092
4093 NodeAddr
4094 Name that a node should be referred to in establishing a commu‐
4095 nications path. This name will be used as an argument to the
4096 gethostbyname() function for identification. If a node range
4097 expression is used to designate multiple nodes, they must
4098 exactly match the entries in the NodeName (e.g. "Node‐
4099 Name=lx[0-7] NodeAddr=elx[0-7]"). NodeAddr may also contain IP
4100 addresses. By default, the NodeAddr will be identical in value
4101 to NodeHostname.
4102
4103
4104 Boards Number of Baseboards in nodes with a baseboard controller. Note
4105 that when Boards is specified, SocketsPerBoard, CoresPerSocket,
4106 and ThreadsPerCore should be specified. Boards and CPUs are
4107 mutually exclusive. The default value is 1.
4108
4109
4110 CoreSpecCount
4111 Number of cores reserved for system use. These cores will not
4112 be available for allocation to user jobs. Depending upon the
4113 TaskPluginParam option of SlurmdOffSpec, Slurm daemons (i.e.
4114 slurmd and slurmstepd) may either be confined to these resources
4115 (the default) or prevented from using these resources. Isola‐
4116 tion of the Slurm daemons from user jobs may improve application
4117 performance. If this option and CpuSpecList are both designated
4118 for a node, an error is generated. For information on the algo‐
4119 rithm used by Slurm to select the cores refer to the core spe‐
4120 cialization documentation (
4121 https://slurm.schedmd.com/core_spec.html ).
4122
4123
4124 CoresPerSocket
4125 Number of cores in a single physical processor socket (e.g.
4126 "2"). The CoresPerSocket value describes physical cores, not
4127 the logical number of processors per socket. NOTE: If you have
4128 multi-core processors, you will likely need to specify this
4129 parameter in order to optimize scheduling. The default value is
4130 1.
4131
4132
4133 CpuBind
4134 If a job step request does not specify an option to control how
4135 tasks are bound to allocated CPUs (--cpu-bind) and all nodes
4136 allocated to the job have the same CpuBind option the node Cpu‐
4137 Bind option will control how tasks are bound to allocated
4138 resources. Supported values for CpuBind are "none", "board",
4139 "socket", "ldom" (NUMA), "core" and "thread".
4140
4141
4142 CPUs Number of logical processors on the node (e.g. "2"). CPUs and
4143 Boards are mutually exclusive. It can be set to the total number
4144 of sockets, cores or threads. This can be useful when you want
4145 to schedule only the cores on a hyper-threaded node. If CPUs is
4146 omitted, it will be set equal to the product of Sockets, Cores‐
4147 PerSocket, and ThreadsPerCore. The default value is 1.
4148
4149
4150 CpuSpecList
4151 A comma delimited list of Slurm abstract CPU IDs reserved for
4152 system use. The list will be expanded to include all other
4153 CPUs, if any, on the same cores. These cores will not be avail‐
4154 able for allocation to user jobs. Depending upon the TaskPlug‐
4155 inParam option of SlurmdOffSpec, Slurm daemons (i.e. slurmd and
4156 slurmstepd) may either be confined to these resources (the
4157 default) or prevented from using these resources. Isolation of
4158 the Slurm daemons from user jobs may improve application perfor‐
4159 mance. If this option and CoreSpecCount are both designated for
4160 a node, an error is generated. This option has no effect unless
4161 cgroup job confinement is also configured (TaskPlu‐
4162 gin=task/cgroup with ConstrainCores=yes in cgroup.conf).
4163
4164
4165 Feature
4166 A comma delimited list of arbitrary strings indicative of some
4167 characteristic associated with the node. There is no value
4168 associated with a feature at this time, a node either has a fea‐
4169 ture or it does not. If desired a feature may contain a numeric
4170 component indicating, for example, processor speed. By default
4171 a node has no features. Also see Gres.
4172
4173
4174 Gres A comma delimited list of generic resources specifications for a
4175 node. The format is: "<name>[:<type>][:no_consume]:<num‐
4176 ber>[K|M|G]". The first field is the resource name, which
4177 matches the GresType configuration parameter name. The optional
4178 type field might be used to identify a model of that generic
4179 resource. A generic resource can also be specified as non-con‐
4180 sumable (i.e. multiple jobs can use the same generic resource)
4181 with the optional field ":no_consume". The final field must
4182 specify a generic resources count. A suffix of "K", "M", "G",
4183 "T" or "P" may be used to multiply the number by 1024, 1048576,
4184 1073741824, etc. respectively.
4185 (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_con‐
4186 sume:4G"). By default a node has no generic resources and its
4187 maximum count is that of an unsigned 64bit integer. Also see
4188 Feature.
4189
4190
4191 MemSpecLimit
4192 Amount of memory, in megabytes, reserved for system use and not
4193 available for user allocations. If the task/cgroup plugin is
4194 configured and that plugin constrains memory allocations (i.e.
4195 TaskPlugin=task/cgroup in slurm.conf, plus ConstrainRAMSpace=yes
4196 in cgroup.conf), then Slurm compute node daemons (slurmd plus
4197 slurmstepd) will be allocated the specified memory limit. Note
4198 that having the Memory set in SelectTypeParameters as any of the
4199 options that has it as a consumable resource is needed for this
4200 option to work. The daemons will not be killed if they exhaust
4201 the memory allocation (ie. the Out-Of-Memory Killer is disabled
4202 for the daemon's memory cgroup). If the task/cgroup plugin is
4203 not configured, the specified memory will only be unavailable
4204 for user allocations.
4205
4206
4207 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4208 tens to for work on this particular node. By default there is a
4209 single port number for all slurmd daemons on all compute nodes
4210 as defined by the SlurmdPort configuration parameter. Use of
4211 this option is not generally recommended except for development
4212 or testing purposes. If multiple slurmd daemons execute on a
4213 node this can specify a range of ports.
4214
4215 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4216 automatically try to interact with anything opened on ports
4217 8192-60000. Configure Port to use a port outside of the config‐
4218 ured SrunPortRange and RSIP's port range.
4219
4220
4221 Procs See CPUs.
4222
4223
4224 RealMemory
4225 Size of real memory on the node in megabytes (e.g. "2048"). The
4226 default value is 1. Lowering RealMemory with the goal of setting
4227 aside some amount for the OS and not available for job alloca‐
4228 tions will not work as intended if Memory is not set as a con‐
4229 sumable resource in SelectTypeParameters. So one of the *_Memory
4230 options need to be enabled for that goal to be accomplished.
4231 Also see MemSpecLimit.
4232
4233
4234 Reason Identifies the reason for a node being in state "DOWN",
4235 "DRAINED" "DRAINING", "FAIL" or "FAILING". Use quotes to
4236 enclose a reason having more than one word.
4237
4238
4239 Sockets
4240 Number of physical processor sockets/chips on the node (e.g.
4241 "2"). If Sockets is omitted, it will be inferred from CPUs,
4242 CoresPerSocket, and ThreadsPerCore. NOTE: If you have
4243 multi-core processors, you will likely need to specify these
4244 parameters. Sockets and SocketsPerBoard are mutually exclusive.
4245 If Sockets is specified when Boards is also used, Sockets is
4246 interpreted as SocketsPerBoard rather than total sockets. The
4247 default value is 1.
4248
4249
4250 SocketsPerBoard
4251 Number of physical processor sockets/chips on a baseboard.
4252 Sockets and SocketsPerBoard are mutually exclusive. The default
4253 value is 1.
4254
4255
4256 State State of the node with respect to the initiation of user jobs.
4257 Acceptable values are "CLOUD", "DOWN", "DRAIN", "FAIL", "FAIL‐
4258 ING", "FUTURE" and "UNKNOWN". Node states of "BUSY" and "IDLE"
4259 should not be specified in the node configuration, but set the
4260 node state to "UNKNOWN" instead. Setting the node state to
4261 "UNKNOWN" will result in the node state being set to "BUSY",
4262 "IDLE" or other appropriate state based upon recovered system
4263 state information. The default value is "UNKNOWN". Also see
4264 the DownNodes parameter below.
4265
4266 CLOUD Indicates the node exists in the cloud. It's initial
4267 state will be treated as powered down. The node will
4268 be available for use after it's state is recovered
4269 from Slurm's state save file or the slurmd daemon
4270 starts on the compute node.
4271
4272 DOWN Indicates the node failed and is unavailable to be
4273 allocated work.
4274
4275 DRAIN Indicates the node is unavailable to be allocated
4276 work.on.
4277
4278 FAIL Indicates the node is expected to fail soon, has no
4279 jobs allocated to it, and will not be allocated to any
4280 new jobs.
4281
4282 FAILING Indicates the node is expected to fail soon, has one
4283 or more jobs allocated to it, but will not be allo‐
4284 cated to any new jobs.
4285
4286 FUTURE Indicates the node is defined for future use and need
4287 not exist when the Slurm daemons are started. These
4288 nodes can be made available for use simply by updating
4289 the node state using the scontrol command rather than
4290 restarting the slurmctld daemon. After these nodes are
4291 made available, change their State in the slurm.conf
4292 file. Until these nodes are made available, they will
4293 not be seen using any Slurm commands or nor will any
4294 attempt be made to contact them.
4295
4296 UNKNOWN Indicates the node's state is undefined (BUSY or
4297 IDLE), but will be established when the slurmd daemon
4298 on that node registers. The default value is
4299 "UNKNOWN".
4300
4301
4302 ThreadsPerCore
4303 Number of logical threads in a single physical core (e.g. "2").
4304 Note that the Slurm can allocate resources to jobs down to the
4305 resolution of a core. If your system is configured with more
4306 than one thread per core, execution of a different job on each
4307 thread is not supported unless you configure SelectTypeParame‐
4308 ters=CR_CPU plus CPUs; do not configure Sockets, CoresPerSocket
4309 or ThreadsPerCore. A job can execute a one task per thread from
4310 within one job step or execute a distinct job step on each of
4311 the threads. Note also if you are running with more than 1
4312 thread per core and running the select/cons_res plugin you will
4313 want to set the SelectTypeParameters variable to something other
4314 than CR_CPU to avoid unexpected results. The default value is
4315 1.
4316
4317
4318 TmpDisk
4319 Total size of temporary disk storage in TmpFS in megabytes (e.g.
4320 "16384"). TmpFS (for "Temporary File System") identifies the
4321 location which jobs should use for temporary storage. Note this
4322 does not indicate the amount of free space available to the user
4323 on the node, only the total file system size. The system admin‐
4324 istration should ensure this file system is purged as needed so
4325 that user jobs have access to most of this space. The Prolog
4326 and/or Epilog programs (specified in the configuration file)
4327 might be used to ensure the file system is kept clean. The
4328 default value is 0.
4329
4330
4331 TRESWeights TRESWeights are used to calculate a value that represents
4332 how
4333 busy a node is. Currently only used in federation configura‐
4334 tions. TRESWeights are different from TRESBillingWeights --
4335 which is used for fairshare calcuations.
4336
4337 TRES weights are specified as a comma-separated list of <TRES
4338 Type>=<TRES Weight> pairs.
4339 e.g.
4340 NodeName=node1 ... TRESWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0"
4341
4342 By default the weighted TRES value is calculated as the sum of
4343 all node TRES types multiplied by their corresponding TRES
4344 weight.
4345
4346 If PriorityFlags=MAX_TRES is configured, the weighted TRES value
4347 is calculated as the MAX of individual node TRES' (e.g. cpus,
4348 mem, gres).
4349
4350
4351 Weight The priority of the node for scheduling purposes. All things
4352 being equal, jobs will be allocated the nodes with the lowest
4353 weight which satisfies their requirements. For example, a het‐
4354 erogeneous collection of nodes might be placed into a single
4355 partition for greater system utilization, responsiveness and
4356 capability. It would be preferable to allocate smaller memory
4357 nodes rather than larger memory nodes if either will satisfy a
4358 job's requirements. The units of weight are arbitrary, but
4359 larger weights should be assigned to nodes with more processors,
4360 memory, disk space, higher processor speed, etc. Note that if a
4361 job allocation request can not be satisfied using the nodes with
4362 the lowest weight, the set of nodes with the next lowest weight
4363 is added to the set of nodes under consideration for use (repeat
4364 as needed for higher weight values). If you absolutely want to
4365 minimize the number of higher weight nodes allocated to a job
4366 (at a cost of higher scheduling overhead), give each node a dis‐
4367 tinct Weight value and they will be added to the pool of nodes
4368 being considered for scheduling individually. The default value
4369 is 1.
4370
4371
4372 The "DownNodes=" configuration permits you to mark certain nodes as in
4373 a DOWN, DRAIN, FAIL, or FAILING state without altering the permanent
4374 configuration information listed under a "NodeName=" specification.
4375
4376
4377 DownNodes
4378 Any node name, or list of node names, from the "NodeName=" spec‐
4379 ifications.
4380
4381
4382 Reason Identifies the reason for a node being in state "DOWN", "DRAIN",
4383 "FAIL" or "FAILING. Use quotes to enclose a reason having more
4384 than one word.
4385
4386
4387 State State of the node with respect to the initiation of user jobs.
4388 Acceptable values are "DOWN", "DRAIN", "FAIL", "FAILING" and
4389 "UNKNOWN". Node states of "BUSY" and "IDLE" should not be spec‐
4390 ified in the node configuration, but set the node state to
4391 "UNKNOWN" instead. Setting the node state to "UNKNOWN" will
4392 result in the node state being set to "BUSY", "IDLE" or other
4393 appropriate state based upon recovered system state information.
4394 The default value is "UNKNOWN".
4395
4396 DOWN Indicates the node failed and is unavailable to be
4397 allocated work.
4398
4399 DRAIN Indicates the node is unavailable to be allocated
4400 work.on.
4401
4402 FAIL Indicates the node is expected to fail soon, has no
4403 jobs allocated to it, and will not be allocated to any
4404 new jobs.
4405
4406 FAILING Indicates the node is expected to fail soon, has one
4407 or more jobs allocated to it, but will not be allo‐
4408 cated to any new jobs.
4409
4410 UNKNOWN Indicates the node's state is undefined (BUSY or
4411 IDLE), but will be established when the slurmd daemon
4412 on that node registers. The default value is
4413 "UNKNOWN".
4414
4415
4416 On computers where frontend nodes are used to execute batch scripts
4417 rather than compute nodes (Cray ALPS systems), one may configure one or
4418 more frontend nodes using the configuration parameters defined below.
4419 These options are very similar to those used in configuring compute
4420 nodes. These options may only be used on systems configured and built
4421 with the appropriate parameters (--have-front-end) or a system deter‐
4422 mined to have the appropriate architecture by the configure script
4423 (Cray ALPS systems). The front end configuration specifies the follow‐
4424 ing information:
4425
4426
4427 AllowGroups
4428 Comma separated list of group names which may execute jobs on
4429 this front end node. By default, all groups may use this front
4430 end node. If at least one group associated with the user
4431 attempting to execute the job is in AllowGroups, he will be per‐
4432 mitted to use this front end node. May not be used with the
4433 DenyGroups option.
4434
4435
4436 AllowUsers
4437 Comma separated list of user names which may execute jobs on
4438 this front end node. By default, all users may use this front
4439 end node. May not be used with the DenyUsers option.
4440
4441
4442 DenyGroups
4443 Comma separated list of group names which are prevented from
4444 executing jobs on this front end node. May not be used with the
4445 AllowGroups option.
4446
4447
4448 DenyUsers
4449 Comma separated list of user names which are prevented from exe‐
4450 cuting jobs on this front end node. May not be used with the
4451 AllowUsers option.
4452
4453
4454 FrontendName
4455 Name that Slurm uses to refer to a frontend node. Typically
4456 this would be the string that "/bin/hostname -s" returns. It
4457 may also be the fully qualified domain name as returned by
4458 "/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid domain
4459 name associated with the host through the host database
4460 (/etc/hosts) or DNS, depending on the resolver settings. Note
4461 that if the short form of the hostname is not used, it may pre‐
4462 vent use of hostlist expressions (the numeric portion in brack‐
4463 ets must be at the end of the string). If the FrontendName is
4464 "DEFAULT", the values specified with that record will apply to
4465 subsequent node specifications unless explicitly set to other
4466 values in that frontend node record or replaced with a different
4467 set of default values. Each line where FrontendName is
4468 "DEFAULT" will replace or add to previous default values and not
4469 a reinitialize the default values. Note that since the naming
4470 of front end nodes would typically not follow that of the com‐
4471 pute nodes (e.g. lacking X, Y and Z coordinates found in the
4472 compute node naming scheme), each front end node name should be
4473 listed separately and without a hostlist expression (i.e. fron‐
4474 tend00,frontend01" rather than "frontend[00-01]").</p>
4475
4476
4477 FrontendAddr
4478 Name that a frontend node should be referred to in establishing
4479 a communications path. This name will be used as an argument to
4480 the gethostbyname() function for identification. As with Fron‐
4481 tendName, list the individual node addresses rather than using a
4482 hostlist expression. The number of FrontendAddr records per
4483 line must equal the number of FrontendName records per line
4484 (i.e. you can't map to node names to one address). FrontendAddr
4485 may also contain IP addresses. By default, the FrontendAddr
4486 will be identical in value to FrontendName.
4487
4488
4489 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4490 tens to for work on this particular frontend node. By default
4491 there is a single port number for all slurmd daemons on all
4492 frontend nodes as defined by the SlurmdPort configuration param‐
4493 eter. Use of this option is not generally recommended except for
4494 development or testing purposes.
4495
4496 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4497 automatically try to interact with anything opened on ports
4498 8192-60000. Configure Port to use a port outside of the config‐
4499 ured SrunPortRange and RSIP's port range.
4500
4501
4502 Reason Identifies the reason for a frontend node being in state "DOWN",
4503 "DRAINED" "DRAINING", "FAIL" or "FAILING". Use quotes to
4504 enclose a reason having more than one word.
4505
4506
4507 State State of the frontend node with respect to the initiation of
4508 user jobs. Acceptable values are "DOWN", "DRAIN", "FAIL",
4509 "FAILING" and "UNKNOWN". "DOWN" indicates the frontend node has
4510 failed and is unavailable to be allocated work. "DRAIN" indi‐
4511 cates the frontend node is unavailable to be allocated work.
4512 "FAIL" indicates the frontend node is expected to fail soon, has
4513 no jobs allocated to it, and will not be allocated to any new
4514 jobs. "FAILING" indicates the frontend node is expected to fail
4515 soon, has one or more jobs allocated to it, but will not be
4516 allocated to any new jobs. "UNKNOWN" indicates the frontend
4517 node's state is undefined (BUSY or IDLE), but will be estab‐
4518 lished when the slurmd daemon on that node registers. The
4519 default value is "UNKNOWN". Also see the DownNodes parameter
4520 below.
4521
4522 For example: "FrontendName=frontend[00-03] FrontendAddr=efron‐
4523 tend[00-03] State=UNKNOWN" is used to define four front end
4524 nodes for running slurmd daemons.
4525
4526
4527 The partition configuration permits you to establish different job lim‐
4528 its or access controls for various groups (or partitions) of nodes.
4529 Nodes may be in more than one partition, making partitions serve as
4530 general purpose queues. For example one may put the same set of nodes
4531 into two different partitions, each with different constraints (time
4532 limit, job sizes, groups allowed to use the partition, etc.). Jobs are
4533 allocated resources within a single partition. Default values can be
4534 specified with a record in which PartitionName is "DEFAULT". The
4535 default entry values will apply only to lines following it in the con‐
4536 figuration file and the default values can be reset multiple times in
4537 the configuration file with multiple entries where "Partition‐
4538 Name=DEFAULT". The "PartitionName=" specification must be placed on
4539 every line describing the configuration of partitions. Each line where
4540 PartitionName is "DEFAULT" will replace or add to previous default val‐
4541 ues and not a reinitialize the default values. A single partition name
4542 can not appear as a PartitionName value in more than one line (dupli‐
4543 cate partition name records will be ignored). If a partition that is
4544 in use is deleted from the configuration and slurm is restarted or
4545 reconfigured (scontrol reconfigure), jobs using the partition are can‐
4546 celed. NOTE: Put all parameters for each partition on a single line.
4547 Each line of partition configuration information should represent a
4548 different partition. The partition configuration file contains the
4549 following information:
4550
4551
4552 AllocNodes
4553 Comma separated list of nodes from which users can submit jobs
4554 in the partition. Node names may be specified using the node
4555 range expression syntax described above. The default value is
4556 "ALL".
4557
4558
4559 AllowAccounts
4560 Comma separated list of accounts which may execute jobs in the
4561 partition. The default value is "ALL". NOTE: If AllowAccounts
4562 is used then DenyAccounts will not be enforced. Also refer to
4563 DenyAccounts.
4564
4565
4566 AllowGroups
4567 Comma separated list of group names which may execute jobs in
4568 the partition. If at least one group associated with the user
4569 attempting to execute the job is in AllowGroups, he will be per‐
4570 mitted to use this partition. Jobs executed as user root can
4571 use any partition without regard to the value of AllowGroups.
4572 If user root attempts to execute a job as another user (e.g.
4573 using srun's --uid option), this other user must be in one of
4574 groups identified by AllowGroups for the job to successfully
4575 execute. The default value is "ALL". When set, all partitions
4576 that a user does not have access will be hidden from display
4577 regardless of the settings used for PrivateData. NOTE: For per‐
4578 formance reasons, Slurm maintains a list of user IDs allowed to
4579 use each partition and this is checked at job submission time.
4580 This list of user IDs is updated when the slurmctld daemon is
4581 restarted, reconfigured (e.g. "scontrol reconfig") or the parti‐
4582 tion's AllowGroups value is reset, even if is value is unchanged
4583 (e.g. "scontrol update PartitionName=name AllowGroups=group").
4584 For a user's access to a partition to change, both his group
4585 membership must change and Slurm's internal user ID list must
4586 change using one of the methods described above.
4587
4588
4589 AllowQos
4590 Comma separated list of Qos which may execute jobs in the parti‐
4591 tion. Jobs executed as user root can use any partition without
4592 regard to the value of AllowQos. The default value is "ALL".
4593 NOTE: If AllowQos is used then DenyQos will not be enforced.
4594 Also refer to DenyQos.
4595
4596
4597 Alternate
4598 Partition name of alternate partition to be used if the state of
4599 this partition is "DRAIN" or "INACTIVE."
4600
4601
4602 CpuBind
4603 If a job step request does not specify an option to control how
4604 tasks are bound to allocated CPUs (--cpu-bind) and all nodes
4605 allocated to the job do not have the same CpuBind option the
4606 node. Then the partition's CpuBind option will control how tasks
4607 are bound to allocated resources. Supported values forCpuBind
4608 are "none", "board", "socket", "ldom" (NUMA), "core" and
4609 "thread".
4610
4611
4612 Default
4613 If this keyword is set, jobs submitted without a partition spec‐
4614 ification will utilize this partition. Possible values are
4615 "YES" and "NO". The default value is "NO".
4616
4617
4618 DefMemPerCPU
4619 Default real memory size available per allocated CPU in
4620 megabytes. Used to avoid over-subscribing memory and causing
4621 paging. DefMemPerCPU would generally be used if individual pro‐
4622 cessors are allocated to jobs (SelectType=select/cons_res). If
4623 not set, the DefMemPerCPU value for the entire cluster will be
4624 used. Also see DefMemPerNode and MaxMemPerCPU. DefMemPerCPU
4625 and DefMemPerNode are mutually exclusive.
4626
4627
4628 DefMemPerNode
4629 Default real memory size available per allocated node in
4630 megabytes. Used to avoid over-subscribing memory and causing
4631 paging. DefMemPerNode would generally be used if whole nodes
4632 are allocated to jobs (SelectType=select/linear) and resources
4633 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
4634 If not set, the DefMemPerNode value for the entire cluster will
4635 be used. Also see DefMemPerCPU and MaxMemPerNode. DefMemPerCPU
4636 and DefMemPerNode are mutually exclusive.
4637
4638
4639 DenyAccounts
4640 Comma separated list of accounts which may not execute jobs in
4641 the partition. By default, no accounts are denied access NOTE:
4642 If AllowAccounts is used then DenyAccounts will not be enforced.
4643 Also refer to AllowAccounts.
4644
4645
4646 DenyQos
4647 Comma separated list of Qos which may not execute jobs in the
4648 partition. By default, no QOS are denied access NOTE: If
4649 AllowQos is used then DenyQos will not be enforced. Also refer
4650 AllowQos.
4651
4652
4653 DefaultTime
4654 Run time limit used for jobs that don't specify a value. If not
4655 set then MaxTime will be used. Format is the same as for Max‐
4656 Time.
4657
4658
4659 DisableRootJobs
4660 If set to "YES" then user root will be prevented from running
4661 any jobs on this partition. The default value will be the value
4662 of DisableRootJobs set outside of a partition specification
4663 (which is "NO", allowing user root to execute jobs).
4664
4665
4666 ExclusiveUser
4667 If set to "YES" then nodes will be exclusively allocated to
4668 users. Multiple jobs may be run for the same user, but only one
4669 user can be active at a time. This capability is also available
4670 on a per-job basis by using the --exclusive=user option.
4671
4672
4673 GraceTime
4674 Specifies, in units of seconds, the preemption grace time to be
4675 extended to a job which has been selected for preemption. The
4676 default value is zero, no preemption grace time is allowed on
4677 this partition. Once a job has been selected for preemption,
4678 its end time is set to the current time plus GraceTime. The
4679 job's tasks are immediately sent SIGCONT and SIGTERM signals in
4680 order to provide notification of its imminent termination. This
4681 is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence
4682 upon reaching its new end time. This second set of signals is
4683 sent to both the tasks and the containing batch script, if
4684 applicable. Meaningful only for PreemptMode=CANCEL. See also
4685 the global KillWait configuration parameter.
4686
4687
4688 Hidden Specifies if the partition and its jobs are to be hidden by
4689 default. Hidden partitions will by default not be reported by
4690 the Slurm APIs or commands. Possible values are "YES" and "NO".
4691 The default value is "NO". Note that partitions that a user
4692 lacks access to by virtue of the AllowGroups parameter will also
4693 be hidden by default.
4694
4695
4696 LLN Schedule resources to jobs on the least loaded nodes (based upon
4697 the number of idle CPUs). This is generally only recommended for
4698 an environment with serial jobs as idle resources will tend to
4699 be highly fragmented, resulting in parallel jobs being distrib‐
4700 uted across many nodes. Note that node Weight takes precedence
4701 over how many idle resources are on each node. Also see the
4702 SelectParameters configuration parameter CR_LLN to use the least
4703 loaded nodes in every partition.
4704
4705
4706 MaxCPUsPerNode
4707 Maximum number of CPUs on any node available to all jobs from
4708 this partition. This can be especially useful to schedule GPUs.
4709 For example a node can be associated with two Slurm partitions
4710 (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be
4711 limited to only a subset of the node's CPUs, ensuring that one
4712 or more CPUs would be available to jobs in the "gpu" parti‐
4713 tion/queue.
4714
4715
4716 MaxMemPerCPU
4717 Maximum real memory size available per allocated CPU in
4718 megabytes. Used to avoid over-subscribing memory and causing
4719 paging. MaxMemPerCPU would generally be used if individual pro‐
4720 cessors are allocated to jobs (SelectType=select/cons_res). If
4721 not set, the MaxMemPerCPU value for the entire cluster will be
4722 used. Also see DefMemPerCPU and MaxMemPerNode. MaxMemPerCPU
4723 and MaxMemPerNode are mutually exclusive.
4724
4725
4726 MaxMemPerNode
4727 Maximum real memory size available per allocated node in
4728 megabytes. Used to avoid over-subscribing memory and causing
4729 paging. MaxMemPerNode would generally be used if whole nodes
4730 are allocated to jobs (SelectType=select/linear) and resources
4731 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
4732 If not set, the MaxMemPerNode value for the entire cluster will
4733 be used. Also see DefMemPerNode and MaxMemPerCPU. MaxMemPerCPU
4734 and MaxMemPerNode are mutually exclusive.
4735
4736
4737 MaxNodes
4738 Maximum count of nodes which may be allocated to any single job.
4739 The default value is "UNLIMITED", which is represented inter‐
4740 nally as -1. This limit does not apply to jobs executed by
4741 SlurmUser or user root.
4742
4743
4744 MaxTime
4745 Maximum run time limit for jobs. Format is minutes, min‐
4746 utes:seconds, hours:minutes:seconds, days-hours, days-hours:min‐
4747 utes, days-hours:minutes:seconds or "UNLIMITED". Time resolu‐
4748 tion is one minute and second values are rounded up to the next
4749 minute. This limit does not apply to jobs executed by SlurmUser
4750 or user root.
4751
4752
4753 MinNodes
4754 Minimum count of nodes which may be allocated to any single job.
4755 The default value is 0. This limit does not apply to jobs exe‐
4756 cuted by SlurmUser or user root.
4757
4758
4759 Nodes Comma separated list of nodes which are associated with this
4760 partition. Node names may be specified using the node range
4761 expression syntax described above. A blank list of nodes (i.e.
4762 "Nodes= ") can be used if one wants a partition to exist, but
4763 have no resources (possibly on a temporary basis). A value of
4764 "ALL" is mapped to all nodes configured in the cluster.
4765
4766
4767 OverSubscribe
4768 Controls the ability of the partition to execute more than one
4769 job at a time on each resource (node, socket or core depending
4770 upon the value of SelectTypeParameters). If resources are to be
4771 over-subscribed, avoiding memory over-subscription is very
4772 important. SelectTypeParameters should be configured to treat
4773 memory as a consumable resource and the --mem option should be
4774 used for job allocations. Sharing of resources is typically
4775 useful only when using gang scheduling (PreemptMode=sus‐
4776 pend,gang). Possible values for OverSubscribe are "EXCLUSIVE",
4777 "FORCE", "YES", and "NO". Note that a value of "YES" or "FORCE"
4778 can negatively impact performance for systems with many thou‐
4779 sands of running jobs. The default value is "NO". For more
4780 information see the following web pages:
4781 https://slurm.schedmd.com/cons_res.html,
4782 https://slurm.schedmd.com/cons_res_share.html,
4783 https://slurm.schedmd.com/gang_scheduling.html, and
4784 https://slurm.schedmd.com/preempt.html.
4785
4786
4787 EXCLUSIVE Allocates entire nodes to jobs even with Select‐
4788 Type=select/cons_res configured. Jobs that run in
4789 partitions with "OverSubscribe=EXCLUSIVE" will have
4790 exclusive access to all allocated nodes.
4791
4792 FORCE Makes all resources in the partition available for
4793 oversubscription without any means for users to dis‐
4794 able it. May be followed with a colon and maximum
4795 number of jobs in running or suspended state. For
4796 example "OverSubscribe=FORCE:4" enables each node,
4797 socket or core to oversubscribe each resource four
4798 ways. Recommended only for systems running with
4799 gang scheduling (PreemptMode=suspend,gang). NOTE:
4800 PreemptType=QOS will permit one additional job to be
4801 run on the partition if started due to job preemp‐
4802 tion. For example, a configuration of OverSub‐
4803 scribe=FORCE:1 will only permit one job per
4804 resources normally, but a second job can be started
4805 if done so through preemption based upon QOS. The
4806 use of PreemptType=QOS and PreemptType=Suspend only
4807 applies with SelectType=select/cons_res.
4808
4809 YES Makes all resources in the partition available for
4810 sharing upon request by the job. Resources will
4811 only be over-subscribed when explicitly requested by
4812 the user using the "--oversubscribe" option on job
4813 submission. May be followed with a colon and maxi‐
4814 mum number of jobs in running or suspended state.
4815 For example "OverSubscribe=YES:4" enables each node,
4816 socket or core to execute up to four jobs at once.
4817 Recommended only for systems running with gang
4818 scheduling (PreemptMode=suspend,gang).
4819
4820 NO Selected resources are allocated to a single job. No
4821 resource will be allocated to more than one job.
4822
4823
4824 PartitionName
4825 Name by which the partition may be referenced (e.g. "Interac‐
4826 tive"). This name can be specified by users when submitting
4827 jobs. If the PartitionName is "DEFAULT", the values specified
4828 with that record will apply to subsequent partition specifica‐
4829 tions unless explicitly set to other values in that partition
4830 record or replaced with a different set of default values. Each
4831 line where PartitionName is "DEFAULT" will replace or add to
4832 previous default values and not a reinitialize the default val‐
4833 ues.
4834
4835
4836 PreemptMode
4837 Mechanism used to preempt jobs from this partition when Preempt‐
4838 Type=preempt/partition_prio is configured. This partition spe‐
4839 cific PreemptMode configuration parameter will override the Pre‐
4840 emptMode configuration parameter set for the cluster as a whole.
4841 The cluster-level PreemptMode must include the GANG option if
4842 PreemptMode is configured to SUSPEND for any partition. The
4843 cluster-level PreemptMode must not be OFF if PreemptMode is
4844 enabled for any partition. See the description of the clus‐
4845 ter-level PreemptMode configuration parameter above for further
4846 information.
4847
4848
4849 PriorityJobFactor
4850 Partition factor used by priority/multifactor plugin in calcu‐
4851 lating job priority. The value may not exceed 65533. Also see
4852 PriorityTier.
4853
4854
4855 PriorityTier
4856 Jobs submitted to a partition with a higher priority tier value
4857 will be dispatched before pending jobs in partition with lower
4858 priority tier value and, if possible, they will preempt running
4859 jobs from partitions with lower priority tier values. Note that
4860 a partition's priority tier takes precedence over a job's prior‐
4861 ity. The value may not exceed 65533. Also see PriorityJobFac‐
4862 tor.
4863
4864
4865 QOS Used to extend the limits available to a QOS on a partition.
4866 Jobs will not be associated to this QOS outside of being associ‐
4867 ated to the partition. They will still be associated to their
4868 requested QOS. By default, no QOS is used. NOTE: If a limit is
4869 set in both the Partition's QOS and the Job's QOS the Partition
4870 QOS will be honored unless the Job's QOS has the OverPartQOS
4871 flag set in which the Job's QOS will have priority.
4872
4873
4874 ReqResv
4875 Specifies users of this partition are required to designate a
4876 reservation when submitting a job. This option can be useful in
4877 restricting usage of a partition that may have higher priority
4878 or additional resources to be allowed only within a reservation.
4879 Possible values are "YES" and "NO". The default value is "NO".
4880
4881
4882 RootOnly
4883 Specifies if only user ID zero (i.e. user root) may allocate
4884 resources in this partition. User root may allocate resources
4885 for any other user, but the request must be initiated by user
4886 root. This option can be useful for a partition to be managed
4887 by some external entity (e.g. a higher-level job manager) and
4888 prevents users from directly using those resources. Possible
4889 values are "YES" and "NO". The default value is "NO".
4890
4891
4892 SelectTypeParameters
4893 Partition-specific resource allocation type. This option
4894 replaces the global SelectTypeParameters value. Supported val‐
4895 ues are CR_Core, CR_Core_Memory, CR_Socket and CR_Socket_Memory.
4896 Use requires the system-wide SelectTypeParameters value be set
4897 to any of the four supported values previously listed; other‐
4898 wise, the partition-specific value will be ignored.
4899
4900
4901 Shared The Shared configuration parameter has been replaced by the
4902 OverSubscribe parameter described above.
4903
4904
4905 State State of partition or availability for use. Possible values are
4906 "UP", "DOWN", "DRAIN" and "INACTIVE". The default value is "UP".
4907 See also the related "Alternate" keyword.
4908
4909 UP Designates that new jobs may queued on the partition,
4910 and that jobs may be allocated nodes and run from the
4911 partition.
4912
4913 DOWN Designates that new jobs may be queued on the parti‐
4914 tion, but queued jobs may not be allocated nodes and
4915 run from the partition. Jobs already running on the
4916 partition continue to run. The jobs must be explicitly
4917 canceled to force their termination.
4918
4919 DRAIN Designates that no new jobs may be queued on the par‐
4920 tition (job submission requests will be denied with an
4921 error message), but jobs already queued on the parti‐
4922 tion may be allocated nodes and run. See also the
4923 "Alternate" partition specification.
4924
4925 INACTIVE Designates that no new jobs may be queued on the par‐
4926 tition, and jobs already queued may not be allocated
4927 nodes and run. See also the "Alternate" partition
4928 specification.
4929
4930
4931 TRESBillingWeights
4932 TRESBillingWeights is used to define the billing weights of each
4933 TRES type that will be used in calculating the usage of a job.
4934 The calculated usage is used when calculating fairshare and when
4935 enforcing the TRES billing limit on jobs.
4936
4937 Billing weights are specified as a comma-separated list of <TRES
4938 Type>=<TRES Billing Weight> pairs.
4939
4940 Any TRES Type is available for billing. Note that the base unit
4941 for memory and burst buffers is megabytes.
4942
4943 By default the billing of TRES is calculated as the sum of all
4944 TRES types multiplied by their corresponding billing weight.
4945
4946 The weighted amount of a resource can be adjusted by adding a
4947 suffix of K,M,G,T or P after the billing weight. For example, a
4948 memory weight of "mem=.25" on a job allocated 8GB will be billed
4949 2048 (8192MB *.25) units. A memory weight of "mem=.25G" on the
4950 same job will be billed 2 (8192MB * (.25/1024)) units.
4951
4952 Negative values are allowed.
4953
4954 When a job is allocated 1 CPU and 8 GB of memory on a partition
4955 configured with TRESBilling‐
4956 Weights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0", the billable TRES will
4957 be: (1*1.0) + (8*0.25) + (0*2.0) = 3.0.
4958
4959 If PriorityFlags=MAX_TRES is configured, the billable TRES is
4960 calculated as the MAX of individual TRES' on a node (e.g. cpus,
4961 mem, gres) plus the sum of all global TRES' (e.g. licenses).
4962 Using the same example above the billable TRES will be
4963 MAX(1*1.0, 8*0.25) + (0*2.0) = 2.0.
4964
4965 If TRESBillingWeights is not defined then the job is billed
4966 against the total number of allocated CPUs.
4967
4968 NOTE: TRESBillingWeights doesn't affect job priority directly as
4969 it is currently not used for the size of the job. If you want
4970 TRES' to play a role in the job's priority then refer to the
4971 PriorityWeightTRES option.
4972
4973
4974
4976 There are a variety of prolog and epilog program options that execute
4977 with various permissions and at various times. The four options most
4978 likely to be used are: Prolog and Epilog (executed once on each compute
4979 node for each job) plus PrologSlurmctld and EpilogSlurmctld (executed
4980 once on the ControlMachine for each job).
4981
4982 NOTE: Standard output and error messages are normally not preserved.
4983 Explicitly write output and error messages to an appropriate location
4984 if you wish to preserve that information.
4985
4986 NOTE: By default the Prolog script is ONLY run on any individual node
4987 when it first sees a job step from a new allocation; it does not run
4988 the Prolog immediately when an allocation is granted. If no job steps
4989 from an allocation are run on a node, it will never run the Prolog for
4990 that allocation. This Prolog behaviour can be changed by the Pro‐
4991 logFlags parameter. The Epilog, on the other hand, always runs on
4992 every node of an allocation when the allocation is released.
4993
4994 If the Epilog fails (returns a non-zero exit code), this will result in
4995 the node being set to a DRAIN state. If the EpilogSlurmctld fails
4996 (returns a non-zero exit code), this will only be logged. If the Pro‐
4997 log fails (returns a non-zero exit code), this will result in the node
4998 being set to a DRAIN state and the job being requeued in a held state
4999 unless nohold_on_prolog_fail is configured in SchedulerParameters. If
5000 the PrologSlurmctld fails (returns a non-zero exit code), this will
5001 result in the job requeued to executed on another node if possible.
5002 Only batch jobs can be requeued.
5003 Interactive jobs (salloc and srun) will be cancelled if the Pro‐
5004 logSlurmctld fails.
5005
5006
5007 Information about the job is passed to the script using environment
5008 variables. Unless otherwise specified, these environment variables are
5009 available to all of the programs.
5010
5011 BASIL_RESERVATION_ID
5012 Basil reservation ID. Available on Cray systems with ALPS only.
5013
5014 SLURM_ARRAY_JOB_ID
5015 If this job is part of a job array, this will be set to the job
5016 ID. Otherwise it will not be set. To reference this specific
5017 task of a job array, combine SLURM_ARRAY_JOB_ID with
5018 SLURM_ARRAY_TASK_ID (e.g. "scontrol update
5019 ${SLURM_ARRAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in
5020 PrologSlurmctld and EpilogSlurmctld only.
5021
5022 SLURM_ARRAY_TASK_ID
5023 If this job is part of a job array, this will be set to the task
5024 ID. Otherwise it will not be set. To reference this specific
5025 task of a job array, combine SLURM_ARRAY_JOB_ID with
5026 SLURM_ARRAY_TASK_ID (e.g. "scontrol update
5027 ${SLURM_ARRAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in
5028 PrologSlurmctld and EpilogSlurmctld only.
5029
5030 SLURM_ARRAY_TASK_MAX
5031 If this job is part of a job array, this will be set to the max‐
5032 imum task ID. Otherwise it will not be set. Available in Pro‐
5033 logSlurmctld and EpilogSlurmctld only.
5034
5035 SLURM_ARRAY_TASK_MIN
5036 If this job is part of a job array, this will be set to the min‐
5037 imum task ID. Otherwise it will not be set. Available in Pro‐
5038 logSlurmctld and EpilogSlurmctld only.
5039
5040 SLURM_ARRAY_TASK_STEP
5041 If this job is part of a job array, this will be set to the step
5042 size of task IDs. Otherwise it will not be set. Available in
5043 PrologSlurmctld and EpilogSlurmctld only.
5044
5045 SLURM_CLUSTER_NAME
5046 Name of the cluster executing the job.
5047
5048 SLURM_JOB_ACCOUNT
5049 Account name used for the job. Available in PrologSlurmctld and
5050 EpilogSlurmctld only.
5051
5052 SLURM_JOB_CONSTRAINTS
5053 Features required to run the job. Available in Prolog, Pro‐
5054 logSlurmctld and EpilogSlurmctld only.
5055
5056 SLURM_JOB_DERIVED_EC
5057 The highest exit code of all of the job steps. Available in
5058 EpilogSlurmctld only.
5059
5060 SLURM_JOB_EXIT_CODE
5061 The exit code of the job script (or salloc). The value is the
5062 status as returned by the wait() system call (See wait(2))
5063 Available in EpilogSlurmctld only.
5064
5065 SLURM_JOB_EXIT_CODE2
5066 The exit code of the job script (or salloc). The value has the
5067 format <exit>:<sig>. The first number is the exit code, typi‐
5068 cally as set by the exit() function. The second number of the
5069 signal that caused the process to terminate if it was terminated
5070 by a signal. Available in EpilogSlurmctld only.
5071
5072 SLURM_JOB_GID
5073 Group ID of the job's owner. Available in PrologSlurmctld, Epi‐
5074 logSlurmctld and TaskProlog only.
5075
5076 SLURM_JOB_GPUS
5077 GPU IDs allocated to the job (if any). Available in the Prolog
5078 only.
5079
5080 SLURM_JOB_GROUP
5081 Group name of the job's owner. Available in PrologSlurmctld and
5082 EpilogSlurmctld only.
5083
5084 SLURM_JOB_ID
5085 Job ID. CAUTION: If this job is the first task of a job array,
5086 then Slurm commands using this job ID will refer to the entire
5087 job array rather than this specific task of the job array.
5088
5089 SLURM_JOB_NAME
5090 Name of the job. Available in PrologSlurmctld and EpilogSlurm‐
5091 ctld only.
5092
5093 SLURM_JOB_NODELIST
5094 Nodes assigned to job. A Slurm hostlist expression. "scontrol
5095 show hostnames" can be used to convert this to a list of indi‐
5096 vidual host names. Available in PrologSlurmctld and Epi‐
5097 logSlurmctld only.
5098
5099 SLURM_JOB_PARTITION
5100 Partition that job runs in. Available in Prolog, PrologSlurm‐
5101 ctld and EpilogSlurmctld only.
5102
5103 SLURM_JOB_UID
5104 User ID of the job's owner.
5105
5106 SLURM_JOB_USER
5107 User name of the job's owner.
5108
5109
5111 Slurm is able to optimize job allocations to minimize network con‐
5112 tention. Special Slurm logic is used to optimize allocations on sys‐
5113 tems with a three-dimensional interconnect. and information about con‐
5114 figuring those systems are available on web pages available here:
5115 <https://slurm.schedmd.com/>. For a hierarchical network, Slurm needs
5116 to have detailed information about how nodes are configured on the net‐
5117 work switches.
5118
5119 Given network topology information, Slurm allocates all of a job's
5120 resources onto a single leaf of the network (if possible) using a
5121 best-fit algorithm. Otherwise it will allocate a job's resources onto
5122 multiple leaf switches so as to minimize the use of higher-level
5123 switches. The TopologyPlugin parameter controls which plugin is used
5124 to collect network topology information. The only values presently
5125 supported are "topology/3d_torus" (default for Cray XT/XE systems, per‐
5126 forms best-fit logic over three-dimensional topology), "topology/none"
5127 (default for other systems, best-fit logic over one-dimensional topol‐
5128 ogy), "topology/tree" (determine the network topology based upon infor‐
5129 mation contained in a topology.conf file, see "man topology.conf" for
5130 more information). Future plugins may gather topology information
5131 directly from the network. The topology information is optional. If
5132 not provided, Slurm will perform a best-fit algorithm assuming the
5133 nodes are in a one-dimensional array as configured and the communica‐
5134 tions cost is related to the node distance in this array.
5135
5136
5138 If the cluster's computers used for the primary or backup controller
5139 will be out of service for an extended period of time, it may be desir‐
5140 able to relocate them. In order to do so, follow this procedure:
5141
5142 1. Stop the Slurm daemons
5143 2. Modify the slurm.conf file appropriately
5144 3. Distribute the updated slurm.conf file to all nodes
5145 4. Restart the Slurm daemons
5146
5147 There should be no loss of any running or pending jobs. Ensure that
5148 any nodes added to the cluster have the current slurm.conf file
5149 installed.
5150
5151 CAUTION: If two nodes are simultaneously configured as the primary con‐
5152 troller (two nodes on which ControlMachine specify the local host and
5153 the slurmctld daemon is executing on each), system behavior will be
5154 destructive. If a compute node has an incorrect ControlMachine or
5155 BackupController parameter, that node may be rendered unusable, but no
5156 other harm will result.
5157
5158
5160 #
5161 # Sample /etc/slurm.conf for dev[0-25].llnl.gov
5162 # Author: John Doe
5163 # Date: 11/06/2001
5164 #
5165 SlurmctldHost=dev0(12.34.56.78) # Primary server
5166 SlurmctldHost=dev1(12.34.56.79) # Backup server
5167 #
5168 AuthType=auth/munge
5169 Epilog=/usr/local/slurm/epilog
5170 Prolog=/usr/local/slurm/prolog
5171 FastSchedule=1
5172 FirstJobId=65536
5173 InactiveLimit=120
5174 JobCompType=jobcomp/filetxt
5175 JobCompLoc=/var/log/slurm/jobcomp
5176 KillWait=30
5177 MaxJobCount=10000
5178 MinJobAge=3600
5179 PluginDir=/usr/local/lib:/usr/local/slurm/lib
5180 ReturnToService=0
5181 SchedulerType=sched/backfill
5182 SlurmctldLogFile=/var/log/slurm/slurmctld.log
5183 SlurmdLogFile=/var/log/slurm/slurmd.log
5184 SlurmctldPort=7002
5185 SlurmdPort=7003
5186 SlurmdSpoolDir=/var/spool/slurmd.spool
5187 StateSaveLocation=/var/spool/slurm.state
5188 SwitchType=switch/none
5189 TmpFS=/tmp
5190 WaitTime=30
5191 JobCredentialPrivateKey=/usr/local/slurm/private.key
5192 JobCredentialPublicCertificate=/usr/local/slurm/public.cert
5193 #
5194 # Node Configurations
5195 #
5196 NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=64000
5197 NodeName=DEFAULT State=UNKNOWN
5198 NodeName=dev[0-25] NodeAddr=edev[0-25] Weight=16
5199 # Update records for specific DOWN nodes
5200 DownNodes=dev20 State=DOWN Reason="power,ETA=Dec25"
5201 #
5202 # Partition Configurations
5203 #
5204 PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP
5205 PartitionName=debug Nodes=dev[0-8,18-25] Default=YES
5206 PartitionName=batch Nodes=dev[9-17] MinNodes=4
5207 PartitionName=long Nodes=dev[9-17] MaxTime=120 AllowGroups=admin
5208
5209
5211 The "include" key word can be used with modifiers within the specified
5212 pathname. These modifiers would be replaced with cluster name or other
5213 information depending on which modifier is specified. If the included
5214 file is not an absolute path name (i.e. it does not start with a
5215 slash), it will searched for in the same directory as the slurm.conf
5216 file.
5217
5218 %c Cluster name specified in the slurm.conf will be used.
5219
5220 EXAMPLE
5221 ClusterName=linux
5222 include /home/slurm/etc/%c_config
5223 # Above line interpreted as
5224 # "include /home/slurm/etc/linux_config"
5225
5226
5228 There are three classes of files: Files used by slurmctld must be
5229 accessible by user SlurmUser and accessible by the primary and backup
5230 control machines. Files used by slurmd must be accessible by user root
5231 and accessible from every compute node. A few files need to be acces‐
5232 sible by normal users on all login and compute nodes. While many files
5233 and directories are listed below, most of them will not be used with
5234 most configurations.
5235
5236 AccountingStorageLoc
5237 If this specifies a file, it must be writable by user SlurmUser.
5238 The file must be accessible by the primary and backup control
5239 machines. It is recommended that the file be readable by all
5240 users from login and compute nodes.
5241
5242 Epilog Must be executable by user root. It is recommended that the
5243 file be readable by all users. The file must exist on every
5244 compute node.
5245
5246 EpilogSlurmctld
5247 Must be executable by user SlurmUser. It is recommended that
5248 the file be readable by all users. The file must be accessible
5249 by the primary and backup control machines.
5250
5251 HealthCheckProgram
5252 Must be executable by user root. It is recommended that the
5253 file be readable by all users. The file must exist on every
5254 compute node.
5255
5256 JobCheckpointDir
5257 Must be writable by user SlurmUser and no other users. The file
5258 must be accessible by the primary and backup control machines.
5259
5260 JobCompLoc
5261 If this specifies a file, it must be writable by user SlurmUser.
5262 The file must be accessible by the primary and backup control
5263 machines.
5264
5265 JobCredentialPrivateKey
5266 Must be readable only by user SlurmUser and writable by no other
5267 users. The file must be accessible by the primary and backup
5268 control machines.
5269
5270 JobCredentialPublicCertificate
5271 Readable to all users on all nodes. Must not be writable by
5272 regular users.
5273
5274 MailProg
5275 Must be executable by user SlurmUser. Must not be writable by
5276 regular users. The file must be accessible by the primary and
5277 backup control machines.
5278
5279 Prolog Must be executable by user root. It is recommended that the
5280 file be readable by all users. The file must exist on every
5281 compute node.
5282
5283 PrologSlurmctld
5284 Must be executable by user SlurmUser. It is recommended that
5285 the file be readable by all users. The file must be accessible
5286 by the primary and backup control machines.
5287
5288 ResumeProgram
5289 Must be executable by user SlurmUser. The file must be accessi‐
5290 ble by the primary and backup control machines.
5291
5292 SallocDefaultCommand
5293 Must be executable by all users. The file must exist on every
5294 login and compute node.
5295
5296 slurm.conf
5297 Readable to all users on all nodes. Must not be writable by
5298 regular users.
5299
5300 SlurmctldLogFile
5301 Must be writable by user SlurmUser. The file must be accessible
5302 by the primary and backup control machines.
5303
5304 SlurmctldPidFile
5305 Must be writable by user root. Preferably writable and remov‐
5306 able by SlurmUser. The file must be accessible by the primary
5307 and backup control machines.
5308
5309 SlurmdLogFile
5310 Must be writable by user root. A distinct file must exist on
5311 each compute node.
5312
5313 SlurmdPidFile
5314 Must be writable by user root. A distinct file must exist on
5315 each compute node.
5316
5317 SlurmdSpoolDir
5318 Must be writable by user root. A distinct file must exist on
5319 each compute node.
5320
5321 SrunEpilog
5322 Must be executable by all users. The file must exist on every
5323 login and compute node.
5324
5325 SrunProlog
5326 Must be executable by all users. The file must exist on every
5327 login and compute node.
5328
5329 StateSaveLocation
5330 Must be writable by user SlurmUser. The file must be accessible
5331 by the primary and backup control machines.
5332
5333 SuspendProgram
5334 Must be executable by user SlurmUser. The file must be accessi‐
5335 ble by the primary and backup control machines.
5336
5337 TaskEpilog
5338 Must be executable by all users. The file must exist on every
5339 compute node.
5340
5341 TaskProlog
5342 Must be executable by all users. The file must exist on every
5343 compute node.
5344
5345 UnkillableStepProgram
5346 Must be executable by user SlurmUser. The file must be accessi‐
5347 ble by the primary and backup control machines.
5348
5349
5351 Note that while Slurm daemons create log files and other files as
5352 needed, it treats the lack of parent directories as a fatal error.
5353 This prevents the daemons from running if critical file systems are not
5354 mounted and will minimize the risk of cold-starting (starting without
5355 preserving jobs).
5356
5357 Log files and job accounting files, may need to be created/owned by the
5358 "SlurmUser" uid to be successfully accessed. Use the "chown" and
5359 "chmod" commands to set the ownership and permissions appropriately.
5360 See the section FILE AND DIRECTORY PERMISSIONS for information about
5361 the various files and directories used by Slurm.
5362
5363 It is recommended that the logrotate utility be used to ensure that
5364 various log files do not become too large. This also applies to text
5365 files used for accounting, process tracking, and the slurmdbd log if
5366 they are used.
5367
5368 Here is a sample logrotate configuration. Make appropriate site modifi‐
5369 cations and save as /etc/logrotate.d/slurm on all nodes. See the
5370 logrotate man page for more details.
5371
5372 ##
5373 # Slurm Logrotate Configuration
5374 ##
5375 /var/log/slurm/*.log {
5376 compress
5377 missingok
5378 nocopytruncate
5379 nodelaycompress
5380 nomail
5381 notifempty
5382 noolddir
5383 rotate 5
5384 sharedscripts
5385 size=5M
5386 create 640 slurm root
5387 postrotate
5388 for daemon in $(/usr/bin/scontrol show daemons)
5389 do
5390 killall -SIGUSR2 $daemon
5391 done
5392 endscript
5393 }
5394
5395 NOTE: slurmdbd daemon isn't listed in the output of 'scontrol show dae‐
5396 mons', so a separate logrotate config should be used to send a SIGUSR2
5397 signal to it.
5398
5400 Copyright (C) 2002-2007 The Regents of the University of California.
5401 Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
5402 Copyright (C) 2008-2010 Lawrence Livermore National Security.
5403 Copyright (C) 2010-2017 SchedMD LLC.
5404
5405 This file is part of Slurm, a resource management program. For
5406 details, see <https://slurm.schedmd.com/>.
5407
5408 Slurm is free software; you can redistribute it and/or modify it under
5409 the terms of the GNU General Public License as published by the Free
5410 Software Foundation; either version 2 of the License, or (at your
5411 option) any later version.
5412
5413 Slurm is distributed in the hope that it will be useful, but WITHOUT
5414 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
5415 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
5416 for more details.
5417
5418
5420 /etc/slurm.conf
5421
5422
5424 cgroup.conf(5), gethostbyname [22m(3), getrlimit (2), gres.conf(5), group
5425 (5), hostname (1), scontrol(1), slurmctld(8), slurmd(8), slurmdbd(8),
5426 slurmdbd.conf(5), srun(1), spank(8), syslog (2), topology.conf(5)
5427
5428
5429
5430February 2019 Slurm Configuration File slurm.conf(5)