1slurm.conf(5) Slurm Configuration File slurm.conf(5)
2
3
4
6 slurm.conf - Slurm configuration file
7
8
10 slurm.conf is an ASCII file which describes general Slurm configuration
11 information, the nodes to be managed, information about how those nodes
12 are grouped into partitions, and various scheduling parameters associ‐
13 ated with those partitions. This file should be consistent across all
14 nodes in the cluster.
15
16 The file location can be modified at execution time by setting the
17 SLURM_CONF environment variable. The Slurm daemons also allow you to
18 override both the built-in and environment-provided location using the
19 "-f" option on the command line.
20
21 The contents of the file are case insensitive except for the names of
22 nodes and partitions. Any text following a "#" in the configuration
23 file is treated as a comment through the end of that line. Changes to
24 the configuration file take effect upon restart of Slurm daemons, dae‐
25 mon receipt of the SIGHUP signal, or execution of the command "scontrol
26 reconfigure" unless otherwise noted.
27
28 If a line begins with the word "Include" followed by whitespace and
29 then a file name, that file will be included inline with the current
30 configuration file. For large or complex systems, multiple configura‐
31 tion files may prove easier to manage and enable reuse of some files
32 (See INCLUDE MODIFIERS for more details).
33
34 Note on file permissions:
35
36 The slurm.conf file must be readable by all users of Slurm, since it is
37 used by many of the Slurm commands. Other files that are defined in
38 the slurm.conf file, such as log files and job accounting files, may
39 need to be created/owned by the user "SlurmUser" to be successfully ac‐
40 cessed. Use the "chown" and "chmod" commands to set the ownership and
41 permissions appropriately. See the section FILE AND DIRECTORY PERMIS‐
42 SIONS for information about the various files and directories used by
43 Slurm.
44
45
47 The overall configuration parameters available include:
48
49
50 AccountingStorageBackupHost
51 The name of the backup machine hosting the accounting storage
52 database. If used with the accounting_storage/slurmdbd plugin,
53 this is where the backup slurmdbd would be running. Only used
54 with systems using SlurmDBD, ignored otherwise.
55
56 AccountingStorageEnforce
57 This controls what level of association-based enforcement to im‐
58 pose on job submissions. Valid options are any combination of
59 associations, limits, nojobs, nosteps, qos, safe, and wckeys, or
60 all for all things (except nojobs and nosteps, which must be re‐
61 quested as well).
62
63 If limits, qos, or wckeys are set, associations will automati‐
64 cally be set.
65
66 If wckeys is set, TrackWCKey will automatically be set.
67
68 If safe is set, limits and associations will automatically be
69 set.
70
71 If nojobs is set, nosteps will automatically be set.
72
73 By setting associations, no new job is allowed to run unless a
74 corresponding association exists in the system. If limits are
75 enforced, users can be limited by association to whatever job
76 size or run time limits are defined.
77
78 If nojobs is set, Slurm will not account for any jobs or steps
79 on the system. Likewise, if nosteps is set, Slurm will not ac‐
80 count for any steps that have run.
81
82 If safe is enforced, a job will only be launched against an as‐
83 sociation or qos that has a GrpTRESMins limit set, if the job
84 will be able to run to completion. Without this option set, jobs
85 will be launched as long as their usage hasn't reached the
86 cpu-minutes limit. This can lead to jobs being launched but then
87 killed when the limit is reached.
88
89 With qos and/or wckeys enforced jobs will not be scheduled un‐
90 less a valid qos and/or workload characterization key is speci‐
91 fied.
92
93 A restart of slurmctld is required for changes to this parameter
94 to take effect.
95
96 AccountingStorageExternalHost
97 A comma-separated list of external slurmdbds
98 (<host/ip>[:port][,...]) to register with. If no port is given,
99 the AccountingStoragePort will be used.
100
101 This allows clusters registered with the external slurmdbd to
102 communicate with each other using the --cluster/-M client com‐
103 mand options.
104
105 The cluster will add itself to the external slurmdbd if it
106 doesn't exist. If a non-external cluster already exists on the
107 external slurmdbd, the slurmctld will ignore registering to the
108 external slurmdbd.
109
110 AccountingStorageHost
111 The name of the machine hosting the accounting storage database.
112 Only used with systems using SlurmDBD, ignored otherwise.
113
114 AccountingStorageParameters
115 Comma-separated list of key-value pair parameters. Currently
116 supported values include options to establish a secure connec‐
117 tion to the database:
118
119 SSL_CERT
120 The path name of the client public key certificate file.
121
122 SSL_CA
123 The path name of the Certificate Authority (CA) certificate
124 file.
125
126 SSL_CAPATH
127 The path name of the directory that contains trusted SSL CA
128 certificate files.
129
130 SSL_KEY
131 The path name of the client private key file.
132
133 SSL_CIPHER
134 The list of permissible ciphers for SSL encryption.
135
136 AccountingStoragePass
137 The password used to gain access to the database to store the
138 accounting data. Only used for database type storage plugins,
139 ignored otherwise. In the case of Slurm DBD (Database Daemon)
140 with MUNGE authentication this can be configured to use a MUNGE
141 daemon specifically configured to provide authentication between
142 clusters while the default MUNGE daemon provides authentication
143 within a cluster. In that case, AccountingStoragePass should
144 specify the named port to be used for communications with the
145 alternate MUNGE daemon (e.g. "/var/run/munge/global.socket.2").
146 The default value is NULL.
147
148 AccountingStoragePort
149 The listening port of the accounting storage database server.
150 Only used for database type storage plugins, ignored otherwise.
151 The default value is SLURMDBD_PORT as established at system
152 build time. If no value is explicitly specified, it will be set
153 to 6819. This value must be equal to the DbdPort parameter in
154 the slurmdbd.conf file.
155
156 AccountingStorageTRES
157 Comma-separated list of resources you wish to track on the clus‐
158 ter. These are the resources requested by the sbatch/srun job
159 when it is submitted. Currently this consists of any GRES, BB
160 (burst buffer) or license along with CPU, Memory, Node, Energy,
161 FS/[Disk|Lustre], IC/OFED, Pages, and VMem. By default Billing,
162 CPU, Energy, Memory, Node, FS/Disk, Pages and VMem are tracked.
163 These default TRES cannot be disabled, but only appended to.
164 AccountingStorageTRES=gres/craynetwork,license/iop1 will track
165 billing, cpu, energy, memory, nodes, fs/disk, pages and vmem
166 along with a gres called craynetwork as well as a license called
167 iop1. Whenever these resources are used on the cluster they are
168 recorded. The TRES are automatically set up in the database on
169 the start of the slurmctld.
170
171 If multiple GRES of different types are tracked (e.g. GPUs of
172 different types), then job requests with matching type specifi‐
173 cations will be recorded. Given a configuration of "Account‐
174 ingStorageTRES=gres/gpu,gres/gpu:tesla,gres/gpu:volta" Then
175 "gres/gpu:tesla" and "gres/gpu:volta" will track only jobs that
176 explicitly request those two GPU types, while "gres/gpu" will
177 track allocated GPUs of any type ("tesla", "volta" or any other
178 GPU type).
179
180 Given a configuration of "AccountingStorage‐
181 TRES=gres/gpu:tesla,gres/gpu:volta" Then "gres/gpu:tesla" and
182 "gres/gpu:volta" will track jobs that explicitly request those
183 GPU types. If a job requests GPUs, but does not explicitly
184 specify the GPU type, then its resource allocation will be ac‐
185 counted for as either "gres/gpu:tesla" or "gres/gpu:volta", al‐
186 though the accounting may not match the actual GPU type allo‐
187 cated to the job and the GPUs allocated to the job could be het‐
188 erogeneous. In an environment containing various GPU types, use
189 of a job_submit plugin may be desired in order to force jobs to
190 explicitly specify some GPU type.
191
192 AccountingStorageType
193 The accounting storage mechanism type. Acceptable values at
194 present include "accounting_storage/none" and "accounting_stor‐
195 age/slurmdbd". The "accounting_storage/slurmdbd" value indi‐
196 cates that accounting records will be written to the Slurm DBD,
197 which manages an underlying MySQL database. See "man slurmdbd"
198 for more information. The default value is "accounting_stor‐
199 age/none" and indicates that account records are not maintained.
200
201 AccountingStorageUser
202 The user account for accessing the accounting storage database.
203 Only used for database type storage plugins, ignored otherwise.
204
205 AccountingStoreFlags
206 Comma separated list used to tell the slurmctld to store extra
207 fields that may be more heavy weight than the normal job infor‐
208 mation.
209
210 Current options are:
211
212 job_comment
213 Include the job's comment field in the job complete mes‐
214 sage sent to the Accounting Storage database. Note the
215 AdminComment and SystemComment are always recorded in the
216 database.
217
218 job_env
219 Include a batch job's environment variables used at job
220 submission in the job start message sent to the Account‐
221 ing Storage database.
222
223 job_script
224 Include the job's batch script in the job start message
225 sent to the Accounting Storage database.
226
227 AcctGatherNodeFreq
228 The AcctGather plugins sampling interval for node accounting.
229 For AcctGather plugin values of none, this parameter is ignored.
230 For all other values this parameter is the number of seconds be‐
231 tween node accounting samples. For the acct_gather_energy/rapl
232 plugin, set a value less than 300 because the counters may over‐
233 flow beyond this rate. The default value is zero. This value
234 disables accounting sampling for nodes. Note: The accounting
235 sampling interval for jobs is determined by the value of JobAc‐
236 ctGatherFrequency.
237
238 AcctGatherEnergyType
239 Identifies the plugin to be used for energy consumption account‐
240 ing. The jobacct_gather plugin and slurmd daemon call this
241 plugin to collect energy consumption data for jobs and nodes.
242 The collection of energy consumption data takes place on the
243 node level, hence only in case of exclusive job allocation the
244 energy consumption measurements will reflect the job's real con‐
245 sumption. In case of node sharing between jobs the reported con‐
246 sumed energy per job (through sstat or sacct) will not reflect
247 the real energy consumed by the jobs.
248
249 Configurable values at present are:
250
251 acct_gather_energy/none
252 No energy consumption data is collected.
253
254 acct_gather_energy/ipmi
255 Energy consumption data is collected from
256 the Baseboard Management Controller (BMC)
257 using the Intelligent Platform Management
258 Interface (IPMI).
259
260 acct_gather_energy/pm_counters
261 Energy consumption data is collected from
262 the Baseboard Management Controller (BMC)
263 for HPE Cray systems.
264
265 acct_gather_energy/rapl
266 Energy consumption data is collected from
267 hardware sensors using the Running Average
268 Power Limit (RAPL) mechanism. Note that en‐
269 abling RAPL may require the execution of the
270 command "sudo modprobe msr".
271
272 acct_gather_energy/xcc
273 Energy consumption data is collected from
274 the Lenovo SD650 XClarity Controller (XCC)
275 using IPMI OEM raw commands.
276
277 AcctGatherInterconnectType
278 Identifies the plugin to be used for interconnect network traf‐
279 fic accounting. The jobacct_gather plugin and slurmd daemon
280 call this plugin to collect network traffic data for jobs and
281 nodes. The collection of network traffic data takes place on
282 the node level, hence only in case of exclusive job allocation
283 the collected values will reflect the job's real traffic. In
284 case of node sharing between jobs the reported network traffic
285 per job (through sstat or sacct) will not reflect the real net‐
286 work traffic by the jobs.
287
288 Configurable values at present are:
289
290 acct_gather_interconnect/none
291 No infiniband network data are collected.
292
293 acct_gather_interconnect/ofed
294 Infiniband network traffic data are col‐
295 lected from the hardware monitoring counters
296 of Infiniband devices through the OFED li‐
297 brary. In order to account for per job net‐
298 work traffic, add the "ic/ofed" TRES to Ac‐
299 countingStorageTRES.
300
301 AcctGatherFilesystemType
302 Identifies the plugin to be used for filesystem traffic account‐
303 ing. The jobacct_gather plugin and slurmd daemon call this
304 plugin to collect filesystem traffic data for jobs and nodes.
305 The collection of filesystem traffic data takes place on the
306 node level, hence only in case of exclusive job allocation the
307 collected values will reflect the job's real traffic. In case of
308 node sharing between jobs the reported filesystem traffic per
309 job (through sstat or sacct) will not reflect the real filesys‐
310 tem traffic by the jobs.
311
312
313 Configurable values at present are:
314
315 acct_gather_filesystem/none
316 No filesystem data are collected.
317
318 acct_gather_filesystem/lustre
319 Lustre filesystem traffic data are collected
320 from the counters found in /proc/fs/lustre/.
321 In order to account for per job lustre traf‐
322 fic, add the "fs/lustre" TRES to Account‐
323 ingStorageTRES.
324
325 AcctGatherProfileType
326 Identifies the plugin to be used for detailed job profiling.
327 The jobacct_gather plugin and slurmd daemon call this plugin to
328 collect detailed data such as I/O counts, memory usage, or en‐
329 ergy consumption for jobs and nodes. There are interfaces in
330 this plugin to collect data as step start and completion, task
331 start and completion, and at the account gather frequency. The
332 data collected at the node level is related to jobs only in case
333 of exclusive job allocation.
334
335 Configurable values at present are:
336
337 acct_gather_profile/none
338 No profile data is collected.
339
340 acct_gather_profile/hdf5
341 This enables the HDF5 plugin. The directory
342 where the profile files are stored and which
343 values are collected are configured in the
344 acct_gather.conf file.
345
346 acct_gather_profile/influxdb
347 This enables the influxdb plugin. The in‐
348 fluxdb instance host, port, database, reten‐
349 tion policy and which values are collected
350 are configured in the acct_gather.conf file.
351
352 AllowSpecResourcesUsage
353 If set to "YES", Slurm allows individual jobs to override node's
354 configured CoreSpecCount value. For a job to take advantage of
355 this feature, a command line option of --core-spec must be spec‐
356 ified. The default value for this option is "YES" for Cray sys‐
357 tems and "NO" for other system types.
358
359 AuthAltTypes
360 Comma-separated list of alternative authentication plugins that
361 the slurmctld will permit for communication. Acceptable values
362 at present include auth/jwt.
363
364 NOTE: auth/jwt requires a jwt_hs256.key to be populated in the
365 StateSaveLocation directory for slurmctld only. The
366 jwt_hs256.key should only be visible to the SlurmUser and root.
367 It is not suggested to place the jwt_hs256.key on any nodes but
368 the controller running slurmctld. auth/jwt can be activated by
369 the presence of the SLURM_JWT environment variable. When acti‐
370 vated, it will override the default AuthType.
371
372 AuthAltParameters
373 Used to define alternative authentication plugins options. Mul‐
374 tiple options may be comma separated.
375
376 disable_token_creation
377 Disable "scontrol token" use by non-SlurmUser ac‐
378 counts.
379
380 jwks= Absolute path to JWKS file. Only RS256 keys are
381 supported, although other key types may be listed
382 in the file. If set, no HS256 key will be loaded
383 by default (and token generation is disabled),
384 although the jwt_key setting may be used to ex‐
385 plicitly re-enable HS256 key use (and token gen‐
386 eration).
387
388 jwt_key= Absolute path to JWT key file. Key must be HS256,
389 and should only be accessible by SlurmUser. If
390 not set, the default key file is jwt_hs256.key in
391 StateSaveLocation.
392
393 AuthInfo
394 Additional information to be used for authentication of communi‐
395 cations between the Slurm daemons (slurmctld and slurmd) and the
396 Slurm clients. The interpretation of this option is specific to
397 the configured AuthType. Multiple options may be specified in a
398 comma-delimited list. If not specified, the default authentica‐
399 tion information will be used.
400
401 cred_expire Default job step credential lifetime, in seconds
402 (e.g. "cred_expire=1200"). It must be suffi‐
403 ciently long enough to load user environment, run
404 prolog, deal with the slurmd getting paged out of
405 memory, etc. This also controls how long a re‐
406 queued job must wait before starting again. The
407 default value is 120 seconds.
408
409 socket Path name to a MUNGE daemon socket to use (e.g.
410 "socket=/var/run/munge/munge.socket.2"). The de‐
411 fault value is "/var/run/munge/munge.socket.2".
412 Used by auth/munge and cred/munge.
413
414 ttl Credential lifetime, in seconds (e.g. "ttl=300").
415 The default value is dependent upon the MUNGE in‐
416 stallation, but is typically 300 seconds.
417
418 AuthType
419 The authentication method for communications between Slurm com‐
420 ponents. Acceptable values at present include "auth/munge",
421 which is the default. "auth/munge" indicates that MUNGE is to
422 be used. (See "https://dun.github.io/munge/" for more informa‐
423 tion). All Slurm daemons and commands must be terminated prior
424 to changing the value of AuthType and later restarted.
425
426 BackupAddr
427 Deprecated option, see SlurmctldHost.
428
429 BackupController
430 Deprecated option, see SlurmctldHost.
431
432 The backup controller recovers state information from the State‐
433 SaveLocation directory, which must be readable and writable from
434 both the primary and backup controllers. While not essential,
435 it is recommended that you specify a backup controller. See
436 the RELOCATING CONTROLLERS section if you change this.
437
438 BatchStartTimeout
439 The maximum time (in seconds) that a batch job is permitted for
440 launching before being considered missing and releasing the al‐
441 location. The default value is 10 (seconds). Larger values may
442 be required if more time is required to execute the Prolog, load
443 user environment variables, or if the slurmd daemon gets paged
444 from memory.
445 Note: The test for a job being successfully launched is only
446 performed when the Slurm daemon on the compute node registers
447 state with the slurmctld daemon on the head node, which happens
448 fairly rarely. Therefore a job will not necessarily be termi‐
449 nated if its start time exceeds BatchStartTimeout. This config‐
450 uration parameter is also applied to launch tasks and avoid
451 aborting srun commands due to long running Prolog scripts.
452
453 BcastExclude
454 Comma-separated list of absolute directory paths to be excluded
455 when autodetecting and broadcasting executable shared object de‐
456 pendencies through sbcast or srun --bcast. The keyword "none"
457 can be used to indicate that no directory paths should be ex‐
458 cluded. The default value is "/lib,/usr/lib,/lib64,/usr/lib64".
459 This option can be overridden by sbcast --exclude and srun
460 --bcast-exclude.
461
462 BcastParameters
463 Controls sbcast and srun --bcast behavior. Multiple options can
464 be specified in a comma separated list. Supported values in‐
465 clude:
466
467 DestDir= Destination directory for file being broadcast to
468 allocated compute nodes. Default value is cur‐
469 rent working directory, or --chdir for srun if
470 set.
471
472 Compression= Specify default file compression library to be
473 used. Supported values are "lz4" and "none".
474 The default value with the sbcast --compress op‐
475 tion is "lz4" and "none" otherwise. Some com‐
476 pression libraries may be unavailable on some
477 systems.
478
479 send_libs If set, attempt to autodetect and broadcast the
480 executable's shared object dependencies to allo‐
481 cated compute nodes. The files are placed in a
482 directory alongside the executable. For srun
483 only, the LD_LIBRARY_PATH is automatically up‐
484 dated to include this cache directory as well.
485 This can be overridden with either sbcast or srun
486 --send-libs option. By default this is disabled.
487
488 BurstBufferType
489 The plugin used to manage burst buffers. Acceptable values at
490 present are:
491
492 burst_buffer/datawarp
493 Use Cray DataWarp API to provide burst buffer functional‐
494 ity.
495
496 burst_buffer/lua
497 This plugin provides hooks to an API that is defined by a
498 Lua script. This plugin was developed to provide system
499 administrators with a way to do any task (not only file
500 staging) at different points in a job’s life cycle.
501
502 burst_buffer/none
503
504 CliFilterPlugins
505 A comma-delimited list of command line interface option fil‐
506 ter/modification plugins. The specified plugins will be executed
507 in the order listed. No cli_filter plugins are used by default.
508 Acceptable values at present are:
509
510 cli_filter/lua
511 This plugin allows you to write your own implementation
512 of a cli_filter using lua.
513
514 cli_filter/syslog
515 This plugin enables logging of job submission activities
516 performed. All the salloc/sbatch/srun options are logged
517 to syslog together with environment variables in JSON
518 format. If the plugin is not the last one in the list it
519 may log values different than what was actually sent to
520 slurmctld.
521
522 cli_filter/user_defaults
523 This plugin looks for the file $HOME/.slurm/defaults and
524 reads every line of it as a key=value pair, where key is
525 any of the job submission options available to sal‐
526 loc/sbatch/srun and value is a default value defined by
527 the user. For instance:
528 time=1:30
529 mem=2048
530 The above will result in a user defined default for each
531 of their jobs of "-t 1:30" and "--mem=2048".
532
533 ClusterName
534 The name by which this Slurm managed cluster is known in the ac‐
535 counting database. This is needed distinguish accounting
536 records when multiple clusters report to the same database. Be‐
537 cause of limitations in some databases, any upper case letters
538 in the name will be silently mapped to lower case. In order to
539 avoid confusion, it is recommended that the name be lower case.
540
541 CommunicationParameters
542 Comma-separated options identifying communication options.
543
544 block_null_hash
545 Require all Slurm authentication tokens to in‐
546 clude a newer (20.11.9 and 21.08.8) payload that
547 provides an additional layer of security against
548 credential replay attacks. This option should
549 only be enabled once all Slurm daemons have been
550 upgraded to 20.11.9/21.08.8 or newer, and all
551 jobs that were started before the upgrade have
552 been completed.
553
554 CheckGhalQuiesce
555 Used specifically on a Cray using an Aries Ghal
556 interconnect. This will check to see if the sys‐
557 tem is quiescing when sending a message, and if
558 so, we wait until it is done before sending.
559
560 DisableIPv4 Disable IPv4 only operation for all slurm daemons
561 (except slurmdbd). This should also be set in
562 your slurmdbd.conf file.
563
564 EnableIPv6 Enable using IPv6 addresses for all slurm daemons
565 (except slurmdbd). When using both IPv4 and IPv6,
566 address family preferences will be based on your
567 /etc/gai.conf file. This should also be set in
568 your slurmdbd.conf file.
569
570 NoAddrCache By default, Slurm will cache a node's network ad‐
571 dress after successfully establishing the node's
572 network address. This option disables the cache
573 and Slurm will look up the node's network address
574 each time a connection is made. This is useful,
575 for example, in a cloud environment where the
576 node addresses come and go out of DNS.
577
578 NoCtldInAddrAny
579 Used to directly bind to the address of what the
580 node resolves to running the slurmctld instead of
581 binding messages to any address on the node,
582 which is the default.
583
584 NoInAddrAny Used to directly bind to the address of what the
585 node resolves to instead of binding messages to
586 any address on the node which is the default.
587 This option is for all daemons/clients except for
588 the slurmctld.
589
590 CompleteWait
591 The time to wait, in seconds, when any job is in the COMPLETING
592 state before any additional jobs are scheduled. This is to at‐
593 tempt to keep jobs on nodes that were recently in use, with the
594 goal of preventing fragmentation. If set to zero, pending jobs
595 will be started as soon as possible. Since a COMPLETING job's
596 resources are released for use by other jobs as soon as the Epi‐
597 log completes on each individual node, this can result in very
598 fragmented resource allocations. To provide jobs with the mini‐
599 mum response time, a value of zero is recommended (no waiting).
600 To minimize fragmentation of resources, a value equal to Kill‐
601 Wait plus two is recommended. In that case, setting KillWait to
602 a small value may be beneficial. The default value of Complete‐
603 Wait is zero seconds. The value may not exceed 65533.
604
605 NOTE: Setting reduce_completing_frag affects the behavior of
606 CompleteWait.
607
608 ControlAddr
609 Deprecated option, see SlurmctldHost.
610
611 ControlMachine
612 Deprecated option, see SlurmctldHost.
613
614 CoreSpecPlugin
615 Identifies the plugins to be used for enforcement of core spe‐
616 cialization. A restart of the slurmd daemons is required for
617 changes to this parameter to take effect. Acceptable values at
618 present include:
619
620 core_spec/cray_aries
621 used only for Cray systems
622
623 core_spec/none used for all other system types
624
625 CpuFreqDef
626 Default CPU frequency value or frequency governor to use when
627 running a job step if it has not been explicitly set with the
628 --cpu-freq option. Acceptable values at present include a nu‐
629 meric value (frequency in kilohertz) or one of the following
630 governors:
631
632 Conservative attempts to use the Conservative CPU governor
633
634 OnDemand attempts to use the OnDemand CPU governor
635
636 Performance attempts to use the Performance CPU governor
637
638 PowerSave attempts to use the PowerSave CPU governor
639 There is no default value. If unset, no attempt to set the governor is
640 made if the --cpu-freq option has not been set.
641
642 CpuFreqGovernors
643 List of CPU frequency governors allowed to be set with the sal‐
644 loc, sbatch, or srun option --cpu-freq. Acceptable values at
645 present include:
646
647 Conservative attempts to use the Conservative CPU governor
648
649 OnDemand attempts to use the OnDemand CPU governor (a de‐
650 fault value)
651
652 Performance attempts to use the Performance CPU governor (a
653 default value)
654
655 PowerSave attempts to use the PowerSave CPU governor
656
657 SchedUtil attempts to use the SchedUtil CPU governor
658
659 UserSpace attempts to use the UserSpace CPU governor (a de‐
660 fault value)
661 The default is OnDemand, Performance and UserSpace.
662
663 CredType
664 The cryptographic signature tool to be used in the creation of
665 job step credentials. A restart of slurmctld is required for
666 changes to this parameter to take effect. The default (and rec‐
667 ommended) value is "cred/munge".
668
669 DebugFlags
670 Defines specific subsystems which should provide more detailed
671 event logging. Multiple subsystems can be specified with comma
672 separators. Most DebugFlags will result in verbose-level log‐
673 ging for the identified subsystems, and could impact perfor‐
674 mance. Valid subsystems available include:
675
676 Accrue Accrue counters accounting details
677
678 Agent RPC agents (outgoing RPCs from Slurm daemons)
679
680 Backfill Backfill scheduler details
681
682 BackfillMap Backfill scheduler to log a very verbose map of
683 reserved resources through time. Combine with
684 Backfill for a verbose and complete view of the
685 backfill scheduler's work.
686
687 BurstBuffer Burst Buffer plugin
688
689 Cgroup Cgroup details
690
691 CPU_Bind CPU binding details for jobs and steps
692
693 CpuFrequency Cpu frequency details for jobs and steps using
694 the --cpu-freq option.
695
696 Data Generic data structure details.
697
698 Dependency Job dependency debug info
699
700 Elasticsearch Elasticsearch debug info
701
702 Energy AcctGatherEnergy debug info
703
704 ExtSensors External Sensors debug info
705
706 Federation Federation scheduling debug info
707
708 FrontEnd Front end node details
709
710 Gres Generic resource details
711
712 Hetjob Heterogeneous job details
713
714 Gang Gang scheduling details
715
716 JobAccountGather Common job account gathering details (not
717 plugin specific).
718
719 JobContainer Job container plugin details
720
721 License License management details
722
723 Network Network details. Warning: activating this flag
724 may cause logging of passwords, tokens or other
725 authentication credentials.
726
727 NetworkRaw Dump raw hex values of key Network communica‐
728 tions. Warning: This flag will cause very ver‐
729 bose logs and may cause logging of passwords,
730 tokens or other authentication credentials.
731
732 NodeFeatures Node Features plugin debug info
733
734 NO_CONF_HASH Do not log when the slurm.conf files differ be‐
735 tween Slurm daemons
736
737 Power Power management plugin and power save (sus‐
738 pend/resume programs) details
739
740 Priority Job prioritization
741
742 Profile AcctGatherProfile plugins details
743
744 Protocol Communication protocol details
745
746 Reservation Advanced reservations
747
748 Route Message forwarding debug info
749
750 Script Debug info regarding the process that runs
751 slurmctld scripts such as PrologSlurmctld and
752 EpilogSlurmctld
753
754 SelectType Resource selection plugin
755
756 Steps Slurmctld resource allocation for job steps
757
758 Switch Switch plugin
759
760 TimeCray Timing of Cray APIs
761
762 TraceJobs Trace jobs in slurmctld. It will print detailed
763 job information including state, job ids and
764 allocated nodes counter.
765
766 Triggers Slurmctld triggers
767
768 WorkQueue Work Queue details
769
770 DefCpuPerGPU
771 Default count of CPUs allocated per allocated GPU. This value is
772 used only if the job didn't specify --cpus-per-task and
773 --cpus-per-gpu.
774
775 DefMemPerCPU
776 Default real memory size available per allocated CPU in
777 megabytes. Used to avoid over-subscribing memory and causing
778 paging. DefMemPerCPU would generally be used if individual pro‐
779 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
780 lectType=select/cons_tres). The default value is 0 (unlimited).
781 Also see DefMemPerGPU, DefMemPerNode and MaxMemPerCPU. DefMem‐
782 PerCPU, DefMemPerGPU and DefMemPerNode are mutually exclusive.
783
784 DefMemPerGPU
785 Default real memory size available per allocated GPU in
786 megabytes. The default value is 0 (unlimited). Also see
787 DefMemPerCPU and DefMemPerNode. DefMemPerCPU, DefMemPerGPU and
788 DefMemPerNode are mutually exclusive.
789
790 DefMemPerNode
791 Default real memory size available per allocated node in
792 megabytes. Used to avoid over-subscribing memory and causing
793 paging. DefMemPerNode would generally be used if whole nodes
794 are allocated to jobs (SelectType=select/linear) and resources
795 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
796 The default value is 0 (unlimited). Also see DefMemPerCPU,
797 DefMemPerGPU and MaxMemPerCPU. DefMemPerCPU, DefMemPerGPU and
798 DefMemPerNode are mutually exclusive.
799
800 DependencyParameters
801 Multiple options may be comma separated.
802
803 disable_remote_singleton
804 By default, when a federated job has a singleton depen‐
805 dency, each cluster in the federation must clear the sin‐
806 gleton dependency before the job's singleton dependency
807 is considered satisfied. Enabling this option means that
808 only the origin cluster must clear the singleton depen‐
809 dency. This option must be set in every cluster in the
810 federation.
811
812 kill_invalid_depend
813 If a job has an invalid dependency and it can never run
814 terminate it and set its state to be JOB_CANCELLED. By
815 default the job stays pending with reason DependencyNev‐
816 erSatisfied. max_depend_depth=# Maximum number of jobs
817 to test for a circular job dependency. Stop testing after
818 this number of job dependencies have been tested. The de‐
819 fault value is 10 jobs.
820
821 DisableRootJobs
822 If set to "YES" then user root will be prevented from running
823 any jobs. The default value is "NO", meaning user root will be
824 able to execute jobs. DisableRootJobs may also be set by parti‐
825 tion.
826
827 EioTimeout
828 The number of seconds srun waits for slurmstepd to close the
829 TCP/IP connection used to relay data between the user applica‐
830 tion and srun when the user application terminates. The default
831 value is 60 seconds. May not exceed 65533.
832
833 EnforcePartLimits
834 If set to "ALL" then jobs which exceed a partition's size and/or
835 time limits will be rejected at submission time. If job is sub‐
836 mitted to multiple partitions, the job must satisfy the limits
837 on all the requested partitions. If set to "NO" then the job
838 will be accepted and remain queued until the partition limits
839 are altered(Time and Node Limits). If set to "ANY" a job must
840 satisfy any of the requested partitions to be submitted. The de‐
841 fault value is "NO". NOTE: If set, then a job's QOS can not be
842 used to exceed partition limits. NOTE: The partition limits be‐
843 ing considered are its configured MaxMemPerCPU, MaxMemPerNode,
844 MinNodes, MaxNodes, MaxTime, AllocNodes, AllowAccounts, Allow‐
845 Groups, AllowQOS, and QOS usage threshold.
846
847 Epilog Fully qualified pathname of a script to execute as user root on
848 every node when a user's job completes (e.g. "/usr/lo‐
849 cal/slurm/epilog"). A glob pattern (See glob (7)) may also be
850 used to run more than one epilog script (e.g. "/etc/slurm/epi‐
851 log.d/*"). The Epilog script or scripts may be used to purge
852 files, disable user login, etc. By default there is no epilog.
853 See Prolog and Epilog Scripts for more information.
854
855 EpilogMsgTime
856 The number of microseconds that the slurmctld daemon requires to
857 process an epilog completion message from the slurmd daemons.
858 This parameter can be used to prevent a burst of epilog comple‐
859 tion messages from being sent at the same time which should help
860 prevent lost messages and improve throughput for large jobs.
861 The default value is 2000 microseconds. For a 1000 node job,
862 this spreads the epilog completion messages out over two sec‐
863 onds.
864
865 EpilogSlurmctld
866 Fully qualified pathname of a program for the slurmctld to exe‐
867 cute upon termination of a job allocation (e.g. "/usr/lo‐
868 cal/slurm/epilog_controller"). The program executes as Slur‐
869 mUser, which gives it permission to drain nodes and requeue the
870 job if a failure occurs (See scontrol(1)). Exactly what the
871 program does and how it accomplishes this is completely at the
872 discretion of the system administrator. Information about the
873 job being initiated, its allocated nodes, etc. are passed to the
874 program using environment variables. See Prolog and Epilog
875 Scripts for more information.
876
877 ExtSensorsFreq
878 The external sensors plugin sampling interval. If ExtSen‐
879 sorsType=ext_sensors/none, this parameter is ignored. For all
880 other values of ExtSensorsType, this parameter is the number of
881 seconds between external sensors samples for hardware components
882 (nodes, switches, etc.) The default value is zero. This value
883 disables external sensors sampling. Note: This parameter does
884 not affect external sensors data collection for jobs/steps.
885
886 ExtSensorsType
887 Identifies the plugin to be used for external sensors data col‐
888 lection. Slurmctld calls this plugin to collect external sen‐
889 sors data for jobs/steps and hardware components. In case of
890 node sharing between jobs the reported values per job/step
891 (through sstat or sacct) may not be accurate. See also "man
892 ext_sensors.conf".
893
894 Configurable values at present are:
895
896 ext_sensors/none No external sensors data is collected.
897
898 ext_sensors/rrd External sensors data is collected from the
899 RRD database.
900
901 FairShareDampeningFactor
902 Dampen the effect of exceeding a user or group's fair share of
903 allocated resources. Higher values will provides greater ability
904 to differentiate between exceeding the fair share at high levels
905 (e.g. a value of 1 results in almost no difference between over‐
906 consumption by a factor of 10 and 100, while a value of 5 will
907 result in a significant difference in priority). The default
908 value is 1.
909
910 FederationParameters
911 Used to define federation options. Multiple options may be comma
912 separated.
913
914 fed_display
915 If set, then the client status commands (e.g. squeue,
916 sinfo, sprio, etc.) will display information in a feder‐
917 ated view by default. This option is functionally equiva‐
918 lent to using the --federation options on each command.
919 Use the client's --local option to override the federated
920 view and get a local view of the given cluster.
921
922 FirstJobId
923 The job id to be used for the first job submitted to Slurm. Job
924 id values generated will incremented by 1 for each subsequent
925 job. Value must be larger than 0. The default value is 1. Also
926 see MaxJobId
927
928 GetEnvTimeout
929 Controls how long the job should wait (in seconds) to load the
930 user's environment before attempting to load it from a cache
931 file. Applies when the salloc or sbatch --get-user-env option
932 is used. If set to 0 then always load the user's environment
933 from the cache file. The default value is 2 seconds.
934
935 GresTypes
936 A comma-delimited list of generic resources to be managed (e.g.
937 GresTypes=gpu,mps). These resources may have an associated GRES
938 plugin of the same name providing additional functionality. No
939 generic resources are managed by default. Ensure this parameter
940 is consistent across all nodes in the cluster for proper opera‐
941 tion. A restart of slurmctld and the slurmd daemons is required
942 for this to take effect.
943
944 GroupUpdateForce
945 If set to a non-zero value, then information about which users
946 are members of groups allowed to use a partition will be updated
947 periodically, even when there have been no changes to the
948 /etc/group file. If set to zero, group member information will
949 be updated only after the /etc/group file is updated. The de‐
950 fault value is 1. Also see the GroupUpdateTime parameter.
951
952 GroupUpdateTime
953 Controls how frequently information about which users are mem‐
954 bers of groups allowed to use a partition will be updated, and
955 how long user group membership lists will be cached. The time
956 interval is given in seconds with a default value of 600 sec‐
957 onds. A value of zero will prevent periodic updating of group
958 membership information. Also see the GroupUpdateForce parame‐
959 ter.
960
961 GpuFreqDef=[<type]=value>[,<type=value>]
962 Default GPU frequency to use when running a job step if it has
963 not been explicitly set using the --gpu-freq option. This op‐
964 tion can be used to independently configure the GPU and its mem‐
965 ory frequencies. Defaults to "high,memory=high". After the job
966 is completed, the frequencies of all affected GPUs will be reset
967 to the highest possible values. In some cases, system power
968 caps may override the requested values. The field type can be
969 "memory". If type is not specified, the GPU frequency is im‐
970 plied. The value field can either be "low", "medium", "high",
971 "highm1" or a numeric value in megahertz (MHz). If the speci‐
972 fied numeric value is not possible, a value as close as possible
973 will be used. See below for definition of the values. Examples
974 of use include "GpuFreqDef=medium,memory=high and "GpuFre‐
975 qDef=450".
976
977 Supported value definitions:
978
979 low the lowest available frequency.
980
981 medium attempts to set a frequency in the middle of the
982 available range.
983
984 high the highest available frequency.
985
986 highm1 (high minus one) will select the next highest avail‐
987 able frequency.
988
989 HealthCheckInterval
990 The interval in seconds between executions of HealthCheckPro‐
991 gram. The default value is zero, which disables execution.
992
993 HealthCheckNodeState
994 Identify what node states should execute the HealthCheckProgram.
995 Multiple state values may be specified with a comma separator.
996 The default value is ANY to execute on nodes in any state.
997
998 ALLOC Run on nodes in the ALLOC state (all CPUs allo‐
999 cated).
1000
1001 ANY Run on nodes in any state.
1002
1003 CYCLE Rather than running the health check program on all
1004 nodes at the same time, cycle through running on all
1005 compute nodes through the course of the HealthCheck‐
1006 Interval. May be combined with the various node
1007 state options.
1008
1009 IDLE Run on nodes in the IDLE state.
1010
1011 MIXED Run on nodes in the MIXED state (some CPUs idle and
1012 other CPUs allocated).
1013
1014 HealthCheckProgram
1015 Fully qualified pathname of a script to execute as user root pe‐
1016 riodically on all compute nodes that are not in the NOT_RESPOND‐
1017 ING state. This program may be used to verify the node is fully
1018 operational and DRAIN the node or send email if a problem is de‐
1019 tected. Any action to be taken must be explicitly performed by
1020 the program (e.g. execute "scontrol update NodeName=foo
1021 State=drain Reason=tmp_file_system_full" to drain a node). The
1022 execution interval is controlled using the HealthCheckInterval
1023 parameter. Note that the HealthCheckProgram will be executed at
1024 the same time on all nodes to minimize its impact upon parallel
1025 programs. This program is will be killed if it does not termi‐
1026 nate normally within 60 seconds. This program will also be exe‐
1027 cuted when the slurmd daemon is first started and before it reg‐
1028 isters with the slurmctld daemon. By default, no program will
1029 be executed.
1030
1031 InactiveLimit
1032 The interval, in seconds, after which a non-responsive job allo‐
1033 cation command (e.g. srun or salloc) will result in the job be‐
1034 ing terminated. If the node on which the command is executed
1035 fails or the command abnormally terminates, this will terminate
1036 its job allocation. This option has no effect upon batch jobs.
1037 When setting a value, take into consideration that a debugger
1038 using srun to launch an application may leave the srun command
1039 in a stopped state for extended periods of time. This limit is
1040 ignored for jobs running in partitions with the RootOnly flag
1041 set (the scheduler running as root will be responsible for the
1042 job). The default value is unlimited (zero) and may not exceed
1043 65533 seconds.
1044
1045 InteractiveStepOptions
1046 When LaunchParameters=use_interactive_step is enabled, launching
1047 salloc will automatically start an srun process with Interac‐
1048 tiveStepOptions to launch a terminal on a node in the job allo‐
1049 cation. The default value is "--interactive --preserve-env
1050 --pty $SHELL". The "--interactive" option is intentionally not
1051 documented in the srun man page. It is meant only to be used in
1052 InteractiveStepOptions in order to create an "interactive step"
1053 that will not consume resources so that other steps may run in
1054 parallel with the interactive step.
1055
1056 JobAcctGatherType
1057 The job accounting mechanism type. Acceptable values at present
1058 include "jobacct_gather/linux" (for Linux systems),
1059 "jobacct_gather/cgroup" and "jobacct_gather/none" (no accounting
1060 data collected). The default value is "jobacct_gather/none".
1061 "jobacct_gather/cgroup" is a plugin for the Linux operating sys‐
1062 tem that uses cgroups to collect accounting statistics. The
1063 plugin collects the following statistics: From the cgroup memory
1064 subsystem: memory.usage_in_bytes (reported as 'pages') and rss
1065 from memory.stat (reported as 'rss'). From the cgroup cpuacct
1066 subsystem: user cpu time and system cpu time. No value is pro‐
1067 vided by cgroups for virtual memory size ('vsize'). In order to
1068 use the sstat tool "jobacct_gather/linux", or
1069 "jobacct_gather/cgroup" must be configured.
1070 NOTE: Changing this configuration parameter changes the contents
1071 of the messages between Slurm daemons. Any previously running
1072 job steps are managed by a slurmstepd daemon that will persist
1073 through the lifetime of that job step and not change its commu‐
1074 nication protocol. Only change this configuration parameter when
1075 there are no running job steps.
1076
1077 JobAcctGatherFrequency
1078 The job accounting and profiling sampling intervals. The sup‐
1079 ported format is follows:
1080
1081 JobAcctGatherFrequency=<datatype>=<interval>
1082 where <datatype>=<interval> specifies the task sam‐
1083 pling interval for the jobacct_gather plugin or a
1084 sampling interval for a profiling type by the
1085 acct_gather_profile plugin. Multiple, comma-sepa‐
1086 rated <datatype>=<interval> intervals may be speci‐
1087 fied. Supported datatypes are as follows:
1088
1089 task=<interval>
1090 where <interval> is the task sampling inter‐
1091 val in seconds for the jobacct_gather plugins
1092 and for task profiling by the
1093 acct_gather_profile plugin.
1094
1095 energy=<interval>
1096 where <interval> is the sampling interval in
1097 seconds for energy profiling using the
1098 acct_gather_energy plugin
1099
1100 network=<interval>
1101 where <interval> is the sampling interval in
1102 seconds for infiniband profiling using the
1103 acct_gather_interconnect plugin.
1104
1105 filesystem=<interval>
1106 where <interval> is the sampling interval in
1107 seconds for filesystem profiling using the
1108 acct_gather_filesystem plugin.
1109
1110
1111 The default value for task sampling
1112 interval
1113 is 30 seconds. The default value for all other intervals is 0.
1114 An interval of 0 disables sampling of the specified type. If
1115 the task sampling interval is 0, accounting information is col‐
1116 lected only at job termination (reducing Slurm interference with
1117 the job).
1118 Smaller (non-zero) values have a greater impact upon job perfor‐
1119 mance, but a value of 30 seconds is not likely to be noticeable
1120 for applications having less than 10,000 tasks.
1121 Users can independently override each interval on a per job ba‐
1122 sis using the --acctg-freq option when submitting the job.
1123
1124 JobAcctGatherParams
1125 Arbitrary parameters for the job account gather plugin. Accept‐
1126 able values at present include:
1127
1128 NoShared Exclude shared memory from accounting.
1129
1130 UsePss Use PSS value instead of RSS to calculate
1131 real usage of memory. The PSS value will be
1132 saved as RSS.
1133
1134 OverMemoryKill Kill processes that are being detected to
1135 use more memory than requested by steps ev‐
1136 ery time accounting information is gathered
1137 by the JobAcctGather plugin. This parameter
1138 should be used with caution because a job
1139 exceeding its memory allocation may affect
1140 other processes and/or machine health.
1141
1142 NOTE: If available, it is recommended to
1143 limit memory by enabling task/cgroup as a
1144 TaskPlugin and making use of Constrain‐
1145 RAMSpace=yes in the cgroup.conf instead of
1146 using this JobAcctGather mechanism for mem‐
1147 ory enforcement. Using JobAcctGather is
1148 polling based and there is a delay before a
1149 job is killed, which could lead to system
1150 Out of Memory events.
1151
1152 NOTE: When using OverMemoryKill, if the com‐
1153 bined memory used by all the processes in a
1154 step exceeds the memory limit, the entire
1155 step will be killed/cancelled by the JobAc‐
1156 ctGather plugin. This differs from the be‐
1157 havior when using ConstrainRAMSpace, where
1158 processes in the step will be killed, but
1159 the step will be left active, possibly with
1160 other processes left running.
1161
1162 JobCompHost
1163 The name of the machine hosting the job completion database.
1164 Only used for database type storage plugins, ignored otherwise.
1165
1166 JobCompLoc
1167 The fully qualified file name where job completion records are
1168 written when the JobCompType is "jobcomp/filetxt" or the data‐
1169 base where job completion records are stored when the JobComp‐
1170 Type is a database, or a complete URL endpoint with format
1171 <host>:<port>/<target>/_doc when JobCompType is "jobcomp/elas‐
1172 ticsearch" like i.e. "localhost:9200/slurm/_doc". NOTE: More
1173 information is available at the Slurm web site
1174 <https://slurm.schedmd.com/elasticsearch.html>.
1175
1176 JobCompParams
1177 Pass arbitrary text string to job completion plugin. Also see
1178 JobCompType.
1179
1180 JobCompPass
1181 The password used to gain access to the database to store the
1182 job completion data. Only used for database type storage plug‐
1183 ins, ignored otherwise.
1184
1185 JobCompPort
1186 The listening port of the job completion database server. Only
1187 used for database type storage plugins, ignored otherwise.
1188
1189 JobCompType
1190 The job completion logging mechanism type. Acceptable values at
1191 present include:
1192
1193 jobcomp/none
1194 Upon job completion, a record of the job is purged from
1195 the system. If using the accounting infrastructure this
1196 plugin may not be of interest since some of the informa‐
1197 tion is redundant.
1198
1199 jobcomp/elasticsearch
1200 Upon job completion, a record of the job should be writ‐
1201 ten to an Elasticsearch server, specified by the JobCom‐
1202 pLoc parameter.
1203 NOTE: More information is available at the Slurm web site
1204 ( https://slurm.schedmd.com/elasticsearch.html ).
1205
1206 jobcomp/filetxt
1207 Upon job completion, a record of the job should be writ‐
1208 ten to a text file, specified by the JobCompLoc parame‐
1209 ter.
1210
1211 jobcomp/lua
1212 Upon job completion, a record of the job should be pro‐
1213 cessed by the jobcomp.lua script, located in the default
1214 script directory (typically the subdirectory etc of the
1215 installation directory.
1216
1217 jobcomp/mysql
1218 Upon job completion, a record of the job should be writ‐
1219 ten to a MySQL or MariaDB database, specified by the Job‐
1220 CompLoc parameter.
1221
1222 jobcomp/script
1223 Upon job completion, a script specified by the JobCompLoc
1224 parameter is to be executed with environment variables
1225 providing the job information.
1226
1227 JobCompUser
1228 The user account for accessing the job completion database.
1229 Only used for database type storage plugins, ignored otherwise.
1230
1231 JobContainerType
1232 Identifies the plugin to be used for job tracking. A restart of
1233 slurmctld is required for changes to this parameter to take ef‐
1234 fect. NOTE: The JobContainerType applies to a job allocation,
1235 while ProctrackType applies to job steps. Acceptable values at
1236 present include:
1237
1238 job_container/cncu Used only for Cray systems (CNCU = Compute
1239 Node Clean Up)
1240
1241 job_container/none Used for all other system types
1242
1243 job_container/tmpfs Used to create a private namespace on the
1244 filesystem for jobs, which houses temporary
1245 file systems (/tmp and /dev/shm) for each
1246 job. 'PrologFlags=Contain' must be set to
1247 use this plugin.
1248
1249 JobFileAppend
1250 This option controls what to do if a job's output or error file
1251 exist when the job is started. If JobFileAppend is set to a
1252 value of 1, then append to the existing file. By default, any
1253 existing file is truncated.
1254
1255 JobRequeue
1256 This option controls the default ability for batch jobs to be
1257 requeued. Jobs may be requeued explicitly by a system adminis‐
1258 trator, after node failure, or upon preemption by a higher pri‐
1259 ority job. If JobRequeue is set to a value of 1, then batch
1260 jobs may be requeued unless explicitly disabled by the user. If
1261 JobRequeue is set to a value of 0, then batch jobs will not be
1262 requeued unless explicitly enabled by the user. Use the sbatch
1263 --no-requeue or --requeue option to change the default behavior
1264 for individual jobs. The default value is 1.
1265
1266 JobSubmitPlugins
1267 A comma-delimited list of job submission plugins to be used.
1268 The specified plugins will be executed in the order listed.
1269 These are intended to be site-specific plugins which can be used
1270 to set default job parameters and/or logging events. Sample
1271 plugins available in the distribution include "all_partitions",
1272 "defaults", "logging", "lua", and "partition". For examples of
1273 use, see the Slurm code in "src/plugins/job_submit" and "con‐
1274 tribs/lua/job_submit*.lua" then modify the code to satisfy your
1275 needs. Slurm can be configured to use multiple job_submit plug‐
1276 ins if desired, however the lua plugin will only execute one lua
1277 script named "job_submit.lua" located in the default script di‐
1278 rectory (typically the subdirectory "etc" of the installation
1279 directory). No job submission plugins are used by default.
1280
1281 KeepAliveTime
1282 Specifies how long sockets communications used between the srun
1283 command and its slurmstepd process are kept alive after discon‐
1284 nect. Longer values can be used to improve reliability of com‐
1285 munications in the event of network failures. The default value
1286 leaves the system default value. The value may not exceed
1287 65533.
1288
1289 KillOnBadExit
1290 If set to 1, a step will be terminated immediately if any task
1291 is crashed or aborted, as indicated by a non-zero exit code.
1292 With the default value of 0, if one of the processes is crashed
1293 or aborted the other processes will continue to run while the
1294 crashed or aborted process waits. The user can override this
1295 configuration parameter by using srun's -K, --kill-on-bad-exit.
1296
1297 KillWait
1298 The interval, in seconds, given to a job's processes between the
1299 SIGTERM and SIGKILL signals upon reaching its time limit. If
1300 the job fails to terminate gracefully in the interval specified,
1301 it will be forcibly terminated. The default value is 30 sec‐
1302 onds. The value may not exceed 65533.
1303
1304 NodeFeaturesPlugins
1305 Identifies the plugins to be used for support of node features
1306 which can change through time. For example, a node which might
1307 be booted with various BIOS setting. This is supported through
1308 the use of a node's active_features and available_features in‐
1309 formation. Acceptable values at present include:
1310
1311 node_features/knl_cray
1312 Used only for Intel Knights Landing processors (KNL) on
1313 Cray systems.
1314
1315 node_features/knl_generic
1316 Used for Intel Knights Landing processors (KNL) on a
1317 generic Linux system.
1318
1319 node_features/helpers
1320 Used to report and modify features on nodes using arbi‐
1321 trary scripts or programs.
1322
1323 LaunchParameters
1324 Identifies options to the job launch plugin. Acceptable values
1325 include:
1326
1327 batch_step_set_cpu_freq Set the cpu frequency for the batch step
1328 from given --cpu-freq, or slurm.conf
1329 CpuFreqDef, option. By default only
1330 steps started with srun will utilize the
1331 cpu freq setting options.
1332
1333 NOTE: If you are using srun to launch
1334 your steps inside a batch script (ad‐
1335 vised) this option will create a situa‐
1336 tion where you may have multiple agents
1337 setting the cpu_freq as the batch step
1338 usually runs on the same resources one
1339 or more steps the sruns in the script
1340 will create.
1341
1342 cray_net_exclusive Allow jobs on a Cray Native cluster ex‐
1343 clusive access to network resources.
1344 This should only be set on clusters pro‐
1345 viding exclusive access to each node to
1346 a single job at once, and not using par‐
1347 allel steps within the job, otherwise
1348 resources on the node can be oversub‐
1349 scribed.
1350
1351 enable_nss_slurm Permits passwd and group resolution for
1352 a job to be serviced by slurmstepd
1353 rather than requiring a lookup from a
1354 network based service. See
1355 https://slurm.schedmd.com/nss_slurm.html
1356 for more information.
1357
1358 lustre_no_flush If set on a Cray Native cluster, then do
1359 not flush the Lustre cache on job step
1360 completion. This setting will only take
1361 effect after reconfiguring, and will
1362 only take effect for newly launched
1363 jobs.
1364
1365 mem_sort Sort NUMA memory at step start. User can
1366 override this default with
1367 SLURM_MEM_BIND environment variable or
1368 --mem-bind=nosort command line option.
1369
1370 mpir_use_nodeaddr When launching tasks Slurm creates en‐
1371 tries in MPIR_proctable that are used by
1372 parallel debuggers, profilers, and re‐
1373 lated tools to attach to running
1374 process. By default the MPIR_proctable
1375 entries contain MPIR_procdesc structures
1376 where the host_name is set to NodeName
1377 by default. If this option is specified,
1378 NodeAddr will be used in this context
1379 instead.
1380
1381 disable_send_gids By default, the slurmctld will look up
1382 and send the user_name and extended gids
1383 for a job, rather than independently on
1384 each node as part of each task launch.
1385 This helps mitigate issues around name
1386 service scalability when launching jobs
1387 involving many nodes. Using this option
1388 will disable this functionality. This
1389 option is ignored if enable_nss_slurm is
1390 specified.
1391
1392 slurmstepd_memlock Lock the slurmstepd process's current
1393 memory in RAM.
1394
1395 slurmstepd_memlock_all Lock the slurmstepd process's current
1396 and future memory in RAM.
1397
1398 test_exec Have srun verify existence of the exe‐
1399 cutable program along with user execute
1400 permission on the node where srun was
1401 called before attempting to launch it on
1402 nodes in the step.
1403
1404 use_interactive_step Have salloc use the Interactive Step to
1405 launch a shell on an allocated compute
1406 node rather than locally to wherever
1407 salloc was invoked. This is accomplished
1408 by launching the srun command with In‐
1409 teractiveStepOptions as options.
1410
1411 This does not affect salloc called with
1412 a command as an argument. These jobs
1413 will continue to be executed as the
1414 calling user on the calling host.
1415
1416 LaunchType
1417 Identifies the mechanism to be used to launch application tasks.
1418 Acceptable values include:
1419
1420 launch/slurm
1421 The default value.
1422
1423 Licenses
1424 Specification of licenses (or other resources available on all
1425 nodes of the cluster) which can be allocated to jobs. License
1426 names can optionally be followed by a colon and count with a de‐
1427 fault count of one. Multiple license names should be comma sep‐
1428 arated (e.g. "Licenses=foo:4,bar"). Note that Slurm prevents
1429 jobs from being scheduled if their required license specifica‐
1430 tion is not available. Slurm does not prevent jobs from using
1431 licenses that are not explicitly listed in the job submission
1432 specification.
1433
1434 LogTimeFormat
1435 Format of the timestamp in slurmctld and slurmd log files. Ac‐
1436 cepted values are "iso8601", "iso8601_ms", "rfc5424",
1437 "rfc5424_ms", "clock", "short" and "thread_id". The values end‐
1438 ing in "_ms" differ from the ones without in that fractional
1439 seconds with millisecond precision are printed. The default
1440 value is "iso8601_ms". The "rfc5424" formats are the same as the
1441 "iso8601" formats except that the timezone value is also shown.
1442 The "clock" format shows a timestamp in microseconds retrieved
1443 with the C standard clock() function. The "short" format is a
1444 short date and time format. The "thread_id" format shows the
1445 timestamp in the C standard ctime() function form without the
1446 year but including the microseconds, the daemon's process ID and
1447 the current thread name and ID.
1448
1449 MailDomain
1450 Domain name to qualify usernames if email address is not explic‐
1451 itly given with the "--mail-user" option. If unset, the local
1452 MTA will need to qualify local address itself. Changes to Mail‐
1453 Domain will only affect new jobs.
1454
1455 MailProg
1456 Fully qualified pathname to the program used to send email per
1457 user request. The default value is "/bin/mail" (or
1458 "/usr/bin/mail" if "/bin/mail" does not exist but
1459 "/usr/bin/mail" does exist). The program is called with argu‐
1460 ments suitable for the default mail command, however additional
1461 information about the job is passed in the form of environment
1462 variables.
1463
1464 Additional variables are the same as those passed to Pro‐
1465 logSlurmctld and EpilogSlurmctld with additional variables in
1466 the following contexts:
1467
1468 ALL
1469
1470 SLURM_JOB_STATE
1471 The base state of the job when the MailProg is
1472 called.
1473
1474 SLURM_JOB_MAIL_TYPE
1475 The mail type triggering the mail.
1476
1477 BEGIN
1478
1479 SLURM_JOB_QEUEUED_TIME
1480 The amount of time the job was queued.
1481
1482 END, FAIL, REQUEUE, TIME_LIMIT_*
1483
1484 SLURM_JOB_RUN_TIME
1485 The amount of time the job ran for.
1486
1487 END, FAIL
1488
1489 SLURM_JOB_EXIT_CODE_MAX
1490 Job's exit code or highest exit code for an array
1491 job.
1492
1493 SLURM_JOB_EXIT_CODE_MIN
1494 Job's minimum exit code for an array job.
1495
1496 SLURM_JOB_TERM_SIGNAL_MAX
1497 Job's highest signal for an array job.
1498
1499 STAGE_OUT
1500
1501 SLURM_JOB_STAGE_OUT_TIME
1502 Job's staging out time.
1503
1504 MaxArraySize
1505 The maximum job array task index value will be one less than
1506 MaxArraySize to allow for an index value of zero. Configure
1507 MaxArraySize to 0 in order to disable job array use. The value
1508 may not exceed 4000001. The value of MaxJobCount should be much
1509 larger than MaxArraySize. The default value is 1001. See also
1510 max_array_tasks in SchedulerParameters.
1511
1512 MaxDBDMsgs
1513 When communication to the SlurmDBD is not possible the slurmctld
1514 will queue messages meant to processed when the SlurmDBD is
1515 available again. In order to avoid running out of memory the
1516 slurmctld will only queue so many messages. The default value is
1517 10000, or MaxJobCount * 2 + Node Count * 4, whichever is
1518 greater. The value can not be less than 10000.
1519
1520 MaxJobCount
1521 The maximum number of jobs slurmctld can have in memory at one
1522 time. Combine with MinJobAge to ensure the slurmctld daemon
1523 does not exhaust its memory or other resources. Once this limit
1524 is reached, requests to submit additional jobs will fail. The
1525 default value is 10000 jobs. NOTE: Each task of a job array
1526 counts as one job even though they will not occupy separate job
1527 records until modified or initiated. Performance can suffer
1528 with more than a few hundred thousand jobs. Setting per MaxSub‐
1529 mitJobs per user is generally valuable to prevent a single user
1530 from filling the system with jobs. This is accomplished using
1531 Slurm's database and configuring enforcement of resource limits.
1532 A restart of slurmctld is required for changes to this parameter
1533 to take effect.
1534
1535 MaxJobId
1536 The maximum job id to be used for jobs submitted to Slurm with‐
1537 out a specific requested value. Job ids are unsigned 32bit inte‐
1538 gers with the first 26 bits reserved for local job ids and the
1539 remaining 6 bits reserved for a cluster id to identify a feder‐
1540 ated job's origin. The maximum allowed local job id is
1541 67,108,863 (0x3FFFFFF). The default value is 67,043,328
1542 (0x03ff0000). MaxJobId only applies to the local job id and not
1543 the federated job id. Job id values generated will be incre‐
1544 mented by 1 for each subsequent job. Once MaxJobId is reached,
1545 the next job will be assigned FirstJobId. Federated jobs will
1546 always have a job ID of 67,108,865 or higher. Also see FirstJo‐
1547 bId.
1548
1549 MaxMemPerCPU
1550 Maximum real memory size available per allocated CPU in
1551 megabytes. Used to avoid over-subscribing memory and causing
1552 paging. MaxMemPerCPU would generally be used if individual pro‐
1553 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
1554 lectType=select/cons_tres). The default value is 0 (unlimited).
1555 Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerNode. MaxMem‐
1556 PerCPU and MaxMemPerNode are mutually exclusive.
1557
1558 NOTE: If a job specifies a memory per CPU limit that exceeds
1559 this system limit, that job's count of CPUs per task will try to
1560 automatically increase. This may result in the job failing due
1561 to CPU count limits. This auto-adjustment feature is a best-ef‐
1562 fort one and optimal assignment is not guaranteed due to the
1563 possibility of having heterogeneous configurations and
1564 multi-partition/qos jobs. If this is a concern it is advised to
1565 use a job submit LUA plugin instead to enforce auto-adjustments
1566 to your specific needs.
1567
1568 MaxMemPerNode
1569 Maximum real memory size available per allocated node in
1570 megabytes. Used to avoid over-subscribing memory and causing
1571 paging. MaxMemPerNode would generally be used if whole nodes
1572 are allocated to jobs (SelectType=select/linear) and resources
1573 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
1574 The default value is 0 (unlimited). Also see DefMemPerNode and
1575 MaxMemPerCPU. MaxMemPerCPU and MaxMemPerNode are mutually ex‐
1576 clusive.
1577
1578 MaxStepCount
1579 The maximum number of steps that any job can initiate. This pa‐
1580 rameter is intended to limit the effect of bad batch scripts.
1581 The default value is 40000 steps.
1582
1583 MaxTasksPerNode
1584 Maximum number of tasks Slurm will allow a job step to spawn on
1585 a single node. The default MaxTasksPerNode is 512. May not ex‐
1586 ceed 65533.
1587
1588 MCSParameters
1589 MCS = Multi-Category Security MCS Plugin Parameters. The sup‐
1590 ported parameters are specific to the MCSPlugin. Changes to
1591 this value take effect when the Slurm daemons are reconfigured.
1592 More information about MCS is available here
1593 <https://slurm.schedmd.com/mcs.html>.
1594
1595 MCSPlugin
1596 MCS = Multi-Category Security : associate a security label to
1597 jobs and ensure that nodes can only be shared among jobs using
1598 the same security label. Acceptable values include:
1599
1600 mcs/none is the default value. No security label associated
1601 with jobs, no particular security restriction when
1602 sharing nodes among jobs.
1603
1604 mcs/account only users with the same account can share the nodes
1605 (requires enabling of accounting).
1606
1607 mcs/group only users with the same group can share the nodes.
1608
1609 mcs/user a node cannot be shared with other users.
1610
1611 MessageTimeout
1612 Time permitted for a round-trip communication to complete in
1613 seconds. Default value is 10 seconds. For systems with shared
1614 nodes, the slurmd daemon could be paged out and necessitate
1615 higher values.
1616
1617 MinJobAge
1618 The minimum age of a completed job before its record is cleared
1619 from the list of jobs slurmctld keeps in memory. Combine with
1620 MaxJobCount to ensure the slurmctld daemon does not exhaust its
1621 memory or other resources. The default value is 300 seconds. A
1622 value of zero prevents any job record purging. Jobs are not
1623 purged during a backfill cycle, so it can take longer than Min‐
1624 JobAge seconds to purge a job if using the backfill scheduling
1625 plugin. In order to eliminate some possible race conditions,
1626 the minimum non-zero value for MinJobAge recommended is 2.
1627
1628 MpiDefault
1629 Identifies the default type of MPI to be used. Srun may over‐
1630 ride this configuration parameter in any case. Currently sup‐
1631 ported versions include: pmi2, pmix, and none (default, which
1632 works for many other versions of MPI). More information about
1633 MPI use is available here
1634 <https://slurm.schedmd.com/mpi_guide.html>.
1635
1636 MpiParams
1637 MPI parameters. Used to identify ports used by older versions
1638 of OpenMPI and native Cray systems. The input format is
1639 "ports=12000-12999" to identify a range of communication ports
1640 to be used. NOTE: This is not needed for modern versions of
1641 OpenMPI, taking it out can cause a small boost in scheduling
1642 performance. NOTE: This is require for Cray's PMI.
1643
1644 OverTimeLimit
1645 Number of minutes by which a job can exceed its time limit be‐
1646 fore being canceled. Normally a job's time limit is treated as
1647 a hard limit and the job will be killed upon reaching that
1648 limit. Configuring OverTimeLimit will result in the job's time
1649 limit being treated like a soft limit. Adding the OverTimeLimit
1650 value to the soft time limit provides a hard time limit, at
1651 which point the job is canceled. This is particularly useful
1652 for backfill scheduling, which bases upon each job's soft time
1653 limit. The default value is zero. May not exceed 65533 min‐
1654 utes. A value of "UNLIMITED" is also supported.
1655
1656 PluginDir
1657 Identifies the places in which to look for Slurm plugins. This
1658 is a colon-separated list of directories, like the PATH environ‐
1659 ment variable. The default value is the prefix given at config‐
1660 ure time + "/lib/slurm". A restart of slurmctld and the slurmd
1661 daemons is required for changes to this parameter to take ef‐
1662 fect.
1663
1664 PlugStackConfig
1665 Location of the config file for Slurm stackable plugins that use
1666 the Stackable Plugin Architecture for Node job (K)control
1667 (SPANK). This provides support for a highly configurable set of
1668 plugins to be called before and/or after execution of each task
1669 spawned as part of a user's job step. Default location is
1670 "plugstack.conf" in the same directory as the system slurm.conf.
1671 For more information on SPANK plugins, see the spank(8) manual.
1672
1673 PowerParameters
1674 System power management parameters. The supported parameters
1675 are specific to the PowerPlugin. Changes to this value take ef‐
1676 fect when the Slurm daemons are reconfigured. More information
1677 about system power management is available here
1678 <https://slurm.schedmd.com/power_mgmt.html>. Options current
1679 supported by any plugins are listed below.
1680
1681 balance_interval=#
1682 Specifies the time interval, in seconds, between attempts
1683 to rebalance power caps across the nodes. This also con‐
1684 trols the frequency at which Slurm attempts to collect
1685 current power consumption data (old data may be used un‐
1686 til new data is available from the underlying infrastruc‐
1687 ture and values below 10 seconds are not recommended for
1688 Cray systems). The default value is 30 seconds. Sup‐
1689 ported by the power/cray_aries plugin.
1690
1691 capmc_path=
1692 Specifies the absolute path of the capmc command. The
1693 default value is "/opt/cray/capmc/default/bin/capmc".
1694 Supported by the power/cray_aries plugin.
1695
1696 cap_watts=#
1697 Specifies the total power limit to be established across
1698 all compute nodes managed by Slurm. A value of 0 sets
1699 every compute node to have an unlimited cap. The default
1700 value is 0. Supported by the power/cray_aries plugin.
1701
1702 decrease_rate=#
1703 Specifies the maximum rate of change in the power cap for
1704 a node where the actual power usage is below the power
1705 cap by an amount greater than lower_threshold (see be‐
1706 low). Value represents a percentage of the difference
1707 between a node's minimum and maximum power consumption.
1708 The default value is 50 percent. Supported by the
1709 power/cray_aries plugin.
1710
1711 get_timeout=#
1712 Amount of time allowed to get power state information in
1713 milliseconds. The default value is 5,000 milliseconds or
1714 5 seconds. Supported by the power/cray_aries plugin and
1715 represents the time allowed for the capmc command to re‐
1716 spond to various "get" options.
1717
1718 increase_rate=#
1719 Specifies the maximum rate of change in the power cap for
1720 a node where the actual power usage is within up‐
1721 per_threshold (see below) of the power cap. Value repre‐
1722 sents a percentage of the difference between a node's
1723 minimum and maximum power consumption. The default value
1724 is 20 percent. Supported by the power/cray_aries plugin.
1725
1726 job_level
1727 All nodes associated with every job will have the same
1728 power cap, to the extent possible. Also see the
1729 --power=level option on the job submission commands.
1730
1731 job_no_level
1732 Disable the user's ability to set every node associated
1733 with a job to the same power cap. Each node will have
1734 its power cap set independently. This disables the
1735 --power=level option on the job submission commands.
1736
1737 lower_threshold=#
1738 Specify a lower power consumption threshold. If a node's
1739 current power consumption is below this percentage of its
1740 current cap, then its power cap will be reduced. The de‐
1741 fault value is 90 percent. Supported by the
1742 power/cray_aries plugin.
1743
1744 recent_job=#
1745 If a job has started or resumed execution (from suspend)
1746 on a compute node within this number of seconds from the
1747 current time, the node's power cap will be increased to
1748 the maximum. The default value is 300 seconds. Sup‐
1749 ported by the power/cray_aries plugin.
1750
1751
1752 set_timeout=#
1753 Amount of time allowed to set power state information in
1754 milliseconds. The default value is 30,000 milliseconds
1755 or 30 seconds. Supported by the power/cray plugin and
1756 represents the time allowed for the capmc command to re‐
1757 spond to various "set" options.
1758
1759 set_watts=#
1760 Specifies the power limit to be set on every compute
1761 nodes managed by Slurm. Every node gets this same power
1762 cap and there is no variation through time based upon ac‐
1763 tual power usage on the node. Supported by the
1764 power/cray_aries plugin.
1765
1766 upper_threshold=#
1767 Specify an upper power consumption threshold. If a
1768 node's current power consumption is above this percentage
1769 of its current cap, then its power cap will be increased
1770 to the extent possible. The default value is 95 percent.
1771 Supported by the power/cray_aries plugin.
1772
1773 PowerPlugin
1774 Identifies the plugin used for system power management. Cur‐
1775 rently supported plugins include: cray_aries and none. A
1776 restart of slurmctld is required for changes to this parameter
1777 to take effect. More information about system power management
1778 is available here <https://slurm.schedmd.com/power_mgmt.html>.
1779 By default, no power plugin is loaded.
1780
1781 PreemptMode
1782 Mechanism used to preempt jobs or enable gang scheduling. When
1783 the PreemptType parameter is set to enable preemption, the Pre‐
1784 emptMode selects the default mechanism used to preempt the eli‐
1785 gible jobs for the cluster.
1786 PreemptMode may be specified on a per partition basis to over‐
1787 ride this default value if PreemptType=preempt/partition_prio.
1788 Alternatively, it can be specified on a per QOS basis if Pre‐
1789 emptType=preempt/qos. In either case, a valid default Preempt‐
1790 Mode value must be specified for the cluster as a whole when
1791 preemption is enabled.
1792 The GANG option is used to enable gang scheduling independent of
1793 whether preemption is enabled (i.e. independent of the Preempt‐
1794 Type setting). It can be specified in addition to a PreemptMode
1795 setting with the two options comma separated (e.g. Preempt‐
1796 Mode=SUSPEND,GANG).
1797 See <https://slurm.schedmd.com/preempt.html> and
1798 <https://slurm.schedmd.com/gang_scheduling.html> for more de‐
1799 tails.
1800
1801 NOTE: For performance reasons, the backfill scheduler reserves
1802 whole nodes for jobs, not partial nodes. If during backfill
1803 scheduling a job preempts one or more other jobs, the whole
1804 nodes for those preempted jobs are reserved for the preemptor
1805 job, even if the preemptor job requested fewer resources than
1806 that. These reserved nodes aren't available to other jobs dur‐
1807 ing that backfill cycle, even if the other jobs could fit on the
1808 nodes. Therefore, jobs may preempt more resources during a sin‐
1809 gle backfill iteration than they requested.
1810 NOTE: For heterogeneous job to be considered for preemption all
1811 components must be eligible for preemption. When a heterogeneous
1812 job is to be preempted the first identified component of the job
1813 with the highest order PreemptMode (SUSPEND (highest), REQUEUE,
1814 CANCEL (lowest)) will be used to set the PreemptMode for all
1815 components. The GraceTime and user warning signal for each com‐
1816 ponent of the heterogeneous job remain unique. Heterogeneous
1817 jobs are excluded from GANG scheduling operations.
1818
1819 OFF Is the default value and disables job preemption and
1820 gang scheduling. It is only compatible with Pre‐
1821 emptType=preempt/none at a global level. A common
1822 use case for this parameter is to set it on a parti‐
1823 tion to disable preemption for that partition.
1824
1825 CANCEL The preempted job will be cancelled.
1826
1827 GANG Enables gang scheduling (time slicing) of jobs in
1828 the same partition, and allows the resuming of sus‐
1829 pended jobs.
1830
1831 NOTE: Gang scheduling is performed independently for
1832 each partition, so if you only want time-slicing by
1833 OverSubscribe, without any preemption, then config‐
1834 uring partitions with overlapping nodes is not rec‐
1835 ommended. On the other hand, if you want to use
1836 PreemptType=preempt/partition_prio to allow jobs
1837 from higher PriorityTier partitions to Suspend jobs
1838 from lower PriorityTier partitions you will need
1839 overlapping partitions, and PreemptMode=SUSPEND,GANG
1840 to use the Gang scheduler to resume the suspended
1841 jobs(s). In any case, time-slicing won't happen be‐
1842 tween jobs on different partitions.
1843
1844 NOTE: Heterogeneous jobs are excluded from GANG
1845 scheduling operations.
1846
1847 REQUEUE Preempts jobs by requeuing them (if possible) or
1848 canceling them. For jobs to be requeued they must
1849 have the --requeue sbatch option set or the cluster
1850 wide JobRequeue parameter in slurm.conf must be set
1851 to 1.
1852
1853 SUSPEND The preempted jobs will be suspended, and later the
1854 Gang scheduler will resume them. Therefore the SUS‐
1855 PEND preemption mode always needs the GANG option to
1856 be specified at the cluster level. Also, because the
1857 suspended jobs will still use memory on the allo‐
1858 cated nodes, Slurm needs to be able to track memory
1859 resources to be able to suspend jobs.
1860 If PreemptType=preempt/qos is configured and if the
1861 preempted job(s) and the preemptor job are on the
1862 same partition, then they will share resources with
1863 the Gang scheduler (time-slicing). If not (i.e. if
1864 the preemptees and preemptor are on different parti‐
1865 tions) then the preempted jobs will remain suspended
1866 until the preemptor ends.
1867
1868 NOTE: Because gang scheduling is performed indepen‐
1869 dently for each partition, if using PreemptType=pre‐
1870 empt/partition_prio then jobs in higher PriorityTier
1871 partitions will suspend jobs in lower PriorityTier
1872 partitions to run on the released resources. Only
1873 when the preemptor job ends will the suspended jobs
1874 will be resumed by the Gang scheduler.
1875 NOTE: Suspended jobs will not release GRES. Higher
1876 priority jobs will not be able to preempt to gain
1877 access to GRES.
1878
1879 PreemptType
1880 Specifies the plugin used to identify which jobs can be pre‐
1881 empted in order to start a pending job.
1882
1883 preempt/none
1884 Job preemption is disabled. This is the default.
1885
1886 preempt/partition_prio
1887 Job preemption is based upon partition PriorityTier.
1888 Jobs in higher PriorityTier partitions may preempt jobs
1889 from lower PriorityTier partitions. This is not compati‐
1890 ble with PreemptMode=OFF.
1891
1892 preempt/qos
1893 Job preemption rules are specified by Quality Of Service
1894 (QOS) specifications in the Slurm database. This option
1895 is not compatible with PreemptMode=OFF. A configuration
1896 of PreemptMode=SUSPEND is only supported by the Select‐
1897 Type=select/cons_res and SelectType=select/cons_tres
1898 plugins. See the sacctmgr man page to configure the op‐
1899 tions for preempt/qos.
1900
1901 PreemptExemptTime
1902 Global option for minimum run time for all jobs before they can
1903 be considered for preemption. Any QOS PreemptExemptTime takes
1904 precedence over the global option. This is only honored for Pre‐
1905 emptMode=REQUEUE and PreemptMode=CANCEL.
1906 A time of -1 disables the option, equivalent to 0. Acceptable
1907 time formats include "minutes", "minutes:seconds", "hours:min‐
1908 utes:seconds", "days-hours", "days-hours:minutes", and
1909 "days-hours:minutes:seconds".
1910
1911 PrEpParameters
1912 Parameters to be passed to the PrEpPlugins.
1913
1914 PrEpPlugins
1915 A resource for programmers wishing to write their own plugins
1916 for the Prolog and Epilog (PrEp) scripts. The default, and cur‐
1917 rently the only implemented plugin is prep/script. Additional
1918 plugins can be specified in a comma-separated list. For more in‐
1919 formation please see the PrEp Plugin API documentation page:
1920 <https://slurm.schedmd.com/prep_plugins.html>
1921
1922 PriorityCalcPeriod
1923 The period of time in minutes in which the half-life decay will
1924 be re-calculated. Applicable only if PriorityType=priority/mul‐
1925 tifactor. The default value is 5 (minutes).
1926
1927 PriorityDecayHalfLife
1928 This controls how long prior resource use is considered in de‐
1929 termining how over- or under-serviced an association is (user,
1930 bank account and cluster) in determining job priority. The
1931 record of usage will be decayed over time, with half of the
1932 original value cleared at age PriorityDecayHalfLife. If set to
1933 0 no decay will be applied. This is helpful if you want to en‐
1934 force hard time limits per association. If set to 0 Priori‐
1935 tyUsageResetPeriod must be set to some interval. Applicable
1936 only if PriorityType=priority/multifactor. The unit is a time
1937 string (i.e. min, hr:min:00, days-hr:min:00, or days-hr). The
1938 default value is 7-0 (7 days).
1939
1940 PriorityFavorSmall
1941 Specifies that small jobs should be given preferential schedul‐
1942 ing priority. Applicable only if PriorityType=priority/multi‐
1943 factor. Supported values are "YES" and "NO". The default value
1944 is "NO".
1945
1946 PriorityFlags
1947 Flags to modify priority behavior. Applicable only if Priority‐
1948 Type=priority/multifactor. The keywords below have no associ‐
1949 ated value (e.g. "PriorityFlags=ACCRUE_ALWAYS,SMALL_RELA‐
1950 TIVE_TO_TIME").
1951
1952 ACCRUE_ALWAYS If set, priority age factor will be increased
1953 despite job dependencies or holds.
1954
1955 CALCULATE_RUNNING
1956 If set, priorities will be recalculated not
1957 only for pending jobs, but also running and
1958 suspended jobs.
1959
1960 DEPTH_OBLIVIOUS If set, priority will be calculated based simi‐
1961 lar to the normal multifactor calculation, but
1962 depth of the associations in the tree does not
1963 adversely affect their priority. This option
1964 automatically enables NO_FAIR_TREE.
1965
1966 NO_FAIR_TREE Disables the "fair tree" algorithm, and reverts
1967 to "classic" fair share priority scheduling.
1968
1969 INCR_ONLY If set, priority values will only increase in
1970 value. Job priority will never decrease in
1971 value.
1972
1973 MAX_TRES If set, the weighted TRES value (e.g. TRES‐
1974 BillingWeights) is calculated as the MAX of in‐
1975 dividual TRES' on a node (e.g. cpus, mem, gres)
1976 plus the sum of all global TRES' (e.g. li‐
1977 censes).
1978
1979 NO_NORMAL_ALL If set, all NO_NORMAL_* flags are set.
1980
1981 NO_NORMAL_ASSOC If set, the association factor is not normal‐
1982 ized against the highest association priority.
1983
1984 NO_NORMAL_PART If set, the partition factor is not normalized
1985 against the highest partition PriorityJobFac‐
1986 tor.
1987
1988 NO_NORMAL_QOS If set, the QOS factor is not normalized
1989 against the highest qos priority.
1990
1991 NO_NORMAL_TRES If set, the TRES factor is not normalized
1992 against the job's partition TRES counts.
1993
1994 SMALL_RELATIVE_TO_TIME
1995 If set, the job's size component will be based
1996 upon not the job size alone, but the job's size
1997 divided by its time limit.
1998
1999 PriorityMaxAge
2000 Specifies the job age which will be given the maximum age factor
2001 in computing priority. For example, a value of 30 minutes would
2002 result in all jobs over 30 minutes old would get the same
2003 age-based priority. Applicable only if PriorityType=prior‐
2004 ity/multifactor. The unit is a time string (i.e. min,
2005 hr:min:00, days-hr:min:00, or days-hr). The default value is
2006 7-0 (7 days).
2007
2008 PriorityParameters
2009 Arbitrary string used by the PriorityType plugin.
2010
2011 PrioritySiteFactorParameters
2012 Arbitrary string used by the PrioritySiteFactorPlugin plugin.
2013
2014 PrioritySiteFactorPlugin
2015 The specifies an optional plugin to be used alongside "prior‐
2016 ity/multifactor", which is meant to initially set and continu‐
2017 ously update the SiteFactor priority factor. The default value
2018 is "site_factor/none".
2019
2020 PriorityType
2021 This specifies the plugin to be used in establishing a job's
2022 scheduling priority. Also see PriorityFlags for configuration
2023 options. The default value is "priority/basic".
2024
2025 priority/basic
2026 Jobs are evaluated in a First In, First Out (FIFO) man‐
2027 ner.
2028
2029 priority/multifactor
2030 Jobs are assigned a priority based upon a variety of fac‐
2031 tors that include size, age, Fairshare, etc.
2032
2033 When not FIFO scheduling, jobs are prioritized in the following
2034 order:
2035
2036 1. Jobs that can preempt
2037 2. Jobs with an advanced reservation
2038 3. Partition PriorityTier
2039 4. Job priority
2040 5. Job submit time
2041 6. Job ID
2042
2043 PriorityUsageResetPeriod
2044 At this interval the usage of associations will be reset to 0.
2045 This is used if you want to enforce hard limits of time usage
2046 per association. If PriorityDecayHalfLife is set to be 0 no de‐
2047 cay will happen and this is the only way to reset the usage ac‐
2048 cumulated by running jobs. By default this is turned off and it
2049 is advised to use the PriorityDecayHalfLife option to avoid not
2050 having anything running on your cluster, but if your schema is
2051 set up to only allow certain amounts of time on your system this
2052 is the way to do it. Applicable only if PriorityType=prior‐
2053 ity/multifactor.
2054
2055 NONE Never clear historic usage. The default value.
2056
2057 NOW Clear the historic usage now. Executed at startup
2058 and reconfiguration time.
2059
2060 DAILY Cleared every day at midnight.
2061
2062 WEEKLY Cleared every week on Sunday at time 00:00.
2063
2064 MONTHLY Cleared on the first day of each month at time
2065 00:00.
2066
2067 QUARTERLY Cleared on the first day of each quarter at time
2068 00:00.
2069
2070 YEARLY Cleared on the first day of each year at time 00:00.
2071
2072 PriorityWeightAge
2073 An integer value that sets the degree to which the queue wait
2074 time component contributes to the job's priority. Applicable
2075 only if PriorityType=priority/multifactor. Requires Account‐
2076 ingStorageType=accounting_storage/slurmdbd. The default value
2077 is 0.
2078
2079 PriorityWeightAssoc
2080 An integer value that sets the degree to which the association
2081 component contributes to the job's priority. Applicable only if
2082 PriorityType=priority/multifactor. The default value is 0.
2083
2084 PriorityWeightFairshare
2085 An integer value that sets the degree to which the fair-share
2086 component contributes to the job's priority. Applicable only if
2087 PriorityType=priority/multifactor. Requires AccountingStor‐
2088 ageType=accounting_storage/slurmdbd. The default value is 0.
2089
2090 PriorityWeightJobSize
2091 An integer value that sets the degree to which the job size com‐
2092 ponent contributes to the job's priority. Applicable only if
2093 PriorityType=priority/multifactor. The default value is 0.
2094
2095 PriorityWeightPartition
2096 Partition factor used by priority/multifactor plugin in calcu‐
2097 lating job priority. Applicable only if PriorityType=prior‐
2098 ity/multifactor. The default value is 0.
2099
2100 PriorityWeightQOS
2101 An integer value that sets the degree to which the Quality Of
2102 Service component contributes to the job's priority. Applicable
2103 only if PriorityType=priority/multifactor. The default value is
2104 0.
2105
2106 PriorityWeightTRES
2107 A comma-separated list of TRES Types and weights that sets the
2108 degree that each TRES Type contributes to the job's priority.
2109
2110 e.g.
2111 PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
2112
2113 Applicable only if PriorityType=priority/multifactor and if Ac‐
2114 countingStorageTRES is configured with each TRES Type. Negative
2115 values are allowed. The default values are 0.
2116
2117 PrivateData
2118 This controls what type of information is hidden from regular
2119 users. By default, all information is visible to all users.
2120 User SlurmUser and root can always view all information. Multi‐
2121 ple values may be specified with a comma separator. Acceptable
2122 values include:
2123
2124 accounts
2125 (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
2126 ing any account definitions unless they are coordinators
2127 of them.
2128
2129 cloud Powered down nodes in the cloud are visible.
2130
2131 events prevents users from viewing event information unless they
2132 have operator status or above.
2133
2134 jobs Prevents users from viewing jobs or job steps belonging
2135 to other users. (NON-SlurmDBD ACCOUNTING ONLY) Prevents
2136 users from viewing job records belonging to other users
2137 unless they are coordinators of the association running
2138 the job when using sacct.
2139
2140 nodes Prevents users from viewing node state information.
2141
2142 partitions
2143 Prevents users from viewing partition state information.
2144
2145 reservations
2146 Prevents regular users from viewing reservations which
2147 they can not use.
2148
2149 usage Prevents users from viewing usage of any other user, this
2150 applies to sshare. (NON-SlurmDBD ACCOUNTING ONLY) Pre‐
2151 vents users from viewing usage of any other user, this
2152 applies to sreport.
2153
2154 users (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
2155 ing information of any user other than themselves, this
2156 also makes it so users can only see associations they
2157 deal with. Coordinators can see associations of all
2158 users in the account they are coordinator of, but can
2159 only see themselves when listing users.
2160
2161 ProctrackType
2162 Identifies the plugin to be used for process tracking on a job
2163 step basis. The slurmd daemon uses this mechanism to identify
2164 all processes which are children of processes it spawns for a
2165 user job step. A restart of slurmctld is required for changes
2166 to this parameter to take effect. NOTE: "proctrack/linuxproc"
2167 and "proctrack/pgid" can fail to identify all processes associ‐
2168 ated with a job since processes can become a child of the init
2169 process (when the parent process terminates) or change their
2170 process group. To reliably track all processes, "proc‐
2171 track/cgroup" is highly recommended. NOTE: The JobContainerType
2172 applies to a job allocation, while ProctrackType applies to job
2173 steps. Acceptable values at present include:
2174
2175 proctrack/cgroup
2176 Uses linux cgroups to constrain and track processes, and
2177 is the default for systems with cgroup support.
2178 NOTE: see "man cgroup.conf" for configuration details.
2179
2180 proctrack/cray_aries
2181 Uses Cray proprietary process tracking.
2182
2183 proctrack/linuxproc
2184 Uses linux process tree using parent process IDs.
2185
2186 proctrack/pgid
2187 Uses Process Group IDs.
2188 NOTE: This is the default for the BSD family.
2189
2190 Prolog Fully qualified pathname of a program for the slurmd to execute
2191 whenever it is asked to run a job step from a new job allocation
2192 (e.g. "/usr/local/slurm/prolog"). A glob pattern (See glob (7))
2193 may also be used to specify more than one program to run (e.g.
2194 "/etc/slurm/prolog.d/*"). The slurmd executes the prolog before
2195 starting the first job step. The prolog script or scripts may
2196 be used to purge files, enable user login, etc. By default
2197 there is no prolog. Any configured script is expected to com‐
2198 plete execution quickly (in less time than MessageTimeout). If
2199 the prolog fails (returns a non-zero exit code), this will re‐
2200 sult in the node being set to a DRAIN state and the job being
2201 requeued in a held state, unless nohold_on_prolog_fail is con‐
2202 figured in SchedulerParameters. See Prolog and Epilog Scripts
2203 for more information.
2204
2205 PrologEpilogTimeout
2206 The interval in seconds Slurms waits for Prolog and Epilog be‐
2207 fore terminating them. The default behavior is to wait indefi‐
2208 nitely. This interval applies to the Prolog and Epilog run by
2209 slurmd daemon before and after the job, the PrologSlurmctld and
2210 EpilogSlurmctld run by slurmctld daemon, and the SPANK plugins
2211 run by the slurmstepd daemon.
2212
2213 PrologFlags
2214 Flags to control the Prolog behavior. By default no flags are
2215 set. Multiple flags may be specified in a comma-separated list.
2216 Currently supported options are:
2217
2218 Alloc If set, the Prolog script will be executed at job allo‐
2219 cation. By default, Prolog is executed just before the
2220 task is launched. Therefore, when salloc is started, no
2221 Prolog is executed. Alloc is useful for preparing things
2222 before a user starts to use any allocated resources. In
2223 particular, this flag is needed on a Cray system when
2224 cluster compatibility mode is enabled.
2225
2226 NOTE: Use of the Alloc flag will increase the time re‐
2227 quired to start jobs.
2228
2229 Contain At job allocation time, use the ProcTrack plugin to cre‐
2230 ate a job container on all allocated compute nodes.
2231 This container may be used for user processes not
2232 launched under Slurm control, for example
2233 pam_slurm_adopt may place processes launched through a
2234 direct user login into this container. If using
2235 pam_slurm_adopt, then ProcTrackType must be set to ei‐
2236 ther proctrack/cgroup or proctrack/cray_aries. Setting
2237 the Contain implicitly sets the Alloc flag.
2238
2239 NoHold If set, the Alloc flag should also be set. This will
2240 allow for salloc to not block until the prolog is fin‐
2241 ished on each node. The blocking will happen when steps
2242 reach the slurmd and before any execution has happened
2243 in the step. This is a much faster way to work and if
2244 using srun to launch your tasks you should use this
2245 flag. This flag cannot be combined with the Contain or
2246 X11 flags.
2247
2248 Serial By default, the Prolog and Epilog scripts run concur‐
2249 rently on each node. This flag forces those scripts to
2250 run serially within each node, but with a significant
2251 penalty to job throughput on each node.
2252
2253 X11 Enable Slurm's built-in X11 forwarding capabilities.
2254 This is incompatible with ProctrackType=proctrack/linux‐
2255 proc. Setting the X11 flag implicitly enables both Con‐
2256 tain and Alloc flags as well.
2257
2258 PrologSlurmctld
2259 Fully qualified pathname of a program for the slurmctld daemon
2260 to execute before granting a new job allocation (e.g. "/usr/lo‐
2261 cal/slurm/prolog_controller"). The program executes as Slur‐
2262 mUser on the same node where the slurmctld daemon executes, giv‐
2263 ing it permission to drain nodes and requeue the job if a fail‐
2264 ure occurs or cancel the job if appropriate. Exactly what the
2265 program does and how it accomplishes this is completely at the
2266 discretion of the system administrator. Information about the
2267 job being initiated, its allocated nodes, etc. are passed to the
2268 program using environment variables. While this program is run‐
2269 ning, the nodes associated with the job will be have a
2270 POWER_UP/CONFIGURING flag set in their state, which can be read‐
2271 ily viewed. The slurmctld daemon will wait indefinitely for
2272 this program to complete. Once the program completes with an
2273 exit code of zero, the nodes will be considered ready for use
2274 and the program will be started. If some node can not be made
2275 available for use, the program should drain the node (typically
2276 using the scontrol command) and terminate with a non-zero exit
2277 code. A non-zero exit code will result in the job being re‐
2278 queued (where possible) or killed. Note that only batch jobs can
2279 be requeued. See Prolog and Epilog Scripts for more informa‐
2280 tion.
2281
2282 PropagatePrioProcess
2283 Controls the scheduling priority (nice value) of user spawned
2284 tasks.
2285
2286 0 The tasks will inherit the scheduling priority from the
2287 slurm daemon. This is the default value.
2288
2289 1 The tasks will inherit the scheduling priority of the com‐
2290 mand used to submit them (e.g. srun or sbatch). Unless the
2291 job is submitted by user root, the tasks will have a sched‐
2292 uling priority no higher than the slurm daemon spawning
2293 them.
2294
2295 2 The tasks will inherit the scheduling priority of the com‐
2296 mand used to submit them (e.g. srun or sbatch) with the re‐
2297 striction that their nice value will always be one higher
2298 than the slurm daemon (i.e. the tasks scheduling priority
2299 will be lower than the slurm daemon).
2300
2301 PropagateResourceLimits
2302 A comma-separated list of resource limit names. The slurmd dae‐
2303 mon uses these names to obtain the associated (soft) limit val‐
2304 ues from the user's process environment on the submit node.
2305 These limits are then propagated and applied to the jobs that
2306 will run on the compute nodes. This parameter can be useful
2307 when system limits vary among nodes. Any resource limits that
2308 do not appear in the list are not propagated. However, the user
2309 can override this by specifying which resource limits to propa‐
2310 gate with the sbatch or srun "--propagate" option. If neither
2311 PropagateResourceLimits or PropagateResourceLimitsExcept are
2312 configured and the "--propagate" option is not specified, then
2313 the default action is to propagate all limits. Only one of the
2314 parameters, either PropagateResourceLimits or PropagateResource‐
2315 LimitsExcept, may be specified. The user limits can not exceed
2316 hard limits under which the slurmd daemon operates. If the user
2317 limits are not propagated, the limits from the slurmd daemon
2318 will be propagated to the user's job. The limits used for the
2319 Slurm daemons can be set in the /etc/sysconf/slurm file. For
2320 more information, see: https://slurm.schedmd.com/faq.html#mem‐
2321 lock The following limit names are supported by Slurm (although
2322 some options may not be supported on some systems):
2323
2324 ALL All limits listed below (default)
2325
2326 NONE No limits listed below
2327
2328 AS The maximum address space (virtual memory) for a
2329 process.
2330
2331 CORE The maximum size of core file
2332
2333 CPU The maximum amount of CPU time
2334
2335 DATA The maximum size of a process's data segment
2336
2337 FSIZE The maximum size of files created. Note that if the
2338 user sets FSIZE to less than the current size of the
2339 slurmd.log, job launches will fail with a 'File size
2340 limit exceeded' error.
2341
2342 MEMLOCK The maximum size that may be locked into memory
2343
2344 NOFILE The maximum number of open files
2345
2346 NPROC The maximum number of processes available
2347
2348 RSS The maximum resident set size. Note that this only
2349 has effect with Linux kernels 2.4.30 or older or BSD.
2350
2351 STACK The maximum stack size
2352
2353 PropagateResourceLimitsExcept
2354 A comma-separated list of resource limit names. By default, all
2355 resource limits will be propagated, (as described by the Propa‐
2356 gateResourceLimits parameter), except for the limits appearing
2357 in this list. The user can override this by specifying which
2358 resource limits to propagate with the sbatch or srun "--propa‐
2359 gate" option. See PropagateResourceLimits above for a list of
2360 valid limit names.
2361
2362 RebootProgram
2363 Program to be executed on each compute node to reboot it. In‐
2364 voked on each node once it becomes idle after the command "scon‐
2365 trol reboot" is executed by an authorized user or a job is sub‐
2366 mitted with the "--reboot" option. After rebooting, the node is
2367 returned to normal use. See ResumeTimeout to configure the time
2368 you expect a reboot to finish in. A node will be marked DOWN if
2369 it doesn't reboot within ResumeTimeout.
2370
2371 ReconfigFlags
2372 Flags to control various actions that may be taken when an
2373 "scontrol reconfig" command is issued. Currently the options
2374 are:
2375
2376 KeepPartInfo If set, an "scontrol reconfig" command will
2377 maintain the in-memory value of partition
2378 "state" and other parameters that may have been
2379 dynamically updated by "scontrol update". Par‐
2380 tition information in the slurm.conf file will
2381 be merged with in-memory data. This flag su‐
2382 persedes the KeepPartState flag.
2383
2384 KeepPartState If set, an "scontrol reconfig" command will
2385 preserve only the current "state" value of
2386 in-memory partitions and will reset all other
2387 parameters of the partitions that may have been
2388 dynamically updated by "scontrol update" to the
2389 values from the slurm.conf file. Partition in‐
2390 formation in the slurm.conf file will be merged
2391 with in-memory data.
2392
2393 The default for the above flags is not set, and the "scontrol
2394 reconfig" will rebuild the partition information using only the
2395 definitions in the slurm.conf file.
2396
2397 RequeueExit
2398 Enables automatic requeue for batch jobs which exit with the
2399 specified values. Separate multiple exit code by a comma and/or
2400 specify numeric ranges using a "-" separator (e.g. "Requeue‐
2401 Exit=1-9,18") Jobs will be put back in to pending state and
2402 later scheduled again. Restarted jobs will have the environment
2403 variable SLURM_RESTART_COUNT set to the number of times the job
2404 has been restarted.
2405
2406 RequeueExitHold
2407 Enables automatic requeue for batch jobs which exit with the
2408 specified values, with these jobs being held until released man‐
2409 ually by the user. Separate multiple exit code by a comma
2410 and/or specify numeric ranges using a "-" separator (e.g. "Re‐
2411 queueExitHold=10-12,16") These jobs are put in the JOB_SPE‐
2412 CIAL_EXIT exit state. Restarted jobs will have the environment
2413 variable SLURM_RESTART_COUNT set to the number of times the job
2414 has been restarted.
2415
2416 ResumeFailProgram
2417 The program that will be executed when nodes fail to resume to
2418 by ResumeTimeout. The argument to the program will be the names
2419 of the failed nodes (using Slurm's hostlist expression format).
2420
2421 ResumeProgram
2422 Slurm supports a mechanism to reduce power consumption on nodes
2423 that remain idle for an extended period of time. This is typi‐
2424 cally accomplished by reducing voltage and frequency or powering
2425 the node down. ResumeProgram is the program that will be exe‐
2426 cuted when a node in power save mode is assigned work to per‐
2427 form. For reasons of reliability, ResumeProgram may execute
2428 more than once for a node when the slurmctld daemon crashes and
2429 is restarted. If ResumeProgram is unable to restore a node to
2430 service with a responding slurmd and an updated BootTime, it
2431 should requeue any job associated with the node and set the node
2432 state to DOWN. If the node isn't actually rebooted (i.e. when
2433 multiple-slurmd is configured) starting slurmd with "-b" option
2434 might be useful. The program executes as SlurmUser. The argu‐
2435 ment to the program will be the names of nodes to be removed
2436 from power savings mode (using Slurm's hostlist expression for‐
2437 mat). A job to node mapping is available in JSON format by read‐
2438 ing the temporary file specified by the SLURM_RESUME_FILE envi‐
2439 ronment variable. By default no program is run.
2440
2441 ResumeRate
2442 The rate at which nodes in power save mode are returned to nor‐
2443 mal operation by ResumeProgram. The value is a number of nodes
2444 per minute and it can be used to prevent power surges if a large
2445 number of nodes in power save mode are assigned work at the same
2446 time (e.g. a large job starts). A value of zero results in no
2447 limits being imposed. The default value is 300 nodes per
2448 minute.
2449
2450 ResumeTimeout
2451 Maximum time permitted (in seconds) between when a node resume
2452 request is issued and when the node is actually available for
2453 use. Nodes which fail to respond in this time frame will be
2454 marked DOWN and the jobs scheduled on the node requeued. Nodes
2455 which reboot after this time frame will be marked DOWN with a
2456 reason of "Node unexpectedly rebooted." The default value is 60
2457 seconds.
2458
2459 ResvEpilog
2460 Fully qualified pathname of a program for the slurmctld to exe‐
2461 cute when a reservation ends. The program can be used to cancel
2462 jobs, modify partition configuration, etc. The reservation
2463 named will be passed as an argument to the program. By default
2464 there is no epilog.
2465
2466 ResvOverRun
2467 Describes how long a job already running in a reservation should
2468 be permitted to execute after the end time of the reservation
2469 has been reached. The time period is specified in minutes and
2470 the default value is 0 (kill the job immediately). The value
2471 may not exceed 65533 minutes, although a value of "UNLIMITED" is
2472 supported to permit a job to run indefinitely after its reserva‐
2473 tion is terminated.
2474
2475 ResvProlog
2476 Fully qualified pathname of a program for the slurmctld to exe‐
2477 cute when a reservation begins. The program can be used to can‐
2478 cel jobs, modify partition configuration, etc. The reservation
2479 named will be passed as an argument to the program. By default
2480 there is no prolog.
2481
2482 ReturnToService
2483 Controls when a DOWN node will be returned to service. The de‐
2484 fault value is 0. Supported values include
2485
2486 0 A node will remain in the DOWN state until a system adminis‐
2487 trator explicitly changes its state (even if the slurmd dae‐
2488 mon registers and resumes communications).
2489
2490 1 A DOWN node will become available for use upon registration
2491 with a valid configuration only if it was set DOWN due to
2492 being non-responsive. If the node was set DOWN for any
2493 other reason (low memory, unexpected reboot, etc.), its
2494 state will not automatically be changed. A node registers
2495 with a valid configuration if its memory, GRES, CPU count,
2496 etc. are equal to or greater than the values configured in
2497 slurm.conf.
2498
2499 2 A DOWN node will become available for use upon registration
2500 with a valid configuration. The node could have been set
2501 DOWN for any reason. A node registers with a valid configu‐
2502 ration if its memory, GRES, CPU count, etc. are equal to or
2503 greater than the values configured in slurm.conf.
2504
2505 RoutePlugin
2506 Identifies the plugin to be used for defining which nodes will
2507 be used for message forwarding.
2508
2509 route/default
2510 default, use TreeWidth.
2511
2512 route/topology
2513 use the switch hierarchy defined in a topology.conf file.
2514 TopologyPlugin=topology/tree is required.
2515
2516 SchedulerParameters
2517 The interpretation of this parameter varies by SchedulerType.
2518 Multiple options may be comma separated.
2519
2520 allow_zero_lic
2521 If set, then job submissions requesting more than config‐
2522 ured licenses won't be rejected.
2523
2524 assoc_limit_stop
2525 If set and a job cannot start due to association limits,
2526 then do not attempt to initiate any lower priority jobs
2527 in that partition. Setting this can decrease system
2528 throughput and utilization, but avoid potentially starv‐
2529 ing larger jobs by preventing them from launching indefi‐
2530 nitely.
2531
2532 batch_sched_delay=#
2533 How long, in seconds, the scheduling of batch jobs can be
2534 delayed. This can be useful in a high-throughput envi‐
2535 ronment in which batch jobs are submitted at a very high
2536 rate (i.e. using the sbatch command) and one wishes to
2537 reduce the overhead of attempting to schedule each job at
2538 submit time. The default value is 3 seconds.
2539
2540 bb_array_stage_cnt=#
2541 Number of tasks from a job array that should be available
2542 for burst buffer resource allocation. Higher values will
2543 increase the system overhead as each task from the job
2544 array will be moved to its own job record in memory, so
2545 relatively small values are generally recommended. The
2546 default value is 10.
2547
2548 bf_busy_nodes
2549 When selecting resources for pending jobs to reserve for
2550 future execution (i.e. the job can not be started immedi‐
2551 ately), then preferentially select nodes that are in use.
2552 This will tend to leave currently idle resources avail‐
2553 able for backfilling longer running jobs, but may result
2554 in allocations having less than optimal network topology.
2555 This option is currently only supported by the se‐
2556 lect/cons_res and select/cons_tres plugins (or se‐
2557 lect/cray_aries with SelectTypeParameters set to
2558 "OTHER_CONS_RES" or "OTHER_CONS_TRES", which layers the
2559 select/cray_aries plugin over the select/cons_res or se‐
2560 lect/cons_tres plugin respectively).
2561
2562 bf_continue
2563 The backfill scheduler periodically releases locks in or‐
2564 der to permit other operations to proceed rather than
2565 blocking all activity for what could be an extended pe‐
2566 riod of time. Setting this option will cause the back‐
2567 fill scheduler to continue processing pending jobs from
2568 its original job list after releasing locks even if job
2569 or node state changes.
2570
2571 bf_hetjob_immediate
2572 Instruct the backfill scheduler to attempt to start a
2573 heterogeneous job as soon as all of its components are
2574 determined able to do so. Otherwise, the backfill sched‐
2575 uler will delay heterogeneous jobs initiation attempts
2576 until after the rest of the queue has been processed.
2577 This delay may result in lower priority jobs being allo‐
2578 cated resources, which could delay the initiation of the
2579 heterogeneous job due to account and/or QOS limits being
2580 reached. This option is disabled by default. If enabled
2581 and bf_hetjob_prio=min is not set, then it would be auto‐
2582 matically set.
2583
2584 bf_hetjob_prio=[min|avg|max]
2585 At the beginning of each backfill scheduling cycle, a
2586 list of pending to be scheduled jobs is sorted according
2587 to the precedence order configured in PriorityType. This
2588 option instructs the scheduler to alter the sorting algo‐
2589 rithm to ensure that all components belonging to the same
2590 heterogeneous job will be attempted to be scheduled con‐
2591 secutively (thus not fragmented in the resulting list).
2592 More specifically, all components from the same heteroge‐
2593 neous job will be treated as if they all have the same
2594 priority (minimum, average or maximum depending upon this
2595 option's parameter) when compared with other jobs (or
2596 other heterogeneous job components). The original order
2597 will be preserved within the same heterogeneous job. Note
2598 that the operation is calculated for the PriorityTier
2599 layer and for the Priority resulting from the prior‐
2600 ity/multifactor plugin calculations. When enabled, if any
2601 heterogeneous job requested an advanced reservation, then
2602 all of that job's components will be treated as if they
2603 had requested an advanced reservation (and get preferen‐
2604 tial treatment in scheduling).
2605
2606 Note that this operation does not update the Priority
2607 values of the heterogeneous job components, only their
2608 order within the list, so the output of the sprio command
2609 will not be effected.
2610
2611 Heterogeneous jobs have special scheduling properties:
2612 they are only scheduled by the backfill scheduling
2613 plugin, each of their components is considered separately
2614 when reserving resources (and might have different Prior‐
2615 ityTier or different Priority values), and no heteroge‐
2616 neous job component is actually allocated resources until
2617 all if its components can be initiated. This may imply
2618 potential scheduling deadlock scenarios because compo‐
2619 nents from different heterogeneous jobs can start reserv‐
2620 ing resources in an interleaved fashion (not consecu‐
2621 tively), but none of the jobs can reserve resources for
2622 all components and start. Enabling this option can help
2623 to mitigate this problem. By default, this option is dis‐
2624 abled.
2625
2626 bf_interval=#
2627 The number of seconds between backfill iterations.
2628 Higher values result in less overhead and better respon‐
2629 siveness. This option applies only to Scheduler‐
2630 Type=sched/backfill. Default: 30, Min: 1, Max: 10800
2631 (3h).
2632
2633
2634 bf_job_part_count_reserve=#
2635 The backfill scheduling logic will reserve resources for
2636 the specified count of highest priority jobs in each par‐
2637 tition. For example, bf_job_part_count_reserve=10 will
2638 cause the backfill scheduler to reserve resources for the
2639 ten highest priority jobs in each partition. Any lower
2640 priority job that can be started using currently avail‐
2641 able resources and not adversely impact the expected
2642 start time of these higher priority jobs will be started
2643 by the backfill scheduler The default value is zero,
2644 which will reserve resources for any pending job and de‐
2645 lay initiation of lower priority jobs. Also see
2646 bf_min_age_reserve and bf_min_prio_reserve. Default: 0,
2647 Min: 0, Max: 100000.
2648
2649 bf_max_job_array_resv=#
2650 The maximum number of tasks from a job array for which
2651 the backfill scheduler will reserve resources in the fu‐
2652 ture. Since job arrays can potentially have millions of
2653 tasks, the overhead in reserving resources for all tasks
2654 can be prohibitive. In addition various limits may pre‐
2655 vent all the jobs from starting at the expected times.
2656 This has no impact upon the number of tasks from a job
2657 array that can be started immediately, only those tasks
2658 expected to start at some future time. Default: 20, Min:
2659 0, Max: 1000. NOTE: Jobs submitted to multiple parti‐
2660 tions appear in the job queue once per partition. If dif‐
2661 ferent copies of a single job array record aren't consec‐
2662 utive in the job queue and another job array record is in
2663 between, then bf_max_job_array_resv tasks are considered
2664 per partition that the job is submitted to.
2665
2666 bf_max_job_assoc=#
2667 The maximum number of jobs per user association to at‐
2668 tempt starting with the backfill scheduler. This setting
2669 is similar to bf_max_job_user but is handy if a user has
2670 multiple associations equating to basically different
2671 users. One can set this limit to prevent users from
2672 flooding the backfill queue with jobs that cannot start
2673 and that prevent jobs from other users to start. This
2674 option applies only to SchedulerType=sched/backfill.
2675 Also see the bf_max_job_user bf_max_job_part,
2676 bf_max_job_test and bf_max_job_user_part=# options. Set
2677 bf_max_job_test to a value much higher than
2678 bf_max_job_assoc. Default: 0 (no limit), Min: 0, Max:
2679 bf_max_job_test.
2680
2681 bf_max_job_part=#
2682 The maximum number of jobs per partition to attempt
2683 starting with the backfill scheduler. This can be espe‐
2684 cially helpful for systems with large numbers of parti‐
2685 tions and jobs. This option applies only to Scheduler‐
2686 Type=sched/backfill. Also see the partition_job_depth
2687 and bf_max_job_test options. Set bf_max_job_test to a
2688 value much higher than bf_max_job_part. Default: 0 (no
2689 limit), Min: 0, Max: bf_max_job_test.
2690
2691 bf_max_job_start=#
2692 The maximum number of jobs which can be initiated in a
2693 single iteration of the backfill scheduler. This option
2694 applies only to SchedulerType=sched/backfill. Default: 0
2695 (no limit), Min: 0, Max: 10000.
2696
2697 bf_max_job_test=#
2698 The maximum number of jobs to attempt backfill scheduling
2699 for (i.e. the queue depth). Higher values result in more
2700 overhead and less responsiveness. Until an attempt is
2701 made to backfill schedule a job, its expected initiation
2702 time value will not be set. In the case of large clus‐
2703 ters, configuring a relatively small value may be desir‐
2704 able. This option applies only to Scheduler‐
2705 Type=sched/backfill. Default: 500, Min: 1, Max:
2706 1,000,000.
2707
2708 bf_max_job_user=#
2709 The maximum number of jobs per user to attempt starting
2710 with the backfill scheduler for ALL partitions. One can
2711 set this limit to prevent users from flooding the back‐
2712 fill queue with jobs that cannot start and that prevent
2713 jobs from other users to start. This is similar to the
2714 MAXIJOB limit in Maui. This option applies only to
2715 SchedulerType=sched/backfill. Also see the
2716 bf_max_job_part, bf_max_job_test and
2717 bf_max_job_user_part=# options. Set bf_max_job_test to a
2718 value much higher than bf_max_job_user. Default: 0 (no
2719 limit), Min: 0, Max: bf_max_job_test.
2720
2721 bf_max_job_user_part=#
2722 The maximum number of jobs per user per partition to at‐
2723 tempt starting with the backfill scheduler for any single
2724 partition. This option applies only to Scheduler‐
2725 Type=sched/backfill. Also see the bf_max_job_part,
2726 bf_max_job_test and bf_max_job_user=# options. Default:
2727 0 (no limit), Min: 0, Max: bf_max_job_test.
2728
2729 bf_max_time=#
2730 The maximum time in seconds the backfill scheduler can
2731 spend (including time spent sleeping when locks are re‐
2732 leased) before discontinuing, even if maximum job counts
2733 have not been reached. This option applies only to
2734 SchedulerType=sched/backfill. The default value is the
2735 value of bf_interval (which defaults to 30 seconds). De‐
2736 fault: bf_interval value (def. 30 sec), Min: 1, Max: 3600
2737 (1h). NOTE: If bf_interval is short and bf_max_time is
2738 large, this may cause locks to be acquired too frequently
2739 and starve out other serviced RPCs. It's advisable if us‐
2740 ing this parameter to set max_rpc_cnt high enough that
2741 scheduling isn't always disabled, and low enough that the
2742 interactive workload can get through in a reasonable pe‐
2743 riod of time. max_rpc_cnt needs to be below 256 (the de‐
2744 fault RPC thread limit). Running around the middle (150)
2745 may give you good results. NOTE: When increasing the
2746 amount of time spent in the backfill scheduling cycle,
2747 Slurm can be prevented from responding to client requests
2748 in a timely manner. To address this you can use
2749 max_rpc_cnt to specify a number of queued RPCs before the
2750 scheduler stops to respond to these requests.
2751
2752 bf_min_age_reserve=#
2753 The backfill and main scheduling logic will not reserve
2754 resources for pending jobs until they have been pending
2755 and runnable for at least the specified number of sec‐
2756 onds. In addition, jobs waiting for less than the speci‐
2757 fied number of seconds will not prevent a newly submitted
2758 job from starting immediately, even if the newly submit‐
2759 ted job has a lower priority. This can be valuable if
2760 jobs lack time limits or all time limits have the same
2761 value. The default value is zero, which will reserve re‐
2762 sources for any pending job and delay initiation of lower
2763 priority jobs. Also see bf_job_part_count_reserve and
2764 bf_min_prio_reserve. Default: 0, Min: 0, Max: 2592000
2765 (30 days).
2766
2767 bf_min_prio_reserve=#
2768 The backfill and main scheduling logic will not reserve
2769 resources for pending jobs unless they have a priority
2770 equal to or higher than the specified value. In addi‐
2771 tion, jobs with a lower priority will not prevent a newly
2772 submitted job from starting immediately, even if the
2773 newly submitted job has a lower priority. This can be
2774 valuable if one wished to maximize system utilization
2775 without regard for job priority below a certain thresh‐
2776 old. The default value is zero, which will reserve re‐
2777 sources for any pending job and delay initiation of lower
2778 priority jobs. Also see bf_job_part_count_reserve and
2779 bf_min_age_reserve. Default: 0, Min: 0, Max: 2^63.
2780
2781 bf_node_space_size=#
2782 Size of backfill node_space table. Adding a single job to
2783 backfill reservations in the worst case can consume two
2784 node_space records. In the case of large clusters, con‐
2785 figuring a relatively small value may be desirable. This
2786 option applies only to SchedulerType=sched/backfill.
2787 Also see bf_max_job_test and bf_running_job_reserve. De‐
2788 fault: bf_max_job_test, Min: 2, Max: 2,000,000.
2789
2790 bf_one_resv_per_job
2791 Disallow adding more than one backfill reservation per
2792 job. The scheduling logic builds a sorted list of job-
2793 partition pairs. Jobs submitted to multiple partitions
2794 have as many entries in the list as requested partitions.
2795 By default, the backfill scheduler may evaluate all the
2796 job-partition entries for a single job, potentially re‐
2797 serving resources for each pair, but only starting the
2798 job in the reservation offering the earliest start time.
2799 Having a single job reserving resources for multiple par‐
2800 titions could impede other jobs (or hetjob components)
2801 from reserving resources already reserved for the parti‐
2802 tions that don't offer the earliest start time. A single
2803 job that requests multiple partitions can also prevent
2804 itself from starting earlier in a lower priority parti‐
2805 tion if the partitions overlap nodes and a backfill
2806 reservation in the higher priority partition blocks nodes
2807 that are also in the lower priority partition. This op‐
2808 tion makes it so that a job submitted to multiple parti‐
2809 tions will stop reserving resources once the first job-
2810 partition pair has booked a backfill reservation. Subse‐
2811 quent pairs from the same job will only be tested to
2812 start now. This allows for other jobs to be able to book
2813 the other pairs resources at the cost of not guaranteeing
2814 that the multi partition job will start in the partition
2815 offering the earliest start time (unless it can start im‐
2816 mediately). This option is disabled by default.
2817
2818 bf_resolution=#
2819 The number of seconds in the resolution of data main‐
2820 tained about when jobs begin and end. Higher values re‐
2821 sult in better responsiveness and quicker backfill cycles
2822 by using larger blocks of time to determine node eligi‐
2823 bility. However, higher values lead to less efficient
2824 system planning, and may miss opportunities to improve
2825 system utilization. This option applies only to Sched‐
2826 ulerType=sched/backfill. Default: 60, Min: 1, Max: 3600
2827 (1 hour).
2828
2829 bf_running_job_reserve
2830 Add an extra step to backfill logic, which creates back‐
2831 fill reservations for jobs running on whole nodes. This
2832 option is disabled by default.
2833
2834 bf_window=#
2835 The number of minutes into the future to look when con‐
2836 sidering jobs to schedule. Higher values result in more
2837 overhead and less responsiveness. A value at least as
2838 long as the highest allowed time limit is generally ad‐
2839 visable to prevent job starvation. In order to limit the
2840 amount of data managed by the backfill scheduler, if the
2841 value of bf_window is increased, then it is generally ad‐
2842 visable to also increase bf_resolution. This option ap‐
2843 plies only to SchedulerType=sched/backfill. Default:
2844 1440 (1 day), Min: 1, Max: 43200 (30 days).
2845
2846 bf_window_linear=#
2847 For performance reasons, the backfill scheduler will de‐
2848 crease precision in calculation of job expected termina‐
2849 tion times. By default, the precision starts at 30 sec‐
2850 onds and that time interval doubles with each evaluation
2851 of currently executing jobs when trying to determine when
2852 a pending job can start. This algorithm can support an
2853 environment with many thousands of running jobs, but can
2854 result in the expected start time of pending jobs being
2855 gradually being deferred due to lack of precision. A
2856 value for bf_window_linear will cause the time interval
2857 to be increased by a constant amount on each iteration.
2858 The value is specified in units of seconds. For example,
2859 a value of 60 will cause the backfill scheduler on the
2860 first iteration to identify the job ending soonest and
2861 determine if the pending job can be started after that
2862 job plus all other jobs expected to end within 30 seconds
2863 (default initial value) of the first job. On the next it‐
2864 eration, the pending job will be evaluated for starting
2865 after the next job expected to end plus all jobs ending
2866 within 90 seconds of that time (30 second default, plus
2867 the 60 second option value). The third iteration will
2868 have a 150 second window and the fourth 210 seconds.
2869 Without this option, the time windows will double on each
2870 iteration and thus be 30, 60, 120, 240 seconds, etc. The
2871 use of bf_window_linear is not recommended with more than
2872 a few hundred simultaneously executing jobs.
2873
2874 bf_yield_interval=#
2875 The backfill scheduler will periodically relinquish locks
2876 in order for other pending operations to take place.
2877 This specifies the times when the locks are relinquished
2878 in microseconds. Smaller values may be helpful for high
2879 throughput computing when used in conjunction with the
2880 bf_continue option. Also see the bf_yield_sleep option.
2881 Default: 2,000,000 (2 sec), Min: 1, Max: 10,000,000 (10
2882 sec).
2883
2884 bf_yield_sleep=#
2885 The backfill scheduler will periodically relinquish locks
2886 in order for other pending operations to take place.
2887 This specifies the length of time for which the locks are
2888 relinquished in microseconds. Also see the bf_yield_in‐
2889 terval option. Default: 500,000 (0.5 sec), Min: 1, Max:
2890 10,000,000 (10 sec).
2891
2892 build_queue_timeout=#
2893 Defines the maximum time that can be devoted to building
2894 a queue of jobs to be tested for scheduling. If the sys‐
2895 tem has a huge number of jobs with dependencies, just
2896 building the job queue can take so much time as to ad‐
2897 versely impact overall system performance and this param‐
2898 eter can be adjusted as needed. The default value is
2899 2,000,000 microseconds (2 seconds).
2900
2901 correspond_after_task_cnt=#
2902 Defines the number of array tasks that get split for po‐
2903 tential aftercorr dependency check. Low number may result
2904 in dependent task check failures when the job one depends
2905 on gets purged before the split. Default: 10.
2906
2907 default_queue_depth=#
2908 The default number of jobs to attempt scheduling (i.e.
2909 the queue depth) when a running job completes or other
2910 routine actions occur, however the frequency with which
2911 the scheduler is run may be limited by using the defer or
2912 sched_min_interval parameters described below. The full
2913 queue will be tested on a less frequent basis as defined
2914 by the sched_interval option described below. The default
2915 value is 100. See the partition_job_depth option to
2916 limit depth by partition.
2917
2918 defer Setting this option will avoid attempting to schedule
2919 each job individually at job submit time, but defer it
2920 until a later time when scheduling multiple jobs simulta‐
2921 neously may be possible. This option may improve system
2922 responsiveness when large numbers of jobs (many hundreds)
2923 are submitted at the same time, but it will delay the
2924 initiation time of individual jobs. Also see de‐
2925 fault_queue_depth above.
2926
2927 delay_boot=#
2928 Do not reboot nodes in order to satisfied this job's fea‐
2929 ture specification if the job has been eligible to run
2930 for less than this time period. If the job has waited
2931 for less than the specified period, it will use only
2932 nodes which already have the specified features. The ar‐
2933 gument is in units of minutes. Individual jobs may over‐
2934 ride this default value with the --delay-boot option.
2935
2936 disable_job_shrink
2937 Deny user requests to shrink the size of running jobs.
2938 (However, running jobs may still shrink due to node fail‐
2939 ure if the --no-kill option was set.)
2940
2941 disable_hetjob_steps
2942 Disable job steps that span heterogeneous job alloca‐
2943 tions.
2944
2945 enable_hetjob_steps
2946 Enable job steps that span heterogeneous job allocations.
2947 The default value.
2948
2949 enable_user_top
2950 Enable use of the "scontrol top" command by non-privi‐
2951 leged users.
2952
2953 Ignore_NUMA
2954 Some processors (e.g. AMD Opteron 6000 series) contain
2955 multiple NUMA nodes per socket. This is a configuration
2956 which does not map into the hardware entities that Slurm
2957 optimizes resource allocation for (PU/thread, core,
2958 socket, baseboard, node and network switch). In order to
2959 optimize resource allocations on such hardware, Slurm
2960 will consider each NUMA node within the socket as a sepa‐
2961 rate socket by default. Use the Ignore_NUMA option to re‐
2962 port the correct socket count, but not optimize resource
2963 allocations on the NUMA nodes.
2964
2965 max_array_tasks
2966 Specify the maximum number of tasks that can be included
2967 in a job array. The default limit is MaxArraySize, but
2968 this option can be used to set a lower limit. For exam‐
2969 ple, max_array_tasks=1000 and MaxArraySize=100001 would
2970 permit a maximum task ID of 100000, but limit the number
2971 of tasks in any single job array to 1000.
2972
2973 max_rpc_cnt=#
2974 If the number of active threads in the slurmctld daemon
2975 is equal to or larger than this value, defer scheduling
2976 of jobs. The scheduler will check this condition at cer‐
2977 tain points in code and yield locks if necessary. This
2978 can improve Slurm's ability to process requests at a cost
2979 of initiating new jobs less frequently. Default: 0 (op‐
2980 tion disabled), Min: 0, Max: 1000.
2981
2982 NOTE: The maximum number of threads (MAX_SERVER_THREADS)
2983 is internally set to 256 and defines the number of served
2984 RPCs at a given time. Setting max_rpc_cnt to more than
2985 256 will be only useful to let backfill continue schedul‐
2986 ing work after locks have been yielded (i.e. each 2 sec‐
2987 onds) if there are a maximum of MAX(max_rpc_cnt/10, 20)
2988 RPCs in the queue. i.e. max_rpc_cnt=1000, the scheduler
2989 will be allowed to continue after yielding locks only
2990 when there are less than or equal to 100 pending RPCs.
2991 If a value is set, then a value of 10 or higher is recom‐
2992 mended. It may require some tuning for each system, but
2993 needs to be high enough that scheduling isn't always dis‐
2994 abled, and low enough that requests can get through in a
2995 reasonable period of time.
2996
2997 max_sched_time=#
2998 How long, in seconds, that the main scheduling loop will
2999 execute for before exiting. If a value is configured, be
3000 aware that all other Slurm operations will be deferred
3001 during this time period. Make certain the value is lower
3002 than MessageTimeout. If a value is not explicitly con‐
3003 figured, the default value is half of MessageTimeout with
3004 a minimum default value of 1 second and a maximum default
3005 value of 2 seconds. For example if MessageTimeout=10,
3006 the time limit will be 2 seconds (i.e. MIN(10/2, 2) = 2).
3007
3008 max_script_size=#
3009 Specify the maximum size of a batch script, in bytes.
3010 The default value is 4 megabytes. Larger values may ad‐
3011 versely impact system performance.
3012
3013 max_switch_wait=#
3014 Maximum number of seconds that a job can delay execution
3015 waiting for the specified desired switch count. The de‐
3016 fault value is 300 seconds.
3017
3018 no_backup_scheduling
3019 If used, the backup controller will not schedule jobs
3020 when it takes over. The backup controller will allow jobs
3021 to be submitted, modified and cancelled but won't sched‐
3022 ule new jobs. This is useful in Cray environments when
3023 the backup controller resides on an external Cray node.
3024 A restart of slurmctld is required for changes to this
3025 parameter to take effect.
3026
3027 no_env_cache
3028 If used, any job started on node that fails to load the
3029 env from a node will fail instead of using the cached
3030 env. This will also implicitly imply the re‐
3031 queue_setup_env_fail option as well.
3032
3033 nohold_on_prolog_fail
3034 By default, if the Prolog exits with a non-zero value the
3035 job is requeued in a held state. By specifying this pa‐
3036 rameter the job will be requeued but not held so that the
3037 scheduler can dispatch it to another host.
3038
3039 pack_serial_at_end
3040 If used with the select/cons_res or select/cons_tres
3041 plugin, then put serial jobs at the end of the available
3042 nodes rather than using a best fit algorithm. This may
3043 reduce resource fragmentation for some workloads.
3044
3045 partition_job_depth=#
3046 The default number of jobs to attempt scheduling (i.e.
3047 the queue depth) from each partition/queue in Slurm's
3048 main scheduling logic. The functionality is similar to
3049 that provided by the bf_max_job_part option for the back‐
3050 fill scheduling logic. The default value is 0 (no
3051 limit). Job's excluded from attempted scheduling based
3052 upon partition will not be counted against the de‐
3053 fault_queue_depth limit. Also see the bf_max_job_part
3054 option.
3055
3056 preempt_reorder_count=#
3057 Specify how many attempts should be made in reordering
3058 preemptable jobs to minimize the count of jobs preempted.
3059 The default value is 1. High values may adversely impact
3060 performance. The logic to support this option is only
3061 available in the select/cons_res and select/cons_tres
3062 plugins.
3063
3064 preempt_strict_order
3065 If set, then execute extra logic in an attempt to preempt
3066 only the lowest priority jobs. It may be desirable to
3067 set this configuration parameter when there are multiple
3068 priorities of preemptable jobs. The logic to support
3069 this option is only available in the select/cons_res and
3070 select/cons_tres plugins.
3071
3072 preempt_youngest_first
3073 If set, then the preemption sorting algorithm will be
3074 changed to sort by the job start times to favor preempt‐
3075 ing younger jobs over older. (Requires preempt/parti‐
3076 tion_prio or preempt/qos plugins.)
3077
3078 reduce_completing_frag
3079 This option is used to control how scheduling of re‐
3080 sources is performed when jobs are in the COMPLETING
3081 state, which influences potential fragmentation. If this
3082 option is not set then no jobs will be started in any
3083 partition when any job is in the COMPLETING state for
3084 less than CompleteWait seconds. If this option is set
3085 then no jobs will be started in any individual partition
3086 that has a job in COMPLETING state for less than Com‐
3087 pleteWait seconds. In addition, no jobs will be started
3088 in any partition with nodes that overlap with any nodes
3089 in the partition of the completing job. This option is
3090 to be used in conjunction with CompleteWait.
3091
3092 NOTE: CompleteWait must be set in order for this to work.
3093 If CompleteWait=0 then this option does nothing.
3094
3095 NOTE: reduce_completing_frag only affects the main sched‐
3096 uler, not the backfill scheduler.
3097
3098 requeue_setup_env_fail
3099 By default if a job environment setup fails the job keeps
3100 running with a limited environment. By specifying this
3101 parameter the job will be requeued in held state and the
3102 execution node drained.
3103
3104 salloc_wait_nodes
3105 If defined, the salloc command will wait until all allo‐
3106 cated nodes are ready for use (i.e. booted) before the
3107 command returns. By default, salloc will return as soon
3108 as the resource allocation has been made.
3109
3110 sbatch_wait_nodes
3111 If defined, the sbatch script will wait until all allo‐
3112 cated nodes are ready for use (i.e. booted) before the
3113 initiation. By default, the sbatch script will be initi‐
3114 ated as soon as the first node in the job allocation is
3115 ready. The sbatch command can use the --wait-all-nodes
3116 option to override this configuration parameter.
3117
3118 sched_interval=#
3119 How frequently, in seconds, the main scheduling loop will
3120 execute and test all pending jobs. The default value is
3121 60 seconds.
3122
3123 sched_max_job_start=#
3124 The maximum number of jobs that the main scheduling logic
3125 will start in any single execution. The default value is
3126 zero, which imposes no limit.
3127
3128 sched_min_interval=#
3129 How frequently, in microseconds, the main scheduling loop
3130 will execute and test any pending jobs. The scheduler
3131 runs in a limited fashion every time that any event hap‐
3132 pens which could enable a job to start (e.g. job submit,
3133 job terminate, etc.). If these events happen at a high
3134 frequency, the scheduler can run very frequently and con‐
3135 sume significant resources if not throttled by this op‐
3136 tion. This option specifies the minimum time between the
3137 end of one scheduling cycle and the beginning of the next
3138 scheduling cycle. A value of zero will disable throt‐
3139 tling of the scheduling logic interval. The default
3140 value is 2 microseconds.
3141
3142 spec_cores_first
3143 Specialized cores will be selected from the first cores
3144 of the first sockets, cycling through the sockets on a
3145 round robin basis. By default, specialized cores will be
3146 selected from the last cores of the last sockets, cycling
3147 through the sockets on a round robin basis.
3148
3149 step_retry_count=#
3150 When a step completes and there are steps ending resource
3151 allocation, then retry step allocations for at least this
3152 number of pending steps. Also see step_retry_time. The
3153 default value is 8 steps.
3154
3155 step_retry_time=#
3156 When a step completes and there are steps ending resource
3157 allocation, then retry step allocations for all steps
3158 which have been pending for at least this number of sec‐
3159 onds. Also see step_retry_count. The default value is
3160 60 seconds.
3161
3162 whole_hetjob
3163 Requests to cancel, hold or release any component of a
3164 heterogeneous job will be applied to all components of
3165 the job.
3166
3167 NOTE: this option was previously named whole_pack and
3168 this is still supported for retrocompatibility.
3169
3170 SchedulerTimeSlice
3171 Number of seconds in each time slice when gang scheduling is en‐
3172 abled (PreemptMode=SUSPEND,GANG). The value must be between 5
3173 seconds and 65533 seconds. The default value is 30 seconds.
3174
3175 SchedulerType
3176 Identifies the type of scheduler to be used. A restart of
3177 slurmctld is required for changes to this parameter to take ef‐
3178 fect. The scontrol command can be used to manually change job
3179 priorities if desired. Acceptable values include:
3180
3181 sched/backfill
3182 For a backfill scheduling module to augment the default
3183 FIFO scheduling. Backfill scheduling will initiate
3184 lower-priority jobs if doing so does not delay the ex‐
3185 pected initiation time of any higher priority job. Ef‐
3186 fectiveness of backfill scheduling is dependent upon
3187 users specifying job time limits, otherwise all jobs will
3188 have the same time limit and backfilling is impossible.
3189 Note documentation for the SchedulerParameters option
3190 above. This is the default configuration.
3191
3192 sched/builtin
3193 This is the FIFO scheduler which initiates jobs in prior‐
3194 ity order. If any job in the partition can not be sched‐
3195 uled, no lower priority job in that partition will be
3196 scheduled. An exception is made for jobs that can not
3197 run due to partition constraints (e.g. the time limit) or
3198 down/drained nodes. In that case, lower priority jobs
3199 can be initiated and not impact the higher priority job.
3200
3201 ScronParameters
3202 Multiple options may be comma separated.
3203
3204 enable Enable the use of scrontab to submit and manage periodic
3205 repeating jobs.
3206
3207 SelectType
3208 Identifies the type of resource selection algorithm to be used.
3209 A restart of slurmctld is required for changes to this parameter
3210 to take effect. When changed, all job information (running and
3211 pending) will be lost, since the job state save format used by
3212 each plugin is different. The only exception to this is when
3213 changing from cons_res to cons_tres or from cons_tres to
3214 cons_res. However, if a job contains cons_tres-specific features
3215 and then SelectType is changed to cons_res, the job will be can‐
3216 celed, since there is no way for cons_res to satisfy require‐
3217 ments specific to cons_tres.
3218
3219 Acceptable values include
3220
3221 select/cons_res
3222 The resources (cores and memory) within a node are indi‐
3223 vidually allocated as consumable resources. Note that
3224 whole nodes can be allocated to jobs for selected parti‐
3225 tions by using the OverSubscribe=Exclusive option. See
3226 the partition OverSubscribe parameter for more informa‐
3227 tion.
3228
3229 select/cons_tres
3230 The resources (cores, memory, GPUs and all other track‐
3231 able resources) within a node are individually allocated
3232 as consumable resources. Note that whole nodes can be
3233 allocated to jobs for selected partitions by using the
3234 OverSubscribe=Exclusive option. See the partition Over‐
3235 Subscribe parameter for more information.
3236
3237 select/cray_aries
3238 for a Cray system. The default value is "se‐
3239 lect/cray_aries" for all Cray systems.
3240
3241 select/linear
3242 for allocation of entire nodes assuming a one-dimensional
3243 array of nodes in which sequentially ordered nodes are
3244 preferable. For a heterogeneous cluster (e.g. different
3245 CPU counts on the various nodes), resource allocations
3246 will favor nodes with high CPU counts as needed based
3247 upon the job's node and CPU specification if TopologyPlu‐
3248 gin=topology/none is configured. Use of other topology
3249 plugins with select/linear and heterogeneous nodes is not
3250 recommended and may result in valid job allocation re‐
3251 quests being rejected. This is the default value.
3252
3253 SelectTypeParameters
3254 The permitted values of SelectTypeParameters depend upon the
3255 configured value of SelectType. The only supported options for
3256 SelectType=select/linear are CR_ONE_TASK_PER_CORE and CR_Memory,
3257 which treats memory as a consumable resource and prevents memory
3258 over subscription with job preemption or gang scheduling. By
3259 default SelectType=select/linear allocates whole nodes to jobs
3260 without considering their memory consumption. By default Se‐
3261 lectType=select/cons_res, SelectType=select/cray_aries, and Se‐
3262 lectType=select/cons_tres, use CR_Core_Memory, which allocates
3263 Core to jobs with considering their memory consumption.
3264
3265 A restart of slurmctld is required for changes to this parameter
3266 to take effect.
3267
3268 The following options are supported for SelectType=se‐
3269 lect/cray_aries:
3270
3271 OTHER_CONS_RES
3272 Layer the select/cons_res plugin under the se‐
3273 lect/cray_aries plugin, the default is to layer on se‐
3274 lect/linear. This also allows all the options available
3275 for SelectType=select/cons_res.
3276
3277 OTHER_CONS_TRES
3278 Layer the select/cons_tres plugin under the se‐
3279 lect/cray_aries plugin, the default is to layer on se‐
3280 lect/linear. This also allows all the options available
3281 for SelectType=select/cons_tres.
3282
3283 The following options are supported by the SelectType=select/cons_res
3284 and SelectType=select/cons_tres plugins:
3285
3286 CR_CPU CPUs are consumable resources. Configure the number of
3287 CPUs on each node, which may be equal to the count of
3288 cores or hyper-threads on the node depending upon the de‐
3289 sired minimum resource allocation. The node's Boards,
3290 Sockets, CoresPerSocket and ThreadsPerCore may optionally
3291 be configured and result in job allocations which have
3292 improved locality; however doing so will prevent more
3293 than one job from being allocated on each core.
3294
3295 CR_CPU_Memory
3296 CPUs and memory are consumable resources. Configure the
3297 number of CPUs on each node, which may be equal to the
3298 count of cores or hyper-threads on the node depending
3299 upon the desired minimum resource allocation. The node's
3300 Boards, Sockets, CoresPerSocket and ThreadsPerCore may
3301 optionally be configured and result in job allocations
3302 which have improved locality; however doing so will pre‐
3303 vent more than one job from being allocated on each core.
3304 Setting a value for DefMemPerCPU is strongly recommended.
3305
3306 CR_Core
3307 Cores are consumable resources. On nodes with hy‐
3308 per-threads, each thread is counted as a CPU to satisfy a
3309 job's resource requirement, but multiple jobs are not al‐
3310 located threads on the same core. The count of CPUs al‐
3311 located to a job is rounded up to account for every CPU
3312 on an allocated core. This will also impact total allo‐
3313 cated memory when --mem-per-cpu is used to be multiply of
3314 total number of CPUs on allocated cores.
3315
3316 CR_Core_Memory
3317 Cores and memory are consumable resources. On nodes with
3318 hyper-threads, each thread is counted as a CPU to satisfy
3319 a job's resource requirement, but multiple jobs are not
3320 allocated threads on the same core. The count of CPUs
3321 allocated to a job may be rounded up to account for every
3322 CPU on an allocated core. Setting a value for DefMemPer‐
3323 CPU is strongly recommended.
3324
3325 CR_ONE_TASK_PER_CORE
3326 Allocate one task per core by default. Without this op‐
3327 tion, by default one task will be allocated per thread on
3328 nodes with more than one ThreadsPerCore configured.
3329 NOTE: This option cannot be used with CR_CPU*.
3330
3331 CR_CORE_DEFAULT_DIST_BLOCK
3332 Allocate cores within a node using block distribution by
3333 default. This is a pseudo-best-fit algorithm that mini‐
3334 mizes the number of boards and minimizes the number of
3335 sockets (within minimum boards) used for the allocation.
3336 This default behavior can be overridden specifying a par‐
3337 ticular "-m" parameter with srun/salloc/sbatch. Without
3338 this option, cores will be allocated cyclically across
3339 the sockets.
3340
3341 CR_LLN Schedule resources to jobs on the least loaded nodes
3342 (based upon the number of idle CPUs). This is generally
3343 only recommended for an environment with serial jobs as
3344 idle resources will tend to be highly fragmented, result‐
3345 ing in parallel jobs being distributed across many nodes.
3346 Note that node Weight takes precedence over how many idle
3347 resources are on each node. Also see the partition con‐
3348 figuration parameter LLN use the least loaded nodes in
3349 selected partitions.
3350
3351 CR_Pack_Nodes
3352 If a job allocation contains more resources than will be
3353 used for launching tasks (e.g. if whole nodes are allo‐
3354 cated to a job), then rather than distributing a job's
3355 tasks evenly across its allocated nodes, pack them as
3356 tightly as possible on these nodes. For example, con‐
3357 sider a job allocation containing two entire nodes with
3358 eight CPUs each. If the job starts ten tasks across
3359 those two nodes without this option, it will start five
3360 tasks on each of the two nodes. With this option, eight
3361 tasks will be started on the first node and two tasks on
3362 the second node. This can be superseded by "NoPack" in
3363 srun's "--distribution" option. CR_Pack_Nodes only ap‐
3364 plies when the "block" task distribution method is used.
3365
3366 CR_Socket
3367 Sockets are consumable resources. On nodes with multiple
3368 cores, each core or thread is counted as a CPU to satisfy
3369 a job's resource requirement, but multiple jobs are not
3370 allocated resources on the same socket.
3371
3372 CR_Socket_Memory
3373 Memory and sockets are consumable resources. On nodes
3374 with multiple cores, each core or thread is counted as a
3375 CPU to satisfy a job's resource requirement, but multiple
3376 jobs are not allocated resources on the same socket.
3377 Setting a value for DefMemPerCPU is strongly recommended.
3378
3379 CR_Memory
3380 Memory is a consumable resource. NOTE: This implies
3381 OverSubscribe=YES or OverSubscribe=FORCE for all parti‐
3382 tions. Setting a value for DefMemPerCPU is strongly rec‐
3383 ommended.
3384
3385 SlurmctldAddr
3386 An optional address to be used for communications to the cur‐
3387 rently active slurmctld daemon, normally used with Virtual IP
3388 addressing of the currently active server. If this parameter is
3389 not specified then each primary and backup server will have its
3390 own unique address used for communications as specified in the
3391 SlurmctldHost parameter. If this parameter is specified then
3392 the SlurmctldHost parameter will still be used for communica‐
3393 tions to specific slurmctld primary or backup servers, for exam‐
3394 ple to cause all of them to read the current configuration files
3395 or shutdown. Also see the SlurmctldPrimaryOffProg and Slurm‐
3396 ctldPrimaryOnProg configuration parameters to configure programs
3397 to manipulate virtual IP address manipulation.
3398
3399 SlurmctldDebug
3400 The level of detail to provide slurmctld daemon's logs. The de‐
3401 fault value is info. If the slurmctld daemon is initiated with
3402 -v or --verbose options, that debug level will be preserve or
3403 restored upon reconfiguration.
3404
3405 quiet Log nothing
3406
3407 fatal Log only fatal errors
3408
3409 error Log only errors
3410
3411 info Log errors and general informational messages
3412
3413 verbose Log errors and verbose informational messages
3414
3415 debug Log errors and verbose informational messages and de‐
3416 bugging messages
3417
3418 debug2 Log errors and verbose informational messages and more
3419 debugging messages
3420
3421 debug3 Log errors and verbose informational messages and even
3422 more debugging messages
3423
3424 debug4 Log errors and verbose informational messages and even
3425 more debugging messages
3426
3427 debug5 Log errors and verbose informational messages and even
3428 more debugging messages
3429
3430 SlurmctldHost
3431 The short, or long, hostname of the machine where Slurm control
3432 daemon is executed (i.e. the name returned by the command "host‐
3433 name -s"). This hostname is optionally followed by the address,
3434 either the IP address or a name by which the address can be
3435 identified, enclosed in parentheses (e.g. SlurmctldHost=slurm‐
3436 ctl-primary(12.34.56.78)). This value must be specified at least
3437 once. If specified more than once, the first hostname named will
3438 be where the daemon runs. If the first specified host fails,
3439 the daemon will execute on the second host. If both the first
3440 and second specified host fails, the daemon will execute on the
3441 third host. A restart of slurmctld is required for changes to
3442 this parameter to take effect.
3443
3444 SlurmctldLogFile
3445 Fully qualified pathname of a file into which the slurmctld dae‐
3446 mon's logs are written. The default value is none (performs
3447 logging via syslog).
3448 See the section LOGGING if a pathname is specified.
3449
3450 SlurmctldParameters
3451 Multiple options may be comma separated.
3452
3453 allow_user_triggers
3454 Permit setting triggers from non-root/slurm_user users.
3455 SlurmUser must also be set to root to permit these trig‐
3456 gers to work. See the strigger man page for additional
3457 details.
3458
3459 cloud_dns
3460 By default, Slurm expects that the network address for a
3461 cloud node won't be known until the creation of the node
3462 and that Slurm will be notified of the node's address
3463 (e.g. scontrol update nodename=<name> nodeaddr=<addr>).
3464 Since Slurm communications rely on the node configuration
3465 found in the slurm.conf, Slurm will tell the client com‐
3466 mand, after waiting for all nodes to boot, each node's ip
3467 address. However, in environments where the nodes are in
3468 DNS, this step can be avoided by configuring this option.
3469
3470 cloud_reg_addrs
3471 When a cloud node registers, the node's NodeAddr and
3472 NodeHostName will automatically be set. They will be re‐
3473 set back to the nodename after powering off.
3474
3475 enable_configless
3476 Permit "configless" operation by the slurmd, slurmstepd,
3477 and user commands. When enabled the slurmd will be per‐
3478 mitted to retrieve config files from the slurmctld, and
3479 on any 'scontrol reconfigure' command new configs will be
3480 automatically pushed out and applied to nodes that are
3481 running in this "configless" mode. A restart of slurm‐
3482 ctld is required for changes to this parameter to take
3483 effect.
3484
3485 idle_on_node_suspend
3486 Mark nodes as idle, regardless of current state, when
3487 suspending nodes with SuspendProgram so that nodes will
3488 be eligible to be resumed at a later time.
3489
3490 node_reg_mem_percent=#
3491 Percentage of memory a node is allowed to register with
3492 without being marked as invalid with low memory. Default
3493 is 100. For State=CLOUD nodes, the default is 90. To dis‐
3494 able this for cloud nodes set it to 100. config_overrides
3495 takes precedence over this option.
3496
3497 It's recommended that task/cgroup with ConstrainRamSpace
3498 is configured. A memory cgroup limit won't be set more
3499 than the actual memory on the node. If needed, configure
3500 AllowedRamSpace in the cgroup.conf to add a buffer.
3501
3502 power_save_interval
3503 How often the power_save thread looks to resume and sus‐
3504 pend nodes. The power_save thread will do work sooner if
3505 there are node state changes. Default is 10 seconds.
3506
3507 power_save_min_interval
3508 How often the power_save thread, at a minimum, looks to
3509 resume and suspend nodes. Default is 0.
3510
3511 max_dbd_msg_action
3512 Action used once MaxDBDMsgs is reached, options are 'dis‐
3513 card' (default) and 'exit'.
3514
3515 When 'discard' is specified and MaxDBDMsgs is reached we
3516 start by purging pending messages of types Step start and
3517 complete, and it reaches MaxDBDMsgs again Job start mes‐
3518 sages are purged. Job completes and node state changes
3519 continue to consume the empty space created from the
3520 purgings until MaxDBDMsgs is reached again at which no
3521 new message is tracked creating data loss and potentially
3522 runaway jobs.
3523
3524 When 'exit' is specified and MaxDBDMsgs is reached the
3525 slurmctld will exit instead of discarding any messages.
3526 It will be impossible to start the slurmctld with this
3527 option where the slurmdbd is down and the slurmctld is
3528 tracking more than MaxDBDMsgs.
3529
3530 preempt_send_user_signal
3531 Send the user signal (e.g. --signal=<sig_num>) at preemp‐
3532 tion time even if the signal time hasn't been reached. In
3533 the case of a gracetime preemption the user signal will
3534 be sent if the user signal has been specified and not
3535 sent, otherwise a SIGTERM will be sent to the tasks.
3536
3537 reboot_from_controller
3538 Run the RebootProgram from the controller instead of on
3539 the slurmds. The RebootProgram will be passed a
3540 comma-separated list of nodes to reboot.
3541
3542 user_resv_delete
3543 Allow any user able to run in a reservation to delete it.
3544
3545 SlurmctldPidFile
3546 Fully qualified pathname of a file into which the slurmctld
3547 daemon may write its process id. This may be used for automated
3548 signal processing. The default value is "/var/run/slurm‐
3549 ctld.pid".
3550
3551 SlurmctldPlugstack
3552 A comma-delimited list of Slurm controller plugins to be started
3553 when the daemon begins and terminated when it ends. Only the
3554 plugin's init and fini functions are called.
3555
3556 SlurmctldPort
3557 The port number that the Slurm controller, slurmctld, listens to
3558 for work. The default value is SLURMCTLD_PORT as established at
3559 system build time. If none is explicitly specified, it will be
3560 set to 6817. SlurmctldPort may also be configured to support a
3561 range of port numbers in order to accept larger bursts of incom‐
3562 ing messages by specifying two numbers separated by a dash (e.g.
3563 SlurmctldPort=6817-6818). A restart of slurmctld is required
3564 for changes to this parameter to take effect. NOTE: Either
3565 slurmctld and slurmd daemons must not execute on the same nodes
3566 or the values of SlurmctldPort and SlurmdPort must be different.
3567
3568 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3569 automatically try to interact with anything opened on ports
3570 8192-60000. Configure SlurmctldPort to use a port outside of
3571 the configured SrunPortRange and RSIP's port range.
3572
3573 SlurmctldPrimaryOffProg
3574 This program is executed when a slurmctld daemon running as the
3575 primary server becomes a backup server. By default no program is
3576 executed. See also the related "SlurmctldPrimaryOnProg" parame‐
3577 ter.
3578
3579 SlurmctldPrimaryOnProg
3580 This program is executed when a slurmctld daemon running as a
3581 backup server becomes the primary server. By default no program
3582 is executed. When using virtual IP addresses to manage High
3583 Available Slurm services, this program can be used to add the IP
3584 address to an interface (and optionally try to kill the unre‐
3585 sponsive slurmctld daemon and flush the ARP caches on nodes on
3586 the local Ethernet fabric). See also the related "SlurmctldPri‐
3587 maryOffProg" parameter.
3588
3589 SlurmctldSyslogDebug
3590 The slurmctld daemon will log events to the syslog file at the
3591 specified level of detail. If not set, the slurmctld daemon will
3592 log to syslog at level fatal, unless there is no SlurmctldLog‐
3593 File and it is running in the background, in which case it will
3594 log to syslog at the level specified by SlurmctldDebug (at fatal
3595 in the case that SlurmctldDebug is set to quiet) or it is run in
3596 the foreground, when it will be set to quiet.
3597
3598 quiet Log nothing
3599
3600 fatal Log only fatal errors
3601
3602 error Log only errors
3603
3604 info Log errors and general informational messages
3605
3606 verbose Log errors and verbose informational messages
3607
3608 debug Log errors and verbose informational messages and de‐
3609 bugging messages
3610
3611 debug2 Log errors and verbose informational messages and more
3612 debugging messages
3613
3614 debug3 Log errors and verbose informational messages and even
3615 more debugging messages
3616
3617 debug4 Log errors and verbose informational messages and even
3618 more debugging messages
3619
3620 debug5 Log errors and verbose informational messages and even
3621 more debugging messages
3622
3623 NOTE: By default, Slurm's systemd service files start daemons in
3624 the foreground with the -D option. This means that systemd will
3625 capture stdout/stderr output and print that to syslog, indepen‐
3626 dent of Slurm printing to syslog directly. To prevent systemd
3627 from doing this, add "StandardOutput=null" and "StandardEr‐
3628 ror=null" to the respective service files or override files.
3629
3630 SlurmctldTimeout
3631 The interval, in seconds, that the backup controller waits for
3632 the primary controller to respond before assuming control. The
3633 default value is 120 seconds. May not exceed 65533.
3634
3635 SlurmdDebug
3636 The level of detail to provide slurmd daemon's logs. The de‐
3637 fault value is info.
3638
3639 quiet Log nothing
3640
3641 fatal Log only fatal errors
3642
3643 error Log only errors
3644
3645 info Log errors and general informational messages
3646
3647 verbose Log errors and verbose informational messages
3648
3649 debug Log errors and verbose informational messages and de‐
3650 bugging messages
3651
3652 debug2 Log errors and verbose informational messages and more
3653 debugging messages
3654
3655 debug3 Log errors and verbose informational messages and even
3656 more debugging messages
3657
3658 debug4 Log errors and verbose informational messages and even
3659 more debugging messages
3660
3661 debug5 Log errors and verbose informational messages and even
3662 more debugging messages
3663
3664 SlurmdLogFile
3665 Fully qualified pathname of a file into which the slurmd dae‐
3666 mon's logs are written. The default value is none (performs
3667 logging via syslog). Any "%h" within the name is replaced with
3668 the hostname on which the slurmd is running. Any "%n" within
3669 the name is replaced with the Slurm node name on which the
3670 slurmd is running.
3671 See the section LOGGING if a pathname is specified.
3672
3673 SlurmdParameters
3674 Parameters specific to the Slurmd. Multiple options may be
3675 comma separated.
3676
3677 config_overrides
3678 If set, consider the configuration of each node to be
3679 that specified in the slurm.conf configuration file and
3680 any node with less than the configured resources will not
3681 be set to INVAL/INVALID_REG. This option is generally
3682 only useful for testing purposes. Equivalent to the now
3683 deprecated FastSchedule=2 option.
3684
3685 l3cache_as_socket
3686 Use the hwloc l3cache as the socket count. Can be useful
3687 on certain processors where the socket level is too
3688 coarse, and the l3cache may provide better task distribu‐
3689 tion. (E.g., along CCX boundaries instead of socket
3690 boundaries.) Requires hwloc v2.
3691
3692 shutdown_on_reboot
3693 If set, the Slurmd will shut itself down when a reboot
3694 request is received.
3695
3696 SlurmdPidFile
3697 Fully qualified pathname of a file into which the slurmd daemon
3698 may write its process id. This may be used for automated signal
3699 processing. Any "%h" within the name is replaced with the host‐
3700 name on which the slurmd is running. Any "%n" within the name
3701 is replaced with the Slurm node name on which the slurmd is run‐
3702 ning. The default value is "/var/run/slurmd.pid".
3703
3704 SlurmdPort
3705 The port number that the Slurm compute node daemon, slurmd, lis‐
3706 tens to for work. The default value is SLURMD_PORT as estab‐
3707 lished at system build time. If none is explicitly specified,
3708 its value will be 6818. A restart of slurmctld is required for
3709 changes to this parameter to take effect. NOTE: Either slurm‐
3710 ctld and slurmd daemons must not execute on the same nodes or
3711 the values of SlurmctldPort and SlurmdPort must be different.
3712
3713 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3714 automatically try to interact with anything opened on ports
3715 8192-60000. Configure SlurmdPort to use a port outside of the
3716 configured SrunPortRange and RSIP's port range.
3717
3718 SlurmdSpoolDir
3719 Fully qualified pathname of a directory into which the slurmd
3720 daemon's state information and batch job script information are
3721 written. This must be a common pathname for all nodes, but
3722 should represent a directory which is local to each node (refer‐
3723 ence a local file system). The default value is
3724 "/var/spool/slurmd". Any "%h" within the name is replaced with
3725 the hostname on which the slurmd is running. Any "%n" within
3726 the name is replaced with the Slurm node name on which the
3727 slurmd is running.
3728
3729 SlurmdSyslogDebug
3730 The slurmd daemon will log events to the syslog file at the
3731 specified level of detail. If not set, the slurmd daemon will
3732 log to syslog at level fatal, unless there is no SlurmdLogFile
3733 and it is running in the background, in which case it will log
3734 to syslog at the level specified by SlurmdDebug (at fatal in
3735 the case that SlurmdDebug is set to quiet) or it is run in the
3736 foreground, when it will be set to quiet.
3737
3738 quiet Log nothing
3739
3740 fatal Log only fatal errors
3741
3742 error Log only errors
3743
3744 info Log errors and general informational messages
3745
3746 verbose Log errors and verbose informational messages
3747
3748 debug Log errors and verbose informational messages and de‐
3749 bugging messages
3750
3751 debug2 Log errors and verbose informational messages and more
3752 debugging messages
3753
3754 debug3 Log errors and verbose informational messages and even
3755 more debugging messages
3756
3757 debug4 Log errors and verbose informational messages and even
3758 more debugging messages
3759
3760 debug5 Log errors and verbose informational messages and even
3761 more debugging messages
3762
3763 NOTE: By default, Slurm's systemd service files start daemons in
3764 the foreground with the -D option. This means that systemd will
3765 capture stdout/stderr output and print that to syslog, indepen‐
3766 dent of Slurm printing to syslog directly. To prevent systemd
3767 from doing this, add "StandardOutput=null" and "StandardEr‐
3768 ror=null" to the respective service files or override files.
3769
3770 SlurmdTimeout
3771 The interval, in seconds, that the Slurm controller waits for
3772 slurmd to respond before configuring that node's state to DOWN.
3773 A value of zero indicates the node will not be tested by slurm‐
3774 ctld to confirm the state of slurmd, the node will not be auto‐
3775 matically set to a DOWN state indicating a non-responsive
3776 slurmd, and some other tool will take responsibility for moni‐
3777 toring the state of each compute node and its slurmd daemon.
3778 Slurm's hierarchical communication mechanism is used to ping the
3779 slurmd daemons in order to minimize system noise and overhead.
3780 The default value is 300 seconds. The value may not exceed
3781 65533 seconds.
3782
3783 SlurmdUser
3784 The name of the user that the slurmd daemon executes as. This
3785 user must exist on all nodes of the cluster for authentication
3786 of communications between Slurm components. The default value
3787 is "root".
3788
3789 SlurmSchedLogFile
3790 Fully qualified pathname of the scheduling event logging file.
3791 The syntax of this parameter is the same as for SlurmctldLog‐
3792 File. In order to configure scheduler logging, set both the
3793 SlurmSchedLogFile and SlurmSchedLogLevel parameters.
3794
3795 SlurmSchedLogLevel
3796 The initial level of scheduling event logging, similar to the
3797 SlurmctldDebug parameter used to control the initial level of
3798 slurmctld logging. Valid values for SlurmSchedLogLevel are "0"
3799 (scheduler logging disabled) and "1" (scheduler logging en‐
3800 abled). If this parameter is omitted, the value defaults to "0"
3801 (disabled). In order to configure scheduler logging, set both
3802 the SlurmSchedLogFile and SlurmSchedLogLevel parameters. The
3803 scheduler logging level can be changed dynamically using scon‐
3804 trol.
3805
3806 SlurmUser
3807 The name of the user that the slurmctld daemon executes as. For
3808 security purposes, a user other than "root" is recommended.
3809 This user must exist on all nodes of the cluster for authentica‐
3810 tion of communications between Slurm components. The default
3811 value is "root".
3812
3813 SrunEpilog
3814 Fully qualified pathname of an executable to be run by srun fol‐
3815 lowing the completion of a job step. The command line arguments
3816 for the executable will be the command and arguments of the job
3817 step. This configuration parameter may be overridden by srun's
3818 --epilog parameter. Note that while the other "Epilog" executa‐
3819 bles (e.g., TaskEpilog) are run by slurmd on the compute nodes
3820 where the tasks are executed, the SrunEpilog runs on the node
3821 where the "srun" is executing.
3822
3823 SrunPortRange
3824 The srun creates a set of listening ports to communicate with
3825 the controller, the slurmstepd and to handle the application
3826 I/O. By default these ports are ephemeral meaning the port num‐
3827 bers are selected by the kernel. Using this parameter allow
3828 sites to configure a range of ports from which srun ports will
3829 be selected. This is useful if sites want to allow only certain
3830 port range on their network.
3831
3832 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3833 automatically try to interact with anything opened on ports
3834 8192-60000. Configure SrunPortRange to use a range of ports
3835 above those used by RSIP, ideally 1000 or more ports, for exam‐
3836 ple "SrunPortRange=60001-63000".
3837
3838 Note: SrunPortRange must be large enough to cover the expected
3839 number of srun ports created on a given submission node. A sin‐
3840 gle srun opens 3 listening ports plus 2 more for every 48 hosts.
3841 Example:
3842
3843 srun -N 48 will use 5 listening ports.
3844
3845 srun -N 50 will use 7 listening ports.
3846
3847 srun -N 200 will use 13 listening ports.
3848
3849 SrunProlog
3850 Fully qualified pathname of an executable to be run by srun
3851 prior to the launch of a job step. The command line arguments
3852 for the executable will be the command and arguments of the job
3853 step. This configuration parameter may be overridden by srun's
3854 --prolog parameter. Note that while the other "Prolog" executa‐
3855 bles (e.g., TaskProlog) are run by slurmd on the compute nodes
3856 where the tasks are executed, the SrunProlog runs on the node
3857 where the "srun" is executing.
3858
3859 StateSaveLocation
3860 Fully qualified pathname of a directory into which the Slurm
3861 controller, slurmctld, saves its state (e.g. "/usr/lo‐
3862 cal/slurm/checkpoint"). Slurm state will saved here to recover
3863 from system failures. SlurmUser must be able to create files in
3864 this directory. If you have a secondary SlurmctldHost config‐
3865 ured, this location should be readable and writable by both sys‐
3866 tems. Since all running and pending job information is stored
3867 here, the use of a reliable file system (e.g. RAID) is recom‐
3868 mended. The default value is "/var/spool". A restart of slurm‐
3869 ctld is required for changes to this parameter to take effect.
3870 If any slurm daemons terminate abnormally, their core files will
3871 also be written into this directory.
3872
3873 SuspendExcNodes
3874 Specifies the nodes which are to not be placed in power save
3875 mode, even if the node remains idle for an extended period of
3876 time. Use Slurm's hostlist expression to identify nodes with an
3877 optional ":" separator and count of nodes to exclude from the
3878 preceding range. For example "nid[10-20]:4" will prevent 4 us‐
3879 able nodes (i.e IDLE and not DOWN, DRAINING or already powered
3880 down) in the set "nid[10-20]" from being powered down. Multiple
3881 sets of nodes can be specified with or without counts in a comma
3882 separated list (e.g "nid[10-20]:4,nid[80-90]:2"). If a node
3883 count specification is given, any list of nodes to NOT have a
3884 node count must be after the last specification with a count.
3885 For example "nid[10-20]:4,nid[60-70]" will exclude 4 nodes in
3886 the set "nid[10-20]:4" plus all nodes in the set "nid[60-70]"
3887 while "nid[1-3],nid[10-20]:4" will exclude 4 nodes from the set
3888 "nid[1-3],nid[10-20]". By default no nodes are excluded.
3889
3890 SuspendExcParts
3891 Specifies the partitions whose nodes are to not be placed in
3892 power save mode, even if the node remains idle for an extended
3893 period of time. Multiple partitions can be identified and sepa‐
3894 rated by commas. By default no nodes are excluded.
3895
3896 SuspendProgram
3897 SuspendProgram is the program that will be executed when a node
3898 remains idle for an extended period of time. This program is
3899 expected to place the node into some power save mode. This can
3900 be used to reduce the frequency and voltage of a node or com‐
3901 pletely power the node off. The program executes as SlurmUser.
3902 The argument to the program will be the names of nodes to be
3903 placed into power savings mode (using Slurm's hostlist expres‐
3904 sion format). By default, no program is run.
3905
3906 SuspendRate
3907 The rate at which nodes are placed into power save mode by Sus‐
3908 pendProgram. The value is number of nodes per minute and it can
3909 be used to prevent a large drop in power consumption (e.g. after
3910 a large job completes). A value of zero results in no limits
3911 being imposed. The default value is 60 nodes per minute.
3912
3913 SuspendTime
3914 Nodes which remain idle or down for this number of seconds will
3915 be placed into power save mode by SuspendProgram. Setting Sus‐
3916 pendTime to anything but INFINITE (or -1) will enable power save
3917 mode. INFINITE is the default.
3918
3919 SuspendTimeout
3920 Maximum time permitted (in seconds) between when a node suspend
3921 request is issued and when the node is shutdown. At that time
3922 the node must be ready for a resume request to be issued as
3923 needed for new work. The default value is 30 seconds.
3924
3925 SwitchParameters
3926 Optional parameters for the switch plugin.
3927
3928 SwitchType
3929 Identifies the type of switch or interconnect used for applica‐
3930 tion communications. Acceptable values include
3931 "switch/cray_aries" for Cray systems, "switch/none" for switches
3932 not requiring special processing for job launch or termination
3933 (Ethernet, and InfiniBand) and The default value is
3934 "switch/none". All Slurm daemons, commands and running jobs
3935 must be restarted for a change in SwitchType to take effect. If
3936 running jobs exist at the time slurmctld is restarted with a new
3937 value of SwitchType, records of all jobs in any state may be
3938 lost.
3939
3940 TaskEpilog
3941 Fully qualified pathname of a program to be execute as the slurm
3942 job's owner after termination of each task. See TaskProlog for
3943 execution order details.
3944
3945 TaskPlugin
3946 Identifies the type of task launch plugin, typically used to
3947 provide resource management within a node (e.g. pinning tasks to
3948 specific processors). More than one task plugin can be specified
3949 in a comma-separated list. The prefix of "task/" is optional.
3950 Acceptable values include:
3951
3952 task/affinity enables resource containment using
3953 sched_setaffinity(). This enables the --cpu-bind
3954 and/or --mem-bind srun options.
3955
3956 task/cgroup enables resource containment using Linux control
3957 cgroups. This enables the --cpu-bind and/or
3958 --mem-bind srun options. NOTE: see "man
3959 cgroup.conf" for configuration details.
3960
3961 task/none for systems requiring no special handling of user
3962 tasks. Lacks support for the --cpu-bind and/or
3963 --mem-bind srun options. The default value is
3964 "task/none".
3965
3966 NOTE: It is recommended to stack task/affinity,task/cgroup to‐
3967 gether when configuring TaskPlugin, and setting Constrain‐
3968 Cores=yes in cgroup.conf. This setup uses the task/affinity
3969 plugin for setting the affinity of the tasks and uses the
3970 task/cgroup plugin to fence tasks into the specified resources.
3971
3972 NOTE: For CRAY systems only: task/cgroup must be used with, and
3973 listed after task/cray_aries in TaskPlugin. The task/affinity
3974 plugin can be listed anywhere, but the previous constraint must
3975 be satisfied. For CRAY systems, a configuration like this is
3976 recommended:
3977 TaskPlugin=task/affinity,task/cray_aries,task/cgroup
3978
3979 TaskPluginParam
3980 Optional parameters for the task plugin. Multiple options
3981 should be comma separated. None, Sockets, Cores and Threads are
3982 mutually exclusive and treated as a last possible source of
3983 --cpu-bind default. See also Node and Partition CpuBind options.
3984
3985 Cores Bind tasks to cores by default. Overrides automatic
3986 binding.
3987
3988 None Perform no task binding by default. Overrides automatic
3989 binding.
3990
3991 Sockets
3992 Bind to sockets by default. Overrides automatic binding.
3993
3994 Threads
3995 Bind to threads by default. Overrides automatic binding.
3996
3997 SlurmdOffSpec
3998 If specialized cores or CPUs are identified for the node
3999 (i.e. the CoreSpecCount or CpuSpecList are configured for
4000 the node), then Slurm daemons running on the compute node
4001 (i.e. slurmd and slurmstepd) should run outside of those
4002 resources (i.e. specialized resources are completely un‐
4003 available to Slurm daemons and jobs spawned by Slurm).
4004 This option may not be used with the task/cray_aries
4005 plugin.
4006
4007 Verbose
4008 Verbosely report binding before tasks run by default.
4009
4010 Autobind
4011 Set a default binding in the event that "auto binding"
4012 doesn't find a match. Set to Threads, Cores or Sockets
4013 (E.g. TaskPluginParam=autobind=threads).
4014
4015 TaskProlog
4016 Fully qualified pathname of a program to be execute as the slurm
4017 job's owner prior to initiation of each task. Besides the nor‐
4018 mal environment variables, this has SLURM_TASK_PID available to
4019 identify the process ID of the task being started. Standard
4020 output from this program can be used to control the environment
4021 variables and output for the user program.
4022
4023 export NAME=value Will set environment variables for the task
4024 being spawned. Everything after the equal
4025 sign to the end of the line will be used as
4026 the value for the environment variable. Ex‐
4027 porting of functions is not currently sup‐
4028 ported.
4029
4030 print ... Will cause that line (without the leading
4031 "print ") to be printed to the job's stan‐
4032 dard output.
4033
4034 unset NAME Will clear environment variables for the
4035 task being spawned.
4036
4037 The order of task prolog/epilog execution is as follows:
4038
4039 1. pre_launch_priv()
4040 Function in TaskPlugin
4041
4042 1. pre_launch() Function in TaskPlugin
4043
4044 2. TaskProlog System-wide per task program defined in
4045 slurm.conf
4046
4047 3. User prolog Job-step-specific task program defined using
4048 srun's --task-prolog option or
4049 SLURM_TASK_PROLOG environment variable
4050
4051 4. Task Execute the job step's task
4052
4053 5. User epilog Job-step-specific task program defined using
4054 srun's --task-epilog option or
4055 SLURM_TASK_EPILOG environment variable
4056
4057 6. TaskEpilog System-wide per task program defined in
4058 slurm.conf
4059
4060 7. post_term() Function in TaskPlugin
4061
4062 TCPTimeout
4063 Time permitted for TCP connection to be established. Default
4064 value is 2 seconds.
4065
4066 TmpFS Fully qualified pathname of the file system available to user
4067 jobs for temporary storage. This parameter is used in establish‐
4068 ing a node's TmpDisk space. The default value is "/tmp".
4069
4070 TopologyParam
4071 Comma-separated options identifying network topology options.
4072
4073 Dragonfly Optimize allocation for Dragonfly network. Valid
4074 when TopologyPlugin=topology/tree.
4075
4076 TopoOptional Only optimize allocation for network topology if
4077 the job includes a switch option. Since optimiz‐
4078 ing resource allocation for topology involves
4079 much higher system overhead, this option can be
4080 used to impose the extra overhead only on jobs
4081 which can take advantage of it. If most job allo‐
4082 cations are not optimized for network topology,
4083 they may fragment resources to the point that
4084 topology optimization for other jobs will be dif‐
4085 ficult to achieve. NOTE: Jobs may span across
4086 nodes without common parent switches with this
4087 enabled.
4088
4089 TopologyPlugin
4090 Identifies the plugin to be used for determining the network
4091 topology and optimizing job allocations to minimize network con‐
4092 tention. See NETWORK TOPOLOGY below for details. Additional
4093 plugins may be provided in the future which gather topology in‐
4094 formation directly from the network. Acceptable values include:
4095
4096 topology/3d_torus best-fit logic over three-dimensional
4097 topology
4098
4099 topology/none default for other systems, best-fit logic
4100 over one-dimensional topology
4101
4102 topology/tree used for a hierarchical network as de‐
4103 scribed in a topology.conf file
4104
4105 TrackWCKey
4106 Boolean yes or no. Used to set display and track of the Work‐
4107 load Characterization Key. Must be set to track correct wckey
4108 usage. NOTE: You must also set TrackWCKey in your slurmdbd.conf
4109 file to create historical usage reports.
4110
4111 TreeWidth
4112 Slurmd daemons use a virtual tree network for communications.
4113 TreeWidth specifies the width of the tree (i.e. the fanout). On
4114 architectures with a front end node running the slurmd daemon,
4115 the value must always be equal to or greater than the number of
4116 front end nodes which eliminates the need for message forwarding
4117 between the slurmd daemons. On other architectures the default
4118 value is 50, meaning each slurmd daemon can communicate with up
4119 to 50 other slurmd daemons and over 2500 nodes can be contacted
4120 with two message hops. The default value will work well for
4121 most clusters. Optimal system performance can typically be
4122 achieved if TreeWidth is set to the square root of the number of
4123 nodes in the cluster for systems having no more than 2500 nodes
4124 or the cube root for larger systems. The value may not exceed
4125 65533.
4126
4127 UnkillableStepProgram
4128 If the processes in a job step are determined to be unkillable
4129 for a period of time specified by the UnkillableStepTimeout
4130 variable, the program specified by UnkillableStepProgram will be
4131 executed. By default no program is run.
4132
4133 See section UNKILLABLE STEP PROGRAM SCRIPT for more information.
4134
4135 UnkillableStepTimeout
4136 The length of time, in seconds, that Slurm will wait before de‐
4137 ciding that processes in a job step are unkillable (after they
4138 have been signaled with SIGKILL) and execute UnkillableStepPro‐
4139 gram. The default timeout value is 60 seconds. If exceeded,
4140 the compute node will be drained to prevent future jobs from be‐
4141 ing scheduled on the node.
4142
4143 UsePAM If set to 1, PAM (Pluggable Authentication Modules for Linux)
4144 will be enabled. PAM is used to establish the upper bounds for
4145 resource limits. With PAM support enabled, local system adminis‐
4146 trators can dynamically configure system resource limits. Chang‐
4147 ing the upper bound of a resource limit will not alter the lim‐
4148 its of running jobs, only jobs started after a change has been
4149 made will pick up the new limits. The default value is 0 (not
4150 to enable PAM support). Remember that PAM also needs to be con‐
4151 figured to support Slurm as a service. For sites using PAM's
4152 directory based configuration option, a configuration file named
4153 slurm should be created. The module-type, control-flags, and
4154 module-path names that should be included in the file are:
4155 auth required pam_localuser.so
4156 auth required pam_shells.so
4157 account required pam_unix.so
4158 account required pam_access.so
4159 session required pam_unix.so
4160 For sites configuring PAM with a general configuration file, the
4161 appropriate lines (see above), where slurm is the service-name,
4162 should be added.
4163
4164 NOTE: UsePAM option has nothing to do with the con‐
4165 tribs/pam/pam_slurm and/or contribs/pam_slurm_adopt modules. So
4166 these two modules can work independently of the value set for
4167 UsePAM.
4168
4169 VSizeFactor
4170 Memory specifications in job requests apply to real memory size
4171 (also known as resident set size). It is possible to enforce
4172 virtual memory limits for both jobs and job steps by limiting
4173 their virtual memory to some percentage of their real memory al‐
4174 location. The VSizeFactor parameter specifies the job's or job
4175 step's virtual memory limit as a percentage of its real memory
4176 limit. For example, if a job's real memory limit is 500MB and
4177 VSizeFactor is set to 101 then the job will be killed if its
4178 real memory exceeds 500MB or its virtual memory exceeds 505MB
4179 (101 percent of the real memory limit). The default value is 0,
4180 which disables enforcement of virtual memory limits. The value
4181 may not exceed 65533 percent.
4182
4183 NOTE: This parameter is dependent on OverMemoryKill being con‐
4184 figured in JobAcctGatherParams. It is also possible to configure
4185 the TaskPlugin to use task/cgroup for memory enforcement. VSize‐
4186 Factor will not have an effect on memory enforcement done
4187 through cgroups.
4188
4189 WaitTime
4190 Specifies how many seconds the srun command should by default
4191 wait after the first task terminates before terminating all re‐
4192 maining tasks. The "--wait" option on the srun command line
4193 overrides this value. The default value is 0, which disables
4194 this feature. May not exceed 65533 seconds.
4195
4196 X11Parameters
4197 For use with Slurm's built-in X11 forwarding implementation.
4198
4199 home_xauthority
4200 If set, xauth data on the compute node will be placed in
4201 ~/.Xauthority rather than in a temporary file under
4202 TmpFS.
4203
4205 The configuration of nodes (or machines) to be managed by Slurm is also
4206 specified in /etc/slurm.conf. Changes in node configuration (e.g.
4207 adding nodes, changing their processor count, etc.) require restarting
4208 both the slurmctld daemon and the slurmd daemons. All slurmd daemons
4209 must know each node in the system to forward messages in support of hi‐
4210 erarchical communications. Only the NodeName must be supplied in the
4211 configuration file. All other node configuration information is op‐
4212 tional. It is advisable to establish baseline node configurations, es‐
4213 pecially if the cluster is heterogeneous. Nodes which register to the
4214 system with less than the configured resources (e.g. too little mem‐
4215 ory), will be placed in the "DOWN" state to avoid scheduling jobs on
4216 them. Establishing baseline configurations will also speed Slurm's
4217 scheduling process by permitting it to compare job requirements against
4218 these (relatively few) configuration parameters and possibly avoid hav‐
4219 ing to check job requirements against every individual node's configu‐
4220 ration. The resources checked at node registration time are: CPUs,
4221 RealMemory and TmpDisk.
4222
4223 Default values can be specified with a record in which NodeName is "DE‐
4224 FAULT". The default entry values will apply only to lines following it
4225 in the configuration file and the default values can be reset multiple
4226 times in the configuration file with multiple entries where "Node‐
4227 Name=DEFAULT". Each line where NodeName is "DEFAULT" will replace or
4228 add to previous default values and will not reinitialize the default
4229 values. The "NodeName=" specification must be placed on every line de‐
4230 scribing the configuration of nodes. A single node name can not appear
4231 as a NodeName value in more than one line (duplicate node name records
4232 will be ignored). In fact, it is generally possible and desirable to
4233 define the configurations of all nodes in only a few lines. This con‐
4234 vention permits significant optimization in the scheduling of larger
4235 clusters. In order to support the concept of jobs requiring consecu‐
4236 tive nodes on some architectures, node specifications should be place
4237 in this file in consecutive order. No single node name may be listed
4238 more than once in the configuration file. Use "DownNodes=" to record
4239 the state of nodes which are temporarily in a DOWN, DRAIN or FAILING
4240 state without altering permanent configuration information. A job
4241 step's tasks are allocated to nodes in order the nodes appear in the
4242 configuration file. There is presently no capability within Slurm to
4243 arbitrarily order a job step's tasks.
4244
4245 Multiple node names may be comma separated (e.g. "alpha,beta,gamma")
4246 and/or a simple node range expression may optionally be used to specify
4247 numeric ranges of nodes to avoid building a configuration file with
4248 large numbers of entries. The node range expression can contain one
4249 pair of square brackets with a sequence of comma-separated numbers
4250 and/or ranges of numbers separated by a "-" (e.g. "linux[0-64,128]", or
4251 "lx[15,18,32-33]"). Note that the numeric ranges can include one or
4252 more leading zeros to indicate the numeric portion has a fixed number
4253 of digits (e.g. "linux[0000-1023]"). Multiple numeric ranges can be
4254 included in the expression (e.g. "rack[0-63]_blade[0-41]"). If one or
4255 more numeric expressions are included, one of them must be at the end
4256 of the name (e.g. "unit[0-31]rack" is invalid), but arbitrary names can
4257 always be used in a comma-separated list.
4258
4259 The node configuration specified the following information:
4260
4261
4262 NodeName
4263 Name that Slurm uses to refer to a node. Typically this would
4264 be the string that "/bin/hostname -s" returns. It may also be
4265 the fully qualified domain name as returned by "/bin/hostname
4266 -f" (e.g. "foo1.bar.com"), or any valid domain name associated
4267 with the host through the host database (/etc/hosts) or DNS, de‐
4268 pending on the resolver settings. Note that if the short form
4269 of the hostname is not used, it may prevent use of hostlist ex‐
4270 pressions (the numeric portion in brackets must be at the end of
4271 the string). It may also be an arbitrary string if NodeHostname
4272 is specified. If the NodeName is "DEFAULT", the values speci‐
4273 fied with that record will apply to subsequent node specifica‐
4274 tions unless explicitly set to other values in that node record
4275 or replaced with a different set of default values. Each line
4276 where NodeName is "DEFAULT" will replace or add to previous de‐
4277 fault values and not a reinitialize the default values. For ar‐
4278 chitectures in which the node order is significant, nodes will
4279 be considered consecutive in the order defined. For example, if
4280 the configuration for "NodeName=charlie" immediately follows the
4281 configuration for "NodeName=baker" they will be considered adja‐
4282 cent in the computer.
4283
4284 NodeHostname
4285 Typically this would be the string that "/bin/hostname -s" re‐
4286 turns. It may also be the fully qualified domain name as re‐
4287 turned by "/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid
4288 domain name associated with the host through the host database
4289 (/etc/hosts) or DNS, depending on the resolver settings. Note
4290 that if the short form of the hostname is not used, it may pre‐
4291 vent use of hostlist expressions (the numeric portion in brack‐
4292 ets must be at the end of the string). A node range expression
4293 can be used to specify a set of nodes. If an expression is
4294 used, the number of nodes identified by NodeHostname on a line
4295 in the configuration file must be identical to the number of
4296 nodes identified by NodeName. By default, the NodeHostname will
4297 be identical in value to NodeName.
4298
4299 NodeAddr
4300 Name that a node should be referred to in establishing a commu‐
4301 nications path. This name will be used as an argument to the
4302 getaddrinfo() function for identification. If a node range ex‐
4303 pression is used to designate multiple nodes, they must exactly
4304 match the entries in the NodeName (e.g. "NodeName=lx[0-7]
4305 NodeAddr=elx[0-7]"). NodeAddr may also contain IP addresses.
4306 By default, the NodeAddr will be identical in value to NodeHost‐
4307 name.
4308
4309 BcastAddr
4310 Alternate network path to be used for sbcast network traffic to
4311 a given node. This name will be used as an argument to the
4312 getaddrinfo() function. If a node range expression is used to
4313 designate multiple nodes, they must exactly match the entries in
4314 the NodeName (e.g. "NodeName=lx[0-7] BcastAddr=elx[0-7]").
4315 BcastAddr may also contain IP addresses. By default, the Bcas‐
4316 tAddr is unset, and sbcast traffic will be routed to the
4317 NodeAddr for a given node. Note: cannot be used with Communica‐
4318 tionParameters=NoInAddrAny.
4319
4320 Boards Number of Baseboards in nodes with a baseboard controller. Note
4321 that when Boards is specified, SocketsPerBoard, CoresPerSocket,
4322 and ThreadsPerCore should be specified. The default value is 1.
4323
4324 CoreSpecCount
4325 Number of cores reserved for system use. These cores will not
4326 be available for allocation to user jobs. Depending upon the
4327 TaskPluginParam option of SlurmdOffSpec, Slurm daemons (i.e.
4328 slurmd and slurmstepd) may either be confined to these resources
4329 (the default) or prevented from using these resources. Isola‐
4330 tion of the Slurm daemons from user jobs may improve application
4331 performance. If this option and CpuSpecList are both designated
4332 for a node, an error is generated. For information on the algo‐
4333 rithm used by Slurm to select the cores refer to the core spe‐
4334 cialization documentation (
4335 https://slurm.schedmd.com/core_spec.html ).
4336
4337 CoresPerSocket
4338 Number of cores in a single physical processor socket (e.g.
4339 "2"). The CoresPerSocket value describes physical cores, not
4340 the logical number of processors per socket. NOTE: If you have
4341 multi-core processors, you will likely need to specify this pa‐
4342 rameter in order to optimize scheduling. The default value is
4343 1.
4344
4345 CpuBind
4346 If a job step request does not specify an option to control how
4347 tasks are bound to allocated CPUs (--cpu-bind) and all nodes al‐
4348 located to the job have the same CpuBind option the node CpuBind
4349 option will control how tasks are bound to allocated resources.
4350 Supported values for CpuBind are "none", "socket", "ldom"
4351 (NUMA), "core" and "thread".
4352
4353 CPUs Number of logical processors on the node (e.g. "2"). It can be
4354 set to the total number of sockets(supported only by select/lin‐
4355 ear), cores or threads. This can be useful when you want to
4356 schedule only the cores on a hyper-threaded node. If CPUs is
4357 omitted, its default will be set equal to the product of Boards,
4358 Sockets, CoresPerSocket, and ThreadsPerCore.
4359
4360 CpuSpecList
4361 A comma-delimited list of Slurm abstract CPU IDs reserved for
4362 system use. The list will be expanded to include all other
4363 CPUs, if any, on the same cores. These cores will not be avail‐
4364 able for allocation to user jobs. Depending upon the TaskPlug‐
4365 inParam option of SlurmdOffSpec, Slurm daemons (i.e. slurmd and
4366 slurmstepd) may either be confined to these resources (the de‐
4367 fault) or prevented from using these resources. Isolation of
4368 the Slurm daemons from user jobs may improve application perfor‐
4369 mance. If this option and CoreSpecCount are both designated for
4370 a node, an error is generated. This option has no effect unless
4371 cgroup job confinement is also configured (i.e. the task/cgroup
4372 TaskPlugin is enabled and ConstrainCores=yes is set in
4373 cgroup.conf).
4374
4375 Features
4376 A comma-delimited list of arbitrary strings indicative of some
4377 characteristic associated with the node. There is no value or
4378 count associated with a feature at this time, a node either has
4379 a feature or it does not. A desired feature may contain a nu‐
4380 meric component indicating, for example, processor speed but
4381 this numeric component will be considered to be part of the fea‐
4382 ture string. Features are intended to be used to filter nodes
4383 eligible to run jobs via the --constraint argument. By default
4384 a node has no features. Also see Gres for being able to have
4385 more control such as types and count. Using features is faster
4386 than scheduling against GRES but is limited to Boolean opera‐
4387 tions.
4388
4389 Gres A comma-delimited list of generic resources specifications for a
4390 node. The format is: "<name>[:<type>][:no_consume]:<num‐
4391 ber>[K|M|G]". The first field is the resource name, which
4392 matches the GresType configuration parameter name. The optional
4393 type field might be used to identify a model of that generic re‐
4394 source. It is forbidden to specify both an untyped GRES and a
4395 typed GRES with the same <name>. The optional no_consume field
4396 allows you to specify that a generic resource does not have a
4397 finite number of that resource that gets consumed as it is re‐
4398 quested. The no_consume field is a GRES specific setting and ap‐
4399 plies to the GRES, regardless of the type specified. The final
4400 field must specify a generic resources count. A suffix of "K",
4401 "M", "G", "T" or "P" may be used to multiply the number by 1024,
4402 1048576, 1073741824, etc. respectively.
4403 (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_con‐
4404 sume:4G"). By default a node has no generic resources and its
4405 maximum count is that of an unsigned 64bit integer. Also see
4406 Features for Boolean flags to filter nodes using job con‐
4407 straints.
4408
4409 MemSpecLimit
4410 Amount of memory, in megabytes, reserved for system use and not
4411 available for user allocations. If the task/cgroup plugin is
4412 configured and that plugin constrains memory allocations (i.e.
4413 the task/cgroup TaskPlugin is enabled and ConstrainRAMSpace=yes
4414 is set in cgroup.conf), then Slurm compute node daemons (slurmd
4415 plus slurmstepd) will be allocated the specified memory limit.
4416 Note that having the Memory set in SelectTypeParameters as any
4417 of the options that has it as a consumable resource is needed
4418 for this option to work. The daemons will not be killed if they
4419 exhaust the memory allocation (ie. the Out-Of-Memory Killer is
4420 disabled for the daemon's memory cgroup). If the task/cgroup
4421 plugin is not configured, the specified memory will only be un‐
4422 available for user allocations.
4423
4424 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4425 tens to for work on this particular node. By default there is a
4426 single port number for all slurmd daemons on all compute nodes
4427 as defined by the SlurmdPort configuration parameter. Use of
4428 this option is not generally recommended except for development
4429 or testing purposes. If multiple slurmd daemons execute on a
4430 node this can specify a range of ports.
4431
4432 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4433 automatically try to interact with anything opened on ports
4434 8192-60000. Configure Port to use a port outside of the config‐
4435 ured SrunPortRange and RSIP's port range.
4436
4437 Procs See CPUs.
4438
4439 RealMemory
4440 Size of real memory on the node in megabytes (e.g. "2048"). The
4441 default value is 1. Lowering RealMemory with the goal of setting
4442 aside some amount for the OS and not available for job alloca‐
4443 tions will not work as intended if Memory is not set as a con‐
4444 sumable resource in SelectTypeParameters. So one of the *_Memory
4445 options need to be enabled for that goal to be accomplished.
4446 Also see MemSpecLimit.
4447
4448 Reason Identifies the reason for a node being in state "DOWN",
4449 "DRAINED" "DRAINING", "FAIL" or "FAILING". Use quotes to en‐
4450 close a reason having more than one word.
4451
4452 Sockets
4453 Number of physical processor sockets/chips on the node (e.g.
4454 "2"). If Sockets is omitted, it will be inferred from CPUs,
4455 CoresPerSocket, and ThreadsPerCore. NOTE: If you have
4456 multi-core processors, you will likely need to specify these pa‐
4457 rameters. Sockets and SocketsPerBoard are mutually exclusive.
4458 If Sockets is specified when Boards is also used, Sockets is in‐
4459 terpreted as SocketsPerBoard rather than total sockets. The de‐
4460 fault value is 1.
4461
4462 SocketsPerBoard
4463 Number of physical processor sockets/chips on a baseboard.
4464 Sockets and SocketsPerBoard are mutually exclusive. The default
4465 value is 1.
4466
4467 State State of the node with respect to the initiation of user jobs.
4468 Acceptable values are CLOUD, DOWN, DRAIN, FAIL, FAILING, FUTURE
4469 and UNKNOWN. Node states of BUSY and IDLE should not be speci‐
4470 fied in the node configuration, but set the node state to UN‐
4471 KNOWN instead. Setting the node state to UNKNOWN will result in
4472 the node state being set to BUSY, IDLE or other appropriate
4473 state based upon recovered system state information. The de‐
4474 fault value is UNKNOWN. Also see the DownNodes parameter below.
4475
4476 CLOUD Indicates the node exists in the cloud. Its initial
4477 state will be treated as powered down. The node will
4478 be available for use after its state is recovered from
4479 Slurm's state save file or the slurmd daemon starts on
4480 the compute node.
4481
4482 DOWN Indicates the node failed and is unavailable to be al‐
4483 located work.
4484
4485 DRAIN Indicates the node is unavailable to be allocated
4486 work.
4487
4488 FAIL Indicates the node is expected to fail soon, has no
4489 jobs allocated to it, and will not be allocated to any
4490 new jobs.
4491
4492 FAILING Indicates the node is expected to fail soon, has one
4493 or more jobs allocated to it, but will not be allo‐
4494 cated to any new jobs.
4495
4496 FUTURE Indicates the node is defined for future use and need
4497 not exist when the Slurm daemons are started. These
4498 nodes can be made available for use simply by updating
4499 the node state using the scontrol command rather than
4500 restarting the slurmctld daemon. After these nodes are
4501 made available, change their State in the slurm.conf
4502 file. Until these nodes are made available, they will
4503 not be seen using any Slurm commands or nor will any
4504 attempt be made to contact them.
4505
4506 Dynamic Future Nodes
4507 A slurmd started with -F[<feature>] will be as‐
4508 sociated with a FUTURE node that matches the
4509 same configuration (sockets, cores, threads) as
4510 reported by slurmd -C. The node's NodeAddr and
4511 NodeHostname will automatically be retrieved
4512 from the slurmd and will be cleared when set
4513 back to the FUTURE state. Dynamic FUTURE nodes
4514 retain non-FUTURE state on restart. Use scon‐
4515 trol to put node back into FUTURE state.
4516
4517 If the mapping of the NodeName to the slurmd
4518 HostName is not updated in DNS, Dynamic Future
4519 nodes won't know how to communicate with each
4520 other -- because NodeAddr and NodeHostName are
4521 not defined in the slurm.conf -- and the fanout
4522 communications need to be disabled by setting
4523 TreeWidth to a high number (e.g. 65533). If the
4524 DNS mapping is made, then the cloud_dns Slurm‐
4525 ctldParameter can be used.
4526
4527 UNKNOWN Indicates the node's state is undefined but will be
4528 established (set to BUSY or IDLE) when the slurmd dae‐
4529 mon on that node registers. UNKNOWN is the default
4530 state.
4531
4532 ThreadsPerCore
4533 Number of logical threads in a single physical core (e.g. "2").
4534 Note that the Slurm can allocate resources to jobs down to the
4535 resolution of a core. If your system is configured with more
4536 than one thread per core, execution of a different job on each
4537 thread is not supported unless you configure SelectTypeParame‐
4538 ters=CR_CPU plus CPUs; do not configure Sockets, CoresPerSocket
4539 or ThreadsPerCore. A job can execute a one task per thread from
4540 within one job step or execute a distinct job step on each of
4541 the threads. Note also if you are running with more than 1
4542 thread per core and running the select/cons_res or se‐
4543 lect/cons_tres plugin then you will want to set the SelectType‐
4544 Parameters variable to something other than CR_CPU to avoid un‐
4545 expected results. The default value is 1.
4546
4547 TmpDisk
4548 Total size of temporary disk storage in TmpFS in megabytes (e.g.
4549 "16384"). TmpFS (for "Temporary File System") identifies the lo‐
4550 cation which jobs should use for temporary storage. Note this
4551 does not indicate the amount of free space available to the user
4552 on the node, only the total file system size. The system admin‐
4553 istration should ensure this file system is purged as needed so
4554 that user jobs have access to most of this space. The Prolog
4555 and/or Epilog programs (specified in the configuration file)
4556 might be used to ensure the file system is kept clean. The de‐
4557 fault value is 0.
4558
4559 TRESWeights
4560 TRESWeights are used to calculate a value that represents how
4561 busy a node is. Currently only used in federation configura‐
4562 tions. TRESWeights are different from TRESBillingWeights --
4563 which is used for fairshare calculations.
4564
4565 TRES weights are specified as a comma-separated list of <TRES
4566 Type>=<TRES Weight> pairs.
4567
4568 e.g.
4569 NodeName=node1 ... TRESWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0"
4570
4571 By default the weighted TRES value is calculated as the sum of
4572 all node TRES types multiplied by their corresponding TRES
4573 weight.
4574
4575 If PriorityFlags=MAX_TRES is configured, the weighted TRES value
4576 is calculated as the MAX of individual node TRES' (e.g. cpus,
4577 mem, gres).
4578
4579 Weight The priority of the node for scheduling purposes. All things
4580 being equal, jobs will be allocated the nodes with the lowest
4581 weight which satisfies their requirements. For example, a het‐
4582 erogeneous collection of nodes might be placed into a single
4583 partition for greater system utilization, responsiveness and ca‐
4584 pability. It would be preferable to allocate smaller memory
4585 nodes rather than larger memory nodes if either will satisfy a
4586 job's requirements. The units of weight are arbitrary, but
4587 larger weights should be assigned to nodes with more processors,
4588 memory, disk space, higher processor speed, etc. Note that if a
4589 job allocation request can not be satisfied using the nodes with
4590 the lowest weight, the set of nodes with the next lowest weight
4591 is added to the set of nodes under consideration for use (repeat
4592 as needed for higher weight values). If you absolutely want to
4593 minimize the number of higher weight nodes allocated to a job
4594 (at a cost of higher scheduling overhead), give each node a dis‐
4595 tinct Weight value and they will be added to the pool of nodes
4596 being considered for scheduling individually.
4597
4598 The default value is 1.
4599
4600 NOTE: Node weights are first considered among currently avail‐
4601 able nodes. For example, a POWERED_DOWN node with a lower weight
4602 will not be evaluated before an IDLE node.
4603
4605 The DownNodes= parameter permits you to mark certain nodes as in a
4606 DOWN, DRAIN, FAIL, FAILING or FUTURE state without altering the perma‐
4607 nent configuration information listed under a NodeName= specification.
4608
4609
4610 DownNodes
4611 Any node name, or list of node names, from the NodeName= speci‐
4612 fications.
4613
4614 Reason Identifies the reason for a node being in state DOWN, DRAIN,
4615 FAIL, FAILING or FUTURE. Use quotes to enclose a reason having
4616 more than one word.
4617
4618 State State of the node with respect to the initiation of user jobs.
4619 Acceptable values are DOWN, DRAIN, FAIL, FAILING and FUTURE.
4620 For more information about these states see the descriptions un‐
4621 der State in the NodeName= section above. The default value is
4622 DOWN.
4623
4625 On computers where frontend nodes are used to execute batch scripts
4626 rather than compute nodes, one may configure one or more frontend nodes
4627 using the configuration parameters defined below. These options are
4628 very similar to those used in configuring compute nodes. These options
4629 may only be used on systems configured and built with the appropriate
4630 parameters (--have-front-end). The front end configuration specifies
4631 the following information:
4632
4633
4634 AllowGroups
4635 Comma-separated list of group names which may execute jobs on
4636 this front end node. By default, all groups may use this front
4637 end node. A user will be permitted to use this front end node
4638 if AllowGroups has at least one group associated with the user.
4639 May not be used with the DenyGroups option.
4640
4641 AllowUsers
4642 Comma-separated list of user names which may execute jobs on
4643 this front end node. By default, all users may use this front
4644 end node. May not be used with the DenyUsers option.
4645
4646 DenyGroups
4647 Comma-separated list of group names which are prevented from ex‐
4648 ecuting jobs on this front end node. May not be used with the
4649 AllowGroups option.
4650
4651 DenyUsers
4652 Comma-separated list of user names which are prevented from exe‐
4653 cuting jobs on this front end node. May not be used with the
4654 AllowUsers option.
4655
4656 FrontendName
4657 Name that Slurm uses to refer to a frontend node. Typically
4658 this would be the string that "/bin/hostname -s" returns. It
4659 may also be the fully qualified domain name as returned by
4660 "/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid domain
4661 name associated with the host through the host database
4662 (/etc/hosts) or DNS, depending on the resolver settings. Note
4663 that if the short form of the hostname is not used, it may pre‐
4664 vent use of hostlist expressions (the numeric portion in brack‐
4665 ets must be at the end of the string). If the FrontendName is
4666 "DEFAULT", the values specified with that record will apply to
4667 subsequent node specifications unless explicitly set to other
4668 values in that frontend node record or replaced with a different
4669 set of default values. Each line where FrontendName is "DE‐
4670 FAULT" will replace or add to previous default values and not a
4671 reinitialize the default values.
4672
4673 FrontendAddr
4674 Name that a frontend node should be referred to in establishing
4675 a communications path. This name will be used as an argument to
4676 the getaddrinfo() function for identification. As with Fron‐
4677 tendName, list the individual node addresses rather than using a
4678 hostlist expression. The number of FrontendAddr records per
4679 line must equal the number of FrontendName records per line
4680 (i.e. you can't map to node names to one address). FrontendAddr
4681 may also contain IP addresses. By default, the FrontendAddr
4682 will be identical in value to FrontendName.
4683
4684 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4685 tens to for work on this particular frontend node. By default
4686 there is a single port number for all slurmd daemons on all
4687 frontend nodes as defined by the SlurmdPort configuration param‐
4688 eter. Use of this option is not generally recommended except for
4689 development or testing purposes.
4690
4691 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4692 automatically try to interact with anything opened on ports
4693 8192-60000. Configure Port to use a port outside of the config‐
4694 ured SrunPortRange and RSIP's port range.
4695
4696 Reason Identifies the reason for a frontend node being in state DOWN,
4697 DRAINED, DRAINING, FAIL or FAILING. Use quotes to enclose a
4698 reason having more than one word.
4699
4700 State State of the frontend node with respect to the initiation of
4701 user jobs. Acceptable values are DOWN, DRAIN, FAIL, FAILING and
4702 UNKNOWN. Node states of BUSY and IDLE should not be specified
4703 in the node configuration, but set the node state to UNKNOWN in‐
4704 stead. Setting the node state to UNKNOWN will result in the
4705 node state being set to BUSY, IDLE or other appropriate state
4706 based upon recovered system state information. For more infor‐
4707 mation about these states see the descriptions under State in
4708 the NodeName= section above. The default value is UNKNOWN.
4709
4710 As an example, you can do something similar to the following to define
4711 four front end nodes for running slurmd daemons.
4712 FrontendName=frontend[00-03] FrontendAddr=efrontend[00-03] State=UNKNOWN
4713
4714
4716 The nodeset configuration allows you to define a name for a specific
4717 set of nodes which can be used to simplify the partition configuration
4718 section, especially for heterogenous or condo-style systems. Each node‐
4719 set may be defined by an explicit list of nodes, and/or by filtering
4720 the nodes by a particular configured feature. If both Feature= and
4721 Nodes= are used the nodeset shall be the union of the two subsets.
4722 Note that the nodesets are only used to simplify the partition defini‐
4723 tions at present, and are not usable outside of the partition configu‐
4724 ration.
4725
4726
4727 Feature
4728 All nodes with this single feature will be included as part of
4729 this nodeset.
4730
4731 Nodes List of nodes in this set.
4732
4733 NodeSet
4734 Unique name for a set of nodes. Must not overlap with any Node‐
4735 Name definitions.
4736
4738 The partition configuration permits you to establish different job lim‐
4739 its or access controls for various groups (or partitions) of nodes.
4740 Nodes may be in more than one partition, making partitions serve as
4741 general purpose queues. For example one may put the same set of nodes
4742 into two different partitions, each with different constraints (time
4743 limit, job sizes, groups allowed to use the partition, etc.). Jobs are
4744 allocated resources within a single partition. Default values can be
4745 specified with a record in which PartitionName is "DEFAULT". The de‐
4746 fault entry values will apply only to lines following it in the config‐
4747 uration file and the default values can be reset multiple times in the
4748 configuration file with multiple entries where "PartitionName=DEFAULT".
4749 The "PartitionName=" specification must be placed on every line de‐
4750 scribing the configuration of partitions. Each line where Partition‐
4751 Name is "DEFAULT" will replace or add to previous default values and
4752 not a reinitialize the default values. A single partition name can not
4753 appear as a PartitionName value in more than one line (duplicate parti‐
4754 tion name records will be ignored). If a partition that is in use is
4755 deleted from the configuration and slurm is restarted or reconfigured
4756 (scontrol reconfigure), jobs using the partition are canceled. NOTE:
4757 Put all parameters for each partition on a single line. Each line of
4758 partition configuration information should represent a different parti‐
4759 tion. The partition configuration file contains the following informa‐
4760 tion:
4761
4762
4763 AllocNodes
4764 Comma-separated list of nodes from which users can submit jobs
4765 in the partition. Node names may be specified using the node
4766 range expression syntax described above. The default value is
4767 "ALL".
4768
4769 AllowAccounts
4770 Comma-separated list of accounts which may execute jobs in the
4771 partition. The default value is "ALL". NOTE: If AllowAccounts
4772 is used then DenyAccounts will not be enforced. Also refer to
4773 DenyAccounts.
4774
4775 AllowGroups
4776 Comma-separated list of group names which may execute jobs in
4777 this partition. A user will be permitted to submit a job to
4778 this partition if AllowGroups has at least one group associated
4779 with the user. Jobs executed as user root or as user SlurmUser
4780 will be allowed to use any partition, regardless of the value of
4781 AllowGroups. In addition, a Slurm Admin or Operator will be able
4782 to view any partition, regardless of the value of AllowGroups.
4783 If user root attempts to execute a job as another user (e.g. us‐
4784 ing srun's --uid option), then the job will be subject to Allow‐
4785 Groups as if it were submitted by that user. By default, Allow‐
4786 Groups is unset, meaning all groups are allowed to use this par‐
4787 tition. The special value 'ALL' is equivalent to this. Users
4788 who are not members of the specified group will not see informa‐
4789 tion about this partition by default. However, this should not
4790 be treated as a security mechanism, since job information will
4791 be returned if a user requests details about the partition or a
4792 specific job. See the PrivateData parameter to restrict access
4793 to job information. NOTE: For performance reasons, Slurm main‐
4794 tains a list of user IDs allowed to use each partition and this
4795 is checked at job submission time. This list of user IDs is up‐
4796 dated when the slurmctld daemon is restarted, reconfigured (e.g.
4797 "scontrol reconfig") or the partition's AllowGroups value is re‐
4798 set, even if is value is unchanged (e.g. "scontrol update Parti‐
4799 tionName=name AllowGroups=group"). For a user's access to a
4800 partition to change, both his group membership must change and
4801 Slurm's internal user ID list must change using one of the meth‐
4802 ods described above.
4803
4804 AllowQos
4805 Comma-separated list of Qos which may execute jobs in the parti‐
4806 tion. Jobs executed as user root can use any partition without
4807 regard to the value of AllowQos. The default value is "ALL".
4808 NOTE: If AllowQos is used then DenyQos will not be enforced.
4809 Also refer to DenyQos.
4810
4811 Alternate
4812 Partition name of alternate partition to be used if the state of
4813 this partition is "DRAIN" or "INACTIVE."
4814
4815 CpuBind
4816 If a job step request does not specify an option to control how
4817 tasks are bound to allocated CPUs (--cpu-bind) and all nodes al‐
4818 located to the job do not have the same CpuBind option the node.
4819 Then the partition's CpuBind option will control how tasks are
4820 bound to allocated resources. Supported values forCpuBind are
4821 "none", "socket", "ldom" (NUMA), "core" and "thread".
4822
4823 Default
4824 If this keyword is set, jobs submitted without a partition spec‐
4825 ification will utilize this partition. Possible values are
4826 "YES" and "NO". The default value is "NO".
4827
4828 DefaultTime
4829 Run time limit used for jobs that don't specify a value. If not
4830 set then MaxTime will be used. Format is the same as for Max‐
4831 Time.
4832
4833 DefCpuPerGPU
4834 Default count of CPUs allocated per allocated GPU. This value is
4835 used only if the job didn't specify --cpus-per-task and
4836 --cpus-per-gpu.
4837
4838 DefMemPerCPU
4839 Default real memory size available per allocated CPU in
4840 megabytes. Used to avoid over-subscribing memory and causing
4841 paging. DefMemPerCPU would generally be used if individual pro‐
4842 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
4843 lectType=select/cons_tres). If not set, the DefMemPerCPU value
4844 for the entire cluster will be used. Also see DefMemPerGPU,
4845 DefMemPerNode and MaxMemPerCPU. DefMemPerCPU, DefMemPerGPU and
4846 DefMemPerNode are mutually exclusive.
4847
4848 DefMemPerGPU
4849 Default real memory size available per allocated GPU in
4850 megabytes. Also see DefMemPerCPU, DefMemPerNode and MaxMemPer‐
4851 CPU. DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually
4852 exclusive.
4853
4854 DefMemPerNode
4855 Default real memory size available per allocated node in
4856 megabytes. Used to avoid over-subscribing memory and causing
4857 paging. DefMemPerNode would generally be used if whole nodes
4858 are allocated to jobs (SelectType=select/linear) and resources
4859 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
4860 If not set, the DefMemPerNode value for the entire cluster will
4861 be used. Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerCPU.
4862 DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually exclu‐
4863 sive.
4864
4865 DenyAccounts
4866 Comma-separated list of accounts which may not execute jobs in
4867 the partition. By default, no accounts are denied access NOTE:
4868 If AllowAccounts is used then DenyAccounts will not be enforced.
4869 Also refer to AllowAccounts.
4870
4871 DenyQos
4872 Comma-separated list of Qos which may not execute jobs in the
4873 partition. By default, no QOS are denied access NOTE: If Al‐
4874 lowQos is used then DenyQos will not be enforced. Also refer
4875 AllowQos.
4876
4877 DisableRootJobs
4878 If set to "YES" then user root will be prevented from running
4879 any jobs on this partition. The default value will be the value
4880 of DisableRootJobs set outside of a partition specification
4881 (which is "NO", allowing user root to execute jobs).
4882
4883 ExclusiveUser
4884 If set to "YES" then nodes will be exclusively allocated to
4885 users. Multiple jobs may be run for the same user, but only one
4886 user can be active at a time. This capability is also available
4887 on a per-job basis by using the --exclusive=user option.
4888
4889 GraceTime
4890 Specifies, in units of seconds, the preemption grace time to be
4891 extended to a job which has been selected for preemption. The
4892 default value is zero, no preemption grace time is allowed on
4893 this partition. Once a job has been selected for preemption,
4894 its end time is set to the current time plus GraceTime. The
4895 job's tasks are immediately sent SIGCONT and SIGTERM signals in
4896 order to provide notification of its imminent termination. This
4897 is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence
4898 upon reaching its new end time. This second set of signals is
4899 sent to both the tasks and the containing batch script, if ap‐
4900 plicable. See also the global KillWait configuration parameter.
4901
4902 Hidden Specifies if the partition and its jobs are to be hidden by de‐
4903 fault. Hidden partitions will by default not be reported by the
4904 Slurm APIs or commands. Possible values are "YES" and "NO".
4905 The default value is "NO". Note that partitions that a user
4906 lacks access to by virtue of the AllowGroups parameter will also
4907 be hidden by default.
4908
4909 LLN Schedule resources to jobs on the least loaded nodes (based upon
4910 the number of idle CPUs). This is generally only recommended for
4911 an environment with serial jobs as idle resources will tend to
4912 be highly fragmented, resulting in parallel jobs being distrib‐
4913 uted across many nodes. Note that node Weight takes precedence
4914 over how many idle resources are on each node. Also see the Se‐
4915 lectParameters configuration parameter CR_LLN to use the least
4916 loaded nodes in every partition.
4917
4918 MaxCPUsPerNode
4919 Maximum number of CPUs on any node available to all jobs from
4920 this partition. This can be especially useful to schedule GPUs.
4921 For example a node can be associated with two Slurm partitions
4922 (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be
4923 limited to only a subset of the node's CPUs, ensuring that one
4924 or more CPUs would be available to jobs in the "gpu" parti‐
4925 tion/queue.
4926
4927 MaxMemPerCPU
4928 Maximum real memory size available per allocated CPU in
4929 megabytes. Used to avoid over-subscribing memory and causing
4930 paging. MaxMemPerCPU would generally be used if individual pro‐
4931 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
4932 lectType=select/cons_tres). If not set, the MaxMemPerCPU value
4933 for the entire cluster will be used. Also see DefMemPerCPU and
4934 MaxMemPerNode. MaxMemPerCPU and MaxMemPerNode are mutually ex‐
4935 clusive.
4936
4937 MaxMemPerNode
4938 Maximum real memory size available per allocated node in
4939 megabytes. Used to avoid over-subscribing memory and causing
4940 paging. MaxMemPerNode would generally be used if whole nodes
4941 are allocated to jobs (SelectType=select/linear) and resources
4942 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
4943 If not set, the MaxMemPerNode value for the entire cluster will
4944 be used. Also see DefMemPerNode and MaxMemPerCPU. MaxMemPerCPU
4945 and MaxMemPerNode are mutually exclusive.
4946
4947 MaxNodes
4948 Maximum count of nodes which may be allocated to any single job.
4949 The default value is "UNLIMITED", which is represented inter‐
4950 nally as -1.
4951
4952 MaxTime
4953 Maximum run time limit for jobs. Format is minutes, min‐
4954 utes:seconds, hours:minutes:seconds, days-hours, days-hours:min‐
4955 utes, days-hours:minutes:seconds or "UNLIMITED". Time resolu‐
4956 tion is one minute and second values are rounded up to the next
4957 minute. The job TimeLimit may be updated by root, SlurmUser or
4958 an Operator to a value higher than the configured MaxTime after
4959 job submission.
4960
4961 MinNodes
4962 Minimum count of nodes which may be allocated to any single job.
4963 The default value is 0.
4964
4965 Nodes Comma-separated list of nodes or nodesets which are associated
4966 with this partition. Node names may be specified using the node
4967 range expression syntax described above. A blank list of nodes
4968 (i.e. "Nodes= ") can be used if one wants a partition to exist,
4969 but have no resources (possibly on a temporary basis). A value
4970 of "ALL" is mapped to all nodes configured in the cluster.
4971
4972 OverSubscribe
4973 Controls the ability of the partition to execute more than one
4974 job at a time on each resource (node, socket or core depending
4975 upon the value of SelectTypeParameters). If resources are to be
4976 over-subscribed, avoiding memory over-subscription is very im‐
4977 portant. SelectTypeParameters should be configured to treat
4978 memory as a consumable resource and the --mem option should be
4979 used for job allocations. Sharing of resources is typically
4980 useful only when using gang scheduling (PreemptMode=sus‐
4981 pend,gang). Possible values for OverSubscribe are "EXCLUSIVE",
4982 "FORCE", "YES", and "NO". Note that a value of "YES" or "FORCE"
4983 can negatively impact performance for systems with many thou‐
4984 sands of running jobs. The default value is "NO". For more in‐
4985 formation see the following web pages:
4986 https://slurm.schedmd.com/cons_res.html
4987 https://slurm.schedmd.com/cons_res_share.html
4988 https://slurm.schedmd.com/gang_scheduling.html
4989 https://slurm.schedmd.com/preempt.html
4990
4991 EXCLUSIVE Allocates entire nodes to jobs even with Select‐
4992 Type=select/cons_res or SelectType=select/cons_tres
4993 configured. Jobs that run in partitions with Over‐
4994 Subscribe=EXCLUSIVE will have exclusive access to
4995 all allocated nodes. These jobs are allocated all
4996 CPUs and GRES on the nodes, but they are only allo‐
4997 cated as much memory as they ask for. This is by de‐
4998 sign to support gang scheduling, because suspended
4999 jobs still reside in memory. To request all the mem‐
5000 ory on a node, use --mem=0 at submit time.
5001
5002 FORCE Makes all resources (except GRES) in the partition
5003 available for oversubscription without any means for
5004 users to disable it. May be followed with a colon
5005 and maximum number of jobs in running or suspended
5006 state. For example OverSubscribe=FORCE:4 enables
5007 each node, socket or core to oversubscribe each re‐
5008 source four ways. Recommended only for systems us‐
5009 ing PreemptMode=suspend,gang.
5010
5011 NOTE: OverSubscribe=FORCE:1 is a special case that
5012 is not exactly equivalent to OverSubscribe=NO. Over‐
5013 Subscribe=FORCE:1 disables the regular oversubscrip‐
5014 tion of resources in the same partition but it will
5015 still allow oversubscription due to preemption. Set‐
5016 ting OverSubscribe=NO will prevent oversubscription
5017 from happening due to preemption as well.
5018
5019 NOTE: If using PreemptType=preempt/qos you can spec‐
5020 ify a value for FORCE that is greater than 1. For
5021 example, OverSubscribe=FORCE:2 will permit two jobs
5022 per resource normally, but a third job can be
5023 started only if done so through preemption based
5024 upon QOS.
5025
5026 NOTE: If OverSubscribe is configured to FORCE or YES
5027 in your slurm.conf and the system is not configured
5028 to use preemption (PreemptMode=OFF) accounting can
5029 easily grow to values greater than the actual uti‐
5030 lization. It may be common on such systems to get
5031 error messages in the slurmdbd log stating: "We have
5032 more allocated time than is possible."
5033
5034 YES Makes all resources (except GRES) in the partition
5035 available for sharing upon request by the job. Re‐
5036 sources will only be over-subscribed when explicitly
5037 requested by the user using the "--oversubscribe"
5038 option on job submission. May be followed with a
5039 colon and maximum number of jobs in running or sus‐
5040 pended state. For example "OverSubscribe=YES:4" en‐
5041 ables each node, socket or core to execute up to
5042 four jobs at once. Recommended only for systems
5043 running with gang scheduling (PreemptMode=sus‐
5044 pend,gang).
5045
5046 NO Selected resources are allocated to a single job. No
5047 resource will be allocated to more than one job.
5048
5049 NOTE: Even if you are using PreemptMode=sus‐
5050 pend,gang, setting OverSubscribe=NO will disable
5051 preemption on that partition. Use OverSub‐
5052 scribe=FORCE:1 if you want to disable normal over‐
5053 subscription but still allow suspension due to pre‐
5054 emption.
5055
5056 OverTimeLimit
5057 Number of minutes by which a job can exceed its time limit be‐
5058 fore being canceled. Normally a job's time limit is treated as
5059 a hard limit and the job will be killed upon reaching that
5060 limit. Configuring OverTimeLimit will result in the job's time
5061 limit being treated like a soft limit. Adding the OverTimeLimit
5062 value to the soft time limit provides a hard time limit, at
5063 which point the job is canceled. This is particularly useful
5064 for backfill scheduling, which bases upon each job's soft time
5065 limit. If not set, the OverTimeLimit value for the entire clus‐
5066 ter will be used. May not exceed 65533 minutes. A value of
5067 "UNLIMITED" is also supported.
5068
5069 PartitionName
5070 Name by which the partition may be referenced (e.g. "Interac‐
5071 tive"). This name can be specified by users when submitting
5072 jobs. If the PartitionName is "DEFAULT", the values specified
5073 with that record will apply to subsequent partition specifica‐
5074 tions unless explicitly set to other values in that partition
5075 record or replaced with a different set of default values. Each
5076 line where PartitionName is "DEFAULT" will replace or add to
5077 previous default values and not a reinitialize the default val‐
5078 ues.
5079
5080 PreemptMode
5081 Mechanism used to preempt jobs or enable gang scheduling for
5082 this partition when PreemptType=preempt/partition_prio is con‐
5083 figured. This partition-specific PreemptMode configuration pa‐
5084 rameter will override the cluster-wide PreemptMode for this par‐
5085 tition. It can be set to OFF to disable preemption and gang
5086 scheduling for this partition. See also PriorityTier and the
5087 above description of the cluster-wide PreemptMode parameter for
5088 further details.
5089 The GANG option is used to enable gang scheduling independent of
5090 whether preemption is enabled (i.e. independent of the Preempt‐
5091 Type setting). It can be specified in addition to a PreemptMode
5092 setting with the two options comma separated (e.g. Preempt‐
5093 Mode=SUSPEND,GANG).
5094 See <https://slurm.schedmd.com/preempt.html> and
5095 <https://slurm.schedmd.com/gang_scheduling.html> for more de‐
5096 tails.
5097
5098 NOTE: For performance reasons, the backfill scheduler reserves
5099 whole nodes for jobs, not partial nodes. If during backfill
5100 scheduling a job preempts one or more other jobs, the whole
5101 nodes for those preempted jobs are reserved for the preemptor
5102 job, even if the preemptor job requested fewer resources than
5103 that. These reserved nodes aren't available to other jobs dur‐
5104 ing that backfill cycle, even if the other jobs could fit on the
5105 nodes. Therefore, jobs may preempt more resources during a sin‐
5106 gle backfill iteration than they requested.
5107 NOTE: For heterogeneous job to be considered for preemption all
5108 components must be eligible for preemption. When a heterogeneous
5109 job is to be preempted the first identified component of the job
5110 with the highest order PreemptMode (SUSPEND (highest), REQUEUE,
5111 CANCEL (lowest)) will be used to set the PreemptMode for all
5112 components. The GraceTime and user warning signal for each com‐
5113 ponent of the heterogeneous job remain unique. Heterogeneous
5114 jobs are excluded from GANG scheduling operations.
5115
5116 OFF Is the default value and disables job preemption and
5117 gang scheduling. It is only compatible with Pre‐
5118 emptType=preempt/none at a global level. A common
5119 use case for this parameter is to set it on a parti‐
5120 tion to disable preemption for that partition.
5121
5122 CANCEL The preempted job will be cancelled.
5123
5124 GANG Enables gang scheduling (time slicing) of jobs in
5125 the same partition, and allows the resuming of sus‐
5126 pended jobs.
5127
5128 NOTE: Gang scheduling is performed independently for
5129 each partition, so if you only want time-slicing by
5130 OverSubscribe, without any preemption, then config‐
5131 uring partitions with overlapping nodes is not rec‐
5132 ommended. On the other hand, if you want to use
5133 PreemptType=preempt/partition_prio to allow jobs
5134 from higher PriorityTier partitions to Suspend jobs
5135 from lower PriorityTier partitions you will need
5136 overlapping partitions, and PreemptMode=SUSPEND,GANG
5137 to use the Gang scheduler to resume the suspended
5138 jobs(s). In any case, time-slicing won't happen be‐
5139 tween jobs on different partitions.
5140 NOTE: Heterogeneous jobs are excluded from GANG
5141 scheduling operations.
5142
5143 REQUEUE Preempts jobs by requeuing them (if possible) or
5144 canceling them. For jobs to be requeued they must
5145 have the --requeue sbatch option set or the cluster
5146 wide JobRequeue parameter in slurm.conf must be set
5147 to 1.
5148
5149 SUSPEND The preempted jobs will be suspended, and later the
5150 Gang scheduler will resume them. Therefore the SUS‐
5151 PEND preemption mode always needs the GANG option to
5152 be specified at the cluster level. Also, because the
5153 suspended jobs will still use memory on the allo‐
5154 cated nodes, Slurm needs to be able to track memory
5155 resources to be able to suspend jobs.
5156
5157 If the preemptees and preemptor are on different
5158 partitions then the preempted jobs will remain sus‐
5159 pended until the preemptor ends.
5160 NOTE: Because gang scheduling is performed indepen‐
5161 dently for each partition, if using PreemptType=pre‐
5162 empt/partition_prio then jobs in higher PriorityTier
5163 partitions will suspend jobs in lower PriorityTier
5164 partitions to run on the released resources. Only
5165 when the preemptor job ends will the suspended jobs
5166 will be resumed by the Gang scheduler.
5167 NOTE: Suspended jobs will not release GRES. Higher
5168 priority jobs will not be able to preempt to gain
5169 access to GRES.
5170
5171 PriorityJobFactor
5172 Partition factor used by priority/multifactor plugin in calcu‐
5173 lating job priority. The value may not exceed 65533. Also see
5174 PriorityTier.
5175
5176 PriorityTier
5177 Jobs submitted to a partition with a higher PriorityTier value
5178 will be evaluated by the scheduler before pending jobs in a par‐
5179 tition with a lower PriorityTier value. They will also be con‐
5180 sidered for preemption of running jobs in partition(s) with
5181 lower PriorityTier values if PreemptType=preempt/partition_prio.
5182 The value may not exceed 65533. Also see PriorityJobFactor.
5183
5184 QOS Used to extend the limits available to a QOS on a partition.
5185 Jobs will not be associated to this QOS outside of being associ‐
5186 ated to the partition. They will still be associated to their
5187 requested QOS. By default, no QOS is used. NOTE: If a limit is
5188 set in both the Partition's QOS and the Job's QOS the Partition
5189 QOS will be honored unless the Job's QOS has the OverPartQOS
5190 flag set in which the Job's QOS will have priority.
5191
5192 ReqResv
5193 Specifies users of this partition are required to designate a
5194 reservation when submitting a job. This option can be useful in
5195 restricting usage of a partition that may have higher priority
5196 or additional resources to be allowed only within a reservation.
5197 Possible values are "YES" and "NO". The default value is "NO".
5198
5199 ResumeTimeout
5200 Maximum time permitted (in seconds) between when a node resume
5201 request is issued and when the node is actually available for
5202 use. Nodes which fail to respond in this time frame will be
5203 marked DOWN and the jobs scheduled on the node requeued. Nodes
5204 which reboot after this time frame will be marked DOWN with a
5205 reason of "Node unexpectedly rebooted." For nodes that are in
5206 multiple partitions with this option set, the highest time will
5207 take effect. If not set on any partition, the node will use the
5208 ResumeTimeout value set for the entire cluster.
5209
5210 RootOnly
5211 Specifies if only user ID zero (i.e. user root) may allocate re‐
5212 sources in this partition. User root may allocate resources for
5213 any other user, but the request must be initiated by user root.
5214 This option can be useful for a partition to be managed by some
5215 external entity (e.g. a higher-level job manager) and prevents
5216 users from directly using those resources. Possible values are
5217 "YES" and "NO". The default value is "NO".
5218
5219 SelectTypeParameters
5220 Partition-specific resource allocation type. This option re‐
5221 places the global SelectTypeParameters value. Supported values
5222 are CR_Core, CR_Core_Memory, CR_Socket and CR_Socket_Memory.
5223 Use requires the system-wide SelectTypeParameters value be set
5224 to any of the four supported values previously listed; other‐
5225 wise, the partition-specific value will be ignored.
5226
5227 Shared The Shared configuration parameter has been replaced by the
5228 OverSubscribe parameter described above.
5229
5230 State State of partition or availability for use. Possible values are
5231 "UP", "DOWN", "DRAIN" and "INACTIVE". The default value is "UP".
5232 See also the related "Alternate" keyword.
5233
5234 UP Designates that new jobs may be queued on the parti‐
5235 tion, and that jobs may be allocated nodes and run
5236 from the partition.
5237
5238 DOWN Designates that new jobs may be queued on the parti‐
5239 tion, but queued jobs may not be allocated nodes and
5240 run from the partition. Jobs already running on the
5241 partition continue to run. The jobs must be explicitly
5242 canceled to force their termination.
5243
5244 DRAIN Designates that no new jobs may be queued on the par‐
5245 tition (job submission requests will be denied with an
5246 error message), but jobs already queued on the parti‐
5247 tion may be allocated nodes and run. See also the
5248 "Alternate" partition specification.
5249
5250 INACTIVE Designates that no new jobs may be queued on the par‐
5251 tition, and jobs already queued may not be allocated
5252 nodes and run. See also the "Alternate" partition
5253 specification.
5254
5255 SuspendTime
5256 Nodes which remain idle or down for this number of seconds will
5257 be placed into power save mode by SuspendProgram. For efficient
5258 system utilization, it is recommended that the value of Suspend‐
5259 Time be at least as large as the sum of SuspendTimeout plus Re‐
5260 sumeTimeout. For nodes that are in multiple partitions with
5261 this option set, the highest time will take effect. If not set
5262 on any partition, the node will use the SuspendTime value set
5263 for the entire cluster. Setting SuspendTime to anything but
5264 "INFINITE" will enable power save mode.
5265
5266 SuspendTimeout
5267 Maximum time permitted (in seconds) between when a node suspend
5268 request is issued and when the node is shutdown. At that time
5269 the node must be ready for a resume request to be issued as
5270 needed for new work. For nodes that are in multiple partitions
5271 with this option set, the highest time will take effect. If not
5272 set on any partition, the node will use the SuspendTimeout value
5273 set for the entire cluster.
5274
5275 TRESBillingWeights
5276 TRESBillingWeights is used to define the billing weights of each
5277 TRES type that will be used in calculating the usage of a job.
5278 The calculated usage is used when calculating fairshare and when
5279 enforcing the TRES billing limit on jobs.
5280
5281 Billing weights are specified as a comma-separated list of <TRES
5282 Type>=<TRES Billing Weight> pairs.
5283
5284 Any TRES Type is available for billing. Note that the base unit
5285 for memory and burst buffers is megabytes.
5286
5287 By default the billing of TRES is calculated as the sum of all
5288 TRES types multiplied by their corresponding billing weight.
5289
5290 The weighted amount of a resource can be adjusted by adding a
5291 suffix of K,M,G,T or P after the billing weight. For example, a
5292 memory weight of "mem=.25" on a job allocated 8GB will be billed
5293 2048 (8192MB *.25) units. A memory weight of "mem=.25G" on the
5294 same job will be billed 2 (8192MB * (.25/1024)) units.
5295
5296 Negative values are allowed.
5297
5298 When a job is allocated 1 CPU and 8 GB of memory on a partition
5299 configured with TRESBilling‐
5300 Weights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0", the billable TRES will
5301 be: (1*1.0) + (8*0.25) + (0*2.0) = 3.0.
5302
5303 If PriorityFlags=MAX_TRES is configured, the billable TRES is
5304 calculated as the MAX of individual TRES' on a node (e.g. cpus,
5305 mem, gres) plus the sum of all global TRES' (e.g. licenses). Us‐
5306 ing the same example above the billable TRES will be MAX(1*1.0,
5307 8*0.25) + (0*2.0) = 2.0.
5308
5309 If TRESBillingWeights is not defined then the job is billed
5310 against the total number of allocated CPUs.
5311
5312 NOTE: TRESBillingWeights doesn't affect job priority directly as
5313 it is currently not used for the size of the job. If you want
5314 TRES' to play a role in the job's priority then refer to the
5315 PriorityWeightTRES option.
5316
5318 There are a variety of prolog and epilog program options that execute
5319 with various permissions and at various times. The four options most
5320 likely to be used are: Prolog and Epilog (executed once on each compute
5321 node for each job) plus PrologSlurmctld and EpilogSlurmctld (executed
5322 once on the ControlMachine for each job).
5323
5324 NOTE: Standard output and error messages are normally not preserved.
5325 Explicitly write output and error messages to an appropriate location
5326 if you wish to preserve that information.
5327
5328 NOTE: By default the Prolog script is ONLY run on any individual node
5329 when it first sees a job step from a new allocation. It does not run
5330 the Prolog immediately when an allocation is granted. If no job steps
5331 from an allocation are run on a node, it will never run the Prolog for
5332 that allocation. This Prolog behaviour can be changed by the Pro‐
5333 logFlags parameter. The Epilog, on the other hand, always runs on ev‐
5334 ery node of an allocation when the allocation is released.
5335
5336 If the Epilog fails (returns a non-zero exit code), this will result in
5337 the node being set to a DRAIN state. If the EpilogSlurmctld fails (re‐
5338 turns a non-zero exit code), this will only be logged. If the Prolog
5339 fails (returns a non-zero exit code), this will result in the node be‐
5340 ing set to a DRAIN state and the job being requeued in a held state un‐
5341 less nohold_on_prolog_fail is configured in SchedulerParameters. If
5342 the PrologSlurmctld fails (returns a non-zero exit code), this will re‐
5343 sult in the job being requeued to be executed on another node if possi‐
5344 ble. Only batch jobs can be requeued. Interactive jobs (salloc and
5345 srun) will be cancelled if the PrologSlurmctld fails. If slurmcltd is
5346 stopped while either PrologSlurmctld or EpilogSlurmctld is running, the
5347 script will be killed with SIGKILL. The script will restart when slurm‐
5348 ctld restarts.
5349
5350
5351 Information about the job is passed to the script using environment
5352 variables. Unless otherwise specified, these environment variables are
5353 available in each of the scripts mentioned above (Prolog, Epilog, Pro‐
5354 logSlurmctld and EpilogSlurmctld). For a full list of environment vari‐
5355 ables that includes those available in the SrunProlog, SrunEpilog,
5356 TaskProlog and TaskEpilog please see the Prolog and Epilog Guide
5357 <https://slurm.schedmd.com/prolog_epilog.html>.
5358
5359
5360 SLURM_ARRAY_JOB_ID
5361 If this job is part of a job array, this will be set to the job
5362 ID. Otherwise it will not be set. To reference this specific
5363 task of a job array, combine SLURM_ARRAY_JOB_ID with SLURM_AR‐
5364 RAY_TASK_ID (e.g. "scontrol update ${SLURM_AR‐
5365 RAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in Pro‐
5366 logSlurmctld and EpilogSlurmctld.
5367
5368 SLURM_ARRAY_TASK_ID
5369 If this job is part of a job array, this will be set to the task
5370 ID. Otherwise it will not be set. To reference this specific
5371 task of a job array, combine SLURM_ARRAY_JOB_ID with SLURM_AR‐
5372 RAY_TASK_ID (e.g. "scontrol update ${SLURM_AR‐
5373 RAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in Pro‐
5374 logSlurmctld and EpilogSlurmctld.
5375
5376 SLURM_ARRAY_TASK_MAX
5377 If this job is part of a job array, this will be set to the max‐
5378 imum task ID. Otherwise it will not be set. Available in Pro‐
5379 logSlurmctld and EpilogSlurmctld.
5380
5381 SLURM_ARRAY_TASK_MIN
5382 If this job is part of a job array, this will be set to the min‐
5383 imum task ID. Otherwise it will not be set. Available in Pro‐
5384 logSlurmctld and EpilogSlurmctld.
5385
5386 SLURM_ARRAY_TASK_STEP
5387 If this job is part of a job array, this will be set to the step
5388 size of task IDs. Otherwise it will not be set. Available in
5389 PrologSlurmctld and EpilogSlurmctld.
5390
5391 SLURM_CLUSTER_NAME
5392 Name of the cluster executing the job.
5393
5394 SLURM_CONF
5395 Location of the slurm.conf file. Available in Prolog and Epilog.
5396
5397 SLURMD_NODENAME
5398 Name of the node running the task. In the case of a parallel job
5399 executing on multiple compute nodes, the various tasks will have
5400 this environment variable set to different values on each com‐
5401 pute node. Available in Prolog and Epilog.
5402
5403 SLURM_JOB_ACCOUNT
5404 Account name used for the job. Available in PrologSlurmctld and
5405 EpilogSlurmctld.
5406
5407 SLURM_JOB_CONSTRAINTS
5408 Features required to run the job. Available in Prolog, Pro‐
5409 logSlurmctld and EpilogSlurmctld.
5410
5411 SLURM_JOB_DERIVED_EC
5412 The highest exit code of all of the job steps. Available in
5413 EpilogSlurmctld.
5414
5415 SLURM_JOB_EXIT_CODE
5416 The exit code of the job script (or salloc). The value is the
5417 status as returned by the wait() system call (See wait(2))
5418 Available in EpilogSlurmctld.
5419
5420 SLURM_JOB_EXIT_CODE2
5421 The exit code of the job script (or salloc). The value has the
5422 format <exit>:<sig>. The first number is the exit code, typi‐
5423 cally as set by the exit() function. The second number of the
5424 signal that caused the process to terminate if it was terminated
5425 by a signal. Available in EpilogSlurmctld.
5426
5427 SLURM_JOB_GID
5428 Group ID of the job's owner.
5429
5430 SLURM_JOB_GPUS
5431 The GPU IDs of GPUs in the job allocation (if any). Available
5432 in the Prolog and Epilog.
5433
5434 SLURM_JOB_GROUP
5435 Group name of the job's owner. Available in PrologSlurmctld and
5436 EpilogSlurmctld.
5437
5438 SLURM_JOB_ID
5439 Job ID.
5440
5441 SLURM_JOBID
5442 Job ID.
5443
5444 SLURM_JOB_NAME
5445 Name of the job. Available in PrologSlurmctld and EpilogSlurm‐
5446 ctld.
5447
5448 SLURM_JOB_NODELIST
5449 Nodes assigned to job. A Slurm hostlist expression. "scontrol
5450 show hostnames" can be used to convert this to a list of indi‐
5451 vidual host names. Available in PrologSlurmctld and Epi‐
5452 logSlurmctld.
5453
5454 SLURM_JOB_PARTITION
5455 Partition that job runs in. Available in Prolog, PrologSlurm‐
5456 ctld and EpilogSlurmctld.
5457
5458 SLURM_JOB_UID
5459 User ID of the job's owner.
5460
5461 SLURM_JOB_USER
5462 User name of the job's owner.
5463
5464 SLURM_SCRIPT_CONTEXT
5465 Identifies which epilog or prolog program is currently running.
5466
5468 This program can be used to take special actions to clean up the unkil‐
5469 lable processes and/or notify system administrators. The program will
5470 be run as SlurmdUser (usually "root") on the compute node where Unkill‐
5471 ableStepTimeout was triggered.
5472
5473 Information about the unkillable job step is passed to the script using
5474 environment variables.
5475
5476
5477 SLURM_JOB_ID
5478 Job ID.
5479
5480 SLURM_STEP_ID
5481 Job Step ID.
5482
5484 Slurm is able to optimize job allocations to minimize network con‐
5485 tention. Special Slurm logic is used to optimize allocations on sys‐
5486 tems with a three-dimensional interconnect. and information about con‐
5487 figuring those systems are available on web pages available here:
5488 <https://slurm.schedmd.com/>. For a hierarchical network, Slurm needs
5489 to have detailed information about how nodes are configured on the net‐
5490 work switches.
5491
5492 Given network topology information, Slurm allocates all of a job's re‐
5493 sources onto a single leaf of the network (if possible) using a
5494 best-fit algorithm. Otherwise it will allocate a job's resources onto
5495 multiple leaf switches so as to minimize the use of higher-level
5496 switches. The TopologyPlugin parameter controls which plugin is used
5497 to collect network topology information. The only values presently
5498 supported are "topology/3d_torus" (default for Cray XT/XE systems, per‐
5499 forms best-fit logic over three-dimensional topology), "topology/none"
5500 (default for other systems, best-fit logic over one-dimensional topol‐
5501 ogy), "topology/tree" (determine the network topology based upon infor‐
5502 mation contained in a topology.conf file, see "man topology.conf" for
5503 more information). Future plugins may gather topology information di‐
5504 rectly from the network. The topology information is optional. If not
5505 provided, Slurm will perform a best-fit algorithm assuming the nodes
5506 are in a one-dimensional array as configured and the communications
5507 cost is related to the node distance in this array.
5508
5509
5511 If the cluster's computers used for the primary or backup controller
5512 will be out of service for an extended period of time, it may be desir‐
5513 able to relocate them. In order to do so, follow this procedure:
5514
5515 1. Stop the Slurm daemons
5516 2. Modify the slurm.conf file appropriately
5517 3. Distribute the updated slurm.conf file to all nodes
5518 4. Restart the Slurm daemons
5519
5520 There should be no loss of any running or pending jobs. Ensure that
5521 any nodes added to the cluster have the current slurm.conf file in‐
5522 stalled.
5523
5524 CAUTION: If two nodes are simultaneously configured as the primary con‐
5525 troller (two nodes on which SlurmctldHost specify the local host and
5526 the slurmctld daemon is executing on each), system behavior will be de‐
5527 structive. If a compute node has an incorrect SlurmctldHost parameter,
5528 that node may be rendered unusable, but no other harm will result.
5529
5530
5532 #
5533 # Sample /etc/slurm.conf for dev[0-25].llnl.gov
5534 # Author: John Doe
5535 # Date: 11/06/2001
5536 #
5537 SlurmctldHost=dev0(12.34.56.78) # Primary server
5538 SlurmctldHost=dev1(12.34.56.79) # Backup server
5539 #
5540 AuthType=auth/munge
5541 Epilog=/usr/local/slurm/epilog
5542 Prolog=/usr/local/slurm/prolog
5543 FirstJobId=65536
5544 InactiveLimit=120
5545 JobCompType=jobcomp/filetxt
5546 JobCompLoc=/var/log/slurm/jobcomp
5547 KillWait=30
5548 MaxJobCount=10000
5549 MinJobAge=3600
5550 PluginDir=/usr/local/lib:/usr/local/slurm/lib
5551 ReturnToService=0
5552 SchedulerType=sched/backfill
5553 SlurmctldLogFile=/var/log/slurm/slurmctld.log
5554 SlurmdLogFile=/var/log/slurm/slurmd.log
5555 SlurmctldPort=7002
5556 SlurmdPort=7003
5557 SlurmdSpoolDir=/var/spool/slurmd.spool
5558 StateSaveLocation=/var/spool/slurm.state
5559 SwitchType=switch/none
5560 TmpFS=/tmp
5561 WaitTime=30
5562 JobCredentialPrivateKey=/usr/local/slurm/private.key
5563 JobCredentialPublicCertificate=/usr/local/slurm/public.cert
5564 #
5565 # Node Configurations
5566 #
5567 NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=64000
5568 NodeName=DEFAULT State=UNKNOWN
5569 NodeName=dev[0-25] NodeAddr=edev[0-25] Weight=16
5570 # Update records for specific DOWN nodes
5571 DownNodes=dev20 State=DOWN Reason="power,ETA=Dec25"
5572 #
5573 # Partition Configurations
5574 #
5575 PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP
5576 PartitionName=debug Nodes=dev[0-8,18-25] Default=YES
5577 PartitionName=batch Nodes=dev[9-17] MinNodes=4
5578 PartitionName=long Nodes=dev[9-17] MaxTime=120 AllowGroups=admin
5579
5580
5582 The "include" key word can be used with modifiers within the specified
5583 pathname. These modifiers would be replaced with cluster name or other
5584 information depending on which modifier is specified. If the included
5585 file is not an absolute path name (i.e. it does not start with a
5586 slash), it will searched for in the same directory as the slurm.conf
5587 file.
5588
5589
5590 %c Cluster name specified in the slurm.conf will be used.
5591
5592 EXAMPLE
5593 ClusterName=linux
5594 include /home/slurm/etc/%c_config
5595 # Above line interpreted as
5596 # "include /home/slurm/etc/linux_config"
5597
5598
5600 There are three classes of files: Files used by slurmctld must be ac‐
5601 cessible by user SlurmUser and accessible by the primary and backup
5602 control machines. Files used by slurmd must be accessible by user root
5603 and accessible from every compute node. A few files need to be acces‐
5604 sible by normal users on all login and compute nodes. While many files
5605 and directories are listed below, most of them will not be used with
5606 most configurations.
5607
5608
5609 Epilog Must be executable by user root. It is recommended that the
5610 file be readable by all users. The file must exist on every
5611 compute node.
5612
5613 EpilogSlurmctld
5614 Must be executable by user SlurmUser. It is recommended that
5615 the file be readable by all users. The file must be accessible
5616 by the primary and backup control machines.
5617
5618 HealthCheckProgram
5619 Must be executable by user root. It is recommended that the
5620 file be readable by all users. The file must exist on every
5621 compute node.
5622
5623 JobCompLoc
5624 If this specifies a file, it must be writable by user SlurmUser.
5625 The file must be accessible by the primary and backup control
5626 machines.
5627
5628 JobCredentialPrivateKey
5629 Must be readable only by user SlurmUser and writable by no other
5630 users. The file must be accessible by the primary and backup
5631 control machines.
5632
5633 JobCredentialPublicCertificate
5634 Readable to all users on all nodes. Must not be writable by
5635 regular users.
5636
5637 MailProg
5638 Must be executable by user SlurmUser. Must not be writable by
5639 regular users. The file must be accessible by the primary and
5640 backup control machines.
5641
5642 Prolog Must be executable by user root. It is recommended that the
5643 file be readable by all users. The file must exist on every
5644 compute node.
5645
5646 PrologSlurmctld
5647 Must be executable by user SlurmUser. It is recommended that
5648 the file be readable by all users. The file must be accessible
5649 by the primary and backup control machines.
5650
5651 ResumeProgram
5652 Must be executable by user SlurmUser. The file must be accessi‐
5653 ble by the primary and backup control machines.
5654
5655 slurm.conf
5656 Readable to all users on all nodes. Must not be writable by
5657 regular users.
5658
5659 SlurmctldLogFile
5660 Must be writable by user SlurmUser. The file must be accessible
5661 by the primary and backup control machines.
5662
5663 SlurmctldPidFile
5664 Must be writable by user root. Preferably writable and remov‐
5665 able by SlurmUser. The file must be accessible by the primary
5666 and backup control machines.
5667
5668 SlurmdLogFile
5669 Must be writable by user root. A distinct file must exist on
5670 each compute node.
5671
5672 SlurmdPidFile
5673 Must be writable by user root. A distinct file must exist on
5674 each compute node.
5675
5676 SlurmdSpoolDir
5677 Must be writable by user root. A distinct file must exist on
5678 each compute node.
5679
5680 SrunEpilog
5681 Must be executable by all users. The file must exist on every
5682 login and compute node.
5683
5684 SrunProlog
5685 Must be executable by all users. The file must exist on every
5686 login and compute node.
5687
5688 StateSaveLocation
5689 Must be writable by user SlurmUser. The file must be accessible
5690 by the primary and backup control machines.
5691
5692 SuspendProgram
5693 Must be executable by user SlurmUser. The file must be accessi‐
5694 ble by the primary and backup control machines.
5695
5696 TaskEpilog
5697 Must be executable by all users. The file must exist on every
5698 compute node.
5699
5700 TaskProlog
5701 Must be executable by all users. The file must exist on every
5702 compute node.
5703
5704 UnkillableStepProgram
5705 Must be executable by user SlurmUser. The file must be accessi‐
5706 ble by the primary and backup control machines.
5707
5709 Note that while Slurm daemons create log files and other files as
5710 needed, it treats the lack of parent directories as a fatal error.
5711 This prevents the daemons from running if critical file systems are not
5712 mounted and will minimize the risk of cold-starting (starting without
5713 preserving jobs).
5714
5715 Log files and job accounting files may need to be created/owned by the
5716 "SlurmUser" uid to be successfully accessed. Use the "chown" and
5717 "chmod" commands to set the ownership and permissions appropriately.
5718 See the section FILE AND DIRECTORY PERMISSIONS for information about
5719 the various files and directories used by Slurm.
5720
5721 It is recommended that the logrotate utility be used to ensure that
5722 various log files do not become too large. This also applies to text
5723 files used for accounting, process tracking, and the slurmdbd log if
5724 they are used.
5725
5726 Here is a sample logrotate configuration. Make appropriate site modifi‐
5727 cations and save as /etc/logrotate.d/slurm on all nodes. See the
5728 logrotate man page for more details.
5729
5730 ##
5731 # Slurm Logrotate Configuration
5732 ##
5733 /var/log/slurm/*.log {
5734 compress
5735 missingok
5736 nocopytruncate
5737 nodelaycompress
5738 nomail
5739 notifempty
5740 noolddir
5741 rotate 5
5742 sharedscripts
5743 size=5M
5744 create 640 slurm root
5745 postrotate
5746 pkill -x --signal SIGUSR2 slurmctld
5747 pkill -x --signal SIGUSR2 slurmd
5748 pkill -x --signal SIGUSR2 slurmdbd
5749 exit 0
5750 endscript
5751 }
5752
5753
5755 Copyright (C) 2002-2007 The Regents of the University of California.
5756 Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
5757 Copyright (C) 2008-2010 Lawrence Livermore National Security.
5758 Copyright (C) 2010-2022 SchedMD LLC.
5759
5760 This file is part of Slurm, a resource management program. For de‐
5761 tails, see <https://slurm.schedmd.com/>.
5762
5763 Slurm is free software; you can redistribute it and/or modify it under
5764 the terms of the GNU General Public License as published by the Free
5765 Software Foundation; either version 2 of the License, or (at your op‐
5766 tion) any later version.
5767
5768 Slurm is distributed in the hope that it will be useful, but WITHOUT
5769 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
5770 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
5771 for more details.
5772
5773
5775 /etc/slurm.conf
5776
5777
5779 cgroup.conf(5), getaddrinfo(3), getrlimit(2), gres.conf(5), group(5),
5780 hostname(1), scontrol(1), slurmctld(8), slurmd(8), slurmdbd(8), slur‐
5781 mdbd.conf(5), srun(1), spank(8), syslog(3), topology.conf(5)
5782
5783
5784
5785May 2022 Slurm Configuration File slurm.conf(5)