1slurm.conf(5) Slurm Configuration File slurm.conf(5)
2
3
4
6 slurm.conf - Slurm configuration file
7
8
10 slurm.conf is an ASCII file which describes general Slurm configuration
11 information, the nodes to be managed, information about how those nodes
12 are grouped into partitions, and various scheduling parameters associ‐
13 ated with those partitions. This file should be consistent across all
14 nodes in the cluster.
15
16 The file location can be modified at execution time by setting the
17 SLURM_CONF environment variable. The Slurm daemons also allow you to
18 override both the built-in and environment-provided location using the
19 "-f" option on the command line.
20
21 The contents of the file are case insensitive except for the names of
22 nodes and partitions. Any text following a "#" in the configuration
23 file is treated as a comment through the end of that line. Changes to
24 the configuration file take effect upon restart of Slurm daemons, dae‐
25 mon receipt of the SIGHUP signal, or execution of the command "scontrol
26 reconfigure" unless otherwise noted.
27
28 If a line begins with the word "Include" followed by whitespace and
29 then a file name, that file will be included inline with the current
30 configuration file. For large or complex systems, multiple configura‐
31 tion files may prove easier to manage and enable reuse of some files
32 (See INCLUDE MODIFIERS for more details).
33
34 Note on file permissions:
35
36 The slurm.conf file must be readable by all users of Slurm, since it is
37 used by many of the Slurm commands. Other files that are defined in
38 the slurm.conf file, such as log files and job accounting files, may
39 need to be created/owned by the user "SlurmUser" to be successfully ac‐
40 cessed. Use the "chown" and "chmod" commands to set the ownership and
41 permissions appropriately. See the section FILE AND DIRECTORY PERMIS‐
42 SIONS for information about the various files and directories used by
43 Slurm.
44
45
47 The overall configuration parameters available include:
48
49
50 AccountingStorageBackupHost
51 The name of the backup machine hosting the accounting storage
52 database. If used with the accounting_storage/slurmdbd plugin,
53 this is where the backup slurmdbd would be running. Only used
54 with systems using SlurmDBD, ignored otherwise.
55
56 AccountingStorageEnforce
57 This controls what level of association-based enforcement to im‐
58 pose on job submissions. Valid options are any combination of
59 associations, limits, nojobs, nosteps, qos, safe, and wckeys, or
60 all for all things (except nojobs and nosteps, which must be re‐
61 quested as well).
62
63 If limits, qos, or wckeys are set, associations will automati‐
64 cally be set.
65
66 If wckeys is set, TrackWCKey will automatically be set.
67
68 If safe is set, limits and associations will automatically be
69 set.
70
71 If nojobs is set, nosteps will automatically be set.
72
73 By setting associations, no new job is allowed to run unless a
74 corresponding association exists in the system. If limits are
75 enforced, users can be limited by association to whatever job
76 size or run time limits are defined.
77
78 If nojobs is set, Slurm will not account for any jobs or steps
79 on the system. Likewise, if nosteps is set, Slurm will not ac‐
80 count for any steps that have run.
81
82 If safe is enforced, a job will only be launched against an as‐
83 sociation or qos that has a GrpTRESMins limit set, if the job
84 will be able to run to completion. Without this option set, jobs
85 will be launched as long as their usage hasn't reached the
86 cpu-minutes limit. This can lead to jobs being launched but then
87 killed when the limit is reached.
88
89 With qos and/or wckeys enforced jobs will not be scheduled un‐
90 less a valid qos and/or workload characterization key is speci‐
91 fied.
92
93 A restart of slurmctld is required for changes to this parameter
94 to take effect.
95
96 AccountingStorageExternalHost
97 A comma-separated list of external slurmdbds
98 (<host/ip>[:port][,...]) to register with. If no port is given,
99 the AccountingStoragePort will be used.
100
101 This allows clusters registered with the external slurmdbd to
102 communicate with each other using the --cluster/-M client com‐
103 mand options.
104
105 The cluster will add itself to the external slurmdbd if it
106 doesn't exist. If a non-external cluster already exists on the
107 external slurmdbd, the slurmctld will ignore registering to the
108 external slurmdbd.
109
110 AccountingStorageHost
111 The name of the machine hosting the accounting storage database.
112 Only used with systems using SlurmDBD, ignored otherwise.
113
114 AccountingStorageParameters
115 Comma-separated list of key-value pair parameters. Currently
116 supported values include options to establish a secure connec‐
117 tion to the database:
118
119 SSL_CERT
120 The path name of the client public key certificate file.
121
122 SSL_CA
123 The path name of the Certificate Authority (CA) certificate
124 file.
125
126 SSL_CAPATH
127 The path name of the directory that contains trusted SSL CA
128 certificate files.
129
130 SSL_KEY
131 The path name of the client private key file.
132
133 SSL_CIPHER
134 The list of permissible ciphers for SSL encryption.
135
136 AccountingStoragePass
137 The password used to gain access to the database to store the
138 accounting data. Only used for database type storage plugins,
139 ignored otherwise. In the case of Slurm DBD (Database Daemon)
140 with MUNGE authentication this can be configured to use a MUNGE
141 daemon specifically configured to provide authentication between
142 clusters while the default MUNGE daemon provides authentication
143 within a cluster. In that case, AccountingStoragePass should
144 specify the named port to be used for communications with the
145 alternate MUNGE daemon (e.g. "/var/run/munge/global.socket.2").
146 The default value is NULL.
147
148 AccountingStoragePort
149 The listening port of the accounting storage database server.
150 Only used for database type storage plugins, ignored otherwise.
151 The default value is SLURMDBD_PORT as established at system
152 build time. If no value is explicitly specified, it will be set
153 to 6819. This value must be equal to the DbdPort parameter in
154 the slurmdbd.conf file.
155
156 AccountingStorageTRES
157 Comma-separated list of resources you wish to track on the clus‐
158 ter. These are the resources requested by the sbatch/srun job
159 when it is submitted. Currently this consists of any GRES, BB
160 (burst buffer) or license along with CPU, Memory, Node, Energy,
161 FS/[Disk|Lustre], IC/OFED, Pages, and VMem. By default Billing,
162 CPU, Energy, Memory, Node, FS/Disk, Pages and VMem are tracked.
163 These default TRES cannot be disabled, but only appended to.
164 AccountingStorageTRES=gres/craynetwork,license/iop1 will track
165 billing, cpu, energy, memory, nodes, fs/disk, pages and vmem
166 along with a gres called craynetwork as well as a license called
167 iop1. Whenever these resources are used on the cluster they are
168 recorded. The TRES are automatically set up in the database on
169 the start of the slurmctld.
170
171 If multiple GRES of different types are tracked (e.g. GPUs of
172 different types), then job requests with matching type specifi‐
173 cations will be recorded. Given a configuration of "Account‐
174 ingStorageTRES=gres/gpu,gres/gpu:tesla,gres/gpu:volta" Then
175 "gres/gpu:tesla" and "gres/gpu:volta" will track only jobs that
176 explicitly request those two GPU types, while "gres/gpu" will
177 track allocated GPUs of any type ("tesla", "volta" or any other
178 GPU type).
179
180 Given a configuration of "AccountingStorage‐
181 TRES=gres/gpu:tesla,gres/gpu:volta" Then "gres/gpu:tesla" and
182 "gres/gpu:volta" will track jobs that explicitly request those
183 GPU types. If a job requests GPUs, but does not explicitly
184 specify the GPU type, then its resource allocation will be ac‐
185 counted for as either "gres/gpu:tesla" or "gres/gpu:volta", al‐
186 though the accounting may not match the actual GPU type allo‐
187 cated to the job and the GPUs allocated to the job could be het‐
188 erogeneous. In an environment containing various GPU types, use
189 of a job_submit plugin may be desired in order to force jobs to
190 explicitly specify some GPU type.
191
192 AccountingStorageType
193 The accounting storage mechanism type. Acceptable values at
194 present include "accounting_storage/none" and "accounting_stor‐
195 age/slurmdbd". The "accounting_storage/slurmdbd" value indi‐
196 cates that accounting records will be written to the Slurm DBD,
197 which manages an underlying MySQL database. See "man slurmdbd"
198 for more information. The default value is "accounting_stor‐
199 age/none" and indicates that account records are not maintained.
200
201 AccountingStorageUser
202 The user account for accessing the accounting storage database.
203 Only used for database type storage plugins, ignored otherwise.
204
205 AccountingStoreFlags
206 Comma separated list used to tell the slurmctld to store extra
207 fields that may be more heavy weight than the normal job infor‐
208 mation.
209
210 Current options are:
211
212 job_comment
213 Include the job's comment field in the job complete mes‐
214 sage sent to the Accounting Storage database. Note the
215 AdminComment and SystemComment are always recorded in the
216 database.
217
218 job_env
219 Include a batch job's environment variables used at job
220 submission in the job start message sent to the Account‐
221 ing Storage database.
222
223 job_script
224 Include the job's batch script in the job start message
225 sent to the Accounting Storage database.
226
227 AcctGatherNodeFreq
228 The AcctGather plugins sampling interval for node accounting.
229 For AcctGather plugin values of none, this parameter is ignored.
230 For all other values this parameter is the number of seconds be‐
231 tween node accounting samples. For the acct_gather_energy/rapl
232 plugin, set a value less than 300 because the counters may over‐
233 flow beyond this rate. The default value is zero. This value
234 disables accounting sampling for nodes. Note: The accounting
235 sampling interval for jobs is determined by the value of JobAc‐
236 ctGatherFrequency.
237
238 AcctGatherEnergyType
239 Identifies the plugin to be used for energy consumption account‐
240 ing. The jobacct_gather plugin and slurmd daemon call this
241 plugin to collect energy consumption data for jobs and nodes.
242 The collection of energy consumption data takes place on the
243 node level, hence only in case of exclusive job allocation the
244 energy consumption measurements will reflect the job's real con‐
245 sumption. In case of node sharing between jobs the reported con‐
246 sumed energy per job (through sstat or sacct) will not reflect
247 the real energy consumed by the jobs.
248
249 Configurable values at present are:
250
251 acct_gather_energy/none
252 No energy consumption data is collected.
253
254 acct_gather_energy/ipmi
255 Energy consumption data is collected from
256 the Baseboard Management Controller (BMC)
257 using the Intelligent Platform Management
258 Interface (IPMI).
259
260 acct_gather_energy/pm_counters
261 Energy consumption data is collected from
262 the Baseboard Management Controller (BMC)
263 for HPE Cray systems.
264
265 acct_gather_energy/rapl
266 Energy consumption data is collected from
267 hardware sensors using the Running Average
268 Power Limit (RAPL) mechanism. Note that en‐
269 abling RAPL may require the execution of the
270 command "sudo modprobe msr".
271
272 acct_gather_energy/xcc
273 Energy consumption data is collected from
274 the Lenovo SD650 XClarity Controller (XCC)
275 using IPMI OEM raw commands.
276
277 AcctGatherInterconnectType
278 Identifies the plugin to be used for interconnect network traf‐
279 fic accounting. The jobacct_gather plugin and slurmd daemon
280 call this plugin to collect network traffic data for jobs and
281 nodes. The collection of network traffic data takes place on
282 the node level, hence only in case of exclusive job allocation
283 the collected values will reflect the job's real traffic. In
284 case of node sharing between jobs the reported network traffic
285 per job (through sstat or sacct) will not reflect the real net‐
286 work traffic by the jobs.
287
288 Configurable values at present are:
289
290 acct_gather_interconnect/none
291 No infiniband network data are collected.
292
293 acct_gather_interconnect/ofed
294 Infiniband network traffic data are col‐
295 lected from the hardware monitoring counters
296 of Infiniband devices through the OFED li‐
297 brary. In order to account for per job net‐
298 work traffic, add the "ic/ofed" TRES to Ac‐
299 countingStorageTRES.
300
301 acct_gather_interconnect/sysfs
302 Network traffic statistics are collected
303 from the Linux sysfs pseudo-filesystem for
304 specific interfaces defined in
305 acct_gather_interconnect.conf(5). In order
306 to account for per job network traffic, add
307 the "ic/sysfs" TRES to AccountingStorage‐
308 TRES.
309
310 AcctGatherFilesystemType
311 Identifies the plugin to be used for filesystem traffic account‐
312 ing. The jobacct_gather plugin and slurmd daemon call this
313 plugin to collect filesystem traffic data for jobs and nodes.
314 The collection of filesystem traffic data takes place on the
315 node level, hence only in case of exclusive job allocation the
316 collected values will reflect the job's real traffic. In case of
317 node sharing between jobs the reported filesystem traffic per
318 job (through sstat or sacct) will not reflect the real filesys‐
319 tem traffic by the jobs.
320
321
322 Configurable values at present are:
323
324 acct_gather_filesystem/none
325 No filesystem data are collected.
326
327 acct_gather_filesystem/lustre
328 Lustre filesystem traffic data are collected
329 from the counters found in /proc/fs/lustre/.
330 In order to account for per job lustre traf‐
331 fic, add the "fs/lustre" TRES to Account‐
332 ingStorageTRES.
333
334 AcctGatherProfileType
335 Identifies the plugin to be used for detailed job profiling.
336 The jobacct_gather plugin and slurmd daemon call this plugin to
337 collect detailed data such as I/O counts, memory usage, or en‐
338 ergy consumption for jobs and nodes. There are interfaces in
339 this plugin to collect data as step start and completion, task
340 start and completion, and at the account gather frequency. The
341 data collected at the node level is related to jobs only in case
342 of exclusive job allocation.
343
344 Configurable values at present are:
345
346 acct_gather_profile/none
347 No profile data is collected.
348
349 acct_gather_profile/hdf5
350 This enables the HDF5 plugin. The directory
351 where the profile files are stored and which
352 values are collected are configured in the
353 acct_gather.conf file.
354
355 acct_gather_profile/influxdb
356 This enables the influxdb plugin. The in‐
357 fluxdb instance host, port, database, reten‐
358 tion policy and which values are collected
359 are configured in the acct_gather.conf file.
360
361 AllowSpecResourcesUsage
362 If set to "YES", Slurm allows individual jobs to override node's
363 configured CoreSpecCount value. For a job to take advantage of
364 this feature, a command line option of --core-spec must be spec‐
365 ified. The default value for this option is "YES" for Cray sys‐
366 tems and "NO" for other system types.
367
368 AuthAltTypes
369 Comma-separated list of alternative authentication plugins that
370 the slurmctld will permit for communication. Acceptable values
371 at present include auth/jwt.
372
373 NOTE: auth/jwt requires a jwt_hs256.key to be populated in the
374 StateSaveLocation directory for slurmctld only. The
375 jwt_hs256.key should only be visible to the SlurmUser and root.
376 It is not suggested to place the jwt_hs256.key on any nodes but
377 the controller running slurmctld. auth/jwt can be activated by
378 the presence of the SLURM_JWT environment variable. When acti‐
379 vated, it will override the default AuthType.
380
381 AuthAltParameters
382 Used to define alternative authentication plugins options. Mul‐
383 tiple options may be comma separated.
384
385 disable_token_creation
386 Disable "scontrol token" use by non-SlurmUser ac‐
387 counts.
388
389 max_token_lifespan=<seconds>
390 Set max lifespan (in seconds) for any token gen‐
391 erated for user accounts. (This limit does not
392 apply to SlurmUser.)
393
394 jwks= Absolute path to JWKS file. Only RS256 keys are
395 supported, although other key types may be listed
396 in the file. If set, no HS256 key will be loaded
397 by default (and token generation is disabled),
398 although the jwt_key setting may be used to ex‐
399 plicitly re-enable HS256 key use (and token gen‐
400 eration).
401
402 jwt_key= Absolute path to JWT key file. Key must be HS256,
403 and should only be accessible by SlurmUser. If
404 not set, the default key file is jwt_hs256.key in
405 StateSaveLocation.
406
407 AuthInfo
408 Additional information to be used for authentication of communi‐
409 cations between the Slurm daemons (slurmctld and slurmd) and the
410 Slurm clients. The interpretation of this option is specific to
411 the configured AuthType. Multiple options may be specified in a
412 comma-delimited list. If not specified, the default authentica‐
413 tion information will be used.
414
415 cred_expire Default job step credential lifetime, in seconds
416 (e.g. "cred_expire=1200"). It must be suffi‐
417 ciently long enough to load user environment, run
418 prolog, deal with the slurmd getting paged out of
419 memory, etc. This also controls how long a re‐
420 queued job must wait before starting again. The
421 default value is 120 seconds.
422
423 socket Path name to a MUNGE daemon socket to use (e.g.
424 "socket=/var/run/munge/munge.socket.2"). The de‐
425 fault value is "/var/run/munge/munge.socket.2".
426 Used by auth/munge and cred/munge.
427
428 ttl Credential lifetime, in seconds (e.g. "ttl=300").
429 The default value is dependent upon the MUNGE in‐
430 stallation, but is typically 300 seconds.
431
432 AuthType
433 The authentication method for communications between Slurm com‐
434 ponents. Acceptable values at present include "auth/munge",
435 which is the default. "auth/munge" indicates that MUNGE is to
436 be used. (See "https://dun.github.io/munge/" for more informa‐
437 tion). All Slurm daemons and commands must be terminated prior
438 to changing the value of AuthType and later restarted.
439
440 BackupAddr
441 Deprecated option, see SlurmctldHost.
442
443 BackupController
444 Deprecated option, see SlurmctldHost.
445
446 The backup controller recovers state information from the State‐
447 SaveLocation directory, which must be readable and writable from
448 both the primary and backup controllers. While not essential,
449 it is recommended that you specify a backup controller. See
450 the RELOCATING CONTROLLERS section if you change this.
451
452 BatchStartTimeout
453 The maximum time (in seconds) that a batch job is permitted for
454 launching before being considered missing and releasing the al‐
455 location. The default value is 10 (seconds). Larger values may
456 be required if more time is required to execute the Prolog, load
457 user environment variables, or if the slurmd daemon gets paged
458 from memory.
459 Note: The test for a job being successfully launched is only
460 performed when the Slurm daemon on the compute node registers
461 state with the slurmctld daemon on the head node, which happens
462 fairly rarely. Therefore a job will not necessarily be termi‐
463 nated if its start time exceeds BatchStartTimeout. This config‐
464 uration parameter is also applied to launch tasks and avoid
465 aborting srun commands due to long running Prolog scripts.
466
467 BcastExclude
468 Comma-separated list of absolute directory paths to be excluded
469 when autodetecting and broadcasting executable shared object de‐
470 pendencies through sbcast or srun --bcast. The keyword "none"
471 can be used to indicate that no directory paths should be ex‐
472 cluded. The default value is "/lib,/usr/lib,/lib64,/usr/lib64".
473 This option can be overridden by sbcast --exclude and srun
474 --bcast-exclude.
475
476 BcastParameters
477 Controls sbcast and srun --bcast behavior. Multiple options can
478 be specified in a comma separated list. Supported values in‐
479 clude:
480
481 DestDir= Destination directory for file being broadcast to
482 allocated compute nodes. Default value is cur‐
483 rent working directory, or --chdir for srun if
484 set.
485
486 Compression= Specify default file compression library to be
487 used. Supported values are "lz4" and "none".
488 The default value with the sbcast --compress op‐
489 tion is "lz4" and "none" otherwise. Some com‐
490 pression libraries may be unavailable on some
491 systems.
492
493 send_libs If set, attempt to autodetect and broadcast the
494 executable's shared object dependencies to allo‐
495 cated compute nodes. The files are placed in a
496 directory alongside the executable. For srun
497 only, the LD_LIBRARY_PATH is automatically up‐
498 dated to include this cache directory as well.
499 This can be overridden with either sbcast or srun
500 --send-libs option. By default this is disabled.
501
502 BurstBufferType
503 The plugin used to manage burst buffers. Acceptable values at
504 present are:
505
506 burst_buffer/datawarp
507 Use Cray DataWarp API to provide burst buffer functional‐
508 ity.
509
510 burst_buffer/lua
511 This plugin provides hooks to an API that is defined by a
512 Lua script. This plugin was developed to provide system
513 administrators with a way to do any task (not only file
514 staging) at different points in a job’s life cycle.
515
516 burst_buffer/none
517
518 CliFilterPlugins
519 A comma-delimited list of command line interface option fil‐
520 ter/modification plugins. The specified plugins will be executed
521 in the order listed. No cli_filter plugins are used by default.
522 Acceptable values at present are:
523
524 cli_filter/lua
525 This plugin allows you to write your own implementation
526 of a cli_filter using lua.
527
528 cli_filter/syslog
529 This plugin enables logging of job submission activities
530 performed. All the salloc/sbatch/srun options are logged
531 to syslog together with environment variables in JSON
532 format. If the plugin is not the last one in the list it
533 may log values different than what was actually sent to
534 slurmctld.
535
536 cli_filter/user_defaults
537 This plugin looks for the file $HOME/.slurm/defaults and
538 reads every line of it as a key=value pair, where key is
539 any of the job submission options available to sal‐
540 loc/sbatch/srun and value is a default value defined by
541 the user. For instance:
542 time=1:30
543 mem=2048
544 The above will result in a user defined default for each
545 of their jobs of "-t 1:30" and "--mem=2048".
546
547 ClusterName
548 The name by which this Slurm managed cluster is known in the ac‐
549 counting database. This is needed distinguish accounting
550 records when multiple clusters report to the same database. Be‐
551 cause of limitations in some databases, any upper case letters
552 in the name will be silently mapped to lower case. In order to
553 avoid confusion, it is recommended that the name be lower case.
554 The cluster name must be 40 characters or less in order to com‐
555 ply with the limit on the maximum length for table names in
556 MySQL/MariaDB.
557
558 CommunicationParameters
559 Comma-separated options identifying communication options.
560
561 block_null_hash
562 Require all Slurm authentication tokens to in‐
563 clude a newer (20.11.9 and 21.08.8) payload that
564 provides an additional layer of security against
565 credential replay attacks. This option should
566 only be enabled once all Slurm daemons have been
567 upgraded to 20.11.9/21.08.8 or newer, and all
568 jobs that were started before the upgrade have
569 been completed.
570
571 CheckGhalQuiesce
572 Used specifically on a Cray using an Aries Ghal
573 interconnect. This will check to see if the sys‐
574 tem is quiescing when sending a message, and if
575 so, we wait until it is done before sending.
576
577 DisableIPv4 Disable IPv4 only operation for all slurm daemons
578 (except slurmdbd). This should also be set in
579 your slurmdbd.conf file.
580
581 EnableIPv6 Enable using IPv6 addresses for all slurm daemons
582 (except slurmdbd). When using both IPv4 and IPv6,
583 address family preferences will be based on your
584 /etc/gai.conf file. This should also be set in
585 your slurmdbd.conf file.
586
587 keepaliveinterval=#
588 Specifies the interval between keepalive probes
589 on the socket communications between srun and its
590 slurmstepd process.
591
592 keepaliveprobes=#
593 Specifies the number of keepalive probes sent on
594 the socket communications between srun command
595 and its slurmstepd process before the connection
596 is considered broken.
597
598 keepalivetime=#
599 Specifies how long sockets communications used
600 between the srun command and its slurmstepd
601 process are kept alive after disconnect. Longer
602 values can be used to improve reliability of com‐
603 munications in the event of network failures.
604
605 NoAddrCache By default, Slurm will cache a node's network ad‐
606 dress after successfully establishing the node's
607 network address. This option disables the cache
608 and Slurm will look up the node's network address
609 each time a connection is made. This is useful,
610 for example, in a cloud environment where the
611 node addresses come and go out of DNS.
612
613 NoCtldInAddrAny
614 Used to directly bind to the address of what the
615 node resolves to running the slurmctld instead of
616 binding messages to any address on the node,
617 which is the default.
618
619 NoInAddrAny Used to directly bind to the address of what the
620 node resolves to instead of binding messages to
621 any address on the node which is the default.
622 This option is for all daemons/clients except for
623 the slurmctld.
624
625 CompleteWait
626 The time to wait, in seconds, when any job is in the COMPLETING
627 state before any additional jobs are scheduled. This is to at‐
628 tempt to keep jobs on nodes that were recently in use, with the
629 goal of preventing fragmentation. If set to zero, pending jobs
630 will be started as soon as possible. Since a COMPLETING job's
631 resources are released for use by other jobs as soon as the Epi‐
632 log completes on each individual node, this can result in very
633 fragmented resource allocations. To provide jobs with the mini‐
634 mum response time, a value of zero is recommended (no waiting).
635 To minimize fragmentation of resources, a value equal to Kill‐
636 Wait plus two is recommended. In that case, setting KillWait to
637 a small value may be beneficial. The default value of Complete‐
638 Wait is zero seconds. The value may not exceed 65533.
639
640 NOTE: Setting reduce_completing_frag affects the behavior of
641 CompleteWait.
642
643 ControlAddr
644 Deprecated option, see SlurmctldHost.
645
646 ControlMachine
647 Deprecated option, see SlurmctldHost.
648
649 CoreSpecPlugin
650 Identifies the plugins to be used for enforcement of core spe‐
651 cialization. A restart of the slurmd daemons is required for
652 changes to this parameter to take effect. Acceptable values at
653 present include:
654
655 core_spec/cray_aries
656 used only for Cray systems
657
658 core_spec/none used for all other system types
659
660 CpuFreqDef
661 Default CPU frequency value or frequency governor to use when
662 running a job step if it has not been explicitly set with the
663 --cpu-freq option. Acceptable values at present include a nu‐
664 meric value (frequency in kilohertz) or one of the following
665 governors:
666
667 Conservative attempts to use the Conservative CPU governor
668
669 OnDemand attempts to use the OnDemand CPU governor
670
671 Performance attempts to use the Performance CPU governor
672
673 PowerSave attempts to use the PowerSave CPU governor
674 There is no default value. If unset, no attempt to set the governor is
675 made if the --cpu-freq option has not been set.
676
677 CpuFreqGovernors
678 List of CPU frequency governors allowed to be set with the sal‐
679 loc, sbatch, or srun option --cpu-freq. Acceptable values at
680 present include:
681
682 Conservative attempts to use the Conservative CPU governor
683
684 OnDemand attempts to use the OnDemand CPU governor (a de‐
685 fault value)
686
687 Performance attempts to use the Performance CPU governor (a
688 default value)
689
690 PowerSave attempts to use the PowerSave CPU governor
691
692 SchedUtil attempts to use the SchedUtil CPU governor
693
694 UserSpace attempts to use the UserSpace CPU governor (a de‐
695 fault value)
696 The default is OnDemand, Performance and UserSpace.
697
698 CredType
699 The cryptographic signature tool to be used in the creation of
700 job step credentials. A restart of slurmctld is required for
701 changes to this parameter to take effect. The default (and rec‐
702 ommended) value is "cred/munge".
703
704 DebugFlags
705 Defines specific subsystems which should provide more detailed
706 event logging. Multiple subsystems can be specified with comma
707 separators. Most DebugFlags will result in verbose-level log‐
708 ging for the identified subsystems, and could impact perfor‐
709 mance.
710
711 Valid subsystems available include:
712
713 Accrue Accrue counters accounting details
714
715 Agent RPC agents (outgoing RPCs from Slurm daemons)
716
717 Backfill Backfill scheduler details
718
719 BackfillMap Backfill scheduler to log a very verbose map of
720 reserved resources through time. Combine with
721 Backfill for a verbose and complete view of the
722 backfill scheduler's work.
723
724 BurstBuffer Burst Buffer plugin
725
726 Cgroup Cgroup details
727
728 CPU_Bind CPU binding details for jobs and steps
729
730 CpuFrequency Cpu frequency details for jobs and steps using
731 the --cpu-freq option.
732
733 Data Generic data structure details.
734
735 Dependency Job dependency debug info
736
737 Elasticsearch Elasticsearch debug info
738
739 Energy AcctGatherEnergy debug info
740
741 ExtSensors External Sensors debug info
742
743 Federation Federation scheduling debug info
744
745 FrontEnd Front end node details
746
747 Gres Generic resource details
748
749 Hetjob Heterogeneous job details
750
751 Gang Gang scheduling details
752
753 JobAccountGather Common job account gathering details (not
754 plugin specific).
755
756 JobContainer Job container plugin details
757
758 License License management details
759
760 Network Network details. Warning: activating this flag
761 may cause logging of passwords, tokens or other
762 authentication credentials.
763
764 NetworkRaw Dump raw hex values of key Network communica‐
765 tions. Warning: This flag will cause very ver‐
766 bose logs and may cause logging of passwords,
767 tokens or other authentication credentials.
768
769 NodeFeatures Node Features plugin debug info
770
771 NO_CONF_HASH Do not log when the slurm.conf files differ be‐
772 tween Slurm daemons
773
774 Power Power management plugin and power save (sus‐
775 pend/resume programs) details
776
777 Priority Job prioritization
778
779 Profile AcctGatherProfile plugins details
780
781 Protocol Communication protocol details
782
783 Reservation Advanced reservations
784
785 Route Message forwarding debug info
786
787 Script Debug info regarding the process that runs
788 slurmctld scripts such as PrologSlurmctld and
789 EpilogSlurmctld
790
791 SelectType Resource selection plugin
792
793 Steps Slurmctld resource allocation for job steps
794
795 Switch Switch plugin
796
797 TimeCray Timing of Cray APIs
798
799 TraceJobs Trace jobs in slurmctld. It will print detailed
800 job information including state, job ids and
801 allocated nodes counter.
802
803 Triggers Slurmctld triggers
804
805 WorkQueue Work Queue details
806
807 DefCpuPerGPU
808 Default count of CPUs allocated per allocated GPU. This value is
809 used only if the job didn't specify --cpus-per-task and
810 --cpus-per-gpu.
811
812 DefMemPerCPU
813 Default real memory size available per usable allocated CPU in
814 megabytes. Used to avoid over-subscribing memory and causing
815 paging. DefMemPerCPU would generally be used if individual pro‐
816 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
817 lectType=select/cons_tres). The default value is 0 (unlimited).
818 Also see DefMemPerGPU, DefMemPerNode and MaxMemPerCPU. DefMem‐
819 PerCPU, DefMemPerGPU and DefMemPerNode are mutually exclusive.
820
821
822 NOTE: This applies to usable allocated CPUs in a job allocation.
823 This is important when more than one thread per core is config‐
824 ured. If a job requests --threads-per-core with fewer threads
825 on a core than exist on the core (or --hint=nomultithread which
826 implies --threads-per-core=1), the job will be unable to use
827 those extra threads on the core and those threads will not be
828 included in the memory per CPU calculation. But if the job has
829 access to all threads on the core, those threads will be in‐
830 cluded in the memory per CPU calculation even if the job did not
831 explicitly request those threads.
832
833 In the following examples, each core has two threads.
834
835 In this first example, two tasks can run on separate hyper‐
836 threads in the same core because --threads-per-core is not used.
837 The third task uses both threads of the second core. The allo‐
838 cated memory per cpu includes all threads:
839
840 $ salloc -n3 --mem-per-cpu=100
841 salloc: Granted job allocation 17199
842 $ sacct -j $SLURM_JOB_ID -X -o jobid%7,reqtres%35,alloctres%35
843 JobID ReqTRES AllocTRES
844 ------- ----------------------------------- -----------------------------------
845 17199 billing=3,cpu=3,mem=300M,node=1 billing=4,cpu=4,mem=400M,node=1
846
847 In this second example, because of --threads-per-core=1, each
848 task is allocated an entire core but is only able to use one
849 thread per core. Allocated CPUs includes all threads on each
850 core. However, allocated memory per cpu includes only the usable
851 thread in each core.
852
853 $ salloc -n3 --mem-per-cpu=100 --threads-per-core=1
854 salloc: Granted job allocation 17200
855 $ sacct -j $SLURM_JOB_ID -X -o jobid%7,reqtres%35,alloctres%35
856 JobID ReqTRES AllocTRES
857 ------- ----------------------------------- -----------------------------------
858 17200 billing=3,cpu=3,mem=300M,node=1 billing=6,cpu=6,mem=300M,node=1
859
860 DefMemPerGPU
861 Default real memory size available per allocated GPU in
862 megabytes. The default value is 0 (unlimited). Also see
863 DefMemPerCPU and DefMemPerNode. DefMemPerCPU, DefMemPerGPU and
864 DefMemPerNode are mutually exclusive.
865
866 DefMemPerNode
867 Default real memory size available per allocated node in
868 megabytes. Used to avoid over-subscribing memory and causing
869 paging. DefMemPerNode would generally be used if whole nodes
870 are allocated to jobs (SelectType=select/linear) and resources
871 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
872 The default value is 0 (unlimited). Also see DefMemPerCPU,
873 DefMemPerGPU and MaxMemPerCPU. DefMemPerCPU, DefMemPerGPU and
874 DefMemPerNode are mutually exclusive.
875
876 DependencyParameters
877 Multiple options may be comma separated.
878
879 disable_remote_singleton
880 By default, when a federated job has a singleton depen‐
881 dency, each cluster in the federation must clear the sin‐
882 gleton dependency before the job's singleton dependency
883 is considered satisfied. Enabling this option means that
884 only the origin cluster must clear the singleton depen‐
885 dency. This option must be set in every cluster in the
886 federation.
887
888 kill_invalid_depend
889 If a job has an invalid dependency and it can never run
890 terminate it and set its state to be JOB_CANCELLED. By
891 default the job stays pending with reason DependencyNev‐
892 erSatisfied.
893
894 max_depend_depth=#
895 Maximum number of jobs to test for a circular job depen‐
896 dency. Stop testing after this number of job dependencies
897 have been tested. The default value is 10 jobs.
898
899 DisableRootJobs
900 If set to "YES" then user root will be prevented from running
901 any jobs. The default value is "NO", meaning user root will be
902 able to execute jobs. DisableRootJobs may also be set by parti‐
903 tion.
904
905 EioTimeout
906 The number of seconds srun waits for slurmstepd to close the
907 TCP/IP connection used to relay data between the user applica‐
908 tion and srun when the user application terminates. The default
909 value is 60 seconds. May not exceed 65533.
910
911 EnforcePartLimits
912 If set to "ALL" then jobs which exceed a partition's size and/or
913 time limits will be rejected at submission time. If job is sub‐
914 mitted to multiple partitions, the job must satisfy the limits
915 on all the requested partitions. If set to "NO" then the job
916 will be accepted and remain queued until the partition limits
917 are altered(Time and Node Limits). If set to "ANY" a job must
918 satisfy any of the requested partitions to be submitted. The de‐
919 fault value is "NO". NOTE: If set, then a job's QOS can not be
920 used to exceed partition limits. NOTE: The partition limits be‐
921 ing considered are its configured MaxMemPerCPU, MaxMemPerNode,
922 MinNodes, MaxNodes, MaxTime, AllocNodes, AllowAccounts, Allow‐
923 Groups, AllowQOS, and QOS usage threshold.
924
925 Epilog Fully qualified pathname of a script to execute as user root on
926 every node when a user's job completes (e.g. "/usr/lo‐
927 cal/slurm/epilog"). A glob pattern (See glob (7)) may also be
928 used to run more than one epilog script (e.g. "/etc/slurm/epi‐
929 log.d/*"). The Epilog script or scripts may be used to purge
930 files, disable user login, etc. By default there is no epilog.
931 See Prolog and Epilog Scripts for more information.
932
933 EpilogMsgTime
934 The number of microseconds that the slurmctld daemon requires to
935 process an epilog completion message from the slurmd daemons.
936 This parameter can be used to prevent a burst of epilog comple‐
937 tion messages from being sent at the same time which should help
938 prevent lost messages and improve throughput for large jobs.
939 The default value is 2000 microseconds. For a 1000 node job,
940 this spreads the epilog completion messages out over two sec‐
941 onds.
942
943 EpilogSlurmctld
944 Fully qualified pathname of a program for the slurmctld to exe‐
945 cute upon termination of a job allocation (e.g. "/usr/lo‐
946 cal/slurm/epilog_controller"). The program executes as Slur‐
947 mUser, which gives it permission to drain nodes and requeue the
948 job if a failure occurs (See scontrol(1)). Exactly what the
949 program does and how it accomplishes this is completely at the
950 discretion of the system administrator. Information about the
951 job being initiated, its allocated nodes, etc. are passed to the
952 program using environment variables. See Prolog and Epilog
953 Scripts for more information.
954
955 ExtSensorsFreq
956 The external sensors plugin sampling interval. If ExtSen‐
957 sorsType=ext_sensors/none, this parameter is ignored. For all
958 other values of ExtSensorsType, this parameter is the number of
959 seconds between external sensors samples for hardware components
960 (nodes, switches, etc.) The default value is zero. This value
961 disables external sensors sampling. Note: This parameter does
962 not affect external sensors data collection for jobs/steps.
963
964 ExtSensorsType
965 Identifies the plugin to be used for external sensors data col‐
966 lection. Slurmctld calls this plugin to collect external sen‐
967 sors data for jobs/steps and hardware components. In case of
968 node sharing between jobs the reported values per job/step
969 (through sstat or sacct) may not be accurate. See also "man
970 ext_sensors.conf".
971
972 Configurable values at present are:
973
974 ext_sensors/none No external sensors data is collected.
975
976 ext_sensors/rrd External sensors data is collected from the
977 RRD database.
978
979 FairShareDampeningFactor
980 Dampen the effect of exceeding a user or group's fair share of
981 allocated resources. Higher values will provides greater ability
982 to differentiate between exceeding the fair share at high levels
983 (e.g. a value of 1 results in almost no difference between over‐
984 consumption by a factor of 10 and 100, while a value of 5 will
985 result in a significant difference in priority). The default
986 value is 1.
987
988 FederationParameters
989 Used to define federation options. Multiple options may be comma
990 separated.
991
992 fed_display
993 If set, then the client status commands (e.g. squeue,
994 sinfo, sprio, etc.) will display information in a feder‐
995 ated view by default. This option is functionally equiva‐
996 lent to using the --federation options on each command.
997 Use the client's --local option to override the federated
998 view and get a local view of the given cluster.
999
1000 FirstJobId
1001 The job id to be used for the first job submitted to Slurm. Job
1002 id values generated will incremented by 1 for each subsequent
1003 job. Value must be larger than 0. The default value is 1. Also
1004 see MaxJobId
1005
1006 GetEnvTimeout
1007 Controls how long the job should wait (in seconds) to load the
1008 user's environment before attempting to load it from a cache
1009 file. Applies when the salloc or sbatch --get-user-env option
1010 is used. If set to 0 then always load the user's environment
1011 from the cache file. The default value is 2 seconds.
1012
1013 GresTypes
1014 A comma-delimited list of generic resources to be managed (e.g.
1015 GresTypes=gpu,mps). These resources may have an associated GRES
1016 plugin of the same name providing additional functionality. No
1017 generic resources are managed by default. Ensure this parameter
1018 is consistent across all nodes in the cluster for proper opera‐
1019 tion. A restart of slurmctld and the slurmd daemons is required
1020 for this to take effect.
1021
1022 GroupUpdateForce
1023 If set to a non-zero value, then information about which users
1024 are members of groups allowed to use a partition will be updated
1025 periodically, even when there have been no changes to the
1026 /etc/group file. If set to zero, group member information will
1027 be updated only after the /etc/group file is updated. The de‐
1028 fault value is 1. Also see the GroupUpdateTime parameter.
1029
1030 GroupUpdateTime
1031 Controls how frequently information about which users are mem‐
1032 bers of groups allowed to use a partition will be updated, and
1033 how long user group membership lists will be cached. The time
1034 interval is given in seconds with a default value of 600 sec‐
1035 onds. A value of zero will prevent periodic updating of group
1036 membership information. Also see the GroupUpdateForce parame‐
1037 ter.
1038
1039 GpuFreqDef=[<type]=value>[,<type=value>]
1040 Default GPU frequency to use when running a job step if it has
1041 not been explicitly set using the --gpu-freq option. This op‐
1042 tion can be used to independently configure the GPU and its mem‐
1043 ory frequencies. Defaults to "high,memory=high". After the job
1044 is completed, the frequencies of all affected GPUs will be reset
1045 to the highest possible values. In some cases, system power
1046 caps may override the requested values. The field type can be
1047 "memory". If type is not specified, the GPU frequency is im‐
1048 plied. The value field can either be "low", "medium", "high",
1049 "highm1" or a numeric value in megahertz (MHz). If the speci‐
1050 fied numeric value is not possible, a value as close as possible
1051 will be used. See below for definition of the values. Examples
1052 of use include "GpuFreqDef=medium,memory=high and "GpuFre‐
1053 qDef=450".
1054
1055 Supported value definitions:
1056
1057 low the lowest available frequency.
1058
1059 medium attempts to set a frequency in the middle of the
1060 available range.
1061
1062 high the highest available frequency.
1063
1064 highm1 (high minus one) will select the next highest avail‐
1065 able frequency.
1066
1067 HealthCheckInterval
1068 The interval in seconds between executions of HealthCheckPro‐
1069 gram. The default value is zero, which disables execution.
1070
1071 HealthCheckNodeState
1072 Identify what node states should execute the HealthCheckProgram.
1073 Multiple state values may be specified with a comma separator.
1074 The default value is ANY to execute on nodes in any state.
1075
1076 ALLOC Run on nodes in the ALLOC state (all CPUs allo‐
1077 cated).
1078
1079 ANY Run on nodes in any state.
1080
1081 CYCLE Rather than running the health check program on all
1082 nodes at the same time, cycle through running on all
1083 compute nodes through the course of the HealthCheck‐
1084 Interval. May be combined with the various node
1085 state options.
1086
1087 IDLE Run on nodes in the IDLE state.
1088
1089 MIXED Run on nodes in the MIXED state (some CPUs idle and
1090 other CPUs allocated).
1091
1092 HealthCheckProgram
1093 Fully qualified pathname of a script to execute as user root pe‐
1094 riodically on all compute nodes that are not in the NOT_RESPOND‐
1095 ING state. This program may be used to verify the node is fully
1096 operational and DRAIN the node or send email if a problem is de‐
1097 tected. Any action to be taken must be explicitly performed by
1098 the program (e.g. execute "scontrol update NodeName=foo
1099 State=drain Reason=tmp_file_system_full" to drain a node). The
1100 execution interval is controlled using the HealthCheckInterval
1101 parameter. Note that the HealthCheckProgram will be executed at
1102 the same time on all nodes to minimize its impact upon parallel
1103 programs. This program will be killed if it does not terminate
1104 normally within 60 seconds. This program will also be executed
1105 when the slurmd daemon is first started and before it registers
1106 with the slurmctld daemon. By default, no program will be exe‐
1107 cuted.
1108
1109 InactiveLimit
1110 The interval, in seconds, after which a non-responsive job allo‐
1111 cation command (e.g. srun or salloc) will result in the job be‐
1112 ing terminated. If the node on which the command is executed
1113 fails or the command abnormally terminates, this will terminate
1114 its job allocation. This option has no effect upon batch jobs.
1115 When setting a value, take into consideration that a debugger
1116 using srun to launch an application may leave the srun command
1117 in a stopped state for extended periods of time. This limit is
1118 ignored for jobs running in partitions with the RootOnly flag
1119 set (the scheduler running as root will be responsible for the
1120 job). The default value is unlimited (zero) and may not exceed
1121 65533 seconds.
1122
1123 InteractiveStepOptions
1124 When LaunchParameters=use_interactive_step is enabled, launching
1125 salloc will automatically start an srun process with Interac‐
1126 tiveStepOptions to launch a terminal on a node in the job allo‐
1127 cation. The default value is "--interactive --preserve-env
1128 --pty $SHELL". The "--interactive" option is intentionally not
1129 documented in the srun man page. It is meant only to be used in
1130 InteractiveStepOptions in order to create an "interactive step"
1131 that will not consume resources so that other steps may run in
1132 parallel with the interactive step.
1133
1134 JobAcctGatherType
1135 The job accounting mechanism type. Acceptable values at present
1136 include "jobacct_gather/linux" (for Linux systems),
1137 "jobacct_gather/cgroup" and "jobacct_gather/none" (no accounting
1138 data collected). The default value is "jobacct_gather/none".
1139 "jobacct_gather/cgroup" is a plugin for the Linux operating sys‐
1140 tem that uses cgroups to collect accounting statistics. The
1141 plugin collects the following statistics: From the cgroup memory
1142 subsystem: memory.usage_in_bytes (reported as 'pages') and rss
1143 from memory.stat (reported as 'rss'). From the cgroup cpuacct
1144 subsystem: user cpu time and system cpu time. No value is pro‐
1145 vided by cgroups for virtual memory size ('vsize'). In order to
1146 use the sstat tool "jobacct_gather/linux", or
1147 "jobacct_gather/cgroup" must be configured.
1148 NOTE: Changing this configuration parameter changes the contents
1149 of the messages between Slurm daemons. Any previously running
1150 job steps are managed by a slurmstepd daemon that will persist
1151 through the lifetime of that job step and not change its commu‐
1152 nication protocol. Only change this configuration parameter when
1153 there are no running job steps.
1154
1155 JobAcctGatherFrequency
1156 The job accounting and profiling sampling intervals. The sup‐
1157 ported format is follows:
1158
1159 JobAcctGatherFrequency=<datatype>=<interval>
1160 where <datatype>=<interval> specifies the task sam‐
1161 pling interval for the jobacct_gather plugin or a
1162 sampling interval for a profiling type by the
1163 acct_gather_profile plugin. Multiple, comma-sepa‐
1164 rated <datatype>=<interval> intervals may be speci‐
1165 fied. Supported datatypes are as follows:
1166
1167 task=<interval>
1168 where <interval> is the task sampling inter‐
1169 val in seconds for the jobacct_gather plugins
1170 and for task profiling by the
1171 acct_gather_profile plugin.
1172
1173 energy=<interval>
1174 where <interval> is the sampling interval in
1175 seconds for energy profiling using the
1176 acct_gather_energy plugin
1177
1178 network=<interval>
1179 where <interval> is the sampling interval in
1180 seconds for infiniband profiling using the
1181 acct_gather_interconnect plugin.
1182
1183 filesystem=<interval>
1184 where <interval> is the sampling interval in
1185 seconds for filesystem profiling using the
1186 acct_gather_filesystem plugin.
1187
1188
1189 The default value for task sampling
1190 interval
1191 is 30 seconds. The default value for all other intervals is 0.
1192 An interval of 0 disables sampling of the specified type. If
1193 the task sampling interval is 0, accounting information is col‐
1194 lected only at job termination, which reduces Slurm interference
1195 with the job, but also means that the statistics about a job
1196 don't reflect the average or maximum of several samples though‐
1197 out the life of the job, but just show the information collected
1198 in the single sample.
1199 Smaller (non-zero) values have a greater impact upon job perfor‐
1200 mance, but a value of 30 seconds is not likely to be noticeable
1201 for applications having less than 10,000 tasks.
1202 Users can independently override each interval on a per job ba‐
1203 sis using the --acctg-freq option when submitting the job.
1204
1205 JobAcctGatherParams
1206 Arbitrary parameters for the job account gather plugin. Accept‐
1207 able values at present include:
1208
1209 NoShared Exclude shared memory from RSS. This option
1210 cannot be used with UsePSS.
1211
1212 UsePss Use PSS value instead of RSS to calculate
1213 real usage of memory. The PSS value will be
1214 saved as RSS. This option cannot be used
1215 with NoShared.
1216
1217 OverMemoryKill Kill processes that are being detected to
1218 use more memory than requested by steps ev‐
1219 ery time accounting information is gathered
1220 by the JobAcctGather plugin. This parameter
1221 should be used with caution because a job
1222 exceeding its memory allocation may affect
1223 other processes and/or machine health.
1224
1225 NOTE: If available, it is recommended to
1226 limit memory by enabling task/cgroup as a
1227 TaskPlugin and making use of Constrain‐
1228 RAMSpace=yes in the cgroup.conf instead of
1229 using this JobAcctGather mechanism for mem‐
1230 ory enforcement. Using JobAcctGather is
1231 polling based and there is a delay before a
1232 job is killed, which could lead to system
1233 Out of Memory events.
1234
1235 NOTE: When using OverMemoryKill, if the com‐
1236 bined memory used by all the processes in a
1237 step exceeds the memory limit, the entire
1238 step will be killed/cancelled by the JobAc‐
1239 ctGather plugin. This differs from the be‐
1240 havior when using ConstrainRAMSpace, where
1241 processes in the step will be killed, but
1242 the step will be left active, possibly with
1243 other processes left running.
1244
1245 JobCompHost
1246 The name of the machine hosting the job completion database.
1247 Only used for database type storage plugins, ignored otherwise.
1248
1249 JobCompLoc
1250 The fully qualified file name where job completion records are
1251 written when the JobCompType is "jobcomp/filetxt" or the data‐
1252 base where job completion records are stored when the JobComp‐
1253 Type is a database, or a complete URL endpoint with format
1254 <host>:<port>/<target>/_doc when JobCompType is "jobcomp/elas‐
1255 ticsearch" like i.e. "localhost:9200/slurm/_doc". NOTE: More
1256 information is available at the Slurm web site
1257 <https://slurm.schedmd.com/elasticsearch.html>.
1258
1259 JobCompParams
1260 Pass arbitrary text string to job completion plugin. Also see
1261 JobCompType.
1262
1263 JobCompPass
1264 The password used to gain access to the database to store the
1265 job completion data. Only used for database type storage plug‐
1266 ins, ignored otherwise.
1267
1268 JobCompPort
1269 The listening port of the job completion database server. Only
1270 used for database type storage plugins, ignored otherwise.
1271
1272 JobCompType
1273 The job completion logging mechanism type. Acceptable values at
1274 present include:
1275
1276 jobcomp/none
1277 Upon job completion, a record of the job is purged from
1278 the system. If using the accounting infrastructure this
1279 plugin may not be of interest since some of the informa‐
1280 tion is redundant.
1281
1282 jobcomp/elasticsearch
1283 Upon job completion, a record of the job should be writ‐
1284 ten to an Elasticsearch server, specified by the JobCom‐
1285 pLoc parameter.
1286 NOTE: More information is available at the Slurm web site
1287 ( https://slurm.schedmd.com/elasticsearch.html ).
1288
1289 jobcomp/filetxt
1290 Upon job completion, a record of the job should be writ‐
1291 ten to a text file, specified by the JobCompLoc parame‐
1292 ter.
1293
1294 jobcomp/lua
1295 Upon job completion, a record of the job should be pro‐
1296 cessed by the jobcomp.lua script, located in the default
1297 script directory (typically the subdirectory etc of the
1298 installation directory.
1299
1300 jobcomp/mysql
1301 Upon job completion, a record of the job should be writ‐
1302 ten to a MySQL or MariaDB database, specified by the Job‐
1303 CompLoc parameter.
1304
1305 jobcomp/script
1306 Upon job completion, a script specified by the JobCompLoc
1307 parameter is to be executed with environment variables
1308 providing the job information.
1309
1310 JobCompUser
1311 The user account for accessing the job completion database.
1312 Only used for database type storage plugins, ignored otherwise.
1313
1314 JobContainerType
1315 Identifies the plugin to be used for job tracking. A restart of
1316 slurmctld is required for changes to this parameter to take ef‐
1317 fect. NOTE: The JobContainerType applies to a job allocation,
1318 while ProctrackType applies to job steps. Acceptable values at
1319 present include:
1320
1321 job_container/cncu Used only for Cray systems (CNCU = Compute
1322 Node Clean Up)
1323
1324 job_container/none Used for all other system types
1325
1326 job_container/tmpfs Used to create a private namespace on the
1327 filesystem for jobs, which houses temporary
1328 file systems (/tmp and /dev/shm) for each
1329 job. 'PrologFlags=Contain' must be set to
1330 use this plugin.
1331
1332 JobFileAppend
1333 This option controls what to do if a job's output or error file
1334 exist when the job is started. If JobFileAppend is set to a
1335 value of 1, then append to the existing file. By default, any
1336 existing file is truncated.
1337
1338 JobRequeue
1339 This option controls the default ability for batch jobs to be
1340 requeued. Jobs may be requeued explicitly by a system adminis‐
1341 trator, after node failure, or upon preemption by a higher pri‐
1342 ority job. If JobRequeue is set to a value of 1, then batch
1343 jobs may be requeued unless explicitly disabled by the user. If
1344 JobRequeue is set to a value of 0, then batch jobs will not be
1345 requeued unless explicitly enabled by the user. Use the sbatch
1346 --no-requeue or --requeue option to change the default behavior
1347 for individual jobs. The default value is 1.
1348
1349 JobSubmitPlugins
1350 These are intended to be site-specific plugins which can be used
1351 to set default job parameters and/or logging events. Slurm can
1352 be configured to use multiple job_submit plugins if desired,
1353 which must be specified as a comma-delimited list and will be
1354 executed in the order listed.
1355 e.g. for multiple job_submit plugin configuration:
1356 JobSubmitPlugins=lua,require_timelimit
1357 Take a look at <https://slurm.schedmd.com/job_submit_plug‐
1358 ins.html> for further plugin implementation details. No job sub‐
1359 mission plugins are used by default. Currently available plug‐
1360 ins are:
1361
1362 all_partitions Set default partition to all partitions
1363 on the cluster.
1364
1365 defaults Set default values for job submission or
1366 modify requests.
1367
1368 logging Log select job submission and modifica‐
1369 tion parameters.
1370
1371 lua Execute a Lua script implementing site's
1372 own job_submit logic. Only one Lua
1373 script will be executed. It must be
1374 named "job_submit.lua" and must be lo‐
1375 cated in the default configuration di‐
1376 rectory (typically the subdirectory
1377 "etc" of the installation directory).
1378 Sample Lua scripts can be found with the
1379 Slurm distribution, in the directory
1380 contribs/lua. Slurmctld will fatal on
1381 startup if the configured lua script is
1382 invalid. Slurm will try to load the
1383 script for each job submission. If the
1384 script is broken or removed while slurm‐
1385 ctld is running, Slurm will fallback to
1386 the previous working version of the
1387 script.
1388
1389 partition Set a job's default partition based upon
1390 job submission parameters and available
1391 partitions.
1392
1393 pbs Translate PBS job submission options to
1394 Slurm equivalent (if possible).
1395
1396 require_timelimit Force job submissions to specify a time‐
1397 limit.
1398
1399 NOTE: For examples of use see the Slurm code in "src/plug‐
1400 ins/job_submit" and "contribs/lua/job_submit*.lua" then modify
1401 the code to satisfy your needs.
1402
1403 KillOnBadExit
1404 If set to 1, a step will be terminated immediately if any task
1405 is crashed or aborted, as indicated by a non-zero exit code.
1406 With the default value of 0, if one of the processes is crashed
1407 or aborted the other processes will continue to run while the
1408 crashed or aborted process waits. The user can override this
1409 configuration parameter by using srun's -K, --kill-on-bad-exit.
1410
1411 KillWait
1412 The interval, in seconds, given to a job's processes between the
1413 SIGTERM and SIGKILL signals upon reaching its time limit. If
1414 the job fails to terminate gracefully in the interval specified,
1415 it will be forcibly terminated. The default value is 30 sec‐
1416 onds. The value may not exceed 65533.
1417
1418 NodeFeaturesPlugins
1419 Identifies the plugins to be used for support of node features
1420 which can change through time. For example, a node which might
1421 be booted with various BIOS setting. This is supported through
1422 the use of a node's active_features and available_features in‐
1423 formation. Acceptable values at present include:
1424
1425 node_features/knl_cray
1426 Used only for Intel Knights Landing processors (KNL) on
1427 Cray systems.
1428
1429 node_features/knl_generic
1430 Used for Intel Knights Landing processors (KNL) on a
1431 generic Linux system.
1432
1433 node_features/helpers
1434 Used to report and modify features on nodes using arbi‐
1435 trary scripts or programs.
1436
1437 LaunchParameters
1438 Identifies options to the job launch plugin. Acceptable values
1439 include:
1440
1441 batch_step_set_cpu_freq Set the cpu frequency for the batch step
1442 from given --cpu-freq, or slurm.conf
1443 CpuFreqDef, option. By default only
1444 steps started with srun will utilize the
1445 cpu freq setting options.
1446
1447 NOTE: If you are using srun to launch
1448 your steps inside a batch script (ad‐
1449 vised) this option will create a situa‐
1450 tion where you may have multiple agents
1451 setting the cpu_freq as the batch step
1452 usually runs on the same resources one
1453 or more steps the sruns in the script
1454 will create.
1455
1456 cray_net_exclusive Allow jobs on a Cray Native cluster ex‐
1457 clusive access to network resources.
1458 This should only be set on clusters pro‐
1459 viding exclusive access to each node to
1460 a single job at once, and not using par‐
1461 allel steps within the job, otherwise
1462 resources on the node can be oversub‐
1463 scribed.
1464
1465 enable_nss_slurm Permits passwd and group resolution for
1466 a job to be serviced by slurmstepd
1467 rather than requiring a lookup from a
1468 network based service. See
1469 https://slurm.schedmd.com/nss_slurm.html
1470 for more information.
1471
1472 lustre_no_flush If set on a Cray Native cluster, then do
1473 not flush the Lustre cache on job step
1474 completion. This setting will only take
1475 effect after reconfiguring, and will
1476 only take effect for newly launched
1477 jobs.
1478
1479 mem_sort Sort NUMA memory at step start. User can
1480 override this default with
1481 SLURM_MEM_BIND environment variable or
1482 --mem-bind=nosort command line option.
1483
1484 mpir_use_nodeaddr When launching tasks Slurm creates en‐
1485 tries in MPIR_proctable that are used by
1486 parallel debuggers, profilers, and re‐
1487 lated tools to attach to running
1488 process. By default the MPIR_proctable
1489 entries contain MPIR_procdesc structures
1490 where the host_name is set to NodeName
1491 by default. If this option is specified,
1492 NodeAddr will be used in this context
1493 instead.
1494
1495 disable_send_gids By default, the slurmctld will look up
1496 and send the user_name and extended gids
1497 for a job, rather than independently on
1498 each node as part of each task launch.
1499 This helps mitigate issues around name
1500 service scalability when launching jobs
1501 involving many nodes. Using this option
1502 will disable this functionality. This
1503 option is ignored if enable_nss_slurm is
1504 specified.
1505
1506 slurmstepd_memlock Lock the slurmstepd process's current
1507 memory in RAM.
1508
1509 slurmstepd_memlock_all Lock the slurmstepd process's current
1510 and future memory in RAM.
1511
1512 test_exec Have srun verify existence of the exe‐
1513 cutable program along with user execute
1514 permission on the node where srun was
1515 called before attempting to launch it on
1516 nodes in the step.
1517
1518 use_interactive_step Have salloc use the Interactive Step to
1519 launch a shell on an allocated compute
1520 node rather than locally to wherever
1521 salloc was invoked. This is accomplished
1522 by launching the srun command with In‐
1523 teractiveStepOptions as options.
1524
1525 This does not affect salloc called with
1526 a command as an argument. These jobs
1527 will continue to be executed as the
1528 calling user on the calling host.
1529
1530 LaunchType
1531 Identifies the mechanism to be used to launch application tasks.
1532 Acceptable values include:
1533
1534 launch/slurm
1535 The default value.
1536
1537 Licenses
1538 Specification of licenses (or other resources available on all
1539 nodes of the cluster) which can be allocated to jobs. License
1540 names can optionally be followed by a colon and count with a de‐
1541 fault count of one. Multiple license names should be comma sep‐
1542 arated (e.g. "Licenses=foo:4,bar"). Note that Slurm prevents
1543 jobs from being scheduled if their required license specifica‐
1544 tion is not available. Slurm does not prevent jobs from using
1545 licenses that are not explicitly listed in the job submission
1546 specification.
1547
1548 LogTimeFormat
1549 Format of the timestamp in slurmctld and slurmd log files. Ac‐
1550 cepted values are "iso8601", "iso8601_ms", "rfc5424",
1551 "rfc5424_ms", "clock", "short" and "thread_id". The values end‐
1552 ing in "_ms" differ from the ones without in that fractional
1553 seconds with millisecond precision are printed. The default
1554 value is "iso8601_ms". The "rfc5424" formats are the same as the
1555 "iso8601" formats except that the timezone value is also shown.
1556 The "clock" format shows a timestamp in microseconds retrieved
1557 with the C standard clock() function. The "short" format is a
1558 short date and time format. The "thread_id" format shows the
1559 timestamp in the C standard ctime() function form without the
1560 year but including the microseconds, the daemon's process ID and
1561 the current thread name and ID.
1562
1563 MailDomain
1564 Domain name to qualify usernames if email address is not explic‐
1565 itly given with the "--mail-user" option. If unset, the local
1566 MTA will need to qualify local address itself. Changes to Mail‐
1567 Domain will only affect new jobs.
1568
1569 MailProg
1570 Fully qualified pathname to the program used to send email per
1571 user request. The default value is "/bin/mail" (or
1572 "/usr/bin/mail" if "/bin/mail" does not exist but
1573 "/usr/bin/mail" does exist). The program is called with argu‐
1574 ments suitable for the default mail command, however additional
1575 information about the job is passed in the form of environment
1576 variables.
1577
1578 Additional variables are the same as those passed to Pro‐
1579 logSlurmctld and EpilogSlurmctld with additional variables in
1580 the following contexts:
1581
1582 ALL
1583
1584 SLURM_JOB_STATE
1585 The base state of the job when the MailProg is
1586 called.
1587
1588 SLURM_JOB_MAIL_TYPE
1589 The mail type triggering the mail.
1590
1591 BEGIN
1592
1593 SLURM_JOB_QEUEUED_TIME
1594 The amount of time the job was queued.
1595
1596 END, FAIL, REQUEUE, TIME_LIMIT_*
1597
1598 SLURM_JOB_RUN_TIME
1599 The amount of time the job ran for.
1600
1601 END, FAIL
1602
1603 SLURM_JOB_EXIT_CODE_MAX
1604 Job's exit code or highest exit code for an array
1605 job.
1606
1607 SLURM_JOB_EXIT_CODE_MIN
1608 Job's minimum exit code for an array job.
1609
1610 SLURM_JOB_TERM_SIGNAL_MAX
1611 Job's highest signal for an array job.
1612
1613 STAGE_OUT
1614
1615 SLURM_JOB_STAGE_OUT_TIME
1616 Job's staging out time.
1617
1618 MaxArraySize
1619 The maximum job array task index value will be one less than
1620 MaxArraySize to allow for an index value of zero. Configure
1621 MaxArraySize to 0 in order to disable job array use. The value
1622 may not exceed 4000001. The value of MaxJobCount should be much
1623 larger than MaxArraySize. The default value is 1001. See also
1624 max_array_tasks in SchedulerParameters.
1625
1626 MaxDBDMsgs
1627 When communication to the SlurmDBD is not possible the slurmctld
1628 will queue messages meant to processed when the SlurmDBD is
1629 available again. In order to avoid running out of memory the
1630 slurmctld will only queue so many messages. The default value is
1631 10000, or MaxJobCount * 2 + Node Count * 4, whichever is
1632 greater. The value can not be less than 10000.
1633
1634 MaxJobCount
1635 The maximum number of jobs slurmctld can have in memory at one
1636 time. Combine with MinJobAge to ensure the slurmctld daemon
1637 does not exhaust its memory or other resources. Once this limit
1638 is reached, requests to submit additional jobs will fail. The
1639 default value is 10000 jobs. NOTE: Each task of a job array
1640 counts as one job even though they will not occupy separate job
1641 records until modified or initiated. Performance can suffer
1642 with more than a few hundred thousand jobs. Setting per MaxSub‐
1643 mitJobs per user is generally valuable to prevent a single user
1644 from filling the system with jobs. This is accomplished using
1645 Slurm's database and configuring enforcement of resource limits.
1646 A restart of slurmctld is required for changes to this parameter
1647 to take effect.
1648
1649 MaxJobId
1650 The maximum job id to be used for jobs submitted to Slurm with‐
1651 out a specific requested value. Job ids are unsigned 32bit inte‐
1652 gers with the first 26 bits reserved for local job ids and the
1653 remaining 6 bits reserved for a cluster id to identify a feder‐
1654 ated job's origin. The maximum allowed local job id is
1655 67,108,863 (0x3FFFFFF). The default value is 67,043,328
1656 (0x03ff0000). MaxJobId only applies to the local job id and not
1657 the federated job id. Job id values generated will be incre‐
1658 mented by 1 for each subsequent job. Once MaxJobId is reached,
1659 the next job will be assigned FirstJobId. Federated jobs will
1660 always have a job ID of 67,108,865 or higher. Also see FirstJo‐
1661 bId.
1662
1663 MaxMemPerCPU
1664 Maximum real memory size available per allocated CPU in
1665 megabytes. Used to avoid over-subscribing memory and causing
1666 paging. MaxMemPerCPU would generally be used if individual pro‐
1667 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
1668 lectType=select/cons_tres). The default value is 0 (unlimited).
1669 Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerNode. MaxMem‐
1670 PerCPU and MaxMemPerNode are mutually exclusive.
1671
1672 NOTE: If a job specifies a memory per CPU limit that exceeds
1673 this system limit, that job's count of CPUs per task will try to
1674 automatically increase. This may result in the job failing due
1675 to CPU count limits. This auto-adjustment feature is a best-ef‐
1676 fort one and optimal assignment is not guaranteed due to the
1677 possibility of having heterogeneous configurations and
1678 multi-partition/qos jobs. If this is a concern it is advised to
1679 use a job submit LUA plugin instead to enforce auto-adjustments
1680 to your specific needs.
1681
1682 MaxMemPerNode
1683 Maximum real memory size available per allocated node in
1684 megabytes. Used to avoid over-subscribing memory and causing
1685 paging. MaxMemPerNode would generally be used if whole nodes
1686 are allocated to jobs (SelectType=select/linear) and resources
1687 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
1688 The default value is 0 (unlimited). Also see DefMemPerNode and
1689 MaxMemPerCPU. MaxMemPerCPU and MaxMemPerNode are mutually ex‐
1690 clusive.
1691
1692 MaxNodeCount
1693 Maximum count of nodes which may exist in the controller. By de‐
1694 fault MaxNodeCount will be set to the number of nodes found in
1695 the slurm.conf. MaxNodeCount will be ignored if less than the
1696 number of nodes found in the slurm.conf. Increase MaxNodeCount
1697 to accommodate dynamically created nodes with dynamic node reg‐
1698 istrations and nodes created with scontrol. The slurmctld daemon
1699 must be restarted for changes to this parameter to take effect.
1700
1701 MaxStepCount
1702 The maximum number of steps that any job can initiate. This pa‐
1703 rameter is intended to limit the effect of bad batch scripts.
1704 The default value is 40000 steps.
1705
1706 MaxTasksPerNode
1707 Maximum number of tasks Slurm will allow a job step to spawn on
1708 a single node. The default MaxTasksPerNode is 512. May not ex‐
1709 ceed 65533.
1710
1711 MCSParameters
1712 MCS = Multi-Category Security MCS Plugin Parameters. The sup‐
1713 ported parameters are specific to the MCSPlugin. Changes to
1714 this value take effect when the Slurm daemons are reconfigured.
1715 More information about MCS is available here
1716 <https://slurm.schedmd.com/mcs.html>.
1717
1718 MCSPlugin
1719 MCS = Multi-Category Security : associate a security label to
1720 jobs and ensure that nodes can only be shared among jobs using
1721 the same security label. Acceptable values include:
1722
1723 mcs/none is the default value. No security label associated
1724 with jobs, no particular security restriction when
1725 sharing nodes among jobs.
1726
1727 mcs/account only users with the same account can share the nodes
1728 (requires enabling of accounting).
1729
1730 mcs/group only users with the same group can share the nodes.
1731
1732 mcs/user a node cannot be shared with other users.
1733
1734 MessageTimeout
1735 Time permitted for a round-trip communication to complete in
1736 seconds. Default value is 10 seconds. For systems with shared
1737 nodes, the slurmd daemon could be paged out and necessitate
1738 higher values.
1739
1740 MinJobAge
1741 The minimum age of a completed job before its record is cleared
1742 from the list of jobs slurmctld keeps in memory. Combine with
1743 MaxJobCount to ensure the slurmctld daemon does not exhaust its
1744 memory or other resources. The default value is 300 seconds. A
1745 value of zero prevents any job record purging. Jobs are not
1746 purged during a backfill cycle, so it can take longer than Min‐
1747 JobAge seconds to purge a job if using the backfill scheduling
1748 plugin. In order to eliminate some possible race conditions,
1749 the minimum non-zero value for MinJobAge recommended is 2.
1750
1751 MpiDefault
1752 Identifies the default type of MPI to be used. Srun may over‐
1753 ride this configuration parameter in any case. Currently sup‐
1754 ported versions include: pmi2, pmix, and none (default, which
1755 works for many other versions of MPI). More information about
1756 MPI use is available here
1757 <https://slurm.schedmd.com/mpi_guide.html>.
1758
1759 MpiParams
1760 MPI parameters. Used to identify ports used by native Cray's
1761 PMI. The format to identify a range of communication ports is
1762 "ports=12000-12999".
1763
1764 OverTimeLimit
1765 Number of minutes by which a job can exceed its time limit be‐
1766 fore being canceled. Normally a job's time limit is treated as
1767 a hard limit and the job will be killed upon reaching that
1768 limit. Configuring OverTimeLimit will result in the job's time
1769 limit being treated like a soft limit. Adding the OverTimeLimit
1770 value to the soft time limit provides a hard time limit, at
1771 which point the job is canceled. This is particularly useful
1772 for backfill scheduling, which bases upon each job's soft time
1773 limit. The default value is zero. May not exceed 65533 min‐
1774 utes. A value of "UNLIMITED" is also supported.
1775
1776 PluginDir
1777 Identifies the places in which to look for Slurm plugins. This
1778 is a colon-separated list of directories, like the PATH environ‐
1779 ment variable. The default value is the prefix given at config‐
1780 ure time + "/lib/slurm". A restart of slurmctld and the slurmd
1781 daemons is required for changes to this parameter to take ef‐
1782 fect.
1783
1784 PlugStackConfig
1785 Location of the config file for Slurm stackable plugins that use
1786 the Stackable Plugin Architecture for Node job (K)control
1787 (SPANK). This provides support for a highly configurable set of
1788 plugins to be called before and/or after execution of each task
1789 spawned as part of a user's job step. Default location is
1790 "plugstack.conf" in the same directory as the system slurm.conf.
1791 For more information on SPANK plugins, see the spank(8) manual.
1792
1793 PowerParameters
1794 System power management parameters. The supported parameters
1795 are specific to the PowerPlugin. Changes to this value take ef‐
1796 fect when the Slurm daemons are reconfigured. More information
1797 about system power management is available here
1798 <https://slurm.schedmd.com/power_mgmt.html>. Options current
1799 supported by any plugins are listed below.
1800
1801 balance_interval=#
1802 Specifies the time interval, in seconds, between attempts
1803 to rebalance power caps across the nodes. This also con‐
1804 trols the frequency at which Slurm attempts to collect
1805 current power consumption data (old data may be used un‐
1806 til new data is available from the underlying infrastruc‐
1807 ture and values below 10 seconds are not recommended for
1808 Cray systems). The default value is 30 seconds. Sup‐
1809 ported by the power/cray_aries plugin.
1810
1811 capmc_path=
1812 Specifies the absolute path of the capmc command. The
1813 default value is "/opt/cray/capmc/default/bin/capmc".
1814 Supported by the power/cray_aries plugin.
1815
1816 cap_watts=#
1817 Specifies the total power limit to be established across
1818 all compute nodes managed by Slurm. A value of 0 sets
1819 every compute node to have an unlimited cap. The default
1820 value is 0. Supported by the power/cray_aries plugin.
1821
1822 decrease_rate=#
1823 Specifies the maximum rate of change in the power cap for
1824 a node where the actual power usage is below the power
1825 cap by an amount greater than lower_threshold (see be‐
1826 low). Value represents a percentage of the difference
1827 between a node's minimum and maximum power consumption.
1828 The default value is 50 percent. Supported by the
1829 power/cray_aries plugin.
1830
1831 get_timeout=#
1832 Amount of time allowed to get power state information in
1833 milliseconds. The default value is 5,000 milliseconds or
1834 5 seconds. Supported by the power/cray_aries plugin and
1835 represents the time allowed for the capmc command to re‐
1836 spond to various "get" options.
1837
1838 increase_rate=#
1839 Specifies the maximum rate of change in the power cap for
1840 a node where the actual power usage is within up‐
1841 per_threshold (see below) of the power cap. Value repre‐
1842 sents a percentage of the difference between a node's
1843 minimum and maximum power consumption. The default value
1844 is 20 percent. Supported by the power/cray_aries plugin.
1845
1846 job_level
1847 All nodes associated with every job will have the same
1848 power cap, to the extent possible. Also see the
1849 --power=level option on the job submission commands.
1850
1851 job_no_level
1852 Disable the user's ability to set every node associated
1853 with a job to the same power cap. Each node will have
1854 its power cap set independently. This disables the
1855 --power=level option on the job submission commands.
1856
1857 lower_threshold=#
1858 Specify a lower power consumption threshold. If a node's
1859 current power consumption is below this percentage of its
1860 current cap, then its power cap will be reduced. The de‐
1861 fault value is 90 percent. Supported by the
1862 power/cray_aries plugin.
1863
1864 recent_job=#
1865 If a job has started or resumed execution (from suspend)
1866 on a compute node within this number of seconds from the
1867 current time, the node's power cap will be increased to
1868 the maximum. The default value is 300 seconds. Sup‐
1869 ported by the power/cray_aries plugin.
1870
1871
1872 set_timeout=#
1873 Amount of time allowed to set power state information in
1874 milliseconds. The default value is 30,000 milliseconds
1875 or 30 seconds. Supported by the power/cray plugin and
1876 represents the time allowed for the capmc command to re‐
1877 spond to various "set" options.
1878
1879 set_watts=#
1880 Specifies the power limit to be set on every compute
1881 nodes managed by Slurm. Every node gets this same power
1882 cap and there is no variation through time based upon ac‐
1883 tual power usage on the node. Supported by the
1884 power/cray_aries plugin.
1885
1886 upper_threshold=#
1887 Specify an upper power consumption threshold. If a
1888 node's current power consumption is above this percentage
1889 of its current cap, then its power cap will be increased
1890 to the extent possible. The default value is 95 percent.
1891 Supported by the power/cray_aries plugin.
1892
1893 PowerPlugin
1894 Identifies the plugin used for system power management. Cur‐
1895 rently supported plugins include: cray_aries and none. A
1896 restart of slurmctld is required for changes to this parameter
1897 to take effect. More information about system power management
1898 is available here <https://slurm.schedmd.com/power_mgmt.html>.
1899 By default, no power plugin is loaded.
1900
1901 PreemptMode
1902 Mechanism used to preempt jobs or enable gang scheduling. When
1903 the PreemptType parameter is set to enable preemption, the Pre‐
1904 emptMode selects the default mechanism used to preempt the eli‐
1905 gible jobs for the cluster.
1906 PreemptMode may be specified on a per partition basis to over‐
1907 ride this default value if PreemptType=preempt/partition_prio.
1908 Alternatively, it can be specified on a per QOS basis if Pre‐
1909 emptType=preempt/qos. In either case, a valid default Preempt‐
1910 Mode value must be specified for the cluster as a whole when
1911 preemption is enabled.
1912 The GANG option is used to enable gang scheduling independent of
1913 whether preemption is enabled (i.e. independent of the Preempt‐
1914 Type setting). It can be specified in addition to a PreemptMode
1915 setting with the two options comma separated (e.g. Preempt‐
1916 Mode=SUSPEND,GANG).
1917 See <https://slurm.schedmd.com/preempt.html> and
1918 <https://slurm.schedmd.com/gang_scheduling.html> for more de‐
1919 tails.
1920
1921 NOTE: For performance reasons, the backfill scheduler reserves
1922 whole nodes for jobs, not partial nodes. If during backfill
1923 scheduling a job preempts one or more other jobs, the whole
1924 nodes for those preempted jobs are reserved for the preemptor
1925 job, even if the preemptor job requested fewer resources than
1926 that. These reserved nodes aren't available to other jobs dur‐
1927 ing that backfill cycle, even if the other jobs could fit on the
1928 nodes. Therefore, jobs may preempt more resources during a sin‐
1929 gle backfill iteration than they requested.
1930 NOTE: For heterogeneous job to be considered for preemption all
1931 components must be eligible for preemption. When a heterogeneous
1932 job is to be preempted the first identified component of the job
1933 with the highest order PreemptMode (SUSPEND (highest), REQUEUE,
1934 CANCEL (lowest)) will be used to set the PreemptMode for all
1935 components. The GraceTime and user warning signal for each com‐
1936 ponent of the heterogeneous job remain unique. Heterogeneous
1937 jobs are excluded from GANG scheduling operations.
1938
1939 OFF Is the default value and disables job preemption and
1940 gang scheduling. It is only compatible with Pre‐
1941 emptType=preempt/none at a global level. A common
1942 use case for this parameter is to set it on a parti‐
1943 tion to disable preemption for that partition.
1944
1945 CANCEL The preempted job will be cancelled.
1946
1947 GANG Enables gang scheduling (time slicing) of jobs in
1948 the same partition, and allows the resuming of sus‐
1949 pended jobs.
1950
1951 NOTE: Gang scheduling is performed independently for
1952 each partition, so if you only want time-slicing by
1953 OverSubscribe, without any preemption, then config‐
1954 uring partitions with overlapping nodes is not rec‐
1955 ommended. On the other hand, if you want to use
1956 PreemptType=preempt/partition_prio to allow jobs
1957 from higher PriorityTier partitions to Suspend jobs
1958 from lower PriorityTier partitions you will need
1959 overlapping partitions, and PreemptMode=SUSPEND,GANG
1960 to use the Gang scheduler to resume the suspended
1961 jobs(s). In any case, time-slicing won't happen be‐
1962 tween jobs on different partitions.
1963
1964 NOTE: Heterogeneous jobs are excluded from GANG
1965 scheduling operations.
1966
1967 REQUEUE Preempts jobs by requeuing them (if possible) or
1968 canceling them. For jobs to be requeued they must
1969 have the --requeue sbatch option set or the cluster
1970 wide JobRequeue parameter in slurm.conf must be set
1971 to 1.
1972
1973 SUSPEND The preempted jobs will be suspended, and later the
1974 Gang scheduler will resume them. Therefore the SUS‐
1975 PEND preemption mode always needs the GANG option to
1976 be specified at the cluster level. Also, because the
1977 suspended jobs will still use memory on the allo‐
1978 cated nodes, Slurm needs to be able to track memory
1979 resources to be able to suspend jobs.
1980 If PreemptType=preempt/qos is configured and if the
1981 preempted job(s) and the preemptor job are on the
1982 same partition, then they will share resources with
1983 the Gang scheduler (time-slicing). If not (i.e. if
1984 the preemptees and preemptor are on different parti‐
1985 tions) then the preempted jobs will remain suspended
1986 until the preemptor ends.
1987
1988 NOTE: Because gang scheduling is performed indepen‐
1989 dently for each partition, if using PreemptType=pre‐
1990 empt/partition_prio then jobs in higher PriorityTier
1991 partitions will suspend jobs in lower PriorityTier
1992 partitions to run on the released resources. Only
1993 when the preemptor job ends will the suspended jobs
1994 will be resumed by the Gang scheduler.
1995 NOTE: Suspended jobs will not release GRES. Higher
1996 priority jobs will not be able to preempt to gain
1997 access to GRES.
1998
1999 WITHIN For PreemptType=preempt/qos, allow jobs within the
2000 same qos to preempt one another. While this can be
2001 set globally here, it is recommend that this only be
2002 set directly on a relevant subset of the system qos
2003 values instead.
2004
2005 PreemptType
2006 Specifies the plugin used to identify which jobs can be pre‐
2007 empted in order to start a pending job.
2008
2009 preempt/none
2010 Job preemption is disabled. This is the default.
2011
2012 preempt/partition_prio
2013 Job preemption is based upon partition PriorityTier.
2014 Jobs in higher PriorityTier partitions may preempt jobs
2015 from lower PriorityTier partitions. This is not compati‐
2016 ble with PreemptMode=OFF.
2017
2018 preempt/qos
2019 Job preemption rules are specified by Quality Of Service
2020 (QOS) specifications in the Slurm database. This option
2021 is not compatible with PreemptMode=OFF. A configuration
2022 of PreemptMode=SUSPEND is only supported by the Select‐
2023 Type=select/cons_res and SelectType=select/cons_tres
2024 plugins. See the sacctmgr man page to configure the op‐
2025 tions for preempt/qos.
2026
2027 PreemptExemptTime
2028 Global option for minimum run time for all jobs before they can
2029 be considered for preemption. Any QOS PreemptExemptTime takes
2030 precedence over the global option. This is only honored for Pre‐
2031 emptMode=REQUEUE and PreemptMode=CANCEL.
2032 A time of -1 disables the option, equivalent to 0. Acceptable
2033 time formats include "minutes", "minutes:seconds", "hours:min‐
2034 utes:seconds", "days-hours", "days-hours:minutes", and
2035 "days-hours:minutes:seconds".
2036
2037 PrEpParameters
2038 Parameters to be passed to the PrEpPlugins.
2039
2040 PrEpPlugins
2041 A resource for programmers wishing to write their own plugins
2042 for the Prolog and Epilog (PrEp) scripts. The default, and cur‐
2043 rently the only implemented plugin is prep/script. Additional
2044 plugins can be specified in a comma-separated list. For more in‐
2045 formation please see the PrEp Plugin API documentation page:
2046 <https://slurm.schedmd.com/prep_plugins.html>
2047
2048 PriorityCalcPeriod
2049 The period of time in minutes in which the half-life decay will
2050 be re-calculated. Applicable only if PriorityType=priority/mul‐
2051 tifactor. The default value is 5 (minutes).
2052
2053 PriorityDecayHalfLife
2054 This controls how long prior resource use is considered in de‐
2055 termining how over- or under-serviced an association is (user,
2056 bank account and cluster) in determining job priority. The
2057 record of usage will be decayed over time, with half of the
2058 original value cleared at age PriorityDecayHalfLife. If set to
2059 0 no decay will be applied. This is helpful if you want to en‐
2060 force hard time limits per association. If set to 0 Priori‐
2061 tyUsageResetPeriod must be set to some interval. Applicable
2062 only if PriorityType=priority/multifactor. The unit is a time
2063 string (i.e. min, hr:min:00, days-hr:min:00, or days-hr). The
2064 default value is 7-0 (7 days).
2065
2066 PriorityFavorSmall
2067 Specifies that small jobs should be given preferential schedul‐
2068 ing priority. Applicable only if PriorityType=priority/multi‐
2069 factor. Supported values are "YES" and "NO". The default value
2070 is "NO".
2071
2072 PriorityFlags
2073 Flags to modify priority behavior. Applicable only if Priority‐
2074 Type=priority/multifactor. The keywords below have no associ‐
2075 ated value (e.g. "PriorityFlags=ACCRUE_ALWAYS,SMALL_RELA‐
2076 TIVE_TO_TIME").
2077
2078 ACCRUE_ALWAYS If set, priority age factor will be increased
2079 despite job ineligibility due to either depen‐
2080 dencies, holds or begin time in the future. Ac‐
2081 crue limits are ignored.
2082
2083 CALCULATE_RUNNING
2084 If set, priorities will be recalculated not
2085 only for pending jobs, but also running and
2086 suspended jobs.
2087
2088 DEPTH_OBLIVIOUS If set, priority will be calculated based simi‐
2089 lar to the normal multifactor calculation, but
2090 depth of the associations in the tree does not
2091 adversely affect their priority. This option
2092 automatically enables NO_FAIR_TREE.
2093
2094 NO_FAIR_TREE Disables the "fair tree" algorithm, and reverts
2095 to "classic" fair share priority scheduling.
2096
2097 INCR_ONLY If set, priority values will only increase in
2098 value. Job priority will never decrease in
2099 value.
2100
2101 MAX_TRES If set, the weighted TRES value (e.g. TRES‐
2102 BillingWeights) is calculated as the MAX of in‐
2103 dividual TRES' on a node (e.g. cpus, mem, gres)
2104 plus the sum of all global TRES' (e.g. li‐
2105 censes).
2106
2107 NO_NORMAL_ALL If set, all NO_NORMAL_* flags are set.
2108
2109 NO_NORMAL_ASSOC If set, the association factor is not normal‐
2110 ized against the highest association priority.
2111
2112 NO_NORMAL_PART If set, the partition factor is not normalized
2113 against the highest partition PriorityJobFac‐
2114 tor.
2115
2116 NO_NORMAL_QOS If set, the QOS factor is not normalized
2117 against the highest qos priority.
2118
2119 NO_NORMAL_TRES If set, the TRES factor is not normalized
2120 against the job's partition TRES counts.
2121
2122 SMALL_RELATIVE_TO_TIME
2123 If set, the job's size component will be based
2124 upon not the job size alone, but the job's size
2125 divided by its time limit.
2126
2127 PriorityMaxAge
2128 Specifies the job age which will be given the maximum age factor
2129 in computing priority. For example, a value of 30 minutes would
2130 result in all jobs over 30 minutes old would get the same
2131 age-based priority. Applicable only if PriorityType=prior‐
2132 ity/multifactor. The unit is a time string (i.e. min,
2133 hr:min:00, days-hr:min:00, or days-hr). The default value is
2134 7-0 (7 days).
2135
2136 PriorityParameters
2137 Arbitrary string used by the PriorityType plugin.
2138
2139 PrioritySiteFactorParameters
2140 Arbitrary string used by the PrioritySiteFactorPlugin plugin.
2141
2142 PrioritySiteFactorPlugin
2143 The specifies an optional plugin to be used alongside "prior‐
2144 ity/multifactor", which is meant to initially set and continu‐
2145 ously update the SiteFactor priority factor. The default value
2146 is "site_factor/none".
2147
2148 PriorityType
2149 This specifies the plugin to be used in establishing a job's
2150 scheduling priority. Also see PriorityFlags for configuration
2151 options. The default value is "priority/basic".
2152
2153 priority/basic
2154 Jobs are evaluated in a First In, First Out (FIFO) man‐
2155 ner.
2156
2157 priority/multifactor
2158 Jobs are assigned a priority based upon a variety of fac‐
2159 tors that include size, age, Fairshare, etc.
2160
2161 When not FIFO scheduling, jobs are prioritized in the following
2162 order:
2163
2164 1. Jobs that can preempt
2165 2. Jobs with an advanced reservation
2166 3. Partition PriorityTier
2167 4. Job priority
2168 5. Job submit time
2169 6. Job ID
2170
2171 PriorityUsageResetPeriod
2172 At this interval the usage of associations will be reset to 0.
2173 This is used if you want to enforce hard limits of time usage
2174 per association. If PriorityDecayHalfLife is set to be 0 no de‐
2175 cay will happen and this is the only way to reset the usage ac‐
2176 cumulated by running jobs. By default this is turned off and it
2177 is advised to use the PriorityDecayHalfLife option to avoid not
2178 having anything running on your cluster, but if your schema is
2179 set up to only allow certain amounts of time on your system this
2180 is the way to do it. Applicable only if PriorityType=prior‐
2181 ity/multifactor.
2182
2183 NONE Never clear historic usage. The default value.
2184
2185 NOW Clear the historic usage now. Executed at startup
2186 and reconfiguration time.
2187
2188 DAILY Cleared every day at midnight.
2189
2190 WEEKLY Cleared every week on Sunday at time 00:00.
2191
2192 MONTHLY Cleared on the first day of each month at time
2193 00:00.
2194
2195 QUARTERLY Cleared on the first day of each quarter at time
2196 00:00.
2197
2198 YEARLY Cleared on the first day of each year at time 00:00.
2199
2200 PriorityWeightAge
2201 An integer value that sets the degree to which the queue wait
2202 time component contributes to the job's priority. Applicable
2203 only if PriorityType=priority/multifactor. Requires Account‐
2204 ingStorageType=accounting_storage/slurmdbd. The default value
2205 is 0.
2206
2207 PriorityWeightAssoc
2208 An integer value that sets the degree to which the association
2209 component contributes to the job's priority. Applicable only if
2210 PriorityType=priority/multifactor. The default value is 0.
2211
2212 PriorityWeightFairshare
2213 An integer value that sets the degree to which the fair-share
2214 component contributes to the job's priority. Applicable only if
2215 PriorityType=priority/multifactor. Requires AccountingStor‐
2216 ageType=accounting_storage/slurmdbd. The default value is 0.
2217
2218 PriorityWeightJobSize
2219 An integer value that sets the degree to which the job size com‐
2220 ponent contributes to the job's priority. Applicable only if
2221 PriorityType=priority/multifactor. The default value is 0.
2222
2223 PriorityWeightPartition
2224 Partition factor used by priority/multifactor plugin in calcu‐
2225 lating job priority. Applicable only if PriorityType=prior‐
2226 ity/multifactor. The default value is 0.
2227
2228 PriorityWeightQOS
2229 An integer value that sets the degree to which the Quality Of
2230 Service component contributes to the job's priority. Applicable
2231 only if PriorityType=priority/multifactor. The default value is
2232 0.
2233
2234 PriorityWeightTRES
2235 A comma-separated list of TRES Types and weights that sets the
2236 degree that each TRES Type contributes to the job's priority.
2237
2238 e.g.
2239 PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
2240
2241 Applicable only if PriorityType=priority/multifactor and if Ac‐
2242 countingStorageTRES is configured with each TRES Type. Negative
2243 values are allowed. The default values are 0.
2244
2245 PrivateData
2246 This controls what type of information is hidden from regular
2247 users. By default, all information is visible to all users.
2248 User SlurmUser and root can always view all information. Multi‐
2249 ple values may be specified with a comma separator. Acceptable
2250 values include:
2251
2252 accounts
2253 (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
2254 ing any account definitions unless they are coordinators
2255 of them.
2256
2257 cloud Powered down nodes in the cloud are visible. Without
2258 this flag, cloud nodes will not appear in the output of
2259 commands like sinfo unless they are powered on, even for
2260 SlurmUser and root.
2261
2262 events prevents users from viewing event information unless they
2263 have operator status or above.
2264
2265 jobs Prevents users from viewing jobs or job steps belonging
2266 to other users. (NON-SlurmDBD ACCOUNTING ONLY) Prevents
2267 users from viewing job records belonging to other users
2268 unless they are coordinators of the association running
2269 the job when using sacct.
2270
2271 nodes Prevents users from viewing node state information.
2272
2273 partitions
2274 Prevents users from viewing partition state information.
2275
2276 reservations
2277 Prevents regular users from viewing reservations which
2278 they can not use.
2279
2280 usage Prevents users from viewing usage of any other user, this
2281 applies to sshare. (NON-SlurmDBD ACCOUNTING ONLY) Pre‐
2282 vents users from viewing usage of any other user, this
2283 applies to sreport.
2284
2285 users (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
2286 ing information of any user other than themselves, this
2287 also makes it so users can only see associations they
2288 deal with. Coordinators can see associations of all
2289 users in the account they are coordinator of, but can
2290 only see themselves when listing users.
2291
2292 ProctrackType
2293 Identifies the plugin to be used for process tracking on a job
2294 step basis. The slurmd daemon uses this mechanism to identify
2295 all processes which are children of processes it spawns for a
2296 user job step. A restart of slurmctld is required for changes
2297 to this parameter to take effect. NOTE: "proctrack/linuxproc"
2298 and "proctrack/pgid" can fail to identify all processes associ‐
2299 ated with a job since processes can become a child of the init
2300 process (when the parent process terminates) or change their
2301 process group. To reliably track all processes, "proc‐
2302 track/cgroup" is highly recommended. NOTE: The JobContainerType
2303 applies to a job allocation, while ProctrackType applies to job
2304 steps. Acceptable values at present include:
2305
2306 proctrack/cgroup
2307 Uses linux cgroups to constrain and track processes, and
2308 is the default for systems with cgroup support.
2309 NOTE: see "man cgroup.conf" for configuration details.
2310
2311 proctrack/cray_aries
2312 Uses Cray proprietary process tracking.
2313
2314 proctrack/linuxproc
2315 Uses linux process tree using parent process IDs.
2316
2317 proctrack/pgid
2318 Uses Process Group IDs.
2319 NOTE: This is the default for the BSD family.
2320
2321 Prolog Fully qualified pathname of a program for the slurmd to execute
2322 whenever it is asked to run a job step from a new job allocation
2323 (e.g. "/usr/local/slurm/prolog"). A glob pattern (See glob (7))
2324 may also be used to specify more than one program to run (e.g.
2325 "/etc/slurm/prolog.d/*"). The slurmd executes the prolog before
2326 starting the first job step. The prolog script or scripts may
2327 be used to purge files, enable user login, etc. By default
2328 there is no prolog. Any configured script is expected to com‐
2329 plete execution quickly (in less time than MessageTimeout). If
2330 the prolog fails (returns a non-zero exit code), this will re‐
2331 sult in the node being set to a DRAIN state and the job being
2332 requeued in a held state, unless nohold_on_prolog_fail is con‐
2333 figured in SchedulerParameters. See Prolog and Epilog Scripts
2334 for more information.
2335
2336 PrologEpilogTimeout
2337 The interval in seconds Slurm waits for Prolog and Epilog before
2338 terminating them. The default behavior is to wait indefinitely.
2339 This interval applies to the Prolog and Epilog run by slurmd
2340 daemon before and after the job, the PrologSlurmctld and Epi‐
2341 logSlurmctld run by slurmctld daemon, and the SPANK plugin pro‐
2342 log/epilog calls: slurm_spank_job_prolog and
2343 slurm_spank_job_epilog.
2344 If the PrologSlurmctld times out, the job is requeued if possi‐
2345 ble. If the Prolog or slurm_spank_job_prolog time out, the job
2346 is requeued if possible and the node is drained. If the Epilog
2347 or slurm_spank_job_epilog time out, the node is drained. In all
2348 cases, errors are logged.
2349
2350 PrologFlags
2351 Flags to control the Prolog behavior. By default no flags are
2352 set. Multiple flags may be specified in a comma-separated list.
2353 Currently supported options are:
2354
2355 Alloc If set, the Prolog script will be executed at job allo‐
2356 cation. By default, Prolog is executed just before the
2357 task is launched. Therefore, when salloc is started, no
2358 Prolog is executed. Alloc is useful for preparing things
2359 before a user starts to use any allocated resources. In
2360 particular, this flag is needed on a Cray system when
2361 cluster compatibility mode is enabled.
2362
2363 NOTE: Use of the Alloc flag will increase the time re‐
2364 quired to start jobs.
2365
2366 Contain At job allocation time, use the ProcTrack plugin to cre‐
2367 ate a job container on all allocated compute nodes.
2368 This container may be used for user processes not
2369 launched under Slurm control, for example
2370 pam_slurm_adopt may place processes launched through a
2371 direct user login into this container. If using
2372 pam_slurm_adopt, then ProcTrackType must be set to ei‐
2373 ther proctrack/cgroup or proctrack/cray_aries. Setting
2374 the Contain implicitly sets the Alloc flag.
2375
2376 DeferBatch
2377 If set, slurmctld will wait until the prolog completes
2378 on all allocated nodes before sending the batch job
2379 launch request. With just the Alloc flag, slurmctld will
2380 launch the batch step as soon as the first node in the
2381 job allocation completes the prolog.
2382
2383 NoHold If set, the Alloc flag should also be set. This will
2384 allow for salloc to not block until the prolog is fin‐
2385 ished on each node. The blocking will happen when steps
2386 reach the slurmd and before any execution has happened
2387 in the step. This is a much faster way to work and if
2388 using srun to launch your tasks you should use this
2389 flag. This flag cannot be combined with the Contain or
2390 X11 flags.
2391
2392 Serial By default, the Prolog and Epilog scripts run concur‐
2393 rently on each node. This flag forces those scripts to
2394 run serially within each node, but with a significant
2395 penalty to job throughput on each node.
2396
2397 X11 Enable Slurm's built-in X11 forwarding capabilities.
2398 This is incompatible with ProctrackType=proctrack/linux‐
2399 proc. Setting the X11 flag implicitly enables both Con‐
2400 tain and Alloc flags as well.
2401
2402 PrologSlurmctld
2403 Fully qualified pathname of a program for the slurmctld daemon
2404 to execute before granting a new job allocation (e.g. "/usr/lo‐
2405 cal/slurm/prolog_controller"). The program executes as Slur‐
2406 mUser on the same node where the slurmctld daemon executes, giv‐
2407 ing it permission to drain nodes and requeue the job if a fail‐
2408 ure occurs or cancel the job if appropriate. Exactly what the
2409 program does and how it accomplishes this is completely at the
2410 discretion of the system administrator. Information about the
2411 job being initiated, its allocated nodes, etc. are passed to the
2412 program using environment variables. While this program is run‐
2413 ning, the nodes associated with the job will be have a
2414 POWER_UP/CONFIGURING flag set in their state, which can be read‐
2415 ily viewed. The slurmctld daemon will wait indefinitely for
2416 this program to complete. Once the program completes with an
2417 exit code of zero, the nodes will be considered ready for use
2418 and the program will be started. If some node can not be made
2419 available for use, the program should drain the node (typically
2420 using the scontrol command) and terminate with a non-zero exit
2421 code. A non-zero exit code will result in the job being re‐
2422 queued (where possible) or killed. Note that only batch jobs can
2423 be requeued. See Prolog and Epilog Scripts for more informa‐
2424 tion.
2425
2426 PropagatePrioProcess
2427 Controls the scheduling priority (nice value) of user spawned
2428 tasks.
2429
2430 0 The tasks will inherit the scheduling priority from the
2431 slurm daemon. This is the default value.
2432
2433 1 The tasks will inherit the scheduling priority of the com‐
2434 mand used to submit them (e.g. srun or sbatch). Unless the
2435 job is submitted by user root, the tasks will have a sched‐
2436 uling priority no higher than the slurm daemon spawning
2437 them.
2438
2439 2 The tasks will inherit the scheduling priority of the com‐
2440 mand used to submit them (e.g. srun or sbatch) with the re‐
2441 striction that their nice value will always be one higher
2442 than the slurm daemon (i.e. the tasks scheduling priority
2443 will be lower than the slurm daemon).
2444
2445 PropagateResourceLimits
2446 A comma-separated list of resource limit names. The slurmd dae‐
2447 mon uses these names to obtain the associated (soft) limit val‐
2448 ues from the user's process environment on the submit node.
2449 These limits are then propagated and applied to the jobs that
2450 will run on the compute nodes. This parameter can be useful
2451 when system limits vary among nodes. Any resource limits that
2452 do not appear in the list are not propagated. However, the user
2453 can override this by specifying which resource limits to propa‐
2454 gate with the sbatch or srun "--propagate" option. If neither
2455 PropagateResourceLimits or PropagateResourceLimitsExcept are
2456 configured and the "--propagate" option is not specified, then
2457 the default action is to propagate all limits. Only one of the
2458 parameters, either PropagateResourceLimits or PropagateResource‐
2459 LimitsExcept, may be specified. The user limits can not exceed
2460 hard limits under which the slurmd daemon operates. If the user
2461 limits are not propagated, the limits from the slurmd daemon
2462 will be propagated to the user's job. The limits used for the
2463 Slurm daemons can be set in the /etc/sysconf/slurm file. For
2464 more information, see: https://slurm.schedmd.com/faq.html#mem‐
2465 lock The following limit names are supported by Slurm (although
2466 some options may not be supported on some systems):
2467
2468 ALL All limits listed below (default)
2469
2470 NONE No limits listed below
2471
2472 AS The maximum address space (virtual memory) for a
2473 process.
2474
2475 CORE The maximum size of core file
2476
2477 CPU The maximum amount of CPU time
2478
2479 DATA The maximum size of a process's data segment
2480
2481 FSIZE The maximum size of files created. Note that if the
2482 user sets FSIZE to less than the current size of the
2483 slurmd.log, job launches will fail with a 'File size
2484 limit exceeded' error.
2485
2486 MEMLOCK The maximum size that may be locked into memory
2487
2488 NOFILE The maximum number of open files
2489
2490 NPROC The maximum number of processes available
2491
2492 RSS The maximum resident set size. Note that this only
2493 has effect with Linux kernels 2.4.30 or older or BSD.
2494
2495 STACK The maximum stack size
2496
2497 PropagateResourceLimitsExcept
2498 A comma-separated list of resource limit names. By default, all
2499 resource limits will be propagated, (as described by the Propa‐
2500 gateResourceLimits parameter), except for the limits appearing
2501 in this list. The user can override this by specifying which
2502 resource limits to propagate with the sbatch or srun "--propa‐
2503 gate" option. See PropagateResourceLimits above for a list of
2504 valid limit names.
2505
2506 RebootProgram
2507 Program to be executed on each compute node to reboot it. In‐
2508 voked on each node once it becomes idle after the command "scon‐
2509 trol reboot" is executed by an authorized user or a job is sub‐
2510 mitted with the "--reboot" option. After rebooting, the node is
2511 returned to normal use. See ResumeTimeout to configure the time
2512 you expect a reboot to finish in. A node will be marked DOWN if
2513 it doesn't reboot within ResumeTimeout.
2514
2515 ReconfigFlags
2516 Flags to control various actions that may be taken when an
2517 "scontrol reconfig" command is issued. Currently the options
2518 are:
2519
2520 KeepPartInfo If set, an "scontrol reconfig" command will
2521 maintain the in-memory value of partition
2522 "state" and other parameters that may have been
2523 dynamically updated by "scontrol update". Par‐
2524 tition information in the slurm.conf file will
2525 be merged with in-memory data. This flag su‐
2526 persedes the KeepPartState flag.
2527
2528 KeepPartState If set, an "scontrol reconfig" command will
2529 preserve only the current "state" value of
2530 in-memory partitions and will reset all other
2531 parameters of the partitions that may have been
2532 dynamically updated by "scontrol update" to the
2533 values from the slurm.conf file. Partition in‐
2534 formation in the slurm.conf file will be merged
2535 with in-memory data.
2536
2537 The default for the above flags is not set, and the "scontrol
2538 reconfig" will rebuild the partition information using only the
2539 definitions in the slurm.conf file.
2540
2541 RequeueExit
2542 Enables automatic requeue for batch jobs which exit with the
2543 specified values. Separate multiple exit code by a comma and/or
2544 specify numeric ranges using a "-" separator (e.g. "Requeue‐
2545 Exit=1-9,18") Jobs will be put back in to pending state and
2546 later scheduled again. Restarted jobs will have the environment
2547 variable SLURM_RESTART_COUNT set to the number of times the job
2548 has been restarted.
2549
2550 RequeueExitHold
2551 Enables automatic requeue for batch jobs which exit with the
2552 specified values, with these jobs being held until released man‐
2553 ually by the user. Separate multiple exit code by a comma
2554 and/or specify numeric ranges using a "-" separator (e.g. "Re‐
2555 queueExitHold=10-12,16") These jobs are put in the JOB_SPE‐
2556 CIAL_EXIT exit state. Restarted jobs will have the environment
2557 variable SLURM_RESTART_COUNT set to the number of times the job
2558 has been restarted.
2559
2560 ResumeFailProgram
2561 The program that will be executed when nodes fail to resume to
2562 by ResumeTimeout. The argument to the program will be the names
2563 of the failed nodes (using Slurm's hostlist expression format).
2564
2565 ResumeProgram
2566 Slurm supports a mechanism to reduce power consumption on nodes
2567 that remain idle for an extended period of time. This is typi‐
2568 cally accomplished by reducing voltage and frequency or powering
2569 the node down. ResumeProgram is the program that will be exe‐
2570 cuted when a node in power save mode is assigned work to per‐
2571 form. For reasons of reliability, ResumeProgram may execute
2572 more than once for a node when the slurmctld daemon crashes and
2573 is restarted. If ResumeProgram is unable to restore a node to
2574 service with a responding slurmd and an updated BootTime, it
2575 should set the node state to DOWN, which will result in a re‐
2576 queue of any job associated with the node - this will happen au‐
2577 tomatically if the node doesn't register within ResumeTimeout.
2578 If the node isn't actually rebooted (i.e. when multiple-slurmd
2579 is configured) starting slurmd with "-b" option might be useful.
2580 The program executes as SlurmUser. The argument to the program
2581 will be the names of nodes to be removed from power savings mode
2582 (using Slurm's hostlist expression format). A job to node map‐
2583 ping is available in JSON format by reading the temporary file
2584 specified by the SLURM_RESUME_FILE environment variable. By de‐
2585 fault no program is run.
2586
2587 ResumeRate
2588 The rate at which nodes in power save mode are returned to nor‐
2589 mal operation by ResumeProgram. The value is a number of nodes
2590 per minute and it can be used to prevent power surges if a large
2591 number of nodes in power save mode are assigned work at the same
2592 time (e.g. a large job starts). A value of zero results in no
2593 limits being imposed. The default value is 300 nodes per
2594 minute.
2595
2596 ResumeTimeout
2597 Maximum time permitted (in seconds) between when a node resume
2598 request is issued and when the node is actually available for
2599 use. Nodes which fail to respond in this time frame will be
2600 marked DOWN and the jobs scheduled on the node requeued. Nodes
2601 which reboot after this time frame will be marked DOWN with a
2602 reason of "Node unexpectedly rebooted." The default value is 60
2603 seconds.
2604
2605 ResvEpilog
2606 Fully qualified pathname of a program for the slurmctld to exe‐
2607 cute when a reservation ends. The program can be used to cancel
2608 jobs, modify partition configuration, etc. The reservation
2609 named will be passed as an argument to the program. By default
2610 there is no epilog.
2611
2612 ResvOverRun
2613 Describes how long a job already running in a reservation should
2614 be permitted to execute after the end time of the reservation
2615 has been reached. The time period is specified in minutes and
2616 the default value is 0 (kill the job immediately). The value
2617 may not exceed 65533 minutes, although a value of "UNLIMITED" is
2618 supported to permit a job to run indefinitely after its reserva‐
2619 tion is terminated.
2620
2621 ResvProlog
2622 Fully qualified pathname of a program for the slurmctld to exe‐
2623 cute when a reservation begins. The program can be used to can‐
2624 cel jobs, modify partition configuration, etc. The reservation
2625 named will be passed as an argument to the program. By default
2626 there is no prolog.
2627
2628 ReturnToService
2629 Controls when a DOWN node will be returned to service. The de‐
2630 fault value is 0. Supported values include
2631
2632 0 A node will remain in the DOWN state until a system adminis‐
2633 trator explicitly changes its state (even if the slurmd dae‐
2634 mon registers and resumes communications).
2635
2636 1 A DOWN node will become available for use upon registration
2637 with a valid configuration only if it was set DOWN due to
2638 being non-responsive. If the node was set DOWN for any
2639 other reason (low memory, unexpected reboot, etc.), its
2640 state will not automatically be changed. A node registers
2641 with a valid configuration if its memory, GRES, CPU count,
2642 etc. are equal to or greater than the values configured in
2643 slurm.conf.
2644
2645 2 A DOWN node will become available for use upon registration
2646 with a valid configuration. The node could have been set
2647 DOWN for any reason. A node registers with a valid configu‐
2648 ration if its memory, GRES, CPU count, etc. are equal to or
2649 greater than the values configured in slurm.conf.
2650
2651 RoutePlugin
2652 Identifies the plugin to be used for defining which nodes will
2653 be used for message forwarding.
2654
2655 route/default
2656 default, use TreeWidth.
2657
2658 route/topology
2659 use the switch hierarchy defined in a topology.conf file.
2660 TopologyPlugin=topology/tree is required.
2661
2662 SchedulerParameters
2663 The interpretation of this parameter varies by SchedulerType.
2664 Multiple options may be comma separated.
2665
2666 allow_zero_lic
2667 If set, then job submissions requesting more than config‐
2668 ured licenses won't be rejected.
2669
2670 assoc_limit_stop
2671 If set and a job cannot start due to association limits,
2672 then do not attempt to initiate any lower priority jobs
2673 in that partition. Setting this can decrease system
2674 throughput and utilization, but avoid potentially starv‐
2675 ing larger jobs by preventing them from launching indefi‐
2676 nitely.
2677
2678 batch_sched_delay=#
2679 How long, in seconds, the scheduling of batch jobs can be
2680 delayed. This can be useful in a high-throughput envi‐
2681 ronment in which batch jobs are submitted at a very high
2682 rate (i.e. using the sbatch command) and one wishes to
2683 reduce the overhead of attempting to schedule each job at
2684 submit time. The default value is 3 seconds.
2685
2686 bb_array_stage_cnt=#
2687 Number of tasks from a job array that should be available
2688 for burst buffer resource allocation. Higher values will
2689 increase the system overhead as each task from the job
2690 array will be moved to its own job record in memory, so
2691 relatively small values are generally recommended. The
2692 default value is 10.
2693
2694 bf_busy_nodes
2695 When selecting resources for pending jobs to reserve for
2696 future execution (i.e. the job can not be started immedi‐
2697 ately), then preferentially select nodes that are in use.
2698 This will tend to leave currently idle resources avail‐
2699 able for backfilling longer running jobs, but may result
2700 in allocations having less than optimal network topology.
2701 This option is currently only supported by the se‐
2702 lect/cons_res and select/cons_tres plugins (or se‐
2703 lect/cray_aries with SelectTypeParameters set to
2704 "OTHER_CONS_RES" or "OTHER_CONS_TRES", which layers the
2705 select/cray_aries plugin over the select/cons_res or se‐
2706 lect/cons_tres plugin respectively).
2707
2708 bf_continue
2709 The backfill scheduler periodically releases locks in or‐
2710 der to permit other operations to proceed rather than
2711 blocking all activity for what could be an extended pe‐
2712 riod of time. Setting this option will cause the back‐
2713 fill scheduler to continue processing pending jobs from
2714 its original job list after releasing locks even if job
2715 or node state changes.
2716
2717 bf_hetjob_immediate
2718 Instruct the backfill scheduler to attempt to start a
2719 heterogeneous job as soon as all of its components are
2720 determined able to do so. Otherwise, the backfill sched‐
2721 uler will delay heterogeneous jobs initiation attempts
2722 until after the rest of the queue has been processed.
2723 This delay may result in lower priority jobs being allo‐
2724 cated resources, which could delay the initiation of the
2725 heterogeneous job due to account and/or QOS limits being
2726 reached. This option is disabled by default. If enabled
2727 and bf_hetjob_prio=min is not set, then it would be auto‐
2728 matically set.
2729
2730 bf_hetjob_prio=[min|avg|max]
2731 At the beginning of each backfill scheduling cycle, a
2732 list of pending to be scheduled jobs is sorted according
2733 to the precedence order configured in PriorityType. This
2734 option instructs the scheduler to alter the sorting algo‐
2735 rithm to ensure that all components belonging to the same
2736 heterogeneous job will be attempted to be scheduled con‐
2737 secutively (thus not fragmented in the resulting list).
2738 More specifically, all components from the same heteroge‐
2739 neous job will be treated as if they all have the same
2740 priority (minimum, average or maximum depending upon this
2741 option's parameter) when compared with other jobs (or
2742 other heterogeneous job components). The original order
2743 will be preserved within the same heterogeneous job. Note
2744 that the operation is calculated for the PriorityTier
2745 layer and for the Priority resulting from the prior‐
2746 ity/multifactor plugin calculations. When enabled, if any
2747 heterogeneous job requested an advanced reservation, then
2748 all of that job's components will be treated as if they
2749 had requested an advanced reservation (and get preferen‐
2750 tial treatment in scheduling).
2751
2752 Note that this operation does not update the Priority
2753 values of the heterogeneous job components, only their
2754 order within the list, so the output of the sprio command
2755 will not be effected.
2756
2757 Heterogeneous jobs have special scheduling properties:
2758 they are only scheduled by the backfill scheduling
2759 plugin, each of their components is considered separately
2760 when reserving resources (and might have different Prior‐
2761 ityTier or different Priority values), and no heteroge‐
2762 neous job component is actually allocated resources until
2763 all if its components can be initiated. This may imply
2764 potential scheduling deadlock scenarios because compo‐
2765 nents from different heterogeneous jobs can start reserv‐
2766 ing resources in an interleaved fashion (not consecu‐
2767 tively), but none of the jobs can reserve resources for
2768 all components and start. Enabling this option can help
2769 to mitigate this problem. By default, this option is dis‐
2770 abled.
2771
2772 bf_interval=#
2773 The number of seconds between backfill iterations.
2774 Higher values result in less overhead and better respon‐
2775 siveness. This option applies only to Scheduler‐
2776 Type=sched/backfill. Default: 30, Min: 1, Max: 10800
2777 (3h). A setting of -1 will disable the backfill schedul‐
2778 ing loop.
2779
2780 bf_job_part_count_reserve=#
2781 The backfill scheduling logic will reserve resources for
2782 the specified count of highest priority jobs in each par‐
2783 tition. For example, bf_job_part_count_reserve=10 will
2784 cause the backfill scheduler to reserve resources for the
2785 ten highest priority jobs in each partition. Any lower
2786 priority job that can be started using currently avail‐
2787 able resources and not adversely impact the expected
2788 start time of these higher priority jobs will be started
2789 by the backfill scheduler The default value is zero,
2790 which will reserve resources for any pending job and de‐
2791 lay initiation of lower priority jobs. Also see
2792 bf_min_age_reserve and bf_min_prio_reserve. Default: 0,
2793 Min: 0, Max: 100000.
2794
2795 bf_licenses
2796 Require the backfill scheduling logic to track and plan
2797 for license availability. By default, any job blocked on
2798 license availability will not have resources reserved
2799 which can lead to job starvation. This option implicitly
2800 enables bf_running_job_reserve.
2801
2802 bf_max_job_array_resv=#
2803 The maximum number of tasks from a job array for which
2804 the backfill scheduler will reserve resources in the fu‐
2805 ture. Since job arrays can potentially have millions of
2806 tasks, the overhead in reserving resources for all tasks
2807 can be prohibitive. In addition various limits may pre‐
2808 vent all the jobs from starting at the expected times.
2809 This has no impact upon the number of tasks from a job
2810 array that can be started immediately, only those tasks
2811 expected to start at some future time. Default: 20, Min:
2812 0, Max: 1000. NOTE: Jobs submitted to multiple parti‐
2813 tions appear in the job queue once per partition. If dif‐
2814 ferent copies of a single job array record aren't consec‐
2815 utive in the job queue and another job array record is in
2816 between, then bf_max_job_array_resv tasks are considered
2817 per partition that the job is submitted to.
2818
2819 bf_max_job_assoc=#
2820 The maximum number of jobs per user association to at‐
2821 tempt starting with the backfill scheduler. This setting
2822 is similar to bf_max_job_user but is handy if a user has
2823 multiple associations equating to basically different
2824 users. One can set this limit to prevent users from
2825 flooding the backfill queue with jobs that cannot start
2826 and that prevent jobs from other users to start. This
2827 option applies only to SchedulerType=sched/backfill.
2828 Also see the bf_max_job_user bf_max_job_part,
2829 bf_max_job_test and bf_max_job_user_part=# options. Set
2830 bf_max_job_test to a value much higher than
2831 bf_max_job_assoc. Default: 0 (no limit), Min: 0, Max:
2832 bf_max_job_test.
2833
2834 bf_max_job_part=#
2835 The maximum number of jobs per partition to attempt
2836 starting with the backfill scheduler. This can be espe‐
2837 cially helpful for systems with large numbers of parti‐
2838 tions and jobs. This option applies only to Scheduler‐
2839 Type=sched/backfill. Also see the partition_job_depth
2840 and bf_max_job_test options. Set bf_max_job_test to a
2841 value much higher than bf_max_job_part. Default: 0 (no
2842 limit), Min: 0, Max: bf_max_job_test.
2843
2844 bf_max_job_start=#
2845 The maximum number of jobs which can be initiated in a
2846 single iteration of the backfill scheduler. This option
2847 applies only to SchedulerType=sched/backfill. Default: 0
2848 (no limit), Min: 0, Max: 10000.
2849
2850 bf_max_job_test=#
2851 The maximum number of jobs to attempt backfill scheduling
2852 for (i.e. the queue depth). Higher values result in more
2853 overhead and less responsiveness. Until an attempt is
2854 made to backfill schedule a job, its expected initiation
2855 time value will not be set. In the case of large clus‐
2856 ters, configuring a relatively small value may be desir‐
2857 able. This option applies only to Scheduler‐
2858 Type=sched/backfill. Default: 500, Min: 1, Max:
2859 1,000,000.
2860
2861 bf_max_job_user=#
2862 The maximum number of jobs per user to attempt starting
2863 with the backfill scheduler for ALL partitions. One can
2864 set this limit to prevent users from flooding the back‐
2865 fill queue with jobs that cannot start and that prevent
2866 jobs from other users to start. This is similar to the
2867 MAXIJOB limit in Maui. This option applies only to
2868 SchedulerType=sched/backfill. Also see the
2869 bf_max_job_part, bf_max_job_test and
2870 bf_max_job_user_part=# options. Set bf_max_job_test to a
2871 value much higher than bf_max_job_user. Default: 0 (no
2872 limit), Min: 0, Max: bf_max_job_test.
2873
2874 bf_max_job_user_part=#
2875 The maximum number of jobs per user per partition to at‐
2876 tempt starting with the backfill scheduler for any single
2877 partition. This option applies only to Scheduler‐
2878 Type=sched/backfill. Also see the bf_max_job_part,
2879 bf_max_job_test and bf_max_job_user=# options. Default:
2880 0 (no limit), Min: 0, Max: bf_max_job_test.
2881
2882 bf_max_time=#
2883 The maximum time in seconds the backfill scheduler can
2884 spend (including time spent sleeping when locks are re‐
2885 leased) before discontinuing, even if maximum job counts
2886 have not been reached. This option applies only to
2887 SchedulerType=sched/backfill. The default value is the
2888 value of bf_interval (which defaults to 30 seconds). De‐
2889 fault: bf_interval value (def. 30 sec), Min: 1, Max: 3600
2890 (1h). NOTE: If bf_interval is short and bf_max_time is
2891 large, this may cause locks to be acquired too frequently
2892 and starve out other serviced RPCs. It's advisable if us‐
2893 ing this parameter to set max_rpc_cnt high enough that
2894 scheduling isn't always disabled, and low enough that the
2895 interactive workload can get through in a reasonable pe‐
2896 riod of time. max_rpc_cnt needs to be below 256 (the de‐
2897 fault RPC thread limit). Running around the middle (150)
2898 may give you good results. NOTE: When increasing the
2899 amount of time spent in the backfill scheduling cycle,
2900 Slurm can be prevented from responding to client requests
2901 in a timely manner. To address this you can use
2902 max_rpc_cnt to specify a number of queued RPCs before the
2903 scheduler stops to respond to these requests.
2904
2905 bf_min_age_reserve=#
2906 The backfill and main scheduling logic will not reserve
2907 resources for pending jobs until they have been pending
2908 and runnable for at least the specified number of sec‐
2909 onds. In addition, jobs waiting for less than the speci‐
2910 fied number of seconds will not prevent a newly submitted
2911 job from starting immediately, even if the newly submit‐
2912 ted job has a lower priority. This can be valuable if
2913 jobs lack time limits or all time limits have the same
2914 value. The default value is zero, which will reserve re‐
2915 sources for any pending job and delay initiation of lower
2916 priority jobs. Also see bf_job_part_count_reserve and
2917 bf_min_prio_reserve. Default: 0, Min: 0, Max: 2592000
2918 (30 days).
2919
2920 bf_min_prio_reserve=#
2921 The backfill and main scheduling logic will not reserve
2922 resources for pending jobs unless they have a priority
2923 equal to or higher than the specified value. In addi‐
2924 tion, jobs with a lower priority will not prevent a newly
2925 submitted job from starting immediately, even if the
2926 newly submitted job has a lower priority. This can be
2927 valuable if one wished to maximize system utilization
2928 without regard for job priority below a certain thresh‐
2929 old. The default value is zero, which will reserve re‐
2930 sources for any pending job and delay initiation of lower
2931 priority jobs. Also see bf_job_part_count_reserve and
2932 bf_min_age_reserve. Default: 0, Min: 0, Max: 2^63.
2933
2934 bf_node_space_size=#
2935 Size of backfill node_space table. Adding a single job to
2936 backfill reservations in the worst case can consume two
2937 node_space records. In the case of large clusters, con‐
2938 figuring a relatively small value may be desirable. This
2939 option applies only to SchedulerType=sched/backfill.
2940 Also see bf_max_job_test and bf_running_job_reserve. De‐
2941 fault: bf_max_job_test, Min: 2, Max: 2,000,000.
2942
2943 bf_one_resv_per_job
2944 Disallow adding more than one backfill reservation per
2945 job. The scheduling logic builds a sorted list of job-
2946 partition pairs. Jobs submitted to multiple partitions
2947 have as many entries in the list as requested partitions.
2948 By default, the backfill scheduler may evaluate all the
2949 job-partition entries for a single job, potentially re‐
2950 serving resources for each pair, but only starting the
2951 job in the reservation offering the earliest start time.
2952 Having a single job reserving resources for multiple par‐
2953 titions could impede other jobs (or hetjob components)
2954 from reserving resources already reserved for the parti‐
2955 tions that don't offer the earliest start time. A single
2956 job that requests multiple partitions can also prevent
2957 itself from starting earlier in a lower priority parti‐
2958 tion if the partitions overlap nodes and a backfill
2959 reservation in the higher priority partition blocks nodes
2960 that are also in the lower priority partition. This op‐
2961 tion makes it so that a job submitted to multiple parti‐
2962 tions will stop reserving resources once the first job-
2963 partition pair has booked a backfill reservation. Subse‐
2964 quent pairs from the same job will only be tested to
2965 start now. This allows for other jobs to be able to book
2966 the other pairs resources at the cost of not guaranteeing
2967 that the multi partition job will start in the partition
2968 offering the earliest start time (unless it can start im‐
2969 mediately). This option is disabled by default.
2970
2971 bf_resolution=#
2972 The number of seconds in the resolution of data main‐
2973 tained about when jobs begin and end. Higher values re‐
2974 sult in better responsiveness and quicker backfill cycles
2975 by using larger blocks of time to determine node eligi‐
2976 bility. However, higher values lead to less efficient
2977 system planning, and may miss opportunities to improve
2978 system utilization. This option applies only to Sched‐
2979 ulerType=sched/backfill. Default: 60, Min: 1, Max: 3600
2980 (1 hour).
2981
2982 bf_running_job_reserve
2983 Add an extra step to backfill logic, which creates back‐
2984 fill reservations for jobs running on whole nodes. This
2985 option is disabled by default.
2986
2987 bf_window=#
2988 The number of minutes into the future to look when con‐
2989 sidering jobs to schedule. Higher values result in more
2990 overhead and less responsiveness. A value at least as
2991 long as the highest allowed time limit is generally ad‐
2992 visable to prevent job starvation. In order to limit the
2993 amount of data managed by the backfill scheduler, if the
2994 value of bf_window is increased, then it is generally ad‐
2995 visable to also increase bf_resolution. This option ap‐
2996 plies only to SchedulerType=sched/backfill. Default:
2997 1440 (1 day), Min: 1, Max: 43200 (30 days).
2998
2999 bf_window_linear=#
3000 For performance reasons, the backfill scheduler will de‐
3001 crease precision in calculation of job expected termina‐
3002 tion times. By default, the precision starts at 30 sec‐
3003 onds and that time interval doubles with each evaluation
3004 of currently executing jobs when trying to determine when
3005 a pending job can start. This algorithm can support an
3006 environment with many thousands of running jobs, but can
3007 result in the expected start time of pending jobs being
3008 gradually being deferred due to lack of precision. A
3009 value for bf_window_linear will cause the time interval
3010 to be increased by a constant amount on each iteration.
3011 The value is specified in units of seconds. For example,
3012 a value of 60 will cause the backfill scheduler on the
3013 first iteration to identify the job ending soonest and
3014 determine if the pending job can be started after that
3015 job plus all other jobs expected to end within 30 seconds
3016 (default initial value) of the first job. On the next it‐
3017 eration, the pending job will be evaluated for starting
3018 after the next job expected to end plus all jobs ending
3019 within 90 seconds of that time (30 second default, plus
3020 the 60 second option value). The third iteration will
3021 have a 150 second window and the fourth 210 seconds.
3022 Without this option, the time windows will double on each
3023 iteration and thus be 30, 60, 120, 240 seconds, etc. The
3024 use of bf_window_linear is not recommended with more than
3025 a few hundred simultaneously executing jobs.
3026
3027 bf_yield_interval=#
3028 The backfill scheduler will periodically relinquish locks
3029 in order for other pending operations to take place.
3030 This specifies the times when the locks are relinquished
3031 in microseconds. Smaller values may be helpful for high
3032 throughput computing when used in conjunction with the
3033 bf_continue option. Also see the bf_yield_sleep option.
3034 Default: 2,000,000 (2 sec), Min: 1, Max: 10,000,000 (10
3035 sec).
3036
3037 bf_yield_sleep=#
3038 The backfill scheduler will periodically relinquish locks
3039 in order for other pending operations to take place.
3040 This specifies the length of time for which the locks are
3041 relinquished in microseconds. Also see the bf_yield_in‐
3042 terval option. Default: 500,000 (0.5 sec), Min: 1, Max:
3043 10,000,000 (10 sec).
3044
3045 build_queue_timeout=#
3046 Defines the maximum time that can be devoted to building
3047 a queue of jobs to be tested for scheduling. If the sys‐
3048 tem has a huge number of jobs with dependencies, just
3049 building the job queue can take so much time as to ad‐
3050 versely impact overall system performance and this param‐
3051 eter can be adjusted as needed. The default value is
3052 2,000,000 microseconds (2 seconds).
3053
3054 correspond_after_task_cnt=#
3055 Defines the number of array tasks that get split for po‐
3056 tential aftercorr dependency check. Low number may result
3057 in dependent task check failures when the job one depends
3058 on gets purged before the split. Default: 10.
3059
3060 default_queue_depth=#
3061 The default number of jobs to attempt scheduling (i.e.
3062 the queue depth) when a running job completes or other
3063 routine actions occur, however the frequency with which
3064 the scheduler is run may be limited by using the defer or
3065 sched_min_interval parameters described below. The full
3066 queue will be tested on a less frequent basis as defined
3067 by the sched_interval option described below. The default
3068 value is 100. See the partition_job_depth option to
3069 limit depth by partition.
3070
3071 defer Setting this option will avoid attempting to schedule
3072 each job individually at job submit time, but defer it
3073 until a later time when scheduling multiple jobs simulta‐
3074 neously may be possible. This option may improve system
3075 responsiveness when large numbers of jobs (many hundreds)
3076 are submitted at the same time, but it will delay the
3077 initiation time of individual jobs. Also see de‐
3078 fault_queue_depth above.
3079
3080 delay_boot=#
3081 Do not reboot nodes in order to satisfied this job's fea‐
3082 ture specification if the job has been eligible to run
3083 for less than this time period. If the job has waited
3084 for less than the specified period, it will use only
3085 nodes which already have the specified features. The ar‐
3086 gument is in units of minutes. Individual jobs may over‐
3087 ride this default value with the --delay-boot option.
3088
3089 disable_job_shrink
3090 Deny user requests to shrink the size of running jobs.
3091 (However, running jobs may still shrink due to node fail‐
3092 ure if the --no-kill option was set.)
3093
3094 disable_hetjob_steps
3095 Disable job steps that span heterogeneous job alloca‐
3096 tions.
3097
3098 enable_hetjob_steps
3099 Enable job steps that span heterogeneous job allocations.
3100 The default value.
3101
3102 enable_user_top
3103 Enable use of the "scontrol top" command by non-privi‐
3104 leged users.
3105
3106 Ignore_NUMA
3107 Some processors (e.g. AMD Opteron 6000 series) contain
3108 multiple NUMA nodes per socket. This is a configuration
3109 which does not map into the hardware entities that Slurm
3110 optimizes resource allocation for (PU/thread, core,
3111 socket, baseboard, node and network switch). In order to
3112 optimize resource allocations on such hardware, Slurm
3113 will consider each NUMA node within the socket as a sepa‐
3114 rate socket by default. Use the Ignore_NUMA option to re‐
3115 port the correct socket count, but not optimize resource
3116 allocations on the NUMA nodes.
3117
3118 NOTE: Since hwloc 2.0 NUMA Nodes are are not part of the
3119 main/CPU topology tree, because of that if Slurm is build
3120 with hwloc 2.0 or above Slurm will treat HWLOC_OBJ_PACK‐
3121 AGE as Socket, you can change this behavior using Slurmd‐
3122 Parameters=l3cache_as_socket.
3123
3124 ignore_prefer_validation
3125 If set, and a job requests --prefer any features in the
3126 request that would create an invalid request with the
3127 current system will not generate an error. This is help‐
3128 ful for dynamic systems where nodes with features come
3129 and go. Please note using this option will not protect
3130 you from typos.
3131
3132 max_array_tasks
3133 Specify the maximum number of tasks that can be included
3134 in a job array. The default limit is MaxArraySize, but
3135 this option can be used to set a lower limit. For exam‐
3136 ple, max_array_tasks=1000 and MaxArraySize=100001 would
3137 permit a maximum task ID of 100000, but limit the number
3138 of tasks in any single job array to 1000.
3139
3140 max_rpc_cnt=#
3141 If the number of active threads in the slurmctld daemon
3142 is equal to or larger than this value, defer scheduling
3143 of jobs. The scheduler will check this condition at cer‐
3144 tain points in code and yield locks if necessary. This
3145 can improve Slurm's ability to process requests at a cost
3146 of initiating new jobs less frequently. Default: 0 (op‐
3147 tion disabled), Min: 0, Max: 1000.
3148
3149 NOTE: The maximum number of threads (MAX_SERVER_THREADS)
3150 is internally set to 256 and defines the number of served
3151 RPCs at a given time. Setting max_rpc_cnt to more than
3152 256 will be only useful to let backfill continue schedul‐
3153 ing work after locks have been yielded (i.e. each 2 sec‐
3154 onds) if there are a maximum of MAX(max_rpc_cnt/10, 20)
3155 RPCs in the queue. i.e. max_rpc_cnt=1000, the scheduler
3156 will be allowed to continue after yielding locks only
3157 when there are less than or equal to 100 pending RPCs.
3158 If a value is set, then a value of 10 or higher is recom‐
3159 mended. It may require some tuning for each system, but
3160 needs to be high enough that scheduling isn't always dis‐
3161 abled, and low enough that requests can get through in a
3162 reasonable period of time.
3163
3164 max_sched_time=#
3165 How long, in seconds, that the main scheduling loop will
3166 execute for before exiting. If a value is configured, be
3167 aware that all other Slurm operations will be deferred
3168 during this time period. Make certain the value is lower
3169 than MessageTimeout. If a value is not explicitly con‐
3170 figured, the default value is half of MessageTimeout with
3171 a minimum default value of 1 second and a maximum default
3172 value of 2 seconds. For example if MessageTimeout=10,
3173 the time limit will be 2 seconds (i.e. MIN(10/2, 2) = 2).
3174
3175 max_script_size=#
3176 Specify the maximum size of a batch script, in bytes.
3177 The default value is 4 megabytes. Larger values may ad‐
3178 versely impact system performance.
3179
3180 max_switch_wait=#
3181 Maximum number of seconds that a job can delay execution
3182 waiting for the specified desired switch count. The de‐
3183 fault value is 300 seconds.
3184
3185 no_backup_scheduling
3186 If used, the backup controller will not schedule jobs
3187 when it takes over. The backup controller will allow jobs
3188 to be submitted, modified and cancelled but won't sched‐
3189 ule new jobs. This is useful in Cray environments when
3190 the backup controller resides on an external Cray node.
3191 A restart of slurmctld is required for changes to this
3192 parameter to take effect.
3193
3194 no_env_cache
3195 If used, any job started on node that fails to load the
3196 env from a node will fail instead of using the cached
3197 env. This will also implicitly imply the re‐
3198 queue_setup_env_fail option as well.
3199
3200 nohold_on_prolog_fail
3201 By default, if the Prolog exits with a non-zero value the
3202 job is requeued in a held state. By specifying this pa‐
3203 rameter the job will be requeued but not held so that the
3204 scheduler can dispatch it to another host.
3205
3206 pack_serial_at_end
3207 If used with the select/cons_res or select/cons_tres
3208 plugin, then put serial jobs at the end of the available
3209 nodes rather than using a best fit algorithm. This may
3210 reduce resource fragmentation for some workloads.
3211
3212 partition_job_depth=#
3213 The default number of jobs to attempt scheduling (i.e.
3214 the queue depth) from each partition/queue in Slurm's
3215 main scheduling logic. The functionality is similar to
3216 that provided by the bf_max_job_part option for the back‐
3217 fill scheduling logic. The default value is 0 (no
3218 limit). Job's excluded from attempted scheduling based
3219 upon partition will not be counted against the de‐
3220 fault_queue_depth limit. Also see the bf_max_job_part
3221 option.
3222
3223 preempt_reorder_count=#
3224 Specify how many attempts should be made in reordering
3225 preemptable jobs to minimize the count of jobs preempted.
3226 The default value is 1. High values may adversely impact
3227 performance. The logic to support this option is only
3228 available in the select/cons_res and select/cons_tres
3229 plugins.
3230
3231 preempt_strict_order
3232 If set, then execute extra logic in an attempt to preempt
3233 only the lowest priority jobs. It may be desirable to
3234 set this configuration parameter when there are multiple
3235 priorities of preemptable jobs. The logic to support
3236 this option is only available in the select/cons_res and
3237 select/cons_tres plugins.
3238
3239 preempt_youngest_first
3240 If set, then the preemption sorting algorithm will be
3241 changed to sort by the job start times to favor preempt‐
3242 ing younger jobs over older. (Requires preempt/parti‐
3243 tion_prio or preempt/qos plugins.)
3244
3245 reduce_completing_frag
3246 This option is used to control how scheduling of re‐
3247 sources is performed when jobs are in the COMPLETING
3248 state, which influences potential fragmentation. If this
3249 option is not set then no jobs will be started in any
3250 partition when any job is in the COMPLETING state for
3251 less than CompleteWait seconds. If this option is set
3252 then no jobs will be started in any individual partition
3253 that has a job in COMPLETING state for less than Com‐
3254 pleteWait seconds. In addition, no jobs will be started
3255 in any partition with nodes that overlap with any nodes
3256 in the partition of the completing job. This option is
3257 to be used in conjunction with CompleteWait.
3258
3259 NOTE: CompleteWait must be set in order for this to work.
3260 If CompleteWait=0 then this option does nothing.
3261
3262 NOTE: reduce_completing_frag only affects the main sched‐
3263 uler, not the backfill scheduler.
3264
3265 requeue_setup_env_fail
3266 By default if a job environment setup fails the job keeps
3267 running with a limited environment. By specifying this
3268 parameter the job will be requeued in held state and the
3269 execution node drained.
3270
3271 salloc_wait_nodes
3272 If defined, the salloc command will wait until all allo‐
3273 cated nodes are ready for use (i.e. booted) before the
3274 command returns. By default, salloc will return as soon
3275 as the resource allocation has been made.
3276
3277 sbatch_wait_nodes
3278 If defined, the sbatch script will wait until all allo‐
3279 cated nodes are ready for use (i.e. booted) before the
3280 initiation. By default, the sbatch script will be initi‐
3281 ated as soon as the first node in the job allocation is
3282 ready. The sbatch command can use the --wait-all-nodes
3283 option to override this configuration parameter.
3284
3285 sched_interval=#
3286 How frequently, in seconds, the main scheduling loop will
3287 execute and test all pending jobs. The default value is
3288 60 seconds. A setting of -1 will disable the main sched‐
3289 uling loop.
3290
3291 sched_max_job_start=#
3292 The maximum number of jobs that the main scheduling logic
3293 will start in any single execution. The default value is
3294 zero, which imposes no limit.
3295
3296 sched_min_interval=#
3297 How frequently, in microseconds, the main scheduling loop
3298 will execute and test any pending jobs. The scheduler
3299 runs in a limited fashion every time that any event hap‐
3300 pens which could enable a job to start (e.g. job submit,
3301 job terminate, etc.). If these events happen at a high
3302 frequency, the scheduler can run very frequently and con‐
3303 sume significant resources if not throttled by this op‐
3304 tion. This option specifies the minimum time between the
3305 end of one scheduling cycle and the beginning of the next
3306 scheduling cycle. A value of zero will disable throt‐
3307 tling of the scheduling logic interval. The default
3308 value is 2 microseconds.
3309
3310 spec_cores_first
3311 Specialized cores will be selected from the first cores
3312 of the first sockets, cycling through the sockets on a
3313 round robin basis. By default, specialized cores will be
3314 selected from the last cores of the last sockets, cycling
3315 through the sockets on a round robin basis.
3316
3317 step_retry_count=#
3318 When a step completes and there are steps ending resource
3319 allocation, then retry step allocations for at least this
3320 number of pending steps. Also see step_retry_time. The
3321 default value is 8 steps.
3322
3323 step_retry_time=#
3324 When a step completes and there are steps ending resource
3325 allocation, then retry step allocations for all steps
3326 which have been pending for at least this number of sec‐
3327 onds. Also see step_retry_count. The default value is
3328 60 seconds.
3329
3330 whole_hetjob
3331 Requests to cancel, hold or release any component of a
3332 heterogeneous job will be applied to all components of
3333 the job.
3334
3335 NOTE: this option was previously named whole_pack and
3336 this is still supported for retrocompatibility.
3337
3338 SchedulerTimeSlice
3339 Number of seconds in each time slice when gang scheduling is en‐
3340 abled (PreemptMode=SUSPEND,GANG). The value must be between 5
3341 seconds and 65533 seconds. The default value is 30 seconds.
3342
3343 SchedulerType
3344 Identifies the type of scheduler to be used. A restart of
3345 slurmctld is required for changes to this parameter to take ef‐
3346 fect. The scontrol command can be used to manually change job
3347 priorities if desired. Acceptable values include:
3348
3349 sched/backfill
3350 For a backfill scheduling module to augment the default
3351 FIFO scheduling. Backfill scheduling will initiate
3352 lower-priority jobs if doing so does not delay the ex‐
3353 pected initiation time of any higher priority job. Ef‐
3354 fectiveness of backfill scheduling is dependent upon
3355 users specifying job time limits, otherwise all jobs will
3356 have the same time limit and backfilling is impossible.
3357 Note documentation for the SchedulerParameters option
3358 above. This is the default configuration.
3359
3360 sched/builtin
3361 This is the FIFO scheduler which initiates jobs in prior‐
3362 ity order. If any job in the partition can not be sched‐
3363 uled, no lower priority job in that partition will be
3364 scheduled. An exception is made for jobs that can not
3365 run due to partition constraints (e.g. the time limit) or
3366 down/drained nodes. In that case, lower priority jobs
3367 can be initiated and not impact the higher priority job.
3368
3369 ScronParameters
3370 Multiple options may be comma separated.
3371
3372 enable Enable the use of scrontab to submit and manage periodic
3373 repeating jobs.
3374
3375 SelectType
3376 Identifies the type of resource selection algorithm to be used.
3377 A restart of slurmctld and slurmd is required for changes to
3378 this parameter to take effect. When changed, all job information
3379 (running and pending) will be lost, since the job state save
3380 format used by each plugin is different. The only exception to
3381 this is when changing from cons_res to cons_tres or from
3382 cons_tres to cons_res. However, if a job contains cons_tres-spe‐
3383 cific features and then SelectType is changed to cons_res, the
3384 job will be canceled, since there is no way for cons_res to sat‐
3385 isfy requirements specific to cons_tres.
3386
3387 Acceptable values include
3388
3389 select/cons_res
3390 The resources (cores and memory) within a node are indi‐
3391 vidually allocated as consumable resources. Note that
3392 whole nodes can be allocated to jobs for selected parti‐
3393 tions by using the OverSubscribe=Exclusive option. See
3394 the partition OverSubscribe parameter for more informa‐
3395 tion.
3396
3397 select/cons_tres
3398 The resources (cores, memory, GPUs and all other track‐
3399 able resources) within a node are individually allocated
3400 as consumable resources. Note that whole nodes can be
3401 allocated to jobs for selected partitions by using the
3402 OverSubscribe=Exclusive option. See the partition Over‐
3403 Subscribe parameter for more information.
3404
3405 select/cray_aries
3406 for a Cray system. The default value is "se‐
3407 lect/cray_aries" for all Cray systems.
3408
3409 select/linear
3410 for allocation of entire nodes assuming a one-dimensional
3411 array of nodes in which sequentially ordered nodes are
3412 preferable. For a heterogeneous cluster (e.g. different
3413 CPU counts on the various nodes), resource allocations
3414 will favor nodes with high CPU counts as needed based
3415 upon the job's node and CPU specification if TopologyPlu‐
3416 gin=topology/none is configured. Use of other topology
3417 plugins with select/linear and heterogeneous nodes is not
3418 recommended and may result in valid job allocation re‐
3419 quests being rejected. The linear plugin is not designed
3420 to track generic resources on a node. In cases where
3421 generic resources (such as GPUs) need to be tracked, the
3422 cons_res or cons_tres plugins should be used instead.
3423 This is the default value.
3424
3425 SelectTypeParameters
3426 The permitted values of SelectTypeParameters depend upon the
3427 configured value of SelectType. The only supported options for
3428 SelectType=select/linear are CR_ONE_TASK_PER_CORE and CR_Memory,
3429 which treats memory as a consumable resource and prevents memory
3430 over subscription with job preemption or gang scheduling. By
3431 default SelectType=select/linear allocates whole nodes to jobs
3432 without considering their memory consumption. By default Se‐
3433 lectType=select/cons_res, SelectType=select/cray_aries, and Se‐
3434 lectType=select/cons_tres, use CR_Core_Memory, which allocates
3435 Core to jobs with considering their memory consumption.
3436
3437 A restart of slurmctld is required for changes to this parameter
3438 to take effect.
3439
3440 The following options are supported for SelectType=se‐
3441 lect/cray_aries:
3442
3443 OTHER_CONS_RES
3444 Layer the select/cons_res plugin under the se‐
3445 lect/cray_aries plugin, the default is to layer on se‐
3446 lect/linear. This also allows all the options available
3447 for SelectType=select/cons_res.
3448
3449 OTHER_CONS_TRES
3450 Layer the select/cons_tres plugin under the se‐
3451 lect/cray_aries plugin, the default is to layer on se‐
3452 lect/linear. This also allows all the options available
3453 for SelectType=select/cons_tres.
3454
3455 The following options are supported by the SelectType=select/cons_res
3456 and SelectType=select/cons_tres plugins:
3457
3458 CR_CPU CPUs are consumable resources. Configure the number of
3459 CPUs on each node, which may be equal to the count of
3460 cores or hyper-threads on the node depending upon the de‐
3461 sired minimum resource allocation. The node's Boards,
3462 Sockets, CoresPerSocket and ThreadsPerCore may optionally
3463 be configured and result in job allocations which have
3464 improved locality; however doing so will prevent more
3465 than one job from being allocated on each core.
3466
3467 CR_CPU_Memory
3468 CPUs and memory are consumable resources. Configure the
3469 number of CPUs on each node, which may be equal to the
3470 count of cores or hyper-threads on the node depending
3471 upon the desired minimum resource allocation. The node's
3472 Boards, Sockets, CoresPerSocket and ThreadsPerCore may
3473 optionally be configured and result in job allocations
3474 which have improved locality; however doing so will pre‐
3475 vent more than one job from being allocated on each core.
3476 Setting a value for DefMemPerCPU is strongly recommended.
3477
3478 CR_Core
3479 Cores are consumable resources. On nodes with hy‐
3480 per-threads, each thread is counted as a CPU to satisfy a
3481 job's resource requirement, but multiple jobs are not al‐
3482 located threads on the same core. The count of CPUs al‐
3483 located to a job is rounded up to account for every CPU
3484 on an allocated core. This will also impact total allo‐
3485 cated memory when --mem-per-cpu is used to be multiply of
3486 total number of CPUs on allocated cores.
3487
3488 CR_Core_Memory
3489 Cores and memory are consumable resources. On nodes with
3490 hyper-threads, each thread is counted as a CPU to satisfy
3491 a job's resource requirement, but multiple jobs are not
3492 allocated threads on the same core. The count of CPUs
3493 allocated to a job may be rounded up to account for every
3494 CPU on an allocated core. Setting a value for DefMemPer‐
3495 CPU is strongly recommended.
3496
3497 CR_ONE_TASK_PER_CORE
3498 Allocate one task per core by default. Without this op‐
3499 tion, by default one task will be allocated per thread on
3500 nodes with more than one ThreadsPerCore configured.
3501 NOTE: This option cannot be used with CR_CPU*.
3502
3503 CR_CORE_DEFAULT_DIST_BLOCK
3504 Allocate cores within a node using block distribution by
3505 default. This is a pseudo-best-fit algorithm that mini‐
3506 mizes the number of boards and minimizes the number of
3507 sockets (within minimum boards) used for the allocation.
3508 This default behavior can be overridden specifying a par‐
3509 ticular "-m" parameter with srun/salloc/sbatch. Without
3510 this option, cores will be allocated cyclically across
3511 the sockets.
3512
3513 CR_LLN Schedule resources to jobs on the least loaded nodes
3514 (based upon the number of idle CPUs). This is generally
3515 only recommended for an environment with serial jobs as
3516 idle resources will tend to be highly fragmented, result‐
3517 ing in parallel jobs being distributed across many nodes.
3518 Note that node Weight takes precedence over how many idle
3519 resources are on each node. Also see the partition con‐
3520 figuration parameter LLN use the least loaded nodes in
3521 selected partitions.
3522
3523 CR_Pack_Nodes
3524 If a job allocation contains more resources than will be
3525 used for launching tasks (e.g. if whole nodes are allo‐
3526 cated to a job), then rather than distributing a job's
3527 tasks evenly across its allocated nodes, pack them as
3528 tightly as possible on these nodes. For example, con‐
3529 sider a job allocation containing two entire nodes with
3530 eight CPUs each. If the job starts ten tasks across
3531 those two nodes without this option, it will start five
3532 tasks on each of the two nodes. With this option, eight
3533 tasks will be started on the first node and two tasks on
3534 the second node. This can be superseded by "NoPack" in
3535 srun's "--distribution" option. CR_Pack_Nodes only ap‐
3536 plies when the "block" task distribution method is used.
3537
3538 CR_Socket
3539 Sockets are consumable resources. On nodes with multiple
3540 cores, each core or thread is counted as a CPU to satisfy
3541 a job's resource requirement, but multiple jobs are not
3542 allocated resources on the same socket.
3543
3544 CR_Socket_Memory
3545 Memory and sockets are consumable resources. On nodes
3546 with multiple cores, each core or thread is counted as a
3547 CPU to satisfy a job's resource requirement, but multiple
3548 jobs are not allocated resources on the same socket.
3549 Setting a value for DefMemPerCPU is strongly recommended.
3550
3551 CR_Memory
3552 Memory is a consumable resource. NOTE: This implies
3553 OverSubscribe=YES or OverSubscribe=FORCE for all parti‐
3554 tions. Setting a value for DefMemPerCPU is strongly rec‐
3555 ommended.
3556
3557 NOTE: If memory isn't configured as a consumable resource
3558 (CR_CPU,
3559 CR_Core or CR_Socket without _Memory) memory can be over‐
3560 subscribed. In this case the --mem option is only used to
3561 filter out nodes with lower configured memory and does
3562 not take running jobs into account. For instance, two
3563 jobs requesting all the memory of a node can run at the
3564 same time.
3565
3566 SlurmctldAddr
3567 An optional address to be used for communications to the cur‐
3568 rently active slurmctld daemon, normally used with Virtual IP
3569 addressing of the currently active server. If this parameter is
3570 not specified then each primary and backup server will have its
3571 own unique address used for communications as specified in the
3572 SlurmctldHost parameter. If this parameter is specified then
3573 the SlurmctldHost parameter will still be used for communica‐
3574 tions to specific slurmctld primary or backup servers, for exam‐
3575 ple to cause all of them to read the current configuration files
3576 or shutdown. Also see the SlurmctldPrimaryOffProg and Slurm‐
3577 ctldPrimaryOnProg configuration parameters to configure programs
3578 to manipulate virtual IP address manipulation.
3579
3580 SlurmctldDebug
3581 The level of detail to provide slurmctld daemon's logs. The de‐
3582 fault value is info. If the slurmctld daemon is initiated with
3583 -v or --verbose options, that debug level will be preserve or
3584 restored upon reconfiguration.
3585
3586 quiet Log nothing
3587
3588 fatal Log only fatal errors
3589
3590 error Log only errors
3591
3592 info Log errors and general informational messages
3593
3594 verbose Log errors and verbose informational messages
3595
3596 debug Log errors and verbose informational messages and de‐
3597 bugging messages
3598
3599 debug2 Log errors and verbose informational messages and more
3600 debugging messages
3601
3602 debug3 Log errors and verbose informational messages and even
3603 more debugging messages
3604
3605 debug4 Log errors and verbose informational messages and even
3606 more debugging messages
3607
3608 debug5 Log errors and verbose informational messages and even
3609 more debugging messages
3610
3611 SlurmctldHost
3612 The short, or long, hostname of the machine where Slurm control
3613 daemon is executed (i.e. the name returned by the command "host‐
3614 name -s"). This hostname is optionally followed by the address,
3615 either the IP address or a name by which the address can be
3616 identified, enclosed in parentheses (e.g. SlurmctldHost=slurm‐
3617 ctl-primary(12.34.56.78)). This value must be specified at least
3618 once. If specified more than once, the first hostname named will
3619 be where the daemon runs. If the first specified host fails,
3620 the daemon will execute on the second host. If both the first
3621 and second specified host fails, the daemon will execute on the
3622 third host. A restart of slurmctld is required for changes to
3623 this parameter to take effect.
3624
3625 SlurmctldLogFile
3626 Fully qualified pathname of a file into which the slurmctld dae‐
3627 mon's logs are written. The default value is none (performs
3628 logging via syslog).
3629 See the section LOGGING if a pathname is specified.
3630
3631 SlurmctldParameters
3632 Multiple options may be comma separated.
3633
3634 allow_user_triggers
3635 Permit setting triggers from non-root/slurm_user users.
3636 SlurmUser must also be set to root to permit these trig‐
3637 gers to work. See the strigger man page for additional
3638 details.
3639
3640 cloud_dns
3641 By default, Slurm expects that the network address for a
3642 cloud node won't be known until the creation of the node
3643 and that Slurm will be notified of the node's address
3644 (e.g. scontrol update nodename=<name> nodeaddr=<addr>).
3645 Since Slurm communications rely on the node configuration
3646 found in the slurm.conf, Slurm will tell the client com‐
3647 mand, after waiting for all nodes to boot, each node's ip
3648 address. However, in environments where the nodes are in
3649 DNS, this step can be avoided by configuring this option.
3650
3651 cloud_reg_addrs
3652 When a cloud node registers, the node's NodeAddr and
3653 NodeHostName will automatically be set. They will be re‐
3654 set back to the nodename after powering off.
3655
3656 enable_configless
3657 Permit "configless" operation by the slurmd, slurmstepd,
3658 and user commands. When enabled the slurmd will be per‐
3659 mitted to retrieve config files from the slurmctld, and
3660 on any 'scontrol reconfigure' command new configs will be
3661 automatically pushed out and applied to nodes that are
3662 running in this "configless" mode. A restart of slurm‐
3663 ctld is required for changes to this parameter to take
3664 effect. NOTE: Included files with the Include directive
3665 will only be pushed if the filename has no path separa‐
3666 tors and is located adjacent to slurm.conf.
3667
3668 idle_on_node_suspend
3669 Mark nodes as idle, regardless of current state, when
3670 suspending nodes with SuspendProgram so that nodes will
3671 be eligible to be resumed at a later time.
3672
3673 node_reg_mem_percent=#
3674 Percentage of memory a node is allowed to register with
3675 without being marked as invalid with low memory. Default
3676 is 100. For State=CLOUD nodes, the default is 90. To dis‐
3677 able this for cloud nodes set it to 100. config_overrides
3678 takes precedence over this option.
3679
3680 It's recommended that task/cgroup with ConstrainRamSpace
3681 is configured. A memory cgroup limit won't be set more
3682 than the actual memory on the node. If needed, configure
3683 AllowedRamSpace in the cgroup.conf to add a buffer.
3684
3685 power_save_interval
3686 How often the power_save thread looks to resume and sus‐
3687 pend nodes. The power_save thread will do work sooner if
3688 there are node state changes. Default is 10 seconds.
3689
3690 power_save_min_interval
3691 How often the power_save thread, at a minimum, looks to
3692 resume and suspend nodes. Default is 0.
3693
3694 max_dbd_msg_action
3695 Action used once MaxDBDMsgs is reached, options are 'dis‐
3696 card' (default) and 'exit'.
3697
3698 When 'discard' is specified and MaxDBDMsgs is reached we
3699 start by purging pending messages of types Step start and
3700 complete, and it reaches MaxDBDMsgs again Job start mes‐
3701 sages are purged. Job completes and node state changes
3702 continue to consume the empty space created from the
3703 purgings until MaxDBDMsgs is reached again at which no
3704 new message is tracked creating data loss and potentially
3705 runaway jobs.
3706
3707 When 'exit' is specified and MaxDBDMsgs is reached the
3708 slurmctld will exit instead of discarding any messages.
3709 It will be impossible to start the slurmctld with this
3710 option where the slurmdbd is down and the slurmctld is
3711 tracking more than MaxDBDMsgs.
3712
3713 preempt_send_user_signal
3714 Send the user signal (e.g. --signal=<sig_num>) at preemp‐
3715 tion time even if the signal time hasn't been reached. In
3716 the case of a gracetime preemption the user signal will
3717 be sent if the user signal has been specified and not
3718 sent, otherwise a SIGTERM will be sent to the tasks.
3719
3720 reboot_from_controller
3721 Run the RebootProgram from the controller instead of on
3722 the slurmds. The RebootProgram will be passed a
3723 comma-separated list of nodes to reboot as the first ar‐
3724 gument and if applicable the required features needed for
3725 reboot as the second argument.
3726
3727 user_resv_delete
3728 Allow any user able to run in a reservation to delete it.
3729
3730 SlurmctldPidFile
3731 Fully qualified pathname of a file into which the slurmctld
3732 daemon may write its process id. This may be used for automated
3733 signal processing. The default value is "/var/run/slurm‐
3734 ctld.pid".
3735
3736 SlurmctldPlugstack
3737 A comma-delimited list of Slurm controller plugins to be started
3738 when the daemon begins and terminated when it ends. Only the
3739 plugin's init and fini functions are called.
3740
3741 SlurmctldPort
3742 The port number that the Slurm controller, slurmctld, listens to
3743 for work. The default value is SLURMCTLD_PORT as established at
3744 system build time. If none is explicitly specified, it will be
3745 set to 6817. SlurmctldPort may also be configured to support a
3746 range of port numbers in order to accept larger bursts of incom‐
3747 ing messages by specifying two numbers separated by a dash (e.g.
3748 SlurmctldPort=6817-6818). A restart of slurmctld is required
3749 for changes to this parameter to take effect. NOTE: Either
3750 slurmctld and slurmd daemons must not execute on the same nodes
3751 or the values of SlurmctldPort and SlurmdPort must be different.
3752
3753 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3754 automatically try to interact with anything opened on ports
3755 8192-60000. Configure SlurmctldPort to use a port outside of
3756 the configured SrunPortRange and RSIP's port range.
3757
3758 SlurmctldPrimaryOffProg
3759 This program is executed when a slurmctld daemon running as the
3760 primary server becomes a backup server. By default no program is
3761 executed. See also the related "SlurmctldPrimaryOnProg" parame‐
3762 ter.
3763
3764 SlurmctldPrimaryOnProg
3765 This program is executed when a slurmctld daemon running as a
3766 backup server becomes the primary server. By default no program
3767 is executed. When using virtual IP addresses to manage High
3768 Available Slurm services, this program can be used to add the IP
3769 address to an interface (and optionally try to kill the unre‐
3770 sponsive slurmctld daemon and flush the ARP caches on nodes on
3771 the local Ethernet fabric). See also the related "SlurmctldPri‐
3772 maryOffProg" parameter.
3773
3774 SlurmctldSyslogDebug
3775 The slurmctld daemon will log events to the syslog file at the
3776 specified level of detail. If not set, the slurmctld daemon will
3777 log to syslog at level fatal, unless there is no SlurmctldLog‐
3778 File and it is running in the background, in which case it will
3779 log to syslog at the level specified by SlurmctldDebug (at fatal
3780 in the case that SlurmctldDebug is set to quiet) or it is run in
3781 the foreground, when it will be set to quiet.
3782
3783 quiet Log nothing
3784
3785 fatal Log only fatal errors
3786
3787 error Log only errors
3788
3789 info Log errors and general informational messages
3790
3791 verbose Log errors and verbose informational messages
3792
3793 debug Log errors and verbose informational messages and de‐
3794 bugging messages
3795
3796 debug2 Log errors and verbose informational messages and more
3797 debugging messages
3798
3799 debug3 Log errors and verbose informational messages and even
3800 more debugging messages
3801
3802 debug4 Log errors and verbose informational messages and even
3803 more debugging messages
3804
3805 debug5 Log errors and verbose informational messages and even
3806 more debugging messages
3807
3808 NOTE: By default, Slurm's systemd service files start daemons in
3809 the foreground with the -D option. This means that systemd will
3810 capture stdout/stderr output and print that to syslog, indepen‐
3811 dent of Slurm printing to syslog directly. To prevent systemd
3812 from doing this, add "StandardOutput=null" and "StandardEr‐
3813 ror=null" to the respective service files or override files.
3814
3815 SlurmctldTimeout
3816 The interval, in seconds, that the backup controller waits for
3817 the primary controller to respond before assuming control. The
3818 default value is 120 seconds. May not exceed 65533.
3819
3820 SlurmdDebug
3821 The level of detail to provide slurmd daemon's logs. The de‐
3822 fault value is info.
3823
3824 quiet Log nothing
3825
3826 fatal Log only fatal errors
3827
3828 error Log only errors
3829
3830 info Log errors and general informational messages
3831
3832 verbose Log errors and verbose informational messages
3833
3834 debug Log errors and verbose informational messages and de‐
3835 bugging messages
3836
3837 debug2 Log errors and verbose informational messages and more
3838 debugging messages
3839
3840 debug3 Log errors and verbose informational messages and even
3841 more debugging messages
3842
3843 debug4 Log errors and verbose informational messages and even
3844 more debugging messages
3845
3846 debug5 Log errors and verbose informational messages and even
3847 more debugging messages
3848
3849 SlurmdLogFile
3850 Fully qualified pathname of a file into which the slurmd dae‐
3851 mon's logs are written. The default value is none (performs
3852 logging via syslog). The first "%h" within the name is replaced
3853 with the hostname on which the slurmd is running. The first
3854 "%n" within the name is replaced with the Slurm node name on
3855 which the slurmd is running.
3856 See the section LOGGING if a pathname is specified.
3857
3858 SlurmdParameters
3859 Parameters specific to the Slurmd. Multiple options may be
3860 comma separated.
3861
3862 config_overrides
3863 If set, consider the configuration of each node to be
3864 that specified in the slurm.conf configuration file and
3865 any node with less than the configured resources will not
3866 be set to INVAL/INVALID_REG. This option is generally
3867 only useful for testing purposes. Equivalent to the now
3868 deprecated FastSchedule=2 option.
3869
3870 l3cache_as_socket
3871 Use the hwloc l3cache as the socket count. Can be useful
3872 on certain processors where the socket level is too
3873 coarse, and the l3cache may provide better task distribu‐
3874 tion. (E.g., along CCX boundaries instead of socket
3875 boundaries.) Mutually exclusive with
3876 numa_node_as_socket. Requires hwloc v2.
3877
3878 numa_node_as_socket
3879 Use the hwloc NUMA Node to determine main hierarchy ob‐
3880 ject to be used as socket. If the option is set Slurm
3881 will check the parent object of NUMA Noda and use it as
3882 socket. This option may be useful for architectures likes
3883 AMD Epyc, where number of nodes per socket may be config‐
3884 ured. Mutually exclusive with l3cache_as_socket. Re‐
3885 quires hwloc v2.
3886
3887 shutdown_on_reboot
3888 If set, the Slurmd will shut itself down when a reboot
3889 request is received.
3890
3891 SlurmdPidFile
3892 Fully qualified pathname of a file into which the slurmd daemon
3893 may write its process id. This may be used for automated signal
3894 processing. The first "%h" within the name is replaced with the
3895 hostname on which the slurmd is running. The first "%n" within
3896 the name is replaced with the Slurm node name on which the
3897 slurmd is running. The default value is "/var/run/slurmd.pid".
3898
3899 SlurmdPort
3900 The port number that the Slurm compute node daemon, slurmd, lis‐
3901 tens to for work. The default value is SLURMD_PORT as estab‐
3902 lished at system build time. If none is explicitly specified,
3903 its value will be 6818. A restart of slurmctld is required for
3904 changes to this parameter to take effect. NOTE: Either slurm‐
3905 ctld and slurmd daemons must not execute on the same nodes or
3906 the values of SlurmctldPort and SlurmdPort must be different.
3907
3908 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3909 automatically try to interact with anything opened on ports
3910 8192-60000. Configure SlurmdPort to use a port outside of the
3911 configured SrunPortRange and RSIP's port range.
3912
3913 SlurmdSpoolDir
3914 Fully qualified pathname of a directory into which the slurmd
3915 daemon's state information and batch job script information are
3916 written. This must be a common pathname for all nodes, but
3917 should represent a directory which is local to each node (refer‐
3918 ence a local file system). The default value is
3919 "/var/spool/slurmd". The first "%h" within the name is replaced
3920 with the hostname on which the slurmd is running. The first
3921 "%n" within the name is replaced with the Slurm node name on
3922 which the slurmd is running.
3923
3924 SlurmdSyslogDebug
3925 The slurmd daemon will log events to the syslog file at the
3926 specified level of detail. If not set, the slurmd daemon will
3927 log to syslog at level fatal, unless there is no SlurmdLogFile
3928 and it is running in the background, in which case it will log
3929 to syslog at the level specified by SlurmdDebug (at fatal in
3930 the case that SlurmdDebug is set to quiet) or it is run in the
3931 foreground, when it will be set to quiet.
3932
3933 quiet Log nothing
3934
3935 fatal Log only fatal errors
3936
3937 error Log only errors
3938
3939 info Log errors and general informational messages
3940
3941 verbose Log errors and verbose informational messages
3942
3943 debug Log errors and verbose informational messages and de‐
3944 bugging messages
3945
3946 debug2 Log errors and verbose informational messages and more
3947 debugging messages
3948
3949 debug3 Log errors and verbose informational messages and even
3950 more debugging messages
3951
3952 debug4 Log errors and verbose informational messages and even
3953 more debugging messages
3954
3955 debug5 Log errors and verbose informational messages and even
3956 more debugging messages
3957
3958 NOTE: By default, Slurm's systemd service files start daemons in
3959 the foreground with the -D option. This means that systemd will
3960 capture stdout/stderr output and print that to syslog, indepen‐
3961 dent of Slurm printing to syslog directly. To prevent systemd
3962 from doing this, add "StandardOutput=null" and "StandardEr‐
3963 ror=null" to the respective service files or override files.
3964
3965 SlurmdTimeout
3966 The interval, in seconds, that the Slurm controller waits for
3967 slurmd to respond before configuring that node's state to DOWN.
3968 A value of zero indicates the node will not be tested by slurm‐
3969 ctld to confirm the state of slurmd, the node will not be auto‐
3970 matically set to a DOWN state indicating a non-responsive
3971 slurmd, and some other tool will take responsibility for moni‐
3972 toring the state of each compute node and its slurmd daemon.
3973 Slurm's hierarchical communication mechanism is used to ping the
3974 slurmd daemons in order to minimize system noise and overhead.
3975 The default value is 300 seconds. The value may not exceed
3976 65533 seconds.
3977
3978 SlurmdUser
3979 The name of the user that the slurmd daemon executes as. This
3980 user must exist on all nodes of the cluster for authentication
3981 of communications between Slurm components. The default value
3982 is "root".
3983
3984 SlurmSchedLogFile
3985 Fully qualified pathname of the scheduling event logging file.
3986 The syntax of this parameter is the same as for SlurmctldLog‐
3987 File. In order to configure scheduler logging, set both the
3988 SlurmSchedLogFile and SlurmSchedLogLevel parameters.
3989
3990 SlurmSchedLogLevel
3991 The initial level of scheduling event logging, similar to the
3992 SlurmctldDebug parameter used to control the initial level of
3993 slurmctld logging. Valid values for SlurmSchedLogLevel are "0"
3994 (scheduler logging disabled) and "1" (scheduler logging en‐
3995 abled). If this parameter is omitted, the value defaults to "0"
3996 (disabled). In order to configure scheduler logging, set both
3997 the SlurmSchedLogFile and SlurmSchedLogLevel parameters. The
3998 scheduler logging level can be changed dynamically using scon‐
3999 trol.
4000
4001 SlurmUser
4002 The name of the user that the slurmctld daemon executes as. For
4003 security purposes, a user other than "root" is recommended.
4004 This user must exist on all nodes of the cluster for authentica‐
4005 tion of communications between Slurm components. The default
4006 value is "root".
4007
4008 SrunEpilog
4009 Fully qualified pathname of an executable to be run by srun fol‐
4010 lowing the completion of a job step. The command line arguments
4011 for the executable will be the command and arguments of the job
4012 step. This configuration parameter may be overridden by srun's
4013 --epilog parameter. Note that while the other "Epilog" executa‐
4014 bles (e.g., TaskEpilog) are run by slurmd on the compute nodes
4015 where the tasks are executed, the SrunEpilog runs on the node
4016 where the "srun" is executing.
4017
4018 SrunPortRange
4019 The srun creates a set of listening ports to communicate with
4020 the controller, the slurmstepd and to handle the application
4021 I/O. By default these ports are ephemeral meaning the port num‐
4022 bers are selected by the kernel. Using this parameter allow
4023 sites to configure a range of ports from which srun ports will
4024 be selected. This is useful if sites want to allow only certain
4025 port range on their network.
4026
4027 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4028 automatically try to interact with anything opened on ports
4029 8192-60000. Configure SrunPortRange to use a range of ports
4030 above those used by RSIP, ideally 1000 or more ports, for exam‐
4031 ple "SrunPortRange=60001-63000".
4032
4033 Note: SrunPortRange must be large enough to cover the expected
4034 number of srun ports created on a given submission node. A sin‐
4035 gle srun opens 3 listening ports plus 2 more for every 48 hosts.
4036 Example:
4037
4038 srun -N 48 will use 5 listening ports.
4039
4040 srun -N 50 will use 7 listening ports.
4041
4042 srun -N 200 will use 13 listening ports.
4043
4044 SrunProlog
4045 Fully qualified pathname of an executable to be run by srun
4046 prior to the launch of a job step. The command line arguments
4047 for the executable will be the command and arguments of the job
4048 step. This configuration parameter may be overridden by srun's
4049 --prolog parameter. Note that while the other "Prolog" executa‐
4050 bles (e.g., TaskProlog) are run by slurmd on the compute nodes
4051 where the tasks are executed, the SrunProlog runs on the node
4052 where the "srun" is executing.
4053
4054 StateSaveLocation
4055 Fully qualified pathname of a directory into which the Slurm
4056 controller, slurmctld, saves its state (e.g. "/usr/lo‐
4057 cal/slurm/checkpoint"). Slurm state will saved here to recover
4058 from system failures. SlurmUser must be able to create files in
4059 this directory. If you have a secondary SlurmctldHost config‐
4060 ured, this location should be readable and writable by both sys‐
4061 tems. Since all running and pending job information is stored
4062 here, the use of a reliable file system (e.g. RAID) is recom‐
4063 mended. The default value is "/var/spool". A restart of slurm‐
4064 ctld is required for changes to this parameter to take effect.
4065 If any slurm daemons terminate abnormally, their core files will
4066 also be written into this directory.
4067
4068 SuspendExcNodes
4069 Specifies the nodes which are to not be placed in power save
4070 mode, even if the node remains idle for an extended period of
4071 time. Use Slurm's hostlist expression to identify nodes with an
4072 optional ":" separator and count of nodes to exclude from the
4073 preceding range. For example "nid[10-20]:4" will prevent 4 us‐
4074 able nodes (i.e IDLE and not DOWN, DRAINING or already powered
4075 down) in the set "nid[10-20]" from being powered down. Multiple
4076 sets of nodes can be specified with or without counts in a comma
4077 separated list (e.g "nid[10-20]:4,nid[80-90]:2"). If a node
4078 count specification is given, any list of nodes to NOT have a
4079 node count must be after the last specification with a count.
4080 For example "nid[10-20]:4,nid[60-70]" will exclude 4 nodes in
4081 the set "nid[10-20]:4" plus all nodes in the set "nid[60-70]"
4082 while "nid[1-3],nid[10-20]:4" will exclude 4 nodes from the set
4083 "nid[1-3],nid[10-20]". By default no nodes are excluded.
4084
4085 SuspendExcParts
4086 Specifies the partitions whose nodes are to not be placed in
4087 power save mode, even if the node remains idle for an extended
4088 period of time. Multiple partitions can be identified and sepa‐
4089 rated by commas. By default no nodes are excluded.
4090
4091 SuspendProgram
4092 SuspendProgram is the program that will be executed when a node
4093 remains idle for an extended period of time. This program is
4094 expected to place the node into some power save mode. This can
4095 be used to reduce the frequency and voltage of a node or com‐
4096 pletely power the node off. The program executes as SlurmUser.
4097 The argument to the program will be the names of nodes to be
4098 placed into power savings mode (using Slurm's hostlist expres‐
4099 sion format). By default, no program is run.
4100
4101 SuspendRate
4102 The rate at which nodes are placed into power save mode by Sus‐
4103 pendProgram. The value is number of nodes per minute and it can
4104 be used to prevent a large drop in power consumption (e.g. after
4105 a large job completes). A value of zero results in no limits
4106 being imposed. The default value is 60 nodes per minute.
4107
4108 SuspendTime
4109 Nodes which remain idle or down for this number of seconds will
4110 be placed into power save mode by SuspendProgram. Setting Sus‐
4111 pendTime to anything but INFINITE (or -1) will enable power save
4112 mode. INFINITE is the default.
4113
4114 SuspendTimeout
4115 Maximum time permitted (in seconds) between when a node suspend
4116 request is issued and when the node is shutdown. At that time
4117 the node must be ready for a resume request to be issued as
4118 needed for new work. The default value is 30 seconds.
4119
4120 SwitchParameters
4121 Optional parameters for the switch plugin.
4122
4123 On HPE Slingshot systems configured with
4124 SwitchType=switch/hpe_slingshot, the following parameters are
4125 supported (separate multiple parameters with a comma):
4126
4127 vnis=<min>-<max>
4128 Range of VNIs to allocate for jobs and applications.
4129 This parameter is required.
4130
4131 tcs=<class1>[:<class2>]...
4132 Set of traffic classes to configure for applications.
4133 Supported traffic classes are DEDICATED_ACCESS, LOW_LA‐
4134 TENCY, BULK_DATA, and BEST_EFFORT.
4135
4136 single_node_vni
4137 Allocate a VNI for single node job steps.
4138
4139 job_vni
4140 Allocate an additional VNI for jobs, shared among all job
4141 steps.
4142
4143 def_<rsrc>=<val>
4144 Per-CPU reserved allocation for this resource.
4145
4146 res_<rsrc>=<val>
4147 Per-node reserved allocation for this resource. If set,
4148 overrides the per-CPU allocation.
4149
4150 max_<rsrc>=<val>
4151 Maximum per-node application for this resource.
4152
4153 The resources that may be configured are:
4154
4155 txqs Transmit command queues. The default is 3 per-CPU, maxi‐
4156 mum 1024 per-node.
4157
4158 tgqs Target command queues. The default is 2 per-CPU, maximum
4159 512 per-node.
4160
4161 eqs Event queues. The default is 8 per-CPU, maximum 2048 per-
4162 node.
4163
4164 cts Counters. The default is 2 per-CPU, maximum 2048 per-
4165 node.
4166
4167 tles Trigger list entries. The default is 1 per-CPU, maximum
4168 2048 per-node.
4169
4170 ptes Portable table entries. The default is 8 per-CPU, maximum
4171 2048 per-node.
4172
4173 les List entries. The default is 134 per-CPU, maximum 65535
4174 per-node.
4175
4176 acs Addressing contexts. The default is 4 per-CPU, maximum
4177 1024 per-node.
4178
4179 SwitchType
4180 Identifies the type of switch or interconnect used for applica‐
4181 tion communications. Acceptable values include
4182 "switch/cray_aries" for Cray systems, "switch/hpe_slingshot" for
4183 HPE Slingshot systems and "switch/none" for switches not requir‐
4184 ing special processing for job launch or termination (Ethernet,
4185 and InfiniBand). The default value is "switch/none". All Slurm
4186 daemons, commands and running jobs must be restarted for a
4187 change in SwitchType to take effect. If running jobs exist at
4188 the time slurmctld is restarted with a new value of SwitchType,
4189 records of all jobs in any state may be lost.
4190
4191 TaskEpilog
4192 Fully qualified pathname of a program to be executed as the
4193 slurm job's owner after termination of each task. See TaskPro‐
4194 log for execution order details.
4195
4196 TaskPlugin
4197 Identifies the type of task launch plugin, typically used to
4198 provide resource management within a node (e.g. pinning tasks to
4199 specific processors). More than one task plugin can be specified
4200 in a comma-separated list. The prefix of "task/" is optional.
4201 Acceptable values include:
4202
4203 task/affinity enables resource containment using
4204 sched_setaffinity(). This enables the --cpu-bind
4205 and/or --mem-bind srun options.
4206
4207 task/cgroup enables resource containment using Linux control
4208 cgroups. This enables the --cpu-bind and/or
4209 --mem-bind srun options. NOTE: see "man
4210 cgroup.conf" for configuration details.
4211
4212 task/none for systems requiring no special handling of user
4213 tasks. Lacks support for the --cpu-bind and/or
4214 --mem-bind srun options. The default value is
4215 "task/none".
4216
4217 NOTE: It is recommended to stack task/affinity,task/cgroup to‐
4218 gether when configuring TaskPlugin, and setting Constrain‐
4219 Cores=yes in cgroup.conf. This setup uses the task/affinity
4220 plugin for setting the affinity of the tasks and uses the
4221 task/cgroup plugin to fence tasks into the specified resources.
4222
4223 NOTE: For CRAY systems only: task/cgroup must be used with, and
4224 listed after task/cray_aries in TaskPlugin. The task/affinity
4225 plugin can be listed anywhere, but the previous constraint must
4226 be satisfied. For CRAY systems, a configuration like this is
4227 recommended:
4228 TaskPlugin=task/affinity,task/cray_aries,task/cgroup
4229
4230 TaskPluginParam
4231 Optional parameters for the task plugin. Multiple options
4232 should be comma separated. None, Sockets, Cores and Threads are
4233 mutually exclusive and treated as a last possible source of
4234 --cpu-bind default. See also Node and Partition CpuBind options.
4235
4236 Cores Bind tasks to cores by default. Overrides automatic
4237 binding.
4238
4239 None Perform no task binding by default. Overrides automatic
4240 binding.
4241
4242 Sockets
4243 Bind to sockets by default. Overrides automatic binding.
4244
4245 Threads
4246 Bind to threads by default. Overrides automatic binding.
4247
4248 SlurmdOffSpec
4249 If specialized cores or CPUs are identified for the node
4250 (i.e. the CoreSpecCount or CpuSpecList are configured for
4251 the node), then Slurm daemons running on the compute node
4252 (i.e. slurmd and slurmstepd) should run outside of those
4253 resources (i.e. specialized resources are completely un‐
4254 available to Slurm daemons and jobs spawned by Slurm).
4255 This option may not be used with the task/cray_aries
4256 plugin.
4257
4258 Verbose
4259 Verbosely report binding before tasks run by default.
4260
4261 Autobind
4262 Set a default binding in the event that "auto binding"
4263 doesn't find a match. Set to Threads, Cores or Sockets
4264 (E.g. TaskPluginParam=autobind=threads).
4265
4266 TaskProlog
4267 Fully qualified pathname of a program to be executed as the
4268 slurm job's owner prior to initiation of each task. Besides the
4269 normal environment variables, this has SLURM_TASK_PID available
4270 to identify the process ID of the task being started. Standard
4271 output from this program can be used to control the environment
4272 variables and output for the user program.
4273
4274 export NAME=value Will set environment variables for the task
4275 being spawned. Everything after the equal
4276 sign to the end of the line will be used as
4277 the value for the environment variable. Ex‐
4278 porting of functions is not currently sup‐
4279 ported.
4280
4281 print ... Will cause that line (without the leading
4282 "print ") to be printed to the job's stan‐
4283 dard output.
4284
4285 unset NAME Will clear environment variables for the
4286 task being spawned.
4287
4288 The order of task prolog/epilog execution is as follows:
4289
4290 1. pre_launch_priv()
4291 Function in TaskPlugin
4292
4293 1. pre_launch() Function in TaskPlugin
4294
4295 2. TaskProlog System-wide per task program defined in
4296 slurm.conf
4297
4298 3. User prolog Job-step-specific task program defined using
4299 srun's --task-prolog option or
4300 SLURM_TASK_PROLOG environment variable
4301
4302 4. Task Execute the job step's task
4303
4304 5. User epilog Job-step-specific task program defined using
4305 srun's --task-epilog option or
4306 SLURM_TASK_EPILOG environment variable
4307
4308 6. TaskEpilog System-wide per task program defined in
4309 slurm.conf
4310
4311 7. post_term() Function in TaskPlugin
4312
4313 TCPTimeout
4314 Time permitted for TCP connection to be established. Default
4315 value is 2 seconds.
4316
4317 TmpFS Fully qualified pathname of the file system available to user
4318 jobs for temporary storage. This parameter is used in establish‐
4319 ing a node's TmpDisk space. The default value is "/tmp".
4320
4321 TopologyParam
4322 Comma-separated options identifying network topology options.
4323
4324 Dragonfly Optimize allocation for Dragonfly network. Valid
4325 when TopologyPlugin=topology/tree.
4326
4327 TopoOptional Only optimize allocation for network topology if
4328 the job includes a switch option. Since optimiz‐
4329 ing resource allocation for topology involves
4330 much higher system overhead, this option can be
4331 used to impose the extra overhead only on jobs
4332 which can take advantage of it. If most job allo‐
4333 cations are not optimized for network topology,
4334 they may fragment resources to the point that
4335 topology optimization for other jobs will be dif‐
4336 ficult to achieve. NOTE: Jobs may span across
4337 nodes without common parent switches with this
4338 enabled.
4339
4340 TopologyPlugin
4341 Identifies the plugin to be used for determining the network
4342 topology and optimizing job allocations to minimize network con‐
4343 tention. See NETWORK TOPOLOGY below for details. Additional
4344 plugins may be provided in the future which gather topology in‐
4345 formation directly from the network. Acceptable values include:
4346
4347 topology/3d_torus best-fit logic over three-dimensional
4348 topology
4349
4350 topology/none default for other systems, best-fit logic
4351 over one-dimensional topology
4352
4353 topology/tree used for a hierarchical network as de‐
4354 scribed in a topology.conf file
4355
4356 TrackWCKey
4357 Boolean yes or no. Used to set display and track of the Work‐
4358 load Characterization Key. Must be set to track correct wckey
4359 usage. NOTE: You must also set TrackWCKey in your slurmdbd.conf
4360 file to create historical usage reports.
4361
4362 TreeWidth
4363 Slurmd daemons use a virtual tree network for communications.
4364 TreeWidth specifies the width of the tree (i.e. the fanout). On
4365 architectures with a front end node running the slurmd daemon,
4366 the value must always be equal to or greater than the number of
4367 front end nodes which eliminates the need for message forwarding
4368 between the slurmd daemons. On other architectures the default
4369 value is 50, meaning each slurmd daemon can communicate with up
4370 to 50 other slurmd daemons and over 2500 nodes can be contacted
4371 with two message hops. The default value will work well for
4372 most clusters. Optimal system performance can typically be
4373 achieved if TreeWidth is set to the square root of the number of
4374 nodes in the cluster for systems having no more than 2500 nodes
4375 or the cube root for larger systems. The value may not exceed
4376 65533.
4377
4378 UnkillableStepProgram
4379 If the processes in a job step are determined to be unkillable
4380 for a period of time specified by the UnkillableStepTimeout
4381 variable, the program specified by UnkillableStepProgram will be
4382 executed. By default no program is run.
4383
4384 See section UNKILLABLE STEP PROGRAM SCRIPT for more information.
4385
4386 UnkillableStepTimeout
4387 The length of time, in seconds, that Slurm will wait before de‐
4388 ciding that processes in a job step are unkillable (after they
4389 have been signaled with SIGKILL) and execute UnkillableStepPro‐
4390 gram. The default timeout value is 60 seconds. If exceeded,
4391 the compute node will be drained to prevent future jobs from be‐
4392 ing scheduled on the node.
4393
4394 UsePAM If set to 1, PAM (Pluggable Authentication Modules for Linux)
4395 will be enabled. PAM is used to establish the upper bounds for
4396 resource limits. With PAM support enabled, local system adminis‐
4397 trators can dynamically configure system resource limits. Chang‐
4398 ing the upper bound of a resource limit will not alter the lim‐
4399 its of running jobs, only jobs started after a change has been
4400 made will pick up the new limits. The default value is 0 (not
4401 to enable PAM support). Remember that PAM also needs to be con‐
4402 figured to support Slurm as a service. For sites using PAM's
4403 directory based configuration option, a configuration file named
4404 slurm should be created. The module-type, control-flags, and
4405 module-path names that should be included in the file are:
4406 auth required pam_localuser.so
4407 auth required pam_shells.so
4408 account required pam_unix.so
4409 account required pam_access.so
4410 session required pam_unix.so
4411 For sites configuring PAM with a general configuration file, the
4412 appropriate lines (see above), where slurm is the service-name,
4413 should be added.
4414
4415 NOTE: UsePAM option has nothing to do with the con‐
4416 tribs/pam/pam_slurm and/or contribs/pam_slurm_adopt modules. So
4417 these two modules can work independently of the value set for
4418 UsePAM.
4419
4420 VSizeFactor
4421 Memory specifications in job requests apply to real memory size
4422 (also known as resident set size). It is possible to enforce
4423 virtual memory limits for both jobs and job steps by limiting
4424 their virtual memory to some percentage of their real memory al‐
4425 location. The VSizeFactor parameter specifies the job's or job
4426 step's virtual memory limit as a percentage of its real memory
4427 limit. For example, if a job's real memory limit is 500MB and
4428 VSizeFactor is set to 101 then the job will be killed if its
4429 real memory exceeds 500MB or its virtual memory exceeds 505MB
4430 (101 percent of the real memory limit). The default value is 0,
4431 which disables enforcement of virtual memory limits. The value
4432 may not exceed 65533 percent.
4433
4434 NOTE: This parameter is dependent on OverMemoryKill being con‐
4435 figured in JobAcctGatherParams. It is also possible to configure
4436 the TaskPlugin to use task/cgroup for memory enforcement. VSize‐
4437 Factor will not have an effect on memory enforcement done
4438 through cgroups.
4439
4440 WaitTime
4441 Specifies how many seconds the srun command should by default
4442 wait after the first task terminates before terminating all re‐
4443 maining tasks. The "--wait" option on the srun command line
4444 overrides this value. The default value is 0, which disables
4445 this feature. May not exceed 65533 seconds.
4446
4447 X11Parameters
4448 For use with Slurm's built-in X11 forwarding implementation.
4449
4450 home_xauthority
4451 If set, xauth data on the compute node will be placed in
4452 ~/.Xauthority rather than in a temporary file under
4453 TmpFS.
4454
4456 The configuration of nodes (or machines) to be managed by Slurm is also
4457 specified in /etc/slurm.conf. Changes in node configuration (e.g.
4458 adding nodes, changing their processor count, etc.) require restarting
4459 both the slurmctld daemon and the slurmd daemons. All slurmd daemons
4460 must know each node in the system to forward messages in support of hi‐
4461 erarchical communications. Only the NodeName must be supplied in the
4462 configuration file. All other node configuration information is op‐
4463 tional. It is advisable to establish baseline node configurations, es‐
4464 pecially if the cluster is heterogeneous. Nodes which register to the
4465 system with less than the configured resources (e.g. too little mem‐
4466 ory), will be placed in the "DOWN" state to avoid scheduling jobs on
4467 them. Establishing baseline configurations will also speed Slurm's
4468 scheduling process by permitting it to compare job requirements against
4469 these (relatively few) configuration parameters and possibly avoid hav‐
4470 ing to check job requirements against every individual node's configu‐
4471 ration. The resources checked at node registration time are: CPUs,
4472 RealMemory and TmpDisk.
4473
4474 Default values can be specified with a record in which NodeName is "DE‐
4475 FAULT". The default entry values will apply only to lines following it
4476 in the configuration file and the default values can be reset multiple
4477 times in the configuration file with multiple entries where "Node‐
4478 Name=DEFAULT". Each line where NodeName is "DEFAULT" will replace or
4479 add to previous default values and will not reinitialize the default
4480 values. The "NodeName=" specification must be placed on every line de‐
4481 scribing the configuration of nodes. A single node name can not appear
4482 as a NodeName value in more than one line (duplicate node name records
4483 will be ignored). In fact, it is generally possible and desirable to
4484 define the configurations of all nodes in only a few lines. This con‐
4485 vention permits significant optimization in the scheduling of larger
4486 clusters. In order to support the concept of jobs requiring consecu‐
4487 tive nodes on some architectures, node specifications should be place
4488 in this file in consecutive order. No single node name may be listed
4489 more than once in the configuration file. Use "DownNodes=" to record
4490 the state of nodes which are temporarily in a DOWN, DRAIN or FAILING
4491 state without altering permanent configuration information. A job
4492 step's tasks are allocated to nodes in order the nodes appear in the
4493 configuration file. There is presently no capability within Slurm to
4494 arbitrarily order a job step's tasks.
4495
4496 Multiple node names may be comma separated (e.g. "alpha,beta,gamma")
4497 and/or a simple node range expression may optionally be used to specify
4498 numeric ranges of nodes to avoid building a configuration file with
4499 large numbers of entries. The node range expression can contain one
4500 pair of square brackets with a sequence of comma-separated numbers
4501 and/or ranges of numbers separated by a "-" (e.g. "linux[0-64,128]", or
4502 "lx[15,18,32-33]"). Note that the numeric ranges can include one or
4503 more leading zeros to indicate the numeric portion has a fixed number
4504 of digits (e.g. "linux[0000-1023]"). Multiple numeric ranges can be
4505 included in the expression (e.g. "rack[0-63]_blade[0-41]"). If one or
4506 more numeric expressions are included, one of them must be at the end
4507 of the name (e.g. "unit[0-31]rack" is invalid), but arbitrary names can
4508 always be used in a comma-separated list.
4509
4510 The node configuration specified the following information:
4511
4512
4513 NodeName
4514 Name that Slurm uses to refer to a node. Typically this would
4515 be the string that "/bin/hostname -s" returns. It may also be
4516 the fully qualified domain name as returned by "/bin/hostname
4517 -f" (e.g. "foo1.bar.com"), or any valid domain name associated
4518 with the host through the host database (/etc/hosts) or DNS, de‐
4519 pending on the resolver settings. Note that if the short form
4520 of the hostname is not used, it may prevent use of hostlist ex‐
4521 pressions (the numeric portion in brackets must be at the end of
4522 the string). It may also be an arbitrary string if NodeHostname
4523 is specified. If the NodeName is "DEFAULT", the values speci‐
4524 fied with that record will apply to subsequent node specifica‐
4525 tions unless explicitly set to other values in that node record
4526 or replaced with a different set of default values. Each line
4527 where NodeName is "DEFAULT" will replace or add to previous de‐
4528 fault values and not a reinitialize the default values. For ar‐
4529 chitectures in which the node order is significant, nodes will
4530 be considered consecutive in the order defined. For example, if
4531 the configuration for "NodeName=charlie" immediately follows the
4532 configuration for "NodeName=baker" they will be considered adja‐
4533 cent in the computer. NOTE: If the NodeName is "ALL" the
4534 process parsing the configuration will exit immediately as it is
4535 an internally reserved word.
4536
4537 NodeHostname
4538 Typically this would be the string that "/bin/hostname -s" re‐
4539 turns. It may also be the fully qualified domain name as re‐
4540 turned by "/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid
4541 domain name associated with the host through the host database
4542 (/etc/hosts) or DNS, depending on the resolver settings. Note
4543 that if the short form of the hostname is not used, it may pre‐
4544 vent use of hostlist expressions (the numeric portion in brack‐
4545 ets must be at the end of the string). A node range expression
4546 can be used to specify a set of nodes. If an expression is
4547 used, the number of nodes identified by NodeHostname on a line
4548 in the configuration file must be identical to the number of
4549 nodes identified by NodeName. By default, the NodeHostname will
4550 be identical in value to NodeName.
4551
4552 NodeAddr
4553 Name that a node should be referred to in establishing a commu‐
4554 nications path. This name will be used as an argument to the
4555 getaddrinfo() function for identification. If a node range ex‐
4556 pression is used to designate multiple nodes, they must exactly
4557 match the entries in the NodeName (e.g. "NodeName=lx[0-7]
4558 NodeAddr=elx[0-7]"). NodeAddr may also contain IP addresses.
4559 By default, the NodeAddr will be identical in value to NodeHost‐
4560 name.
4561
4562 BcastAddr
4563 Alternate network path to be used for sbcast network traffic to
4564 a given node. This name will be used as an argument to the
4565 getaddrinfo() function. If a node range expression is used to
4566 designate multiple nodes, they must exactly match the entries in
4567 the NodeName (e.g. "NodeName=lx[0-7] BcastAddr=elx[0-7]").
4568 BcastAddr may also contain IP addresses. By default, the Bcas‐
4569 tAddr is unset, and sbcast traffic will be routed to the
4570 NodeAddr for a given node. Note: cannot be used with Communica‐
4571 tionParameters=NoInAddrAny.
4572
4573 Boards Number of Baseboards in nodes with a baseboard controller. Note
4574 that when Boards is specified, SocketsPerBoard, CoresPerSocket,
4575 and ThreadsPerCore should be specified. The default value is 1.
4576
4577 CoreSpecCount
4578 Number of cores reserved for system use. Depending upon the
4579 TaskPluginParam option of SlurmdOffSpec, the Slurm daemon slurmd
4580 may either be confined to these resources (the default) or pre‐
4581 vented from using these resources. Isolation of slurmd from
4582 user jobs may improve application performance. A job can use
4583 these cores if AllowSpecResourcesUsage=yes and the user explic‐
4584 itly requests less than the configured CoreSpecCount. If this
4585 option and CpuSpecList are both designated for a node, an error
4586 is generated. For information on the algorithm used by Slurm to
4587 select the cores refer to the core specialization documentation
4588 ( https://slurm.schedmd.com/core_spec.html ).
4589
4590 CoresPerSocket
4591 Number of cores in a single physical processor socket (e.g.
4592 "2"). The CoresPerSocket value describes physical cores, not
4593 the logical number of processors per socket. NOTE: If you have
4594 multi-core processors, you will likely need to specify this pa‐
4595 rameter in order to optimize scheduling. The default value is
4596 1.
4597
4598 CpuBind
4599 If a job step request does not specify an option to control how
4600 tasks are bound to allocated CPUs (--cpu-bind) and all nodes al‐
4601 located to the job have the same CpuBind option the node CpuBind
4602 option will control how tasks are bound to allocated resources.
4603 Supported values for CpuBind are "none", "socket", "ldom"
4604 (NUMA), "core" and "thread".
4605
4606 CPUs Number of logical processors on the node (e.g. "2"). It can be
4607 set to the total number of sockets(supported only by select/lin‐
4608 ear), cores or threads. This can be useful when you want to
4609 schedule only the cores on a hyper-threaded node. If CPUs is
4610 omitted, its default will be set equal to the product of Boards,
4611 Sockets, CoresPerSocket, and ThreadsPerCore.
4612
4613 CpuSpecList
4614 A comma-delimited list of Slurm abstract CPU IDs reserved for
4615 system use. The list will be expanded to include all other
4616 CPUs, if any, on the same cores. Depending upon the TaskPlugin‐
4617 Param option of SlurmdOffSpec, the Slurm daemon slurmd may ei‐
4618 ther be confined to these resources (the default) or prevented
4619 from using these resources. Isolation of slurmd from user jobs
4620 may improve application performance. A job can use these cores
4621 if AllowSpecResourcesUsage=yes and the user explicitly requests
4622 less than the number of CPUs in this list. If this option and
4623 CoreSpecCount are both designated for a node, an error is gener‐
4624 ated. This option has no effect unless cgroup job confinement
4625 is also configured (i.e. the task/cgroup TaskPlugin is enabled
4626 and ConstrainCores=yes is set in cgroup.conf).
4627
4628 Features
4629 A comma-delimited list of arbitrary strings indicative of some
4630 characteristic associated with the node. There is no value or
4631 count associated with a feature at this time, a node either has
4632 a feature or it does not. A desired feature may contain a nu‐
4633 meric component indicating, for example, processor speed but
4634 this numeric component will be considered to be part of the fea‐
4635 ture string. Features are intended to be used to filter nodes
4636 eligible to run jobs via the --constraint argument. By default
4637 a node has no features. Also see Gres for being able to have
4638 more control such as types and count. Using features is faster
4639 than scheduling against GRES but is limited to Boolean opera‐
4640 tions.
4641
4642 Gres A comma-delimited list of generic resources specifications for a
4643 node. The format is: "<name>[:<type>][:no_consume]:<num‐
4644 ber>[K|M|G]". The first field is the resource name, which
4645 matches the GresType configuration parameter name. The optional
4646 type field might be used to identify a model of that generic re‐
4647 source. It is forbidden to specify both an untyped GRES and a
4648 typed GRES with the same <name>. The optional no_consume field
4649 allows you to specify that a generic resource does not have a
4650 finite number of that resource that gets consumed as it is re‐
4651 quested. The no_consume field is a GRES specific setting and ap‐
4652 plies to the GRES, regardless of the type specified. It should
4653 not be used with GRES that has a dedicated plugin, if you're
4654 looking for a way to overcommit GPUs to multiple processes at
4655 the time you may be interested in using "shard" GRES instead.
4656 The final field must specify a generic resources count. A suf‐
4657 fix of "K", "M", "G", "T" or "P" may be used to multiply the
4658 number by 1024, 1048576, 1073741824, etc. respectively.
4659 (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_con‐
4660 sume:4G"). By default a node has no generic resources and its
4661 maximum count is that of an unsigned 64bit integer. Also see
4662 Features for Boolean flags to filter nodes using job con‐
4663 straints.
4664
4665 MemSpecLimit
4666 Amount of memory, in megabytes, reserved for system use and not
4667 available for user allocations. If the task/cgroup plugin is
4668 configured and that plugin constrains memory allocations (i.e.
4669 the task/cgroup TaskPlugin is enabled and ConstrainRAMSpace=yes
4670 is set in cgroup.conf), then Slurm compute node daemons (slurmd
4671 plus slurmstepd) will be allocated the specified memory limit.
4672 Note that having the Memory set in SelectTypeParameters as any
4673 of the options that has it as a consumable resource is needed
4674 for this option to work. The daemons will not be killed if they
4675 exhaust the memory allocation (ie. the Out-Of-Memory Killer is
4676 disabled for the daemon's memory cgroup). If the task/cgroup
4677 plugin is not configured, the specified memory will only be un‐
4678 available for user allocations.
4679
4680 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4681 tens to for work on this particular node. By default there is a
4682 single port number for all slurmd daemons on all compute nodes
4683 as defined by the SlurmdPort configuration parameter. Use of
4684 this option is not generally recommended except for development
4685 or testing purposes. If multiple slurmd daemons execute on a
4686 node this can specify a range of ports.
4687
4688 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4689 automatically try to interact with anything opened on ports
4690 8192-60000. Configure Port to use a port outside of the config‐
4691 ured SrunPortRange and RSIP's port range.
4692
4693 Procs See CPUs.
4694
4695 RealMemory
4696 Size of real memory on the node in megabytes (e.g. "2048"). The
4697 default value is 1. Lowering RealMemory with the goal of setting
4698 aside some amount for the OS and not available for job alloca‐
4699 tions will not work as intended if Memory is not set as a con‐
4700 sumable resource in SelectTypeParameters. So one of the *_Memory
4701 options need to be enabled for that goal to be accomplished.
4702 Also see MemSpecLimit.
4703
4704 Reason Identifies the reason for a node being in state "DOWN",
4705 "DRAINED" "DRAINING", "FAIL" or "FAILING". Use quotes to en‐
4706 close a reason having more than one word.
4707
4708 Sockets
4709 Number of physical processor sockets/chips on the node (e.g.
4710 "2"). If Sockets is omitted, it will be inferred from CPUs,
4711 CoresPerSocket, and ThreadsPerCore. NOTE: If you have
4712 multi-core processors, you will likely need to specify these pa‐
4713 rameters. Sockets and SocketsPerBoard are mutually exclusive.
4714 If Sockets is specified when Boards is also used, Sockets is in‐
4715 terpreted as SocketsPerBoard rather than total sockets. The de‐
4716 fault value is 1.
4717
4718 SocketsPerBoard
4719 Number of physical processor sockets/chips on a baseboard.
4720 Sockets and SocketsPerBoard are mutually exclusive. The default
4721 value is 1.
4722
4723 State State of the node with respect to the initiation of user jobs.
4724 Acceptable values are CLOUD, DOWN, DRAIN, FAIL, FAILING, FUTURE
4725 and UNKNOWN. Node states of BUSY and IDLE should not be speci‐
4726 fied in the node configuration, but set the node state to UN‐
4727 KNOWN instead. Setting the node state to UNKNOWN will result in
4728 the node state being set to BUSY, IDLE or other appropriate
4729 state based upon recovered system state information. The de‐
4730 fault value is UNKNOWN. Also see the DownNodes parameter below.
4731
4732 CLOUD Indicates the node exists in the cloud. Its initial
4733 state will be treated as powered down. The node will
4734 be available for use after its state is recovered from
4735 Slurm's state save file or the slurmd daemon starts on
4736 the compute node.
4737
4738 DOWN Indicates the node failed and is unavailable to be al‐
4739 located work.
4740
4741 DRAIN Indicates the node is unavailable to be allocated
4742 work.
4743
4744 FAIL Indicates the node is expected to fail soon, has no
4745 jobs allocated to it, and will not be allocated to any
4746 new jobs.
4747
4748 FAILING Indicates the node is expected to fail soon, has one
4749 or more jobs allocated to it, but will not be allo‐
4750 cated to any new jobs.
4751
4752 FUTURE Indicates the node is defined for future use and need
4753 not exist when the Slurm daemons are started. These
4754 nodes can be made available for use simply by updating
4755 the node state using the scontrol command rather than
4756 restarting the slurmctld daemon. After these nodes are
4757 made available, change their State in the slurm.conf
4758 file. Until these nodes are made available, they will
4759 not be seen using any Slurm commands or nor will any
4760 attempt be made to contact them.
4761
4762 Dynamic Future Nodes
4763 A slurmd started with -F[<feature>] will be as‐
4764 sociated with a FUTURE node that matches the
4765 same configuration (sockets, cores, threads) as
4766 reported by slurmd -C. The node's NodeAddr and
4767 NodeHostname will automatically be retrieved
4768 from the slurmd and will be cleared when set
4769 back to the FUTURE state. Dynamic FUTURE nodes
4770 retain non-FUTURE state on restart. Use scon‐
4771 trol to put node back into FUTURE state.
4772
4773 If the mapping of the NodeName to the slurmd
4774 HostName is not updated in DNS, Dynamic Future
4775 nodes won't know how to communicate with each
4776 other -- because NodeAddr and NodeHostName are
4777 not defined in the slurm.conf -- and the fanout
4778 communications need to be disabled by setting
4779 TreeWidth to a high number (e.g. 65533). If the
4780 DNS mapping is made, then the cloud_dns Slurm‐
4781 ctldParameter can be used.
4782
4783 UNKNOWN Indicates the node's state is undefined but will be
4784 established (set to BUSY or IDLE) when the slurmd dae‐
4785 mon on that node registers. UNKNOWN is the default
4786 state.
4787
4788 ThreadsPerCore
4789 Number of logical threads in a single physical core (e.g. "2").
4790 Note that the Slurm can allocate resources to jobs down to the
4791 resolution of a core. If your system is configured with more
4792 than one thread per core, execution of a different job on each
4793 thread is not supported unless you configure SelectTypeParame‐
4794 ters=CR_CPU plus CPUs; do not configure Sockets, CoresPerSocket
4795 or ThreadsPerCore. A job can execute a one task per thread from
4796 within one job step or execute a distinct job step on each of
4797 the threads. Note also if you are running with more than 1
4798 thread per core and running the select/cons_res or se‐
4799 lect/cons_tres plugin then you will want to set the SelectType‐
4800 Parameters variable to something other than CR_CPU to avoid un‐
4801 expected results. The default value is 1.
4802
4803 TmpDisk
4804 Total size of temporary disk storage in TmpFS in megabytes (e.g.
4805 "16384"). TmpFS (for "Temporary File System") identifies the lo‐
4806 cation which jobs should use for temporary storage. Note this
4807 does not indicate the amount of free space available to the user
4808 on the node, only the total file system size. The system admin‐
4809 istration should ensure this file system is purged as needed so
4810 that user jobs have access to most of this space. The Prolog
4811 and/or Epilog programs (specified in the configuration file)
4812 might be used to ensure the file system is kept clean. The de‐
4813 fault value is 0.
4814
4815 Weight The priority of the node for scheduling purposes. All things
4816 being equal, jobs will be allocated the nodes with the lowest
4817 weight which satisfies their requirements. For example, a het‐
4818 erogeneous collection of nodes might be placed into a single
4819 partition for greater system utilization, responsiveness and ca‐
4820 pability. It would be preferable to allocate smaller memory
4821 nodes rather than larger memory nodes if either will satisfy a
4822 job's requirements. The units of weight are arbitrary, but
4823 larger weights should be assigned to nodes with more processors,
4824 memory, disk space, higher processor speed, etc. Note that if a
4825 job allocation request can not be satisfied using the nodes with
4826 the lowest weight, the set of nodes with the next lowest weight
4827 is added to the set of nodes under consideration for use (repeat
4828 as needed for higher weight values). If you absolutely want to
4829 minimize the number of higher weight nodes allocated to a job
4830 (at a cost of higher scheduling overhead), give each node a dis‐
4831 tinct Weight value and they will be added to the pool of nodes
4832 being considered for scheduling individually.
4833
4834 The default value is 1.
4835
4836 NOTE: Node weights are first considered among currently avail‐
4837 able nodes. For example, a POWERED_DOWN node with a lower weight
4838 will not be evaluated before an IDLE node.
4839
4841 The DownNodes= parameter permits you to mark certain nodes as in a
4842 DOWN, DRAIN, FAIL, FAILING or FUTURE state without altering the perma‐
4843 nent configuration information listed under a NodeName= specification.
4844
4845
4846 DownNodes
4847 Any node name, or list of node names, from the NodeName= speci‐
4848 fications.
4849
4850 Reason Identifies the reason for a node being in state DOWN, DRAIN,
4851 FAIL, FAILING or FUTURE. Use quotes to enclose a reason having
4852 more than one word.
4853
4854 State State of the node with respect to the initiation of user jobs.
4855 Acceptable values are DOWN, DRAIN, FAIL, FAILING and FUTURE.
4856 For more information about these states see the descriptions un‐
4857 der State in the NodeName= section above. The default value is
4858 DOWN.
4859
4861 On computers where frontend nodes are used to execute batch scripts
4862 rather than compute nodes, one may configure one or more frontend nodes
4863 using the configuration parameters defined below. These options are
4864 very similar to those used in configuring compute nodes. These options
4865 may only be used on systems configured and built with the appropriate
4866 parameters (--have-front-end). The front end configuration specifies
4867 the following information:
4868
4869
4870 AllowGroups
4871 Comma-separated list of group names which may execute jobs on
4872 this front end node. By default, all groups may use this front
4873 end node. A user will be permitted to use this front end node
4874 if AllowGroups has at least one group associated with the user.
4875 May not be used with the DenyGroups option.
4876
4877 AllowUsers
4878 Comma-separated list of user names which may execute jobs on
4879 this front end node. By default, all users may use this front
4880 end node. May not be used with the DenyUsers option.
4881
4882 DenyGroups
4883 Comma-separated list of group names which are prevented from ex‐
4884 ecuting jobs on this front end node. May not be used with the
4885 AllowGroups option.
4886
4887 DenyUsers
4888 Comma-separated list of user names which are prevented from exe‐
4889 cuting jobs on this front end node. May not be used with the
4890 AllowUsers option.
4891
4892 FrontendName
4893 Name that Slurm uses to refer to a frontend node. Typically
4894 this would be the string that "/bin/hostname -s" returns. It
4895 may also be the fully qualified domain name as returned by
4896 "/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid domain
4897 name associated with the host through the host database
4898 (/etc/hosts) or DNS, depending on the resolver settings. Note
4899 that if the short form of the hostname is not used, it may pre‐
4900 vent use of hostlist expressions (the numeric portion in brack‐
4901 ets must be at the end of the string). If the FrontendName is
4902 "DEFAULT", the values specified with that record will apply to
4903 subsequent node specifications unless explicitly set to other
4904 values in that frontend node record or replaced with a different
4905 set of default values. Each line where FrontendName is "DE‐
4906 FAULT" will replace or add to previous default values and not a
4907 reinitialize the default values.
4908
4909 FrontendAddr
4910 Name that a frontend node should be referred to in establishing
4911 a communications path. This name will be used as an argument to
4912 the getaddrinfo() function for identification. As with Fron‐
4913 tendName, list the individual node addresses rather than using a
4914 hostlist expression. The number of FrontendAddr records per
4915 line must equal the number of FrontendName records per line
4916 (i.e. you can't map to node names to one address). FrontendAddr
4917 may also contain IP addresses. By default, the FrontendAddr
4918 will be identical in value to FrontendName.
4919
4920 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4921 tens to for work on this particular frontend node. By default
4922 there is a single port number for all slurmd daemons on all
4923 frontend nodes as defined by the SlurmdPort configuration param‐
4924 eter. Use of this option is not generally recommended except for
4925 development or testing purposes.
4926
4927 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4928 automatically try to interact with anything opened on ports
4929 8192-60000. Configure Port to use a port outside of the config‐
4930 ured SrunPortRange and RSIP's port range.
4931
4932 Reason Identifies the reason for a frontend node being in state DOWN,
4933 DRAINED, DRAINING, FAIL or FAILING. Use quotes to enclose a
4934 reason having more than one word.
4935
4936 State State of the frontend node with respect to the initiation of
4937 user jobs. Acceptable values are DOWN, DRAIN, FAIL, FAILING and
4938 UNKNOWN. Node states of BUSY and IDLE should not be specified
4939 in the node configuration, but set the node state to UNKNOWN in‐
4940 stead. Setting the node state to UNKNOWN will result in the
4941 node state being set to BUSY, IDLE or other appropriate state
4942 based upon recovered system state information. For more infor‐
4943 mation about these states see the descriptions under State in
4944 the NodeName= section above. The default value is UNKNOWN.
4945
4946 As an example, you can do something similar to the following to define
4947 four front end nodes for running slurmd daemons.
4948 FrontendName=frontend[00-03] FrontendAddr=efrontend[00-03] State=UNKNOWN
4949
4950
4952 The nodeset configuration allows you to define a name for a specific
4953 set of nodes which can be used to simplify the partition configuration
4954 section, especially for heterogenous or condo-style systems. Each node‐
4955 set may be defined by an explicit list of nodes, and/or by filtering
4956 the nodes by a particular configured feature. If both Feature= and
4957 Nodes= are used the nodeset shall be the union of the two subsets.
4958 Note that the nodesets are only used to simplify the partition defini‐
4959 tions at present, and are not usable outside of the partition configu‐
4960 ration.
4961
4962
4963 Feature
4964 All nodes with this single feature will be included as part of
4965 this nodeset.
4966
4967 Nodes List of nodes in this set.
4968
4969 NodeSet
4970 Unique name for a set of nodes. Must not overlap with any Node‐
4971 Name definitions.
4972
4974 The partition configuration permits you to establish different job lim‐
4975 its or access controls for various groups (or partitions) of nodes.
4976 Nodes may be in more than one partition, making partitions serve as
4977 general purpose queues. For example one may put the same set of nodes
4978 into two different partitions, each with different constraints (time
4979 limit, job sizes, groups allowed to use the partition, etc.). Jobs are
4980 allocated resources within a single partition. Default values can be
4981 specified with a record in which PartitionName is "DEFAULT". The de‐
4982 fault entry values will apply only to lines following it in the config‐
4983 uration file and the default values can be reset multiple times in the
4984 configuration file with multiple entries where "PartitionName=DEFAULT".
4985 The "PartitionName=" specification must be placed on every line de‐
4986 scribing the configuration of partitions. Each line where Partition‐
4987 Name is "DEFAULT" will replace or add to previous default values and
4988 not a reinitialize the default values. A single partition name can not
4989 appear as a PartitionName value in more than one line (duplicate parti‐
4990 tion name records will be ignored). If a partition that is in use is
4991 deleted from the configuration and slurm is restarted or reconfigured
4992 (scontrol reconfigure), jobs using the partition are canceled. NOTE:
4993 Put all parameters for each partition on a single line. Each line of
4994 partition configuration information should represent a different parti‐
4995 tion. The partition configuration file contains the following informa‐
4996 tion:
4997
4998
4999 AllocNodes
5000 Comma-separated list of nodes from which users can submit jobs
5001 in the partition. Node names may be specified using the node
5002 range expression syntax described above. The default value is
5003 "ALL".
5004
5005 AllowAccounts
5006 Comma-separated list of accounts which may execute jobs in the
5007 partition. The default value is "ALL". NOTE: If AllowAccounts
5008 is used then DenyAccounts will not be enforced. Also refer to
5009 DenyAccounts.
5010
5011 AllowGroups
5012 Comma-separated list of group names which may execute jobs in
5013 this partition. A user will be permitted to submit a job to
5014 this partition if AllowGroups has at least one group associated
5015 with the user. Jobs executed as user root or as user SlurmUser
5016 will be allowed to use any partition, regardless of the value of
5017 AllowGroups. In addition, a Slurm Admin or Operator will be able
5018 to view any partition, regardless of the value of AllowGroups.
5019 If user root attempts to execute a job as another user (e.g. us‐
5020 ing srun's --uid option), then the job will be subject to Allow‐
5021 Groups as if it were submitted by that user. By default, Allow‐
5022 Groups is unset, meaning all groups are allowed to use this par‐
5023 tition. The special value 'ALL' is equivalent to this. Users
5024 who are not members of the specified group will not see informa‐
5025 tion about this partition by default. However, this should not
5026 be treated as a security mechanism, since job information will
5027 be returned if a user requests details about the partition or a
5028 specific job. See the PrivateData parameter to restrict access
5029 to job information. NOTE: For performance reasons, Slurm main‐
5030 tains a list of user IDs allowed to use each partition and this
5031 is checked at job submission time. This list of user IDs is up‐
5032 dated when the slurmctld daemon is restarted, reconfigured (e.g.
5033 "scontrol reconfig") or the partition's AllowGroups value is re‐
5034 set, even if is value is unchanged (e.g. "scontrol update Parti‐
5035 tionName=name AllowGroups=group"). For a user's access to a
5036 partition to change, both his group membership must change and
5037 Slurm's internal user ID list must change using one of the meth‐
5038 ods described above.
5039
5040 AllowQos
5041 Comma-separated list of Qos which may execute jobs in the parti‐
5042 tion. Jobs executed as user root can use any partition without
5043 regard to the value of AllowQos. The default value is "ALL".
5044 NOTE: If AllowQos is used then DenyQos will not be enforced.
5045 Also refer to DenyQos.
5046
5047 Alternate
5048 Partition name of alternate partition to be used if the state of
5049 this partition is "DRAIN" or "INACTIVE."
5050
5051 CpuBind
5052 If a job step request does not specify an option to control how
5053 tasks are bound to allocated CPUs (--cpu-bind) and all nodes al‐
5054 located to the job do not have the same CpuBind option the node.
5055 Then the partition's CpuBind option will control how tasks are
5056 bound to allocated resources. Supported values forCpuBind are
5057 "none", "socket", "ldom" (NUMA), "core" and "thread".
5058
5059 Default
5060 If this keyword is set, jobs submitted without a partition spec‐
5061 ification will utilize this partition. Possible values are
5062 "YES" and "NO". The default value is "NO".
5063
5064 DefaultTime
5065 Run time limit used for jobs that don't specify a value. If not
5066 set then MaxTime will be used. Format is the same as for Max‐
5067 Time.
5068
5069 DefCpuPerGPU
5070 Default count of CPUs allocated per allocated GPU. This value is
5071 used only if the job didn't specify --cpus-per-task and
5072 --cpus-per-gpu.
5073
5074 DefMemPerCPU
5075 Default real memory size available per allocated CPU in
5076 megabytes. Used to avoid over-subscribing memory and causing
5077 paging. DefMemPerCPU would generally be used if individual pro‐
5078 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
5079 lectType=select/cons_tres). If not set, the DefMemPerCPU value
5080 for the entire cluster will be used. Also see DefMemPerGPU,
5081 DefMemPerNode and MaxMemPerCPU. DefMemPerCPU, DefMemPerGPU and
5082 DefMemPerNode are mutually exclusive.
5083
5084 DefMemPerGPU
5085 Default real memory size available per allocated GPU in
5086 megabytes. Also see DefMemPerCPU, DefMemPerNode and MaxMemPer‐
5087 CPU. DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually
5088 exclusive.
5089
5090 DefMemPerNode
5091 Default real memory size available per allocated node in
5092 megabytes. Used to avoid over-subscribing memory and causing
5093 paging. DefMemPerNode would generally be used if whole nodes
5094 are allocated to jobs (SelectType=select/linear) and resources
5095 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
5096 If not set, the DefMemPerNode value for the entire cluster will
5097 be used. Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerCPU.
5098 DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually exclu‐
5099 sive.
5100
5101 DenyAccounts
5102 Comma-separated list of accounts which may not execute jobs in
5103 the partition. By default, no accounts are denied access NOTE:
5104 If AllowAccounts is used then DenyAccounts will not be enforced.
5105 Also refer to AllowAccounts.
5106
5107 DenyQos
5108 Comma-separated list of Qos which may not execute jobs in the
5109 partition. By default, no QOS are denied access NOTE: If Al‐
5110 lowQos is used then DenyQos will not be enforced. Also refer
5111 AllowQos.
5112
5113 DisableRootJobs
5114 If set to "YES" then user root will be prevented from running
5115 any jobs on this partition. The default value will be the value
5116 of DisableRootJobs set outside of a partition specification
5117 (which is "NO", allowing user root to execute jobs).
5118
5119 ExclusiveUser
5120 If set to "YES" then nodes will be exclusively allocated to
5121 users. Multiple jobs may be run for the same user, but only one
5122 user can be active at a time. This capability is also available
5123 on a per-job basis by using the --exclusive=user option.
5124
5125 GraceTime
5126 Specifies, in units of seconds, the preemption grace time to be
5127 extended to a job which has been selected for preemption. The
5128 default value is zero, no preemption grace time is allowed on
5129 this partition. Once a job has been selected for preemption,
5130 its end time is set to the current time plus GraceTime. The
5131 job's tasks are immediately sent SIGCONT and SIGTERM signals in
5132 order to provide notification of its imminent termination. This
5133 is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence
5134 upon reaching its new end time. This second set of signals is
5135 sent to both the tasks and the containing batch script, if ap‐
5136 plicable. See also the global KillWait configuration parameter.
5137
5138 Hidden Specifies if the partition and its jobs are to be hidden by de‐
5139 fault. Hidden partitions will by default not be reported by the
5140 Slurm APIs or commands. Possible values are "YES" and "NO".
5141 The default value is "NO". Note that partitions that a user
5142 lacks access to by virtue of the AllowGroups parameter will also
5143 be hidden by default.
5144
5145 LLN Schedule resources to jobs on the least loaded nodes (based upon
5146 the number of idle CPUs). This is generally only recommended for
5147 an environment with serial jobs as idle resources will tend to
5148 be highly fragmented, resulting in parallel jobs being distrib‐
5149 uted across many nodes. Note that node Weight takes precedence
5150 over how many idle resources are on each node. Also see the Se‐
5151 lectTypeParameters configuration parameter CR_LLN to use the
5152 least loaded nodes in every partition.
5153
5154 MaxCPUsPerNode
5155 Maximum number of CPUs on any node available to all jobs from
5156 this partition. This can be especially useful to schedule GPUs.
5157 For example a node can be associated with two Slurm partitions
5158 (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be
5159 limited to only a subset of the node's CPUs, ensuring that one
5160 or more CPUs would be available to jobs in the "gpu" parti‐
5161 tion/queue.
5162
5163 MaxMemPerCPU
5164 Maximum real memory size available per allocated CPU in
5165 megabytes. Used to avoid over-subscribing memory and causing
5166 paging. MaxMemPerCPU would generally be used if individual pro‐
5167 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
5168 lectType=select/cons_tres). If not set, the MaxMemPerCPU value
5169 for the entire cluster will be used. Also see DefMemPerCPU and
5170 MaxMemPerNode. MaxMemPerCPU and MaxMemPerNode are mutually ex‐
5171 clusive.
5172
5173 MaxMemPerNode
5174 Maximum real memory size available per allocated node in
5175 megabytes. Used to avoid over-subscribing memory and causing
5176 paging. MaxMemPerNode would generally be used if whole nodes
5177 are allocated to jobs (SelectType=select/linear) and resources
5178 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
5179 If not set, the MaxMemPerNode value for the entire cluster will
5180 be used. Also see DefMemPerNode and MaxMemPerCPU. MaxMemPerCPU
5181 and MaxMemPerNode are mutually exclusive.
5182
5183 MaxNodes
5184 Maximum count of nodes which may be allocated to any single job.
5185 The default value is "UNLIMITED", which is represented inter‐
5186 nally as -1.
5187
5188 MaxTime
5189 Maximum run time limit for jobs. Format is minutes, min‐
5190 utes:seconds, hours:minutes:seconds, days-hours, days-hours:min‐
5191 utes, days-hours:minutes:seconds or "UNLIMITED". Time resolu‐
5192 tion is one minute and second values are rounded up to the next
5193 minute. The job TimeLimit may be updated by root, SlurmUser or
5194 an Operator to a value higher than the configured MaxTime after
5195 job submission.
5196
5197 MinNodes
5198 Minimum count of nodes which may be allocated to any single job.
5199 The default value is 0.
5200
5201 Nodes Comma-separated list of nodes or nodesets which are associated
5202 with this partition. Node names may be specified using the node
5203 range expression syntax described above. A blank list of nodes
5204 (i.e. "Nodes= ") can be used if one wants a partition to exist,
5205 but have no resources (possibly on a temporary basis). A value
5206 of "ALL" is mapped to all nodes configured in the cluster.
5207
5208 OverSubscribe
5209 Controls the ability of the partition to execute more than one
5210 job at a time on each resource (node, socket or core depending
5211 upon the value of SelectTypeParameters). If resources are to be
5212 over-subscribed, avoiding memory over-subscription is very im‐
5213 portant. SelectTypeParameters should be configured to treat
5214 memory as a consumable resource and the --mem option should be
5215 used for job allocations. Sharing of resources is typically
5216 useful only when using gang scheduling (PreemptMode=sus‐
5217 pend,gang). Possible values for OverSubscribe are "EXCLUSIVE",
5218 "FORCE", "YES", and "NO". Note that a value of "YES" or "FORCE"
5219 can negatively impact performance for systems with many thou‐
5220 sands of running jobs. The default value is "NO". For more in‐
5221 formation see the following web pages:
5222 https://slurm.schedmd.com/cons_res.html
5223 https://slurm.schedmd.com/cons_res_share.html
5224 https://slurm.schedmd.com/gang_scheduling.html
5225 https://slurm.schedmd.com/preempt.html
5226
5227 EXCLUSIVE Allocates entire nodes to jobs even with Select‐
5228 Type=select/cons_res or SelectType=select/cons_tres
5229 configured. Jobs that run in partitions with Over‐
5230 Subscribe=EXCLUSIVE will have exclusive access to
5231 all allocated nodes. These jobs are allocated all
5232 CPUs and GRES on the nodes, but they are only allo‐
5233 cated as much memory as they ask for. This is by de‐
5234 sign to support gang scheduling, because suspended
5235 jobs still reside in memory. To request all the mem‐
5236 ory on a node, use --mem=0 at submit time.
5237
5238 FORCE Makes all resources (except GRES) in the partition
5239 available for oversubscription without any means for
5240 users to disable it. May be followed with a colon
5241 and maximum number of jobs in running or suspended
5242 state. For example OverSubscribe=FORCE:4 enables
5243 each node, socket or core to oversubscribe each re‐
5244 source four ways. Recommended only for systems us‐
5245 ing PreemptMode=suspend,gang.
5246
5247 NOTE: OverSubscribe=FORCE:1 is a special case that
5248 is not exactly equivalent to OverSubscribe=NO. Over‐
5249 Subscribe=FORCE:1 disables the regular oversubscrip‐
5250 tion of resources in the same partition but it will
5251 still allow oversubscription due to preemption. Set‐
5252 ting OverSubscribe=NO will prevent oversubscription
5253 from happening due to preemption as well.
5254
5255 NOTE: If using PreemptType=preempt/qos you can spec‐
5256 ify a value for FORCE that is greater than 1. For
5257 example, OverSubscribe=FORCE:2 will permit two jobs
5258 per resource normally, but a third job can be
5259 started only if done so through preemption based
5260 upon QOS.
5261
5262 NOTE: If OverSubscribe is configured to FORCE or YES
5263 in your slurm.conf and the system is not configured
5264 to use preemption (PreemptMode=OFF) accounting can
5265 easily grow to values greater than the actual uti‐
5266 lization. It may be common on such systems to get
5267 error messages in the slurmdbd log stating: "We have
5268 more allocated time than is possible."
5269
5270 YES Makes all resources (except GRES) in the partition
5271 available for sharing upon request by the job. Re‐
5272 sources will only be over-subscribed when explicitly
5273 requested by the user using the "--oversubscribe"
5274 option on job submission. May be followed with a
5275 colon and maximum number of jobs in running or sus‐
5276 pended state. For example "OverSubscribe=YES:4" en‐
5277 ables each node, socket or core to execute up to
5278 four jobs at once. Recommended only for systems
5279 running with gang scheduling (PreemptMode=sus‐
5280 pend,gang).
5281
5282 NO Selected resources are allocated to a single job. No
5283 resource will be allocated to more than one job.
5284
5285 NOTE: Even if you are using PreemptMode=sus‐
5286 pend,gang, setting OverSubscribe=NO will disable
5287 preemption on that partition. Use OverSub‐
5288 scribe=FORCE:1 if you want to disable normal over‐
5289 subscription but still allow suspension due to pre‐
5290 emption.
5291
5292 OverTimeLimit
5293 Number of minutes by which a job can exceed its time limit be‐
5294 fore being canceled. Normally a job's time limit is treated as
5295 a hard limit and the job will be killed upon reaching that
5296 limit. Configuring OverTimeLimit will result in the job's time
5297 limit being treated like a soft limit. Adding the OverTimeLimit
5298 value to the soft time limit provides a hard time limit, at
5299 which point the job is canceled. This is particularly useful
5300 for backfill scheduling, which bases upon each job's soft time
5301 limit. If not set, the OverTimeLimit value for the entire clus‐
5302 ter will be used. May not exceed 65533 minutes. A value of
5303 "UNLIMITED" is also supported.
5304
5305 PartitionName
5306 Name by which the partition may be referenced (e.g. "Interac‐
5307 tive"). This name can be specified by users when submitting
5308 jobs. If the PartitionName is "DEFAULT", the values specified
5309 with that record will apply to subsequent partition specifica‐
5310 tions unless explicitly set to other values in that partition
5311 record or replaced with a different set of default values. Each
5312 line where PartitionName is "DEFAULT" will replace or add to
5313 previous default values and not a reinitialize the default val‐
5314 ues.
5315
5316 PreemptMode
5317 Mechanism used to preempt jobs or enable gang scheduling for
5318 this partition when PreemptType=preempt/partition_prio is con‐
5319 figured. This partition-specific PreemptMode configuration pa‐
5320 rameter will override the cluster-wide PreemptMode for this par‐
5321 tition. It can be set to OFF to disable preemption and gang
5322 scheduling for this partition. See also PriorityTier and the
5323 above description of the cluster-wide PreemptMode parameter for
5324 further details.
5325 The GANG option is used to enable gang scheduling independent of
5326 whether preemption is enabled (i.e. independent of the Preempt‐
5327 Type setting). It can be specified in addition to a PreemptMode
5328 setting with the two options comma separated (e.g. Preempt‐
5329 Mode=SUSPEND,GANG).
5330 See <https://slurm.schedmd.com/preempt.html> and
5331 <https://slurm.schedmd.com/gang_scheduling.html> for more de‐
5332 tails.
5333
5334 NOTE: For performance reasons, the backfill scheduler reserves
5335 whole nodes for jobs, not partial nodes. If during backfill
5336 scheduling a job preempts one or more other jobs, the whole
5337 nodes for those preempted jobs are reserved for the preemptor
5338 job, even if the preemptor job requested fewer resources than
5339 that. These reserved nodes aren't available to other jobs dur‐
5340 ing that backfill cycle, even if the other jobs could fit on the
5341 nodes. Therefore, jobs may preempt more resources during a sin‐
5342 gle backfill iteration than they requested.
5343 NOTE: For heterogeneous job to be considered for preemption all
5344 components must be eligible for preemption. When a heterogeneous
5345 job is to be preempted the first identified component of the job
5346 with the highest order PreemptMode (SUSPEND (highest), REQUEUE,
5347 CANCEL (lowest)) will be used to set the PreemptMode for all
5348 components. The GraceTime and user warning signal for each com‐
5349 ponent of the heterogeneous job remain unique. Heterogeneous
5350 jobs are excluded from GANG scheduling operations.
5351
5352 OFF Is the default value and disables job preemption and
5353 gang scheduling. It is only compatible with Pre‐
5354 emptType=preempt/none at a global level. A common
5355 use case for this parameter is to set it on a parti‐
5356 tion to disable preemption for that partition.
5357
5358 CANCEL The preempted job will be cancelled.
5359
5360 GANG Enables gang scheduling (time slicing) of jobs in
5361 the same partition, and allows the resuming of sus‐
5362 pended jobs.
5363
5364 NOTE: Gang scheduling is performed independently for
5365 each partition, so if you only want time-slicing by
5366 OverSubscribe, without any preemption, then config‐
5367 uring partitions with overlapping nodes is not rec‐
5368 ommended. On the other hand, if you want to use
5369 PreemptType=preempt/partition_prio to allow jobs
5370 from higher PriorityTier partitions to Suspend jobs
5371 from lower PriorityTier partitions you will need
5372 overlapping partitions, and PreemptMode=SUSPEND,GANG
5373 to use the Gang scheduler to resume the suspended
5374 jobs(s). In any case, time-slicing won't happen be‐
5375 tween jobs on different partitions.
5376 NOTE: Heterogeneous jobs are excluded from GANG
5377 scheduling operations.
5378
5379 REQUEUE Preempts jobs by requeuing them (if possible) or
5380 canceling them. For jobs to be requeued they must
5381 have the --requeue sbatch option set or the cluster
5382 wide JobRequeue parameter in slurm.conf must be set
5383 to 1.
5384
5385 SUSPEND The preempted jobs will be suspended, and later the
5386 Gang scheduler will resume them. Therefore the SUS‐
5387 PEND preemption mode always needs the GANG option to
5388 be specified at the cluster level. Also, because the
5389 suspended jobs will still use memory on the allo‐
5390 cated nodes, Slurm needs to be able to track memory
5391 resources to be able to suspend jobs.
5392
5393 If the preemptees and preemptor are on different
5394 partitions then the preempted jobs will remain sus‐
5395 pended until the preemptor ends.
5396 NOTE: Because gang scheduling is performed indepen‐
5397 dently for each partition, if using PreemptType=pre‐
5398 empt/partition_prio then jobs in higher PriorityTier
5399 partitions will suspend jobs in lower PriorityTier
5400 partitions to run on the released resources. Only
5401 when the preemptor job ends will the suspended jobs
5402 will be resumed by the Gang scheduler.
5403 NOTE: Suspended jobs will not release GRES. Higher
5404 priority jobs will not be able to preempt to gain
5405 access to GRES.
5406
5407 PriorityJobFactor
5408 Partition factor used by priority/multifactor plugin in calcu‐
5409 lating job priority. The value may not exceed 65533. Also see
5410 PriorityTier.
5411
5412 PriorityTier
5413 Jobs submitted to a partition with a higher PriorityTier value
5414 will be evaluated by the scheduler before pending jobs in a par‐
5415 tition with a lower PriorityTier value. They will also be con‐
5416 sidered for preemption of running jobs in partition(s) with
5417 lower PriorityTier values if PreemptType=preempt/partition_prio.
5418 The value may not exceed 65533. Also see PriorityJobFactor.
5419
5420 QOS Used to extend the limits available to a QOS on a partition.
5421 Jobs will not be associated to this QOS outside of being associ‐
5422 ated to the partition. They will still be associated to their
5423 requested QOS. By default, no QOS is used. NOTE: If a limit is
5424 set in both the Partition's QOS and the Job's QOS the Partition
5425 QOS will be honored unless the Job's QOS has the OverPartQOS
5426 flag set in which the Job's QOS will have priority.
5427
5428 ReqResv
5429 Specifies users of this partition are required to designate a
5430 reservation when submitting a job. This option can be useful in
5431 restricting usage of a partition that may have higher priority
5432 or additional resources to be allowed only within a reservation.
5433 Possible values are "YES" and "NO". The default value is "NO".
5434
5435 ResumeTimeout
5436 Maximum time permitted (in seconds) between when a node resume
5437 request is issued and when the node is actually available for
5438 use. Nodes which fail to respond in this time frame will be
5439 marked DOWN and the jobs scheduled on the node requeued. Nodes
5440 which reboot after this time frame will be marked DOWN with a
5441 reason of "Node unexpectedly rebooted." For nodes that are in
5442 multiple partitions with this option set, the highest time will
5443 take effect. If not set on any partition, the node will use the
5444 ResumeTimeout value set for the entire cluster.
5445
5446 RootOnly
5447 Specifies if only user ID zero (i.e. user root) may allocate re‐
5448 sources in this partition. User root may allocate resources for
5449 any other user, but the request must be initiated by user root.
5450 This option can be useful for a partition to be managed by some
5451 external entity (e.g. a higher-level job manager) and prevents
5452 users from directly using those resources. Possible values are
5453 "YES" and "NO". The default value is "NO".
5454
5455 SelectTypeParameters
5456 Partition-specific resource allocation type. This option re‐
5457 places the global SelectTypeParameters value. Supported values
5458 are CR_Core, CR_Core_Memory, CR_Socket and CR_Socket_Memory.
5459 Use requires the system-wide SelectTypeParameters value be set
5460 to any of the four supported values previously listed; other‐
5461 wise, the partition-specific value will be ignored.
5462
5463 Shared The Shared configuration parameter has been replaced by the
5464 OverSubscribe parameter described above.
5465
5466 State State of partition or availability for use. Possible values are
5467 "UP", "DOWN", "DRAIN" and "INACTIVE". The default value is "UP".
5468 See also the related "Alternate" keyword.
5469
5470 UP Designates that new jobs may be queued on the parti‐
5471 tion, and that jobs may be allocated nodes and run
5472 from the partition.
5473
5474 DOWN Designates that new jobs may be queued on the parti‐
5475 tion, but queued jobs may not be allocated nodes and
5476 run from the partition. Jobs already running on the
5477 partition continue to run. The jobs must be explicitly
5478 canceled to force their termination.
5479
5480 DRAIN Designates that no new jobs may be queued on the par‐
5481 tition (job submission requests will be denied with an
5482 error message), but jobs already queued on the parti‐
5483 tion may be allocated nodes and run. See also the
5484 "Alternate" partition specification.
5485
5486 INACTIVE Designates that no new jobs may be queued on the par‐
5487 tition, and jobs already queued may not be allocated
5488 nodes and run. See also the "Alternate" partition
5489 specification.
5490
5491 SuspendTime
5492 Nodes which remain idle or down for this number of seconds will
5493 be placed into power save mode by SuspendProgram. For nodes
5494 that are in multiple partitions with this option set, the high‐
5495 est time will take effect. If not set on any partition, the node
5496 will use the SuspendTime value set for the entire cluster. Set‐
5497 ting SuspendTime to INFINITE will disable suspending of nodes in
5498 this partition. Setting SuspendTime to anything but INFINITE
5499 (or -1) will enable power save mode.
5500
5501 SuspendTimeout
5502 Maximum time permitted (in seconds) between when a node suspend
5503 request is issued and when the node is shutdown. At that time
5504 the node must be ready for a resume request to be issued as
5505 needed for new work. For nodes that are in multiple partitions
5506 with this option set, the highest time will take effect. If not
5507 set on any partition, the node will use the SuspendTimeout value
5508 set for the entire cluster.
5509
5510 TRESBillingWeights
5511 TRESBillingWeights is used to define the billing weights of each
5512 TRES type that will be used in calculating the usage of a job.
5513 The calculated usage is used when calculating fairshare and when
5514 enforcing the TRES billing limit on jobs.
5515
5516 Billing weights are specified as a comma-separated list of <TRES
5517 Type>=<TRES Billing Weight> pairs.
5518
5519 Any TRES Type is available for billing. Note that the base unit
5520 for memory and burst buffers is megabytes.
5521
5522 By default the billing of TRES is calculated as the sum of all
5523 TRES types multiplied by their corresponding billing weight.
5524
5525 The weighted amount of a resource can be adjusted by adding a
5526 suffix of K,M,G,T or P after the billing weight. For example, a
5527 memory weight of "mem=.25" on a job allocated 8GB will be billed
5528 2048 (8192MB *.25) units. A memory weight of "mem=.25G" on the
5529 same job will be billed 2 (8192MB * (.25/1024)) units.
5530
5531 Negative values are allowed.
5532
5533 When a job is allocated 1 CPU and 8 GB of memory on a partition
5534 configured with TRESBilling‐
5535 Weights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0", the billable TRES will
5536 be: (1*1.0) + (8*0.25) + (0*2.0) = 3.0.
5537
5538 If PriorityFlags=MAX_TRES is configured, the billable TRES is
5539 calculated as the MAX of individual TRES' on a node (e.g. cpus,
5540 mem, gres) plus the sum of all global TRES' (e.g. licenses). Us‐
5541 ing the same example above the billable TRES will be MAX(1*1.0,
5542 8*0.25) + (0*2.0) = 2.0.
5543
5544 If TRESBillingWeights is not defined then the job is billed
5545 against the total number of allocated CPUs.
5546
5547 NOTE: TRESBillingWeights doesn't affect job priority directly as
5548 it is currently not used for the size of the job. If you want
5549 TRES' to play a role in the job's priority then refer to the
5550 PriorityWeightTRES option.
5551
5553 There are a variety of prolog and epilog program options that execute
5554 with various permissions and at various times. The four options most
5555 likely to be used are: Prolog and Epilog (executed once on each compute
5556 node for each job) plus PrologSlurmctld and EpilogSlurmctld (executed
5557 once on the ControlMachine for each job).
5558
5559 NOTE: Standard output and error messages are normally not preserved.
5560 Explicitly write output and error messages to an appropriate location
5561 if you wish to preserve that information.
5562
5563 NOTE: By default the Prolog script is ONLY run on any individual node
5564 when it first sees a job step from a new allocation. It does not run
5565 the Prolog immediately when an allocation is granted. If no job steps
5566 from an allocation are run on a node, it will never run the Prolog for
5567 that allocation. This Prolog behaviour can be changed by the Pro‐
5568 logFlags parameter. The Epilog, on the other hand, always runs on ev‐
5569 ery node of an allocation when the allocation is released.
5570
5571 If the Epilog fails (returns a non-zero exit code), this will result in
5572 the node being set to a DRAIN state. If the EpilogSlurmctld fails (re‐
5573 turns a non-zero exit code), this will only be logged. If the Prolog
5574 fails (returns a non-zero exit code), this will result in the node be‐
5575 ing set to a DRAIN state and the job being requeued in a held state un‐
5576 less nohold_on_prolog_fail is configured in SchedulerParameters. If
5577 the PrologSlurmctld fails (returns a non-zero exit code), this will re‐
5578 sult in the job being requeued to be executed on another node if possi‐
5579 ble. Only batch jobs can be requeued. Interactive jobs (salloc and
5580 srun) will be cancelled if the PrologSlurmctld fails. If slurmcltd is
5581 stopped while either PrologSlurmctld or EpilogSlurmctld is running, the
5582 script will be killed with SIGKILL. The script will restart when slurm‐
5583 ctld restarts.
5584
5585
5586 Information about the job is passed to the script using environment
5587 variables. Unless otherwise specified, these environment variables are
5588 available in each of the scripts mentioned above (Prolog, Epilog, Pro‐
5589 logSlurmctld and EpilogSlurmctld). For a full list of environment vari‐
5590 ables that includes those available in the SrunProlog, SrunEpilog,
5591 TaskProlog and TaskEpilog please see the Prolog and Epilog Guide
5592 <https://slurm.schedmd.com/prolog_epilog.html>.
5593
5594
5595 SLURM_ARRAY_JOB_ID
5596 If this job is part of a job array, this will be set to the job
5597 ID. Otherwise it will not be set. To reference this specific
5598 task of a job array, combine SLURM_ARRAY_JOB_ID with SLURM_AR‐
5599 RAY_TASK_ID (e.g. "scontrol update ${SLURM_AR‐
5600 RAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in Pro‐
5601 logSlurmctld and EpilogSlurmctld.
5602
5603 SLURM_ARRAY_TASK_ID
5604 If this job is part of a job array, this will be set to the task
5605 ID. Otherwise it will not be set. To reference this specific
5606 task of a job array, combine SLURM_ARRAY_JOB_ID with SLURM_AR‐
5607 RAY_TASK_ID (e.g. "scontrol update ${SLURM_AR‐
5608 RAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in Pro‐
5609 logSlurmctld and EpilogSlurmctld.
5610
5611 SLURM_ARRAY_TASK_MAX
5612 If this job is part of a job array, this will be set to the max‐
5613 imum task ID. Otherwise it will not be set. Available in Pro‐
5614 logSlurmctld and EpilogSlurmctld.
5615
5616 SLURM_ARRAY_TASK_MIN
5617 If this job is part of a job array, this will be set to the min‐
5618 imum task ID. Otherwise it will not be set. Available in Pro‐
5619 logSlurmctld and EpilogSlurmctld.
5620
5621 SLURM_ARRAY_TASK_STEP
5622 If this job is part of a job array, this will be set to the step
5623 size of task IDs. Otherwise it will not be set. Available in
5624 PrologSlurmctld and EpilogSlurmctld.
5625
5626 SLURM_CLUSTER_NAME
5627 Name of the cluster executing the job.
5628
5629 SLURM_CONF
5630 Location of the slurm.conf file. Available in Prolog and Epilog.
5631
5632 SLURMD_NODENAME
5633 Name of the node running the task. In the case of a parallel job
5634 executing on multiple compute nodes, the various tasks will have
5635 this environment variable set to different values on each com‐
5636 pute node. Available in Prolog and Epilog.
5637
5638 SLURM_JOB_ACCOUNT
5639 Account name used for the job.
5640
5641 SLURM_JOB_COMMENT
5642 Comment added to the job. Available in Prolog, PrologSlurmctld,
5643 Epilog and EpilogSlurmctld.
5644
5645 SLURM_JOB_CONSTRAINTS
5646 Features required to run the job. Available in Prolog, Pro‐
5647 logSlurmctld, Epilog and EpilogSlurmctld.
5648
5649 SLURM_JOB_DERIVED_EC
5650 The highest exit code of all of the job steps. Available in
5651 Epilog and EpilogSlurmctld.
5652
5653 SLURM_JOB_EXIT_CODE
5654 The exit code of the job script (or salloc). The value is the
5655 status as returned by the wait() system call (See wait(2))
5656 Available in Epilog and EpilogSlurmctld.
5657
5658 SLURM_JOB_EXIT_CODE2
5659 The exit code of the job script (or salloc). The value has the
5660 format <exit>:<sig>. The first number is the exit code, typi‐
5661 cally as set by the exit() function. The second number of the
5662 signal that caused the process to terminate if it was terminated
5663 by a signal. Available in Epilog and EpilogSlurmctld.
5664
5665 SLURM_JOB_GID
5666 Group ID of the job's owner.
5667
5668 SLURM_JOB_GPUS
5669 The GPU IDs of GPUs in the job allocation (if any). Available
5670 in the Prolog and Epilog.
5671
5672 SLURM_JOB_GROUP
5673 Group name of the job's owner. Available in PrologSlurmctld and
5674 EpilogSlurmctld.
5675
5676 SLURM_JOB_ID
5677 Job ID.
5678
5679 SLURM_JOBID
5680 Job ID.
5681
5682 SLURM_JOB_NAME
5683 Name of the job. Available in PrologSlurmctld and EpilogSlurm‐
5684 ctld.
5685
5686 SLURM_JOB_NODELIST
5687 Nodes assigned to job. A Slurm hostlist expression. "scontrol
5688 show hostnames" can be used to convert this to a list of indi‐
5689 vidual host names. Available in PrologSlurmctld and Epi‐
5690 logSlurmctld.
5691
5692 SLURM_JOB_PARTITION
5693 Partition that job runs in. Available in Prolog, PrologSlurm‐
5694 ctld, Epilog and EpilogSlurmctld.
5695
5696 SLURM_JOB_UID
5697 User ID of the job's owner.
5698
5699 SLURM_JOB_USER
5700 User name of the job's owner.
5701
5702 SLURM_SCRIPT_CONTEXT
5703 Identifies which epilog or prolog program is currently running.
5704
5706 This program can be used to take special actions to clean up the unkil‐
5707 lable processes and/or notify system administrators. The program will
5708 be run as SlurmdUser (usually "root") on the compute node where Unkill‐
5709 ableStepTimeout was triggered.
5710
5711 Information about the unkillable job step is passed to the script using
5712 environment variables.
5713
5714
5715 SLURM_JOB_ID
5716 Job ID.
5717
5718 SLURM_STEP_ID
5719 Job Step ID.
5720
5722 Slurm is able to optimize job allocations to minimize network con‐
5723 tention. Special Slurm logic is used to optimize allocations on sys‐
5724 tems with a three-dimensional interconnect. and information about con‐
5725 figuring those systems are available on web pages available here:
5726 <https://slurm.schedmd.com/>. For a hierarchical network, Slurm needs
5727 to have detailed information about how nodes are configured on the net‐
5728 work switches.
5729
5730 Given network topology information, Slurm allocates all of a job's re‐
5731 sources onto a single leaf of the network (if possible) using a
5732 best-fit algorithm. Otherwise it will allocate a job's resources onto
5733 multiple leaf switches so as to minimize the use of higher-level
5734 switches. The TopologyPlugin parameter controls which plugin is used
5735 to collect network topology information. The only values presently
5736 supported are "topology/3d_torus" (default for Cray XT/XE systems, per‐
5737 forms best-fit logic over three-dimensional topology), "topology/none"
5738 (default for other systems, best-fit logic over one-dimensional topol‐
5739 ogy), "topology/tree" (determine the network topology based upon infor‐
5740 mation contained in a topology.conf file, see "man topology.conf" for
5741 more information). Future plugins may gather topology information di‐
5742 rectly from the network. The topology information is optional. If not
5743 provided, Slurm will perform a best-fit algorithm assuming the nodes
5744 are in a one-dimensional array as configured and the communications
5745 cost is related to the node distance in this array.
5746
5747
5749 If the cluster's computers used for the primary or backup controller
5750 will be out of service for an extended period of time, it may be desir‐
5751 able to relocate them. In order to do so, follow this procedure:
5752
5753 1. Stop the Slurm daemons
5754 2. Modify the slurm.conf file appropriately
5755 3. Distribute the updated slurm.conf file to all nodes
5756 4. Restart the Slurm daemons
5757
5758 There should be no loss of any running or pending jobs. Ensure that
5759 any nodes added to the cluster have the current slurm.conf file in‐
5760 stalled.
5761
5762 CAUTION: If two nodes are simultaneously configured as the primary con‐
5763 troller (two nodes on which SlurmctldHost specify the local host and
5764 the slurmctld daemon is executing on each), system behavior will be de‐
5765 structive. If a compute node has an incorrect SlurmctldHost parameter,
5766 that node may be rendered unusable, but no other harm will result.
5767
5768
5770 #
5771 # Sample /etc/slurm.conf for dev[0-25].llnl.gov
5772 # Author: John Doe
5773 # Date: 11/06/2001
5774 #
5775 SlurmctldHost=dev0(12.34.56.78) # Primary server
5776 SlurmctldHost=dev1(12.34.56.79) # Backup server
5777 #
5778 AuthType=auth/munge
5779 Epilog=/usr/local/slurm/epilog
5780 Prolog=/usr/local/slurm/prolog
5781 FirstJobId=65536
5782 InactiveLimit=120
5783 JobCompType=jobcomp/filetxt
5784 JobCompLoc=/var/log/slurm/jobcomp
5785 KillWait=30
5786 MaxJobCount=10000
5787 MinJobAge=3600
5788 PluginDir=/usr/local/lib:/usr/local/slurm/lib
5789 ReturnToService=0
5790 SchedulerType=sched/backfill
5791 SlurmctldLogFile=/var/log/slurm/slurmctld.log
5792 SlurmdLogFile=/var/log/slurm/slurmd.log
5793 SlurmctldPort=7002
5794 SlurmdPort=7003
5795 SlurmdSpoolDir=/var/spool/slurmd.spool
5796 StateSaveLocation=/var/spool/slurm.state
5797 SwitchType=switch/none
5798 TmpFS=/tmp
5799 WaitTime=30
5800 #
5801 # Node Configurations
5802 #
5803 NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=64000
5804 NodeName=DEFAULT State=UNKNOWN
5805 NodeName=dev[0-25] NodeAddr=edev[0-25] Weight=16
5806 # Update records for specific DOWN nodes
5807 DownNodes=dev20 State=DOWN Reason="power,ETA=Dec25"
5808 #
5809 # Partition Configurations
5810 #
5811 PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP
5812 PartitionName=debug Nodes=dev[0-8,18-25] Default=YES
5813 PartitionName=batch Nodes=dev[9-17] MinNodes=4
5814 PartitionName=long Nodes=dev[9-17] MaxTime=120 AllowGroups=admin
5815
5816
5818 The "include" key word can be used with modifiers within the specified
5819 pathname. These modifiers would be replaced with cluster name or other
5820 information depending on which modifier is specified. If the included
5821 file is not an absolute path name (i.e. it does not start with a
5822 slash), it will searched for in the same directory as the slurm.conf
5823 file.
5824
5825
5826 %c Cluster name specified in the slurm.conf will be used.
5827
5828 EXAMPLE
5829 ClusterName=linux
5830 include /home/slurm/etc/%c_config
5831 # Above line interpreted as
5832 # "include /home/slurm/etc/linux_config"
5833
5834
5836 There are three classes of files: Files used by slurmctld must be ac‐
5837 cessible by user SlurmUser and accessible by the primary and backup
5838 control machines. Files used by slurmd must be accessible by user root
5839 and accessible from every compute node. A few files need to be acces‐
5840 sible by normal users on all login and compute nodes. While many files
5841 and directories are listed below, most of them will not be used with
5842 most configurations.
5843
5844
5845 Epilog Must be executable by user root. It is recommended that the
5846 file be readable by all users. The file must exist on every
5847 compute node.
5848
5849 EpilogSlurmctld
5850 Must be executable by user SlurmUser. It is recommended that
5851 the file be readable by all users. The file must be accessible
5852 by the primary and backup control machines.
5853
5854 HealthCheckProgram
5855 Must be executable by user root. It is recommended that the
5856 file be readable by all users. The file must exist on every
5857 compute node.
5858
5859 JobCompLoc
5860 If this specifies a file, it must be writable by user SlurmUser.
5861 The file must be accessible by the primary and backup control
5862 machines.
5863
5864 MailProg
5865 Must be executable by user SlurmUser. Must not be writable by
5866 regular users. The file must be accessible by the primary and
5867 backup control machines.
5868
5869 Prolog Must be executable by user root. It is recommended that the
5870 file be readable by all users. The file must exist on every
5871 compute node.
5872
5873 PrologSlurmctld
5874 Must be executable by user SlurmUser. It is recommended that
5875 the file be readable by all users. The file must be accessible
5876 by the primary and backup control machines.
5877
5878 ResumeProgram
5879 Must be executable by user SlurmUser. The file must be accessi‐
5880 ble by the primary and backup control machines.
5881
5882 slurm.conf
5883 Readable to all users on all nodes. Must not be writable by
5884 regular users.
5885
5886 SlurmctldLogFile
5887 Must be writable by user SlurmUser. The file must be accessible
5888 by the primary and backup control machines.
5889
5890 SlurmctldPidFile
5891 Must be writable by user root. Preferably writable and remov‐
5892 able by SlurmUser. The file must be accessible by the primary
5893 and backup control machines.
5894
5895 SlurmdLogFile
5896 Must be writable by user root. A distinct file must exist on
5897 each compute node.
5898
5899 SlurmdPidFile
5900 Must be writable by user root. A distinct file must exist on
5901 each compute node.
5902
5903 SlurmdSpoolDir
5904 Must be writable by user root. Permissions must be set to 755 so
5905 that job scripts can be executed from this directory. A dis‐
5906 tinct file must exist on each compute node.
5907
5908 SrunEpilog
5909 Must be executable by all users. The file must exist on every
5910 login and compute node.
5911
5912 SrunProlog
5913 Must be executable by all users. The file must exist on every
5914 login and compute node.
5915
5916 StateSaveLocation
5917 Must be writable by user SlurmUser. The file must be accessible
5918 by the primary and backup control machines.
5919
5920 SuspendProgram
5921 Must be executable by user SlurmUser. The file must be accessi‐
5922 ble by the primary and backup control machines.
5923
5924 TaskEpilog
5925 Must be executable by all users. The file must exist on every
5926 compute node.
5927
5928 TaskProlog
5929 Must be executable by all users. The file must exist on every
5930 compute node.
5931
5932 UnkillableStepProgram
5933 Must be executable by user SlurmdUser. The file must be acces‐
5934 sible by the primary and backup control machines.
5935
5937 Note that while Slurm daemons create log files and other files as
5938 needed, it treats the lack of parent directories as a fatal error.
5939 This prevents the daemons from running if critical file systems are not
5940 mounted and will minimize the risk of cold-starting (starting without
5941 preserving jobs).
5942
5943 Log files and job accounting files may need to be created/owned by the
5944 "SlurmUser" uid to be successfully accessed. Use the "chown" and
5945 "chmod" commands to set the ownership and permissions appropriately.
5946 See the section FILE AND DIRECTORY PERMISSIONS for information about
5947 the various files and directories used by Slurm.
5948
5949 It is recommended that the logrotate utility be used to ensure that
5950 various log files do not become too large. This also applies to text
5951 files used for accounting, process tracking, and the slurmdbd log if
5952 they are used.
5953
5954 Here is a sample logrotate configuration. Make appropriate site modifi‐
5955 cations and save as /etc/logrotate.d/slurm on all nodes. See the
5956 logrotate man page for more details.
5957
5958 ##
5959 # Slurm Logrotate Configuration
5960 ##
5961 /var/log/slurm/*.log {
5962 compress
5963 missingok
5964 nocopytruncate
5965 nodelaycompress
5966 nomail
5967 notifempty
5968 noolddir
5969 rotate 5
5970 sharedscripts
5971 size=5M
5972 create 640 slurm root
5973 postrotate
5974 pkill -x --signal SIGUSR2 slurmctld
5975 pkill -x --signal SIGUSR2 slurmd
5976 pkill -x --signal SIGUSR2 slurmdbd
5977 exit 0
5978 endscript
5979 }
5980
5981
5983 Copyright (C) 2002-2007 The Regents of the University of California.
5984 Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
5985 Copyright (C) 2008-2010 Lawrence Livermore National Security.
5986 Copyright (C) 2010-2022 SchedMD LLC.
5987
5988 This file is part of Slurm, a resource management program. For de‐
5989 tails, see <https://slurm.schedmd.com/>.
5990
5991 Slurm is free software; you can redistribute it and/or modify it under
5992 the terms of the GNU General Public License as published by the Free
5993 Software Foundation; either version 2 of the License, or (at your op‐
5994 tion) any later version.
5995
5996 Slurm is distributed in the hope that it will be useful, but WITHOUT
5997 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
5998 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
5999 for more details.
6000
6001
6003 /etc/slurm.conf
6004
6005
6007 cgroup.conf(5), getaddrinfo(3), getrlimit(2), gres.conf(5), group(5),
6008 hostname(1), scontrol(1), slurmctld(8), slurmd(8), slurmdbd(8), slur‐
6009 mdbd.conf(5), srun(1), spank(8), syslog(3), topology.conf(5)
6010
6011
6012
6013January 2023 Slurm Configuration File slurm.conf(5)