1slurm.conf(5) Slurm Configuration File slurm.conf(5)
2
3
4
6 slurm.conf - Slurm configuration file
7
8
10 slurm.conf is an ASCII file which describes general Slurm configuration
11 information, the nodes to be managed, information about how those nodes
12 are grouped into partitions, and various scheduling parameters associ‐
13 ated with those partitions. This file should be consistent across all
14 nodes in the cluster.
15
16 The file location can be modified at execution time by setting the
17 SLURM_CONF environment variable. The Slurm daemons also allow you to
18 override both the built-in and environment-provided location using the
19 "-f" option on the command line.
20
21 The contents of the file are case insensitive except for the names of
22 nodes and partitions. Any text following a "#" in the configuration
23 file is treated as a comment through the end of that line. Changes to
24 the configuration file take effect upon restart of Slurm daemons, dae‐
25 mon receipt of the SIGHUP signal, or execution of the command "scontrol
26 reconfigure" unless otherwise noted.
27
28 If a line begins with the word "Include" followed by whitespace and
29 then a file name, that file will be included inline with the current
30 configuration file. For large or complex systems, multiple configura‐
31 tion files may prove easier to manage and enable reuse of some files
32 (See INCLUDE MODIFIERS for more details).
33
34 Note on file permissions:
35
36 The slurm.conf file must be readable by all users of Slurm, since it is
37 used by many of the Slurm commands. Other files that are defined in
38 the slurm.conf file, such as log files and job accounting files, may
39 need to be created/owned by the user "SlurmUser" to be successfully ac‐
40 cessed. Use the "chown" and "chmod" commands to set the ownership and
41 permissions appropriately. See the section FILE AND DIRECTORY PERMIS‐
42 SIONS for information about the various files and directories used by
43 Slurm.
44
45
47 The overall configuration parameters available include:
48
49
50 AccountingStorageBackupHost
51 The name of the backup machine hosting the accounting storage
52 database. If used with the accounting_storage/slurmdbd plugin,
53 this is where the backup slurmdbd would be running. Only used
54 with systems using SlurmDBD, ignored otherwise.
55
56 AccountingStorageEnforce
57 This controls what level of association-based enforcement to im‐
58 pose on job submissions. Valid options are any combination of
59 associations, limits, nojobs, nosteps, qos, safe, and wckeys, or
60 all for all things (except nojobs and nosteps, which must be re‐
61 quested as well).
62
63 If limits, qos, or wckeys are set, associations will automati‐
64 cally be set.
65
66 If wckeys is set, TrackWCKey will automatically be set.
67
68 If safe is set, limits and associations will automatically be
69 set.
70
71 If nojobs is set, nosteps will automatically be set.
72
73 By setting associations, no new job is allowed to run unless a
74 corresponding association exists in the system. If limits are
75 enforced, users can be limited by association to whatever job
76 size or run time limits are defined.
77
78 If nojobs is set, Slurm will not account for any jobs or steps
79 on the system. Likewise, if nosteps is set, Slurm will not ac‐
80 count for any steps that have run.
81
82 If safe is enforced, a job will only be launched against an as‐
83 sociation or qos that has a GrpTRESMins limit set, if the job
84 will be able to run to completion. Without this option set, jobs
85 will be launched as long as their usage hasn't reached the
86 cpu-minutes limit. This can lead to jobs being launched but then
87 killed when the limit is reached.
88
89 With qos and/or wckeys enforced jobs will not be scheduled un‐
90 less a valid qos and/or workload characterization key is speci‐
91 fied.
92
93 A restart of slurmctld is required for changes to this parameter
94 to take effect.
95
96 AccountingStorageExternalHost
97 A comma-separated list of external slurmdbds
98 (<host/ip>[:port][,...]) to register with. If no port is given,
99 the AccountingStoragePort will be used.
100
101 This allows clusters registered with the external slurmdbd to
102 communicate with each other using the --cluster/-M client com‐
103 mand options.
104
105 The cluster will add itself to the external slurmdbd if it
106 doesn't exist. If a non-external cluster already exists on the
107 external slurmdbd, the slurmctld will ignore registering to the
108 external slurmdbd.
109
110 AccountingStorageHost
111 The name of the machine hosting the accounting storage database.
112 Only used with systems using SlurmDBD, ignored otherwise.
113
114 AccountingStorageParameters
115 Comma-separated list of key-value pair parameters. Currently
116 supported values include options to establish a secure connec‐
117 tion to the database:
118
119 SSL_CERT
120 The path name of the client public key certificate file.
121
122 SSL_CA
123 The path name of the Certificate Authority (CA) certificate
124 file.
125
126 SSL_CAPATH
127 The path name of the directory that contains trusted SSL CA
128 certificate files.
129
130 SSL_KEY
131 The path name of the client private key file.
132
133 SSL_CIPHER
134 The list of permissible ciphers for SSL encryption.
135
136 AccountingStoragePass
137 The password used to gain access to the database to store the
138 accounting data. Only used for database type storage plugins,
139 ignored otherwise. In the case of Slurm DBD (Database Daemon)
140 with MUNGE authentication this can be configured to use a MUNGE
141 daemon specifically configured to provide authentication between
142 clusters while the default MUNGE daemon provides authentication
143 within a cluster. In that case, AccountingStoragePass should
144 specify the named port to be used for communications with the
145 alternate MUNGE daemon (e.g. "/var/run/munge/global.socket.2").
146 The default value is NULL.
147
148 AccountingStoragePort
149 The listening port of the accounting storage database server.
150 Only used for database type storage plugins, ignored otherwise.
151 The default value is SLURMDBD_PORT as established at system
152 build time. If no value is explicitly specified, it will be set
153 to 6819. This value must be equal to the DbdPort parameter in
154 the slurmdbd.conf file.
155
156 AccountingStorageTRES
157 Comma-separated list of resources you wish to track on the clus‐
158 ter. These are the resources requested by the sbatch/srun job
159 when it is submitted. Currently this consists of any GRES, BB
160 (burst buffer) or license along with CPU, Memory, Node, Energy,
161 FS/[Disk|Lustre], IC/OFED, Pages, and VMem. By default Billing,
162 CPU, Energy, Memory, Node, FS/Disk, Pages and VMem are tracked.
163 These default TRES cannot be disabled, but only appended to.
164 AccountingStorageTRES=gres/craynetwork,license/iop1 will track
165 billing, cpu, energy, memory, nodes, fs/disk, pages and vmem
166 along with a gres called craynetwork as well as a license called
167 iop1. Whenever these resources are used on the cluster they are
168 recorded. The TRES are automatically set up in the database on
169 the start of the slurmctld.
170
171 If multiple GRES of different types are tracked (e.g. GPUs of
172 different types), then job requests with matching type specifi‐
173 cations will be recorded. Given a configuration of "Account‐
174 ingStorageTRES=gres/gpu,gres/gpu:tesla,gres/gpu:volta" Then
175 "gres/gpu:tesla" and "gres/gpu:volta" will track only jobs that
176 explicitly request those two GPU types, while "gres/gpu" will
177 track allocated GPUs of any type ("tesla", "volta" or any other
178 GPU type).
179
180 Given a configuration of "AccountingStorage‐
181 TRES=gres/gpu:tesla,gres/gpu:volta" Then "gres/gpu:tesla" and
182 "gres/gpu:volta" will track jobs that explicitly request those
183 GPU types. If a job requests GPUs, but does not explicitly
184 specify the GPU type, then its resource allocation will be ac‐
185 counted for as either "gres/gpu:tesla" or "gres/gpu:volta", al‐
186 though the accounting may not match the actual GPU type allo‐
187 cated to the job and the GPUs allocated to the job could be het‐
188 erogeneous. In an environment containing various GPU types, use
189 of a job_submit plugin may be desired in order to force jobs to
190 explicitly specify some GPU type.
191
192 AccountingStorageType
193 The accounting storage mechanism type. Acceptable values at
194 present include "accounting_storage/none" and "accounting_stor‐
195 age/slurmdbd". The "accounting_storage/slurmdbd" value indi‐
196 cates that accounting records will be written to the Slurm DBD,
197 which manages an underlying MySQL database. See "man slurmdbd"
198 for more information. The default value is "accounting_stor‐
199 age/none" and indicates that account records are not maintained.
200
201 AccountingStorageUser
202 The user account for accessing the accounting storage database.
203 Only used for database type storage plugins, ignored otherwise.
204
205 AccountingStoreFlags
206 Comma separated list used to tell the slurmctld to store extra
207 fields that may be more heavy weight than the normal job infor‐
208 mation.
209
210 Current options are:
211
212 job_comment
213 Include the job's comment field in the job complete mes‐
214 sage sent to the Accounting Storage database. Note the
215 AdminComment and SystemComment are always recorded in the
216 database.
217
218 job_env
219 Include a batch job's environment variables used at job
220 submission in the job start message sent to the Account‐
221 ing Storage database.
222
223 job_script
224 Include the job's batch script in the job start message
225 sent to the Accounting Storage database.
226
227 AcctGatherNodeFreq
228 The AcctGather plugins sampling interval for node accounting.
229 For AcctGather plugin values of none, this parameter is ignored.
230 For all other values this parameter is the number of seconds be‐
231 tween node accounting samples. For the acct_gather_energy/rapl
232 plugin, set a value less than 300 because the counters may over‐
233 flow beyond this rate. The default value is zero. This value
234 disables accounting sampling for nodes. Note: The accounting
235 sampling interval for jobs is determined by the value of JobAc‐
236 ctGatherFrequency.
237
238 AcctGatherEnergyType
239 Identifies the plugin to be used for energy consumption account‐
240 ing. The jobacct_gather plugin and slurmd daemon call this
241 plugin to collect energy consumption data for jobs and nodes.
242 The collection of energy consumption data takes place on the
243 node level, hence only in case of exclusive job allocation the
244 energy consumption measurements will reflect the job's real con‐
245 sumption. In case of node sharing between jobs the reported con‐
246 sumed energy per job (through sstat or sacct) will not reflect
247 the real energy consumed by the jobs.
248
249 Configurable values at present are:
250
251 acct_gather_energy/none
252 No energy consumption data is collected.
253
254 acct_gather_energy/ipmi
255 Energy consumption data is collected from
256 the Baseboard Management Controller (BMC)
257 using the Intelligent Platform Management
258 Interface (IPMI).
259
260 acct_gather_energy/pm_counters
261 Energy consumption data is collected from
262 the Baseboard Management Controller (BMC)
263 for HPE Cray systems.
264
265 acct_gather_energy/rapl
266 Energy consumption data is collected from
267 hardware sensors using the Running Average
268 Power Limit (RAPL) mechanism. Note that en‐
269 abling RAPL may require the execution of the
270 command "sudo modprobe msr".
271
272 acct_gather_energy/xcc
273 Energy consumption data is collected from
274 the Lenovo SD650 XClarity Controller (XCC)
275 using IPMI OEM raw commands.
276
277 AcctGatherInterconnectType
278 Identifies the plugin to be used for interconnect network traf‐
279 fic accounting. The jobacct_gather plugin and slurmd daemon
280 call this plugin to collect network traffic data for jobs and
281 nodes. The collection of network traffic data takes place on
282 the node level, hence only in case of exclusive job allocation
283 the collected values will reflect the job's real traffic. In
284 case of node sharing between jobs the reported network traffic
285 per job (through sstat or sacct) will not reflect the real net‐
286 work traffic by the jobs.
287
288 Configurable values at present are:
289
290 acct_gather_interconnect/none
291 No infiniband network data are collected.
292
293 acct_gather_interconnect/ofed
294 Infiniband network traffic data are col‐
295 lected from the hardware monitoring counters
296 of Infiniband devices through the OFED li‐
297 brary. In order to account for per job net‐
298 work traffic, add the "ic/ofed" TRES to Ac‐
299 countingStorageTRES.
300
301 acct_gather_interconnect/sysfs
302 Network traffic statistics are collected
303 from the Linux sysfs pseudo-filesystem for
304 specific interfaces defined in
305 acct_gather_interconnect.conf(5). In order
306 to account for per job network traffic, add
307 the "ic/sysfs" TRES to AccountingStorage‐
308 TRES.
309
310 AcctGatherFilesystemType
311 Identifies the plugin to be used for filesystem traffic account‐
312 ing. The jobacct_gather plugin and slurmd daemon call this
313 plugin to collect filesystem traffic data for jobs and nodes.
314 The collection of filesystem traffic data takes place on the
315 node level, hence only in case of exclusive job allocation the
316 collected values will reflect the job's real traffic. In case of
317 node sharing between jobs the reported filesystem traffic per
318 job (through sstat or sacct) will not reflect the real filesys‐
319 tem traffic by the jobs.
320
321
322 Configurable values at present are:
323
324 acct_gather_filesystem/none
325 No filesystem data are collected.
326
327 acct_gather_filesystem/lustre
328 Lustre filesystem traffic data are collected
329 from the counters found in /proc/fs/lustre/.
330 In order to account for per job lustre traf‐
331 fic, add the "fs/lustre" TRES to Account‐
332 ingStorageTRES.
333
334 AcctGatherProfileType
335 Identifies the plugin to be used for detailed job profiling.
336 The jobacct_gather plugin and slurmd daemon call this plugin to
337 collect detailed data such as I/O counts, memory usage, or en‐
338 ergy consumption for jobs and nodes. There are interfaces in
339 this plugin to collect data as step start and completion, task
340 start and completion, and at the account gather frequency. The
341 data collected at the node level is related to jobs only in case
342 of exclusive job allocation.
343
344 Configurable values at present are:
345
346 acct_gather_profile/none
347 No profile data is collected.
348
349 acct_gather_profile/hdf5
350 This enables the HDF5 plugin. The directory
351 where the profile files are stored and which
352 values are collected are configured in the
353 acct_gather.conf file.
354
355 acct_gather_profile/influxdb
356 This enables the influxdb plugin. The in‐
357 fluxdb instance host, port, database, reten‐
358 tion policy and which values are collected
359 are configured in the acct_gather.conf file.
360
361 AllowSpecResourcesUsage
362 If set to "YES", Slurm allows individual jobs to override node's
363 configured CoreSpecCount value. For a job to take advantage of
364 this feature, a command line option of --core-spec must be spec‐
365 ified. The default value for this option is "YES" for Cray sys‐
366 tems and "NO" for other system types.
367
368 AuthAltTypes
369 Comma-separated list of alternative authentication plugins that
370 the slurmctld will permit for communication. Acceptable values
371 at present include auth/jwt.
372
373 NOTE: auth/jwt requires a jwt_hs256.key to be populated in the
374 StateSaveLocation directory for slurmctld only. The
375 jwt_hs256.key should only be visible to the SlurmUser and root.
376 It is not suggested to place the jwt_hs256.key on any nodes but
377 the controller running slurmctld. auth/jwt can be activated by
378 the presence of the SLURM_JWT environment variable. When acti‐
379 vated, it will override the default AuthType.
380
381 AuthAltParameters
382 Used to define alternative authentication plugins options. Mul‐
383 tiple options may be comma separated.
384
385 disable_token_creation
386 Disable "scontrol token" use by non-SlurmUser ac‐
387 counts.
388
389 max_token_lifespan=<seconds>
390 Set max lifespan (in seconds) for any token gen‐
391 erated for user accounts. (This limit does not
392 apply to SlurmUser.)
393
394 jwks= Absolute path to JWKS file. Only RS256 keys are
395 supported, although other key types may be listed
396 in the file. If set, no HS256 key will be loaded
397 by default (and token generation is disabled),
398 although the jwt_key setting may be used to ex‐
399 plicitly re-enable HS256 key use (and token gen‐
400 eration).
401
402 jwt_key= Absolute path to JWT key file. Key must be HS256,
403 and should only be accessible by SlurmUser. If
404 not set, the default key file is jwt_hs256.key in
405 StateSaveLocation.
406
407 AuthInfo
408 Additional information to be used for authentication of communi‐
409 cations between the Slurm daemons (slurmctld and slurmd) and the
410 Slurm clients. The interpretation of this option is specific to
411 the configured AuthType. Multiple options may be specified in a
412 comma-delimited list. If not specified, the default authentica‐
413 tion information will be used.
414
415 cred_expire Default job step credential lifetime, in seconds
416 (e.g. "cred_expire=1200"). It must be suffi‐
417 ciently long enough to load user environment, run
418 prolog, deal with the slurmd getting paged out of
419 memory, etc. This also controls how long a re‐
420 queued job must wait before starting again. The
421 default value is 120 seconds.
422
423 socket Path name to a MUNGE daemon socket to use (e.g.
424 "socket=/var/run/munge/munge.socket.2"). The de‐
425 fault value is "/var/run/munge/munge.socket.2".
426 Used by auth/munge and cred/munge.
427
428 ttl Credential lifetime, in seconds (e.g. "ttl=300").
429 The default value is dependent upon the MUNGE in‐
430 stallation, but is typically 300 seconds.
431
432 AuthType
433 The authentication method for communications between Slurm com‐
434 ponents. Acceptable values at present include "auth/munge",
435 which is the default. "auth/munge" indicates that MUNGE is to
436 be used. (See "https://dun.github.io/munge/" for more informa‐
437 tion). All Slurm daemons and commands must be terminated prior
438 to changing the value of AuthType and later restarted.
439
440 BackupAddr
441 Deprecated option, see SlurmctldHost.
442
443 BackupController
444 Deprecated option, see SlurmctldHost.
445
446 The backup controller recovers state information from the State‐
447 SaveLocation directory, which must be readable and writable from
448 both the primary and backup controllers. While not essential,
449 it is recommended that you specify a backup controller. See
450 the RELOCATING CONTROLLERS section if you change this.
451
452 BatchStartTimeout
453 The maximum time (in seconds) that a batch job is permitted for
454 launching before being considered missing and releasing the al‐
455 location. The default value is 10 (seconds). Larger values may
456 be required if more time is required to execute the Prolog, load
457 user environment variables, or if the slurmd daemon gets paged
458 from memory.
459 Note: The test for a job being successfully launched is only
460 performed when the Slurm daemon on the compute node registers
461 state with the slurmctld daemon on the head node, which happens
462 fairly rarely. Therefore a job will not necessarily be termi‐
463 nated if its start time exceeds BatchStartTimeout. This config‐
464 uration parameter is also applied to launch tasks and avoid
465 aborting srun commands due to long running Prolog scripts.
466
467 BcastExclude
468 Comma-separated list of absolute directory paths to be excluded
469 when autodetecting and broadcasting executable shared object de‐
470 pendencies through sbcast or srun --bcast. The keyword "none"
471 can be used to indicate that no directory paths should be ex‐
472 cluded. The default value is "/lib,/usr/lib,/lib64,/usr/lib64".
473 This option can be overridden by sbcast --exclude and srun
474 --bcast-exclude.
475
476 BcastParameters
477 Controls sbcast and srun --bcast behavior. Multiple options can
478 be specified in a comma separated list. Supported values in‐
479 clude:
480
481 DestDir= Destination directory for file being broadcast to
482 allocated compute nodes. Default value is cur‐
483 rent working directory, or --chdir for srun if
484 set.
485
486 Compression= Specify default file compression library to be
487 used. Supported values are "lz4" and "none".
488 The default value with the sbcast --compress op‐
489 tion is "lz4" and "none" otherwise. Some com‐
490 pression libraries may be unavailable on some
491 systems.
492
493 send_libs If set, attempt to autodetect and broadcast the
494 executable's shared object dependencies to allo‐
495 cated compute nodes. The files are placed in a
496 directory alongside the executable. For srun
497 only, the LD_LIBRARY_PATH is automatically up‐
498 dated to include this cache directory as well.
499 This can be overridden with either sbcast or srun
500 --send-libs option. By default this is disabled.
501
502 BurstBufferType
503 The plugin used to manage burst buffers. Acceptable values at
504 present are:
505
506 burst_buffer/datawarp
507 Use Cray DataWarp API to provide burst buffer functional‐
508 ity.
509
510 burst_buffer/lua
511 This plugin provides hooks to an API that is defined by a
512 Lua script. This plugin was developed to provide system
513 administrators with a way to do any task (not only file
514 staging) at different points in a job’s life cycle.
515
516 burst_buffer/none
517
518 CliFilterPlugins
519 A comma-delimited list of command line interface option fil‐
520 ter/modification plugins. The specified plugins will be executed
521 in the order listed. No cli_filter plugins are used by default.
522 Acceptable values at present are:
523
524 cli_filter/lua
525 This plugin allows you to write your own implementation
526 of a cli_filter using lua.
527
528 cli_filter/syslog
529 This plugin enables logging of job submission activities
530 performed. All the salloc/sbatch/srun options are logged
531 to syslog together with environment variables in JSON
532 format. If the plugin is not the last one in the list it
533 may log values different than what was actually sent to
534 slurmctld.
535
536 cli_filter/user_defaults
537 This plugin looks for the file $HOME/.slurm/defaults and
538 reads every line of it as a key=value pair, where key is
539 any of the job submission options available to sal‐
540 loc/sbatch/srun and value is a default value defined by
541 the user. For instance:
542 time=1:30
543 mem=2048
544 The above will result in a user defined default for each
545 of their jobs of "-t 1:30" and "--mem=2048".
546
547 ClusterName
548 The name by which this Slurm managed cluster is known in the ac‐
549 counting database. This is needed distinguish accounting
550 records when multiple clusters report to the same database. Be‐
551 cause of limitations in some databases, any upper case letters
552 in the name will be silently mapped to lower case. In order to
553 avoid confusion, it is recommended that the name be lower case.
554
555 CommunicationParameters
556 Comma-separated options identifying communication options.
557
558 block_null_hash
559 Require all Slurm authentication tokens to in‐
560 clude a newer (20.11.9 and 21.08.8) payload that
561 provides an additional layer of security against
562 credential replay attacks. This option should
563 only be enabled once all Slurm daemons have been
564 upgraded to 20.11.9/21.08.8 or newer, and all
565 jobs that were started before the upgrade have
566 been completed.
567
568 CheckGhalQuiesce
569 Used specifically on a Cray using an Aries Ghal
570 interconnect. This will check to see if the sys‐
571 tem is quiescing when sending a message, and if
572 so, we wait until it is done before sending.
573
574 DisableIPv4 Disable IPv4 only operation for all slurm daemons
575 (except slurmdbd). This should also be set in
576 your slurmdbd.conf file.
577
578 EnableIPv6 Enable using IPv6 addresses for all slurm daemons
579 (except slurmdbd). When using both IPv4 and IPv6,
580 address family preferences will be based on your
581 /etc/gai.conf file. This should also be set in
582 your slurmdbd.conf file.
583
584 keepaliveinterval=#
585 Specifies the interval between keepalive probes
586 on the socket communications between srun and its
587 slurmstepd process.
588
589 keepaliveprobes=#
590 Specifies the number of keepalive probes sent on
591 the socket communications between srun command
592 and its slurmstepd process before the connection
593 is considered broken.
594
595 keepalivetime=#
596 Specifies how long sockets communications used
597 between the srun command and its slurmstepd
598 process are kept alive after disconnect. Longer
599 values can be used to improve reliability of com‐
600 munications in the event of network failures.
601
602 NoAddrCache By default, Slurm will cache a node's network ad‐
603 dress after successfully establishing the node's
604 network address. This option disables the cache
605 and Slurm will look up the node's network address
606 each time a connection is made. This is useful,
607 for example, in a cloud environment where the
608 node addresses come and go out of DNS.
609
610 NoCtldInAddrAny
611 Used to directly bind to the address of what the
612 node resolves to running the slurmctld instead of
613 binding messages to any address on the node,
614 which is the default.
615
616 NoInAddrAny Used to directly bind to the address of what the
617 node resolves to instead of binding messages to
618 any address on the node which is the default.
619 This option is for all daemons/clients except for
620 the slurmctld.
621
622 CompleteWait
623 The time to wait, in seconds, when any job is in the COMPLETING
624 state before any additional jobs are scheduled. This is to at‐
625 tempt to keep jobs on nodes that were recently in use, with the
626 goal of preventing fragmentation. If set to zero, pending jobs
627 will be started as soon as possible. Since a COMPLETING job's
628 resources are released for use by other jobs as soon as the Epi‐
629 log completes on each individual node, this can result in very
630 fragmented resource allocations. To provide jobs with the mini‐
631 mum response time, a value of zero is recommended (no waiting).
632 To minimize fragmentation of resources, a value equal to Kill‐
633 Wait plus two is recommended. In that case, setting KillWait to
634 a small value may be beneficial. The default value of Complete‐
635 Wait is zero seconds. The value may not exceed 65533.
636
637 NOTE: Setting reduce_completing_frag affects the behavior of
638 CompleteWait.
639
640 ControlAddr
641 Deprecated option, see SlurmctldHost.
642
643 ControlMachine
644 Deprecated option, see SlurmctldHost.
645
646 CoreSpecPlugin
647 Identifies the plugins to be used for enforcement of core spe‐
648 cialization. A restart of the slurmd daemons is required for
649 changes to this parameter to take effect. Acceptable values at
650 present include:
651
652 core_spec/cray_aries
653 used only for Cray systems
654
655 core_spec/none used for all other system types
656
657 CpuFreqDef
658 Default CPU frequency value or frequency governor to use when
659 running a job step if it has not been explicitly set with the
660 --cpu-freq option. Acceptable values at present include a nu‐
661 meric value (frequency in kilohertz) or one of the following
662 governors:
663
664 Conservative attempts to use the Conservative CPU governor
665
666 OnDemand attempts to use the OnDemand CPU governor
667
668 Performance attempts to use the Performance CPU governor
669
670 PowerSave attempts to use the PowerSave CPU governor
671 There is no default value. If unset, no attempt to set the governor is
672 made if the --cpu-freq option has not been set.
673
674 CpuFreqGovernors
675 List of CPU frequency governors allowed to be set with the sal‐
676 loc, sbatch, or srun option --cpu-freq. Acceptable values at
677 present include:
678
679 Conservative attempts to use the Conservative CPU governor
680
681 OnDemand attempts to use the OnDemand CPU governor (a de‐
682 fault value)
683
684 Performance attempts to use the Performance CPU governor (a
685 default value)
686
687 PowerSave attempts to use the PowerSave CPU governor
688
689 SchedUtil attempts to use the SchedUtil CPU governor
690
691 UserSpace attempts to use the UserSpace CPU governor (a de‐
692 fault value)
693 The default is OnDemand, Performance and UserSpace.
694
695 CredType
696 The cryptographic signature tool to be used in the creation of
697 job step credentials. A restart of slurmctld is required for
698 changes to this parameter to take effect. The default (and rec‐
699 ommended) value is "cred/munge".
700
701 DebugFlags
702 Defines specific subsystems which should provide more detailed
703 event logging. Multiple subsystems can be specified with comma
704 separators. Most DebugFlags will result in verbose-level log‐
705 ging for the identified subsystems, and could impact perfor‐
706 mance.
707
708 NOTE: You can also set debug flags by having the SLURM_DE‐
709 BUG_FLAGS environment variable defined with the desired flags
710 when the process (client command, daemon, etc.) is started. The
711 environment variable takes precedence over the setting in the
712 slurm.conf.
713
714 Valid subsystems available include:
715
716 Accrue Accrue counters accounting details
717
718 Agent RPC agents (outgoing RPCs from Slurm daemons)
719
720 Backfill Backfill scheduler details
721
722 BackfillMap Backfill scheduler to log a very verbose map of
723 reserved resources through time. Combine with
724 Backfill for a verbose and complete view of the
725 backfill scheduler's work.
726
727 BurstBuffer Burst Buffer plugin
728
729 Cgroup Cgroup details
730
731 CPU_Bind CPU binding details for jobs and steps
732
733 CpuFrequency Cpu frequency details for jobs and steps using
734 the --cpu-freq option.
735
736 Data Generic data structure details.
737
738 Dependency Job dependency debug info
739
740 Elasticsearch Elasticsearch debug info
741
742 Energy AcctGatherEnergy debug info
743
744 ExtSensors External Sensors debug info
745
746 Federation Federation scheduling debug info
747
748 FrontEnd Front end node details
749
750 Gres Generic resource details
751
752 Hetjob Heterogeneous job details
753
754 Gang Gang scheduling details
755
756 JobAccountGather Common job account gathering details (not
757 plugin specific).
758
759 JobContainer Job container plugin details
760
761 License License management details
762
763 Network Network details. Warning: activating this flag
764 may cause logging of passwords, tokens or other
765 authentication credentials.
766
767 NetworkRaw Dump raw hex values of key Network communica‐
768 tions. Warning: This flag will cause very ver‐
769 bose logs and may cause logging of passwords,
770 tokens or other authentication credentials.
771
772 NodeFeatures Node Features plugin debug info
773
774 NO_CONF_HASH Do not log when the slurm.conf files differ be‐
775 tween Slurm daemons
776
777 Power Power management plugin and power save (sus‐
778 pend/resume programs) details
779
780 Priority Job prioritization
781
782 Profile AcctGatherProfile plugins details
783
784 Protocol Communication protocol details
785
786 Reservation Advanced reservations
787
788 Route Message forwarding debug info
789
790 Script Debug info regarding the process that runs
791 slurmctld scripts such as PrologSlurmctld and
792 EpilogSlurmctld
793
794 SelectType Resource selection plugin
795
796 Steps Slurmctld resource allocation for job steps
797
798 Switch Switch plugin
799
800 TimeCray Timing of Cray APIs
801
802 TraceJobs Trace jobs in slurmctld. It will print detailed
803 job information including state, job ids and
804 allocated nodes counter.
805
806 Triggers Slurmctld triggers
807
808 WorkQueue Work Queue details
809
810 DefCpuPerGPU
811 Default count of CPUs allocated per allocated GPU. This value is
812 used only if the job didn't specify --cpus-per-task and
813 --cpus-per-gpu.
814
815 DefMemPerCPU
816 Default real memory size available per usable allocated CPU in
817 megabytes. Used to avoid over-subscribing memory and causing
818 paging. DefMemPerCPU would generally be used if individual pro‐
819 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
820 lectType=select/cons_tres). The default value is 0 (unlimited).
821 Also see DefMemPerGPU, DefMemPerNode and MaxMemPerCPU. DefMem‐
822 PerCPU, DefMemPerGPU and DefMemPerNode are mutually exclusive.
823
824
825 NOTE: This applies to usable allocated CPUs in a job allocation.
826 This is important when more than one thread per core is config‐
827 ured. If a job requests --threads-per-core with fewer threads
828 on a core than exist on the core (or --hint=nomultithread which
829 implies --threads-per-core=1), the job will be unable to use
830 those extra threads on the core and those threads will not be
831 included in the memory per CPU calculation. But if the job has
832 access to all threads on the core, those threads will be in‐
833 cluded in the memory per CPU calculation even if the job did not
834 explicitly request those threads.
835
836 In the following examples, each core has two threads.
837
838 In this first example, two tasks can run on separate hyper‐
839 threads in the same core because --threads-per-core is not used.
840 The third task uses both threads of the second core. The allo‐
841 cated memory per cpu includes all threads:
842
843 $ salloc -n3 --mem-per-cpu=100
844 salloc: Granted job allocation 17199
845 $ sacct -j $SLURM_JOB_ID -X -o jobid%7,reqtres%35,alloctres%35
846 JobID ReqTRES AllocTRES
847 ------- ----------------------------------- -----------------------------------
848 17199 billing=3,cpu=3,mem=300M,node=1 billing=4,cpu=4,mem=400M,node=1
849
850 In this second example, because of --threads-per-core=1, each
851 task is allocated an entire core but is only able to use one
852 thread per core. Allocated CPUs includes all threads on each
853 core. However, allocated memory per cpu includes only the usable
854 thread in each core.
855
856 $ salloc -n3 --mem-per-cpu=100 --threads-per-core=1
857 salloc: Granted job allocation 17200
858 $ sacct -j $SLURM_JOB_ID -X -o jobid%7,reqtres%35,alloctres%35
859 JobID ReqTRES AllocTRES
860 ------- ----------------------------------- -----------------------------------
861 17200 billing=3,cpu=3,mem=300M,node=1 billing=6,cpu=6,mem=300M,node=1
862
863 DefMemPerGPU
864 Default real memory size available per allocated GPU in
865 megabytes. The default value is 0 (unlimited). Also see
866 DefMemPerCPU and DefMemPerNode. DefMemPerCPU, DefMemPerGPU and
867 DefMemPerNode are mutually exclusive.
868
869 DefMemPerNode
870 Default real memory size available per allocated node in
871 megabytes. Used to avoid over-subscribing memory and causing
872 paging. DefMemPerNode would generally be used if whole nodes
873 are allocated to jobs (SelectType=select/linear) and resources
874 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
875 The default value is 0 (unlimited). Also see DefMemPerCPU,
876 DefMemPerGPU and MaxMemPerCPU. DefMemPerCPU, DefMemPerGPU and
877 DefMemPerNode are mutually exclusive.
878
879 DependencyParameters
880 Multiple options may be comma separated.
881
882 disable_remote_singleton
883 By default, when a federated job has a singleton depen‐
884 dency, each cluster in the federation must clear the sin‐
885 gleton dependency before the job's singleton dependency
886 is considered satisfied. Enabling this option means that
887 only the origin cluster must clear the singleton depen‐
888 dency. This option must be set in every cluster in the
889 federation.
890
891 kill_invalid_depend
892 If a job has an invalid dependency and it can never run
893 terminate it and set its state to be JOB_CANCELLED. By
894 default the job stays pending with reason DependencyNev‐
895 erSatisfied.
896
897 max_depend_depth=#
898 Maximum number of jobs to test for a circular job depen‐
899 dency. Stop testing after this number of job dependencies
900 have been tested. The default value is 10 jobs.
901
902 DisableRootJobs
903 If set to "YES" then user root will be prevented from running
904 any jobs. The default value is "NO", meaning user root will be
905 able to execute jobs. DisableRootJobs may also be set by parti‐
906 tion.
907
908 EioTimeout
909 The number of seconds srun waits for slurmstepd to close the
910 TCP/IP connection used to relay data between the user applica‐
911 tion and srun when the user application terminates. The default
912 value is 60 seconds. May not exceed 65533.
913
914 EnforcePartLimits
915 If set to "ALL" then jobs which exceed a partition's size and/or
916 time limits will be rejected at submission time. If job is sub‐
917 mitted to multiple partitions, the job must satisfy the limits
918 on all the requested partitions. If set to "NO" then the job
919 will be accepted and remain queued until the partition limits
920 are altered(Time and Node Limits). If set to "ANY" a job must
921 satisfy any of the requested partitions to be submitted. The de‐
922 fault value is "NO". NOTE: If set, then a job's QOS can not be
923 used to exceed partition limits. NOTE: The partition limits be‐
924 ing considered are its configured MaxMemPerCPU, MaxMemPerNode,
925 MinNodes, MaxNodes, MaxTime, AllocNodes, AllowAccounts, Allow‐
926 Groups, AllowQOS, and QOS usage threshold.
927
928 Epilog Fully qualified pathname of a script to execute as user root on
929 every node when a user's job completes (e.g. "/usr/lo‐
930 cal/slurm/epilog"). A glob pattern (See glob (7)) may also be
931 used to run more than one epilog script (e.g. "/etc/slurm/epi‐
932 log.d/*"). The Epilog script or scripts may be used to purge
933 files, disable user login, etc. By default there is no epilog.
934 See Prolog and Epilog Scripts for more information.
935
936 EpilogMsgTime
937 The number of microseconds that the slurmctld daemon requires to
938 process an epilog completion message from the slurmd daemons.
939 This parameter can be used to prevent a burst of epilog comple‐
940 tion messages from being sent at the same time which should help
941 prevent lost messages and improve throughput for large jobs.
942 The default value is 2000 microseconds. For a 1000 node job,
943 this spreads the epilog completion messages out over two sec‐
944 onds.
945
946 EpilogSlurmctld
947 Fully qualified pathname of a program for the slurmctld to exe‐
948 cute upon termination of a job allocation (e.g. "/usr/lo‐
949 cal/slurm/epilog_controller"). The program executes as Slur‐
950 mUser, which gives it permission to drain nodes and requeue the
951 job if a failure occurs (See scontrol(1)). Exactly what the
952 program does and how it accomplishes this is completely at the
953 discretion of the system administrator. Information about the
954 job being initiated, its allocated nodes, etc. are passed to the
955 program using environment variables. See Prolog and Epilog
956 Scripts for more information.
957
958 ExtSensorsFreq
959 The external sensors plugin sampling interval. If ExtSen‐
960 sorsType=ext_sensors/none, this parameter is ignored. For all
961 other values of ExtSensorsType, this parameter is the number of
962 seconds between external sensors samples for hardware components
963 (nodes, switches, etc.) The default value is zero. This value
964 disables external sensors sampling. Note: This parameter does
965 not affect external sensors data collection for jobs/steps.
966
967 ExtSensorsType
968 Identifies the plugin to be used for external sensors data col‐
969 lection. Slurmctld calls this plugin to collect external sen‐
970 sors data for jobs/steps and hardware components. In case of
971 node sharing between jobs the reported values per job/step
972 (through sstat or sacct) may not be accurate. See also "man
973 ext_sensors.conf".
974
975 Configurable values at present are:
976
977 ext_sensors/none No external sensors data is collected.
978
979 ext_sensors/rrd External sensors data is collected from the
980 RRD database.
981
982 FairShareDampeningFactor
983 Dampen the effect of exceeding a user or group's fair share of
984 allocated resources. Higher values will provides greater ability
985 to differentiate between exceeding the fair share at high levels
986 (e.g. a value of 1 results in almost no difference between over‐
987 consumption by a factor of 10 and 100, while a value of 5 will
988 result in a significant difference in priority). The default
989 value is 1.
990
991 FederationParameters
992 Used to define federation options. Multiple options may be comma
993 separated.
994
995 fed_display
996 If set, then the client status commands (e.g. squeue,
997 sinfo, sprio, etc.) will display information in a feder‐
998 ated view by default. This option is functionally equiva‐
999 lent to using the --federation options on each command.
1000 Use the client's --local option to override the federated
1001 view and get a local view of the given cluster.
1002
1003 FirstJobId
1004 The job id to be used for the first job submitted to Slurm. Job
1005 id values generated will incremented by 1 for each subsequent
1006 job. Value must be larger than 0. The default value is 1. Also
1007 see MaxJobId
1008
1009 GetEnvTimeout
1010 Controls how long the job should wait (in seconds) to load the
1011 user's environment before attempting to load it from a cache
1012 file. Applies when the salloc or sbatch --get-user-env option
1013 is used. If set to 0 then always load the user's environment
1014 from the cache file. The default value is 2 seconds.
1015
1016 GresTypes
1017 A comma-delimited list of generic resources to be managed (e.g.
1018 GresTypes=gpu,mps). These resources may have an associated GRES
1019 plugin of the same name providing additional functionality. No
1020 generic resources are managed by default. Ensure this parameter
1021 is consistent across all nodes in the cluster for proper opera‐
1022 tion. A restart of slurmctld and the slurmd daemons is required
1023 for this to take effect.
1024
1025 GroupUpdateForce
1026 If set to a non-zero value, then information about which users
1027 are members of groups allowed to use a partition will be updated
1028 periodically, even when there have been no changes to the
1029 /etc/group file. If set to zero, group member information will
1030 be updated only after the /etc/group file is updated. The de‐
1031 fault value is 1. Also see the GroupUpdateTime parameter.
1032
1033 GroupUpdateTime
1034 Controls how frequently information about which users are mem‐
1035 bers of groups allowed to use a partition will be updated, and
1036 how long user group membership lists will be cached. The time
1037 interval is given in seconds with a default value of 600 sec‐
1038 onds. A value of zero will prevent periodic updating of group
1039 membership information. Also see the GroupUpdateForce parame‐
1040 ter.
1041
1042 GpuFreqDef=[<type]=value>[,<type=value>]
1043 Default GPU frequency to use when running a job step if it has
1044 not been explicitly set using the --gpu-freq option. This op‐
1045 tion can be used to independently configure the GPU and its mem‐
1046 ory frequencies. Defaults to "high,memory=high". After the job
1047 is completed, the frequencies of all affected GPUs will be reset
1048 to the highest possible values. In some cases, system power
1049 caps may override the requested values. The field type can be
1050 "memory". If type is not specified, the GPU frequency is im‐
1051 plied. The value field can either be "low", "medium", "high",
1052 "highm1" or a numeric value in megahertz (MHz). If the speci‐
1053 fied numeric value is not possible, a value as close as possible
1054 will be used. See below for definition of the values. Examples
1055 of use include "GpuFreqDef=medium,memory=high and "GpuFre‐
1056 qDef=450".
1057
1058 Supported value definitions:
1059
1060 low the lowest available frequency.
1061
1062 medium attempts to set a frequency in the middle of the
1063 available range.
1064
1065 high the highest available frequency.
1066
1067 highm1 (high minus one) will select the next highest avail‐
1068 able frequency.
1069
1070 HealthCheckInterval
1071 The interval in seconds between executions of HealthCheckPro‐
1072 gram. The default value is zero, which disables execution.
1073
1074 HealthCheckNodeState
1075 Identify what node states should execute the HealthCheckProgram.
1076 Multiple state values may be specified with a comma separator.
1077 The default value is ANY to execute on nodes in any state.
1078
1079 ALLOC Run on nodes in the ALLOC state (all CPUs allo‐
1080 cated).
1081
1082 ANY Run on nodes in any state.
1083
1084 CYCLE Rather than running the health check program on all
1085 nodes at the same time, cycle through running on all
1086 compute nodes through the course of the HealthCheck‐
1087 Interval. May be combined with the various node
1088 state options.
1089
1090 IDLE Run on nodes in the IDLE state.
1091
1092 MIXED Run on nodes in the MIXED state (some CPUs idle and
1093 other CPUs allocated).
1094
1095 HealthCheckProgram
1096 Fully qualified pathname of a script to execute as user root pe‐
1097 riodically on all compute nodes that are not in the NOT_RESPOND‐
1098 ING state. This program may be used to verify the node is fully
1099 operational and DRAIN the node or send email if a problem is de‐
1100 tected. Any action to be taken must be explicitly performed by
1101 the program (e.g. execute "scontrol update NodeName=foo
1102 State=drain Reason=tmp_file_system_full" to drain a node). The
1103 execution interval is controlled using the HealthCheckInterval
1104 parameter. Note that the HealthCheckProgram will be executed at
1105 the same time on all nodes to minimize its impact upon parallel
1106 programs. This program will be killed if it does not terminate
1107 normally within 60 seconds. This program will also be executed
1108 when the slurmd daemon is first started and before it registers
1109 with the slurmctld daemon. By default, no program will be exe‐
1110 cuted.
1111
1112 InactiveLimit
1113 The interval, in seconds, after which a non-responsive job allo‐
1114 cation command (e.g. srun or salloc) will result in the job be‐
1115 ing terminated. If the node on which the command is executed
1116 fails or the command abnormally terminates, this will terminate
1117 its job allocation. This option has no effect upon batch jobs.
1118 When setting a value, take into consideration that a debugger
1119 using srun to launch an application may leave the srun command
1120 in a stopped state for extended periods of time. This limit is
1121 ignored for jobs running in partitions with the RootOnly flag
1122 set (the scheduler running as root will be responsible for the
1123 job). The default value is unlimited (zero) and may not exceed
1124 65533 seconds.
1125
1126 InteractiveStepOptions
1127 When LaunchParameters=use_interactive_step is enabled, launching
1128 salloc will automatically start an srun process with Interac‐
1129 tiveStepOptions to launch a terminal on a node in the job allo‐
1130 cation. The default value is "--interactive --preserve-env
1131 --pty $SHELL". The "--interactive" option is intentionally not
1132 documented in the srun man page. It is meant only to be used in
1133 InteractiveStepOptions in order to create an "interactive step"
1134 that will not consume resources so that other steps may run in
1135 parallel with the interactive step.
1136
1137 JobAcctGatherType
1138 The job accounting mechanism type. Acceptable values at present
1139 include "jobacct_gather/linux" (for Linux systems),
1140 "jobacct_gather/cgroup" and "jobacct_gather/none" (no accounting
1141 data collected). The default value is "jobacct_gather/none".
1142 "jobacct_gather/cgroup" is a plugin for the Linux operating sys‐
1143 tem that uses cgroups to collect accounting statistics. The
1144 plugin collects the following statistics: From the cgroup memory
1145 subsystem: memory.usage_in_bytes (reported as 'pages') and rss
1146 from memory.stat (reported as 'rss'). From the cgroup cpuacct
1147 subsystem: user cpu time and system cpu time. No value is pro‐
1148 vided by cgroups for virtual memory size ('vsize'). In order to
1149 use the sstat tool "jobacct_gather/linux", or
1150 "jobacct_gather/cgroup" must be configured.
1151 NOTE: Changing this configuration parameter changes the contents
1152 of the messages between Slurm daemons. Any previously running
1153 job steps are managed by a slurmstepd daemon that will persist
1154 through the lifetime of that job step and not change its commu‐
1155 nication protocol. Only change this configuration parameter when
1156 there are no running job steps.
1157
1158 JobAcctGatherFrequency
1159 The job accounting and profiling sampling intervals. The sup‐
1160 ported format is follows:
1161
1162 JobAcctGatherFrequency=<datatype>=<interval>
1163 where <datatype>=<interval> specifies the task sam‐
1164 pling interval for the jobacct_gather plugin or a
1165 sampling interval for a profiling type by the
1166 acct_gather_profile plugin. Multiple, comma-sepa‐
1167 rated <datatype>=<interval> intervals may be speci‐
1168 fied. Supported datatypes are as follows:
1169
1170 task=<interval>
1171 where <interval> is the task sampling inter‐
1172 val in seconds for the jobacct_gather plugins
1173 and for task profiling by the
1174 acct_gather_profile plugin.
1175
1176 energy=<interval>
1177 where <interval> is the sampling interval in
1178 seconds for energy profiling using the
1179 acct_gather_energy plugin
1180
1181 network=<interval>
1182 where <interval> is the sampling interval in
1183 seconds for infiniband profiling using the
1184 acct_gather_interconnect plugin.
1185
1186 filesystem=<interval>
1187 where <interval> is the sampling interval in
1188 seconds for filesystem profiling using the
1189 acct_gather_filesystem plugin.
1190
1191
1192 The default value for task sampling
1193 interval
1194 is 30 seconds. The default value for all other intervals is 0.
1195 An interval of 0 disables sampling of the specified type. If
1196 the task sampling interval is 0, accounting information is col‐
1197 lected only at job termination, which reduces Slurm interference
1198 with the job, but also means that the statistics about a job
1199 don't reflect the average or maximum of several samples though‐
1200 out the life of the job, but just show the information collected
1201 in the single sample.
1202 Smaller (non-zero) values have a greater impact upon job perfor‐
1203 mance, but a value of 30 seconds is not likely to be noticeable
1204 for applications having less than 10,000 tasks.
1205 Users can independently override each interval on a per job ba‐
1206 sis using the --acctg-freq option when submitting the job.
1207
1208 JobAcctGatherParams
1209 Arbitrary parameters for the job account gather plugin. Accept‐
1210 able values at present include:
1211
1212 NoShared Exclude shared memory from RSS. This option
1213 cannot be used with UsePSS.
1214
1215 UsePss Use PSS value instead of RSS to calculate
1216 real usage of memory. The PSS value will be
1217 saved as RSS. This option cannot be used
1218 with NoShared.
1219
1220 OverMemoryKill Kill processes that are being detected to
1221 use more memory than requested by steps ev‐
1222 ery time accounting information is gathered
1223 by the JobAcctGather plugin. This parameter
1224 should be used with caution because a job
1225 exceeding its memory allocation may affect
1226 other processes and/or machine health.
1227
1228 NOTE: If available, it is recommended to
1229 limit memory by enabling task/cgroup as a
1230 TaskPlugin and making use of Constrain‐
1231 RAMSpace=yes in the cgroup.conf instead of
1232 using this JobAcctGather mechanism for mem‐
1233 ory enforcement. Using JobAcctGather is
1234 polling based and there is a delay before a
1235 job is killed, which could lead to system
1236 Out of Memory events.
1237
1238 NOTE: When using OverMemoryKill, if the com‐
1239 bined memory used by all the processes in a
1240 step exceeds the memory limit, the entire
1241 step will be killed/cancelled by the JobAc‐
1242 ctGather plugin. This differs from the be‐
1243 havior when using ConstrainRAMSpace, where
1244 processes in the step will be killed, but
1245 the step will be left active, possibly with
1246 other processes left running.
1247
1248 JobCompHost
1249 The name of the machine hosting the job completion database.
1250 Only used for database type storage plugins, ignored otherwise.
1251
1252 JobCompLoc
1253 The fully qualified file name where job completion records are
1254 written when the JobCompType is "jobcomp/filetxt" or the data‐
1255 base where job completion records are stored when the JobComp‐
1256 Type is a database, or a complete URL endpoint with format
1257 <host>:<port>/<target>/_doc when JobCompType is "jobcomp/elas‐
1258 ticsearch" like i.e. "localhost:9200/slurm/_doc". NOTE: More
1259 information is available at the Slurm web site
1260 <https://slurm.schedmd.com/elasticsearch.html>.
1261
1262 JobCompParams
1263 Pass arbitrary text string to job completion plugin. Also see
1264 JobCompType.
1265
1266 JobCompPass
1267 The password used to gain access to the database to store the
1268 job completion data. Only used for database type storage plug‐
1269 ins, ignored otherwise.
1270
1271 JobCompPort
1272 The listening port of the job completion database server. Only
1273 used for database type storage plugins, ignored otherwise.
1274
1275 JobCompType
1276 The job completion logging mechanism type. Acceptable values at
1277 present include:
1278
1279 jobcomp/none
1280 Upon job completion, a record of the job is purged from
1281 the system. If using the accounting infrastructure this
1282 plugin may not be of interest since some of the informa‐
1283 tion is redundant.
1284
1285 jobcomp/elasticsearch
1286 Upon job completion, a record of the job should be writ‐
1287 ten to an Elasticsearch server, specified by the JobCom‐
1288 pLoc parameter.
1289 NOTE: More information is available at the Slurm web site
1290 ( https://slurm.schedmd.com/elasticsearch.html ).
1291
1292 jobcomp/filetxt
1293 Upon job completion, a record of the job should be writ‐
1294 ten to a text file, specified by the JobCompLoc parame‐
1295 ter.
1296
1297 jobcomp/lua
1298 Upon job completion, a record of the job should be pro‐
1299 cessed by the jobcomp.lua script, located in the default
1300 script directory (typically the subdirectory etc of the
1301 installation directory.
1302
1303 jobcomp/mysql
1304 Upon job completion, a record of the job should be writ‐
1305 ten to a MySQL or MariaDB database, specified by the Job‐
1306 CompLoc parameter.
1307
1308 jobcomp/script
1309 Upon job completion, a script specified by the JobCompLoc
1310 parameter is to be executed with environment variables
1311 providing the job information.
1312
1313 JobCompUser
1314 The user account for accessing the job completion database.
1315 Only used for database type storage plugins, ignored otherwise.
1316
1317 JobContainerType
1318 Identifies the plugin to be used for job tracking. A restart of
1319 slurmctld is required for changes to this parameter to take ef‐
1320 fect. NOTE: The JobContainerType applies to a job allocation,
1321 while ProctrackType applies to job steps. Acceptable values at
1322 present include:
1323
1324 job_container/cncu Used only for Cray systems (CNCU = Compute
1325 Node Clean Up)
1326
1327 job_container/none Used for all other system types
1328
1329 job_container/tmpfs Used to create a private namespace on the
1330 filesystem for jobs, which houses temporary
1331 file systems (/tmp and /dev/shm) for each
1332 job. 'PrologFlags=Contain' must be set to
1333 use this plugin.
1334
1335 JobFileAppend
1336 This option controls what to do if a job's output or error file
1337 exist when the job is started. If JobFileAppend is set to a
1338 value of 1, then append to the existing file. By default, any
1339 existing file is truncated.
1340
1341 JobRequeue
1342 This option controls the default ability for batch jobs to be
1343 requeued. Jobs may be requeued explicitly by a system adminis‐
1344 trator, after node failure, or upon preemption by a higher pri‐
1345 ority job. If JobRequeue is set to a value of 1, then batch
1346 jobs may be requeued unless explicitly disabled by the user. If
1347 JobRequeue is set to a value of 0, then batch jobs will not be
1348 requeued unless explicitly enabled by the user. Use the sbatch
1349 --no-requeue or --requeue option to change the default behavior
1350 for individual jobs. The default value is 1.
1351
1352 JobSubmitPlugins
1353 These are intended to be site-specific plugins which can be used
1354 to set default job parameters and/or logging events. Slurm can
1355 be configured to use multiple job_submit plugins if desired,
1356 which must be specified as a comma-delimited list and will be
1357 executed in the order listed.
1358 e.g. for multiple job_submit plugin configuration:
1359 JobSubmitPlugins=lua,require_timelimit
1360 Take a look at <https://slurm.schedmd.com/job_submit_plug‐
1361 ins.html> for further plugin implementation details. No job sub‐
1362 mission plugins are used by default. Currently available plug‐
1363 ins are:
1364
1365 all_partitions Set default partition to all partitions
1366 on the cluster.
1367
1368 defaults Set default values for job submission or
1369 modify requests.
1370
1371 logging Log select job submission and modifica‐
1372 tion parameters.
1373
1374 lua Execute a Lua script implementing site's
1375 own job_submit logic. Only one Lua
1376 script will be executed. It must be
1377 named "job_submit.lua" and must be lo‐
1378 cated in the default configuration di‐
1379 rectory (typically the subdirectory
1380 "etc" of the installation directory).
1381 Sample Lua scripts can be found with the
1382 Slurm distribution, in the directory
1383 contribs/lua. Slurmctld will fatal on
1384 startup if the configured lua script is
1385 invalid. Slurm will try to load the
1386 script for each job submission. If the
1387 script is broken or removed while slurm‐
1388 ctld is running, Slurm will fallback to
1389 the previous working version of the
1390 script.
1391
1392 partition Set a job's default partition based upon
1393 job submission parameters and available
1394 partitions.
1395
1396 pbs Translate PBS job submission options to
1397 Slurm equivalent (if possible).
1398
1399 require_timelimit Force job submissions to specify a time‐
1400 limit.
1401
1402 NOTE: For examples of use see the Slurm code in "src/plug‐
1403 ins/job_submit" and "contribs/lua/job_submit*.lua" then modify
1404 the code to satisfy your needs.
1405
1406 KillOnBadExit
1407 If set to 1, a step will be terminated immediately if any task
1408 is crashed or aborted, as indicated by a non-zero exit code.
1409 With the default value of 0, if one of the processes is crashed
1410 or aborted the other processes will continue to run while the
1411 crashed or aborted process waits. The user can override this
1412 configuration parameter by using srun's -K, --kill-on-bad-exit.
1413
1414 KillWait
1415 The interval, in seconds, given to a job's processes between the
1416 SIGTERM and SIGKILL signals upon reaching its time limit. If
1417 the job fails to terminate gracefully in the interval specified,
1418 it will be forcibly terminated. The default value is 30 sec‐
1419 onds. The value may not exceed 65533.
1420
1421 NodeFeaturesPlugins
1422 Identifies the plugins to be used for support of node features
1423 which can change through time. For example, a node which might
1424 be booted with various BIOS setting. This is supported through
1425 the use of a node's active_features and available_features in‐
1426 formation. Acceptable values at present include:
1427
1428 node_features/knl_cray
1429 Used only for Intel Knights Landing processors (KNL) on
1430 Cray systems.
1431
1432 node_features/knl_generic
1433 Used for Intel Knights Landing processors (KNL) on a
1434 generic Linux system.
1435
1436 node_features/helpers
1437 Used to report and modify features on nodes using arbi‐
1438 trary scripts or programs.
1439
1440 LaunchParameters
1441 Identifies options to the job launch plugin. Acceptable values
1442 include:
1443
1444 batch_step_set_cpu_freq Set the cpu frequency for the batch step
1445 from given --cpu-freq, or slurm.conf
1446 CpuFreqDef, option. By default only
1447 steps started with srun will utilize the
1448 cpu freq setting options.
1449
1450 NOTE: If you are using srun to launch
1451 your steps inside a batch script (ad‐
1452 vised) this option will create a situa‐
1453 tion where you may have multiple agents
1454 setting the cpu_freq as the batch step
1455 usually runs on the same resources one
1456 or more steps the sruns in the script
1457 will create.
1458
1459 cray_net_exclusive Allow jobs on a Cray Native cluster ex‐
1460 clusive access to network resources.
1461 This should only be set on clusters pro‐
1462 viding exclusive access to each node to
1463 a single job at once, and not using par‐
1464 allel steps within the job, otherwise
1465 resources on the node can be oversub‐
1466 scribed.
1467
1468 enable_nss_slurm Permits passwd and group resolution for
1469 a job to be serviced by slurmstepd
1470 rather than requiring a lookup from a
1471 network based service. See
1472 https://slurm.schedmd.com/nss_slurm.html
1473 for more information.
1474
1475 lustre_no_flush If set on a Cray Native cluster, then do
1476 not flush the Lustre cache on job step
1477 completion. This setting will only take
1478 effect after reconfiguring, and will
1479 only take effect for newly launched
1480 jobs.
1481
1482 mem_sort Sort NUMA memory at step start. User can
1483 override this default with
1484 SLURM_MEM_BIND environment variable or
1485 --mem-bind=nosort command line option.
1486
1487 mpir_use_nodeaddr When launching tasks Slurm creates en‐
1488 tries in MPIR_proctable that are used by
1489 parallel debuggers, profilers, and re‐
1490 lated tools to attach to running
1491 process. By default the MPIR_proctable
1492 entries contain MPIR_procdesc structures
1493 where the host_name is set to NodeName
1494 by default. If this option is specified,
1495 NodeAddr will be used in this context
1496 instead.
1497
1498 disable_send_gids By default, the slurmctld will look up
1499 and send the user_name and extended gids
1500 for a job, rather than independently on
1501 each node as part of each task launch.
1502 This helps mitigate issues around name
1503 service scalability when launching jobs
1504 involving many nodes. Using this option
1505 will disable this functionality. This
1506 option is ignored if enable_nss_slurm is
1507 specified.
1508
1509 slurmstepd_memlock Lock the slurmstepd process's current
1510 memory in RAM.
1511
1512 slurmstepd_memlock_all Lock the slurmstepd process's current
1513 and future memory in RAM.
1514
1515 test_exec Have srun verify existence of the exe‐
1516 cutable program along with user execute
1517 permission on the node where srun was
1518 called before attempting to launch it on
1519 nodes in the step.
1520
1521 use_interactive_step Have salloc use the Interactive Step to
1522 launch a shell on an allocated compute
1523 node rather than locally to wherever
1524 salloc was invoked. This is accomplished
1525 by launching the srun command with In‐
1526 teractiveStepOptions as options.
1527
1528 This does not affect salloc called with
1529 a command as an argument. These jobs
1530 will continue to be executed as the
1531 calling user on the calling host.
1532
1533 LaunchType
1534 Identifies the mechanism to be used to launch application tasks.
1535 Acceptable values include:
1536
1537 launch/slurm
1538 The default value.
1539
1540 Licenses
1541 Specification of licenses (or other resources available on all
1542 nodes of the cluster) which can be allocated to jobs. License
1543 names can optionally be followed by a colon and count with a de‐
1544 fault count of one. Multiple license names should be comma sep‐
1545 arated (e.g. "Licenses=foo:4,bar"). Note that Slurm prevents
1546 jobs from being scheduled if their required license specifica‐
1547 tion is not available. Slurm does not prevent jobs from using
1548 licenses that are not explicitly listed in the job submission
1549 specification.
1550
1551 LogTimeFormat
1552 Format of the timestamp in slurmctld and slurmd log files. Ac‐
1553 cepted values are "iso8601", "iso8601_ms", "rfc5424",
1554 "rfc5424_ms", "clock", "short" and "thread_id". The values end‐
1555 ing in "_ms" differ from the ones without in that fractional
1556 seconds with millisecond precision are printed. The default
1557 value is "iso8601_ms". The "rfc5424" formats are the same as the
1558 "iso8601" formats except that the timezone value is also shown.
1559 The "clock" format shows a timestamp in microseconds retrieved
1560 with the C standard clock() function. The "short" format is a
1561 short date and time format. The "thread_id" format shows the
1562 timestamp in the C standard ctime() function form without the
1563 year but including the microseconds, the daemon's process ID and
1564 the current thread name and ID.
1565
1566 MailDomain
1567 Domain name to qualify usernames if email address is not explic‐
1568 itly given with the "--mail-user" option. If unset, the local
1569 MTA will need to qualify local address itself. Changes to Mail‐
1570 Domain will only affect new jobs.
1571
1572 MailProg
1573 Fully qualified pathname to the program used to send email per
1574 user request. The default value is "/bin/mail" (or
1575 "/usr/bin/mail" if "/bin/mail" does not exist but
1576 "/usr/bin/mail" does exist). The program is called with argu‐
1577 ments suitable for the default mail command, however additional
1578 information about the job is passed in the form of environment
1579 variables.
1580
1581 Additional variables are the same as those passed to Pro‐
1582 logSlurmctld and EpilogSlurmctld with additional variables in
1583 the following contexts:
1584
1585 ALL
1586
1587 SLURM_JOB_STATE
1588 The base state of the job when the MailProg is
1589 called.
1590
1591 SLURM_JOB_MAIL_TYPE
1592 The mail type triggering the mail.
1593
1594 BEGIN
1595
1596 SLURM_JOB_QEUEUED_TIME
1597 The amount of time the job was queued.
1598
1599 END, FAIL, REQUEUE, TIME_LIMIT_*
1600
1601 SLURM_JOB_RUN_TIME
1602 The amount of time the job ran for.
1603
1604 END, FAIL
1605
1606 SLURM_JOB_EXIT_CODE_MAX
1607 Job's exit code or highest exit code for an array
1608 job.
1609
1610 SLURM_JOB_EXIT_CODE_MIN
1611 Job's minimum exit code for an array job.
1612
1613 SLURM_JOB_TERM_SIGNAL_MAX
1614 Job's highest signal for an array job.
1615
1616 STAGE_OUT
1617
1618 SLURM_JOB_STAGE_OUT_TIME
1619 Job's staging out time.
1620
1621 MaxArraySize
1622 The maximum job array task index value will be one less than
1623 MaxArraySize to allow for an index value of zero. Configure
1624 MaxArraySize to 0 in order to disable job array use. The value
1625 may not exceed 4000001. The value of MaxJobCount should be much
1626 larger than MaxArraySize. The default value is 1001. See also
1627 max_array_tasks in SchedulerParameters.
1628
1629 MaxDBDMsgs
1630 When communication to the SlurmDBD is not possible the slurmctld
1631 will queue messages meant to processed when the SlurmDBD is
1632 available again. In order to avoid running out of memory the
1633 slurmctld will only queue so many messages. The default value is
1634 10000, or MaxJobCount * 2 + Node Count * 4, whichever is
1635 greater. The value can not be less than 10000.
1636
1637 MaxJobCount
1638 The maximum number of jobs slurmctld can have in memory at one
1639 time. Combine with MinJobAge to ensure the slurmctld daemon
1640 does not exhaust its memory or other resources. Once this limit
1641 is reached, requests to submit additional jobs will fail. The
1642 default value is 10000 jobs. NOTE: Each task of a job array
1643 counts as one job even though they will not occupy separate job
1644 records until modified or initiated. Performance can suffer
1645 with more than a few hundred thousand jobs. Setting per MaxSub‐
1646 mitJobs per user is generally valuable to prevent a single user
1647 from filling the system with jobs. This is accomplished using
1648 Slurm's database and configuring enforcement of resource limits.
1649 A restart of slurmctld is required for changes to this parameter
1650 to take effect.
1651
1652 MaxJobId
1653 The maximum job id to be used for jobs submitted to Slurm with‐
1654 out a specific requested value. Job ids are unsigned 32bit inte‐
1655 gers with the first 26 bits reserved for local job ids and the
1656 remaining 6 bits reserved for a cluster id to identify a feder‐
1657 ated job's origin. The maximum allowed local job id is
1658 67,108,863 (0x3FFFFFF). The default value is 67,043,328
1659 (0x03ff0000). MaxJobId only applies to the local job id and not
1660 the federated job id. Job id values generated will be incre‐
1661 mented by 1 for each subsequent job. Once MaxJobId is reached,
1662 the next job will be assigned FirstJobId. Federated jobs will
1663 always have a job ID of 67,108,865 or higher. Also see FirstJo‐
1664 bId.
1665
1666 MaxMemPerCPU
1667 Maximum real memory size available per allocated CPU in
1668 megabytes. Used to avoid over-subscribing memory and causing
1669 paging. MaxMemPerCPU would generally be used if individual pro‐
1670 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
1671 lectType=select/cons_tres). The default value is 0 (unlimited).
1672 Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerNode. MaxMem‐
1673 PerCPU and MaxMemPerNode are mutually exclusive.
1674
1675 NOTE: If a job specifies a memory per CPU limit that exceeds
1676 this system limit, that job's count of CPUs per task will try to
1677 automatically increase. This may result in the job failing due
1678 to CPU count limits. This auto-adjustment feature is a best-ef‐
1679 fort one and optimal assignment is not guaranteed due to the
1680 possibility of having heterogeneous configurations and
1681 multi-partition/qos jobs. If this is a concern it is advised to
1682 use a job submit LUA plugin instead to enforce auto-adjustments
1683 to your specific needs.
1684
1685 MaxMemPerNode
1686 Maximum real memory size available per allocated node in
1687 megabytes. Used to avoid over-subscribing memory and causing
1688 paging. MaxMemPerNode would generally be used if whole nodes
1689 are allocated to jobs (SelectType=select/linear) and resources
1690 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
1691 The default value is 0 (unlimited). Also see DefMemPerNode and
1692 MaxMemPerCPU. MaxMemPerCPU and MaxMemPerNode are mutually ex‐
1693 clusive.
1694
1695 MaxNodeCount
1696 Maximum count of nodes which may exist in the controller. By de‐
1697 fault MaxNodeCount will be set to the number of nodes found in
1698 the slurm.conf. MaxNodeCount will be ignored if less than the
1699 number of nodes found in the slurm.conf. Increase MaxNodeCount
1700 to accommodate dynamically created nodes with dynamic node reg‐
1701 istrations and nodes created with scontrol. The slurmctld daemon
1702 must be restarted for changes to this parameter to take effect.
1703
1704 MaxStepCount
1705 The maximum number of steps that any job can initiate. This pa‐
1706 rameter is intended to limit the effect of bad batch scripts.
1707 The default value is 40000 steps.
1708
1709 MaxTasksPerNode
1710 Maximum number of tasks Slurm will allow a job step to spawn on
1711 a single node. The default MaxTasksPerNode is 512. May not ex‐
1712 ceed 65533.
1713
1714 MCSParameters
1715 MCS = Multi-Category Security MCS Plugin Parameters. The sup‐
1716 ported parameters are specific to the MCSPlugin. Changes to
1717 this value take effect when the Slurm daemons are reconfigured.
1718 More information about MCS is available here
1719 <https://slurm.schedmd.com/mcs.html>.
1720
1721 MCSPlugin
1722 MCS = Multi-Category Security : associate a security label to
1723 jobs and ensure that nodes can only be shared among jobs using
1724 the same security label. Acceptable values include:
1725
1726 mcs/none is the default value. No security label associated
1727 with jobs, no particular security restriction when
1728 sharing nodes among jobs.
1729
1730 mcs/account only users with the same account can share the nodes
1731 (requires enabling of accounting).
1732
1733 mcs/group only users with the same group can share the nodes.
1734
1735 mcs/user a node cannot be shared with other users.
1736
1737 MessageTimeout
1738 Time permitted for a round-trip communication to complete in
1739 seconds. Default value is 10 seconds. For systems with shared
1740 nodes, the slurmd daemon could be paged out and necessitate
1741 higher values.
1742
1743 MinJobAge
1744 The minimum age of a completed job before its record is cleared
1745 from the list of jobs slurmctld keeps in memory. Combine with
1746 MaxJobCount to ensure the slurmctld daemon does not exhaust its
1747 memory or other resources. The default value is 300 seconds. A
1748 value of zero prevents any job record purging. Jobs are not
1749 purged during a backfill cycle, so it can take longer than Min‐
1750 JobAge seconds to purge a job if using the backfill scheduling
1751 plugin. In order to eliminate some possible race conditions,
1752 the minimum non-zero value for MinJobAge recommended is 2.
1753
1754 MpiDefault
1755 Identifies the default type of MPI to be used. Srun may over‐
1756 ride this configuration parameter in any case. Currently sup‐
1757 ported versions include: pmi2, pmix, and none (default, which
1758 works for many other versions of MPI). More information about
1759 MPI use is available here
1760 <https://slurm.schedmd.com/mpi_guide.html>.
1761
1762 MpiParams
1763 MPI parameters. Used to identify ports used by native Cray's
1764 PMI. The format to identify a range of communication ports is
1765 "ports=12000-12999".
1766
1767 OverTimeLimit
1768 Number of minutes by which a job can exceed its time limit be‐
1769 fore being canceled. Normally a job's time limit is treated as
1770 a hard limit and the job will be killed upon reaching that
1771 limit. Configuring OverTimeLimit will result in the job's time
1772 limit being treated like a soft limit. Adding the OverTimeLimit
1773 value to the soft time limit provides a hard time limit, at
1774 which point the job is canceled. This is particularly useful
1775 for backfill scheduling, which bases upon each job's soft time
1776 limit. The default value is zero. May not exceed 65533 min‐
1777 utes. A value of "UNLIMITED" is also supported.
1778
1779 PluginDir
1780 Identifies the places in which to look for Slurm plugins. This
1781 is a colon-separated list of directories, like the PATH environ‐
1782 ment variable. The default value is the prefix given at config‐
1783 ure time + "/lib/slurm". A restart of slurmctld and the slurmd
1784 daemons is required for changes to this parameter to take ef‐
1785 fect.
1786
1787 PlugStackConfig
1788 Location of the config file for Slurm stackable plugins that use
1789 the Stackable Plugin Architecture for Node job (K)control
1790 (SPANK). This provides support for a highly configurable set of
1791 plugins to be called before and/or after execution of each task
1792 spawned as part of a user's job step. Default location is
1793 "plugstack.conf" in the same directory as the system slurm.conf.
1794 For more information on SPANK plugins, see the spank(8) manual.
1795
1796 PowerParameters
1797 System power management parameters. The supported parameters
1798 are specific to the PowerPlugin. Changes to this value take ef‐
1799 fect when the Slurm daemons are reconfigured. More information
1800 about system power management is available here
1801 <https://slurm.schedmd.com/power_mgmt.html>. Options current
1802 supported by any plugins are listed below.
1803
1804 balance_interval=#
1805 Specifies the time interval, in seconds, between attempts
1806 to rebalance power caps across the nodes. This also con‐
1807 trols the frequency at which Slurm attempts to collect
1808 current power consumption data (old data may be used un‐
1809 til new data is available from the underlying infrastruc‐
1810 ture and values below 10 seconds are not recommended for
1811 Cray systems). The default value is 30 seconds. Sup‐
1812 ported by the power/cray_aries plugin.
1813
1814 capmc_path=
1815 Specifies the absolute path of the capmc command. The
1816 default value is "/opt/cray/capmc/default/bin/capmc".
1817 Supported by the power/cray_aries plugin.
1818
1819 cap_watts=#
1820 Specifies the total power limit to be established across
1821 all compute nodes managed by Slurm. A value of 0 sets
1822 every compute node to have an unlimited cap. The default
1823 value is 0. Supported by the power/cray_aries plugin.
1824
1825 decrease_rate=#
1826 Specifies the maximum rate of change in the power cap for
1827 a node where the actual power usage is below the power
1828 cap by an amount greater than lower_threshold (see be‐
1829 low). Value represents a percentage of the difference
1830 between a node's minimum and maximum power consumption.
1831 The default value is 50 percent. Supported by the
1832 power/cray_aries plugin.
1833
1834 get_timeout=#
1835 Amount of time allowed to get power state information in
1836 milliseconds. The default value is 5,000 milliseconds or
1837 5 seconds. Supported by the power/cray_aries plugin and
1838 represents the time allowed for the capmc command to re‐
1839 spond to various "get" options.
1840
1841 increase_rate=#
1842 Specifies the maximum rate of change in the power cap for
1843 a node where the actual power usage is within up‐
1844 per_threshold (see below) of the power cap. Value repre‐
1845 sents a percentage of the difference between a node's
1846 minimum and maximum power consumption. The default value
1847 is 20 percent. Supported by the power/cray_aries plugin.
1848
1849 job_level
1850 All nodes associated with every job will have the same
1851 power cap, to the extent possible. Also see the
1852 --power=level option on the job submission commands.
1853
1854 job_no_level
1855 Disable the user's ability to set every node associated
1856 with a job to the same power cap. Each node will have
1857 its power cap set independently. This disables the
1858 --power=level option on the job submission commands.
1859
1860 lower_threshold=#
1861 Specify a lower power consumption threshold. If a node's
1862 current power consumption is below this percentage of its
1863 current cap, then its power cap will be reduced. The de‐
1864 fault value is 90 percent. Supported by the
1865 power/cray_aries plugin.
1866
1867 recent_job=#
1868 If a job has started or resumed execution (from suspend)
1869 on a compute node within this number of seconds from the
1870 current time, the node's power cap will be increased to
1871 the maximum. The default value is 300 seconds. Sup‐
1872 ported by the power/cray_aries plugin.
1873
1874
1875 set_timeout=#
1876 Amount of time allowed to set power state information in
1877 milliseconds. The default value is 30,000 milliseconds
1878 or 30 seconds. Supported by the power/cray plugin and
1879 represents the time allowed for the capmc command to re‐
1880 spond to various "set" options.
1881
1882 set_watts=#
1883 Specifies the power limit to be set on every compute
1884 nodes managed by Slurm. Every node gets this same power
1885 cap and there is no variation through time based upon ac‐
1886 tual power usage on the node. Supported by the
1887 power/cray_aries plugin.
1888
1889 upper_threshold=#
1890 Specify an upper power consumption threshold. If a
1891 node's current power consumption is above this percentage
1892 of its current cap, then its power cap will be increased
1893 to the extent possible. The default value is 95 percent.
1894 Supported by the power/cray_aries plugin.
1895
1896 PowerPlugin
1897 Identifies the plugin used for system power management. Cur‐
1898 rently supported plugins include: cray_aries and none. A
1899 restart of slurmctld is required for changes to this parameter
1900 to take effect. More information about system power management
1901 is available here <https://slurm.schedmd.com/power_mgmt.html>.
1902 By default, no power plugin is loaded.
1903
1904 PreemptMode
1905 Mechanism used to preempt jobs or enable gang scheduling. When
1906 the PreemptType parameter is set to enable preemption, the Pre‐
1907 emptMode selects the default mechanism used to preempt the eli‐
1908 gible jobs for the cluster.
1909 PreemptMode may be specified on a per partition basis to over‐
1910 ride this default value if PreemptType=preempt/partition_prio.
1911 Alternatively, it can be specified on a per QOS basis if Pre‐
1912 emptType=preempt/qos. In either case, a valid default Preempt‐
1913 Mode value must be specified for the cluster as a whole when
1914 preemption is enabled.
1915 The GANG option is used to enable gang scheduling independent of
1916 whether preemption is enabled (i.e. independent of the Preempt‐
1917 Type setting). It can be specified in addition to a PreemptMode
1918 setting with the two options comma separated (e.g. Preempt‐
1919 Mode=SUSPEND,GANG).
1920 See <https://slurm.schedmd.com/preempt.html> and
1921 <https://slurm.schedmd.com/gang_scheduling.html> for more de‐
1922 tails.
1923
1924 NOTE: For performance reasons, the backfill scheduler reserves
1925 whole nodes for jobs, not partial nodes. If during backfill
1926 scheduling a job preempts one or more other jobs, the whole
1927 nodes for those preempted jobs are reserved for the preemptor
1928 job, even if the preemptor job requested fewer resources than
1929 that. These reserved nodes aren't available to other jobs dur‐
1930 ing that backfill cycle, even if the other jobs could fit on the
1931 nodes. Therefore, jobs may preempt more resources during a sin‐
1932 gle backfill iteration than they requested.
1933 NOTE: For heterogeneous job to be considered for preemption all
1934 components must be eligible for preemption. When a heterogeneous
1935 job is to be preempted the first identified component of the job
1936 with the highest order PreemptMode (SUSPEND (highest), REQUEUE,
1937 CANCEL (lowest)) will be used to set the PreemptMode for all
1938 components. The GraceTime and user warning signal for each com‐
1939 ponent of the heterogeneous job remain unique. Heterogeneous
1940 jobs are excluded from GANG scheduling operations.
1941
1942 OFF Is the default value and disables job preemption and
1943 gang scheduling. It is only compatible with Pre‐
1944 emptType=preempt/none at a global level. A common
1945 use case for this parameter is to set it on a parti‐
1946 tion to disable preemption for that partition.
1947
1948 CANCEL The preempted job will be cancelled.
1949
1950 GANG Enables gang scheduling (time slicing) of jobs in
1951 the same partition, and allows the resuming of sus‐
1952 pended jobs.
1953
1954 NOTE: Gang scheduling is performed independently for
1955 each partition, so if you only want time-slicing by
1956 OverSubscribe, without any preemption, then config‐
1957 uring partitions with overlapping nodes is not rec‐
1958 ommended. On the other hand, if you want to use
1959 PreemptType=preempt/partition_prio to allow jobs
1960 from higher PriorityTier partitions to Suspend jobs
1961 from lower PriorityTier partitions you will need
1962 overlapping partitions, and PreemptMode=SUSPEND,GANG
1963 to use the Gang scheduler to resume the suspended
1964 jobs(s). In any case, time-slicing won't happen be‐
1965 tween jobs on different partitions.
1966
1967 NOTE: Heterogeneous jobs are excluded from GANG
1968 scheduling operations.
1969
1970 REQUEUE Preempts jobs by requeuing them (if possible) or
1971 canceling them. For jobs to be requeued they must
1972 have the --requeue sbatch option set or the cluster
1973 wide JobRequeue parameter in slurm.conf must be set
1974 to 1.
1975
1976 SUSPEND The preempted jobs will be suspended, and later the
1977 Gang scheduler will resume them. Therefore the SUS‐
1978 PEND preemption mode always needs the GANG option to
1979 be specified at the cluster level. Also, because the
1980 suspended jobs will still use memory on the allo‐
1981 cated nodes, Slurm needs to be able to track memory
1982 resources to be able to suspend jobs.
1983 If PreemptType=preempt/qos is configured and if the
1984 preempted job(s) and the preemptor job are on the
1985 same partition, then they will share resources with
1986 the Gang scheduler (time-slicing). If not (i.e. if
1987 the preemptees and preemptor are on different parti‐
1988 tions) then the preempted jobs will remain suspended
1989 until the preemptor ends.
1990
1991 NOTE: Because gang scheduling is performed indepen‐
1992 dently for each partition, if using PreemptType=pre‐
1993 empt/partition_prio then jobs in higher PriorityTier
1994 partitions will suspend jobs in lower PriorityTier
1995 partitions to run on the released resources. Only
1996 when the preemptor job ends will the suspended jobs
1997 will be resumed by the Gang scheduler.
1998 NOTE: Suspended jobs will not release GRES. Higher
1999 priority jobs will not be able to preempt to gain
2000 access to GRES.
2001
2002 WITHIN For PreemptType=preempt/qos, allow jobs within the
2003 same qos to preempt one another. While this can be
2004 set globally here, it is recommend that this only be
2005 set directly on a relevant subset of the system qos
2006 values instead.
2007
2008 PreemptType
2009 Specifies the plugin used to identify which jobs can be pre‐
2010 empted in order to start a pending job.
2011
2012 preempt/none
2013 Job preemption is disabled. This is the default.
2014
2015 preempt/partition_prio
2016 Job preemption is based upon partition PriorityTier.
2017 Jobs in higher PriorityTier partitions may preempt jobs
2018 from lower PriorityTier partitions. This is not compati‐
2019 ble with PreemptMode=OFF.
2020
2021 preempt/qos
2022 Job preemption rules are specified by Quality Of Service
2023 (QOS) specifications in the Slurm database. This option
2024 is not compatible with PreemptMode=OFF. A configuration
2025 of PreemptMode=SUSPEND is only supported by the Select‐
2026 Type=select/cons_res and SelectType=select/cons_tres
2027 plugins. See the sacctmgr man page to configure the op‐
2028 tions for preempt/qos.
2029
2030 PreemptExemptTime
2031 Global option for minimum run time for all jobs before they can
2032 be considered for preemption. Any QOS PreemptExemptTime takes
2033 precedence over the global option. This is only honored for Pre‐
2034 emptMode=REQUEUE and PreemptMode=CANCEL.
2035 A time of -1 disables the option, equivalent to 0. Acceptable
2036 time formats include "minutes", "minutes:seconds", "hours:min‐
2037 utes:seconds", "days-hours", "days-hours:minutes", and
2038 "days-hours:minutes:seconds".
2039
2040 PrEpParameters
2041 Parameters to be passed to the PrEpPlugins.
2042
2043 PrEpPlugins
2044 A resource for programmers wishing to write their own plugins
2045 for the Prolog and Epilog (PrEp) scripts. The default, and cur‐
2046 rently the only implemented plugin is prep/script. Additional
2047 plugins can be specified in a comma-separated list. For more in‐
2048 formation please see the PrEp Plugin API documentation page:
2049 <https://slurm.schedmd.com/prep_plugins.html>
2050
2051 PriorityCalcPeriod
2052 The period of time in minutes in which the half-life decay will
2053 be re-calculated. Applicable only if PriorityType=priority/mul‐
2054 tifactor. The default value is 5 (minutes).
2055
2056 PriorityDecayHalfLife
2057 This controls how long prior resource use is considered in de‐
2058 termining how over- or under-serviced an association is (user,
2059 bank account and cluster) in determining job priority. The
2060 record of usage will be decayed over time, with half of the
2061 original value cleared at age PriorityDecayHalfLife. If set to
2062 0 no decay will be applied. This is helpful if you want to en‐
2063 force hard time limits per association. If set to 0 Priori‐
2064 tyUsageResetPeriod must be set to some interval. Applicable
2065 only if PriorityType=priority/multifactor. The unit is a time
2066 string (i.e. min, hr:min:00, days-hr:min:00, or days-hr). The
2067 default value is 7-0 (7 days).
2068
2069 PriorityFavorSmall
2070 Specifies that small jobs should be given preferential schedul‐
2071 ing priority. Applicable only if PriorityType=priority/multi‐
2072 factor. Supported values are "YES" and "NO". The default value
2073 is "NO".
2074
2075 PriorityFlags
2076 Flags to modify priority behavior. Applicable only if Priority‐
2077 Type=priority/multifactor. The keywords below have no associ‐
2078 ated value (e.g. "PriorityFlags=ACCRUE_ALWAYS,SMALL_RELA‐
2079 TIVE_TO_TIME").
2080
2081 ACCRUE_ALWAYS If set, priority age factor will be increased
2082 despite job ineligibility due to either depen‐
2083 dencies, holds or begin time in the future. Ac‐
2084 crue limits are ignored.
2085
2086 CALCULATE_RUNNING
2087 If set, priorities will be recalculated not
2088 only for pending jobs, but also running and
2089 suspended jobs.
2090
2091 DEPTH_OBLIVIOUS If set, priority will be calculated based simi‐
2092 lar to the normal multifactor calculation, but
2093 depth of the associations in the tree does not
2094 adversely affect their priority. This option
2095 automatically enables NO_FAIR_TREE.
2096
2097 NO_FAIR_TREE Disables the "fair tree" algorithm, and reverts
2098 to "classic" fair share priority scheduling.
2099
2100 INCR_ONLY If set, priority values will only increase in
2101 value. Job priority will never decrease in
2102 value.
2103
2104 MAX_TRES If set, the weighted TRES value (e.g. TRES‐
2105 BillingWeights) is calculated as the MAX of in‐
2106 dividual TRES' on a node (e.g. cpus, mem, gres)
2107 plus the sum of all global TRES' (e.g. li‐
2108 censes).
2109
2110 NO_NORMAL_ALL If set, all NO_NORMAL_* flags are set.
2111
2112 NO_NORMAL_ASSOC If set, the association factor is not normal‐
2113 ized against the highest association priority.
2114
2115 NO_NORMAL_PART If set, the partition factor is not normalized
2116 against the highest partition PriorityJobFac‐
2117 tor.
2118
2119 NO_NORMAL_QOS If set, the QOS factor is not normalized
2120 against the highest qos priority.
2121
2122 NO_NORMAL_TRES If set, the TRES factor is not normalized
2123 against the job's partition TRES counts.
2124
2125 SMALL_RELATIVE_TO_TIME
2126 If set, the job's size component will be based
2127 upon not the job size alone, but the job's size
2128 divided by its time limit.
2129
2130 PriorityMaxAge
2131 Specifies the job age which will be given the maximum age factor
2132 in computing priority. For example, a value of 30 minutes would
2133 result in all jobs over 30 minutes old would get the same
2134 age-based priority. Applicable only if PriorityType=prior‐
2135 ity/multifactor. The unit is a time string (i.e. min,
2136 hr:min:00, days-hr:min:00, or days-hr). The default value is
2137 7-0 (7 days).
2138
2139 PriorityParameters
2140 Arbitrary string used by the PriorityType plugin.
2141
2142 PrioritySiteFactorParameters
2143 Arbitrary string used by the PrioritySiteFactorPlugin plugin.
2144
2145 PrioritySiteFactorPlugin
2146 The specifies an optional plugin to be used alongside "prior‐
2147 ity/multifactor", which is meant to initially set and continu‐
2148 ously update the SiteFactor priority factor. The default value
2149 is "site_factor/none".
2150
2151 PriorityType
2152 This specifies the plugin to be used in establishing a job's
2153 scheduling priority. Also see PriorityFlags for configuration
2154 options. The default value is "priority/basic".
2155
2156 priority/basic
2157 Jobs are evaluated in a First In, First Out (FIFO) man‐
2158 ner.
2159
2160 priority/multifactor
2161 Jobs are assigned a priority based upon a variety of fac‐
2162 tors that include size, age, Fairshare, etc.
2163
2164 When not FIFO scheduling, jobs are prioritized in the following
2165 order:
2166
2167 1. Jobs that can preempt
2168 2. Jobs with an advanced reservation
2169 3. Partition PriorityTier
2170 4. Job priority
2171 5. Job submit time
2172 6. Job ID
2173
2174 PriorityUsageResetPeriod
2175 At this interval the usage of associations will be reset to 0.
2176 This is used if you want to enforce hard limits of time usage
2177 per association. If PriorityDecayHalfLife is set to be 0 no de‐
2178 cay will happen and this is the only way to reset the usage ac‐
2179 cumulated by running jobs. By default this is turned off and it
2180 is advised to use the PriorityDecayHalfLife option to avoid not
2181 having anything running on your cluster, but if your schema is
2182 set up to only allow certain amounts of time on your system this
2183 is the way to do it. Applicable only if PriorityType=prior‐
2184 ity/multifactor.
2185
2186 NONE Never clear historic usage. The default value.
2187
2188 NOW Clear the historic usage now. Executed at startup
2189 and reconfiguration time.
2190
2191 DAILY Cleared every day at midnight.
2192
2193 WEEKLY Cleared every week on Sunday at time 00:00.
2194
2195 MONTHLY Cleared on the first day of each month at time
2196 00:00.
2197
2198 QUARTERLY Cleared on the first day of each quarter at time
2199 00:00.
2200
2201 YEARLY Cleared on the first day of each year at time 00:00.
2202
2203 PriorityWeightAge
2204 An integer value that sets the degree to which the queue wait
2205 time component contributes to the job's priority. Applicable
2206 only if PriorityType=priority/multifactor. Requires Account‐
2207 ingStorageType=accounting_storage/slurmdbd. The default value
2208 is 0.
2209
2210 PriorityWeightAssoc
2211 An integer value that sets the degree to which the association
2212 component contributes to the job's priority. Applicable only if
2213 PriorityType=priority/multifactor. The default value is 0.
2214
2215 PriorityWeightFairshare
2216 An integer value that sets the degree to which the fair-share
2217 component contributes to the job's priority. Applicable only if
2218 PriorityType=priority/multifactor. Requires AccountingStor‐
2219 ageType=accounting_storage/slurmdbd. The default value is 0.
2220
2221 PriorityWeightJobSize
2222 An integer value that sets the degree to which the job size com‐
2223 ponent contributes to the job's priority. Applicable only if
2224 PriorityType=priority/multifactor. The default value is 0.
2225
2226 PriorityWeightPartition
2227 Partition factor used by priority/multifactor plugin in calcu‐
2228 lating job priority. Applicable only if PriorityType=prior‐
2229 ity/multifactor. The default value is 0.
2230
2231 PriorityWeightQOS
2232 An integer value that sets the degree to which the Quality Of
2233 Service component contributes to the job's priority. Applicable
2234 only if PriorityType=priority/multifactor. The default value is
2235 0.
2236
2237 PriorityWeightTRES
2238 A comma-separated list of TRES Types and weights that sets the
2239 degree that each TRES Type contributes to the job's priority.
2240
2241 e.g.
2242 PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
2243
2244 Applicable only if PriorityType=priority/multifactor and if Ac‐
2245 countingStorageTRES is configured with each TRES Type. Negative
2246 values are allowed. The default values are 0.
2247
2248 PrivateData
2249 This controls what type of information is hidden from regular
2250 users. By default, all information is visible to all users.
2251 User SlurmUser and root can always view all information. Multi‐
2252 ple values may be specified with a comma separator. Acceptable
2253 values include:
2254
2255 accounts
2256 (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
2257 ing any account definitions unless they are coordinators
2258 of them.
2259
2260 cloud Powered down nodes in the cloud are visible. Without
2261 this flag, cloud nodes will not appear in the output of
2262 commands like sinfo unless they are powered on, even for
2263 SlurmUser and root.
2264
2265 events prevents users from viewing event information unless they
2266 have operator status or above.
2267
2268 jobs Prevents users from viewing jobs or job steps belonging
2269 to other users. (NON-SlurmDBD ACCOUNTING ONLY) Prevents
2270 users from viewing job records belonging to other users
2271 unless they are coordinators of the association running
2272 the job when using sacct.
2273
2274 nodes Prevents users from viewing node state information.
2275
2276 partitions
2277 Prevents users from viewing partition state information.
2278
2279 reservations
2280 Prevents regular users from viewing reservations which
2281 they can not use.
2282
2283 usage Prevents users from viewing usage of any other user, this
2284 applies to sshare. (NON-SlurmDBD ACCOUNTING ONLY) Pre‐
2285 vents users from viewing usage of any other user, this
2286 applies to sreport.
2287
2288 users (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
2289 ing information of any user other than themselves, this
2290 also makes it so users can only see associations they
2291 deal with. Coordinators can see associations of all
2292 users in the account they are coordinator of, but can
2293 only see themselves when listing users.
2294
2295 ProctrackType
2296 Identifies the plugin to be used for process tracking on a job
2297 step basis. The slurmd daemon uses this mechanism to identify
2298 all processes which are children of processes it spawns for a
2299 user job step. A restart of slurmctld is required for changes
2300 to this parameter to take effect. NOTE: "proctrack/linuxproc"
2301 and "proctrack/pgid" can fail to identify all processes associ‐
2302 ated with a job since processes can become a child of the init
2303 process (when the parent process terminates) or change their
2304 process group. To reliably track all processes, "proc‐
2305 track/cgroup" is highly recommended. NOTE: The JobContainerType
2306 applies to a job allocation, while ProctrackType applies to job
2307 steps. Acceptable values at present include:
2308
2309 proctrack/cgroup
2310 Uses linux cgroups to constrain and track processes, and
2311 is the default for systems with cgroup support.
2312 NOTE: see "man cgroup.conf" for configuration details.
2313
2314 proctrack/cray_aries
2315 Uses Cray proprietary process tracking.
2316
2317 proctrack/linuxproc
2318 Uses linux process tree using parent process IDs.
2319
2320 proctrack/pgid
2321 Uses Process Group IDs.
2322 NOTE: This is the default for the BSD family.
2323
2324 Prolog Fully qualified pathname of a program for the slurmd to execute
2325 whenever it is asked to run a job step from a new job allocation
2326 (e.g. "/usr/local/slurm/prolog"). A glob pattern (See glob (7))
2327 may also be used to specify more than one program to run (e.g.
2328 "/etc/slurm/prolog.d/*"). The slurmd executes the prolog before
2329 starting the first job step. The prolog script or scripts may
2330 be used to purge files, enable user login, etc. By default
2331 there is no prolog. Any configured script is expected to com‐
2332 plete execution quickly (in less time than MessageTimeout). If
2333 the prolog fails (returns a non-zero exit code), this will re‐
2334 sult in the node being set to a DRAIN state and the job being
2335 requeued in a held state, unless nohold_on_prolog_fail is con‐
2336 figured in SchedulerParameters. See Prolog and Epilog Scripts
2337 for more information.
2338
2339 PrologEpilogTimeout
2340 The interval in seconds Slurm waits for Prolog and Epilog before
2341 terminating them. The default behavior is to wait indefinitely.
2342 This interval applies to the Prolog and Epilog run by slurmd
2343 daemon before and after the job, the PrologSlurmctld and Epi‐
2344 logSlurmctld run by slurmctld daemon, and the SPANK plugin pro‐
2345 log/epilog calls: slurm_spank_job_prolog and
2346 slurm_spank_job_epilog.
2347 If the PrologSlurmctld times out, the job is requeued if possi‐
2348 ble. If the Prolog or slurm_spank_job_prolog time out, the job
2349 is requeued if possible and the node is drained. If the Epilog
2350 or slurm_spank_job_epilog time out, the node is drained. In all
2351 cases, errors are logged.
2352
2353 PrologFlags
2354 Flags to control the Prolog behavior. By default no flags are
2355 set. Multiple flags may be specified in a comma-separated list.
2356 Currently supported options are:
2357
2358 Alloc If set, the Prolog script will be executed at job allo‐
2359 cation. By default, Prolog is executed just before the
2360 task is launched. Therefore, when salloc is started, no
2361 Prolog is executed. Alloc is useful for preparing things
2362 before a user starts to use any allocated resources. In
2363 particular, this flag is needed on a Cray system when
2364 cluster compatibility mode is enabled.
2365
2366 NOTE: Use of the Alloc flag will increase the time re‐
2367 quired to start jobs.
2368
2369 Contain At job allocation time, use the ProcTrack plugin to cre‐
2370 ate a job container on all allocated compute nodes.
2371 This container may be used for user processes not
2372 launched under Slurm control, for example
2373 pam_slurm_adopt may place processes launched through a
2374 direct user login into this container. If using
2375 pam_slurm_adopt, then ProcTrackType must be set to ei‐
2376 ther proctrack/cgroup or proctrack/cray_aries. Setting
2377 the Contain implicitly sets the Alloc flag.
2378
2379 DeferBatch
2380 If set, slurmctld will wait until the prolog completes
2381 on all allocated nodes before sending the batch job
2382 launch request. With just the Alloc flag, slurmctld will
2383 launch the batch step as soon as the first node in the
2384 job allocation completes the prolog.
2385
2386 NoHold If set, the Alloc flag should also be set. This will
2387 allow for salloc to not block until the prolog is fin‐
2388 ished on each node. The blocking will happen when steps
2389 reach the slurmd and before any execution has happened
2390 in the step. This is a much faster way to work and if
2391 using srun to launch your tasks you should use this
2392 flag. This flag cannot be combined with the Contain or
2393 X11 flags.
2394
2395 Serial By default, the Prolog and Epilog scripts run concur‐
2396 rently on each node. This flag forces those scripts to
2397 run serially within each node, but with a significant
2398 penalty to job throughput on each node.
2399
2400 X11 Enable Slurm's built-in X11 forwarding capabilities.
2401 This is incompatible with ProctrackType=proctrack/linux‐
2402 proc. Setting the X11 flag implicitly enables both Con‐
2403 tain and Alloc flags as well.
2404
2405 PrologSlurmctld
2406 Fully qualified pathname of a program for the slurmctld daemon
2407 to execute before granting a new job allocation (e.g. "/usr/lo‐
2408 cal/slurm/prolog_controller"). The program executes as Slur‐
2409 mUser on the same node where the slurmctld daemon executes, giv‐
2410 ing it permission to drain nodes and requeue the job if a fail‐
2411 ure occurs or cancel the job if appropriate. Exactly what the
2412 program does and how it accomplishes this is completely at the
2413 discretion of the system administrator. Information about the
2414 job being initiated, its allocated nodes, etc. are passed to the
2415 program using environment variables. While this program is run‐
2416 ning, the nodes associated with the job will be have a
2417 POWER_UP/CONFIGURING flag set in their state, which can be read‐
2418 ily viewed. The slurmctld daemon will wait indefinitely for
2419 this program to complete. Once the program completes with an
2420 exit code of zero, the nodes will be considered ready for use
2421 and the program will be started. If some node can not be made
2422 available for use, the program should drain the node (typically
2423 using the scontrol command) and terminate with a non-zero exit
2424 code. A non-zero exit code will result in the job being re‐
2425 queued (where possible) or killed. Note that only batch jobs can
2426 be requeued. See Prolog and Epilog Scripts for more informa‐
2427 tion.
2428
2429 PropagatePrioProcess
2430 Controls the scheduling priority (nice value) of user spawned
2431 tasks.
2432
2433 0 The tasks will inherit the scheduling priority from the
2434 slurm daemon. This is the default value.
2435
2436 1 The tasks will inherit the scheduling priority of the com‐
2437 mand used to submit them (e.g. srun or sbatch). Unless the
2438 job is submitted by user root, the tasks will have a sched‐
2439 uling priority no higher than the slurm daemon spawning
2440 them.
2441
2442 2 The tasks will inherit the scheduling priority of the com‐
2443 mand used to submit them (e.g. srun or sbatch) with the re‐
2444 striction that their nice value will always be one higher
2445 than the slurm daemon (i.e. the tasks scheduling priority
2446 will be lower than the slurm daemon).
2447
2448 PropagateResourceLimits
2449 A comma-separated list of resource limit names. The slurmd dae‐
2450 mon uses these names to obtain the associated (soft) limit val‐
2451 ues from the user's process environment on the submit node.
2452 These limits are then propagated and applied to the jobs that
2453 will run on the compute nodes. This parameter can be useful
2454 when system limits vary among nodes. Any resource limits that
2455 do not appear in the list are not propagated. However, the user
2456 can override this by specifying which resource limits to propa‐
2457 gate with the sbatch or srun "--propagate" option. If neither
2458 PropagateResourceLimits or PropagateResourceLimitsExcept are
2459 configured and the "--propagate" option is not specified, then
2460 the default action is to propagate all limits. Only one of the
2461 parameters, either PropagateResourceLimits or PropagateResource‐
2462 LimitsExcept, may be specified. The user limits can not exceed
2463 hard limits under which the slurmd daemon operates. If the user
2464 limits are not propagated, the limits from the slurmd daemon
2465 will be propagated to the user's job. The limits used for the
2466 Slurm daemons can be set in the /etc/sysconf/slurm file. For
2467 more information, see: https://slurm.schedmd.com/faq.html#mem‐
2468 lock The following limit names are supported by Slurm (although
2469 some options may not be supported on some systems):
2470
2471 ALL All limits listed below (default)
2472
2473 NONE No limits listed below
2474
2475 AS The maximum address space (virtual memory) for a
2476 process.
2477
2478 CORE The maximum size of core file
2479
2480 CPU The maximum amount of CPU time
2481
2482 DATA The maximum size of a process's data segment
2483
2484 FSIZE The maximum size of files created. Note that if the
2485 user sets FSIZE to less than the current size of the
2486 slurmd.log, job launches will fail with a 'File size
2487 limit exceeded' error.
2488
2489 MEMLOCK The maximum size that may be locked into memory
2490
2491 NOFILE The maximum number of open files
2492
2493 NPROC The maximum number of processes available
2494
2495 RSS The maximum resident set size. Note that this only
2496 has effect with Linux kernels 2.4.30 or older or BSD.
2497
2498 STACK The maximum stack size
2499
2500 PropagateResourceLimitsExcept
2501 A comma-separated list of resource limit names. By default, all
2502 resource limits will be propagated, (as described by the Propa‐
2503 gateResourceLimits parameter), except for the limits appearing
2504 in this list. The user can override this by specifying which
2505 resource limits to propagate with the sbatch or srun "--propa‐
2506 gate" option. See PropagateResourceLimits above for a list of
2507 valid limit names.
2508
2509 RebootProgram
2510 Program to be executed on each compute node to reboot it. In‐
2511 voked on each node once it becomes idle after the command "scon‐
2512 trol reboot" is executed by an authorized user or a job is sub‐
2513 mitted with the "--reboot" option. After rebooting, the node is
2514 returned to normal use. See ResumeTimeout to configure the time
2515 you expect a reboot to finish in. A node will be marked DOWN if
2516 it doesn't reboot within ResumeTimeout.
2517
2518 ReconfigFlags
2519 Flags to control various actions that may be taken when an
2520 "scontrol reconfig" command is issued. Currently the options
2521 are:
2522
2523 KeepPartInfo If set, an "scontrol reconfig" command will
2524 maintain the in-memory value of partition
2525 "state" and other parameters that may have been
2526 dynamically updated by "scontrol update". Par‐
2527 tition information in the slurm.conf file will
2528 be merged with in-memory data. This flag su‐
2529 persedes the KeepPartState flag.
2530
2531 KeepPartState If set, an "scontrol reconfig" command will
2532 preserve only the current "state" value of
2533 in-memory partitions and will reset all other
2534 parameters of the partitions that may have been
2535 dynamically updated by "scontrol update" to the
2536 values from the slurm.conf file. Partition in‐
2537 formation in the slurm.conf file will be merged
2538 with in-memory data.
2539
2540 The default for the above flags is not set, and the "scontrol
2541 reconfig" will rebuild the partition information using only the
2542 definitions in the slurm.conf file.
2543
2544 RequeueExit
2545 Enables automatic requeue for batch jobs which exit with the
2546 specified values. Separate multiple exit code by a comma and/or
2547 specify numeric ranges using a "-" separator (e.g. "Requeue‐
2548 Exit=1-9,18") Jobs will be put back in to pending state and
2549 later scheduled again. Restarted jobs will have the environment
2550 variable SLURM_RESTART_COUNT set to the number of times the job
2551 has been restarted.
2552
2553 RequeueExitHold
2554 Enables automatic requeue for batch jobs which exit with the
2555 specified values, with these jobs being held until released man‐
2556 ually by the user. Separate multiple exit code by a comma
2557 and/or specify numeric ranges using a "-" separator (e.g. "Re‐
2558 queueExitHold=10-12,16") These jobs are put in the JOB_SPE‐
2559 CIAL_EXIT exit state. Restarted jobs will have the environment
2560 variable SLURM_RESTART_COUNT set to the number of times the job
2561 has been restarted.
2562
2563 ResumeFailProgram
2564 The program that will be executed when nodes fail to resume to
2565 by ResumeTimeout. The argument to the program will be the names
2566 of the failed nodes (using Slurm's hostlist expression format).
2567
2568 ResumeProgram
2569 Slurm supports a mechanism to reduce power consumption on nodes
2570 that remain idle for an extended period of time. This is typi‐
2571 cally accomplished by reducing voltage and frequency or powering
2572 the node down. ResumeProgram is the program that will be exe‐
2573 cuted when a node in power save mode is assigned work to per‐
2574 form. For reasons of reliability, ResumeProgram may execute
2575 more than once for a node when the slurmctld daemon crashes and
2576 is restarted. If ResumeProgram is unable to restore a node to
2577 service with a responding slurmd and an updated BootTime, it
2578 should set the node state to DOWN, which will result in a re‐
2579 queue of any job associated with the node - this will happen au‐
2580 tomatically if the node doesn't register within ResumeTimeout.
2581 If the node isn't actually rebooted (i.e. when multiple-slurmd
2582 is configured) starting slurmd with "-b" option might be useful.
2583 The program executes as SlurmUser. The argument to the program
2584 will be the names of nodes to be removed from power savings mode
2585 (using Slurm's hostlist expression format). A job to node map‐
2586 ping is available in JSON format by reading the temporary file
2587 specified by the SLURM_RESUME_FILE environment variable. By de‐
2588 fault no program is run.
2589
2590 ResumeRate
2591 The rate at which nodes in power save mode are returned to nor‐
2592 mal operation by ResumeProgram. The value is a number of nodes
2593 per minute and it can be used to prevent power surges if a large
2594 number of nodes in power save mode are assigned work at the same
2595 time (e.g. a large job starts). A value of zero results in no
2596 limits being imposed. The default value is 300 nodes per
2597 minute.
2598
2599 ResumeTimeout
2600 Maximum time permitted (in seconds) between when a node resume
2601 request is issued and when the node is actually available for
2602 use. Nodes which fail to respond in this time frame will be
2603 marked DOWN and the jobs scheduled on the node requeued. Nodes
2604 which reboot after this time frame will be marked DOWN with a
2605 reason of "Node unexpectedly rebooted." The default value is 60
2606 seconds.
2607
2608 ResvEpilog
2609 Fully qualified pathname of a program for the slurmctld to exe‐
2610 cute when a reservation ends. The program can be used to cancel
2611 jobs, modify partition configuration, etc. The reservation
2612 named will be passed as an argument to the program. By default
2613 there is no epilog.
2614
2615 ResvOverRun
2616 Describes how long a job already running in a reservation should
2617 be permitted to execute after the end time of the reservation
2618 has been reached. The time period is specified in minutes and
2619 the default value is 0 (kill the job immediately). The value
2620 may not exceed 65533 minutes, although a value of "UNLIMITED" is
2621 supported to permit a job to run indefinitely after its reserva‐
2622 tion is terminated.
2623
2624 ResvProlog
2625 Fully qualified pathname of a program for the slurmctld to exe‐
2626 cute when a reservation begins. The program can be used to can‐
2627 cel jobs, modify partition configuration, etc. The reservation
2628 named will be passed as an argument to the program. By default
2629 there is no prolog.
2630
2631 ReturnToService
2632 Controls when a DOWN node will be returned to service. The de‐
2633 fault value is 0. Supported values include
2634
2635 0 A node will remain in the DOWN state until a system adminis‐
2636 trator explicitly changes its state (even if the slurmd dae‐
2637 mon registers and resumes communications).
2638
2639 1 A DOWN node will become available for use upon registration
2640 with a valid configuration only if it was set DOWN due to
2641 being non-responsive. If the node was set DOWN for any
2642 other reason (low memory, unexpected reboot, etc.), its
2643 state will not automatically be changed. A node registers
2644 with a valid configuration if its memory, GRES, CPU count,
2645 etc. are equal to or greater than the values configured in
2646 slurm.conf.
2647
2648 2 A DOWN node will become available for use upon registration
2649 with a valid configuration. The node could have been set
2650 DOWN for any reason. A node registers with a valid configu‐
2651 ration if its memory, GRES, CPU count, etc. are equal to or
2652 greater than the values configured in slurm.conf.
2653
2654 RoutePlugin
2655 Identifies the plugin to be used for defining which nodes will
2656 be used for message forwarding.
2657
2658 route/default
2659 default, use TreeWidth.
2660
2661 route/topology
2662 use the switch hierarchy defined in a topology.conf file.
2663 TopologyPlugin=topology/tree is required.
2664
2665 SchedulerParameters
2666 The interpretation of this parameter varies by SchedulerType.
2667 Multiple options may be comma separated.
2668
2669 allow_zero_lic
2670 If set, then job submissions requesting more than config‐
2671 ured licenses won't be rejected.
2672
2673 assoc_limit_stop
2674 If set and a job cannot start due to association limits,
2675 then do not attempt to initiate any lower priority jobs
2676 in that partition. Setting this can decrease system
2677 throughput and utilization, but avoid potentially starv‐
2678 ing larger jobs by preventing them from launching indefi‐
2679 nitely.
2680
2681 batch_sched_delay=#
2682 How long, in seconds, the scheduling of batch jobs can be
2683 delayed. This can be useful in a high-throughput envi‐
2684 ronment in which batch jobs are submitted at a very high
2685 rate (i.e. using the sbatch command) and one wishes to
2686 reduce the overhead of attempting to schedule each job at
2687 submit time. The default value is 3 seconds.
2688
2689 bb_array_stage_cnt=#
2690 Number of tasks from a job array that should be available
2691 for burst buffer resource allocation. Higher values will
2692 increase the system overhead as each task from the job
2693 array will be moved to its own job record in memory, so
2694 relatively small values are generally recommended. The
2695 default value is 10.
2696
2697 bf_busy_nodes
2698 When selecting resources for pending jobs to reserve for
2699 future execution (i.e. the job can not be started immedi‐
2700 ately), then preferentially select nodes that are in use.
2701 This will tend to leave currently idle resources avail‐
2702 able for backfilling longer running jobs, but may result
2703 in allocations having less than optimal network topology.
2704 This option is currently only supported by the se‐
2705 lect/cons_res and select/cons_tres plugins (or se‐
2706 lect/cray_aries with SelectTypeParameters set to
2707 "OTHER_CONS_RES" or "OTHER_CONS_TRES", which layers the
2708 select/cray_aries plugin over the select/cons_res or se‐
2709 lect/cons_tres plugin respectively).
2710
2711 bf_continue
2712 The backfill scheduler periodically releases locks in or‐
2713 der to permit other operations to proceed rather than
2714 blocking all activity for what could be an extended pe‐
2715 riod of time. Setting this option will cause the back‐
2716 fill scheduler to continue processing pending jobs from
2717 its original job list after releasing locks even if job
2718 or node state changes.
2719
2720 bf_hetjob_immediate
2721 Instruct the backfill scheduler to attempt to start a
2722 heterogeneous job as soon as all of its components are
2723 determined able to do so. Otherwise, the backfill sched‐
2724 uler will delay heterogeneous jobs initiation attempts
2725 until after the rest of the queue has been processed.
2726 This delay may result in lower priority jobs being allo‐
2727 cated resources, which could delay the initiation of the
2728 heterogeneous job due to account and/or QOS limits being
2729 reached. This option is disabled by default. If enabled
2730 and bf_hetjob_prio=min is not set, then it would be auto‐
2731 matically set.
2732
2733 bf_hetjob_prio=[min|avg|max]
2734 At the beginning of each backfill scheduling cycle, a
2735 list of pending to be scheduled jobs is sorted according
2736 to the precedence order configured in PriorityType. This
2737 option instructs the scheduler to alter the sorting algo‐
2738 rithm to ensure that all components belonging to the same
2739 heterogeneous job will be attempted to be scheduled con‐
2740 secutively (thus not fragmented in the resulting list).
2741 More specifically, all components from the same heteroge‐
2742 neous job will be treated as if they all have the same
2743 priority (minimum, average or maximum depending upon this
2744 option's parameter) when compared with other jobs (or
2745 other heterogeneous job components). The original order
2746 will be preserved within the same heterogeneous job. Note
2747 that the operation is calculated for the PriorityTier
2748 layer and for the Priority resulting from the prior‐
2749 ity/multifactor plugin calculations. When enabled, if any
2750 heterogeneous job requested an advanced reservation, then
2751 all of that job's components will be treated as if they
2752 had requested an advanced reservation (and get preferen‐
2753 tial treatment in scheduling).
2754
2755 Note that this operation does not update the Priority
2756 values of the heterogeneous job components, only their
2757 order within the list, so the output of the sprio command
2758 will not be effected.
2759
2760 Heterogeneous jobs have special scheduling properties:
2761 they are only scheduled by the backfill scheduling
2762 plugin, each of their components is considered separately
2763 when reserving resources (and might have different Prior‐
2764 ityTier or different Priority values), and no heteroge‐
2765 neous job component is actually allocated resources until
2766 all if its components can be initiated. This may imply
2767 potential scheduling deadlock scenarios because compo‐
2768 nents from different heterogeneous jobs can start reserv‐
2769 ing resources in an interleaved fashion (not consecu‐
2770 tively), but none of the jobs can reserve resources for
2771 all components and start. Enabling this option can help
2772 to mitigate this problem. By default, this option is dis‐
2773 abled.
2774
2775 bf_interval=#
2776 The number of seconds between backfill iterations.
2777 Higher values result in less overhead and better respon‐
2778 siveness. This option applies only to Scheduler‐
2779 Type=sched/backfill. Default: 30, Min: 1, Max: 10800
2780 (3h). A setting of -1 will disable the backfill schedul‐
2781 ing loop.
2782
2783 bf_job_part_count_reserve=#
2784 The backfill scheduling logic will reserve resources for
2785 the specified count of highest priority jobs in each par‐
2786 tition. For example, bf_job_part_count_reserve=10 will
2787 cause the backfill scheduler to reserve resources for the
2788 ten highest priority jobs in each partition. Any lower
2789 priority job that can be started using currently avail‐
2790 able resources and not adversely impact the expected
2791 start time of these higher priority jobs will be started
2792 by the backfill scheduler The default value is zero,
2793 which will reserve resources for any pending job and de‐
2794 lay initiation of lower priority jobs. Also see
2795 bf_min_age_reserve and bf_min_prio_reserve. Default: 0,
2796 Min: 0, Max: 100000.
2797
2798 bf_licenses
2799 Require the backfill scheduling logic to track and plan
2800 for license availability. By default, any job blocked on
2801 license availability will not have resources reserved
2802 which can lead to job starvation. This option implicitly
2803 enables bf_running_job_reserve.
2804
2805 bf_max_job_array_resv=#
2806 The maximum number of tasks from a job array for which
2807 the backfill scheduler will reserve resources in the fu‐
2808 ture. Since job arrays can potentially have millions of
2809 tasks, the overhead in reserving resources for all tasks
2810 can be prohibitive. In addition various limits may pre‐
2811 vent all the jobs from starting at the expected times.
2812 This has no impact upon the number of tasks from a job
2813 array that can be started immediately, only those tasks
2814 expected to start at some future time. Default: 20, Min:
2815 0, Max: 1000. NOTE: Jobs submitted to multiple parti‐
2816 tions appear in the job queue once per partition. If dif‐
2817 ferent copies of a single job array record aren't consec‐
2818 utive in the job queue and another job array record is in
2819 between, then bf_max_job_array_resv tasks are considered
2820 per partition that the job is submitted to.
2821
2822 bf_max_job_assoc=#
2823 The maximum number of jobs per user association to at‐
2824 tempt starting with the backfill scheduler. This setting
2825 is similar to bf_max_job_user but is handy if a user has
2826 multiple associations equating to basically different
2827 users. One can set this limit to prevent users from
2828 flooding the backfill queue with jobs that cannot start
2829 and that prevent jobs from other users to start. This
2830 option applies only to SchedulerType=sched/backfill.
2831 Also see the bf_max_job_user bf_max_job_part,
2832 bf_max_job_test and bf_max_job_user_part=# options. Set
2833 bf_max_job_test to a value much higher than
2834 bf_max_job_assoc. Default: 0 (no limit), Min: 0, Max:
2835 bf_max_job_test.
2836
2837 bf_max_job_part=#
2838 The maximum number of jobs per partition to attempt
2839 starting with the backfill scheduler. This can be espe‐
2840 cially helpful for systems with large numbers of parti‐
2841 tions and jobs. This option applies only to Scheduler‐
2842 Type=sched/backfill. Also see the partition_job_depth
2843 and bf_max_job_test options. Set bf_max_job_test to a
2844 value much higher than bf_max_job_part. Default: 0 (no
2845 limit), Min: 0, Max: bf_max_job_test.
2846
2847 bf_max_job_start=#
2848 The maximum number of jobs which can be initiated in a
2849 single iteration of the backfill scheduler. This option
2850 applies only to SchedulerType=sched/backfill. Default: 0
2851 (no limit), Min: 0, Max: 10000.
2852
2853 bf_max_job_test=#
2854 The maximum number of jobs to attempt backfill scheduling
2855 for (i.e. the queue depth). Higher values result in more
2856 overhead and less responsiveness. Until an attempt is
2857 made to backfill schedule a job, its expected initiation
2858 time value will not be set. In the case of large clus‐
2859 ters, configuring a relatively small value may be desir‐
2860 able. This option applies only to Scheduler‐
2861 Type=sched/backfill. Default: 500, Min: 1, Max:
2862 1,000,000.
2863
2864 bf_max_job_user=#
2865 The maximum number of jobs per user to attempt starting
2866 with the backfill scheduler for ALL partitions. One can
2867 set this limit to prevent users from flooding the back‐
2868 fill queue with jobs that cannot start and that prevent
2869 jobs from other users to start. This is similar to the
2870 MAXIJOB limit in Maui. This option applies only to
2871 SchedulerType=sched/backfill. Also see the
2872 bf_max_job_part, bf_max_job_test and
2873 bf_max_job_user_part=# options. Set bf_max_job_test to a
2874 value much higher than bf_max_job_user. Default: 0 (no
2875 limit), Min: 0, Max: bf_max_job_test.
2876
2877 bf_max_job_user_part=#
2878 The maximum number of jobs per user per partition to at‐
2879 tempt starting with the backfill scheduler for any single
2880 partition. This option applies only to Scheduler‐
2881 Type=sched/backfill. Also see the bf_max_job_part,
2882 bf_max_job_test and bf_max_job_user=# options. Default:
2883 0 (no limit), Min: 0, Max: bf_max_job_test.
2884
2885 bf_max_time=#
2886 The maximum time in seconds the backfill scheduler can
2887 spend (including time spent sleeping when locks are re‐
2888 leased) before discontinuing, even if maximum job counts
2889 have not been reached. This option applies only to
2890 SchedulerType=sched/backfill. The default value is the
2891 value of bf_interval (which defaults to 30 seconds). De‐
2892 fault: bf_interval value (def. 30 sec), Min: 1, Max: 3600
2893 (1h). NOTE: If bf_interval is short and bf_max_time is
2894 large, this may cause locks to be acquired too frequently
2895 and starve out other serviced RPCs. It's advisable if us‐
2896 ing this parameter to set max_rpc_cnt high enough that
2897 scheduling isn't always disabled, and low enough that the
2898 interactive workload can get through in a reasonable pe‐
2899 riod of time. max_rpc_cnt needs to be below 256 (the de‐
2900 fault RPC thread limit). Running around the middle (150)
2901 may give you good results. NOTE: When increasing the
2902 amount of time spent in the backfill scheduling cycle,
2903 Slurm can be prevented from responding to client requests
2904 in a timely manner. To address this you can use
2905 max_rpc_cnt to specify a number of queued RPCs before the
2906 scheduler stops to respond to these requests.
2907
2908 bf_min_age_reserve=#
2909 The backfill and main scheduling logic will not reserve
2910 resources for pending jobs until they have been pending
2911 and runnable for at least the specified number of sec‐
2912 onds. In addition, jobs waiting for less than the speci‐
2913 fied number of seconds will not prevent a newly submitted
2914 job from starting immediately, even if the newly submit‐
2915 ted job has a lower priority. This can be valuable if
2916 jobs lack time limits or all time limits have the same
2917 value. The default value is zero, which will reserve re‐
2918 sources for any pending job and delay initiation of lower
2919 priority jobs. Also see bf_job_part_count_reserve and
2920 bf_min_prio_reserve. Default: 0, Min: 0, Max: 2592000
2921 (30 days).
2922
2923 bf_min_prio_reserve=#
2924 The backfill and main scheduling logic will not reserve
2925 resources for pending jobs unless they have a priority
2926 equal to or higher than the specified value. In addi‐
2927 tion, jobs with a lower priority will not prevent a newly
2928 submitted job from starting immediately, even if the
2929 newly submitted job has a lower priority. This can be
2930 valuable if one wished to maximize system utilization
2931 without regard for job priority below a certain thresh‐
2932 old. The default value is zero, which will reserve re‐
2933 sources for any pending job and delay initiation of lower
2934 priority jobs. Also see bf_job_part_count_reserve and
2935 bf_min_age_reserve. Default: 0, Min: 0, Max: 2^63.
2936
2937 bf_node_space_size=#
2938 Size of backfill node_space table. Adding a single job to
2939 backfill reservations in the worst case can consume two
2940 node_space records. In the case of large clusters, con‐
2941 figuring a relatively small value may be desirable. This
2942 option applies only to SchedulerType=sched/backfill.
2943 Also see bf_max_job_test and bf_running_job_reserve. De‐
2944 fault: bf_max_job_test, Min: 2, Max: 2,000,000.
2945
2946 bf_one_resv_per_job
2947 Disallow adding more than one backfill reservation per
2948 job. The scheduling logic builds a sorted list of job-
2949 partition pairs. Jobs submitted to multiple partitions
2950 have as many entries in the list as requested partitions.
2951 By default, the backfill scheduler may evaluate all the
2952 job-partition entries for a single job, potentially re‐
2953 serving resources for each pair, but only starting the
2954 job in the reservation offering the earliest start time.
2955 Having a single job reserving resources for multiple par‐
2956 titions could impede other jobs (or hetjob components)
2957 from reserving resources already reserved for the parti‐
2958 tions that don't offer the earliest start time. A single
2959 job that requests multiple partitions can also prevent
2960 itself from starting earlier in a lower priority parti‐
2961 tion if the partitions overlap nodes and a backfill
2962 reservation in the higher priority partition blocks nodes
2963 that are also in the lower priority partition. This op‐
2964 tion makes it so that a job submitted to multiple parti‐
2965 tions will stop reserving resources once the first job-
2966 partition pair has booked a backfill reservation. Subse‐
2967 quent pairs from the same job will only be tested to
2968 start now. This allows for other jobs to be able to book
2969 the other pairs resources at the cost of not guaranteeing
2970 that the multi partition job will start in the partition
2971 offering the earliest start time (unless it can start im‐
2972 mediately). This option is disabled by default.
2973
2974 bf_resolution=#
2975 The number of seconds in the resolution of data main‐
2976 tained about when jobs begin and end. Higher values re‐
2977 sult in better responsiveness and quicker backfill cycles
2978 by using larger blocks of time to determine node eligi‐
2979 bility. However, higher values lead to less efficient
2980 system planning, and may miss opportunities to improve
2981 system utilization. This option applies only to Sched‐
2982 ulerType=sched/backfill. Default: 60, Min: 1, Max: 3600
2983 (1 hour).
2984
2985 bf_running_job_reserve
2986 Add an extra step to backfill logic, which creates back‐
2987 fill reservations for jobs running on whole nodes. This
2988 option is disabled by default.
2989
2990 bf_window=#
2991 The number of minutes into the future to look when con‐
2992 sidering jobs to schedule. Higher values result in more
2993 overhead and less responsiveness. A value at least as
2994 long as the highest allowed time limit is generally ad‐
2995 visable to prevent job starvation. In order to limit the
2996 amount of data managed by the backfill scheduler, if the
2997 value of bf_window is increased, then it is generally ad‐
2998 visable to also increase bf_resolution. This option ap‐
2999 plies only to SchedulerType=sched/backfill. Default:
3000 1440 (1 day), Min: 1, Max: 43200 (30 days).
3001
3002 bf_window_linear=#
3003 For performance reasons, the backfill scheduler will de‐
3004 crease precision in calculation of job expected termina‐
3005 tion times. By default, the precision starts at 30 sec‐
3006 onds and that time interval doubles with each evaluation
3007 of currently executing jobs when trying to determine when
3008 a pending job can start. This algorithm can support an
3009 environment with many thousands of running jobs, but can
3010 result in the expected start time of pending jobs being
3011 gradually being deferred due to lack of precision. A
3012 value for bf_window_linear will cause the time interval
3013 to be increased by a constant amount on each iteration.
3014 The value is specified in units of seconds. For example,
3015 a value of 60 will cause the backfill scheduler on the
3016 first iteration to identify the job ending soonest and
3017 determine if the pending job can be started after that
3018 job plus all other jobs expected to end within 30 seconds
3019 (default initial value) of the first job. On the next it‐
3020 eration, the pending job will be evaluated for starting
3021 after the next job expected to end plus all jobs ending
3022 within 90 seconds of that time (30 second default, plus
3023 the 60 second option value). The third iteration will
3024 have a 150 second window and the fourth 210 seconds.
3025 Without this option, the time windows will double on each
3026 iteration and thus be 30, 60, 120, 240 seconds, etc. The
3027 use of bf_window_linear is not recommended with more than
3028 a few hundred simultaneously executing jobs.
3029
3030 bf_yield_interval=#
3031 The backfill scheduler will periodically relinquish locks
3032 in order for other pending operations to take place.
3033 This specifies the times when the locks are relinquished
3034 in microseconds. Smaller values may be helpful for high
3035 throughput computing when used in conjunction with the
3036 bf_continue option. Also see the bf_yield_sleep option.
3037 Default: 2,000,000 (2 sec), Min: 1, Max: 10,000,000 (10
3038 sec).
3039
3040 bf_yield_sleep=#
3041 The backfill scheduler will periodically relinquish locks
3042 in order for other pending operations to take place.
3043 This specifies the length of time for which the locks are
3044 relinquished in microseconds. Also see the bf_yield_in‐
3045 terval option. Default: 500,000 (0.5 sec), Min: 1, Max:
3046 10,000,000 (10 sec).
3047
3048 build_queue_timeout=#
3049 Defines the maximum time that can be devoted to building
3050 a queue of jobs to be tested for scheduling. If the sys‐
3051 tem has a huge number of jobs with dependencies, just
3052 building the job queue can take so much time as to ad‐
3053 versely impact overall system performance and this param‐
3054 eter can be adjusted as needed. The default value is
3055 2,000,000 microseconds (2 seconds).
3056
3057 correspond_after_task_cnt=#
3058 Defines the number of array tasks that get split for po‐
3059 tential aftercorr dependency check. Low number may result
3060 in dependent task check failures when the job one depends
3061 on gets purged before the split. Default: 10.
3062
3063 default_queue_depth=#
3064 The default number of jobs to attempt scheduling (i.e.
3065 the queue depth) when a running job completes or other
3066 routine actions occur, however the frequency with which
3067 the scheduler is run may be limited by using the defer or
3068 sched_min_interval parameters described below. The full
3069 queue will be tested on a less frequent basis as defined
3070 by the sched_interval option described below. The default
3071 value is 100. See the partition_job_depth option to
3072 limit depth by partition.
3073
3074 defer Setting this option will avoid attempting to schedule
3075 each job individually at job submit time, but defer it
3076 until a later time when scheduling multiple jobs simulta‐
3077 neously may be possible. This option may improve system
3078 responsiveness when large numbers of jobs (many hundreds)
3079 are submitted at the same time, but it will delay the
3080 initiation time of individual jobs. Also see de‐
3081 fault_queue_depth above.
3082
3083 delay_boot=#
3084 Do not reboot nodes in order to satisfied this job's fea‐
3085 ture specification if the job has been eligible to run
3086 for less than this time period. If the job has waited
3087 for less than the specified period, it will use only
3088 nodes which already have the specified features. The ar‐
3089 gument is in units of minutes. Individual jobs may over‐
3090 ride this default value with the --delay-boot option.
3091
3092 disable_job_shrink
3093 Deny user requests to shrink the size of running jobs.
3094 (However, running jobs may still shrink due to node fail‐
3095 ure if the --no-kill option was set.)
3096
3097 disable_hetjob_steps
3098 Disable job steps that span heterogeneous job alloca‐
3099 tions.
3100
3101 enable_hetjob_steps
3102 Enable job steps that span heterogeneous job allocations.
3103 The default value.
3104
3105 enable_user_top
3106 Enable use of the "scontrol top" command by non-privi‐
3107 leged users.
3108
3109 Ignore_NUMA
3110 Some processors (e.g. AMD Opteron 6000 series) contain
3111 multiple NUMA nodes per socket. This is a configuration
3112 which does not map into the hardware entities that Slurm
3113 optimizes resource allocation for (PU/thread, core,
3114 socket, baseboard, node and network switch). In order to
3115 optimize resource allocations on such hardware, Slurm
3116 will consider each NUMA node within the socket as a sepa‐
3117 rate socket by default. Use the Ignore_NUMA option to re‐
3118 port the correct socket count, but not optimize resource
3119 allocations on the NUMA nodes.
3120
3121 NOTE: Since hwloc 2.0 NUMA Nodes are are not part of the
3122 main/CPU topology tree, because of that if Slurm is build
3123 with hwloc 2.0 or above Slurm will treat HWLOC_OBJ_PACK‐
3124 AGE as Socket, you can change this behavior using Slurmd‐
3125 Parameters=l3cache_as_socket.
3126
3127 ignore_prefer_validation
3128 If set, and a job requests --prefer any features in the
3129 request that would create an invalid request with the
3130 current system will not generate an error. This is help‐
3131 ful for dynamic systems where nodes with features come
3132 and go. Please note using this option will not protect
3133 you from typos.
3134
3135 max_array_tasks
3136 Specify the maximum number of tasks that can be included
3137 in a job array. The default limit is MaxArraySize, but
3138 this option can be used to set a lower limit. For exam‐
3139 ple, max_array_tasks=1000 and MaxArraySize=100001 would
3140 permit a maximum task ID of 100000, but limit the number
3141 of tasks in any single job array to 1000.
3142
3143 max_rpc_cnt=#
3144 If the number of active threads in the slurmctld daemon
3145 is equal to or larger than this value, defer scheduling
3146 of jobs. The scheduler will check this condition at cer‐
3147 tain points in code and yield locks if necessary. This
3148 can improve Slurm's ability to process requests at a cost
3149 of initiating new jobs less frequently. Default: 0 (op‐
3150 tion disabled), Min: 0, Max: 1000.
3151
3152 NOTE: The maximum number of threads (MAX_SERVER_THREADS)
3153 is internally set to 256 and defines the number of served
3154 RPCs at a given time. Setting max_rpc_cnt to more than
3155 256 will be only useful to let backfill continue schedul‐
3156 ing work after locks have been yielded (i.e. each 2 sec‐
3157 onds) if there are a maximum of MAX(max_rpc_cnt/10, 20)
3158 RPCs in the queue. i.e. max_rpc_cnt=1000, the scheduler
3159 will be allowed to continue after yielding locks only
3160 when there are less than or equal to 100 pending RPCs.
3161 If a value is set, then a value of 10 or higher is recom‐
3162 mended. It may require some tuning for each system, but
3163 needs to be high enough that scheduling isn't always dis‐
3164 abled, and low enough that requests can get through in a
3165 reasonable period of time.
3166
3167 max_sched_time=#
3168 How long, in seconds, that the main scheduling loop will
3169 execute for before exiting. If a value is configured, be
3170 aware that all other Slurm operations will be deferred
3171 during this time period. Make certain the value is lower
3172 than MessageTimeout. If a value is not explicitly con‐
3173 figured, the default value is half of MessageTimeout with
3174 a minimum default value of 1 second and a maximum default
3175 value of 2 seconds. For example if MessageTimeout=10,
3176 the time limit will be 2 seconds (i.e. MIN(10/2, 2) = 2).
3177
3178 max_script_size=#
3179 Specify the maximum size of a batch script, in bytes.
3180 The default value is 4 megabytes. Larger values may ad‐
3181 versely impact system performance.
3182
3183 max_switch_wait=#
3184 Maximum number of seconds that a job can delay execution
3185 waiting for the specified desired switch count. The de‐
3186 fault value is 300 seconds.
3187
3188 no_backup_scheduling
3189 If used, the backup controller will not schedule jobs
3190 when it takes over. The backup controller will allow jobs
3191 to be submitted, modified and cancelled but won't sched‐
3192 ule new jobs. This is useful in Cray environments when
3193 the backup controller resides on an external Cray node.
3194 A restart of slurmctld is required for changes to this
3195 parameter to take effect.
3196
3197 no_env_cache
3198 If used, any job started on node that fails to load the
3199 env from a node will fail instead of using the cached
3200 env. This will also implicitly imply the re‐
3201 queue_setup_env_fail option as well.
3202
3203 nohold_on_prolog_fail
3204 By default, if the Prolog exits with a non-zero value the
3205 job is requeued in a held state. By specifying this pa‐
3206 rameter the job will be requeued but not held so that the
3207 scheduler can dispatch it to another host.
3208
3209 pack_serial_at_end
3210 If used with the select/cons_res or select/cons_tres
3211 plugin, then put serial jobs at the end of the available
3212 nodes rather than using a best fit algorithm. This may
3213 reduce resource fragmentation for some workloads.
3214
3215 partition_job_depth=#
3216 The default number of jobs to attempt scheduling (i.e.
3217 the queue depth) from each partition/queue in Slurm's
3218 main scheduling logic. The functionality is similar to
3219 that provided by the bf_max_job_part option for the back‐
3220 fill scheduling logic. The default value is 0 (no
3221 limit). Job's excluded from attempted scheduling based
3222 upon partition will not be counted against the de‐
3223 fault_queue_depth limit. Also see the bf_max_job_part
3224 option.
3225
3226 preempt_reorder_count=#
3227 Specify how many attempts should be made in reordering
3228 preemptable jobs to minimize the count of jobs preempted.
3229 The default value is 1. High values may adversely impact
3230 performance. The logic to support this option is only
3231 available in the select/cons_res and select/cons_tres
3232 plugins.
3233
3234 preempt_strict_order
3235 If set, then execute extra logic in an attempt to preempt
3236 only the lowest priority jobs. It may be desirable to
3237 set this configuration parameter when there are multiple
3238 priorities of preemptable jobs. The logic to support
3239 this option is only available in the select/cons_res and
3240 select/cons_tres plugins.
3241
3242 preempt_youngest_first
3243 If set, then the preemption sorting algorithm will be
3244 changed to sort by the job start times to favor preempt‐
3245 ing younger jobs over older. (Requires preempt/parti‐
3246 tion_prio or preempt/qos plugins.)
3247
3248 reduce_completing_frag
3249 This option is used to control how scheduling of re‐
3250 sources is performed when jobs are in the COMPLETING
3251 state, which influences potential fragmentation. If this
3252 option is not set then no jobs will be started in any
3253 partition when any job is in the COMPLETING state for
3254 less than CompleteWait seconds. If this option is set
3255 then no jobs will be started in any individual partition
3256 that has a job in COMPLETING state for less than Com‐
3257 pleteWait seconds. In addition, no jobs will be started
3258 in any partition with nodes that overlap with any nodes
3259 in the partition of the completing job. This option is
3260 to be used in conjunction with CompleteWait.
3261
3262 NOTE: CompleteWait must be set in order for this to work.
3263 If CompleteWait=0 then this option does nothing.
3264
3265 NOTE: reduce_completing_frag only affects the main sched‐
3266 uler, not the backfill scheduler.
3267
3268 requeue_setup_env_fail
3269 By default if a job environment setup fails the job keeps
3270 running with a limited environment. By specifying this
3271 parameter the job will be requeued in held state and the
3272 execution node drained.
3273
3274 salloc_wait_nodes
3275 If defined, the salloc command will wait until all allo‐
3276 cated nodes are ready for use (i.e. booted) before the
3277 command returns. By default, salloc will return as soon
3278 as the resource allocation has been made.
3279
3280 sbatch_wait_nodes
3281 If defined, the sbatch script will wait until all allo‐
3282 cated nodes are ready for use (i.e. booted) before the
3283 initiation. By default, the sbatch script will be initi‐
3284 ated as soon as the first node in the job allocation is
3285 ready. The sbatch command can use the --wait-all-nodes
3286 option to override this configuration parameter.
3287
3288 sched_interval=#
3289 How frequently, in seconds, the main scheduling loop will
3290 execute and test all pending jobs. The default value is
3291 60 seconds. A setting of -1 will disable the main sched‐
3292 uling loop.
3293
3294 sched_max_job_start=#
3295 The maximum number of jobs that the main scheduling logic
3296 will start in any single execution. The default value is
3297 zero, which imposes no limit.
3298
3299 sched_min_interval=#
3300 How frequently, in microseconds, the main scheduling loop
3301 will execute and test any pending jobs. The scheduler
3302 runs in a limited fashion every time that any event hap‐
3303 pens which could enable a job to start (e.g. job submit,
3304 job terminate, etc.). If these events happen at a high
3305 frequency, the scheduler can run very frequently and con‐
3306 sume significant resources if not throttled by this op‐
3307 tion. This option specifies the minimum time between the
3308 end of one scheduling cycle and the beginning of the next
3309 scheduling cycle. A value of zero will disable throt‐
3310 tling of the scheduling logic interval. The default
3311 value is 2 microseconds.
3312
3313 spec_cores_first
3314 Specialized cores will be selected from the first cores
3315 of the first sockets, cycling through the sockets on a
3316 round robin basis. By default, specialized cores will be
3317 selected from the last cores of the last sockets, cycling
3318 through the sockets on a round robin basis.
3319
3320 step_retry_count=#
3321 When a step completes and there are steps ending resource
3322 allocation, then retry step allocations for at least this
3323 number of pending steps. Also see step_retry_time. The
3324 default value is 8 steps.
3325
3326 step_retry_time=#
3327 When a step completes and there are steps ending resource
3328 allocation, then retry step allocations for all steps
3329 which have been pending for at least this number of sec‐
3330 onds. Also see step_retry_count. The default value is
3331 60 seconds.
3332
3333 whole_hetjob
3334 Requests to cancel, hold or release any component of a
3335 heterogeneous job will be applied to all components of
3336 the job.
3337
3338 NOTE: this option was previously named whole_pack and
3339 this is still supported for retrocompatibility.
3340
3341 SchedulerTimeSlice
3342 Number of seconds in each time slice when gang scheduling is en‐
3343 abled (PreemptMode=SUSPEND,GANG). The value must be between 5
3344 seconds and 65533 seconds. The default value is 30 seconds.
3345
3346 SchedulerType
3347 Identifies the type of scheduler to be used. A restart of
3348 slurmctld is required for changes to this parameter to take ef‐
3349 fect. The scontrol command can be used to manually change job
3350 priorities if desired. Acceptable values include:
3351
3352 sched/backfill
3353 For a backfill scheduling module to augment the default
3354 FIFO scheduling. Backfill scheduling will initiate
3355 lower-priority jobs if doing so does not delay the ex‐
3356 pected initiation time of any higher priority job. Ef‐
3357 fectiveness of backfill scheduling is dependent upon
3358 users specifying job time limits, otherwise all jobs will
3359 have the same time limit and backfilling is impossible.
3360 Note documentation for the SchedulerParameters option
3361 above. This is the default configuration.
3362
3363 sched/builtin
3364 This is the FIFO scheduler which initiates jobs in prior‐
3365 ity order. If any job in the partition can not be sched‐
3366 uled, no lower priority job in that partition will be
3367 scheduled. An exception is made for jobs that can not
3368 run due to partition constraints (e.g. the time limit) or
3369 down/drained nodes. In that case, lower priority jobs
3370 can be initiated and not impact the higher priority job.
3371
3372 ScronParameters
3373 Multiple options may be comma separated.
3374
3375 enable Enable the use of scrontab to submit and manage periodic
3376 repeating jobs.
3377
3378 SelectType
3379 Identifies the type of resource selection algorithm to be used.
3380 A restart of slurmctld is required for changes to this parameter
3381 to take effect. When changed, all job information (running and
3382 pending) will be lost, since the job state save format used by
3383 each plugin is different. The only exception to this is when
3384 changing from cons_res to cons_tres or from cons_tres to
3385 cons_res. However, if a job contains cons_tres-specific features
3386 and then SelectType is changed to cons_res, the job will be can‐
3387 celed, since there is no way for cons_res to satisfy require‐
3388 ments specific to cons_tres.
3389
3390 Acceptable values include
3391
3392 select/cons_res
3393 The resources (cores and memory) within a node are indi‐
3394 vidually allocated as consumable resources. Note that
3395 whole nodes can be allocated to jobs for selected parti‐
3396 tions by using the OverSubscribe=Exclusive option. See
3397 the partition OverSubscribe parameter for more informa‐
3398 tion.
3399
3400 select/cons_tres
3401 The resources (cores, memory, GPUs and all other track‐
3402 able resources) within a node are individually allocated
3403 as consumable resources. Note that whole nodes can be
3404 allocated to jobs for selected partitions by using the
3405 OverSubscribe=Exclusive option. See the partition Over‐
3406 Subscribe parameter for more information.
3407
3408 select/cray_aries
3409 for a Cray system. The default value is "se‐
3410 lect/cray_aries" for all Cray systems.
3411
3412 select/linear
3413 for allocation of entire nodes assuming a one-dimensional
3414 array of nodes in which sequentially ordered nodes are
3415 preferable. For a heterogeneous cluster (e.g. different
3416 CPU counts on the various nodes), resource allocations
3417 will favor nodes with high CPU counts as needed based
3418 upon the job's node and CPU specification if TopologyPlu‐
3419 gin=topology/none is configured. Use of other topology
3420 plugins with select/linear and heterogeneous nodes is not
3421 recommended and may result in valid job allocation re‐
3422 quests being rejected. The linear plugin is not designed
3423 to track generic resources on a node. In cases where
3424 generic resources (such as GPUs) need to be tracked, the
3425 cons_res or cons_tres plugins should be used instead.
3426 This is the default value.
3427
3428 SelectTypeParameters
3429 The permitted values of SelectTypeParameters depend upon the
3430 configured value of SelectType. The only supported options for
3431 SelectType=select/linear are CR_ONE_TASK_PER_CORE and CR_Memory,
3432 which treats memory as a consumable resource and prevents memory
3433 over subscription with job preemption or gang scheduling. By
3434 default SelectType=select/linear allocates whole nodes to jobs
3435 without considering their memory consumption. By default Se‐
3436 lectType=select/cons_res, SelectType=select/cray_aries, and Se‐
3437 lectType=select/cons_tres, use CR_Core_Memory, which allocates
3438 Core to jobs with considering their memory consumption.
3439
3440 A restart of slurmctld is required for changes to this parameter
3441 to take effect.
3442
3443 The following options are supported for SelectType=se‐
3444 lect/cray_aries:
3445
3446 OTHER_CONS_RES
3447 Layer the select/cons_res plugin under the se‐
3448 lect/cray_aries plugin, the default is to layer on se‐
3449 lect/linear. This also allows all the options available
3450 for SelectType=select/cons_res.
3451
3452 OTHER_CONS_TRES
3453 Layer the select/cons_tres plugin under the se‐
3454 lect/cray_aries plugin, the default is to layer on se‐
3455 lect/linear. This also allows all the options available
3456 for SelectType=select/cons_tres.
3457
3458 The following options are supported by the SelectType=select/cons_res
3459 and SelectType=select/cons_tres plugins:
3460
3461 CR_CPU CPUs are consumable resources. Configure the number of
3462 CPUs on each node, which may be equal to the count of
3463 cores or hyper-threads on the node depending upon the de‐
3464 sired minimum resource allocation. The node's Boards,
3465 Sockets, CoresPerSocket and ThreadsPerCore may optionally
3466 be configured and result in job allocations which have
3467 improved locality; however doing so will prevent more
3468 than one job from being allocated on each core.
3469
3470 CR_CPU_Memory
3471 CPUs and memory are consumable resources. Configure the
3472 number of CPUs on each node, which may be equal to the
3473 count of cores or hyper-threads on the node depending
3474 upon the desired minimum resource allocation. The node's
3475 Boards, Sockets, CoresPerSocket and ThreadsPerCore may
3476 optionally be configured and result in job allocations
3477 which have improved locality; however doing so will pre‐
3478 vent more than one job from being allocated on each core.
3479 Setting a value for DefMemPerCPU is strongly recommended.
3480
3481 CR_Core
3482 Cores are consumable resources. On nodes with hy‐
3483 per-threads, each thread is counted as a CPU to satisfy a
3484 job's resource requirement, but multiple jobs are not al‐
3485 located threads on the same core. The count of CPUs al‐
3486 located to a job is rounded up to account for every CPU
3487 on an allocated core. This will also impact total allo‐
3488 cated memory when --mem-per-cpu is used to be multiply of
3489 total number of CPUs on allocated cores.
3490
3491 CR_Core_Memory
3492 Cores and memory are consumable resources. On nodes with
3493 hyper-threads, each thread is counted as a CPU to satisfy
3494 a job's resource requirement, but multiple jobs are not
3495 allocated threads on the same core. The count of CPUs
3496 allocated to a job may be rounded up to account for every
3497 CPU on an allocated core. Setting a value for DefMemPer‐
3498 CPU is strongly recommended.
3499
3500 CR_ONE_TASK_PER_CORE
3501 Allocate one task per core by default. Without this op‐
3502 tion, by default one task will be allocated per thread on
3503 nodes with more than one ThreadsPerCore configured.
3504 NOTE: This option cannot be used with CR_CPU*.
3505
3506 CR_CORE_DEFAULT_DIST_BLOCK
3507 Allocate cores within a node using block distribution by
3508 default. This is a pseudo-best-fit algorithm that mini‐
3509 mizes the number of boards and minimizes the number of
3510 sockets (within minimum boards) used for the allocation.
3511 This default behavior can be overridden specifying a par‐
3512 ticular "-m" parameter with srun/salloc/sbatch. Without
3513 this option, cores will be allocated cyclically across
3514 the sockets.
3515
3516 CR_LLN Schedule resources to jobs on the least loaded nodes
3517 (based upon the number of idle CPUs). This is generally
3518 only recommended for an environment with serial jobs as
3519 idle resources will tend to be highly fragmented, result‐
3520 ing in parallel jobs being distributed across many nodes.
3521 Note that node Weight takes precedence over how many idle
3522 resources are on each node. Also see the partition con‐
3523 figuration parameter LLN use the least loaded nodes in
3524 selected partitions.
3525
3526 CR_Pack_Nodes
3527 If a job allocation contains more resources than will be
3528 used for launching tasks (e.g. if whole nodes are allo‐
3529 cated to a job), then rather than distributing a job's
3530 tasks evenly across its allocated nodes, pack them as
3531 tightly as possible on these nodes. For example, con‐
3532 sider a job allocation containing two entire nodes with
3533 eight CPUs each. If the job starts ten tasks across
3534 those two nodes without this option, it will start five
3535 tasks on each of the two nodes. With this option, eight
3536 tasks will be started on the first node and two tasks on
3537 the second node. This can be superseded by "NoPack" in
3538 srun's "--distribution" option. CR_Pack_Nodes only ap‐
3539 plies when the "block" task distribution method is used.
3540
3541 CR_Socket
3542 Sockets are consumable resources. On nodes with multiple
3543 cores, each core or thread is counted as a CPU to satisfy
3544 a job's resource requirement, but multiple jobs are not
3545 allocated resources on the same socket.
3546
3547 CR_Socket_Memory
3548 Memory and sockets are consumable resources. On nodes
3549 with multiple cores, each core or thread is counted as a
3550 CPU to satisfy a job's resource requirement, but multiple
3551 jobs are not allocated resources on the same socket.
3552 Setting a value for DefMemPerCPU is strongly recommended.
3553
3554 CR_Memory
3555 Memory is a consumable resource. NOTE: This implies
3556 OverSubscribe=YES or OverSubscribe=FORCE for all parti‐
3557 tions. Setting a value for DefMemPerCPU is strongly rec‐
3558 ommended.
3559
3560 NOTE: If memory isn't configured as a consumable resource
3561 (CR_CPU,
3562 CR_Core or CR_Socket without _Memory) memory can be over‐
3563 subscribed. In this case the --mem option is only used to
3564 filter out nodes with lower configured memory and does
3565 not take running jobs into account. For instance, two
3566 jobs requesting all the memory of a node can run at the
3567 same time.
3568
3569 SlurmctldAddr
3570 An optional address to be used for communications to the cur‐
3571 rently active slurmctld daemon, normally used with Virtual IP
3572 addressing of the currently active server. If this parameter is
3573 not specified then each primary and backup server will have its
3574 own unique address used for communications as specified in the
3575 SlurmctldHost parameter. If this parameter is specified then
3576 the SlurmctldHost parameter will still be used for communica‐
3577 tions to specific slurmctld primary or backup servers, for exam‐
3578 ple to cause all of them to read the current configuration files
3579 or shutdown. Also see the SlurmctldPrimaryOffProg and Slurm‐
3580 ctldPrimaryOnProg configuration parameters to configure programs
3581 to manipulate virtual IP address manipulation.
3582
3583 SlurmctldDebug
3584 The level of detail to provide slurmctld daemon's logs. The de‐
3585 fault value is info. If the slurmctld daemon is initiated with
3586 -v or --verbose options, that debug level will be preserve or
3587 restored upon reconfiguration.
3588
3589 quiet Log nothing
3590
3591 fatal Log only fatal errors
3592
3593 error Log only errors
3594
3595 info Log errors and general informational messages
3596
3597 verbose Log errors and verbose informational messages
3598
3599 debug Log errors and verbose informational messages and de‐
3600 bugging messages
3601
3602 debug2 Log errors and verbose informational messages and more
3603 debugging messages
3604
3605 debug3 Log errors and verbose informational messages and even
3606 more debugging messages
3607
3608 debug4 Log errors and verbose informational messages and even
3609 more debugging messages
3610
3611 debug5 Log errors and verbose informational messages and even
3612 more debugging messages
3613
3614 SlurmctldHost
3615 The short, or long, hostname of the machine where Slurm control
3616 daemon is executed (i.e. the name returned by the command "host‐
3617 name -s"). This hostname is optionally followed by the address,
3618 either the IP address or a name by which the address can be
3619 identified, enclosed in parentheses (e.g. SlurmctldHost=slurm‐
3620 ctl-primary(12.34.56.78)). This value must be specified at least
3621 once. If specified more than once, the first hostname named will
3622 be where the daemon runs. If the first specified host fails,
3623 the daemon will execute on the second host. If both the first
3624 and second specified host fails, the daemon will execute on the
3625 third host. A restart of slurmctld is required for changes to
3626 this parameter to take effect.
3627
3628 SlurmctldLogFile
3629 Fully qualified pathname of a file into which the slurmctld dae‐
3630 mon's logs are written. The default value is none (performs
3631 logging via syslog).
3632 See the section LOGGING if a pathname is specified.
3633
3634 SlurmctldParameters
3635 Multiple options may be comma separated.
3636
3637 allow_user_triggers
3638 Permit setting triggers from non-root/slurm_user users.
3639 SlurmUser must also be set to root to permit these trig‐
3640 gers to work. See the strigger man page for additional
3641 details.
3642
3643 cloud_dns
3644 By default, Slurm expects that the network address for a
3645 cloud node won't be known until the creation of the node
3646 and that Slurm will be notified of the node's address
3647 (e.g. scontrol update nodename=<name> nodeaddr=<addr>).
3648 Since Slurm communications rely on the node configuration
3649 found in the slurm.conf, Slurm will tell the client com‐
3650 mand, after waiting for all nodes to boot, each node's ip
3651 address. However, in environments where the nodes are in
3652 DNS, this step can be avoided by configuring this option.
3653
3654 cloud_reg_addrs
3655 When a cloud node registers, the node's NodeAddr and
3656 NodeHostName will automatically be set. They will be re‐
3657 set back to the nodename after powering off.
3658
3659 enable_configless
3660 Permit "configless" operation by the slurmd, slurmstepd,
3661 and user commands. When enabled the slurmd will be per‐
3662 mitted to retrieve config files from the slurmctld, and
3663 on any 'scontrol reconfigure' command new configs will be
3664 automatically pushed out and applied to nodes that are
3665 running in this "configless" mode. A restart of slurm‐
3666 ctld is required for changes to this parameter to take
3667 effect. NOTE: Included files with the Include directive
3668 will only be pushed if the filename has no path separa‐
3669 tors and is located adjacent to slurm.conf.
3670
3671 idle_on_node_suspend
3672 Mark nodes as idle, regardless of current state, when
3673 suspending nodes with SuspendProgram so that nodes will
3674 be eligible to be resumed at a later time.
3675
3676 node_reg_mem_percent=#
3677 Percentage of memory a node is allowed to register with
3678 without being marked as invalid with low memory. Default
3679 is 100. For State=CLOUD nodes, the default is 90. To dis‐
3680 able this for cloud nodes set it to 100. config_overrides
3681 takes precedence over this option.
3682
3683 It's recommended that task/cgroup with ConstrainRamSpace
3684 is configured. A memory cgroup limit won't be set more
3685 than the actual memory on the node. If needed, configure
3686 AllowedRamSpace in the cgroup.conf to add a buffer.
3687
3688 power_save_interval
3689 How often the power_save thread looks to resume and sus‐
3690 pend nodes. The power_save thread will do work sooner if
3691 there are node state changes. Default is 10 seconds.
3692
3693 power_save_min_interval
3694 How often the power_save thread, at a minimum, looks to
3695 resume and suspend nodes. Default is 0.
3696
3697 max_dbd_msg_action
3698 Action used once MaxDBDMsgs is reached, options are 'dis‐
3699 card' (default) and 'exit'.
3700
3701 When 'discard' is specified and MaxDBDMsgs is reached we
3702 start by purging pending messages of types Step start and
3703 complete, and it reaches MaxDBDMsgs again Job start mes‐
3704 sages are purged. Job completes and node state changes
3705 continue to consume the empty space created from the
3706 purgings until MaxDBDMsgs is reached again at which no
3707 new message is tracked creating data loss and potentially
3708 runaway jobs.
3709
3710 When 'exit' is specified and MaxDBDMsgs is reached the
3711 slurmctld will exit instead of discarding any messages.
3712 It will be impossible to start the slurmctld with this
3713 option where the slurmdbd is down and the slurmctld is
3714 tracking more than MaxDBDMsgs.
3715
3716 preempt_send_user_signal
3717 Send the user signal (e.g. --signal=<sig_num>) at preemp‐
3718 tion time even if the signal time hasn't been reached. In
3719 the case of a gracetime preemption the user signal will
3720 be sent if the user signal has been specified and not
3721 sent, otherwise a SIGTERM will be sent to the tasks.
3722
3723 reboot_from_controller
3724 Run the RebootProgram from the controller instead of on
3725 the slurmds. The RebootProgram will be passed a
3726 comma-separated list of nodes to reboot as the first ar‐
3727 gument and if applicable the required features needed for
3728 reboot as the second argument.
3729
3730 user_resv_delete
3731 Allow any user able to run in a reservation to delete it.
3732
3733 SlurmctldPidFile
3734 Fully qualified pathname of a file into which the slurmctld
3735 daemon may write its process id. This may be used for automated
3736 signal processing. The default value is "/var/run/slurm‐
3737 ctld.pid".
3738
3739 SlurmctldPlugstack
3740 A comma-delimited list of Slurm controller plugins to be started
3741 when the daemon begins and terminated when it ends. Only the
3742 plugin's init and fini functions are called.
3743
3744 SlurmctldPort
3745 The port number that the Slurm controller, slurmctld, listens to
3746 for work. The default value is SLURMCTLD_PORT as established at
3747 system build time. If none is explicitly specified, it will be
3748 set to 6817. SlurmctldPort may also be configured to support a
3749 range of port numbers in order to accept larger bursts of incom‐
3750 ing messages by specifying two numbers separated by a dash (e.g.
3751 SlurmctldPort=6817-6818). A restart of slurmctld is required
3752 for changes to this parameter to take effect. NOTE: Either
3753 slurmctld and slurmd daemons must not execute on the same nodes
3754 or the values of SlurmctldPort and SlurmdPort must be different.
3755
3756 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3757 automatically try to interact with anything opened on ports
3758 8192-60000. Configure SlurmctldPort to use a port outside of
3759 the configured SrunPortRange and RSIP's port range.
3760
3761 SlurmctldPrimaryOffProg
3762 This program is executed when a slurmctld daemon running as the
3763 primary server becomes a backup server. By default no program is
3764 executed. See also the related "SlurmctldPrimaryOnProg" parame‐
3765 ter.
3766
3767 SlurmctldPrimaryOnProg
3768 This program is executed when a slurmctld daemon running as a
3769 backup server becomes the primary server. By default no program
3770 is executed. When using virtual IP addresses to manage High
3771 Available Slurm services, this program can be used to add the IP
3772 address to an interface (and optionally try to kill the unre‐
3773 sponsive slurmctld daemon and flush the ARP caches on nodes on
3774 the local Ethernet fabric). See also the related "SlurmctldPri‐
3775 maryOffProg" parameter.
3776
3777 SlurmctldSyslogDebug
3778 The slurmctld daemon will log events to the syslog file at the
3779 specified level of detail. If not set, the slurmctld daemon will
3780 log to syslog at level fatal, unless there is no SlurmctldLog‐
3781 File and it is running in the background, in which case it will
3782 log to syslog at the level specified by SlurmctldDebug (at fatal
3783 in the case that SlurmctldDebug is set to quiet) or it is run in
3784 the foreground, when it will be set to quiet.
3785
3786 quiet Log nothing
3787
3788 fatal Log only fatal errors
3789
3790 error Log only errors
3791
3792 info Log errors and general informational messages
3793
3794 verbose Log errors and verbose informational messages
3795
3796 debug Log errors and verbose informational messages and de‐
3797 bugging messages
3798
3799 debug2 Log errors and verbose informational messages and more
3800 debugging messages
3801
3802 debug3 Log errors and verbose informational messages and even
3803 more debugging messages
3804
3805 debug4 Log errors and verbose informational messages and even
3806 more debugging messages
3807
3808 debug5 Log errors and verbose informational messages and even
3809 more debugging messages
3810
3811 NOTE: By default, Slurm's systemd service files start daemons in
3812 the foreground with the -D option. This means that systemd will
3813 capture stdout/stderr output and print that to syslog, indepen‐
3814 dent of Slurm printing to syslog directly. To prevent systemd
3815 from doing this, add "StandardOutput=null" and "StandardEr‐
3816 ror=null" to the respective service files or override files.
3817
3818 SlurmctldTimeout
3819 The interval, in seconds, that the backup controller waits for
3820 the primary controller to respond before assuming control. The
3821 default value is 120 seconds. May not exceed 65533.
3822
3823 SlurmdDebug
3824 The level of detail to provide slurmd daemon's logs. The de‐
3825 fault value is info.
3826
3827 quiet Log nothing
3828
3829 fatal Log only fatal errors
3830
3831 error Log only errors
3832
3833 info Log errors and general informational messages
3834
3835 verbose Log errors and verbose informational messages
3836
3837 debug Log errors and verbose informational messages and de‐
3838 bugging messages
3839
3840 debug2 Log errors and verbose informational messages and more
3841 debugging messages
3842
3843 debug3 Log errors and verbose informational messages and even
3844 more debugging messages
3845
3846 debug4 Log errors and verbose informational messages and even
3847 more debugging messages
3848
3849 debug5 Log errors and verbose informational messages and even
3850 more debugging messages
3851
3852 SlurmdLogFile
3853 Fully qualified pathname of a file into which the slurmd dae‐
3854 mon's logs are written. The default value is none (performs
3855 logging via syslog). The first "%h" within the name is replaced
3856 with the hostname on which the slurmd is running. The first
3857 "%n" within the name is replaced with the Slurm node name on
3858 which the slurmd is running.
3859 See the section LOGGING if a pathname is specified.
3860
3861 SlurmdParameters
3862 Parameters specific to the Slurmd. Multiple options may be
3863 comma separated.
3864
3865 config_overrides
3866 If set, consider the configuration of each node to be
3867 that specified in the slurm.conf configuration file and
3868 any node with less than the configured resources will not
3869 be set to INVAL/INVALID_REG. This option is generally
3870 only useful for testing purposes. Equivalent to the now
3871 deprecated FastSchedule=2 option.
3872
3873 l3cache_as_socket
3874 Use the hwloc l3cache as the socket count. Can be useful
3875 on certain processors where the socket level is too
3876 coarse, and the l3cache may provide better task distribu‐
3877 tion. (E.g., along CCX boundaries instead of socket
3878 boundaries.) Mutually exclusive with
3879 numa_node_as_socket. Requires hwloc v2.
3880
3881 numa_node_as_socket
3882 Use the hwloc NUMA Node to determine main hierarchy ob‐
3883 ject to be used as socket. If the option is set Slurm
3884 will check the parent object of NUMA Noda and use it as
3885 socket. This option may be useful for architectures likes
3886 AMD Epyc, where number of nodes per socket may be config‐
3887 ured. Mutually exclusive with l3cache_as_socket. Re‐
3888 quires hwloc v2.
3889
3890 shutdown_on_reboot
3891 If set, the Slurmd will shut itself down when a reboot
3892 request is received.
3893
3894 SlurmdPidFile
3895 Fully qualified pathname of a file into which the slurmd daemon
3896 may write its process id. This may be used for automated signal
3897 processing. The first "%h" within the name is replaced with the
3898 hostname on which the slurmd is running. The first "%n" within
3899 the name is replaced with the Slurm node name on which the
3900 slurmd is running. The default value is "/var/run/slurmd.pid".
3901
3902 SlurmdPort
3903 The port number that the Slurm compute node daemon, slurmd, lis‐
3904 tens to for work. The default value is SLURMD_PORT as estab‐
3905 lished at system build time. If none is explicitly specified,
3906 its value will be 6818. A restart of slurmctld is required for
3907 changes to this parameter to take effect. NOTE: Either slurm‐
3908 ctld and slurmd daemons must not execute on the same nodes or
3909 the values of SlurmctldPort and SlurmdPort must be different.
3910
3911 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3912 automatically try to interact with anything opened on ports
3913 8192-60000. Configure SlurmdPort to use a port outside of the
3914 configured SrunPortRange and RSIP's port range.
3915
3916 SlurmdSpoolDir
3917 Fully qualified pathname of a directory into which the slurmd
3918 daemon's state information and batch job script information are
3919 written. This must be a common pathname for all nodes, but
3920 should represent a directory which is local to each node (refer‐
3921 ence a local file system). The default value is
3922 "/var/spool/slurmd". The first "%h" within the name is replaced
3923 with the hostname on which the slurmd is running. The first
3924 "%n" within the name is replaced with the Slurm node name on
3925 which the slurmd is running.
3926
3927 SlurmdSyslogDebug
3928 The slurmd daemon will log events to the syslog file at the
3929 specified level of detail. If not set, the slurmd daemon will
3930 log to syslog at level fatal, unless there is no SlurmdLogFile
3931 and it is running in the background, in which case it will log
3932 to syslog at the level specified by SlurmdDebug (at fatal in
3933 the case that SlurmdDebug is set to quiet) or it is run in the
3934 foreground, when it will be set to quiet.
3935
3936 quiet Log nothing
3937
3938 fatal Log only fatal errors
3939
3940 error Log only errors
3941
3942 info Log errors and general informational messages
3943
3944 verbose Log errors and verbose informational messages
3945
3946 debug Log errors and verbose informational messages and de‐
3947 bugging messages
3948
3949 debug2 Log errors and verbose informational messages and more
3950 debugging messages
3951
3952 debug3 Log errors and verbose informational messages and even
3953 more debugging messages
3954
3955 debug4 Log errors and verbose informational messages and even
3956 more debugging messages
3957
3958 debug5 Log errors and verbose informational messages and even
3959 more debugging messages
3960
3961 NOTE: By default, Slurm's systemd service files start daemons in
3962 the foreground with the -D option. This means that systemd will
3963 capture stdout/stderr output and print that to syslog, indepen‐
3964 dent of Slurm printing to syslog directly. To prevent systemd
3965 from doing this, add "StandardOutput=null" and "StandardEr‐
3966 ror=null" to the respective service files or override files.
3967
3968 SlurmdTimeout
3969 The interval, in seconds, that the Slurm controller waits for
3970 slurmd to respond before configuring that node's state to DOWN.
3971 A value of zero indicates the node will not be tested by slurm‐
3972 ctld to confirm the state of slurmd, the node will not be auto‐
3973 matically set to a DOWN state indicating a non-responsive
3974 slurmd, and some other tool will take responsibility for moni‐
3975 toring the state of each compute node and its slurmd daemon.
3976 Slurm's hierarchical communication mechanism is used to ping the
3977 slurmd daemons in order to minimize system noise and overhead.
3978 The default value is 300 seconds. The value may not exceed
3979 65533 seconds.
3980
3981 SlurmdUser
3982 The name of the user that the slurmd daemon executes as. This
3983 user must exist on all nodes of the cluster for authentication
3984 of communications between Slurm components. The default value
3985 is "root".
3986
3987 SlurmSchedLogFile
3988 Fully qualified pathname of the scheduling event logging file.
3989 The syntax of this parameter is the same as for SlurmctldLog‐
3990 File. In order to configure scheduler logging, set both the
3991 SlurmSchedLogFile and SlurmSchedLogLevel parameters.
3992
3993 SlurmSchedLogLevel
3994 The initial level of scheduling event logging, similar to the
3995 SlurmctldDebug parameter used to control the initial level of
3996 slurmctld logging. Valid values for SlurmSchedLogLevel are "0"
3997 (scheduler logging disabled) and "1" (scheduler logging en‐
3998 abled). If this parameter is omitted, the value defaults to "0"
3999 (disabled). In order to configure scheduler logging, set both
4000 the SlurmSchedLogFile and SlurmSchedLogLevel parameters. The
4001 scheduler logging level can be changed dynamically using scon‐
4002 trol.
4003
4004 SlurmUser
4005 The name of the user that the slurmctld daemon executes as. For
4006 security purposes, a user other than "root" is recommended.
4007 This user must exist on all nodes of the cluster for authentica‐
4008 tion of communications between Slurm components. The default
4009 value is "root".
4010
4011 SrunEpilog
4012 Fully qualified pathname of an executable to be run by srun fol‐
4013 lowing the completion of a job step. The command line arguments
4014 for the executable will be the command and arguments of the job
4015 step. This configuration parameter may be overridden by srun's
4016 --epilog parameter. Note that while the other "Epilog" executa‐
4017 bles (e.g., TaskEpilog) are run by slurmd on the compute nodes
4018 where the tasks are executed, the SrunEpilog runs on the node
4019 where the "srun" is executing.
4020
4021 SrunPortRange
4022 The srun creates a set of listening ports to communicate with
4023 the controller, the slurmstepd and to handle the application
4024 I/O. By default these ports are ephemeral meaning the port num‐
4025 bers are selected by the kernel. Using this parameter allow
4026 sites to configure a range of ports from which srun ports will
4027 be selected. This is useful if sites want to allow only certain
4028 port range on their network.
4029
4030 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4031 automatically try to interact with anything opened on ports
4032 8192-60000. Configure SrunPortRange to use a range of ports
4033 above those used by RSIP, ideally 1000 or more ports, for exam‐
4034 ple "SrunPortRange=60001-63000".
4035
4036 Note: SrunPortRange must be large enough to cover the expected
4037 number of srun ports created on a given submission node. A sin‐
4038 gle srun opens 3 listening ports plus 2 more for every 48 hosts.
4039 Example:
4040
4041 srun -N 48 will use 5 listening ports.
4042
4043 srun -N 50 will use 7 listening ports.
4044
4045 srun -N 200 will use 13 listening ports.
4046
4047 SrunProlog
4048 Fully qualified pathname of an executable to be run by srun
4049 prior to the launch of a job step. The command line arguments
4050 for the executable will be the command and arguments of the job
4051 step. This configuration parameter may be overridden by srun's
4052 --prolog parameter. Note that while the other "Prolog" executa‐
4053 bles (e.g., TaskProlog) are run by slurmd on the compute nodes
4054 where the tasks are executed, the SrunProlog runs on the node
4055 where the "srun" is executing.
4056
4057 StateSaveLocation
4058 Fully qualified pathname of a directory into which the Slurm
4059 controller, slurmctld, saves its state (e.g. "/usr/lo‐
4060 cal/slurm/checkpoint"). Slurm state will saved here to recover
4061 from system failures. SlurmUser must be able to create files in
4062 this directory. If you have a secondary SlurmctldHost config‐
4063 ured, this location should be readable and writable by both sys‐
4064 tems. Since all running and pending job information is stored
4065 here, the use of a reliable file system (e.g. RAID) is recom‐
4066 mended. The default value is "/var/spool". A restart of slurm‐
4067 ctld is required for changes to this parameter to take effect.
4068 If any slurm daemons terminate abnormally, their core files will
4069 also be written into this directory.
4070
4071 SuspendExcNodes
4072 Specifies the nodes which are to not be placed in power save
4073 mode, even if the node remains idle for an extended period of
4074 time. Use Slurm's hostlist expression to identify nodes with an
4075 optional ":" separator and count of nodes to exclude from the
4076 preceding range. For example "nid[10-20]:4" will prevent 4 us‐
4077 able nodes (i.e IDLE and not DOWN, DRAINING or already powered
4078 down) in the set "nid[10-20]" from being powered down. Multiple
4079 sets of nodes can be specified with or without counts in a comma
4080 separated list (e.g "nid[10-20]:4,nid[80-90]:2"). If a node
4081 count specification is given, any list of nodes to NOT have a
4082 node count must be after the last specification with a count.
4083 For example "nid[10-20]:4,nid[60-70]" will exclude 4 nodes in
4084 the set "nid[10-20]:4" plus all nodes in the set "nid[60-70]"
4085 while "nid[1-3],nid[10-20]:4" will exclude 4 nodes from the set
4086 "nid[1-3],nid[10-20]". By default no nodes are excluded.
4087
4088 SuspendExcParts
4089 Specifies the partitions whose nodes are to not be placed in
4090 power save mode, even if the node remains idle for an extended
4091 period of time. Multiple partitions can be identified and sepa‐
4092 rated by commas. By default no nodes are excluded.
4093
4094 SuspendProgram
4095 SuspendProgram is the program that will be executed when a node
4096 remains idle for an extended period of time. This program is
4097 expected to place the node into some power save mode. This can
4098 be used to reduce the frequency and voltage of a node or com‐
4099 pletely power the node off. The program executes as SlurmUser.
4100 The argument to the program will be the names of nodes to be
4101 placed into power savings mode (using Slurm's hostlist expres‐
4102 sion format). By default, no program is run.
4103
4104 SuspendRate
4105 The rate at which nodes are placed into power save mode by Sus‐
4106 pendProgram. The value is number of nodes per minute and it can
4107 be used to prevent a large drop in power consumption (e.g. after
4108 a large job completes). A value of zero results in no limits
4109 being imposed. The default value is 60 nodes per minute.
4110
4111 SuspendTime
4112 Nodes which remain idle or down for this number of seconds will
4113 be placed into power save mode by SuspendProgram. Setting Sus‐
4114 pendTime to anything but INFINITE (or -1) will enable power save
4115 mode. INFINITE is the default.
4116
4117 SuspendTimeout
4118 Maximum time permitted (in seconds) between when a node suspend
4119 request is issued and when the node is shutdown. At that time
4120 the node must be ready for a resume request to be issued as
4121 needed for new work. The default value is 30 seconds.
4122
4123 SwitchParameters
4124 Optional parameters for the switch plugin.
4125
4126 On HPE Slingshot systems configured with
4127 SwitchType=switch/hpe_slingshot, the following parameters are
4128 supported (separate multiple parameters with a comma):
4129
4130 vnis=<min>-<max>
4131 Range of VNIs to allocate for jobs and applications.
4132 This parameter is required.
4133
4134 tcs=<class1>[:<class2>]...
4135 Set of traffic classes to configure for applications.
4136 Supported traffic classes are DEDICATED_ACCESS, LOW_LA‐
4137 TENCY, BULK_DATA, and BEST_EFFORT.
4138
4139 single_node_vni
4140 Allocate a VNI for single node job steps.
4141
4142 job_vni
4143 Allocate an additional VNI for jobs, shared among all job
4144 steps.
4145
4146 def_<rsrc>=<val>
4147 Per-CPU reserved allocation for this resource.
4148
4149 res_<rsrc>=<val>
4150 Per-node reserved allocation for this resource. If set,
4151 overrides the per-CPU allocation.
4152
4153 max_<rsrc>=<val>
4154 Maximum per-node application for this resource.
4155
4156 The resources that may be configured are:
4157
4158 txqs Transmit command queues. The default is 3 per-CPU, maxi‐
4159 mum 1024 per-node.
4160
4161 tgqs Target command queues. The default is 2 per-CPU, maximum
4162 512 per-node.
4163
4164 eqs Event queues. The default is 8 per-CPU, maximum 2048 per-
4165 node.
4166
4167 cts Counters. The default is 2 per-CPU, maximum 2048 per-
4168 node.
4169
4170 tles Trigger list entries. The default is 1 per-CPU, maximum
4171 2048 per-node.
4172
4173 ptes Portable table entries. The default is 8 per-CPU, maximum
4174 2048 per-node.
4175
4176 les List entries. The default is 134 per-CPU, maximum 65535
4177 per-node.
4178
4179 acs Addressing contexts. The default is 4 per-CPU, maximum
4180 1024 per-node.
4181
4182 SwitchType
4183 Identifies the type of switch or interconnect used for applica‐
4184 tion communications. Acceptable values include
4185 "switch/cray_aries" for Cray systems, "switch/hpe_slingshot" for
4186 HPE Slingshot systems and "switch/none" for switches not requir‐
4187 ing special processing for job launch or termination (Ethernet,
4188 and InfiniBand). The default value is "switch/none". All Slurm
4189 daemons, commands and running jobs must be restarted for a
4190 change in SwitchType to take effect. If running jobs exist at
4191 the time slurmctld is restarted with a new value of SwitchType,
4192 records of all jobs in any state may be lost.
4193
4194 TaskEpilog
4195 Fully qualified pathname of a program to be executed as the
4196 slurm job's owner after termination of each task. See TaskPro‐
4197 log for execution order details.
4198
4199 TaskPlugin
4200 Identifies the type of task launch plugin, typically used to
4201 provide resource management within a node (e.g. pinning tasks to
4202 specific processors). More than one task plugin can be specified
4203 in a comma-separated list. The prefix of "task/" is optional.
4204 Acceptable values include:
4205
4206 task/affinity enables resource containment using
4207 sched_setaffinity(). This enables the --cpu-bind
4208 and/or --mem-bind srun options.
4209
4210 task/cgroup enables resource containment using Linux control
4211 cgroups. This enables the --cpu-bind and/or
4212 --mem-bind srun options. NOTE: see "man
4213 cgroup.conf" for configuration details.
4214
4215 task/none for systems requiring no special handling of user
4216 tasks. Lacks support for the --cpu-bind and/or
4217 --mem-bind srun options. The default value is
4218 "task/none".
4219
4220 NOTE: It is recommended to stack task/affinity,task/cgroup to‐
4221 gether when configuring TaskPlugin, and setting Constrain‐
4222 Cores=yes in cgroup.conf. This setup uses the task/affinity
4223 plugin for setting the affinity of the tasks and uses the
4224 task/cgroup plugin to fence tasks into the specified resources.
4225
4226 NOTE: For CRAY systems only: task/cgroup must be used with, and
4227 listed after task/cray_aries in TaskPlugin. The task/affinity
4228 plugin can be listed anywhere, but the previous constraint must
4229 be satisfied. For CRAY systems, a configuration like this is
4230 recommended:
4231 TaskPlugin=task/affinity,task/cray_aries,task/cgroup
4232
4233 TaskPluginParam
4234 Optional parameters for the task plugin. Multiple options
4235 should be comma separated. None, Sockets, Cores and Threads are
4236 mutually exclusive and treated as a last possible source of
4237 --cpu-bind default. See also Node and Partition CpuBind options.
4238
4239 Cores Bind tasks to cores by default. Overrides automatic
4240 binding.
4241
4242 None Perform no task binding by default. Overrides automatic
4243 binding.
4244
4245 Sockets
4246 Bind to sockets by default. Overrides automatic binding.
4247
4248 Threads
4249 Bind to threads by default. Overrides automatic binding.
4250
4251 SlurmdOffSpec
4252 If specialized cores or CPUs are identified for the node
4253 (i.e. the CoreSpecCount or CpuSpecList are configured for
4254 the node), then Slurm daemons running on the compute node
4255 (i.e. slurmd and slurmstepd) should run outside of those
4256 resources (i.e. specialized resources are completely un‐
4257 available to Slurm daemons and jobs spawned by Slurm).
4258 This option may not be used with the task/cray_aries
4259 plugin.
4260
4261 Verbose
4262 Verbosely report binding before tasks run by default.
4263
4264 Autobind
4265 Set a default binding in the event that "auto binding"
4266 doesn't find a match. Set to Threads, Cores or Sockets
4267 (E.g. TaskPluginParam=autobind=threads).
4268
4269 TaskProlog
4270 Fully qualified pathname of a program to be executed as the
4271 slurm job's owner prior to initiation of each task. Besides the
4272 normal environment variables, this has SLURM_TASK_PID available
4273 to identify the process ID of the task being started. Standard
4274 output from this program can be used to control the environment
4275 variables and output for the user program.
4276
4277 export NAME=value Will set environment variables for the task
4278 being spawned. Everything after the equal
4279 sign to the end of the line will be used as
4280 the value for the environment variable. Ex‐
4281 porting of functions is not currently sup‐
4282 ported.
4283
4284 print ... Will cause that line (without the leading
4285 "print ") to be printed to the job's stan‐
4286 dard output.
4287
4288 unset NAME Will clear environment variables for the
4289 task being spawned.
4290
4291 The order of task prolog/epilog execution is as follows:
4292
4293 1. pre_launch_priv()
4294 Function in TaskPlugin
4295
4296 1. pre_launch() Function in TaskPlugin
4297
4298 2. TaskProlog System-wide per task program defined in
4299 slurm.conf
4300
4301 3. User prolog Job-step-specific task program defined using
4302 srun's --task-prolog option or
4303 SLURM_TASK_PROLOG environment variable
4304
4305 4. Task Execute the job step's task
4306
4307 5. User epilog Job-step-specific task program defined using
4308 srun's --task-epilog option or
4309 SLURM_TASK_EPILOG environment variable
4310
4311 6. TaskEpilog System-wide per task program defined in
4312 slurm.conf
4313
4314 7. post_term() Function in TaskPlugin
4315
4316 TCPTimeout
4317 Time permitted for TCP connection to be established. Default
4318 value is 2 seconds.
4319
4320 TmpFS Fully qualified pathname of the file system available to user
4321 jobs for temporary storage. This parameter is used in establish‐
4322 ing a node's TmpDisk space. The default value is "/tmp".
4323
4324 TopologyParam
4325 Comma-separated options identifying network topology options.
4326
4327 Dragonfly Optimize allocation for Dragonfly network. Valid
4328 when TopologyPlugin=topology/tree.
4329
4330 TopoOptional Only optimize allocation for network topology if
4331 the job includes a switch option. Since optimiz‐
4332 ing resource allocation for topology involves
4333 much higher system overhead, this option can be
4334 used to impose the extra overhead only on jobs
4335 which can take advantage of it. If most job allo‐
4336 cations are not optimized for network topology,
4337 they may fragment resources to the point that
4338 topology optimization for other jobs will be dif‐
4339 ficult to achieve. NOTE: Jobs may span across
4340 nodes without common parent switches with this
4341 enabled.
4342
4343 TopologyPlugin
4344 Identifies the plugin to be used for determining the network
4345 topology and optimizing job allocations to minimize network con‐
4346 tention. See NETWORK TOPOLOGY below for details. Additional
4347 plugins may be provided in the future which gather topology in‐
4348 formation directly from the network. Acceptable values include:
4349
4350 topology/3d_torus best-fit logic over three-dimensional
4351 topology
4352
4353 topology/none default for other systems, best-fit logic
4354 over one-dimensional topology
4355
4356 topology/tree used for a hierarchical network as de‐
4357 scribed in a topology.conf file
4358
4359 TrackWCKey
4360 Boolean yes or no. Used to set display and track of the Work‐
4361 load Characterization Key. Must be set to track correct wckey
4362 usage. NOTE: You must also set TrackWCKey in your slurmdbd.conf
4363 file to create historical usage reports.
4364
4365 TreeWidth
4366 Slurmd daemons use a virtual tree network for communications.
4367 TreeWidth specifies the width of the tree (i.e. the fanout). On
4368 architectures with a front end node running the slurmd daemon,
4369 the value must always be equal to or greater than the number of
4370 front end nodes which eliminates the need for message forwarding
4371 between the slurmd daemons. On other architectures the default
4372 value is 50, meaning each slurmd daemon can communicate with up
4373 to 50 other slurmd daemons and over 2500 nodes can be contacted
4374 with two message hops. The default value will work well for
4375 most clusters. Optimal system performance can typically be
4376 achieved if TreeWidth is set to the square root of the number of
4377 nodes in the cluster for systems having no more than 2500 nodes
4378 or the cube root for larger systems. The value may not exceed
4379 65533.
4380
4381 UnkillableStepProgram
4382 If the processes in a job step are determined to be unkillable
4383 for a period of time specified by the UnkillableStepTimeout
4384 variable, the program specified by UnkillableStepProgram will be
4385 executed. By default no program is run.
4386
4387 See section UNKILLABLE STEP PROGRAM SCRIPT for more information.
4388
4389 UnkillableStepTimeout
4390 The length of time, in seconds, that Slurm will wait before de‐
4391 ciding that processes in a job step are unkillable (after they
4392 have been signaled with SIGKILL) and execute UnkillableStepPro‐
4393 gram. The default timeout value is 60 seconds. If exceeded,
4394 the compute node will be drained to prevent future jobs from be‐
4395 ing scheduled on the node.
4396
4397 UsePAM If set to 1, PAM (Pluggable Authentication Modules for Linux)
4398 will be enabled. PAM is used to establish the upper bounds for
4399 resource limits. With PAM support enabled, local system adminis‐
4400 trators can dynamically configure system resource limits. Chang‐
4401 ing the upper bound of a resource limit will not alter the lim‐
4402 its of running jobs, only jobs started after a change has been
4403 made will pick up the new limits. The default value is 0 (not
4404 to enable PAM support). Remember that PAM also needs to be con‐
4405 figured to support Slurm as a service. For sites using PAM's
4406 directory based configuration option, a configuration file named
4407 slurm should be created. The module-type, control-flags, and
4408 module-path names that should be included in the file are:
4409 auth required pam_localuser.so
4410 auth required pam_shells.so
4411 account required pam_unix.so
4412 account required pam_access.so
4413 session required pam_unix.so
4414 For sites configuring PAM with a general configuration file, the
4415 appropriate lines (see above), where slurm is the service-name,
4416 should be added.
4417
4418 NOTE: UsePAM option has nothing to do with the con‐
4419 tribs/pam/pam_slurm and/or contribs/pam_slurm_adopt modules. So
4420 these two modules can work independently of the value set for
4421 UsePAM.
4422
4423 VSizeFactor
4424 Memory specifications in job requests apply to real memory size
4425 (also known as resident set size). It is possible to enforce
4426 virtual memory limits for both jobs and job steps by limiting
4427 their virtual memory to some percentage of their real memory al‐
4428 location. The VSizeFactor parameter specifies the job's or job
4429 step's virtual memory limit as a percentage of its real memory
4430 limit. For example, if a job's real memory limit is 500MB and
4431 VSizeFactor is set to 101 then the job will be killed if its
4432 real memory exceeds 500MB or its virtual memory exceeds 505MB
4433 (101 percent of the real memory limit). The default value is 0,
4434 which disables enforcement of virtual memory limits. The value
4435 may not exceed 65533 percent.
4436
4437 NOTE: This parameter is dependent on OverMemoryKill being con‐
4438 figured in JobAcctGatherParams. It is also possible to configure
4439 the TaskPlugin to use task/cgroup for memory enforcement. VSize‐
4440 Factor will not have an effect on memory enforcement done
4441 through cgroups.
4442
4443 WaitTime
4444 Specifies how many seconds the srun command should by default
4445 wait after the first task terminates before terminating all re‐
4446 maining tasks. The "--wait" option on the srun command line
4447 overrides this value. The default value is 0, which disables
4448 this feature. May not exceed 65533 seconds.
4449
4450 X11Parameters
4451 For use with Slurm's built-in X11 forwarding implementation.
4452
4453 home_xauthority
4454 If set, xauth data on the compute node will be placed in
4455 ~/.Xauthority rather than in a temporary file under
4456 TmpFS.
4457
4459 The configuration of nodes (or machines) to be managed by Slurm is also
4460 specified in /etc/slurm.conf. Changes in node configuration (e.g.
4461 adding nodes, changing their processor count, etc.) require restarting
4462 both the slurmctld daemon and the slurmd daemons. All slurmd daemons
4463 must know each node in the system to forward messages in support of hi‐
4464 erarchical communications. Only the NodeName must be supplied in the
4465 configuration file. All other node configuration information is op‐
4466 tional. It is advisable to establish baseline node configurations, es‐
4467 pecially if the cluster is heterogeneous. Nodes which register to the
4468 system with less than the configured resources (e.g. too little mem‐
4469 ory), will be placed in the "DOWN" state to avoid scheduling jobs on
4470 them. Establishing baseline configurations will also speed Slurm's
4471 scheduling process by permitting it to compare job requirements against
4472 these (relatively few) configuration parameters and possibly avoid hav‐
4473 ing to check job requirements against every individual node's configu‐
4474 ration. The resources checked at node registration time are: CPUs,
4475 RealMemory and TmpDisk.
4476
4477 Default values can be specified with a record in which NodeName is "DE‐
4478 FAULT". The default entry values will apply only to lines following it
4479 in the configuration file and the default values can be reset multiple
4480 times in the configuration file with multiple entries where "Node‐
4481 Name=DEFAULT". Each line where NodeName is "DEFAULT" will replace or
4482 add to previous default values and will not reinitialize the default
4483 values. The "NodeName=" specification must be placed on every line de‐
4484 scribing the configuration of nodes. A single node name can not appear
4485 as a NodeName value in more than one line (duplicate node name records
4486 will be ignored). In fact, it is generally possible and desirable to
4487 define the configurations of all nodes in only a few lines. This con‐
4488 vention permits significant optimization in the scheduling of larger
4489 clusters. In order to support the concept of jobs requiring consecu‐
4490 tive nodes on some architectures, node specifications should be place
4491 in this file in consecutive order. No single node name may be listed
4492 more than once in the configuration file. Use "DownNodes=" to record
4493 the state of nodes which are temporarily in a DOWN, DRAIN or FAILING
4494 state without altering permanent configuration information. A job
4495 step's tasks are allocated to nodes in order the nodes appear in the
4496 configuration file. There is presently no capability within Slurm to
4497 arbitrarily order a job step's tasks.
4498
4499 Multiple node names may be comma separated (e.g. "alpha,beta,gamma")
4500 and/or a simple node range expression may optionally be used to specify
4501 numeric ranges of nodes to avoid building a configuration file with
4502 large numbers of entries. The node range expression can contain one
4503 pair of square brackets with a sequence of comma-separated numbers
4504 and/or ranges of numbers separated by a "-" (e.g. "linux[0-64,128]", or
4505 "lx[15,18,32-33]"). Note that the numeric ranges can include one or
4506 more leading zeros to indicate the numeric portion has a fixed number
4507 of digits (e.g. "linux[0000-1023]"). Multiple numeric ranges can be
4508 included in the expression (e.g. "rack[0-63]_blade[0-41]"). If one or
4509 more numeric expressions are included, one of them must be at the end
4510 of the name (e.g. "unit[0-31]rack" is invalid), but arbitrary names can
4511 always be used in a comma-separated list.
4512
4513 The node configuration specified the following information:
4514
4515
4516 NodeName
4517 Name that Slurm uses to refer to a node. Typically this would
4518 be the string that "/bin/hostname -s" returns. It may also be
4519 the fully qualified domain name as returned by "/bin/hostname
4520 -f" (e.g. "foo1.bar.com"), or any valid domain name associated
4521 with the host through the host database (/etc/hosts) or DNS, de‐
4522 pending on the resolver settings. Note that if the short form
4523 of the hostname is not used, it may prevent use of hostlist ex‐
4524 pressions (the numeric portion in brackets must be at the end of
4525 the string). It may also be an arbitrary string if NodeHostname
4526 is specified. If the NodeName is "DEFAULT", the values speci‐
4527 fied with that record will apply to subsequent node specifica‐
4528 tions unless explicitly set to other values in that node record
4529 or replaced with a different set of default values. Each line
4530 where NodeName is "DEFAULT" will replace or add to previous de‐
4531 fault values and not a reinitialize the default values. For ar‐
4532 chitectures in which the node order is significant, nodes will
4533 be considered consecutive in the order defined. For example, if
4534 the configuration for "NodeName=charlie" immediately follows the
4535 configuration for "NodeName=baker" they will be considered adja‐
4536 cent in the computer. NOTE: If the NodeName is "ALL" the
4537 process parsing the configuration will exit immediately as it is
4538 an internally reserved word.
4539
4540 NodeHostname
4541 Typically this would be the string that "/bin/hostname -s" re‐
4542 turns. It may also be the fully qualified domain name as re‐
4543 turned by "/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid
4544 domain name associated with the host through the host database
4545 (/etc/hosts) or DNS, depending on the resolver settings. Note
4546 that if the short form of the hostname is not used, it may pre‐
4547 vent use of hostlist expressions (the numeric portion in brack‐
4548 ets must be at the end of the string). A node range expression
4549 can be used to specify a set of nodes. If an expression is
4550 used, the number of nodes identified by NodeHostname on a line
4551 in the configuration file must be identical to the number of
4552 nodes identified by NodeName. By default, the NodeHostname will
4553 be identical in value to NodeName.
4554
4555 NodeAddr
4556 Name that a node should be referred to in establishing a commu‐
4557 nications path. This name will be used as an argument to the
4558 getaddrinfo() function for identification. If a node range ex‐
4559 pression is used to designate multiple nodes, they must exactly
4560 match the entries in the NodeName (e.g. "NodeName=lx[0-7]
4561 NodeAddr=elx[0-7]"). NodeAddr may also contain IP addresses.
4562 By default, the NodeAddr will be identical in value to NodeHost‐
4563 name.
4564
4565 BcastAddr
4566 Alternate network path to be used for sbcast network traffic to
4567 a given node. This name will be used as an argument to the
4568 getaddrinfo() function. If a node range expression is used to
4569 designate multiple nodes, they must exactly match the entries in
4570 the NodeName (e.g. "NodeName=lx[0-7] BcastAddr=elx[0-7]").
4571 BcastAddr may also contain IP addresses. By default, the Bcas‐
4572 tAddr is unset, and sbcast traffic will be routed to the
4573 NodeAddr for a given node. Note: cannot be used with Communica‐
4574 tionParameters=NoInAddrAny.
4575
4576 Boards Number of Baseboards in nodes with a baseboard controller. Note
4577 that when Boards is specified, SocketsPerBoard, CoresPerSocket,
4578 and ThreadsPerCore should be specified. The default value is 1.
4579
4580 CoreSpecCount
4581 Number of cores reserved for system use. Depending upon the
4582 TaskPluginParam option of SlurmdOffSpec, the Slurm daemon slurmd
4583 may either be confined to these resources (the default) or pre‐
4584 vented from using these resources. Isolation of slurmd from
4585 user jobs may improve application performance. A job can use
4586 these cores if AllowSpecResourcesUsage=yes and the user explic‐
4587 itly requests less than the configured CoreSpecCount. If this
4588 option and CpuSpecList are both designated for a node, an error
4589 is generated. For information on the algorithm used by Slurm to
4590 select the cores refer to the core specialization documentation
4591 ( https://slurm.schedmd.com/core_spec.html ).
4592
4593 CoresPerSocket
4594 Number of cores in a single physical processor socket (e.g.
4595 "2"). The CoresPerSocket value describes physical cores, not
4596 the logical number of processors per socket. NOTE: If you have
4597 multi-core processors, you will likely need to specify this pa‐
4598 rameter in order to optimize scheduling. The default value is
4599 1.
4600
4601 CpuBind
4602 If a job step request does not specify an option to control how
4603 tasks are bound to allocated CPUs (--cpu-bind) and all nodes al‐
4604 located to the job have the same CpuBind option the node CpuBind
4605 option will control how tasks are bound to allocated resources.
4606 Supported values for CpuBind are "none", "socket", "ldom"
4607 (NUMA), "core" and "thread".
4608
4609 CPUs Number of logical processors on the node (e.g. "2"). It can be
4610 set to the total number of sockets(supported only by select/lin‐
4611 ear), cores or threads. This can be useful when you want to
4612 schedule only the cores on a hyper-threaded node. If CPUs is
4613 omitted, its default will be set equal to the product of Boards,
4614 Sockets, CoresPerSocket, and ThreadsPerCore.
4615
4616 CpuSpecList
4617 A comma-delimited list of Slurm abstract CPU IDs reserved for
4618 system use. The list will be expanded to include all other
4619 CPUs, if any, on the same cores. Depending upon the TaskPlugin‐
4620 Param option of SlurmdOffSpec, the Slurm daemon slurmd may ei‐
4621 ther be confined to these resources (the default) or prevented
4622 from using these resources. Isolation of slurmd from user jobs
4623 may improve application performance. A job can use these cores
4624 if AllowSpecResourcesUsage=yes and the user explicitly requests
4625 less than the number of CPUs in this list. If this option and
4626 CoreSpecCount are both designated for a node, an error is gener‐
4627 ated. This option has no effect unless cgroup job confinement
4628 is also configured (i.e. the task/cgroup TaskPlugin is enabled
4629 and ConstrainCores=yes is set in cgroup.conf).
4630
4631 Features
4632 A comma-delimited list of arbitrary strings indicative of some
4633 characteristic associated with the node. There is no value or
4634 count associated with a feature at this time, a node either has
4635 a feature or it does not. A desired feature may contain a nu‐
4636 meric component indicating, for example, processor speed but
4637 this numeric component will be considered to be part of the fea‐
4638 ture string. Features are intended to be used to filter nodes
4639 eligible to run jobs via the --constraint argument. By default
4640 a node has no features. Also see Gres for being able to have
4641 more control such as types and count. Using features is faster
4642 than scheduling against GRES but is limited to Boolean opera‐
4643 tions.
4644
4645 Gres A comma-delimited list of generic resources specifications for a
4646 node. The format is: "<name>[:<type>][:no_consume]:<num‐
4647 ber>[K|M|G]". The first field is the resource name, which
4648 matches the GresType configuration parameter name. The optional
4649 type field might be used to identify a model of that generic re‐
4650 source. It is forbidden to specify both an untyped GRES and a
4651 typed GRES with the same <name>. The optional no_consume field
4652 allows you to specify that a generic resource does not have a
4653 finite number of that resource that gets consumed as it is re‐
4654 quested. The no_consume field is a GRES specific setting and ap‐
4655 plies to the GRES, regardless of the type specified. It should
4656 not be used with GRES that has a dedicated plugin, if you're
4657 looking for a way to overcommit GPUs to multiple processes at
4658 the time you may be interested in using "shard" GRES instead.
4659 The final field must specify a generic resources count. A suf‐
4660 fix of "K", "M", "G", "T" or "P" may be used to multiply the
4661 number by 1024, 1048576, 1073741824, etc. respectively.
4662 (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_con‐
4663 sume:4G"). By default a node has no generic resources and its
4664 maximum count is that of an unsigned 64bit integer. Also see
4665 Features for Boolean flags to filter nodes using job con‐
4666 straints.
4667
4668 MemSpecLimit
4669 Amount of memory, in megabytes, reserved for system use and not
4670 available for user allocations. If the task/cgroup plugin is
4671 configured and that plugin constrains memory allocations (i.e.
4672 the task/cgroup TaskPlugin is enabled and ConstrainRAMSpace=yes
4673 is set in cgroup.conf), then Slurm compute node daemons (slurmd
4674 plus slurmstepd) will be allocated the specified memory limit.
4675 Note that having the Memory set in SelectTypeParameters as any
4676 of the options that has it as a consumable resource is needed
4677 for this option to work. The daemons will not be killed if they
4678 exhaust the memory allocation (ie. the Out-Of-Memory Killer is
4679 disabled for the daemon's memory cgroup). If the task/cgroup
4680 plugin is not configured, the specified memory will only be un‐
4681 available for user allocations.
4682
4683 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4684 tens to for work on this particular node. By default there is a
4685 single port number for all slurmd daemons on all compute nodes
4686 as defined by the SlurmdPort configuration parameter. Use of
4687 this option is not generally recommended except for development
4688 or testing purposes. If multiple slurmd daemons execute on a
4689 node this can specify a range of ports.
4690
4691 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4692 automatically try to interact with anything opened on ports
4693 8192-60000. Configure Port to use a port outside of the config‐
4694 ured SrunPortRange and RSIP's port range.
4695
4696 Procs See CPUs.
4697
4698 RealMemory
4699 Size of real memory on the node in megabytes (e.g. "2048"). The
4700 default value is 1. Lowering RealMemory with the goal of setting
4701 aside some amount for the OS and not available for job alloca‐
4702 tions will not work as intended if Memory is not set as a con‐
4703 sumable resource in SelectTypeParameters. So one of the *_Memory
4704 options need to be enabled for that goal to be accomplished.
4705 Also see MemSpecLimit.
4706
4707 Reason Identifies the reason for a node being in state "DOWN",
4708 "DRAINED" "DRAINING", "FAIL" or "FAILING". Use quotes to en‐
4709 close a reason having more than one word.
4710
4711 Sockets
4712 Number of physical processor sockets/chips on the node (e.g.
4713 "2"). If Sockets is omitted, it will be inferred from CPUs,
4714 CoresPerSocket, and ThreadsPerCore. NOTE: If you have
4715 multi-core processors, you will likely need to specify these pa‐
4716 rameters. Sockets and SocketsPerBoard are mutually exclusive.
4717 If Sockets is specified when Boards is also used, Sockets is in‐
4718 terpreted as SocketsPerBoard rather than total sockets. The de‐
4719 fault value is 1.
4720
4721 SocketsPerBoard
4722 Number of physical processor sockets/chips on a baseboard.
4723 Sockets and SocketsPerBoard are mutually exclusive. The default
4724 value is 1.
4725
4726 State State of the node with respect to the initiation of user jobs.
4727 Acceptable values are CLOUD, DOWN, DRAIN, FAIL, FAILING, FUTURE
4728 and UNKNOWN. Node states of BUSY and IDLE should not be speci‐
4729 fied in the node configuration, but set the node state to UN‐
4730 KNOWN instead. Setting the node state to UNKNOWN will result in
4731 the node state being set to BUSY, IDLE or other appropriate
4732 state based upon recovered system state information. The de‐
4733 fault value is UNKNOWN. Also see the DownNodes parameter below.
4734
4735 CLOUD Indicates the node exists in the cloud. Its initial
4736 state will be treated as powered down. The node will
4737 be available for use after its state is recovered from
4738 Slurm's state save file or the slurmd daemon starts on
4739 the compute node.
4740
4741 DOWN Indicates the node failed and is unavailable to be al‐
4742 located work.
4743
4744 DRAIN Indicates the node is unavailable to be allocated
4745 work.
4746
4747 FAIL Indicates the node is expected to fail soon, has no
4748 jobs allocated to it, and will not be allocated to any
4749 new jobs.
4750
4751 FAILING Indicates the node is expected to fail soon, has one
4752 or more jobs allocated to it, but will not be allo‐
4753 cated to any new jobs.
4754
4755 FUTURE Indicates the node is defined for future use and need
4756 not exist when the Slurm daemons are started. These
4757 nodes can be made available for use simply by updating
4758 the node state using the scontrol command rather than
4759 restarting the slurmctld daemon. After these nodes are
4760 made available, change their State in the slurm.conf
4761 file. Until these nodes are made available, they will
4762 not be seen using any Slurm commands or nor will any
4763 attempt be made to contact them.
4764
4765 Dynamic Future Nodes
4766 A slurmd started with -F[<feature>] will be as‐
4767 sociated with a FUTURE node that matches the
4768 same configuration (sockets, cores, threads) as
4769 reported by slurmd -C. The node's NodeAddr and
4770 NodeHostname will automatically be retrieved
4771 from the slurmd and will be cleared when set
4772 back to the FUTURE state. Dynamic FUTURE nodes
4773 retain non-FUTURE state on restart. Use scon‐
4774 trol to put node back into FUTURE state.
4775
4776 If the mapping of the NodeName to the slurmd
4777 HostName is not updated in DNS, Dynamic Future
4778 nodes won't know how to communicate with each
4779 other -- because NodeAddr and NodeHostName are
4780 not defined in the slurm.conf -- and the fanout
4781 communications need to be disabled by setting
4782 TreeWidth to a high number (e.g. 65533). If the
4783 DNS mapping is made, then the cloud_dns Slurm‐
4784 ctldParameter can be used.
4785
4786 UNKNOWN Indicates the node's state is undefined but will be
4787 established (set to BUSY or IDLE) when the slurmd dae‐
4788 mon on that node registers. UNKNOWN is the default
4789 state.
4790
4791 ThreadsPerCore
4792 Number of logical threads in a single physical core (e.g. "2").
4793 Note that the Slurm can allocate resources to jobs down to the
4794 resolution of a core. If your system is configured with more
4795 than one thread per core, execution of a different job on each
4796 thread is not supported unless you configure SelectTypeParame‐
4797 ters=CR_CPU plus CPUs; do not configure Sockets, CoresPerSocket
4798 or ThreadsPerCore. A job can execute a one task per thread from
4799 within one job step or execute a distinct job step on each of
4800 the threads. Note also if you are running with more than 1
4801 thread per core and running the select/cons_res or se‐
4802 lect/cons_tres plugin then you will want to set the SelectType‐
4803 Parameters variable to something other than CR_CPU to avoid un‐
4804 expected results. The default value is 1.
4805
4806 TmpDisk
4807 Total size of temporary disk storage in TmpFS in megabytes (e.g.
4808 "16384"). TmpFS (for "Temporary File System") identifies the lo‐
4809 cation which jobs should use for temporary storage. Note this
4810 does not indicate the amount of free space available to the user
4811 on the node, only the total file system size. The system admin‐
4812 istration should ensure this file system is purged as needed so
4813 that user jobs have access to most of this space. The Prolog
4814 and/or Epilog programs (specified in the configuration file)
4815 might be used to ensure the file system is kept clean. The de‐
4816 fault value is 0.
4817
4818 Weight The priority of the node for scheduling purposes. All things
4819 being equal, jobs will be allocated the nodes with the lowest
4820 weight which satisfies their requirements. For example, a het‐
4821 erogeneous collection of nodes might be placed into a single
4822 partition for greater system utilization, responsiveness and ca‐
4823 pability. It would be preferable to allocate smaller memory
4824 nodes rather than larger memory nodes if either will satisfy a
4825 job's requirements. The units of weight are arbitrary, but
4826 larger weights should be assigned to nodes with more processors,
4827 memory, disk space, higher processor speed, etc. Note that if a
4828 job allocation request can not be satisfied using the nodes with
4829 the lowest weight, the set of nodes with the next lowest weight
4830 is added to the set of nodes under consideration for use (repeat
4831 as needed for higher weight values). If you absolutely want to
4832 minimize the number of higher weight nodes allocated to a job
4833 (at a cost of higher scheduling overhead), give each node a dis‐
4834 tinct Weight value and they will be added to the pool of nodes
4835 being considered for scheduling individually.
4836
4837 The default value is 1.
4838
4839 NOTE: Node weights are first considered among currently avail‐
4840 able nodes. For example, a POWERED_DOWN node with a lower weight
4841 will not be evaluated before an IDLE node.
4842
4844 The DownNodes= parameter permits you to mark certain nodes as in a
4845 DOWN, DRAIN, FAIL, FAILING or FUTURE state without altering the perma‐
4846 nent configuration information listed under a NodeName= specification.
4847
4848
4849 DownNodes
4850 Any node name, or list of node names, from the NodeName= speci‐
4851 fications.
4852
4853 Reason Identifies the reason for a node being in state DOWN, DRAIN,
4854 FAIL, FAILING or FUTURE. Use quotes to enclose a reason having
4855 more than one word.
4856
4857 State State of the node with respect to the initiation of user jobs.
4858 Acceptable values are DOWN, DRAIN, FAIL, FAILING and FUTURE.
4859 For more information about these states see the descriptions un‐
4860 der State in the NodeName= section above. The default value is
4861 DOWN.
4862
4864 On computers where frontend nodes are used to execute batch scripts
4865 rather than compute nodes, one may configure one or more frontend nodes
4866 using the configuration parameters defined below. These options are
4867 very similar to those used in configuring compute nodes. These options
4868 may only be used on systems configured and built with the appropriate
4869 parameters (--have-front-end). The front end configuration specifies
4870 the following information:
4871
4872
4873 AllowGroups
4874 Comma-separated list of group names which may execute jobs on
4875 this front end node. By default, all groups may use this front
4876 end node. A user will be permitted to use this front end node
4877 if AllowGroups has at least one group associated with the user.
4878 May not be used with the DenyGroups option.
4879
4880 AllowUsers
4881 Comma-separated list of user names which may execute jobs on
4882 this front end node. By default, all users may use this front
4883 end node. May not be used with the DenyUsers option.
4884
4885 DenyGroups
4886 Comma-separated list of group names which are prevented from ex‐
4887 ecuting jobs on this front end node. May not be used with the
4888 AllowGroups option.
4889
4890 DenyUsers
4891 Comma-separated list of user names which are prevented from exe‐
4892 cuting jobs on this front end node. May not be used with the
4893 AllowUsers option.
4894
4895 FrontendName
4896 Name that Slurm uses to refer to a frontend node. Typically
4897 this would be the string that "/bin/hostname -s" returns. It
4898 may also be the fully qualified domain name as returned by
4899 "/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid domain
4900 name associated with the host through the host database
4901 (/etc/hosts) or DNS, depending on the resolver settings. Note
4902 that if the short form of the hostname is not used, it may pre‐
4903 vent use of hostlist expressions (the numeric portion in brack‐
4904 ets must be at the end of the string). If the FrontendName is
4905 "DEFAULT", the values specified with that record will apply to
4906 subsequent node specifications unless explicitly set to other
4907 values in that frontend node record or replaced with a different
4908 set of default values. Each line where FrontendName is "DE‐
4909 FAULT" will replace or add to previous default values and not a
4910 reinitialize the default values.
4911
4912 FrontendAddr
4913 Name that a frontend node should be referred to in establishing
4914 a communications path. This name will be used as an argument to
4915 the getaddrinfo() function for identification. As with Fron‐
4916 tendName, list the individual node addresses rather than using a
4917 hostlist expression. The number of FrontendAddr records per
4918 line must equal the number of FrontendName records per line
4919 (i.e. you can't map to node names to one address). FrontendAddr
4920 may also contain IP addresses. By default, the FrontendAddr
4921 will be identical in value to FrontendName.
4922
4923 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4924 tens to for work on this particular frontend node. By default
4925 there is a single port number for all slurmd daemons on all
4926 frontend nodes as defined by the SlurmdPort configuration param‐
4927 eter. Use of this option is not generally recommended except for
4928 development or testing purposes.
4929
4930 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4931 automatically try to interact with anything opened on ports
4932 8192-60000. Configure Port to use a port outside of the config‐
4933 ured SrunPortRange and RSIP's port range.
4934
4935 Reason Identifies the reason for a frontend node being in state DOWN,
4936 DRAINED, DRAINING, FAIL or FAILING. Use quotes to enclose a
4937 reason having more than one word.
4938
4939 State State of the frontend node with respect to the initiation of
4940 user jobs. Acceptable values are DOWN, DRAIN, FAIL, FAILING and
4941 UNKNOWN. Node states of BUSY and IDLE should not be specified
4942 in the node configuration, but set the node state to UNKNOWN in‐
4943 stead. Setting the node state to UNKNOWN will result in the
4944 node state being set to BUSY, IDLE or other appropriate state
4945 based upon recovered system state information. For more infor‐
4946 mation about these states see the descriptions under State in
4947 the NodeName= section above. The default value is UNKNOWN.
4948
4949 As an example, you can do something similar to the following to define
4950 four front end nodes for running slurmd daemons.
4951 FrontendName=frontend[00-03] FrontendAddr=efrontend[00-03] State=UNKNOWN
4952
4953
4955 The nodeset configuration allows you to define a name for a specific
4956 set of nodes which can be used to simplify the partition configuration
4957 section, especially for heterogenous or condo-style systems. Each node‐
4958 set may be defined by an explicit list of nodes, and/or by filtering
4959 the nodes by a particular configured feature. If both Feature= and
4960 Nodes= are used the nodeset shall be the union of the two subsets.
4961 Note that the nodesets are only used to simplify the partition defini‐
4962 tions at present, and are not usable outside of the partition configu‐
4963 ration.
4964
4965
4966 Feature
4967 All nodes with this single feature will be included as part of
4968 this nodeset.
4969
4970 Nodes List of nodes in this set.
4971
4972 NodeSet
4973 Unique name for a set of nodes. Must not overlap with any Node‐
4974 Name definitions.
4975
4977 The partition configuration permits you to establish different job lim‐
4978 its or access controls for various groups (or partitions) of nodes.
4979 Nodes may be in more than one partition, making partitions serve as
4980 general purpose queues. For example one may put the same set of nodes
4981 into two different partitions, each with different constraints (time
4982 limit, job sizes, groups allowed to use the partition, etc.). Jobs are
4983 allocated resources within a single partition. Default values can be
4984 specified with a record in which PartitionName is "DEFAULT". The de‐
4985 fault entry values will apply only to lines following it in the config‐
4986 uration file and the default values can be reset multiple times in the
4987 configuration file with multiple entries where "PartitionName=DEFAULT".
4988 The "PartitionName=" specification must be placed on every line de‐
4989 scribing the configuration of partitions. Each line where Partition‐
4990 Name is "DEFAULT" will replace or add to previous default values and
4991 not a reinitialize the default values. A single partition name can not
4992 appear as a PartitionName value in more than one line (duplicate parti‐
4993 tion name records will be ignored). If a partition that is in use is
4994 deleted from the configuration and slurm is restarted or reconfigured
4995 (scontrol reconfigure), jobs using the partition are canceled. NOTE:
4996 Put all parameters for each partition on a single line. Each line of
4997 partition configuration information should represent a different parti‐
4998 tion. The partition configuration file contains the following informa‐
4999 tion:
5000
5001
5002 AllocNodes
5003 Comma-separated list of nodes from which users can submit jobs
5004 in the partition. Node names may be specified using the node
5005 range expression syntax described above. The default value is
5006 "ALL".
5007
5008 AllowAccounts
5009 Comma-separated list of accounts which may execute jobs in the
5010 partition. The default value is "ALL". NOTE: If AllowAccounts
5011 is used then DenyAccounts will not be enforced. Also refer to
5012 DenyAccounts.
5013
5014 AllowGroups
5015 Comma-separated list of group names which may execute jobs in
5016 this partition. A user will be permitted to submit a job to
5017 this partition if AllowGroups has at least one group associated
5018 with the user. Jobs executed as user root or as user SlurmUser
5019 will be allowed to use any partition, regardless of the value of
5020 AllowGroups. In addition, a Slurm Admin or Operator will be able
5021 to view any partition, regardless of the value of AllowGroups.
5022 If user root attempts to execute a job as another user (e.g. us‐
5023 ing srun's --uid option), then the job will be subject to Allow‐
5024 Groups as if it were submitted by that user. By default, Allow‐
5025 Groups is unset, meaning all groups are allowed to use this par‐
5026 tition. The special value 'ALL' is equivalent to this. Users
5027 who are not members of the specified group will not see informa‐
5028 tion about this partition by default. However, this should not
5029 be treated as a security mechanism, since job information will
5030 be returned if a user requests details about the partition or a
5031 specific job. See the PrivateData parameter to restrict access
5032 to job information. NOTE: For performance reasons, Slurm main‐
5033 tains a list of user IDs allowed to use each partition and this
5034 is checked at job submission time. This list of user IDs is up‐
5035 dated when the slurmctld daemon is restarted, reconfigured (e.g.
5036 "scontrol reconfig") or the partition's AllowGroups value is re‐
5037 set, even if is value is unchanged (e.g. "scontrol update Parti‐
5038 tionName=name AllowGroups=group"). For a user's access to a
5039 partition to change, both his group membership must change and
5040 Slurm's internal user ID list must change using one of the meth‐
5041 ods described above.
5042
5043 AllowQos
5044 Comma-separated list of Qos which may execute jobs in the parti‐
5045 tion. Jobs executed as user root can use any partition without
5046 regard to the value of AllowQos. The default value is "ALL".
5047 NOTE: If AllowQos is used then DenyQos will not be enforced.
5048 Also refer to DenyQos.
5049
5050 Alternate
5051 Partition name of alternate partition to be used if the state of
5052 this partition is "DRAIN" or "INACTIVE."
5053
5054 CpuBind
5055 If a job step request does not specify an option to control how
5056 tasks are bound to allocated CPUs (--cpu-bind) and all nodes al‐
5057 located to the job do not have the same CpuBind option the node.
5058 Then the partition's CpuBind option will control how tasks are
5059 bound to allocated resources. Supported values forCpuBind are
5060 "none", "socket", "ldom" (NUMA), "core" and "thread".
5061
5062 Default
5063 If this keyword is set, jobs submitted without a partition spec‐
5064 ification will utilize this partition. Possible values are
5065 "YES" and "NO". The default value is "NO".
5066
5067 DefaultTime
5068 Run time limit used for jobs that don't specify a value. If not
5069 set then MaxTime will be used. Format is the same as for Max‐
5070 Time.
5071
5072 DefCpuPerGPU
5073 Default count of CPUs allocated per allocated GPU. This value is
5074 used only if the job didn't specify --cpus-per-task and
5075 --cpus-per-gpu.
5076
5077 DefMemPerCPU
5078 Default real memory size available per allocated CPU in
5079 megabytes. Used to avoid over-subscribing memory and causing
5080 paging. DefMemPerCPU would generally be used if individual pro‐
5081 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
5082 lectType=select/cons_tres). If not set, the DefMemPerCPU value
5083 for the entire cluster will be used. Also see DefMemPerGPU,
5084 DefMemPerNode and MaxMemPerCPU. DefMemPerCPU, DefMemPerGPU and
5085 DefMemPerNode are mutually exclusive.
5086
5087 DefMemPerGPU
5088 Default real memory size available per allocated GPU in
5089 megabytes. Also see DefMemPerCPU, DefMemPerNode and MaxMemPer‐
5090 CPU. DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually
5091 exclusive.
5092
5093 DefMemPerNode
5094 Default real memory size available per allocated node in
5095 megabytes. Used to avoid over-subscribing memory and causing
5096 paging. DefMemPerNode would generally be used if whole nodes
5097 are allocated to jobs (SelectType=select/linear) and resources
5098 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
5099 If not set, the DefMemPerNode value for the entire cluster will
5100 be used. Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerCPU.
5101 DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually exclu‐
5102 sive.
5103
5104 DenyAccounts
5105 Comma-separated list of accounts which may not execute jobs in
5106 the partition. By default, no accounts are denied access NOTE:
5107 If AllowAccounts is used then DenyAccounts will not be enforced.
5108 Also refer to AllowAccounts.
5109
5110 DenyQos
5111 Comma-separated list of Qos which may not execute jobs in the
5112 partition. By default, no QOS are denied access NOTE: If Al‐
5113 lowQos is used then DenyQos will not be enforced. Also refer
5114 AllowQos.
5115
5116 DisableRootJobs
5117 If set to "YES" then user root will be prevented from running
5118 any jobs on this partition. The default value will be the value
5119 of DisableRootJobs set outside of a partition specification
5120 (which is "NO", allowing user root to execute jobs).
5121
5122 ExclusiveUser
5123 If set to "YES" then nodes will be exclusively allocated to
5124 users. Multiple jobs may be run for the same user, but only one
5125 user can be active at a time. This capability is also available
5126 on a per-job basis by using the --exclusive=user option.
5127
5128 GraceTime
5129 Specifies, in units of seconds, the preemption grace time to be
5130 extended to a job which has been selected for preemption. The
5131 default value is zero, no preemption grace time is allowed on
5132 this partition. Once a job has been selected for preemption,
5133 its end time is set to the current time plus GraceTime. The
5134 job's tasks are immediately sent SIGCONT and SIGTERM signals in
5135 order to provide notification of its imminent termination. This
5136 is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence
5137 upon reaching its new end time. This second set of signals is
5138 sent to both the tasks and the containing batch script, if ap‐
5139 plicable. See also the global KillWait configuration parameter.
5140
5141 Hidden Specifies if the partition and its jobs are to be hidden by de‐
5142 fault. Hidden partitions will by default not be reported by the
5143 Slurm APIs or commands. Possible values are "YES" and "NO".
5144 The default value is "NO". Note that partitions that a user
5145 lacks access to by virtue of the AllowGroups parameter will also
5146 be hidden by default.
5147
5148 LLN Schedule resources to jobs on the least loaded nodes (based upon
5149 the number of idle CPUs). This is generally only recommended for
5150 an environment with serial jobs as idle resources will tend to
5151 be highly fragmented, resulting in parallel jobs being distrib‐
5152 uted across many nodes. Note that node Weight takes precedence
5153 over how many idle resources are on each node. Also see the Se‐
5154 lectTypeParameters configuration parameter CR_LLN to use the
5155 least loaded nodes in every partition.
5156
5157 MaxCPUsPerNode
5158 Maximum number of CPUs on any node available to all jobs from
5159 this partition. This can be especially useful to schedule GPUs.
5160 For example a node can be associated with two Slurm partitions
5161 (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be
5162 limited to only a subset of the node's CPUs, ensuring that one
5163 or more CPUs would be available to jobs in the "gpu" parti‐
5164 tion/queue.
5165
5166 MaxMemPerCPU
5167 Maximum real memory size available per allocated CPU in
5168 megabytes. Used to avoid over-subscribing memory and causing
5169 paging. MaxMemPerCPU would generally be used if individual pro‐
5170 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
5171 lectType=select/cons_tres). If not set, the MaxMemPerCPU value
5172 for the entire cluster will be used. Also see DefMemPerCPU and
5173 MaxMemPerNode. MaxMemPerCPU and MaxMemPerNode are mutually ex‐
5174 clusive.
5175
5176 MaxMemPerNode
5177 Maximum real memory size available per allocated node in
5178 megabytes. Used to avoid over-subscribing memory and causing
5179 paging. MaxMemPerNode would generally be used if whole nodes
5180 are allocated to jobs (SelectType=select/linear) and resources
5181 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
5182 If not set, the MaxMemPerNode value for the entire cluster will
5183 be used. Also see DefMemPerNode and MaxMemPerCPU. MaxMemPerCPU
5184 and MaxMemPerNode are mutually exclusive.
5185
5186 MaxNodes
5187 Maximum count of nodes which may be allocated to any single job.
5188 The default value is "UNLIMITED", which is represented inter‐
5189 nally as -1.
5190
5191 MaxTime
5192 Maximum run time limit for jobs. Format is minutes, min‐
5193 utes:seconds, hours:minutes:seconds, days-hours, days-hours:min‐
5194 utes, days-hours:minutes:seconds or "UNLIMITED". Time resolu‐
5195 tion is one minute and second values are rounded up to the next
5196 minute. The job TimeLimit may be updated by root, SlurmUser or
5197 an Operator to a value higher than the configured MaxTime after
5198 job submission.
5199
5200 MinNodes
5201 Minimum count of nodes which may be allocated to any single job.
5202 The default value is 0.
5203
5204 Nodes Comma-separated list of nodes or nodesets which are associated
5205 with this partition. Node names may be specified using the node
5206 range expression syntax described above. A blank list of nodes
5207 (i.e. "Nodes= ") can be used if one wants a partition to exist,
5208 but have no resources (possibly on a temporary basis). A value
5209 of "ALL" is mapped to all nodes configured in the cluster.
5210
5211 OverSubscribe
5212 Controls the ability of the partition to execute more than one
5213 job at a time on each resource (node, socket or core depending
5214 upon the value of SelectTypeParameters). If resources are to be
5215 over-subscribed, avoiding memory over-subscription is very im‐
5216 portant. SelectTypeParameters should be configured to treat
5217 memory as a consumable resource and the --mem option should be
5218 used for job allocations. Sharing of resources is typically
5219 useful only when using gang scheduling (PreemptMode=sus‐
5220 pend,gang). Possible values for OverSubscribe are "EXCLUSIVE",
5221 "FORCE", "YES", and "NO". Note that a value of "YES" or "FORCE"
5222 can negatively impact performance for systems with many thou‐
5223 sands of running jobs. The default value is "NO". For more in‐
5224 formation see the following web pages:
5225 https://slurm.schedmd.com/cons_res.html
5226 https://slurm.schedmd.com/cons_res_share.html
5227 https://slurm.schedmd.com/gang_scheduling.html
5228 https://slurm.schedmd.com/preempt.html
5229
5230 EXCLUSIVE Allocates entire nodes to jobs even with Select‐
5231 Type=select/cons_res or SelectType=select/cons_tres
5232 configured. Jobs that run in partitions with Over‐
5233 Subscribe=EXCLUSIVE will have exclusive access to
5234 all allocated nodes. These jobs are allocated all
5235 CPUs and GRES on the nodes, but they are only allo‐
5236 cated as much memory as they ask for. This is by de‐
5237 sign to support gang scheduling, because suspended
5238 jobs still reside in memory. To request all the mem‐
5239 ory on a node, use --mem=0 at submit time.
5240
5241 FORCE Makes all resources (except GRES) in the partition
5242 available for oversubscription without any means for
5243 users to disable it. May be followed with a colon
5244 and maximum number of jobs in running or suspended
5245 state. For example OverSubscribe=FORCE:4 enables
5246 each node, socket or core to oversubscribe each re‐
5247 source four ways. Recommended only for systems us‐
5248 ing PreemptMode=suspend,gang.
5249
5250 NOTE: OverSubscribe=FORCE:1 is a special case that
5251 is not exactly equivalent to OverSubscribe=NO. Over‐
5252 Subscribe=FORCE:1 disables the regular oversubscrip‐
5253 tion of resources in the same partition but it will
5254 still allow oversubscription due to preemption. Set‐
5255 ting OverSubscribe=NO will prevent oversubscription
5256 from happening due to preemption as well.
5257
5258 NOTE: If using PreemptType=preempt/qos you can spec‐
5259 ify a value for FORCE that is greater than 1. For
5260 example, OverSubscribe=FORCE:2 will permit two jobs
5261 per resource normally, but a third job can be
5262 started only if done so through preemption based
5263 upon QOS.
5264
5265 NOTE: If OverSubscribe is configured to FORCE or YES
5266 in your slurm.conf and the system is not configured
5267 to use preemption (PreemptMode=OFF) accounting can
5268 easily grow to values greater than the actual uti‐
5269 lization. It may be common on such systems to get
5270 error messages in the slurmdbd log stating: "We have
5271 more allocated time than is possible."
5272
5273 YES Makes all resources (except GRES) in the partition
5274 available for sharing upon request by the job. Re‐
5275 sources will only be over-subscribed when explicitly
5276 requested by the user using the "--oversubscribe"
5277 option on job submission. May be followed with a
5278 colon and maximum number of jobs in running or sus‐
5279 pended state. For example "OverSubscribe=YES:4" en‐
5280 ables each node, socket or core to execute up to
5281 four jobs at once. Recommended only for systems
5282 running with gang scheduling (PreemptMode=sus‐
5283 pend,gang).
5284
5285 NO Selected resources are allocated to a single job. No
5286 resource will be allocated to more than one job.
5287
5288 NOTE: Even if you are using PreemptMode=sus‐
5289 pend,gang, setting OverSubscribe=NO will disable
5290 preemption on that partition. Use OverSub‐
5291 scribe=FORCE:1 if you want to disable normal over‐
5292 subscription but still allow suspension due to pre‐
5293 emption.
5294
5295 OverTimeLimit
5296 Number of minutes by which a job can exceed its time limit be‐
5297 fore being canceled. Normally a job's time limit is treated as
5298 a hard limit and the job will be killed upon reaching that
5299 limit. Configuring OverTimeLimit will result in the job's time
5300 limit being treated like a soft limit. Adding the OverTimeLimit
5301 value to the soft time limit provides a hard time limit, at
5302 which point the job is canceled. This is particularly useful
5303 for backfill scheduling, which bases upon each job's soft time
5304 limit. If not set, the OverTimeLimit value for the entire clus‐
5305 ter will be used. May not exceed 65533 minutes. A value of
5306 "UNLIMITED" is also supported.
5307
5308 PartitionName
5309 Name by which the partition may be referenced (e.g. "Interac‐
5310 tive"). This name can be specified by users when submitting
5311 jobs. If the PartitionName is "DEFAULT", the values specified
5312 with that record will apply to subsequent partition specifica‐
5313 tions unless explicitly set to other values in that partition
5314 record or replaced with a different set of default values. Each
5315 line where PartitionName is "DEFAULT" will replace or add to
5316 previous default values and not a reinitialize the default val‐
5317 ues.
5318
5319 PreemptMode
5320 Mechanism used to preempt jobs or enable gang scheduling for
5321 this partition when PreemptType=preempt/partition_prio is con‐
5322 figured. This partition-specific PreemptMode configuration pa‐
5323 rameter will override the cluster-wide PreemptMode for this par‐
5324 tition. It can be set to OFF to disable preemption and gang
5325 scheduling for this partition. See also PriorityTier and the
5326 above description of the cluster-wide PreemptMode parameter for
5327 further details.
5328 The GANG option is used to enable gang scheduling independent of
5329 whether preemption is enabled (i.e. independent of the Preempt‐
5330 Type setting). It can be specified in addition to a PreemptMode
5331 setting with the two options comma separated (e.g. Preempt‐
5332 Mode=SUSPEND,GANG).
5333 See <https://slurm.schedmd.com/preempt.html> and
5334 <https://slurm.schedmd.com/gang_scheduling.html> for more de‐
5335 tails.
5336
5337 NOTE: For performance reasons, the backfill scheduler reserves
5338 whole nodes for jobs, not partial nodes. If during backfill
5339 scheduling a job preempts one or more other jobs, the whole
5340 nodes for those preempted jobs are reserved for the preemptor
5341 job, even if the preemptor job requested fewer resources than
5342 that. These reserved nodes aren't available to other jobs dur‐
5343 ing that backfill cycle, even if the other jobs could fit on the
5344 nodes. Therefore, jobs may preempt more resources during a sin‐
5345 gle backfill iteration than they requested.
5346 NOTE: For heterogeneous job to be considered for preemption all
5347 components must be eligible for preemption. When a heterogeneous
5348 job is to be preempted the first identified component of the job
5349 with the highest order PreemptMode (SUSPEND (highest), REQUEUE,
5350 CANCEL (lowest)) will be used to set the PreemptMode for all
5351 components. The GraceTime and user warning signal for each com‐
5352 ponent of the heterogeneous job remain unique. Heterogeneous
5353 jobs are excluded from GANG scheduling operations.
5354
5355 OFF Is the default value and disables job preemption and
5356 gang scheduling. It is only compatible with Pre‐
5357 emptType=preempt/none at a global level. A common
5358 use case for this parameter is to set it on a parti‐
5359 tion to disable preemption for that partition.
5360
5361 CANCEL The preempted job will be cancelled.
5362
5363 GANG Enables gang scheduling (time slicing) of jobs in
5364 the same partition, and allows the resuming of sus‐
5365 pended jobs.
5366
5367 NOTE: Gang scheduling is performed independently for
5368 each partition, so if you only want time-slicing by
5369 OverSubscribe, without any preemption, then config‐
5370 uring partitions with overlapping nodes is not rec‐
5371 ommended. On the other hand, if you want to use
5372 PreemptType=preempt/partition_prio to allow jobs
5373 from higher PriorityTier partitions to Suspend jobs
5374 from lower PriorityTier partitions you will need
5375 overlapping partitions, and PreemptMode=SUSPEND,GANG
5376 to use the Gang scheduler to resume the suspended
5377 jobs(s). In any case, time-slicing won't happen be‐
5378 tween jobs on different partitions.
5379 NOTE: Heterogeneous jobs are excluded from GANG
5380 scheduling operations.
5381
5382 REQUEUE Preempts jobs by requeuing them (if possible) or
5383 canceling them. For jobs to be requeued they must
5384 have the --requeue sbatch option set or the cluster
5385 wide JobRequeue parameter in slurm.conf must be set
5386 to 1.
5387
5388 SUSPEND The preempted jobs will be suspended, and later the
5389 Gang scheduler will resume them. Therefore the SUS‐
5390 PEND preemption mode always needs the GANG option to
5391 be specified at the cluster level. Also, because the
5392 suspended jobs will still use memory on the allo‐
5393 cated nodes, Slurm needs to be able to track memory
5394 resources to be able to suspend jobs.
5395
5396 If the preemptees and preemptor are on different
5397 partitions then the preempted jobs will remain sus‐
5398 pended until the preemptor ends.
5399 NOTE: Because gang scheduling is performed indepen‐
5400 dently for each partition, if using PreemptType=pre‐
5401 empt/partition_prio then jobs in higher PriorityTier
5402 partitions will suspend jobs in lower PriorityTier
5403 partitions to run on the released resources. Only
5404 when the preemptor job ends will the suspended jobs
5405 will be resumed by the Gang scheduler.
5406 NOTE: Suspended jobs will not release GRES. Higher
5407 priority jobs will not be able to preempt to gain
5408 access to GRES.
5409
5410 PriorityJobFactor
5411 Partition factor used by priority/multifactor plugin in calcu‐
5412 lating job priority. The value may not exceed 65533. Also see
5413 PriorityTier.
5414
5415 PriorityTier
5416 Jobs submitted to a partition with a higher PriorityTier value
5417 will be evaluated by the scheduler before pending jobs in a par‐
5418 tition with a lower PriorityTier value. They will also be con‐
5419 sidered for preemption of running jobs in partition(s) with
5420 lower PriorityTier values if PreemptType=preempt/partition_prio.
5421 The value may not exceed 65533. Also see PriorityJobFactor.
5422
5423 QOS Used to extend the limits available to a QOS on a partition.
5424 Jobs will not be associated to this QOS outside of being associ‐
5425 ated to the partition. They will still be associated to their
5426 requested QOS. By default, no QOS is used. NOTE: If a limit is
5427 set in both the Partition's QOS and the Job's QOS the Partition
5428 QOS will be honored unless the Job's QOS has the OverPartQOS
5429 flag set in which the Job's QOS will have priority.
5430
5431 ReqResv
5432 Specifies users of this partition are required to designate a
5433 reservation when submitting a job. This option can be useful in
5434 restricting usage of a partition that may have higher priority
5435 or additional resources to be allowed only within a reservation.
5436 Possible values are "YES" and "NO". The default value is "NO".
5437
5438 ResumeTimeout
5439 Maximum time permitted (in seconds) between when a node resume
5440 request is issued and when the node is actually available for
5441 use. Nodes which fail to respond in this time frame will be
5442 marked DOWN and the jobs scheduled on the node requeued. Nodes
5443 which reboot after this time frame will be marked DOWN with a
5444 reason of "Node unexpectedly rebooted." For nodes that are in
5445 multiple partitions with this option set, the highest time will
5446 take effect. If not set on any partition, the node will use the
5447 ResumeTimeout value set for the entire cluster.
5448
5449 RootOnly
5450 Specifies if only user ID zero (i.e. user root) may allocate re‐
5451 sources in this partition. User root may allocate resources for
5452 any other user, but the request must be initiated by user root.
5453 This option can be useful for a partition to be managed by some
5454 external entity (e.g. a higher-level job manager) and prevents
5455 users from directly using those resources. Possible values are
5456 "YES" and "NO". The default value is "NO".
5457
5458 SelectTypeParameters
5459 Partition-specific resource allocation type. This option re‐
5460 places the global SelectTypeParameters value. Supported values
5461 are CR_Core, CR_Core_Memory, CR_Socket and CR_Socket_Memory.
5462 Use requires the system-wide SelectTypeParameters value be set
5463 to any of the four supported values previously listed; other‐
5464 wise, the partition-specific value will be ignored.
5465
5466 Shared The Shared configuration parameter has been replaced by the
5467 OverSubscribe parameter described above.
5468
5469 State State of partition or availability for use. Possible values are
5470 "UP", "DOWN", "DRAIN" and "INACTIVE". The default value is "UP".
5471 See also the related "Alternate" keyword.
5472
5473 UP Designates that new jobs may be queued on the parti‐
5474 tion, and that jobs may be allocated nodes and run
5475 from the partition.
5476
5477 DOWN Designates that new jobs may be queued on the parti‐
5478 tion, but queued jobs may not be allocated nodes and
5479 run from the partition. Jobs already running on the
5480 partition continue to run. The jobs must be explicitly
5481 canceled to force their termination.
5482
5483 DRAIN Designates that no new jobs may be queued on the par‐
5484 tition (job submission requests will be denied with an
5485 error message), but jobs already queued on the parti‐
5486 tion may be allocated nodes and run. See also the
5487 "Alternate" partition specification.
5488
5489 INACTIVE Designates that no new jobs may be queued on the par‐
5490 tition, and jobs already queued may not be allocated
5491 nodes and run. See also the "Alternate" partition
5492 specification.
5493
5494 SuspendTime
5495 Nodes which remain idle or down for this number of seconds will
5496 be placed into power save mode by SuspendProgram. For nodes
5497 that are in multiple partitions with this option set, the high‐
5498 est time will take effect. If not set on any partition, the node
5499 will use the SuspendTime value set for the entire cluster. Set‐
5500 ting SuspendTime to anything but "INFINITE" will enable power
5501 save mode.
5502
5503 SuspendTimeout
5504 Maximum time permitted (in seconds) between when a node suspend
5505 request is issued and when the node is shutdown. At that time
5506 the node must be ready for a resume request to be issued as
5507 needed for new work. For nodes that are in multiple partitions
5508 with this option set, the highest time will take effect. If not
5509 set on any partition, the node will use the SuspendTimeout value
5510 set for the entire cluster.
5511
5512 TRESBillingWeights
5513 TRESBillingWeights is used to define the billing weights of each
5514 TRES type that will be used in calculating the usage of a job.
5515 The calculated usage is used when calculating fairshare and when
5516 enforcing the TRES billing limit on jobs.
5517
5518 Billing weights are specified as a comma-separated list of <TRES
5519 Type>=<TRES Billing Weight> pairs.
5520
5521 Any TRES Type is available for billing. Note that the base unit
5522 for memory and burst buffers is megabytes.
5523
5524 By default the billing of TRES is calculated as the sum of all
5525 TRES types multiplied by their corresponding billing weight.
5526
5527 The weighted amount of a resource can be adjusted by adding a
5528 suffix of K,M,G,T or P after the billing weight. For example, a
5529 memory weight of "mem=.25" on a job allocated 8GB will be billed
5530 2048 (8192MB *.25) units. A memory weight of "mem=.25G" on the
5531 same job will be billed 2 (8192MB * (.25/1024)) units.
5532
5533 Negative values are allowed.
5534
5535 When a job is allocated 1 CPU and 8 GB of memory on a partition
5536 configured with TRESBilling‐
5537 Weights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0", the billable TRES will
5538 be: (1*1.0) + (8*0.25) + (0*2.0) = 3.0.
5539
5540 If PriorityFlags=MAX_TRES is configured, the billable TRES is
5541 calculated as the MAX of individual TRES' on a node (e.g. cpus,
5542 mem, gres) plus the sum of all global TRES' (e.g. licenses). Us‐
5543 ing the same example above the billable TRES will be MAX(1*1.0,
5544 8*0.25) + (0*2.0) = 2.0.
5545
5546 If TRESBillingWeights is not defined then the job is billed
5547 against the total number of allocated CPUs.
5548
5549 NOTE: TRESBillingWeights doesn't affect job priority directly as
5550 it is currently not used for the size of the job. If you want
5551 TRES' to play a role in the job's priority then refer to the
5552 PriorityWeightTRES option.
5553
5555 There are a variety of prolog and epilog program options that execute
5556 with various permissions and at various times. The four options most
5557 likely to be used are: Prolog and Epilog (executed once on each compute
5558 node for each job) plus PrologSlurmctld and EpilogSlurmctld (executed
5559 once on the ControlMachine for each job).
5560
5561 NOTE: Standard output and error messages are normally not preserved.
5562 Explicitly write output and error messages to an appropriate location
5563 if you wish to preserve that information.
5564
5565 NOTE: By default the Prolog script is ONLY run on any individual node
5566 when it first sees a job step from a new allocation. It does not run
5567 the Prolog immediately when an allocation is granted. If no job steps
5568 from an allocation are run on a node, it will never run the Prolog for
5569 that allocation. This Prolog behaviour can be changed by the Pro‐
5570 logFlags parameter. The Epilog, on the other hand, always runs on ev‐
5571 ery node of an allocation when the allocation is released.
5572
5573 If the Epilog fails (returns a non-zero exit code), this will result in
5574 the node being set to a DRAIN state. If the EpilogSlurmctld fails (re‐
5575 turns a non-zero exit code), this will only be logged. If the Prolog
5576 fails (returns a non-zero exit code), this will result in the node be‐
5577 ing set to a DRAIN state and the job being requeued in a held state un‐
5578 less nohold_on_prolog_fail is configured in SchedulerParameters. If
5579 the PrologSlurmctld fails (returns a non-zero exit code), this will re‐
5580 sult in the job being requeued to be executed on another node if possi‐
5581 ble. Only batch jobs can be requeued. Interactive jobs (salloc and
5582 srun) will be cancelled if the PrologSlurmctld fails. If slurmcltd is
5583 stopped while either PrologSlurmctld or EpilogSlurmctld is running, the
5584 script will be killed with SIGKILL. The script will restart when slurm‐
5585 ctld restarts.
5586
5587
5588 Information about the job is passed to the script using environment
5589 variables. Unless otherwise specified, these environment variables are
5590 available in each of the scripts mentioned above (Prolog, Epilog, Pro‐
5591 logSlurmctld and EpilogSlurmctld). For a full list of environment vari‐
5592 ables that includes those available in the SrunProlog, SrunEpilog,
5593 TaskProlog and TaskEpilog please see the Prolog and Epilog Guide
5594 <https://slurm.schedmd.com/prolog_epilog.html>.
5595
5596
5597 SLURM_ARRAY_JOB_ID
5598 If this job is part of a job array, this will be set to the job
5599 ID. Otherwise it will not be set. To reference this specific
5600 task of a job array, combine SLURM_ARRAY_JOB_ID with SLURM_AR‐
5601 RAY_TASK_ID (e.g. "scontrol update ${SLURM_AR‐
5602 RAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in Pro‐
5603 logSlurmctld and EpilogSlurmctld.
5604
5605 SLURM_ARRAY_TASK_ID
5606 If this job is part of a job array, this will be set to the task
5607 ID. Otherwise it will not be set. To reference this specific
5608 task of a job array, combine SLURM_ARRAY_JOB_ID with SLURM_AR‐
5609 RAY_TASK_ID (e.g. "scontrol update ${SLURM_AR‐
5610 RAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in Pro‐
5611 logSlurmctld and EpilogSlurmctld.
5612
5613 SLURM_ARRAY_TASK_MAX
5614 If this job is part of a job array, this will be set to the max‐
5615 imum task ID. Otherwise it will not be set. Available in Pro‐
5616 logSlurmctld and EpilogSlurmctld.
5617
5618 SLURM_ARRAY_TASK_MIN
5619 If this job is part of a job array, this will be set to the min‐
5620 imum task ID. Otherwise it will not be set. Available in Pro‐
5621 logSlurmctld and EpilogSlurmctld.
5622
5623 SLURM_ARRAY_TASK_STEP
5624 If this job is part of a job array, this will be set to the step
5625 size of task IDs. Otherwise it will not be set. Available in
5626 PrologSlurmctld and EpilogSlurmctld.
5627
5628 SLURM_CLUSTER_NAME
5629 Name of the cluster executing the job.
5630
5631 SLURM_CONF
5632 Location of the slurm.conf file. Available in Prolog and Epilog.
5633
5634 SLURMD_NODENAME
5635 Name of the node running the task. In the case of a parallel job
5636 executing on multiple compute nodes, the various tasks will have
5637 this environment variable set to different values on each com‐
5638 pute node. Available in Prolog and Epilog.
5639
5640 SLURM_JOB_ACCOUNT
5641 Account name used for the job.
5642
5643 SLURM_JOB_COMMENT
5644 Comment added to the job. Available in Prolog, PrologSlurmctld,
5645 Epilog and EpilogSlurmctld.
5646
5647 SLURM_JOB_CONSTRAINTS
5648 Features required to run the job. Available in Prolog, Pro‐
5649 logSlurmctld, Epilog and EpilogSlurmctld.
5650
5651 SLURM_JOB_DERIVED_EC
5652 The highest exit code of all of the job steps. Available in
5653 Epilog and EpilogSlurmctld.
5654
5655 SLURM_JOB_EXIT_CODE
5656 The exit code of the job script (or salloc). The value is the
5657 status as returned by the wait() system call (See wait(2))
5658 Available in Epilog and EpilogSlurmctld.
5659
5660 SLURM_JOB_EXIT_CODE2
5661 The exit code of the job script (or salloc). The value has the
5662 format <exit>:<sig>. The first number is the exit code, typi‐
5663 cally as set by the exit() function. The second number of the
5664 signal that caused the process to terminate if it was terminated
5665 by a signal. Available in Epilog and EpilogSlurmctld.
5666
5667 SLURM_JOB_GID
5668 Group ID of the job's owner.
5669
5670 SLURM_JOB_GPUS
5671 The GPU IDs of GPUs in the job allocation (if any). Available
5672 in the Prolog and Epilog.
5673
5674 SLURM_JOB_GROUP
5675 Group name of the job's owner. Available in PrologSlurmctld and
5676 EpilogSlurmctld.
5677
5678 SLURM_JOB_ID
5679 Job ID.
5680
5681 SLURM_JOBID
5682 Job ID.
5683
5684 SLURM_JOB_NAME
5685 Name of the job. Available in PrologSlurmctld and EpilogSlurm‐
5686 ctld.
5687
5688 SLURM_JOB_NODELIST
5689 Nodes assigned to job. A Slurm hostlist expression. "scontrol
5690 show hostnames" can be used to convert this to a list of indi‐
5691 vidual host names. Available in PrologSlurmctld and Epi‐
5692 logSlurmctld.
5693
5694 SLURM_JOB_PARTITION
5695 Partition that job runs in. Available in Prolog, PrologSlurm‐
5696 ctld, Epilog and EpilogSlurmctld.
5697
5698 SLURM_JOB_UID
5699 User ID of the job's owner.
5700
5701 SLURM_JOB_USER
5702 User name of the job's owner.
5703
5704 SLURM_SCRIPT_CONTEXT
5705 Identifies which epilog or prolog program is currently running.
5706
5708 This program can be used to take special actions to clean up the unkil‐
5709 lable processes and/or notify system administrators. The program will
5710 be run as SlurmdUser (usually "root") on the compute node where Unkill‐
5711 ableStepTimeout was triggered.
5712
5713 Information about the unkillable job step is passed to the script using
5714 environment variables.
5715
5716
5717 SLURM_JOB_ID
5718 Job ID.
5719
5720 SLURM_STEP_ID
5721 Job Step ID.
5722
5724 Slurm is able to optimize job allocations to minimize network con‐
5725 tention. Special Slurm logic is used to optimize allocations on sys‐
5726 tems with a three-dimensional interconnect. and information about con‐
5727 figuring those systems are available on web pages available here:
5728 <https://slurm.schedmd.com/>. For a hierarchical network, Slurm needs
5729 to have detailed information about how nodes are configured on the net‐
5730 work switches.
5731
5732 Given network topology information, Slurm allocates all of a job's re‐
5733 sources onto a single leaf of the network (if possible) using a
5734 best-fit algorithm. Otherwise it will allocate a job's resources onto
5735 multiple leaf switches so as to minimize the use of higher-level
5736 switches. The TopologyPlugin parameter controls which plugin is used
5737 to collect network topology information. The only values presently
5738 supported are "topology/3d_torus" (default for Cray XT/XE systems, per‐
5739 forms best-fit logic over three-dimensional topology), "topology/none"
5740 (default for other systems, best-fit logic over one-dimensional topol‐
5741 ogy), "topology/tree" (determine the network topology based upon infor‐
5742 mation contained in a topology.conf file, see "man topology.conf" for
5743 more information). Future plugins may gather topology information di‐
5744 rectly from the network. The topology information is optional. If not
5745 provided, Slurm will perform a best-fit algorithm assuming the nodes
5746 are in a one-dimensional array as configured and the communications
5747 cost is related to the node distance in this array.
5748
5749
5751 If the cluster's computers used for the primary or backup controller
5752 will be out of service for an extended period of time, it may be desir‐
5753 able to relocate them. In order to do so, follow this procedure:
5754
5755 1. Stop the Slurm daemons
5756 2. Modify the slurm.conf file appropriately
5757 3. Distribute the updated slurm.conf file to all nodes
5758 4. Restart the Slurm daemons
5759
5760 There should be no loss of any running or pending jobs. Ensure that
5761 any nodes added to the cluster have the current slurm.conf file in‐
5762 stalled.
5763
5764 CAUTION: If two nodes are simultaneously configured as the primary con‐
5765 troller (two nodes on which SlurmctldHost specify the local host and
5766 the slurmctld daemon is executing on each), system behavior will be de‐
5767 structive. If a compute node has an incorrect SlurmctldHost parameter,
5768 that node may be rendered unusable, but no other harm will result.
5769
5770
5772 #
5773 # Sample /etc/slurm.conf for dev[0-25].llnl.gov
5774 # Author: John Doe
5775 # Date: 11/06/2001
5776 #
5777 SlurmctldHost=dev0(12.34.56.78) # Primary server
5778 SlurmctldHost=dev1(12.34.56.79) # Backup server
5779 #
5780 AuthType=auth/munge
5781 Epilog=/usr/local/slurm/epilog
5782 Prolog=/usr/local/slurm/prolog
5783 FirstJobId=65536
5784 InactiveLimit=120
5785 JobCompType=jobcomp/filetxt
5786 JobCompLoc=/var/log/slurm/jobcomp
5787 KillWait=30
5788 MaxJobCount=10000
5789 MinJobAge=3600
5790 PluginDir=/usr/local/lib:/usr/local/slurm/lib
5791 ReturnToService=0
5792 SchedulerType=sched/backfill
5793 SlurmctldLogFile=/var/log/slurm/slurmctld.log
5794 SlurmdLogFile=/var/log/slurm/slurmd.log
5795 SlurmctldPort=7002
5796 SlurmdPort=7003
5797 SlurmdSpoolDir=/var/spool/slurmd.spool
5798 StateSaveLocation=/var/spool/slurm.state
5799 SwitchType=switch/none
5800 TmpFS=/tmp
5801 WaitTime=30
5802 #
5803 # Node Configurations
5804 #
5805 NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=64000
5806 NodeName=DEFAULT State=UNKNOWN
5807 NodeName=dev[0-25] NodeAddr=edev[0-25] Weight=16
5808 # Update records for specific DOWN nodes
5809 DownNodes=dev20 State=DOWN Reason="power,ETA=Dec25"
5810 #
5811 # Partition Configurations
5812 #
5813 PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP
5814 PartitionName=debug Nodes=dev[0-8,18-25] Default=YES
5815 PartitionName=batch Nodes=dev[9-17] MinNodes=4
5816 PartitionName=long Nodes=dev[9-17] MaxTime=120 AllowGroups=admin
5817
5818
5820 The "include" key word can be used with modifiers within the specified
5821 pathname. These modifiers would be replaced with cluster name or other
5822 information depending on which modifier is specified. If the included
5823 file is not an absolute path name (i.e. it does not start with a
5824 slash), it will searched for in the same directory as the slurm.conf
5825 file.
5826
5827
5828 %c Cluster name specified in the slurm.conf will be used.
5829
5830 EXAMPLE
5831 ClusterName=linux
5832 include /home/slurm/etc/%c_config
5833 # Above line interpreted as
5834 # "include /home/slurm/etc/linux_config"
5835
5836
5838 There are three classes of files: Files used by slurmctld must be ac‐
5839 cessible by user SlurmUser and accessible by the primary and backup
5840 control machines. Files used by slurmd must be accessible by user root
5841 and accessible from every compute node. A few files need to be acces‐
5842 sible by normal users on all login and compute nodes. While many files
5843 and directories are listed below, most of them will not be used with
5844 most configurations.
5845
5846
5847 Epilog Must be executable by user root. It is recommended that the
5848 file be readable by all users. The file must exist on every
5849 compute node.
5850
5851 EpilogSlurmctld
5852 Must be executable by user SlurmUser. It is recommended that
5853 the file be readable by all users. The file must be accessible
5854 by the primary and backup control machines.
5855
5856 HealthCheckProgram
5857 Must be executable by user root. It is recommended that the
5858 file be readable by all users. The file must exist on every
5859 compute node.
5860
5861 JobCompLoc
5862 If this specifies a file, it must be writable by user SlurmUser.
5863 The file must be accessible by the primary and backup control
5864 machines.
5865
5866 MailProg
5867 Must be executable by user SlurmUser. Must not be writable by
5868 regular users. The file must be accessible by the primary and
5869 backup control machines.
5870
5871 Prolog Must be executable by user root. It is recommended that the
5872 file be readable by all users. The file must exist on every
5873 compute node.
5874
5875 PrologSlurmctld
5876 Must be executable by user SlurmUser. It is recommended that
5877 the file be readable by all users. The file must be accessible
5878 by the primary and backup control machines.
5879
5880 ResumeProgram
5881 Must be executable by user SlurmUser. The file must be accessi‐
5882 ble by the primary and backup control machines.
5883
5884 slurm.conf
5885 Readable to all users on all nodes. Must not be writable by
5886 regular users.
5887
5888 SlurmctldLogFile
5889 Must be writable by user SlurmUser. The file must be accessible
5890 by the primary and backup control machines.
5891
5892 SlurmctldPidFile
5893 Must be writable by user root. Preferably writable and remov‐
5894 able by SlurmUser. The file must be accessible by the primary
5895 and backup control machines.
5896
5897 SlurmdLogFile
5898 Must be writable by user root. A distinct file must exist on
5899 each compute node.
5900
5901 SlurmdPidFile
5902 Must be writable by user root. A distinct file must exist on
5903 each compute node.
5904
5905 SlurmdSpoolDir
5906 Must be writable by user root. A distinct file must exist on
5907 each compute node.
5908
5909 SrunEpilog
5910 Must be executable by all users. The file must exist on every
5911 login and compute node.
5912
5913 SrunProlog
5914 Must be executable by all users. The file must exist on every
5915 login and compute node.
5916
5917 StateSaveLocation
5918 Must be writable by user SlurmUser. The file must be accessible
5919 by the primary and backup control machines.
5920
5921 SuspendProgram
5922 Must be executable by user SlurmUser. The file must be accessi‐
5923 ble by the primary and backup control machines.
5924
5925 TaskEpilog
5926 Must be executable by all users. The file must exist on every
5927 compute node.
5928
5929 TaskProlog
5930 Must be executable by all users. The file must exist on every
5931 compute node.
5932
5933 UnkillableStepProgram
5934 Must be executable by user SlurmUser. The file must be accessi‐
5935 ble by the primary and backup control machines.
5936
5938 Note that while Slurm daemons create log files and other files as
5939 needed, it treats the lack of parent directories as a fatal error.
5940 This prevents the daemons from running if critical file systems are not
5941 mounted and will minimize the risk of cold-starting (starting without
5942 preserving jobs).
5943
5944 Log files and job accounting files may need to be created/owned by the
5945 "SlurmUser" uid to be successfully accessed. Use the "chown" and
5946 "chmod" commands to set the ownership and permissions appropriately.
5947 See the section FILE AND DIRECTORY PERMISSIONS for information about
5948 the various files and directories used by Slurm.
5949
5950 It is recommended that the logrotate utility be used to ensure that
5951 various log files do not become too large. This also applies to text
5952 files used for accounting, process tracking, and the slurmdbd log if
5953 they are used.
5954
5955 Here is a sample logrotate configuration. Make appropriate site modifi‐
5956 cations and save as /etc/logrotate.d/slurm on all nodes. See the
5957 logrotate man page for more details.
5958
5959 ##
5960 # Slurm Logrotate Configuration
5961 ##
5962 /var/log/slurm/*.log {
5963 compress
5964 missingok
5965 nocopytruncate
5966 nodelaycompress
5967 nomail
5968 notifempty
5969 noolddir
5970 rotate 5
5971 sharedscripts
5972 size=5M
5973 create 640 slurm root
5974 postrotate
5975 pkill -x --signal SIGUSR2 slurmctld
5976 pkill -x --signal SIGUSR2 slurmd
5977 pkill -x --signal SIGUSR2 slurmdbd
5978 exit 0
5979 endscript
5980 }
5981
5982
5984 Copyright (C) 2002-2007 The Regents of the University of California.
5985 Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
5986 Copyright (C) 2008-2010 Lawrence Livermore National Security.
5987 Copyright (C) 2010-2022 SchedMD LLC.
5988
5989 This file is part of Slurm, a resource management program. For de‐
5990 tails, see <https://slurm.schedmd.com/>.
5991
5992 Slurm is free software; you can redistribute it and/or modify it under
5993 the terms of the GNU General Public License as published by the Free
5994 Software Foundation; either version 2 of the License, or (at your op‐
5995 tion) any later version.
5996
5997 Slurm is distributed in the hope that it will be useful, but WITHOUT
5998 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
5999 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
6000 for more details.
6001
6002
6004 /etc/slurm.conf
6005
6006
6008 cgroup.conf(5), getaddrinfo(3), getrlimit(2), gres.conf(5), group(5),
6009 hostname(1), scontrol(1), slurmctld(8), slurmd(8), slurmdbd(8), slur‐
6010 mdbd.conf(5), srun(1), spank(8), syslog(3), topology.conf(5)
6011
6012
6013
6014October 2022 Slurm Configuration File slurm.conf(5)