1slurm.conf(5) Slurm Configuration File slurm.conf(5)
2
3
4
6 slurm.conf - Slurm configuration file
7
8
10 slurm.conf is an ASCII file which describes general Slurm configuration
11 information, the nodes to be managed, information about how those nodes
12 are grouped into partitions, and various scheduling parameters associ‐
13 ated with those partitions. This file should be consistent across all
14 nodes in the cluster.
15
16 The file location can be modified at system build time using the DE‐
17 FAULT_SLURM_CONF parameter or at execution time by setting the
18 SLURM_CONF environment variable. The Slurm daemons also allow you to
19 override both the built-in and environment-provided location using the
20 "-f" option on the command line.
21
22 The contents of the file are case insensitive except for the names of
23 nodes and partitions. Any text following a "#" in the configuration
24 file is treated as a comment through the end of that line. Changes to
25 the configuration file take effect upon restart of Slurm daemons, dae‐
26 mon receipt of the SIGHUP signal, or execution of the command "scontrol
27 reconfigure" unless otherwise noted.
28
29 If a line begins with the word "Include" followed by whitespace and
30 then a file name, that file will be included inline with the current
31 configuration file. For large or complex systems, multiple configura‐
32 tion files may prove easier to manage and enable reuse of some files
33 (See INCLUDE MODIFIERS for more details).
34
35 Note on file permissions:
36
37 The slurm.conf file must be readable by all users of Slurm, since it is
38 used by many of the Slurm commands. Other files that are defined in
39 the slurm.conf file, such as log files and job accounting files, may
40 need to be created/owned by the user "SlurmUser" to be successfully ac‐
41 cessed. Use the "chown" and "chmod" commands to set the ownership and
42 permissions appropriately. See the section FILE AND DIRECTORY PERMIS‐
43 SIONS for information about the various files and directories used by
44 Slurm.
45
46
48 The overall configuration parameters available include:
49
50
51 AccountingStorageBackupHost
52 The name of the backup machine hosting the accounting storage
53 database. If used with the accounting_storage/slurmdbd plugin,
54 this is where the backup slurmdbd would be running. Only used
55 with systems using SlurmDBD, ignored otherwise.
56
57
58 AccountingStorageEnforce
59 This controls what level of association-based enforcement to im‐
60 pose on job submissions. Valid options are any combination of
61 associations, limits, nojobs, nosteps, qos, safe, and wckeys, or
62 all for all things (except nojobs and nosteps, which must be re‐
63 quested as well).
64
65 If limits, qos, or wckeys are set, associations will automati‐
66 cally be set.
67
68 If wckeys is set, TrackWCKey will automatically be set.
69
70 If safe is set, limits and associations will automatically be
71 set.
72
73 If nojobs is set, nosteps will automatically be set.
74
75 By setting associations, no new job is allowed to run unless a
76 corresponding association exists in the system. If limits are
77 enforced, users can be limited by association to whatever job
78 size or run time limits are defined.
79
80 If nojobs is set, Slurm will not account for any jobs or steps
81 on the system. Likewise, if nosteps is set, Slurm will not ac‐
82 count for any steps that have run.
83
84 If safe is enforced, a job will only be launched against an as‐
85 sociation or qos that has a GrpTRESMins limit set, if the job
86 will be able to run to completion. Without this option set, jobs
87 will be launched as long as their usage hasn't reached the
88 cpu-minutes limit. This can lead to jobs being launched but then
89 killed when the limit is reached.
90
91 With qos and/or wckeys enforced jobs will not be scheduled un‐
92 less a valid qos and/or workload characterization key is speci‐
93 fied.
94
95 When AccountingStorageEnforce is changed, a restart of the
96 slurmctld daemon is required (not just a "scontrol reconfig").
97
98
99 AccountingStorageExternalHost
100 A comma-separated list of external slurmdbds
101 (<host/ip>[:port][,...]) to register with. If no port is given,
102 the AccountingStoragePort will be used.
103
104 This allows clusters registered with the external slurmdbd to
105 communicate with each other using the --cluster/-M client com‐
106 mand options.
107
108 The cluster will add itself to the external slurmdbd if it
109 doesn't exist. If a non-external cluster already exists on the
110 external slurmdbd, the slurmctld will ignore registering to the
111 external slurmdbd.
112
113
114 AccountingStorageHost
115 The name of the machine hosting the accounting storage database.
116 Only used with systems using SlurmDBD, ignored otherwise.
117
118
119 AccountingStorageParameters
120 Comma-separated list of key-value pair parameters. Currently
121 supported values include options to establish a secure connec‐
122 tion to the database:
123
124 SSL_CERT
125 The path name of the client public key certificate file.
126
127 SSL_CA
128 The path name of the Certificate Authority (CA) certificate
129 file.
130
131 SSL_CAPATH
132 The path name of the directory that contains trusted SSL CA
133 certificate files.
134
135 SSL_KEY
136 The path name of the client private key file.
137
138 SSL_CIPHER
139 The list of permissible ciphers for SSL encryption.
140
141
142 AccountingStoragePass
143 The password used to gain access to the database to store the
144 accounting data. Only used for database type storage plugins,
145 ignored otherwise. In the case of Slurm DBD (Database Daemon)
146 with MUNGE authentication this can be configured to use a MUNGE
147 daemon specifically configured to provide authentication between
148 clusters while the default MUNGE daemon provides authentication
149 within a cluster. In that case, AccountingStoragePass should
150 specify the named port to be used for communications with the
151 alternate MUNGE daemon (e.g. "/var/run/munge/global.socket.2").
152 The default value is NULL.
153
154
155 AccountingStoragePort
156 The listening port of the accounting storage database server.
157 Only used for database type storage plugins, ignored otherwise.
158 The default value is SLURMDBD_PORT as established at system
159 build time. If no value is explicitly specified, it will be set
160 to 6819. This value must be equal to the DbdPort parameter in
161 the slurmdbd.conf file.
162
163
164 AccountingStorageTRES
165 Comma-separated list of resources you wish to track on the clus‐
166 ter. These are the resources requested by the sbatch/srun job
167 when it is submitted. Currently this consists of any GRES, BB
168 (burst buffer) or license along with CPU, Memory, Node, Energy,
169 FS/[Disk|Lustre], IC/OFED, Pages, and VMem. By default Billing,
170 CPU, Energy, Memory, Node, FS/Disk, Pages and VMem are tracked.
171 These default TRES cannot be disabled, but only appended to.
172 AccountingStorageTRES=gres/craynetwork,license/iop1 will track
173 billing, cpu, energy, memory, nodes, fs/disk, pages and vmem
174 along with a gres called craynetwork as well as a license called
175 iop1. Whenever these resources are used on the cluster they are
176 recorded. The TRES are automatically set up in the database on
177 the start of the slurmctld.
178
179 If multiple GRES of different types are tracked (e.g. GPUs of
180 different types), then job requests with matching type specifi‐
181 cations will be recorded. Given a configuration of "Account‐
182 ingStorageTRES=gres/gpu,gres/gpu:tesla,gres/gpu:volta" Then
183 "gres/gpu:tesla" and "gres/gpu:volta" will track only jobs that
184 explicitly request those two GPU types, while "gres/gpu" will
185 track allocated GPUs of any type ("tesla", "volta" or any other
186 GPU type).
187
188 Given a configuration of "AccountingStorage‐
189 TRES=gres/gpu:tesla,gres/gpu:volta" Then "gres/gpu:tesla" and
190 "gres/gpu:volta" will track jobs that explicitly request those
191 GPU types. If a job requests GPUs, but does not explicitly
192 specify the GPU type, then its resource allocation will be ac‐
193 counted for as either "gres/gpu:tesla" or "gres/gpu:volta", al‐
194 though the accounting may not match the actual GPU type allo‐
195 cated to the job and the GPUs allocated to the job could be het‐
196 erogeneous. In an environment containing various GPU types, use
197 of a job_submit plugin may be desired in order to force jobs to
198 explicitly specify some GPU type.
199
200
201 AccountingStorageType
202 The accounting storage mechanism type. Acceptable values at
203 present include "accounting_storage/none" and "accounting_stor‐
204 age/slurmdbd". The "accounting_storage/slurmdbd" value indi‐
205 cates that accounting records will be written to the Slurm DBD,
206 which manages an underlying MySQL database. See "man slurmdbd"
207 for more information. The default value is "accounting_stor‐
208 age/none" and indicates that account records are not maintained.
209
210
211 AccountingStorageUser
212 The user account for accessing the accounting storage database.
213 Only used for database type storage plugins, ignored otherwise.
214
215
216 AccountingStoreFlags
217 Comma separated list used to tell the slurmctld to store extra
218 fields that may be more heavy weight than the normal job infor‐
219 mation.
220
221
222 Current options are:
223
224
225 job_comment
226 Include the job's comment field in the job complete mes‐
227 sage sent to the Accounting Storage database. Note the
228 AdminComment and SystemComment are always recorded in the
229 database.
230
231
232 job_env
233 Include a batch job's environment variables used at job
234 submission in the job start message sent to the Account‐
235 ing Storage database.
236
237
238 job_script
239 Include the job's batch script in the job start message
240 sent to the Accounting Storage database.
241
242
243 AcctGatherNodeFreq
244 The AcctGather plugins sampling interval for node accounting.
245 For AcctGather plugin values of none, this parameter is ignored.
246 For all other values this parameter is the number of seconds be‐
247 tween node accounting samples. For the acct_gather_energy/rapl
248 plugin, set a value less than 300 because the counters may over‐
249 flow beyond this rate. The default value is zero. This value
250 disables accounting sampling for nodes. Note: The accounting
251 sampling interval for jobs is determined by the value of JobAc‐
252 ctGatherFrequency.
253
254
255 AcctGatherEnergyType
256 Identifies the plugin to be used for energy consumption account‐
257 ing. The jobacct_gather plugin and slurmd daemon call this
258 plugin to collect energy consumption data for jobs and nodes.
259 The collection of energy consumption data takes place on the
260 node level, hence only in case of exclusive job allocation the
261 energy consumption measurements will reflect the job's real con‐
262 sumption. In case of node sharing between jobs the reported con‐
263 sumed energy per job (through sstat or sacct) will not reflect
264 the real energy consumed by the jobs.
265
266 Configurable values at present are:
267
268 acct_gather_energy/none
269 No energy consumption data is collected.
270
271 acct_gather_energy/ipmi
272 Energy consumption data is collected from
273 the Baseboard Management Controller (BMC)
274 using the Intelligent Platform Management
275 Interface (IPMI).
276
277 acct_gather_energy/pm_counters
278 Energy consumption data is collected from
279 the Baseboard Management Controller (BMC)
280 for HPE Cray systems.
281
282 acct_gather_energy/rapl
283 Energy consumption data is collected from
284 hardware sensors using the Running Average
285 Power Limit (RAPL) mechanism. Note that en‐
286 abling RAPL may require the execution of the
287 command "sudo modprobe msr".
288
289 acct_gather_energy/xcc
290 Energy consumption data is collected from
291 the Lenovo SD650 XClarity Controller (XCC)
292 using IPMI OEM raw commands.
293
294
295 AcctGatherInterconnectType
296 Identifies the plugin to be used for interconnect network traf‐
297 fic accounting. The jobacct_gather plugin and slurmd daemon
298 call this plugin to collect network traffic data for jobs and
299 nodes. The collection of network traffic data takes place on
300 the node level, hence only in case of exclusive job allocation
301 the collected values will reflect the job's real traffic. In
302 case of node sharing between jobs the reported network traffic
303 per job (through sstat or sacct) will not reflect the real net‐
304 work traffic by the jobs.
305
306 Configurable values at present are:
307
308 acct_gather_interconnect/none
309 No infiniband network data are collected.
310
311 acct_gather_interconnect/ofed
312 Infiniband network traffic data are col‐
313 lected from the hardware monitoring counters
314 of Infiniband devices through the OFED li‐
315 brary. In order to account for per job net‐
316 work traffic, add the "ic/ofed" TRES to Ac‐
317 countingStorageTRES.
318
319
320 AcctGatherFilesystemType
321 Identifies the plugin to be used for filesystem traffic account‐
322 ing. The jobacct_gather plugin and slurmd daemon call this
323 plugin to collect filesystem traffic data for jobs and nodes.
324 The collection of filesystem traffic data takes place on the
325 node level, hence only in case of exclusive job allocation the
326 collected values will reflect the job's real traffic. In case of
327 node sharing between jobs the reported filesystem traffic per
328 job (through sstat or sacct) will not reflect the real filesys‐
329 tem traffic by the jobs.
330
331
332 Configurable values at present are:
333
334 acct_gather_filesystem/none
335 No filesystem data are collected.
336
337 acct_gather_filesystem/lustre
338 Lustre filesystem traffic data are collected
339 from the counters found in /proc/fs/lustre/.
340 In order to account for per job lustre traf‐
341 fic, add the "fs/lustre" TRES to Account‐
342 ingStorageTRES.
343
344
345 AcctGatherProfileType
346 Identifies the plugin to be used for detailed job profiling.
347 The jobacct_gather plugin and slurmd daemon call this plugin to
348 collect detailed data such as I/O counts, memory usage, or en‐
349 ergy consumption for jobs and nodes. There are interfaces in
350 this plugin to collect data as step start and completion, task
351 start and completion, and at the account gather frequency. The
352 data collected at the node level is related to jobs only in case
353 of exclusive job allocation.
354
355 Configurable values at present are:
356
357 acct_gather_profile/none
358 No profile data is collected.
359
360 acct_gather_profile/hdf5
361 This enables the HDF5 plugin. The directory
362 where the profile files are stored and which
363 values are collected are configured in the
364 acct_gather.conf file.
365
366 acct_gather_profile/influxdb
367 This enables the influxdb plugin. The in‐
368 fluxdb instance host, port, database, reten‐
369 tion policy and which values are collected
370 are configured in the acct_gather.conf file.
371
372
373 AllowSpecResourcesUsage
374 If set to "YES", Slurm allows individual jobs to override node's
375 configured CoreSpecCount value. For a job to take advantage of
376 this feature, a command line option of --core-spec must be spec‐
377 ified. The default value for this option is "YES" for Cray sys‐
378 tems and "NO" for other system types.
379
380
381 AuthAltTypes
382 Comma-separated list of alternative authentication plugins that
383 the slurmctld will permit for communication. Acceptable values
384 at present include auth/jwt.
385
386 NOTE: auth/jwt requires a jwt_hs256.key to be populated in the
387 StateSaveLocation directory for slurmctld only. The
388 jwt_hs256.key should only be visible to the SlurmUser and root.
389 It is not suggested to place the jwt_hs256.key on any nodes but
390 the controller running slurmctld. auth/jwt can be activated by
391 the presence of the SLURM_JWT environment variable. When acti‐
392 vated, it will override the default AuthType.
393
394
395 AuthAltParameters
396 Used to define alternative authentication plugins options. Mul‐
397 tiple options may be comma separated.
398
399 disable_token_creation
400 Disable "scontrol token" use by non-SlurmUser ac‐
401 counts.
402
403 jwks= Absolute path to JWKS file. Only RS256 keys are
404 supported, although other key types may be listed
405 in the file. If set, no HS256 key will be loaded
406 by default (and token generation is disabled),
407 although the jwt_key setting may be used to ex‐
408 plicitly re-enable HS256 key use (and token gen‐
409 eration).
410
411 jwt_key= Absolute path to JWT key file. Key must be HS256,
412 and should only be accessible by SlurmUser. If
413 not set, the default key file is jwt_hs256.key in
414 StateSaveLocation.
415
416
417 AuthInfo
418 Additional information to be used for authentication of communi‐
419 cations between the Slurm daemons (slurmctld and slurmd) and the
420 Slurm clients. The interpretation of this option is specific to
421 the configured AuthType. Multiple options may be specified in a
422 comma-delimited list. If not specified, the default authentica‐
423 tion information will be used.
424
425 cred_expire Default job step credential lifetime, in seconds
426 (e.g. "cred_expire=1200"). It must be suffi‐
427 ciently long enough to load user environment, run
428 prolog, deal with the slurmd getting paged out of
429 memory, etc. This also controls how long a re‐
430 queued job must wait before starting again. The
431 default value is 120 seconds.
432
433 socket Path name to a MUNGE daemon socket to use (e.g.
434 "socket=/var/run/munge/munge.socket.2"). The de‐
435 fault value is "/var/run/munge/munge.socket.2".
436 Used by auth/munge and cred/munge.
437
438 ttl Credential lifetime, in seconds (e.g. "ttl=300").
439 The default value is dependent upon the MUNGE in‐
440 stallation, but is typically 300 seconds.
441
442
443 AuthType
444 The authentication method for communications between Slurm com‐
445 ponents. Acceptable values at present include "auth/munge",
446 which is the default. "auth/munge" indicates that MUNGE is to
447 be used. (See "https://dun.github.io/munge/" for more informa‐
448 tion). All Slurm daemons and commands must be terminated prior
449 to changing the value of AuthType and later restarted.
450
451
452 BackupAddr
453 Deprecated option, see SlurmctldHost.
454
455
456 BackupController
457 Deprecated option, see SlurmctldHost.
458
459 The backup controller recovers state information from the State‐
460 SaveLocation directory, which must be readable and writable from
461 both the primary and backup controllers. While not essential,
462 it is recommended that you specify a backup controller. See
463 the RELOCATING CONTROLLERS section if you change this.
464
465
466 BatchStartTimeout
467 The maximum time (in seconds) that a batch job is permitted for
468 launching before being considered missing and releasing the al‐
469 location. The default value is 10 (seconds). Larger values may
470 be required if more time is required to execute the Prolog, load
471 user environment variables, or if the slurmd daemon gets paged
472 from memory.
473 Note: The test for a job being successfully launched is only
474 performed when the Slurm daemon on the compute node registers
475 state with the slurmctld daemon on the head node, which happens
476 fairly rarely. Therefore a job will not necessarily be termi‐
477 nated if its start time exceeds BatchStartTimeout. This config‐
478 uration parameter is also applied to launch tasks and avoid
479 aborting srun commands due to long running Prolog scripts.
480
481
482 BcastExclude
483 Comma-separated list of absolute directory paths to be excluded
484 when autodetecting and broadcasting executable shared object de‐
485 pendencies through sbcast or srun --bcast. The keyword "none"
486 can be used to indicate that no directory paths should be ex‐
487 cluded. The default value is "/lib,/usr/lib,/lib64,/usr/lib64".
488 This option can be overridden by sbcast --exclude and srun
489 --bcast-exclude.
490
491
492 BcastParameters
493 Controls sbcast and srun --bcast behavior. Multiple options can
494 be specified in a comma separated list. Supported values in‐
495 clude:
496
497 DestDir= Destination directory for file being broadcast to
498 allocated compute nodes. Default value is cur‐
499 rent working directory, or --chdir for srun if
500 set.
501
502 Compression= Specify default file compression library to be
503 used. Supported values are "lz4" and "none".
504 The default value with the sbcast --compress op‐
505 tion is "lz4" and "none" otherwise. Some com‐
506 pression libraries may be unavailable on some
507 systems.
508
509 send_libs If set, attempt to autodetect and broadcast the
510 executable's shared object dependencies to allo‐
511 cated compute nodes. The files are placed in a
512 directory alongside the executable. For srun
513 only, the LD_LIBRARY_PATH is automatically up‐
514 dated to include this cache directory as well.
515 This can be overridden with either sbcast or srun
516 --send-libs option. By default this is disabled.
517
518
519 BurstBufferType
520 The plugin used to manage burst buffers. Acceptable values at
521 present are:
522
523 burst_buffer/datawarp
524 Use Cray DataWarp API to provide burst buffer functional‐
525 ity.
526
527 burst_buffer/lua
528 This plugin provides hooks to an API that is defined by a
529 Lua script. This plugin was developed to provide system
530 administrators with a way to do any task (not only file
531 staging) at different points in a job’s life cycle.
532
533 burst_buffer/none
534
535
536 CliFilterPlugins
537 A comma-delimited list of command line interface option fil‐
538 ter/modification plugins. The specified plugins will be executed
539 in the order listed. These are intended to be site-specific
540 plugins which can be used to set default job parameters and/or
541 logging events. No cli_filter plugins are used by default.
542
543
544 ClusterName
545 The name by which this Slurm managed cluster is known in the ac‐
546 counting database. This is needed distinguish accounting
547 records when multiple clusters report to the same database. Be‐
548 cause of limitations in some databases, any upper case letters
549 in the name will be silently mapped to lower case. In order to
550 avoid confusion, it is recommended that the name be lower case.
551
552
553 CommunicationParameters
554 Comma-separated options identifying communication options.
555
556 CheckGhalQuiesce
557 Used specifically on a Cray using an Aries Ghal
558 interconnect. This will check to see if the sys‐
559 tem is quiescing when sending a message, and if
560 so, we wait until it is done before sending.
561
562 DisableIPv4 Disable IPv4 only operation for all slurm daemons
563 (except slurmdbd). This should also be set in
564 your slurmdbd.conf file.
565
566 EnableIPv6 Enable using IPv6 addresses for all slurm daemons
567 (except slurmdbd). When using both IPv4 and IPv6,
568 address family preferences will be based on your
569 /etc/gai.conf file. This should also be set in
570 your slurmdbd.conf file.
571
572 NoAddrCache By default, Slurm will cache a node's network ad‐
573 dress after successfully establishing the node's
574 network address. This option disables the cache
575 and Slurm will look up the node's network address
576 each time a connection is made. This is useful,
577 for example, in a cloud environment where the
578 node addresses come and go out of DNS.
579
580 NoCtldInAddrAny
581 Used to directly bind to the address of what the
582 node resolves to running the slurmctld instead of
583 binding messages to any address on the node,
584 which is the default.
585
586 NoInAddrAny Used to directly bind to the address of what the
587 node resolves to instead of binding messages to
588 any address on the node which is the default.
589 This option is for all daemons/clients except for
590 the slurmctld.
591
592
593
594 CompleteWait
595 The time to wait, in seconds, when any job is in the COMPLETING
596 state before any additional jobs are scheduled. This is to at‐
597 tempt to keep jobs on nodes that were recently in use, with the
598 goal of preventing fragmentation. If set to zero, pending jobs
599 will be started as soon as possible. Since a COMPLETING job's
600 resources are released for use by other jobs as soon as the Epi‐
601 log completes on each individual node, this can result in very
602 fragmented resource allocations. To provide jobs with the mini‐
603 mum response time, a value of zero is recommended (no waiting).
604 To minimize fragmentation of resources, a value equal to Kill‐
605 Wait plus two is recommended. In that case, setting KillWait to
606 a small value may be beneficial. The default value of Complete‐
607 Wait is zero seconds. The value may not exceed 65533.
608
609 NOTE: Setting reduce_completing_frag affects the behavior of
610 CompleteWait.
611
612
613 ControlAddr
614 Deprecated option, see SlurmctldHost.
615
616
617 ControlMachine
618 Deprecated option, see SlurmctldHost.
619
620
621 CoreSpecPlugin
622 Identifies the plugins to be used for enforcement of core spe‐
623 cialization. The slurmd daemon must be restarted for a change
624 in CoreSpecPlugin to take effect. Acceptable values at present
625 include:
626
627 core_spec/cray_aries
628 used only for Cray systems
629
630 core_spec/none used for all other system types
631
632
633 CpuFreqDef
634 Default CPU frequency value or frequency governor to use when
635 running a job step if it has not been explicitly set with the
636 --cpu-freq option. Acceptable values at present include a nu‐
637 meric value (frequency in kilohertz) or one of the following
638 governors:
639
640 Conservative attempts to use the Conservative CPU governor
641
642 OnDemand attempts to use the OnDemand CPU governor
643
644 Performance attempts to use the Performance CPU governor
645
646 PowerSave attempts to use the PowerSave CPU governor
647 There is no default value. If unset, no attempt to set the governor is
648 made if the --cpu-freq option has not been set.
649
650
651 CpuFreqGovernors
652 List of CPU frequency governors allowed to be set with the sal‐
653 loc, sbatch, or srun option --cpu-freq. Acceptable values at
654 present include:
655
656 Conservative attempts to use the Conservative CPU governor
657
658 OnDemand attempts to use the OnDemand CPU governor (a de‐
659 fault value)
660
661 Performance attempts to use the Performance CPU governor (a
662 default value)
663
664 PowerSave attempts to use the PowerSave CPU governor
665
666 SchedUtil attempts to use the SchedUtil CPU governor
667
668 UserSpace attempts to use the UserSpace CPU governor (a de‐
669 fault value)
670 The default is OnDemand, Performance and UserSpace.
671
672 CredType
673 The cryptographic signature tool to be used in the creation of
674 job step credentials. The slurmctld daemon must be restarted
675 for a change in CredType to take effect. The default (and rec‐
676 ommended) value is "cred/munge".
677
678
679 DebugFlags
680 Defines specific subsystems which should provide more detailed
681 event logging. Multiple subsystems can be specified with comma
682 separators. Most DebugFlags will result in verbose-level log‐
683 ging for the identified subsystems, and could impact perfor‐
684 mance. Valid subsystems available include:
685
686 Accrue Accrue counters accounting details
687
688 Agent RPC agents (outgoing RPCs from Slurm daemons)
689
690 Backfill Backfill scheduler details
691
692 BackfillMap Backfill scheduler to log a very verbose map of
693 reserved resources through time. Combine with
694 Backfill for a verbose and complete view of the
695 backfill scheduler's work.
696
697 BurstBuffer Burst Buffer plugin
698
699 Cgroup Cgroup details
700
701 CPU_Bind CPU binding details for jobs and steps
702
703 CpuFrequency Cpu frequency details for jobs and steps using
704 the --cpu-freq option.
705
706 Data Generic data structure details.
707
708 Dependency Job dependency debug info
709
710 Elasticsearch Elasticsearch debug info
711
712 Energy AcctGatherEnergy debug info
713
714 ExtSensors External Sensors debug info
715
716 Federation Federation scheduling debug info
717
718 FrontEnd Front end node details
719
720 Gres Generic resource details
721
722 Hetjob Heterogeneous job details
723
724 Gang Gang scheduling details
725
726 JobAccountGather Common job account gathering details (not
727 plugin specific).
728
729 JobContainer Job container plugin details
730
731 License License management details
732
733 Network Network details
734
735 NetworkRaw Dump raw hex values of key Network communica‐
736 tions. Warning: very verbose.
737
738 NodeFeatures Node Features plugin debug info
739
740 NO_CONF_HASH Do not log when the slurm.conf files differ be‐
741 tween Slurm daemons
742
743 Power Power management plugin and power save (sus‐
744 pend/resume programs) details
745
746 Priority Job prioritization
747
748 Profile AcctGatherProfile plugins details
749
750 Protocol Communication protocol details
751
752 Reservation Advanced reservations
753
754 Route Message forwarding debug info
755
756 Script Debug info regarding the process that runs
757 slurmctld scripts such as PrologSlurmctld and
758 EpilogSlurmctld
759
760 SelectType Resource selection plugin
761
762 Steps Slurmctld resource allocation for job steps
763
764 Switch Switch plugin
765
766 TimeCray Timing of Cray APIs
767
768 TraceJobs Trace jobs in slurmctld. It will print detailed
769 job information including state, job ids and
770 allocated nodes counter.
771
772 Triggers Slurmctld triggers
773
774 WorkQueue Work Queue details
775
776
777 DefCpuPerGPU
778 Default count of CPUs allocated per allocated GPU. This value is
779 used only if the job didn't specify --cpus-per-task and
780 --cpus-per-gpu.
781
782
783 DefMemPerCPU
784 Default real memory size available per allocated CPU in
785 megabytes. Used to avoid over-subscribing memory and causing
786 paging. DefMemPerCPU would generally be used if individual pro‐
787 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
788 lectType=select/cons_tres). The default value is 0 (unlimited).
789 Also see DefMemPerGPU, DefMemPerNode and MaxMemPerCPU. DefMem‐
790 PerCPU, DefMemPerGPU and DefMemPerNode are mutually exclusive.
791
792
793 DefMemPerGPU
794 Default real memory size available per allocated GPU in
795 megabytes. The default value is 0 (unlimited). Also see
796 DefMemPerCPU and DefMemPerNode. DefMemPerCPU, DefMemPerGPU and
797 DefMemPerNode are mutually exclusive.
798
799
800 DefMemPerNode
801 Default real memory size available per allocated node in
802 megabytes. Used to avoid over-subscribing memory and causing
803 paging. DefMemPerNode would generally be used if whole nodes
804 are allocated to jobs (SelectType=select/linear) and resources
805 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
806 The default value is 0 (unlimited). Also see DefMemPerCPU,
807 DefMemPerGPU and MaxMemPerCPU. DefMemPerCPU, DefMemPerGPU and
808 DefMemPerNode are mutually exclusive.
809
810
811 DependencyParameters
812 Multiple options may be comma separated.
813
814
815 disable_remote_singleton
816 By default, when a federated job has a singleton depen‐
817 dency, each cluster in the federation must clear the sin‐
818 gleton dependency before the job's singleton dependency
819 is considered satisfied. Enabling this option means that
820 only the origin cluster must clear the singleton depen‐
821 dency. This option must be set in every cluster in the
822 federation.
823
824 kill_invalid_depend
825 If a job has an invalid dependency and it can never run
826 terminate it and set its state to be JOB_CANCELLED. By
827 default the job stays pending with reason DependencyNev‐
828 erSatisfied. max_depend_depth=# Maximum number of jobs
829 to test for a circular job dependency. Stop testing after
830 this number of job dependencies have been tested. The de‐
831 fault value is 10 jobs.
832
833
834 DisableRootJobs
835 If set to "YES" then user root will be prevented from running
836 any jobs. The default value is "NO", meaning user root will be
837 able to execute jobs. DisableRootJobs may also be set by parti‐
838 tion.
839
840
841 EioTimeout
842 The number of seconds srun waits for slurmstepd to close the
843 TCP/IP connection used to relay data between the user applica‐
844 tion and srun when the user application terminates. The default
845 value is 60 seconds. May not exceed 65533.
846
847
848 EnforcePartLimits
849 If set to "ALL" then jobs which exceed a partition's size and/or
850 time limits will be rejected at submission time. If job is sub‐
851 mitted to multiple partitions, the job must satisfy the limits
852 on all the requested partitions. If set to "NO" then the job
853 will be accepted and remain queued until the partition limits
854 are altered(Time and Node Limits). If set to "ANY" a job must
855 satisfy any of the requested partitions to be submitted. The de‐
856 fault value is "NO". NOTE: If set, then a job's QOS can not be
857 used to exceed partition limits. NOTE: The partition limits be‐
858 ing considered are its configured MaxMemPerCPU, MaxMemPerNode,
859 MinNodes, MaxNodes, MaxTime, AllocNodes, AllowAccounts, Allow‐
860 Groups, AllowQOS, and QOS usage threshold.
861
862
863 Epilog Fully qualified pathname of a script to execute as user root on
864 every node when a user's job completes (e.g. "/usr/lo‐
865 cal/slurm/epilog"). A glob pattern (See glob (7)) may also be
866 used to run more than one epilog script (e.g. "/etc/slurm/epi‐
867 log.d/*"). The Epilog script or scripts may be used to purge
868 files, disable user login, etc. By default there is no epilog.
869 See Prolog and Epilog Scripts for more information.
870
871
872 EpilogMsgTime
873 The number of microseconds that the slurmctld daemon requires to
874 process an epilog completion message from the slurmd daemons.
875 This parameter can be used to prevent a burst of epilog comple‐
876 tion messages from being sent at the same time which should help
877 prevent lost messages and improve throughput for large jobs.
878 The default value is 2000 microseconds. For a 1000 node job,
879 this spreads the epilog completion messages out over two sec‐
880 onds.
881
882
883 EpilogSlurmctld
884 Fully qualified pathname of a program for the slurmctld to exe‐
885 cute upon termination of a job allocation (e.g. "/usr/lo‐
886 cal/slurm/epilog_controller"). The program executes as Slur‐
887 mUser, which gives it permission to drain nodes and requeue the
888 job if a failure occurs (See scontrol(1)). Exactly what the
889 program does and how it accomplishes this is completely at the
890 discretion of the system administrator. Information about the
891 job being initiated, its allocated nodes, etc. are passed to the
892 program using environment variables. See Prolog and Epilog
893 Scripts for more information.
894
895
896 ExtSensorsFreq
897 The external sensors plugin sampling interval. If ExtSen‐
898 sorsType=ext_sensors/none, this parameter is ignored. For all
899 other values of ExtSensorsType, this parameter is the number of
900 seconds between external sensors samples for hardware components
901 (nodes, switches, etc.) The default value is zero. This value
902 disables external sensors sampling. Note: This parameter does
903 not affect external sensors data collection for jobs/steps.
904
905
906 ExtSensorsType
907 Identifies the plugin to be used for external sensors data col‐
908 lection. Slurmctld calls this plugin to collect external sen‐
909 sors data for jobs/steps and hardware components. In case of
910 node sharing between jobs the reported values per job/step
911 (through sstat or sacct) may not be accurate. See also "man
912 ext_sensors.conf".
913
914 Configurable values at present are:
915
916 ext_sensors/none No external sensors data is collected.
917
918 ext_sensors/rrd External sensors data is collected from the
919 RRD database.
920
921
922 FairShareDampeningFactor
923 Dampen the effect of exceeding a user or group's fair share of
924 allocated resources. Higher values will provides greater ability
925 to differentiate between exceeding the fair share at high levels
926 (e.g. a value of 1 results in almost no difference between over‐
927 consumption by a factor of 10 and 100, while a value of 5 will
928 result in a significant difference in priority). The default
929 value is 1.
930
931
932 FederationParameters
933 Used to define federation options. Multiple options may be comma
934 separated.
935
936
937 fed_display
938 If set, then the client status commands (e.g. squeue,
939 sinfo, sprio, etc.) will display information in a feder‐
940 ated view by default. This option is functionally equiva‐
941 lent to using the --federation options on each command.
942 Use the client's --local option to override the federated
943 view and get a local view of the given cluster.
944
945
946 FirstJobId
947 The job id to be used for the first job submitted to Slurm. Job
948 id values generated will incremented by 1 for each subsequent
949 job. Value must be larger than 0. The default value is 1. Also
950 see MaxJobId
951
952
953 GetEnvTimeout
954 Controls how long the job should wait (in seconds) to load the
955 user's environment before attempting to load it from a cache
956 file. Applies when the salloc or sbatch --get-user-env option
957 is used. If set to 0 then always load the user's environment
958 from the cache file. The default value is 2 seconds.
959
960
961 GresTypes
962 A comma-delimited list of generic resources to be managed (e.g.
963 GresTypes=gpu,mps). These resources may have an associated GRES
964 plugin of the same name providing additional functionality. No
965 generic resources are managed by default. Ensure this parameter
966 is consistent across all nodes in the cluster for proper opera‐
967 tion. The slurmctld and slurmd daemons must be restarted for
968 changes to this parameter to take effect.
969
970
971 GroupUpdateForce
972 If set to a non-zero value, then information about which users
973 are members of groups allowed to use a partition will be updated
974 periodically, even when there have been no changes to the
975 /etc/group file. If set to zero, group member information will
976 be updated only after the /etc/group file is updated. The de‐
977 fault value is 1. Also see the GroupUpdateTime parameter.
978
979
980 GroupUpdateTime
981 Controls how frequently information about which users are mem‐
982 bers of groups allowed to use a partition will be updated, and
983 how long user group membership lists will be cached. The time
984 interval is given in seconds with a default value of 600 sec‐
985 onds. A value of zero will prevent periodic updating of group
986 membership information. Also see the GroupUpdateForce parame‐
987 ter.
988
989
990 GpuFreqDef=[<type]=value>[,<type=value>]
991 Default GPU frequency to use when running a job step if it has
992 not been explicitly set using the --gpu-freq option. This op‐
993 tion can be used to independently configure the GPU and its mem‐
994 ory frequencies. Defaults to "high,memory=high". After the job
995 is completed, the frequencies of all affected GPUs will be reset
996 to the highest possible values. In some cases, system power
997 caps may override the requested values. The field type can be
998 "memory". If type is not specified, the GPU frequency is im‐
999 plied. The value field can either be "low", "medium", "high",
1000 "highm1" or a numeric value in megahertz (MHz). If the speci‐
1001 fied numeric value is not possible, a value as close as possible
1002 will be used. See below for definition of the values. Examples
1003 of use include "GpuFreqDef=medium,memory=high and "GpuFre‐
1004 qDef=450".
1005
1006 Supported value definitions:
1007
1008 low the lowest available frequency.
1009
1010 medium attempts to set a frequency in the middle of the
1011 available range.
1012
1013 high the highest available frequency.
1014
1015 highm1 (high minus one) will select the next highest avail‐
1016 able frequency.
1017
1018
1019 HealthCheckInterval
1020 The interval in seconds between executions of HealthCheckPro‐
1021 gram. The default value is zero, which disables execution.
1022
1023
1024 HealthCheckNodeState
1025 Identify what node states should execute the HealthCheckProgram.
1026 Multiple state values may be specified with a comma separator.
1027 The default value is ANY to execute on nodes in any state.
1028
1029 ALLOC Run on nodes in the ALLOC state (all CPUs allo‐
1030 cated).
1031
1032 ANY Run on nodes in any state.
1033
1034 CYCLE Rather than running the health check program on all
1035 nodes at the same time, cycle through running on all
1036 compute nodes through the course of the HealthCheck‐
1037 Interval. May be combined with the various node
1038 state options.
1039
1040 IDLE Run on nodes in the IDLE state.
1041
1042 MIXED Run on nodes in the MIXED state (some CPUs idle and
1043 other CPUs allocated).
1044
1045
1046 HealthCheckProgram
1047 Fully qualified pathname of a script to execute as user root pe‐
1048 riodically on all compute nodes that are not in the NOT_RESPOND‐
1049 ING state. This program may be used to verify the node is fully
1050 operational and DRAIN the node or send email if a problem is de‐
1051 tected. Any action to be taken must be explicitly performed by
1052 the program (e.g. execute "scontrol update NodeName=foo
1053 State=drain Reason=tmp_file_system_full" to drain a node). The
1054 execution interval is controlled using the HealthCheckInterval
1055 parameter. Note that the HealthCheckProgram will be executed at
1056 the same time on all nodes to minimize its impact upon parallel
1057 programs. This program is will be killed if it does not termi‐
1058 nate normally within 60 seconds. This program will also be exe‐
1059 cuted when the slurmd daemon is first started and before it reg‐
1060 isters with the slurmctld daemon. By default, no program will
1061 be executed.
1062
1063
1064 InactiveLimit
1065 The interval, in seconds, after which a non-responsive job allo‐
1066 cation command (e.g. srun or salloc) will result in the job be‐
1067 ing terminated. If the node on which the command is executed
1068 fails or the command abnormally terminates, this will terminate
1069 its job allocation. This option has no effect upon batch jobs.
1070 When setting a value, take into consideration that a debugger
1071 using srun to launch an application may leave the srun command
1072 in a stopped state for extended periods of time. This limit is
1073 ignored for jobs running in partitions with the RootOnly flag
1074 set (the scheduler running as root will be responsible for the
1075 job). The default value is unlimited (zero) and may not exceed
1076 65533 seconds.
1077
1078
1079 InteractiveStepOptions
1080 When LaunchParameters=use_interactive_step is enabled, launching
1081 salloc will automatically start an srun process with Interac‐
1082 tiveStepOptions to launch a terminal on a node in the job allo‐
1083 cation. The default value is "--interactive --preserve-env
1084 --pty $SHELL". The "--interactive" option is intentionally not
1085 documented in the srun man page. It is meant only to be used in
1086 InteractiveStepOptions in order to create an "interactive step"
1087 that will not consume resources so that other steps may run in
1088 parallel with the interactive step.
1089
1090
1091 JobAcctGatherType
1092 The job accounting mechanism type. Acceptable values at present
1093 include "jobacct_gather/linux" (for Linux systems) and is the
1094 recommended one, "jobacct_gather/cgroup" and
1095 "jobacct_gather/none" (no accounting data collected). The de‐
1096 fault value is "jobacct_gather/none". "jobacct_gather/cgroup"
1097 is a plugin for the Linux operating system that uses cgroups to
1098 collect accounting statistics. The plugin collects the following
1099 statistics: From the cgroup memory subsystem: memory.us‐
1100 age_in_bytes (reported as 'pages') and rss from memory.stat (re‐
1101 ported as 'rss'). From the cgroup cpuacct subsystem: user cpu
1102 time and system cpu time. No value is provided by cgroups for
1103 virtual memory size ('vsize'). In order to use the sstat tool
1104 "jobacct_gather/linux", or "jobacct_gather/cgroup" must be con‐
1105 figured.
1106 NOTE: Changing this configuration parameter changes the contents
1107 of the messages between Slurm daemons. Any previously running
1108 job steps are managed by a slurmstepd daemon that will persist
1109 through the lifetime of that job step and not change its commu‐
1110 nication protocol. Only change this configuration parameter when
1111 there are no running job steps.
1112
1113
1114 JobAcctGatherFrequency
1115 The job accounting and profiling sampling intervals. The sup‐
1116 ported format is follows:
1117
1118 JobAcctGatherFrequency=<datatype>=<interval>
1119 where <datatype>=<interval> specifies the task sam‐
1120 pling interval for the jobacct_gather plugin or a
1121 sampling interval for a profiling type by the
1122 acct_gather_profile plugin. Multiple, comma-sepa‐
1123 rated <datatype>=<interval> intervals may be speci‐
1124 fied. Supported datatypes are as follows:
1125
1126 task=<interval>
1127 where <interval> is the task sampling inter‐
1128 val in seconds for the jobacct_gather plugins
1129 and for task profiling by the
1130 acct_gather_profile plugin.
1131
1132 energy=<interval>
1133 where <interval> is the sampling interval in
1134 seconds for energy profiling using the
1135 acct_gather_energy plugin
1136
1137 network=<interval>
1138 where <interval> is the sampling interval in
1139 seconds for infiniband profiling using the
1140 acct_gather_interconnect plugin.
1141
1142 filesystem=<interval>
1143 where <interval> is the sampling interval in
1144 seconds for filesystem profiling using the
1145 acct_gather_filesystem plugin.
1146
1147 The default value for task sampling interval
1148 is 30 seconds. The default value for all other intervals is 0.
1149 An interval of 0 disables sampling of the specified type. If
1150 the task sampling interval is 0, accounting information is col‐
1151 lected only at job termination (reducing Slurm interference with
1152 the job).
1153 Smaller (non-zero) values have a greater impact upon job perfor‐
1154 mance, but a value of 30 seconds is not likely to be noticeable
1155 for applications having less than 10,000 tasks.
1156 Users can independently override each interval on a per job ba‐
1157 sis using the --acctg-freq option when submitting the job.
1158
1159
1160 JobAcctGatherParams
1161 Arbitrary parameters for the job account gather plugin Accept‐
1162 able values at present include:
1163
1164 NoShared Exclude shared memory from accounting.
1165
1166 UsePss Use PSS value instead of RSS to calculate
1167 real usage of memory. The PSS value will be
1168 saved as RSS.
1169
1170 OverMemoryKill Kill processes that are being detected to
1171 use more memory than requested by steps ev‐
1172 ery time accounting information is gathered
1173 by the JobAcctGather plugin. This parameter
1174 should be used with caution because a job
1175 exceeding its memory allocation may affect
1176 other processes and/or machine health.
1177
1178 NOTE: If available, it is recommended to
1179 limit memory by enabling task/cgroup as a
1180 TaskPlugin and making use of Constrain‐
1181 RAMSpace=yes in the cgroup.conf instead of
1182 using this JobAcctGather mechanism for mem‐
1183 ory enforcement. Using JobAcctGather is
1184 polling based and there is a delay before a
1185 job is killed, which could lead to system
1186 Out of Memory events.
1187
1188 NOTE: When using OverMemoryKill, if the mem‐
1189 ory usage of one of the processes in a step
1190 exceeds the memory limit, the entire step
1191 will be killed/cancelled by the JobAcct‐
1192 Gather plugin. This differs from the behav‐
1193 ior when using ConstrainRAMSpace, where pro‐
1194 cesses in the step will be killed, but the
1195 step will be left active, possibly with
1196 other processes left running. It also dif‐
1197 fers in that the combined memory usage of
1198 all the processes in the step are considered
1199 when evaluating against the memory limit.
1200
1201
1202 JobCompHost
1203 The name of the machine hosting the job completion database.
1204 Only used for database type storage plugins, ignored otherwise.
1205
1206
1207 JobCompLoc
1208 The fully qualified file name where job completion records are
1209 written when the JobCompType is "jobcomp/filetxt" or the data‐
1210 base where job completion records are stored when the JobComp‐
1211 Type is a database, or a complete URL endpoint with format
1212 <host>:<port>/<target>/_doc when JobCompType is "jobcomp/elas‐
1213 ticsearch" like i.e. "localhost:9200/slurm/_doc". NOTE: More
1214 information is available at the Slurm web site
1215 <https://slurm.schedmd.com/elasticsearch.html>.
1216
1217
1218 JobCompParams
1219 Pass arbitrary text string to job completion plugin. Also see
1220 JobCompType.
1221
1222
1223 JobCompPass
1224 The password used to gain access to the database to store the
1225 job completion data. Only used for database type storage plug‐
1226 ins, ignored otherwise.
1227
1228
1229 JobCompPort
1230 The listening port of the job completion database server. Only
1231 used for database type storage plugins, ignored otherwise.
1232
1233
1234 JobCompType
1235 The job completion logging mechanism type. Acceptable values at
1236 present include:
1237
1238 jobcomp/none
1239 Upon job completion, a record of the job is purged from
1240 the system. If using the accounting infrastructure this
1241 plugin may not be of interest since some of the informa‐
1242 tion is redundant.
1243
1244
1245 jobcomp/elasticsearch
1246 Upon job completion, a record of the job should be writ‐
1247 ten to an Elasticsearch server, specified by the JobCom‐
1248 pLoc parameter.
1249 NOTE: More information is available at the Slurm web site
1250 ( https://slurm.schedmd.com/elasticsearch.html ).
1251
1252
1253 jobcomp/filetxt
1254 Upon job completion, a record of the job should be writ‐
1255 ten to a text file, specified by the JobCompLoc parame‐
1256 ter.
1257
1258
1259 jobcomp/lua
1260 Upon job completion, a record of the job should be pro‐
1261 cessed by the jobcomp.lua script, located in the default
1262 script directory (typically the subdirectory etc of the
1263 installation directory.
1264
1265
1266 jobcomp/mysql
1267 Upon job completion, a record of the job should be writ‐
1268 ten to a MySQL or MariaDB database, specified by the Job‐
1269 CompLoc parameter.
1270
1271
1272 jobcomp/script
1273 Upon job completion, a script specified by the JobCompLoc
1274 parameter is to be executed with environment variables
1275 providing the job information.
1276
1277
1278 JobCompUser
1279 The user account for accessing the job completion database.
1280 Only used for database type storage plugins, ignored otherwise.
1281
1282
1283 JobContainerType
1284 Identifies the plugin to be used for job tracking. The slurmd
1285 daemon must be restarted for a change in JobContainerType to
1286 take effect. NOTE: The JobContainerType applies to a job allo‐
1287 cation, while ProctrackType applies to job steps. Acceptable
1288 values at present include:
1289
1290 job_container/cncu Used only for Cray systems (CNCU = Compute
1291 Node Clean Up)
1292
1293 job_container/none Used for all other system types
1294
1295 job_container/tmpfs Used to create a private namespace on the
1296 filesystem for jobs, which houses temporary
1297 file systems (/tmp and /dev/shm) for each
1298 job. 'PrologFlags=Contain' must be set to
1299 use this plugin.
1300
1301
1302 JobFileAppend
1303 This option controls what to do if a job's output or error file
1304 exist when the job is started. If JobFileAppend is set to a
1305 value of 1, then append to the existing file. By default, any
1306 existing file is truncated.
1307
1308
1309 JobRequeue
1310 This option controls the default ability for batch jobs to be
1311 requeued. Jobs may be requeued explicitly by a system adminis‐
1312 trator, after node failure, or upon preemption by a higher pri‐
1313 ority job. If JobRequeue is set to a value of 1, then batch job
1314 may be requeued unless explicitly disabled by the user. If Jo‐
1315 bRequeue is set to a value of 0, then batch job will not be re‐
1316 queued unless explicitly enabled by the user. Use the sbatch
1317 --no-requeue or --requeue option to change the default behavior
1318 for individual jobs. The default value is 1.
1319
1320
1321 JobSubmitPlugins
1322 A comma-delimited list of job submission plugins to be used.
1323 The specified plugins will be executed in the order listed.
1324 These are intended to be site-specific plugins which can be used
1325 to set default job parameters and/or logging events. Sample
1326 plugins available in the distribution include "all_partitions",
1327 "defaults", "logging", "lua", and "partition". For examples of
1328 use, see the Slurm code in "src/plugins/job_submit" and "con‐
1329 tribs/lua/job_submit*.lua" then modify the code to satisfy your
1330 needs. Slurm can be configured to use multiple job_submit plug‐
1331 ins if desired, however the lua plugin will only execute one lua
1332 script named "job_submit.lua" located in the default script di‐
1333 rectory (typically the subdirectory "etc" of the installation
1334 directory). No job submission plugins are used by default.
1335
1336
1337 KeepAliveTime
1338 Specifies how long sockets communications used between the srun
1339 command and its slurmstepd process are kept alive after discon‐
1340 nect. Longer values can be used to improve reliability of com‐
1341 munications in the event of network failures. The default value
1342 leaves the system default value. The value may not exceed
1343 65533.
1344
1345
1346 KillOnBadExit
1347 If set to 1, a step will be terminated immediately if any task
1348 is crashed or aborted, as indicated by a non-zero exit code.
1349 With the default value of 0, if one of the processes is crashed
1350 or aborted the other processes will continue to run while the
1351 crashed or aborted process waits. The user can override this
1352 configuration parameter by using srun's -K, --kill-on-bad-exit.
1353
1354
1355 KillWait
1356 The interval, in seconds, given to a job's processes between the
1357 SIGTERM and SIGKILL signals upon reaching its time limit. If
1358 the job fails to terminate gracefully in the interval specified,
1359 it will be forcibly terminated. The default value is 30 sec‐
1360 onds. The value may not exceed 65533.
1361
1362
1363 NodeFeaturesPlugins
1364 Identifies the plugins to be used for support of node features
1365 which can change through time. For example, a node which might
1366 be booted with various BIOS setting. This is supported through
1367 the use of a node's active_features and available_features in‐
1368 formation. Acceptable values at present include:
1369
1370 node_features/knl_cray
1371 used only for Intel Knights Landing proces‐
1372 sors (KNL) on Cray systems
1373
1374 node_features/knl_generic
1375 used for Intel Knights Landing processors
1376 (KNL) on a generic Linux system
1377
1378
1379 LaunchParameters
1380 Identifies options to the job launch plugin. Acceptable values
1381 include:
1382
1383 batch_step_set_cpu_freq Set the cpu frequency for the batch step
1384 from given --cpu-freq, or slurm.conf
1385 CpuFreqDef, option. By default only
1386 steps started with srun will utilize the
1387 cpu freq setting options.
1388
1389 NOTE: If you are using srun to launch
1390 your steps inside a batch script (ad‐
1391 vised) this option will create a situa‐
1392 tion where you may have multiple agents
1393 setting the cpu_freq as the batch step
1394 usually runs on the same resources one
1395 or more steps the sruns in the script
1396 will create.
1397
1398 cray_net_exclusive Allow jobs on a Cray Native cluster ex‐
1399 clusive access to network resources.
1400 This should only be set on clusters pro‐
1401 viding exclusive access to each node to
1402 a single job at once, and not using par‐
1403 allel steps within the job, otherwise
1404 resources on the node can be oversub‐
1405 scribed.
1406
1407 enable_nss_slurm Permits passwd and group resolution for
1408 a job to be serviced by slurmstepd
1409 rather than requiring a lookup from a
1410 network based service. See
1411 https://slurm.schedmd.com/nss_slurm.html
1412 for more information.
1413
1414 lustre_no_flush If set on a Cray Native cluster, then do
1415 not flush the Lustre cache on job step
1416 completion. This setting will only take
1417 effect after reconfiguring, and will
1418 only take effect for newly launched
1419 jobs.
1420
1421 mem_sort Sort NUMA memory at step start. User can
1422 override this default with
1423 SLURM_MEM_BIND environment variable or
1424 --mem-bind=nosort command line option.
1425
1426 mpir_use_nodeaddr When launching tasks Slurm creates en‐
1427 tries in MPIR_proctable that are used by
1428 parallel debuggers, profilers, and re‐
1429 lated tools to attach to running
1430 process. By default the MPIR_proctable
1431 entries contain MPIR_procdesc structures
1432 where the host_name is set to NodeName
1433 by default. If this option is specified,
1434 NodeAddr will be used in this context
1435 instead.
1436
1437 disable_send_gids By default, the slurmctld will look up
1438 and send the user_name and extended gids
1439 for a job, rather than independently on
1440 each node as part of each task launch.
1441 This helps mitigate issues around name
1442 service scalability when launching jobs
1443 involving many nodes. Using this option
1444 will disable this functionality. This
1445 option is ignored if enable_nss_slurm is
1446 specified.
1447
1448 slurmstepd_memlock Lock the slurmstepd process's current
1449 memory in RAM.
1450
1451 slurmstepd_memlock_all Lock the slurmstepd process's current
1452 and future memory in RAM.
1453
1454 test_exec Have srun verify existence of the exe‐
1455 cutable program along with user execute
1456 permission on the node where srun was
1457 called before attempting to launch it on
1458 nodes in the step.
1459
1460 use_interactive_step Have salloc use the Interactive Step to
1461 launch a shell on an allocated compute
1462 node rather than locally to wherever
1463 salloc was invoked. This is accomplished
1464 by launching the srun command with In‐
1465 teractiveStepOptions as options.
1466
1467 This does not affect salloc called with
1468 a command as an argument. These jobs
1469 will continue to be executed as the
1470 calling user on the calling host.
1471
1472
1473 LaunchType
1474 Identifies the mechanism to be used to launch application tasks.
1475 Acceptable values include:
1476
1477 launch/slurm
1478 The default value.
1479
1480
1481 Licenses
1482 Specification of licenses (or other resources available on all
1483 nodes of the cluster) which can be allocated to jobs. License
1484 names can optionally be followed by a colon and count with a de‐
1485 fault count of one. Multiple license names should be comma sep‐
1486 arated (e.g. "Licenses=foo:4,bar"). Note that Slurm prevents
1487 jobs from being scheduled if their required license specifica‐
1488 tion is not available. Slurm does not prevent jobs from using
1489 licenses that are not explicitly listed in the job submission
1490 specification.
1491
1492
1493 LogTimeFormat
1494 Format of the timestamp in slurmctld and slurmd log files. Ac‐
1495 cepted values are "iso8601", "iso8601_ms", "rfc5424",
1496 "rfc5424_ms", "clock", "short" and "thread_id". The values end‐
1497 ing in "_ms" differ from the ones without in that fractional
1498 seconds with millisecond precision are printed. The default
1499 value is "iso8601_ms". The "rfc5424" formats are the same as the
1500 "iso8601" formats except that the timezone value is also shown.
1501 The "clock" format shows a timestamp in microseconds retrieved
1502 with the C standard clock() function. The "short" format is a
1503 short date and time format. The "thread_id" format shows the
1504 timestamp in the C standard ctime() function form without the
1505 year but including the microseconds, the daemon's process ID and
1506 the current thread name and ID.
1507
1508
1509 MailDomain
1510 Domain name to qualify usernames if email address is not explic‐
1511 itly given with the "--mail-user" option. If unset, the local
1512 MTA will need to qualify local address itself. Changes to Mail‐
1513 Domain will only affect new jobs.
1514
1515
1516 MailProg
1517 Fully qualified pathname to the program used to send email per
1518 user request. The default value is "/bin/mail" (or
1519 "/usr/bin/mail" if "/bin/mail" does not exist but
1520 "/usr/bin/mail" does exist). The program is called with argu‐
1521 ments suitable for the default mail command, however additional
1522 information about the job is passed in the form of environment
1523 variables.
1524
1525 Additional variables are the same as those passed to Pro‐
1526 logSlurmctld and EpilogSlurmctld with additional variables in
1527 the following contexts:
1528
1529
1530 ALL
1531
1532
1533 SLURM_JOB_STATE
1534 The base state of the job when the MailProg is
1535 called.
1536
1537
1538 SLURM_JOB_MAIL_TYPE
1539 The mail type triggering the mail.
1540
1541
1542 BEGIN
1543
1544 SLURM_JOB_QEUEUED_TIME
1545 The amount of time the job was queued.
1546
1547
1548 END, FAIL, REQUEUE, TIME_LIMIT_*
1549
1550 SLURM_JOB_RUN_TIME
1551 The amount of time the job ran for.
1552
1553
1554 END, FAIL
1555
1556 SLURM_JOB_EXIT_CODE_MAX
1557 Job's exit code or highest exit code for an array
1558 job.
1559
1560
1561 SLURM_JOB_EXIT_CODE_MIN
1562 Job's minimun exit code for an array job.
1563
1564
1565 SLURM_JOB_TERM_SIGNAL_MAX
1566 Job's highest signal for an array job.
1567
1568
1569 STAGE_OUT
1570
1571
1572 SLURM_JOB_STAGE_OUT_TIME
1573 Job's staging out time.
1574
1575
1576 MaxArraySize
1577 The maximum job array task index value will be one less than
1578 MaxArraySize to allow for an index value of zero. Configure
1579 MaxArraySize to 0 in order to disable job array use. The value
1580 may not exceed 4000001. The value of MaxJobCount should be much
1581 larger than MaxArraySize. The default value is 1001. See also
1582 max_array_tasks in SchedulerParameters.
1583
1584
1585 MaxDBDMsgs
1586 When communication to the SlurmDBD is not possible the slurmctld
1587 will queue messages meant to processed when the SlurmDBD is
1588 available again. In order to avoid running out of memory the
1589 slurmctld will only queue so many messages. The default value is
1590 10000, or MaxJobCount * 2 + Node Count * 4, whichever is
1591 greater. The value can not be less than 10000.
1592
1593
1594 MaxJobCount
1595 The maximum number of jobs Slurm can have in its active database
1596 at one time. Set the values of MaxJobCount and MinJobAge to en‐
1597 sure the slurmctld daemon does not exhaust its memory or other
1598 resources. Once this limit is reached, requests to submit addi‐
1599 tional jobs will fail. The default value is 10000 jobs. NOTE:
1600 Each task of a job array counts as one job even though they will
1601 not occupy separate job records until modified or initiated.
1602 Performance can suffer with more than a few hundred thousand
1603 jobs. Setting per MaxSubmitJobs per user is generally valuable
1604 to prevent a single user from filling the system with jobs.
1605 This is accomplished using Slurm's database and configuring en‐
1606 forcement of resource limits. This value may not be reset via
1607 "scontrol reconfig". It only takes effect upon restart of the
1608 slurmctld daemon.
1609
1610
1611 MaxJobId
1612 The maximum job id to be used for jobs submitted to Slurm with‐
1613 out a specific requested value. Job ids are unsigned 32bit inte‐
1614 gers with the first 26 bits reserved for local job ids and the
1615 remaining 6 bits reserved for a cluster id to identify a feder‐
1616 ated job's origin. The maximum allowed local job id is
1617 67,108,863 (0x3FFFFFF). The default value is 67,043,328
1618 (0x03ff0000). MaxJobId only applies to the local job id and not
1619 the federated job id. Job id values generated will be incre‐
1620 mented by 1 for each subsequent job. Once MaxJobId is reached,
1621 the next job will be assigned FirstJobId. Federated jobs will
1622 always have a job ID of 67,108,865 or higher. Also see FirstJo‐
1623 bId.
1624
1625
1626 MaxMemPerCPU
1627 Maximum real memory size available per allocated CPU in
1628 megabytes. Used to avoid over-subscribing memory and causing
1629 paging. MaxMemPerCPU would generally be used if individual pro‐
1630 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
1631 lectType=select/cons_tres). The default value is 0 (unlimited).
1632 Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerNode. MaxMem‐
1633 PerCPU and MaxMemPerNode are mutually exclusive.
1634
1635 NOTE: If a job specifies a memory per CPU limit that exceeds
1636 this system limit, that job's count of CPUs per task will try to
1637 automatically increase. This may result in the job failing due
1638 to CPU count limits. This auto-adjustment feature is a best-ef‐
1639 fort one and optimal assignment is not guaranteed due to the
1640 possibility of having heterogeneous configurations and
1641 multi-partition/qos jobs. If this is a concern it is advised to
1642 use a job submit LUA plugin instead to enforce auto-adjustments
1643 to your specific needs.
1644
1645
1646 MaxMemPerNode
1647 Maximum real memory size available per allocated node in
1648 megabytes. Used to avoid over-subscribing memory and causing
1649 paging. MaxMemPerNode would generally be used if whole nodes
1650 are allocated to jobs (SelectType=select/linear) and resources
1651 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
1652 The default value is 0 (unlimited). Also see DefMemPerNode and
1653 MaxMemPerCPU. MaxMemPerCPU and MaxMemPerNode are mutually ex‐
1654 clusive.
1655
1656
1657 MaxStepCount
1658 The maximum number of steps that any job can initiate. This pa‐
1659 rameter is intended to limit the effect of bad batch scripts.
1660 The default value is 40000 steps.
1661
1662
1663 MaxTasksPerNode
1664 Maximum number of tasks Slurm will allow a job step to spawn on
1665 a single node. The default MaxTasksPerNode is 512. May not ex‐
1666 ceed 65533.
1667
1668
1669 MCSParameters
1670 MCS = Multi-Category Security MCS Plugin Parameters. The sup‐
1671 ported parameters are specific to the MCSPlugin. Changes to
1672 this value take effect when the Slurm daemons are reconfigured.
1673 More information about MCS is available here
1674 <https://slurm.schedmd.com/mcs.html>.
1675
1676
1677 MCSPlugin
1678 MCS = Multi-Category Security : associate a security label to
1679 jobs and ensure that nodes can only be shared among jobs using
1680 the same security label. Acceptable values include:
1681
1682 mcs/none is the default value. No security label associated
1683 with jobs, no particular security restriction when
1684 sharing nodes among jobs.
1685
1686 mcs/account only users with the same account can share the nodes
1687 (requires enabling of accounting).
1688
1689 mcs/group only users with the same group can share the nodes.
1690
1691 mcs/user a node cannot be shared with other users.
1692
1693
1694 MessageTimeout
1695 Time permitted for a round-trip communication to complete in
1696 seconds. Default value is 10 seconds. For systems with shared
1697 nodes, the slurmd daemon could be paged out and necessitate
1698 higher values.
1699
1700
1701 MinJobAge
1702 The minimum age of a completed job before its record is purged
1703 from Slurm's active database. Set the values of MaxJobCount and
1704 to ensure the slurmctld daemon does not exhaust its memory or
1705 other resources. The default value is 300 seconds. A value of
1706 zero prevents any job record purging. Jobs are not purged dur‐
1707 ing a backfill cycle, so it can take longer than MinJobAge sec‐
1708 onds to purge a job if using the backfill scheduling plugin. In
1709 order to eliminate some possible race conditions, the minimum
1710 non-zero value for MinJobAge recommended is 2.
1711
1712
1713 MpiDefault
1714 Identifies the default type of MPI to be used. Srun may over‐
1715 ride this configuration parameter in any case. Currently sup‐
1716 ported versions include: pmi2, pmix, and none (default, which
1717 works for many other versions of MPI). More information about
1718 MPI use is available here
1719 <https://slurm.schedmd.com/mpi_guide.html>.
1720
1721
1722 MpiParams
1723 MPI parameters. Used to identify ports used by older versions
1724 of OpenMPI and native Cray systems. The input format is
1725 "ports=12000-12999" to identify a range of communication ports
1726 to be used. NOTE: This is not needed for modern versions of
1727 OpenMPI, taking it out can cause a small boost in scheduling
1728 performance. NOTE: This is require for Cray's PMI.
1729
1730
1731 OverTimeLimit
1732 Number of minutes by which a job can exceed its time limit be‐
1733 fore being canceled. Normally a job's time limit is treated as
1734 a hard limit and the job will be killed upon reaching that
1735 limit. Configuring OverTimeLimit will result in the job's time
1736 limit being treated like a soft limit. Adding the OverTimeLimit
1737 value to the soft time limit provides a hard time limit, at
1738 which point the job is canceled. This is particularly useful
1739 for backfill scheduling, which bases upon each job's soft time
1740 limit. The default value is zero. May not exceed 65533 min‐
1741 utes. A value of "UNLIMITED" is also supported.
1742
1743
1744 PluginDir
1745 Identifies the places in which to look for Slurm plugins. This
1746 is a colon-separated list of directories, like the PATH environ‐
1747 ment variable. The default value is the prefix given at config‐
1748 ure time + "/lib/slurm".
1749
1750
1751 PlugStackConfig
1752 Location of the config file for Slurm stackable plugins that use
1753 the Stackable Plugin Architecture for Node job (K)control
1754 (SPANK). This provides support for a highly configurable set of
1755 plugins to be called before and/or after execution of each task
1756 spawned as part of a user's job step. Default location is
1757 "plugstack.conf" in the same directory as the system slurm.conf.
1758 For more information on SPANK plugins, see the spank(8) manual.
1759
1760
1761 PowerParameters
1762 System power management parameters. The supported parameters
1763 are specific to the PowerPlugin. Changes to this value take ef‐
1764 fect when the Slurm daemons are reconfigured. More information
1765 about system power management is available here
1766 <https://slurm.schedmd.com/power_mgmt.html>. Options current
1767 supported by any plugins are listed below.
1768
1769 balance_interval=#
1770 Specifies the time interval, in seconds, between attempts
1771 to rebalance power caps across the nodes. This also con‐
1772 trols the frequency at which Slurm attempts to collect
1773 current power consumption data (old data may be used un‐
1774 til new data is available from the underlying infrastruc‐
1775 ture and values below 10 seconds are not recommended for
1776 Cray systems). The default value is 30 seconds. Sup‐
1777 ported by the power/cray_aries plugin.
1778
1779 capmc_path=
1780 Specifies the absolute path of the capmc command. The
1781 default value is "/opt/cray/capmc/default/bin/capmc".
1782 Supported by the power/cray_aries plugin.
1783
1784 cap_watts=#
1785 Specifies the total power limit to be established across
1786 all compute nodes managed by Slurm. A value of 0 sets
1787 every compute node to have an unlimited cap. The default
1788 value is 0. Supported by the power/cray_aries plugin.
1789
1790 decrease_rate=#
1791 Specifies the maximum rate of change in the power cap for
1792 a node where the actual power usage is below the power
1793 cap by an amount greater than lower_threshold (see be‐
1794 low). Value represents a percentage of the difference
1795 between a node's minimum and maximum power consumption.
1796 The default value is 50 percent. Supported by the
1797 power/cray_aries plugin.
1798
1799 get_timeout=#
1800 Amount of time allowed to get power state information in
1801 milliseconds. The default value is 5,000 milliseconds or
1802 5 seconds. Supported by the power/cray_aries plugin and
1803 represents the time allowed for the capmc command to re‐
1804 spond to various "get" options.
1805
1806 increase_rate=#
1807 Specifies the maximum rate of change in the power cap for
1808 a node where the actual power usage is within up‐
1809 per_threshold (see below) of the power cap. Value repre‐
1810 sents a percentage of the difference between a node's
1811 minimum and maximum power consumption. The default value
1812 is 20 percent. Supported by the power/cray_aries plugin.
1813
1814 job_level
1815 All nodes associated with every job will have the same
1816 power cap, to the extent possible. Also see the
1817 --power=level option on the job submission commands.
1818
1819 job_no_level
1820 Disable the user's ability to set every node associated
1821 with a job to the same power cap. Each node will have
1822 its power cap set independently. This disables the
1823 --power=level option on the job submission commands.
1824
1825 lower_threshold=#
1826 Specify a lower power consumption threshold. If a node's
1827 current power consumption is below this percentage of its
1828 current cap, then its power cap will be reduced. The de‐
1829 fault value is 90 percent. Supported by the
1830 power/cray_aries plugin.
1831
1832 recent_job=#
1833 If a job has started or resumed execution (from suspend)
1834 on a compute node within this number of seconds from the
1835 current time, the node's power cap will be increased to
1836 the maximum. The default value is 300 seconds. Sup‐
1837 ported by the power/cray_aries plugin.
1838
1839
1840 set_timeout=#
1841 Amount of time allowed to set power state information in
1842 milliseconds. The default value is 30,000 milliseconds
1843 or 30 seconds. Supported by the power/cray plugin and
1844 represents the time allowed for the capmc command to re‐
1845 spond to various "set" options.
1846
1847 set_watts=#
1848 Specifies the power limit to be set on every compute
1849 nodes managed by Slurm. Every node gets this same power
1850 cap and there is no variation through time based upon ac‐
1851 tual power usage on the node. Supported by the
1852 power/cray_aries plugin.
1853
1854 upper_threshold=#
1855 Specify an upper power consumption threshold. If a
1856 node's current power consumption is above this percentage
1857 of its current cap, then its power cap will be increased
1858 to the extent possible. The default value is 95 percent.
1859 Supported by the power/cray_aries plugin.
1860
1861
1862 PowerPlugin
1863 Identifies the plugin used for system power management. Cur‐
1864 rently supported plugins include: cray_aries and none. Changes
1865 to this value require restarting Slurm daemons to take effect.
1866 More information about system power management is available here
1867 <https://slurm.schedmd.com/power_mgmt.html>. By default, no
1868 power plugin is loaded.
1869
1870
1871 PreemptMode
1872 Mechanism used to preempt jobs or enable gang scheduling. When
1873 the PreemptType parameter is set to enable preemption, the Pre‐
1874 emptMode selects the default mechanism used to preempt the eli‐
1875 gible jobs for the cluster.
1876 PreemptMode may be specified on a per partition basis to over‐
1877 ride this default value if PreemptType=preempt/partition_prio.
1878 Alternatively, it can be specified on a per QOS basis if Pre‐
1879 emptType=preempt/qos. In either case, a valid default Preempt‐
1880 Mode value must be specified for the cluster as a whole when
1881 preemption is enabled.
1882 The GANG option is used to enable gang scheduling independent of
1883 whether preemption is enabled (i.e. independent of the Preempt‐
1884 Type setting). It can be specified in addition to a PreemptMode
1885 setting with the two options comma separated (e.g. Preempt‐
1886 Mode=SUSPEND,GANG).
1887 See <https://slurm.schedmd.com/preempt.html> and
1888 <https://slurm.schedmd.com/gang_scheduling.html> for more de‐
1889 tails.
1890
1891 NOTE: For performance reasons, the backfill scheduler reserves
1892 whole nodes for jobs, not partial nodes. If during backfill
1893 scheduling a job preempts one or more other jobs, the whole
1894 nodes for those preempted jobs are reserved for the preemptor
1895 job, even if the preemptor job requested fewer resources than
1896 that. These reserved nodes aren't available to other jobs dur‐
1897 ing that backfill cycle, even if the other jobs could fit on the
1898 nodes. Therefore, jobs may preempt more resources during a sin‐
1899 gle backfill iteration than they requested.
1900
1901 NOTE: For heterogeneous job to be considered for preemption all
1902 components must be eligible for preemption. When a heterogeneous
1903 job is to be preempted the first identified component of the job
1904 with the highest order PreemptMode (SUSPEND (highest), REQUEUE,
1905 CANCEL (lowest)) will be used to set the PreemptMode for all
1906 components. The GraceTime and user warning signal for each com‐
1907 ponent of the heterogeneous job remain unique. Heterogeneous
1908 jobs are excluded from GANG scheduling operations.
1909
1910 OFF Is the default value and disables job preemption and
1911 gang scheduling. It is only compatible with Pre‐
1912 emptType=preempt/none at a global level. A common
1913 use case for this parameter is to set it on a parti‐
1914 tion to disable preemption for that partition.
1915
1916 CANCEL The preempted job will be cancelled.
1917
1918 GANG Enables gang scheduling (time slicing) of jobs in
1919 the same partition, and allows the resuming of sus‐
1920 pended jobs.
1921
1922 NOTE: Gang scheduling is performed independently for
1923 each partition, so if you only want time-slicing by
1924 OverSubscribe, without any preemption, then config‐
1925 uring partitions with overlapping nodes is not rec‐
1926 ommended. On the other hand, if you want to use
1927 PreemptType=preempt/partition_prio to allow jobs
1928 from higher PriorityTier partitions to Suspend jobs
1929 from lower PriorityTier partitions you will need
1930 overlapping partitions, and PreemptMode=SUSPEND,GANG
1931 to use the Gang scheduler to resume the suspended
1932 jobs(s). In any case, time-slicing won't happen be‐
1933 tween jobs on different partitions.
1934
1935 NOTE: Heterogeneous jobs are excluded from GANG
1936 scheduling operations.
1937
1938 REQUEUE Preempts jobs by requeuing them (if possible) or
1939 canceling them. For jobs to be requeued they must
1940 have the --requeue sbatch option set or the cluster
1941 wide JobRequeue parameter in slurm.conf must be set
1942 to one.
1943
1944 SUSPEND The preempted jobs will be suspended, and later the
1945 Gang scheduler will resume them. Therefore the SUS‐
1946 PEND preemption mode always needs the GANG option to
1947 be specified at the cluster level. Also, because the
1948 suspended jobs will still use memory on the allo‐
1949 cated nodes, Slurm needs to be able to track memory
1950 resources to be able to suspend jobs.
1951
1952 NOTE: Because gang scheduling is performed indepen‐
1953 dently for each partition, if using PreemptType=pre‐
1954 empt/partition_prio then jobs in higher PriorityTier
1955 partitions will suspend jobs in lower PriorityTier
1956 partitions to run on the released resources. Only
1957 when the preemptor job ends will the suspended jobs
1958 will be resumed by the Gang scheduler.
1959
1960 NOTE: Suspended jobs will not release GRES. Higher
1961 priority jobs will not be able to preempt to gain
1962 access to GRES.
1963 If PreemptType=preempt/qos is configured and if the
1964 preempted job(s) and the preemptor job are on the
1965 same partition, then they will share resources with
1966 the Gang scheduler (time-slicing). If not (i.e. if
1967 the preemptees and preemptor are on different parti‐
1968 tions) then the preempted jobs will remain suspended
1969 until the preemptor ends.
1970
1971
1972 PreemptType
1973 Specifies the plugin used to identify which jobs can be pre‐
1974 empted in order to start a pending job.
1975
1976 preempt/none
1977 Job preemption is disabled. This is the default.
1978
1979 preempt/partition_prio
1980 Job preemption is based upon partition PriorityTier.
1981 Jobs in higher PriorityTier partitions may preempt jobs
1982 from lower PriorityTier partitions. This is not compati‐
1983 ble with PreemptMode=OFF.
1984
1985 preempt/qos
1986 Job preemption rules are specified by Quality Of Service
1987 (QOS) specifications in the Slurm database. This option
1988 is not compatible with PreemptMode=OFF. A configuration
1989 of PreemptMode=SUSPEND is only supported by the Select‐
1990 Type=select/cons_res and SelectType=select/cons_tres
1991 plugins. See the sacctmgr man page to configure the op‐
1992 tions for preempt/qos.
1993
1994
1995 PreemptExemptTime
1996 Global option for minimum run time for all jobs before they can
1997 be considered for preemption. Any QOS PreemptExemptTime takes
1998 precedence over the global option. A time of -1 disables the
1999 option, equivalent to 0. Acceptable time formats include "min‐
2000 utes", "minutes:seconds", "hours:minutes:seconds", "days-hours",
2001 "days-hours:minutes", and "days-hours:minutes:seconds".
2002
2003
2004 PrEpParameters
2005 Parameters to be passed to the PrEpPlugins.
2006
2007
2008 PrEpPlugins
2009 A resource for programmers wishing to write their own plugins
2010 for the Prolog and Epilog (PrEp) scripts. The default, and cur‐
2011 rently the only implemented plugin is prep/script. Additional
2012 plugins can be specified in a comma-separated list. For more in‐
2013 formation please see the PrEp Plugin API documentation page:
2014 <https://slurm.schedmd.com/prep_plugins.html>
2015
2016
2017 PriorityCalcPeriod
2018 The period of time in minutes in which the half-life decay will
2019 be re-calculated. Applicable only if PriorityType=priority/mul‐
2020 tifactor. The default value is 5 (minutes).
2021
2022
2023 PriorityDecayHalfLife
2024 This controls how long prior resource use is considered in de‐
2025 termining how over- or under-serviced an association is (user,
2026 bank account and cluster) in determining job priority. The
2027 record of usage will be decayed over time, with half of the
2028 original value cleared at age PriorityDecayHalfLife. If set to
2029 0 no decay will be applied. This is helpful if you want to en‐
2030 force hard time limits per association. If set to 0 Priori‐
2031 tyUsageResetPeriod must be set to some interval. Applicable
2032 only if PriorityType=priority/multifactor. The unit is a time
2033 string (i.e. min, hr:min:00, days-hr:min:00, or days-hr). The
2034 default value is 7-0 (7 days).
2035
2036
2037 PriorityFavorSmall
2038 Specifies that small jobs should be given preferential schedul‐
2039 ing priority. Applicable only if PriorityType=priority/multi‐
2040 factor. Supported values are "YES" and "NO". The default value
2041 is "NO".
2042
2043
2044 PriorityFlags
2045 Flags to modify priority behavior. Applicable only if Priority‐
2046 Type=priority/multifactor. The keywords below have no associ‐
2047 ated value (e.g. "PriorityFlags=ACCRUE_ALWAYS,SMALL_RELA‐
2048 TIVE_TO_TIME").
2049
2050 ACCRUE_ALWAYS If set, priority age factor will be increased
2051 despite job dependencies or holds.
2052
2053 CALCULATE_RUNNING
2054 If set, priorities will be recalculated not
2055 only for pending jobs, but also running and
2056 suspended jobs.
2057
2058 DEPTH_OBLIVIOUS If set, priority will be calculated based simi‐
2059 lar to the normal multifactor calculation, but
2060 depth of the associations in the tree do not
2061 adversely effect their priority. This option
2062 automatically enables NO_FAIR_TREE.
2063
2064 NO_FAIR_TREE Disables the "fair tree" algorithm, and reverts
2065 to "classic" fair share priority scheduling.
2066
2067 INCR_ONLY If set, priority values will only increase in
2068 value. Job priority will never decrease in
2069 value.
2070
2071 MAX_TRES If set, the weighted TRES value (e.g. TRES‐
2072 BillingWeights) is calculated as the MAX of in‐
2073 dividual TRES' on a node (e.g. cpus, mem, gres)
2074 plus the sum of all global TRES' (e.g. li‐
2075 censes).
2076
2077 NO_NORMAL_ALL If set, all NO_NORMAL_* flags are set.
2078
2079 NO_NORMAL_ASSOC If set, the association factor is not normal‐
2080 ized against the highest association priority.
2081
2082 NO_NORMAL_PART If set, the partition factor is not normalized
2083 against the highest partition PriorityJobFac‐
2084 tor.
2085
2086 NO_NORMAL_QOS If set, the QOS factor is not normalized
2087 against the highest qos priority.
2088
2089 NO_NORMAL_TRES If set, the QOS factor is not normalized
2090 against the job's partition TRES counts.
2091
2092 SMALL_RELATIVE_TO_TIME
2093 If set, the job's size component will be based
2094 upon not the job size alone, but the job's size
2095 divided by its time limit.
2096
2097
2098 PriorityMaxAge
2099 Specifies the job age which will be given the maximum age factor
2100 in computing priority. For example, a value of 30 minutes would
2101 result in all jobs over 30 minutes old would get the same
2102 age-based priority. Applicable only if PriorityType=prior‐
2103 ity/multifactor. The unit is a time string (i.e. min,
2104 hr:min:00, days-hr:min:00, or days-hr). The default value is
2105 7-0 (7 days).
2106
2107
2108 PriorityParameters
2109 Arbitrary string used by the PriorityType plugin.
2110
2111
2112 PrioritySiteFactorParameters
2113 Arbitrary string used by the PrioritySiteFactorPlugin plugin.
2114
2115
2116 PrioritySiteFactorPlugin
2117 The specifies an optional plugin to be used alongside "prior‐
2118 ity/multifactor", which is meant to initially set and continu‐
2119 ously update the SiteFactor priority factor. The default value
2120 is "site_factor/none".
2121
2122
2123 PriorityType
2124 This specifies the plugin to be used in establishing a job's
2125 scheduling priority. Also see PriorityFlags for configuration
2126 options. The default value is "priority/basic".
2127
2128 priority/basic
2129 Jobs are evaluated in a First In, First Out (FIFO) man‐
2130 ner.
2131
2132 priority/multifactor
2133 Jobs are assigned a priority based upon a variety of fac‐
2134 tors that include size, age, Fairshare, etc.
2135 When not FIFO scheduling, jobs are prioritized in the following
2136 order:
2137
2138 1. Jobs that can preempt
2139 2. Jobs with an advanced reservation
2140 3. Partition PriorityTier
2141 4. Job priority
2142 5. Job submit time
2143 6. Job ID
2144
2145
2146 PriorityUsageResetPeriod
2147 At this interval the usage of associations will be reset to 0.
2148 This is used if you want to enforce hard limits of time usage
2149 per association. If PriorityDecayHalfLife is set to be 0 no de‐
2150 cay will happen and this is the only way to reset the usage ac‐
2151 cumulated by running jobs. By default this is turned off and it
2152 is advised to use the PriorityDecayHalfLife option to avoid not
2153 having anything running on your cluster, but if your schema is
2154 set up to only allow certain amounts of time on your system this
2155 is the way to do it. Applicable only if PriorityType=prior‐
2156 ity/multifactor.
2157
2158 NONE Never clear historic usage. The default value.
2159
2160 NOW Clear the historic usage now. Executed at startup
2161 and reconfiguration time.
2162
2163 DAILY Cleared every day at midnight.
2164
2165 WEEKLY Cleared every week on Sunday at time 00:00.
2166
2167 MONTHLY Cleared on the first day of each month at time
2168 00:00.
2169
2170 QUARTERLY Cleared on the first day of each quarter at time
2171 00:00.
2172
2173 YEARLY Cleared on the first day of each year at time 00:00.
2174
2175
2176 PriorityWeightAge
2177 An integer value that sets the degree to which the queue wait
2178 time component contributes to the job's priority. Applicable
2179 only if PriorityType=priority/multifactor. Requires Account‐
2180 ingStorageType=accounting_storage/slurmdbd. The default value
2181 is 0.
2182
2183
2184 PriorityWeightAssoc
2185 An integer value that sets the degree to which the association
2186 component contributes to the job's priority. Applicable only if
2187 PriorityType=priority/multifactor. The default value is 0.
2188
2189
2190 PriorityWeightFairshare
2191 An integer value that sets the degree to which the fair-share
2192 component contributes to the job's priority. Applicable only if
2193 PriorityType=priority/multifactor. Requires AccountingStor‐
2194 ageType=accounting_storage/slurmdbd. The default value is 0.
2195
2196
2197 PriorityWeightJobSize
2198 An integer value that sets the degree to which the job size com‐
2199 ponent contributes to the job's priority. Applicable only if
2200 PriorityType=priority/multifactor. The default value is 0.
2201
2202
2203 PriorityWeightPartition
2204 Partition factor used by priority/multifactor plugin in calcu‐
2205 lating job priority. Applicable only if PriorityType=prior‐
2206 ity/multifactor. The default value is 0.
2207
2208
2209 PriorityWeightQOS
2210 An integer value that sets the degree to which the Quality Of
2211 Service component contributes to the job's priority. Applicable
2212 only if PriorityType=priority/multifactor. The default value is
2213 0.
2214
2215
2216 PriorityWeightTRES
2217 A comma-separated list of TRES Types and weights that sets the
2218 degree that each TRES Type contributes to the job's priority.
2219
2220 e.g.
2221 PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
2222
2223 Applicable only if PriorityType=priority/multifactor and if Ac‐
2224 countingStorageTRES is configured with each TRES Type. Negative
2225 values are allowed. The default values are 0.
2226
2227
2228 PrivateData
2229 This controls what type of information is hidden from regular
2230 users. By default, all information is visible to all users.
2231 User SlurmUser and root can always view all information. Multi‐
2232 ple values may be specified with a comma separator. Acceptable
2233 values include:
2234
2235 accounts
2236 (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
2237 ing any account definitions unless they are coordinators
2238 of them.
2239
2240 cloud Powered down nodes in the cloud are visible.
2241
2242 events prevents users from viewing event information unless they
2243 have operator status or above.
2244
2245 jobs Prevents users from viewing jobs or job steps belonging
2246 to other users. (NON-SlurmDBD ACCOUNTING ONLY) Prevents
2247 users from viewing job records belonging to other users
2248 unless they are coordinators of the association running
2249 the job when using sacct.
2250
2251 nodes Prevents users from viewing node state information.
2252
2253 partitions
2254 Prevents users from viewing partition state information.
2255
2256 reservations
2257 Prevents regular users from viewing reservations which
2258 they can not use.
2259
2260 usage Prevents users from viewing usage of any other user, this
2261 applies to sshare. (NON-SlurmDBD ACCOUNTING ONLY) Pre‐
2262 vents users from viewing usage of any other user, this
2263 applies to sreport.
2264
2265 users (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
2266 ing information of any user other than themselves, this
2267 also makes it so users can only see associations they
2268 deal with. Coordinators can see associations of all
2269 users in the account they are coordinator of, but can
2270 only see themselves when listing users.
2271
2272
2273 ProctrackType
2274 Identifies the plugin to be used for process tracking on a job
2275 step basis. The slurmd daemon uses this mechanism to identify
2276 all processes which are children of processes it spawns for a
2277 user job step. The slurmd daemon must be restarted for a change
2278 in ProctrackType to take effect. NOTE: "proctrack/linuxproc"
2279 and "proctrack/pgid" can fail to identify all processes associ‐
2280 ated with a job since processes can become a child of the init
2281 process (when the parent process terminates) or change their
2282 process group. To reliably track all processes, "proc‐
2283 track/cgroup" is highly recommended. NOTE: The JobContainerType
2284 applies to a job allocation, while ProctrackType applies to job
2285 steps. Acceptable values at present include:
2286
2287 proctrack/cgroup
2288 Uses linux cgroups to constrain and track processes, and
2289 is the default for systems with cgroup support.
2290 NOTE: see "man cgroup.conf" for configuration details.
2291
2292 proctrack/cray_aries
2293 Uses Cray proprietary process tracking.
2294
2295 proctrack/linuxproc
2296 Uses linux process tree using parent process IDs.
2297
2298 proctrack/pgid
2299 Uses Process Group IDs.
2300 NOTE: This is the default for the BSD family.
2301
2302
2303 Prolog Fully qualified pathname of a program for the slurmd to execute
2304 whenever it is asked to run a job step from a new job allocation
2305 (e.g. "/usr/local/slurm/prolog"). A glob pattern (See glob (7))
2306 may also be used to specify more than one program to run (e.g.
2307 "/etc/slurm/prolog.d/*"). The slurmd executes the prolog before
2308 starting the first job step. The prolog script or scripts may
2309 be used to purge files, enable user login, etc. By default
2310 there is no prolog. Any configured script is expected to com‐
2311 plete execution quickly (in less time than MessageTimeout). If
2312 the prolog fails (returns a non-zero exit code), this will re‐
2313 sult in the node being set to a DRAIN state and the job being
2314 requeued in a held state, unless nohold_on_prolog_fail is con‐
2315 figured in SchedulerParameters. See Prolog and Epilog Scripts
2316 for more information.
2317
2318
2319 PrologEpilogTimeout
2320 The interval in seconds Slurms waits for Prolog and Epilog be‐
2321 fore terminating them. The default behavior is to wait indefi‐
2322 nitely. This interval applies to the Prolog and Epilog run by
2323 slurmd daemon before and after the job, the PrologSlurmctld and
2324 EpilogSlurmctld run by slurmctld daemon, and the SPANK plugins
2325 run by the slurmstepd daemon.
2326
2327
2328 PrologFlags
2329 Flags to control the Prolog behavior. By default no flags are
2330 set. Multiple flags may be specified in a comma-separated list.
2331 Currently supported options are:
2332
2333 Alloc If set, the Prolog script will be executed at job allo‐
2334 cation. By default, Prolog is executed just before the
2335 task is launched. Therefore, when salloc is started, no
2336 Prolog is executed. Alloc is useful for preparing things
2337 before a user starts to use any allocated resources. In
2338 particular, this flag is needed on a Cray system when
2339 cluster compatibility mode is enabled.
2340
2341 NOTE: Use of the Alloc flag will increase the time re‐
2342 quired to start jobs.
2343
2344 Contain At job allocation time, use the ProcTrack plugin to cre‐
2345 ate a job container on all allocated compute nodes.
2346 This container may be used for user processes not
2347 launched under Slurm control, for example
2348 pam_slurm_adopt may place processes launched through a
2349 direct user login into this container. If using
2350 pam_slurm_adopt, then ProcTrackType must be set to ei‐
2351 ther proctrack/cgroup or proctrack/cray_aries. Setting
2352 the Contain implicitly sets the Alloc flag.
2353
2354 NoHold If set, the Alloc flag should also be set. This will
2355 allow for salloc to not block until the prolog is fin‐
2356 ished on each node. The blocking will happen when steps
2357 reach the slurmd and before any execution has happened
2358 in the step. This is a much faster way to work and if
2359 using srun to launch your tasks you should use this
2360 flag. This flag cannot be combined with the Contain or
2361 X11 flags.
2362
2363 Serial By default, the Prolog and Epilog scripts run concur‐
2364 rently on each node. This flag forces those scripts to
2365 run serially within each node, but with a significant
2366 penalty to job throughput on each node.
2367
2368 X11 Enable Slurm's built-in X11 forwarding capabilities.
2369 This is incompatible with ProctrackType=proctrack/linux‐
2370 proc. Setting the X11 flag implicitly enables both Con‐
2371 tain and Alloc flags as well.
2372
2373
2374 PrologSlurmctld
2375 Fully qualified pathname of a program for the slurmctld daemon
2376 to execute before granting a new job allocation (e.g. "/usr/lo‐
2377 cal/slurm/prolog_controller"). The program executes as Slur‐
2378 mUser on the same node where the slurmctld daemon executes, giv‐
2379 ing it permission to drain nodes and requeue the job if a fail‐
2380 ure occurs or cancel the job if appropriate. Exactly what the
2381 program does and how it accomplishes this is completely at the
2382 discretion of the system administrator. Information about the
2383 job being initiated, its allocated nodes, etc. are passed to the
2384 program using environment variables. While this program is run‐
2385 ning, the nodes associated with the job will be have a
2386 POWER_UP/CONFIGURING flag set in their state, which can be read‐
2387 ily viewed. The slurmctld daemon will wait indefinitely for
2388 this program to complete. Once the program completes with an
2389 exit code of zero, the nodes will be considered ready for use
2390 and the program will be started. If some node can not be made
2391 available for use, the program should drain the node (typically
2392 using the scontrol command) and terminate with a non-zero exit
2393 code. A non-zero exit code will result in the job being re‐
2394 queued (where possible) or killed. Note that only batch jobs can
2395 be requeued. See Prolog and Epilog Scripts for more informa‐
2396 tion.
2397
2398
2399 PropagatePrioProcess
2400 Controls the scheduling priority (nice value) of user spawned
2401 tasks.
2402
2403 0 The tasks will inherit the scheduling priority from the
2404 slurm daemon. This is the default value.
2405
2406 1 The tasks will inherit the scheduling priority of the com‐
2407 mand used to submit them (e.g. srun or sbatch). Unless the
2408 job is submitted by user root, the tasks will have a sched‐
2409 uling priority no higher than the slurm daemon spawning
2410 them.
2411
2412 2 The tasks will inherit the scheduling priority of the com‐
2413 mand used to submit them (e.g. srun or sbatch) with the re‐
2414 striction that their nice value will always be one higher
2415 than the slurm daemon (i.e. the tasks scheduling priority
2416 will be lower than the slurm daemon).
2417
2418
2419 PropagateResourceLimits
2420 A comma-separated list of resource limit names. The slurmd dae‐
2421 mon uses these names to obtain the associated (soft) limit val‐
2422 ues from the user's process environment on the submit node.
2423 These limits are then propagated and applied to the jobs that
2424 will run on the compute nodes. This parameter can be useful
2425 when system limits vary among nodes. Any resource limits that
2426 do not appear in the list are not propagated. However, the user
2427 can override this by specifying which resource limits to propa‐
2428 gate with the sbatch or srun "--propagate" option. If neither
2429 PropagateResourceLimits or PropagateResourceLimitsExcept are
2430 configured and the "--propagate" option is not specified, then
2431 the default action is to propagate all limits. Only one of the
2432 parameters, either PropagateResourceLimits or PropagateResource‐
2433 LimitsExcept, may be specified. The user limits can not exceed
2434 hard limits under which the slurmd daemon operates. If the user
2435 limits are not propagated, the limits from the slurmd daemon
2436 will be propagated to the user's job. The limits used for the
2437 Slurm daemons can be set in the /etc/sysconf/slurm file. For
2438 more information, see: https://slurm.schedmd.com/faq.html#mem‐
2439 lock The following limit names are supported by Slurm (although
2440 some options may not be supported on some systems):
2441
2442 ALL All limits listed below (default)
2443
2444 NONE No limits listed below
2445
2446 AS The maximum address space (virtual memory) for a
2447 process.
2448
2449 CORE The maximum size of core file
2450
2451 CPU The maximum amount of CPU time
2452
2453 DATA The maximum size of a process's data segment
2454
2455 FSIZE The maximum size of files created. Note that if the
2456 user sets FSIZE to less than the current size of the
2457 slurmd.log, job launches will fail with a 'File size
2458 limit exceeded' error.
2459
2460 MEMLOCK The maximum size that may be locked into memory
2461
2462 NOFILE The maximum number of open files
2463
2464 NPROC The maximum number of processes available
2465
2466 RSS The maximum resident set size. Note that this only
2467 has effect with Linux kernels 2.4.30 or older or BSD.
2468
2469 STACK The maximum stack size
2470
2471
2472 PropagateResourceLimitsExcept
2473 A comma-separated list of resource limit names. By default, all
2474 resource limits will be propagated, (as described by the Propa‐
2475 gateResourceLimits parameter), except for the limits appearing
2476 in this list. The user can override this by specifying which
2477 resource limits to propagate with the sbatch or srun "--propa‐
2478 gate" option. See PropagateResourceLimits above for a list of
2479 valid limit names.
2480
2481
2482 RebootProgram
2483 Program to be executed on each compute node to reboot it. In‐
2484 voked on each node once it becomes idle after the command "scon‐
2485 trol reboot" is executed by an authorized user or a job is sub‐
2486 mitted with the "--reboot" option. After rebooting, the node is
2487 returned to normal use. See ResumeTimeout to configure the time
2488 you expect a reboot to finish in. A node will be marked DOWN if
2489 it doesn't reboot within ResumeTimeout.
2490
2491
2492 ReconfigFlags
2493 Flags to control various actions that may be taken when an
2494 "scontrol reconfig" command is issued. Currently the options
2495 are:
2496
2497 KeepPartInfo If set, an "scontrol reconfig" command will
2498 maintain the in-memory value of partition
2499 "state" and other parameters that may have been
2500 dynamically updated by "scontrol update". Par‐
2501 tition information in the slurm.conf file will
2502 be merged with in-memory data. This flag su‐
2503 persedes the KeepPartState flag.
2504
2505 KeepPartState If set, an "scontrol reconfig" command will
2506 preserve only the current "state" value of
2507 in-memory partitions and will reset all other
2508 parameters of the partitions that may have been
2509 dynamically updated by "scontrol update" to the
2510 values from the slurm.conf file. Partition in‐
2511 formation in the slurm.conf file will be merged
2512 with in-memory data.
2513 The default for the above flags is not set, and the "scontrol
2514 reconfig" will rebuild the partition information using only the
2515 definitions in the slurm.conf file.
2516
2517
2518 RequeueExit
2519 Enables automatic requeue for batch jobs which exit with the
2520 specified values. Separate multiple exit code by a comma and/or
2521 specify numeric ranges using a "-" separator (e.g. "Requeue‐
2522 Exit=1-9,18") Jobs will be put back in to pending state and
2523 later scheduled again. Restarted jobs will have the environment
2524 variable SLURM_RESTART_COUNT set to the number of times the job
2525 has been restarted.
2526
2527
2528 RequeueExitHold
2529 Enables automatic requeue for batch jobs which exit with the
2530 specified values, with these jobs being held until released man‐
2531 ually by the user. Separate multiple exit code by a comma
2532 and/or specify numeric ranges using a "-" separator (e.g. "Re‐
2533 queueExitHold=10-12,16") These jobs are put in the JOB_SPE‐
2534 CIAL_EXIT exit state. Restarted jobs will have the environment
2535 variable SLURM_RESTART_COUNT set to the number of times the job
2536 has been restarted.
2537
2538
2539 ResumeFailProgram
2540 The program that will be executed when nodes fail to resume to
2541 by ResumeTimeout. The argument to the program will be the names
2542 of the failed nodes (using Slurm's hostlist expression format).
2543
2544
2545 ResumeProgram
2546 Slurm supports a mechanism to reduce power consumption on nodes
2547 that remain idle for an extended period of time. This is typi‐
2548 cally accomplished by reducing voltage and frequency or powering
2549 the node down. ResumeProgram is the program that will be exe‐
2550 cuted when a node in power save mode is assigned work to per‐
2551 form. For reasons of reliability, ResumeProgram may execute
2552 more than once for a node when the slurmctld daemon crashes and
2553 is restarted. If ResumeProgram is unable to restore a node to
2554 service with a responding slurmd and an updated BootTime, it
2555 should requeue any job associated with the node and set the node
2556 state to DOWN. If the node isn't actually rebooted (i.e. when
2557 multiple-slurmd is configured) starting slurmd with "-b" option
2558 might be useful. The program executes as SlurmUser. The argu‐
2559 ment to the program will be the names of nodes to be removed
2560 from power savings mode (using Slurm's hostlist expression for‐
2561 mat). A job to node mapping is available in JSON format by read‐
2562 ing the temporary file specified by the SLURM_RESUME_FILE envi‐
2563 ronment variable. By default no program is run.
2564
2565
2566 ResumeRate
2567 The rate at which nodes in power save mode are returned to nor‐
2568 mal operation by ResumeProgram. The value is number of nodes
2569 per minute and it can be used to prevent power surges if a large
2570 number of nodes in power save mode are assigned work at the same
2571 time (e.g. a large job starts). A value of zero results in no
2572 limits being imposed. The default value is 300 nodes per
2573 minute.
2574
2575
2576 ResumeTimeout
2577 Maximum time permitted (in seconds) between when a node resume
2578 request is issued and when the node is actually available for
2579 use. Nodes which fail to respond in this time frame will be
2580 marked DOWN and the jobs scheduled on the node requeued. Nodes
2581 which reboot after this time frame will be marked DOWN with a
2582 reason of "Node unexpectedly rebooted." The default value is 60
2583 seconds.
2584
2585
2586 ResvEpilog
2587 Fully qualified pathname of a program for the slurmctld to exe‐
2588 cute when a reservation ends. The program can be used to cancel
2589 jobs, modify partition configuration, etc. The reservation
2590 named will be passed as an argument to the program. By default
2591 there is no epilog.
2592
2593
2594 ResvOverRun
2595 Describes how long a job already running in a reservation should
2596 be permitted to execute after the end time of the reservation
2597 has been reached. The time period is specified in minutes and
2598 the default value is 0 (kill the job immediately). The value
2599 may not exceed 65533 minutes, although a value of "UNLIMITED" is
2600 supported to permit a job to run indefinitely after its reserva‐
2601 tion is terminated.
2602
2603
2604 ResvProlog
2605 Fully qualified pathname of a program for the slurmctld to exe‐
2606 cute when a reservation begins. The program can be used to can‐
2607 cel jobs, modify partition configuration, etc. The reservation
2608 named will be passed as an argument to the program. By default
2609 there is no prolog.
2610
2611
2612 ReturnToService
2613 Controls when a DOWN node will be returned to service. The de‐
2614 fault value is 0. Supported values include
2615
2616 0 A node will remain in the DOWN state until a system adminis‐
2617 trator explicitly changes its state (even if the slurmd dae‐
2618 mon registers and resumes communications).
2619
2620 1 A DOWN node will become available for use upon registration
2621 with a valid configuration only if it was set DOWN due to
2622 being non-responsive. If the node was set DOWN for any
2623 other reason (low memory, unexpected reboot, etc.), its
2624 state will not automatically be changed. A node registers
2625 with a valid configuration if its memory, GRES, CPU count,
2626 etc. are equal to or greater than the values configured in
2627 slurm.conf.
2628
2629 2 A DOWN node will become available for use upon registration
2630 with a valid configuration. The node could have been set
2631 DOWN for any reason. A node registers with a valid configu‐
2632 ration if its memory, GRES, CPU count, etc. are equal to or
2633 greater than the values configured in slurm.conf.
2634
2635
2636 RoutePlugin
2637 Identifies the plugin to be used for defining which nodes will
2638 be used for message forwarding.
2639
2640 route/default
2641 default, use TreeWidth.
2642
2643 route/topology
2644 use the switch hierarchy defined in a topology.conf file.
2645 TopologyPlugin=topology/tree is required.
2646
2647
2648 SchedulerParameters
2649 The interpretation of this parameter varies by SchedulerType.
2650 Multiple options may be comma separated.
2651
2652 allow_zero_lic
2653 If set, then job submissions requesting more than config‐
2654 ured licenses won't be rejected.
2655
2656 assoc_limit_stop
2657 If set and a job cannot start due to association limits,
2658 then do not attempt to initiate any lower priority jobs
2659 in that partition. Setting this can decrease system
2660 throughput and utilization, but avoid potentially starv‐
2661 ing larger jobs by preventing them from launching indefi‐
2662 nitely.
2663
2664 batch_sched_delay=#
2665 How long, in seconds, the scheduling of batch jobs can be
2666 delayed. This can be useful in a high-throughput envi‐
2667 ronment in which batch jobs are submitted at a very high
2668 rate (i.e. using the sbatch command) and one wishes to
2669 reduce the overhead of attempting to schedule each job at
2670 submit time. The default value is 3 seconds.
2671
2672 bb_array_stage_cnt=#
2673 Number of tasks from a job array that should be available
2674 for burst buffer resource allocation. Higher values will
2675 increase the system overhead as each task from the job
2676 array will be moved to its own job record in memory, so
2677 relatively small values are generally recommended. The
2678 default value is 10.
2679
2680 bf_busy_nodes
2681 When selecting resources for pending jobs to reserve for
2682 future execution (i.e. the job can not be started immedi‐
2683 ately), then preferentially select nodes that are in use.
2684 This will tend to leave currently idle resources avail‐
2685 able for backfilling longer running jobs, but may result
2686 in allocations having less than optimal network topology.
2687 This option is currently only supported by the se‐
2688 lect/cons_res and select/cons_tres plugins (or se‐
2689 lect/cray_aries with SelectTypeParameters set to
2690 "OTHER_CONS_RES" or "OTHER_CONS_TRES", which layers the
2691 select/cray_aries plugin over the select/cons_res or se‐
2692 lect/cons_tres plugin respectively).
2693
2694 bf_continue
2695 The backfill scheduler periodically releases locks in or‐
2696 der to permit other operations to proceed rather than
2697 blocking all activity for what could be an extended pe‐
2698 riod of time. Setting this option will cause the back‐
2699 fill scheduler to continue processing pending jobs from
2700 its original job list after releasing locks even if job
2701 or node state changes.
2702
2703 bf_hetjob_immediate
2704 Instruct the backfill scheduler to attempt to start a
2705 heterogeneous job as soon as all of its components are
2706 determined able to do so. Otherwise, the backfill sched‐
2707 uler will delay heterogeneous jobs initiation attempts
2708 until after the rest of the queue has been processed.
2709 This delay may result in lower priority jobs being allo‐
2710 cated resources, which could delay the initiation of the
2711 heterogeneous job due to account and/or QOS limits being
2712 reached. This option is disabled by default. If enabled
2713 and bf_hetjob_prio=min is not set, then it would be auto‐
2714 matically set.
2715
2716 bf_hetjob_prio=[min|avg|max]
2717 At the beginning of each backfill scheduling cycle, a
2718 list of pending to be scheduled jobs is sorted according
2719 to the precedence order configured in PriorityType. This
2720 option instructs the scheduler to alter the sorting algo‐
2721 rithm to ensure that all components belonging to the same
2722 heterogeneous job will be attempted to be scheduled con‐
2723 secutively (thus not fragmented in the resulting list).
2724 More specifically, all components from the same heteroge‐
2725 neous job will be treated as if they all have the same
2726 priority (minimum, average or maximum depending upon this
2727 option's parameter) when compared with other jobs (or
2728 other heterogeneous job components). The original order
2729 will be preserved within the same heterogeneous job. Note
2730 that the operation is calculated for the PriorityTier
2731 layer and for the Priority resulting from the prior‐
2732 ity/multifactor plugin calculations. When enabled, if any
2733 heterogeneous job requested an advanced reservation, then
2734 all of that job's components will be treated as if they
2735 had requested an advanced reservation (and get preferen‐
2736 tial treatment in scheduling).
2737
2738 Note that this operation does not update the Priority
2739 values of the heterogeneous job components, only their
2740 order within the list, so the output of the sprio command
2741 will not be effected.
2742
2743 Heterogeneous jobs have special scheduling properties:
2744 they are only scheduled by the backfill scheduling
2745 plugin, each of their components is considered separately
2746 when reserving resources (and might have different Prior‐
2747 ityTier or different Priority values), and no heteroge‐
2748 neous job component is actually allocated resources until
2749 all if its components can be initiated. This may imply
2750 potential scheduling deadlock scenarios because compo‐
2751 nents from different heterogeneous jobs can start reserv‐
2752 ing resources in an interleaved fashion (not consecu‐
2753 tively), but none of the jobs can reserve resources for
2754 all components and start. Enabling this option can help
2755 to mitigate this problem. By default, this option is dis‐
2756 abled.
2757
2758 bf_interval=#
2759 The number of seconds between backfill iterations.
2760 Higher values result in less overhead and better respon‐
2761 siveness. This option applies only to Scheduler‐
2762 Type=sched/backfill. Default: 30, Min: 1, Max: 10800
2763 (3h).
2764
2765
2766 bf_job_part_count_reserve=#
2767 The backfill scheduling logic will reserve resources for
2768 the specified count of highest priority jobs in each par‐
2769 tition. For example, bf_job_part_count_reserve=10 will
2770 cause the backfill scheduler to reserve resources for the
2771 ten highest priority jobs in each partition. Any lower
2772 priority job that can be started using currently avail‐
2773 able resources and not adversely impact the expected
2774 start time of these higher priority jobs will be started
2775 by the backfill scheduler The default value is zero,
2776 which will reserve resources for any pending job and de‐
2777 lay initiation of lower priority jobs. Also see
2778 bf_min_age_reserve and bf_min_prio_reserve. Default: 0,
2779 Min: 0, Max: 100000.
2780
2781
2782 bf_max_job_array_resv=#
2783 The maximum number of tasks from a job array for which
2784 the backfill scheduler will reserve resources in the fu‐
2785 ture. Since job arrays can potentially have millions of
2786 tasks, the overhead in reserving resources for all tasks
2787 can be prohibitive. In addition various limits may pre‐
2788 vent all the jobs from starting at the expected times.
2789 This has no impact upon the number of tasks from a job
2790 array that can be started immediately, only those tasks
2791 expected to start at some future time. Default: 20, Min:
2792 0, Max: 1000. NOTE: Jobs submitted to multiple parti‐
2793 tions appear in the job queue once per partition. If dif‐
2794 ferent copies of a single job array record aren't consec‐
2795 utive in the job queue and another job array record is in
2796 between, then bf_max_job_array_resv tasks are considered
2797 per partition that the job is submitted to.
2798
2799 bf_max_job_assoc=#
2800 The maximum number of jobs per user association to at‐
2801 tempt starting with the backfill scheduler. This setting
2802 is similar to bf_max_job_user but is handy if a user has
2803 multiple associations equating to basically different
2804 users. One can set this limit to prevent users from
2805 flooding the backfill queue with jobs that cannot start
2806 and that prevent jobs from other users to start. This
2807 option applies only to SchedulerType=sched/backfill.
2808 Also see the bf_max_job_user bf_max_job_part,
2809 bf_max_job_test and bf_max_job_user_part=# options. Set
2810 bf_max_job_test to a value much higher than
2811 bf_max_job_assoc. Default: 0 (no limit), Min: 0, Max:
2812 bf_max_job_test.
2813
2814 bf_max_job_part=#
2815 The maximum number of jobs per partition to attempt
2816 starting with the backfill scheduler. This can be espe‐
2817 cially helpful for systems with large numbers of parti‐
2818 tions and jobs. This option applies only to Scheduler‐
2819 Type=sched/backfill. Also see the partition_job_depth
2820 and bf_max_job_test options. Set bf_max_job_test to a
2821 value much higher than bf_max_job_part. Default: 0 (no
2822 limit), Min: 0, Max: bf_max_job_test.
2823
2824 bf_max_job_start=#
2825 The maximum number of jobs which can be initiated in a
2826 single iteration of the backfill scheduler. This option
2827 applies only to SchedulerType=sched/backfill. Default: 0
2828 (no limit), Min: 0, Max: 10000.
2829
2830 bf_max_job_test=#
2831 The maximum number of jobs to attempt backfill scheduling
2832 for (i.e. the queue depth). Higher values result in more
2833 overhead and less responsiveness. Until an attempt is
2834 made to backfill schedule a job, its expected initiation
2835 time value will not be set. In the case of large clus‐
2836 ters, configuring a relatively small value may be desir‐
2837 able. This option applies only to Scheduler‐
2838 Type=sched/backfill. Default: 500, Min: 1, Max:
2839 1,000,000.
2840
2841 bf_max_job_user=#
2842 The maximum number of jobs per user to attempt starting
2843 with the backfill scheduler for ALL partitions. One can
2844 set this limit to prevent users from flooding the back‐
2845 fill queue with jobs that cannot start and that prevent
2846 jobs from other users to start. This is similar to the
2847 MAXIJOB limit in Maui. This option applies only to
2848 SchedulerType=sched/backfill. Also see the
2849 bf_max_job_part, bf_max_job_test and
2850 bf_max_job_user_part=# options. Set bf_max_job_test to a
2851 value much higher than bf_max_job_user. Default: 0 (no
2852 limit), Min: 0, Max: bf_max_job_test.
2853
2854 bf_max_job_user_part=#
2855 The maximum number of jobs per user per partition to at‐
2856 tempt starting with the backfill scheduler for any single
2857 partition. This option applies only to Scheduler‐
2858 Type=sched/backfill. Also see the bf_max_job_part,
2859 bf_max_job_test and bf_max_job_user=# options. Default:
2860 0 (no limit), Min: 0, Max: bf_max_job_test.
2861
2862 bf_max_time=#
2863 The maximum time in seconds the backfill scheduler can
2864 spend (including time spent sleeping when locks are re‐
2865 leased) before discontinuing, even if maximum job counts
2866 have not been reached. This option applies only to
2867 SchedulerType=sched/backfill. The default value is the
2868 value of bf_interval (which defaults to 30 seconds). De‐
2869 fault: bf_interval value (def. 30 sec), Min: 1, Max: 3600
2870 (1h). NOTE: If bf_interval is short and bf_max_time is
2871 large, this may cause locks to be acquired too frequently
2872 and starve out other serviced RPCs. It's advisable if us‐
2873 ing this parameter to set max_rpc_cnt high enough that
2874 scheduling isn't always disabled, and low enough that the
2875 interactive workload can get through in a reasonable pe‐
2876 riod of time. max_rpc_cnt needs to be below 256 (the de‐
2877 fault RPC thread limit). Running around the middle (150)
2878 may give you good results. NOTE: When increasing the
2879 amount of time spent in the backfill scheduling cycle,
2880 Slurm can be prevented from responding to client requests
2881 in a timely manner. To address this you can use
2882 max_rpc_cnt to specify a number of queued RPCs before the
2883 scheduler stops to respond to these requests.
2884
2885 bf_min_age_reserve=#
2886 The backfill and main scheduling logic will not reserve
2887 resources for pending jobs until they have been pending
2888 and runnable for at least the specified number of sec‐
2889 onds. In addition, jobs waiting for less than the speci‐
2890 fied number of seconds will not prevent a newly submitted
2891 job from starting immediately, even if the newly submit‐
2892 ted job has a lower priority. This can be valuable if
2893 jobs lack time limits or all time limits have the same
2894 value. The default value is zero, which will reserve re‐
2895 sources for any pending job and delay initiation of lower
2896 priority jobs. Also see bf_job_part_count_reserve and
2897 bf_min_prio_reserve. Default: 0, Min: 0, Max: 2592000
2898 (30 days).
2899
2900 bf_min_prio_reserve=#
2901 The backfill and main scheduling logic will not reserve
2902 resources for pending jobs unless they have a priority
2903 equal to or higher than the specified value. In addi‐
2904 tion, jobs with a lower priority will not prevent a newly
2905 submitted job from starting immediately, even if the
2906 newly submitted job has a lower priority. This can be
2907 valuable if one wished to maximum system utilization
2908 without regard for job priority below a certain thresh‐
2909 old. The default value is zero, which will reserve re‐
2910 sources for any pending job and delay initiation of lower
2911 priority jobs. Also see bf_job_part_count_reserve and
2912 bf_min_age_reserve. Default: 0, Min: 0, Max: 2^63.
2913
2914 bf_node_space_size=#
2915 Size of backfill node_space table. Adding a single job to
2916 backfill reservations in the worst case can consume two
2917 node_space records. In the case of large clusters, con‐
2918 figuring a relatively small value may be desirable. This
2919 option applies only to SchedulerType=sched/backfill.
2920 Also see bf_max_job_test and bf_running_job_reserve. De‐
2921 fault: bf_max_job_test, Min: 2, Max: 2,000,000.
2922
2923 bf_one_resv_per_job
2924 Disallow adding more than one backfill reservation per
2925 job. The scheduling logic builds a sorted list of job-
2926 partition pairs. Jobs submitted to multiple partitions
2927 have as many entries in the list as requested partitions.
2928 By default, the backfill scheduler may evaluate all the
2929 job-partition entries for a single job, potentially re‐
2930 serving resources for each pair, but only starting the
2931 job in the reservation offering the earliest start time.
2932 Having a single job reserving resources for multiple par‐
2933 titions could impede other jobs (or hetjob components)
2934 from reserving resources already reserved for the parti‐
2935 tions that don't offer the earliest start time. A single
2936 job that requests multiple partitions can also prevent
2937 itself from starting earlier in a lower priority parti‐
2938 tion if the partitions overlap nodes and a backfill
2939 reservation in the higher priority partition blocks nodes
2940 that are also in the lower priority partition. This op‐
2941 tion makes it so that a job submitted to multiple parti‐
2942 tions will stop reserving resources once the first job-
2943 partition pair has booked a backfill reservation. Subse‐
2944 quent pairs from the same job will only be tested to
2945 start now. This allows for other jobs to be able to book
2946 the other pairs resources at the cost of not guaranteeing
2947 that the multi partition job will start in the partition
2948 offering the earliest start time (unless it can start im‐
2949 mediately). This option is disabled by default.
2950
2951
2952 bf_resolution=#
2953 The number of seconds in the resolution of data main‐
2954 tained about when jobs begin and end. Higher values re‐
2955 sult in better responsiveness and quicker backfill cycles
2956 by using larger blocks of time to determine node eligi‐
2957 bility. However, higher values lead to less efficient
2958 system planning, and may miss opportunities to improve
2959 system utilization. This option applies only to Sched‐
2960 ulerType=sched/backfill. Default: 60, Min: 1, Max: 3600
2961 (1 hour).
2962
2963 bf_running_job_reserve
2964 Add an extra step to backfill logic, which creates back‐
2965 fill reservations for jobs running on whole nodes. This
2966 option is disabled by default.
2967
2968 bf_window=#
2969 The number of minutes into the future to look when con‐
2970 sidering jobs to schedule. Higher values result in more
2971 overhead and less responsiveness. A value at least as
2972 long as the highest allowed time limit is generally ad‐
2973 visable to prevent job starvation. In order to limit the
2974 amount of data managed by the backfill scheduler, if the
2975 value of bf_window is increased, then it is generally ad‐
2976 visable to also increase bf_resolution. This option ap‐
2977 plies only to SchedulerType=sched/backfill. Default:
2978 1440 (1 day), Min: 1, Max: 43200 (30 days).
2979
2980 bf_window_linear=#
2981 For performance reasons, the backfill scheduler will de‐
2982 crease precision in calculation of job expected termina‐
2983 tion times. By default, the precision starts at 30 sec‐
2984 onds and that time interval doubles with each evaluation
2985 of currently executing jobs when trying to determine when
2986 a pending job can start. This algorithm can support an
2987 environment with many thousands of running jobs, but can
2988 result in the expected start time of pending jobs being
2989 gradually being deferred due to lack of precision. A
2990 value for bf_window_linear will cause the time interval
2991 to be increased by a constant amount on each iteration.
2992 The value is specified in units of seconds. For example,
2993 a value of 60 will cause the backfill scheduler on the
2994 first iteration to identify the job ending soonest and
2995 determine if the pending job can be started after that
2996 job plus all other jobs expected to end within 30 seconds
2997 (default initial value) of the first job. On the next it‐
2998 eration, the pending job will be evaluated for starting
2999 after the next job expected to end plus all jobs ending
3000 within 90 seconds of that time (30 second default, plus
3001 the 60 second option value). The third iteration will
3002 have a 150 second window and the fourth 210 seconds.
3003 Without this option, the time windows will double on each
3004 iteration and thus be 30, 60, 120, 240 seconds, etc. The
3005 use of bf_window_linear is not recommended with more than
3006 a few hundred simultaneously executing jobs.
3007
3008 bf_yield_interval=#
3009 The backfill scheduler will periodically relinquish locks
3010 in order for other pending operations to take place.
3011 This specifies the times when the locks are relinquished
3012 in microseconds. Smaller values may be helpful for high
3013 throughput computing when used in conjunction with the
3014 bf_continue option. Also see the bf_yield_sleep option.
3015 Default: 2,000,000 (2 sec), Min: 1, Max: 10,000,000 (10
3016 sec).
3017
3018 bf_yield_sleep=#
3019 The backfill scheduler will periodically relinquish locks
3020 in order for other pending operations to take place.
3021 This specifies the length of time for which the locks are
3022 relinquished in microseconds. Also see the bf_yield_in‐
3023 terval option. Default: 500,000 (0.5 sec), Min: 1, Max:
3024 10,000,000 (10 sec).
3025
3026 build_queue_timeout=#
3027 Defines the maximum time that can be devoted to building
3028 a queue of jobs to be tested for scheduling. If the sys‐
3029 tem has a huge number of jobs with dependencies, just
3030 building the job queue can take so much time as to ad‐
3031 versely impact overall system performance and this param‐
3032 eter can be adjusted as needed. The default value is
3033 2,000,000 microseconds (2 seconds).
3034
3035 correspond_after_task_cnt=#
3036 Defines the number of array tasks that get split for po‐
3037 tential aftercorr dependency check. Low number may result
3038 in dependent task check failures when the job one depends
3039 on gets purged before the split. Default: 10.
3040
3041 default_queue_depth=#
3042 The default number of jobs to attempt scheduling (i.e.
3043 the queue depth) when a running job completes or other
3044 routine actions occur, however the frequency with which
3045 the scheduler is run may be limited by using the defer or
3046 sched_min_interval parameters described below. The full
3047 queue will be tested on a less frequent basis as defined
3048 by the sched_interval option described below. The default
3049 value is 100. See the partition_job_depth option to
3050 limit depth by partition.
3051
3052 defer Setting this option will avoid attempting to schedule
3053 each job individually at job submit time, but defer it
3054 until a later time when scheduling multiple jobs simulta‐
3055 neously may be possible. This option may improve system
3056 responsiveness when large numbers of jobs (many hundreds)
3057 are submitted at the same time, but it will delay the
3058 initiation time of individual jobs. Also see de‐
3059 fault_queue_depth above.
3060
3061 delay_boot=#
3062 Do not reboot nodes in order to satisfied this job's fea‐
3063 ture specification if the job has been eligible to run
3064 for less than this time period. If the job has waited
3065 for less than the specified period, it will use only
3066 nodes which already have the specified features. The ar‐
3067 gument is in units of minutes. Individual jobs may over‐
3068 ride this default value with the --delay-boot option.
3069
3070 disable_job_shrink
3071 Deny user requests to shrink the side of running jobs.
3072 (However, running jobs may still shrink due to node fail‐
3073 ure if the --no-kill option was set.)
3074
3075 disable_hetjob_steps
3076 Disable job steps that span heterogeneous job alloca‐
3077 tions.
3078
3079 enable_hetjob_steps
3080 Enable job steps that span heterogeneous job allocations.
3081 The default value.
3082
3083 enable_user_top
3084 Enable use of the "scontrol top" command by non-privi‐
3085 leged users.
3086
3087 Ignore_NUMA
3088 Some processors (e.g. AMD Opteron 6000 series) contain
3089 multiple NUMA nodes per socket. This is a configuration
3090 which does not map into the hardware entities that Slurm
3091 optimizes resource allocation for (PU/thread, core,
3092 socket, baseboard, node and network switch). In order to
3093 optimize resource allocations on such hardware, Slurm
3094 will consider each NUMA node within the socket as a sepa‐
3095 rate socket by default. Use the Ignore_NUMA option to re‐
3096 port the correct socket count, but not optimize resource
3097 allocations on the NUMA nodes.
3098
3099 max_array_tasks
3100 Specify the maximum number of tasks that can be included
3101 in a job array. The default limit is MaxArraySize, but
3102 this option can be used to set a lower limit. For exam‐
3103 ple, max_array_tasks=1000 and MaxArraySize=100001 would
3104 permit a maximum task ID of 100000, but limit the number
3105 of tasks in any single job array to 1000.
3106
3107 max_rpc_cnt=#
3108 If the number of active threads in the slurmctld daemon
3109 is equal to or larger than this value, defer scheduling
3110 of jobs. The scheduler will check this condition at cer‐
3111 tain points in code and yield locks if necessary. This
3112 can improve Slurm's ability to process requests at a cost
3113 of initiating new jobs less frequently. Default: 0 (op‐
3114 tion disabled), Min: 0, Max: 1000.
3115
3116 NOTE: The maximum number of threads (MAX_SERVER_THREADS)
3117 is internally set to 256 and defines the number of served
3118 RPCs at a given time. Setting max_rpc_cnt to more than
3119 256 will be only useful to let backfill continue schedul‐
3120 ing work after locks have been yielded (i.e. each 2 sec‐
3121 onds) if there are a maximum of MAX(max_rpc_cnt/10, 20)
3122 RPCs in the queue. i.e. max_rpc_cnt=1000, the scheduler
3123 will be allowed to continue after yielding locks only
3124 when there are less than or equal to 100 pending RPCs.
3125 If a value is set, then a value of 10 or higher is recom‐
3126 mended. It may require some tuning for each system, but
3127 needs to be high enough that scheduling isn't always dis‐
3128 abled, and low enough that requests can get through in a
3129 reasonable period of time.
3130
3131 max_sched_time=#
3132 How long, in seconds, that the main scheduling loop will
3133 execute for before exiting. If a value is configured, be
3134 aware that all other Slurm operations will be deferred
3135 during this time period. Make certain the value is lower
3136 than MessageTimeout. If a value is not explicitly con‐
3137 figured, the default value is half of MessageTimeout with
3138 a minimum default value of 1 second and a maximum default
3139 value of 2 seconds. For example if MessageTimeout=10,
3140 the time limit will be 2 seconds (i.e. MIN(10/2, 2) = 2).
3141
3142 max_script_size=#
3143 Specify the maximum size of a batch script, in bytes.
3144 The default value is 4 megabytes. Larger values may ad‐
3145 versely impact system performance.
3146
3147 max_switch_wait=#
3148 Maximum number of seconds that a job can delay execution
3149 waiting for the specified desired switch count. The de‐
3150 fault value is 300 seconds.
3151
3152 no_backup_scheduling
3153 If used, the backup controller will not schedule jobs
3154 when it takes over. The backup controller will allow jobs
3155 to be submitted, modified and cancelled but won't sched‐
3156 ule new jobs. This is useful in Cray environments when
3157 the backup controller resides on an external Cray node.
3158 A restart is required to alter this option.
3159
3160 no_env_cache
3161 If used, any job started on node that fails to load the
3162 env from a node will fail instead of using the cached
3163 env. This will also implicitly imply the re‐
3164 queue_setup_env_fail option as well.
3165
3166 nohold_on_prolog_fail
3167 By default, if the Prolog exits with a non-zero value the
3168 job is requeued in a held state. By specifying this pa‐
3169 rameter the job will be requeued but not held so that the
3170 scheduler can dispatch it to another host.
3171
3172 pack_serial_at_end
3173 If used with the select/cons_res or select/cons_tres
3174 plugin, then put serial jobs at the end of the available
3175 nodes rather than using a best fit algorithm. This may
3176 reduce resource fragmentation for some workloads.
3177
3178 partition_job_depth=#
3179 The default number of jobs to attempt scheduling (i.e.
3180 the queue depth) from each partition/queue in Slurm's
3181 main scheduling logic. The functionality is similar to
3182 that provided by the bf_max_job_part option for the back‐
3183 fill scheduling logic. The default value is 0 (no
3184 limit). Job's excluded from attempted scheduling based
3185 upon partition will not be counted against the de‐
3186 fault_queue_depth limit. Also see the bf_max_job_part
3187 option.
3188
3189 preempt_reorder_count=#
3190 Specify how many attempts should be made in reordering
3191 preemptable jobs to minimize the count of jobs preempted.
3192 The default value is 1. High values may adversely impact
3193 performance. The logic to support this option is only
3194 available in the select/cons_res and select/cons_tres
3195 plugins.
3196
3197 preempt_strict_order
3198 If set, then execute extra logic in an attempt to preempt
3199 only the lowest priority jobs. It may be desirable to
3200 set this configuration parameter when there are multiple
3201 priorities of preemptable jobs. The logic to support
3202 this option is only available in the select/cons_res and
3203 select/cons_tres plugins.
3204
3205 preempt_youngest_first
3206 If set, then the preemption sorting algorithm will be
3207 changed to sort by the job start times to favor preempt‐
3208 ing younger jobs over older. (Requires preempt/parti‐
3209 tion_prio or preempt/qos plugins.)
3210
3211 reduce_completing_frag
3212 This option is used to control how scheduling of re‐
3213 sources is performed when jobs are in the COMPLETING
3214 state, which influences potential fragmentation. If this
3215 option is not set then no jobs will be started in any
3216 partition when any job is in the COMPLETING state for
3217 less than CompleteWait seconds. If this option is set
3218 then no jobs will be started in any individual partition
3219 that has a job in COMPLETING state for less than Com‐
3220 pleteWait seconds. In addition, no jobs will be started
3221 in any partition with nodes that overlap with any nodes
3222 in the partition of the completing job. This option is
3223 to be used in conjunction with CompleteWait.
3224
3225 NOTE: CompleteWait must be set in order for this to work.
3226 If CompleteWait=0 then this option does nothing.
3227
3228 NOTE: reduce_completing_frag only affects the main sched‐
3229 uler, not the backfill scheduler.
3230
3231 requeue_setup_env_fail
3232 By default if a job environment setup fails the job keeps
3233 running with a limited environment. By specifying this
3234 parameter the job will be requeued in held state and the
3235 execution node drained.
3236
3237 salloc_wait_nodes
3238 If defined, the salloc command will wait until all allo‐
3239 cated nodes are ready for use (i.e. booted) before the
3240 command returns. By default, salloc will return as soon
3241 as the resource allocation has been made.
3242
3243 sbatch_wait_nodes
3244 If defined, the sbatch script will wait until all allo‐
3245 cated nodes are ready for use (i.e. booted) before the
3246 initiation. By default, the sbatch script will be initi‐
3247 ated as soon as the first node in the job allocation is
3248 ready. The sbatch command can use the --wait-all-nodes
3249 option to override this configuration parameter.
3250
3251 sched_interval=#
3252 How frequently, in seconds, the main scheduling loop will
3253 execute and test all pending jobs. The default value is
3254 60 seconds.
3255
3256 sched_max_job_start=#
3257 The maximum number of jobs that the main scheduling logic
3258 will start in any single execution. The default value is
3259 zero, which imposes no limit.
3260
3261 sched_min_interval=#
3262 How frequently, in microseconds, the main scheduling loop
3263 will execute and test any pending jobs. The scheduler
3264 runs in a limited fashion every time that any event hap‐
3265 pens which could enable a job to start (e.g. job submit,
3266 job terminate, etc.). If these events happen at a high
3267 frequency, the scheduler can run very frequently and con‐
3268 sume significant resources if not throttled by this op‐
3269 tion. This option specifies the minimum time between the
3270 end of one scheduling cycle and the beginning of the next
3271 scheduling cycle. A value of zero will disable throt‐
3272 tling of the scheduling logic interval. The default
3273 value is 2 microseconds on other systems.
3274
3275 spec_cores_first
3276 Specialized cores will be selected from the first cores
3277 of the first sockets, cycling through the sockets on a
3278 round robin basis. By default, specialized cores will be
3279 selected from the last cores of the last sockets, cycling
3280 through the sockets on a round robin basis.
3281
3282 step_retry_count=#
3283 When a step completes and there are steps ending resource
3284 allocation, then retry step allocations for at least this
3285 number of pending steps. Also see step_retry_time. The
3286 default value is 8 steps.
3287
3288 step_retry_time=#
3289 When a step completes and there are steps ending resource
3290 allocation, then retry step allocations for all steps
3291 which have been pending for at least this number of sec‐
3292 onds. Also see step_retry_count. The default value is
3293 60 seconds.
3294
3295 whole_hetjob
3296 Requests to cancel, hold or release any component of a
3297 heterogeneous job will be applied to all components of
3298 the job.
3299
3300 NOTE: this option was previously named whole_pack and
3301 this is still supported for retrocompatibility.
3302
3303
3304 SchedulerTimeSlice
3305 Number of seconds in each time slice when gang scheduling is en‐
3306 abled (PreemptMode=SUSPEND,GANG). The value must be between 5
3307 seconds and 65533 seconds. The default value is 30 seconds.
3308
3309
3310 SchedulerType
3311 Identifies the type of scheduler to be used. Note the slurmctld
3312 daemon must be restarted for a change in scheduler type to be‐
3313 come effective (reconfiguring a running daemon has no effect for
3314 this parameter). The scontrol command can be used to manually
3315 change job priorities if desired. Acceptable values include:
3316
3317 sched/backfill
3318 For a backfill scheduling module to augment the default
3319 FIFO scheduling. Backfill scheduling will initiate
3320 lower-priority jobs if doing so does not delay the ex‐
3321 pected initiation time of any higher priority job. Ef‐
3322 fectiveness of backfill scheduling is dependent upon
3323 users specifying job time limits, otherwise all jobs will
3324 have the same time limit and backfilling is impossible.
3325 Note documentation for the SchedulerParameters option
3326 above. This is the default configuration.
3327
3328 sched/builtin
3329 This is the FIFO scheduler which initiates jobs in prior‐
3330 ity order. If any job in the partition can not be sched‐
3331 uled, no lower priority job in that partition will be
3332 scheduled. An exception is made for jobs that can not
3333 run due to partition constraints (e.g. the time limit) or
3334 down/drained nodes. In that case, lower priority jobs
3335 can be initiated and not impact the higher priority job.
3336
3337 sched/hold
3338 To hold all newly arriving jobs if a file
3339 "/etc/slurm.hold" exists otherwise use the built-in FIFO
3340 scheduler
3341
3342
3343 ScronParameters
3344 Multiple options may be comma separated.
3345
3346 enable Enable the use of scrontab to submit and manage periodic
3347 repeating jobs.
3348
3349
3350 SelectType
3351 Identifies the type of resource selection algorithm to be used.
3352 Changing this value can only be done by restarting the slurmctld
3353 daemon. When changed, all job information (running and pending)
3354 will be lost, since the job state save format used by each
3355 plugin is different. The only exception to this is when chang‐
3356 ing from cons_res to cons_tres or from cons_tres to cons_res.
3357 However, if a job contains cons_tres-specific features and then
3358 SelectType is changed to cons_res, the job will be canceled,
3359 since there is no way for cons_res to satisfy requirements spe‐
3360 cific to cons_tres.
3361
3362 Acceptable values include
3363
3364 select/cons_res
3365 The resources (cores and memory) within a node are indi‐
3366 vidually allocated as consumable resources. Note that
3367 whole nodes can be allocated to jobs for selected parti‐
3368 tions by using the OverSubscribe=Exclusive option. See
3369 the partition OverSubscribe parameter for more informa‐
3370 tion.
3371
3372 select/cons_tres
3373 The resources (cores, memory, GPUs and all other track‐
3374 able resources) within a node are individually allocated
3375 as consumable resources. Note that whole nodes can be
3376 allocated to jobs for selected partitions by using the
3377 OverSubscribe=Exclusive option. See the partition Over‐
3378 Subscribe parameter for more information.
3379
3380 select/cray_aries
3381 for a Cray system. The default value is "se‐
3382 lect/cray_aries" for all Cray systems.
3383
3384 select/linear
3385 for allocation of entire nodes assuming a one-dimensional
3386 array of nodes in which sequentially ordered nodes are
3387 preferable. For a heterogeneous cluster (e.g. different
3388 CPU counts on the various nodes), resource allocations
3389 will favor nodes with high CPU counts as needed based
3390 upon the job's node and CPU specification if TopologyPlu‐
3391 gin=topology/none is configured. Use of other topology
3392 plugins with select/linear and heterogeneous nodes is not
3393 recommended and may result in valid job allocation re‐
3394 quests being rejected. This is the default value.
3395
3396
3397 SelectTypeParameters
3398 The permitted values of SelectTypeParameters depend upon the
3399 configured value of SelectType. The only supported options for
3400 SelectType=select/linear are CR_ONE_TASK_PER_CORE and CR_Memory,
3401 which treats memory as a consumable resource and prevents memory
3402 over subscription with job preemption or gang scheduling. By
3403 default SelectType=select/linear allocates whole nodes to jobs
3404 without considering their memory consumption. By default Se‐
3405 lectType=select/cons_res, SelectType=select/cray_aries, and Se‐
3406 lectType=select/cons_tres, use CR_Core_Memory, which allocates
3407 Core to jobs with considering their memory consumption.
3408
3409 The following options are supported for SelectType=se‐
3410 lect/cray_aries:
3411
3412 OTHER_CONS_RES
3413 Layer the select/cons_res plugin under the se‐
3414 lect/cray_aries plugin, the default is to layer on
3415 select/linear. This also allows all the options
3416 available for SelectType=select/cons_res.
3417
3418 OTHER_CONS_TRES
3419 Layer the select/cons_tres plugin under the se‐
3420 lect/cray_aries plugin, the default is to layer on
3421 select/linear. This also allows all the options
3422 available for SelectType=select/cons_tres.
3423
3424 The following options are supported by the SelectType=se‐
3425 lect/cons_res and SelectType=select/cons_tres plugins:
3426
3427 CR_CPU CPUs are consumable resources. Configure the num‐
3428 ber of CPUs on each node, which may be equal to
3429 the count of cores or hyper-threads on the node
3430 depending upon the desired minimum resource allo‐
3431 cation. The node's Boards, Sockets, CoresPer‐
3432 Socket and ThreadsPerCore may optionally be con‐
3433 figured and result in job allocations which have
3434 improved locality; however doing so will prevent
3435 more than one job from being allocated on each
3436 core.
3437
3438 CR_CPU_Memory
3439 CPUs and memory are consumable resources. Config‐
3440 ure the number of CPUs on each node, which may be
3441 equal to the count of cores or hyper-threads on
3442 the node depending upon the desired minimum re‐
3443 source allocation. The node's Boards, Sockets,
3444 CoresPerSocket and ThreadsPerCore may optionally
3445 be configured and result in job allocations which
3446 have improved locality; however doing so will pre‐
3447 vent more than one job from being allocated on
3448 each core. Setting a value for DefMemPerCPU is
3449 strongly recommended.
3450
3451 CR_Core
3452 Cores are consumable resources. On nodes with hy‐
3453 per-threads, each thread is counted as a CPU to
3454 satisfy a job's resource requirement, but multiple
3455 jobs are not allocated threads on the same core.
3456 The count of CPUs allocated to a job is rounded up
3457 to account for every CPU on an allocated core.
3458 This will also impact total allocated memory when
3459 --mem-per-cpu is used to be multiply of total num‐
3460 ber of CPUs on allocated cores.
3461
3462 CR_Core_Memory
3463 Cores and memory are consumable resources. On
3464 nodes with hyper-threads, each thread is counted
3465 as a CPU to satisfy a job's resource requirement,
3466 but multiple jobs are not allocated threads on the
3467 same core. The count of CPUs allocated to a job
3468 may be rounded up to account for every CPU on an
3469 allocated core. Setting a value for DefMemPerCPU
3470 is strongly recommended.
3471
3472 CR_ONE_TASK_PER_CORE
3473 Allocate one task per core by default. Without
3474 this option, by default one task will be allocated
3475 per thread on nodes with more than one ThreadsPer‐
3476 Core configured. NOTE: This option cannot be used
3477 with CR_CPU*.
3478
3479 CR_CORE_DEFAULT_DIST_BLOCK
3480 Allocate cores within a node using block distribu‐
3481 tion by default. This is a pseudo-best-fit algo‐
3482 rithm that minimizes the number of boards and min‐
3483 imizes the number of sockets (within minimum
3484 boards) used for the allocation. This default be‐
3485 havior can be overridden specifying a particular
3486 "-m" parameter with srun/salloc/sbatch. Without
3487 this option, cores will be allocated cyclically
3488 across the sockets.
3489
3490 CR_LLN Schedule resources to jobs on the least loaded
3491 nodes (based upon the number of idle CPUs). This
3492 is generally only recommended for an environment
3493 with serial jobs as idle resources will tend to be
3494 highly fragmented, resulting in parallel jobs be‐
3495 ing distributed across many nodes. Note that node
3496 Weight takes precedence over how many idle re‐
3497 sources are on each node. Also see the partition
3498 configuration parameter LLN use the least loaded
3499 nodes in selected partitions.
3500
3501 CR_Pack_Nodes
3502 If a job allocation contains more resources than
3503 will be used for launching tasks (e.g. if whole
3504 nodes are allocated to a job), then rather than
3505 distributing a job's tasks evenly across its allo‐
3506 cated nodes, pack them as tightly as possible on
3507 these nodes. For example, consider a job alloca‐
3508 tion containing two entire nodes with eight CPUs
3509 each. If the job starts ten tasks across those
3510 two nodes without this option, it will start five
3511 tasks on each of the two nodes. With this option,
3512 eight tasks will be started on the first node and
3513 two tasks on the second node. This can be super‐
3514 seded by "NoPack" in srun's "--distribution" op‐
3515 tion. CR_Pack_Nodes only applies when the "block"
3516 task distribution method is used.
3517
3518 CR_Socket
3519 Sockets are consumable resources. On nodes with
3520 multiple cores, each core or thread is counted as
3521 a CPU to satisfy a job's resource requirement, but
3522 multiple jobs are not allocated resources on the
3523 same socket.
3524
3525 CR_Socket_Memory
3526 Memory and sockets are consumable resources. On
3527 nodes with multiple cores, each core or thread is
3528 counted as a CPU to satisfy a job's resource re‐
3529 quirement, but multiple jobs are not allocated re‐
3530 sources on the same socket. Setting a value for
3531 DefMemPerCPU is strongly recommended.
3532
3533 CR_Memory
3534 Memory is a consumable resource. NOTE: This im‐
3535 plies OverSubscribe=YES or OverSubscribe=FORCE for
3536 all partitions. Setting a value for DefMemPerCPU
3537 is strongly recommended.
3538
3539
3540 SlurmctldAddr
3541 An optional address to be used for communications to the cur‐
3542 rently active slurmctld daemon, normally used with Virtual IP
3543 addressing of the currently active server. If this parameter is
3544 not specified then each primary and backup server will have its
3545 own unique address used for communications as specified in the
3546 SlurmctldHost parameter. If this parameter is specified then
3547 the SlurmctldHost parameter will still be used for communica‐
3548 tions to specific slurmctld primary or backup servers, for exam‐
3549 ple to cause all of them to read the current configuration files
3550 or shutdown. Also see the SlurmctldPrimaryOffProg and Slurm‐
3551 ctldPrimaryOnProg configuration parameters to configure programs
3552 to manipulate virtual IP address manipulation.
3553
3554
3555 SlurmctldDebug
3556 The level of detail to provide slurmctld daemon's logs. The de‐
3557 fault value is info. If the slurmctld daemon is initiated with
3558 -v or --verbose options, that debug level will be preserve or
3559 restored upon reconfiguration.
3560
3561
3562 quiet Log nothing
3563
3564 fatal Log only fatal errors
3565
3566 error Log only errors
3567
3568 info Log errors and general informational messages
3569
3570 verbose Log errors and verbose informational messages
3571
3572 debug Log errors and verbose informational messages and de‐
3573 bugging messages
3574
3575 debug2 Log errors and verbose informational messages and more
3576 debugging messages
3577
3578 debug3 Log errors and verbose informational messages and even
3579 more debugging messages
3580
3581 debug4 Log errors and verbose informational messages and even
3582 more debugging messages
3583
3584 debug5 Log errors and verbose informational messages and even
3585 more debugging messages
3586
3587
3588 SlurmctldHost
3589 The short, or long, hostname of the machine where Slurm control
3590 daemon is executed (i.e. the name returned by the command "host‐
3591 name -s"). This hostname is optionally followed by the address,
3592 either the IP address or a name by which the address can be
3593 identified, enclosed in parentheses (e.g. SlurmctldHost=slurm‐
3594 ctl-primary(12.34.56.78)). This value must be specified at least
3595 once. If specified more than once, the first hostname named will
3596 be where the daemon runs. If the first specified host fails,
3597 the daemon will execute on the second host. If both the first
3598 and second specified host fails, the daemon will execute on the
3599 third host.
3600
3601
3602 SlurmctldLogFile
3603 Fully qualified pathname of a file into which the slurmctld dae‐
3604 mon's logs are written. The default value is none (performs
3605 logging via syslog).
3606 See the section LOGGING if a pathname is specified.
3607
3608
3609 SlurmctldParameters
3610 Multiple options may be comma separated.
3611
3612
3613 allow_user_triggers
3614 Permit setting triggers from non-root/slurm_user users.
3615 SlurmUser must also be set to root to permit these trig‐
3616 gers to work. See the strigger man page for additional
3617 details.
3618
3619 cloud_dns
3620 By default, Slurm expects that the network address for a
3621 cloud node won't be known until the creation of the node
3622 and that Slurm will be notified of the node's address
3623 (e.g. scontrol update nodename=<name> nodeaddr=<addr>).
3624 Since Slurm communications rely on the node configuration
3625 found in the slurm.conf, Slurm will tell the client com‐
3626 mand, after waiting for all nodes to boot, each node's ip
3627 address. However, in environments where the nodes are in
3628 DNS, this step can be avoided by configuring this option.
3629
3630 cloud_reg_addrs
3631 When a cloud node registers, the node's NodeAddr and
3632 NodeHostName will automatically be set. They will be re‐
3633 set back to the nodename after powering off.
3634
3635 enable_configless
3636 Permit "configless" operation by the slurmd, slurmstepd,
3637 and user commands. When enabled the slurmd will be per‐
3638 mitted to retrieve config files from the slurmctld, and
3639 on any 'scontrol reconfigure' command new configs will be
3640 automatically pushed out and applied to nodes that are
3641 running in this "configless" mode. NOTE: a restart of
3642 the slurmctld is required for this to take effect.
3643
3644 idle_on_node_suspend
3645 Mark nodes as idle, regardless of current state, when
3646 suspending nodes with SuspendProgram so that nodes will
3647 be eligible to be resumed at a later time.
3648
3649 node_reg_mem_percent=#
3650 Percentage of memory a node is allowed to register with
3651 without being marked as invalid with low memory. Default
3652 is 100. For State=CLOUD nodes, the default is 90. To dis‐
3653 able this for cloud nodes set it to 100. config_overrides
3654 takes precendence over this option.
3655
3656 It's recommended that task/cgroup with ConstrainRamSpace
3657 is configured. A memory cgroup limit won't be set more
3658 than the actual memory on the node. If needed, configure
3659 AllowedRamSpace in the cgroup.conf to add a buffer.
3660
3661 power_save_interval
3662 How often the power_save thread looks to resume and sus‐
3663 pend nodes. The power_save thread will do work sooner if
3664 there are node state changes. Default is 10 seconds.
3665
3666 power_save_min_interval
3667 How often the power_save thread, at a minimum, looks to
3668 resume and suspend nodes. Default is 0.
3669
3670 max_dbd_msg_action
3671 Action used once MaxDBDMsgs is reached, options are 'dis‐
3672 card' (default) and 'exit'.
3673
3674 When 'discard' is specified and MaxDBDMsgs is reached we
3675 start by purging pending messages of types Step start and
3676 complete, and it reaches MaxDBDMsgs again Job start mes‐
3677 sages are purged. Job completes and node state changes
3678 continue to consume the empty space created from the
3679 purgings until MaxDBDMsgs is reached again at which no
3680 new message is tracked creating data loss and potentially
3681 runaway jobs.
3682
3683 When 'exit' is specified and MaxDBDMsgs is reached the
3684 slurmctld will exit instead of discarding any messages.
3685 It will be impossible to start the slurmctld with this
3686 option where the slurmdbd is down and the slurmctld is
3687 tracking more than MaxDBDMsgs.
3688
3689
3690 preempt_send_user_signal
3691 Send the user signal (e.g. --signal=<sig_num>) at preemp‐
3692 tion time even if the signal time hasn't been reached. In
3693 the case of a gracetime preemption the user signal will
3694 be sent if the user signal has been specified and not
3695 sent, otherwise a SIGTERM will be sent to the tasks.
3696
3697 reboot_from_controller
3698 Run the RebootProgram from the controller instead of on
3699 the slurmds. The RebootProgram will be passed a
3700 comma-separated list of nodes to reboot.
3701
3702 user_resv_delete
3703 Allow any user able to run in a reservation to delete it.
3704
3705
3706 SlurmctldPidFile
3707 Fully qualified pathname of a file into which the slurmctld
3708 daemon may write its process id. This may be used for automated
3709 signal processing. The default value is "/var/run/slurm‐
3710 ctld.pid".
3711
3712
3713 SlurmctldPlugstack
3714 A comma-delimited list of Slurm controller plugins to be started
3715 when the daemon begins and terminated when it ends. Only the
3716 plugin's init and fini functions are called.
3717
3718
3719 SlurmctldPort
3720 The port number that the Slurm controller, slurmctld, listens to
3721 for work. The default value is SLURMCTLD_PORT as established at
3722 system build time. If none is explicitly specified, it will be
3723 set to 6817. SlurmctldPort may also be configured to support a
3724 range of port numbers in order to accept larger bursts of incom‐
3725 ing messages by specifying two numbers separated by a dash (e.g.
3726 SlurmctldPort=6817-6818). NOTE: Either slurmctld and slurmd
3727 daemons must not execute on the same nodes or the values of
3728 SlurmctldPort and SlurmdPort must be different.
3729
3730 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3731 automatically try to interact with anything opened on ports
3732 8192-60000. Configure SlurmctldPort to use a port outside of
3733 the configured SrunPortRange and RSIP's port range.
3734
3735
3736 SlurmctldPrimaryOffProg
3737 This program is executed when a slurmctld daemon running as the
3738 primary server becomes a backup server. By default no program is
3739 executed. See also the related "SlurmctldPrimaryOnProg" parame‐
3740 ter.
3741
3742
3743 SlurmctldPrimaryOnProg
3744 This program is executed when a slurmctld daemon running as a
3745 backup server becomes the primary server. By default no program
3746 is executed. When using virtual IP addresses to manage High
3747 Available Slurm services, this program can be used to add the IP
3748 address to an interface (and optionally try to kill the unre‐
3749 sponsive slurmctld daemon and flush the ARP caches on nodes on
3750 the local Ethernet fabric). See also the related "SlurmctldPri‐
3751 maryOffProg" parameter.
3752
3753 SlurmctldSyslogDebug
3754 The slurmctld daemon will log events to the syslog file at the
3755 specified level of detail. If not set, the slurmctld daemon will
3756 log to syslog at level fatal, unless there is no SlurmctldLog‐
3757 File and it is running in the background, in which case it will
3758 log to syslog at the level specified by SlurmctldDebug (at fatal
3759 in the case that SlurmctldDebug is set to quiet) or it is run in
3760 the foreground, when it will be set to quiet.
3761
3762
3763 quiet Log nothing
3764
3765 fatal Log only fatal errors
3766
3767 error Log only errors
3768
3769 info Log errors and general informational messages
3770
3771 verbose Log errors and verbose informational messages
3772
3773 debug Log errors and verbose informational messages and de‐
3774 bugging messages
3775
3776 debug2 Log errors and verbose informational messages and more
3777 debugging messages
3778
3779 debug3 Log errors and verbose informational messages and even
3780 more debugging messages
3781
3782 debug4 Log errors and verbose informational messages and even
3783 more debugging messages
3784
3785 debug5 Log errors and verbose informational messages and even
3786 more debugging messages
3787
3788
3789
3790 SlurmctldTimeout
3791 The interval, in seconds, that the backup controller waits for
3792 the primary controller to respond before assuming control. The
3793 default value is 120 seconds. May not exceed 65533.
3794
3795
3796 SlurmdDebug
3797 The level of detail to provide slurmd daemon's logs. The de‐
3798 fault value is info.
3799
3800 quiet Log nothing
3801
3802 fatal Log only fatal errors
3803
3804 error Log only errors
3805
3806 info Log errors and general informational messages
3807
3808 verbose Log errors and verbose informational messages
3809
3810 debug Log errors and verbose informational messages and de‐
3811 bugging messages
3812
3813 debug2 Log errors and verbose informational messages and more
3814 debugging messages
3815
3816 debug3 Log errors and verbose informational messages and even
3817 more debugging messages
3818
3819 debug4 Log errors and verbose informational messages and even
3820 more debugging messages
3821
3822 debug5 Log errors and verbose informational messages and even
3823 more debugging messages
3824
3825
3826 SlurmdLogFile
3827 Fully qualified pathname of a file into which the slurmd dae‐
3828 mon's logs are written. The default value is none (performs
3829 logging via syslog). Any "%h" within the name is replaced with
3830 the hostname on which the slurmd is running. Any "%n" within
3831 the name is replaced with the Slurm node name on which the
3832 slurmd is running.
3833 See the section LOGGING if a pathname is specified.
3834
3835
3836 SlurmdParameters
3837 Parameters specific to the Slurmd. Multiple options may be
3838 comma separated.
3839
3840 config_overrides
3841 If set, consider the configuration of each node to be
3842 that specified in the slurm.conf configuration file and
3843 any node with less than the configured resources will not
3844 be set DRAIN. This option is generally only useful for
3845 testing purposes. Equivalent to the now deprecated
3846 FastSchedule=2 option.
3847
3848 l3cache_as_socket
3849 Use the hwloc l3cache as the socket count. Can be useful
3850 on certain processors where the socket level is too
3851 coarse, and the l3cache may provide better task distribu‐
3852 tion. (E.g., along CCX boundaries instead of socket
3853 boundaries.) Requires hwloc v2.
3854
3855 shutdown_on_reboot
3856 If set, the Slurmd will shut itself down when a reboot
3857 request is received.
3858
3859
3860 SlurmdPidFile
3861 Fully qualified pathname of a file into which the slurmd daemon
3862 may write its process id. This may be used for automated signal
3863 processing. Any "%h" within the name is replaced with the host‐
3864 name on which the slurmd is running. Any "%n" within the name
3865 is replaced with the Slurm node name on which the slurmd is run‐
3866 ning. The default value is "/var/run/slurmd.pid".
3867
3868
3869 SlurmdPort
3870 The port number that the Slurm compute node daemon, slurmd, lis‐
3871 tens to for work. The default value is SLURMD_PORT as estab‐
3872 lished at system build time. If none is explicitly specified,
3873 its value will be 6818. NOTE: Either slurmctld and slurmd dae‐
3874 mons must not execute on the same nodes or the values of Slurm‐
3875 ctldPort and SlurmdPort must be different.
3876
3877 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3878 automatically try to interact with anything opened on ports
3879 8192-60000. Configure SlurmdPort to use a port outside of the
3880 configured SrunPortRange and RSIP's port range.
3881
3882
3883 SlurmdSpoolDir
3884 Fully qualified pathname of a directory into which the slurmd
3885 daemon's state information and batch job script information are
3886 written. This must be a common pathname for all nodes, but
3887 should represent a directory which is local to each node (refer‐
3888 ence a local file system). The default value is
3889 "/var/spool/slurmd". Any "%h" within the name is replaced with
3890 the hostname on which the slurmd is running. Any "%n" within
3891 the name is replaced with the Slurm node name on which the
3892 slurmd is running.
3893
3894
3895 SlurmdSyslogDebug
3896 The slurmd daemon will log events to the syslog file at the
3897 specified level of detail. If not set, the slurmd daemon will
3898 log to syslog at level fatal, unless there is no SlurmdLogFile
3899 and it is running in the background, in which case it will log
3900 to syslog at the level specified by SlurmdDebug (at fatal in
3901 the case that SlurmdDebug is set to quiet) or it is run in the
3902 foreground, when it will be set to quiet.
3903
3904
3905 quiet Log nothing
3906
3907 fatal Log only fatal errors
3908
3909 error Log only errors
3910
3911 info Log errors and general informational messages
3912
3913 verbose Log errors and verbose informational messages
3914
3915 debug Log errors and verbose informational messages and de‐
3916 bugging messages
3917
3918 debug2 Log errors and verbose informational messages and more
3919 debugging messages
3920
3921 debug3 Log errors and verbose informational messages and even
3922 more debugging messages
3923
3924 debug4 Log errors and verbose informational messages and even
3925 more debugging messages
3926
3927 debug5 Log errors and verbose informational messages and even
3928 more debugging messages
3929
3930
3931 SlurmdTimeout
3932 The interval, in seconds, that the Slurm controller waits for
3933 slurmd to respond before configuring that node's state to DOWN.
3934 A value of zero indicates the node will not be tested by slurm‐
3935 ctld to confirm the state of slurmd, the node will not be auto‐
3936 matically set to a DOWN state indicating a non-responsive
3937 slurmd, and some other tool will take responsibility for moni‐
3938 toring the state of each compute node and its slurmd daemon.
3939 Slurm's hierarchical communication mechanism is used to ping the
3940 slurmd daemons in order to minimize system noise and overhead.
3941 The default value is 300 seconds. The value may not exceed
3942 65533 seconds.
3943
3944
3945 SlurmdUser
3946 The name of the user that the slurmd daemon executes as. This
3947 user must exist on all nodes of the cluster for authentication
3948 of communications between Slurm components. The default value
3949 is "root".
3950
3951
3952 SlurmSchedLogFile
3953 Fully qualified pathname of the scheduling event logging file.
3954 The syntax of this parameter is the same as for SlurmctldLog‐
3955 File. In order to configure scheduler logging, set both the
3956 SlurmSchedLogFile and SlurmSchedLogLevel parameters.
3957
3958
3959 SlurmSchedLogLevel
3960 The initial level of scheduling event logging, similar to the
3961 SlurmctldDebug parameter used to control the initial level of
3962 slurmctld logging. Valid values for SlurmSchedLogLevel are "0"
3963 (scheduler logging disabled) and "1" (scheduler logging en‐
3964 abled). If this parameter is omitted, the value defaults to "0"
3965 (disabled). In order to configure scheduler logging, set both
3966 the SlurmSchedLogFile and SlurmSchedLogLevel parameters. The
3967 scheduler logging level can be changed dynamically using scon‐
3968 trol.
3969
3970
3971 SlurmUser
3972 The name of the user that the slurmctld daemon executes as. For
3973 security purposes, a user other than "root" is recommended.
3974 This user must exist on all nodes of the cluster for authentica‐
3975 tion of communications between Slurm components. The default
3976 value is "root".
3977
3978
3979 SrunEpilog
3980 Fully qualified pathname of an executable to be run by srun fol‐
3981 lowing the completion of a job step. The command line arguments
3982 for the executable will be the command and arguments of the job
3983 step. This configuration parameter may be overridden by srun's
3984 --epilog parameter. Note that while the other "Epilog" executa‐
3985 bles (e.g., TaskEpilog) are run by slurmd on the compute nodes
3986 where the tasks are executed, the SrunEpilog runs on the node
3987 where the "srun" is executing.
3988
3989
3990 SrunPortRange
3991 The srun creates a set of listening ports to communicate with
3992 the controller, the slurmstepd and to handle the application
3993 I/O. By default these ports are ephemeral meaning the port num‐
3994 bers are selected by the kernel. Using this parameter allow
3995 sites to configure a range of ports from which srun ports will
3996 be selected. This is useful if sites want to allow only certain
3997 port range on their network.
3998
3999 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4000 automatically try to interact with anything opened on ports
4001 8192-60000. Configure SrunPortRange to use a range of ports
4002 above those used by RSIP, ideally 1000 or more ports, for exam‐
4003 ple "SrunPortRange=60001-63000".
4004
4005 Note: SrunPortRange must be large enough to cover the expected
4006 number of srun ports created on a given submission node. A sin‐
4007 gle srun opens 3 listening ports plus 2 more for every 48 hosts.
4008 Example:
4009
4010 srun -N 48 will use 5 listening ports.
4011
4012
4013 srun -N 50 will use 7 listening ports.
4014
4015
4016 srun -N 200 will use 13 listening ports.
4017
4018
4019 SrunProlog
4020 Fully qualified pathname of an executable to be run by srun
4021 prior to the launch of a job step. The command line arguments
4022 for the executable will be the command and arguments of the job
4023 step. This configuration parameter may be overridden by srun's
4024 --prolog parameter. Note that while the other "Prolog" executa‐
4025 bles (e.g., TaskProlog) are run by slurmd on the compute nodes
4026 where the tasks are executed, the SrunProlog runs on the node
4027 where the "srun" is executing.
4028
4029
4030 StateSaveLocation
4031 Fully qualified pathname of a directory into which the Slurm
4032 controller, slurmctld, saves its state (e.g. "/usr/lo‐
4033 cal/slurm/checkpoint"). Slurm state will saved here to recover
4034 from system failures. SlurmUser must be able to create files in
4035 this directory. If you have a secondary SlurmctldHost config‐
4036 ured, this location should be readable and writable by both sys‐
4037 tems. Since all running and pending job information is stored
4038 here, the use of a reliable file system (e.g. RAID) is recom‐
4039 mended. The default value is "/var/spool". If any slurm dae‐
4040 mons terminate abnormally, their core files will also be written
4041 into this directory.
4042
4043
4044 SuspendExcNodes
4045 Specifies the nodes which are to not be placed in power save
4046 mode, even if the node remains idle for an extended period of
4047 time. Use Slurm's hostlist expression to identify nodes with an
4048 optional ":" separator and count of nodes to exclude from the
4049 preceding range. For example "nid[10-20]:4" will prevent 4 us‐
4050 able nodes (i.e IDLE and not DOWN, DRAINING or already powered
4051 down) in the set "nid[10-20]" from being powered down. Multiple
4052 sets of nodes can be specified with or without counts in a comma
4053 separated list (e.g "nid[10-20]:4,nid[80-90]:2"). If a node
4054 count specification is given, any list of nodes to NOT have a
4055 node count must be after the last specification with a count.
4056 For example "nid[10-20]:4,nid[60-70]" will exclude 4 nodes in
4057 the set "nid[10-20]:4" plus all nodes in the set "nid[60-70]"
4058 while "nid[1-3],nid[10-20]:4" will exclude 4 nodes from the set
4059 "nid[1-3],nid[10-20]". By default no nodes are excluded.
4060
4061
4062 SuspendExcParts
4063 Specifies the partitions whose nodes are to not be placed in
4064 power save mode, even if the node remains idle for an extended
4065 period of time. Multiple partitions can be identified and sepa‐
4066 rated by commas. By default no nodes are excluded.
4067
4068
4069 SuspendProgram
4070 SuspendProgram is the program that will be executed when a node
4071 remains idle for an extended period of time. This program is
4072 expected to place the node into some power save mode. This can
4073 be used to reduce the frequency and voltage of a node or com‐
4074 pletely power the node off. The program executes as SlurmUser.
4075 The argument to the program will be the names of nodes to be
4076 placed into power savings mode (using Slurm's hostlist expres‐
4077 sion format). By default, no program is run.
4078
4079
4080 SuspendRate
4081 The rate at which nodes are placed into power save mode by Sus‐
4082 pendProgram. The value is number of nodes per minute and it can
4083 be used to prevent a large drop in power consumption (e.g. after
4084 a large job completes). A value of zero results in no limits
4085 being imposed. The default value is 60 nodes per minute.
4086
4087
4088 SuspendTime
4089 Nodes which remain idle or down for this number of seconds will
4090 be placed into power save mode by SuspendProgram. Setting Sus‐
4091 pendTime to anything but INFINITE (or -1) will enable power save
4092 mode. INFINITE is the default.
4093
4094
4095 SuspendTimeout
4096 Maximum time permitted (in seconds) between when a node suspend
4097 request is issued and when the node is shutdown. At that time
4098 the node must be ready for a resume request to be issued as
4099 needed for new work. The default value is 30 seconds.
4100
4101
4102 SwitchParameters
4103 Optional parameters for the switch plugin.
4104
4105
4106 SwitchType
4107 Identifies the type of switch or interconnect used for applica‐
4108 tion communications. Acceptable values include
4109 "switch/cray_aries" for Cray systems, "switch/none" for switches
4110 not requiring special processing for job launch or termination
4111 (Ethernet, and InfiniBand) and The default value is
4112 "switch/none". All Slurm daemons, commands and running jobs
4113 must be restarted for a change in SwitchType to take effect. If
4114 running jobs exist at the time slurmctld is restarted with a new
4115 value of SwitchType, records of all jobs in any state may be
4116 lost.
4117
4118
4119 TaskEpilog
4120 Fully qualified pathname of a program to be execute as the slurm
4121 job's owner after termination of each task. See TaskProlog for
4122 execution order details.
4123
4124
4125 TaskPlugin
4126 Identifies the type of task launch plugin, typically used to
4127 provide resource management within a node (e.g. pinning tasks to
4128 specific processors). More than one task plugin can be specified
4129 in a comma-separated list. The prefix of "task/" is optional.
4130 Acceptable values include:
4131
4132 task/affinity enables resource containment using
4133 sched_setaffinity(). This enables the --cpu-bind
4134 and/or --mem-bind srun options.
4135
4136 task/cgroup enables resource containment using Linux control
4137 cgroups. This enables the --cpu-bind and/or
4138 --mem-bind srun options. NOTE: see "man
4139 cgroup.conf" for configuration details.
4140
4141 task/none for systems requiring no special handling of user
4142 tasks. Lacks support for the --cpu-bind and/or
4143 --mem-bind srun options. The default value is
4144 "task/none".
4145
4146 NOTE: It is recommended to stack task/affinity,task/cgroup to‐
4147 gether when configuring TaskPlugin, and setting Constrain‐
4148 Cores=yes in cgroup.conf. This setup uses the task/affinity
4149 plugin for setting the affinity of the tasks and uses the
4150 task/cgroup plugin to fence tasks into the specified resources.
4151
4152 NOTE: For CRAY systems only: task/cgroup must be used with, and
4153 listed after task/cray_aries in TaskPlugin. The task/affinity
4154 plugin can be listed anywhere, but the previous constraint must
4155 be satisfied. For CRAY systems, a configuration like this is
4156 recommended:
4157 TaskPlugin=task/affinity,task/cray_aries,task/cgroup
4158
4159
4160 TaskPluginParam
4161 Optional parameters for the task plugin. Multiple options
4162 should be comma separated. None, Boards, Sockets, Cores and
4163 Threads are mutually exclusive and treated as a last possible
4164 source of --cpu-bind default. See also Node and Partition Cpu‐
4165 Bind options.
4166
4167
4168 Cores Bind tasks to cores by default. Overrides automatic
4169 binding.
4170
4171 None Perform no task binding by default. Overrides automatic
4172 binding.
4173
4174 Sockets
4175 Bind to sockets by default. Overrides automatic binding.
4176
4177 Threads
4178 Bind to threads by default. Overrides automatic binding.
4179
4180 SlurmdOffSpec
4181 If specialized cores or CPUs are identified for the node
4182 (i.e. the CoreSpecCount or CpuSpecList are configured for
4183 the node), then Slurm daemons running on the compute node
4184 (i.e. slurmd and slurmstepd) should run outside of those
4185 resources (i.e. specialized resources are completely un‐
4186 available to Slurm daemons and jobs spawned by Slurm).
4187 This option may not be used with the task/cray_aries
4188 plugin.
4189
4190 Verbose
4191 Verbosely report binding before tasks run by default.
4192
4193 Autobind
4194 Set a default binding in the event that "auto binding"
4195 doesn't find a match. Set to Threads, Cores or Sockets
4196 (E.g. TaskPluginParam=autobind=threads).
4197
4198
4199 TaskProlog
4200 Fully qualified pathname of a program to be execute as the slurm
4201 job's owner prior to initiation of each task. Besides the nor‐
4202 mal environment variables, this has SLURM_TASK_PID available to
4203 identify the process ID of the task being started. Standard
4204 output from this program can be used to control the environment
4205 variables and output for the user program.
4206
4207 export NAME=value Will set environment variables for the task
4208 being spawned. Everything after the equal
4209 sign to the end of the line will be used as
4210 the value for the environment variable. Ex‐
4211 porting of functions is not currently sup‐
4212 ported.
4213
4214 print ... Will cause that line (without the leading
4215 "print ") to be printed to the job's stan‐
4216 dard output.
4217
4218 unset NAME Will clear environment variables for the
4219 task being spawned.
4220
4221 The order of task prolog/epilog execution is as follows:
4222
4223 1. pre_launch_priv()
4224 Function in TaskPlugin
4225
4226 1. pre_launch() Function in TaskPlugin
4227
4228 2. TaskProlog System-wide per task program defined in
4229 slurm.conf
4230
4231 3. User prolog Job-step-specific task program defined using
4232 srun's --task-prolog option or
4233 SLURM_TASK_PROLOG environment variable
4234
4235 4. Task Execute the job step's task
4236
4237 5. User epilog Job-step-specific task program defined using
4238 srun's --task-epilog option or
4239 SLURM_TASK_EPILOG environment variable
4240
4241 6. TaskEpilog System-wide per task program defined in
4242 slurm.conf
4243
4244 7. post_term() Function in TaskPlugin
4245
4246
4247 TCPTimeout
4248 Time permitted for TCP connection to be established. Default
4249 value is 2 seconds.
4250
4251
4252 TmpFS Fully qualified pathname of the file system available to user
4253 jobs for temporary storage. This parameter is used in establish‐
4254 ing a node's TmpDisk space. The default value is "/tmp".
4255
4256
4257 TopologyParam
4258 Comma-separated options identifying network topology options.
4259
4260 Dragonfly Optimize allocation for Dragonfly network. Valid
4261 when TopologyPlugin=topology/tree.
4262
4263 TopoOptional Only optimize allocation for network topology if
4264 the job includes a switch option. Since optimiz‐
4265 ing resource allocation for topology involves
4266 much higher system overhead, this option can be
4267 used to impose the extra overhead only on jobs
4268 which can take advantage of it. If most job allo‐
4269 cations are not optimized for network topology,
4270 they may fragment resources to the point that
4271 topology optimization for other jobs will be dif‐
4272 ficult to achieve. NOTE: Jobs may span across
4273 nodes without common parent switches with this
4274 enabled.
4275
4276
4277 TopologyPlugin
4278 Identifies the plugin to be used for determining the network
4279 topology and optimizing job allocations to minimize network con‐
4280 tention. See NETWORK TOPOLOGY below for details. Additional
4281 plugins may be provided in the future which gather topology in‐
4282 formation directly from the network. Acceptable values include:
4283
4284 topology/3d_torus best-fit logic over three-dimensional
4285 topology
4286
4287 topology/none default for other systems, best-fit logic
4288 over one-dimensional topology
4289
4290 topology/tree used for a hierarchical network as de‐
4291 scribed in a topology.conf file
4292
4293
4294 TrackWCKey
4295 Boolean yes or no. Used to set display and track of the Work‐
4296 load Characterization Key. Must be set to track correct wckey
4297 usage. NOTE: You must also set TrackWCKey in your slurmdbd.conf
4298 file to create historical usage reports.
4299
4300
4301 TreeWidth
4302 Slurmd daemons use a virtual tree network for communications.
4303 TreeWidth specifies the width of the tree (i.e. the fanout). On
4304 architectures with a front end node running the slurmd daemon,
4305 the value must always be equal to or greater than the number of
4306 front end nodes which eliminates the need for message forwarding
4307 between the slurmd daemons. On other architectures the default
4308 value is 50, meaning each slurmd daemon can communicate with up
4309 to 50 other slurmd daemons and over 2500 nodes can be contacted
4310 with two message hops. The default value will work well for
4311 most clusters. Optimal system performance can typically be
4312 achieved if TreeWidth is set to the square root of the number of
4313 nodes in the cluster for systems having no more than 2500 nodes
4314 or the cube root for larger systems. The value may not exceed
4315 65533.
4316
4317
4318 UnkillableStepProgram
4319 If the processes in a job step are determined to be unkillable
4320 for a period of time specified by the UnkillableStepTimeout
4321 variable, the program specified by UnkillableStepProgram will be
4322 executed. By default no program is run.
4323
4324 See section UNKILLABLE STEP PROGRAM SCRIPT for more information.
4325
4326
4327 UnkillableStepTimeout
4328 The length of time, in seconds, that Slurm will wait before de‐
4329 ciding that processes in a job step are unkillable (after they
4330 have been signaled with SIGKILL) and execute UnkillableStepPro‐
4331 gram. The default timeout value is 60 seconds. If exceeded,
4332 the compute node will be drained to prevent future jobs from be‐
4333 ing scheduled on the node.
4334
4335
4336 UsePAM If set to 1, PAM (Pluggable Authentication Modules for Linux)
4337 will be enabled. PAM is used to establish the upper bounds for
4338 resource limits. With PAM support enabled, local system adminis‐
4339 trators can dynamically configure system resource limits. Chang‐
4340 ing the upper bound of a resource limit will not alter the lim‐
4341 its of running jobs, only jobs started after a change has been
4342 made will pick up the new limits. The default value is 0 (not
4343 to enable PAM support). Remember that PAM also needs to be con‐
4344 figured to support Slurm as a service. For sites using PAM's
4345 directory based configuration option, a configuration file named
4346 slurm should be created. The module-type, control-flags, and
4347 module-path names that should be included in the file are:
4348 auth required pam_localuser.so
4349 auth required pam_shells.so
4350 account required pam_unix.so
4351 account required pam_access.so
4352 session required pam_unix.so
4353 For sites configuring PAM with a general configuration file, the
4354 appropriate lines (see above), where slurm is the service-name,
4355 should be added.
4356
4357 NOTE: UsePAM option has nothing to do with the con‐
4358 tribs/pam/pam_slurm and/or contribs/pam_slurm_adopt modules. So
4359 these two modules can work independently of the value set for
4360 UsePAM.
4361
4362
4363 VSizeFactor
4364 Memory specifications in job requests apply to real memory size
4365 (also known as resident set size). It is possible to enforce
4366 virtual memory limits for both jobs and job steps by limiting
4367 their virtual memory to some percentage of their real memory al‐
4368 location. The VSizeFactor parameter specifies the job's or job
4369 step's virtual memory limit as a percentage of its real memory
4370 limit. For example, if a job's real memory limit is 500MB and
4371 VSizeFactor is set to 101 then the job will be killed if its
4372 real memory exceeds 500MB or its virtual memory exceeds 505MB
4373 (101 percent of the real memory limit). The default value is 0,
4374 which disables enforcement of virtual memory limits. The value
4375 may not exceed 65533 percent.
4376
4377 NOTE: This parameter is dependent on OverMemoryKill being con‐
4378 figured in JobAcctGatherParams. It is also possible to configure
4379 the TaskPlugin to use task/cgroup for memory enforcement. VSize‐
4380 Factor will not have an effect on memory enforcement done
4381 through cgroups.
4382
4383
4384 WaitTime
4385 Specifies how many seconds the srun command should by default
4386 wait after the first task terminates before terminating all re‐
4387 maining tasks. The "--wait" option on the srun command line
4388 overrides this value. The default value is 0, which disables
4389 this feature. May not exceed 65533 seconds.
4390
4391
4392 X11Parameters
4393 For use with Slurm's built-in X11 forwarding implementation.
4394
4395 home_xauthority
4396 If set, xauth data on the compute node will be placed in
4397 ~/.Xauthority rather than in a temporary file under
4398 TmpFS.
4399
4400
4402 The configuration of nodes (or machines) to be managed by Slurm is also
4403 specified in /etc/slurm.conf. Changes in node configuration (e.g.
4404 adding nodes, changing their processor count, etc.) require restarting
4405 both the slurmctld daemon and the slurmd daemons. All slurmd daemons
4406 must know each node in the system to forward messages in support of hi‐
4407 erarchical communications. Only the NodeName must be supplied in the
4408 configuration file. All other node configuration information is op‐
4409 tional. It is advisable to establish baseline node configurations, es‐
4410 pecially if the cluster is heterogeneous. Nodes which register to the
4411 system with less than the configured resources (e.g. too little mem‐
4412 ory), will be placed in the "DOWN" state to avoid scheduling jobs on
4413 them. Establishing baseline configurations will also speed Slurm's
4414 scheduling process by permitting it to compare job requirements against
4415 these (relatively few) configuration parameters and possibly avoid hav‐
4416 ing to check job requirements against every individual node's configu‐
4417 ration. The resources checked at node registration time are: CPUs,
4418 RealMemory and TmpDisk.
4419
4420 Default values can be specified with a record in which NodeName is "DE‐
4421 FAULT". The default entry values will apply only to lines following it
4422 in the configuration file and the default values can be reset multiple
4423 times in the configuration file with multiple entries where "Node‐
4424 Name=DEFAULT". Each line where NodeName is "DEFAULT" will replace or
4425 add to previous default values and not a reinitialize the default val‐
4426 ues. The "NodeName=" specification must be placed on every line de‐
4427 scribing the configuration of nodes. A single node name can not appear
4428 as a NodeName value in more than one line (duplicate node name records
4429 will be ignored). In fact, it is generally possible and desirable to
4430 define the configurations of all nodes in only a few lines. This con‐
4431 vention permits significant optimization in the scheduling of larger
4432 clusters. In order to support the concept of jobs requiring consecu‐
4433 tive nodes on some architectures, node specifications should be place
4434 in this file in consecutive order. No single node name may be listed
4435 more than once in the configuration file. Use "DownNodes=" to record
4436 the state of nodes which are temporarily in a DOWN, DRAIN or FAILING
4437 state without altering permanent configuration information. A job
4438 step's tasks are allocated to nodes in order the nodes appear in the
4439 configuration file. There is presently no capability within Slurm to
4440 arbitrarily order a job step's tasks.
4441
4442 Multiple node names may be comma separated (e.g. "alpha,beta,gamma")
4443 and/or a simple node range expression may optionally be used to specify
4444 numeric ranges of nodes to avoid building a configuration file with
4445 large numbers of entries. The node range expression can contain one
4446 pair of square brackets with a sequence of comma-separated numbers
4447 and/or ranges of numbers separated by a "-" (e.g. "linux[0-64,128]", or
4448 "lx[15,18,32-33]"). Note that the numeric ranges can include one or
4449 more leading zeros to indicate the numeric portion has a fixed number
4450 of digits (e.g. "linux[0000-1023]"). Multiple numeric ranges can be
4451 included in the expression (e.g. "rack[0-63]_blade[0-41]"). If one or
4452 more numeric expressions are included, one of them must be at the end
4453 of the name (e.g. "unit[0-31]rack" is invalid), but arbitrary names can
4454 always be used in a comma-separated list.
4455
4456 The node configuration specified the following information:
4457
4458
4459 NodeName
4460 Name that Slurm uses to refer to a node. Typically this would
4461 be the string that "/bin/hostname -s" returns. It may also be
4462 the fully qualified domain name as returned by "/bin/hostname
4463 -f" (e.g. "foo1.bar.com"), or any valid domain name associated
4464 with the host through the host database (/etc/hosts) or DNS, de‐
4465 pending on the resolver settings. Note that if the short form
4466 of the hostname is not used, it may prevent use of hostlist ex‐
4467 pressions (the numeric portion in brackets must be at the end of
4468 the string). It may also be an arbitrary string if NodeHostname
4469 is specified. If the NodeName is "DEFAULT", the values speci‐
4470 fied with that record will apply to subsequent node specifica‐
4471 tions unless explicitly set to other values in that node record
4472 or replaced with a different set of default values. Each line
4473 where NodeName is "DEFAULT" will replace or add to previous de‐
4474 fault values and not a reinitialize the default values. For ar‐
4475 chitectures in which the node order is significant, nodes will
4476 be considered consecutive in the order defined. For example, if
4477 the configuration for "NodeName=charlie" immediately follows the
4478 configuration for "NodeName=baker" they will be considered adja‐
4479 cent in the computer.
4480
4481
4482 NodeHostname
4483 Typically this would be the string that "/bin/hostname -s" re‐
4484 turns. It may also be the fully qualified domain name as re‐
4485 turned by "/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid
4486 domain name associated with the host through the host database
4487 (/etc/hosts) or DNS, depending on the resolver settings. Note
4488 that if the short form of the hostname is not used, it may pre‐
4489 vent use of hostlist expressions (the numeric portion in brack‐
4490 ets must be at the end of the string). A node range expression
4491 can be used to specify a set of nodes. If an expression is
4492 used, the number of nodes identified by NodeHostname on a line
4493 in the configuration file must be identical to the number of
4494 nodes identified by NodeName. By default, the NodeHostname will
4495 be identical in value to NodeName.
4496
4497
4498 NodeAddr
4499 Name that a node should be referred to in establishing a commu‐
4500 nications path. This name will be used as an argument to the
4501 getaddrinfo() function for identification. If a node range ex‐
4502 pression is used to designate multiple nodes, they must exactly
4503 match the entries in the NodeName (e.g. "NodeName=lx[0-7]
4504 NodeAddr=elx[0-7]"). NodeAddr may also contain IP addresses.
4505 By default, the NodeAddr will be identical in value to NodeHost‐
4506 name.
4507
4508
4509 BcastAddr
4510 Alternate network path to be used for sbcast network traffic to
4511 a given node. This name will be used as an argument to the
4512 getaddrinfo() function. If a node range expression is used to
4513 designate multiple nodes, they must exactly match the entries in
4514 the NodeName (e.g. "NodeName=lx[0-7] BcastAddr=elx[0-7]").
4515 BcastAddr may also contain IP addresses. By default, the Bcas‐
4516 tAddr is unset, and sbcast traffic will be routed to the
4517 NodeAddr for a given node. Note: cannot be used with Communica‐
4518 tionParameters=NoInAddrAny.
4519
4520
4521 Boards Number of Baseboards in nodes with a baseboard controller. Note
4522 that when Boards is specified, SocketsPerBoard, CoresPerSocket,
4523 and ThreadsPerCore should be specified. The default value is 1.
4524
4525
4526 CoreSpecCount
4527 Number of cores reserved for system use. These cores will not
4528 be available for allocation to user jobs. Depending upon the
4529 TaskPluginParam option of SlurmdOffSpec, Slurm daemons (i.e.
4530 slurmd and slurmstepd) may either be confined to these resources
4531 (the default) or prevented from using these resources. Isola‐
4532 tion of the Slurm daemons from user jobs may improve application
4533 performance. If this option and CpuSpecList are both designated
4534 for a node, an error is generated. For information on the algo‐
4535 rithm used by Slurm to select the cores refer to the core spe‐
4536 cialization documentation (
4537 https://slurm.schedmd.com/core_spec.html ).
4538
4539
4540 CoresPerSocket
4541 Number of cores in a single physical processor socket (e.g.
4542 "2"). The CoresPerSocket value describes physical cores, not
4543 the logical number of processors per socket. NOTE: If you have
4544 multi-core processors, you will likely need to specify this pa‐
4545 rameter in order to optimize scheduling. The default value is
4546 1.
4547
4548
4549 CpuBind
4550 If a job step request does not specify an option to control how
4551 tasks are bound to allocated CPUs (--cpu-bind) and all nodes al‐
4552 located to the job have the same CpuBind option the node CpuBind
4553 option will control how tasks are bound to allocated resources.
4554 Supported values for CpuBind are "none", "board", "socket",
4555 "ldom" (NUMA), "core" and "thread".
4556
4557
4558 CPUs Number of logical processors on the node (e.g. "2"). It can be
4559 set to the total number of sockets(supported only by select/lin‐
4560 ear), cores or threads. This can be useful when you want to
4561 schedule only the cores on a hyper-threaded node. If CPUs is
4562 omitted, its default will be set equal to the product of Boards,
4563 Sockets, CoresPerSocket, and ThreadsPerCore.
4564
4565
4566 CpuSpecList
4567 A comma-delimited list of Slurm abstract CPU IDs reserved for
4568 system use. The list will be expanded to include all other
4569 CPUs, if any, on the same cores. These cores will not be avail‐
4570 able for allocation to user jobs. Depending upon the TaskPlug‐
4571 inParam option of SlurmdOffSpec, Slurm daemons (i.e. slurmd and
4572 slurmstepd) may either be confined to these resources (the de‐
4573 fault) or prevented from using these resources. Isolation of
4574 the Slurm daemons from user jobs may improve application perfor‐
4575 mance. If this option and CoreSpecCount are both designated for
4576 a node, an error is generated. This option has no effect unless
4577 cgroup job confinement is also configured (TaskPlu‐
4578 gin=task/cgroup with ConstrainCores=yes in cgroup.conf).
4579
4580
4581 Features
4582 A comma-delimited list of arbitrary strings indicative of some
4583 characteristic associated with the node. There is no value or
4584 count associated with a feature at this time, a node either has
4585 a feature or it does not. A desired feature may contain a nu‐
4586 meric component indicating, for example, processor speed but
4587 this numeric component will be considered to be part of the fea‐
4588 ture string. Features are intended to be used to filter nodes
4589 eligible to run jobs via the --constraint argument. By default
4590 a node has no features. Also see Gres for being able to have
4591 more control such as types and count. Using features is faster
4592 than scheduling against GRES but is limited to Boolean opera‐
4593 tions.
4594
4595
4596 Gres A comma-delimited list of generic resources specifications for a
4597 node. The format is: "<name>[:<type>][:no_consume]:<num‐
4598 ber>[K|M|G]". The first field is the resource name, which
4599 matches the GresType configuration parameter name. The optional
4600 type field might be used to identify a model of that generic re‐
4601 source. It is forbidden to specify both an untyped GRES and a
4602 typed GRES with the same <name>. The optional no_consume field
4603 allows you to specify that a generic resource does not have a
4604 finite number of that resource that gets consumed as it is re‐
4605 quested. The no_consume field is a GRES specific setting and ap‐
4606 plies to the GRES, regardless of the type specified. The final
4607 field must specify a generic resources count. A suffix of "K",
4608 "M", "G", "T" or "P" may be used to multiply the number by 1024,
4609 1048576, 1073741824, etc. respectively.
4610 (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_con‐
4611 sume:4G"). By default a node has no generic resources and its
4612 maximum count is that of an unsigned 64bit integer. Also see
4613 Features for Boolean flags to filter nodes using job con‐
4614 straints.
4615
4616
4617 MemSpecLimit
4618 Amount of memory, in megabytes, reserved for system use and not
4619 available for user allocations. If the task/cgroup plugin is
4620 configured and that plugin constrains memory allocations (i.e.
4621 TaskPlugin=task/cgroup in slurm.conf, plus ConstrainRAMSpace=yes
4622 in cgroup.conf), then Slurm compute node daemons (slurmd plus
4623 slurmstepd) will be allocated the specified memory limit. Note
4624 that having the Memory set in SelectTypeParameters as any of the
4625 options that has it as a consumable resource is needed for this
4626 option to work. The daemons will not be killed if they exhaust
4627 the memory allocation (ie. the Out-Of-Memory Killer is disabled
4628 for the daemon's memory cgroup). If the task/cgroup plugin is
4629 not configured, the specified memory will only be unavailable
4630 for user allocations.
4631
4632
4633 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4634 tens to for work on this particular node. By default there is a
4635 single port number for all slurmd daemons on all compute nodes
4636 as defined by the SlurmdPort configuration parameter. Use of
4637 this option is not generally recommended except for development
4638 or testing purposes. If multiple slurmd daemons execute on a
4639 node this can specify a range of ports.
4640
4641 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4642 automatically try to interact with anything opened on ports
4643 8192-60000. Configure Port to use a port outside of the config‐
4644 ured SrunPortRange and RSIP's port range.
4645
4646
4647 Procs See CPUs.
4648
4649
4650 RealMemory
4651 Size of real memory on the node in megabytes (e.g. "2048"). The
4652 default value is 1. Lowering RealMemory with the goal of setting
4653 aside some amount for the OS and not available for job alloca‐
4654 tions will not work as intended if Memory is not set as a con‐
4655 sumable resource in SelectTypeParameters. So one of the *_Memory
4656 options need to be enabled for that goal to be accomplished.
4657 Also see MemSpecLimit.
4658
4659
4660 Reason Identifies the reason for a node being in state "DOWN",
4661 "DRAINED" "DRAINING", "FAIL" or "FAILING". Use quotes to en‐
4662 close a reason having more than one word.
4663
4664
4665 Sockets
4666 Number of physical processor sockets/chips on the node (e.g.
4667 "2"). If Sockets is omitted, it will be inferred from CPUs,
4668 CoresPerSocket, and ThreadsPerCore. NOTE: If you have
4669 multi-core processors, you will likely need to specify these pa‐
4670 rameters. Sockets and SocketsPerBoard are mutually exclusive.
4671 If Sockets is specified when Boards is also used, Sockets is in‐
4672 terpreted as SocketsPerBoard rather than total sockets. The de‐
4673 fault value is 1.
4674
4675
4676 SocketsPerBoard
4677 Number of physical processor sockets/chips on a baseboard.
4678 Sockets and SocketsPerBoard are mutually exclusive. The default
4679 value is 1.
4680
4681
4682 State State of the node with respect to the initiation of user jobs.
4683 Acceptable values are CLOUD, DOWN, DRAIN, FAIL, FAILING, FUTURE
4684 and UNKNOWN. Node states of BUSY and IDLE should not be speci‐
4685 fied in the node configuration, but set the node state to UN‐
4686 KNOWN instead. Setting the node state to UNKNOWN will result in
4687 the node state being set to BUSY, IDLE or other appropriate
4688 state based upon recovered system state information. The de‐
4689 fault value is UNKNOWN. Also see the DownNodes parameter below.
4690
4691 CLOUD Indicates the node exists in the cloud. Its initial
4692 state will be treated as powered down. The node will
4693 be available for use after its state is recovered from
4694 Slurm's state save file or the slurmd daemon starts on
4695 the compute node.
4696
4697 DOWN Indicates the node failed and is unavailable to be al‐
4698 located work.
4699
4700 DRAIN Indicates the node is unavailable to be allocated
4701 work.
4702
4703 FAIL Indicates the node is expected to fail soon, has no
4704 jobs allocated to it, and will not be allocated to any
4705 new jobs.
4706
4707 FAILING Indicates the node is expected to fail soon, has one
4708 or more jobs allocated to it, but will not be allo‐
4709 cated to any new jobs.
4710
4711 FUTURE Indicates the node is defined for future use and need
4712 not exist when the Slurm daemons are started. These
4713 nodes can be made available for use simply by updating
4714 the node state using the scontrol command rather than
4715 restarting the slurmctld daemon. After these nodes are
4716 made available, change their State in the slurm.conf
4717 file. Until these nodes are made available, they will
4718 not be seen using any Slurm commands or nor will any
4719 attempt be made to contact them.
4720
4721
4722 Dynamic Future Nodes
4723 A slurmd started with -F[<feature>] will be as‐
4724 sociated with a FUTURE node that matches the
4725 same configuration (sockets, cores, threads) as
4726 reported by slurmd -C. The node's NodeAddr and
4727 NodeHostname will automatically be retrieved
4728 from the slurmd and will be cleared when set
4729 back to the FUTURE state. Dynamic FUTURE nodes
4730 retain non-FUTURE state on restart. Use scon‐
4731 trol to put node back into FUTURE state.
4732
4733 If the mapping of the NodeName to the slurmd
4734 HostName is not updated in DNS, Dynamic Future
4735 nodes won't know how to communicate with each
4736 other -- because NodeAddr and NodeHostName are
4737 not defined in the slurm.conf -- and the fanout
4738 communications need to be disabled by setting
4739 TreeWidth to a high number (e.g. 65533). If the
4740 DNS mapping is made, then the cloud_dns Slurm‐
4741 ctldParameter can be used.
4742
4743
4744 UNKNOWN Indicates the node's state is undefined but will be
4745 established (set to BUSY or IDLE) when the slurmd dae‐
4746 mon on that node registers. UNKNOWN is the default
4747 state.
4748
4749
4750 ThreadsPerCore
4751 Number of logical threads in a single physical core (e.g. "2").
4752 Note that the Slurm can allocate resources to jobs down to the
4753 resolution of a core. If your system is configured with more
4754 than one thread per core, execution of a different job on each
4755 thread is not supported unless you configure SelectTypeParame‐
4756 ters=CR_CPU plus CPUs; do not configure Sockets, CoresPerSocket
4757 or ThreadsPerCore. A job can execute a one task per thread from
4758 within one job step or execute a distinct job step on each of
4759 the threads. Note also if you are running with more than 1
4760 thread per core and running the select/cons_res or se‐
4761 lect/cons_tres plugin then you will want to set the SelectType‐
4762 Parameters variable to something other than CR_CPU to avoid un‐
4763 expected results. The default value is 1.
4764
4765
4766 TmpDisk
4767 Total size of temporary disk storage in TmpFS in megabytes (e.g.
4768 "16384"). TmpFS (for "Temporary File System") identifies the lo‐
4769 cation which jobs should use for temporary storage. Note this
4770 does not indicate the amount of free space available to the user
4771 on the node, only the total file system size. The system admin‐
4772 istration should ensure this file system is purged as needed so
4773 that user jobs have access to most of this space. The Prolog
4774 and/or Epilog programs (specified in the configuration file)
4775 might be used to ensure the file system is kept clean. The de‐
4776 fault value is 0.
4777
4778
4779 TRESWeights
4780 TRESWeights are used to calculate a value that represents how
4781 busy a node is. Currently only used in federation configura‐
4782 tions. TRESWeights are different from TRESBillingWeights --
4783 which is used for fairshare calculations.
4784
4785 TRES weights are specified as a comma-separated list of <TRES
4786 Type>=<TRES Weight> pairs.
4787 e.g.
4788 NodeName=node1 ... TRESWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0"
4789
4790 By default the weighted TRES value is calculated as the sum of
4791 all node TRES types multiplied by their corresponding TRES
4792 weight.
4793
4794 If PriorityFlags=MAX_TRES is configured, the weighted TRES value
4795 is calculated as the MAX of individual node TRES' (e.g. cpus,
4796 mem, gres).
4797
4798
4799 Weight The priority of the node for scheduling purposes. All things
4800 being equal, jobs will be allocated the nodes with the lowest
4801 weight which satisfies their requirements. For example, a het‐
4802 erogeneous collection of nodes might be placed into a single
4803 partition for greater system utilization, responsiveness and ca‐
4804 pability. It would be preferable to allocate smaller memory
4805 nodes rather than larger memory nodes if either will satisfy a
4806 job's requirements. The units of weight are arbitrary, but
4807 larger weights should be assigned to nodes with more processors,
4808 memory, disk space, higher processor speed, etc. Note that if a
4809 job allocation request can not be satisfied using the nodes with
4810 the lowest weight, the set of nodes with the next lowest weight
4811 is added to the set of nodes under consideration for use (repeat
4812 as needed for higher weight values). If you absolutely want to
4813 minimize the number of higher weight nodes allocated to a job
4814 (at a cost of higher scheduling overhead), give each node a dis‐
4815 tinct Weight value and they will be added to the pool of nodes
4816 being considered for scheduling individually. The default value
4817 is 1.
4818
4819
4821 The DownNodes= parameter permits you to mark certain nodes as in a
4822 DOWN, DRAIN, FAIL, FAILING or FUTURE state without altering the perma‐
4823 nent configuration information listed under a NodeName= specification.
4824
4825
4826 DownNodes
4827 Any node name, or list of node names, from the NodeName= speci‐
4828 fications.
4829
4830
4831 Reason Identifies the reason for a node being in state DOWN, DRAIN,
4832 FAIL, FAILING or FUTURE. Use quotes to enclose a reason having
4833 more than one word.
4834
4835
4836 State State of the node with respect to the initiation of user jobs.
4837 Acceptable values are DOWN, DRAIN, FAIL, FAILING and FUTURE.
4838 For more information about these states see the descriptions un‐
4839 der State in the NodeName= section above. The default value is
4840 DOWN.
4841
4842
4844 On computers where frontend nodes are used to execute batch scripts
4845 rather than compute nodes, one may configure one or more frontend nodes
4846 using the configuration parameters defined below. These options are
4847 very similar to those used in configuring compute nodes. These options
4848 may only be used on systems configured and built with the appropriate
4849 parameters (--have-front-end). The front end configuration specifies
4850 the following information:
4851
4852
4853 AllowGroups
4854 Comma-separated list of group names which may execute jobs on
4855 this front end node. By default, all groups may use this front
4856 end node. A user will be permitted to use this front end node
4857 if AllowGroups has at least one group associated with the user.
4858 May not be used with the DenyGroups option.
4859
4860
4861 AllowUsers
4862 Comma-separated list of user names which may execute jobs on
4863 this front end node. By default, all users may use this front
4864 end node. May not be used with the DenyUsers option.
4865
4866
4867 DenyGroups
4868 Comma-separated list of group names which are prevented from ex‐
4869 ecuting jobs on this front end node. May not be used with the
4870 AllowGroups option.
4871
4872
4873 DenyUsers
4874 Comma-separated list of user names which are prevented from exe‐
4875 cuting jobs on this front end node. May not be used with the
4876 AllowUsers option.
4877
4878
4879 FrontendName
4880 Name that Slurm uses to refer to a frontend node. Typically
4881 this would be the string that "/bin/hostname -s" returns. It
4882 may also be the fully qualified domain name as returned by
4883 "/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid domain
4884 name associated with the host through the host database
4885 (/etc/hosts) or DNS, depending on the resolver settings. Note
4886 that if the short form of the hostname is not used, it may pre‐
4887 vent use of hostlist expressions (the numeric portion in brack‐
4888 ets must be at the end of the string). If the FrontendName is
4889 "DEFAULT", the values specified with that record will apply to
4890 subsequent node specifications unless explicitly set to other
4891 values in that frontend node record or replaced with a different
4892 set of default values. Each line where FrontendName is "DE‐
4893 FAULT" will replace or add to previous default values and not a
4894 reinitialize the default values.
4895
4896
4897 FrontendAddr
4898 Name that a frontend node should be referred to in establishing
4899 a communications path. This name will be used as an argument to
4900 the getaddrinfo() function for identification. As with Fron‐
4901 tendName, list the individual node addresses rather than using a
4902 hostlist expression. The number of FrontendAddr records per
4903 line must equal the number of FrontendName records per line
4904 (i.e. you can't map to node names to one address). FrontendAddr
4905 may also contain IP addresses. By default, the FrontendAddr
4906 will be identical in value to FrontendName.
4907
4908
4909 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4910 tens to for work on this particular frontend node. By default
4911 there is a single port number for all slurmd daemons on all
4912 frontend nodes as defined by the SlurmdPort configuration param‐
4913 eter. Use of this option is not generally recommended except for
4914 development or testing purposes.
4915
4916 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4917 automatically try to interact with anything opened on ports
4918 8192-60000. Configure Port to use a port outside of the config‐
4919 ured SrunPortRange and RSIP's port range.
4920
4921
4922 Reason Identifies the reason for a frontend node being in state DOWN,
4923 DRAINED, DRAINING, FAIL or FAILING. Use quotes to enclose a
4924 reason having more than one word.
4925
4926
4927 State State of the frontend node with respect to the initiation of
4928 user jobs. Acceptable values are DOWN, DRAIN, FAIL, FAILING and
4929 UNKNOWN. Node states of BUSY and IDLE should not be specified
4930 in the node configuration, but set the node state to UNKNOWN in‐
4931 stead. Setting the node state to UNKNOWN will result in the
4932 node state being set to BUSY, IDLE or other appropriate state
4933 based upon recovered system state information. For more infor‐
4934 mation about these states see the descriptions under State in
4935 the NodeName= section above. The default value is UNKNOWN.
4936
4937
4938 As an example, you can do something similar to the following to define
4939 four front end nodes for running slurmd daemons.
4940 FrontendName=frontend[00-03] FrontendAddr=efrontend[00-03] State=UNKNOWN
4941
4942
4944 The nodeset configuration allows you to define a name for a specific
4945 set of nodes which can be used to simplify the partition configuration
4946 section, especially for heterogenous or condo-style systems. Each node‐
4947 set may be defined by an explicit list of nodes, and/or by filtering
4948 the nodes by a particular configured feature. If both Feature= and
4949 Nodes= are used the nodeset shall be the union of the two subsets.
4950 Note that the nodesets are only used to simplify the partition defini‐
4951 tions at present, and are not usable outside of the partition configu‐
4952 ration.
4953
4954 Feature
4955 All nodes with this single feature will be included as part of
4956 this nodeset.
4957
4958 Nodes List of nodes in this set.
4959
4960 NodeSet
4961 Unique name for a set of nodes. Must not overlap with any Node‐
4962 Name definitions.
4963
4964
4966 The partition configuration permits you to establish different job lim‐
4967 its or access controls for various groups (or partitions) of nodes.
4968 Nodes may be in more than one partition, making partitions serve as
4969 general purpose queues. For example one may put the same set of nodes
4970 into two different partitions, each with different constraints (time
4971 limit, job sizes, groups allowed to use the partition, etc.). Jobs are
4972 allocated resources within a single partition. Default values can be
4973 specified with a record in which PartitionName is "DEFAULT". The de‐
4974 fault entry values will apply only to lines following it in the config‐
4975 uration file and the default values can be reset multiple times in the
4976 configuration file with multiple entries where "PartitionName=DEFAULT".
4977 The "PartitionName=" specification must be placed on every line de‐
4978 scribing the configuration of partitions. Each line where Partition‐
4979 Name is "DEFAULT" will replace or add to previous default values and
4980 not a reinitialize the default values. A single partition name can not
4981 appear as a PartitionName value in more than one line (duplicate parti‐
4982 tion name records will be ignored). If a partition that is in use is
4983 deleted from the configuration and slurm is restarted or reconfigured
4984 (scontrol reconfigure), jobs using the partition are canceled. NOTE:
4985 Put all parameters for each partition on a single line. Each line of
4986 partition configuration information should represent a different parti‐
4987 tion. The partition configuration file contains the following informa‐
4988 tion:
4989
4990
4991 AllocNodes
4992 Comma-separated list of nodes from which users can submit jobs
4993 in the partition. Node names may be specified using the node
4994 range expression syntax described above. The default value is
4995 "ALL".
4996
4997
4998 AllowAccounts
4999 Comma-separated list of accounts which may execute jobs in the
5000 partition. The default value is "ALL". NOTE: If AllowAccounts
5001 is used then DenyAccounts will not be enforced. Also refer to
5002 DenyAccounts.
5003
5004
5005 AllowGroups
5006 Comma-separated list of group names which may execute jobs in
5007 this partition. A user will be permitted to submit a job to
5008 this partition if AllowGroups has at least one group associated
5009 with the user. Jobs executed as user root or as user SlurmUser
5010 will be allowed to use any partition, regardless of the value of
5011 AllowGroups. In addition, a Slurm Admin or Operator will be able
5012 to view any partition, regardless of the value of AllowGroups.
5013 If user root attempts to execute a job as another user (e.g. us‐
5014 ing srun's --uid option), then the job will be subject to Allow‐
5015 Groups as if it were submitted by that user. By default, Allow‐
5016 Groups is unset, meaning all groups are allowed to use this par‐
5017 tition. The special value 'ALL' is equivalent to this. Users
5018 who are not members of the specified group will not see informa‐
5019 tion about this partition by default. However, this should not
5020 be treated as a security mechanism, since job information will
5021 be returned if a user requests details about the partition or a
5022 specific job. See the PrivateData parameter to restrict access
5023 to job information. NOTE: For performance reasons, Slurm main‐
5024 tains a list of user IDs allowed to use each partition and this
5025 is checked at job submission time. This list of user IDs is up‐
5026 dated when the slurmctld daemon is restarted, reconfigured (e.g.
5027 "scontrol reconfig") or the partition's AllowGroups value is re‐
5028 set, even if is value is unchanged (e.g. "scontrol update Parti‐
5029 tionName=name AllowGroups=group"). For a user's access to a
5030 partition to change, both his group membership must change and
5031 Slurm's internal user ID list must change using one of the meth‐
5032 ods described above.
5033
5034
5035 AllowQos
5036 Comma-separated list of Qos which may execute jobs in the parti‐
5037 tion. Jobs executed as user root can use any partition without
5038 regard to the value of AllowQos. The default value is "ALL".
5039 NOTE: If AllowQos is used then DenyQos will not be enforced.
5040 Also refer to DenyQos.
5041
5042
5043 Alternate
5044 Partition name of alternate partition to be used if the state of
5045 this partition is "DRAIN" or "INACTIVE."
5046
5047
5048 CpuBind
5049 If a job step request does not specify an option to control how
5050 tasks are bound to allocated CPUs (--cpu-bind) and all nodes al‐
5051 located to the job do not have the same CpuBind option the node.
5052 Then the partition's CpuBind option will control how tasks are
5053 bound to allocated resources. Supported values forCpuBind are
5054 "none", "board", "socket", "ldom" (NUMA), "core" and "thread".
5055
5056
5057 Default
5058 If this keyword is set, jobs submitted without a partition spec‐
5059 ification will utilize this partition. Possible values are
5060 "YES" and "NO". The default value is "NO".
5061
5062
5063 DefaultTime
5064 Run time limit used for jobs that don't specify a value. If not
5065 set then MaxTime will be used. Format is the same as for Max‐
5066 Time.
5067
5068
5069 DefCpuPerGPU
5070 Default count of CPUs allocated per allocated GPU. This value is
5071 used only if the job didn't specify --cpus-per-task and
5072 --cpus-per-gpu.
5073
5074
5075 DefMemPerCPU
5076 Default real memory size available per allocated CPU in
5077 megabytes. Used to avoid over-subscribing memory and causing
5078 paging. DefMemPerCPU would generally be used if individual pro‐
5079 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
5080 lectType=select/cons_tres). If not set, the DefMemPerCPU value
5081 for the entire cluster will be used. Also see DefMemPerGPU,
5082 DefMemPerNode and MaxMemPerCPU. DefMemPerCPU, DefMemPerGPU and
5083 DefMemPerNode are mutually exclusive.
5084
5085
5086 DefMemPerGPU
5087 Default real memory size available per allocated GPU in
5088 megabytes. Also see DefMemPerCPU, DefMemPerNode and MaxMemPer‐
5089 CPU. DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually
5090 exclusive.
5091
5092
5093 DefMemPerNode
5094 Default real memory size available per allocated node in
5095 megabytes. Used to avoid over-subscribing memory and causing
5096 paging. DefMemPerNode would generally be used if whole nodes
5097 are allocated to jobs (SelectType=select/linear) and resources
5098 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
5099 If not set, the DefMemPerNode value for the entire cluster will
5100 be used. Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerCPU.
5101 DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually exclu‐
5102 sive.
5103
5104
5105 DenyAccounts
5106 Comma-separated list of accounts which may not execute jobs in
5107 the partition. By default, no accounts are denied access NOTE:
5108 If AllowAccounts is used then DenyAccounts will not be enforced.
5109 Also refer to AllowAccounts.
5110
5111
5112 DenyQos
5113 Comma-separated list of Qos which may not execute jobs in the
5114 partition. By default, no QOS are denied access NOTE: If Al‐
5115 lowQos is used then DenyQos will not be enforced. Also refer
5116 AllowQos.
5117
5118
5119 DisableRootJobs
5120 If set to "YES" then user root will be prevented from running
5121 any jobs on this partition. The default value will be the value
5122 of DisableRootJobs set outside of a partition specification
5123 (which is "NO", allowing user root to execute jobs).
5124
5125
5126 ExclusiveUser
5127 If set to "YES" then nodes will be exclusively allocated to
5128 users. Multiple jobs may be run for the same user, but only one
5129 user can be active at a time. This capability is also available
5130 on a per-job basis by using the --exclusive=user option.
5131
5132
5133 GraceTime
5134 Specifies, in units of seconds, the preemption grace time to be
5135 extended to a job which has been selected for preemption. The
5136 default value is zero, no preemption grace time is allowed on
5137 this partition. Once a job has been selected for preemption,
5138 its end time is set to the current time plus GraceTime. The
5139 job's tasks are immediately sent SIGCONT and SIGTERM signals in
5140 order to provide notification of its imminent termination. This
5141 is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence
5142 upon reaching its new end time. This second set of signals is
5143 sent to both the tasks and the containing batch script, if ap‐
5144 plicable. See also the global KillWait configuration parameter.
5145
5146
5147 Hidden Specifies if the partition and its jobs are to be hidden by de‐
5148 fault. Hidden partitions will by default not be reported by the
5149 Slurm APIs or commands. Possible values are "YES" and "NO".
5150 The default value is "NO". Note that partitions that a user
5151 lacks access to by virtue of the AllowGroups parameter will also
5152 be hidden by default.
5153
5154
5155 LLN Schedule resources to jobs on the least loaded nodes (based upon
5156 the number of idle CPUs). This is generally only recommended for
5157 an environment with serial jobs as idle resources will tend to
5158 be highly fragmented, resulting in parallel jobs being distrib‐
5159 uted across many nodes. Note that node Weight takes precedence
5160 over how many idle resources are on each node. Also see the Se‐
5161 lectParameters configuration parameter CR_LLN to use the least
5162 loaded nodes in every partition.
5163
5164
5165 MaxCPUsPerNode
5166 Maximum number of CPUs on any node available to all jobs from
5167 this partition. This can be especially useful to schedule GPUs.
5168 For example a node can be associated with two Slurm partitions
5169 (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be
5170 limited to only a subset of the node's CPUs, ensuring that one
5171 or more CPUs would be available to jobs in the "gpu" parti‐
5172 tion/queue.
5173
5174
5175 MaxMemPerCPU
5176 Maximum real memory size available per allocated CPU in
5177 megabytes. Used to avoid over-subscribing memory and causing
5178 paging. MaxMemPerCPU would generally be used if individual pro‐
5179 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
5180 lectType=select/cons_tres). If not set, the MaxMemPerCPU value
5181 for the entire cluster will be used. Also see DefMemPerCPU and
5182 MaxMemPerNode. MaxMemPerCPU and MaxMemPerNode are mutually ex‐
5183 clusive.
5184
5185
5186 MaxMemPerNode
5187 Maximum real memory size available per allocated node in
5188 megabytes. Used to avoid over-subscribing memory and causing
5189 paging. MaxMemPerNode would generally be used if whole nodes
5190 are allocated to jobs (SelectType=select/linear) and resources
5191 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
5192 If not set, the MaxMemPerNode value for the entire cluster will
5193 be used. Also see DefMemPerNode and MaxMemPerCPU. MaxMemPerCPU
5194 and MaxMemPerNode are mutually exclusive.
5195
5196
5197 MaxNodes
5198 Maximum count of nodes which may be allocated to any single job.
5199 The default value is "UNLIMITED", which is represented inter‐
5200 nally as -1.
5201
5202
5203 MaxTime
5204 Maximum run time limit for jobs. Format is minutes, min‐
5205 utes:seconds, hours:minutes:seconds, days-hours, days-hours:min‐
5206 utes, days-hours:minutes:seconds or "UNLIMITED". Time resolu‐
5207 tion is one minute and second values are rounded up to the next
5208 minute. The job TimeLimit may be updated by root, SlurmUser or
5209 an Operator to a value higher than the configured MaxTime after
5210 job submission.
5211
5212
5213 MinNodes
5214 Minimum count of nodes which may be allocated to any single job.
5215 The default value is 0.
5216
5217
5218 Nodes Comma-separated list of nodes or nodesets which are associated
5219 with this partition. Node names may be specified using the node
5220 range expression syntax described above. A blank list of nodes
5221 (i.e. "Nodes= ") can be used if one wants a partition to exist,
5222 but have no resources (possibly on a temporary basis). A value
5223 of "ALL" is mapped to all nodes configured in the cluster.
5224
5225
5226 OverSubscribe
5227 Controls the ability of the partition to execute more than one
5228 job at a time on each resource (node, socket or core depending
5229 upon the value of SelectTypeParameters). If resources are to be
5230 over-subscribed, avoiding memory over-subscription is very im‐
5231 portant. SelectTypeParameters should be configured to treat
5232 memory as a consumable resource and the --mem option should be
5233 used for job allocations. Sharing of resources is typically
5234 useful only when using gang scheduling (PreemptMode=sus‐
5235 pend,gang). Possible values for OverSubscribe are "EXCLUSIVE",
5236 "FORCE", "YES", and "NO". Note that a value of "YES" or "FORCE"
5237 can negatively impact performance for systems with many thou‐
5238 sands of running jobs. The default value is "NO". For more in‐
5239 formation see the following web pages:
5240 https://slurm.schedmd.com/cons_res.html
5241 https://slurm.schedmd.com/cons_res_share.html
5242 https://slurm.schedmd.com/gang_scheduling.html
5243 https://slurm.schedmd.com/preempt.html
5244
5245
5246 EXCLUSIVE Allocates entire nodes to jobs even with Select‐
5247 Type=select/cons_res or SelectType=select/cons_tres
5248 configured. Jobs that run in partitions with Over‐
5249 Subscribe=EXCLUSIVE will have exclusive access to
5250 all allocated nodes. These jobs are allocated all
5251 CPUs and GRES on the nodes, but they are only allo‐
5252 cated as much memory as they ask for. This is by de‐
5253 sign to support gang scheduling, because suspended
5254 jobs still reside in memory. To request all the mem‐
5255 ory on a node, use --mem=0 at submit time.
5256
5257 FORCE Makes all resources (except GRES) in the partition
5258 available for oversubscription without any means for
5259 users to disable it. May be followed with a colon
5260 and maximum number of jobs in running or suspended
5261 state. For example OverSubscribe=FORCE:4 enables
5262 each node, socket or core to oversubscribe each re‐
5263 source four ways. Recommended only for systems us‐
5264 ing PreemptMode=suspend,gang.
5265
5266 NOTE: OverSubscribe=FORCE:1 is a special case that
5267 is not exactly equivalent to OverSubscribe=NO. Over‐
5268 Subscribe=FORCE:1 disables the regular oversubscrip‐
5269 tion of resources in the same partition but it will
5270 still allow oversubscription due to preemption. Set‐
5271 ting OverSubscribe=NO will prevent oversubscription
5272 from happening due to preemption as well.
5273
5274 NOTE: If using PreemptType=preempt/qos you can spec‐
5275 ify a value for FORCE that is greater than 1. For
5276 example, OverSubscribe=FORCE:2 will permit two jobs
5277 per resource normally, but a third job can be
5278 started only if done so through preemption based
5279 upon QOS.
5280
5281 NOTE: If OverSubscribe is configured to FORCE or YES
5282 in your slurm.conf and the system is not configured
5283 to use preemption (PreemptMode=OFF) accounting can
5284 easily grow to values greater than the actual uti‐
5285 lization. It may be common on such systems to get
5286 error messages in the slurmdbd log stating: "We have
5287 more allocated time than is possible."
5288
5289
5290 YES Makes all resources (except GRES) in the partition
5291 available for sharing upon request by the job. Re‐
5292 sources will only be over-subscribed when explicitly
5293 requested by the user using the "--oversubscribe"
5294 option on job submission. May be followed with a
5295 colon and maximum number of jobs in running or sus‐
5296 pended state. For example "OverSubscribe=YES:4" en‐
5297 ables each node, socket or core to execute up to
5298 four jobs at once. Recommended only for systems
5299 running with gang scheduling (PreemptMode=sus‐
5300 pend,gang).
5301
5302 NO Selected resources are allocated to a single job. No
5303 resource will be allocated to more than one job.
5304
5305 NOTE: Even if you are using PreemptMode=sus‐
5306 pend,gang, setting OverSubscribe=NO will disable
5307 preemption on that partition. Use OverSub‐
5308 scribe=FORCE:1 if you want to disable normal over‐
5309 subscription but still allow suspension due to pre‐
5310 emption.
5311
5312
5313 OverTimeLimit
5314 Number of minutes by which a job can exceed its time limit be‐
5315 fore being canceled. Normally a job's time limit is treated as
5316 a hard limit and the job will be killed upon reaching that
5317 limit. Configuring OverTimeLimit will result in the job's time
5318 limit being treated like a soft limit. Adding the OverTimeLimit
5319 value to the soft time limit provides a hard time limit, at
5320 which point the job is canceled. This is particularly useful
5321 for backfill scheduling, which bases upon each job's soft time
5322 limit. If not set, the OverTimeLimit value for the entire clus‐
5323 ter will be used. May not exceed 65533 minutes. A value of
5324 "UNLIMITED" is also supported.
5325
5326
5327 PartitionName
5328 Name by which the partition may be referenced (e.g. "Interac‐
5329 tive"). This name can be specified by users when submitting
5330 jobs. If the PartitionName is "DEFAULT", the values specified
5331 with that record will apply to subsequent partition specifica‐
5332 tions unless explicitly set to other values in that partition
5333 record or replaced with a different set of default values. Each
5334 line where PartitionName is "DEFAULT" will replace or add to
5335 previous default values and not a reinitialize the default val‐
5336 ues.
5337
5338
5339 PreemptMode
5340 Mechanism used to preempt jobs or enable gang scheduling for
5341 this partition when PreemptType=preempt/partition_prio is con‐
5342 figured. This partition-specific PreemptMode configuration pa‐
5343 rameter will override the cluster-wide PreemptMode for this par‐
5344 tition. It can be set to OFF to disable preemption and gang
5345 scheduling for this partition. See also PriorityTier and the
5346 above description of the cluster-wide PreemptMode parameter for
5347 further details.
5348
5349
5350 PriorityJobFactor
5351 Partition factor used by priority/multifactor plugin in calcu‐
5352 lating job priority. The value may not exceed 65533. Also see
5353 PriorityTier.
5354
5355
5356 PriorityTier
5357 Jobs submitted to a partition with a higher PriorityTier value
5358 will be evaluated by the scheduler before pending jobs in a par‐
5359 tition with a lower PriorityTier value. They will also be con‐
5360 sidered for preemption of running jobs in partition(s) with
5361 lower PriorityTier values if PreemptType=preempt/partition_prio.
5362 The value may not exceed 65533. Also see PriorityJobFactor.
5363
5364
5365 QOS Used to extend the limits available to a QOS on a partition.
5366 Jobs will not be associated to this QOS outside of being associ‐
5367 ated to the partition. They will still be associated to their
5368 requested QOS. By default, no QOS is used. NOTE: If a limit is
5369 set in both the Partition's QOS and the Job's QOS the Partition
5370 QOS will be honored unless the Job's QOS has the OverPartQOS
5371 flag set in which the Job's QOS will have priority.
5372
5373
5374 ReqResv
5375 Specifies users of this partition are required to designate a
5376 reservation when submitting a job. This option can be useful in
5377 restricting usage of a partition that may have higher priority
5378 or additional resources to be allowed only within a reservation.
5379 Possible values are "YES" and "NO". The default value is "NO".
5380
5381
5382 ResumeTimeout
5383 Maximum time permitted (in seconds) between when a node resume
5384 request is issued and when the node is actually available for
5385 use. Nodes which fail to respond in this time frame will be
5386 marked DOWN and the jobs scheduled on the node requeued. Nodes
5387 which reboot after this time frame will be marked DOWN with a
5388 reason of "Node unexpectedly rebooted." For nodes that are in
5389 multiple partitions with this option set, the highest time will
5390 take effect. If not set on any partition, the node will use the
5391 ResumeTimeout value set for the entire cluster.
5392
5393
5394 RootOnly
5395 Specifies if only user ID zero (i.e. user root) may allocate re‐
5396 sources in this partition. User root may allocate resources for
5397 any other user, but the request must be initiated by user root.
5398 This option can be useful for a partition to be managed by some
5399 external entity (e.g. a higher-level job manager) and prevents
5400 users from directly using those resources. Possible values are
5401 "YES" and "NO". The default value is "NO".
5402
5403
5404 SelectTypeParameters
5405 Partition-specific resource allocation type. This option re‐
5406 places the global SelectTypeParameters value. Supported values
5407 are CR_Core, CR_Core_Memory, CR_Socket and CR_Socket_Memory.
5408 Use requires the system-wide SelectTypeParameters value be set
5409 to any of the four supported values previously listed; other‐
5410 wise, the partition-specific value will be ignored.
5411
5412
5413 Shared The Shared configuration parameter has been replaced by the
5414 OverSubscribe parameter described above.
5415
5416
5417 State State of partition or availability for use. Possible values are
5418 "UP", "DOWN", "DRAIN" and "INACTIVE". The default value is "UP".
5419 See also the related "Alternate" keyword.
5420
5421 UP Designates that new jobs may be queued on the parti‐
5422 tion, and that jobs may be allocated nodes and run
5423 from the partition.
5424
5425 DOWN Designates that new jobs may be queued on the parti‐
5426 tion, but queued jobs may not be allocated nodes and
5427 run from the partition. Jobs already running on the
5428 partition continue to run. The jobs must be explicitly
5429 canceled to force their termination.
5430
5431 DRAIN Designates that no new jobs may be queued on the par‐
5432 tition (job submission requests will be denied with an
5433 error message), but jobs already queued on the parti‐
5434 tion may be allocated nodes and run. See also the
5435 "Alternate" partition specification.
5436
5437 INACTIVE Designates that no new jobs may be queued on the par‐
5438 tition, and jobs already queued may not be allocated
5439 nodes and run. See also the "Alternate" partition
5440 specification.
5441
5442
5443 SuspendTime
5444 Nodes which remain idle or down for this number of seconds will
5445 be placed into power save mode by SuspendProgram. For efficient
5446 system utilization, it is recommended that the value of Suspend‐
5447 Time be at least as large as the sum of SuspendTimeout plus Re‐
5448 sumeTimeout. For nodes that are in multiple partitions with
5449 this option set, the highest time will take effect. If not set
5450 on any partition, the node will use the SuspendTime value set
5451 for the entire cluster. Setting SuspendTime to anything but
5452 "INFINITE" will enable power save mode.
5453
5454
5455 SuspendTimeout
5456 Maximum time permitted (in seconds) between when a node suspend
5457 request is issued and when the node is shutdown. At that time
5458 the node must be ready for a resume request to be issued as
5459 needed for new work. For nodes that are in multiple partitions
5460 with this option set, the highest time will take effect. If not
5461 set on any partition, the node will use the SuspendTimeout value
5462 set for the entire cluster.
5463
5464
5465 TRESBillingWeights
5466 TRESBillingWeights is used to define the billing weights of each
5467 TRES type that will be used in calculating the usage of a job.
5468 The calculated usage is used when calculating fairshare and when
5469 enforcing the TRES billing limit on jobs.
5470
5471 Billing weights are specified as a comma-separated list of <TRES
5472 Type>=<TRES Billing Weight> pairs.
5473
5474 Any TRES Type is available for billing. Note that the base unit
5475 for memory and burst buffers is megabytes.
5476
5477 By default the billing of TRES is calculated as the sum of all
5478 TRES types multiplied by their corresponding billing weight.
5479
5480 The weighted amount of a resource can be adjusted by adding a
5481 suffix of K,M,G,T or P after the billing weight. For example, a
5482 memory weight of "mem=.25" on a job allocated 8GB will be billed
5483 2048 (8192MB *.25) units. A memory weight of "mem=.25G" on the
5484 same job will be billed 2 (8192MB * (.25/1024)) units.
5485
5486 Negative values are allowed.
5487
5488 When a job is allocated 1 CPU and 8 GB of memory on a partition
5489 configured with TRESBilling‐
5490 Weights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0", the billable TRES will
5491 be: (1*1.0) + (8*0.25) + (0*2.0) = 3.0.
5492
5493 If PriorityFlags=MAX_TRES is configured, the billable TRES is
5494 calculated as the MAX of individual TRES' on a node (e.g. cpus,
5495 mem, gres) plus the sum of all global TRES' (e.g. licenses). Us‐
5496 ing the same example above the billable TRES will be MAX(1*1.0,
5497 8*0.25) + (0*2.0) = 2.0.
5498
5499 If TRESBillingWeights is not defined then the job is billed
5500 against the total number of allocated CPUs.
5501
5502 NOTE: TRESBillingWeights doesn't affect job priority directly as
5503 it is currently not used for the size of the job. If you want
5504 TRES' to play a role in the job's priority then refer to the
5505 PriorityWeightTRES option.
5506
5507
5508
5510 There are a variety of prolog and epilog program options that execute
5511 with various permissions and at various times. The four options most
5512 likely to be used are: Prolog and Epilog (executed once on each compute
5513 node for each job) plus PrologSlurmctld and EpilogSlurmctld (executed
5514 once on the ControlMachine for each job).
5515
5516 NOTE: Standard output and error messages are normally not preserved.
5517 Explicitly write output and error messages to an appropriate location
5518 if you wish to preserve that information.
5519
5520 NOTE: By default the Prolog script is ONLY run on any individual node
5521 when it first sees a job step from a new allocation. It does not run
5522 the Prolog immediately when an allocation is granted. If no job steps
5523 from an allocation are run on a node, it will never run the Prolog for
5524 that allocation. This Prolog behaviour can be changed by the Pro‐
5525 logFlags parameter. The Epilog, on the other hand, always runs on ev‐
5526 ery node of an allocation when the allocation is released.
5527
5528 If the Epilog fails (returns a non-zero exit code), this will result in
5529 the node being set to a DRAIN state. If the EpilogSlurmctld fails (re‐
5530 turns a non-zero exit code), this will only be logged. If the Prolog
5531 fails (returns a non-zero exit code), this will result in the node be‐
5532 ing set to a DRAIN state and the job being requeued in a held state un‐
5533 less nohold_on_prolog_fail is configured in SchedulerParameters. If
5534 the PrologSlurmctld fails (returns a non-zero exit code), this will re‐
5535 sult in the job being requeued to be executed on another node if possi‐
5536 ble. Only batch jobs can be requeued. Interactive jobs (salloc and
5537 srun) will be cancelled if the PrologSlurmctld fails. If slurmcltd is
5538 stopped while either PrologSlurmctld or EpilogSlurmctld is running, the
5539 script will be killed with SIGKILL. The script will restart when slurm‐
5540 ctld restarts.
5541
5542
5543 Information about the job is passed to the script using environment
5544 variables. Unless otherwise specified, these environment variables are
5545 available in each of the scripts mentioned above (Prolog, Epilog, Pro‐
5546 logSlurmctld and EpilogSlurmctld). For a full list of environment vari‐
5547 ables that includes those available in the SrunProlog, SrunEpilog,
5548 TaskProlog and TaskEpilog please see the Prolog and Epilog Guide
5549 <https://slurm.schedmd.com/prolog_epilog.html>.
5550
5551 SLURM_ARRAY_JOB_ID
5552 If this job is part of a job array, this will be set to the job
5553 ID. Otherwise it will not be set. To reference this specific
5554 task of a job array, combine SLURM_ARRAY_JOB_ID with SLURM_AR‐
5555 RAY_TASK_ID (e.g. "scontrol update ${SLURM_AR‐
5556 RAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in Pro‐
5557 logSlurmctld and EpilogSlurmctld.
5558
5559 SLURM_ARRAY_TASK_ID
5560 If this job is part of a job array, this will be set to the task
5561 ID. Otherwise it will not be set. To reference this specific
5562 task of a job array, combine SLURM_ARRAY_JOB_ID with SLURM_AR‐
5563 RAY_TASK_ID (e.g. "scontrol update ${SLURM_AR‐
5564 RAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in Pro‐
5565 logSlurmctld and EpilogSlurmctld.
5566
5567 SLURM_ARRAY_TASK_MAX
5568 If this job is part of a job array, this will be set to the max‐
5569 imum task ID. Otherwise it will not be set. Available in Pro‐
5570 logSlurmctld and EpilogSlurmctld.
5571
5572 SLURM_ARRAY_TASK_MIN
5573 If this job is part of a job array, this will be set to the min‐
5574 imum task ID. Otherwise it will not be set. Available in Pro‐
5575 logSlurmctld and EpilogSlurmctld.
5576
5577 SLURM_ARRAY_TASK_STEP
5578 If this job is part of a job array, this will be set to the step
5579 size of task IDs. Otherwise it will not be set. Available in
5580 PrologSlurmctld and EpilogSlurmctld.
5581
5582 SLURM_CLUSTER_NAME
5583 Name of the cluster executing the job.
5584
5585 SLURM_CONF
5586 Location of the slurm.conf file. Available in Prolog and Epilog.
5587
5588 SLURMD_NODENAME
5589 Name of the node running the task. In the case of a parallel job
5590 executing on multiple compute nodes, the various tasks will have
5591 this environment variable set to different values on each com‐
5592 pute node. Available in Prolog and Epilog.
5593
5594 SLURM_JOB_ACCOUNT
5595 Account name used for the job. Available in PrologSlurmctld and
5596 EpilogSlurmctld.
5597
5598 SLURM_JOB_CONSTRAINTS
5599 Features required to run the job. Available in Prolog, Pro‐
5600 logSlurmctld and EpilogSlurmctld.
5601
5602 SLURM_JOB_DERIVED_EC
5603 The highest exit code of all of the job steps. Available in
5604 EpilogSlurmctld.
5605
5606 SLURM_JOB_EXIT_CODE
5607 The exit code of the job script (or salloc). The value is the
5608 status as returned by the wait() system call (See wait(2))
5609 Available in EpilogSlurmctld.
5610
5611 SLURM_JOB_EXIT_CODE2
5612 The exit code of the job script (or salloc). The value has the
5613 format <exit>:<sig>. The first number is the exit code, typi‐
5614 cally as set by the exit() function. The second number of the
5615 signal that caused the process to terminate if it was terminated
5616 by a signal. Available in EpilogSlurmctld.
5617
5618 SLURM_JOB_GID
5619 Group ID of the job's owner.
5620
5621 SLURM_JOB_GPUS
5622 The GPU IDs of GPUs in the job allocation (if any). Available
5623 in the Prolog and Epilog.
5624
5625 SLURM_JOB_GROUP
5626 Group name of the job's owner. Available in PrologSlurmctld and
5627 EpilogSlurmctld.
5628
5629 SLURM_JOB_ID
5630 Job ID.
5631
5632 SLURM_JOBID
5633 Job ID.
5634
5635 SLURM_JOB_NAME
5636 Name of the job. Available in PrologSlurmctld and EpilogSlurm‐
5637 ctld.
5638
5639 SLURM_JOB_NODELIST
5640 Nodes assigned to job. A Slurm hostlist expression. "scontrol
5641 show hostnames" can be used to convert this to a list of indi‐
5642 vidual host names. Available in PrologSlurmctld and Epi‐
5643 logSlurmctld.
5644
5645 SLURM_JOB_PARTITION
5646 Partition that job runs in. Available in Prolog, PrologSlurm‐
5647 ctld and EpilogSlurmctld.
5648
5649 SLURM_JOB_UID
5650 User ID of the job's owner.
5651
5652 SLURM_JOB_USER
5653 User name of the job's owner.
5654
5655 SLURM_SCRIPT_CONTEXT
5656 Identifies which epilog or prolog program is currently running.
5657
5658
5660 This program can be used to take special actions to clean up the unkil‐
5661 lable processes and/or notify system administrators. The program will
5662 be run as SlurmdUser (usually "root") on the compute node where Unkill‐
5663 ableStepTimeout was triggered.
5664
5665 Information about the unkillable job step is passed to the script using
5666 environment variables.
5667
5668 SLURM_JOB_ID
5669 Job ID.
5670
5671 SLURM_STEP_ID
5672 Job Step ID.
5673
5674
5676 Slurm is able to optimize job allocations to minimize network con‐
5677 tention. Special Slurm logic is used to optimize allocations on sys‐
5678 tems with a three-dimensional interconnect. and information about con‐
5679 figuring those systems are available on web pages available here:
5680 <https://slurm.schedmd.com/>. For a hierarchical network, Slurm needs
5681 to have detailed information about how nodes are configured on the net‐
5682 work switches.
5683
5684 Given network topology information, Slurm allocates all of a job's re‐
5685 sources onto a single leaf of the network (if possible) using a
5686 best-fit algorithm. Otherwise it will allocate a job's resources onto
5687 multiple leaf switches so as to minimize the use of higher-level
5688 switches. The TopologyPlugin parameter controls which plugin is used
5689 to collect network topology information. The only values presently
5690 supported are "topology/3d_torus" (default for Cray XT/XE systems, per‐
5691 forms best-fit logic over three-dimensional topology), "topology/none"
5692 (default for other systems, best-fit logic over one-dimensional topol‐
5693 ogy), "topology/tree" (determine the network topology based upon infor‐
5694 mation contained in a topology.conf file, see "man topology.conf" for
5695 more information). Future plugins may gather topology information di‐
5696 rectly from the network. The topology information is optional. If not
5697 provided, Slurm will perform a best-fit algorithm assuming the nodes
5698 are in a one-dimensional array as configured and the communications
5699 cost is related to the node distance in this array.
5700
5701
5703 If the cluster's computers used for the primary or backup controller
5704 will be out of service for an extended period of time, it may be desir‐
5705 able to relocate them. In order to do so, follow this procedure:
5706
5707 1. Stop the Slurm daemons
5708 2. Modify the slurm.conf file appropriately
5709 3. Distribute the updated slurm.conf file to all nodes
5710 4. Restart the Slurm daemons
5711
5712 There should be no loss of any running or pending jobs. Ensure that
5713 any nodes added to the cluster have the current slurm.conf file in‐
5714 stalled.
5715
5716 CAUTION: If two nodes are simultaneously configured as the primary con‐
5717 troller (two nodes on which SlurmctldHost specify the local host and
5718 the slurmctld daemon is executing on each), system behavior will be de‐
5719 structive. If a compute node has an incorrect SlurmctldHost parameter,
5720 that node may be rendered unusable, but no other harm will result.
5721
5722
5724 #
5725 # Sample /etc/slurm.conf for dev[0-25].llnl.gov
5726 # Author: John Doe
5727 # Date: 11/06/2001
5728 #
5729 SlurmctldHost=dev0(12.34.56.78) # Primary server
5730 SlurmctldHost=dev1(12.34.56.79) # Backup server
5731 #
5732 AuthType=auth/munge
5733 Epilog=/usr/local/slurm/epilog
5734 Prolog=/usr/local/slurm/prolog
5735 FirstJobId=65536
5736 InactiveLimit=120
5737 JobCompType=jobcomp/filetxt
5738 JobCompLoc=/var/log/slurm/jobcomp
5739 KillWait=30
5740 MaxJobCount=10000
5741 MinJobAge=3600
5742 PluginDir=/usr/local/lib:/usr/local/slurm/lib
5743 ReturnToService=0
5744 SchedulerType=sched/backfill
5745 SlurmctldLogFile=/var/log/slurm/slurmctld.log
5746 SlurmdLogFile=/var/log/slurm/slurmd.log
5747 SlurmctldPort=7002
5748 SlurmdPort=7003
5749 SlurmdSpoolDir=/var/spool/slurmd.spool
5750 StateSaveLocation=/var/spool/slurm.state
5751 SwitchType=switch/none
5752 TmpFS=/tmp
5753 WaitTime=30
5754 JobCredentialPrivateKey=/usr/local/slurm/private.key
5755 JobCredentialPublicCertificate=/usr/local/slurm/public.cert
5756 #
5757 # Node Configurations
5758 #
5759 NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=64000
5760 NodeName=DEFAULT State=UNKNOWN
5761 NodeName=dev[0-25] NodeAddr=edev[0-25] Weight=16
5762 # Update records for specific DOWN nodes
5763 DownNodes=dev20 State=DOWN Reason="power,ETA=Dec25"
5764 #
5765 # Partition Configurations
5766 #
5767 PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP
5768 PartitionName=debug Nodes=dev[0-8,18-25] Default=YES
5769 PartitionName=batch Nodes=dev[9-17] MinNodes=4
5770 PartitionName=long Nodes=dev[9-17] MaxTime=120 AllowGroups=admin
5771
5772
5774 The "include" key word can be used with modifiers within the specified
5775 pathname. These modifiers would be replaced with cluster name or other
5776 information depending on which modifier is specified. If the included
5777 file is not an absolute path name (i.e. it does not start with a
5778 slash), it will searched for in the same directory as the slurm.conf
5779 file.
5780
5781 %c Cluster name specified in the slurm.conf will be used.
5782
5783 EXAMPLE
5784 ClusterName=linux
5785 include /home/slurm/etc/%c_config
5786 # Above line interpreted as
5787 # "include /home/slurm/etc/linux_config"
5788
5789
5791 There are three classes of files: Files used by slurmctld must be ac‐
5792 cessible by user SlurmUser and accessible by the primary and backup
5793 control machines. Files used by slurmd must be accessible by user root
5794 and accessible from every compute node. A few files need to be acces‐
5795 sible by normal users on all login and compute nodes. While many files
5796 and directories are listed below, most of them will not be used with
5797 most configurations.
5798
5799 Epilog Must be executable by user root. It is recommended that the
5800 file be readable by all users. The file must exist on every
5801 compute node.
5802
5803 EpilogSlurmctld
5804 Must be executable by user SlurmUser. It is recommended that
5805 the file be readable by all users. The file must be accessible
5806 by the primary and backup control machines.
5807
5808 HealthCheckProgram
5809 Must be executable by user root. It is recommended that the
5810 file be readable by all users. The file must exist on every
5811 compute node.
5812
5813 JobCompLoc
5814 If this specifies a file, it must be writable by user SlurmUser.
5815 The file must be accessible by the primary and backup control
5816 machines.
5817
5818 JobCredentialPrivateKey
5819 Must be readable only by user SlurmUser and writable by no other
5820 users. The file must be accessible by the primary and backup
5821 control machines.
5822
5823 JobCredentialPublicCertificate
5824 Readable to all users on all nodes. Must not be writable by
5825 regular users.
5826
5827 MailProg
5828 Must be executable by user SlurmUser. Must not be writable by
5829 regular users. The file must be accessible by the primary and
5830 backup control machines.
5831
5832 Prolog Must be executable by user root. It is recommended that the
5833 file be readable by all users. The file must exist on every
5834 compute node.
5835
5836 PrologSlurmctld
5837 Must be executable by user SlurmUser. It is recommended that
5838 the file be readable by all users. The file must be accessible
5839 by the primary and backup control machines.
5840
5841 ResumeProgram
5842 Must be executable by user SlurmUser. The file must be accessi‐
5843 ble by the primary and backup control machines.
5844
5845 slurm.conf
5846 Readable to all users on all nodes. Must not be writable by
5847 regular users.
5848
5849 SlurmctldLogFile
5850 Must be writable by user SlurmUser. The file must be accessible
5851 by the primary and backup control machines.
5852
5853 SlurmctldPidFile
5854 Must be writable by user root. Preferably writable and remov‐
5855 able by SlurmUser. The file must be accessible by the primary
5856 and backup control machines.
5857
5858 SlurmdLogFile
5859 Must be writable by user root. A distinct file must exist on
5860 each compute node.
5861
5862 SlurmdPidFile
5863 Must be writable by user root. A distinct file must exist on
5864 each compute node.
5865
5866 SlurmdSpoolDir
5867 Must be writable by user root. A distinct file must exist on
5868 each compute node.
5869
5870 SrunEpilog
5871 Must be executable by all users. The file must exist on every
5872 login and compute node.
5873
5874 SrunProlog
5875 Must be executable by all users. The file must exist on every
5876 login and compute node.
5877
5878 StateSaveLocation
5879 Must be writable by user SlurmUser. The file must be accessible
5880 by the primary and backup control machines.
5881
5882 SuspendProgram
5883 Must be executable by user SlurmUser. The file must be accessi‐
5884 ble by the primary and backup control machines.
5885
5886 TaskEpilog
5887 Must be executable by all users. The file must exist on every
5888 compute node.
5889
5890 TaskProlog
5891 Must be executable by all users. The file must exist on every
5892 compute node.
5893
5894 UnkillableStepProgram
5895 Must be executable by user SlurmUser. The file must be accessi‐
5896 ble by the primary and backup control machines.
5897
5898
5900 Note that while Slurm daemons create log files and other files as
5901 needed, it treats the lack of parent directories as a fatal error.
5902 This prevents the daemons from running if critical file systems are not
5903 mounted and will minimize the risk of cold-starting (starting without
5904 preserving jobs).
5905
5906 Log files and job accounting files, may need to be created/owned by the
5907 "SlurmUser" uid to be successfully accessed. Use the "chown" and
5908 "chmod" commands to set the ownership and permissions appropriately.
5909 See the section FILE AND DIRECTORY PERMISSIONS for information about
5910 the various files and directories used by Slurm.
5911
5912 It is recommended that the logrotate utility be used to ensure that
5913 various log files do not become too large. This also applies to text
5914 files used for accounting, process tracking, and the slurmdbd log if
5915 they are used.
5916
5917 Here is a sample logrotate configuration. Make appropriate site modifi‐
5918 cations and save as /etc/logrotate.d/slurm on all nodes. See the
5919 logrotate man page for more details.
5920
5921 ##
5922 # Slurm Logrotate Configuration
5923 ##
5924 /var/log/slurm/*.log {
5925 compress
5926 missingok
5927 nocopytruncate
5928 nodelaycompress
5929 nomail
5930 notifempty
5931 noolddir
5932 rotate 5
5933 sharedscripts
5934 size=5M
5935 create 640 slurm root
5936 postrotate
5937 pkill -x --signal SIGUSR2 slurmctld
5938 pkill -x --signal SIGUSR2 slurmd
5939 pkill -x --signal SIGUSR2 slurmdbd
5940 exit 0
5941 endscript
5942 }
5943
5945 Copyright (C) 2002-2007 The Regents of the University of California.
5946 Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
5947 Copyright (C) 2008-2010 Lawrence Livermore National Security.
5948 Copyright (C) 2010-2021 SchedMD LLC.
5949
5950 This file is part of Slurm, a resource management program. For de‐
5951 tails, see <https://slurm.schedmd.com/>.
5952
5953 Slurm is free software; you can redistribute it and/or modify it under
5954 the terms of the GNU General Public License as published by the Free
5955 Software Foundation; either version 2 of the License, or (at your op‐
5956 tion) any later version.
5957
5958 Slurm is distributed in the hope that it will be useful, but WITHOUT
5959 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
5960 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
5961 for more details.
5962
5963
5965 /etc/slurm.conf
5966
5967
5969 cgroup.conf(5), getaddrinfo(3), getrlimit(2), gres.conf(5), group(5),
5970 hostname(1), scontrol(1), slurmctld(8), slurmd(8), slurmdbd(8), slur‐
5971 mdbd.conf(5), srun(1), spank(8), syslog(3), topology.conf(5)
5972
5973
5974
5975November 2021 Slurm Configuration File slurm.conf(5)