1slurm.conf(5) Slurm Configuration File slurm.conf(5)
2
3
4
6 slurm.conf - Slurm configuration file
7
8
10 slurm.conf is an ASCII file which describes general Slurm configuration
11 information, the nodes to be managed, information about how those nodes
12 are grouped into partitions, and various scheduling parameters associ‐
13 ated with those partitions. This file should be consistent across all
14 nodes in the cluster.
15
16 The file location can be modified at system build time using the DE‐
17 FAULT_SLURM_CONF parameter or at execution time by setting the
18 SLURM_CONF environment variable. The Slurm daemons also allow you to
19 override both the built-in and environment-provided location using the
20 "-f" option on the command line.
21
22 The contents of the file are case insensitive except for the names of
23 nodes and partitions. Any text following a "#" in the configuration
24 file is treated as a comment through the end of that line. Changes to
25 the configuration file take effect upon restart of Slurm daemons, dae‐
26 mon receipt of the SIGHUP signal, or execution of the command "scontrol
27 reconfigure" unless otherwise noted.
28
29 If a line begins with the word "Include" followed by whitespace and
30 then a file name, that file will be included inline with the current
31 configuration file. For large or complex systems, multiple configura‐
32 tion files may prove easier to manage and enable reuse of some files
33 (See INCLUDE MODIFIERS for more details).
34
35 Note on file permissions:
36
37 The slurm.conf file must be readable by all users of Slurm, since it is
38 used by many of the Slurm commands. Other files that are defined in
39 the slurm.conf file, such as log files and job accounting files, may
40 need to be created/owned by the user "SlurmUser" to be successfully ac‐
41 cessed. Use the "chown" and "chmod" commands to set the ownership and
42 permissions appropriately. See the section FILE AND DIRECTORY PERMIS‐
43 SIONS for information about the various files and directories used by
44 Slurm.
45
46
48 The overall configuration parameters available include:
49
50
51 AccountingStorageBackupHost
52 The name of the backup machine hosting the accounting storage
53 database. If used with the accounting_storage/slurmdbd plugin,
54 this is where the backup slurmdbd would be running. Only used
55 with systems using SlurmDBD, ignored otherwise.
56
57
58 AccountingStorageEnforce
59 This controls what level of association-based enforcement to im‐
60 pose on job submissions. Valid options are any combination of
61 associations, limits, nojobs, nosteps, qos, safe, and wckeys, or
62 all for all things (except nojobs and nosteps, which must be re‐
63 quested as well).
64
65 If limits, qos, or wckeys are set, associations will automati‐
66 cally be set.
67
68 If wckeys is set, TrackWCKey will automatically be set.
69
70 If safe is set, limits and associations will automatically be
71 set.
72
73 If nojobs is set, nosteps will automatically be set.
74
75 By setting associations, no new job is allowed to run unless a
76 corresponding association exists in the system. If limits are
77 enforced, users can be limited by association to whatever job
78 size or run time limits are defined.
79
80 If nojobs is set, Slurm will not account for any jobs or steps
81 on the system. Likewise, if nosteps is set, Slurm will not ac‐
82 count for any steps that have run.
83
84 If safe is enforced, a job will only be launched against an as‐
85 sociation or qos that has a GrpTRESMins limit set, if the job
86 will be able to run to completion. Without this option set, jobs
87 will be launched as long as their usage hasn't reached the cpu-
88 minutes limit. This can lead to jobs being launched but then
89 killed when the limit is reached.
90
91 With qos and/or wckeys enforced jobs will not be scheduled un‐
92 less a valid qos and/or workload characterization key is speci‐
93 fied.
94
95 When AccountingStorageEnforce is changed, a restart of the
96 slurmctld daemon is required (not just a "scontrol reconfig").
97
98
99 AccountingStorageExternalHost
100 A comma-separated list of external slurmdbds
101 (<host/ip>[:port][,...]) to register with. If no port is given,
102 the AccountingStoragePort will be used.
103
104 This allows clusters registered with the external slurmdbd to
105 communicate with each other using the --cluster/-M client com‐
106 mand options.
107
108 The cluster will add itself to the external slurmdbd if it
109 doesn't exist. If a non-external cluster already exists on the
110 external slurmdbd, the slurmctld will ignore registering to the
111 external slurmdbd.
112
113
114 AccountingStorageHost
115 The name of the machine hosting the accounting storage database.
116 Only used with systems using SlurmDBD, ignored otherwise. Also
117 see DefaultStorageHost.
118
119
120 AccountingStorageParameters
121 Comma-separated list of key-value pair parameters. Currently
122 supported values include options to establish a secure connec‐
123 tion to the database:
124
125 SSL_CERT
126 The path name of the client public key certificate file.
127
128 SSL_CA
129 The path name of the Certificate Authority (CA) certificate
130 file.
131
132 SSL_CAPATH
133 The path name of the directory that contains trusted SSL CA
134 certificate files.
135
136 SSL_KEY
137 The path name of the client private key file.
138
139 SSL_CIPHER
140 The list of permissible ciphers for SSL encryption.
141
142
143 AccountingStoragePass
144 The password used to gain access to the database to store the
145 accounting data. Only used for database type storage plugins,
146 ignored otherwise. In the case of Slurm DBD (Database Daemon)
147 with MUNGE authentication this can be configured to use a MUNGE
148 daemon specifically configured to provide authentication between
149 clusters while the default MUNGE daemon provides authentication
150 within a cluster. In that case, AccountingStoragePass should
151 specify the named port to be used for communications with the
152 alternate MUNGE daemon (e.g. "/var/run/munge/global.socket.2").
153 The default value is NULL. Also see DefaultStoragePass.
154
155
156 AccountingStoragePort
157 The listening port of the accounting storage database server.
158 Only used for database type storage plugins, ignored otherwise.
159 The default value is SLURMDBD_PORT as established at system
160 build time. If no value is explicitly specified, it will be set
161 to 6819. This value must be equal to the DbdPort parameter in
162 the slurmdbd.conf file. Also see DefaultStoragePort.
163
164
165 AccountingStorageTRES
166 Comma-separated list of resources you wish to track on the clus‐
167 ter. These are the resources requested by the sbatch/srun job
168 when it is submitted. Currently this consists of any GRES, BB
169 (burst buffer) or license along with CPU, Memory, Node, Energy,
170 FS/[Disk|Lustre], IC/OFED, Pages, and VMem. By default Billing,
171 CPU, Energy, Memory, Node, FS/Disk, Pages and VMem are tracked.
172 These default TRES cannot be disabled, but only appended to.
173 AccountingStorageTRES=gres/craynetwork,license/iop1 will track
174 billing, cpu, energy, memory, nodes, fs/disk, pages and vmem
175 along with a gres called craynetwork as well as a license called
176 iop1. Whenever these resources are used on the cluster they are
177 recorded. The TRES are automatically set up in the database on
178 the start of the slurmctld.
179
180 If multiple GRES of different types are tracked (e.g. GPUs of
181 different types), then job requests with matching type specifi‐
182 cations will be recorded. Given a configuration of "Account‐
183 ingStorageTRES=gres/gpu,gres/gpu:tesla,gres/gpu:volta" Then
184 "gres/gpu:tesla" and "gres/gpu:volta" will track only jobs that
185 explicitly request those two GPU types, while "gres/gpu" will
186 track allocated GPUs of any type ("tesla", "volta" or any other
187 GPU type).
188
189 Given a configuration of "AccountingStorage‐
190 TRES=gres/gpu:tesla,gres/gpu:volta" Then "gres/gpu:tesla" and
191 "gres/gpu:volta" will track jobs that explicitly request those
192 GPU types. If a job requests GPUs, but does not explicitly
193 specify the GPU type, then its resource allocation will be ac‐
194 counted for as either "gres/gpu:tesla" or "gres/gpu:volta", al‐
195 though the accounting may not match the actual GPU type allo‐
196 cated to the job and the GPUs allocated to the job could be het‐
197 erogeneous. In an environment containing various GPU types, use
198 of a job_submit plugin may be desired in order to force jobs to
199 explicitly specify some GPU type.
200
201
202 AccountingStorageType
203 The accounting storage mechanism type. Acceptable values at
204 present include "accounting_storage/none" and "accounting_stor‐
205 age/slurmdbd". The "accounting_storage/slurmdbd" value indi‐
206 cates that accounting records will be written to the Slurm DBD,
207 which manages an underlying MySQL database. See "man slurmdbd"
208 for more information. The default value is "accounting_stor‐
209 age/none" and indicates that account records are not maintained.
210 Also see DefaultStorageType.
211
212
213 AccountingStorageUser
214 The user account for accessing the accounting storage database.
215 Only used for database type storage plugins, ignored otherwise.
216 Also see DefaultStorageUser.
217
218
219 AccountingStoreJobComment
220 If set to "YES" then include the job's comment field in the job
221 complete message sent to the Accounting Storage database. The
222 default is "YES". Note the AdminComment and SystemComment are
223 always recorded in the database.
224
225
226 AcctGatherNodeFreq
227 The AcctGather plugins sampling interval for node accounting.
228 For AcctGather plugin values of none, this parameter is ignored.
229 For all other values this parameter is the number of seconds be‐
230 tween node accounting samples. For the acct_gather_energy/rapl
231 plugin, set a value less than 300 because the counters may over‐
232 flow beyond this rate. The default value is zero. This value
233 disables accounting sampling for nodes. Note: The accounting
234 sampling interval for jobs is determined by the value of JobAc‐
235 ctGatherFrequency.
236
237
238 AcctGatherEnergyType
239 Identifies the plugin to be used for energy consumption account‐
240 ing. The jobacct_gather plugin and slurmd daemon call this
241 plugin to collect energy consumption data for jobs and nodes.
242 The collection of energy consumption data takes place on the
243 node level, hence only in case of exclusive job allocation the
244 energy consumption measurements will reflect the job's real con‐
245 sumption. In case of node sharing between jobs the reported con‐
246 sumed energy per job (through sstat or sacct) will not reflect
247 the real energy consumed by the jobs.
248
249 Configurable values at present are:
250
251 acct_gather_energy/none
252 No energy consumption data is collected.
253
254 acct_gather_energy/ipmi
255 Energy consumption data is collected from
256 the Baseboard Management Controller (BMC)
257 using the Intelligent Platform Management
258 Interface (IPMI).
259
260 acct_gather_energy/pm_counters
261 Energy consumption data is collected from
262 the Baseboard Management Controller (BMC)
263 for HPE Cray systems.
264
265 acct_gather_energy/rapl
266 Energy consumption data is collected from
267 hardware sensors using the Running Average
268 Power Limit (RAPL) mechanism. Note that en‐
269 abling RAPL may require the execution of the
270 command "sudo modprobe msr".
271
272 acct_gather_energy/xcc
273 Energy consumption data is collected from
274 the Lenovo SD650 XClarity Controller (XCC)
275 using IPMI OEM raw commands.
276
277
278 AcctGatherInterconnectType
279 Identifies the plugin to be used for interconnect network traf‐
280 fic accounting. The jobacct_gather plugin and slurmd daemon
281 call this plugin to collect network traffic data for jobs and
282 nodes. The collection of network traffic data takes place on
283 the node level, hence only in case of exclusive job allocation
284 the collected values will reflect the job's real traffic. In
285 case of node sharing between jobs the reported network traffic
286 per job (through sstat or sacct) will not reflect the real net‐
287 work traffic by the jobs.
288
289 Configurable values at present are:
290
291 acct_gather_interconnect/none
292 No infiniband network data are collected.
293
294 acct_gather_interconnect/ofed
295 Infiniband network traffic data are col‐
296 lected from the hardware monitoring counters
297 of Infiniband devices through the OFED li‐
298 brary. In order to account for per job net‐
299 work traffic, add the "ic/ofed" TRES to Ac‐
300 countingStorageTRES.
301
302
303 AcctGatherFilesystemType
304 Identifies the plugin to be used for filesystem traffic account‐
305 ing. The jobacct_gather plugin and slurmd daemon call this
306 plugin to collect filesystem traffic data for jobs and nodes.
307 The collection of filesystem traffic data takes place on the
308 node level, hence only in case of exclusive job allocation the
309 collected values will reflect the job's real traffic. In case of
310 node sharing between jobs the reported filesystem traffic per
311 job (through sstat or sacct) will not reflect the real filesys‐
312 tem traffic by the jobs.
313
314
315 Configurable values at present are:
316
317 acct_gather_filesystem/none
318 No filesystem data are collected.
319
320 acct_gather_filesystem/lustre
321 Lustre filesystem traffic data are collected
322 from the counters found in /proc/fs/lustre/.
323 In order to account for per job lustre traf‐
324 fic, add the "fs/lustre" TRES to Account‐
325 ingStorageTRES.
326
327
328 AcctGatherProfileType
329 Identifies the plugin to be used for detailed job profiling.
330 The jobacct_gather plugin and slurmd daemon call this plugin to
331 collect detailed data such as I/O counts, memory usage, or en‐
332 ergy consumption for jobs and nodes. There are interfaces in
333 this plugin to collect data as step start and completion, task
334 start and completion, and at the account gather frequency. The
335 data collected at the node level is related to jobs only in case
336 of exclusive job allocation.
337
338 Configurable values at present are:
339
340 acct_gather_profile/none
341 No profile data is collected.
342
343 acct_gather_profile/hdf5
344 This enables the HDF5 plugin. The directory
345 where the profile files are stored and which
346 values are collected are configured in the
347 acct_gather.conf file.
348
349 acct_gather_profile/influxdb
350 This enables the influxdb plugin. The in‐
351 fluxdb instance host, port, database, reten‐
352 tion policy and which values are collected
353 are configured in the acct_gather.conf file.
354
355
356 AllowSpecResourcesUsage
357 If set to "YES", Slurm allows individual jobs to override node's
358 configured CoreSpecCount value. For a job to take advantage of
359 this feature, a command line option of --core-spec must be spec‐
360 ified. The default value for this option is "YES" for Cray sys‐
361 tems and "NO" for other system types.
362
363
364 AuthAltTypes
365 Comma-separated list of alternative authentication plugins that
366 the slurmctld will permit for communication. Acceptable values
367 at present include auth/jwt.
368
369 NOTE: auth/jwt requires a jwt_hs256.key to be populated in the
370 StateSaveLocation directory for slurmctld only. The
371 jwt_hs256.key should only be visible to the SlurmUser and root.
372 It is not suggested to place the jwt_hs256.key on any nodes but
373 the controller running slurmctld. auth/jwt can be activated by
374 the presence of the SLURM_JWT environment variable. When acti‐
375 vated, it will override the default AuthType.
376
377
378 AuthAltParameters
379 Used to define alternative authentication plugins options. Mul‐
380 tiple options may be comma separated.
381
382 disable_token_creation
383 Disable "scontrol token" use by non-SlurmUser ac‐
384 counts.
385
386 jwt_key= Absolute path to JWT key file. Key must be HS256,
387 and should only be accessible by SlurmUser. If
388 not set, the default key file is jwt_hs256.key in
389 StateSaveLocation.
390
391
392 AuthInfo
393 Additional information to be used for authentication of communi‐
394 cations between the Slurm daemons (slurmctld and slurmd) and the
395 Slurm clients. The interpretation of this option is specific to
396 the configured AuthType. Multiple options may be specified in a
397 comma delimited list. If not specified, the default authentica‐
398 tion information will be used.
399
400 cred_expire Default job step credential lifetime, in seconds
401 (e.g. "cred_expire=1200"). It must be suffi‐
402 ciently long enough to load user environment, run
403 prolog, deal with the slurmd getting paged out of
404 memory, etc. This also controls how long a re‐
405 queued job must wait before starting again. The
406 default value is 120 seconds.
407
408 socket Path name to a MUNGE daemon socket to use (e.g.
409 "socket=/var/run/munge/munge.socket.2"). The de‐
410 fault value is "/var/run/munge/munge.socket.2".
411 Used by auth/munge and cred/munge.
412
413 ttl Credential lifetime, in seconds (e.g. "ttl=300").
414 The default value is dependent upon the MUNGE in‐
415 stallation, but is typically 300 seconds.
416
417
418 AuthType
419 The authentication method for communications between Slurm com‐
420 ponents. Acceptable values at present include "auth/munge" and
421 "auth/none". The default value is "auth/munge". "auth/none"
422 includes the UID in each communication, but it is not verified.
423 This may be fine for testing purposes, but do not use
424 "auth/none" if you desire any security. "auth/munge" indicates
425 that MUNGE is to be used. (See "https://dun.github.io/munge/"
426 for more information). All Slurm daemons and commands must be
427 terminated prior to changing the value of AuthType and later
428 restarted.
429
430
431 BackupAddr
432 Deprecated option, see SlurmctldHost.
433
434
435 BackupController
436 Deprecated option, see SlurmctldHost.
437
438 The backup controller recovers state information from the State‐
439 SaveLocation directory, which must be readable and writable from
440 both the primary and backup controllers. While not essential,
441 it is recommended that you specify a backup controller. See
442 the RELOCATING CONTROLLERS section if you change this.
443
444
445 BatchStartTimeout
446 The maximum time (in seconds) that a batch job is permitted for
447 launching before being considered missing and releasing the al‐
448 location. The default value is 10 (seconds). Larger values may
449 be required if more time is required to execute the Prolog, load
450 user environment variables, or if the slurmd daemon gets paged
451 from memory.
452 Note: The test for a job being successfully launched is only
453 performed when the Slurm daemon on the compute node registers
454 state with the slurmctld daemon on the head node, which happens
455 fairly rarely. Therefore a job will not necessarily be termi‐
456 nated if its start time exceeds BatchStartTimeout. This config‐
457 uration parameter is also applied to launch tasks and avoid
458 aborting srun commands due to long running Prolog scripts.
459
460
461 BurstBufferType
462 The plugin used to manage burst buffers. Acceptable values at
463 present are:
464
465 burst_buffer/datawarp
466 Use Cray DataWarp API to provide burst buffer functional‐
467 ity.
468
469 burst_buffer/none
470
471
472 CliFilterPlugins
473 A comma delimited list of command line interface option fil‐
474 ter/modification plugins. The specified plugins will be executed
475 in the order listed. These are intended to be site-specific
476 plugins which can be used to set default job parameters and/or
477 logging events. No cli_filter plugins are used by default.
478
479
480 ClusterName
481 The name by which this Slurm managed cluster is known in the ac‐
482 counting database. This is needed distinguish accounting
483 records when multiple clusters report to the same database. Be‐
484 cause of limitations in some databases, any upper case letters
485 in the name will be silently mapped to lower case. In order to
486 avoid confusion, it is recommended that the name be lower case.
487
488
489 CommunicationParameters
490 Comma-separated options identifying communication options.
491
492 CheckGhalQuiesce
493 Used specifically on a Cray using an Aries Ghal
494 interconnect. This will check to see if the sys‐
495 tem is quiescing when sending a message, and if
496 so, we wait until it is done before sending.
497
498 DisableIPv4 Disable IPv4 only operation for all slurm daemons
499 (except slurmdbd). This should also be set in
500 your slurmdbd.conf file.
501
502 EnableIPv6 Enable using IPv6 addresses for all slurm daemons
503 (except slurmdbd). When using both IPv4 and IPv6,
504 address family preferences will be based on your
505 /etc/gai.conf file. This should also be set in
506 your slurmdbd.conf file.
507
508 NoAddrCache By default, Slurm will cache a node's network ad‐
509 dress after successfully establishing the node's
510 network address. This option disables the cache
511 and Slurm will look up the node's network address
512 each time a connection is made. This is useful,
513 for example, in a cloud environment where the
514 node addresses come and go out of DNS.
515
516 NoCtldInAddrAny
517 Used to directly bind to the address of what the
518 node resolves to running the slurmctld instead of
519 binding messages to any address on the node,
520 which is the default.
521
522 NoInAddrAny Used to directly bind to the address of what the
523 node resolves to instead of binding messages to
524 any address on the node which is the default.
525 This option is for all daemons/clients except for
526 the slurmctld.
527
528
529
530 CompleteWait
531 The time to wait, in seconds, when any job is in the COMPLETING
532 state before any additional jobs are scheduled. This is to at‐
533 tempt to keep jobs on nodes that were recently in use, with the
534 goal of preventing fragmentation. If set to zero, pending jobs
535 will be started as soon as possible. Since a COMPLETING job's
536 resources are released for use by other jobs as soon as the Epi‐
537 log completes on each individual node, this can result in very
538 fragmented resource allocations. To provide jobs with the mini‐
539 mum response time, a value of zero is recommended (no waiting).
540 To minimize fragmentation of resources, a value equal to Kill‐
541 Wait plus two is recommended. In that case, setting KillWait to
542 a small value may be beneficial. The default value of Complete‐
543 Wait is zero seconds. The value may not exceed 65533.
544
545 NOTE: Setting reduce_completing_frag affects the behavior of
546 CompleteWait.
547
548
549 ControlAddr
550 Deprecated option, see SlurmctldHost.
551
552
553 ControlMachine
554 Deprecated option, see SlurmctldHost.
555
556
557 CoreSpecPlugin
558 Identifies the plugins to be used for enforcement of core spe‐
559 cialization. The slurmd daemon must be restarted for a change
560 in CoreSpecPlugin to take effect. Acceptable values at present
561 include:
562
563 core_spec/cray_aries
564 used only for Cray systems
565
566 core_spec/none used for all other system types
567
568
569 CpuFreqDef
570 Default CPU frequency value or frequency governor to use when
571 running a job step if it has not been explicitly set with the
572 --cpu-freq option. Acceptable values at present include a nu‐
573 meric value (frequency in kilohertz) or one of the following
574 governors:
575
576 Conservative attempts to use the Conservative CPU governor
577
578 OnDemand attempts to use the OnDemand CPU governor
579
580 Performance attempts to use the Performance CPU governor
581
582 PowerSave attempts to use the PowerSave CPU governor
583 There is no default value. If unset, no attempt to set the governor is
584 made if the --cpu-freq option has not been set.
585
586
587 CpuFreqGovernors
588 List of CPU frequency governors allowed to be set with the sal‐
589 loc, sbatch, or srun option --cpu-freq. Acceptable values at
590 present include:
591
592 Conservative attempts to use the Conservative CPU governor
593
594 OnDemand attempts to use the OnDemand CPU governor (a de‐
595 fault value)
596
597 Performance attempts to use the Performance CPU governor (a
598 default value)
599
600 PowerSave attempts to use the PowerSave CPU governor
601
602 UserSpace attempts to use the UserSpace CPU governor (a de‐
603 fault value)
604 The default is OnDemand, Performance and UserSpace.
605
606 CredType
607 The cryptographic signature tool to be used in the creation of
608 job step credentials. The slurmctld daemon must be restarted
609 for a change in CredType to take effect. Acceptable values at
610 present include "cred/munge" and "cred/none". The default value
611 is "cred/munge" and is the recommended.
612
613
614 DebugFlags
615 Defines specific subsystems which should provide more detailed
616 event logging. Multiple subsystems can be specified with comma
617 separators. Most DebugFlags will result in verbose-level log‐
618 ging for the identified subsystems, and could impact perfor‐
619 mance. Valid subsystems available include:
620
621 Accrue Accrue counters accounting details
622
623 Agent RPC agents (outgoing RPCs from Slurm daemons)
624
625 Backfill Backfill scheduler details
626
627 BackfillMap Backfill scheduler to log a very verbose map of
628 reserved resources through time. Combine with
629 Backfill for a verbose and complete view of the
630 backfill scheduler's work.
631
632 BurstBuffer Burst Buffer plugin
633
634 CPU_Bind CPU binding details for jobs and steps
635
636 CpuFrequency Cpu frequency details for jobs and steps using
637 the --cpu-freq option.
638
639 Data Generic data structure details.
640
641 Dependency Job dependency debug info
642
643 Elasticsearch Elasticsearch debug info
644
645 Energy AcctGatherEnergy debug info
646
647 ExtSensors External Sensors debug info
648
649 Federation Federation scheduling debug info
650
651 FrontEnd Front end node details
652
653 Gres Generic resource details
654
655 Hetjob Heterogeneous job details
656
657 Gang Gang scheduling details
658
659 JobContainer Job container plugin details
660
661 License License management details
662
663 Network Network details
664
665 NetworkRaw Dump raw hex values of key Network communica‐
666 tions. Warning: very verbose.
667
668 NodeFeatures Node Features plugin debug info
669
670 NO_CONF_HASH Do not log when the slurm.conf files differ be‐
671 tween Slurm daemons
672
673 Power Power management plugin
674
675 PowerSave Power save (suspend/resume programs) details
676
677 Priority Job prioritization
678
679 Profile AcctGatherProfile plugins details
680
681 Protocol Communication protocol details
682
683 Reservation Advanced reservations
684
685 Route Message forwarding debug info
686
687 SelectType Resource selection plugin
688
689 Steps Slurmctld resource allocation for job steps
690
691 Switch Switch plugin
692
693 TimeCray Timing of Cray APIs
694
695 TRESNode Limits dealing with TRES=Node
696
697 TraceJobs Trace jobs in slurmctld. It will print detailed
698 job information including state, job ids and
699 allocated nodes counter.
700
701 Triggers Slurmctld triggers
702
703 WorkQueue Work Queue details
704
705
706 DefCpuPerGPU
707 Default count of CPUs allocated per allocated GPU.
708
709
710 DefMemPerCPU
711 Default real memory size available per allocated CPU in
712 megabytes. Used to avoid over-subscribing memory and causing
713 paging. DefMemPerCPU would generally be used if individual pro‐
714 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
715 lectType=select/cons_tres). The default value is 0 (unlimited).
716 Also see DefMemPerGPU, DefMemPerNode and MaxMemPerCPU. DefMem‐
717 PerCPU, DefMemPerGPU and DefMemPerNode are mutually exclusive.
718
719
720 DefMemPerGPU
721 Default real memory size available per allocated GPU in
722 megabytes. The default value is 0 (unlimited). Also see
723 DefMemPerCPU and DefMemPerNode. DefMemPerCPU, DefMemPerGPU and
724 DefMemPerNode are mutually exclusive.
725
726
727 DefMemPerNode
728 Default real memory size available per allocated node in
729 megabytes. Used to avoid over-subscribing memory and causing
730 paging. DefMemPerNode would generally be used if whole nodes
731 are allocated to jobs (SelectType=select/linear) and resources
732 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
733 The default value is 0 (unlimited). Also see DefMemPerCPU,
734 DefMemPerGPU and MaxMemPerCPU. DefMemPerCPU, DefMemPerGPU and
735 DefMemPerNode are mutually exclusive.
736
737
738 DefaultStorageHost
739 The default name of the machine hosting the accounting storage
740 and job completion databases. Only used for database type stor‐
741 age plugins and when the AccountingStorageHost and JobCompHost
742 have not been defined.
743
744
745 DefaultStorageLoc
746 The fully qualified file name where job completion records are
747 written when the DefaultStorageType is "filetxt". Also see Job‐
748 CompLoc.
749
750
751 DefaultStoragePass
752 The password used to gain access to the database to store the
753 accounting and job completion data. Only used for database type
754 storage plugins, ignored otherwise. Also see AccountingStor‐
755 agePass and JobCompPass.
756
757
758 DefaultStoragePort
759 The listening port of the accounting storage and/or job comple‐
760 tion database server. Only used for database type storage plug‐
761 ins, ignored otherwise. Also see AccountingStoragePort and Job‐
762 CompPort.
763
764
765 DefaultStorageType
766 The accounting and job completion storage mechanism type. Ac‐
767 ceptable values at present include "filetxt", "mysql" and
768 "none". The value "filetxt" indicates that records will be
769 written to a file. The value "mysql" indicates that accounting
770 records will be written to a MySQL or MariaDB database. The de‐
771 fault value is "none", which means that records are not main‐
772 tained. Also see AccountingStorageType and JobCompType.
773
774
775 DefaultStorageUser
776 The user account for accessing the accounting storage and/or job
777 completion database. Only used for database type storage plug‐
778 ins, ignored otherwise. Also see AccountingStorageUser and Job‐
779 CompUser.
780
781
782 DependencyParameters
783 Multiple options may be comma separated.
784
785
786 disable_remote_singleton
787 By default, when a federated job has a singleton depen‐
788 deny, each cluster in the federation must clear the sin‐
789 gleton dependency before the job's singleton dependency
790 is considered satisfied. Enabling this option means that
791 only the origin cluster must clear the singleton depen‐
792 dency. This option must be set in every cluster in the
793 federation.
794
795 kill_invalid_depend
796 If a job has an invalid dependency and it can never run
797 terminate it and set its state to be JOB_CANCELLED. By
798 default the job stays pending with reason DependencyNev‐
799 erSatisfied. max_depend_depth=# Maximum number of jobs
800 to test for a circular job dependency. Stop testing after
801 this number of job dependencies have been tested. The de‐
802 fault value is 10 jobs.
803
804
805 DisableRootJobs
806 If set to "YES" then user root will be prevented from running
807 any jobs. The default value is "NO", meaning user root will be
808 able to execute jobs. DisableRootJobs may also be set by parti‐
809 tion.
810
811
812 EioTimeout
813 The number of seconds srun waits for slurmstepd to close the
814 TCP/IP connection used to relay data between the user applica‐
815 tion and srun when the user application terminates. The default
816 value is 60 seconds. May not exceed 65533.
817
818
819 EnforcePartLimits
820 If set to "ALL" then jobs which exceed a partition's size and/or
821 time limits will be rejected at submission time. If job is sub‐
822 mitted to multiple partitions, the job must satisfy the limits
823 on all the requested partitions. If set to "NO" then the job
824 will be accepted and remain queued until the partition limits
825 are altered(Time and Node Limits). If set to "ANY" a job must
826 satisfy any of the requested partitions to be submitted. The de‐
827 fault value is "NO". NOTE: If set, then a job's QOS can not be
828 used to exceed partition limits. NOTE: The partition limits be‐
829 ing considered are its configured MaxMemPerCPU, MaxMemPerNode,
830 MinNodes, MaxNodes, MaxTime, AllocNodes, AllowAccounts, Allow‐
831 Groups, AllowQOS, and QOS usage threshold.
832
833
834 Epilog Fully qualified pathname of a script to execute as user root on
835 every node when a user's job completes (e.g. "/usr/lo‐
836 cal/slurm/epilog"). A glob pattern (See glob (7)) may also be
837 used to run more than one epilog script (e.g. "/etc/slurm/epi‐
838 log.d/*"). The Epilog script or scripts may be used to purge
839 files, disable user login, etc. By default there is no epilog.
840 See Prolog and Epilog Scripts for more information.
841
842
843 EpilogMsgTime
844 The number of microseconds that the slurmctld daemon requires to
845 process an epilog completion message from the slurmd daemons.
846 This parameter can be used to prevent a burst of epilog comple‐
847 tion messages from being sent at the same time which should help
848 prevent lost messages and improve throughput for large jobs.
849 The default value is 2000 microseconds. For a 1000 node job,
850 this spreads the epilog completion messages out over two sec‐
851 onds.
852
853
854 EpilogSlurmctld
855 Fully qualified pathname of a program for the slurmctld to exe‐
856 cute upon termination of a job allocation (e.g. "/usr/lo‐
857 cal/slurm/epilog_controller"). The program executes as Slur‐
858 mUser, which gives it permission to drain nodes and requeue the
859 job if a failure occurs (See scontrol(1)). Exactly what the
860 program does and how it accomplishes this is completely at the
861 discretion of the system administrator. Information about the
862 job being initiated, its allocated nodes, etc. are passed to the
863 program using environment variables. See Prolog and Epilog
864 Scripts for more information.
865
866
867 ExtSensorsFreq
868 The external sensors plugin sampling interval. If ExtSen‐
869 sorsType=ext_sensors/none, this parameter is ignored. For all
870 other values of ExtSensorsType, this parameter is the number of
871 seconds between external sensors samples for hardware components
872 (nodes, switches, etc.) The default value is zero. This value
873 disables external sensors sampling. Note: This parameter does
874 not affect external sensors data collection for jobs/steps.
875
876
877 ExtSensorsType
878 Identifies the plugin to be used for external sensors data col‐
879 lection. Slurmctld calls this plugin to collect external sen‐
880 sors data for jobs/steps and hardware components. In case of
881 node sharing between jobs the reported values per job/step
882 (through sstat or sacct) may not be accurate. See also "man
883 ext_sensors.conf".
884
885 Configurable values at present are:
886
887 ext_sensors/none No external sensors data is collected.
888
889 ext_sensors/rrd External sensors data is collected from the
890 RRD database.
891
892
893 FairShareDampeningFactor
894 Dampen the effect of exceeding a user or group's fair share of
895 allocated resources. Higher values will provides greater ability
896 to differentiate between exceeding the fair share at high levels
897 (e.g. a value of 1 results in almost no difference between over‐
898 consumption by a factor of 10 and 100, while a value of 5 will
899 result in a significant difference in priority). The default
900 value is 1.
901
902
903 FederationParameters
904 Used to define federation options. Multiple options may be comma
905 separated.
906
907
908 fed_display
909 If set, then the client status commands (e.g. squeue,
910 sinfo, sprio, etc.) will display information in a feder‐
911 ated view by default. This option is functionally equiva‐
912 lent to using the --federation options on each command.
913 Use the client's --local option to override the federated
914 view and get a local view of the given cluster.
915
916
917 FirstJobId
918 The job id to be used for the first submitted to Slurm without a
919 specific requested value. Job id values generated will incre‐
920 mented by 1 for each subsequent job. This may be used to provide
921 a meta-scheduler with a job id space which is disjoint from the
922 interactive jobs. The default value is 1. Also see MaxJobId
923
924
925 GetEnvTimeout
926 Controls how long the job should wait (in seconds) to load the
927 user's environment before attempting to load it from a cache
928 file. Applies when the salloc or sbatch --get-user-env option
929 is used. If set to 0 then always load the user's environment
930 from the cache file. The default value is 2 seconds.
931
932
933 GresTypes
934 A comma delimited list of generic resources to be managed (e.g.
935 GresTypes=gpu,mps). These resources may have an associated GRES
936 plugin of the same name providing additional functionality. No
937 generic resources are managed by default. Ensure this parameter
938 is consistent across all nodes in the cluster for proper opera‐
939 tion. The slurmctld daemon must be restarted for changes to
940 this parameter to become effective.
941
942
943 GroupUpdateForce
944 If set to a non-zero value, then information about which users
945 are members of groups allowed to use a partition will be updated
946 periodically, even when there have been no changes to the
947 /etc/group file. If set to zero, group member information will
948 be updated only after the /etc/group file is updated. The de‐
949 fault value is 1. Also see the GroupUpdateTime parameter.
950
951
952 GroupUpdateTime
953 Controls how frequently information about which users are mem‐
954 bers of groups allowed to use a partition will be updated, and
955 how long user group membership lists will be cached. The time
956 interval is given in seconds with a default value of 600 sec‐
957 onds. A value of zero will prevent periodic updating of group
958 membership information. Also see the GroupUpdateForce parame‐
959 ter.
960
961
962 GpuFreqDef=[<type]=value>[,<type=value>]
963 Default GPU frequency to use when running a job step if it has
964 not been explicitly set using the --gpu-freq option. This op‐
965 tion can be used to independently configure the GPU and its mem‐
966 ory frequencies. Defaults to "high,memory=high". After the job
967 is completed, the frequencies of all affected GPUs will be reset
968 to the highest possible values. In some cases, system power
969 caps may override the requested values. The field type can be
970 "memory". If type is not specified, the GPU frequency is im‐
971 plied. The value field can either be "low", "medium", "high",
972 "highm1" or a numeric value in megahertz (MHz). If the speci‐
973 fied numeric value is not possible, a value as close as possible
974 will be used. See below for definition of the values. Examples
975 of use include "GpuFreqDef=medium,memory=high and "GpuFre‐
976 qDef=450".
977
978 Supported value definitions:
979
980 low the lowest available frequency.
981
982 medium attempts to set a frequency in the middle of the
983 available range.
984
985 high the highest available frequency.
986
987 highm1 (high minus one) will select the next highest avail‐
988 able frequency.
989
990
991 HealthCheckInterval
992 The interval in seconds between executions of HealthCheckPro‐
993 gram. The default value is zero, which disables execution.
994
995
996 HealthCheckNodeState
997 Identify what node states should execute the HealthCheckProgram.
998 Multiple state values may be specified with a comma separator.
999 The default value is ANY to execute on nodes in any state.
1000
1001 ALLOC Run on nodes in the ALLOC state (all CPUs allo‐
1002 cated).
1003
1004 ANY Run on nodes in any state.
1005
1006 CYCLE Rather than running the health check program on all
1007 nodes at the same time, cycle through running on all
1008 compute nodes through the course of the HealthCheck‐
1009 Interval. May be combined with the various node
1010 state options.
1011
1012 IDLE Run on nodes in the IDLE state.
1013
1014 MIXED Run on nodes in the MIXED state (some CPUs idle and
1015 other CPUs allocated).
1016
1017
1018 HealthCheckProgram
1019 Fully qualified pathname of a script to execute as user root pe‐
1020 riodically on all compute nodes that are not in the NOT_RESPOND‐
1021 ING state. This program may be used to verify the node is fully
1022 operational and DRAIN the node or send email if a problem is de‐
1023 tected. Any action to be taken must be explicitly performed by
1024 the program (e.g. execute "scontrol update NodeName=foo
1025 State=drain Reason=tmp_file_system_full" to drain a node). The
1026 execution interval is controlled using the HealthCheckInterval
1027 parameter. Note that the HealthCheckProgram will be executed at
1028 the same time on all nodes to minimize its impact upon parallel
1029 programs. This program is will be killed if it does not termi‐
1030 nate normally within 60 seconds. This program will also be exe‐
1031 cuted when the slurmd daemon is first started and before it reg‐
1032 isters with the slurmctld daemon. By default, no program will
1033 be executed.
1034
1035
1036 InactiveLimit
1037 The interval, in seconds, after which a non-responsive job allo‐
1038 cation command (e.g. srun or salloc) will result in the job be‐
1039 ing terminated. If the node on which the command is executed
1040 fails or the command abnormally terminates, this will terminate
1041 its job allocation. This option has no effect upon batch jobs.
1042 When setting a value, take into consideration that a debugger
1043 using srun to launch an application may leave the srun command
1044 in a stopped state for extended periods of time. This limit is
1045 ignored for jobs running in partitions with the RootOnly flag
1046 set (the scheduler running as root will be responsible for the
1047 job). The default value is unlimited (zero) and may not exceed
1048 65533 seconds.
1049
1050
1051 InteractiveStepOptions
1052 When LaunchParameters=use_interactive_step is enabled, launching
1053 salloc will automatically start an srun process with Interac‐
1054 tiveStepOptions to launch a terminal on a node in the job allo‐
1055 cation. The default value is "--interactive --preserve-env
1056 --pty $SHELL".
1057
1058
1059 JobAcctGatherType
1060 The job accounting mechanism type. Acceptable values at present
1061 include "jobacct_gather/linux" (for Linux systems) and is the
1062 recommended one, "jobacct_gather/cgroup" and
1063 "jobacct_gather/none" (no accounting data collected). The de‐
1064 fault value is "jobacct_gather/none". "jobacct_gather/cgroup"
1065 is a plugin for the Linux operating system that uses cgroups to
1066 collect accounting statistics. The plugin collects the following
1067 statistics: From the cgroup memory subsystem: memory.us‐
1068 age_in_bytes (reported as 'pages') and rss from memory.stat (re‐
1069 ported as 'rss'). From the cgroup cpuacct subsystem: user cpu
1070 time and system cpu time. No value is provided by cgroups for
1071 virtual memory size ('vsize'). In order to use the sstat tool
1072 "jobacct_gather/linux", or "jobacct_gather/cgroup" must be con‐
1073 figured.
1074 NOTE: Changing this configuration parameter changes the contents
1075 of the messages between Slurm daemons. Any previously running
1076 job steps are managed by a slurmstepd daemon that will persist
1077 through the lifetime of that job step and not change its commu‐
1078 nication protocol. Only change this configuration parameter when
1079 there are no running job steps.
1080
1081
1082 JobAcctGatherFrequency
1083 The job accounting and profiling sampling intervals. The sup‐
1084 ported format is follows:
1085
1086 JobAcctGatherFrequency=<datatype>=<interval>
1087 where <datatype>=<interval> specifies the task sam‐
1088 pling interval for the jobacct_gather plugin or a
1089 sampling interval for a profiling type by the
1090 acct_gather_profile plugin. Multiple, comma-sepa‐
1091 rated <datatype>=<interval> intervals may be speci‐
1092 fied. Supported datatypes are as follows:
1093
1094 task=<interval>
1095 where <interval> is the task sampling inter‐
1096 val in seconds for the jobacct_gather plugins
1097 and for task profiling by the
1098 acct_gather_profile plugin.
1099
1100 energy=<interval>
1101 where <interval> is the sampling interval in
1102 seconds for energy profiling using the
1103 acct_gather_energy plugin
1104
1105 network=<interval>
1106 where <interval> is the sampling interval in
1107 seconds for infiniband profiling using the
1108 acct_gather_interconnect plugin.
1109
1110 filesystem=<interval>
1111 where <interval> is the sampling interval in
1112 seconds for filesystem profiling using the
1113 acct_gather_filesystem plugin.
1114
1115 The default value for task sampling interval
1116 is 30 seconds. The default value for all other intervals is 0.
1117 An interval of 0 disables sampling of the specified type. If
1118 the task sampling interval is 0, accounting information is col‐
1119 lected only at job termination (reducing Slurm interference with
1120 the job).
1121 Smaller (non-zero) values have a greater impact upon job perfor‐
1122 mance, but a value of 30 seconds is not likely to be noticeable
1123 for applications having less than 10,000 tasks.
1124 Users can independently override each interval on a per job ba‐
1125 sis using the --acctg-freq option when submitting the job.
1126
1127
1128 JobAcctGatherParams
1129 Arbitrary parameters for the job account gather plugin Accept‐
1130 able values at present include:
1131
1132 NoShared Exclude shared memory from accounting.
1133
1134 UsePss Use PSS value instead of RSS to calculate
1135 real usage of memory. The PSS value will be
1136 saved as RSS.
1137
1138 OverMemoryKill Kill processes that are being detected to
1139 use more memory than requested by steps ev‐
1140 ery time accounting information is gathered
1141 by the JobAcctGather plugin. This parameter
1142 should be used with caution because a job
1143 exceeding its memory allocation may affect
1144 other processes and/or machine health.
1145
1146 NOTE: If available, it is recommended to
1147 limit memory by enabling task/cgroup as a
1148 TaskPlugin and making use of Constrain‐
1149 RAMSpace=yes in the cgroup.conf instead of
1150 using this JobAcctGather mechanism for mem‐
1151 ory enforcement. With OverMemoryKill, memory
1152 limit is applied against each process indi‐
1153 vidually and is not applied to the step as a
1154 whole. This means that when jobs have a
1155 process that consumes too much memory, the
1156 process will be killed but the step will
1157 continue to run. When using cgroups with
1158 ConstrainRAMSpace=yes, a process that con‐
1159 sumes too much memory will result in the job
1160 step being killed. Using JobAcctGather is
1161 polling based and there is a delay before a
1162 job is killed, which could lead to system
1163 Out of Memory events.
1164
1165
1166 JobCompHost
1167 The name of the machine hosting the job completion database.
1168 Only used for database type storage plugins, ignored otherwise.
1169 Also see DefaultStorageHost.
1170
1171
1172 JobCompLoc
1173 The fully qualified file name where job completion records are
1174 written when the JobCompType is "jobcomp/filetxt" or the data‐
1175 base where job completion records are stored when the JobComp‐
1176 Type is a database, or a complete URL endpoint with format
1177 <host>:<port>/<target>/_doc when JobCompType is "jobcomp/elas‐
1178 ticsearch" like i.e. "localhost:9200/slurm/_doc". NOTE: More
1179 information is available at the Slurm web site
1180 <https://slurm.schedmd.com/elasticsearch.html>. Also see De‐
1181 faultStorageLoc.
1182
1183
1184 JobCompParams
1185 Pass arbitrary text string to job completion plugin. Also see
1186 JobCompType.
1187
1188
1189 JobCompPass
1190 The password used to gain access to the database to store the
1191 job completion data. Only used for database type storage plug‐
1192 ins, ignored otherwise. Also see DefaultStoragePass.
1193
1194
1195 JobCompPort
1196 The listening port of the job completion database server. Only
1197 used for database type storage plugins, ignored otherwise. Also
1198 see DefaultStoragePort.
1199
1200
1201 JobCompType
1202 The job completion logging mechanism type. Acceptable values at
1203 present include "jobcomp/none", "jobcomp/elasticsearch", "job‐
1204 comp/filetxt", "jobcomp/lua", "jobcomp/mysql" and "job‐
1205 comp/script". The default value is "jobcomp/none", which means
1206 that upon job completion the record of the job is purged from
1207 the system. If using the accounting infrastructure this plugin
1208 may not be of interest since the information here is redundant.
1209 The value "jobcomp/elasticsearch" indicates that a record of the
1210 job should be written to an Elasticsearch server specified by
1211 the JobCompLoc parameter. NOTE: More information is available
1212 at the Slurm web site ( https://slurm.schedmd.com/elastic‐
1213 search.html ). The value "jobcomp/filetxt" indicates that a
1214 record of the job should be written to a text file specified by
1215 the JobCompLoc parameter. The value "jobcomp/lua" indicates
1216 that a record of the job should processed by the "jobcomp.lua"
1217 script located in the default script directory (typically the
1218 subdirectory "etc" of the installation directory). The value
1219 "jobcomp/mysql" indicates that a record of the job should be
1220 written to a MySQL or MariaDB database specified by the JobCom‐
1221 pLoc parameter. The value "jobcomp/script" indicates that a
1222 script specified by the JobCompLoc parameter is to be executed
1223 with environment variables indicating the job information.
1224
1225 JobCompUser
1226 The user account for accessing the job completion database.
1227 Only used for database type storage plugins, ignored otherwise.
1228 Also see DefaultStorageUser.
1229
1230
1231 JobContainerType
1232 Identifies the plugin to be used for job tracking. The slurmd
1233 daemon must be restarted for a change in JobContainerType to
1234 take effect. NOTE: The JobContainerType applies to a job allo‐
1235 cation, while ProctrackType applies to job steps. Acceptable
1236 values at present include:
1237
1238 job_container/cncu Used only for Cray systems (CNCU = Compute
1239 Node Clean Up)
1240
1241 job_container/none Used for all other system types
1242
1243 job_container/tmpfs Used to create a private namespace on the
1244 filesystem for jobs, which houses temporary
1245 file systems (/tmp and /dev/shm) for each
1246 job.
1247
1248
1249 JobFileAppend
1250 This option controls what to do if a job's output or error file
1251 exist when the job is started. If JobFileAppend is set to a
1252 value of 1, then append to the existing file. By default, any
1253 existing file is truncated.
1254
1255
1256 JobRequeue
1257 This option controls the default ability for batch jobs to be
1258 requeued. Jobs may be requeued explicitly by a system adminis‐
1259 trator, after node failure, or upon preemption by a higher pri‐
1260 ority job. If JobRequeue is set to a value of 1, then batch job
1261 may be requeued unless explicitly disabled by the user. If Jo‐
1262 bRequeue is set to a value of 0, then batch job will not be re‐
1263 queued unless explicitly enabled by the user. Use the sbatch
1264 --no-requeue or --requeue option to change the default behavior
1265 for individual jobs. The default value is 1.
1266
1267
1268 JobSubmitPlugins
1269 A comma delimited list of job submission plugins to be used.
1270 The specified plugins will be executed in the order listed.
1271 These are intended to be site-specific plugins which can be used
1272 to set default job parameters and/or logging events. Sample
1273 plugins available in the distribution include "all_partitions",
1274 "defaults", "logging", "lua", and "partition". For examples of
1275 use, see the Slurm code in "src/plugins/job_submit" and "con‐
1276 tribs/lua/job_submit*.lua" then modify the code to satisfy your
1277 needs. Slurm can be configured to use multiple job_submit plug‐
1278 ins if desired, however the lua plugin will only execute one lua
1279 script named "job_submit.lua" located in the default script di‐
1280 rectory (typically the subdirectory "etc" of the installation
1281 directory). No job submission plugins are used by default.
1282
1283
1284 KeepAliveTime
1285 Specifies how long sockets communications used between the srun
1286 command and its slurmstepd process are kept alive after discon‐
1287 nect. Longer values can be used to improve reliability of com‐
1288 munications in the event of network failures. The default value
1289 leaves the system default value. The value may not exceed
1290 65533.
1291
1292
1293 KillOnBadExit
1294 If set to 1, a step will be terminated immediately if any task
1295 is crashed or aborted, as indicated by a non-zero exit code.
1296 With the default value of 0, if one of the processes is crashed
1297 or aborted the other processes will continue to run while the
1298 crashed or aborted process waits. The user can override this
1299 configuration parameter by using srun's -K, --kill-on-bad-exit.
1300
1301
1302 KillWait
1303 The interval, in seconds, given to a job's processes between the
1304 SIGTERM and SIGKILL signals upon reaching its time limit. If
1305 the job fails to terminate gracefully in the interval specified,
1306 it will be forcibly terminated. The default value is 30 sec‐
1307 onds. The value may not exceed 65533.
1308
1309
1310 NodeFeaturesPlugins
1311 Identifies the plugins to be used for support of node features
1312 which can change through time. For example, a node which might
1313 be booted with various BIOS setting. This is supported through
1314 the use of a node's active_features and available_features in‐
1315 formation. Acceptable values at present include:
1316
1317 node_features/knl_cray
1318 used only for Intel Knights Landing proces‐
1319 sors (KNL) on Cray systems
1320
1321 node_features/knl_generic
1322 used for Intel Knights Landing processors
1323 (KNL) on a generic Linux system
1324
1325
1326 LaunchParameters
1327 Identifies options to the job launch plugin. Acceptable values
1328 include:
1329
1330 batch_step_set_cpu_freq Set the cpu frequency for the batch step
1331 from given --cpu-freq, or slurm.conf
1332 CpuFreqDef, option. By default only
1333 steps started with srun will utilize the
1334 cpu freq setting options.
1335
1336 NOTE: If you are using srun to launch
1337 your steps inside a batch script (ad‐
1338 vised) this option will create a situa‐
1339 tion where you may have multiple agents
1340 setting the cpu_freq as the batch step
1341 usually runs on the same resources one
1342 or more steps the sruns in the script
1343 will create.
1344
1345 cray_net_exclusive Allow jobs on a Cray Native cluster ex‐
1346 clusive access to network resources.
1347 This should only be set on clusters pro‐
1348 viding exclusive access to each node to
1349 a single job at once, and not using par‐
1350 allel steps within the job, otherwise
1351 resources on the node can be oversub‐
1352 scribed.
1353
1354 enable_nss_slurm Permits passwd and group resolution for
1355 a job to be serviced by slurmstepd
1356 rather than requiring a lookup from a
1357 network based service. See
1358 https://slurm.schedmd.com/nss_slurm.html
1359 for more information.
1360
1361 lustre_no_flush If set on a Cray Native cluster, then do
1362 not flush the Lustre cache on job step
1363 completion. This setting will only take
1364 effect after reconfiguring, and will
1365 only take effect for newly launched
1366 jobs.
1367
1368 mem_sort Sort NUMA memory at step start. User can
1369 override this default with
1370 SLURM_MEM_BIND environment variable or
1371 --mem-bind=nosort command line option.
1372
1373 mpir_use_nodeaddr When launching tasks Slurm creates en‐
1374 tries in MPIR_proctable that are used by
1375 parallel debuggers, profilers, and re‐
1376 lated tools to attach to running
1377 process. By default the MPIR_proctable
1378 entries contain MPIR_procdesc structures
1379 where the host_name is set to NodeName
1380 by default. If this option is specified,
1381 NodeAddr will be used in this context
1382 instead.
1383
1384 disable_send_gids By default, the slurmctld will look up
1385 and send the user_name and extended gids
1386 for a job, rather than independently on
1387 each node as part of each task launch.
1388 This helps mitigate issues around name
1389 service scalability when launching jobs
1390 involving many nodes. Using this option
1391 will disable this functionality. This
1392 option is ignored if enable_nss_slurm is
1393 specified.
1394
1395 slurmstepd_memlock Lock the slurmstepd process's current
1396 memory in RAM.
1397
1398 slurmstepd_memlock_all Lock the slurmstepd process's current
1399 and future memory in RAM.
1400
1401 test_exec Have srun verify existence of the exe‐
1402 cutable program along with user execute
1403 permission on the node where srun was
1404 called before attempting to launch it on
1405 nodes in the step.
1406
1407 use_interactive_step Have salloc use the Interactive Step to
1408 launch a shell on an allocated compute
1409 node rather than locally to wherever
1410 salloc was invoked. This is accomplished
1411 by launching the srun command with In‐
1412 teractiveStepOptions as options.
1413
1414 This does not affect salloc called with
1415 a command as an argument. These jobs
1416 will continue to be executed as the
1417 calling user on the calling host.
1418
1419
1420 LaunchType
1421 Identifies the mechanism to be used to launch application tasks.
1422 Acceptable values include:
1423
1424 launch/slurm
1425 The default value.
1426
1427
1428 Licenses
1429 Specification of licenses (or other resources available on all
1430 nodes of the cluster) which can be allocated to jobs. License
1431 names can optionally be followed by a colon and count with a de‐
1432 fault count of one. Multiple license names should be comma sep‐
1433 arated (e.g. "Licenses=foo:4,bar"). Note that Slurm prevents
1434 jobs from being scheduled if their required license specifica‐
1435 tion is not available. Slurm does not prevent jobs from using
1436 licenses that are not explicitly listed in the job submission
1437 specification.
1438
1439
1440 LogTimeFormat
1441 Format of the timestamp in slurmctld and slurmd log files. Ac‐
1442 cepted values are "iso8601", "iso8601_ms", "rfc5424",
1443 "rfc5424_ms", "clock", "short" and "thread_id". The values end‐
1444 ing in "_ms" differ from the ones without in that fractional
1445 seconds with millisecond precision are printed. The default
1446 value is "iso8601_ms". The "rfc5424" formats are the same as the
1447 "iso8601" formats except that the timezone value is also shown.
1448 The "clock" format shows a timestamp in microseconds retrieved
1449 with the C standard clock() function. The "short" format is a
1450 short date and time format. The "thread_id" format shows the
1451 timestamp in the C standard ctime() function form without the
1452 year but including the microseconds, the daemon's process ID and
1453 the current thread name and ID.
1454
1455
1456 MailDomain
1457 Domain name to qualify usernames if email address is not explic‐
1458 itly given with the "--mail-user" option. If unset, the local
1459 MTA will need to qualify local address itself. Changes to Mail‐
1460 Domain will only affect new jobs.
1461
1462
1463 MailProg
1464 Fully qualified pathname to the program used to send email per
1465 user request. The default value is "/bin/mail" (or
1466 "/usr/bin/mail" if "/bin/mail" does not exist but
1467 "/usr/bin/mail" does exist).
1468
1469
1470 MaxArraySize
1471 The maximum job array task index value will be one less than
1472 MaxArraySize to allow for an index value of zero. Configure
1473 MaxArraySize to 0 in order to disable job array use. The value
1474 may not exceed 4000001. The value of MaxJobCount should be much
1475 larger than MaxArraySize. The default value is 1001. See also
1476 max_array_tasks in SchedulerParameters.
1477
1478
1479 MaxDBDMsgs
1480 When communication to the SlurmDBD is not possible the slurmctld
1481 will queue messages meant to processed when the SlurmDBD is
1482 available again. In order to avoid running out of memory the
1483 slurmctld will only queue so many messages. The default value is
1484 10000, or MaxJobCount * 2 + Node Count * 4, whichever is
1485 greater. The value can not be less than 10000.
1486
1487
1488 MaxJobCount
1489 The maximum number of jobs Slurm can have in its active database
1490 at one time. Set the values of MaxJobCount and MinJobAge to en‐
1491 sure the slurmctld daemon does not exhaust its memory or other
1492 resources. Once this limit is reached, requests to submit addi‐
1493 tional jobs will fail. The default value is 10000 jobs. NOTE:
1494 Each task of a job array counts as one job even though they will
1495 not occupy separate job records until modified or initiated.
1496 Performance can suffer with more than a few hundred thousand
1497 jobs. Setting per MaxSubmitJobs per user is generally valuable
1498 to prevent a single user from filling the system with jobs.
1499 This is accomplished using Slurm's database and configuring en‐
1500 forcement of resource limits. This value may not be reset via
1501 "scontrol reconfig". It only takes effect upon restart of the
1502 slurmctld daemon.
1503
1504
1505 MaxJobId
1506 The maximum job id to be used for jobs submitted to Slurm with‐
1507 out a specific requested value. Job ids are unsigned 32bit inte‐
1508 gers with the first 26 bits reserved for local job ids and the
1509 remaining 6 bits reserved for a cluster id to identify a feder‐
1510 ated job's origin. The maximun allowed local job id is
1511 67,108,863 (0x3FFFFFF). The default value is 67,043,328
1512 (0x03ff0000). MaxJobId only applies to the local job id and not
1513 the federated job id. Job id values generated will be incre‐
1514 mented by 1 for each subsequent job. Once MaxJobId is reached,
1515 the next job will be assigned FirstJobId. Federated jobs will
1516 always have a job ID of 67,108,865 or higher. Also see FirstJo‐
1517 bId.
1518
1519
1520 MaxMemPerCPU
1521 Maximum real memory size available per allocated CPU in
1522 megabytes. Used to avoid over-subscribing memory and causing
1523 paging. MaxMemPerCPU would generally be used if individual pro‐
1524 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
1525 lectType=select/cons_tres). The default value is 0 (unlimited).
1526 Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerNode. MaxMem‐
1527 PerCPU and MaxMemPerNode are mutually exclusive.
1528
1529 NOTE: If a job specifies a memory per CPU limit that exceeds
1530 this system limit, that job's count of CPUs per task will try to
1531 automatically increase. This may result in the job failing due
1532 to CPU count limits. This auto-adjustment feature is a best-ef‐
1533 fort one and optimal assignment is not guaranteed due to the
1534 possibility of having heterogeneous configurations and multi-
1535 partition/qos jobs. If this is a concern it is advised to use a
1536 job submit LUA plugin instead to enforce auto-adjustments to
1537 your specific needs.
1538
1539
1540 MaxMemPerNode
1541 Maximum real memory size available per allocated node in
1542 megabytes. Used to avoid over-subscribing memory and causing
1543 paging. MaxMemPerNode would generally be used if whole nodes
1544 are allocated to jobs (SelectType=select/linear) and resources
1545 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
1546 The default value is 0 (unlimited). Also see DefMemPerNode and
1547 MaxMemPerCPU. MaxMemPerCPU and MaxMemPerNode are mutually ex‐
1548 clusive.
1549
1550
1551 MaxStepCount
1552 The maximum number of steps that any job can initiate. This pa‐
1553 rameter is intended to limit the effect of bad batch scripts.
1554 The default value is 40000 steps.
1555
1556
1557 MaxTasksPerNode
1558 Maximum number of tasks Slurm will allow a job step to spawn on
1559 a single node. The default MaxTasksPerNode is 512. May not ex‐
1560 ceed 65533.
1561
1562
1563 MCSParameters
1564 MCS = Multi-Category Security MCS Plugin Parameters. The sup‐
1565 ported parameters are specific to the MCSPlugin. Changes to
1566 this value take effect when the Slurm daemons are reconfigured.
1567 More information about MCS is available here
1568 <https://slurm.schedmd.com/mcs.html>.
1569
1570
1571 MCSPlugin
1572 MCS = Multi-Category Security : associate a security label to
1573 jobs and ensure that nodes can only be shared among jobs using
1574 the same security label. Acceptable values include:
1575
1576 mcs/none is the default value. No security label associated
1577 with jobs, no particular security restriction when
1578 sharing nodes among jobs.
1579
1580 mcs/account only users with the same account can share the nodes
1581 (requires enabling of accounting).
1582
1583 mcs/group only users with the same group can share the nodes.
1584
1585 mcs/user a node cannot be shared with other users.
1586
1587
1588 MessageTimeout
1589 Time permitted for a round-trip communication to complete in
1590 seconds. Default value is 10 seconds. For systems with shared
1591 nodes, the slurmd daemon could be paged out and necessitate
1592 higher values.
1593
1594
1595 MinJobAge
1596 The minimum age of a completed job before its record is purged
1597 from Slurm's active database. Set the values of MaxJobCount and
1598 to ensure the slurmctld daemon does not exhaust its memory or
1599 other resources. The default value is 300 seconds. A value of
1600 zero prevents any job record purging. Jobs are not purged dur‐
1601 ing a backfill cycle, so it can take longer than MinJobAge sec‐
1602 onds to purge a job if using the backfill scheduling plugin. In
1603 order to eliminate some possible race conditions, the minimum
1604 non-zero value for MinJobAge recommended is 2.
1605
1606
1607 MpiDefault
1608 Identifies the default type of MPI to be used. Srun may over‐
1609 ride this configuration parameter in any case. Currently sup‐
1610 ported versions include: pmi2, pmix, and none (default, which
1611 works for many other versions of MPI). More information about
1612 MPI use is available here
1613 <https://slurm.schedmd.com/mpi_guide.html>.
1614
1615
1616 MpiParams
1617 MPI parameters. Used to identify ports used by older versions
1618 of OpenMPI and native Cray systems. The input format is
1619 "ports=12000-12999" to identify a range of communication ports
1620 to be used. NOTE: This is not needed for modern versions of
1621 OpenMPI, taking it out can cause a small boost in scheduling
1622 performance. NOTE: This is require for Cray's PMI.
1623
1624
1625 OverTimeLimit
1626 Number of minutes by which a job can exceed its time limit be‐
1627 fore being canceled. Normally a job's time limit is treated as
1628 a hard limit and the job will be killed upon reaching that
1629 limit. Configuring OverTimeLimit will result in the job's time
1630 limit being treated like a soft limit. Adding the OverTimeLimit
1631 value to the soft time limit provides a hard time limit, at
1632 which point the job is canceled. This is particularly useful
1633 for backfill scheduling, which bases upon each job's soft time
1634 limit. The default value is zero. May not exceed 65533 min‐
1635 utes. A value of "UNLIMITED" is also supported.
1636
1637
1638 PluginDir
1639 Identifies the places in which to look for Slurm plugins. This
1640 is a colon-separated list of directories, like the PATH environ‐
1641 ment variable. The default value is the prefix given at config‐
1642 ure time + "/lib/slurm".
1643
1644
1645 PlugStackConfig
1646 Location of the config file for Slurm stackable plugins that use
1647 the Stackable Plugin Architecture for Node job (K)control
1648 (SPANK). This provides support for a highly configurable set of
1649 plugins to be called before and/or after execution of each task
1650 spawned as part of a user's job step. Default location is
1651 "plugstack.conf" in the same directory as the system slurm.conf.
1652 For more information on SPANK plugins, see the spank(8) manual.
1653
1654
1655 PowerParameters
1656 System power management parameters. The supported parameters
1657 are specific to the PowerPlugin. Changes to this value take ef‐
1658 fect when the Slurm daemons are reconfigured. More information
1659 about system power management is available here
1660 <https://slurm.schedmd.com/power_mgmt.html>. Options current
1661 supported by any plugins are listed below.
1662
1663 balance_interval=#
1664 Specifies the time interval, in seconds, between attempts
1665 to rebalance power caps across the nodes. This also con‐
1666 trols the frequency at which Slurm attempts to collect
1667 current power consumption data (old data may be used un‐
1668 til new data is available from the underlying infrastruc‐
1669 ture and values below 10 seconds are not recommended for
1670 Cray systems). The default value is 30 seconds. Sup‐
1671 ported by the power/cray_aries plugin.
1672
1673 capmc_path=
1674 Specifies the absolute path of the capmc command. The
1675 default value is "/opt/cray/capmc/default/bin/capmc".
1676 Supported by the power/cray_aries plugin.
1677
1678 cap_watts=#
1679 Specifies the total power limit to be established across
1680 all compute nodes managed by Slurm. A value of 0 sets
1681 every compute node to have an unlimited cap. The default
1682 value is 0. Supported by the power/cray_aries plugin.
1683
1684 decrease_rate=#
1685 Specifies the maximum rate of change in the power cap for
1686 a node where the actual power usage is below the power
1687 cap by an amount greater than lower_threshold (see be‐
1688 low). Value represents a percentage of the difference
1689 between a node's minimum and maximum power consumption.
1690 The default value is 50 percent. Supported by the
1691 power/cray_aries plugin.
1692
1693 get_timeout=#
1694 Amount of time allowed to get power state information in
1695 milliseconds. The default value is 5,000 milliseconds or
1696 5 seconds. Supported by the power/cray_aries plugin and
1697 represents the time allowed for the capmc command to re‐
1698 spond to various "get" options.
1699
1700 increase_rate=#
1701 Specifies the maximum rate of change in the power cap for
1702 a node where the actual power usage is within up‐
1703 per_threshold (see below) of the power cap. Value repre‐
1704 sents a percentage of the difference between a node's
1705 minimum and maximum power consumption. The default value
1706 is 20 percent. Supported by the power/cray_aries plugin.
1707
1708 job_level
1709 All nodes associated with every job will have the same
1710 power cap, to the extent possible. Also see the
1711 --power=level option on the job submission commands.
1712
1713 job_no_level
1714 Disable the user's ability to set every node associated
1715 with a job to the same power cap. Each node will have
1716 its power cap set independently. This disables the
1717 --power=level option on the job submission commands.
1718
1719 lower_threshold=#
1720 Specify a lower power consumption threshold. If a node's
1721 current power consumption is below this percentage of its
1722 current cap, then its power cap will be reduced. The de‐
1723 fault value is 90 percent. Supported by the
1724 power/cray_aries plugin.
1725
1726 recent_job=#
1727 If a job has started or resumed execution (from suspend)
1728 on a compute node within this number of seconds from the
1729 current time, the node's power cap will be increased to
1730 the maximum. The default value is 300 seconds. Sup‐
1731 ported by the power/cray_aries plugin.
1732
1733
1734 set_timeout=#
1735 Amount of time allowed to set power state information in
1736 milliseconds. The default value is 30,000 milliseconds
1737 or 30 seconds. Supported by the power/cray plugin and
1738 represents the time allowed for the capmc command to re‐
1739 spond to various "set" options.
1740
1741 set_watts=#
1742 Specifies the power limit to be set on every compute
1743 nodes managed by Slurm. Every node gets this same power
1744 cap and there is no variation through time based upon ac‐
1745 tual power usage on the node. Supported by the
1746 power/cray_aries plugin.
1747
1748 upper_threshold=#
1749 Specify an upper power consumption threshold. If a
1750 node's current power consumption is above this percentage
1751 of its current cap, then its power cap will be increased
1752 to the extent possible. The default value is 95 percent.
1753 Supported by the power/cray_aries plugin.
1754
1755
1756 PowerPlugin
1757 Identifies the plugin used for system power management. Cur‐
1758 rently supported plugins include: cray_aries and none. Changes
1759 to this value require restarting Slurm daemons to take effect.
1760 More information about system power management is available here
1761 <https://slurm.schedmd.com/power_mgmt.html>. By default, no
1762 power plugin is loaded.
1763
1764
1765 PreemptMode
1766 Mechanism used to preempt jobs or enable gang scheduling. When
1767 the PreemptType parameter is set to enable preemption, the Pre‐
1768 emptMode selects the default mechanism used to preempt the eli‐
1769 gible jobs for the cluster.
1770 PreemptMode may be specified on a per partition basis to over‐
1771 ride this default value if PreemptType=preempt/partition_prio.
1772 Alternatively, it can be specified on a per QOS basis if Pre‐
1773 emptType=preempt/qos. In either case, a valid default Preempt‐
1774 Mode value must be specified for the cluster as a whole when
1775 preemption is enabled.
1776 The GANG option is used to enable gang scheduling independent of
1777 whether preemption is enabled (i.e. independent of the Preempt‐
1778 Type setting). It can be specified in addition to a PreemptMode
1779 setting with the two options comma separated (e.g. Preempt‐
1780 Mode=SUSPEND,GANG).
1781 See <https://slurm.schedmd.com/preempt.html> and
1782 <https://slurm.schedmd.com/gang_scheduling.html> for more de‐
1783 tails.
1784
1785 NOTE: For performance reasons, the backfill scheduler reserves
1786 whole nodes for jobs, not partial nodes. If during backfill
1787 scheduling a job preempts one or more other jobs, the whole
1788 nodes for those preempted jobs are reserved for the preemptor
1789 job, even if the preemptor job requested fewer resources than
1790 that. These reserved nodes aren't available to other jobs dur‐
1791 ing that backfill cycle, even if the other jobs could fit on the
1792 nodes. Therefore, jobs may preempt more resources during a sin‐
1793 gle backfill iteration than they requested.
1794
1795 NOTE: For heterogeneous job to be considered for preemption all
1796 components must be eligible for preemption. When a heterogeneous
1797 job is to be preempted the first identified component of the job
1798 with the highest order PreemptMode (SUSPEND (highest), REQUEUE,
1799 CANCEL (lowest)) will be used to set the PreemptMode for all
1800 components. The GraceTime and user warning signal for each com‐
1801 ponent of the heterogeneous job remain unique. Heterogeneous
1802 jobs are excluded from GANG scheduling operations.
1803
1804 OFF Is the default value and disables job preemption and
1805 gang scheduling. It is only compatible with Pre‐
1806 emptType=preempt/none at a global level. A common
1807 use case for this parameter is to set it on a parti‐
1808 tion to disable preemption for that partition.
1809
1810 CANCEL The preempted job will be cancelled.
1811
1812 GANG Enables gang scheduling (time slicing) of jobs in
1813 the same partition, and allows the resuming of sus‐
1814 pended jobs.
1815
1816 NOTE: Gang scheduling is performed independently for
1817 each partition, so if you only want time-slicing by
1818 OverSubscribe, without any preemption, then config‐
1819 uring partitions with overlapping nodes is not rec‐
1820 ommended. On the other hand, if you want to use
1821 PreemptType=preempt/partition_prio to allow jobs
1822 from higher PriorityTier partitions to Suspend jobs
1823 from lower PriorityTier partitions you will need
1824 overlapping partitions, and PreemptMode=SUSPEND,GANG
1825 to use the Gang scheduler to resume the suspended
1826 jobs(s). In any case, time-slicing won't happen be‐
1827 tween jobs on different partitions.
1828
1829 NOTE: Heterogeneous jobs are excluded from GANG
1830 scheduling operations.
1831
1832 REQUEUE Preempts jobs by requeuing them (if possible) or
1833 canceling them. For jobs to be requeued they must
1834 have the --requeue sbatch option set or the cluster
1835 wide JobRequeue parameter in slurm.conf must be set
1836 to one.
1837
1838 SUSPEND The preempted jobs will be suspended, and later the
1839 Gang scheduler will resume them. Therefore the SUS‐
1840 PEND preemption mode always needs the GANG option to
1841 be specified at the cluster level. Also, because the
1842 suspended jobs will still use memory on the allo‐
1843 cated nodes, Slurm needs to be able to track memory
1844 resources to be able to suspend jobs.
1845
1846 NOTE: Because gang scheduling is performed indepen‐
1847 dently for each partition, if using PreemptType=pre‐
1848 empt/partition_prio then jobs in higher PriorityTier
1849 partitions will suspend jobs in lower PriorityTier
1850 partitions to run on the released resources. Only
1851 when the preemptor job ends will the suspended jobs
1852 will be resumed by the Gang scheduler.
1853 If PreemptType=preempt/qos is configured and if the
1854 preempted job(s) and the preemptor job are on the
1855 same partition, then they will share resources with
1856 the Gang scheduler (time-slicing). If not (i.e. if
1857 the preemptees and preemptor are on different parti‐
1858 tions) then the preempted jobs will remain suspended
1859 until the preemptor ends.
1860
1861
1862 PreemptType
1863 Specifies the plugin used to identify which jobs can be pre‐
1864 empted in order to start a pending job.
1865
1866 preempt/none
1867 Job preemption is disabled. This is the default.
1868
1869 preempt/partition_prio
1870 Job preemption is based upon partition PriorityTier.
1871 Jobs in higher PriorityTier partitions may preempt jobs
1872 from lower PriorityTier partitions. This is not compati‐
1873 ble with PreemptMode=OFF.
1874
1875 preempt/qos
1876 Job preemption rules are specified by Quality Of Service
1877 (QOS) specifications in the Slurm database. This option
1878 is not compatible with PreemptMode=OFF. A configuration
1879 of PreemptMode=SUSPEND is only supported by the Select‐
1880 Type=select/cons_res and SelectType=select/cons_tres
1881 plugins. See the sacctmgr man page to configure the op‐
1882 tions for preempt/qos.
1883
1884
1885 PreemptExemptTime
1886 Global option for minimum run time for all jobs before they can
1887 be considered for preemption. Any QOS PreemptExemptTime takes
1888 precedence over the global option. A time of -1 disables the
1889 option, equivalent to 0. Acceptable time formats include "min‐
1890 utes", "minutes:seconds", "hours:minutes:seconds", "days-hours",
1891 "days-hours:minutes", and "days-hours:minutes:seconds".
1892
1893
1894 PrEpParameters
1895 Parameters to be passed to the PrEpPlugins.
1896
1897
1898 PrEpPlugins
1899 A resource for programmers wishing to write their own plugins
1900 for the Prolog and Epilog (PrEp) scripts. The default, and cur‐
1901 rently the only implemented plugin is prep/script. Additional
1902 plugins can be specified in a comma-separated list. For more in‐
1903 formation please see the PrEp Plugin API documentation page:
1904 <https://slurm.schedmd.com/prep_plugins.html>
1905
1906
1907 PriorityCalcPeriod
1908 The period of time in minutes in which the half-life decay will
1909 be re-calculated. Applicable only if PriorityType=priority/mul‐
1910 tifactor. The default value is 5 (minutes).
1911
1912
1913 PriorityDecayHalfLife
1914 This controls how long prior resource use is considered in de‐
1915 termining how over- or under-serviced an association is (user,
1916 bank account and cluster) in determining job priority. The
1917 record of usage will be decayed over time, with half of the
1918 original value cleared at age PriorityDecayHalfLife. If set to
1919 0 no decay will be applied. This is helpful if you want to en‐
1920 force hard time limits per association. If set to 0 Priori‐
1921 tyUsageResetPeriod must be set to some interval. Applicable
1922 only if PriorityType=priority/multifactor. The unit is a time
1923 string (i.e. min, hr:min:00, days-hr:min:00, or days-hr). The
1924 default value is 7-0 (7 days).
1925
1926
1927 PriorityFavorSmall
1928 Specifies that small jobs should be given preferential schedul‐
1929 ing priority. Applicable only if PriorityType=priority/multi‐
1930 factor. Supported values are "YES" and "NO". The default value
1931 is "NO".
1932
1933
1934 PriorityFlags
1935 Flags to modify priority behavior. Applicable only if Priority‐
1936 Type=priority/multifactor. The keywords below have no associ‐
1937 ated value (e.g. "PriorityFlags=ACCRUE_ALWAYS,SMALL_RELA‐
1938 TIVE_TO_TIME").
1939
1940 ACCRUE_ALWAYS If set, priority age factor will be increased
1941 despite job dependencies or holds.
1942
1943 CALCULATE_RUNNING
1944 If set, priorities will be recalculated not
1945 only for pending jobs, but also running and
1946 suspended jobs.
1947
1948 DEPTH_OBLIVIOUS If set, priority will be calculated based simi‐
1949 lar to the normal multifactor calculation, but
1950 depth of the associations in the tree do not
1951 adversely effect their priority. This option
1952 automatically enables NO_FAIR_TREE.
1953
1954 NO_FAIR_TREE Disables the "fair tree" algorithm, and reverts
1955 to "classic" fair share priority scheduling.
1956
1957 INCR_ONLY If set, priority values will only increase in
1958 value. Job priority will never decrease in
1959 value.
1960
1961 MAX_TRES If set, the weighted TRES value (e.g. TRES‐
1962 BillingWeights) is calculated as the MAX of in‐
1963 dividual TRES' on a node (e.g. cpus, mem, gres)
1964 plus the sum of all global TRES' (e.g. li‐
1965 censes).
1966
1967 NO_NORMAL_ALL If set, all NO_NORMAL_* flags are set.
1968
1969 NO_NORMAL_ASSOC If set, the association factor is not normal‐
1970 ized against the highest association priority.
1971
1972 NO_NORMAL_PART If set, the partition factor is not normalized
1973 against the highest partition PriorityJobFac‐
1974 tor.
1975
1976 NO_NORMAL_QOS If set, the QOS factor is not normalized
1977 against the highest qos priority.
1978
1979 NO_NORMAL_TRES If set, the QOS factor is not normalized
1980 against the job's partition TRES counts.
1981
1982 SMALL_RELATIVE_TO_TIME
1983 If set, the job's size component will be based
1984 upon not the job size alone, but the job's size
1985 divided by its time limit.
1986
1987
1988 PriorityMaxAge
1989 Specifies the job age which will be given the maximum age factor
1990 in computing priority. For example, a value of 30 minutes would
1991 result in all jobs over 30 minutes old would get the same
1992 age-based priority. Applicable only if PriorityType=prior‐
1993 ity/multifactor. The unit is a time string (i.e. min,
1994 hr:min:00, days-hr:min:00, or days-hr). The default value is
1995 7-0 (7 days).
1996
1997
1998 PriorityParameters
1999 Arbitrary string used by the PriorityType plugin.
2000
2001
2002 PrioritySiteFactorParameters
2003 Arbitrary string used by the PrioritySiteFactorPlugin plugin.
2004
2005
2006 PrioritySiteFactorPlugin
2007 The specifies an optional plugin to be used alongside "prior‐
2008 ity/multifactor", which is meant to initially set and continu‐
2009 ously update the SiteFactor priority factor. The default value
2010 is "site_factor/none".
2011
2012
2013 PriorityType
2014 This specifies the plugin to be used in establishing a job's
2015 scheduling priority. Supported values are "priority/basic" (jobs
2016 are prioritized by order of arrival), "priority/multifactor"
2017 (jobs are prioritized based upon size, age, fair-share of allo‐
2018 cation, etc). Also see PriorityFlags for configuration options.
2019 The default value is "priority/basic".
2020
2021 When not FIFO scheduling, jobs are prioritized in the following
2022 order:
2023
2024 1. Jobs that can preempt
2025 2. Jobs with an advanced reservation
2026 3. Partition Priority Tier
2027 4. Job Priority
2028 5. Job Id
2029
2030
2031 PriorityUsageResetPeriod
2032 At this interval the usage of associations will be reset to 0.
2033 This is used if you want to enforce hard limits of time usage
2034 per association. If PriorityDecayHalfLife is set to be 0 no de‐
2035 cay will happen and this is the only way to reset the usage ac‐
2036 cumulated by running jobs. By default this is turned off and it
2037 is advised to use the PriorityDecayHalfLife option to avoid not
2038 having anything running on your cluster, but if your schema is
2039 set up to only allow certain amounts of time on your system this
2040 is the way to do it. Applicable only if PriorityType=prior‐
2041 ity/multifactor.
2042
2043 NONE Never clear historic usage. The default value.
2044
2045 NOW Clear the historic usage now. Executed at startup
2046 and reconfiguration time.
2047
2048 DAILY Cleared every day at midnight.
2049
2050 WEEKLY Cleared every week on Sunday at time 00:00.
2051
2052 MONTHLY Cleared on the first day of each month at time
2053 00:00.
2054
2055 QUARTERLY Cleared on the first day of each quarter at time
2056 00:00.
2057
2058 YEARLY Cleared on the first day of each year at time 00:00.
2059
2060
2061 PriorityWeightAge
2062 An integer value that sets the degree to which the queue wait
2063 time component contributes to the job's priority. Applicable
2064 only if PriorityType=priority/multifactor. Requires Account‐
2065 ingStorageType=accounting_storage/slurmdbd. The default value
2066 is 0.
2067
2068
2069 PriorityWeightAssoc
2070 An integer value that sets the degree to which the association
2071 component contributes to the job's priority. Applicable only if
2072 PriorityType=priority/multifactor. The default value is 0.
2073
2074
2075 PriorityWeightFairshare
2076 An integer value that sets the degree to which the fair-share
2077 component contributes to the job's priority. Applicable only if
2078 PriorityType=priority/multifactor. Requires AccountingStor‐
2079 ageType=accounting_storage/slurmdbd. The default value is 0.
2080
2081
2082 PriorityWeightJobSize
2083 An integer value that sets the degree to which the job size com‐
2084 ponent contributes to the job's priority. Applicable only if
2085 PriorityType=priority/multifactor. The default value is 0.
2086
2087
2088 PriorityWeightPartition
2089 Partition factor used by priority/multifactor plugin in calcu‐
2090 lating job priority. Applicable only if PriorityType=prior‐
2091 ity/multifactor. The default value is 0.
2092
2093
2094 PriorityWeightQOS
2095 An integer value that sets the degree to which the Quality Of
2096 Service component contributes to the job's priority. Applicable
2097 only if PriorityType=priority/multifactor. The default value is
2098 0.
2099
2100
2101 PriorityWeightTRES
2102 A comma-separated list of TRES Types and weights that sets the
2103 degree that each TRES Type contributes to the job's priority.
2104
2105 e.g.
2106 PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
2107
2108 Applicable only if PriorityType=priority/multifactor and if Ac‐
2109 countingStorageTRES is configured with each TRES Type. Negative
2110 values are allowed. The default values are 0.
2111
2112
2113 PrivateData
2114 This controls what type of information is hidden from regular
2115 users. By default, all information is visible to all users.
2116 User SlurmUser and root can always view all information. Multi‐
2117 ple values may be specified with a comma separator. Acceptable
2118 values include:
2119
2120 accounts
2121 (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
2122 ing any account definitions unless they are coordinators
2123 of them.
2124
2125 cloud Powered down nodes in the cloud are visible.
2126
2127 events prevents users from viewing event information unless they
2128 have operator status or above.
2129
2130 jobs Prevents users from viewing jobs or job steps belonging
2131 to other users. (NON-SlurmDBD ACCOUNTING ONLY) Prevents
2132 users from viewing job records belonging to other users
2133 unless they are coordinators of the association running
2134 the job when using sacct.
2135
2136 nodes Prevents users from viewing node state information.
2137
2138 partitions
2139 Prevents users from viewing partition state information.
2140
2141 reservations
2142 Prevents regular users from viewing reservations which
2143 they can not use.
2144
2145 usage Prevents users from viewing usage of any other user, this
2146 applies to sshare. (NON-SlurmDBD ACCOUNTING ONLY) Pre‐
2147 vents users from viewing usage of any other user, this
2148 applies to sreport.
2149
2150 users (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
2151 ing information of any user other than themselves, this
2152 also makes it so users can only see associations they
2153 deal with. Coordinators can see associations of all
2154 users in the account they are coordinator of, but can
2155 only see themselves when listing users.
2156
2157
2158 ProctrackType
2159 Identifies the plugin to be used for process tracking on a job
2160 step basis. The slurmd daemon uses this mechanism to identify
2161 all processes which are children of processes it spawns for a
2162 user job step. The slurmd daemon must be restarted for a change
2163 in ProctrackType to take effect. NOTE: "proctrack/linuxproc"
2164 and "proctrack/pgid" can fail to identify all processes associ‐
2165 ated with a job since processes can become a child of the init
2166 process (when the parent process terminates) or change their
2167 process group. To reliably track all processes, "proc‐
2168 track/cgroup" is highly recommended. NOTE: The JobContainerType
2169 applies to a job allocation, while ProctrackType applies to job
2170 steps. Acceptable values at present include:
2171
2172 proctrack/cgroup
2173 Uses linux cgroups to constrain and track processes, and
2174 is the default for systems with cgroup support.
2175 NOTE: see "man cgroup.conf" for configuration details.
2176
2177 proctrack/cray_aries
2178 Uses Cray proprietary process tracking.
2179
2180 proctrack/linuxproc
2181 Uses linux process tree using parent process IDs.
2182
2183 proctrack/pgid
2184 Uses Process Group IDs.
2185 NOTE: This is the default for the BSD family.
2186
2187
2188 Prolog Fully qualified pathname of a program for the slurmd to execute
2189 whenever it is asked to run a job step from a new job allocation
2190 (e.g. "/usr/local/slurm/prolog"). A glob pattern (See glob (7))
2191 may also be used to specify more than one program to run (e.g.
2192 "/etc/slurm/prolog.d/*"). The slurmd executes the prolog before
2193 starting the first job step. The prolog script or scripts may
2194 be used to purge files, enable user login, etc. By default
2195 there is no prolog. Any configured script is expected to com‐
2196 plete execution quickly (in less time than MessageTimeout). If
2197 the prolog fails (returns a non-zero exit code), this will re‐
2198 sult in the node being set to a DRAIN state and the job being
2199 requeued in a held state, unless nohold_on_prolog_fail is con‐
2200 figured in SchedulerParameters. See Prolog and Epilog Scripts
2201 for more information.
2202
2203
2204 PrologEpilogTimeout
2205 The interval in seconds Slurms waits for Prolog and Epilog be‐
2206 fore terminating them. The default behavior is to wait indefi‐
2207 nitely. This interval applies to the Prolog and Epilog run by
2208 slurmd daemon before and after the job, the PrologSlurmctld and
2209 EpilogSlurmctld run by slurmctld daemon, and the SPANK plugins
2210 run by the slurmstepd daemon.
2211
2212
2213 PrologFlags
2214 Flags to control the Prolog behavior. By default no flags are
2215 set. Multiple flags may be specified in a comma-separated list.
2216 Currently supported options are:
2217
2218 Alloc If set, the Prolog script will be executed at job allo‐
2219 cation. By default, Prolog is executed just before the
2220 task is launched. Therefore, when salloc is started, no
2221 Prolog is executed. Alloc is useful for preparing things
2222 before a user starts to use any allocated resources. In
2223 particular, this flag is needed on a Cray system when
2224 cluster compatibility mode is enabled.
2225
2226 NOTE: Use of the Alloc flag will increase the time re‐
2227 quired to start jobs.
2228
2229 Contain At job allocation time, use the ProcTrack plugin to cre‐
2230 ate a job container on all allocated compute nodes.
2231 This container may be used for user processes not
2232 launched under Slurm control, for example
2233 pam_slurm_adopt may place processes launched through a
2234 direct user login into this container. If using
2235 pam_slurm_adopt, then ProcTrackType must be set to ei‐
2236 ther proctrack/cgroup or proctrack/cray_aries. Setting
2237 the Contain implicitly sets the Alloc flag.
2238
2239 NoHold If set, the Alloc flag should also be set. This will
2240 allow for salloc to not block until the prolog is fin‐
2241 ished on each node. The blocking will happen when steps
2242 reach the slurmd and before any execution has happened
2243 in the step. This is a much faster way to work and if
2244 using srun to launch your tasks you should use this
2245 flag. This flag cannot be combined with the Contain or
2246 X11 flags.
2247
2248 Serial By default, the Prolog and Epilog scripts run concur‐
2249 rently on each node. This flag forces those scripts to
2250 run serially within each node, but with a significant
2251 penalty to job throughput on each node.
2252
2253 X11 Enable Slurm's built-in X11 forwarding capabilities.
2254 This is incompatible with ProctrackType=proctrack/linux‐
2255 proc. Setting the X11 flag implicitly enables both Con‐
2256 tain and Alloc flags as well.
2257
2258
2259 PrologSlurmctld
2260 Fully qualified pathname of a program for the slurmctld daemon
2261 to execute before granting a new job allocation (e.g. "/usr/lo‐
2262 cal/slurm/prolog_controller"). The program executes as Slur‐
2263 mUser on the same node where the slurmctld daemon executes, giv‐
2264 ing it permission to drain nodes and requeue the job if a fail‐
2265 ure occurs or cancel the job if appropriate. The program can be
2266 used to reboot nodes or perform other work to prepare resources
2267 for use. Exactly what the program does and how it accomplishes
2268 this is completely at the discretion of the system administra‐
2269 tor. Information about the job being initiated, its allocated
2270 nodes, etc. are passed to the program using environment vari‐
2271 ables. While this program is running, the nodes associated with
2272 the job will be have a POWER_UP/CONFIGURING flag set in their
2273 state, which can be readily viewed. The slurmctld daemon will
2274 wait indefinitely for this program to complete. Once the pro‐
2275 gram completes with an exit code of zero, the nodes will be con‐
2276 sidered ready for use and the program will be started. If some
2277 node can not be made available for use, the program should drain
2278 the node (typically using the scontrol command) and terminate
2279 with a non-zero exit code. A non-zero exit code will result in
2280 the job being requeued (where possible) or killed. Note that
2281 only batch jobs can be requeued. See Prolog and Epilog Scripts
2282 for more information.
2283
2284
2285 PropagatePrioProcess
2286 Controls the scheduling priority (nice value) of user spawned
2287 tasks.
2288
2289 0 The tasks will inherit the scheduling priority from the
2290 slurm daemon. This is the default value.
2291
2292 1 The tasks will inherit the scheduling priority of the com‐
2293 mand used to submit them (e.g. srun or sbatch). Unless the
2294 job is submitted by user root, the tasks will have a sched‐
2295 uling priority no higher than the slurm daemon spawning
2296 them.
2297
2298 2 The tasks will inherit the scheduling priority of the com‐
2299 mand used to submit them (e.g. srun or sbatch) with the re‐
2300 striction that their nice value will always be one higher
2301 than the slurm daemon (i.e. the tasks scheduling priority
2302 will be lower than the slurm daemon).
2303
2304
2305 PropagateResourceLimits
2306 A comma-separated list of resource limit names. The slurmd dae‐
2307 mon uses these names to obtain the associated (soft) limit val‐
2308 ues from the user's process environment on the submit node.
2309 These limits are then propagated and applied to the jobs that
2310 will run on the compute nodes. This parameter can be useful
2311 when system limits vary among nodes. Any resource limits that
2312 do not appear in the list are not propagated. However, the user
2313 can override this by specifying which resource limits to propa‐
2314 gate with the sbatch or srun "--propagate" option. If neither
2315 PropagateResourceLimits or PropagateResourceLimitsExcept are
2316 configured and the "--propagate" option is not specified, then
2317 the default action is to propagate all limits. Only one of the
2318 parameters, either PropagateResourceLimits or PropagateResource‐
2319 LimitsExcept, may be specified. The user limits can not exceed
2320 hard limits under which the slurmd daemon operates. If the user
2321 limits are not propagated, the limits from the slurmd daemon
2322 will be propagated to the user's job. The limits used for the
2323 Slurm daemons can be set in the /etc/sysconf/slurm file. For
2324 more information, see: https://slurm.schedmd.com/faq.html#mem‐
2325 lock The following limit names are supported by Slurm (although
2326 some options may not be supported on some systems):
2327
2328 ALL All limits listed below (default)
2329
2330 NONE No limits listed below
2331
2332 AS The maximum address space for a process
2333
2334 CORE The maximum size of core file
2335
2336 CPU The maximum amount of CPU time
2337
2338 DATA The maximum size of a process's data segment
2339
2340 FSIZE The maximum size of files created. Note that if the
2341 user sets FSIZE to less than the current size of the
2342 slurmd.log, job launches will fail with a 'File size
2343 limit exceeded' error.
2344
2345 MEMLOCK The maximum size that may be locked into memory
2346
2347 NOFILE The maximum number of open files
2348
2349 NPROC The maximum number of processes available
2350
2351 RSS The maximum resident set size
2352
2353 STACK The maximum stack size
2354
2355
2356 PropagateResourceLimitsExcept
2357 A comma-separated list of resource limit names. By default, all
2358 resource limits will be propagated, (as described by the Propa‐
2359 gateResourceLimits parameter), except for the limits appearing
2360 in this list. The user can override this by specifying which
2361 resource limits to propagate with the sbatch or srun "--propa‐
2362 gate" option. See PropagateResourceLimits above for a list of
2363 valid limit names.
2364
2365
2366 RebootProgram
2367 Program to be executed on each compute node to reboot it. In‐
2368 voked on each node once it becomes idle after the command "scon‐
2369 trol reboot" is executed by an authorized user or a job is sub‐
2370 mitted with the "--reboot" option. After rebooting, the node is
2371 returned to normal use. See ResumeTimeout to configure the time
2372 you expect a reboot to finish in. A node will be marked DOWN if
2373 it doesn't reboot within ResumeTimeout.
2374
2375
2376 ReconfigFlags
2377 Flags to control various actions that may be taken when an
2378 "scontrol reconfig" command is issued. Currently the options
2379 are:
2380
2381 KeepPartInfo If set, an "scontrol reconfig" command will
2382 maintain the in-memory value of partition
2383 "state" and other parameters that may have been
2384 dynamically updated by "scontrol update". Par‐
2385 tition information in the slurm.conf file will
2386 be merged with in-memory data. This flag su‐
2387 persedes the KeepPartState flag.
2388
2389 KeepPartState If set, an "scontrol reconfig" command will
2390 preserve only the current "state" value of
2391 in-memory partitions and will reset all other
2392 parameters of the partitions that may have been
2393 dynamically updated by "scontrol update" to the
2394 values from the slurm.conf file. Partition in‐
2395 formation in the slurm.conf file will be merged
2396 with in-memory data.
2397 The default for the above flags is not set, and the "scontrol
2398 reconfig" will rebuild the partition information using only the
2399 definitions in the slurm.conf file.
2400
2401
2402 RequeueExit
2403 Enables automatic requeue for batch jobs which exit with the
2404 specified values. Separate multiple exit code by a comma and/or
2405 specify numeric ranges using a "-" separator (e.g. "Requeue‐
2406 Exit=1-9,18") Jobs will be put back in to pending state and
2407 later scheduled again. Restarted jobs will have the environment
2408 variable SLURM_RESTART_COUNT set to the number of times the job
2409 has been restarted.
2410
2411
2412 RequeueExitHold
2413 Enables automatic requeue for batch jobs which exit with the
2414 specified values, with these jobs being held until released man‐
2415 ually by the user. Separate multiple exit code by a comma
2416 and/or specify numeric ranges using a "-" separator (e.g. "Re‐
2417 queueExitHold=10-12,16") These jobs are put in the JOB_SPE‐
2418 CIAL_EXIT exit state. Restarted jobs will have the environment
2419 variable SLURM_RESTART_COUNT set to the number of times the job
2420 has been restarted.
2421
2422
2423 ResumeFailProgram
2424 The program that will be executed when nodes fail to resume to
2425 by ResumeTimeout. The argument to the program will be the names
2426 of the failed nodes (using Slurm's hostlist expression format).
2427
2428
2429 ResumeProgram
2430 Slurm supports a mechanism to reduce power consumption on nodes
2431 that remain idle for an extended period of time. This is typi‐
2432 cally accomplished by reducing voltage and frequency or powering
2433 the node down. ResumeProgram is the program that will be exe‐
2434 cuted when a node in power save mode is assigned work to per‐
2435 form. For reasons of reliability, ResumeProgram may execute
2436 more than once for a node when the slurmctld daemon crashes and
2437 is restarted. If ResumeProgram is unable to restore a node to
2438 service with a responding slurmd and an updated BootTime, it
2439 should requeue any job associated with the node and set the node
2440 state to DOWN. If the node isn't actually rebooted (i.e. when
2441 multiple-slurmd is configured) starting slurmd with "-b" option
2442 might be useful. The program executes as SlurmUser. The argu‐
2443 ment to the program will be the names of nodes to be removed
2444 from power savings mode (using Slurm's hostlist expression for‐
2445 mat). By default no program is run. Related configuration op‐
2446 tions include ResumeTimeout, ResumeRate, SuspendRate, Suspend‐
2447 Time, SuspendTimeout, SuspendProgram, SuspendExcNodes, and Sus‐
2448 pendExcParts. More information is available at the Slurm web
2449 site ( https://slurm.schedmd.com/power_save.html ).
2450
2451
2452 ResumeRate
2453 The rate at which nodes in power save mode are returned to nor‐
2454 mal operation by ResumeProgram. The value is number of nodes
2455 per minute and it can be used to prevent power surges if a large
2456 number of nodes in power save mode are assigned work at the same
2457 time (e.g. a large job starts). A value of zero results in no
2458 limits being imposed. The default value is 300 nodes per
2459 minute. Related configuration options include ResumeTimeout,
2460 ResumeProgram, SuspendRate, SuspendTime, SuspendTimeout, Sus‐
2461 pendProgram, SuspendExcNodes, and SuspendExcParts.
2462
2463
2464 ResumeTimeout
2465 Maximum time permitted (in seconds) between when a node resume
2466 request is issued and when the node is actually available for
2467 use. Nodes which fail to respond in this time frame will be
2468 marked DOWN and the jobs scheduled on the node requeued. Nodes
2469 which reboot after this time frame will be marked DOWN with a
2470 reason of "Node unexpectedly rebooted." The default value is 60
2471 seconds. Related configuration options include ResumeProgram,
2472 ResumeRate, SuspendRate, SuspendTime, SuspendTimeout, Suspend‐
2473 Program, SuspendExcNodes and SuspendExcParts. More information
2474 is available at the Slurm web site (
2475 https://slurm.schedmd.com/power_save.html ).
2476
2477
2478 ResvEpilog
2479 Fully qualified pathname of a program for the slurmctld to exe‐
2480 cute when a reservation ends. The program can be used to cancel
2481 jobs, modify partition configuration, etc. The reservation
2482 named will be passed as an argument to the program. By default
2483 there is no epilog.
2484
2485
2486 ResvOverRun
2487 Describes how long a job already running in a reservation should
2488 be permitted to execute after the end time of the reservation
2489 has been reached. The time period is specified in minutes and
2490 the default value is 0 (kill the job immediately). The value
2491 may not exceed 65533 minutes, although a value of "UNLIMITED" is
2492 supported to permit a job to run indefinitely after its reserva‐
2493 tion is terminated.
2494
2495
2496 ResvProlog
2497 Fully qualified pathname of a program for the slurmctld to exe‐
2498 cute when a reservation begins. The program can be used to can‐
2499 cel jobs, modify partition configuration, etc. The reservation
2500 named will be passed as an argument to the program. By default
2501 there is no prolog.
2502
2503
2504 ReturnToService
2505 Controls when a DOWN node will be returned to service. The de‐
2506 fault value is 0. Supported values include
2507
2508 0 A node will remain in the DOWN state until a system adminis‐
2509 trator explicitly changes its state (even if the slurmd dae‐
2510 mon registers and resumes communications).
2511
2512 1 A DOWN node will become available for use upon registration
2513 with a valid configuration only if it was set DOWN due to
2514 being non-responsive. If the node was set DOWN for any
2515 other reason (low memory, unexpected reboot, etc.), its
2516 state will not automatically be changed. A node registers
2517 with a valid configuration if its memory, GRES, CPU count,
2518 etc. are equal to or greater than the values configured in
2519 slurm.conf.
2520
2521 2 A DOWN node will become available for use upon registration
2522 with a valid configuration. The node could have been set
2523 DOWN for any reason. A node registers with a valid configu‐
2524 ration if its memory, GRES, CPU count, etc. are equal to or
2525 greater than the values configured in slurm.conf. (Disabled
2526 on Cray ALPS systems.)
2527
2528
2529 RoutePlugin
2530 Identifies the plugin to be used for defining which nodes will
2531 be used for message forwarding.
2532
2533 route/default
2534 default, use TreeWidth.
2535
2536 route/topology
2537 use the switch hierarchy defined in a topology.conf file.
2538 TopologyPlugin=topology/tree is required.
2539
2540
2541 SbcastParameters
2542 Controls sbcast command behavior. Multiple options can be speci‐
2543 fied in a comma separated list. Supported values include:
2544
2545 DestDir= Destination directory for file being broadcast to
2546 allocated compute nodes. Default value is cur‐
2547 rent working directory.
2548
2549 Compression= Specify default file compression library to be
2550 used. Supported values are "lz4", "none" and
2551 "zlib". The default value with the sbcast --com‐
2552 press option is "lz4" and "none" otherwise. Some
2553 compression libraries may be unavailable on some
2554 systems.
2555
2556
2557 SchedulerParameters
2558 The interpretation of this parameter varies by SchedulerType.
2559 Multiple options may be comma separated.
2560
2561 allow_zero_lic
2562 If set, then job submissions requesting more than config‐
2563 ured licenses won't be rejected.
2564
2565 assoc_limit_stop
2566 If set and a job cannot start due to association limits,
2567 then do not attempt to initiate any lower priority jobs
2568 in that partition. Setting this can decrease system
2569 throughput and utilization, but avoid potentially starv‐
2570 ing larger jobs by preventing them from launching indefi‐
2571 nitely.
2572
2573 batch_sched_delay=#
2574 How long, in seconds, the scheduling of batch jobs can be
2575 delayed. This can be useful in a high-throughput envi‐
2576 ronment in which batch jobs are submitted at a very high
2577 rate (i.e. using the sbatch command) and one wishes to
2578 reduce the overhead of attempting to schedule each job at
2579 submit time. The default value is 3 seconds.
2580
2581 bb_array_stage_cnt=#
2582 Number of tasks from a job array that should be available
2583 for burst buffer resource allocation. Higher values will
2584 increase the system overhead as each task from the job
2585 array will be moved to its own job record in memory, so
2586 relatively small values are generally recommended. The
2587 default value is 10.
2588
2589 bf_busy_nodes
2590 When selecting resources for pending jobs to reserve for
2591 future execution (i.e. the job can not be started immedi‐
2592 ately), then preferentially select nodes that are in use.
2593 This will tend to leave currently idle resources avail‐
2594 able for backfilling longer running jobs, but may result
2595 in allocations having less than optimal network topology.
2596 This option is currently only supported by the se‐
2597 lect/cons_res and select/cons_tres plugins (or se‐
2598 lect/cray_aries with SelectTypeParameters set to
2599 "OTHER_CONS_RES" or "OTHER_CONS_TRES", which layers the
2600 select/cray_aries plugin over the select/cons_res or se‐
2601 lect/cons_tres plugin respectively).
2602
2603 bf_continue
2604 The backfill scheduler periodically releases locks in or‐
2605 der to permit other operations to proceed rather than
2606 blocking all activity for what could be an extended pe‐
2607 riod of time. Setting this option will cause the back‐
2608 fill scheduler to continue processing pending jobs from
2609 its original job list after releasing locks even if job
2610 or node state changes.
2611
2612 bf_hetjob_immediate
2613 Instruct the backfill scheduler to attempt to start a
2614 heterogeneous job as soon as all of its components are
2615 determined able to do so. Otherwise, the backfill sched‐
2616 uler will delay heterogeneous jobs initiation attempts
2617 until after the rest of the queue has been processed.
2618 This delay may result in lower priority jobs being allo‐
2619 cated resources, which could delay the initiation of the
2620 heterogeneous job due to account and/or QOS limits being
2621 reached. This option is disabled by default. If enabled
2622 and bf_hetjob_prio=min is not set, then it would be auto‐
2623 matically set.
2624
2625 bf_hetjob_prio=[min|avg|max]
2626 At the beginning of each backfill scheduling cycle, a
2627 list of pending to be scheduled jobs is sorted according
2628 to the precedence order configured in PriorityType. This
2629 option instructs the scheduler to alter the sorting algo‐
2630 rithm to ensure that all components belonging to the same
2631 heterogeneous job will be attempted to be scheduled con‐
2632 secutively (thus not fragmented in the resulting list).
2633 More specifically, all components from the same heteroge‐
2634 neous job will be treated as if they all have the same
2635 priority (minimum, average or maximum depending upon this
2636 option's parameter) when compared with other jobs (or
2637 other heterogeneous job components). The original order
2638 will be preserved within the same heterogeneous job. Note
2639 that the operation is calculated for the PriorityTier
2640 layer and for the Priority resulting from the prior‐
2641 ity/multifactor plugin calculations. When enabled, if any
2642 heterogeneous job requested an advanced reservation, then
2643 all of that job's components will be treated as if they
2644 had requested an advanced reservation (and get preferen‐
2645 tial treatment in scheduling).
2646
2647 Note that this operation does not update the Priority
2648 values of the heterogeneous job components, only their
2649 order within the list, so the output of the sprio command
2650 will not be effected.
2651
2652 Heterogeneous jobs have special scheduling properties:
2653 they are only scheduled by the backfill scheduling
2654 plugin, each of their components is considered separately
2655 when reserving resources (and might have different Prior‐
2656 ityTier or different Priority values), and no heteroge‐
2657 neous job component is actually allocated resources until
2658 all if its components can be initiated. This may imply
2659 potential scheduling deadlock scenarios because compo‐
2660 nents from different heterogeneous jobs can start reserv‐
2661 ing resources in an interleaved fashion (not consecu‐
2662 tively), but none of the jobs can reserve resources for
2663 all components and start. Enabling this option can help
2664 to mitigate this problem. By default, this option is dis‐
2665 abled.
2666
2667 bf_interval=#
2668 The number of seconds between backfill iterations.
2669 Higher values result in less overhead and better respon‐
2670 siveness. This option applies only to Scheduler‐
2671 Type=sched/backfill. Default: 30, Min: 1, Max: 10800
2672 (3h).
2673
2674
2675 bf_job_part_count_reserve=#
2676 The backfill scheduling logic will reserve resources for
2677 the specified count of highest priority jobs in each par‐
2678 tition. For example, bf_job_part_count_reserve=10 will
2679 cause the backfill scheduler to reserve resources for the
2680 ten highest priority jobs in each partition. Any lower
2681 priority job that can be started using currently avail‐
2682 able resources and not adversely impact the expected
2683 start time of these higher priority jobs will be started
2684 by the backfill scheduler The default value is zero,
2685 which will reserve resources for any pending job and de‐
2686 lay initiation of lower priority jobs. Also see
2687 bf_min_age_reserve and bf_min_prio_reserve. Default: 0,
2688 Min: 0, Max: 100000.
2689
2690
2691 bf_max_job_array_resv=#
2692 The maximum number of tasks from a job array for which
2693 the backfill scheduler will reserve resources in the fu‐
2694 ture. Since job arrays can potentially have millions of
2695 tasks, the overhead in reserving resources for all tasks
2696 can be prohibitive. In addition various limits may pre‐
2697 vent all the jobs from starting at the expected times.
2698 This has no impact upon the number of tasks from a job
2699 array that can be started immediately, only those tasks
2700 expected to start at some future time. Default: 20, Min:
2701 0, Max: 1000. NOTE: Jobs submitted to multiple parti‐
2702 tions appear in the job queue once per partition. If dif‐
2703 ferent copies of a single job array record aren't consec‐
2704 utive in the job queue and another job array record is in
2705 between, then bf_max_job_array_resv tasks are considered
2706 per partition that the job is submitted to.
2707
2708 bf_max_job_assoc=#
2709 The maximum number of jobs per user association to at‐
2710 tempt starting with the backfill scheduler. This setting
2711 is similar to bf_max_job_user but is handy if a user has
2712 multiple associations equating to basically different
2713 users. One can set this limit to prevent users from
2714 flooding the backfill queue with jobs that cannot start
2715 and that prevent jobs from other users to start. This
2716 option applies only to SchedulerType=sched/backfill.
2717 Also see the bf_max_job_user bf_max_job_part,
2718 bf_max_job_test and bf_max_job_user_part=# options. Set
2719 bf_max_job_test to a value much higher than
2720 bf_max_job_assoc. Default: 0 (no limit), Min: 0, Max:
2721 bf_max_job_test.
2722
2723 bf_max_job_part=#
2724 The maximum number of jobs per partition to attempt
2725 starting with the backfill scheduler. This can be espe‐
2726 cially helpful for systems with large numbers of parti‐
2727 tions and jobs. This option applies only to Scheduler‐
2728 Type=sched/backfill. Also see the partition_job_depth
2729 and bf_max_job_test options. Set bf_max_job_test to a
2730 value much higher than bf_max_job_part. Default: 0 (no
2731 limit), Min: 0, Max: bf_max_job_test.
2732
2733 bf_max_job_start=#
2734 The maximum number of jobs which can be initiated in a
2735 single iteration of the backfill scheduler. This option
2736 applies only to SchedulerType=sched/backfill. Default: 0
2737 (no limit), Min: 0, Max: 10000.
2738
2739 bf_max_job_test=#
2740 The maximum number of jobs to attempt backfill scheduling
2741 for (i.e. the queue depth). Higher values result in more
2742 overhead and less responsiveness. Until an attempt is
2743 made to backfill schedule a job, its expected initiation
2744 time value will not be set. In the case of large clus‐
2745 ters, configuring a relatively small value may be desir‐
2746 able. This option applies only to Scheduler‐
2747 Type=sched/backfill. Default: 100, Min: 1, Max:
2748 1,000,000.
2749
2750 bf_max_job_user=#
2751 The maximum number of jobs per user to attempt starting
2752 with the backfill scheduler for ALL partitions. One can
2753 set this limit to prevent users from flooding the back‐
2754 fill queue with jobs that cannot start and that prevent
2755 jobs from other users to start. This is similar to the
2756 MAXIJOB limit in Maui. This option applies only to
2757 SchedulerType=sched/backfill. Also see the
2758 bf_max_job_part, bf_max_job_test and
2759 bf_max_job_user_part=# options. Set bf_max_job_test to a
2760 value much higher than bf_max_job_user. Default: 0 (no
2761 limit), Min: 0, Max: bf_max_job_test.
2762
2763 bf_max_job_user_part=#
2764 The maximum number of jobs per user per partition to at‐
2765 tempt starting with the backfill scheduler for any single
2766 partition. This option applies only to Scheduler‐
2767 Type=sched/backfill. Also see the bf_max_job_part,
2768 bf_max_job_test and bf_max_job_user=# options. Default:
2769 0 (no limit), Min: 0, Max: bf_max_job_test.
2770
2771 bf_max_time=#
2772 The maximum time in seconds the backfill scheduler can
2773 spend (including time spent sleeping when locks are re‐
2774 leased) before discontinuing, even if maximum job counts
2775 have not been reached. This option applies only to
2776 SchedulerType=sched/backfill. The default value is the
2777 value of bf_interval (which defaults to 30 seconds). De‐
2778 fault: bf_interval value (def. 30 sec), Min: 1, Max: 3600
2779 (1h). NOTE: If bf_interval is short and bf_max_time is
2780 large, this may cause locks to be acquired too frequently
2781 and starve out other serviced RPCs. It's advisable if us‐
2782 ing this parameter to set max_rpc_cnt high enough that
2783 scheduling isn't always disabled, and low enough that the
2784 interactive workload can get through in a reasonable pe‐
2785 riod of time. max_rpc_cnt needs to be below 256 (the de‐
2786 fault RPC thread limit). Running around the middle (150)
2787 may give you good results. NOTE: When increasing the
2788 amount of time spent in the backfill scheduling cycle,
2789 Slurm can be prevented from responding to client requests
2790 in a timely manner. To address this you can use
2791 max_rpc_cnt to specify a number of queued RPCs before the
2792 scheduler stops to respond to these requests.
2793
2794 bf_min_age_reserve=#
2795 The backfill and main scheduling logic will not reserve
2796 resources for pending jobs until they have been pending
2797 and runnable for at least the specified number of sec‐
2798 onds. In addition, jobs waiting for less than the speci‐
2799 fied number of seconds will not prevent a newly submitted
2800 job from starting immediately, even if the newly submit‐
2801 ted job has a lower priority. This can be valuable if
2802 jobs lack time limits or all time limits have the same
2803 value. The default value is zero, which will reserve re‐
2804 sources for any pending job and delay initiation of lower
2805 priority jobs. Also see bf_job_part_count_reserve and
2806 bf_min_prio_reserve. Default: 0, Min: 0, Max: 2592000
2807 (30 days).
2808
2809 bf_min_prio_reserve=#
2810 The backfill and main scheduling logic will not reserve
2811 resources for pending jobs unless they have a priority
2812 equal to or higher than the specified value. In addi‐
2813 tion, jobs with a lower priority will not prevent a newly
2814 submitted job from starting immediately, even if the
2815 newly submitted job has a lower priority. This can be
2816 valuable if one wished to maximum system utilization
2817 without regard for job priority below a certain thresh‐
2818 old. The default value is zero, which will reserve re‐
2819 sources for any pending job and delay initiation of lower
2820 priority jobs. Also see bf_job_part_count_reserve and
2821 bf_min_age_reserve. Default: 0, Min: 0, Max: 2^63.
2822
2823 bf_one_resv_per_job
2824 Disallow adding more than one backfill reservation per
2825 job. The scheduling logic builds a sorted list of (job,
2826 partition) pairs. Jobs submitted to multiple partitions
2827 have as many entries in the list as requested partitions.
2828 By default, the backfill scheduler may evaluate all the
2829 (job, partition) entries for a single job, potentially
2830 reserving resources for each pair, but only starting the
2831 job in the reservation offering the earliest start time.
2832 Having a single job reserving resources for multiple par‐
2833 titions could impede other jobs (or hetjob components)
2834 from reserving resources already reserved for the reser‐
2835 vations related to the partitions that don't offer the
2836 earliest start time. This option makes it so that a job
2837 submitted to multiple partitions will stop reserving re‐
2838 sources once the first (job, partition) pair has booked a
2839 backfill reservation. Subsequent pairs from the same job
2840 will only be tested to start now. This allows for other
2841 jobs to be able to book the other pairs resources at the
2842 cost of not guaranteeing that the multi partition job
2843 will start in the partition offering the earliest start
2844 time (except if it can start now). This option is dis‐
2845 abled by default.
2846
2847
2848 bf_resolution=#
2849 The number of seconds in the resolution of data main‐
2850 tained about when jobs begin and end. Higher values re‐
2851 sult in better responsiveness and quicker backfill cycles
2852 by using larger blocks of time to determine node eligi‐
2853 bility. However, higher values lead to less efficient
2854 system planning, and may miss opportunities to improve
2855 system utilization. This option applies only to Sched‐
2856 ulerType=sched/backfill. Default: 60, Min: 1, Max: 3600
2857 (1 hour).
2858
2859 bf_running_job_reserve
2860 Add an extra step to backfill logic, which creates back‐
2861 fill reservations for jobs running on whole nodes. This
2862 option is disabled by default.
2863
2864 bf_window=#
2865 The number of minutes into the future to look when con‐
2866 sidering jobs to schedule. Higher values result in more
2867 overhead and less responsiveness. A value at least as
2868 long as the highest allowed time limit is generally ad‐
2869 visable to prevent job starvation. In order to limit the
2870 amount of data managed by the backfill scheduler, if the
2871 value of bf_window is increased, then it is generally ad‐
2872 visable to also increase bf_resolution. This option ap‐
2873 plies only to SchedulerType=sched/backfill. Default:
2874 1440 (1 day), Min: 1, Max: 43200 (30 days).
2875
2876 bf_window_linear=#
2877 For performance reasons, the backfill scheduler will de‐
2878 crease precision in calculation of job expected termina‐
2879 tion times. By default, the precision starts at 30 sec‐
2880 onds and that time interval doubles with each evaluation
2881 of currently executing jobs when trying to determine when
2882 a pending job can start. This algorithm can support an
2883 environment with many thousands of running jobs, but can
2884 result in the expected start time of pending jobs being
2885 gradually being deferred due to lack of precision. A
2886 value for bf_window_linear will cause the time interval
2887 to be increased by a constant amount on each iteration.
2888 The value is specified in units of seconds. For example,
2889 a value of 60 will cause the backfill scheduler on the
2890 first iteration to identify the job ending soonest and
2891 determine if the pending job can be started after that
2892 job plus all other jobs expected to end within 30 seconds
2893 (default initial value) of the first job. On the next it‐
2894 eration, the pending job will be evaluated for starting
2895 after the next job expected to end plus all jobs ending
2896 within 90 seconds of that time (30 second default, plus
2897 the 60 second option value). The third iteration will
2898 have a 150 second window and the fourth 210 seconds.
2899 Without this option, the time windows will double on each
2900 iteration and thus be 30, 60, 120, 240 seconds, etc. The
2901 use of bf_window_linear is not recommended with more than
2902 a few hundred simultaneously executing jobs.
2903
2904 bf_yield_interval=#
2905 The backfill scheduler will periodically relinquish locks
2906 in order for other pending operations to take place.
2907 This specifies the times when the locks are relinquished
2908 in microseconds. Smaller values may be helpful for high
2909 throughput computing when used in conjunction with the
2910 bf_continue option. Also see the bf_yield_sleep option.
2911 Default: 2,000,000 (2 sec), Min: 1, Max: 10,000,000 (10
2912 sec).
2913
2914 bf_yield_sleep=#
2915 The backfill scheduler will periodically relinquish locks
2916 in order for other pending operations to take place.
2917 This specifies the length of time for which the locks are
2918 relinquished in microseconds. Also see the bf_yield_in‐
2919 terval option. Default: 500,000 (0.5 sec), Min: 1, Max:
2920 10,000,000 (10 sec).
2921
2922 build_queue_timeout=#
2923 Defines the maximum time that can be devoted to building
2924 a queue of jobs to be tested for scheduling. If the sys‐
2925 tem has a huge number of jobs with dependencies, just
2926 building the job queue can take so much time as to ad‐
2927 versely impact overall system performance and this param‐
2928 eter can be adjusted as needed. The default value is
2929 2,000,000 microseconds (2 seconds).
2930
2931 correspond_after_task_cnt=#
2932 Defines the number of array tasks that get split for po‐
2933 tential aftercorr dependency check. Low number may result
2934 in dependent task check failures when the job one depends
2935 on gets purged before the split. Default: 10.
2936
2937 default_queue_depth=#
2938 The default number of jobs to attempt scheduling (i.e.
2939 the queue depth) when a running job completes or other
2940 routine actions occur, however the frequency with which
2941 the scheduler is run may be limited by using the defer or
2942 sched_min_interval parameters described below. The full
2943 queue will be tested on a less frequent basis as defined
2944 by the sched_interval option described below. The default
2945 value is 100. See the partition_job_depth option to
2946 limit depth by partition.
2947
2948 defer Setting this option will avoid attempting to schedule
2949 each job individually at job submit time, but defer it
2950 until a later time when scheduling multiple jobs simulta‐
2951 neously may be possible. This option may improve system
2952 responsiveness when large numbers of jobs (many hundreds)
2953 are submitted at the same time, but it will delay the
2954 initiation time of individual jobs. Also see de‐
2955 fault_queue_depth above.
2956
2957 delay_boot=#
2958 Do not reboot nodes in order to satisfied this job's fea‐
2959 ture specification if the job has been eligible to run
2960 for less than this time period. If the job has waited
2961 for less than the specified period, it will use only
2962 nodes which already have the specified features. The ar‐
2963 gument is in units of minutes. Individual jobs may over‐
2964 ride this default value with the --delay-boot option.
2965
2966 disable_job_shrink
2967 Deny user requests to shrink the side of running jobs.
2968 (However, running jobs may still shrink due to node fail‐
2969 ure if the --no-kill option was set.)
2970
2971 disable_hetjob_steps
2972 Disable job steps that span heterogeneous job alloca‐
2973 tions. The default value on Cray systems.
2974
2975 enable_hetjob_steps
2976 Enable job steps that span heterogeneous job allocations.
2977 The default value except for Cray systems.
2978
2979 enable_user_top
2980 Enable use of the "scontrol top" command by non-privi‐
2981 leged users.
2982
2983 Ignore_NUMA
2984 Some processors (e.g. AMD Opteron 6000 series) contain
2985 multiple NUMA nodes per socket. This is a configuration
2986 which does not map into the hardware entities that Slurm
2987 optimizes resource allocation for (PU/thread, core,
2988 socket, baseboard, node and network switch). In order to
2989 optimize resource allocations on such hardware, Slurm
2990 will consider each NUMA node within the socket as a sepa‐
2991 rate socket by default. Use the Ignore_NUMA option to re‐
2992 port the correct socket count, but not optimize resource
2993 allocations on the NUMA nodes.
2994
2995 inventory_interval=#
2996 On a Cray system using Slurm on top of ALPS this limits
2997 the number of times a Basil Inventory call is made. Nor‐
2998 mally this call happens every scheduling consideration to
2999 attempt to close a node state change window with respects
3000 to what ALPS has. This call is rather slow, so making it
3001 less frequently improves performance dramatically, but in
3002 the situation where a node changes state the window is as
3003 large as this setting. In an HTC environment this set‐
3004 ting is a must and we advise around 10 seconds.
3005
3006 max_array_tasks
3007 Specify the maximum number of tasks that can be included
3008 in a job array. The default limit is MaxArraySize, but
3009 this option can be used to set a lower limit. For exam‐
3010 ple, max_array_tasks=1000 and MaxArraySize=100001 would
3011 permit a maximum task ID of 100000, but limit the number
3012 of tasks in any single job array to 1000.
3013
3014 max_rpc_cnt=#
3015 If the number of active threads in the slurmctld daemon
3016 is equal to or larger than this value, defer scheduling
3017 of jobs. The scheduler will check this condition at cer‐
3018 tain points in code and yield locks if necessary. This
3019 can improve Slurm's ability to process requests at a cost
3020 of initiating new jobs less frequently. Default: 0 (op‐
3021 tion disabled), Min: 0, Max: 1000.
3022
3023 NOTE: The maximum number of threads (MAX_SERVER_THREADS)
3024 is internally set to 256 and defines the number of served
3025 RPCs at a given time. Setting max_rpc_cnt to more than
3026 256 will be only useful to let backfill continue schedul‐
3027 ing work after locks have been yielded (i.e. each 2 sec‐
3028 onds) if there are a maximum of MAX(max_rpc_cnt/10, 20)
3029 RPCs in the queue. i.e. max_rpc_cnt=1000, the scheduler
3030 will be allowed to continue after yielding locks only
3031 when there are less than or equal to 100 pending RPCs.
3032 If a value is set, then a value of 10 or higher is recom‐
3033 mended. It may require some tuning for each system, but
3034 needs to be high enough that scheduling isn't always dis‐
3035 abled, and low enough that requests can get through in a
3036 reasonable period of time.
3037
3038 max_sched_time=#
3039 How long, in seconds, that the main scheduling loop will
3040 execute for before exiting. If a value is configured, be
3041 aware that all other Slurm operations will be deferred
3042 during this time period. Make certain the value is lower
3043 than MessageTimeout. If a value is not explicitly con‐
3044 figured, the default value is half of MessageTimeout with
3045 a minimum default value of 1 second and a maximum default
3046 value of 2 seconds. For example if MessageTimeout=10,
3047 the time limit will be 2 seconds (i.e. MIN(10/2, 2) = 2).
3048
3049 max_script_size=#
3050 Specify the maximum size of a batch script, in bytes.
3051 The default value is 4 megabytes. Larger values may ad‐
3052 versely impact system performance.
3053
3054 max_switch_wait=#
3055 Maximum number of seconds that a job can delay execution
3056 waiting for the specified desired switch count. The de‐
3057 fault value is 300 seconds.
3058
3059 no_backup_scheduling
3060 If used, the backup controller will not schedule jobs
3061 when it takes over. The backup controller will allow jobs
3062 to be submitted, modified and cancelled but won't sched‐
3063 ule new jobs. This is useful in Cray environments when
3064 the backup controller resides on an external Cray node.
3065 A restart is required to alter this option. This is ex‐
3066 plicitly set on a Cray/ALPS system.
3067
3068 no_env_cache
3069 If used, any job started on node that fails to load the
3070 env from a node will fail instead of using the cached
3071 env. This will also implicitly imply the re‐
3072 queue_setup_env_fail option as well.
3073
3074 nohold_on_prolog_fail
3075 By default, if the Prolog exits with a non-zero value the
3076 job is requeued in a held state. By specifying this pa‐
3077 rameter the job will be requeued but not held so that the
3078 scheduler can dispatch it to another host.
3079
3080 pack_serial_at_end
3081 If used with the select/cons_res or select/cons_tres
3082 plugin, then put serial jobs at the end of the available
3083 nodes rather than using a best fit algorithm. This may
3084 reduce resource fragmentation for some workloads.
3085
3086 partition_job_depth=#
3087 The default number of jobs to attempt scheduling (i.e.
3088 the queue depth) from each partition/queue in Slurm's
3089 main scheduling logic. The functionality is similar to
3090 that provided by the bf_max_job_part option for the back‐
3091 fill scheduling logic. The default value is 0 (no
3092 limit). Job's excluded from attempted scheduling based
3093 upon partition will not be counted against the de‐
3094 fault_queue_depth limit. Also see the bf_max_job_part
3095 option.
3096
3097 permit_job_expansion
3098 Allow running jobs to request additional nodes be merged
3099 in with the current job allocation.
3100
3101 preempt_reorder_count=#
3102 Specify how many attempts should be made in reording pre‐
3103 emptable jobs to minimize the count of jobs preempted.
3104 The default value is 1. High values may adversely impact
3105 performance. The logic to support this option is only
3106 available in the select/cons_res and select/cons_tres
3107 plugins.
3108
3109 preempt_strict_order
3110 If set, then execute extra logic in an attempt to preempt
3111 only the lowest priority jobs. It may be desirable to
3112 set this configuration parameter when there are multiple
3113 priorities of preemptable jobs. The logic to support
3114 this option is only available in the select/cons_res and
3115 select/cons_tres plugins.
3116
3117 preempt_youngest_first
3118 If set, then the preemption sorting algorithm will be
3119 changed to sort by the job start times to favor preempt‐
3120 ing younger jobs over older. (Requires preempt/parti‐
3121 tion_prio or preempt/qos plugins.)
3122
3123 reduce_completing_frag
3124 This option is used to control how scheduling of re‐
3125 sources is performed when jobs are in the COMPLETING
3126 state, which influences potential fragmentation. If this
3127 option is not set then no jobs will be started in any
3128 partition when any job is in the COMPLETING state for
3129 less than CompleteWait seconds. If this option is set
3130 then no jobs will be started in any individual partition
3131 that has a job in COMPLETING state for less than Com‐
3132 pleteWait seconds. In addition, no jobs will be started
3133 in any partition with nodes that overlap with any nodes
3134 in the partition of the completing job. This option is
3135 to be used in conjunction with CompleteWait.
3136
3137 NOTE: CompleteWait must be set in order for this to work.
3138 If CompleteWait=0 then this option does nothing.
3139
3140 NOTE: reduce_completing_frag only affects the main sched‐
3141 uler, not the backfill scheduler.
3142
3143 requeue_setup_env_fail
3144 By default if a job environment setup fails the job keeps
3145 running with a limited environment. By specifying this
3146 parameter the job will be requeued in held state and the
3147 execution node drained.
3148
3149 salloc_wait_nodes
3150 If defined, the salloc command will wait until all allo‐
3151 cated nodes are ready for use (i.e. booted) before the
3152 command returns. By default, salloc will return as soon
3153 as the resource allocation has been made.
3154
3155 sbatch_wait_nodes
3156 If defined, the sbatch script will wait until all allo‐
3157 cated nodes are ready for use (i.e. booted) before the
3158 initiation. By default, the sbatch script will be initi‐
3159 ated as soon as the first node in the job allocation is
3160 ready. The sbatch command can use the --wait-all-nodes
3161 option to override this configuration parameter.
3162
3163 sched_interval=#
3164 How frequently, in seconds, the main scheduling loop will
3165 execute and test all pending jobs. The default value is
3166 60 seconds.
3167
3168 sched_max_job_start=#
3169 The maximum number of jobs that the main scheduling logic
3170 will start in any single execution. The default value is
3171 zero, which imposes no limit.
3172
3173 sched_min_interval=#
3174 How frequently, in microseconds, the main scheduling loop
3175 will execute and test any pending jobs. The scheduler
3176 runs in a limited fashion every time that any event hap‐
3177 pens which could enable a job to start (e.g. job submit,
3178 job terminate, etc.). If these events happen at a high
3179 frequency, the scheduler can run very frequently and con‐
3180 sume significant resources if not throttled by this op‐
3181 tion. This option specifies the minimum time between the
3182 end of one scheduling cycle and the beginning of the next
3183 scheduling cycle. A value of zero will disable throt‐
3184 tling of the scheduling logic interval. The default
3185 value is 1,000,000 microseconds on Cray/ALPS systems and
3186 2 microseconds on other systems.
3187
3188 spec_cores_first
3189 Specialized cores will be selected from the first cores
3190 of the first sockets, cycling through the sockets on a
3191 round robin basis. By default, specialized cores will be
3192 selected from the last cores of the last sockets, cycling
3193 through the sockets on a round robin basis.
3194
3195 step_retry_count=#
3196 When a step completes and there are steps ending resource
3197 allocation, then retry step allocations for at least this
3198 number of pending steps. Also see step_retry_time. The
3199 default value is 8 steps.
3200
3201 step_retry_time=#
3202 When a step completes and there are steps ending resource
3203 allocation, then retry step allocations for all steps
3204 which have been pending for at least this number of sec‐
3205 onds. Also see step_retry_count. The default value is
3206 60 seconds.
3207
3208 whole_hetjob
3209 Requests to cancel, hold or release any component of a
3210 heterogeneous job will be applied to all components of
3211 the job.
3212
3213 NOTE: this option was previously named whole_pack and
3214 this is still supported for retrocompatibility.
3215
3216
3217 SchedulerTimeSlice
3218 Number of seconds in each time slice when gang scheduling is en‐
3219 abled (PreemptMode=SUSPEND,GANG). The value must be between 5
3220 seconds and 65533 seconds. The default value is 30 seconds.
3221
3222
3223 SchedulerType
3224 Identifies the type of scheduler to be used. Note the slurmctld
3225 daemon must be restarted for a change in scheduler type to be‐
3226 come effective (reconfiguring a running daemon has no effect for
3227 this parameter). The scontrol command can be used to manually
3228 change job priorities if desired. Acceptable values include:
3229
3230 sched/backfill
3231 For a backfill scheduling module to augment the default
3232 FIFO scheduling. Backfill scheduling will initiate
3233 lower-priority jobs if doing so does not delay the ex‐
3234 pected initiation time of any higher priority job. Ef‐
3235 fectiveness of backfill scheduling is dependent upon
3236 users specifying job time limits, otherwise all jobs will
3237 have the same time limit and backfilling is impossible.
3238 Note documentation for the SchedulerParameters option
3239 above. This is the default configuration.
3240
3241 sched/builtin
3242 This is the FIFO scheduler which initiates jobs in prior‐
3243 ity order. If any job in the partition can not be sched‐
3244 uled, no lower priority job in that partition will be
3245 scheduled. An exception is made for jobs that can not
3246 run due to partition constraints (e.g. the time limit) or
3247 down/drained nodes. In that case, lower priority jobs
3248 can be initiated and not impact the higher priority job.
3249
3250 sched/hold
3251 To hold all newly arriving jobs if a file
3252 "/etc/slurm.hold" exists otherwise use the built-in FIFO
3253 scheduler
3254
3255
3256 ScronParameters
3257 Multiple options may be comma separated.
3258
3259 enable Enable the use of scrontab to submit and manage periodic
3260 repeating jobs.
3261
3262
3263 SelectType
3264 Identifies the type of resource selection algorithm to be used.
3265 Changing this value can only be done by restarting the slurmctld
3266 daemon. When changed, all job information (running and pending)
3267 will be lost, since the job state save format used by each
3268 plugin is different. The only exception to this is when chang‐
3269 ing from cons_res to cons_tres or from cons_tres to cons_res.
3270 However, if a job contains cons_tres-specific features and then
3271 SelectType is changed to cons_res, the job will be canceled,
3272 since there is no way for cons_res to satisfy requirements spe‐
3273 cific to cons_tres.
3274
3275 Acceptable values include
3276
3277 select/cons_res
3278 The resources (cores and memory) within a node are indi‐
3279 vidually allocated as consumable resources. Note that
3280 whole nodes can be allocated to jobs for selected parti‐
3281 tions by using the OverSubscribe=Exclusive option. See
3282 the partition OverSubscribe parameter for more informa‐
3283 tion.
3284
3285 select/cons_tres
3286 The resources (cores, memory, GPUs and all other track‐
3287 able resources) within a node are individually allocated
3288 as consumable resources. Note that whole nodes can be
3289 allocated to jobs for selected partitions by using the
3290 OverSubscribe=Exclusive option. See the partition Over‐
3291 Subscribe parameter for more information.
3292
3293 select/cray_aries
3294 for a Cray system. The default value is "se‐
3295 lect/cray_aries" for all Cray systems.
3296
3297 select/linear
3298 for allocation of entire nodes assuming a one-dimensional
3299 array of nodes in which sequentially ordered nodes are
3300 preferable. For a heterogeneous cluster (e.g. different
3301 CPU counts on the various nodes), resource allocations
3302 will favor nodes with high CPU counts as needed based
3303 upon the job's node and CPU specification if TopologyPlu‐
3304 gin=topology/none is configured. Use of other topology
3305 plugins with select/linear and heterogeneous nodes is not
3306 recommended and may result in valid job allocation re‐
3307 quests being rejected. This is the default value.
3308
3309
3310 SelectTypeParameters
3311 The permitted values of SelectTypeParameters depend upon the
3312 configured value of SelectType. The only supported options for
3313 SelectType=select/linear are CR_ONE_TASK_PER_CORE and CR_Memory,
3314 which treats memory as a consumable resource and prevents memory
3315 over subscription with job preemption or gang scheduling. By
3316 default SelectType=select/linear allocates whole nodes to jobs
3317 without considering their memory consumption. By default Se‐
3318 lectType=select/cons_res, SelectType=select/cray_aries, and Se‐
3319 lectType=select/cons_tres, use CR_CPU, which allocates CPU
3320 (threads) to jobs without considering their memory consumption.
3321
3322 The following options are supported for SelectType=se‐
3323 lect/cray_aries:
3324
3325 OTHER_CONS_RES
3326 Layer the select/cons_res plugin under the se‐
3327 lect/cray_aries plugin, the default is to layer on
3328 select/linear. This also allows all the options
3329 available for SelectType=select/cons_res.
3330
3331 OTHER_CONS_TRES
3332 Layer the select/cons_tres plugin under the se‐
3333 lect/cray_aries plugin, the default is to layer on
3334 select/linear. This also allows all the options
3335 available for SelectType=select/cons_tres.
3336
3337 The following options are supported by the SelectType=se‐
3338 lect/cons_res and SelectType=select/cons_tres plugins:
3339
3340 CR_CPU CPUs are consumable resources. Configure the num‐
3341 ber of CPUs on each node, which may be equal to
3342 the count of cores or hyper-threads on the node
3343 depending upon the desired minimum resource allo‐
3344 cation. The node's Boards, Sockets, CoresPer‐
3345 Socket and ThreadsPerCore may optionally be con‐
3346 figured and result in job allocations which have
3347 improved locality; however doing so will prevent
3348 more than one job from being allocated on each
3349 core.
3350
3351 CR_CPU_Memory
3352 CPUs and memory are consumable resources. Config‐
3353 ure the number of CPUs on each node, which may be
3354 equal to the count of cores or hyper-threads on
3355 the node depending upon the desired minimum re‐
3356 source allocation. The node's Boards, Sockets,
3357 CoresPerSocket and ThreadsPerCore may optionally
3358 be configured and result in job allocations which
3359 have improved locality; however doing so will pre‐
3360 vent more than one job from being allocated on
3361 each core. Setting a value for DefMemPerCPU is
3362 strongly recommended.
3363
3364 CR_Core
3365 Cores are consumable resources. On nodes with hy‐
3366 per-threads, each thread is counted as a CPU to
3367 satisfy a job's resource requirement, but multiple
3368 jobs are not allocated threads on the same core.
3369 The count of CPUs allocated to a job is rounded up
3370 to account for every CPU on an allocated core.
3371 This will also impact total allocated memory when
3372 --mem-per-cpu is used to be multiply of total num‐
3373 ber of CPUs on allocated cores.
3374
3375 CR_Core_Memory
3376 Cores and memory are consumable resources. On
3377 nodes with hyper-threads, each thread is counted
3378 as a CPU to satisfy a job's resource requirement,
3379 but multiple jobs are not allocated threads on the
3380 same core. The count of CPUs allocated to a job
3381 may be rounded up to account for every CPU on an
3382 allocated core. Setting a value for DefMemPerCPU
3383 is strongly recommended.
3384
3385 CR_ONE_TASK_PER_CORE
3386 Allocate one task per core by default. Without
3387 this option, by default one task will be allocated
3388 per thread on nodes with more than one ThreadsPer‐
3389 Core configured. NOTE: This option cannot be used
3390 with CR_CPU*.
3391
3392 CR_CORE_DEFAULT_DIST_BLOCK
3393 Allocate cores within a node using block distribu‐
3394 tion by default. This is a pseudo-best-fit algo‐
3395 rithm that minimizes the number of boards and min‐
3396 imizes the number of sockets (within minimum
3397 boards) used for the allocation. This default be‐
3398 havior can be overridden specifying a particular
3399 "-m" parameter with srun/salloc/sbatch. Without
3400 this option, cores will be allocated cyclicly
3401 across the sockets.
3402
3403 CR_LLN Schedule resources to jobs on the least loaded
3404 nodes (based upon the number of idle CPUs). This
3405 is generally only recommended for an environment
3406 with serial jobs as idle resources will tend to be
3407 highly fragmented, resulting in parallel jobs be‐
3408 ing distributed across many nodes. Note that node
3409 Weight takes precedence over how many idle re‐
3410 sources are on each node. Also see the partition
3411 configuration parameter LLN use the least loaded
3412 nodes in selected partitions.
3413
3414 CR_Pack_Nodes
3415 If a job allocation contains more resources than
3416 will be used for launching tasks (e.g. if whole
3417 nodes are allocated to a job), then rather than
3418 distributing a job's tasks evenly across its allo‐
3419 cated nodes, pack them as tightly as possible on
3420 these nodes. For example, consider a job alloca‐
3421 tion containing two entire nodes with eight CPUs
3422 each. If the job starts ten tasks across those
3423 two nodes without this option, it will start five
3424 tasks on each of the two nodes. With this option,
3425 eight tasks will be started on the first node and
3426 two tasks on the second node. This can be super‐
3427 seded by "NoPack" in srun's "--distribution" op‐
3428 tion. CR_Pack_Nodes only applies when the "block"
3429 task distribution method is used.
3430
3431 CR_Socket
3432 Sockets are consumable resources. On nodes with
3433 multiple cores, each core or thread is counted as
3434 a CPU to satisfy a job's resource requirement, but
3435 multiple jobs are not allocated resources on the
3436 same socket.
3437
3438 CR_Socket_Memory
3439 Memory and sockets are consumable resources. On
3440 nodes with multiple cores, each core or thread is
3441 counted as a CPU to satisfy a job's resource re‐
3442 quirement, but multiple jobs are not allocated re‐
3443 sources on the same socket. Setting a value for
3444 DefMemPerCPU is strongly recommended.
3445
3446 CR_Memory
3447 Memory is a consumable resource. NOTE: This im‐
3448 plies OverSubscribe=YES or OverSubscribe=FORCE for
3449 all partitions. Setting a value for DefMemPerCPU
3450 is strongly recommended.
3451
3452
3453 SlurmctldAddr
3454 An optional address to be used for communications to the cur‐
3455 rently active slurmctld daemon, normally used with Virtual IP
3456 addressing of the currently active server. If this parameter is
3457 not specified then each primary and backup server will have its
3458 own unique address used for communications as specified in the
3459 SlurmctldHost parameter. If this parameter is specified then
3460 the SlurmctldHost parameter will still be used for communica‐
3461 tions to specific slurmctld primary or backup servers, for exam‐
3462 ple to cause all of them to read the current configuration files
3463 or shutdown. Also see the SlurmctldPrimaryOffProg and Slurm‐
3464 ctldPrimaryOnProg configuration parameters to configure programs
3465 to manipulate virtual IP address manipulation.
3466
3467
3468 SlurmctldDebug
3469 The level of detail to provide slurmctld daemon's logs. The de‐
3470 fault value is info. If the slurmctld daemon is initiated with
3471 -v or --verbose options, that debug level will be preserve or
3472 restored upon reconfiguration.
3473
3474
3475 quiet Log nothing
3476
3477 fatal Log only fatal errors
3478
3479 error Log only errors
3480
3481 info Log errors and general informational messages
3482
3483 verbose Log errors and verbose informational messages
3484
3485 debug Log errors and verbose informational messages and de‐
3486 bugging messages
3487
3488 debug2 Log errors and verbose informational messages and more
3489 debugging messages
3490
3491 debug3 Log errors and verbose informational messages and even
3492 more debugging messages
3493
3494 debug4 Log errors and verbose informational messages and even
3495 more debugging messages
3496
3497 debug5 Log errors and verbose informational messages and even
3498 more debugging messages
3499
3500
3501 SlurmctldHost
3502 The short, or long, hostname of the machine where Slurm control
3503 daemon is executed (i.e. the name returned by the command "host‐
3504 name -s"). This hostname is optionally followed by the address,
3505 either the IP address or a name by which the address can be
3506 identifed, enclosed in parentheses (e.g. SlurmctldHost=slurm‐
3507 ctl-primary(12.34.56.78)). This value must be specified at least
3508 once. If specified more than once, the first hostname named will
3509 be where the daemon runs. If the first specified host fails,
3510 the daemon will execute on the second host. If both the first
3511 and second specified host fails, the daemon will execute on the
3512 third host.
3513
3514
3515 SlurmctldLogFile
3516 Fully qualified pathname of a file into which the slurmctld dae‐
3517 mon's logs are written. The default value is none (performs
3518 logging via syslog).
3519 See the section LOGGING if a pathname is specified.
3520
3521
3522 SlurmctldParameters
3523 Multiple options may be comma separated.
3524
3525
3526 allow_user_triggers
3527 Permit setting triggers from non-root/slurm_user users.
3528 SlurmUser must also be set to root to permit these trig‐
3529 gers to work. See the strigger man page for additional
3530 details.
3531
3532 cloud_dns
3533 By default, Slurm expects that the network address for a
3534 cloud node won't be known until the creation of the node
3535 and that Slurm will be notified of the node's address
3536 (e.g. scontrol update nodename=<name> nodeaddr=<addr>).
3537 Since Slurm communications rely on the node configuration
3538 found in the slurm.conf, Slurm will tell the client com‐
3539 mand, after waiting for all nodes to boot, each node's ip
3540 address. However, in environments where the nodes are in
3541 DNS, this step can be avoided by configuring this option.
3542
3543 cloud_reg_addrs
3544 When a cloud node registers, the node's NodeAddr and
3545 NodeHostName will automatically be set. They will be re‐
3546 set back to the nodename after powering off.
3547
3548 enable_configless
3549 Permit "configless" operation by the slurmd, slurmstepd,
3550 and user commands. When enabled the slurmd will be per‐
3551 mitted to retrieve config files from the slurmctld, and
3552 on any 'scontrol reconfigure' command new configs will be
3553 automatically pushed out and applied to nodes that are
3554 running in this "configless" mode. NOTE: a restart of
3555 the slurmctld is required for this to take effect.
3556
3557 idle_on_node_suspend
3558 Mark nodes as idle, regardless of current state, when
3559 suspending nodes with SuspendProgram so that nodes will
3560 be eligible to be resumed at a later time.
3561
3562 power_save_interval
3563 How often the power_save thread looks to resume and sus‐
3564 pend nodes. The power_save thread will do work sooner if
3565 there are node state changes. Default is 10 seconds.
3566
3567 power_save_min_interval
3568 How often the power_save thread, at a minimun, looks to
3569 resume and suspend nodes. Default is 0.
3570
3571 max_dbd_msg_action
3572 Action used once MaxDBDMsgs is reached, options are 'dis‐
3573 card' (default) and 'exit'.
3574
3575 When 'discard' is specified and MaxDBDMsgs is reached we
3576 start by purging pending messages of types Step start and
3577 complete, and it reaches MaxDBDMsgs again Job start mes‐
3578 sages are purged. Job completes and node state changes
3579 continue to consume the empty space created from the
3580 purgings until MaxDBDMsgs is reached again at which no
3581 new message is tracked creating data loss and potentially
3582 runaway jobs.
3583
3584 When 'exit' is specified and MaxDBDMsgs is reached the
3585 slurmctld will exit instead of discarding any messages.
3586 It will be impossible to start the slurmctld with this
3587 option where the slurmdbd is down and the slurmctld is
3588 tracking more than MaxDBDMsgs.
3589
3590
3591 preempt_send_user_signal
3592 Send the user signal (e.g. --signal=<sig_num>) at preemp‐
3593 tion time even if the signal time hasn't been reached. In
3594 the case of a gracetime preemption the user signal will
3595 be sent if the user signal has been specified and not
3596 sent, otherwise a SIGTERM will be sent to the tasks.
3597
3598 reboot_from_controller
3599 Run the RebootProgram from the controller instead of on
3600 the slurmds. The RebootProgram will be passed a comma-
3601 separated list of nodes to reboot.
3602
3603 user_resv_delete
3604 Allow any user able to run in a reservation to delete it.
3605
3606
3607 SlurmctldPidFile
3608 Fully qualified pathname of a file into which the slurmctld
3609 daemon may write its process id. This may be used for automated
3610 signal processing. The default value is "/var/run/slurm‐
3611 ctld.pid".
3612
3613
3614 SlurmctldPlugstack
3615 A comma delimited list of Slurm controller plugins to be started
3616 when the daemon begins and terminated when it ends. Only the
3617 plugin's init and fini functions are called.
3618
3619
3620 SlurmctldPort
3621 The port number that the Slurm controller, slurmctld, listens to
3622 for work. The default value is SLURMCTLD_PORT as established at
3623 system build time. If none is explicitly specified, it will be
3624 set to 6817. SlurmctldPort may also be configured to support a
3625 range of port numbers in order to accept larger bursts of incom‐
3626 ing messages by specifying two numbers separated by a dash (e.g.
3627 SlurmctldPort=6817-6818). NOTE: Either slurmctld and slurmd
3628 daemons must not execute on the same nodes or the values of
3629 SlurmctldPort and SlurmdPort must be different.
3630
3631 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3632 automatically try to interact with anything opened on ports
3633 8192-60000. Configure SlurmctldPort to use a port outside of
3634 the configured SrunPortRange and RSIP's port range.
3635
3636
3637 SlurmctldPrimaryOffProg
3638 This program is executed when a slurmctld daemon running as the
3639 primary server becomes a backup server. By default no program is
3640 executed. See also the related "SlurmctldPrimaryOnProg" parame‐
3641 ter.
3642
3643
3644 SlurmctldPrimaryOnProg
3645 This program is executed when a slurmctld daemon running as a
3646 backup server becomes the primary server. By default no program
3647 is executed. When using virtual IP addresses to manage High
3648 Available Slurm services, this program can be used to add the IP
3649 address to an interface (and optionally try to kill the unre‐
3650 sponsive slurmctld daemon and flush the ARP caches on nodes on
3651 the local ethernet fabric). See also the related "SlurmctldPri‐
3652 maryOffProg" parameter.
3653
3654 SlurmctldSyslogDebug
3655 The slurmctld daemon will log events to the syslog file at the
3656 specified level of detail. If not set, the slurmctld daemon will
3657 log to syslog at level fatal, unless there is no SlurmctldLog‐
3658 File and it is running in the background, in which case it will
3659 log to syslog at the level specified by SlurmctldDebug (at fatal
3660 in the case that SlurmctldDebug is set to quiet) or it is run in
3661 the foreground, when it will be set to quiet.
3662
3663
3664 quiet Log nothing
3665
3666 fatal Log only fatal errors
3667
3668 error Log only errors
3669
3670 info Log errors and general informational messages
3671
3672 verbose Log errors and verbose informational messages
3673
3674 debug Log errors and verbose informational messages and de‐
3675 bugging messages
3676
3677 debug2 Log errors and verbose informational messages and more
3678 debugging messages
3679
3680 debug3 Log errors and verbose informational messages and even
3681 more debugging messages
3682
3683 debug4 Log errors and verbose informational messages and even
3684 more debugging messages
3685
3686 debug5 Log errors and verbose informational messages and even
3687 more debugging messages
3688
3689
3690
3691 SlurmctldTimeout
3692 The interval, in seconds, that the backup controller waits for
3693 the primary controller to respond before assuming control. The
3694 default value is 120 seconds. May not exceed 65533.
3695
3696
3697 SlurmdDebug
3698 The level of detail to provide slurmd daemon's logs. The de‐
3699 fault value is info.
3700
3701 quiet Log nothing
3702
3703 fatal Log only fatal errors
3704
3705 error Log only errors
3706
3707 info Log errors and general informational messages
3708
3709 verbose Log errors and verbose informational messages
3710
3711 debug Log errors and verbose informational messages and de‐
3712 bugging messages
3713
3714 debug2 Log errors and verbose informational messages and more
3715 debugging messages
3716
3717 debug3 Log errors and verbose informational messages and even
3718 more debugging messages
3719
3720 debug4 Log errors and verbose informational messages and even
3721 more debugging messages
3722
3723 debug5 Log errors and verbose informational messages and even
3724 more debugging messages
3725
3726
3727 SlurmdLogFile
3728 Fully qualified pathname of a file into which the slurmd dae‐
3729 mon's logs are written. The default value is none (performs
3730 logging via syslog). Any "%h" within the name is replaced with
3731 the hostname on which the slurmd is running. Any "%n" within
3732 the name is replaced with the Slurm node name on which the
3733 slurmd is running.
3734 See the section LOGGING if a pathname is specified.
3735
3736
3737 SlurmdParameters
3738 Parameters specific to the Slurmd. Multiple options may be
3739 comma separated.
3740
3741 config_overrides
3742 If set, consider the configuration of each node to be
3743 that specified in the slurm.conf configuration file and
3744 any node with less than the configured resources will not
3745 be set DRAIN. This option is generally only useful for
3746 testing purposes. Equivalent to the now deprecated
3747 FastSchedule=2 option.
3748
3749 shutdown_on_reboot
3750 If set, the Slurmd will shut itself down when a reboot
3751 request is received.
3752
3753
3754 SlurmdPidFile
3755 Fully qualified pathname of a file into which the slurmd daemon
3756 may write its process id. This may be used for automated signal
3757 processing. Any "%h" within the name is replaced with the host‐
3758 name on which the slurmd is running. Any "%n" within the name
3759 is replaced with the Slurm node name on which the slurmd is run‐
3760 ning. The default value is "/var/run/slurmd.pid".
3761
3762
3763 SlurmdPort
3764 The port number that the Slurm compute node daemon, slurmd, lis‐
3765 tens to for work. The default value is SLURMD_PORT as estab‐
3766 lished at system build time. If none is explicitly specified,
3767 its value will be 6818. NOTE: Either slurmctld and slurmd dae‐
3768 mons must not execute on the same nodes or the values of Slurm‐
3769 ctldPort and SlurmdPort must be different.
3770
3771 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3772 automatically try to interact with anything opened on ports
3773 8192-60000. Configure SlurmdPort to use a port outside of the
3774 configured SrunPortRange and RSIP's port range.
3775
3776
3777 SlurmdSpoolDir
3778 Fully qualified pathname of a directory into which the slurmd
3779 daemon's state information and batch job script information are
3780 written. This must be a common pathname for all nodes, but
3781 should represent a directory which is local to each node (refer‐
3782 ence a local file system). The default value is
3783 "/var/spool/slurmd". Any "%h" within the name is replaced with
3784 the hostname on which the slurmd is running. Any "%n" within
3785 the name is replaced with the Slurm node name on which the
3786 slurmd is running.
3787
3788
3789 SlurmdSyslogDebug
3790 The slurmd daemon will log events to the syslog file at the
3791 specified level of detail. If not set, the slurmd daemon will
3792 log to syslog at level fatal, unless there is no SlurmdLogFile
3793 and it is running in the background, in which case it will log
3794 to syslog at the level specified by SlurmdDebug (at fatal in
3795 the case that SlurmdDebug is set to quiet) or it is run in the
3796 foreground, when it will be set to quiet.
3797
3798
3799 quiet Log nothing
3800
3801 fatal Log only fatal errors
3802
3803 error Log only errors
3804
3805 info Log errors and general informational messages
3806
3807 verbose Log errors and verbose informational messages
3808
3809 debug Log errors and verbose informational messages and de‐
3810 bugging messages
3811
3812 debug2 Log errors and verbose informational messages and more
3813 debugging messages
3814
3815 debug3 Log errors and verbose informational messages and even
3816 more debugging messages
3817
3818 debug4 Log errors and verbose informational messages and even
3819 more debugging messages
3820
3821 debug5 Log errors and verbose informational messages and even
3822 more debugging messages
3823
3824
3825 SlurmdTimeout
3826 The interval, in seconds, that the Slurm controller waits for
3827 slurmd to respond before configuring that node's state to DOWN.
3828 A value of zero indicates the node will not be tested by slurm‐
3829 ctld to confirm the state of slurmd, the node will not be auto‐
3830 matically set to a DOWN state indicating a non-responsive
3831 slurmd, and some other tool will take responsibility for moni‐
3832 toring the state of each compute node and its slurmd daemon.
3833 Slurm's hierarchical communication mechanism is used to ping the
3834 slurmd daemons in order to minimize system noise and overhead.
3835 The default value is 300 seconds. The value may not exceed
3836 65533 seconds.
3837
3838
3839 SlurmdUser
3840 The name of the user that the slurmd daemon executes as. This
3841 user must exist on all nodes of the cluster for authentication
3842 of communications between Slurm components. The default value
3843 is "root".
3844
3845
3846 SlurmSchedLogFile
3847 Fully qualified pathname of the scheduling event logging file.
3848 The syntax of this parameter is the same as for SlurmctldLog‐
3849 File. In order to configure scheduler logging, set both the
3850 SlurmSchedLogFile and SlurmSchedLogLevel parameters.
3851
3852
3853 SlurmSchedLogLevel
3854 The initial level of scheduling event logging, similar to the
3855 SlurmctldDebug parameter used to control the initial level of
3856 slurmctld logging. Valid values for SlurmSchedLogLevel are "0"
3857 (scheduler logging disabled) and "1" (scheduler logging en‐
3858 abled). If this parameter is omitted, the value defaults to "0"
3859 (disabled). In order to configure scheduler logging, set both
3860 the SlurmSchedLogFile and SlurmSchedLogLevel parameters. The
3861 scheduler logging level can be changed dynamically using scon‐
3862 trol.
3863
3864
3865 SlurmUser
3866 The name of the user that the slurmctld daemon executes as. For
3867 security purposes, a user other than "root" is recommended.
3868 This user must exist on all nodes of the cluster for authentica‐
3869 tion of communications between Slurm components. The default
3870 value is "root".
3871
3872
3873 SrunEpilog
3874 Fully qualified pathname of an executable to be run by srun fol‐
3875 lowing the completion of a job step. The command line arguments
3876 for the executable will be the command and arguments of the job
3877 step. This configuration parameter may be overridden by srun's
3878 --epilog parameter. Note that while the other "Epilog" executa‐
3879 bles (e.g., TaskEpilog) are run by slurmd on the compute nodes
3880 where the tasks are executed, the SrunEpilog runs on the node
3881 where the "srun" is executing.
3882
3883
3884 SrunPortRange
3885 The srun creates a set of listening ports to communicate with
3886 the controller, the slurmstepd and to handle the application
3887 I/O. By default these ports are ephemeral meaning the port num‐
3888 bers are selected by the kernel. Using this parameter allow
3889 sites to configure a range of ports from which srun ports will
3890 be selected. This is useful if sites want to allow only certain
3891 port range on their network.
3892
3893 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3894 automatically try to interact with anything opened on ports
3895 8192-60000. Configure SrunPortRange to use a range of ports
3896 above those used by RSIP, ideally 1000 or more ports, for exam‐
3897 ple "SrunPortRange=60001-63000".
3898
3899 Note: A sufficient number of ports must be configured based on
3900 the estimated number of srun on the submission nodes considering
3901 that srun opens 3 listening ports plus 2 more for every 48
3902 hosts. Example:
3903
3904 srun -N 48 will use 5 listening ports.
3905
3906
3907 srun -N 50 will use 7 listening ports.
3908
3909
3910 srun -N 200 will use 13 listening ports.
3911
3912
3913 SrunProlog
3914 Fully qualified pathname of an executable to be run by srun
3915 prior to the launch of a job step. The command line arguments
3916 for the executable will be the command and arguments of the job
3917 step. This configuration parameter may be overridden by srun's
3918 --prolog parameter. Note that while the other "Prolog" executa‐
3919 bles (e.g., TaskProlog) are run by slurmd on the compute nodes
3920 where the tasks are executed, the SrunProlog runs on the node
3921 where the "srun" is executing.
3922
3923
3924 StateSaveLocation
3925 Fully qualified pathname of a directory into which the Slurm
3926 controller, slurmctld, saves its state (e.g. "/usr/lo‐
3927 cal/slurm/checkpoint"). Slurm state will saved here to recover
3928 from system failures. SlurmUser must be able to create files in
3929 this directory. If you have a secondary SlurmctldHost config‐
3930 ured, this location should be readable and writable by both sys‐
3931 tems. Since all running and pending job information is stored
3932 here, the use of a reliable file system (e.g. RAID) is recom‐
3933 mended. The default value is "/var/spool". If any slurm dae‐
3934 mons terminate abnormally, their core files will also be written
3935 into this directory.
3936
3937
3938 SuspendExcNodes
3939 Specifies the nodes which are to not be placed in power save
3940 mode, even if the node remains idle for an extended period of
3941 time. Use Slurm's hostlist expression to identify nodes with an
3942 optional ":" separator and count of nodes to exclude from the
3943 preceding range. For example "nid[10-20]:4" will prevent 4 us‐
3944 able nodes (i.e IDLE and not DOWN, DRAINING or already powered
3945 down) in the set "nid[10-20]" from being powered down. Multiple
3946 sets of nodes can be specified with or without counts in a comma
3947 separated list (e.g "nid[10-20]:4,nid[80-90]:2"). If a node
3948 count specification is given, any list of nodes to NOT have a
3949 node count must be after the last specification with a count.
3950 For example "nid[10-20]:4,nid[60-70]" will exclude 4 nodes in
3951 the set "nid[10-20]:4" plus all nodes in the set "nid[60-70]"
3952 while "nid[1-3],nid[10-20]:4" will exclude 4 nodes from the set
3953 "nid[1-3],nid[10-20]". By default no nodes are excluded. Re‐
3954 lated configuration options include ResumeTimeout, ResumePro‐
3955 gram, ResumeRate, SuspendProgram, SuspendRate, SuspendTime, Sus‐
3956 pendTimeout, and SuspendExcParts.
3957
3958
3959 SuspendExcParts
3960 Specifies the partitions whose nodes are to not be placed in
3961 power save mode, even if the node remains idle for an extended
3962 period of time. Multiple partitions can be identified and sepa‐
3963 rated by commas. By default no nodes are excluded. Related
3964 configuration options include ResumeTimeout, ResumeProgram, Re‐
3965 sumeRate, SuspendProgram, SuspendRate, SuspendTime SuspendTime‐
3966 out, and SuspendExcNodes.
3967
3968
3969 SuspendProgram
3970 SuspendProgram is the program that will be executed when a node
3971 remains idle for an extended period of time. This program is
3972 expected to place the node into some power save mode. This can
3973 be used to reduce the frequency and voltage of a node or com‐
3974 pletely power the node off. The program executes as SlurmUser.
3975 The argument to the program will be the names of nodes to be
3976 placed into power savings mode (using Slurm's hostlist expres‐
3977 sion format). By default, no program is run. Related configu‐
3978 ration options include ResumeTimeout, ResumeProgram, ResumeRate,
3979 SuspendRate, SuspendTime, SuspendTimeout, SuspendExcNodes, and
3980 SuspendExcParts.
3981
3982
3983 SuspendRate
3984 The rate at which nodes are placed into power save mode by Sus‐
3985 pendProgram. The value is number of nodes per minute and it can
3986 be used to prevent a large drop in power consumption (e.g. after
3987 a large job completes). A value of zero results in no limits
3988 being imposed. The default value is 60 nodes per minute. Re‐
3989 lated configuration options include ResumeTimeout, ResumePro‐
3990 gram, ResumeRate, SuspendProgram, SuspendTime, SuspendTimeout,
3991 SuspendExcNodes, and SuspendExcParts.
3992
3993
3994 SuspendTime
3995 Nodes which remain idle or down for this number of seconds will
3996 be placed into power save mode by SuspendProgram. For efficient
3997 system utilization, it is recommended that the value of Suspend‐
3998 Time be at least as large as the sum of SuspendTimeout plus Re‐
3999 sumeTimeout. A value of -1 disables power save mode and is the
4000 default. Related configuration options include ResumeTimeout,
4001 ResumeProgram, ResumeRate, SuspendProgram, SuspendRate, Suspend‐
4002 Timeout, SuspendExcNodes, and SuspendExcParts.
4003
4004
4005 SuspendTimeout
4006 Maximum time permitted (in seconds) between when a node suspend
4007 request is issued and when the node is shutdown. At that time
4008 the node must be ready for a resume request to be issued as
4009 needed for new work. The default value is 30 seconds. Related
4010 configuration options include ResumeProgram, ResumeRate, Resume‐
4011 Timeout, SuspendRate, SuspendTime, SuspendProgram, SuspendExcN‐
4012 odes and SuspendExcParts. More information is available at the
4013 Slurm web site ( https://slurm.schedmd.com/power_save.html ).
4014
4015
4016 SwitchType
4017 Identifies the type of switch or interconnect used for applica‐
4018 tion communications. Acceptable values include
4019 "switch/cray_aries" for Cray systems, "switch/none" for switches
4020 not requiring special processing for job launch or termination
4021 (Ethernet, and InfiniBand) and The default value is
4022 "switch/none". All Slurm daemons, commands and running jobs
4023 must be restarted for a change in SwitchType to take effect. If
4024 running jobs exist at the time slurmctld is restarted with a new
4025 value of SwitchType, records of all jobs in any state may be
4026 lost.
4027
4028
4029 TaskEpilog
4030 Fully qualified pathname of a program to be execute as the slurm
4031 job's owner after termination of each task. See TaskProlog for
4032 execution order details.
4033
4034
4035 TaskPlugin
4036 Identifies the type of task launch plugin, typically used to
4037 provide resource management within a node (e.g. pinning tasks to
4038 specific processors). More than one task plugin can be specified
4039 in a comma-separated list. The prefix of "task/" is optional.
4040 Acceptable values include:
4041
4042 task/affinity enables resource containment using
4043 sched_setaffinity(). This enables the --cpu-bind
4044 and/or --mem-bind srun options.
4045
4046 task/cgroup enables resource containment using Linux control
4047 cgroups. This enables the --cpu-bind and/or
4048 --mem-bind srun options. NOTE: see "man
4049 cgroup.conf" for configuration details.
4050
4051 task/none for systems requiring no special handling of user
4052 tasks. Lacks support for the --cpu-bind and/or
4053 --mem-bind srun options. The default value is
4054 "task/none".
4055
4056 NOTE: It is recommended to stack task/affinity,task/cgroup to‐
4057 gether when configuring TaskPlugin, and setting TaskAffinity=no
4058 and ConstrainCores=yes in cgroup.conf. This setup uses the
4059 task/affinity plugin for setting the affinity of the tasks
4060 (which is better and different than task/cgroup) and uses the
4061 task/cgroup plugin to fence tasks into the specified resources,
4062 thus combining the best of both pieces.
4063
4064 NOTE: For CRAY systems only: task/cgroup must be used with, and
4065 listed after task/cray_aries in TaskPlugin. The task/affinity
4066 plugin can be listed anywhere, but the previous constraint must
4067 be satisfied. For CRAY systems, a configuration like this is
4068 recommended:
4069 TaskPlugin=task/affinity,task/cray_aries,task/cgroup
4070
4071
4072 TaskPluginParam
4073 Optional parameters for the task plugin. Multiple options
4074 should be comma separated. If None, Boards, Sockets, Cores,
4075 Threads, and/or Verbose are specified, they will override the
4076 --cpu-bind option specified by the user in the srun command.
4077 None, Boards, Sockets, Cores and Threads are mutually exclusive
4078 and since they decrease scheduling flexibility are not generally
4079 recommended (select no more than one of them).
4080
4081
4082 Boards Bind tasks to boards by default. Overrides automatic
4083 binding.
4084
4085 Cores Bind tasks to cores by default. Overrides automatic
4086 binding.
4087
4088 None Perform no task binding by default. Overrides auto‐
4089 matic binding.
4090
4091 Sockets Bind to sockets by default. Overrides automatic bind‐
4092 ing.
4093
4094 Threads Bind to threads by default. Overrides automatic bind‐
4095 ing.
4096
4097 SlurmdOffSpec
4098 If specialized cores or CPUs are identified for the
4099 node (i.e. the CoreSpecCount or CpuSpecList are con‐
4100 figured for the node), then Slurm daemons running on
4101 the compute node (i.e. slurmd and slurmstepd) should
4102 run outside of those resources (i.e. specialized re‐
4103 sources are completely unavailable to Slurm daemons
4104 and jobs spawned by Slurm). This option may not be
4105 used with the task/cray_aries plugin.
4106
4107 Verbose Verbosely report binding before tasks run. Overrides
4108 user options.
4109
4110 Autobind Set a default binding in the event that "auto binding"
4111 doesn't find a match. Set to Threads, Cores or Sock‐
4112 ets (E.g. TaskPluginParam=autobind=threads).
4113
4114
4115 TaskProlog
4116 Fully qualified pathname of a program to be execute as the slurm
4117 job's owner prior to initiation of each task. Besides the nor‐
4118 mal environment variables, this has SLURM_TASK_PID available to
4119 identify the process ID of the task being started. Standard
4120 output from this program can be used to control the environment
4121 variables and output for the user program.
4122
4123 export NAME=value Will set environment variables for the task
4124 being spawned. Everything after the equal
4125 sign to the end of the line will be used as
4126 the value for the environment variable. Ex‐
4127 porting of functions is not currently sup‐
4128 ported.
4129
4130 print ... Will cause that line (without the leading
4131 "print ") to be printed to the job's stan‐
4132 dard output.
4133
4134 unset NAME Will clear environment variables for the
4135 task being spawned.
4136
4137 The order of task prolog/epilog execution is as follows:
4138
4139 1. pre_launch_priv()
4140 Function in TaskPlugin
4141
4142 1. pre_launch() Function in TaskPlugin
4143
4144 2. TaskProlog System-wide per task program defined in
4145 slurm.conf
4146
4147 3. User prolog Job-step-specific task program defined using
4148 srun's --task-prolog option or
4149 SLURM_TASK_PROLOG environment variable
4150
4151 4. Task Execute the job step's task
4152
4153 5. User epilog Job-step-specific task program defined using
4154 srun's --task-epilog option or
4155 SLURM_TASK_EPILOG environment variable
4156
4157 6. TaskEpilog System-wide per task program defined in
4158 slurm.conf
4159
4160 7. post_term() Function in TaskPlugin
4161
4162
4163 TCPTimeout
4164 Time permitted for TCP connection to be established. Default
4165 value is 2 seconds.
4166
4167
4168 TmpFS Fully qualified pathname of the file system available to user
4169 jobs for temporary storage. This parameter is used in establish‐
4170 ing a node's TmpDisk space. The default value is "/tmp".
4171
4172
4173 TopologyParam
4174 Comma-separated options identifying network topology options.
4175
4176 Dragonfly Optimize allocation for Dragonfly network. Valid
4177 when TopologyPlugin=topology/tree.
4178
4179 TopoOptional Only optimize allocation for network topology if
4180 the job includes a switch option. Since optimiz‐
4181 ing resource allocation for topology involves
4182 much higher system overhead, this option can be
4183 used to impose the extra overhead only on jobs
4184 which can take advantage of it. If most job allo‐
4185 cations are not optimized for network topology,
4186 they may fragment resources to the point that
4187 topology optimization for other jobs will be dif‐
4188 ficult to achieve. NOTE: Jobs may span across
4189 nodes without common parent switches with this
4190 enabled.
4191
4192
4193 TopologyPlugin
4194 Identifies the plugin to be used for determining the network
4195 topology and optimizing job allocations to minimize network con‐
4196 tention. See NETWORK TOPOLOGY below for details. Additional
4197 plugins may be provided in the future which gather topology in‐
4198 formation directly from the network. Acceptable values include:
4199
4200 topology/3d_torus best-fit logic over three-dimensional
4201 topology
4202
4203 topology/none default for other systems, best-fit logic
4204 over one-dimensional topology
4205
4206 topology/tree used for a hierarchical network as de‐
4207 scribed in a topology.conf file
4208
4209
4210 TrackWCKey
4211 Boolean yes or no. Used to set display and track of the Work‐
4212 load Characterization Key. Must be set to track correct wckey
4213 usage. NOTE: You must also set TrackWCKey in your slurmdbd.conf
4214 file to create historical usage reports.
4215
4216
4217 TreeWidth
4218 Slurmd daemons use a virtual tree network for communications.
4219 TreeWidth specifies the width of the tree (i.e. the fanout). On
4220 architectures with a front end node running the slurmd daemon,
4221 the value must always be equal to or greater than the number of
4222 front end nodes which eliminates the need for message forwarding
4223 between the slurmd daemons. On other architectures the default
4224 value is 50, meaning each slurmd daemon can communicate with up
4225 to 50 other slurmd daemons and over 2500 nodes can be contacted
4226 with two message hops. The default value will work well for
4227 most clusters. Optimal system performance can typically be
4228 achieved if TreeWidth is set to the square root of the number of
4229 nodes in the cluster for systems having no more than 2500 nodes
4230 or the cube root for larger systems. The value may not exceed
4231 65533.
4232
4233
4234 UnkillableStepProgram
4235 If the processes in a job step are determined to be unkillable
4236 for a period of time specified by the UnkillableStepTimeout
4237 variable, the program specified by UnkillableStepProgram will be
4238 executed. By default no program is run.
4239
4240 See section UNKILLABLE STEP PROGRAM SCRIPT for more information.
4241
4242
4243 UnkillableStepTimeout
4244 The length of time, in seconds, that Slurm will wait before de‐
4245 ciding that processes in a job step are unkillable (after they
4246 have been signaled with SIGKILL) and execute UnkillableStepPro‐
4247 gram. The default timeout value is 60 seconds. If exceeded,
4248 the compute node will be drained to prevent future jobs from be‐
4249 ing scheduled on the node.
4250
4251
4252 UsePAM If set to 1, PAM (Pluggable Authentication Modules for Linux)
4253 will be enabled. PAM is used to establish the upper bounds for
4254 resource limits. With PAM support enabled, local system adminis‐
4255 trators can dynamically configure system resource limits. Chang‐
4256 ing the upper bound of a resource limit will not alter the lim‐
4257 its of running jobs, only jobs started after a change has been
4258 made will pick up the new limits. The default value is 0 (not
4259 to enable PAM support). Remember that PAM also needs to be con‐
4260 figured to support Slurm as a service. For sites using PAM's
4261 directory based configuration option, a configuration file named
4262 slurm should be created. The module-type, control-flags, and
4263 module-path names that should be included in the file are:
4264 auth required pam_localuser.so
4265 auth required pam_shells.so
4266 account required pam_unix.so
4267 account required pam_access.so
4268 session required pam_unix.so
4269 For sites configuring PAM with a general configuration file, the
4270 appropriate lines (see above), where slurm is the service-name,
4271 should be added.
4272
4273 NOTE: UsePAM option has nothing to do with the con‐
4274 tribs/pam/pam_slurm and/or contribs/pam_slurm_adopt modules. So
4275 these two modules can work independently of the value set for
4276 UsePAM.
4277
4278
4279 VSizeFactor
4280 Memory specifications in job requests apply to real memory size
4281 (also known as resident set size). It is possible to enforce
4282 virtual memory limits for both jobs and job steps by limiting
4283 their virtual memory to some percentage of their real memory al‐
4284 location. The VSizeFactor parameter specifies the job's or job
4285 step's virtual memory limit as a percentage of its real memory
4286 limit. For example, if a job's real memory limit is 500MB and
4287 VSizeFactor is set to 101 then the job will be killed if its
4288 real memory exceeds 500MB or its virtual memory exceeds 505MB
4289 (101 percent of the real memory limit). The default value is 0,
4290 which disables enforcement of virtual memory limits. The value
4291 may not exceed 65533 percent.
4292
4293 NOTE: This parameter is dependent on OverMemoryKill being con‐
4294 figured in JobAcctGatherParams. It is also possible to configure
4295 the TaskPlugin to use task/cgroup for memory enforcement. VSize‐
4296 Factor will not have an effect on memory enforcement done
4297 through cgroups.
4298
4299
4300 WaitTime
4301 Specifies how many seconds the srun command should by default
4302 wait after the first task terminates before terminating all re‐
4303 maining tasks. The "--wait" option on the srun command line
4304 overrides this value. The default value is 0, which disables
4305 this feature. May not exceed 65533 seconds.
4306
4307
4308 X11Parameters
4309 For use with Slurm's built-in X11 forwarding implementation.
4310
4311 home_xauthority
4312 If set, xauth data on the compute node will be placed in
4313 ~/.Xauthority rather than in a temporary file under
4314 TmpFS.
4315
4316
4318 The configuration of nodes (or machines) to be managed by Slurm is also
4319 specified in /etc/slurm.conf. Changes in node configuration (e.g.
4320 adding nodes, changing their processor count, etc.) require restarting
4321 both the slurmctld daemon and the slurmd daemons. All slurmd daemons
4322 must know each node in the system to forward messages in support of hi‐
4323 erarchical communications. Only the NodeName must be supplied in the
4324 configuration file. All other node configuration information is op‐
4325 tional. It is advisable to establish baseline node configurations, es‐
4326 pecially if the cluster is heterogeneous. Nodes which register to the
4327 system with less than the configured resources (e.g. too little mem‐
4328 ory), will be placed in the "DOWN" state to avoid scheduling jobs on
4329 them. Establishing baseline configurations will also speed Slurm's
4330 scheduling process by permitting it to compare job requirements against
4331 these (relatively few) configuration parameters and possibly avoid hav‐
4332 ing to check job requirements against every individual node's configu‐
4333 ration. The resources checked at node registration time are: CPUs,
4334 RealMemory and TmpDisk.
4335
4336 Default values can be specified with a record in which NodeName is "DE‐
4337 FAULT". The default entry values will apply only to lines following it
4338 in the configuration file and the default values can be reset multiple
4339 times in the configuration file with multiple entries where "Node‐
4340 Name=DEFAULT". Each line where NodeName is "DEFAULT" will replace or
4341 add to previous default values and not a reinitialize the default val‐
4342 ues. The "NodeName=" specification must be placed on every line de‐
4343 scribing the configuration of nodes. A single node name can not appear
4344 as a NodeName value in more than one line (duplicate node name records
4345 will be ignored). In fact, it is generally possible and desirable to
4346 define the configurations of all nodes in only a few lines. This con‐
4347 vention permits significant optimization in the scheduling of larger
4348 clusters. In order to support the concept of jobs requiring consecu‐
4349 tive nodes on some architectures, node specifications should be place
4350 in this file in consecutive order. No single node name may be listed
4351 more than once in the configuration file. Use "DownNodes=" to record
4352 the state of nodes which are temporarily in a DOWN, DRAIN or FAILING
4353 state without altering permanent configuration information. A job
4354 step's tasks are allocated to nodes in order the nodes appear in the
4355 configuration file. There is presently no capability within Slurm to
4356 arbitrarily order a job step's tasks.
4357
4358 Multiple node names may be comma separated (e.g. "alpha,beta,gamma")
4359 and/or a simple node range expression may optionally be used to specify
4360 numeric ranges of nodes to avoid building a configuration file with
4361 large numbers of entries. The node range expression can contain one
4362 pair of square brackets with a sequence of comma-separated numbers
4363 and/or ranges of numbers separated by a "-" (e.g. "linux[0-64,128]", or
4364 "lx[15,18,32-33]"). Note that the numeric ranges can include one or
4365 more leading zeros to indicate the numeric portion has a fixed number
4366 of digits (e.g. "linux[0000-1023]"). Multiple numeric ranges can be
4367 included in the expression (e.g. "rack[0-63]_blade[0-41]"). If one or
4368 more numeric expressions are included, one of them must be at the end
4369 of the name (e.g. "unit[0-31]rack" is invalid), but arbitrary names can
4370 always be used in a comma-separated list.
4371
4372 The node configuration specified the following information:
4373
4374
4375 NodeName
4376 Name that Slurm uses to refer to a node. Typically this would
4377 be the string that "/bin/hostname -s" returns. It may also be
4378 the fully qualified domain name as returned by "/bin/hostname
4379 -f" (e.g. "foo1.bar.com"), or any valid domain name associated
4380 with the host through the host database (/etc/hosts) or DNS, de‐
4381 pending on the resolver settings. Note that if the short form
4382 of the hostname is not used, it may prevent use of hostlist ex‐
4383 pressions (the numeric portion in brackets must be at the end of
4384 the string). It may also be an arbitrary string if NodeHostname
4385 is specified. If the NodeName is "DEFAULT", the values speci‐
4386 fied with that record will apply to subsequent node specifica‐
4387 tions unless explicitly set to other values in that node record
4388 or replaced with a different set of default values. Each line
4389 where NodeName is "DEFAULT" will replace or add to previous de‐
4390 fault values and not a reinitialize the default values. For ar‐
4391 chitectures in which the node order is significant, nodes will
4392 be considered consecutive in the order defined. For example, if
4393 the configuration for "NodeName=charlie" immediately follows the
4394 configuration for "NodeName=baker" they will be considered adja‐
4395 cent in the computer.
4396
4397
4398 NodeHostname
4399 Typically this would be the string that "/bin/hostname -s" re‐
4400 turns. It may also be the fully qualified domain name as re‐
4401 turned by "/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid
4402 domain name associated with the host through the host database
4403 (/etc/hosts) or DNS, depending on the resolver settings. Note
4404 that if the short form of the hostname is not used, it may pre‐
4405 vent use of hostlist expressions (the numeric portion in brack‐
4406 ets must be at the end of the string). A node range expression
4407 can be used to specify a set of nodes. If an expression is
4408 used, the number of nodes identified by NodeHostname on a line
4409 in the configuration file must be identical to the number of
4410 nodes identified by NodeName. By default, the NodeHostname will
4411 be identical in value to NodeName.
4412
4413
4414 NodeAddr
4415 Name that a node should be referred to in establishing a commu‐
4416 nications path. This name will be used as an argument to the
4417 getaddrinfo() function for identification. If a node range ex‐
4418 pression is used to designate multiple nodes, they must exactly
4419 match the entries in the NodeName (e.g. "NodeName=lx[0-7]
4420 NodeAddr=elx[0-7]"). NodeAddr may also contain IP addresses.
4421 By default, the NodeAddr will be identical in value to NodeHost‐
4422 name.
4423
4424
4425 BcastAddr
4426 Alternate network path to be used for sbcast network traffic to
4427 a given node. This name will be used as an argument to the
4428 getaddrinfo() function. If a node range expression is used to
4429 designate multiple nodes, they must exactly match the entries in
4430 the NodeName (e.g. "NodeName=lx[0-7] BcastAddr=elx[0-7]").
4431 BcastAddr may also contain IP addresses. By default, the Bcas‐
4432 tAddr is unset, and sbcast traffic will be routed to the
4433 NodeAddr for a given node. Note: cannot be used with Communica‐
4434 tionParameters=NoInAddrAny.
4435
4436
4437 Boards Number of Baseboards in nodes with a baseboard controller. Note
4438 that when Boards is specified, SocketsPerBoard, CoresPerSocket,
4439 and ThreadsPerCore should be specified. Boards and CPUs are mu‐
4440 tually exclusive. The default value is 1.
4441
4442
4443 CoreSpecCount
4444 Number of cores reserved for system use. These cores will not
4445 be available for allocation to user jobs. Depending upon the
4446 TaskPluginParam option of SlurmdOffSpec, Slurm daemons (i.e.
4447 slurmd and slurmstepd) may either be confined to these resources
4448 (the default) or prevented from using these resources. Isola‐
4449 tion of the Slurm daemons from user jobs may improve application
4450 performance. If this option and CpuSpecList are both designated
4451 for a node, an error is generated. For information on the algo‐
4452 rithm used by Slurm to select the cores refer to the core spe‐
4453 cialization documentation (
4454 https://slurm.schedmd.com/core_spec.html ).
4455
4456
4457 CoresPerSocket
4458 Number of cores in a single physical processor socket (e.g.
4459 "2"). The CoresPerSocket value describes physical cores, not
4460 the logical number of processors per socket. NOTE: If you have
4461 multi-core processors, you will likely need to specify this pa‐
4462 rameter in order to optimize scheduling. The default value is
4463 1.
4464
4465
4466 CpuBind
4467 If a job step request does not specify an option to control how
4468 tasks are bound to allocated CPUs (--cpu-bind) and all nodes al‐
4469 located to the job have the same CpuBind option the node CpuBind
4470 option will control how tasks are bound to allocated resources.
4471 Supported values for CpuBind are "none", "board", "socket",
4472 "ldom" (NUMA), "core" and "thread".
4473
4474
4475 CPUs Number of logical processors on the node (e.g. "2"). CPUs and
4476 Boards are mutually exclusive. It can be set to the total number
4477 of sockets(supported only by select/linear), cores or threads.
4478 This can be useful when you want to schedule only the cores on a
4479 hyper-threaded node. If CPUs is omitted, its default will be set
4480 equal to the product of Boards, Sockets, CoresPerSocket, and
4481 ThreadsPerCore.
4482
4483
4484 CpuSpecList
4485 A comma delimited list of Slurm abstract CPU IDs reserved for
4486 system use. The list will be expanded to include all other
4487 CPUs, if any, on the same cores. These cores will not be avail‐
4488 able for allocation to user jobs. Depending upon the TaskPlug‐
4489 inParam option of SlurmdOffSpec, Slurm daemons (i.e. slurmd and
4490 slurmstepd) may either be confined to these resources (the de‐
4491 fault) or prevented from using these resources. Isolation of
4492 the Slurm daemons from user jobs may improve application perfor‐
4493 mance. If this option and CoreSpecCount are both designated for
4494 a node, an error is generated. This option has no effect unless
4495 cgroup job confinement is also configured (TaskPlu‐
4496 gin=task/cgroup with ConstrainCores=yes in cgroup.conf).
4497
4498
4499 Features
4500 A comma delimited list of arbitrary strings indicative of some
4501 characteristic associated with the node. There is no value or
4502 count associated with a feature at this time, a node either has
4503 a feature or it does not. A desired feature may contain a nu‐
4504 meric component indicating, for example, processor speed but
4505 this numeric component will be considered to be part of the fea‐
4506 ture string. Features are intended to be used to filter nodes
4507 eligible to run jobs via the --constraint argument. By default
4508 a node has no features. Also see Gres for being able to have
4509 more control such as types and count. Using features is faster
4510 than scheduling against GRES but is limited to Boolean opera‐
4511 tions.
4512
4513
4514 Gres A comma delimited list of generic resources specifications for a
4515 node. The format is: "<name>[:<type>][:no_consume]:<num‐
4516 ber>[K|M|G]". The first field is the resource name, which
4517 matches the GresType configuration parameter name. The optional
4518 type field might be used to identify a model of that generic re‐
4519 source. It is forbidden to specify both an untyped GRES and a
4520 typed GRES with the same <name>. The optional no_consume field
4521 allows you to specify that a generic resource does not have a
4522 finite number of that resource that gets consumed as it is re‐
4523 quested. The no_consume field is a GRES specific setting and ap‐
4524 plies to the GRES, regardless of the type specified. The final
4525 field must specify a generic resources count. A suffix of "K",
4526 "M", "G", "T" or "P" may be used to multiply the number by 1024,
4527 1048576, 1073741824, etc. respectively.
4528 (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_con‐
4529 sume:4G"). By default a node has no generic resources and its
4530 maximum count is that of an unsigned 64bit integer. Also see
4531 Features for Boolean flags to filter nodes using job con‐
4532 straints.
4533
4534
4535 MemSpecLimit
4536 Amount of memory, in megabytes, reserved for system use and not
4537 available for user allocations. If the task/cgroup plugin is
4538 configured and that plugin constrains memory allocations (i.e.
4539 TaskPlugin=task/cgroup in slurm.conf, plus ConstrainRAMSpace=yes
4540 in cgroup.conf), then Slurm compute node daemons (slurmd plus
4541 slurmstepd) will be allocated the specified memory limit. Note
4542 that having the Memory set in SelectTypeParameters as any of the
4543 options that has it as a consumable resource is needed for this
4544 option to work. The daemons will not be killed if they exhaust
4545 the memory allocation (ie. the Out-Of-Memory Killer is disabled
4546 for the daemon's memory cgroup). If the task/cgroup plugin is
4547 not configured, the specified memory will only be unavailable
4548 for user allocations.
4549
4550
4551 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4552 tens to for work on this particular node. By default there is a
4553 single port number for all slurmd daemons on all compute nodes
4554 as defined by the SlurmdPort configuration parameter. Use of
4555 this option is not generally recommended except for development
4556 or testing purposes. If multiple slurmd daemons execute on a
4557 node this can specify a range of ports.
4558
4559 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4560 automatically try to interact with anything opened on ports
4561 8192-60000. Configure Port to use a port outside of the config‐
4562 ured SrunPortRange and RSIP's port range.
4563
4564
4565 Procs See CPUs.
4566
4567
4568 RealMemory
4569 Size of real memory on the node in megabytes (e.g. "2048"). The
4570 default value is 1. Lowering RealMemory with the goal of setting
4571 aside some amount for the OS and not available for job alloca‐
4572 tions will not work as intended if Memory is not set as a con‐
4573 sumable resource in SelectTypeParameters. So one of the *_Memory
4574 options need to be enabled for that goal to be accomplished.
4575 Also see MemSpecLimit.
4576
4577
4578 Reason Identifies the reason for a node being in state "DOWN",
4579 "DRAINED" "DRAINING", "FAIL" or "FAILING". Use quotes to en‐
4580 close a reason having more than one word.
4581
4582
4583 Sockets
4584 Number of physical processor sockets/chips on the node (e.g.
4585 "2"). If Sockets is omitted, it will be inferred from CPUs,
4586 CoresPerSocket, and ThreadsPerCore. NOTE: If you have
4587 multi-core processors, you will likely need to specify these pa‐
4588 rameters. Sockets and SocketsPerBoard are mutually exclusive.
4589 If Sockets is specified when Boards is also used, Sockets is in‐
4590 terpreted as SocketsPerBoard rather than total sockets. The de‐
4591 fault value is 1.
4592
4593
4594 SocketsPerBoard
4595 Number of physical processor sockets/chips on a baseboard.
4596 Sockets and SocketsPerBoard are mutually exclusive. The default
4597 value is 1.
4598
4599
4600 State State of the node with respect to the initiation of user jobs.
4601 Acceptable values are CLOUD, DOWN, DRAIN, FAIL, FAILING, FUTURE
4602 and UNKNOWN. Node states of BUSY and IDLE should not be speci‐
4603 fied in the node configuration, but set the node state to UN‐
4604 KNOWN instead. Setting the node state to UNKNOWN will result in
4605 the node state being set to BUSY, IDLE or other appropriate
4606 state based upon recovered system state information. The de‐
4607 fault value is UNKNOWN. Also see the DownNodes parameter below.
4608
4609 CLOUD Indicates the node exists in the cloud. Its initial
4610 state will be treated as powered down. The node will
4611 be available for use after its state is recovered from
4612 Slurm's state save file or the slurmd daemon starts on
4613 the compute node.
4614
4615 DOWN Indicates the node failed and is unavailable to be al‐
4616 located work.
4617
4618 DRAIN Indicates the node is unavailable to be allocated
4619 work.
4620
4621 FAIL Indicates the node is expected to fail soon, has no
4622 jobs allocated to it, and will not be allocated to any
4623 new jobs.
4624
4625 FAILING Indicates the node is expected to fail soon, has one
4626 or more jobs allocated to it, but will not be allo‐
4627 cated to any new jobs.
4628
4629 FUTURE Indicates the node is defined for future use and need
4630 not exist when the Slurm daemons are started. These
4631 nodes can be made available for use simply by updating
4632 the node state using the scontrol command rather than
4633 restarting the slurmctld daemon. After these nodes are
4634 made available, change their State in the slurm.conf
4635 file. Until these nodes are made available, they will
4636 not be seen using any Slurm commands or nor will any
4637 attempt be made to contact them.
4638
4639
4640 Dynamic Future Nodes
4641 A slurmd started with -F[<feature>] will be as‐
4642 sociated with a FUTURE node that matches the
4643 same configuration (sockets, cores, threads) as
4644 reported by slurmd -C. The node's NodeAddr and
4645 NodeHostname will automatically be retrieved
4646 from the slurmd and will be cleared when set
4647 back to the FUTURE state. Dynamic FUTURE nodes
4648 retain non-FUTURE state on restart. Use scon‐
4649 trol to put node back into FUTURE state.
4650
4651 If the mapping of the NodeName to the slurmd
4652 HostName is not updated in DNS, Dynamic Future
4653 nodes won't know how to communicate with each
4654 other -- because NodeAddr and NodeHostName are
4655 not defined in the slurm.conf -- and the fanout
4656 communications need to be disabled by setting
4657 TreeWidth to a high number (e.g. 65533). If the
4658 DNS mapping is made, then the cloud_dns Slurm‐
4659 ctldParameter can be used.
4660
4661
4662 UNKNOWN Indicates the node's state is undefined but will be
4663 established (set to BUSY or IDLE) when the slurmd dae‐
4664 mon on that node registers. UNKNOWN is the default
4665 state.
4666
4667
4668 ThreadsPerCore
4669 Number of logical threads in a single physical core (e.g. "2").
4670 Note that the Slurm can allocate resources to jobs down to the
4671 resolution of a core. If your system is configured with more
4672 than one thread per core, execution of a different job on each
4673 thread is not supported unless you configure SelectTypeParame‐
4674 ters=CR_CPU plus CPUs; do not configure Sockets, CoresPerSocket
4675 or ThreadsPerCore. A job can execute a one task per thread from
4676 within one job step or execute a distinct job step on each of
4677 the threads. Note also if you are running with more than 1
4678 thread per core and running the select/cons_res or se‐
4679 lect/cons_tres plugin then you will want to set the SelectType‐
4680 Parameters variable to something other than CR_CPU to avoid un‐
4681 expected results. The default value is 1.
4682
4683
4684 TmpDisk
4685 Total size of temporary disk storage in TmpFS in megabytes (e.g.
4686 "16384"). TmpFS (for "Temporary File System") identifies the lo‐
4687 cation which jobs should use for temporary storage. Note this
4688 does not indicate the amount of free space available to the user
4689 on the node, only the total file system size. The system admin‐
4690 istration should ensure this file system is purged as needed so
4691 that user jobs have access to most of this space. The Prolog
4692 and/or Epilog programs (specified in the configuration file)
4693 might be used to ensure the file system is kept clean. The de‐
4694 fault value is 0.
4695
4696
4697 TRESWeights
4698 TRESWeights are used to calculate a value that represents how
4699 busy a node is. Currently only used in federation configura‐
4700 tions. TRESWeights are different from TRESBillingWeights --
4701 which is used for fairshare calculations.
4702
4703 TRES weights are specified as a comma-separated list of <TRES
4704 Type>=<TRES Weight> pairs.
4705 e.g.
4706 NodeName=node1 ... TRESWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0"
4707
4708 By default the weighted TRES value is calculated as the sum of
4709 all node TRES types multiplied by their corresponding TRES
4710 weight.
4711
4712 If PriorityFlags=MAX_TRES is configured, the weighted TRES value
4713 is calculated as the MAX of individual node TRES' (e.g. cpus,
4714 mem, gres).
4715
4716
4717 Weight The priority of the node for scheduling purposes. All things
4718 being equal, jobs will be allocated the nodes with the lowest
4719 weight which satisfies their requirements. For example, a het‐
4720 erogeneous collection of nodes might be placed into a single
4721 partition for greater system utilization, responsiveness and ca‐
4722 pability. It would be preferable to allocate smaller memory
4723 nodes rather than larger memory nodes if either will satisfy a
4724 job's requirements. The units of weight are arbitrary, but
4725 larger weights should be assigned to nodes with more processors,
4726 memory, disk space, higher processor speed, etc. Note that if a
4727 job allocation request can not be satisfied using the nodes with
4728 the lowest weight, the set of nodes with the next lowest weight
4729 is added to the set of nodes under consideration for use (repeat
4730 as needed for higher weight values). If you absolutely want to
4731 minimize the number of higher weight nodes allocated to a job
4732 (at a cost of higher scheduling overhead), give each node a dis‐
4733 tinct Weight value and they will be added to the pool of nodes
4734 being considered for scheduling individually. The default value
4735 is 1.
4736
4737
4739 The DownNodes= parameter permits you to mark certain nodes as in a
4740 DOWN, DRAIN, FAIL, FAILING or FUTURE state without altering the perma‐
4741 nent configuration information listed under a NodeName= specification.
4742
4743
4744 DownNodes
4745 Any node name, or list of node names, from the NodeName= speci‐
4746 fications.
4747
4748
4749 Reason Identifies the reason for a node being in state DOWN, DRAIN,
4750 FAIL, FAILING or FUTURE. Use quotes to enclose a reason having
4751 more than one word.
4752
4753
4754 State State of the node with respect to the initiation of user jobs.
4755 Acceptable values are DOWN, DRAIN, FAIL, FAILING and FUTURE.
4756 For more information about these states see the descriptions un‐
4757 der State in the NodeName= section above. The default value is
4758 DOWN.
4759
4760
4762 On computers where frontend nodes are used to execute batch scripts
4763 rather than compute nodes (Cray ALPS systems), one may configure one or
4764 more frontend nodes using the configuration parameters defined below.
4765 These options are very similar to those used in configuring compute
4766 nodes. These options may only be used on systems configured and built
4767 with the appropriate parameters (--have-front-end) or a system deter‐
4768 mined to have the appropriate architecture by the configure script
4769 (Cray ALPS systems). The front end configuration specifies the follow‐
4770 ing information:
4771
4772
4773 AllowGroups
4774 Comma-separated list of group names which may execute jobs on
4775 this front end node. By default, all groups may use this front
4776 end node. A user will be permitted to use this front end node
4777 if AllowGroups has at least one group associated with the user.
4778 May not be used with the DenyGroups option.
4779
4780
4781 AllowUsers
4782 Comma-separated list of user names which may execute jobs on
4783 this front end node. By default, all users may use this front
4784 end node. May not be used with the DenyUsers option.
4785
4786
4787 DenyGroups
4788 Comma-separated list of group names which are prevented from ex‐
4789 ecuting jobs on this front end node. May not be used with the
4790 AllowGroups option.
4791
4792
4793 DenyUsers
4794 Comma-separated list of user names which are prevented from exe‐
4795 cuting jobs on this front end node. May not be used with the
4796 AllowUsers option.
4797
4798
4799 FrontendName
4800 Name that Slurm uses to refer to a frontend node. Typically
4801 this would be the string that "/bin/hostname -s" returns. It
4802 may also be the fully qualified domain name as returned by
4803 "/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid domain
4804 name associated with the host through the host database
4805 (/etc/hosts) or DNS, depending on the resolver settings. Note
4806 that if the short form of the hostname is not used, it may pre‐
4807 vent use of hostlist expressions (the numeric portion in brack‐
4808 ets must be at the end of the string). If the FrontendName is
4809 "DEFAULT", the values specified with that record will apply to
4810 subsequent node specifications unless explicitly set to other
4811 values in that frontend node record or replaced with a different
4812 set of default values. Each line where FrontendName is "DE‐
4813 FAULT" will replace or add to previous default values and not a
4814 reinitialize the default values.
4815
4816
4817 FrontendAddr
4818 Name that a frontend node should be referred to in establishing
4819 a communications path. This name will be used as an argument to
4820 the getaddrinfo() function for identification. As with Fron‐
4821 tendName, list the individual node addresses rather than using a
4822 hostlist expression. The number of FrontendAddr records per
4823 line must equal the number of FrontendName records per line
4824 (i.e. you can't map to node names to one address). FrontendAddr
4825 may also contain IP addresses. By default, the FrontendAddr
4826 will be identical in value to FrontendName.
4827
4828
4829 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4830 tens to for work on this particular frontend node. By default
4831 there is a single port number for all slurmd daemons on all
4832 frontend nodes as defined by the SlurmdPort configuration param‐
4833 eter. Use of this option is not generally recommended except for
4834 development or testing purposes.
4835
4836 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4837 automatically try to interact with anything opened on ports
4838 8192-60000. Configure Port to use a port outside of the config‐
4839 ured SrunPortRange and RSIP's port range.
4840
4841
4842 Reason Identifies the reason for a frontend node being in state DOWN,
4843 DRAINED, DRAINING, FAIL or FAILING. Use quotes to enclose a
4844 reason having more than one word.
4845
4846
4847 State State of the frontend node with respect to the initiation of
4848 user jobs. Acceptable values are DOWN, DRAIN, FAIL, FAILING and
4849 UNKNOWN. Node states of BUSY and IDLE should not be specified
4850 in the node configuration, but set the node state to UNKNOWN in‐
4851 stead. Setting the node state to UNKNOWN will result in the
4852 node state being set to BUSY, IDLE or other appropriate state
4853 based upon recovered system state information. For more infor‐
4854 mation about these states see the descriptions under State in
4855 the NodeName= section above. The default value is UNKNOWN.
4856
4857
4858 As an example, you can do something similar to the following to define
4859 four front end nodes for running slurmd daemons.
4860 FrontendName=frontend[00-03] FrontendAddr=efrontend[00-03] State=UNKNOWN
4861
4862
4864 The nodeset configuration allows you to define a name for a specific
4865 set of nodes which can be used to simplify the partition configuration
4866 section, especially for heterogenous or condo-style systems. Each node‐
4867 set may be defined by an explicit list of nodes, and/or by filtering
4868 the nodes by a particular configured feature. If both Feature= and
4869 Nodes= are used the nodeset shall be the union of the two subsets.
4870 Note that the nodesets are only used to simplify the partition defini‐
4871 tions at present, and are not usable outside of the partition configu‐
4872 ration.
4873
4874 Feature
4875 All nodes with this single feature will be included as part of
4876 this nodeset.
4877
4878 Nodes List of nodes in this set.
4879
4880 NodeSet
4881 Unique name for a set of nodes. Must not overlap with any Node‐
4882 Name definitions.
4883
4884
4886 The partition configuration permits you to establish different job lim‐
4887 its or access controls for various groups (or partitions) of nodes.
4888 Nodes may be in more than one partition, making partitions serve as
4889 general purpose queues. For example one may put the same set of nodes
4890 into two different partitions, each with different constraints (time
4891 limit, job sizes, groups allowed to use the partition, etc.). Jobs are
4892 allocated resources within a single partition. Default values can be
4893 specified with a record in which PartitionName is "DEFAULT". The de‐
4894 fault entry values will apply only to lines following it in the config‐
4895 uration file and the default values can be reset multiple times in the
4896 configuration file with multiple entries where "PartitionName=DEFAULT".
4897 The "PartitionName=" specification must be placed on every line de‐
4898 scribing the configuration of partitions. Each line where Partition‐
4899 Name is "DEFAULT" will replace or add to previous default values and
4900 not a reinitialize the default values. A single partition name can not
4901 appear as a PartitionName value in more than one line (duplicate parti‐
4902 tion name records will be ignored). If a partition that is in use is
4903 deleted from the configuration and slurm is restarted or reconfigured
4904 (scontrol reconfigure), jobs using the partition are canceled. NOTE:
4905 Put all parameters for each partition on a single line. Each line of
4906 partition configuration information should represent a different parti‐
4907 tion. The partition configuration file contains the following informa‐
4908 tion:
4909
4910
4911 AllocNodes
4912 Comma-separated list of nodes from which users can submit jobs
4913 in the partition. Node names may be specified using the node
4914 range expression syntax described above. The default value is
4915 "ALL".
4916
4917
4918 AllowAccounts
4919 Comma-separated list of accounts which may execute jobs in the
4920 partition. The default value is "ALL". NOTE: If AllowAccounts
4921 is used then DenyAccounts will not be enforced. Also refer to
4922 DenyAccounts.
4923
4924
4925 AllowGroups
4926 Comma-separated list of group names which may view and execute
4927 jobs in this partition. A user will be permitted to view and
4928 submit a job to this partition if AllowGroups has at least one
4929 group associated with the user. Jobs executed as user root or
4930 as user SlurmUser will be allowed to view and use any partition,
4931 regardless of the value of AllowGroups. In addition, a Slurm Ad‐
4932 min or Operator will be able to view any partition, regardless
4933 of the value of AllowGroups. If user root attempts to execute a
4934 job as another user (e.g. using srun's --uid option), then the
4935 job will be subject to AllowGroups as if it were submitted by
4936 that user. By default, AllowGroups is unset, meaning all groups
4937 are allowed to use this partition. The special value 'ALL' is
4938 equivalent to this. Even when PrivateData does not hide parti‐
4939 tion information, AllowGroups will still hide partition informa‐
4940 tion accordingly. NOTE: For performance reasons, Slurm main‐
4941 tains a list of user IDs allowed to use each partition and this
4942 is checked at job submission time. This list of user IDs is up‐
4943 dated when the slurmctld daemon is restarted, reconfigured (e.g.
4944 "scontrol reconfig") or the partition's AllowGroups value is re‐
4945 set, even if is value is unchanged (e.g. "scontrol update Parti‐
4946 tionName=name AllowGroups=group"). For a user's access to a
4947 partition to change, both his group membership must change and
4948 Slurm's internal user ID list must change using one of the meth‐
4949 ods described above.
4950
4951
4952 AllowQos
4953 Comma-separated list of Qos which may execute jobs in the parti‐
4954 tion. Jobs executed as user root can use any partition without
4955 regard to the value of AllowQos. The default value is "ALL".
4956 NOTE: If AllowQos is used then DenyQos will not be enforced.
4957 Also refer to DenyQos.
4958
4959
4960 Alternate
4961 Partition name of alternate partition to be used if the state of
4962 this partition is "DRAIN" or "INACTIVE."
4963
4964
4965 CpuBind
4966 If a job step request does not specify an option to control how
4967 tasks are bound to allocated CPUs (--cpu-bind) and all nodes al‐
4968 located to the job do not have the same CpuBind option the node.
4969 Then the partition's CpuBind option will control how tasks are
4970 bound to allocated resources. Supported values forCpuBind are
4971 "none", "board", "socket", "ldom" (NUMA), "core" and "thread".
4972
4973
4974 Default
4975 If this keyword is set, jobs submitted without a partition spec‐
4976 ification will utilize this partition. Possible values are
4977 "YES" and "NO". The default value is "NO".
4978
4979
4980 DefCpuPerGPU
4981 Default count of CPUs allocated per allocated GPU.
4982
4983
4984 DefMemPerCPU
4985 Default real memory size available per allocated CPU in
4986 megabytes. Used to avoid over-subscribing memory and causing
4987 paging. DefMemPerCPU would generally be used if individual pro‐
4988 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
4989 lectType=select/cons_tres). If not set, the DefMemPerCPU value
4990 for the entire cluster will be used. Also see DefMemPerGPU,
4991 DefMemPerNode and MaxMemPerCPU. DefMemPerCPU, DefMemPerGPU and
4992 DefMemPerNode are mutually exclusive.
4993
4994
4995 DefMemPerGPU
4996 Default real memory size available per allocated GPU in
4997 megabytes. Also see DefMemPerCPU, DefMemPerNode and MaxMemPer‐
4998 CPU. DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually
4999 exclusive.
5000
5001
5002 DefMemPerNode
5003 Default real memory size available per allocated node in
5004 megabytes. Used to avoid over-subscribing memory and causing
5005 paging. DefMemPerNode would generally be used if whole nodes
5006 are allocated to jobs (SelectType=select/linear) and resources
5007 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
5008 If not set, the DefMemPerNode value for the entire cluster will
5009 be used. Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerCPU.
5010 DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually exclu‐
5011 sive.
5012
5013
5014 DenyAccounts
5015 Comma-separated list of accounts which may not execute jobs in
5016 the partition. By default, no accounts are denied access NOTE:
5017 If AllowAccounts is used then DenyAccounts will not be enforced.
5018 Also refer to AllowAccounts.
5019
5020
5021 DenyQos
5022 Comma-separated list of Qos which may not execute jobs in the
5023 partition. By default, no QOS are denied access NOTE: If Al‐
5024 lowQos is used then DenyQos will not be enforced. Also refer
5025 AllowQos.
5026
5027
5028 DefaultTime
5029 Run time limit used for jobs that don't specify a value. If not
5030 set then MaxTime will be used. Format is the same as for Max‐
5031 Time.
5032
5033
5034 DisableRootJobs
5035 If set to "YES" then user root will be prevented from running
5036 any jobs on this partition. The default value will be the value
5037 of DisableRootJobs set outside of a partition specification
5038 (which is "NO", allowing user root to execute jobs).
5039
5040
5041 ExclusiveUser
5042 If set to "YES" then nodes will be exclusively allocated to
5043 users. Multiple jobs may be run for the same user, but only one
5044 user can be active at a time. This capability is also available
5045 on a per-job basis by using the --exclusive=user option.
5046
5047
5048 GraceTime
5049 Specifies, in units of seconds, the preemption grace time to be
5050 extended to a job which has been selected for preemption. The
5051 default value is zero, no preemption grace time is allowed on
5052 this partition. Once a job has been selected for preemption,
5053 its end time is set to the current time plus GraceTime. The
5054 job's tasks are immediately sent SIGCONT and SIGTERM signals in
5055 order to provide notification of its imminent termination. This
5056 is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence
5057 upon reaching its new end time. This second set of signals is
5058 sent to both the tasks and the containing batch script, if ap‐
5059 plicable. See also the global KillWait configuration parameter.
5060
5061
5062 Hidden Specifies if the partition and its jobs are to be hidden by de‐
5063 fault. Hidden partitions will by default not be reported by the
5064 Slurm APIs or commands. Possible values are "YES" and "NO".
5065 The default value is "NO". Note that partitions that a user
5066 lacks access to by virtue of the AllowGroups parameter will also
5067 be hidden by default.
5068
5069
5070 LLN Schedule resources to jobs on the least loaded nodes (based upon
5071 the number of idle CPUs). This is generally only recommended for
5072 an environment with serial jobs as idle resources will tend to
5073 be highly fragmented, resulting in parallel jobs being distrib‐
5074 uted across many nodes. Note that node Weight takes precedence
5075 over how many idle resources are on each node. Also see the Se‐
5076 lectParameters configuration parameter CR_LLN to use the least
5077 loaded nodes in every partition.
5078
5079
5080 MaxCPUsPerNode
5081 Maximum number of CPUs on any node available to all jobs from
5082 this partition. This can be especially useful to schedule GPUs.
5083 For example a node can be associated with two Slurm partitions
5084 (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be
5085 limited to only a subset of the node's CPUs, ensuring that one
5086 or more CPUs would be available to jobs in the "gpu" parti‐
5087 tion/queue.
5088
5089
5090 MaxMemPerCPU
5091 Maximum real memory size available per allocated CPU in
5092 megabytes. Used to avoid over-subscribing memory and causing
5093 paging. MaxMemPerCPU would generally be used if individual pro‐
5094 cessors are allocated to jobs (SelectType=select/cons_res or Se‐
5095 lectType=select/cons_tres). If not set, the MaxMemPerCPU value
5096 for the entire cluster will be used. Also see DefMemPerCPU and
5097 MaxMemPerNode. MaxMemPerCPU and MaxMemPerNode are mutually ex‐
5098 clusive.
5099
5100
5101 MaxMemPerNode
5102 Maximum real memory size available per allocated node in
5103 megabytes. Used to avoid over-subscribing memory and causing
5104 paging. MaxMemPerNode would generally be used if whole nodes
5105 are allocated to jobs (SelectType=select/linear) and resources
5106 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
5107 If not set, the MaxMemPerNode value for the entire cluster will
5108 be used. Also see DefMemPerNode and MaxMemPerCPU. MaxMemPerCPU
5109 and MaxMemPerNode are mutually exclusive.
5110
5111
5112 MaxNodes
5113 Maximum count of nodes which may be allocated to any single job.
5114 The default value is "UNLIMITED", which is represented inter‐
5115 nally as -1. This limit does not apply to jobs executed by
5116 SlurmUser or user root.
5117
5118
5119 MaxTime
5120 Maximum run time limit for jobs. Format is minutes, min‐
5121 utes:seconds, hours:minutes:seconds, days-hours, days-hours:min‐
5122 utes, days-hours:minutes:seconds or "UNLIMITED". Time resolu‐
5123 tion is one minute and second values are rounded up to the next
5124 minute. This limit does not apply to jobs executed by SlurmUser
5125 or user root.
5126
5127
5128 MinNodes
5129 Minimum count of nodes which may be allocated to any single job.
5130 The default value is 0. This limit does not apply to jobs exe‐
5131 cuted by SlurmUser or user root.
5132
5133
5134 Nodes Comma-separated list of nodes or nodesets which are associated
5135 with this partition. Node names may be specified using the node
5136 range expression syntax described above. A blank list of nodes
5137 (i.e. "Nodes= ") can be used if one wants a partition to exist,
5138 but have no resources (possibly on a temporary basis). A value
5139 of "ALL" is mapped to all nodes configured in the cluster.
5140
5141
5142 OverSubscribe
5143 Controls the ability of the partition to execute more than one
5144 job at a time on each resource (node, socket or core depending
5145 upon the value of SelectTypeParameters). If resources are to be
5146 over-subscribed, avoiding memory over-subscription is very im‐
5147 portant. SelectTypeParameters should be configured to treat
5148 memory as a consumable resource and the --mem option should be
5149 used for job allocations. Sharing of resources is typically
5150 useful only when using gang scheduling (PreemptMode=sus‐
5151 pend,gang). Possible values for OverSubscribe are "EXCLUSIVE",
5152 "FORCE", "YES", and "NO". Note that a value of "YES" or "FORCE"
5153 can negatively impact performance for systems with many thou‐
5154 sands of running jobs. The default value is "NO". For more in‐
5155 formation see the following web pages:
5156 https://slurm.schedmd.com/cons_res.html
5157 https://slurm.schedmd.com/cons_res_share.html
5158 https://slurm.schedmd.com/gang_scheduling.html
5159 https://slurm.schedmd.com/preempt.html
5160
5161
5162 EXCLUSIVE Allocates entire nodes to jobs even with Select‐
5163 Type=select/cons_res or SelectType=select/cons_tres
5164 configured. Jobs that run in partitions with Over‐
5165 Subscribe=EXCLUSIVE will have exclusive access to
5166 all allocated nodes.
5167
5168 FORCE Makes all resources in the partition available for
5169 oversubscription without any means for users to dis‐
5170 able it. May be followed with a colon and maximum
5171 number of jobs in running or suspended state. For
5172 example OverSubscribe=FORCE:4 enables each node,
5173 socket or core to oversubscribe each resource four
5174 ways. Recommended only for systems using Preempt‐
5175 Mode=suspend,gang.
5176
5177 NOTE: OverSubscribe=FORCE:1 is a special case that
5178 is not exactly equivalent to OverSubscribe=NO. Over‐
5179 Subscribe=FORCE:1 disables the regular oversubscrip‐
5180 tion of resources in the same partition but it will
5181 still allow oversubscription due to preemption. Set‐
5182 ting OverSubscribe=NO will prevent oversubscription
5183 from happening due to preemption as well.
5184
5185 NOTE: If using PreemptType=preempt/qos you can spec‐
5186 ify a value for FORCE that is greater than 1. For
5187 example, OverSubscribe=FORCE:2 will permit two jobs
5188 per resource normally, but a third job can be
5189 started only if done so through preemption based
5190 upon QOS.
5191
5192 NOTE: If OverSubscribe is configured to FORCE or YES
5193 in your slurm.conf and the system is not configured
5194 to use preemption (PreemptMode=OFF) accounting can
5195 easily grow to values greater than the actual uti‐
5196 lization. It may be common on such systems to get
5197 error messages in the slurmdbd log stating: "We have
5198 more allocated time than is possible."
5199
5200
5201 YES Makes all resources in the partition available for
5202 sharing upon request by the job. Resources will
5203 only be over-subscribed when explicitly requested by
5204 the user using the "--oversubscribe" option on job
5205 submission. May be followed with a colon and maxi‐
5206 mum number of jobs in running or suspended state.
5207 For example "OverSubscribe=YES:4" enables each node,
5208 socket or core to execute up to four jobs at once.
5209 Recommended only for systems running with gang
5210 scheduling (PreemptMode=suspend,gang).
5211
5212 NO Selected resources are allocated to a single job. No
5213 resource will be allocated to more than one job.
5214
5215 NOTE: Even if you are using PreemptMode=sus‐
5216 pend,gang, setting OverSubscribe=NO will disable
5217 preemption on that partition. Use OverSub‐
5218 scribe=FORCE:1 if you want to disable normal over‐
5219 subscription but still allow suspension due to pre‐
5220 emption.
5221
5222
5223 PartitionName
5224 Name by which the partition may be referenced (e.g. "Interac‐
5225 tive"). This name can be specified by users when submitting
5226 jobs. If the PartitionName is "DEFAULT", the values specified
5227 with that record will apply to subsequent partition specifica‐
5228 tions unless explicitly set to other values in that partition
5229 record or replaced with a different set of default values. Each
5230 line where PartitionName is "DEFAULT" will replace or add to
5231 previous default values and not a reinitialize the default val‐
5232 ues.
5233
5234
5235 PreemptMode
5236 Mechanism used to preempt jobs or enable gang scheduling for
5237 this partition when PreemptType=preempt/partition_prio is con‐
5238 figured. This partition-specific PreemptMode configuration pa‐
5239 rameter will override the cluster-wide PreemptMode for this par‐
5240 tition. It can be set to OFF to disable preemption and gang
5241 scheduling for this partition. See also PriorityTier and the
5242 above description of the cluster-wide PreemptMode parameter for
5243 further details.
5244
5245
5246 PriorityJobFactor
5247 Partition factor used by priority/multifactor plugin in calcu‐
5248 lating job priority. The value may not exceed 65533. Also see
5249 PriorityTier.
5250
5251
5252 PriorityTier
5253 Jobs submitted to a partition with a higher priority tier value
5254 will be dispatched before pending jobs in partition with lower
5255 priority tier value and, if possible, they will preempt running
5256 jobs from partitions with lower priority tier values. Note that
5257 a partition's priority tier takes precedence over a job's prior‐
5258 ity. The value may not exceed 65533. Also see PriorityJobFac‐
5259 tor.
5260
5261
5262 QOS Used to extend the limits available to a QOS on a partition.
5263 Jobs will not be associated to this QOS outside of being associ‐
5264 ated to the partition. They will still be associated to their
5265 requested QOS. By default, no QOS is used. NOTE: If a limit is
5266 set in both the Partition's QOS and the Job's QOS the Partition
5267 QOS will be honored unless the Job's QOS has the OverPartQOS
5268 flag set in which the Job's QOS will have priority.
5269
5270
5271 ReqResv
5272 Specifies users of this partition are required to designate a
5273 reservation when submitting a job. This option can be useful in
5274 restricting usage of a partition that may have higher priority
5275 or additional resources to be allowed only within a reservation.
5276 Possible values are "YES" and "NO". The default value is "NO".
5277
5278
5279 RootOnly
5280 Specifies if only user ID zero (i.e. user root) may allocate re‐
5281 sources in this partition. User root may allocate resources for
5282 any other user, but the request must be initiated by user root.
5283 This option can be useful for a partition to be managed by some
5284 external entity (e.g. a higher-level job manager) and prevents
5285 users from directly using those resources. Possible values are
5286 "YES" and "NO". The default value is "NO".
5287
5288
5289 SelectTypeParameters
5290 Partition-specific resource allocation type. This option re‐
5291 places the global SelectTypeParameters value. Supported values
5292 are CR_Core, CR_Core_Memory, CR_Socket and CR_Socket_Memory.
5293 Use requires the system-wide SelectTypeParameters value be set
5294 to any of the four supported values previously listed; other‐
5295 wise, the partition-specific value will be ignored.
5296
5297
5298 Shared The Shared configuration parameter has been replaced by the
5299 OverSubscribe parameter described above.
5300
5301
5302 State State of partition or availability for use. Possible values are
5303 "UP", "DOWN", "DRAIN" and "INACTIVE". The default value is "UP".
5304 See also the related "Alternate" keyword.
5305
5306 UP Designates that new jobs may be queued on the parti‐
5307 tion, and that jobs may be allocated nodes and run
5308 from the partition.
5309
5310 DOWN Designates that new jobs may be queued on the parti‐
5311 tion, but queued jobs may not be allocated nodes and
5312 run from the partition. Jobs already running on the
5313 partition continue to run. The jobs must be explicitly
5314 canceled to force their termination.
5315
5316 DRAIN Designates that no new jobs may be queued on the par‐
5317 tition (job submission requests will be denied with an
5318 error message), but jobs already queued on the parti‐
5319 tion may be allocated nodes and run. See also the
5320 "Alternate" partition specification.
5321
5322 INACTIVE Designates that no new jobs may be queued on the par‐
5323 tition, and jobs already queued may not be allocated
5324 nodes and run. See also the "Alternate" partition
5325 specification.
5326
5327
5328 TRESBillingWeights
5329 TRESBillingWeights is used to define the billing weights of each
5330 TRES type that will be used in calculating the usage of a job.
5331 The calculated usage is used when calculating fairshare and when
5332 enforcing the TRES billing limit on jobs.
5333
5334 Billing weights are specified as a comma-separated list of <TRES
5335 Type>=<TRES Billing Weight> pairs.
5336
5337 Any TRES Type is available for billing. Note that the base unit
5338 for memory and burst buffers is megabytes.
5339
5340 By default the billing of TRES is calculated as the sum of all
5341 TRES types multiplied by their corresponding billing weight.
5342
5343 The weighted amount of a resource can be adjusted by adding a
5344 suffix of K,M,G,T or P after the billing weight. For example, a
5345 memory weight of "mem=.25" on a job allocated 8GB will be billed
5346 2048 (8192MB *.25) units. A memory weight of "mem=.25G" on the
5347 same job will be billed 2 (8192MB * (.25/1024)) units.
5348
5349 Negative values are allowed.
5350
5351 When a job is allocated 1 CPU and 8 GB of memory on a partition
5352 configured with TRESBilling‐
5353 Weights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0", the billable TRES will
5354 be: (1*1.0) + (8*0.25) + (0*2.0) = 3.0.
5355
5356 If PriorityFlags=MAX_TRES is configured, the billable TRES is
5357 calculated as the MAX of individual TRES' on a node (e.g. cpus,
5358 mem, gres) plus the sum of all global TRES' (e.g. licenses). Us‐
5359 ing the same example above the billable TRES will be MAX(1*1.0,
5360 8*0.25) + (0*2.0) = 2.0.
5361
5362 If TRESBillingWeights is not defined then the job is billed
5363 against the total number of allocated CPUs.
5364
5365 NOTE: TRESBillingWeights doesn't affect job priority directly as
5366 it is currently not used for the size of the job. If you want
5367 TRES' to play a role in the job's priority then refer to the
5368 PriorityWeightTRES option.
5369
5370
5371
5373 There are a variety of prolog and epilog program options that execute
5374 with various permissions and at various times. The four options most
5375 likely to be used are: Prolog and Epilog (executed once on each compute
5376 node for each job) plus PrologSlurmctld and EpilogSlurmctld (executed
5377 once on the ControlMachine for each job).
5378
5379 NOTE: Standard output and error messages are normally not preserved.
5380 Explicitly write output and error messages to an appropriate location
5381 if you wish to preserve that information.
5382
5383 NOTE: By default the Prolog script is ONLY run on any individual node
5384 when it first sees a job step from a new allocation. It does not run
5385 the Prolog immediately when an allocation is granted. If no job steps
5386 from an allocation are run on a node, it will never run the Prolog for
5387 that allocation. This Prolog behaviour can be changed by the Pro‐
5388 logFlags parameter. The Epilog, on the other hand, always runs on ev‐
5389 ery node of an allocation when the allocation is released.
5390
5391 If the Epilog fails (returns a non-zero exit code), this will result in
5392 the node being set to a DRAIN state. If the EpilogSlurmctld fails (re‐
5393 turns a non-zero exit code), this will only be logged. If the Prolog
5394 fails (returns a non-zero exit code), this will result in the node be‐
5395 ing set to a DRAIN state and the job being requeued in a held state un‐
5396 less nohold_on_prolog_fail is configured in SchedulerParameters. If
5397 the PrologSlurmctld fails (returns a non-zero exit code), this will re‐
5398 sult in the job being requeued to be executed on another node if possi‐
5399 ble. Only batch jobs can be requeued. Interactive jobs (salloc and
5400 srun) will be cancelled if the PrologSlurmctld fails.
5401
5402
5403 Information about the job is passed to the script using environment
5404 variables. Unless otherwise specified, these environment variables are
5405 available in each of the scripts mentioned above (Prolog, Epilog, Pro‐
5406 logSlurmctld and EpilogSlurmctld). For a full list of environment vari‐
5407 ables that includes those available in the SrunProlog, SrunEpilog,
5408 TaskProlog and TaskEpilog please see the Prolog and Epilog Guide
5409 <https://slurm.schedmd.com/prolog_epilog.html>.
5410
5411 SLURM_ARRAY_JOB_ID
5412 If this job is part of a job array, this will be set to the job
5413 ID. Otherwise it will not be set. To reference this specific
5414 task of a job array, combine SLURM_ARRAY_JOB_ID with SLURM_AR‐
5415 RAY_TASK_ID (e.g. "scontrol update ${SLURM_AR‐
5416 RAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in Pro‐
5417 logSlurmctld and EpilogSlurmctld.
5418
5419 SLURM_ARRAY_TASK_ID
5420 If this job is part of a job array, this will be set to the task
5421 ID. Otherwise it will not be set. To reference this specific
5422 task of a job array, combine SLURM_ARRAY_JOB_ID with SLURM_AR‐
5423 RAY_TASK_ID (e.g. "scontrol update ${SLURM_AR‐
5424 RAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in Pro‐
5425 logSlurmctld and EpilogSlurmctld.
5426
5427 SLURM_ARRAY_TASK_MAX
5428 If this job is part of a job array, this will be set to the max‐
5429 imum task ID. Otherwise it will not be set. Available in Pro‐
5430 logSlurmctld and EpilogSlurmctld.
5431
5432 SLURM_ARRAY_TASK_MIN
5433 If this job is part of a job array, this will be set to the min‐
5434 imum task ID. Otherwise it will not be set. Available in Pro‐
5435 logSlurmctld and EpilogSlurmctld.
5436
5437 SLURM_ARRAY_TASK_STEP
5438 If this job is part of a job array, this will be set to the step
5439 size of task IDs. Otherwise it will not be set. Available in
5440 PrologSlurmctld and EpilogSlurmctld.
5441
5442 SLURM_CLUSTER_NAME
5443 Name of the cluster executing the job.
5444
5445 SLURM_CONF
5446 Location of the slurm.conf file. Available in Prolog and Epilog.
5447
5448 SLURMD_NODENAME
5449 Name of the node running the task. In the case of a parallel job
5450 executing on multiple compute nodes, the various tasks will have
5451 this environment variable set to different values on each com‐
5452 pute node. Availble in Prolog and Epilog.
5453
5454 SLURM_JOB_ACCOUNT
5455 Account name used for the job. Available in PrologSlurmctld and
5456 EpilogSlurmctld.
5457
5458 SLURM_JOB_CONSTRAINTS
5459 Features required to run the job. Available in Prolog, Pro‐
5460 logSlurmctld and EpilogSlurmctld.
5461
5462 SLURM_JOB_DERIVED_EC
5463 The highest exit code of all of the job steps. Available in
5464 EpilogSlurmctld.
5465
5466 SLURM_JOB_EXIT_CODE
5467 The exit code of the job script (or salloc). The value is the
5468 status as returned by the wait() system call (See wait(2))
5469 Available in EpilogSlurmctld.
5470
5471 SLURM_JOB_EXIT_CODE2
5472 The exit code of the job script (or salloc). The value has the
5473 format <exit>:<sig>. The first number is the exit code, typi‐
5474 cally as set by the exit() function. The second number of the
5475 signal that caused the process to terminate if it was terminated
5476 by a signal. Available in EpilogSlurmctld.
5477
5478 SLURM_JOB_GID
5479 Group ID of the job's owner.
5480
5481 SLURM_JOB_GPUS
5482 GPU IDs allocated to the job (if any). Available in the Prolog.
5483
5484 SLURM_JOB_GROUP
5485 Group name of the job's owner. Available in PrologSlurmctld and
5486 EpilogSlurmctld.
5487
5488 SLURM_JOB_ID
5489 Job ID.
5490
5491 SLURM_JOBID
5492 Job ID.
5493
5494 SLURM_JOB_NAME
5495 Name of the job. Available in PrologSlurmctld and EpilogSlurm‐
5496 ctld.
5497
5498 SLURM_JOB_NODELIST
5499 Nodes assigned to job. A Slurm hostlist expression. "scontrol
5500 show hostnames" can be used to convert this to a list of indi‐
5501 vidual host names. Available in PrologSlurmctld and Epi‐
5502 logSlurmctld.
5503
5504 SLURM_JOB_PARTITION
5505 Partition that job runs in. Available in Prolog, PrologSlurm‐
5506 ctld and EpilogSlurmctld.
5507
5508 SLURM_JOB_UID
5509 User ID of the job's owner.
5510
5511 SLURM_JOB_USER
5512 User name of the job's owner.
5513
5514 SLURM_SCRIPT_CONTEXT
5515 Identifies which epilog or prolog program is currently running.
5516
5517
5519 This program can be used to take special actions to clean up the unkil‐
5520 lable processes and/or notify system administrators. The program will
5521 be run as SlurmdUser (usually "root") on the compute node where Unkill‐
5522 ableStepTimeout was triggered.
5523
5524 Information about the unkillable job step is passed to the script using
5525 environment variables.
5526
5527 SLURM_JOB_ID
5528 Job ID.
5529
5530 SLURM_STEP_ID
5531 Job Step ID.
5532
5533
5535 Slurm is able to optimize job allocations to minimize network con‐
5536 tention. Special Slurm logic is used to optimize allocations on sys‐
5537 tems with a three-dimensional interconnect. and information about con‐
5538 figuring those systems are available on web pages available here:
5539 <https://slurm.schedmd.com/>. For a hierarchical network, Slurm needs
5540 to have detailed information about how nodes are configured on the net‐
5541 work switches.
5542
5543 Given network topology information, Slurm allocates all of a job's re‐
5544 sources onto a single leaf of the network (if possible) using a
5545 best-fit algorithm. Otherwise it will allocate a job's resources onto
5546 multiple leaf switches so as to minimize the use of higher-level
5547 switches. The TopologyPlugin parameter controls which plugin is used
5548 to collect network topology information. The only values presently
5549 supported are "topology/3d_torus" (default for Cray XT/XE systems, per‐
5550 forms best-fit logic over three-dimensional topology), "topology/none"
5551 (default for other systems, best-fit logic over one-dimensional topol‐
5552 ogy), "topology/tree" (determine the network topology based upon infor‐
5553 mation contained in a topology.conf file, see "man topology.conf" for
5554 more information). Future plugins may gather topology information di‐
5555 rectly from the network. The topology information is optional. If not
5556 provided, Slurm will perform a best-fit algorithm assuming the nodes
5557 are in a one-dimensional array as configured and the communications
5558 cost is related to the node distance in this array.
5559
5560
5562 If the cluster's computers used for the primary or backup controller
5563 will be out of service for an extended period of time, it may be desir‐
5564 able to relocate them. In order to do so, follow this procedure:
5565
5566 1. Stop the Slurm daemons
5567 2. Modify the slurm.conf file appropriately
5568 3. Distribute the updated slurm.conf file to all nodes
5569 4. Restart the Slurm daemons
5570
5571 There should be no loss of any running or pending jobs. Ensure that
5572 any nodes added to the cluster have the current slurm.conf file in‐
5573 stalled.
5574
5575 CAUTION: If two nodes are simultaneously configured as the primary con‐
5576 troller (two nodes on which SlurmctldHost specify the local host and
5577 the slurmctld daemon is executing on each), system behavior will be de‐
5578 structive. If a compute node has an incorrect SlurmctldHost parameter,
5579 that node may be rendered unusable, but no other harm will result.
5580
5581
5583 #
5584 # Sample /etc/slurm.conf for dev[0-25].llnl.gov
5585 # Author: John Doe
5586 # Date: 11/06/2001
5587 #
5588 SlurmctldHost=dev0(12.34.56.78) # Primary server
5589 SlurmctldHost=dev1(12.34.56.79) # Backup server
5590 #
5591 AuthType=auth/munge
5592 Epilog=/usr/local/slurm/epilog
5593 Prolog=/usr/local/slurm/prolog
5594 FirstJobId=65536
5595 InactiveLimit=120
5596 JobCompType=jobcomp/filetxt
5597 JobCompLoc=/var/log/slurm/jobcomp
5598 KillWait=30
5599 MaxJobCount=10000
5600 MinJobAge=3600
5601 PluginDir=/usr/local/lib:/usr/local/slurm/lib
5602 ReturnToService=0
5603 SchedulerType=sched/backfill
5604 SlurmctldLogFile=/var/log/slurm/slurmctld.log
5605 SlurmdLogFile=/var/log/slurm/slurmd.log
5606 SlurmctldPort=7002
5607 SlurmdPort=7003
5608 SlurmdSpoolDir=/var/spool/slurmd.spool
5609 StateSaveLocation=/var/spool/slurm.state
5610 SwitchType=switch/none
5611 TmpFS=/tmp
5612 WaitTime=30
5613 JobCredentialPrivateKey=/usr/local/slurm/private.key
5614 JobCredentialPublicCertificate=/usr/local/slurm/public.cert
5615 #
5616 # Node Configurations
5617 #
5618 NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=64000
5619 NodeName=DEFAULT State=UNKNOWN
5620 NodeName=dev[0-25] NodeAddr=edev[0-25] Weight=16
5621 # Update records for specific DOWN nodes
5622 DownNodes=dev20 State=DOWN Reason="power,ETA=Dec25"
5623 #
5624 # Partition Configurations
5625 #
5626 PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP
5627 PartitionName=debug Nodes=dev[0-8,18-25] Default=YES
5628 PartitionName=batch Nodes=dev[9-17] MinNodes=4
5629 PartitionName=long Nodes=dev[9-17] MaxTime=120 AllowGroups=admin
5630
5631
5633 The "include" key word can be used with modifiers within the specified
5634 pathname. These modifiers would be replaced with cluster name or other
5635 information depending on which modifier is specified. If the included
5636 file is not an absolute path name (i.e. it does not start with a
5637 slash), it will searched for in the same directory as the slurm.conf
5638 file.
5639
5640 %c Cluster name specified in the slurm.conf will be used.
5641
5642 EXAMPLE
5643 ClusterName=linux
5644 include /home/slurm/etc/%c_config
5645 # Above line interpreted as
5646 # "include /home/slurm/etc/linux_config"
5647
5648
5650 There are three classes of files: Files used by slurmctld must be ac‐
5651 cessible by user SlurmUser and accessible by the primary and backup
5652 control machines. Files used by slurmd must be accessible by user root
5653 and accessible from every compute node. A few files need to be acces‐
5654 sible by normal users on all login and compute nodes. While many files
5655 and directories are listed below, most of them will not be used with
5656 most configurations.
5657
5658 Epilog Must be executable by user root. It is recommended that the
5659 file be readable by all users. The file must exist on every
5660 compute node.
5661
5662 EpilogSlurmctld
5663 Must be executable by user SlurmUser. It is recommended that
5664 the file be readable by all users. The file must be accessible
5665 by the primary and backup control machines.
5666
5667 HealthCheckProgram
5668 Must be executable by user root. It is recommended that the
5669 file be readable by all users. The file must exist on every
5670 compute node.
5671
5672 JobCompLoc
5673 If this specifies a file, it must be writable by user SlurmUser.
5674 The file must be accessible by the primary and backup control
5675 machines.
5676
5677 JobCredentialPrivateKey
5678 Must be readable only by user SlurmUser and writable by no other
5679 users. The file must be accessible by the primary and backup
5680 control machines.
5681
5682 JobCredentialPublicCertificate
5683 Readable to all users on all nodes. Must not be writable by
5684 regular users.
5685
5686 MailProg
5687 Must be executable by user SlurmUser. Must not be writable by
5688 regular users. The file must be accessible by the primary and
5689 backup control machines.
5690
5691 Prolog Must be executable by user root. It is recommended that the
5692 file be readable by all users. The file must exist on every
5693 compute node.
5694
5695 PrologSlurmctld
5696 Must be executable by user SlurmUser. It is recommended that
5697 the file be readable by all users. The file must be accessible
5698 by the primary and backup control machines.
5699
5700 ResumeProgram
5701 Must be executable by user SlurmUser. The file must be accessi‐
5702 ble by the primary and backup control machines.
5703
5704 slurm.conf
5705 Readable to all users on all nodes. Must not be writable by
5706 regular users.
5707
5708 SlurmctldLogFile
5709 Must be writable by user SlurmUser. The file must be accessible
5710 by the primary and backup control machines.
5711
5712 SlurmctldPidFile
5713 Must be writable by user root. Preferably writable and remov‐
5714 able by SlurmUser. The file must be accessible by the primary
5715 and backup control machines.
5716
5717 SlurmdLogFile
5718 Must be writable by user root. A distinct file must exist on
5719 each compute node.
5720
5721 SlurmdPidFile
5722 Must be writable by user root. A distinct file must exist on
5723 each compute node.
5724
5725 SlurmdSpoolDir
5726 Must be writable by user root. A distinct file must exist on
5727 each compute node.
5728
5729 SrunEpilog
5730 Must be executable by all users. The file must exist on every
5731 login and compute node.
5732
5733 SrunProlog
5734 Must be executable by all users. The file must exist on every
5735 login and compute node.
5736
5737 StateSaveLocation
5738 Must be writable by user SlurmUser. The file must be accessible
5739 by the primary and backup control machines.
5740
5741 SuspendProgram
5742 Must be executable by user SlurmUser. The file must be accessi‐
5743 ble by the primary and backup control machines.
5744
5745 TaskEpilog
5746 Must be executable by all users. The file must exist on every
5747 compute node.
5748
5749 TaskProlog
5750 Must be executable by all users. The file must exist on every
5751 compute node.
5752
5753 UnkillableStepProgram
5754 Must be executable by user SlurmUser. The file must be accessi‐
5755 ble by the primary and backup control machines.
5756
5757
5759 Note that while Slurm daemons create log files and other files as
5760 needed, it treats the lack of parent directories as a fatal error.
5761 This prevents the daemons from running if critical file systems are not
5762 mounted and will minimize the risk of cold-starting (starting without
5763 preserving jobs).
5764
5765 Log files and job accounting files, may need to be created/owned by the
5766 "SlurmUser" uid to be successfully accessed. Use the "chown" and
5767 "chmod" commands to set the ownership and permissions appropriately.
5768 See the section FILE AND DIRECTORY PERMISSIONS for information about
5769 the various files and directories used by Slurm.
5770
5771 It is recommended that the logrotate utility be used to ensure that
5772 various log files do not become too large. This also applies to text
5773 files used for accounting, process tracking, and the slurmdbd log if
5774 they are used.
5775
5776 Here is a sample logrotate configuration. Make appropriate site modifi‐
5777 cations and save as /etc/logrotate.d/slurm on all nodes. See the
5778 logrotate man page for more details.
5779
5780 ##
5781 # Slurm Logrotate Configuration
5782 ##
5783 /var/log/slurm/*.log {
5784 compress
5785 missingok
5786 nocopytruncate
5787 nodelaycompress
5788 nomail
5789 notifempty
5790 noolddir
5791 rotate 5
5792 sharedscripts
5793 size=5M
5794 create 640 slurm root
5795 postrotate
5796 pkill -x --signal SIGUSR2 slurmctld
5797 pkill -x --signal SIGUSR2 slurmd
5798 pkill -x --signal SIGUSR2 slurmdbd
5799 exit 0
5800 endscript
5801 }
5802
5804 Copyright (C) 2002-2007 The Regents of the University of California.
5805 Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
5806 Copyright (C) 2008-2010 Lawrence Livermore National Security.
5807 Copyright (C) 2010-2017 SchedMD LLC.
5808
5809 This file is part of Slurm, a resource management program. For de‐
5810 tails, see <https://slurm.schedmd.com/>.
5811
5812 Slurm is free software; you can redistribute it and/or modify it under
5813 the terms of the GNU General Public License as published by the Free
5814 Software Foundation; either version 2 of the License, or (at your op‐
5815 tion) any later version.
5816
5817 Slurm is distributed in the hope that it will be useful, but WITHOUT
5818 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
5819 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
5820 for more details.
5821
5822
5824 /etc/slurm.conf
5825
5826
5828 cgroup.conf(5), getaddrinfo(3), getrlimit(2), gres.conf(5), group(5),
5829 hostname(1), scontrol(1), slurmctld(8), slurmd(8), slurmdbd(8), slur‐
5830 mdbd.conf(5), srun(1), spank(8), syslog(3), topology.conf(5)
5831
5832
5833
5834May 2021 Slurm Configuration File slurm.conf(5)