1slurm.conf(5) Slurm Configuration File slurm.conf(5)
2
3
4
6 slurm.conf - Slurm configuration file
7
8
10 slurm.conf is an ASCII file which describes general Slurm configuration
11 information, the nodes to be managed, information about how those nodes
12 are grouped into partitions, and various scheduling parameters associ‐
13 ated with those partitions. This file should be consistent across all
14 nodes in the cluster.
15
16 The file location can be modified at system build time using the
17 DEFAULT_SLURM_CONF parameter or at execution time by setting the
18 SLURM_CONF environment variable. The Slurm daemons also allow you to
19 override both the built-in and environment-provided location using the
20 "-f" option on the command line.
21
22 The contents of the file are case insensitive except for the names of
23 nodes and partitions. Any text following a "#" in the configuration
24 file is treated as a comment through the end of that line. Changes to
25 the configuration file take effect upon restart of Slurm daemons, dae‐
26 mon receipt of the SIGHUP signal, or execution of the command "scontrol
27 reconfigure" unless otherwise noted.
28
29 If a line begins with the word "Include" followed by whitespace and
30 then a file name, that file will be included inline with the current
31 configuration file. For large or complex systems, multiple configura‐
32 tion files may prove easier to manage and enable reuse of some files
33 (See INCLUDE MODIFIERS for more details).
34
35 Note on file permissions:
36
37 The slurm.conf file must be readable by all users of Slurm, since it is
38 used by many of the Slurm commands. Other files that are defined in
39 the slurm.conf file, such as log files and job accounting files, may
40 need to be created/owned by the user "SlurmUser" to be successfully
41 accessed. Use the "chown" and "chmod" commands to set the ownership
42 and permissions appropriately. See the section FILE AND DIRECTORY PER‐
43 MISSIONS for information about the various files and directories used
44 by Slurm.
45
46
48 The overall configuration parameters available include:
49
50
51 AccountingStorageBackupHost
52 The name of the backup machine hosting the accounting storage
53 database. If used with the accounting_storage/slurmdbd plugin,
54 this is where the backup slurmdbd would be running. Only used
55 with systems using SlurmDBD, ignored otherwise.
56
57
58 AccountingStorageEnforce
59 This controls what level of association-based enforcement to
60 impose on job submissions. Valid options are any combination of
61 associations, limits, nojobs, nosteps, qos, safe, and wckeys, or
62 all for all things (except nojobs and nosteps, which must be
63 requested as well).
64
65 If limits, qos, or wckeys are set, associations will automati‐
66 cally be set.
67
68 If wckeys is set, TrackWCKey will automatically be set.
69
70 If safe is set, limits and associations will automatically be
71 set.
72
73 If nojobs is set, nosteps will automatically be set.
74
75 By setting associations, no new job is allowed to run unless a
76 corresponding association exists in the system. If limits are
77 enforced, users can be limited by association to whatever job
78 size or run time limits are defined.
79
80 If nojobs is set, Slurm will not account for any jobs or steps
81 on the system. Likewise, if nosteps is set, Slurm will not
82 account for any steps that have run.
83
84 If safe is enforced, a job will only be launched against an
85 association or qos that has a GrpTRESMins limit set, if the job
86 will be able to run to completion. Without this option set, jobs
87 will be launched as long as their usage hasn't reached the cpu-
88 minutes limit. This can lead to jobs being launched but then
89 killed when the limit is reached.
90
91 With qos and/or wckeys enforced jobs will not be scheduled
92 unless a valid qos and/or workload characterization key is spec‐
93 ified.
94
95 When AccountingStorageEnforce is changed, a restart of the
96 slurmctld daemon is required (not just a "scontrol reconfig").
97
98
99 AccountingStorageExternalHost
100 A comma separated list of external slurmdbds
101 (<host/ip>[:port][,...]) to register with. If no port is given,
102 the AccountingStoragePort will be used.
103
104 This allows clusters registered with the external slurmdbd to
105 communicate with each other using the --cluster/-M client com‐
106 mand options.
107
108 The cluster will add itself to the external slurmdbd if it
109 doesn't exist. If a non-external cluster already exists on the
110 external slurmdbd, the slurmctld will ignore registering to the
111 external slurmdbd.
112
113
114 AccountingStorageHost
115 The name of the machine hosting the accounting storage database.
116 Only used with systems using SlurmDBD, ignored otherwise. Also
117 see DefaultStorageHost.
118
119
120 AccountingStorageParameters
121 Comma separated list of key-value pair parameters. Currently
122 supported values include options to establish a secure connec‐
123 tion to the database:
124
125 SSL_CERT
126 The path name of the client public key certificate file.
127
128 SSL_CA
129 The path name of the Certificate Authority (CA) certificate
130 file.
131
132 SSL_CAPATH
133 The path name of the directory that contains trusted SSL CA
134 certificate files.
135
136 SSL_KEY
137 The path name of the client private key file.
138
139 SSL_CIPHER
140 The list of permissible ciphers for SSL encryption.
141
142
143 AccountingStoragePass
144 The password used to gain access to the database to store the
145 accounting data. Only used for database type storage plugins,
146 ignored otherwise. In the case of Slurm DBD (Database Daemon)
147 with MUNGE authentication this can be configured to use a MUNGE
148 daemon specifically configured to provide authentication between
149 clusters while the default MUNGE daemon provides authentication
150 within a cluster. In that case, AccountingStoragePass should
151 specify the named port to be used for communications with the
152 alternate MUNGE daemon (e.g. "/var/run/munge/global.socket.2").
153 The default value is NULL. Also see DefaultStoragePass.
154
155
156 AccountingStoragePort
157 The listening port of the accounting storage database server.
158 Only used for database type storage plugins, ignored otherwise.
159 The default value is SLURMDBD_PORT as established at system
160 build time. If no value is explicitly specified, it will be set
161 to 6819. This value must be equal to the DbdPort parameter in
162 the slurmdbd.conf file. Also see DefaultStoragePort.
163
164
165 AccountingStorageTRES
166 Comma separated list of resources you wish to track on the clus‐
167 ter. These are the resources requested by the sbatch/srun job
168 when it is submitted. Currently this consists of any GRES, BB
169 (burst buffer) or license along with CPU, Memory, Node, Energy,
170 FS/[Disk|Lustre], IC/OFED, Pages, and VMem. By default Billing,
171 CPU, Energy, Memory, Node, FS/Disk, Pages and VMem are tracked.
172 These default TRES cannot be disabled, but only appended to.
173 AccountingStorageTRES=gres/craynetwork,license/iop1 will track
174 billing, cpu, energy, memory, nodes, fs/disk, pages and vmem
175 along with a gres called craynetwork as well as a license called
176 iop1. Whenever these resources are used on the cluster they are
177 recorded. The TRES are automatically set up in the database on
178 the start of the slurmctld.
179
180 If multiple GRES of different types are tracked (e.g. GPUs of
181 different types), then job requests with matching type specifi‐
182 cations will be recorded. Given a configuration of "Account‐
183 ingStorageTRES=gres/gpu,gres/gpu:tesla,gres/gpu:volta" Then
184 "gres/gpu:tesla" and "gres/gpu:volta" will track only jobs that
185 explicitly request those two GPU types, while "gres/gpu" will
186 track allocated GPUs of any type ("tesla", "volta" or any other
187 GPU type).
188
189 Given a configuration of "AccountingStorage‐
190 TRES=gres/gpu:tesla,gres/gpu:volta" Then "gres/gpu:tesla" and
191 "gres/gpu:volta" will track jobs that explicitly request those
192 GPU types. If a job requests GPUs, but does not explicitly
193 specify the GPU type, then its resource allocation will be
194 accounted for as either "gres/gpu:tesla" or "gres/gpu:volta",
195 although the accounting may not match the actual GPU type allo‐
196 cated to the job and the GPUs allocated to the job could be het‐
197 erogeneous. In an environment containing various GPU types, use
198 of a job_submit plugin may be desired in order to force jobs to
199 explicitly specify some GPU type.
200
201
202 AccountingStorageType
203 The accounting storage mechanism type. Acceptable values at
204 present include "accounting_storage/none" and "accounting_stor‐
205 age/slurmdbd". The "accounting_storage/slurmdbd" value indi‐
206 cates that accounting records will be written to the Slurm DBD,
207 which manages an underlying MySQL database. See "man slurmdbd"
208 for more information. The default value is "accounting_stor‐
209 age/none" and indicates that account records are not maintained.
210 Also see DefaultStorageType.
211
212
213 AccountingStorageUser
214 The user account for accessing the accounting storage database.
215 Only used for database type storage plugins, ignored otherwise.
216 Also see DefaultStorageUser.
217
218
219 AccountingStoreJobComment
220 If set to "YES" then include the job's comment field in the job
221 complete message sent to the Accounting Storage database. The
222 default is "YES". Note the AdminComment and SystemComment are
223 always recorded in the database.
224
225
226 AcctGatherNodeFreq
227 The AcctGather plugins sampling interval for node accounting.
228 For AcctGather plugin values of none, this parameter is ignored.
229 For all other values this parameter is the number of seconds
230 between node accounting samples. For the acct_gather_energy/rapl
231 plugin, set a value less than 300 because the counters may over‐
232 flow beyond this rate. The default value is zero. This value
233 disables accounting sampling for nodes. Note: The accounting
234 sampling interval for jobs is determined by the value of JobAc‐
235 ctGatherFrequency.
236
237
238 AcctGatherEnergyType
239 Identifies the plugin to be used for energy consumption account‐
240 ing. The jobacct_gather plugin and slurmd daemon call this
241 plugin to collect energy consumption data for jobs and nodes.
242 The collection of energy consumption data takes place on the
243 node level, hence only in case of exclusive job allocation the
244 energy consumption measurements will reflect the job's real con‐
245 sumption. In case of node sharing between jobs the reported con‐
246 sumed energy per job (through sstat or sacct) will not reflect
247 the real energy consumed by the jobs.
248
249 Configurable values at present are:
250
251 acct_gather_energy/none
252 No energy consumption data is collected.
253
254 acct_gather_energy/ipmi
255 Energy consumption data is collected from
256 the Baseboard Management Controller (BMC)
257 using the Intelligent Platform Management
258 Interface (IPMI).
259
260 acct_gather_energy/pm_counters
261 Energy consumption data is collected from
262 the Baseboard Management Controller (BMC)
263 for HPE Cray systems.
264
265 acct_gather_energy/rapl
266 Energy consumption data is collected from
267 hardware sensors using the Running Average
268 Power Limit (RAPL) mechanism. Note that
269 enabling RAPL may require the execution of
270 the command "sudo modprobe msr".
271
272 acct_gather_energy/xcc
273 Energy consumption data is collected from
274 the Lenovo SD650 XClarity Controller (XCC)
275 using IPMI OEM raw commands.
276
277
278 AcctGatherInterconnectType
279 Identifies the plugin to be used for interconnect network traf‐
280 fic accounting. The jobacct_gather plugin and slurmd daemon
281 call this plugin to collect network traffic data for jobs and
282 nodes. The collection of network traffic data takes place on
283 the node level, hence only in case of exclusive job allocation
284 the collected values will reflect the job's real traffic. In
285 case of node sharing between jobs the reported network traffic
286 per job (through sstat or sacct) will not reflect the real net‐
287 work traffic by the jobs.
288
289 Configurable values at present are:
290
291 acct_gather_interconnect/none
292 No infiniband network data are collected.
293
294 acct_gather_interconnect/ofed
295 Infiniband network traffic data are col‐
296 lected from the hardware monitoring counters
297 of Infiniband devices through the OFED
298 library. In order to account for per job
299 network traffic, add the "ic/ofed" TRES to
300 AccountingStorageTRES.
301
302
303 AcctGatherFilesystemType
304 Identifies the plugin to be used for filesystem traffic account‐
305 ing. The jobacct_gather plugin and slurmd daemon call this
306 plugin to collect filesystem traffic data for jobs and nodes.
307 The collection of filesystem traffic data takes place on the
308 node level, hence only in case of exclusive job allocation the
309 collected values will reflect the job's real traffic. In case of
310 node sharing between jobs the reported filesystem traffic per
311 job (through sstat or sacct) will not reflect the real filesys‐
312 tem traffic by the jobs.
313
314
315 Configurable values at present are:
316
317 acct_gather_filesystem/none
318 No filesystem data are collected.
319
320 acct_gather_filesystem/lustre
321 Lustre filesystem traffic data are collected
322 from the counters found in /proc/fs/lustre/.
323 In order to account for per job lustre traf‐
324 fic, add the "fs/lustre" TRES to Account‐
325 ingStorageTRES.
326
327
328 AcctGatherProfileType
329 Identifies the plugin to be used for detailed job profiling.
330 The jobacct_gather plugin and slurmd daemon call this plugin to
331 collect detailed data such as I/O counts, memory usage, or
332 energy consumption for jobs and nodes. There are interfaces in
333 this plugin to collect data as step start and completion, task
334 start and completion, and at the account gather frequency. The
335 data collected at the node level is related to jobs only in case
336 of exclusive job allocation.
337
338 Configurable values at present are:
339
340 acct_gather_profile/none
341 No profile data is collected.
342
343 acct_gather_profile/hdf5
344 This enables the HDF5 plugin. The directory
345 where the profile files are stored and which
346 values are collected are configured in the
347 acct_gather.conf file.
348
349 acct_gather_profile/influxdb
350 This enables the influxdb plugin. The
351 influxdb instance host, port, database,
352 retention policy and which values are col‐
353 lected are configured in the
354 acct_gather.conf file.
355
356
357 AllowSpecResourcesUsage
358 If set to "YES", Slurm allows individual jobs to override node's
359 configured CoreSpecCount value. For a job to take advantage of
360 this feature, a command line option of --core-spec must be spec‐
361 ified. The default value for this option is "YES" for Cray sys‐
362 tems and "NO" for other system types.
363
364
365 AuthAltTypes
366 Comma separated list of alternative authentication plugins that
367 the slurmctld will permit for communication. Acceptable values
368 at present include auth/jwt.
369
370 NOTE: auth/jwt requires a jwt_hs256.key to be populated in the
371 StateSaveLocation directory for slurmctld only. The
372 jwt_hs256.key should only be visible to the SlurmUser and root.
373 It is not suggested to place the jwt_hs256.key on any nodes but
374 the controller running slurmctld. auth/jwt can be activated by
375 the presence of the SLURM_JWT environment variable. When acti‐
376 vated, it will override the default AuthType.
377
378
379 AuthAltParameters
380 Used to define alternative authentication plugins options. Mul‐
381 tiple options may be comma separated.
382
383 disable_token_creation
384 Disable "scontrol token" use by non-SlurmUser
385 accounts.
386
387 jwt_key= Absolute path to JWT key file. Key must be HS256,
388 and should only be accessible by SlurmUser. If
389 not set, the default key file is jwt_hs256.key in
390 StateSaveLocation.
391
392
393 AuthInfo
394 Additional information to be used for authentication of communi‐
395 cations between the Slurm daemons (slurmctld and slurmd) and the
396 Slurm clients. The interpretation of this option is specific to
397 the configured AuthType. Multiple options may be specified in a
398 comma delimited list. If not specified, the default authentica‐
399 tion information will be used.
400
401 cred_expire Default job step credential lifetime, in seconds
402 (e.g. "cred_expire=1200"). It must be suffi‐
403 ciently long enough to load user environment, run
404 prolog, deal with the slurmd getting paged out of
405 memory, etc. This also controls how long a
406 requeued job must wait before starting again. The
407 default value is 120 seconds.
408
409 socket Path name to a MUNGE daemon socket to use (e.g.
410 "socket=/var/run/munge/munge.socket.2"). The
411 default value is "/var/run/munge/munge.socket.2".
412 Used by auth/munge and cred/munge.
413
414 ttl Credential lifetime, in seconds (e.g. "ttl=300").
415 The default value is dependent upon the MUNGE
416 installation, but is typically 300 seconds.
417
418
419 AuthType
420 The authentication method for communications between Slurm com‐
421 ponents. Acceptable values at present include "auth/munge" and
422 "auth/none". The default value is "auth/munge". "auth/none"
423 includes the UID in each communication, but it is not verified.
424 This may be fine for testing purposes, but do not use
425 "auth/none" if you desire any security. "auth/munge" indicates
426 that MUNGE is to be used. (See "https://dun.github.io/munge/"
427 for more information). All Slurm daemons and commands must be
428 terminated prior to changing the value of AuthType and later
429 restarted.
430
431
432 BackupAddr
433 Deprecated option, see SlurmctldHost.
434
435
436 BackupController
437 Deprecated option, see SlurmctldHost.
438
439 The backup controller recovers state information from the State‐
440 SaveLocation directory, which must be readable and writable from
441 both the primary and backup controllers. While not essential,
442 it is recommended that you specify a backup controller. See
443 the RELOCATING CONTROLLERS section if you change this.
444
445
446 BatchStartTimeout
447 The maximum time (in seconds) that a batch job is permitted for
448 launching before being considered missing and releasing the
449 allocation. The default value is 10 (seconds). Larger values may
450 be required if more time is required to execute the Prolog, load
451 user environment variables, or if the slurmd daemon gets paged
452 from memory.
453 Note: The test for a job being successfully launched is only
454 performed when the Slurm daemon on the compute node registers
455 state with the slurmctld daemon on the head node, which happens
456 fairly rarely. Therefore a job will not necessarily be termi‐
457 nated if its start time exceeds BatchStartTimeout. This config‐
458 uration parameter is also applied to launch tasks and avoid
459 aborting srun commands due to long running Prolog scripts.
460
461
462 BurstBufferType
463 The plugin used to manage burst buffers. Acceptable values at
464 present are:
465
466 burst_buffer/datawarp
467 Use Cray DataWarp API to provide burst buffer functional‐
468 ity.
469
470 burst_buffer/none
471
472
473 CliFilterPlugins
474 A comma delimited list of command line interface option fil‐
475 ter/modification plugins. The specified plugins will be executed
476 in the order listed. These are intended to be site-specific
477 plugins which can be used to set default job parameters and/or
478 logging events. No cli_filter plugins are used by default.
479
480
481 ClusterName
482 The name by which this Slurm managed cluster is known in the
483 accounting database. This is needed distinguish accounting
484 records when multiple clusters report to the same database.
485 Because of limitations in some databases, any upper case letters
486 in the name will be silently mapped to lower case. In order to
487 avoid confusion, it is recommended that the name be lower case.
488
489
490 CommunicationParameters
491 Comma separated options identifying communication options.
492
493 CheckGhalQuiesce
494 Used specifically on a Cray using an Aries Ghal
495 interconnect. This will check to see if the sys‐
496 tem is quiescing when sending a message, and if
497 so, we wait until it is done before sending.
498
499 DisableIPv4 Disable IPv4 only operation for all slurm daemons
500 (except slurmdbd). This should also be set in
501 your slurmdbd.conf file.
502
503 EnableIPv6 Enable using IPv6 addresses for all slurm daemons
504 (except slurmdbd). When using both IPv4 and IPv6,
505 address family preferences will be based on your
506 /etc/gai.conf file. This should also be set in
507 your slurmdbd.conf file.
508
509 NoAddrCache By default, Slurm will cache a node's network
510 address after successfully establishing the
511 node's network address. This option disables the
512 cache and Slurm will look up the node's network
513 address each time a connection is made. This is
514 useful, for example, in a cloud environment where
515 the node addresses come and go out of DNS.
516
517 NoCtldInAddrAny
518 Used to directly bind to the address of what the
519 node resolves to running the slurmctld instead of
520 binding messages to any address on the node,
521 which is the default.
522
523 NoInAddrAny Used to directly bind to the address of what the
524 node resolves to instead of binding messages to
525 any address on the node which is the default.
526 This option is for all daemons/clients except for
527 the slurmctld.
528
529
530
531 CompleteWait
532 The time to wait, in seconds, when any job is in the COMPLETING
533 state before any additional jobs are scheduled. This is to
534 attempt to keep jobs on nodes that were recently in use, with
535 the goal of preventing fragmentation. If set to zero, pending
536 jobs will be started as soon as possible. Since a COMPLETING
537 job's resources are released for use by other jobs as soon as
538 the Epilog completes on each individual node, this can result in
539 very fragmented resource allocations. To provide jobs with the
540 minimum response time, a value of zero is recommended (no wait‐
541 ing). To minimize fragmentation of resources, a value equal to
542 KillWait plus two is recommended. In that case, setting Kill‐
543 Wait to a small value may be beneficial. The default value of
544 CompleteWait is zero seconds. The value may not exceed 65533.
545
546 NOTE: Setting reduce_completing_frag affects the behavior of
547 CompleteWait.
548
549
550 ControlAddr
551 Deprecated option, see SlurmctldHost.
552
553
554 ControlMachine
555 Deprecated option, see SlurmctldHost.
556
557
558 CoreSpecPlugin
559 Identifies the plugins to be used for enforcement of core spe‐
560 cialization. The slurmd daemon must be restarted for a change
561 in CoreSpecPlugin to take effect. Acceptable values at present
562 include:
563
564 core_spec/cray_aries
565 used only for Cray systems
566
567 core_spec/none used for all other system types
568
569
570 CpuFreqDef
571 Default CPU frequency value or frequency governor to use when
572 running a job step if it has not been explicitly set with the
573 --cpu-freq option. Acceptable values at present include a
574 numeric value (frequency in kilohertz) or one of the following
575 governors:
576
577 Conservative attempts to use the Conservative CPU governor
578
579 OnDemand attempts to use the OnDemand CPU governor
580
581 Performance attempts to use the Performance CPU governor
582
583 PowerSave attempts to use the PowerSave CPU governor
584 There is no default value. If unset, no attempt to set the governor is
585 made if the --cpu-freq option has not been set.
586
587
588 CpuFreqGovernors
589 List of CPU frequency governors allowed to be set with the sal‐
590 loc, sbatch, or srun option --cpu-freq. Acceptable values at
591 present include:
592
593 Conservative attempts to use the Conservative CPU governor
594
595 OnDemand attempts to use the OnDemand CPU governor (a
596 default value)
597
598 Performance attempts to use the Performance CPU governor (a
599 default value)
600
601 PowerSave attempts to use the PowerSave CPU governor
602
603 UserSpace attempts to use the UserSpace CPU governor (a
604 default value)
605 The default is OnDemand, Performance and UserSpace.
606
607 CredType
608 The cryptographic signature tool to be used in the creation of
609 job step credentials. The slurmctld daemon must be restarted
610 for a change in CredType to take effect. Acceptable values at
611 present include "cred/munge" and "cred/none". The default value
612 is "cred/munge" and is the recommended.
613
614
615 DebugFlags
616 Defines specific subsystems which should provide more detailed
617 event logging. Multiple subsystems can be specified with comma
618 separators. Most DebugFlags will result in verbose-level log‐
619 ging for the identified subsystems, and could impact perfor‐
620 mance. Valid subsystems available include:
621
622 Accrue Accrue counters accounting details
623
624 Agent RPC agents (outgoing RPCs from Slurm daemons)
625
626 Backfill Backfill scheduler details
627
628 BackfillMap Backfill scheduler to log a very verbose map of
629 reserved resources through time. Combine with
630 Backfill for a verbose and complete view of the
631 backfill scheduler's work.
632
633 BurstBuffer Burst Buffer plugin
634
635 CPU_Bind CPU binding details for jobs and steps
636
637 CpuFrequency Cpu frequency details for jobs and steps using
638 the --cpu-freq option.
639
640 Data Generic data structure details.
641
642 Dependency Job dependency debug info
643
644 Elasticsearch Elasticsearch debug info
645
646 Energy AcctGatherEnergy debug info
647
648 ExtSensors External Sensors debug info
649
650 Federation Federation scheduling debug info
651
652 FrontEnd Front end node details
653
654 Gres Generic resource details
655
656 Hetjob Heterogeneous job details
657
658 Gang Gang scheduling details
659
660 JobContainer Job container plugin details
661
662 License License management details
663
664 Network Network details
665
666 NetworkRaw Dump raw hex values of key Network communica‐
667 tions. Warning: very verbose.
668
669 NodeFeatures Node Features plugin debug info
670
671 NO_CONF_HASH Do not log when the slurm.conf files differ
672 between Slurm daemons
673
674 Power Power management plugin
675
676 PowerSave Power save (suspend/resume programs) details
677
678 Priority Job prioritization
679
680 Profile AcctGatherProfile plugins details
681
682 Protocol Communication protocol details
683
684 Reservation Advanced reservations
685
686 Route Message forwarding debug info
687
688 SelectType Resource selection plugin
689
690 Steps Slurmctld resource allocation for job steps
691
692 Switch Switch plugin
693
694 TimeCray Timing of Cray APIs
695
696 TRESNode Limits dealing with TRES=Node
697
698 TraceJobs Trace jobs in slurmctld. It will print detailed
699 job information including state, job ids and
700 allocated nodes counter.
701
702 Triggers Slurmctld triggers
703
704 WorkQueue Work Queue details
705
706
707 DefCpuPerGPU
708 Default count of CPUs allocated per allocated GPU.
709
710
711 DefMemPerCPU
712 Default real memory size available per allocated CPU in
713 megabytes. Used to avoid over-subscribing memory and causing
714 paging. DefMemPerCPU would generally be used if individual pro‐
715 cessors are allocated to jobs (SelectType=select/cons_res or
716 SelectType=select/cons_tres). The default value is 0 (unlim‐
717 ited). Also see DefMemPerGPU, DefMemPerNode and MaxMemPerCPU.
718 DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually exclu‐
719 sive.
720
721
722 DefMemPerGPU
723 Default real memory size available per allocated GPU in
724 megabytes. The default value is 0 (unlimited). Also see
725 DefMemPerCPU and DefMemPerNode. DefMemPerCPU, DefMemPerGPU and
726 DefMemPerNode are mutually exclusive.
727
728
729 DefMemPerNode
730 Default real memory size available per allocated node in
731 megabytes. Used to avoid over-subscribing memory and causing
732 paging. DefMemPerNode would generally be used if whole nodes
733 are allocated to jobs (SelectType=select/linear) and resources
734 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
735 The default value is 0 (unlimited). Also see DefMemPerCPU,
736 DefMemPerGPU and MaxMemPerCPU. DefMemPerCPU, DefMemPerGPU and
737 DefMemPerNode are mutually exclusive.
738
739
740 DefaultStorageHost
741 The default name of the machine hosting the accounting storage
742 and job completion databases. Only used for database type stor‐
743 age plugins and when the AccountingStorageHost and JobCompHost
744 have not been defined.
745
746
747 DefaultStorageLoc
748 The fully qualified file name where job completion records are
749 written when the DefaultStorageType is "filetxt". Also see Job‐
750 CompLoc.
751
752
753 DefaultStoragePass
754 The password used to gain access to the database to store the
755 accounting and job completion data. Only used for database type
756 storage plugins, ignored otherwise. Also see AccountingStor‐
757 agePass and JobCompPass.
758
759
760 DefaultStoragePort
761 The listening port of the accounting storage and/or job comple‐
762 tion database server. Only used for database type storage plug‐
763 ins, ignored otherwise. Also see AccountingStoragePort and Job‐
764 CompPort.
765
766
767 DefaultStorageType
768 The accounting and job completion storage mechanism type.
769 Acceptable values at present include "filetxt", "mysql" and
770 "none". The value "filetxt" indicates that records will be
771 written to a file. The value "mysql" indicates that accounting
772 records will be written to a MySQL or MariaDB database. The
773 default value is "none", which means that records are not main‐
774 tained. Also see AccountingStorageType and JobCompType.
775
776
777 DefaultStorageUser
778 The user account for accessing the accounting storage and/or job
779 completion database. Only used for database type storage plug‐
780 ins, ignored otherwise. Also see AccountingStorageUser and Job‐
781 CompUser.
782
783
784 DependencyParameters
785 Multiple options may be comma-separated.
786
787
788 disable_remote_singleton
789 By default, when a federated job has a singleton depen‐
790 deny, each cluster in the federation must clear the sin‐
791 gleton dependency before the job's singleton dependency
792 is considered satisfied. Enabling this option means that
793 only the origin cluster must clear the singleton depen‐
794 dency. This option must be set in every cluster in the
795 federation.
796
797 kill_invalid_depend
798 If a job has an invalid dependency and it can never run
799 terminate it and set its state to be JOB_CANCELLED. By
800 default the job stays pending with reason DependencyNev‐
801 erSatisfied. max_depend_depth=# Maximum number of jobs
802 to test for a circular job dependency. Stop testing after
803 this number of job dependencies have been tested. The
804 default value is 10 jobs.
805
806
807 DisableRootJobs
808 If set to "YES" then user root will be prevented from running
809 any jobs. The default value is "NO", meaning user root will be
810 able to execute jobs. DisableRootJobs may also be set by parti‐
811 tion.
812
813
814 EioTimeout
815 The number of seconds srun waits for slurmstepd to close the
816 TCP/IP connection used to relay data between the user applica‐
817 tion and srun when the user application terminates. The default
818 value is 60 seconds. May not exceed 65533.
819
820
821 EnforcePartLimits
822 If set to "ALL" then jobs which exceed a partition's size and/or
823 time limits will be rejected at submission time. If job is sub‐
824 mitted to multiple partitions, the job must satisfy the limits
825 on all the requested partitions. If set to "NO" then the job
826 will be accepted and remain queued until the partition limits
827 are altered(Time and Node Limits). If set to "ANY" a job must
828 satisfy any of the requested partitions to be submitted. The
829 default value is "NO". NOTE: If set, then a job's QOS can not
830 be used to exceed partition limits. NOTE: The partition limits
831 being considered are its configured MaxMemPerCPU, MaxMemPerNode,
832 MinNodes, MaxNodes, MaxTime, AllocNodes, AllowAccounts, Allow‐
833 Groups, AllowQOS, and QOS usage threshold.
834
835
836 Epilog Fully qualified pathname of a script to execute as user root on
837 every node when a user's job completes (e.g.
838 "/usr/local/slurm/epilog"). A glob pattern (See glob (7)) may
839 also be used to run more than one epilog script (e.g.
840 "/etc/slurm/epilog.d/*"). The Epilog script or scripts may be
841 used to purge files, disable user login, etc. By default there
842 is no epilog. See Prolog and Epilog Scripts for more informa‐
843 tion.
844
845
846 EpilogMsgTime
847 The number of microseconds that the slurmctld daemon requires to
848 process an epilog completion message from the slurmd daemons.
849 This parameter can be used to prevent a burst of epilog comple‐
850 tion messages from being sent at the same time which should help
851 prevent lost messages and improve throughput for large jobs.
852 The default value is 2000 microseconds. For a 1000 node job,
853 this spreads the epilog completion messages out over two sec‐
854 onds.
855
856
857 EpilogSlurmctld
858 Fully qualified pathname of a program for the slurmctld to exe‐
859 cute upon termination of a job allocation (e.g.
860 "/usr/local/slurm/epilog_controller"). The program executes as
861 SlurmUser, which gives it permission to drain nodes and requeue
862 the job if a failure occurs (See scontrol(1)). Exactly what the
863 program does and how it accomplishes this is completely at the
864 discretion of the system administrator. Information about the
865 job being initiated, its allocated nodes, etc. are passed to the
866 program using environment variables. See Prolog and Epilog
867 Scripts for more information.
868
869
870 ExtSensorsFreq
871 The external sensors plugin sampling interval. If ExtSen‐
872 sorsType=ext_sensors/none, this parameter is ignored. For all
873 other values of ExtSensorsType, this parameter is the number of
874 seconds between external sensors samples for hardware components
875 (nodes, switches, etc.) The default value is zero. This value
876 disables external sensors sampling. Note: This parameter does
877 not affect external sensors data collection for jobs/steps.
878
879
880 ExtSensorsType
881 Identifies the plugin to be used for external sensors data col‐
882 lection. Slurmctld calls this plugin to collect external sen‐
883 sors data for jobs/steps and hardware components. In case of
884 node sharing between jobs the reported values per job/step
885 (through sstat or sacct) may not be accurate. See also "man
886 ext_sensors.conf".
887
888 Configurable values at present are:
889
890 ext_sensors/none No external sensors data is collected.
891
892 ext_sensors/rrd External sensors data is collected from the
893 RRD database.
894
895
896 FairShareDampeningFactor
897 Dampen the effect of exceeding a user or group's fair share of
898 allocated resources. Higher values will provides greater ability
899 to differentiate between exceeding the fair share at high levels
900 (e.g. a value of 1 results in almost no difference between over‐
901 consumption by a factor of 10 and 100, while a value of 5 will
902 result in a significant difference in priority). The default
903 value is 1.
904
905
906 FederationParameters
907 Used to define federation options. Multiple options may be comma
908 separated.
909
910
911 fed_display
912 If set, then the client status commands (e.g. squeue,
913 sinfo, sprio, etc.) will display information in a feder‐
914 ated view by default. This option is functionally equiva‐
915 lent to using the --federation options on each command.
916 Use the client's --local option to override the federated
917 view and get a local view of the given cluster.
918
919
920 FirstJobId
921 The job id to be used for the first submitted to Slurm without a
922 specific requested value. Job id values generated will incre‐
923 mented by 1 for each subsequent job. This may be used to provide
924 a meta-scheduler with a job id space which is disjoint from the
925 interactive jobs. The default value is 1. Also see MaxJobId
926
927
928 GetEnvTimeout
929 Controls how long the job should wait (in seconds) to load the
930 user's environment before attempting to load it from a cache
931 file. Applies when the salloc or sbatch --get-user-env option
932 is used. If set to 0 then always load the user's environment
933 from the cache file. The default value is 2 seconds.
934
935
936 GresTypes
937 A comma delimited list of generic resources to be managed (e.g.
938 GresTypes=gpu,mps). These resources may have an associated GRES
939 plugin of the same name providing additional functionality. No
940 generic resources are managed by default. Ensure this parameter
941 is consistent across all nodes in the cluster for proper opera‐
942 tion. The slurmctld daemon must be restarted for changes to
943 this parameter to become effective.
944
945
946 GroupUpdateForce
947 If set to a non-zero value, then information about which users
948 are members of groups allowed to use a partition will be updated
949 periodically, even when there have been no changes to the
950 /etc/group file. If set to zero, group member information will
951 be updated only after the /etc/group file is updated. The
952 default value is 1. Also see the GroupUpdateTime parameter.
953
954
955 GroupUpdateTime
956 Controls how frequently information about which users are mem‐
957 bers of groups allowed to use a partition will be updated, and
958 how long user group membership lists will be cached. The time
959 interval is given in seconds with a default value of 600 sec‐
960 onds. A value of zero will prevent periodic updating of group
961 membership information. Also see the GroupUpdateForce parame‐
962 ter.
963
964
965 GpuFreqDef=[<type]=value>[,<type=value>]
966 Default GPU frequency to use when running a job step if it has
967 not been explicitly set using the --gpu-freq option. This
968 option can be used to independently configure the GPU and its
969 memory frequencies. Defaults to "high,memory=high". After the
970 job is completed, the frequencies of all affected GPUs will be
971 reset to the highest possible values. In some cases, system
972 power caps may override the requested values. The field type
973 can be "memory". If type is not specified, the GPU frequency is
974 implied. The value field can either be "low", "medium", "high",
975 "highm1" or a numeric value in megahertz (MHz). If the speci‐
976 fied numeric value is not possible, a value as close as possible
977 will be used. See below for definition of the values. Examples
978 of use include "GpuFreqDef=medium,memory=high and "GpuFre‐
979 qDef=450".
980
981 Supported value definitions:
982
983 low the lowest available frequency.
984
985 medium attempts to set a frequency in the middle of the
986 available range.
987
988 high the highest available frequency.
989
990 highm1 (high minus one) will select the next highest avail‐
991 able frequency.
992
993
994 HealthCheckInterval
995 The interval in seconds between executions of HealthCheckPro‐
996 gram. The default value is zero, which disables execution.
997
998
999 HealthCheckNodeState
1000 Identify what node states should execute the HealthCheckProgram.
1001 Multiple state values may be specified with a comma separator.
1002 The default value is ANY to execute on nodes in any state.
1003
1004 ALLOC Run on nodes in the ALLOC state (all CPUs allo‐
1005 cated).
1006
1007 ANY Run on nodes in any state.
1008
1009 CYCLE Rather than running the health check program on all
1010 nodes at the same time, cycle through running on all
1011 compute nodes through the course of the HealthCheck‐
1012 Interval. May be combined with the various node
1013 state options.
1014
1015 IDLE Run on nodes in the IDLE state.
1016
1017 MIXED Run on nodes in the MIXED state (some CPUs idle and
1018 other CPUs allocated).
1019
1020
1021 HealthCheckProgram
1022 Fully qualified pathname of a script to execute as user root
1023 periodically on all compute nodes that are not in the
1024 NOT_RESPONDING state. This program may be used to verify the
1025 node is fully operational and DRAIN the node or send email if a
1026 problem is detected. Any action to be taken must be explicitly
1027 performed by the program (e.g. execute "scontrol update Node‐
1028 Name=foo State=drain Reason=tmp_file_system_full" to drain a
1029 node). The execution interval is controlled using the
1030 HealthCheckInterval parameter. Note that the HealthCheckProgram
1031 will be executed at the same time on all nodes to minimize its
1032 impact upon parallel programs. This program is will be killed
1033 if it does not terminate normally within 60 seconds. This pro‐
1034 gram will also be executed when the slurmd daemon is first
1035 started and before it registers with the slurmctld daemon. By
1036 default, no program will be executed.
1037
1038
1039 InactiveLimit
1040 The interval, in seconds, after which a non-responsive job allo‐
1041 cation command (e.g. srun or salloc) will result in the job
1042 being terminated. If the node on which the command is executed
1043 fails or the command abnormally terminates, this will terminate
1044 its job allocation. This option has no effect upon batch jobs.
1045 When setting a value, take into consideration that a debugger
1046 using srun to launch an application may leave the srun command
1047 in a stopped state for extended periods of time. This limit is
1048 ignored for jobs running in partitions with the RootOnly flag
1049 set (the scheduler running as root will be responsible for the
1050 job). The default value is unlimited (zero) and may not exceed
1051 65533 seconds.
1052
1053
1054 InteractiveStepOptions
1055 When LaunchParameters=use_interactive_step is enabled, launching
1056 salloc will automatically start an srun process with Interac‐
1057 tiveStepOptions to launch a terminal on a node in the job allo‐
1058 cation. The default value is "--interactive --preserve-env
1059 --pty $SHELL".
1060
1061
1062 JobAcctGatherType
1063 The job accounting mechanism type. Acceptable values at present
1064 include "jobacct_gather/linux" (for Linux systems) and is the
1065 recommended one, "jobacct_gather/cgroup" and
1066 "jobacct_gather/none" (no accounting data collected). The
1067 default value is "jobacct_gather/none". "jobacct_gather/cgroup"
1068 is a plugin for the Linux operating system that uses cgroups to
1069 collect accounting statistics. The plugin collects the following
1070 statistics: From the cgroup memory subsystem: mem‐
1071 ory.usage_in_bytes (reported as 'pages') and rss from mem‐
1072 ory.stat (reported as 'rss'). From the cgroup cpuacct subsystem:
1073 user cpu time and system cpu time. No value is provided by
1074 cgroups for virtual memory size ('vsize'). In order to use the
1075 sstat tool "jobacct_gather/linux", or "jobacct_gather/cgroup"
1076 must be configured.
1077 NOTE: Changing this configuration parameter changes the contents
1078 of the messages between Slurm daemons. Any previously running
1079 job steps are managed by a slurmstepd daemon that will persist
1080 through the lifetime of that job step and not change its commu‐
1081 nication protocol. Only change this configuration parameter when
1082 there are no running job steps.
1083
1084
1085 JobAcctGatherFrequency
1086 The job accounting and profiling sampling intervals. The sup‐
1087 ported format is follows:
1088
1089 JobAcctGatherFrequency=<datatype>=<interval>
1090 where <datatype>=<interval> specifies the task sam‐
1091 pling interval for the jobacct_gather plugin or a
1092 sampling interval for a profiling type by the
1093 acct_gather_profile plugin. Multiple, comma-sepa‐
1094 rated <datatype>=<interval> intervals may be speci‐
1095 fied. Supported datatypes are as follows:
1096
1097 task=<interval>
1098 where <interval> is the task sampling inter‐
1099 val in seconds for the jobacct_gather plugins
1100 and for task profiling by the
1101 acct_gather_profile plugin.
1102
1103 energy=<interval>
1104 where <interval> is the sampling interval in
1105 seconds for energy profiling using the
1106 acct_gather_energy plugin
1107
1108 network=<interval>
1109 where <interval> is the sampling interval in
1110 seconds for infiniband profiling using the
1111 acct_gather_interconnect plugin.
1112
1113 filesystem=<interval>
1114 where <interval> is the sampling interval in
1115 seconds for filesystem profiling using the
1116 acct_gather_filesystem plugin.
1117
1118 The default value for task sampling interval
1119 is 30 seconds. The default value for all other intervals is 0.
1120 An interval of 0 disables sampling of the specified type. If
1121 the task sampling interval is 0, accounting information is col‐
1122 lected only at job termination (reducing Slurm interference with
1123 the job).
1124 Smaller (non-zero) values have a greater impact upon job perfor‐
1125 mance, but a value of 30 seconds is not likely to be noticeable
1126 for applications having less than 10,000 tasks.
1127 Users can independently override each interval on a per job
1128 basis using the --acctg-freq option when submitting the job.
1129
1130
1131 JobAcctGatherParams
1132 Arbitrary parameters for the job account gather plugin Accept‐
1133 able values at present include:
1134
1135 NoShared Exclude shared memory from accounting.
1136
1137 UsePss Use PSS value instead of RSS to calculate
1138 real usage of memory. The PSS value will be
1139 saved as RSS.
1140
1141 OverMemoryKill Kill processes that are being detected to
1142 use more memory than requested by steps
1143 every time accounting information is gath‐
1144 ered by the JobAcctGather plugin. This
1145 parameter should be used with caution
1146 because a job exceeding its memory alloca‐
1147 tion may affect other processes and/or
1148 machine health.
1149
1150 NOTE: If available, it is recommended to
1151 limit memory by enabling task/cgroup as a
1152 TaskPlugin and making use of Constrain‐
1153 RAMSpace=yes in the cgroup.conf instead of
1154 using this JobAcctGather mechanism for mem‐
1155 ory enforcement. With OverMemoryKill, memory
1156 limit is applied against each process indi‐
1157 vidually and is not applied to the step as a
1158 whole as it is with ConstrainRAMSpace=yes.
1159 Using JobAcctGather is polling based and
1160 there is a delay before a job is killed,
1161 which could lead to system Out of Memory
1162 events.
1163
1164
1165 JobCompHost
1166 The name of the machine hosting the job completion database.
1167 Only used for database type storage plugins, ignored otherwise.
1168 Also see DefaultStorageHost.
1169
1170
1171 JobCompLoc
1172 The fully qualified file name where job completion records are
1173 written when the JobCompType is "jobcomp/filetxt" or the data‐
1174 base where job completion records are stored when the JobComp‐
1175 Type is a database, or a complete URL endpoint with format
1176 <host>:<port>/<target>/_doc when JobCompType is "jobcomp/elas‐
1177 ticsearch" like i.e. "localhost:9200/slurm/_doc". NOTE: More
1178 information is available at the Slurm web site
1179 <https://slurm.schedmd.com/elasticsearch.html>. Also see
1180 DefaultStorageLoc.
1181
1182
1183 JobCompParams
1184 Pass arbitrary text string to job completion plugin. Also see
1185 JobCompType.
1186
1187
1188 JobCompPass
1189 The password used to gain access to the database to store the
1190 job completion data. Only used for database type storage plug‐
1191 ins, ignored otherwise. Also see DefaultStoragePass.
1192
1193
1194 JobCompPort
1195 The listening port of the job completion database server. Only
1196 used for database type storage plugins, ignored otherwise. Also
1197 see DefaultStoragePort.
1198
1199
1200 JobCompType
1201 The job completion logging mechanism type. Acceptable values at
1202 present include "jobcomp/none", "jobcomp/elasticsearch", "job‐
1203 comp/filetxt", "jobcomp/lua", "jobcomp/mysql" and "job‐
1204 comp/script". The default value is "jobcomp/none", which means
1205 that upon job completion the record of the job is purged from
1206 the system. If using the accounting infrastructure this plugin
1207 may not be of interest since the information here is redundant.
1208 The value "jobcomp/elasticsearch" indicates that a record of the
1209 job should be written to an Elasticsearch server specified by
1210 the JobCompLoc parameter. NOTE: More information is available
1211 at the Slurm web site ( https://slurm.schedmd.com/elastic‐
1212 search.html ). The value "jobcomp/filetxt" indicates that a
1213 record of the job should be written to a text file specified by
1214 the JobCompLoc parameter. The value "jobcomp/lua" indicates
1215 that a record of the job should processed by the "jobcomp.lua"
1216 script located in the default script directory (typically the
1217 subdirectory "etc" of the installation directory). The value
1218 "jobcomp/mysql" indicates that a record of the job should be
1219 written to a MySQL or MariaDB database specified by the JobCom‐
1220 pLoc parameter. The value "jobcomp/script" indicates that a
1221 script specified by the JobCompLoc parameter is to be executed
1222 with environment variables indicating the job information.
1223
1224 JobCompUser
1225 The user account for accessing the job completion database.
1226 Only used for database type storage plugins, ignored otherwise.
1227 Also see DefaultStorageUser.
1228
1229
1230 JobContainerType
1231 Identifies the plugin to be used for job tracking. The slurmd
1232 daemon must be restarted for a change in JobContainerType to
1233 take effect. NOTE: The JobContainerType applies to a job allo‐
1234 cation, while ProctrackType applies to job steps. Acceptable
1235 values at present include:
1236
1237 job_container/cncu used only for Cray systems (CNCU = Compute
1238 Node Clean Up)
1239
1240 job_container/none used for all other system types
1241
1242
1243 JobFileAppend
1244 This option controls what to do if a job's output or error file
1245 exist when the job is started. If JobFileAppend is set to a
1246 value of 1, then append to the existing file. By default, any
1247 existing file is truncated.
1248
1249
1250 JobRequeue
1251 This option controls the default ability for batch jobs to be
1252 requeued. Jobs may be requeued explicitly by a system adminis‐
1253 trator, after node failure, or upon preemption by a higher pri‐
1254 ority job. If JobRequeue is set to a value of 1, then batch job
1255 may be requeued unless explicitly disabled by the user. If
1256 JobRequeue is set to a value of 0, then batch job will not be
1257 requeued unless explicitly enabled by the user. Use the sbatch
1258 --no-requeue or --requeue option to change the default behavior
1259 for individual jobs. The default value is 1.
1260
1261
1262 JobSubmitPlugins
1263 A comma delimited list of job submission plugins to be used.
1264 The specified plugins will be executed in the order listed.
1265 These are intended to be site-specific plugins which can be used
1266 to set default job parameters and/or logging events. Sample
1267 plugins available in the distribution include "all_partitions",
1268 "defaults", "logging", "lua", and "partition". For examples of
1269 use, see the Slurm code in "src/plugins/job_submit" and "con‐
1270 tribs/lua/job_submit*.lua" then modify the code to satisfy your
1271 needs. Slurm can be configured to use multiple job_submit plug‐
1272 ins if desired, however the lua plugin will only execute one lua
1273 script named "job_submit.lua" located in the default script
1274 directory (typically the subdirectory "etc" of the installation
1275 directory). No job submission plugins are used by default.
1276
1277
1278 KeepAliveTime
1279 Specifies how long sockets communications used between the srun
1280 command and its slurmstepd process are kept alive after discon‐
1281 nect. Longer values can be used to improve reliability of com‐
1282 munications in the event of network failures. The default value
1283 leaves the system default value. The value may not exceed
1284 65533.
1285
1286
1287 KillOnBadExit
1288 If set to 1, a step will be terminated immediately if any task
1289 is crashed or aborted, as indicated by a non-zero exit code.
1290 With the default value of 0, if one of the processes is crashed
1291 or aborted the other processes will continue to run while the
1292 crashed or aborted process waits. The user can override this
1293 configuration parameter by using srun's -K, --kill-on-bad-exit.
1294
1295
1296 KillWait
1297 The interval, in seconds, given to a job's processes between the
1298 SIGTERM and SIGKILL signals upon reaching its time limit. If
1299 the job fails to terminate gracefully in the interval specified,
1300 it will be forcibly terminated. The default value is 30 sec‐
1301 onds. The value may not exceed 65533.
1302
1303
1304 NodeFeaturesPlugins
1305 Identifies the plugins to be used for support of node features
1306 which can change through time. For example, a node which might
1307 be booted with various BIOS setting. This is supported through
1308 the use of a node's active_features and available_features
1309 information. Acceptable values at present include:
1310
1311 node_features/knl_cray
1312 used only for Intel Knights Landing proces‐
1313 sors (KNL) on Cray systems
1314
1315 node_features/knl_generic
1316 used for Intel Knights Landing processors
1317 (KNL) on a generic Linux system
1318
1319
1320 LaunchParameters
1321 Identifies options to the job launch plugin. Acceptable values
1322 include:
1323
1324 batch_step_set_cpu_freq Set the cpu frequency for the batch step
1325 from given --cpu-freq, or slurm.conf
1326 CpuFreqDef, option. By default only
1327 steps started with srun will utilize the
1328 cpu freq setting options.
1329
1330 NOTE: If you are using srun to launch
1331 your steps inside a batch script
1332 (advised) this option will create a sit‐
1333 uation where you may have multiple
1334 agents setting the cpu_freq as the batch
1335 step usually runs on the same resources
1336 one or more steps the sruns in the
1337 script will create.
1338
1339 cray_net_exclusive Allow jobs on a Cray Native cluster
1340 exclusive access to network resources.
1341 This should only be set on clusters pro‐
1342 viding exclusive access to each node to
1343 a single job at once, and not using par‐
1344 allel steps within the job, otherwise
1345 resources on the node can be oversub‐
1346 scribed.
1347
1348 enable_nss_slurm Permits passwd and group resolution for
1349 a job to be serviced by slurmstepd
1350 rather than requiring a lookup from a
1351 network based service. See
1352 https://slurm.schedmd.com/nss_slurm.html
1353 for more information.
1354
1355 lustre_no_flush If set on a Cray Native cluster, then do
1356 not flush the Lustre cache on job step
1357 completion. This setting will only take
1358 effect after reconfiguring, and will
1359 only take effect for newly launched
1360 jobs.
1361
1362 mem_sort Sort NUMA memory at step start. User can
1363 override this default with
1364 SLURM_MEM_BIND environment variable or
1365 --mem-bind=nosort command line option.
1366
1367 mpir_use_nodeaddr When launching tasks Slurm creates
1368 entries in MPIR_proctable that are used
1369 by parallel debuggers, profilers, and
1370 related tools to attach to running
1371 process. By default the MPIR_proctable
1372 entries contain MPIR_procdesc structures
1373 where the host_name is set to NodeName
1374 by default. If this option is specified,
1375 NodeAddr will be used in this context
1376 instead.
1377
1378 disable_send_gids By default, the slurmctld will look up
1379 and send the user_name and extended gids
1380 for a job, rather than independently on
1381 each node as part of each task launch.
1382 This helps mitigate issues around name
1383 service scalability when launching jobs
1384 involving many nodes. Using this option
1385 will disable this functionality. This
1386 option is ignored if enable_nss_slurm is
1387 specified.
1388
1389 slurmstepd_memlock Lock the slurmstepd process's current
1390 memory in RAM.
1391
1392 slurmstepd_memlock_all Lock the slurmstepd process's current
1393 and future memory in RAM.
1394
1395 test_exec Have srun verify existence of the exe‐
1396 cutable program along with user execute
1397 permission on the node where srun was
1398 called before attempting to launch it on
1399 nodes in the step.
1400
1401 use_interactive_step Have salloc use the Interactive Step to
1402 launch a shell on an allocated compute
1403 node rather than locally to wherever
1404 salloc was invoked. This is accomplished
1405 by launching the srun command with
1406 InteractiveStepOptions as options.
1407
1408 This does not affect salloc called with
1409 a command as an argument. These jobs
1410 will continue to be executed as the
1411 calling user on the calling host.
1412
1413
1414 LaunchType
1415 Identifies the mechanism to be used to launch application tasks.
1416 Acceptable values include:
1417
1418 launch/slurm
1419 The default value.
1420
1421
1422 Licenses
1423 Specification of licenses (or other resources available on all
1424 nodes of the cluster) which can be allocated to jobs. License
1425 names can optionally be followed by a colon and count with a
1426 default count of one. Multiple license names should be comma
1427 separated (e.g. "Licenses=foo:4,bar"). Note that Slurm pre‐
1428 vents jobs from being scheduled if their required license speci‐
1429 fication is not available. Slurm does not prevent jobs from
1430 using licenses that are not explicitly listed in the job submis‐
1431 sion specification.
1432
1433
1434 LogTimeFormat
1435 Format of the timestamp in slurmctld and slurmd log files.
1436 Accepted values are "iso8601", "iso8601_ms", "rfc5424",
1437 "rfc5424_ms", "clock", "short" and "thread_id". The values end‐
1438 ing in "_ms" differ from the ones without in that fractional
1439 seconds with millisecond precision are printed. The default
1440 value is "iso8601_ms". The "rfc5424" formats are the same as the
1441 "iso8601" formats except that the timezone value is also shown.
1442 The "clock" format shows a timestamp in microseconds retrieved
1443 with the C standard clock() function. The "short" format is a
1444 short date and time format. The "thread_id" format shows the
1445 timestamp in the C standard ctime() function form without the
1446 year but including the microseconds, the daemon's process ID and
1447 the current thread name and ID.
1448
1449
1450 MailDomain
1451 Domain name to qualify usernames if email address is not explic‐
1452 itly given with the "--mail-user" option. If unset, the local
1453 MTA will need to qualify local address itself. Changes to Mail‐
1454 Domain will only affect new jobs.
1455
1456
1457 MailProg
1458 Fully qualified pathname to the program used to send email per
1459 user request. The default value is "/bin/mail" (or
1460 "/usr/bin/mail" if "/bin/mail" does not exist but
1461 "/usr/bin/mail" does exist).
1462
1463
1464 MaxArraySize
1465 The maximum job array size. The maximum job array task index
1466 value will be one less than MaxArraySize to allow for an index
1467 value of zero. Configure MaxArraySize to 0 in order to disable
1468 job array use. The value may not exceed 4000001. The value of
1469 MaxJobCount should be much larger than MaxArraySize. The
1470 default value is 1001.
1471
1472
1473 MaxDBDMsgs
1474 When communication to the SlurmDBD is not possible the slurmctld
1475 will queue messages meant to processed when the SlurmDBD is
1476 available again. In order to avoid running out of memory the
1477 slurmctld will only queue so many messages. The default value is
1478 10000, or MaxJobCount * 2 + Node Count * 4, whichever is
1479 greater. The value can not be less than 10000.
1480
1481
1482 MaxJobCount
1483 The maximum number of jobs Slurm can have in its active database
1484 at one time. Set the values of MaxJobCount and MinJobAge to
1485 ensure the slurmctld daemon does not exhaust its memory or other
1486 resources. Once this limit is reached, requests to submit addi‐
1487 tional jobs will fail. The default value is 10000 jobs. NOTE:
1488 Each task of a job array counts as one job even though they will
1489 not occupy separate job records until modified or initiated.
1490 Performance can suffer with more than a few hundred thousand
1491 jobs. Setting per MaxSubmitJobs per user is generally valuable
1492 to prevent a single user from filling the system with jobs.
1493 This is accomplished using Slurm's database and configuring
1494 enforcement of resource limits. This value may not be reset via
1495 "scontrol reconfig". It only takes effect upon restart of the
1496 slurmctld daemon.
1497
1498
1499 MaxJobId
1500 The maximum job id to be used for jobs submitted to Slurm with‐
1501 out a specific requested value. Job ids are unsigned 32bit inte‐
1502 gers with the first 26 bits reserved for local job ids and the
1503 remaining 6 bits reserved for a cluster id to identify a feder‐
1504 ated job's origin. The maximun allowed local job id is
1505 67,108,863 (0x3FFFFFF). The default value is 67,043,328
1506 (0x03ff0000). MaxJobId only applies to the local job id and not
1507 the federated job id. Job id values generated will be incre‐
1508 mented by 1 for each subsequent job. Once MaxJobId is reached,
1509 the next job will be assigned FirstJobId. Federated jobs will
1510 always have a job ID of 67,108,865 or higher. Also see FirstJo‐
1511 bId.
1512
1513
1514 MaxMemPerCPU
1515 Maximum real memory size available per allocated CPU in
1516 megabytes. Used to avoid over-subscribing memory and causing
1517 paging. MaxMemPerCPU would generally be used if individual pro‐
1518 cessors are allocated to jobs (SelectType=select/cons_res or
1519 SelectType=select/cons_tres). The default value is 0 (unlim‐
1520 ited). Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerNode.
1521 MaxMemPerCPU and MaxMemPerNode are mutually exclusive.
1522
1523 NOTE: If a job specifies a memory per CPU limit that exceeds
1524 this system limit, that job's count of CPUs per task will try to
1525 automatically increase. This may result in the job failing due
1526 to CPU count limits. This auto-adjustment feature is a best-
1527 effort one and optimal assignment is not guaranteed due to the
1528 possibility of having heterogeneous configurations and multi-
1529 partition/qos jobs. If this is a concern it is advised to use a
1530 job submit LUA plugin instead to enforce auto-adjustments to
1531 your specific needs.
1532
1533
1534 MaxMemPerNode
1535 Maximum real memory size available per allocated node in
1536 megabytes. Used to avoid over-subscribing memory and causing
1537 paging. MaxMemPerNode would generally be used if whole nodes
1538 are allocated to jobs (SelectType=select/linear) and resources
1539 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
1540 The default value is 0 (unlimited). Also see DefMemPerNode and
1541 MaxMemPerCPU. MaxMemPerCPU and MaxMemPerNode are mutually
1542 exclusive.
1543
1544
1545 MaxStepCount
1546 The maximum number of steps that any job can initiate. This
1547 parameter is intended to limit the effect of bad batch scripts.
1548 The default value is 40000 steps.
1549
1550
1551 MaxTasksPerNode
1552 Maximum number of tasks Slurm will allow a job step to spawn on
1553 a single node. The default MaxTasksPerNode is 512. May not
1554 exceed 65533.
1555
1556
1557 MCSParameters
1558 MCS = Multi-Category Security MCS Plugin Parameters. The sup‐
1559 ported parameters are specific to the MCSPlugin. Changes to
1560 this value take effect when the Slurm daemons are reconfigured.
1561 More information about MCS is available here
1562 <https://slurm.schedmd.com/mcs.html>.
1563
1564
1565 MCSPlugin
1566 MCS = Multi-Category Security : associate a security label to
1567 jobs and ensure that nodes can only be shared among jobs using
1568 the same security label. Acceptable values include:
1569
1570 mcs/none is the default value. No security label associated
1571 with jobs, no particular security restriction when
1572 sharing nodes among jobs.
1573
1574 mcs/account only users with the same account can share the nodes
1575 (requires enabling of accounting).
1576
1577 mcs/group only users with the same group can share the nodes.
1578
1579 mcs/user a node cannot be shared with other users.
1580
1581
1582 MessageTimeout
1583 Time permitted for a round-trip communication to complete in
1584 seconds. Default value is 10 seconds. For systems with shared
1585 nodes, the slurmd daemon could be paged out and necessitate
1586 higher values.
1587
1588
1589 MinJobAge
1590 The minimum age of a completed job before its record is purged
1591 from Slurm's active database. Set the values of MaxJobCount and
1592 to ensure the slurmctld daemon does not exhaust its memory or
1593 other resources. The default value is 300 seconds. A value of
1594 zero prevents any job record purging. Jobs are not purged dur‐
1595 ing a backfill cycle, so it can take longer than MinJobAge sec‐
1596 onds to purge a job if using the backfill scheduling plugin. In
1597 order to eliminate some possible race conditions, the minimum
1598 non-zero value for MinJobAge recommended is 2.
1599
1600
1601 MpiDefault
1602 Identifies the default type of MPI to be used. Srun may over‐
1603 ride this configuration parameter in any case. Currently sup‐
1604 ported versions include: pmi2, pmix, and none (default, which
1605 works for many other versions of MPI). More information about
1606 MPI use is available here
1607 <https://slurm.schedmd.com/mpi_guide.html>.
1608
1609
1610 MpiParams
1611 MPI parameters. Used to identify ports used by older versions
1612 of OpenMPI and native Cray systems. The input format is
1613 "ports=12000-12999" to identify a range of communication ports
1614 to be used. NOTE: This is not needed for modern versions of
1615 OpenMPI, taking it out can cause a small boost in scheduling
1616 performance. NOTE: This is require for Cray's PMI.
1617
1618
1619 OverTimeLimit
1620 Number of minutes by which a job can exceed its time limit
1621 before being canceled. Normally a job's time limit is treated
1622 as a hard limit and the job will be killed upon reaching that
1623 limit. Configuring OverTimeLimit will result in the job's time
1624 limit being treated like a soft limit. Adding the OverTimeLimit
1625 value to the soft time limit provides a hard time limit, at
1626 which point the job is canceled. This is particularly useful
1627 for backfill scheduling, which bases upon each job's soft time
1628 limit. The default value is zero. May not exceed 65533 min‐
1629 utes. A value of "UNLIMITED" is also supported.
1630
1631
1632 PluginDir
1633 Identifies the places in which to look for Slurm plugins. This
1634 is a colon-separated list of directories, like the PATH environ‐
1635 ment variable. The default value is the prefix given at config‐
1636 ure time + "/lib/slurm".
1637
1638
1639 PlugStackConfig
1640 Location of the config file for Slurm stackable plugins that use
1641 the Stackable Plugin Architecture for Node job (K)control
1642 (SPANK). This provides support for a highly configurable set of
1643 plugins to be called before and/or after execution of each task
1644 spawned as part of a user's job step. Default location is
1645 "plugstack.conf" in the same directory as the system slurm.conf.
1646 For more information on SPANK plugins, see the spank(8) manual.
1647
1648
1649 PowerParameters
1650 System power management parameters. The supported parameters
1651 are specific to the PowerPlugin. Changes to this value take
1652 effect when the Slurm daemons are reconfigured. More informa‐
1653 tion about system power management is available here
1654 <https://slurm.schedmd.com/power_mgmt.html>. Options current
1655 supported by any plugins are listed below.
1656
1657 balance_interval=#
1658 Specifies the time interval, in seconds, between attempts
1659 to rebalance power caps across the nodes. This also con‐
1660 trols the frequency at which Slurm attempts to collect
1661 current power consumption data (old data may be used
1662 until new data is available from the underlying infra‐
1663 structure and values below 10 seconds are not recommended
1664 for Cray systems). The default value is 30 seconds.
1665 Supported by the power/cray_aries plugin.
1666
1667 capmc_path=
1668 Specifies the absolute path of the capmc command. The
1669 default value is "/opt/cray/capmc/default/bin/capmc".
1670 Supported by the power/cray_aries plugin.
1671
1672 cap_watts=#
1673 Specifies the total power limit to be established across
1674 all compute nodes managed by Slurm. A value of 0 sets
1675 every compute node to have an unlimited cap. The default
1676 value is 0. Supported by the power/cray_aries plugin.
1677
1678 decrease_rate=#
1679 Specifies the maximum rate of change in the power cap for
1680 a node where the actual power usage is below the power
1681 cap by an amount greater than lower_threshold (see
1682 below). Value represents a percentage of the difference
1683 between a node's minimum and maximum power consumption.
1684 The default value is 50 percent. Supported by the
1685 power/cray_aries plugin.
1686
1687 get_timeout=#
1688 Amount of time allowed to get power state information in
1689 milliseconds. The default value is 5,000 milliseconds or
1690 5 seconds. Supported by the power/cray_aries plugin and
1691 represents the time allowed for the capmc command to
1692 respond to various "get" options.
1693
1694 increase_rate=#
1695 Specifies the maximum rate of change in the power cap for
1696 a node where the actual power usage is within
1697 upper_threshold (see below) of the power cap. Value rep‐
1698 resents a percentage of the difference between a node's
1699 minimum and maximum power consumption. The default value
1700 is 20 percent. Supported by the power/cray_aries plugin.
1701
1702 job_level
1703 All nodes associated with every job will have the same
1704 power cap, to the extent possible. Also see the
1705 --power=level option on the job submission commands.
1706
1707 job_no_level
1708 Disable the user's ability to set every node associated
1709 with a job to the same power cap. Each node will have
1710 its power cap set independently. This disables the
1711 --power=level option on the job submission commands.
1712
1713 lower_threshold=#
1714 Specify a lower power consumption threshold. If a node's
1715 current power consumption is below this percentage of its
1716 current cap, then its power cap will be reduced. The
1717 default value is 90 percent. Supported by the
1718 power/cray_aries plugin.
1719
1720 recent_job=#
1721 If a job has started or resumed execution (from suspend)
1722 on a compute node within this number of seconds from the
1723 current time, the node's power cap will be increased to
1724 the maximum. The default value is 300 seconds. Sup‐
1725 ported by the power/cray_aries plugin.
1726
1727
1728 set_timeout=#
1729 Amount of time allowed to set power state information in
1730 milliseconds. The default value is 30,000 milliseconds
1731 or 30 seconds. Supported by the power/cray plugin and
1732 represents the time allowed for the capmc command to
1733 respond to various "set" options.
1734
1735 set_watts=#
1736 Specifies the power limit to be set on every compute
1737 nodes managed by Slurm. Every node gets this same power
1738 cap and there is no variation through time based upon
1739 actual power usage on the node. Supported by the
1740 power/cray_aries plugin.
1741
1742 upper_threshold=#
1743 Specify an upper power consumption threshold. If a
1744 node's current power consumption is above this percentage
1745 of its current cap, then its power cap will be increased
1746 to the extent possible. The default value is 95 percent.
1747 Supported by the power/cray_aries plugin.
1748
1749
1750 PowerPlugin
1751 Identifies the plugin used for system power management. Cur‐
1752 rently supported plugins include: cray_aries and none. Changes
1753 to this value require restarting Slurm daemons to take effect.
1754 More information about system power management is available here
1755 <https://slurm.schedmd.com/power_mgmt.html>. By default, no
1756 power plugin is loaded.
1757
1758
1759 PreemptMode
1760 Mechanism used to preempt jobs or enable gang scheduling. When
1761 the PreemptType parameter is set to enable preemption, the Pre‐
1762 emptMode selects the default mechanism used to preempt the eli‐
1763 gible jobs for the cluster.
1764 PreemptMode may be specified on a per partition basis to over‐
1765 ride this default value if PreemptType=preempt/partition_prio.
1766 Alternatively, it can be specified on a per QOS basis if Pre‐
1767 emptType=preempt/qos. In either case, a valid default Preempt‐
1768 Mode value must be specified for the cluster as a whole when
1769 preemption is enabled.
1770 The GANG option is used to enable gang scheduling independent of
1771 whether preemption is enabled (i.e. independent of the Preempt‐
1772 Type setting). It can be specified in addition to a PreemptMode
1773 setting with the two options comma separated (e.g. Preempt‐
1774 Mode=SUSPEND,GANG).
1775 See <https://slurm.schedmd.com/preempt.html> and
1776 <https://slurm.schedmd.com/gang_scheduling.html> for more
1777 details.
1778
1779 NOTE: For performance reasons, the backfill scheduler reserves
1780 whole nodes for jobs, not partial nodes. If during backfill
1781 scheduling a job preempts one or more other jobs, the whole
1782 nodes for those preempted jobs are reserved for the preemptor
1783 job, even if the preemptor job requested fewer resources than
1784 that. These reserved nodes aren't available to other jobs dur‐
1785 ing that backfill cycle, even if the other jobs could fit on the
1786 nodes. Therefore, jobs may preempt more resources during a sin‐
1787 gle backfill iteration than they requested.
1788
1789 NOTE: For heterogeneous job to be considered for preemption all
1790 components must be eligible for preemption. When a heterogeneous
1791 job is to be preempted the first identified component of the job
1792 with the highest order PreemptMode (SUSPEND (highest), REQUEUE,
1793 CANCEL (lowest)) will be used to set the PreemptMode for all
1794 components. The GraceTime and user warning signal for each com‐
1795 ponent of the heterogeneous job remain unique. Heterogeneous
1796 jobs are excluded from GANG scheduling operations.
1797
1798 OFF Is the default value and disables job preemption and
1799 gang scheduling. It is only compatible with Pre‐
1800 emptType=preempt/none at a global level. A common
1801 use case for this parameter is to set it on a parti‐
1802 tion to disable preemption for that partition.
1803
1804 CANCEL The preempted job will be cancelled.
1805
1806 GANG Enables gang scheduling (time slicing) of jobs in
1807 the same partition, and allows the resuming of sus‐
1808 pended jobs.
1809
1810 NOTE: Gang scheduling is performed independently for
1811 each partition, so if you only want time-slicing by
1812 OverSubscribe, without any preemption, then config‐
1813 uring partitions with overlapping nodes is not rec‐
1814 ommended. On the other hand, if you want to use
1815 PreemptType=preempt/partition_prio to allow jobs
1816 from higher PriorityTier partitions to Suspend jobs
1817 from lower PriorityTier partitions you will need
1818 overlapping partitions, and PreemptMode=SUSPEND,GANG
1819 to use the Gang scheduler to resume the suspended
1820 jobs(s). In any case, time-slicing won't happen
1821 between jobs on different partitions.
1822
1823 NOTE: Heterogeneous jobs are excluded from GANG
1824 scheduling operations.
1825
1826 REQUEUE Preempts jobs by requeuing them (if possible) or
1827 canceling them. For jobs to be requeued they must
1828 have the --requeue sbatch option set or the cluster
1829 wide JobRequeue parameter in slurm.conf must be set
1830 to one.
1831
1832 SUSPEND The preempted jobs will be suspended, and later the
1833 Gang scheduler will resume them. Therefore the SUS‐
1834 PEND preemption mode always needs the GANG option to
1835 be specified at the cluster level. Also, because the
1836 suspended jobs will still use memory on the allo‐
1837 cated nodes, Slurm needs to be able to track memory
1838 resources to be able to suspend jobs.
1839
1840 NOTE: Because gang scheduling is performed indepen‐
1841 dently for each partition, if using PreemptType=pre‐
1842 empt/partition_prio then jobs in higher PriorityTier
1843 partitions will suspend jobs in lower PriorityTier
1844 partitions to run on the released resources. Only
1845 when the preemptor job ends will the suspended jobs
1846 will be resumed by the Gang scheduler.
1847 If PreemptType=preempt/qos is configured and if the
1848 preempted job(s) and the preemptor job are on the
1849 same partition, then they will share resources with
1850 the Gang scheduler (time-slicing). If not (i.e. if
1851 the preemptees and preemptor are on different parti‐
1852 tions) then the preempted jobs will remain suspended
1853 until the preemptor ends.
1854
1855
1856 PreemptType
1857 Specifies the plugin used to identify which jobs can be pre‐
1858 empted in order to start a pending job.
1859
1860 preempt/none
1861 Job preemption is disabled. This is the default.
1862
1863 preempt/partition_prio
1864 Job preemption is based upon partition PriorityTier.
1865 Jobs in higher PriorityTier partitions may preempt jobs
1866 from lower PriorityTier partitions. This is not compati‐
1867 ble with PreemptMode=OFF.
1868
1869 preempt/qos
1870 Job preemption rules are specified by Quality Of Service
1871 (QOS) specifications in the Slurm database. This option
1872 is not compatible with PreemptMode=OFF. A configuration
1873 of PreemptMode=SUSPEND is only supported by the Select‐
1874 Type=select/cons_res and SelectType=select/cons_tres
1875 plugins. See the sacctmgr man page to configure the
1876 options for preempt/qos.
1877
1878
1879 PreemptExemptTime
1880 Global option for minimum run time for all jobs before they can
1881 be considered for preemption. Any QOS PreemptExemptTime takes
1882 precedence over the global option. A time of -1 disables the
1883 option, equivalent to 0. Acceptable time formats include "min‐
1884 utes", "minutes:seconds", "hours:minutes:seconds", "days-hours",
1885 "days-hours:minutes", and "days-hours:minutes:seconds".
1886
1887
1888 PrEpParameters
1889 Parameters to be passed to the PrEpPlugins.
1890
1891
1892 PrEpPlugins
1893 A resource for programmers wishing to write their own plugins
1894 for the Prolog and Epilog (PrEp) scripts. The default, and cur‐
1895 rently the only implemented plugin is prep/script. Additional
1896 plugins can be specified in a comma separated list. For more
1897 information please see the PrEp Plugin API documentation page:
1898 <https://slurm.schedmd.com/prep_plugins.html>
1899
1900
1901 PriorityCalcPeriod
1902 The period of time in minutes in which the half-life decay will
1903 be re-calculated. Applicable only if PriorityType=priority/mul‐
1904 tifactor. The default value is 5 (minutes).
1905
1906
1907 PriorityDecayHalfLife
1908 This controls how long prior resource use is considered in
1909 determining how over- or under-serviced an association is (user,
1910 bank account and cluster) in determining job priority. The
1911 record of usage will be decayed over time, with half of the
1912 original value cleared at age PriorityDecayHalfLife. If set to
1913 0 no decay will be applied. This is helpful if you want to
1914 enforce hard time limits per association. If set to 0 Priori‐
1915 tyUsageResetPeriod must be set to some interval. Applicable
1916 only if PriorityType=priority/multifactor. The unit is a time
1917 string (i.e. min, hr:min:00, days-hr:min:00, or days-hr). The
1918 default value is 7-0 (7 days).
1919
1920
1921 PriorityFavorSmall
1922 Specifies that small jobs should be given preferential schedul‐
1923 ing priority. Applicable only if PriorityType=priority/multi‐
1924 factor. Supported values are "YES" and "NO". The default value
1925 is "NO".
1926
1927
1928 PriorityFlags
1929 Flags to modify priority behavior. Applicable only if Priority‐
1930 Type=priority/multifactor. The keywords below have no associ‐
1931 ated value (e.g. "PriorityFlags=ACCRUE_ALWAYS,SMALL_RELA‐
1932 TIVE_TO_TIME").
1933
1934 ACCRUE_ALWAYS If set, priority age factor will be increased
1935 despite job dependencies or holds.
1936
1937 CALCULATE_RUNNING
1938 If set, priorities will be recalculated not
1939 only for pending jobs, but also running and
1940 suspended jobs.
1941
1942 DEPTH_OBLIVIOUS If set, priority will be calculated based simi‐
1943 lar to the normal multifactor calculation, but
1944 depth of the associations in the tree do not
1945 adversely effect their priority. This option
1946 automatically enables NO_FAIR_TREE.
1947
1948 NO_FAIR_TREE Disables the "fair tree" algorithm, and reverts
1949 to "classic" fair share priority scheduling.
1950
1951 INCR_ONLY If set, priority values will only increase in
1952 value. Job priority will never decrease in
1953 value.
1954
1955 MAX_TRES If set, the weighted TRES value (e.g. TRES‐
1956 BillingWeights) is calculated as the MAX of
1957 individual TRES' on a node (e.g. cpus, mem,
1958 gres) plus the sum of all global TRES' (e.g.
1959 licenses).
1960
1961 NO_NORMAL_ALL If set, all NO_NORMAL_* flags are set.
1962
1963 NO_NORMAL_ASSOC If set, the association factor is not normal‐
1964 ized against the highest association priority.
1965
1966 NO_NORMAL_PART If set, the partition factor is not normalized
1967 against the highest partition PriorityJobFac‐
1968 tor.
1969
1970 NO_NORMAL_QOS If set, the QOS factor is not normalized
1971 against the highest qos priority.
1972
1973 NO_NORMAL_TRES If set, the QOS factor is not normalized
1974 against the job's partition TRES counts.
1975
1976 SMALL_RELATIVE_TO_TIME
1977 If set, the job's size component will be based
1978 upon not the job size alone, but the job's size
1979 divided by its time limit.
1980
1981
1982 PriorityMaxAge
1983 Specifies the job age which will be given the maximum age factor
1984 in computing priority. For example, a value of 30 minutes would
1985 result in all jobs over 30 minutes old would get the same
1986 age-based priority. Applicable only if PriorityType=prior‐
1987 ity/multifactor. The unit is a time string (i.e. min,
1988 hr:min:00, days-hr:min:00, or days-hr). The default value is
1989 7-0 (7 days).
1990
1991
1992 PriorityParameters
1993 Arbitrary string used by the PriorityType plugin.
1994
1995
1996 PrioritySiteFactorParameters
1997 Arbitrary string used by the PrioritySiteFactorPlugin plugin.
1998
1999
2000 PrioritySiteFactorPlugin
2001 The specifies an optional plugin to be used alongside "prior‐
2002 ity/multifactor", which is meant to initially set and continu‐
2003 ously update the SiteFactor priority factor. The default value
2004 is "site_factor/none".
2005
2006
2007 PriorityType
2008 This specifies the plugin to be used in establishing a job's
2009 scheduling priority. Supported values are "priority/basic" (jobs
2010 are prioritized by order of arrival), "priority/multifactor"
2011 (jobs are prioritized based upon size, age, fair-share of allo‐
2012 cation, etc). Also see PriorityFlags for configuration options.
2013 The default value is "priority/basic".
2014
2015 When not FIFO scheduling, jobs are prioritized in the following
2016 order:
2017
2018 1. Jobs that can preempt
2019 2. Jobs with an advanced reservation
2020 3. Partition Priority Tier
2021 4. Job Priority
2022 5. Job Id
2023
2024
2025 PriorityUsageResetPeriod
2026 At this interval the usage of associations will be reset to 0.
2027 This is used if you want to enforce hard limits of time usage
2028 per association. If PriorityDecayHalfLife is set to be 0 no
2029 decay will happen and this is the only way to reset the usage
2030 accumulated by running jobs. By default this is turned off and
2031 it is advised to use the PriorityDecayHalfLife option to avoid
2032 not having anything running on your cluster, but if your schema
2033 is set up to only allow certain amounts of time on your system
2034 this is the way to do it. Applicable only if PriorityType=pri‐
2035 ority/multifactor.
2036
2037 NONE Never clear historic usage. The default value.
2038
2039 NOW Clear the historic usage now. Executed at startup
2040 and reconfiguration time.
2041
2042 DAILY Cleared every day at midnight.
2043
2044 WEEKLY Cleared every week on Sunday at time 00:00.
2045
2046 MONTHLY Cleared on the first day of each month at time
2047 00:00.
2048
2049 QUARTERLY Cleared on the first day of each quarter at time
2050 00:00.
2051
2052 YEARLY Cleared on the first day of each year at time 00:00.
2053
2054
2055 PriorityWeightAge
2056 An integer value that sets the degree to which the queue wait
2057 time component contributes to the job's priority. Applicable
2058 only if PriorityType=priority/multifactor. Requires Account‐
2059 ingStorageType=accounting_storage/slurmdbd. The default value
2060 is 0.
2061
2062
2063 PriorityWeightAssoc
2064 An integer value that sets the degree to which the association
2065 component contributes to the job's priority. Applicable only if
2066 PriorityType=priority/multifactor. The default value is 0.
2067
2068
2069 PriorityWeightFairshare
2070 An integer value that sets the degree to which the fair-share
2071 component contributes to the job's priority. Applicable only if
2072 PriorityType=priority/multifactor. Requires AccountingStor‐
2073 ageType=accounting_storage/slurmdbd. The default value is 0.
2074
2075
2076 PriorityWeightJobSize
2077 An integer value that sets the degree to which the job size com‐
2078 ponent contributes to the job's priority. Applicable only if
2079 PriorityType=priority/multifactor. The default value is 0.
2080
2081
2082 PriorityWeightPartition
2083 Partition factor used by priority/multifactor plugin in calcu‐
2084 lating job priority. Applicable only if PriorityType=prior‐
2085 ity/multifactor. The default value is 0.
2086
2087
2088 PriorityWeightQOS
2089 An integer value that sets the degree to which the Quality Of
2090 Service component contributes to the job's priority. Applicable
2091 only if PriorityType=priority/multifactor. The default value is
2092 0.
2093
2094
2095 PriorityWeightTRES
2096 A comma separated list of TRES Types and weights that sets the
2097 degree that each TRES Type contributes to the job's priority.
2098
2099 e.g.
2100 PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
2101
2102 Applicable only if PriorityType=priority/multifactor and if
2103 AccountingStorageTRES is configured with each TRES Type. Nega‐
2104 tive values are allowed. The default values are 0.
2105
2106
2107 PrivateData
2108 This controls what type of information is hidden from regular
2109 users. By default, all information is visible to all users.
2110 User SlurmUser and root can always view all information. Multi‐
2111 ple values may be specified with a comma separator. Acceptable
2112 values include:
2113
2114 accounts
2115 (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
2116 ing any account definitions unless they are coordinators
2117 of them.
2118
2119 cloud Powered down nodes in the cloud are visible.
2120
2121 events prevents users from viewing event information unless they
2122 have operator status or above.
2123
2124 jobs Prevents users from viewing jobs or job steps belonging
2125 to other users. (NON-SlurmDBD ACCOUNTING ONLY) Prevents
2126 users from viewing job records belonging to other users
2127 unless they are coordinators of the association running
2128 the job when using sacct.
2129
2130 nodes Prevents users from viewing node state information.
2131
2132 partitions
2133 Prevents users from viewing partition state information.
2134
2135 reservations
2136 Prevents regular users from viewing reservations which
2137 they can not use.
2138
2139 usage Prevents users from viewing usage of any other user, this
2140 applies to sshare. (NON-SlurmDBD ACCOUNTING ONLY) Pre‐
2141 vents users from viewing usage of any other user, this
2142 applies to sreport.
2143
2144 users (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
2145 ing information of any user other than themselves, this
2146 also makes it so users can only see associations they
2147 deal with. Coordinators can see associations of all
2148 users in the account they are coordinator of, but can
2149 only see themselves when listing users.
2150
2151
2152 ProctrackType
2153 Identifies the plugin to be used for process tracking on a job
2154 step basis. The slurmd daemon uses this mechanism to identify
2155 all processes which are children of processes it spawns for a
2156 user job step. The slurmd daemon must be restarted for a change
2157 in ProctrackType to take effect. NOTE: "proctrack/linuxproc"
2158 and "proctrack/pgid" can fail to identify all processes associ‐
2159 ated with a job since processes can become a child of the init
2160 process (when the parent process terminates) or change their
2161 process group. To reliably track all processes, "proc‐
2162 track/cgroup" is highly recommended. NOTE: The JobContainerType
2163 applies to a job allocation, while ProctrackType applies to job
2164 steps. Acceptable values at present include:
2165
2166 proctrack/cgroup
2167 Uses linux cgroups to constrain and track processes, and
2168 is the default for systems with cgroup support.
2169 NOTE: see "man cgroup.conf" for configuration details.
2170
2171 proctrack/cray_aries
2172 Uses Cray proprietary process tracking.
2173
2174 proctrack/linuxproc
2175 Uses linux process tree using parent process IDs.
2176
2177 proctrack/pgid
2178 Uses Process Group IDs.
2179 NOTE: This is the default for the BSD family.
2180
2181
2182 Prolog Fully qualified pathname of a program for the slurmd to execute
2183 whenever it is asked to run a job step from a new job allocation
2184 (e.g. "/usr/local/slurm/prolog"). A glob pattern (See glob (7))
2185 may also be used to specify more than one program to run (e.g.
2186 "/etc/slurm/prolog.d/*"). The slurmd executes the prolog before
2187 starting the first job step. The prolog script or scripts may
2188 be used to purge files, enable user login, etc. By default
2189 there is no prolog. Any configured script is expected to com‐
2190 plete execution quickly (in less time than MessageTimeout). If
2191 the prolog fails (returns a non-zero exit code), this will
2192 result in the node being set to a DRAIN state and the job being
2193 requeued in a held state, unless nohold_on_prolog_fail is con‐
2194 figured in SchedulerParameters. See Prolog and Epilog Scripts
2195 for more information.
2196
2197
2198 PrologEpilogTimeout
2199 The interval in seconds Slurms waits for Prolog and Epilog
2200 before terminating them. The default behavior is to wait indefi‐
2201 nitely. This interval applies to the Prolog and Epilog run by
2202 slurmd daemon before and after the job, the PrologSlurmctld and
2203 EpilogSlurmctld run by slurmctld daemon, and the SPANK plugins
2204 run by the slurmstepd daemon.
2205
2206
2207 PrologFlags
2208 Flags to control the Prolog behavior. By default no flags are
2209 set. Multiple flags may be specified in a comma-separated list.
2210 Currently supported options are:
2211
2212 Alloc If set, the Prolog script will be executed at job allo‐
2213 cation. By default, Prolog is executed just before the
2214 task is launched. Therefore, when salloc is started, no
2215 Prolog is executed. Alloc is useful for preparing things
2216 before a user starts to use any allocated resources. In
2217 particular, this flag is needed on a Cray system when
2218 cluster compatibility mode is enabled.
2219
2220 NOTE: Use of the Alloc flag will increase the time
2221 required to start jobs.
2222
2223 Contain At job allocation time, use the ProcTrack plugin to cre‐
2224 ate a job container on all allocated compute nodes.
2225 This container may be used for user processes not
2226 launched under Slurm control, for example
2227 pam_slurm_adopt may place processes launched through a
2228 direct user login into this container. If using
2229 pam_slurm_adopt, then ProcTrackType must be set to
2230 either proctrack/cgroup or proctrack/cray_aries. Set‐
2231 ting the Contain implicitly sets the Alloc flag.
2232
2233 NoHold If set, the Alloc flag should also be set. This will
2234 allow for salloc to not block until the prolog is fin‐
2235 ished on each node. The blocking will happen when steps
2236 reach the slurmd and before any execution has happened
2237 in the step. This is a much faster way to work and if
2238 using srun to launch your tasks you should use this
2239 flag. This flag cannot be combined with the Contain or
2240 X11 flags.
2241
2242 Serial By default, the Prolog and Epilog scripts run concur‐
2243 rently on each node. This flag forces those scripts to
2244 run serially within each node, but with a significant
2245 penalty to job throughput on each node.
2246
2247 X11 Enable Slurm's built-in X11 forwarding capabilities.
2248 This is incompatible with ProctrackType=proctrack/linux‐
2249 proc. Setting the X11 flag implicitly enables both Con‐
2250 tain and Alloc flags as well.
2251
2252
2253 PrologSlurmctld
2254 Fully qualified pathname of a program for the slurmctld daemon
2255 to execute before granting a new job allocation (e.g.
2256 "/usr/local/slurm/prolog_controller"). The program executes as
2257 SlurmUser on the same node where the slurmctld daemon executes,
2258 giving it permission to drain nodes and requeue the job if a
2259 failure occurs or cancel the job if appropriate. The program
2260 can be used to reboot nodes or perform other work to prepare
2261 resources for use. Exactly what the program does and how it
2262 accomplishes this is completely at the discretion of the system
2263 administrator. Information about the job being initiated, its
2264 allocated nodes, etc. are passed to the program using environ‐
2265 ment variables. While this program is running, the nodes asso‐
2266 ciated with the job will be have a POWER_UP/CONFIGURING flag set
2267 in their state, which can be readily viewed. The slurmctld dae‐
2268 mon will wait indefinitely for this program to complete. Once
2269 the program completes with an exit code of zero, the nodes will
2270 be considered ready for use and the program will be started. If
2271 some node can not be made available for use, the program should
2272 drain the node (typically using the scontrol command) and termi‐
2273 nate with a non-zero exit code. A non-zero exit code will
2274 result in the job being requeued (where possible) or killed.
2275 Note that only batch jobs can be requeued. See Prolog and Epi‐
2276 log Scripts for more information.
2277
2278
2279 PropagatePrioProcess
2280 Controls the scheduling priority (nice value) of user spawned
2281 tasks.
2282
2283 0 The tasks will inherit the scheduling priority from the
2284 slurm daemon. This is the default value.
2285
2286 1 The tasks will inherit the scheduling priority of the com‐
2287 mand used to submit them (e.g. srun or sbatch). Unless the
2288 job is submitted by user root, the tasks will have a sched‐
2289 uling priority no higher than the slurm daemon spawning
2290 them.
2291
2292 2 The tasks will inherit the scheduling priority of the com‐
2293 mand used to submit them (e.g. srun or sbatch) with the
2294 restriction that their nice value will always be one higher
2295 than the slurm daemon (i.e. the tasks scheduling priority
2296 will be lower than the slurm daemon).
2297
2298
2299 PropagateResourceLimits
2300 A list of comma separated resource limit names. The slurmd dae‐
2301 mon uses these names to obtain the associated (soft) limit val‐
2302 ues from the user's process environment on the submit node.
2303 These limits are then propagated and applied to the jobs that
2304 will run on the compute nodes. This parameter can be useful
2305 when system limits vary among nodes. Any resource limits that
2306 do not appear in the list are not propagated. However, the user
2307 can override this by specifying which resource limits to propa‐
2308 gate with the sbatch or srun "--propagate" option. If neither
2309 PropagateResourceLimits or PropagateResourceLimitsExcept are
2310 configured and the "--propagate" option is not specified, then
2311 the default action is to propagate all limits. Only one of the
2312 parameters, either PropagateResourceLimits or PropagateResource‐
2313 LimitsExcept, may be specified. The user limits can not exceed
2314 hard limits under which the slurmd daemon operates. If the user
2315 limits are not propagated, the limits from the slurmd daemon
2316 will be propagated to the user's job. The limits used for the
2317 Slurm daemons can be set in the /etc/sysconf/slurm file. For
2318 more information, see: https://slurm.schedmd.com/faq.html#mem‐
2319 lock The following limit names are supported by Slurm (although
2320 some options may not be supported on some systems):
2321
2322 ALL All limits listed below (default)
2323
2324 NONE No limits listed below
2325
2326 AS The maximum address space for a process
2327
2328 CORE The maximum size of core file
2329
2330 CPU The maximum amount of CPU time
2331
2332 DATA The maximum size of a process's data segment
2333
2334 FSIZE The maximum size of files created. Note that if the
2335 user sets FSIZE to less than the current size of the
2336 slurmd.log, job launches will fail with a 'File size
2337 limit exceeded' error.
2338
2339 MEMLOCK The maximum size that may be locked into memory
2340
2341 NOFILE The maximum number of open files
2342
2343 NPROC The maximum number of processes available
2344
2345 RSS The maximum resident set size
2346
2347 STACK The maximum stack size
2348
2349
2350 PropagateResourceLimitsExcept
2351 A list of comma separated resource limit names. By default, all
2352 resource limits will be propagated, (as described by the Propa‐
2353 gateResourceLimits parameter), except for the limits appearing
2354 in this list. The user can override this by specifying which
2355 resource limits to propagate with the sbatch or srun "--propa‐
2356 gate" option. See PropagateResourceLimits above for a list of
2357 valid limit names.
2358
2359
2360 RebootProgram
2361 Program to be executed on each compute node to reboot it.
2362 Invoked on each node once it becomes idle after the command
2363 "scontrol reboot" is executed by an authorized user or a job is
2364 submitted with the "--reboot" option. After rebooting, the node
2365 is returned to normal use. See ResumeTimeout to configure the
2366 time you expect a reboot to finish in. A node will be marked
2367 DOWN if it doesn't reboot within ResumeTimeout.
2368
2369
2370 ReconfigFlags
2371 Flags to control various actions that may be taken when an
2372 "scontrol reconfig" command is issued. Currently the options
2373 are:
2374
2375 KeepPartInfo If set, an "scontrol reconfig" command will
2376 maintain the in-memory value of partition
2377 "state" and other parameters that may have been
2378 dynamically updated by "scontrol update". Par‐
2379 tition information in the slurm.conf file will
2380 be merged with in-memory data. This flag
2381 supersedes the KeepPartState flag.
2382
2383 KeepPartState If set, an "scontrol reconfig" command will
2384 preserve only the current "state" value of
2385 in-memory partitions and will reset all other
2386 parameters of the partitions that may have been
2387 dynamically updated by "scontrol update" to the
2388 values from the slurm.conf file. Partition
2389 information in the slurm.conf file will be
2390 merged with in-memory data.
2391 The default for the above flags is not set, and the "scontrol
2392 reconfig" will rebuild the partition information using only the
2393 definitions in the slurm.conf file.
2394
2395
2396 RequeueExit
2397 Enables automatic requeue for batch jobs which exit with the
2398 specified values. Separate multiple exit code by a comma and/or
2399 specify numeric ranges using a "-" separator (e.g. "Requeue‐
2400 Exit=1-9,18") Jobs will be put back in to pending state and
2401 later scheduled again. Restarted jobs will have the environment
2402 variable SLURM_RESTART_COUNT set to the number of times the job
2403 has been restarted.
2404
2405
2406 RequeueExitHold
2407 Enables automatic requeue for batch jobs which exit with the
2408 specified values, with these jobs being held until released man‐
2409 ually by the user. Separate multiple exit code by a comma
2410 and/or specify numeric ranges using a "-" separator (e.g.
2411 "RequeueExitHold=10-12,16") These jobs are put in the JOB_SPE‐
2412 CIAL_EXIT exit state. Restarted jobs will have the environment
2413 variable SLURM_RESTART_COUNT set to the number of times the job
2414 has been restarted.
2415
2416
2417 ResumeFailProgram
2418 The program that will be executed when nodes fail to resume to
2419 by ResumeTimeout. The argument to the program will be the names
2420 of the failed nodes (using Slurm's hostlist expression format).
2421
2422
2423 ResumeProgram
2424 Slurm supports a mechanism to reduce power consumption on nodes
2425 that remain idle for an extended period of time. This is typi‐
2426 cally accomplished by reducing voltage and frequency or powering
2427 the node down. ResumeProgram is the program that will be exe‐
2428 cuted when a node in power save mode is assigned work to per‐
2429 form. For reasons of reliability, ResumeProgram may execute
2430 more than once for a node when the slurmctld daemon crashes and
2431 is restarted. If ResumeProgram is unable to restore a node to
2432 service with a responding slurmd and an updated BootTime, it
2433 should requeue any job associated with the node and set the node
2434 state to DOWN. If the node isn't actually rebooted (i.e. when
2435 multiple-slurmd is configured) starting slurmd with "-b" option
2436 might be useful. The program executes as SlurmUser. The argu‐
2437 ment to the program will be the names of nodes to be removed
2438 from power savings mode (using Slurm's hostlist expression for‐
2439 mat). By default no program is run. Related configuration
2440 options include ResumeTimeout, ResumeRate, SuspendRate, Suspend‐
2441 Time, SuspendTimeout, SuspendProgram, SuspendExcNodes, and Sus‐
2442 pendExcParts. More information is available at the Slurm web
2443 site ( https://slurm.schedmd.com/power_save.html ).
2444
2445
2446 ResumeRate
2447 The rate at which nodes in power save mode are returned to nor‐
2448 mal operation by ResumeProgram. The value is number of nodes
2449 per minute and it can be used to prevent power surges if a large
2450 number of nodes in power save mode are assigned work at the same
2451 time (e.g. a large job starts). A value of zero results in no
2452 limits being imposed. The default value is 300 nodes per
2453 minute. Related configuration options include ResumeTimeout,
2454 ResumeProgram, SuspendRate, SuspendTime, SuspendTimeout, Sus‐
2455 pendProgram, SuspendExcNodes, and SuspendExcParts.
2456
2457
2458 ResumeTimeout
2459 Maximum time permitted (in seconds) between when a node resume
2460 request is issued and when the node is actually available for
2461 use. Nodes which fail to respond in this time frame will be
2462 marked DOWN and the jobs scheduled on the node requeued. Nodes
2463 which reboot after this time frame will be marked DOWN with a
2464 reason of "Node unexpectedly rebooted." The default value is 60
2465 seconds. Related configuration options include ResumeProgram,
2466 ResumeRate, SuspendRate, SuspendTime, SuspendTimeout, Suspend‐
2467 Program, SuspendExcNodes and SuspendExcParts. More information
2468 is available at the Slurm web site (
2469 https://slurm.schedmd.com/power_save.html ).
2470
2471
2472 ResvEpilog
2473 Fully qualified pathname of a program for the slurmctld to exe‐
2474 cute when a reservation ends. The program can be used to cancel
2475 jobs, modify partition configuration, etc. The reservation
2476 named will be passed as an argument to the program. By default
2477 there is no epilog.
2478
2479
2480 ResvOverRun
2481 Describes how long a job already running in a reservation should
2482 be permitted to execute after the end time of the reservation
2483 has been reached. The time period is specified in minutes and
2484 the default value is 0 (kill the job immediately). The value
2485 may not exceed 65533 minutes, although a value of "UNLIMITED" is
2486 supported to permit a job to run indefinitely after its reserva‐
2487 tion is terminated.
2488
2489
2490 ResvProlog
2491 Fully qualified pathname of a program for the slurmctld to exe‐
2492 cute when a reservation begins. The program can be used to can‐
2493 cel jobs, modify partition configuration, etc. The reservation
2494 named will be passed as an argument to the program. By default
2495 there is no prolog.
2496
2497
2498 ReturnToService
2499 Controls when a DOWN node will be returned to service. The
2500 default value is 0. Supported values include
2501
2502 0 A node will remain in the DOWN state until a system adminis‐
2503 trator explicitly changes its state (even if the slurmd dae‐
2504 mon registers and resumes communications).
2505
2506 1 A DOWN node will become available for use upon registration
2507 with a valid configuration only if it was set DOWN due to
2508 being non-responsive. If the node was set DOWN for any
2509 other reason (low memory, unexpected reboot, etc.), its
2510 state will not automatically be changed. A node registers
2511 with a valid configuration if its memory, GRES, CPU count,
2512 etc. are equal to or greater than the values configured in
2513 slurm.conf.
2514
2515 2 A DOWN node will become available for use upon registration
2516 with a valid configuration. The node could have been set
2517 DOWN for any reason. A node registers with a valid configu‐
2518 ration if its memory, GRES, CPU count, etc. are equal to or
2519 greater than the values configured in slurm.conf. (Disabled
2520 on Cray ALPS systems.)
2521
2522
2523 RoutePlugin
2524 Identifies the plugin to be used for defining which nodes will
2525 be used for message forwarding.
2526
2527 route/default
2528 default, use TreeWidth.
2529
2530 route/topology
2531 use the switch hierarchy defined in a topology.conf file.
2532 TopologyPlugin=topology/tree is required.
2533
2534
2535 SbcastParameters
2536 Controls sbcast command behavior. Multiple options can be speci‐
2537 fied in a comma separated list. Supported values include:
2538
2539 DestDir= Destination directory for file being broadcast to
2540 allocated compute nodes. Default value is cur‐
2541 rent working directory.
2542
2543 Compression= Specify default file compression library to be
2544 used. Supported values are "lz4", "none" and
2545 "zlib". The default value with the sbcast --com‐
2546 press option is "lz4" and "none" otherwise. Some
2547 compression libraries may be unavailable on some
2548 systems.
2549
2550
2551 SchedulerParameters
2552 The interpretation of this parameter varies by SchedulerType.
2553 Multiple options may be comma separated.
2554
2555 allow_zero_lic
2556 If set, then job submissions requesting more than config‐
2557 ured licenses won't be rejected.
2558
2559 assoc_limit_stop
2560 If set and a job cannot start due to association limits,
2561 then do not attempt to initiate any lower priority jobs
2562 in that partition. Setting this can decrease system
2563 throughput and utilization, but avoid potentially starv‐
2564 ing larger jobs by preventing them from launching indefi‐
2565 nitely.
2566
2567 batch_sched_delay=#
2568 How long, in seconds, the scheduling of batch jobs can be
2569 delayed. This can be useful in a high-throughput envi‐
2570 ronment in which batch jobs are submitted at a very high
2571 rate (i.e. using the sbatch command) and one wishes to
2572 reduce the overhead of attempting to schedule each job at
2573 submit time. The default value is 3 seconds.
2574
2575 bb_array_stage_cnt=#
2576 Number of tasks from a job array that should be available
2577 for burst buffer resource allocation. Higher values will
2578 increase the system overhead as each task from the job
2579 array will be moved to its own job record in memory, so
2580 relatively small values are generally recommended. The
2581 default value is 10.
2582
2583 bf_busy_nodes
2584 When selecting resources for pending jobs to reserve for
2585 future execution (i.e. the job can not be started immedi‐
2586 ately), then preferentially select nodes that are in use.
2587 This will tend to leave currently idle resources avail‐
2588 able for backfilling longer running jobs, but may result
2589 in allocations having less than optimal network topology.
2590 This option is currently only supported by the
2591 select/cons_res and select/cons_tres plugins (or
2592 select/cray_aries with SelectTypeParameters set to
2593 "OTHER_CONS_RES" or "OTHER_CONS_TRES", which layers the
2594 select/cray_aries plugin over the select/cons_res or
2595 select/cons_tres plugin respectively).
2596
2597 bf_continue
2598 The backfill scheduler periodically releases locks in
2599 order to permit other operations to proceed rather than
2600 blocking all activity for what could be an extended
2601 period of time. Setting this option will cause the back‐
2602 fill scheduler to continue processing pending jobs from
2603 its original job list after releasing locks even if job
2604 or node state changes.
2605
2606 bf_hetjob_immediate
2607 Instruct the backfill scheduler to attempt to start a
2608 heterogeneous job as soon as all of its components are
2609 determined able to do so. Otherwise, the backfill sched‐
2610 uler will delay heterogeneous jobs initiation attempts
2611 until after the rest of the queue has been processed.
2612 This delay may result in lower priority jobs being allo‐
2613 cated resources, which could delay the initiation of the
2614 heterogeneous job due to account and/or QOS limits being
2615 reached. This option is disabled by default. If enabled
2616 and bf_hetjob_prio=min is not set, then it would be auto‐
2617 matically set.
2618
2619 bf_hetjob_prio=[min|avg|max]
2620 At the beginning of each backfill scheduling cycle, a
2621 list of pending to be scheduled jobs is sorted according
2622 to the precedence order configured in PriorityType. This
2623 option instructs the scheduler to alter the sorting algo‐
2624 rithm to ensure that all components belonging to the same
2625 heterogeneous job will be attempted to be scheduled con‐
2626 secutively (thus not fragmented in the resulting list).
2627 More specifically, all components from the same heteroge‐
2628 neous job will be treated as if they all have the same
2629 priority (minimum, average or maximum depending upon this
2630 option's parameter) when compared with other jobs (or
2631 other heterogeneous job components). The original order
2632 will be preserved within the same heterogeneous job. Note
2633 that the operation is calculated for the PriorityTier
2634 layer and for the Priority resulting from the prior‐
2635 ity/multifactor plugin calculations. When enabled, if any
2636 heterogeneous job requested an advanced reservation, then
2637 all of that job's components will be treated as if they
2638 had requested an advanced reservation (and get preferen‐
2639 tial treatment in scheduling).
2640
2641 Note that this operation does not update the Priority
2642 values of the heterogeneous job components, only their
2643 order within the list, so the output of the sprio command
2644 will not be effected.
2645
2646 Heterogeneous jobs have special scheduling properties:
2647 they are only scheduled by the backfill scheduling plug‐
2648 in, each of their components is considered separately
2649 when reserving resources (and might have different Prior‐
2650 ityTier or different Priority values), and no heteroge‐
2651 neous job component is actually allocated resources until
2652 all if its components can be initiated. This may imply
2653 potential scheduling deadlock scenarios because compo‐
2654 nents from different heterogeneous jobs can start reserv‐
2655 ing resources in an interleaved fashion (not consecu‐
2656 tively), but none of the jobs can reserve resources for
2657 all components and start. Enabling this option can help
2658 to mitigate this problem. By default, this option is dis‐
2659 abled.
2660
2661 bf_interval=#
2662 The number of seconds between backfill iterations.
2663 Higher values result in less overhead and better respon‐
2664 siveness. This option applies only to Scheduler‐
2665 Type=sched/backfill. Default: 30, Min: 1, Max: 10800
2666 (3h).
2667
2668
2669 bf_job_part_count_reserve=#
2670 The backfill scheduling logic will reserve resources for
2671 the specified count of highest priority jobs in each par‐
2672 tition. For example, bf_job_part_count_reserve=10 will
2673 cause the backfill scheduler to reserve resources for the
2674 ten highest priority jobs in each partition. Any lower
2675 priority job that can be started using currently avail‐
2676 able resources and not adversely impact the expected
2677 start time of these higher priority jobs will be started
2678 by the backfill scheduler The default value is zero,
2679 which will reserve resources for any pending job and
2680 delay initiation of lower priority jobs. Also see
2681 bf_min_age_reserve and bf_min_prio_reserve. Default: 0,
2682 Min: 0, Max: 100000.
2683
2684
2685 bf_max_job_array_resv=#
2686 The maximum number of tasks from a job array for which
2687 the backfill scheduler will reserve resources in the
2688 future. Since job arrays can potentially have millions
2689 of tasks, the overhead in reserving resources for all
2690 tasks can be prohibitive. In addition various limits may
2691 prevent all the jobs from starting at the expected times.
2692 This has no impact upon the number of tasks from a job
2693 array that can be started immediately, only those tasks
2694 expected to start at some future time. Default: 20, Min:
2695 0, Max: 1000. NOTE: Jobs submitted to multiple parti‐
2696 tions appear in the job queue once per partition. If dif‐
2697 ferent copies of a single job array record aren't consec‐
2698 utive in the job queue and another job array record is in
2699 between, then bf_max_job_array_resv tasks are considered
2700 per partition that the job is submitted to.
2701
2702 bf_max_job_assoc=#
2703 The maximum number of jobs per user association to
2704 attempt starting with the backfill scheduler. This set‐
2705 ting is similar to bf_max_job_user but is handy if a user
2706 has multiple associations equating to basically different
2707 users. One can set this limit to prevent users from
2708 flooding the backfill queue with jobs that cannot start
2709 and that prevent jobs from other users to start. This
2710 option applies only to SchedulerType=sched/backfill.
2711 Also see the bf_max_job_user bf_max_job_part,
2712 bf_max_job_test and bf_max_job_user_part=# options. Set
2713 bf_max_job_test to a value much higher than
2714 bf_max_job_assoc. Default: 0 (no limit), Min: 0, Max:
2715 bf_max_job_test.
2716
2717 bf_max_job_part=#
2718 The maximum number of jobs per partition to attempt
2719 starting with the backfill scheduler. This can be espe‐
2720 cially helpful for systems with large numbers of parti‐
2721 tions and jobs. This option applies only to Scheduler‐
2722 Type=sched/backfill. Also see the partition_job_depth
2723 and bf_max_job_test options. Set bf_max_job_test to a
2724 value much higher than bf_max_job_part. Default: 0 (no
2725 limit), Min: 0, Max: bf_max_job_test.
2726
2727 bf_max_job_start=#
2728 The maximum number of jobs which can be initiated in a
2729 single iteration of the backfill scheduler. This option
2730 applies only to SchedulerType=sched/backfill. Default: 0
2731 (no limit), Min: 0, Max: 10000.
2732
2733 bf_max_job_test=#
2734 The maximum number of jobs to attempt backfill scheduling
2735 for (i.e. the queue depth). Higher values result in more
2736 overhead and less responsiveness. Until an attempt is
2737 made to backfill schedule a job, its expected initiation
2738 time value will not be set. In the case of large clus‐
2739 ters, configuring a relatively small value may be desir‐
2740 able. This option applies only to Scheduler‐
2741 Type=sched/backfill. Default: 100, Min: 1, Max:
2742 1,000,000.
2743
2744 bf_max_job_user=#
2745 The maximum number of jobs per user to attempt starting
2746 with the backfill scheduler for ALL partitions. One can
2747 set this limit to prevent users from flooding the back‐
2748 fill queue with jobs that cannot start and that prevent
2749 jobs from other users to start. This is similar to the
2750 MAXIJOB limit in Maui. This option applies only to
2751 SchedulerType=sched/backfill. Also see the
2752 bf_max_job_part, bf_max_job_test and
2753 bf_max_job_user_part=# options. Set bf_max_job_test to a
2754 value much higher than bf_max_job_user. Default: 0 (no
2755 limit), Min: 0, Max: bf_max_job_test.
2756
2757 bf_max_job_user_part=#
2758 The maximum number of jobs per user per partition to
2759 attempt starting with the backfill scheduler for any sin‐
2760 gle partition. This option applies only to Scheduler‐
2761 Type=sched/backfill. Also see the bf_max_job_part,
2762 bf_max_job_test and bf_max_job_user=# options. Default:
2763 0 (no limit), Min: 0, Max: bf_max_job_test.
2764
2765 bf_max_time=#
2766 The maximum time in seconds the backfill scheduler can
2767 spend (including time spent sleeping when locks are
2768 released) before discontinuing, even if maximum job
2769 counts have not been reached. This option applies only
2770 to SchedulerType=sched/backfill. The default value is
2771 the value of bf_interval (which defaults to 30 seconds).
2772 Default: bf_interval value (def. 30 sec), Min: 1, Max:
2773 3600 (1h). NOTE: If bf_interval is short and bf_max_time
2774 is large, this may cause locks to be acquired too fre‐
2775 quently and starve out other serviced RPCs. It's advis‐
2776 able if using this parameter to set max_rpc_cnt high
2777 enough that scheduling isn't always disabled, and low
2778 enough that the interactive workload can get through in a
2779 reasonable period of time. max_rpc_cnt needs to be below
2780 256 (the default RPC thread limit). Running around the
2781 middle (150) may give you good results. NOTE: When
2782 increasing the amount of time spent in the backfill
2783 scheduling cycle, Slurm can be prevented from responding
2784 to client requests in a timely manner. To address this
2785 you can use max_rpc_cnt to specify a number of queued
2786 RPCs before the scheduler stops to respond to these
2787 requests.
2788
2789 bf_min_age_reserve=#
2790 The backfill and main scheduling logic will not reserve
2791 resources for pending jobs until they have been pending
2792 and runnable for at least the specified number of sec‐
2793 onds. In addition, jobs waiting for less than the speci‐
2794 fied number of seconds will not prevent a newly submitted
2795 job from starting immediately, even if the newly submit‐
2796 ted job has a lower priority. This can be valuable if
2797 jobs lack time limits or all time limits have the same
2798 value. The default value is zero, which will reserve
2799 resources for any pending job and delay initiation of
2800 lower priority jobs. Also see bf_job_part_count_reserve
2801 and bf_min_prio_reserve. Default: 0, Min: 0, Max:
2802 2592000 (30 days).
2803
2804 bf_min_prio_reserve=#
2805 The backfill and main scheduling logic will not reserve
2806 resources for pending jobs unless they have a priority
2807 equal to or higher than the specified value. In addi‐
2808 tion, jobs with a lower priority will not prevent a newly
2809 submitted job from starting immediately, even if the
2810 newly submitted job has a lower priority. This can be
2811 valuable if one wished to maximum system utilization
2812 without regard for job priority below a certain thresh‐
2813 old. The default value is zero, which will reserve
2814 resources for any pending job and delay initiation of
2815 lower priority jobs. Also see bf_job_part_count_reserve
2816 and bf_min_age_reserve. Default: 0, Min: 0, Max: 2^63.
2817
2818 bf_one_resv_per_job
2819 Disallow adding more than one backfill reservation per
2820 job. The scheduling logic builds a sorted list of (job,
2821 partition) pairs. Jobs submitted to multiple partitions
2822 have as many entries in the list as requested partitions.
2823 By default, the backfill scheduler may evaluate all the
2824 (job, partition) entries for a single job, potentially
2825 reserving resources for each pair, but only starting the
2826 job in the reservation offering the earliest start time.
2827 Having a single job reserving resources for multiple par‐
2828 titions could impede other jobs (or hetjob components)
2829 from reserving resources already reserved for the reser‐
2830 vations related to the partitions that don't offer the
2831 earliest start time. This option makes it so that a job
2832 submitted to multiple partitions will stop reserving
2833 resources once the first (job, partition) pair has booked
2834 a backfill reservation. Subsequent pairs from the same
2835 job will only be tested to start now. This allows for
2836 other jobs to be able to book the other pairs resources
2837 at the cost of not guaranteeing that the multi partition
2838 job will start in the partition offering the earliest
2839 start time (except if it can start now). This option is
2840 disabled by default.
2841
2842
2843 bf_resolution=#
2844 The number of seconds in the resolution of data main‐
2845 tained about when jobs begin and end. Higher values
2846 result in better responsiveness and quicker backfill
2847 cycles by using larger blocks of time to determine node
2848 eligibility. However, higher values lead to less effi‐
2849 cient system planning, and may miss opportunities to
2850 improve system utilization. This option applies only to
2851 SchedulerType=sched/backfill. Default: 60, Min: 1, Max:
2852 3600 (1 hour).
2853
2854 bf_running_job_reserve
2855 Add an extra step to backfill logic, which creates back‐
2856 fill reservations for jobs running on whole nodes. This
2857 option is disabled by default.
2858
2859 bf_window=#
2860 The number of minutes into the future to look when con‐
2861 sidering jobs to schedule. Higher values result in more
2862 overhead and less responsiveness. A value at least as
2863 long as the highest allowed time limit is generally
2864 advisable to prevent job starvation. In order to limit
2865 the amount of data managed by the backfill scheduler, if
2866 the value of bf_window is increased, then it is generally
2867 advisable to also increase bf_resolution. This option
2868 applies only to SchedulerType=sched/backfill. Default:
2869 1440 (1 day), Min: 1, Max: 43200 (30 days).
2870
2871 bf_window_linear=#
2872 For performance reasons, the backfill scheduler will
2873 decrease precision in calculation of job expected termi‐
2874 nation times. By default, the precision starts at 30 sec‐
2875 onds and that time interval doubles with each evaluation
2876 of currently executing jobs when trying to determine when
2877 a pending job can start. This algorithm can support an
2878 environment with many thousands of running jobs, but can
2879 result in the expected start time of pending jobs being
2880 gradually being deferred due to lack of precision. A
2881 value for bf_window_linear will cause the time interval
2882 to be increased by a constant amount on each iteration.
2883 The value is specified in units of seconds. For example,
2884 a value of 60 will cause the backfill scheduler on the
2885 first iteration to identify the job ending soonest and
2886 determine if the pending job can be started after that
2887 job plus all other jobs expected to end within 30 seconds
2888 (default initial value) of the first job. On the next
2889 iteration, the pending job will be evaluated for starting
2890 after the next job expected to end plus all jobs ending
2891 within 90 seconds of that time (30 second default, plus
2892 the 60 second option value). The third iteration will
2893 have a 150 second window and the fourth 210 seconds.
2894 Without this option, the time windows will double on each
2895 iteration and thus be 30, 60, 120, 240 seconds, etc. The
2896 use of bf_window_linear is not recommended with more than
2897 a few hundred simultaneously executing jobs.
2898
2899 bf_yield_interval=#
2900 The backfill scheduler will periodically relinquish locks
2901 in order for other pending operations to take place.
2902 This specifies the times when the locks are relinquished
2903 in microseconds. Smaller values may be helpful for high
2904 throughput computing when used in conjunction with the
2905 bf_continue option. Also see the bf_yield_sleep option.
2906 Default: 2,000,000 (2 sec), Min: 1, Max: 10,000,000 (10
2907 sec).
2908
2909 bf_yield_sleep=#
2910 The backfill scheduler will periodically relinquish locks
2911 in order for other pending operations to take place.
2912 This specifies the length of time for which the locks are
2913 relinquished in microseconds. Also see the
2914 bf_yield_interval option. Default: 500,000 (0.5 sec),
2915 Min: 1, Max: 10,000,000 (10 sec).
2916
2917 build_queue_timeout=#
2918 Defines the maximum time that can be devoted to building
2919 a queue of jobs to be tested for scheduling. If the sys‐
2920 tem has a huge number of jobs with dependencies, just
2921 building the job queue can take so much time as to
2922 adversely impact overall system performance and this
2923 parameter can be adjusted as needed. The default value
2924 is 2,000,000 microseconds (2 seconds).
2925
2926 correspond_after_task_cnt=#
2927 Defines the number of array tasks that get split for
2928 potential aftercorr dependency check. Low number may
2929 result in dependent task check failures when the job one
2930 depends on gets purged before the split. Default: 10.
2931
2932 default_queue_depth=#
2933 The default number of jobs to attempt scheduling (i.e.
2934 the queue depth) when a running job completes or other
2935 routine actions occur, however the frequency with which
2936 the scheduler is run may be limited by using the defer or
2937 sched_min_interval parameters described below. The full
2938 queue will be tested on a less frequent basis as defined
2939 by the sched_interval option described below. The default
2940 value is 100. See the partition_job_depth option to
2941 limit depth by partition.
2942
2943 defer Setting this option will avoid attempting to schedule
2944 each job individually at job submit time, but defer it
2945 until a later time when scheduling multiple jobs simulta‐
2946 neously may be possible. This option may improve system
2947 responsiveness when large numbers of jobs (many hundreds)
2948 are submitted at the same time, but it will delay the
2949 initiation time of individual jobs. Also see
2950 default_queue_depth above.
2951
2952 delay_boot=#
2953 Do not reboot nodes in order to satisfied this job's fea‐
2954 ture specification if the job has been eligible to run
2955 for less than this time period. If the job has waited
2956 for less than the specified period, it will use only
2957 nodes which already have the specified features. The
2958 argument is in units of minutes. Individual jobs may
2959 override this default value with the --delay-boot option.
2960
2961 disable_job_shrink
2962 Deny user requests to shrink the side of running jobs.
2963 (However, running jobs may still shrink due to node fail‐
2964 ure if the --no-kill option was set.)
2965
2966 disable_hetjob_steps
2967 Disable job steps that span heterogeneous job alloca‐
2968 tions. The default value on Cray systems.
2969
2970 enable_hetjob_steps
2971 Enable job steps that span heterogeneous job allocations.
2972 The default value except for Cray systems.
2973
2974 enable_user_top
2975 Enable use of the "scontrol top" command by non-privi‐
2976 leged users.
2977
2978 Ignore_NUMA
2979 Some processors (e.g. AMD Opteron 6000 series) contain
2980 multiple NUMA nodes per socket. This is a configuration
2981 which does not map into the hardware entities that Slurm
2982 optimizes resource allocation for (PU/thread, core,
2983 socket, baseboard, node and network switch). In order to
2984 optimize resource allocations on such hardware, Slurm
2985 will consider each NUMA node within the socket as a sepa‐
2986 rate socket by default. Use the Ignore_NUMA option to
2987 report the correct socket count, but not optimize
2988 resource allocations on the NUMA nodes.
2989
2990 inventory_interval=#
2991 On a Cray system using Slurm on top of ALPS this limits
2992 the number of times a Basil Inventory call is made. Nor‐
2993 mally this call happens every scheduling consideration to
2994 attempt to close a node state change window with respects
2995 to what ALPS has. This call is rather slow, so making it
2996 less frequently improves performance dramatically, but in
2997 the situation where a node changes state the window is as
2998 large as this setting. In an HTC environment this set‐
2999 ting is a must and we advise around 10 seconds.
3000
3001 max_array_tasks
3002 Specify the maximum number of tasks that be included in a
3003 job array. The default limit is MaxArraySize, but this
3004 option can be used to set a lower limit. For example,
3005 max_array_tasks=1000 and MaxArraySize=100001 would permit
3006 a maximum task ID of 100000, but limit the number of
3007 tasks in any single job array to 1000.
3008
3009 max_rpc_cnt=#
3010 If the number of active threads in the slurmctld daemon
3011 is equal to or larger than this value, defer scheduling
3012 of jobs. The scheduler will check this condition at cer‐
3013 tain points in code and yield locks if necessary. This
3014 can improve Slurm's ability to process requests at a cost
3015 of initiating new jobs less frequently. Default: 0
3016 (option disabled), Min: 0, Max: 1000.
3017
3018 NOTE: The maximum number of threads (MAX_SERVER_THREADS)
3019 is internally set to 256 and defines the number of served
3020 RPCs at a given time. Setting max_rpc_cnt to more than
3021 256 will be only useful to let backfill continue schedul‐
3022 ing work after locks have been yielded (i.e. each 2 sec‐
3023 onds) if there are a maximum of MAX(max_rpc_cnt/10, 20)
3024 RPCs in the queue. i.e. max_rpc_cnt=1000, the scheduler
3025 will be allowed to continue after yielding locks only
3026 when there are less than or equal to 100 pending RPCs.
3027 If a value is set, then a value of 10 or higher is recom‐
3028 mended. It may require some tuning for each system, but
3029 needs to be high enough that scheduling isn't always dis‐
3030 abled, and low enough that requests can get through in a
3031 reasonable period of time.
3032
3033 max_sched_time=#
3034 How long, in seconds, that the main scheduling loop will
3035 execute for before exiting. If a value is configured, be
3036 aware that all other Slurm operations will be deferred
3037 during this time period. Make certain the value is lower
3038 than MessageTimeout. If a value is not explicitly con‐
3039 figured, the default value is half of MessageTimeout with
3040 a minimum default value of 1 second and a maximum default
3041 value of 2 seconds. For example if MessageTimeout=10,
3042 the time limit will be 2 seconds (i.e. MIN(10/2, 2) = 2).
3043
3044 max_script_size=#
3045 Specify the maximum size of a batch script, in bytes.
3046 The default value is 4 megabytes. Larger values may
3047 adversely impact system performance.
3048
3049 max_switch_wait=#
3050 Maximum number of seconds that a job can delay execution
3051 waiting for the specified desired switch count. The
3052 default value is 300 seconds.
3053
3054 no_backup_scheduling
3055 If used, the backup controller will not schedule jobs
3056 when it takes over. The backup controller will allow jobs
3057 to be submitted, modified and cancelled but won't sched‐
3058 ule new jobs. This is useful in Cray environments when
3059 the backup controller resides on an external Cray node.
3060 A restart is required to alter this option. This is
3061 explicitly set on a Cray/ALPS system.
3062
3063 no_env_cache
3064 If used, any job started on node that fails to load the
3065 env from a node will fail instead of using the cached
3066 env. This will also implicitly imply the requeue_set‐
3067 up_env_fail option as well.
3068
3069 nohold_on_prolog_fail
3070 By default, if the Prolog exits with a non-zero value the
3071 job is requeued in a held state. By specifying this
3072 parameter the job will be requeued but not held so that
3073 the scheduler can dispatch it to another host.
3074
3075 pack_serial_at_end
3076 If used with the select/cons_res or select/cons_tres
3077 plugin, then put serial jobs at the end of the available
3078 nodes rather than using a best fit algorithm. This may
3079 reduce resource fragmentation for some workloads.
3080
3081 partition_job_depth=#
3082 The default number of jobs to attempt scheduling (i.e.
3083 the queue depth) from each partition/queue in Slurm's
3084 main scheduling logic. The functionality is similar to
3085 that provided by the bf_max_job_part option for the back‐
3086 fill scheduling logic. The default value is 0 (no
3087 limit). Job's excluded from attempted scheduling based
3088 upon partition will not be counted against the
3089 default_queue_depth limit. Also see the bf_max_job_part
3090 option.
3091
3092 permit_job_expansion
3093 Allow running jobs to request additional nodes be merged
3094 in with the current job allocation.
3095
3096 preempt_reorder_count=#
3097 Specify how many attempts should be made in reording pre‐
3098 emptable jobs to minimize the count of jobs preempted.
3099 The default value is 1. High values may adversely impact
3100 performance. The logic to support this option is only
3101 available in the select/cons_res and select/cons_tres
3102 plugins.
3103
3104 preempt_strict_order
3105 If set, then execute extra logic in an attempt to preempt
3106 only the lowest priority jobs. It may be desirable to
3107 set this configuration parameter when there are multiple
3108 priorities of preemptable jobs. The logic to support
3109 this option is only available in the select/cons_res and
3110 select/cons_tres plugins.
3111
3112 preempt_youngest_first
3113 If set, then the preemption sorting algorithm will be
3114 changed to sort by the job start times to favor preempt‐
3115 ing younger jobs over older. (Requires preempt/parti‐
3116 tion_prio or preempt/qos plugins.)
3117
3118 reduce_completing_frag
3119 This option is used to control how scheduling of
3120 resources is performed when jobs are in the COMPLETING
3121 state, which influences potential fragmentation. If this
3122 option is not set then no jobs will be started in any
3123 partition when any job is in the COMPLETING state for
3124 less than CompleteWait seconds. If this option is set
3125 then no jobs will be started in any individual partition
3126 that has a job in COMPLETING state for less than Com‐
3127 pleteWait seconds. In addition, no jobs will be started
3128 in any partition with nodes that overlap with any nodes
3129 in the partition of the completing job. This option is
3130 to be used in conjunction with CompleteWait.
3131
3132 NOTE: CompleteWait must be set in order for this to work.
3133 If CompleteWait=0 then this option does nothing.
3134
3135 NOTE: reduce_completing_frag only affects the main sched‐
3136 uler, not the backfill scheduler.
3137
3138 requeue_setup_env_fail
3139 By default if a job environment setup fails the job keeps
3140 running with a limited environment. By specifying this
3141 parameter the job will be requeued in held state and the
3142 execution node drained.
3143
3144 salloc_wait_nodes
3145 If defined, the salloc command will wait until all allo‐
3146 cated nodes are ready for use (i.e. booted) before the
3147 command returns. By default, salloc will return as soon
3148 as the resource allocation has been made.
3149
3150 sbatch_wait_nodes
3151 If defined, the sbatch script will wait until all allo‐
3152 cated nodes are ready for use (i.e. booted) before the
3153 initiation. By default, the sbatch script will be initi‐
3154 ated as soon as the first node in the job allocation is
3155 ready. The sbatch command can use the --wait-all-nodes
3156 option to override this configuration parameter.
3157
3158 sched_interval=#
3159 How frequently, in seconds, the main scheduling loop will
3160 execute and test all pending jobs. The default value is
3161 60 seconds.
3162
3163 sched_max_job_start=#
3164 The maximum number of jobs that the main scheduling logic
3165 will start in any single execution. The default value is
3166 zero, which imposes no limit.
3167
3168 sched_min_interval=#
3169 How frequently, in microseconds, the main scheduling loop
3170 will execute and test any pending jobs. The scheduler
3171 runs in a limited fashion every time that any event hap‐
3172 pens which could enable a job to start (e.g. job submit,
3173 job terminate, etc.). If these events happen at a high
3174 frequency, the scheduler can run very frequently and con‐
3175 sume significant resources if not throttled by this
3176 option. This option specifies the minimum time between
3177 the end of one scheduling cycle and the beginning of the
3178 next scheduling cycle. A value of zero will disable
3179 throttling of the scheduling logic interval. The default
3180 value is 1,000,000 microseconds on Cray/ALPS systems and
3181 2 microseconds on other systems.
3182
3183 spec_cores_first
3184 Specialized cores will be selected from the first cores
3185 of the first sockets, cycling through the sockets on a
3186 round robin basis. By default, specialized cores will be
3187 selected from the last cores of the last sockets, cycling
3188 through the sockets on a round robin basis.
3189
3190 step_retry_count=#
3191 When a step completes and there are steps ending resource
3192 allocation, then retry step allocations for at least this
3193 number of pending steps. Also see step_retry_time. The
3194 default value is 8 steps.
3195
3196 step_retry_time=#
3197 When a step completes and there are steps ending resource
3198 allocation, then retry step allocations for all steps
3199 which have been pending for at least this number of sec‐
3200 onds. Also see step_retry_count. The default value is
3201 60 seconds.
3202
3203 whole_hetjob
3204 Requests to cancel, hold or release any component of a
3205 heterogeneous job will be applied to all components of
3206 the job.
3207
3208 NOTE: this option was previously named whole_pack and
3209 this is still supported for retrocompatibility.
3210
3211
3212 SchedulerTimeSlice
3213 Number of seconds in each time slice when gang scheduling is
3214 enabled (PreemptMode=SUSPEND,GANG). The value must be between 5
3215 seconds and 65533 seconds. The default value is 30 seconds.
3216
3217
3218 SchedulerType
3219 Identifies the type of scheduler to be used. Note the slurmctld
3220 daemon must be restarted for a change in scheduler type to
3221 become effective (reconfiguring a running daemon has no effect
3222 for this parameter). The scontrol command can be used to manu‐
3223 ally change job priorities if desired. Acceptable values
3224 include:
3225
3226 sched/backfill
3227 For a backfill scheduling module to augment the default
3228 FIFO scheduling. Backfill scheduling will initiate
3229 lower-priority jobs if doing so does not delay the
3230 expected initiation time of any higher priority job.
3231 Effectiveness of backfill scheduling is dependent upon
3232 users specifying job time limits, otherwise all jobs will
3233 have the same time limit and backfilling is impossible.
3234 Note documentation for the SchedulerParameters option
3235 above. This is the default configuration.
3236
3237 sched/builtin
3238 This is the FIFO scheduler which initiates jobs in prior‐
3239 ity order. If any job in the partition can not be sched‐
3240 uled, no lower priority job in that partition will be
3241 scheduled. An exception is made for jobs that can not
3242 run due to partition constraints (e.g. the time limit) or
3243 down/drained nodes. In that case, lower priority jobs
3244 can be initiated and not impact the higher priority job.
3245
3246 sched/hold
3247 To hold all newly arriving jobs if a file
3248 "/etc/slurm.hold" exists otherwise use the built-in FIFO
3249 scheduler
3250
3251
3252 ScronParameters
3253 Multiple options may be comma-separated.
3254
3255 enable Enable the use of scrontab to submit and manage periodic
3256 repeating jobs.
3257
3258
3259 SelectType
3260 Identifies the type of resource selection algorithm to be used.
3261 Changing this value can only be done by restarting the slurmctld
3262 daemon. When changed, all job information (running and pending)
3263 will be lost, since the job state save format used by each plug‐
3264 in is different. The only exception to this is when changing
3265 from cons_res to cons_tres or from cons_tres to cons_res. How‐
3266 ever, if a job contains cons_tres-specific features and then
3267 SelectType is changed to cons_res, the job will be canceled,
3268 since there is no way for cons_res to satisfy requirements spe‐
3269 cific to cons_tres.
3270
3271 Acceptable values include
3272
3273 select/cons_res
3274 The resources (cores and memory) within a node are indi‐
3275 vidually allocated as consumable resources. Note that
3276 whole nodes can be allocated to jobs for selected parti‐
3277 tions by using the OverSubscribe=Exclusive option. See
3278 the partition OverSubscribe parameter for more informa‐
3279 tion.
3280
3281 select/cons_tres
3282 The resources (cores, memory, GPUs and all other track‐
3283 able resources) within a node are individually allocated
3284 as consumable resources. Note that whole nodes can be
3285 allocated to jobs for selected partitions by using the
3286 OverSubscribe=Exclusive option. See the partition Over‐
3287 Subscribe parameter for more information.
3288
3289 select/cray_aries
3290 for a Cray system. The default value is
3291 "select/cray_aries" for all Cray systems.
3292
3293 select/linear
3294 for allocation of entire nodes assuming a one-dimensional
3295 array of nodes in which sequentially ordered nodes are
3296 preferable. For a heterogeneous cluster (e.g. different
3297 CPU counts on the various nodes), resource allocations
3298 will favor nodes with high CPU counts as needed based
3299 upon the job's node and CPU specification if TopologyPlu‐
3300 gin=topology/none is configured. Use of other topology
3301 plugins with select/linear and heterogeneous nodes is not
3302 recommended and may result in valid job allocation
3303 requests being rejected. This is the default value.
3304
3305
3306 SelectTypeParameters
3307 The permitted values of SelectTypeParameters depend upon the
3308 configured value of SelectType. The only supported options for
3309 SelectType=select/linear are CR_ONE_TASK_PER_CORE and CR_Memory,
3310 which treats memory as a consumable resource and prevents memory
3311 over subscription with job preemption or gang scheduling. By
3312 default SelectType=select/linear allocates whole nodes to jobs
3313 without considering their memory consumption. By default
3314 SelectType=select/cons_res, SelectType=select/cray_aries, and
3315 SelectType=select/cons_tres, use CR_CPU, which allocates CPU
3316 (threads) to jobs without considering their memory consumption.
3317
3318 The following options are supported for Select‐
3319 Type=select/cray_aries:
3320
3321 OTHER_CONS_RES
3322 Layer the select/cons_res plugin under the
3323 select/cray_aries plugin, the default is to layer
3324 on select/linear. This also allows all the
3325 options available for SelectType=select/cons_res.
3326
3327 OTHER_CONS_TRES
3328 Layer the select/cons_tres plugin under the
3329 select/cray_aries plugin, the default is to layer
3330 on select/linear. This also allows all the
3331 options available for SelectType=select/cons_tres.
3332
3333 The following options are supported by the Select‐
3334 Type=select/cons_res and SelectType=select/cons_tres plugins:
3335
3336 CR_CPU CPUs are consumable resources. Configure the num‐
3337 ber of CPUs on each node, which may be equal to
3338 the count of cores or hyper-threads on the node
3339 depending upon the desired minimum resource allo‐
3340 cation. The node's Boards, Sockets, CoresPer‐
3341 Socket and ThreadsPerCore may optionally be con‐
3342 figured and result in job allocations which have
3343 improved locality; however doing so will prevent
3344 more than one job from being allocated on each
3345 core.
3346
3347 CR_CPU_Memory
3348 CPUs and memory are consumable resources. Config‐
3349 ure the number of CPUs on each node, which may be
3350 equal to the count of cores or hyper-threads on
3351 the node depending upon the desired minimum
3352 resource allocation. The node's Boards, Sockets,
3353 CoresPerSocket and ThreadsPerCore may optionally
3354 be configured and result in job allocations which
3355 have improved locality; however doing so will pre‐
3356 vent more than one job from being allocated on
3357 each core. Setting a value for DefMemPerCPU is
3358 strongly recommended.
3359
3360 CR_Core
3361 Cores are consumable resources. On nodes with
3362 hyper-threads, each thread is counted as a CPU to
3363 satisfy a job's resource requirement, but multiple
3364 jobs are not allocated threads on the same core.
3365 The count of CPUs allocated to a job is rounded up
3366 to account for every CPU on an allocated core.
3367 This will also impact total allocated memory when
3368 --mem-per-cpu is used to be multiply of total num‐
3369 ber of CPUs on allocated cores.
3370
3371 CR_Core_Memory
3372 Cores and memory are consumable resources. On
3373 nodes with hyper-threads, each thread is counted
3374 as a CPU to satisfy a job's resource requirement,
3375 but multiple jobs are not allocated threads on the
3376 same core. The count of CPUs allocated to a job
3377 may be rounded up to account for every CPU on an
3378 allocated core. Setting a value for DefMemPerCPU
3379 is strongly recommended.
3380
3381 CR_ONE_TASK_PER_CORE
3382 Allocate one task per core by default. Without
3383 this option, by default one task will be allocated
3384 per thread on nodes with more than one ThreadsPer‐
3385 Core configured. NOTE: This option cannot be used
3386 with CR_CPU*.
3387
3388 CR_CORE_DEFAULT_DIST_BLOCK
3389 Allocate cores within a node using block distribu‐
3390 tion by default. This is a pseudo-best-fit algo‐
3391 rithm that minimizes the number of boards and min‐
3392 imizes the number of sockets (within minimum
3393 boards) used for the allocation. This default
3394 behavior can be overridden specifying a particular
3395 "-m" parameter with srun/salloc/sbatch. Without
3396 this option, cores will be allocated cyclicly
3397 across the sockets.
3398
3399 CR_LLN Schedule resources to jobs on the least loaded
3400 nodes (based upon the number of idle CPUs). This
3401 is generally only recommended for an environment
3402 with serial jobs as idle resources will tend to be
3403 highly fragmented, resulting in parallel jobs
3404 being distributed across many nodes. Note that
3405 node Weight takes precedence over how many idle
3406 resources are on each node. Also see the parti‐
3407 tion configuration parameter LLN use the least
3408 loaded nodes in selected partitions.
3409
3410 CR_Pack_Nodes
3411 If a job allocation contains more resources than
3412 will be used for launching tasks (e.g. if whole
3413 nodes are allocated to a job), then rather than
3414 distributing a job's tasks evenly across its allo‐
3415 cated nodes, pack them as tightly as possible on
3416 these nodes. For example, consider a job alloca‐
3417 tion containing two entire nodes with eight CPUs
3418 each. If the job starts ten tasks across those
3419 two nodes without this option, it will start five
3420 tasks on each of the two nodes. With this option,
3421 eight tasks will be started on the first node and
3422 two tasks on the second node. This can be super‐
3423 seded by "NoPack" in srun's "--distribution"
3424 option. CR_Pack_Nodes only applies when the
3425 "block" task distribution method is used.
3426
3427 CR_Socket
3428 Sockets are consumable resources. On nodes with
3429 multiple cores, each core or thread is counted as
3430 a CPU to satisfy a job's resource requirement, but
3431 multiple jobs are not allocated resources on the
3432 same socket.
3433
3434 CR_Socket_Memory
3435 Memory and sockets are consumable resources. On
3436 nodes with multiple cores, each core or thread is
3437 counted as a CPU to satisfy a job's resource
3438 requirement, but multiple jobs are not allocated
3439 resources on the same socket. Setting a value for
3440 DefMemPerCPU is strongly recommended.
3441
3442 CR_Memory
3443 Memory is a consumable resource. NOTE: This
3444 implies OverSubscribe=YES or OverSubscribe=FORCE
3445 for all partitions. Setting a value for DefMem‐
3446 PerCPU is strongly recommended.
3447
3448
3449 SlurmctldAddr
3450 An optional address to be used for communications to the cur‐
3451 rently active slurmctld daemon, normally used with Virtual IP
3452 addressing of the currently active server. If this parameter is
3453 not specified then each primary and backup server will have its
3454 own unique address used for communications as specified in the
3455 SlurmctldHost parameter. If this parameter is specified then
3456 the SlurmctldHost parameter will still be used for communica‐
3457 tions to specific slurmctld primary or backup servers, for exam‐
3458 ple to cause all of them to read the current configuration files
3459 or shutdown. Also see the SlurmctldPrimaryOffProg and Slurm‐
3460 ctldPrimaryOnProg configuration parameters to configure programs
3461 to manipulate virtual IP address manipulation.
3462
3463
3464 SlurmctldDebug
3465 The level of detail to provide slurmctld daemon's logs. The
3466 default value is info. If the slurmctld daemon is initiated
3467 with -v or --verbose options, that debug level will be preserve
3468 or restored upon reconfiguration.
3469
3470
3471 quiet Log nothing
3472
3473 fatal Log only fatal errors
3474
3475 error Log only errors
3476
3477 info Log errors and general informational messages
3478
3479 verbose Log errors and verbose informational messages
3480
3481 debug Log errors and verbose informational messages and
3482 debugging messages
3483
3484 debug2 Log errors and verbose informational messages and more
3485 debugging messages
3486
3487 debug3 Log errors and verbose informational messages and even
3488 more debugging messages
3489
3490 debug4 Log errors and verbose informational messages and even
3491 more debugging messages
3492
3493 debug5 Log errors and verbose informational messages and even
3494 more debugging messages
3495
3496
3497 SlurmctldHost
3498 The short, or long, hostname of the machine where Slurm control
3499 daemon is executed (i.e. the name returned by the command "host‐
3500 name -s"). This hostname is optionally followed by the address,
3501 either the IP address or a name by which the address can be
3502 identifed, enclosed in parentheses (e.g. SlurmctldHost=slurm‐
3503 ctl-primary(12.34.56.78)). This value must be specified at least
3504 once. If specified more than once, the first hostname named will
3505 be where the daemon runs. If the first specified host fails,
3506 the daemon will execute on the second host. If both the first
3507 and second specified host fails, the daemon will execute on the
3508 third host.
3509
3510
3511 SlurmctldLogFile
3512 Fully qualified pathname of a file into which the slurmctld dae‐
3513 mon's logs are written. The default value is none (performs
3514 logging via syslog).
3515 See the section LOGGING if a pathname is specified.
3516
3517
3518 SlurmctldParameters
3519 Multiple options may be comma-separated.
3520
3521
3522 allow_user_triggers
3523 Permit setting triggers from non-root/slurm_user users.
3524 SlurmUser must also be set to root to permit these trig‐
3525 gers to work. See the strigger man page for additional
3526 details.
3527
3528 cloud_dns
3529 By default, Slurm expects that the network address for a
3530 cloud node won't be known until the creation of the node
3531 and that Slurm will be notified of the node's address
3532 (e.g. scontrol update nodename=<name> nodeaddr=<addr>).
3533 Since Slurm communications rely on the node configuration
3534 found in the slurm.conf, Slurm will tell the client com‐
3535 mand, after waiting for all nodes to boot, each node's ip
3536 address. However, in environments where the nodes are in
3537 DNS, this step can be avoided by configuring this option.
3538
3539 cloud_reg_addrs
3540 When a cloud node registers, the node's NodeAddr and
3541 NodeHostName will automatically be set. They will be
3542 reset back to the nodename after powering off.
3543
3544 enable_configless
3545 Permit "configless" operation by the slurmd, slurmstepd,
3546 and user commands. When enabled the slurmd will be per‐
3547 mitted to retrieve config files from the slurmctld, and
3548 on any 'scontrol reconfigure' command new configs will be
3549 automatically pushed out and applied to nodes that are
3550 running in this "configless" mode. NOTE: a restart of
3551 the slurmctld is required for this to take effect.
3552
3553 idle_on_node_suspend
3554 Mark nodes as idle, regardless of current state, when
3555 suspending nodes with SuspendProgram so that nodes will
3556 be eligible to be resumed at a later time.
3557
3558 power_save_interval
3559 How often the power_save thread looks to resume and sus‐
3560 pend nodes. The power_save thread will do work sooner if
3561 there are node state changes. Default is 10 seconds.
3562
3563 power_save_min_interval
3564 How often the power_save thread, at a minimun, looks to
3565 resume and suspend nodes. Default is 0.
3566
3567 max_dbd_msg_action
3568 Action used once MaxDBDMsgs is reached, options are 'dis‐
3569 card' (default) and 'exit'.
3570
3571 When 'discard' is specified and MaxDBDMsgs is reached we
3572 start by purging pending messages of types Step start and
3573 complete, and it reaches MaxDBDMsgs again Job start mes‐
3574 sages are purged. Job completes and node state changes
3575 continue to consume the empty space created from the
3576 purgings until MaxDBDMsgs is reached again at which no
3577 new message is tracked creating data loss and potentially
3578 runaway jobs.
3579
3580 When 'exit' is specified and MaxDBDMsgs is reached the
3581 slurmctld will exit instead of discarding any messages.
3582 It will be impossible to start the slurmctld with this
3583 option where the slurmdbd is down and the slurmctld is
3584 tracking more than MaxDBDMsgs.
3585
3586
3587 preempt_send_user_signal
3588 Send the user signal (e.g. --signal=<sig_num>) at preemp‐
3589 tion time even if the signal time hasn't been reached. In
3590 the case of a gracetime preemption the user signal will
3591 be sent if the user signal has been specified and not
3592 sent, otherwise a SIGTERM will be sent to the tasks.
3593
3594 reboot_from_controller
3595 Run the RebootProgram from the controller instead of on
3596 the slurmds. The RebootProgram will be passed a comma-
3597 separated list of nodes to reboot.
3598
3599 user_resv_delete
3600 Allow any user able to run in a reservation to delete it.
3601
3602
3603 SlurmctldPidFile
3604 Fully qualified pathname of a file into which the slurmctld
3605 daemon may write its process id. This may be used for automated
3606 signal processing. The default value is "/var/run/slurm‐
3607 ctld.pid".
3608
3609
3610 SlurmctldPlugstack
3611 A comma delimited list of Slurm controller plugins to be started
3612 when the daemon begins and terminated when it ends. Only the
3613 plugin's init and fini functions are called.
3614
3615
3616 SlurmctldPort
3617 The port number that the Slurm controller, slurmctld, listens to
3618 for work. The default value is SLURMCTLD_PORT as established at
3619 system build time. If none is explicitly specified, it will be
3620 set to 6817. SlurmctldPort may also be configured to support a
3621 range of port numbers in order to accept larger bursts of incom‐
3622 ing messages by specifying two numbers separated by a dash (e.g.
3623 SlurmctldPort=6817-6818). NOTE: Either slurmctld and slurmd
3624 daemons must not execute on the same nodes or the values of
3625 SlurmctldPort and SlurmdPort must be different.
3626
3627 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3628 automatically try to interact with anything opened on ports
3629 8192-60000. Configure SlurmctldPort to use a port outside of
3630 the configured SrunPortRange and RSIP's port range.
3631
3632
3633 SlurmctldPrimaryOffProg
3634 This program is executed when a slurmctld daemon running as the
3635 primary server becomes a backup server. By default no program is
3636 executed. See also the related "SlurmctldPrimaryOnProg" parame‐
3637 ter.
3638
3639
3640 SlurmctldPrimaryOnProg
3641 This program is executed when a slurmctld daemon running as a
3642 backup server becomes the primary server. By default no program
3643 is executed. When using virtual IP addresses to manage High
3644 Available Slurm services, this program can be used to add the IP
3645 address to an interface (and optionally try to kill the unre‐
3646 sponsive slurmctld daemon and flush the ARP caches on nodes on
3647 the local ethernet fabric). See also the related "SlurmctldPri‐
3648 maryOffProg" parameter.
3649
3650 SlurmctldSyslogDebug
3651 The slurmctld daemon will log events to the syslog file at the
3652 specified level of detail. If not set, the slurmctld daemon will
3653 log to syslog at level fatal, unless there is no SlurmctldLog‐
3654 File and it is running in the background, in which case it will
3655 log to syslog at the level specified by SlurmctldDebug (at fatal
3656 in the case that SlurmctldDebug is set to quiet) or it is run in
3657 the foreground, when it will be set to quiet.
3658
3659
3660 quiet Log nothing
3661
3662 fatal Log only fatal errors
3663
3664 error Log only errors
3665
3666 info Log errors and general informational messages
3667
3668 verbose Log errors and verbose informational messages
3669
3670 debug Log errors and verbose informational messages and
3671 debugging messages
3672
3673 debug2 Log errors and verbose informational messages and more
3674 debugging messages
3675
3676 debug3 Log errors and verbose informational messages and even
3677 more debugging messages
3678
3679 debug4 Log errors and verbose informational messages and even
3680 more debugging messages
3681
3682 debug5 Log errors and verbose informational messages and even
3683 more debugging messages
3684
3685
3686
3687 SlurmctldTimeout
3688 The interval, in seconds, that the backup controller waits for
3689 the primary controller to respond before assuming control. The
3690 default value is 120 seconds. May not exceed 65533.
3691
3692
3693 SlurmdDebug
3694 The level of detail to provide slurmd daemon's logs. The
3695 default value is info.
3696
3697 quiet Log nothing
3698
3699 fatal Log only fatal errors
3700
3701 error Log only errors
3702
3703 info Log errors and general informational messages
3704
3705 verbose Log errors and verbose informational messages
3706
3707 debug Log errors and verbose informational messages and
3708 debugging messages
3709
3710 debug2 Log errors and verbose informational messages and more
3711 debugging messages
3712
3713 debug3 Log errors and verbose informational messages and even
3714 more debugging messages
3715
3716 debug4 Log errors and verbose informational messages and even
3717 more debugging messages
3718
3719 debug5 Log errors and verbose informational messages and even
3720 more debugging messages
3721
3722
3723 SlurmdLogFile
3724 Fully qualified pathname of a file into which the slurmd dae‐
3725 mon's logs are written. The default value is none (performs
3726 logging via syslog). Any "%h" within the name is replaced with
3727 the hostname on which the slurmd is running. Any "%n" within
3728 the name is replaced with the Slurm node name on which the
3729 slurmd is running.
3730 See the section LOGGING if a pathname is specified.
3731
3732
3733 SlurmdParameters
3734 Parameters specific to the Slurmd. Multiple options may be
3735 comma separated.
3736
3737 config_overrides
3738 If set, consider the configuration of each node to be
3739 that specified in the slurm.conf configuration file and
3740 any node with less than the configured resources will not
3741 be set DRAIN. This option is generally only useful for
3742 testing purposes. Equivalent to the now deprecated
3743 FastSchedule=2 option.
3744
3745 shutdown_on_reboot
3746 If set, the Slurmd will shut itself down when a reboot
3747 request is received.
3748
3749
3750 SlurmdPidFile
3751 Fully qualified pathname of a file into which the slurmd daemon
3752 may write its process id. This may be used for automated signal
3753 processing. Any "%h" within the name is replaced with the host‐
3754 name on which the slurmd is running. Any "%n" within the name
3755 is replaced with the Slurm node name on which the slurmd is run‐
3756 ning. The default value is "/var/run/slurmd.pid".
3757
3758
3759 SlurmdPort
3760 The port number that the Slurm compute node daemon, slurmd, lis‐
3761 tens to for work. The default value is SLURMD_PORT as estab‐
3762 lished at system build time. If none is explicitly specified,
3763 its value will be 6818. NOTE: Either slurmctld and slurmd dae‐
3764 mons must not execute on the same nodes or the values of Slurm‐
3765 ctldPort and SlurmdPort must be different.
3766
3767 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3768 automatically try to interact with anything opened on ports
3769 8192-60000. Configure SlurmdPort to use a port outside of the
3770 configured SrunPortRange and RSIP's port range.
3771
3772
3773 SlurmdSpoolDir
3774 Fully qualified pathname of a directory into which the slurmd
3775 daemon's state information and batch job script information are
3776 written. This must be a common pathname for all nodes, but
3777 should represent a directory which is local to each node (refer‐
3778 ence a local file system). The default value is
3779 "/var/spool/slurmd". Any "%h" within the name is replaced with
3780 the hostname on which the slurmd is running. Any "%n" within
3781 the name is replaced with the Slurm node name on which the
3782 slurmd is running.
3783
3784
3785 SlurmdSyslogDebug
3786 The slurmd daemon will log events to the syslog file at the
3787 specified level of detail. If not set, the slurmd daemon will
3788 log to syslog at level fatal, unless there is no SlurmdLogFile
3789 and it is running in the background, in which case it will log
3790 to syslog at the level specified by SlurmdDebug (at fatal in
3791 the case that SlurmdDebug is set to quiet) or it is run in the
3792 foreground, when it will be set to quiet.
3793
3794
3795 quiet Log nothing
3796
3797 fatal Log only fatal errors
3798
3799 error Log only errors
3800
3801 info Log errors and general informational messages
3802
3803 verbose Log errors and verbose informational messages
3804
3805 debug Log errors and verbose informational messages and
3806 debugging messages
3807
3808 debug2 Log errors and verbose informational messages and more
3809 debugging messages
3810
3811 debug3 Log errors and verbose informational messages and even
3812 more debugging messages
3813
3814 debug4 Log errors and verbose informational messages and even
3815 more debugging messages
3816
3817 debug5 Log errors and verbose informational messages and even
3818 more debugging messages
3819
3820
3821 SlurmdTimeout
3822 The interval, in seconds, that the Slurm controller waits for
3823 slurmd to respond before configuring that node's state to DOWN.
3824 A value of zero indicates the node will not be tested by slurm‐
3825 ctld to confirm the state of slurmd, the node will not be auto‐
3826 matically set to a DOWN state indicating a non-responsive
3827 slurmd, and some other tool will take responsibility for moni‐
3828 toring the state of each compute node and its slurmd daemon.
3829 Slurm's hierarchical communication mechanism is used to ping the
3830 slurmd daemons in order to minimize system noise and overhead.
3831 The default value is 300 seconds. The value may not exceed
3832 65533 seconds.
3833
3834
3835 SlurmdUser
3836 The name of the user that the slurmd daemon executes as. This
3837 user must exist on all nodes of the cluster for authentication
3838 of communications between Slurm components. The default value
3839 is "root".
3840
3841
3842 SlurmSchedLogFile
3843 Fully qualified pathname of the scheduling event logging file.
3844 The syntax of this parameter is the same as for SlurmctldLog‐
3845 File. In order to configure scheduler logging, set both the
3846 SlurmSchedLogFile and SlurmSchedLogLevel parameters.
3847
3848
3849 SlurmSchedLogLevel
3850 The initial level of scheduling event logging, similar to the
3851 SlurmctldDebug parameter used to control the initial level of
3852 slurmctld logging. Valid values for SlurmSchedLogLevel are "0"
3853 (scheduler logging disabled) and "1" (scheduler logging
3854 enabled). If this parameter is omitted, the value defaults to
3855 "0" (disabled). In order to configure scheduler logging, set
3856 both the SlurmSchedLogFile and SlurmSchedLogLevel parameters.
3857 The scheduler logging level can be changed dynamically using
3858 scontrol.
3859
3860
3861 SlurmUser
3862 The name of the user that the slurmctld daemon executes as. For
3863 security purposes, a user other than "root" is recommended.
3864 This user must exist on all nodes of the cluster for authentica‐
3865 tion of communications between Slurm components. The default
3866 value is "root".
3867
3868
3869 SrunEpilog
3870 Fully qualified pathname of an executable to be run by srun fol‐
3871 lowing the completion of a job step. The command line arguments
3872 for the executable will be the command and arguments of the job
3873 step. This configuration parameter may be overridden by srun's
3874 --epilog parameter. Note that while the other "Epilog" executa‐
3875 bles (e.g., TaskEpilog) are run by slurmd on the compute nodes
3876 where the tasks are executed, the SrunEpilog runs on the node
3877 where the "srun" is executing.
3878
3879
3880 SrunPortRange
3881 The srun creates a set of listening ports to communicate with
3882 the controller, the slurmstepd and to handle the application
3883 I/O. By default these ports are ephemeral meaning the port num‐
3884 bers are selected by the kernel. Using this parameter allow
3885 sites to configure a range of ports from which srun ports will
3886 be selected. This is useful if sites want to allow only certain
3887 port range on their network.
3888
3889 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3890 automatically try to interact with anything opened on ports
3891 8192-60000. Configure SrunPortRange to use a range of ports
3892 above those used by RSIP, ideally 1000 or more ports, for exam‐
3893 ple "SrunPortRange=60001-63000".
3894
3895 Note: A sufficient number of ports must be configured based on
3896 the estimated number of srun on the submission nodes considering
3897 that srun opens 3 listening ports plus 2 more for every 48
3898 hosts. Example:
3899
3900 srun -N 48 will use 5 listening ports.
3901
3902
3903 srun -N 50 will use 7 listening ports.
3904
3905
3906 srun -N 200 will use 13 listening ports.
3907
3908
3909 SrunProlog
3910 Fully qualified pathname of an executable to be run by srun
3911 prior to the launch of a job step. The command line arguments
3912 for the executable will be the command and arguments of the job
3913 step. This configuration parameter may be overridden by srun's
3914 --prolog parameter. Note that while the other "Prolog" executa‐
3915 bles (e.g., TaskProlog) are run by slurmd on the compute nodes
3916 where the tasks are executed, the SrunProlog runs on the node
3917 where the "srun" is executing.
3918
3919
3920 StateSaveLocation
3921 Fully qualified pathname of a directory into which the Slurm
3922 controller, slurmctld, saves its state (e.g.
3923 "/usr/local/slurm/checkpoint"). Slurm state will saved here to
3924 recover from system failures. SlurmUser must be able to create
3925 files in this directory. If you have a secondary SlurmctldHost
3926 configured, this location should be readable and writable by
3927 both systems. Since all running and pending job information is
3928 stored here, the use of a reliable file system (e.g. RAID) is
3929 recommended. The default value is "/var/spool". If any slurm
3930 daemons terminate abnormally, their core files will also be
3931 written into this directory.
3932
3933
3934 SuspendExcNodes
3935 Specifies the nodes which are to not be placed in power save
3936 mode, even if the node remains idle for an extended period of
3937 time. Use Slurm's hostlist expression to identify nodes with an
3938 optional ":" separator and count of nodes to exclude from the
3939 preceding range. For example "nid[10-20]:4" will prevent 4
3940 usable nodes (i.e IDLE and not DOWN, DRAINING or already powered
3941 down) in the set "nid[10-20]" from being powered down. Multiple
3942 sets of nodes can be specified with or without counts in a comma
3943 separated list (e.g "nid[10-20]:4,nid[80-90]:2"). If a node
3944 count specification is given, any list of nodes to NOT have a
3945 node count must be after the last specification with a count.
3946 For example "nid[10-20]:4,nid[60-70]" will exclude 4 nodes in
3947 the set "nid[10-20]:4" plus all nodes in the set "nid[60-70]"
3948 while "nid[1-3],nid[10-20]:4" will exclude 4 nodes from the set
3949 "nid[1-3],nid[10-20]". By default no nodes are excluded.
3950 Related configuration options include ResumeTimeout, ResumePro‐
3951 gram, ResumeRate, SuspendProgram, SuspendRate, SuspendTime, Sus‐
3952 pendTimeout, and SuspendExcParts.
3953
3954
3955 SuspendExcParts
3956 Specifies the partitions whose nodes are to not be placed in
3957 power save mode, even if the node remains idle for an extended
3958 period of time. Multiple partitions can be identified and sepa‐
3959 rated by commas. By default no nodes are excluded. Related
3960 configuration options include ResumeTimeout, ResumeProgram,
3961 ResumeRate, SuspendProgram, SuspendRate, SuspendTime Suspend‐
3962 Timeout, and SuspendExcNodes.
3963
3964
3965 SuspendProgram
3966 SuspendProgram is the program that will be executed when a node
3967 remains idle for an extended period of time. This program is
3968 expected to place the node into some power save mode. This can
3969 be used to reduce the frequency and voltage of a node or com‐
3970 pletely power the node off. The program executes as SlurmUser.
3971 The argument to the program will be the names of nodes to be
3972 placed into power savings mode (using Slurm's hostlist expres‐
3973 sion format). By default, no program is run. Related configu‐
3974 ration options include ResumeTimeout, ResumeProgram, ResumeRate,
3975 SuspendRate, SuspendTime, SuspendTimeout, SuspendExcNodes, and
3976 SuspendExcParts.
3977
3978
3979 SuspendRate
3980 The rate at which nodes are placed into power save mode by Sus‐
3981 pendProgram. The value is number of nodes per minute and it can
3982 be used to prevent a large drop in power consumption (e.g. after
3983 a large job completes). A value of zero results in no limits
3984 being imposed. The default value is 60 nodes per minute.
3985 Related configuration options include ResumeTimeout, ResumePro‐
3986 gram, ResumeRate, SuspendProgram, SuspendTime, SuspendTimeout,
3987 SuspendExcNodes, and SuspendExcParts.
3988
3989
3990 SuspendTime
3991 Nodes which remain idle or down for this number of seconds will
3992 be placed into power save mode by SuspendProgram. For efficient
3993 system utilization, it is recommended that the value of Suspend‐
3994 Time be at least as large as the sum of SuspendTimeout plus
3995 ResumeTimeout. A value of -1 disables power save mode and is
3996 the default. Related configuration options include ResumeTime‐
3997 out, ResumeProgram, ResumeRate, SuspendProgram, SuspendRate,
3998 SuspendTimeout, SuspendExcNodes, and SuspendExcParts.
3999
4000
4001 SuspendTimeout
4002 Maximum time permitted (in seconds) between when a node suspend
4003 request is issued and when the node is shutdown. At that time
4004 the node must be ready for a resume request to be issued as
4005 needed for new work. The default value is 30 seconds. Related
4006 configuration options include ResumeProgram, ResumeRate, Resume‐
4007 Timeout, SuspendRate, SuspendTime, SuspendProgram, SuspendExcN‐
4008 odes and SuspendExcParts. More information is available at the
4009 Slurm web site ( https://slurm.schedmd.com/power_save.html ).
4010
4011
4012 SwitchType
4013 Identifies the type of switch or interconnect used for applica‐
4014 tion communications. Acceptable values include
4015 "switch/cray_aries" for Cray systems, "switch/none" for switches
4016 not requiring special processing for job launch or termination
4017 (Ethernet, and InfiniBand) and The default value is
4018 "switch/none". All Slurm daemons, commands and running jobs
4019 must be restarted for a change in SwitchType to take effect. If
4020 running jobs exist at the time slurmctld is restarted with a new
4021 value of SwitchType, records of all jobs in any state may be
4022 lost.
4023
4024
4025 TaskEpilog
4026 Fully qualified pathname of a program to be execute as the slurm
4027 job's owner after termination of each task. See TaskProlog for
4028 execution order details.
4029
4030
4031 TaskPlugin
4032 Identifies the type of task launch plugin, typically used to
4033 provide resource management within a node (e.g. pinning tasks to
4034 specific processors). More than one task plugin can be specified
4035 in a comma separated list. The prefix of "task/" is optional.
4036 Acceptable values include:
4037
4038 task/affinity enables resource containment using
4039 sched_setaffinity(). This enables the --cpu-bind
4040 and/or --mem-bind srun options.
4041
4042 task/cgroup enables resource containment using Linux control
4043 cgroups. This enables the --cpu-bind and/or
4044 --mem-bind srun options. NOTE: see "man
4045 cgroup.conf" for configuration details.
4046
4047 task/none for systems requiring no special handling of user
4048 tasks. Lacks support for the --cpu-bind and/or
4049 --mem-bind srun options. The default value is
4050 "task/none".
4051
4052 NOTE: It is recommended to stack task/affinity,task/cgroup
4053 together when configuring TaskPlugin, and setting TaskAffin‐
4054 ity=no and ConstrainCores=yes in cgroup.conf. This setup uses
4055 the task/affinity plugin for setting the affinity of the tasks
4056 (which is better and different than task/cgroup) and uses the
4057 task/cgroup plugin to fence tasks into the specified resources,
4058 thus combining the best of both pieces.
4059
4060 NOTE: For CRAY systems only: task/cgroup must be used with, and
4061 listed after task/cray_aries in TaskPlugin. The task/affinity
4062 plugin can be listed anywhere, but the previous constraint must
4063 be satisfied. For CRAY systems, a configuration like this is
4064 recommended:
4065 TaskPlugin=task/affinity,task/cray_aries,task/cgroup
4066
4067
4068 TaskPluginParam
4069 Optional parameters for the task plugin. Multiple options
4070 should be comma separated. If None, Boards, Sockets, Cores,
4071 Threads, and/or Verbose are specified, they will override the
4072 --cpu-bind option specified by the user in the srun command.
4073 None, Boards, Sockets, Cores and Threads are mutually exclusive
4074 and since they decrease scheduling flexibility are not generally
4075 recommended (select no more than one of them).
4076
4077
4078 Boards Bind tasks to boards by default. Overrides automatic
4079 binding.
4080
4081 Cores Bind tasks to cores by default. Overrides automatic
4082 binding.
4083
4084 None Perform no task binding by default. Overrides auto‐
4085 matic binding.
4086
4087 Sockets Bind to sockets by default. Overrides automatic bind‐
4088 ing.
4089
4090 Threads Bind to threads by default. Overrides automatic bind‐
4091 ing.
4092
4093 SlurmdOffSpec
4094 If specialized cores or CPUs are identified for the
4095 node (i.e. the CoreSpecCount or CpuSpecList are con‐
4096 figured for the node), then Slurm daemons running on
4097 the compute node (i.e. slurmd and slurmstepd) should
4098 run outside of those resources (i.e. specialized
4099 resources are completely unavailable to Slurm daemons
4100 and jobs spawned by Slurm). This option may not be
4101 used with the task/cray_aries plugin.
4102
4103 Verbose Verbosely report binding before tasks run. Overrides
4104 user options.
4105
4106 Autobind Set a default binding in the event that "auto binding"
4107 doesn't find a match. Set to Threads, Cores or Sock‐
4108 ets (E.g. TaskPluginParam=autobind=threads).
4109
4110
4111 TaskProlog
4112 Fully qualified pathname of a program to be execute as the slurm
4113 job's owner prior to initiation of each task. Besides the nor‐
4114 mal environment variables, this has SLURM_TASK_PID available to
4115 identify the process ID of the task being started. Standard
4116 output from this program can be used to control the environment
4117 variables and output for the user program.
4118
4119 export NAME=value Will set environment variables for the task
4120 being spawned. Everything after the equal
4121 sign to the end of the line will be used as
4122 the value for the environment variable.
4123 Exporting of functions is not currently sup‐
4124 ported.
4125
4126 print ... Will cause that line (without the leading
4127 "print ") to be printed to the job's stan‐
4128 dard output.
4129
4130 unset NAME Will clear environment variables for the
4131 task being spawned.
4132
4133 The order of task prolog/epilog execution is as follows:
4134
4135 1. pre_launch_priv()
4136 Function in TaskPlugin
4137
4138 1. pre_launch() Function in TaskPlugin
4139
4140 2. TaskProlog System-wide per task program defined in
4141 slurm.conf
4142
4143 3. user prolog Job step specific task program defined using
4144 srun's --task-prolog option or
4145 SLURM_TASK_PROLOG environment variable
4146
4147 4. Execute the job step's task
4148
4149 5. user epilog Job step specific task program defined using
4150 srun's --task-epilog option or
4151 SLURM_TASK_EPILOG environment variable
4152
4153 6. TaskEpilog System-wide per task program defined in
4154 slurm.conf
4155
4156 7. post_term() Function in TaskPlugin
4157
4158
4159 TCPTimeout
4160 Time permitted for TCP connection to be established. Default
4161 value is 2 seconds.
4162
4163
4164 TmpFS Fully qualified pathname of the file system available to user
4165 jobs for temporary storage. This parameter is used in establish‐
4166 ing a node's TmpDisk space. The default value is "/tmp".
4167
4168
4169 TopologyParam
4170 Comma separated options identifying network topology options.
4171
4172 Dragonfly Optimize allocation for Dragonfly network. Valid
4173 when TopologyPlugin=topology/tree.
4174
4175 TopoOptional Only optimize allocation for network topology if
4176 the job includes a switch option. Since optimiz‐
4177 ing resource allocation for topology involves
4178 much higher system overhead, this option can be
4179 used to impose the extra overhead only on jobs
4180 which can take advantage of it. If most job allo‐
4181 cations are not optimized for network topology,
4182 they may fragment resources to the point that
4183 topology optimization for other jobs will be dif‐
4184 ficult to achieve. NOTE: Jobs may span across
4185 nodes without common parent switches with this
4186 enabled.
4187
4188
4189 TopologyPlugin
4190 Identifies the plugin to be used for determining the network
4191 topology and optimizing job allocations to minimize network con‐
4192 tention. See NETWORK TOPOLOGY below for details. Additional
4193 plugins may be provided in the future which gather topology
4194 information directly from the network. Acceptable values
4195 include:
4196
4197 topology/3d_torus best-fit logic over three-dimensional
4198 topology
4199
4200 topology/none default for other systems, best-fit logic
4201 over one-dimensional topology
4202
4203 topology/tree used for a hierarchical network as
4204 described in a topology.conf file
4205
4206
4207 TrackWCKey
4208 Boolean yes or no. Used to set display and track of the Work‐
4209 load Characterization Key. Must be set to track correct wckey
4210 usage. NOTE: You must also set TrackWCKey in your slurmdbd.conf
4211 file to create historical usage reports.
4212
4213
4214 TreeWidth
4215 Slurmd daemons use a virtual tree network for communications.
4216 TreeWidth specifies the width of the tree (i.e. the fanout). On
4217 architectures with a front end node running the slurmd daemon,
4218 the value must always be equal to or greater than the number of
4219 front end nodes which eliminates the need for message forwarding
4220 between the slurmd daemons. On other architectures the default
4221 value is 50, meaning each slurmd daemon can communicate with up
4222 to 50 other slurmd daemons and over 2500 nodes can be contacted
4223 with two message hops. The default value will work well for
4224 most clusters. Optimal system performance can typically be
4225 achieved if TreeWidth is set to the square root of the number of
4226 nodes in the cluster for systems having no more than 2500 nodes
4227 or the cube root for larger systems. The value may not exceed
4228 65533.
4229
4230
4231 UnkillableStepProgram
4232 If the processes in a job step are determined to be unkillable
4233 for a period of time specified by the UnkillableStepTimeout
4234 variable, the program specified by UnkillableStepProgram will be
4235 executed. This program can be used to take special actions to
4236 clean up the unkillable processes and/or notify computer admin‐
4237 istrators. The program will be run SlurmdUser (usually "root")
4238 on the compute node. By default no program is run.
4239
4240
4241 UnkillableStepTimeout
4242 The length of time, in seconds, that Slurm will wait before
4243 deciding that processes in a job step are unkillable (after they
4244 have been signaled with SIGKILL) and execute UnkillableStepPro‐
4245 gram as described above. The default timeout value is 60 sec‐
4246 onds. If exceeded, the compute node will be drained to prevent
4247 future jobs from being scheduled on the node.
4248
4249
4250 UsePAM If set to 1, PAM (Pluggable Authentication Modules for Linux)
4251 will be enabled. PAM is used to establish the upper bounds for
4252 resource limits. With PAM support enabled, local system adminis‐
4253 trators can dynamically configure system resource limits. Chang‐
4254 ing the upper bound of a resource limit will not alter the lim‐
4255 its of running jobs, only jobs started after a change has been
4256 made will pick up the new limits. The default value is 0 (not
4257 to enable PAM support). Remember that PAM also needs to be con‐
4258 figured to support Slurm as a service. For sites using PAM's
4259 directory based configuration option, a configuration file named
4260 slurm should be created. The module-type, control-flags, and
4261 module-path names that should be included in the file are:
4262 auth required pam_localuser.so
4263 auth required pam_shells.so
4264 account required pam_unix.so
4265 account required pam_access.so
4266 session required pam_unix.so
4267 For sites configuring PAM with a general configuration file, the
4268 appropriate lines (see above), where slurm is the service-name,
4269 should be added.
4270
4271 NOTE: UsePAM option has nothing to do with the con‐
4272 tribs/pam/pam_slurm and/or contribs/pam_slurm_adopt modules. So
4273 these two modules can work independently of the value set for
4274 UsePAM.
4275
4276
4277 VSizeFactor
4278 Memory specifications in job requests apply to real memory size
4279 (also known as resident set size). It is possible to enforce
4280 virtual memory limits for both jobs and job steps by limiting
4281 their virtual memory to some percentage of their real memory
4282 allocation. The VSizeFactor parameter specifies the job's or job
4283 step's virtual memory limit as a percentage of its real memory
4284 limit. For example, if a job's real memory limit is 500MB and
4285 VSizeFactor is set to 101 then the job will be killed if its
4286 real memory exceeds 500MB or its virtual memory exceeds 505MB
4287 (101 percent of the real memory limit). The default value is 0,
4288 which disables enforcement of virtual memory limits. The value
4289 may not exceed 65533 percent.
4290
4291 NOTE: This parameter is dependent on OverMemoryKill being con‐
4292 figured in JobAcctGatherParams. It is also possible to configure
4293 the TaskPlugin to use task/cgroup for memory enforcement. VSize‐
4294 Factor will not have an effect on memory enforcement done
4295 through cgroups.
4296
4297
4298 WaitTime
4299 Specifies how many seconds the srun command should by default
4300 wait after the first task terminates before terminating all
4301 remaining tasks. The "--wait" option on the srun command line
4302 overrides this value. The default value is 0, which disables
4303 this feature. May not exceed 65533 seconds.
4304
4305
4306 X11Parameters
4307 For use with Slurm's built-in X11 forwarding implementation.
4308
4309 home_xauthority
4310 If set, xauth data on the compute node will be placed in
4311 ~/.Xauthority rather than in a temporary file under
4312 TmpFS.
4313
4314
4316 The configuration of nodes (or machines) to be managed by Slurm is also
4317 specified in /etc/slurm.conf. Changes in node configuration (e.g.
4318 adding nodes, changing their processor count, etc.) require restarting
4319 both the slurmctld daemon and the slurmd daemons. All slurmd daemons
4320 must know each node in the system to forward messages in support of
4321 hierarchical communications. Only the NodeName must be supplied in the
4322 configuration file. All other node configuration information is
4323 optional. It is advisable to establish baseline node configurations,
4324 especially if the cluster is heterogeneous. Nodes which register to
4325 the system with less than the configured resources (e.g. too little
4326 memory), will be placed in the "DOWN" state to avoid scheduling jobs on
4327 them. Establishing baseline configurations will also speed Slurm's
4328 scheduling process by permitting it to compare job requirements against
4329 these (relatively few) configuration parameters and possibly avoid hav‐
4330 ing to check job requirements against every individual node's configu‐
4331 ration. The resources checked at node registration time are: CPUs,
4332 RealMemory and TmpDisk.
4333
4334 Default values can be specified with a record in which NodeName is
4335 "DEFAULT". The default entry values will apply only to lines following
4336 it in the configuration file and the default values can be reset multi‐
4337 ple times in the configuration file with multiple entries where "Node‐
4338 Name=DEFAULT". Each line where NodeName is "DEFAULT" will replace or
4339 add to previous default values and not a reinitialize the default val‐
4340 ues. The "NodeName=" specification must be placed on every line
4341 describing the configuration of nodes. A single node name can not
4342 appear as a NodeName value in more than one line (duplicate node name
4343 records will be ignored). In fact, it is generally possible and desir‐
4344 able to define the configurations of all nodes in only a few lines.
4345 This convention permits significant optimization in the scheduling of
4346 larger clusters. In order to support the concept of jobs requiring
4347 consecutive nodes on some architectures, node specifications should be
4348 place in this file in consecutive order. No single node name may be
4349 listed more than once in the configuration file. Use "DownNodes=" to
4350 record the state of nodes which are temporarily in a DOWN, DRAIN or
4351 FAILING state without altering permanent configuration information. A
4352 job step's tasks are allocated to nodes in order the nodes appear in
4353 the configuration file. There is presently no capability within Slurm
4354 to arbitrarily order a job step's tasks.
4355
4356 Multiple node names may be comma separated (e.g. "alpha,beta,gamma")
4357 and/or a simple node range expression may optionally be used to specify
4358 numeric ranges of nodes to avoid building a configuration file with
4359 large numbers of entries. The node range expression can contain one
4360 pair of square brackets with a sequence of comma separated numbers
4361 and/or ranges of numbers separated by a "-" (e.g. "linux[0-64,128]", or
4362 "lx[15,18,32-33]"). Note that the numeric ranges can include one or
4363 more leading zeros to indicate the numeric portion has a fixed number
4364 of digits (e.g. "linux[0000-1023]"). Multiple numeric ranges can be
4365 included in the expression (e.g. "rack[0-63]_blade[0-41]"). If one or
4366 more numeric expressions are included, one of them must be at the end
4367 of the name (e.g. "unit[0-31]rack" is invalid), but arbitrary names can
4368 always be used in a comma separated list.
4369
4370 The node configuration specified the following information:
4371
4372
4373 NodeName
4374 Name that Slurm uses to refer to a node. Typically this would
4375 be the string that "/bin/hostname -s" returns. It may also be
4376 the fully qualified domain name as returned by "/bin/hostname
4377 -f" (e.g. "foo1.bar.com"), or any valid domain name associated
4378 with the host through the host database (/etc/hosts) or DNS,
4379 depending on the resolver settings. Note that if the short form
4380 of the hostname is not used, it may prevent use of hostlist
4381 expressions (the numeric portion in brackets must be at the end
4382 of the string). It may also be an arbitrary string if NodeHost‐
4383 name is specified. If the NodeName is "DEFAULT", the values
4384 specified with that record will apply to subsequent node speci‐
4385 fications unless explicitly set to other values in that node
4386 record or replaced with a different set of default values. Each
4387 line where NodeName is "DEFAULT" will replace or add to previous
4388 default values and not a reinitialize the default values. For
4389 architectures in which the node order is significant, nodes will
4390 be considered consecutive in the order defined. For example, if
4391 the configuration for "NodeName=charlie" immediately follows the
4392 configuration for "NodeName=baker" they will be considered adja‐
4393 cent in the computer.
4394
4395
4396 NodeHostname
4397 Typically this would be the string that "/bin/hostname -s"
4398 returns. It may also be the fully qualified domain name as
4399 returned by "/bin/hostname -f" (e.g. "foo1.bar.com"), or any
4400 valid domain name associated with the host through the host
4401 database (/etc/hosts) or DNS, depending on the resolver set‐
4402 tings. Note that if the short form of the hostname is not used,
4403 it may prevent use of hostlist expressions (the numeric portion
4404 in brackets must be at the end of the string). A node range
4405 expression can be used to specify a set of nodes. If an expres‐
4406 sion is used, the number of nodes identified by NodeHostname on
4407 a line in the configuration file must be identical to the number
4408 of nodes identified by NodeName. By default, the NodeHostname
4409 will be identical in value to NodeName.
4410
4411
4412 NodeAddr
4413 Name that a node should be referred to in establishing a commu‐
4414 nications path. This name will be used as an argument to the
4415 getaddrinfo() function for identification. If a node range
4416 expression is used to designate multiple nodes, they must
4417 exactly match the entries in the NodeName (e.g. "Node‐
4418 Name=lx[0-7] NodeAddr=elx[0-7]"). NodeAddr may also contain IP
4419 addresses. By default, the NodeAddr will be identical in value
4420 to NodeHostname.
4421
4422
4423 BcastAddr
4424 Alternate network path to be used for sbcast network traffic to
4425 a given node. This name will be used as an argument to the
4426 getaddrinfo() function. If a node range expression is used to
4427 designate multiple nodes, they must exactly match the entries in
4428 the NodeName (e.g. "NodeName=lx[0-7] BcastAddr=elx[0-7]").
4429 BcastAddr may also contain IP addresses. By default, the Bcas‐
4430 tAddr is unset, and sbcast traffic will be routed to the
4431 NodeAddr for a given node. Note: cannot be used with Communica‐
4432 tionParameters=NoInAddrAny.
4433
4434
4435 Boards Number of Baseboards in nodes with a baseboard controller. Note
4436 that when Boards is specified, SocketsPerBoard, CoresPerSocket,
4437 and ThreadsPerCore should be specified. Boards and CPUs are
4438 mutually exclusive. The default value is 1.
4439
4440
4441 CoreSpecCount
4442 Number of cores reserved for system use. These cores will not
4443 be available for allocation to user jobs. Depending upon the
4444 TaskPluginParam option of SlurmdOffSpec, Slurm daemons (i.e.
4445 slurmd and slurmstepd) may either be confined to these resources
4446 (the default) or prevented from using these resources. Isola‐
4447 tion of the Slurm daemons from user jobs may improve application
4448 performance. If this option and CpuSpecList are both designated
4449 for a node, an error is generated. For information on the algo‐
4450 rithm used by Slurm to select the cores refer to the core spe‐
4451 cialization documentation (
4452 https://slurm.schedmd.com/core_spec.html ).
4453
4454
4455 CoresPerSocket
4456 Number of cores in a single physical processor socket (e.g.
4457 "2"). The CoresPerSocket value describes physical cores, not
4458 the logical number of processors per socket. NOTE: If you have
4459 multi-core processors, you will likely need to specify this
4460 parameter in order to optimize scheduling. The default value is
4461 1.
4462
4463
4464 CpuBind
4465 If a job step request does not specify an option to control how
4466 tasks are bound to allocated CPUs (--cpu-bind) and all nodes
4467 allocated to the job have the same CpuBind option the node Cpu‐
4468 Bind option will control how tasks are bound to allocated
4469 resources. Supported values for CpuBind are "none", "board",
4470 "socket", "ldom" (NUMA), "core" and "thread".
4471
4472
4473 CPUs Number of logical processors on the node (e.g. "2"). CPUs and
4474 Boards are mutually exclusive. It can be set to the total number
4475 of sockets(supported only by select/linear), cores or threads.
4476 This can be useful when you want to schedule only the cores on a
4477 hyper-threaded node. If CPUs is omitted, its default will be set
4478 equal to the product of Boards, Sockets, CoresPerSocket, and
4479 ThreadsPerCore.
4480
4481
4482 CpuSpecList
4483 A comma delimited list of Slurm abstract CPU IDs reserved for
4484 system use. The list will be expanded to include all other
4485 CPUs, if any, on the same cores. These cores will not be avail‐
4486 able for allocation to user jobs. Depending upon the TaskPlug‐
4487 inParam option of SlurmdOffSpec, Slurm daemons (i.e. slurmd and
4488 slurmstepd) may either be confined to these resources (the
4489 default) or prevented from using these resources. Isolation of
4490 the Slurm daemons from user jobs may improve application perfor‐
4491 mance. If this option and CoreSpecCount are both designated for
4492 a node, an error is generated. This option has no effect unless
4493 cgroup job confinement is also configured (TaskPlu‐
4494 gin=task/cgroup with ConstrainCores=yes in cgroup.conf).
4495
4496
4497 Features
4498 A comma delimited list of arbitrary strings indicative of some
4499 characteristic associated with the node. There is no value or
4500 count associated with a feature at this time, a node either has
4501 a feature or it does not. A desired feature may contain a
4502 numeric component indicating, for example, processor speed but
4503 this numeric component will be considered to be part of the fea‐
4504 ture string. Features are intended to be used to filter nodes
4505 eligible to run jobs via the --constraint argument. By default
4506 a node has no features. Also see Gres for being able to have
4507 more control such as types and count. Using features is faster
4508 than scheduling against GRES but is limited to Boolean opera‐
4509 tions.
4510
4511
4512 Gres A comma delimited list of generic resources specifications for a
4513 node. The format is: "<name>[:<type>][:no_consume]:<num‐
4514 ber>[K|M|G]". The first field is the resource name, which
4515 matches the GresType configuration parameter name. The optional
4516 type field might be used to identify a model of that generic
4517 resource. It is forbidden to specify both an untyped GRES and a
4518 typed GRES with the same <name>. The optional no_consume field
4519 allows you to specify that a generic resource does not have a
4520 finite number of that resource that gets consumed as it is
4521 requested. The no_consume field is a GRES specific setting and
4522 applies to the GRES, regardless of the type specified. The
4523 final field must specify a generic resources count. A suffix of
4524 "K", "M", "G", "T" or "P" may be used to multiply the number by
4525 1024, 1048576, 1073741824, etc. respectively.
4526 (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_con‐
4527 sume:4G"). By default a node has no generic resources and its
4528 maximum count is that of an unsigned 64bit integer. Also see
4529 Features for Boolean flags to filter nodes using job con‐
4530 straints.
4531
4532
4533 MemSpecLimit
4534 Amount of memory, in megabytes, reserved for system use and not
4535 available for user allocations. If the task/cgroup plugin is
4536 configured and that plugin constrains memory allocations (i.e.
4537 TaskPlugin=task/cgroup in slurm.conf, plus ConstrainRAMSpace=yes
4538 in cgroup.conf), then Slurm compute node daemons (slurmd plus
4539 slurmstepd) will be allocated the specified memory limit. Note
4540 that having the Memory set in SelectTypeParameters as any of the
4541 options that has it as a consumable resource is needed for this
4542 option to work. The daemons will not be killed if they exhaust
4543 the memory allocation (ie. the Out-Of-Memory Killer is disabled
4544 for the daemon's memory cgroup). If the task/cgroup plugin is
4545 not configured, the specified memory will only be unavailable
4546 for user allocations.
4547
4548
4549 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4550 tens to for work on this particular node. By default there is a
4551 single port number for all slurmd daemons on all compute nodes
4552 as defined by the SlurmdPort configuration parameter. Use of
4553 this option is not generally recommended except for development
4554 or testing purposes. If multiple slurmd daemons execute on a
4555 node this can specify a range of ports.
4556
4557 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4558 automatically try to interact with anything opened on ports
4559 8192-60000. Configure Port to use a port outside of the config‐
4560 ured SrunPortRange and RSIP's port range.
4561
4562
4563 Procs See CPUs.
4564
4565
4566 RealMemory
4567 Size of real memory on the node in megabytes (e.g. "2048"). The
4568 default value is 1. Lowering RealMemory with the goal of setting
4569 aside some amount for the OS and not available for job alloca‐
4570 tions will not work as intended if Memory is not set as a con‐
4571 sumable resource in SelectTypeParameters. So one of the *_Memory
4572 options need to be enabled for that goal to be accomplished.
4573 Also see MemSpecLimit.
4574
4575
4576 Reason Identifies the reason for a node being in state "DOWN",
4577 "DRAINED" "DRAINING", "FAIL" or "FAILING". Use quotes to
4578 enclose a reason having more than one word.
4579
4580
4581 Sockets
4582 Number of physical processor sockets/chips on the node (e.g.
4583 "2"). If Sockets is omitted, it will be inferred from CPUs,
4584 CoresPerSocket, and ThreadsPerCore. NOTE: If you have
4585 multi-core processors, you will likely need to specify these
4586 parameters. Sockets and SocketsPerBoard are mutually exclusive.
4587 If Sockets is specified when Boards is also used, Sockets is
4588 interpreted as SocketsPerBoard rather than total sockets. The
4589 default value is 1.
4590
4591
4592 SocketsPerBoard
4593 Number of physical processor sockets/chips on a baseboard.
4594 Sockets and SocketsPerBoard are mutually exclusive. The default
4595 value is 1.
4596
4597
4598 State State of the node with respect to the initiation of user jobs.
4599 Acceptable values are CLOUD, DOWN, DRAIN, FAIL, FAILING, FUTURE
4600 and UNKNOWN. Node states of BUSY and IDLE should not be speci‐
4601 fied in the node configuration, but set the node state to
4602 UNKNOWN instead. Setting the node state to UNKNOWN will result
4603 in the node state being set to BUSY, IDLE or other appropriate
4604 state based upon recovered system state information. The
4605 default value is UNKNOWN. Also see the DownNodes parameter
4606 below.
4607
4608 CLOUD Indicates the node exists in the cloud. Its initial
4609 state will be treated as powered down. The node will
4610 be available for use after its state is recovered from
4611 Slurm's state save file or the slurmd daemon starts on
4612 the compute node.
4613
4614 DOWN Indicates the node failed and is unavailable to be
4615 allocated work.
4616
4617 DRAIN Indicates the node is unavailable to be allocated
4618 work.
4619
4620 FAIL Indicates the node is expected to fail soon, has no
4621 jobs allocated to it, and will not be allocated to any
4622 new jobs.
4623
4624 FAILING Indicates the node is expected to fail soon, has one
4625 or more jobs allocated to it, but will not be allo‐
4626 cated to any new jobs.
4627
4628 FUTURE Indicates the node is defined for future use and need
4629 not exist when the Slurm daemons are started. These
4630 nodes can be made available for use simply by updating
4631 the node state using the scontrol command rather than
4632 restarting the slurmctld daemon. After these nodes are
4633 made available, change their State in the slurm.conf
4634 file. Until these nodes are made available, they will
4635 not be seen using any Slurm commands or nor will any
4636 attempt be made to contact them.
4637
4638
4639 Dynamic Future Nodes
4640 A slurmd started with -F[<feature>] will be
4641 associated with a FUTURE node that matches the
4642 same configuration (sockets, cores, threads) as
4643 reported by slurmd -C. The node's NodeAddr and
4644 NodeHostname will automatically be retrieved
4645 from the slurmd and will be cleared when set
4646 back to the FUTURE state. Dynamic FUTURE nodes
4647 retain non-FUTURE state on restart. Use scon‐
4648 trol to put node back into FUTURE state.
4649
4650 If the mapping of the NodeName to the slurmd
4651 HostName is not updated in DNS, Dynamic Future
4652 nodes won't know how to communicate with each
4653 other -- because NodeAddr and NodeHostName are
4654 not defined in the slurm.conf -- and the fanout
4655 communications need to be disabled by setting
4656 TreeWidth to a high number (e.g. 65533). If the
4657 DNS mapping is made, then the cloud_dns Slurm‐
4658 ctldParameter can be used.
4659
4660
4661 UNKNOWN Indicates the node's state is undefined but will be
4662 established (set to BUSY or IDLE) when the slurmd dae‐
4663 mon on that node registers. UNKNOWN is the default
4664 state.
4665
4666
4667 ThreadsPerCore
4668 Number of logical threads in a single physical core (e.g. "2").
4669 Note that the Slurm can allocate resources to jobs down to the
4670 resolution of a core. If your system is configured with more
4671 than one thread per core, execution of a different job on each
4672 thread is not supported unless you configure SelectTypeParame‐
4673 ters=CR_CPU plus CPUs; do not configure Sockets, CoresPerSocket
4674 or ThreadsPerCore. A job can execute a one task per thread from
4675 within one job step or execute a distinct job step on each of
4676 the threads. Note also if you are running with more than 1
4677 thread per core and running the select/cons_res or
4678 select/cons_tres plugin then you will want to set the Select‐
4679 TypeParameters variable to something other than CR_CPU to avoid
4680 unexpected results. The default value is 1.
4681
4682
4683 TmpDisk
4684 Total size of temporary disk storage in TmpFS in megabytes (e.g.
4685 "16384"). TmpFS (for "Temporary File System") identifies the
4686 location which jobs should use for temporary storage. Note this
4687 does not indicate the amount of free space available to the user
4688 on the node, only the total file system size. The system admin‐
4689 istration should ensure this file system is purged as needed so
4690 that user jobs have access to most of this space. The Prolog
4691 and/or Epilog programs (specified in the configuration file)
4692 might be used to ensure the file system is kept clean. The
4693 default value is 0.
4694
4695
4696 TRESWeights TRESWeights are used to calculate a value that represents
4697 how
4698 busy a node is. Currently only used in federation configura‐
4699 tions. TRESWeights are different from TRESBillingWeights --
4700 which is used for fairshare calculations.
4701
4702 TRES weights are specified as a comma-separated list of <TRES
4703 Type>=<TRES Weight> pairs.
4704 e.g.
4705 NodeName=node1 ... TRESWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0"
4706
4707 By default the weighted TRES value is calculated as the sum of
4708 all node TRES types multiplied by their corresponding TRES
4709 weight.
4710
4711 If PriorityFlags=MAX_TRES is configured, the weighted TRES value
4712 is calculated as the MAX of individual node TRES' (e.g. cpus,
4713 mem, gres).
4714
4715
4716 Weight The priority of the node for scheduling purposes. All things
4717 being equal, jobs will be allocated the nodes with the lowest
4718 weight which satisfies their requirements. For example, a het‐
4719 erogeneous collection of nodes might be placed into a single
4720 partition for greater system utilization, responsiveness and
4721 capability. It would be preferable to allocate smaller memory
4722 nodes rather than larger memory nodes if either will satisfy a
4723 job's requirements. The units of weight are arbitrary, but
4724 larger weights should be assigned to nodes with more processors,
4725 memory, disk space, higher processor speed, etc. Note that if a
4726 job allocation request can not be satisfied using the nodes with
4727 the lowest weight, the set of nodes with the next lowest weight
4728 is added to the set of nodes under consideration for use (repeat
4729 as needed for higher weight values). If you absolutely want to
4730 minimize the number of higher weight nodes allocated to a job
4731 (at a cost of higher scheduling overhead), give each node a dis‐
4732 tinct Weight value and they will be added to the pool of nodes
4733 being considered for scheduling individually. The default value
4734 is 1.
4735
4736
4738 The DownNodes= parameter permits you to mark certain nodes as in a
4739 DOWN, DRAIN, FAIL, FAILING or FUTURE state without altering the perma‐
4740 nent configuration information listed under a NodeName= specification.
4741
4742
4743 DownNodes
4744 Any node name, or list of node names, from the NodeName= speci‐
4745 fications.
4746
4747
4748 Reason Identifies the reason for a node being in state DOWN, DRAIN,
4749 FAIL, FAILING or FUTURE. Use quotes to enclose a reason having
4750 more than one word.
4751
4752
4753 State State of the node with respect to the initiation of user jobs.
4754 Acceptable values are DOWN, DRAIN, FAIL, FAILING and FUTURE.
4755 For more information about these states see the descriptions
4756 under State in the NodeName= section above. The default value
4757 is DOWN.
4758
4759
4761 On computers where frontend nodes are used to execute batch scripts
4762 rather than compute nodes (Cray ALPS systems), one may configure one or
4763 more frontend nodes using the configuration parameters defined below.
4764 These options are very similar to those used in configuring compute
4765 nodes. These options may only be used on systems configured and built
4766 with the appropriate parameters (--have-front-end) or a system deter‐
4767 mined to have the appropriate architecture by the configure script
4768 (Cray ALPS systems). The front end configuration specifies the follow‐
4769 ing information:
4770
4771
4772 AllowGroups
4773 Comma separated list of group names which may execute jobs on
4774 this front end node. By default, all groups may use this front
4775 end node. If at least one group associated with the user
4776 attempting to execute the job is in AllowGroups, he will be per‐
4777 mitted to use this front end node. May not be used with the
4778 DenyGroups option.
4779
4780
4781 AllowUsers
4782 Comma separated list of user names which may execute jobs on
4783 this front end node. By default, all users may use this front
4784 end node. May not be used with the DenyUsers option.
4785
4786
4787 DenyGroups
4788 Comma separated list of group names which are prevented from
4789 executing jobs on this front end node. May not be used with the
4790 AllowGroups option.
4791
4792
4793 DenyUsers
4794 Comma separated list of user names which are prevented from exe‐
4795 cuting jobs on this front end node. May not be used with the
4796 AllowUsers option.
4797
4798
4799 FrontendName
4800 Name that Slurm uses to refer to a frontend node. Typically
4801 this would be the string that "/bin/hostname -s" returns. It
4802 may also be the fully qualified domain name as returned by
4803 "/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid domain
4804 name associated with the host through the host database
4805 (/etc/hosts) or DNS, depending on the resolver settings. Note
4806 that if the short form of the hostname is not used, it may pre‐
4807 vent use of hostlist expressions (the numeric portion in brack‐
4808 ets must be at the end of the string). If the FrontendName is
4809 "DEFAULT", the values specified with that record will apply to
4810 subsequent node specifications unless explicitly set to other
4811 values in that frontend node record or replaced with a different
4812 set of default values. Each line where FrontendName is
4813 "DEFAULT" will replace or add to previous default values and not
4814 a reinitialize the default values.
4815
4816
4817 FrontendAddr
4818 Name that a frontend node should be referred to in establishing
4819 a communications path. This name will be used as an argument to
4820 the getaddrinfo() function for identification. As with Fron‐
4821 tendName, list the individual node addresses rather than using a
4822 hostlist expression. The number of FrontendAddr records per
4823 line must equal the number of FrontendName records per line
4824 (i.e. you can't map to node names to one address). FrontendAddr
4825 may also contain IP addresses. By default, the FrontendAddr
4826 will be identical in value to FrontendName.
4827
4828
4829 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4830 tens to for work on this particular frontend node. By default
4831 there is a single port number for all slurmd daemons on all
4832 frontend nodes as defined by the SlurmdPort configuration param‐
4833 eter. Use of this option is not generally recommended except for
4834 development or testing purposes.
4835
4836 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4837 automatically try to interact with anything opened on ports
4838 8192-60000. Configure Port to use a port outside of the config‐
4839 ured SrunPortRange and RSIP's port range.
4840
4841
4842 Reason Identifies the reason for a frontend node being in state DOWN,
4843 DRAINED, DRAINING, FAIL or FAILING. Use quotes to enclose a
4844 reason having more than one word.
4845
4846
4847 State State of the frontend node with respect to the initiation of
4848 user jobs. Acceptable values are DOWN, DRAIN, FAIL, FAILING and
4849 UNKNOWN. Node states of BUSY and IDLE should not be specified
4850 in the node configuration, but set the node state to UNKNOWN
4851 instead. Setting the node state to UNKNOWN will result in the
4852 node state being set to BUSY, IDLE or other appropriate state
4853 based upon recovered system state information. For more infor‐
4854 mation about these states see the descriptions under State in
4855 the NodeName= section above. The default value is UNKNOWN.
4856
4857
4858 As an example, you can do something similar to the following to define
4859 four front end nodes for running slurmd daemons.
4860 FrontendName=frontend[00-03] FrontendAddr=efrontend[00-03] State=UNKNOWN
4861
4862
4864 The nodeset configuration allows you to define a name for a specific
4865 set of nodes which can be used to simplify the partition configuration
4866 section, especially for heterogenous or condo-style systems. Each node‐
4867 set may be defined by an explicit list of nodes, and/or by filtering
4868 the nodes by a particular configured feature. If both Feature= and
4869 Nodes= are used the nodeset shall be the union of the two subsets.
4870 Note that the nodesets are only used to simplify the partition defini‐
4871 tions at present, and are not usable outside of the partition configu‐
4872 ration.
4873
4874 Feature
4875 All nodes with this single feature will be included as part of
4876 this nodeset.
4877
4878 Nodes List of nodes in this set.
4879
4880 NodeSet
4881 Unique name for a set of nodes. Must not overlap with any Node‐
4882 Name definitions.
4883
4884
4886 The partition configuration permits you to establish different job lim‐
4887 its or access controls for various groups (or partitions) of nodes.
4888 Nodes may be in more than one partition, making partitions serve as
4889 general purpose queues. For example one may put the same set of nodes
4890 into two different partitions, each with different constraints (time
4891 limit, job sizes, groups allowed to use the partition, etc.). Jobs are
4892 allocated resources within a single partition. Default values can be
4893 specified with a record in which PartitionName is "DEFAULT". The
4894 default entry values will apply only to lines following it in the con‐
4895 figuration file and the default values can be reset multiple times in
4896 the configuration file with multiple entries where "Partition‐
4897 Name=DEFAULT". The "PartitionName=" specification must be placed on
4898 every line describing the configuration of partitions. Each line where
4899 PartitionName is "DEFAULT" will replace or add to previous default val‐
4900 ues and not a reinitialize the default values. A single partition name
4901 can not appear as a PartitionName value in more than one line (dupli‐
4902 cate partition name records will be ignored). If a partition that is
4903 in use is deleted from the configuration and slurm is restarted or
4904 reconfigured (scontrol reconfigure), jobs using the partition are can‐
4905 celed. NOTE: Put all parameters for each partition on a single line.
4906 Each line of partition configuration information should represent a
4907 different partition. The partition configuration file contains the
4908 following information:
4909
4910
4911 AllocNodes
4912 Comma separated list of nodes from which users can submit jobs
4913 in the partition. Node names may be specified using the node
4914 range expression syntax described above. The default value is
4915 "ALL".
4916
4917
4918 AllowAccounts
4919 Comma separated list of accounts which may execute jobs in the
4920 partition. The default value is "ALL". NOTE: If AllowAccounts
4921 is used then DenyAccounts will not be enforced. Also refer to
4922 DenyAccounts.
4923
4924
4925 AllowGroups
4926 Comma separated list of group names which may execute jobs in
4927 the partition. If at least one group associated with the user
4928 attempting to execute the job is in AllowGroups, he will be per‐
4929 mitted to use this partition. Jobs executed as user root can
4930 use any partition without regard to the value of AllowGroups.
4931 If user root attempts to execute a job as another user (e.g.
4932 using srun's --uid option), this other user must be in one of
4933 groups identified by AllowGroups for the job to successfully
4934 execute. The default value is "ALL". When set, all partitions
4935 that a user does not have access will be hidden from display
4936 regardless of the settings used for PrivateData. NOTE: For per‐
4937 formance reasons, Slurm maintains a list of user IDs allowed to
4938 use each partition and this is checked at job submission time.
4939 This list of user IDs is updated when the slurmctld daemon is
4940 restarted, reconfigured (e.g. "scontrol reconfig") or the parti‐
4941 tion's AllowGroups value is reset, even if is value is unchanged
4942 (e.g. "scontrol update PartitionName=name AllowGroups=group").
4943 For a user's access to a partition to change, both his group
4944 membership must change and Slurm's internal user ID list must
4945 change using one of the methods described above.
4946
4947
4948 AllowQos
4949 Comma separated list of Qos which may execute jobs in the parti‐
4950 tion. Jobs executed as user root can use any partition without
4951 regard to the value of AllowQos. The default value is "ALL".
4952 NOTE: If AllowQos is used then DenyQos will not be enforced.
4953 Also refer to DenyQos.
4954
4955
4956 Alternate
4957 Partition name of alternate partition to be used if the state of
4958 this partition is "DRAIN" or "INACTIVE."
4959
4960
4961 CpuBind
4962 If a job step request does not specify an option to control how
4963 tasks are bound to allocated CPUs (--cpu-bind) and all nodes
4964 allocated to the job do not have the same CpuBind option the
4965 node. Then the partition's CpuBind option will control how tasks
4966 are bound to allocated resources. Supported values forCpuBind
4967 are "none", "board", "socket", "ldom" (NUMA), "core" and
4968 "thread".
4969
4970
4971 Default
4972 If this keyword is set, jobs submitted without a partition spec‐
4973 ification will utilize this partition. Possible values are
4974 "YES" and "NO". The default value is "NO".
4975
4976
4977 DefCpuPerGPU
4978 Default count of CPUs allocated per allocated GPU.
4979
4980
4981 DefMemPerCPU
4982 Default real memory size available per allocated CPU in
4983 megabytes. Used to avoid over-subscribing memory and causing
4984 paging. DefMemPerCPU would generally be used if individual pro‐
4985 cessors are allocated to jobs (SelectType=select/cons_res or
4986 SelectType=select/cons_tres). If not set, the DefMemPerCPU
4987 value for the entire cluster will be used. Also see DefMem‐
4988 PerGPU, DefMemPerNode and MaxMemPerCPU. DefMemPerCPU, DefMem‐
4989 PerGPU and DefMemPerNode are mutually exclusive.
4990
4991
4992 DefMemPerGPU
4993 Default real memory size available per allocated GPU in
4994 megabytes. Also see DefMemPerCPU, DefMemPerNode and MaxMemPer‐
4995 CPU. DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually
4996 exclusive.
4997
4998
4999 DefMemPerNode
5000 Default real memory size available per allocated node in
5001 megabytes. Used to avoid over-subscribing memory and causing
5002 paging. DefMemPerNode would generally be used if whole nodes
5003 are allocated to jobs (SelectType=select/linear) and resources
5004 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
5005 If not set, the DefMemPerNode value for the entire cluster will
5006 be used. Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerCPU.
5007 DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually exclu‐
5008 sive.
5009
5010
5011 DenyAccounts
5012 Comma separated list of accounts which may not execute jobs in
5013 the partition. By default, no accounts are denied access NOTE:
5014 If AllowAccounts is used then DenyAccounts will not be enforced.
5015 Also refer to AllowAccounts.
5016
5017
5018 DenyQos
5019 Comma separated list of Qos which may not execute jobs in the
5020 partition. By default, no QOS are denied access NOTE: If
5021 AllowQos is used then DenyQos will not be enforced. Also refer
5022 AllowQos.
5023
5024
5025 DefaultTime
5026 Run time limit used for jobs that don't specify a value. If not
5027 set then MaxTime will be used. Format is the same as for Max‐
5028 Time.
5029
5030
5031 DisableRootJobs
5032 If set to "YES" then user root will be prevented from running
5033 any jobs on this partition. The default value will be the value
5034 of DisableRootJobs set outside of a partition specification
5035 (which is "NO", allowing user root to execute jobs).
5036
5037
5038 ExclusiveUser
5039 If set to "YES" then nodes will be exclusively allocated to
5040 users. Multiple jobs may be run for the same user, but only one
5041 user can be active at a time. This capability is also available
5042 on a per-job basis by using the --exclusive=user option.
5043
5044
5045 GraceTime
5046 Specifies, in units of seconds, the preemption grace time to be
5047 extended to a job which has been selected for preemption. The
5048 default value is zero, no preemption grace time is allowed on
5049 this partition. Once a job has been selected for preemption,
5050 its end time is set to the current time plus GraceTime. The
5051 job's tasks are immediately sent SIGCONT and SIGTERM signals in
5052 order to provide notification of its imminent termination. This
5053 is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence
5054 upon reaching its new end time. This second set of signals is
5055 sent to both the tasks and the containing batch script, if
5056 applicable. See also the global KillWait configuration parame‐
5057 ter.
5058
5059
5060 Hidden Specifies if the partition and its jobs are to be hidden by
5061 default. Hidden partitions will by default not be reported by
5062 the Slurm APIs or commands. Possible values are "YES" and "NO".
5063 The default value is "NO". Note that partitions that a user
5064 lacks access to by virtue of the AllowGroups parameter will also
5065 be hidden by default.
5066
5067
5068 LLN Schedule resources to jobs on the least loaded nodes (based upon
5069 the number of idle CPUs). This is generally only recommended for
5070 an environment with serial jobs as idle resources will tend to
5071 be highly fragmented, resulting in parallel jobs being distrib‐
5072 uted across many nodes. Note that node Weight takes precedence
5073 over how many idle resources are on each node. Also see the
5074 SelectParameters configuration parameter CR_LLN to use the least
5075 loaded nodes in every partition.
5076
5077
5078 MaxCPUsPerNode
5079 Maximum number of CPUs on any node available to all jobs from
5080 this partition. This can be especially useful to schedule GPUs.
5081 For example a node can be associated with two Slurm partitions
5082 (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be
5083 limited to only a subset of the node's CPUs, ensuring that one
5084 or more CPUs would be available to jobs in the "gpu" parti‐
5085 tion/queue.
5086
5087
5088 MaxMemPerCPU
5089 Maximum real memory size available per allocated CPU in
5090 megabytes. Used to avoid over-subscribing memory and causing
5091 paging. MaxMemPerCPU would generally be used if individual pro‐
5092 cessors are allocated to jobs (SelectType=select/cons_res or
5093 SelectType=select/cons_tres). If not set, the MaxMemPerCPU
5094 value for the entire cluster will be used. Also see DefMemPer‐
5095 CPU and MaxMemPerNode. MaxMemPerCPU and MaxMemPerNode are mutu‐
5096 ally exclusive.
5097
5098
5099 MaxMemPerNode
5100 Maximum real memory size available per allocated node in
5101 megabytes. Used to avoid over-subscribing memory and causing
5102 paging. MaxMemPerNode would generally be used if whole nodes
5103 are allocated to jobs (SelectType=select/linear) and resources
5104 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
5105 If not set, the MaxMemPerNode value for the entire cluster will
5106 be used. Also see DefMemPerNode and MaxMemPerCPU. MaxMemPerCPU
5107 and MaxMemPerNode are mutually exclusive.
5108
5109
5110 MaxNodes
5111 Maximum count of nodes which may be allocated to any single job.
5112 The default value is "UNLIMITED", which is represented inter‐
5113 nally as -1. This limit does not apply to jobs executed by
5114 SlurmUser or user root.
5115
5116
5117 MaxTime
5118 Maximum run time limit for jobs. Format is minutes, min‐
5119 utes:seconds, hours:minutes:seconds, days-hours, days-hours:min‐
5120 utes, days-hours:minutes:seconds or "UNLIMITED". Time resolu‐
5121 tion is one minute and second values are rounded up to the next
5122 minute. This limit does not apply to jobs executed by SlurmUser
5123 or user root.
5124
5125
5126 MinNodes
5127 Minimum count of nodes which may be allocated to any single job.
5128 The default value is 0. This limit does not apply to jobs exe‐
5129 cuted by SlurmUser or user root.
5130
5131
5132 Nodes Comma separated list of nodes or nodesets which are associated
5133 with this partition. Node names may be specified using the node
5134 range expression syntax described above. A blank list of nodes
5135 (i.e. "Nodes= ") can be used if one wants a partition to exist,
5136 but have no resources (possibly on a temporary basis). A value
5137 of "ALL" is mapped to all nodes configured in the cluster.
5138
5139
5140 OverSubscribe
5141 Controls the ability of the partition to execute more than one
5142 job at a time on each resource (node, socket or core depending
5143 upon the value of SelectTypeParameters). If resources are to be
5144 over-subscribed, avoiding memory over-subscription is very
5145 important. SelectTypeParameters should be configured to treat
5146 memory as a consumable resource and the --mem option should be
5147 used for job allocations. Sharing of resources is typically
5148 useful only when using gang scheduling (PreemptMode=sus‐
5149 pend,gang). Possible values for OverSubscribe are "EXCLUSIVE",
5150 "FORCE", "YES", and "NO". Note that a value of "YES" or "FORCE"
5151 can negatively impact performance for systems with many thou‐
5152 sands of running jobs. The default value is "NO". For more
5153 information see the following web pages:
5154 https://slurm.schedmd.com/cons_res.html
5155 https://slurm.schedmd.com/cons_res_share.html
5156 https://slurm.schedmd.com/gang_scheduling.html
5157 https://slurm.schedmd.com/preempt.html
5158
5159
5160 EXCLUSIVE Allocates entire nodes to jobs even with Select‐
5161 Type=select/cons_res or SelectType=select/cons_tres
5162 configured. Jobs that run in partitions with Over‐
5163 Subscribe=EXCLUSIVE will have exclusive access to
5164 all allocated nodes.
5165
5166 FORCE Makes all resources in the partition available for
5167 oversubscription without any means for users to dis‐
5168 able it. May be followed with a colon and maximum
5169 number of jobs in running or suspended state. For
5170 example OverSubscribe=FORCE:4 enables each node,
5171 socket or core to oversubscribe each resource four
5172 ways. Recommended only for systems using Preempt‐
5173 Mode=suspend,gang.
5174
5175 NOTE: OverSubscribe=FORCE:1 is a special case that
5176 is not exactly equivalent to OverSubscribe=NO. Over‐
5177 Subscribe=FORCE:1 disables the regular oversubscrip‐
5178 tion of resources in the same partition but it will
5179 still allow oversubscription due to preemption. Set‐
5180 ting OverSubscribe=NO will prevent oversubscription
5181 from happening due to preemption as well.
5182
5183 NOTE: If using PreemptType=preempt/qos you can spec‐
5184 ify a value for FORCE that is greater than 1. For
5185 example, OverSubscribe=FORCE:2 will permit two jobs
5186 per resource normally, but a third job can be
5187 started only if done so through preemption based
5188 upon QOS.
5189
5190 NOTE: If OverSubscribe is configured to FORCE or YES
5191 in your slurm.conf and the system is not configured
5192 to use preemption (PreemptMode=OFF) accounting can
5193 easily grow to values greater than the actual uti‐
5194 lization. It may be common on such systems to get
5195 error messages in the slurmdbd log stating: "We have
5196 more allocated time than is possible."
5197
5198
5199 YES Makes all resources in the partition available for
5200 sharing upon request by the job. Resources will
5201 only be over-subscribed when explicitly requested by
5202 the user using the "--oversubscribe" option on job
5203 submission. May be followed with a colon and maxi‐
5204 mum number of jobs in running or suspended state.
5205 For example "OverSubscribe=YES:4" enables each node,
5206 socket or core to execute up to four jobs at once.
5207 Recommended only for systems running with gang
5208 scheduling (PreemptMode=suspend,gang).
5209
5210 NO Selected resources are allocated to a single job. No
5211 resource will be allocated to more than one job.
5212
5213 NOTE: Even if you are using PreemptMode=sus‐
5214 pend,gang, setting OverSubscribe=NO will disable
5215 preemption on that partition. Use OverSub‐
5216 scribe=FORCE:1 if you want to disable normal over‐
5217 subscription but still allow suspension due to pre‐
5218 emption.
5219
5220
5221 PartitionName
5222 Name by which the partition may be referenced (e.g. "Interac‐
5223 tive"). This name can be specified by users when submitting
5224 jobs. If the PartitionName is "DEFAULT", the values specified
5225 with that record will apply to subsequent partition specifica‐
5226 tions unless explicitly set to other values in that partition
5227 record or replaced with a different set of default values. Each
5228 line where PartitionName is "DEFAULT" will replace or add to
5229 previous default values and not a reinitialize the default val‐
5230 ues.
5231
5232
5233 PreemptMode
5234 Mechanism used to preempt jobs or enable gang scheduling for
5235 this partition when PreemptType=preempt/partition_prio is con‐
5236 figured. This partition-specific PreemptMode configuration
5237 parameter will override the cluster-wide PreemptMode for this
5238 partition. It can be set to OFF to disable preemption and gang
5239 scheduling for this partition. See also PriorityTier and the
5240 above description of the cluster-wide PreemptMode parameter for
5241 further details.
5242
5243
5244 PriorityJobFactor
5245 Partition factor used by priority/multifactor plugin in calcu‐
5246 lating job priority. The value may not exceed 65533. Also see
5247 PriorityTier.
5248
5249
5250 PriorityTier
5251 Jobs submitted to a partition with a higher priority tier value
5252 will be dispatched before pending jobs in partition with lower
5253 priority tier value and, if possible, they will preempt running
5254 jobs from partitions with lower priority tier values. Note that
5255 a partition's priority tier takes precedence over a job's prior‐
5256 ity. The value may not exceed 65533. Also see PriorityJobFac‐
5257 tor.
5258
5259
5260 QOS Used to extend the limits available to a QOS on a partition.
5261 Jobs will not be associated to this QOS outside of being associ‐
5262 ated to the partition. They will still be associated to their
5263 requested QOS. By default, no QOS is used. NOTE: If a limit is
5264 set in both the Partition's QOS and the Job's QOS the Partition
5265 QOS will be honored unless the Job's QOS has the OverPartQOS
5266 flag set in which the Job's QOS will have priority.
5267
5268
5269 ReqResv
5270 Specifies users of this partition are required to designate a
5271 reservation when submitting a job. This option can be useful in
5272 restricting usage of a partition that may have higher priority
5273 or additional resources to be allowed only within a reservation.
5274 Possible values are "YES" and "NO". The default value is "NO".
5275
5276
5277 RootOnly
5278 Specifies if only user ID zero (i.e. user root) may allocate
5279 resources in this partition. User root may allocate resources
5280 for any other user, but the request must be initiated by user
5281 root. This option can be useful for a partition to be managed
5282 by some external entity (e.g. a higher-level job manager) and
5283 prevents users from directly using those resources. Possible
5284 values are "YES" and "NO". The default value is "NO".
5285
5286
5287 SelectTypeParameters
5288 Partition-specific resource allocation type. This option
5289 replaces the global SelectTypeParameters value. Supported val‐
5290 ues are CR_Core, CR_Core_Memory, CR_Socket and CR_Socket_Memory.
5291 Use requires the system-wide SelectTypeParameters value be set
5292 to any of the four supported values previously listed; other‐
5293 wise, the partition-specific value will be ignored.
5294
5295
5296 Shared The Shared configuration parameter has been replaced by the
5297 OverSubscribe parameter described above.
5298
5299
5300 State State of partition or availability for use. Possible values are
5301 "UP", "DOWN", "DRAIN" and "INACTIVE". The default value is "UP".
5302 See also the related "Alternate" keyword.
5303
5304 UP Designates that new jobs may be queued on the parti‐
5305 tion, and that jobs may be allocated nodes and run
5306 from the partition.
5307
5308 DOWN Designates that new jobs may be queued on the parti‐
5309 tion, but queued jobs may not be allocated nodes and
5310 run from the partition. Jobs already running on the
5311 partition continue to run. The jobs must be explicitly
5312 canceled to force their termination.
5313
5314 DRAIN Designates that no new jobs may be queued on the par‐
5315 tition (job submission requests will be denied with an
5316 error message), but jobs already queued on the parti‐
5317 tion may be allocated nodes and run. See also the
5318 "Alternate" partition specification.
5319
5320 INACTIVE Designates that no new jobs may be queued on the par‐
5321 tition, and jobs already queued may not be allocated
5322 nodes and run. See also the "Alternate" partition
5323 specification.
5324
5325
5326 TRESBillingWeights
5327 TRESBillingWeights is used to define the billing weights of each
5328 TRES type that will be used in calculating the usage of a job.
5329 The calculated usage is used when calculating fairshare and when
5330 enforcing the TRES billing limit on jobs.
5331
5332 Billing weights are specified as a comma-separated list of <TRES
5333 Type>=<TRES Billing Weight> pairs.
5334
5335 Any TRES Type is available for billing. Note that the base unit
5336 for memory and burst buffers is megabytes.
5337
5338 By default the billing of TRES is calculated as the sum of all
5339 TRES types multiplied by their corresponding billing weight.
5340
5341 The weighted amount of a resource can be adjusted by adding a
5342 suffix of K,M,G,T or P after the billing weight. For example, a
5343 memory weight of "mem=.25" on a job allocated 8GB will be billed
5344 2048 (8192MB *.25) units. A memory weight of "mem=.25G" on the
5345 same job will be billed 2 (8192MB * (.25/1024)) units.
5346
5347 Negative values are allowed.
5348
5349 When a job is allocated 1 CPU and 8 GB of memory on a partition
5350 configured with TRESBilling‐
5351 Weights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0", the billable TRES will
5352 be: (1*1.0) + (8*0.25) + (0*2.0) = 3.0.
5353
5354 If PriorityFlags=MAX_TRES is configured, the billable TRES is
5355 calculated as the MAX of individual TRES' on a node (e.g. cpus,
5356 mem, gres) plus the sum of all global TRES' (e.g. licenses).
5357 Using the same example above the billable TRES will be
5358 MAX(1*1.0, 8*0.25) + (0*2.0) = 2.0.
5359
5360 If TRESBillingWeights is not defined then the job is billed
5361 against the total number of allocated CPUs.
5362
5363 NOTE: TRESBillingWeights doesn't affect job priority directly as
5364 it is currently not used for the size of the job. If you want
5365 TRES' to play a role in the job's priority then refer to the
5366 PriorityWeightTRES option.
5367
5368
5369
5371 There are a variety of prolog and epilog program options that execute
5372 with various permissions and at various times. The four options most
5373 likely to be used are: Prolog and Epilog (executed once on each compute
5374 node for each job) plus PrologSlurmctld and EpilogSlurmctld (executed
5375 once on the ControlMachine for each job).
5376
5377 NOTE: Standard output and error messages are normally not preserved.
5378 Explicitly write output and error messages to an appropriate location
5379 if you wish to preserve that information.
5380
5381 NOTE: By default the Prolog script is ONLY run on any individual node
5382 when it first sees a job step from a new allocation. It does not run
5383 the Prolog immediately when an allocation is granted. If no job steps
5384 from an allocation are run on a node, it will never run the Prolog for
5385 that allocation. This Prolog behaviour can be changed by the Pro‐
5386 logFlags parameter. The Epilog, on the other hand, always runs on
5387 every node of an allocation when the allocation is released.
5388
5389 If the Epilog fails (returns a non-zero exit code), this will result in
5390 the node being set to a DRAIN state. If the EpilogSlurmctld fails
5391 (returns a non-zero exit code), this will only be logged. If the Pro‐
5392 log fails (returns a non-zero exit code), this will result in the node
5393 being set to a DRAIN state and the job being requeued in a held state
5394 unless nohold_on_prolog_fail is configured in SchedulerParameters. If
5395 the PrologSlurmctld fails (returns a non-zero exit code), this will
5396 result in the job being requeued to be executed on another node if pos‐
5397 sible. Only batch jobs can be requeued. Interactive jobs (salloc and
5398 srun) will be cancelled if the PrologSlurmctld fails.
5399
5400
5401 Information about the job is passed to the script using environment
5402 variables. Unless otherwise specified, these environment variables are
5403 available in each of the scripts mentioned above (Prolog, Epilog, Pro‐
5404 logSlurmctld and EpilogSlurmctld). For a full list of environment vari‐
5405 ables that includes those available in the SrunProlog, SrunEpilog,
5406 TaskProlog and TaskEpilog please see the Prolog and Epilog Guide
5407 <https://slurm.schedmd.com/prolog_epilog.html>.
5408
5409 SLURM_ARRAY_JOB_ID
5410 If this job is part of a job array, this will be set to the job
5411 ID. Otherwise it will not be set. To reference this specific
5412 task of a job array, combine SLURM_ARRAY_JOB_ID with
5413 SLURM_ARRAY_TASK_ID (e.g. "scontrol update
5414 ${SLURM_ARRAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in
5415 PrologSlurmctld and EpilogSlurmctld only.
5416
5417 SLURM_ARRAY_TASK_ID
5418 If this job is part of a job array, this will be set to the task
5419 ID. Otherwise it will not be set. To reference this specific
5420 task of a job array, combine SLURM_ARRAY_JOB_ID with
5421 SLURM_ARRAY_TASK_ID (e.g. "scontrol update
5422 ${SLURM_ARRAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in
5423 PrologSlurmctld and EpilogSlurmctld only.
5424
5425 SLURM_ARRAY_TASK_MAX
5426 If this job is part of a job array, this will be set to the max‐
5427 imum task ID. Otherwise it will not be set. Available in Pro‐
5428 logSlurmctld and EpilogSlurmctld only.
5429
5430 SLURM_ARRAY_TASK_MIN
5431 If this job is part of a job array, this will be set to the min‐
5432 imum task ID. Otherwise it will not be set. Available in Pro‐
5433 logSlurmctld and EpilogSlurmctld only.
5434
5435 SLURM_ARRAY_TASK_STEP
5436 If this job is part of a job array, this will be set to the step
5437 size of task IDs. Otherwise it will not be set. Available in
5438 PrologSlurmctld and EpilogSlurmctld only.
5439
5440 SLURM_CLUSTER_NAME
5441 Name of the cluster executing the job.
5442
5443 SLURM_CONF
5444 Location of the slurm.conf file. Available in Prolog and Epilog
5445 only.
5446
5447 SLURMD_NODENAME
5448 Name of the node running the task. In the case of a parallel job
5449 executing on multiple compute nodes, the various tasks will have
5450 this environment variable set to different values on each com‐
5451 pute node. Availble in Prolog and Epilog only.
5452
5453 SLURM_JOB_ACCOUNT
5454 Account name used for the job. Available in PrologSlurmctld and
5455 EpilogSlurmctld only.
5456
5457 SLURM_JOB_CONSTRAINTS
5458 Features required to run the job. Available in Prolog, Pro‐
5459 logSlurmctld and EpilogSlurmctld only.
5460
5461 SLURM_JOB_DERIVED_EC
5462 The highest exit code of all of the job steps. Available in
5463 EpilogSlurmctld only.
5464
5465 SLURM_JOB_EXIT_CODE
5466 The exit code of the job script (or salloc). The value is the
5467 status as returned by the wait() system call (See wait(2))
5468 Available in EpilogSlurmctld only.
5469
5470 SLURM_JOB_EXIT_CODE2
5471 The exit code of the job script (or salloc). The value has the
5472 format <exit>:<sig>. The first number is the exit code, typi‐
5473 cally as set by the exit() function. The second number of the
5474 signal that caused the process to terminate if it was terminated
5475 by a signal. Available in EpilogSlurmctld only.
5476
5477 SLURM_JOB_GID
5478 Group ID of the job's owner. Available in PrologSlurmctld and
5479 EpilogSlurmctld only.
5480
5481 SLURM_JOB_GPUS
5482 GPU IDs allocated to the job (if any). Available in the Prolog
5483 only.
5484
5485 SLURM_JOB_GROUP
5486 Group name of the job's owner. Available in PrologSlurmctld and
5487 EpilogSlurmctld only.
5488
5489 SLURM_JOB_ID
5490 Job ID.
5491
5492 SLURM_JOBID
5493 Job ID.
5494
5495 SLURM_JOB_NAME
5496 Name of the job. Available in PrologSlurmctld and EpilogSlurm‐
5497 ctld only.
5498
5499 SLURM_JOB_NODELIST
5500 Nodes assigned to job. A Slurm hostlist expression. "scontrol
5501 show hostnames" can be used to convert this to a list of indi‐
5502 vidual host names. Available in PrologSlurmctld and Epi‐
5503 logSlurmctld only.
5504
5505 SLURM_JOB_PARTITION
5506 Partition that job runs in. Available in Prolog, PrologSlurm‐
5507 ctld and EpilogSlurmctld only.
5508
5509 SLURM_JOB_UID
5510 User ID of the job's owner.
5511
5512 SLURM_JOB_USER
5513 User name of the job's owner.
5514
5515 SLURM_SCRIPT_CONTEXT
5516 Identifies which epilog or prolog program is currently running.
5517
5518
5520 Slurm is able to optimize job allocations to minimize network con‐
5521 tention. Special Slurm logic is used to optimize allocations on sys‐
5522 tems with a three-dimensional interconnect. and information about con‐
5523 figuring those systems are available on web pages available here:
5524 <https://slurm.schedmd.com/>. For a hierarchical network, Slurm needs
5525 to have detailed information about how nodes are configured on the net‐
5526 work switches.
5527
5528 Given network topology information, Slurm allocates all of a job's
5529 resources onto a single leaf of the network (if possible) using a
5530 best-fit algorithm. Otherwise it will allocate a job's resources onto
5531 multiple leaf switches so as to minimize the use of higher-level
5532 switches. The TopologyPlugin parameter controls which plugin is used
5533 to collect network topology information. The only values presently
5534 supported are "topology/3d_torus" (default for Cray XT/XE systems, per‐
5535 forms best-fit logic over three-dimensional topology), "topology/none"
5536 (default for other systems, best-fit logic over one-dimensional topol‐
5537 ogy), "topology/tree" (determine the network topology based upon infor‐
5538 mation contained in a topology.conf file, see "man topology.conf" for
5539 more information). Future plugins may gather topology information
5540 directly from the network. The topology information is optional. If
5541 not provided, Slurm will perform a best-fit algorithm assuming the
5542 nodes are in a one-dimensional array as configured and the communica‐
5543 tions cost is related to the node distance in this array.
5544
5545
5547 If the cluster's computers used for the primary or backup controller
5548 will be out of service for an extended period of time, it may be desir‐
5549 able to relocate them. In order to do so, follow this procedure:
5550
5551 1. Stop the Slurm daemons
5552 2. Modify the slurm.conf file appropriately
5553 3. Distribute the updated slurm.conf file to all nodes
5554 4. Restart the Slurm daemons
5555
5556 There should be no loss of any running or pending jobs. Ensure that
5557 any nodes added to the cluster have the current slurm.conf file
5558 installed.
5559
5560 CAUTION: If two nodes are simultaneously configured as the primary con‐
5561 troller (two nodes on which SlurmctldHost specify the local host and
5562 the slurmctld daemon is executing on each), system behavior will be
5563 destructive. If a compute node has an incorrect SlurmctldHost parame‐
5564 ter, that node may be rendered unusable, but no other harm will result.
5565
5566
5568 #
5569 # Sample /etc/slurm.conf for dev[0-25].llnl.gov
5570 # Author: John Doe
5571 # Date: 11/06/2001
5572 #
5573 SlurmctldHost=dev0(12.34.56.78) # Primary server
5574 SlurmctldHost=dev1(12.34.56.79) # Backup server
5575 #
5576 AuthType=auth/munge
5577 Epilog=/usr/local/slurm/epilog
5578 Prolog=/usr/local/slurm/prolog
5579 FirstJobId=65536
5580 InactiveLimit=120
5581 JobCompType=jobcomp/filetxt
5582 JobCompLoc=/var/log/slurm/jobcomp
5583 KillWait=30
5584 MaxJobCount=10000
5585 MinJobAge=3600
5586 PluginDir=/usr/local/lib:/usr/local/slurm/lib
5587 ReturnToService=0
5588 SchedulerType=sched/backfill
5589 SlurmctldLogFile=/var/log/slurm/slurmctld.log
5590 SlurmdLogFile=/var/log/slurm/slurmd.log
5591 SlurmctldPort=7002
5592 SlurmdPort=7003
5593 SlurmdSpoolDir=/var/spool/slurmd.spool
5594 StateSaveLocation=/var/spool/slurm.state
5595 SwitchType=switch/none
5596 TmpFS=/tmp
5597 WaitTime=30
5598 JobCredentialPrivateKey=/usr/local/slurm/private.key
5599 JobCredentialPublicCertificate=/usr/local/slurm/public.cert
5600 #
5601 # Node Configurations
5602 #
5603 NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=64000
5604 NodeName=DEFAULT State=UNKNOWN
5605 NodeName=dev[0-25] NodeAddr=edev[0-25] Weight=16
5606 # Update records for specific DOWN nodes
5607 DownNodes=dev20 State=DOWN Reason="power,ETA=Dec25"
5608 #
5609 # Partition Configurations
5610 #
5611 PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP
5612 PartitionName=debug Nodes=dev[0-8,18-25] Default=YES
5613 PartitionName=batch Nodes=dev[9-17] MinNodes=4
5614 PartitionName=long Nodes=dev[9-17] MaxTime=120 AllowGroups=admin
5615
5616
5618 The "include" key word can be used with modifiers within the specified
5619 pathname. These modifiers would be replaced with cluster name or other
5620 information depending on which modifier is specified. If the included
5621 file is not an absolute path name (i.e. it does not start with a
5622 slash), it will searched for in the same directory as the slurm.conf
5623 file.
5624
5625 %c Cluster name specified in the slurm.conf will be used.
5626
5627 EXAMPLE
5628 ClusterName=linux
5629 include /home/slurm/etc/%c_config
5630 # Above line interpreted as
5631 # "include /home/slurm/etc/linux_config"
5632
5633
5635 There are three classes of files: Files used by slurmctld must be
5636 accessible by user SlurmUser and accessible by the primary and backup
5637 control machines. Files used by slurmd must be accessible by user root
5638 and accessible from every compute node. A few files need to be acces‐
5639 sible by normal users on all login and compute nodes. While many files
5640 and directories are listed below, most of them will not be used with
5641 most configurations.
5642
5643 Epilog Must be executable by user root. It is recommended that the
5644 file be readable by all users. The file must exist on every
5645 compute node.
5646
5647 EpilogSlurmctld
5648 Must be executable by user SlurmUser. It is recommended that
5649 the file be readable by all users. The file must be accessible
5650 by the primary and backup control machines.
5651
5652 HealthCheckProgram
5653 Must be executable by user root. It is recommended that the
5654 file be readable by all users. The file must exist on every
5655 compute node.
5656
5657 JobCompLoc
5658 If this specifies a file, it must be writable by user SlurmUser.
5659 The file must be accessible by the primary and backup control
5660 machines.
5661
5662 JobCredentialPrivateKey
5663 Must be readable only by user SlurmUser and writable by no other
5664 users. The file must be accessible by the primary and backup
5665 control machines.
5666
5667 JobCredentialPublicCertificate
5668 Readable to all users on all nodes. Must not be writable by
5669 regular users.
5670
5671 MailProg
5672 Must be executable by user SlurmUser. Must not be writable by
5673 regular users. The file must be accessible by the primary and
5674 backup control machines.
5675
5676 Prolog Must be executable by user root. It is recommended that the
5677 file be readable by all users. The file must exist on every
5678 compute node.
5679
5680 PrologSlurmctld
5681 Must be executable by user SlurmUser. It is recommended that
5682 the file be readable by all users. The file must be accessible
5683 by the primary and backup control machines.
5684
5685 ResumeProgram
5686 Must be executable by user SlurmUser. The file must be accessi‐
5687 ble by the primary and backup control machines.
5688
5689 slurm.conf
5690 Readable to all users on all nodes. Must not be writable by
5691 regular users.
5692
5693 SlurmctldLogFile
5694 Must be writable by user SlurmUser. The file must be accessible
5695 by the primary and backup control machines.
5696
5697 SlurmctldPidFile
5698 Must be writable by user root. Preferably writable and remov‐
5699 able by SlurmUser. The file must be accessible by the primary
5700 and backup control machines.
5701
5702 SlurmdLogFile
5703 Must be writable by user root. A distinct file must exist on
5704 each compute node.
5705
5706 SlurmdPidFile
5707 Must be writable by user root. A distinct file must exist on
5708 each compute node.
5709
5710 SlurmdSpoolDir
5711 Must be writable by user root. A distinct file must exist on
5712 each compute node.
5713
5714 SrunEpilog
5715 Must be executable by all users. The file must exist on every
5716 login and compute node.
5717
5718 SrunProlog
5719 Must be executable by all users. The file must exist on every
5720 login and compute node.
5721
5722 StateSaveLocation
5723 Must be writable by user SlurmUser. The file must be accessible
5724 by the primary and backup control machines.
5725
5726 SuspendProgram
5727 Must be executable by user SlurmUser. The file must be accessi‐
5728 ble by the primary and backup control machines.
5729
5730 TaskEpilog
5731 Must be executable by all users. The file must exist on every
5732 compute node.
5733
5734 TaskProlog
5735 Must be executable by all users. The file must exist on every
5736 compute node.
5737
5738 UnkillableStepProgram
5739 Must be executable by user SlurmUser. The file must be accessi‐
5740 ble by the primary and backup control machines.
5741
5742
5744 Note that while Slurm daemons create log files and other files as
5745 needed, it treats the lack of parent directories as a fatal error.
5746 This prevents the daemons from running if critical file systems are not
5747 mounted and will minimize the risk of cold-starting (starting without
5748 preserving jobs).
5749
5750 Log files and job accounting files, may need to be created/owned by the
5751 "SlurmUser" uid to be successfully accessed. Use the "chown" and
5752 "chmod" commands to set the ownership and permissions appropriately.
5753 See the section FILE AND DIRECTORY PERMISSIONS for information about
5754 the various files and directories used by Slurm.
5755
5756 It is recommended that the logrotate utility be used to ensure that
5757 various log files do not become too large. This also applies to text
5758 files used for accounting, process tracking, and the slurmdbd log if
5759 they are used.
5760
5761 Here is a sample logrotate configuration. Make appropriate site modifi‐
5762 cations and save as /etc/logrotate.d/slurm on all nodes. See the
5763 logrotate man page for more details.
5764
5765 ##
5766 # Slurm Logrotate Configuration
5767 ##
5768 /var/log/slurm/*.log {
5769 compress
5770 missingok
5771 nocopytruncate
5772 nodelaycompress
5773 nomail
5774 notifempty
5775 noolddir
5776 rotate 5
5777 sharedscripts
5778 size=5M
5779 create 640 slurm root
5780 postrotate
5781 pkill -x --signal SIGUSR2 slurmctld
5782 pkill -x --signal SIGUSR2 slurmd
5783 pkill -x --signal SIGUSR2 slurmdbd
5784 exit 0
5785 endscript
5786 }
5787
5789 Copyright (C) 2002-2007 The Regents of the University of California.
5790 Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
5791 Copyright (C) 2008-2010 Lawrence Livermore National Security.
5792 Copyright (C) 2010-2017 SchedMD LLC.
5793
5794 This file is part of Slurm, a resource management program. For
5795 details, see <https://slurm.schedmd.com/>.
5796
5797 Slurm is free software; you can redistribute it and/or modify it under
5798 the terms of the GNU General Public License as published by the Free
5799 Software Foundation; either version 2 of the License, or (at your
5800 option) any later version.
5801
5802 Slurm is distributed in the hope that it will be useful, but WITHOUT
5803 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
5804 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
5805 for more details.
5806
5807
5809 /etc/slurm.conf
5810
5811
5813 cgroup.conf(5), getaddrinfo(3), getrlimit(2), gres.conf(5), group(5),
5814 hostname(1), scontrol(1), slurmctld(8), slurmd(8), slurmdbd(8), slur‐
5815 mdbd.conf(5), srun(1), spank(8), syslog(3), topology.conf(5)
5816
5817
5818
5819January 2021 Slurm Configuration File slurm.conf(5)