1slurm.conf(5) Slurm Configuration File slurm.conf(5)
2
3
4
6 slurm.conf - Slurm configuration file
7
8
10 slurm.conf is an ASCII file which describes general Slurm configuration
11 information, the nodes to be managed, information about how those nodes
12 are grouped into partitions, and various scheduling parameters associ‐
13 ated with those partitions. This file should be consistent across all
14 nodes in the cluster.
15
16 The file location can be modified at system build time using the
17 DEFAULT_SLURM_CONF parameter or at execution time by setting the
18 SLURM_CONF environment variable. The Slurm daemons also allow you to
19 override both the built-in and environment-provided location using the
20 "-f" option on the command line.
21
22 The contents of the file are case insensitive except for the names of
23 nodes and partitions. Any text following a "#" in the configuration
24 file is treated as a comment through the end of that line. Changes to
25 the configuration file take effect upon restart of Slurm daemons, dae‐
26 mon receipt of the SIGHUP signal, or execution of the command "scontrol
27 reconfigure" unless otherwise noted.
28
29 If a line begins with the word "Include" followed by whitespace and
30 then a file name, that file will be included inline with the current
31 configuration file. For large or complex systems, multiple configura‐
32 tion files may prove easier to manage and enable reuse of some files
33 (See INCLUDE MODIFIERS for more details).
34
35 Note on file permissions:
36
37 The slurm.conf file must be readable by all users of Slurm, since it is
38 used by many of the Slurm commands. Other files that are defined in
39 the slurm.conf file, such as log files and job accounting files, may
40 need to be created/owned by the user "SlurmUser" to be successfully
41 accessed. Use the "chown" and "chmod" commands to set the ownership
42 and permissions appropriately. See the section FILE AND DIRECTORY PER‐
43 MISSIONS for information about the various files and directories used
44 by Slurm.
45
46
48 The overall configuration parameters available include:
49
50
51 AccountingStorageBackupHost
52 The name of the backup machine hosting the accounting storage
53 database. If used with the accounting_storage/slurmdbd plugin,
54 this is where the backup slurmdbd would be running. Only used
55 with systems using SlurmDBD, ignored otherwise.
56
57
58 AccountingStorageEnforce
59 This controls what level of association-based enforcement to
60 impose on job submissions. Valid options are any combination of
61 associations, limits, nojobs, nosteps, qos, safe, and wckeys, or
62 all for all things (expect nojobs and nosteps, they must be
63 requested as well).
64
65 If limits, qos, or wckeys are set, associations will automati‐
66 cally be set.
67
68 If wckeys is set, TrackWCKey will automatically be set.
69
70 If safe is set, limits and associations will automatically be
71 set.
72
73 If nojobs is set nosteps will automatically be set.
74
75 By enforcing Associations no new job is allowed to run unless a
76 corresponding association exists in the system. If limits are
77 enforced users can be limited by association to whatever job
78 size or run time limits are defined.
79
80 If nojobs is set Slurm will not account for any jobs or steps on
81 the system, like wise if nosteps is set Slurm will not account
82 for any steps ran limits will still be enforced.
83
84 If safe is enforced, a job will only be launched against an
85 association or qos that has a GrpTRESMins limit set if the job
86 will be able to run to completion. Without this option set,
87 jobs will be launched as long as their usage hasn't reached the
88 cpu-minutes limit which can lead to jobs being launched but then
89 killed when the limit is reached.
90
91 With qos and/or wckeys enforced jobs will not be scheduled
92 unless a valid qos and/or workload characterization key is spec‐
93 ified.
94
95 When AccountingStorageEnforce is changed, a restart of the
96 slurmctld daemon is required (not just a "scontrol reconfig").
97
98
99 AccountingStorageHost
100 The name of the machine hosting the accounting storage database.
101 Only used with systems using SlurmDBD, ignored otherwise. Also
102 see DefaultStorageHost.
103
104
105 AccountingStorageLoc
106 The fully qualified file name where accounting records are writ‐
107 ten when the AccountingStorageType is "accounting_stor‐
108 age/filetxt". Also see DefaultStorageLoc.
109
110
111 AccountingStoragePass
112 The password used to gain access to the database to store the
113 accounting data. Only used for database type storage plugins,
114 ignored otherwise. In the case of Slurm DBD (Database Daemon)
115 with MUNGE authentication this can be configured to use a MUNGE
116 daemon specifically configured to provide authentication between
117 clusters while the default MUNGE daemon provides authentication
118 within a cluster. In that case, AccountingStoragePass should
119 specify the named port to be used for communications with the
120 alternate MUNGE daemon (e.g. "/var/run/munge/global.socket.2").
121 The default value is NULL. Also see DefaultStoragePass.
122
123
124 AccountingStoragePort
125 The listening port of the accounting storage database server.
126 Only used for database type storage plugins, ignored otherwise.
127 The default value is SLURMDBD_PORT as established at system
128 build time. If no value is explicitly specified, it will be set
129 to 6819. This value must be equal to the DbdPort parameter in
130 the slurmdbd.conf file. Also see DefaultStoragePort.
131
132
133 AccountingStorageTRES
134 Comma separated list of resources you wish to track on the clus‐
135 ter. These are the resources requested by the sbatch/srun job
136 when it is submitted. Currently this consists of any GRES, BB
137 (burst buffer) or license along with CPU, Memory, Node, Energy,
138 FS/[Disk|Lustre], IC/OFED, Pages, and VMem. By default Billing,
139 CPU, Energy, Memory, Node, FS/Disk, Pages and VMem are tracked.
140 These default TRES cannot be disabled, but only appended to.
141 AccountingStorageTRES=gres/craynetwork,license/iop1 will track
142 billing, cpu, energy, memory, nodes, fs/disk, pages and vmem
143 along with a gres called craynetwork as well as a license called
144 iop1. Whenever these resources are used on the cluster they are
145 recorded. The TRES are automatically set up in the database on
146 the start of the slurmctld.
147
148 If multiple GRES of different types are tracked (e.g. GPUs of
149 different types), then job requests with matching type specifi‐
150 cations will be recorded. Given a configuration of "Account‐
151 ingStorageTRES=gres/gpu,gres/gpu:tesla,gres/gpu:volta" Then
152 "gres/gpu:tesla" and "gres/gpu:volta" will track only jobs that
153 explicitly request those two GPU types, while "gres/gpu" will
154 track allocated GPUs of any type ("tesla", "volta" or any other
155 GPU type).
156
157 Given a configuration of "AccountingStorage‐
158 TRES=gres/gpu:tesla,gres/gpu:volta" Then "gres/gpu:tesla" and
159 "gres/gpu:volta" will track jobs that explicitly request those
160 GPU types. If a job requests GPUs, but does not explicitly
161 specify the GPU type, then it's resource allocation will be
162 accounted for as either "gres/gpu:tesla" or "gres/gpu:volta",
163 although the accounting may not match the actual GPU type allo‐
164 cated to the job and the GPUs allocated to the job could be het‐
165 erogeneous. In an environment containing various GPU types, use
166 of a job_submit plugin may be desired in order to force jobs to
167 explicitly specify some GPU type.
168
169
170 AccountingStorageType
171 The accounting storage mechanism type. Acceptable values at
172 present include "accounting_storage/filetxt", "accounting_stor‐
173 age/none" and "accounting_storage/slurmdbd". The "account‐
174 ing_storage/filetxt" value indicates that accounting records
175 will be written to the file specified by the AccountingStorage‐
176 Loc parameter. The "accounting_storage/slurmdbd" value indi‐
177 cates that accounting records will be written to the Slurm DBD,
178 which manages an underlying MySQL database. See "man slurmdbd"
179 for more information. The default value is "accounting_stor‐
180 age/none" and indicates that account records are not maintained.
181 Note: The filetxt plugin records only a limited subset of
182 accounting information and will prevent some sacct options from
183 proper operation. Also see DefaultStorageType.
184
185
186 AccountingStorageUser
187 The user account for accessing the accounting storage database.
188 Only used for database type storage plugins, ignored otherwise.
189 Also see DefaultStorageUser.
190
191
192 AccountingStoreJobComment
193 If set to "YES" then include the job's comment field in the job
194 complete message sent to the Accounting Storage database. The
195 default is "YES". Note the AdminComment and SystemComment are
196 always recorded in the database.
197
198
199 AcctGatherNodeFreq
200 The AcctGather plugins sampling interval for node accounting.
201 For AcctGather plugin values of none, this parameter is ignored.
202 For all other values this parameter is the number of seconds
203 between node accounting samples. For the acct_gather_energy/rapl
204 plugin, set a value less than 300 because the counters may over‐
205 flow beyond this rate. The default value is zero. This value
206 disables accounting sampling for nodes. Note: The accounting
207 sampling interval for jobs is determined by the value of JobAc‐
208 ctGatherFrequency.
209
210
211 AcctGatherEnergyType
212 Identifies the plugin to be used for energy consumption account‐
213 ing. The jobacct_gather plugin and slurmd daemon call this
214 plugin to collect energy consumption data for jobs and nodes.
215 The collection of energy consumption data takes place on the
216 node level, hence only in case of exclusive job allocation the
217 energy consumption measurements will reflect the job's real con‐
218 sumption. In case of node sharing between jobs the reported con‐
219 sumed energy per job (through sstat or sacct) will not reflect
220 the real energy consumed by the jobs.
221
222 Configurable values at present are:
223
224 acct_gather_energy/none
225 No energy consumption data is collected.
226
227 acct_gather_energy/ipmi
228 Energy consumption data is collected from
229 the Baseboard Management Controller (BMC)
230 using the Intelligent Platform Management
231 Interface (IPMI).
232
233 acct_gather_energy/rapl
234 Energy consumption data is collected from
235 hardware sensors using the Running Average
236 Power Limit (RAPL) mechanism. Note that
237 enabling RAPL may require the execution of
238 the command "sudo modprobe msr".
239
240
241 AcctGatherInfinibandType
242 Identifies the plugin to be used for infiniband network traffic
243 accounting. The jobacct_gather plugin and slurmd daemon call
244 this plugin to collect network traffic data for jobs and nodes.
245 The collection of network traffic data takes place on the node
246 level, hence only in case of exclusive job allocation the col‐
247 lected values will reflect the job's real traffic. In case of
248 node sharing between jobs the reported network traffic per job
249 (through sstat or sacct) will not reflect the real network traf‐
250 fic by the jobs.
251
252 Configurable values at present are:
253
254 acct_gather_infiniband/none
255 No infiniband network data are collected.
256
257 acct_gather_infiniband/ofed
258 Infiniband network traffic data are col‐
259 lected from the hardware monitoring counters
260 of Infiniband devices through the OFED
261 library. In order to account for per job
262 network traffic, add the "ic/ofed" TRES to
263 AccountingStorageTRES.
264
265
266 AcctGatherFilesystemType
267 Identifies the plugin to be used for filesystem traffic account‐
268 ing. The jobacct_gather plugin and slurmd daemon call this
269 plugin to collect filesystem traffic data for jobs and nodes.
270 The collection of filesystem traffic data takes place on the
271 node level, hence only in case of exclusive job allocation the
272 collected values will reflect the job's real traffic. In case of
273 node sharing between jobs the reported filesystem traffic per
274 job (through sstat or sacct) will not reflect the real filesys‐
275 tem traffic by the jobs.
276
277
278 Configurable values at present are:
279
280 acct_gather_filesystem/none
281 No filesystem data are collected.
282
283 acct_gather_filesystem/lustre
284 Lustre filesystem traffic data are collected
285 from the counters found in /proc/fs/lustre/.
286 In order to account for per job lustre traf‐
287 fic, add the "fs/lustre" TRES to Account‐
288 ingStorageTRES.
289
290
291 AcctGatherProfileType
292 Identifies the plugin to be used for detailed job profiling.
293 The jobacct_gather plugin and slurmd daemon call this plugin to
294 collect detailed data such as I/O counts, memory usage, or
295 energy consumption for jobs and nodes. There are interfaces in
296 this plugin to collect data as step start and completion, task
297 start and completion, and at the account gather frequency. The
298 data collected at the node level is related to jobs only in case
299 of exclusive job allocation.
300
301 Configurable values at present are:
302
303 acct_gather_profile/none
304 No profile data is collected.
305
306 acct_gather_profile/hdf5
307 This enables the HDF5 plugin. The directory
308 where the profile files are stored and which
309 values are collected are configured in the
310 acct_gather.conf file.
311
312 acct_gather_profile/influxdb
313 This enables the influxdb plugin. The
314 influxdb instance host, port, database,
315 retention policy and which values are col‐
316 lected are configured in the
317 acct_gather.conf file.
318
319
320 AllowSpecResourcesUsage
321 If set to 1, Slurm allows individual jobs to override node's
322 configured CoreSpecCount value. For a job to take advantage of
323 this feature, a command line option of --core-spec must be spec‐
324 ified. The default value for this option is 1 for Cray systems
325 and 0 for other system types.
326
327
328 AuthAltTypes
329 Command separated list of alternative authentication plugins
330 that the slurmctld will permit for communication.
331
332
333 AuthInfo
334 Additional information to be used for authentication of communi‐
335 cations between the Slurm daemons (slurmctld and slurmd) and the
336 Slurm clients. The interpretation of this option is specific to
337 the configured AuthType. Multiple options may be specified in a
338 comma delimited list. If not specified, the default authentica‐
339 tion information will be used.
340
341 cred_expire Default job step credential lifetime, in seconds
342 (e.g. "cred_expire=1200"). It must be suffi‐
343 ciently long enough to load user environment, run
344 prolog, deal with the slurmd getting paged out of
345 memory, etc. This also controls how long a
346 requeued job must wait before starting again. The
347 default value is 120 seconds.
348
349 socket Path name to a MUNGE daemon socket to use (e.g.
350 "socket=/var/run/munge/munge.socket.2"). The
351 default value is "/var/run/munge/munge.socket.2".
352 Used by auth/munge and cred/munge.
353
354 ttl Credential lifetime, in seconds (e.g. "ttl=300").
355 The default value is dependent upon the MUNGE
356 installation, but is typically 300 seconds.
357
358
359 AuthType
360 The authentication method for communications between Slurm com‐
361 ponents. Acceptable values at present include "auth/munge" and
362 "auth/none". The default value is "auth/munge". "auth/none"
363 includes the UID in each communication, but it is not verified.
364 This may be fine for testing purposes, but do not use
365 "auth/none" if you desire any security. "auth/munge" indicates
366 that MUNGE is to be used. (See "https://dun.github.io/munge/"
367 for more information). All Slurm daemons and commands must be
368 terminated prior to changing the value of AuthType and later
369 restarted.
370
371
372 BackupAddr
373 Defunct option, see SlurmctldHost.
374
375
376 BackupController
377 Defunct option, see SlurmctldHost.
378
379 The backup controller recovers state information from the State‐
380 SaveLocation directory, which must be readable and writable from
381 both the primary and backup controllers. While not essential,
382 it is recommended that you specify a backup controller. See
383 the RELOCATING CONTROLLERS section if you change this.
384
385
386 BatchStartTimeout
387 The maximum time (in seconds) that a batch job is permitted for
388 launching before being considered missing and releasing the
389 allocation. The default value is 10 (seconds). Larger values may
390 be required if more time is required to execute the Prolog, load
391 user environment variables (for Moab spawned jobs), or if the
392 slurmd daemon gets paged from memory.
393 Note: The test for a job being successfully launched is only
394 performed when the Slurm daemon on the compute node registers
395 state with the slurmctld daemon on the head node, which happens
396 fairly rarely. Therefore a job will not necessarily be termi‐
397 nated if its start time exceeds BatchStartTimeout. This config‐
398 uration parameter is also applied to launch tasks and avoid
399 aborting srun commands due to long running Prolog scripts.
400
401
402 BurstBufferType
403 The plugin used to manage burst buffers. Acceptable values at
404 present are:
405
406 burst_buffer/datawrap
407 Use Cray DataWarp API to provide burst buffer functional‐
408 ity.
409
410 burst_buffer/none
411
412
413 CheckpointType
414 The system-initiated checkpoint method to be used for user jobs.
415 The slurmctld daemon must be restarted for a change in Check‐
416 pointType to take effect. Supported values presently include:
417
418 checkpoint/none
419 no checkpoint support (default)
420
421 checkpoint/ompi
422 OpenMPI (version 1.3 or higher)
423
424
425 CliFilterPlugins
426 A comma delimited list of command line interface option fil‐
427 ter/modification plugins. The specified plugins will be executed
428 in the order listed. These are intended to be site-specific
429 plugins which can be used to set default job parameters and/or
430 logging events. No cli_filter plugins are used by default.
431
432
433 ClusterName
434 The name by which this Slurm managed cluster is known in the
435 accounting database. This is needed distinguish accounting
436 records when multiple clusters report to the same database.
437 Because of limitations in some databases, any upper case letters
438 in the name will be silently mapped to lower case. In order to
439 avoid confusion, it is recommended that the name be lower case.
440
441
442 CommunicationParameters
443 Comma separated options identifying communication options.
444
445 CheckGhalQuiesce
446 Used specifically on a Cray using an Aries Ghal
447 interconnect. This will check to see if the sys‐
448 tem is quiescing when sending a message, and if
449 so, we wait until it is done before sending.
450
451 NoAddrCache By default, Slurm will cache a node's network
452 address after
453 successfully establishing the node's network
454 address. This option disables the cache and Slurm
455 will look up the node's network address each time
456 a connection is made. This is useful, for exam‐
457 ple, in a cloud environment where the node
458 addresses come and go out of DNS.
459
460 NoCtldInAddrAny
461 Used to directly bind to the address of what the
462 node resolves to running the slurmctld instead of
463 binding messages to any address on the node,
464 which is the default.
465
466 NoInAddrAny Used to directly bind to the address of what the
467 node resolves to instead of binding messages to
468 any address on the node which is the default.
469 This option is for all daemons/clients except for
470 the slurmctld.
471
472
473 CompleteWait
474 The time, in seconds, given for a job to remain in COMPLETING
475 state before any additional jobs are scheduled. If set to zero,
476 pending jobs will be started as soon as possible. Since a COM‐
477 PLETING job's resources are released for use by other jobs as
478 soon as the Epilog completes on each individual node, this can
479 result in very fragmented resource allocations. To provide jobs
480 with the minimum response time, a value of zero is recommended
481 (no waiting). To minimize fragmentation of resources, a value
482 equal to KillWait plus two is recommended. In that case, set‐
483 ting KillWait to a small value may be beneficial. The default
484 value of CompleteWait is zero seconds. The value may not exceed
485 65533.
486
487
488 ControlAddr
489 Defunct option, see SlurmctldHost.
490
491
492 ControlMachine
493 Defunct option, see SlurmctldHost.
494
495
496 CoreSpecPlugin
497 Identifies the plugins to be used for enforcement of core spe‐
498 cialization. The slurmd daemon must be restarted for a change
499 in CoreSpecPlugin to take effect. Acceptable values at present
500 include:
501
502 core_spec/cray_aries
503 used only for Cray systems
504
505 core_spec/none used for all other system types
506
507
508 CpuFreqDef
509 Default CPU frequency value or frequency governor to use when
510 running a job step if it has not been explicitly set with the
511 --cpu-freq option. Acceptable values at present include a
512 numeric value (frequency in kilohertz) or one of the following
513 governors:
514
515 Conservative attempts to use the Conservative CPU governor
516
517 OnDemand attempts to use the OnDemand CPU governor
518
519 Performance attempts to use the Performance CPU governor
520
521 PowerSave attempts to use the PowerSave CPU governor
522 There is no default value. If unset, no attempt to set the governor is
523 made if the --cpu-freq option has not been set.
524
525
526 CpuFreqGovernors
527 List of CPU frequency governors allowed to be set with the sal‐
528 loc, sbatch, or srun option --cpu-freq. Acceptable values at
529 present include:
530
531 Conservative attempts to use the Conservative CPU governor
532
533 OnDemand attempts to use the OnDemand CPU governor (a
534 default value)
535
536 Performance attempts to use the Performance CPU governor (a
537 default value)
538
539 PowerSave attempts to use the PowerSave CPU governor
540
541 UserSpace attempts to use the UserSpace CPU governor (a
542 default value)
543 The default is OnDemand, Performance and UserSpace.
544
545 CredType
546 The cryptographic signature tool to be used in the creation of
547 job step credentials. The slurmctld daemon must be restarted
548 for a change in CredType to take effect. Acceptable values at
549 present include "cred/munge". The default value is "cred/munge"
550 and is the recommended.
551
552
553 DebugFlags
554 Defines specific subsystems which should provide more detailed
555 event logging. Multiple subsystems can be specified with comma
556 separators. Most DebugFlags will result in verbose logging for
557 the identified subsystems and could impact performance. Valid
558 subsystems available today (with more to come) include:
559
560 Accrue Accrue counters accounting details
561
562 Agent RPC agents (outgoing RPCs from Slurm daemons)
563
564 Backfill Backfill scheduler details
565
566 BackfillMap Backfill scheduler to log a very verbose map of
567 reserved resources through time. Combine with
568 Backfill for a verbose and complete view of the
569 backfill scheduler's work.
570
571 BurstBuffer Burst Buffer plugin
572
573 CPU_Bind CPU binding details for jobs and steps
574
575 CpuFrequency Cpu frequency details for jobs and steps using
576 the --cpu-freq option.
577
578 Elasticsearch Elasticsearch debug info
579
580 Energy AcctGatherEnergy debug info
581
582 ExtSensors External Sensors debug info
583
584 Federation Federation scheduling debug info
585
586 FrontEnd Front end node details
587
588 Gres Generic resource details
589
590 HeteroJobs Heterogeneous job details
591
592 Gang Gang scheduling details
593
594 JobContainer Job container plugin details
595
596 License License management details
597
598 NodeFeatures Node Features plugin debug info
599
600 NO_CONF_HASH Do not log when the slurm.conf files differs
601 between Slurm daemons
602
603 Power Power management plugin
604
605 PowerSave Power save (suspend/resume programs) details
606
607 Priority Job prioritization
608
609 Profile AcctGatherProfile plugins details
610
611 Protocol Communication protocol details
612
613 Reservation Advanced reservations
614
615 Route Message forwarding and message aggregation
616 debug info
617
618 SelectType Resource selection plugin
619
620 Steps Slurmctld resource allocation for job steps
621
622 Switch Switch plugin
623
624 TimeCray Timing of Cray APIs
625
626 TRESNode Limits dealing with TRES=Node
627
628 TraceJobs Trace jobs in slurmctld. It will print detailed
629 job information including state, job ids and
630 allocated nodes counter.
631
632 Triggers Slurmctld triggers
633
634
635 DefCpuPerGPU
636 Default count of CPUs allocated per allocated GPU.
637
638
639 DefMemPerCPU
640 Default real memory size available per allocated CPU in
641 megabytes. Used to avoid over-subscribing memory and causing
642 paging. DefMemPerCPU would generally be used if individual pro‐
643 cessors are allocated to jobs (SelectType=select/cons_res or
644 SelectType=select/cons_tres). The default value is 0 (unlim‐
645 ited). Also see DefMemPerGPU, DefMemPerNode and MaxMemPerCPU.
646 DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually exclu‐
647 sive.
648
649
650 DefMemPerGPU
651 Default real memory size available per allocated GPU in
652 megabytes. The default value is 0 (unlimited). Also see
653 DefMemPerCPU and DefMemPerNode. DefMemPerCPU, DefMemPerGPU and
654 DefMemPerNode are mutually exclusive.
655
656
657 DefMemPerNode
658 Default real memory size available per allocated node in
659 megabytes. Used to avoid over-subscribing memory and causing
660 paging. DefMemPerNode would generally be used if whole nodes
661 are allocated to jobs (SelectType=select/linear) and resources
662 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
663 The default value is 0 (unlimited). Also see DefMemPerCPU,
664 DefMemPerGPU and MaxMemPerCPU. DefMemPerCPU, DefMemPerGPU and
665 DefMemPerNode are mutually exclusive.
666
667
668 DefaultStorageHost
669 The default name of the machine hosting the accounting storage
670 and job completion databases. Only used for database type stor‐
671 age plugins and when the AccountingStorageHost and JobCompHost
672 have not been defined.
673
674
675 DefaultStorageLoc
676 The fully qualified file name where accounting records and/or
677 job completion records are written when the DefaultStorageType
678 is "filetxt". Also see AccountingStorageLoc and JobCompLoc.
679
680
681 DefaultStoragePass
682 The password used to gain access to the database to store the
683 accounting and job completion data. Only used for database type
684 storage plugins, ignored otherwise. Also see AccountingStor‐
685 agePass and JobCompPass.
686
687
688 DefaultStoragePort
689 The listening port of the accounting storage and/or job comple‐
690 tion database server. Only used for database type storage plug‐
691 ins, ignored otherwise. Also see AccountingStoragePort and Job‐
692 CompPort.
693
694
695 DefaultStorageType
696 The accounting and job completion storage mechanism type.
697 Acceptable values at present include "filetxt", "mysql" and
698 "none". The value "filetxt" indicates that records will be
699 written to a file. The value "mysql" indicates that accounting
700 records will be written to a MySQL or MariaDB database. The
701 default value is "none", which means that records are not main‐
702 tained. Also see AccountingStorageType and JobCompType.
703
704
705 DefaultStorageUser
706 The user account for accessing the accounting storage and/or job
707 completion database. Only used for database type storage plug‐
708 ins, ignored otherwise. Also see AccountingStorageUser and Job‐
709 CompUser.
710
711
712 DisableRootJobs
713 If set to "YES" then user root will be prevented from running
714 any jobs. The default value is "NO", meaning user root will be
715 able to execute jobs. DisableRootJobs may also be set by parti‐
716 tion.
717
718
719 EioTimeout
720 The number of seconds srun waits for slurmstepd to close the
721 TCP/IP connection used to relay data between the user applica‐
722 tion and srun when the user application terminates. The default
723 value is 60 seconds. May not exceed 65533.
724
725
726 EnforcePartLimits
727 If set to "ALL" then jobs which exceed a partition's size and/or
728 time limits will be rejected at submission time. If job is sub‐
729 mitted to multiple partitions, the job must satisfy the limits
730 on all the requested partitions. If set to "NO" then the job
731 will be accepted and remain queued until the partition limits
732 are altered(Time and Node Limits). If set to "ANY" a job must
733 satisfy any of the requested partitions to be submitted. The
734 default value is "NO". NOTE: If set, then a job's QOS can not
735 be used to exceed partition limits. NOTE: The partition limits
736 being considered are it's configured MaxMemPerCPU, MaxMemPerN‐
737 ode, MinNodes, MaxNodes, MaxTime, AllocNodes, AllowAccounts,
738 AllowGroups, AllowQOS, and QOS usage threshold.
739
740
741 Epilog Fully qualified pathname of a script to execute as user root on
742 every node when a user's job completes (e.g.
743 "/usr/local/slurm/epilog"). A glob pattern (See glob (7)) may
744 also be used to run more than one epilog script (e.g.
745 "/etc/slurm/epilog.d/*"). The Epilog script or scripts may be
746 used to purge files, disable user login, etc. By default there
747 is no epilog. See Prolog and Epilog Scripts for more informa‐
748 tion.
749
750
751 EpilogMsgTime
752 The number of microseconds that the slurmctld daemon requires to
753 process an epilog completion message from the slurmd daemons.
754 This parameter can be used to prevent a burst of epilog comple‐
755 tion messages from being sent at the same time which should help
756 prevent lost messages and improve throughput for large jobs.
757 The default value is 2000 microseconds. For a 1000 node job,
758 this spreads the epilog completion messages out over two sec‐
759 onds.
760
761
762 EpilogSlurmctld
763 Fully qualified pathname of a program for the slurmctld to exe‐
764 cute upon termination of a job allocation (e.g.
765 "/usr/local/slurm/epilog_controller"). The program executes as
766 SlurmUser, which gives it permission to drain nodes and requeue
767 the job if a failure occurs (See scontrol(1)). Exactly what the
768 program does and how it accomplishes this is completely at the
769 discretion of the system administrator. Information about the
770 job being initiated, it's allocated nodes, etc. are passed to
771 the program using environment variables. See Prolog and Epilog
772 Scripts for more information.
773
774
775 ExtSensorsFreq
776 The external sensors plugin sampling interval. If ExtSen‐
777 sorsType=ext_sensors/none, this parameter is ignored. For all
778 other values of ExtSensorsType, this parameter is the number of
779 seconds between external sensors samples for hardware components
780 (nodes, switches, etc.) The default value is zero. This value
781 disables external sensors sampling. Note: This parameter does
782 not affect external sensors data collection for jobs/steps.
783
784
785 ExtSensorsType
786 Identifies the plugin to be used for external sensors data col‐
787 lection. Slurmctld calls this plugin to collect external sen‐
788 sors data for jobs/steps and hardware components. In case of
789 node sharing between jobs the reported values per job/step
790 (through sstat or sacct) may not be accurate. See also "man
791 ext_sensors.conf".
792
793 Configurable values at present are:
794
795 ext_sensors/none No external sensors data is collected.
796
797 ext_sensors/rrd External sensors data is collected from the
798 RRD database.
799
800
801 FairShareDampeningFactor
802 Dampen the effect of exceeding a user or group's fair share of
803 allocated resources. Higher values will provides greater ability
804 to differentiate between exceeding the fair share at high levels
805 (e.g. a value of 1 results in almost no difference between over‐
806 consumption by a factor of 10 and 100, while a value of 5 will
807 result in a significant difference in priority). The default
808 value is 1.
809
810
811 FederationParameters
812 Used to define federation options. Multiple options may be comma
813 separated.
814
815
816 fed_display
817 If set, then the client status commands (e.g. squeue,
818 sinfo, sprio, etc.) will display information in a feder‐
819 ated view by default. This option is functionally equiva‐
820 lent to using the --federation options on each command.
821 Use the client's --local option to override the federated
822 view and get a local view of the given cluster.
823
824
825 FirstJobId
826 The job id to be used for the first submitted to Slurm without a
827 specific requested value. Job id values generated will incre‐
828 mented by 1 for each subsequent job. This may be used to provide
829 a meta-scheduler with a job id space which is disjoint from the
830 interactive jobs. The default value is 1. Also see MaxJobId
831
832
833 GetEnvTimeout
834 Used for Moab scheduled jobs only. Controls how long job should
835 wait in seconds for loading the user's environment before
836 attempting to load it from a cache file. Applies when the srun
837 or sbatch --get-user-env option is used. If set to 0 then always
838 load the user's environment from the cache file. The default
839 value is 2 seconds.
840
841
842 GresTypes
843 A comma delimited list of generic resources to be managed (e.g.
844 GresTypes=gpu,mps). These resources may have an associated GRES
845 plugin of the same name providing additional functionality. No
846 generic resources are managed by default. Ensure this parameter
847 is consistent across all nodes in the cluster for proper opera‐
848 tion. The slurmctld daemon must be restarted for changes to
849 this parameter to become effective.
850
851
852 GroupUpdateForce
853 If set to a non-zero value, then information about which users
854 are members of groups allowed to use a partition will be updated
855 periodically, even when there have been no changes to the
856 /etc/group file. If set to zero, group member information will
857 be updated only after the /etc/group file is updated. The
858 default value is 1. Also see the GroupUpdateTime parameter.
859
860
861 GroupUpdateTime
862 Controls how frequently information about which users are mem‐
863 bers of groups allowed to use a partition will be updated, and
864 how long user group membership lists will be cached. The time
865 interval is given in seconds with a default value of 600 sec‐
866 onds. A value of zero will prevent periodic updating of group
867 membership information. Also see the GroupUpdateForce parame‐
868 ter.
869
870
871 GpuFreqDef=[<type]=value>[,<type=value>]
872 Default GPU frequency to use when running a job step if it has
873 not been explicitly set using the --gpu-freq option. This
874 option can be used to independently configure the GPU and its
875 memory frequencies. Defaults to "high,memory=high". After the
876 job is completed, the frequencies of all affected GPUs will be
877 reset to the highest possible values. In some cases, system
878 power caps may override the requested values. The field type
879 can be "memory". If type is not specified, the GPU frequency is
880 implied. The value field can either be "low", "medium", "high",
881 "highm1" or a numeric value in megahertz (MHz). If the speci‐
882 fied numeric value is not possible, a value as close as possible
883 will be used. See below for definition of the values. Examples
884 of use include "GpuFreqDef=medium,memory=high and "GpuFre‐
885 qDef=450".
886
887 Supported value definitions:
888
889 low the lowest available frequency.
890
891 medium attempts to set a frequency in the middle of the
892 available range.
893
894 high the highest available frequency.
895
896 highm1 (high minus one) will select the next highest avail‐
897 able frequency.
898
899
900 HealthCheckInterval
901 The interval in seconds between executions of HealthCheckPro‐
902 gram. The default value is zero, which disables execution.
903
904
905 HealthCheckNodeState
906 Identify what node states should execute the HealthCheckProgram.
907 Multiple state values may be specified with a comma separator.
908 The default value is ANY to execute on nodes in any state.
909
910 ALLOC Run on nodes in the ALLOC state (all CPUs allo‐
911 cated).
912
913 ANY Run on nodes in any state.
914
915 CYCLE Rather than running the health check program on all
916 nodes at the same time, cycle through running on all
917 compute nodes through the course of the HealthCheck‐
918 Interval. May be combined with the various node
919 state options.
920
921 IDLE Run on nodes in the IDLE state.
922
923 MIXED Run on nodes in the MIXED state (some CPUs idle and
924 other CPUs allocated).
925
926
927 HealthCheckProgram
928 Fully qualified pathname of a script to execute as user root
929 periodically on all compute nodes that are not in the
930 NOT_RESPONDING state. This program may be used to verify the
931 node is fully operational and DRAIN the node or send email if a
932 problem is detected. Any action to be taken must be explicitly
933 performed by the program (e.g. execute "scontrol update Node‐
934 Name=foo State=drain Reason=tmp_file_system_full" to drain a
935 node). The execution interval is controlled using the
936 HealthCheckInterval parameter. Note that the HealthCheckProgram
937 will be executed at the same time on all nodes to minimize its
938 impact upon parallel programs. This program is will be killed
939 if it does not terminate normally within 60 seconds. This pro‐
940 gram will also be executed when the slurmd daemon is first
941 started and before it registers with the slurmctld daemon. By
942 default, no program will be executed.
943
944
945 InactiveLimit
946 The interval, in seconds, after which a non-responsive job allo‐
947 cation command (e.g. srun or salloc) will result in the job
948 being terminated. If the node on which the command is executed
949 fails or the command abnormally terminates, this will terminate
950 its job allocation. This option has no effect upon batch jobs.
951 When setting a value, take into consideration that a debugger
952 using srun to launch an application may leave the srun command
953 in a stopped state for extended periods of time. This limit is
954 ignored for jobs running in partitions with the RootOnly flag
955 set (the scheduler running as root will be responsible for the
956 job). The default value is unlimited (zero) and may not exceed
957 65533 seconds.
958
959
960 JobAcctGatherType
961 The job accounting mechanism type. Acceptable values at present
962 include "jobacct_gather/linux" (for Linux systems) and is the
963 recommended one, "jobacct_gather/cgroup" and
964 "jobacct_gather/none" (no accounting data collected). The
965 default value is "jobacct_gather/none". "jobacct_gather/cgroup"
966 is a plugin for the Linux operating system that uses cgroups to
967 collect accounting statistics. The plugin collects the following
968 statistics: From the cgroup memory subsystem: mem‐
969 ory.usage_in_bytes (reported as 'pages') and rss from mem‐
970 ory.stat (reported as 'rss'). From the cgroup cpuacct subsystem:
971 user cpu time and system cpu time. No value is provided by
972 cgroups for virtual memory size ('vsize'). In order to use the
973 sstat tool "jobacct_gather/linux", or "jobacct_gather/cgroup"
974 must be configured.
975 NOTE: Changing this configuration parameter changes the contents
976 of the messages between Slurm daemons. Any previously running
977 job steps are managed by a slurmstepd daemon that will persist
978 through the lifetime of that job step and not change it's commu‐
979 nication protocol. Only change this configuration parameter when
980 there are no running job steps.
981
982
983 JobAcctGatherFrequency
984 The job accounting and profiling sampling intervals. The sup‐
985 ported format is follows:
986
987 JobAcctGatherFrequency=<datatype>=<interval>
988 where <datatype>=<interval> specifies the task sam‐
989 pling interval for the jobacct_gather plugin or a
990 sampling interval for a profiling type by the
991 acct_gather_profile plugin. Multiple, comma-sepa‐
992 rated <datatype>=<interval> intervals may be speci‐
993 fied. Supported datatypes are as follows:
994
995 task=<interval>
996 where <interval> is the task sampling inter‐
997 val in seconds for the jobacct_gather plugins
998 and for task profiling by the
999 acct_gather_profile plugin.
1000
1001 energy=<interval>
1002 where <interval> is the sampling interval in
1003 seconds for energy profiling using the
1004 acct_gather_energy plugin
1005
1006 network=<interval>
1007 where <interval> is the sampling interval in
1008 seconds for infiniband profiling using the
1009 acct_gather_infiniband plugin.
1010
1011 filesystem=<interval>
1012 where <interval> is the sampling interval in
1013 seconds for filesystem profiling using the
1014 acct_gather_filesystem plugin.
1015
1016 The default value for task sampling interval
1017 is 30 seconds. The default value for all other intervals is 0.
1018 An interval of 0 disables sampling of the specified type. If
1019 the task sampling interval is 0, accounting information is col‐
1020 lected only at job termination (reducing Slurm interference with
1021 the job).
1022 Smaller (non-zero) values have a greater impact upon job perfor‐
1023 mance, but a value of 30 seconds is not likely to be noticeable
1024 for applications having less than 10,000 tasks.
1025 Users can independently override each interval on a per job
1026 basis using the --acctg-freq option when submitting the job.
1027
1028
1029 JobAcctGatherParams
1030 Arbitrary parameters for the job account gather plugin Accept‐
1031 able values at present include:
1032
1033 NoShared Exclude shared memory from accounting.
1034
1035 UsePss Use PSS value instead of RSS to calculate
1036 real usage of memory. The PSS value will be
1037 saved as RSS.
1038
1039 OverMemoryKill Kill jobs or steps that are being detected
1040 to use more memory than requested every time
1041 accounting information is gathered by the
1042 JobAcctGather plugin. This parameter will
1043 not kill a job directly, but only the step.
1044 See MemLimitEnforce for that purpose. This
1045 parameter should be used with caution as if
1046 jobs exceeds its memory allocation it may
1047 affect other processes and/or machine
1048 health. NOTE: It is recommended to limit
1049 memory by enabling task/cgroup in TaskPlugin
1050 and making use of ConstrainRAMSpace=yes
1051 cgroup.conf instead of using this JobAcct‐
1052 Gather mechanism for memory enforcement,
1053 since the former has a lower resolution
1054 (JobAcctGatherFreq) and OOMs could happen at
1055 some point.
1056
1057
1058 JobCheckpointDir
1059 Specifies the default directory for storing or reading job
1060 checkpoint information. The data stored here is only a few thou‐
1061 sand bytes per job and includes information needed to resubmit
1062 the job request, not job's memory image. The directory must be
1063 readable and writable by SlurmUser, but not writable by regular
1064 users. The job memory images may be in a different location as
1065 specified by --checkpoint-dir option at job submit time or scon‐
1066 trol's ImageDir option.
1067
1068
1069 JobCompHost
1070 The name of the machine hosting the job completion database.
1071 Only used for database type storage plugins, ignored otherwise.
1072 Also see DefaultStorageHost.
1073
1074
1075 JobCompLoc
1076 The fully qualified file name where job completion records are
1077 written when the JobCompType is "jobcomp/filetxt" or the data‐
1078 base where job completion records are stored when the JobComp‐
1079 Type is a database, or an url with format http://yourelastic‐
1080 server:port when JobCompType is "jobcomp/elasticsearch". NOTE:
1081 when you specify a URL for Elasticsearch, Slurm will remove any
1082 trailing slashes "/" from the configured URL and append
1083 "/slurm/jobcomp", which are the Elasticsearch index name (slurm)
1084 and mapping (jobcomp). NOTE: More information is available at
1085 the Slurm web site ( https://slurm.schedmd.com/elastic‐
1086 search.html ). Also see DefaultStorageLoc.
1087
1088
1089 JobCompPass
1090 The password used to gain access to the database to store the
1091 job completion data. Only used for database type storage plug‐
1092 ins, ignored otherwise. Also see DefaultStoragePass.
1093
1094
1095 JobCompPort
1096 The listening port of the job completion database server. Only
1097 used for database type storage plugins, ignored otherwise. Also
1098 see DefaultStoragePort.
1099
1100
1101 JobCompType
1102 The job completion logging mechanism type. Acceptable values at
1103 present include "jobcomp/none", "jobcomp/elasticsearch", "job‐
1104 comp/filetxt", "jobcomp/mysql" and "jobcomp/script". The
1105 default value is "jobcomp/none", which means that upon job com‐
1106 pletion the record of the job is purged from the system. If
1107 using the accounting infrastructure this plugin may not be of
1108 interest since the information here is redundant. The value
1109 "jobcomp/elasticsearch" indicates that a record of the job
1110 should be written to an Elasticsearch server specified by the
1111 JobCompLoc parameter. NOTE: More information is available at
1112 the Slurm web site ( https://slurm.schedmd.com/elastic‐
1113 search.html ). The value "jobcomp/filetxt" indicates that a
1114 record of the job should be written to a text file specified by
1115 the JobCompLoc parameter. The value "jobcomp/mysql" indicates
1116 that a record of the job should be written to a MySQL or MariaDB
1117 database specified by the JobCompLoc parameter. The value "job‐
1118 comp/script" indicates that a script specified by the JobCompLoc
1119 parameter is to be executed with environment variables indicat‐
1120 ing the job information.
1121
1122 JobCompUser
1123 The user account for accessing the job completion database.
1124 Only used for database type storage plugins, ignored otherwise.
1125 Also see DefaultStorageUser.
1126
1127
1128 JobContainerType
1129 Identifies the plugin to be used for job tracking. The slurmd
1130 daemon must be restarted for a change in JobContainerType to
1131 take effect. NOTE: The JobContainerType applies to a job allo‐
1132 cation, while ProctrackType applies to job steps. Acceptable
1133 values at present include:
1134
1135 job_container/cncu used only for Cray systems (CNCU = Compute
1136 Node Clean Up)
1137
1138 job_container/none used for all other system types
1139
1140
1141 JobFileAppend
1142 This option controls what to do if a job's output or error file
1143 exist when the job is started. If JobFileAppend is set to a
1144 value of 1, then append to the existing file. By default, any
1145 existing file is truncated.
1146
1147
1148 JobRequeue
1149 This option controls the default ability for batch jobs to be
1150 requeued. Jobs may be requeued explicitly by a system adminis‐
1151 trator, after node failure, or upon preemption by a higher pri‐
1152 ority job. If JobRequeue is set to a value of 1, then batch job
1153 may be requeued unless explicitly disabled by the user. If
1154 JobRequeue is set to a value of 0, then batch job will not be
1155 requeued unless explicitly enabled by the user. Use the sbatch
1156 --no-requeue or --requeue option to change the default behavior
1157 for individual jobs. The default value is 1.
1158
1159
1160 JobSubmitPlugins
1161 A comma delimited list of job submission plugins to be used.
1162 The specified plugins will be executed in the order listed.
1163 These are intended to be site-specific plugins which can be used
1164 to set default job parameters and/or logging events. Sample
1165 plugins available in the distribution include "all_partitions",
1166 "defaults", "logging", "lua", and "partition". For examples of
1167 use, see the Slurm code in "src/plugins/job_submit" and "con‐
1168 tribs/lua/job_submit*.lua" then modify the code to satisfy your
1169 needs. Slurm can be configured to use multiple job_submit plug‐
1170 ins if desired, however the lua plugin will only execute one lua
1171 script named "job_submit.lua" located in the default script
1172 directory (typically the subdirectory "etc" of the installation
1173 directory). No job submission plugins are used by default.
1174
1175
1176 KeepAliveTime
1177 Specifies how long sockets communications used between the srun
1178 command and its slurmstepd process are kept alive after discon‐
1179 nect. Longer values can be used to improve reliability of com‐
1180 munications in the event of network failures. The default value
1181 leaves the system default value. The value may not exceed
1182 65533.
1183
1184
1185 KillOnBadExit
1186 If set to 1, a step will be terminated immediately if any task
1187 is crashed or aborted, as indicated by a non-zero exit code.
1188 With the default value of 0, if one of the processes is crashed
1189 or aborted the other processes will continue to run while the
1190 crashed or aborted process waits. The user can override this
1191 configuration parameter by using srun's -K, --kill-on-bad-exit.
1192
1193
1194 KillWait
1195 The interval, in seconds, given to a job's processes between the
1196 SIGTERM and SIGKILL signals upon reaching its time limit. If
1197 the job fails to terminate gracefully in the interval specified,
1198 it will be forcibly terminated. The default value is 30 sec‐
1199 onds. The value may not exceed 65533.
1200
1201
1202 NodeFeaturesPlugins
1203 Identifies the plugins to be used for support of node features
1204 which can change through time. For example, a node which might
1205 be booted with various BIOS setting. This is supported through
1206 the use of a node's active_features and available_features
1207 information. Acceptable values at present include:
1208
1209 node_features/knl_cray
1210 used only for Intel Knights Landing proces‐
1211 sors (KNL) on Cray systems
1212
1213 node_features/knl_generic
1214 used for Intel Knights Landing processors
1215 (KNL) on a generic Linux system
1216
1217
1218 LaunchParameters
1219 Identifies options to the job launch plugin. Acceptable values
1220 include:
1221
1222 batch_step_set_cpu_freq Set the cpu frequency for the batch step
1223 from given --cpu-freq, or slurm.conf
1224 CpuFreqDef, option. By default only
1225 steps started with srun will utilize the
1226 cpu freq setting options.
1227
1228 NOTE: If you are using srun to launch
1229 your steps inside a batch script
1230 (advised) this option will create a sit‐
1231 uation where you may have multiple
1232 agents setting the cpu_freq as the batch
1233 step usually runs on the same resources
1234 one or more steps the sruns in the
1235 script will create.
1236
1237 cray_net_exclusive Allow jobs on a Cray Native cluster
1238 exclusive access to network resources.
1239 This should only be set on clusters pro‐
1240 viding exclusive access to each node to
1241 a single job at once, and not using par‐
1242 allel steps within the job, otherwise
1243 resources on the node can be oversub‐
1244 scribed.
1245
1246 enable_nss_slurm Permits passwd and group resolution for
1247 a job to be serviced by slurmstepd
1248 rather than requiring a lookup from a
1249 network based service. See
1250 https://slurm.schedmd.com/nss_slurm.html
1251 for more information.
1252
1253 lustre_no_flush If set on a Cray Native cluster, then do
1254 not flush the Lustre cache on job step
1255 completion. This setting will only take
1256 effect after reconfiguring, and will
1257 only take effect for newly launched
1258 jobs.
1259
1260 mem_sort Sort NUMA memory at step start. User can
1261 override this default with
1262 SLURM_MEM_BIND environment variable or
1263 --mem-bind=nosort command line option.
1264
1265 disable_send_gids By default the slurmctld will lookup and
1266 send the user_name and extended gids for
1267 a job, rather than individual on each
1268 node as part of each task launch. Which
1269 avoids issues around name service scala‐
1270 bility when launching jobs involving
1271 many nodes. Using this option will
1272 reverse this functionality.
1273
1274 slurmstepd_memlock Lock the slurmstepd process's current
1275 memory in RAM.
1276
1277 slurmstepd_memlock_all Lock the slurmstepd process's current
1278 and future memory in RAM.
1279
1280 test_exec Have srun verify existence of the exe‐
1281 cutable program along with user execute
1282 permission on the node where srun was
1283 called before attempting to launch it on
1284 nodes in the step.
1285
1286
1287 LaunchType
1288 Identifies the mechanism to be used to launch application tasks.
1289 Acceptable values include:
1290
1291 launch/slurm
1292 The default value.
1293
1294
1295 Licenses
1296 Specification of licenses (or other resources available on all
1297 nodes of the cluster) which can be allocated to jobs. License
1298 names can optionally be followed by a colon and count with a
1299 default count of one. Multiple license names should be comma
1300 separated (e.g. "Licenses=foo:4,bar"). Note that Slurm pre‐
1301 vents jobs from being scheduled if their required license speci‐
1302 fication is not available. Slurm does not prevent jobs from
1303 using licenses that are not explicitly listed in the job submis‐
1304 sion specification.
1305
1306
1307 LogTimeFormat
1308 Format of the timestamp in slurmctld and slurmd log files.
1309 Accepted values are "iso8601", "iso8601_ms", "rfc5424",
1310 "rfc5424_ms", "clock", "short" and "thread_id". The values end‐
1311 ing in "_ms" differ from the ones without in that fractional
1312 seconds with millisecond precision are printed. The default
1313 value is "iso8601_ms". The "rfc5424" formats are the same as the
1314 "iso8601" formats except that the timezone value is also shown.
1315 The "clock" format shows a timestamp in microseconds retrieved
1316 with the C standard clock() function. The "short" format is a
1317 short date and time format. The "thread_id" format shows the
1318 timestamp in the C standard ctime() function form without the
1319 year but including the microseconds, the daemon's process ID and
1320 the current thread name and ID.
1321
1322
1323 MailDomain
1324 Domain name to qualify usernames if email address is not explic‐
1325 itly given with the "--mail-user" option. If unset, the local
1326 MTA will need to qualify local address itself.
1327
1328
1329 MailProg
1330 Fully qualified pathname to the program used to send email per
1331 user request. The default value is "/bin/mail" (or
1332 "/usr/bin/mail" if "/bin/mail" does not exist but
1333 "/usr/bin/mail" does exist).
1334
1335
1336 MaxArraySize
1337 The maximum job array size. The maximum job array task index
1338 value will be one less than MaxArraySize to allow for an index
1339 value of zero. Configure MaxArraySize to 0 in order to disable
1340 job array use. The value may not exceed 4000001. The value of
1341 MaxJobCount should be much larger than MaxArraySize. The
1342 default value is 1001.
1343
1344
1345 MaxJobCount
1346 The maximum number of jobs Slurm can have in its active database
1347 at one time. Set the values of MaxJobCount and MinJobAge to
1348 ensure the slurmctld daemon does not exhaust its memory or other
1349 resources. Once this limit is reached, requests to submit addi‐
1350 tional jobs will fail. The default value is 10000 jobs. NOTE:
1351 Each task of a job array counts as one job even though they will
1352 not occupy separate job records until modified or initiated.
1353 Performance can suffer with more than a few hundred thousand
1354 jobs. Setting per MaxSubmitJobs per user is generally valuable
1355 to prevent a single user from filling the system with jobs.
1356 This is accomplished using Slurm's database and configuring
1357 enforcement of resource limits. This value may not be reset via
1358 "scontrol reconfig". It only takes effect upon restart of the
1359 slurmctld daemon.
1360
1361
1362 MaxJobId
1363 The maximum job id to be used for jobs submitted to Slurm with‐
1364 out a specific requested value. Job ids are unsigned 32bit inte‐
1365 gers with the first 26 bits reserved for local job ids and the
1366 remaining 6 bits reserved for a cluster id to identify a feder‐
1367 ated job's origin. The maximun allowed local job id is
1368 67,108,863 (0x3FFFFFF). The default value is 67,043,328
1369 (0x03ff0000). MaxJobId only applies to the local job id and not
1370 the federated job id. Job id values generated will be incre‐
1371 mented by 1 for each subsequent job. Once MaxJobId is reached,
1372 the next job will be assigned FirstJobId. Federated jobs will
1373 always have a job ID of 67,108,865 or higher. Also see FirstJo‐
1374 bId.
1375
1376
1377 MaxMemPerCPU
1378 Maximum real memory size available per allocated CPU in
1379 megabytes. Used to avoid over-subscribing memory and causing
1380 paging. MaxMemPerCPU would generally be used if individual pro‐
1381 cessors are allocated to jobs (SelectType=select/cons_res or
1382 SelectType=select/cons_tres). The default value is 0 (unlim‐
1383 ited). Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerNode.
1384 MaxMemPerCPU and MaxMemPerNode are mutually exclusive.
1385
1386 NOTE: If a job specifies a memory per CPU limit that exceeds
1387 this system limit, that job's count of CPUs per task will auto‐
1388 matically be increased. This may result in the job failing due
1389 to CPU count limits.
1390
1391
1392 MaxMemPerNode
1393 Maximum real memory size available per allocated node in
1394 megabytes. Used to avoid over-subscribing memory and causing
1395 paging. MaxMemPerNode would generally be used if whole nodes
1396 are allocated to jobs (SelectType=select/linear) and resources
1397 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
1398 The default value is 0 (unlimited). Also see DefMemPerNode and
1399 MaxMemPerCPU. MaxMemPerCPU and MaxMemPerNode are mutually
1400 exclusive.
1401
1402
1403 MaxStepCount
1404 The maximum number of steps that any job can initiate. This
1405 parameter is intended to limit the effect of bad batch scripts.
1406 The default value is 40000 steps.
1407
1408
1409 MaxTasksPerNode
1410 Maximum number of tasks Slurm will allow a job step to spawn on
1411 a single node. The default MaxTasksPerNode is 512. May not
1412 exceed 65533.
1413
1414
1415 MCSParameters
1416 MCS = Multi-Category Security MCS Plugin Parameters. The sup‐
1417 ported parameters are specific to the MCSPlugin. Changes to
1418 this value take effect when the Slurm daemons are reconfigured.
1419 More information about MCS is available here
1420 <https://slurm.schedmd.com/mcs.html>.
1421
1422
1423 MCSPlugin
1424 MCS = Multi-Category Security : associate a security label to
1425 jobs and ensure that nodes can only be shared among jobs using
1426 the same security label. Acceptable values include:
1427
1428 mcs/none is the default value. No security label associated
1429 with jobs, no particular security restriction when
1430 sharing nodes among jobs.
1431
1432 mcs/account only users with the same account can share the nodes
1433 (requires enabling of accounting).
1434
1435 mcs/group only users with the same group can share the nodes.
1436
1437 mcs/user a node cannot be shared with other users.
1438
1439
1440 MemLimitEnforce
1441 If set to yes then Slurm will terminate the job if it exceeds
1442 the value requested using the --mem-per-cpu option of sal‐
1443 loc/sbatch/srun. This is useful in combination with JobAcct‐
1444 GatherParams=OverMemoryKill. Used when jobs need to specify
1445 --mem-per-cpu for scheduling and they should be terminated if
1446 they exceed the estimated value. The default value is 'no',
1447 which disables this enforcing mechanism. NOTE: It is recom‐
1448 mended to limit memory by enabling task/cgroup in TaskPlugin and
1449 making use of ConstrainRAMSpace=yes cgroup.conf instead of using
1450 this JobAcctGather mechanism for memory enforcement, since the
1451 former has a lower resolution (JobAcctGatherFreq) and OOMs could
1452 happen at some point.
1453
1454
1455 MessageTimeout
1456 Time permitted for a round-trip communication to complete in
1457 seconds. Default value is 10 seconds. For systems with shared
1458 nodes, the slurmd daemon could be paged out and necessitate
1459 higher values.
1460
1461
1462 MinJobAge
1463 The minimum age of a completed job before its record is purged
1464 from Slurm's active database. Set the values of MaxJobCount and
1465 to ensure the slurmctld daemon does not exhaust its memory or
1466 other resources. The default value is 300 seconds. A value of
1467 zero prevents any job record purging. Jobs are not purged dur‐
1468 ing a backfill cycle, so it can take longer than MinJobAge sec‐
1469 onds to purge a job if using the backfill scheduling plugin. In
1470 order to eliminate some possible race conditions, the minimum
1471 non-zero value for MinJobAge recommended is 2.
1472
1473
1474 MpiDefault
1475 Identifies the default type of MPI to be used. Srun may over‐
1476 ride this configuration parameter in any case. Currently sup‐
1477 ported versions include: openmpi, pmi2, pmix, and none (default,
1478 which works for many other versions of MPI). More information
1479 about MPI use is available here
1480 <https://slurm.schedmd.com/mpi_guide.html>.
1481
1482
1483 MpiParams
1484 MPI parameters. Used to identify ports used by older versions
1485 of OpenMPI and native Cray systems. The input format is
1486 "ports=12000-12999" to identify a range of communication ports
1487 to be used. NOTE: This is not needed for modern versions of
1488 OpenMPI, taking it out can cause a small boost in scheduling
1489 performance. NOTE: This is require for Cray's PMI.
1490
1491 MsgAggregationParams
1492 Message aggregation parameters. Message aggregation is an
1493 optional feature that may improve system performance by reducing
1494 the number of separate messages passed between nodes. The fea‐
1495 ture works by routing messages through one or more message col‐
1496 lector nodes between their source and destination nodes. At each
1497 collector node, messages with the same destination received dur‐
1498 ing a defined message collection window are packaged into a sin‐
1499 gle composite message. When the window expires, the composite
1500 message is sent to the next collector node on the route to its
1501 destination. The route between each source and destination node
1502 is provided by the Route plugin. When a composite message is
1503 received at its destination node, the original messages are
1504 extracted and processed as if they had been sent directly.
1505 Currently, the only message types supported by message aggrega‐
1506 tion are the node registration, batch script completion, step
1507 completion, and epilog complete messages.
1508 Since the aggregation node address is set resolving the hostname
1509 at slurmd start in each node, using this feature in non-flat
1510 networks is not possible. For example, if slurmctld is in a
1511 different subnetwork than compute nodes and node addresses are
1512 resolved differently the controller than in the compute nodes,
1513 you may face communication issues. In some cases it may be use‐
1514 ful to set CommunicationParameters=NoInAddrAny to make all dae‐
1515 mons communicate through the same network.
1516 The format for this parameter is as follows:
1517
1518 MsgAggregationParams=<option>=<value>
1519 where <option>=<value> specify a particular control
1520 variable. Multiple, comma-separated <option>=<value>
1521 pairs may be specified. Supported options are as
1522 follows:
1523
1524 WindowMsgs=<number>
1525 where <number> is the maximum number of mes‐
1526 sages in each message collection window.
1527
1528 WindowTime=<time>
1529 where <time> is the maximum elapsed time in
1530 milliseconds of each message collection win‐
1531 dow.
1532
1533 A window expires when either WindowMsgs or
1534 WindowTime is
1535 reached. By default, message aggregation is disabled. To enable
1536 the feature, set WindowMsgs to a value greater than 1. The
1537 default value for WindowTime is 100 milliseconds.
1538
1539
1540 OverTimeLimit
1541 Number of minutes by which a job can exceed its time limit
1542 before being canceled. Normally a job's time limit is treated
1543 as a hard limit and the job will be killed upon reaching that
1544 limit. Configuring OverTimeLimit will result in the job's time
1545 limit being treated like a soft limit. Adding the OverTimeLimit
1546 value to the soft time limit provides a hard time limit, at
1547 which point the job is canceled. This is particularly useful
1548 for backfill scheduling, which bases upon each job's soft time
1549 limit. The default value is zero. May not exceed exceed 65533
1550 minutes. A value of "UNLIMITED" is also supported.
1551
1552
1553 PluginDir
1554 Identifies the places in which to look for Slurm plugins. This
1555 is a colon-separated list of directories, like the PATH environ‐
1556 ment variable. The default value is "/usr/local/lib/slurm".
1557
1558
1559 PlugStackConfig
1560 Location of the config file for Slurm stackable plugins that use
1561 the Stackable Plugin Architecture for Node job (K)control
1562 (SPANK). This provides support for a highly configurable set of
1563 plugins to be called before and/or after execution of each task
1564 spawned as part of a user's job step. Default location is
1565 "plugstack.conf" in the same directory as the system slurm.conf.
1566 For more information on SPANK plugins, see the spank(8) manual.
1567
1568
1569 PowerParameters
1570 System power management parameters. The supported parameters
1571 are specific to the PowerPlugin. Changes to this value take
1572 effect when the Slurm daemons are reconfigured. More informa‐
1573 tion about system power management is available here
1574 <https://slurm.schedmd.com/power_mgmt.html>. Options current
1575 supported by any plugins are listed below.
1576
1577 balance_interval=#
1578 Specifies the time interval, in seconds, between attempts
1579 to rebalance power caps across the nodes. This also con‐
1580 trols the frequency at which Slurm attempts to collect
1581 current power consumption data (old data may be used
1582 until new data is available from the underlying infra‐
1583 structure and values below 10 seconds are not recommended
1584 for Cray systems). The default value is 30 seconds.
1585 Supported by the power/cray_aries plugin.
1586
1587 capmc_path=
1588 Specifies the absolute path of the capmc command. The
1589 default value is "/opt/cray/capmc/default/bin/capmc".
1590 Supported by the power/cray_aries plugin.
1591
1592 cap_watts=#
1593 Specifies the total power limit to be established across
1594 all compute nodes managed by Slurm. A value of 0 sets
1595 every compute node to have an unlimited cap. The default
1596 value is 0. Supported by the power/cray_aries plugin.
1597
1598 decrease_rate=#
1599 Specifies the maximum rate of change in the power cap for
1600 a node where the actual power usage is below the power
1601 cap by an amount greater than lower_threshold (see
1602 below). Value represents a percentage of the difference
1603 between a node's minimum and maximum power consumption.
1604 The default value is 50 percent. Supported by the
1605 power/cray_aries plugin.
1606
1607 get_timeout=#
1608 Amount of time allowed to get power state information in
1609 milliseconds. The default value is 5,000 milliseconds or
1610 5 seconds. Supported by the power/cray_aries plugin and
1611 represents the time allowed for the capmc command to
1612 respond to various "get" options.
1613
1614 increase_rate=#
1615 Specifies the maximum rate of change in the power cap for
1616 a node where the actual power usage is within
1617 upper_threshold (see below) of the power cap. Value rep‐
1618 resents a percentage of the difference between a node's
1619 minimum and maximum power consumption. The default value
1620 is 20 percent. Supported by the power/cray_aries plugin.
1621
1622 job_level
1623 All nodes associated with every job will have the same
1624 power cap, to the extent possible. Also see the
1625 --power=level option on the job submission commands.
1626
1627 job_no_level
1628 Disable the user's ability to set every node associated
1629 with a job to the same power cap. Each node will have
1630 it's power cap set independently. This disables the
1631 --power=level option on the job submission commands.
1632
1633 lower_threshold=#
1634 Specify a lower power consumption threshold. If a node's
1635 current power consumption is below this percentage of its
1636 current cap, then its power cap will be reduced. The
1637 default value is 90 percent. Supported by the
1638 power/cray_aries plugin.
1639
1640 recent_job=#
1641 If a job has started or resumed execution (from suspend)
1642 on a compute node within this number of seconds from the
1643 current time, the node's power cap will be increased to
1644 the maximum. The default value is 300 seconds. Sup‐
1645 ported by the power/cray_aries plugin.
1646
1647
1648 set_timeout=#
1649 Amount of time allowed to set power state information in
1650 milliseconds. The default value is 30,000 milliseconds
1651 or 30 seconds. Supported by the power/cray plugin and
1652 represents the time allowed for the capmc command to
1653 respond to various "set" options.
1654
1655 set_watts=#
1656 Specifies the power limit to be set on every compute
1657 nodes managed by Slurm. Every node gets this same power
1658 cap and there is no variation through time based upon
1659 actual power usage on the node. Supported by the
1660 power/cray_aries plugin.
1661
1662 upper_threshold=#
1663 Specify an upper power consumption threshold. If a
1664 node's current power consumption is above this percentage
1665 of its current cap, then its power cap will be increased
1666 to the extent possible. The default value is 95 percent.
1667 Supported by the power/cray_aries plugin.
1668
1669
1670 PowerPlugin
1671 Identifies the plugin used for system power management. Cur‐
1672 rently supported plugins include: cray_aries and none. Changes
1673 to this value require restarting Slurm daemons to take effect.
1674 More information about system power management is available here
1675 <https://slurm.schedmd.com/power_mgmt.html>. By default, no
1676 power plugin is loaded.
1677
1678
1679 PreemptMode
1680 Enables gang scheduling and/or controls the mechanism used to
1681 preempt jobs. When the PreemptType parameter is set to enable
1682 preemption, the PreemptMode selects the default mechanism used
1683 to preempt the lower priority jobs for the cluster. PreemptMode
1684 may be specified on a per partition basis to override this
1685 default value if PreemptType=preempt/partition_prio, but a valid
1686 default PreemptMode value must be specified for the cluster as a
1687 whole when preemption is enabled. The GANG option is used to
1688 enable gang scheduling independent of whether preemption is
1689 enabled (the PreemptType setting). The GANG option can be spec‐
1690 ified in addition to a PreemptMode setting with the two options
1691 comma separated. The SUSPEND option requires that gang schedul‐
1692 ing be enabled (i.e, "PreemptMode=SUSPEND,GANG"). NOTE: For
1693 performance reasons, the backfill scheduler reserves whole nodes
1694 for jobs, not partial nodes. If during backfill scheduling a job
1695 preempts one or more other jobs, the whole nodes for those pre‐
1696 empted jobs are reserved for the preemptor job, even if the pre‐
1697 emptor job requested fewer resources than that. These reserved
1698 nodes aren't available to other jobs during that backfill cycle,
1699 even if the other jobs could fit on the nodes. Therefore, jobs
1700 may preempt more resources during a single backfill iteration
1701 than they requested.
1702
1703 OFF is the default value and disables job preemption and
1704 gang scheduling.
1705
1706 CANCEL always cancel the job.
1707
1708 CHECKPOINT preempts jobs by checkpointing them (if possible) or
1709 canceling them.
1710
1711 GANG enables gang scheduling (time slicing) of jobs in
1712 the same partition. NOTE: Gang scheduling is per‐
1713 formed independently for each partition, so config‐
1714 uring partitions with overlapping nodes and gang
1715 scheduling is generally not recommended.
1716
1717 REQUEUE preempts jobs by requeuing them (if possible) or
1718 canceling them. For jobs to be requeued they must
1719 have the --requeue sbatch option set or the cluster
1720 wide JobRequeue parameter in slurm.conf must be set
1721 to one.
1722
1723 SUSPEND If PreemptType=preempt/partition_prio is configured
1724 then suspend and automatically resume the low prior‐
1725 ity jobs. If PreemptType=preempt/qos is configured,
1726 then the jobs sharing resources will always time
1727 slice rather than one job remaining suspended. The
1728 SUSPEND may only be used with the GANG option (the
1729 gang scheduler module performs the job resume opera‐
1730 tion).
1731
1732
1733 PreemptType
1734 This specifies the plugin used to identify which jobs can be
1735 preempted in order to start a pending job.
1736
1737 preempt/none
1738 Job preemption is disabled. This is the default.
1739
1740 preempt/partition_prio
1741 Job preemption is based upon partition priority tier.
1742 Jobs in higher priority partitions (queues) may preempt
1743 jobs from lower priority partitions. This is not compat‐
1744 ible with PreemptMode=OFF.
1745
1746 preempt/qos
1747 Job preemption rules are specified by Quality Of Service
1748 (QOS) specifications in the Slurm database. This option
1749 is not compatible with PreemptMode=OFF. A configuration
1750 of PreemptMode=SUSPEND is only supported by the Select‐
1751 Type=select/cons_res and SelectType=select/cons_tres
1752 plugins.
1753
1754
1755 PreemptExemptTime
1756 Global option for minimum run time for all jobs before they can
1757 be considered for preemption. Any QOS PreemptExemptTime takes
1758 precedence over the global option. A time of -1 disables the
1759 option, equivalent to 0. Acceptable time formats include "min‐
1760 utes", "minutes:seconds", "hours:minutes:seconds", "days-hours",
1761 "days-hours:minutes", and "days-hours:minutes:seconds".
1762
1763
1764 PriorityCalcPeriod
1765 The period of time in minutes in which the half-life decay will
1766 be re-calculated. Applicable only if PriorityType=priority/mul‐
1767 tifactor. The default value is 5 (minutes).
1768
1769
1770 PriorityDecayHalfLife
1771 This controls how long prior resource use is considered in
1772 determining how over- or under-serviced an association is (user,
1773 bank account and cluster) in determining job priority. The
1774 record of usage will be decayed over time, with half of the
1775 original value cleared at age PriorityDecayHalfLife. If set to
1776 0 no decay will be applied. This is helpful if you want to
1777 enforce hard time limits per association. If set to 0 Priori‐
1778 tyUsageResetPeriod must be set to some interval. Applicable
1779 only if PriorityType=priority/multifactor. The unit is a time
1780 string (i.e. min, hr:min:00, days-hr:min:00, or days-hr). The
1781 default value is 7-0 (7 days).
1782
1783
1784 PriorityFavorSmall
1785 Specifies that small jobs should be given preferential schedul‐
1786 ing priority. Applicable only if PriorityType=priority/multi‐
1787 factor. Supported values are "YES" and "NO". The default value
1788 is "NO".
1789
1790
1791 PriorityFlags
1792 Flags to modify priority behavior. Applicable only if Priority‐
1793 Type=priority/multifactor. The keywords below have no associ‐
1794 ated value (e.g. "PriorityFlags=ACCRUE_ALWAYS,SMALL_RELA‐
1795 TIVE_TO_TIME").
1796
1797 ACCRUE_ALWAYS If set, priority age factor will be increased
1798 despite job dependencies or holds.
1799
1800 CALCULATE_RUNNING
1801 If set, priorities will be recalculated not
1802 only for pending jobs, but also running and
1803 suspended jobs.
1804
1805 DEPTH_OBLIVIOUS If set, priority will be calculated based simi‐
1806 lar to the normal multifactor calculation, but
1807 depth of the associations in the tree do not
1808 adversely effect their priority. This option
1809 automatically enables NO_FAIR_TREE.
1810
1811 NO_FAIR_TREE Disables the "fair tree" algorithm, and reverts
1812 to "classic" fair share priority scheduling.
1813
1814 INCR_ONLY If set, priority values will only increase in
1815 value. Job priority will never decrease in
1816 value.
1817
1818 MAX_TRES If set, the weighted TRES value (e.g. TRES‐
1819 BillingWeights) is calculated as the MAX of
1820 individual TRES' on a node (e.g. cpus, mem,
1821 gres) plus the sum of all global TRES' (e.g.
1822 licenses).
1823
1824 NO_NORMAL_ALL If set, all NO_NORMAL_* flags are set.
1825
1826 NO_NORMAL_ASSOC If set, the association factor is not normal‐
1827 ized against the highest association priority.
1828
1829 NO_NORMAL_PART If set, the partition factor is not normalized
1830 against the highest partition PriorityTier.
1831
1832 NO_NORMAL_QOS If set, the QOS factor is not normalized
1833 against the highest qos priority.
1834
1835 NO_NORMAL_TRES If set, the QOS factor is not normalized
1836 against the job's partition TRES counts.
1837
1838 SMALL_RELATIVE_TO_TIME
1839 If set, the job's size component will be based
1840 upon not the job size alone, but the job's size
1841 divided by it's time limit.
1842
1843
1844 PriorityMaxAge
1845 Specifies the job age which will be given the maximum age factor
1846 in computing priority. For example, a value of 30 minutes would
1847 result in all jobs over 30 minutes old would get the same
1848 age-based priority. Applicable only if PriorityType=prior‐
1849 ity/multifactor. The unit is a time string (i.e. min,
1850 hr:min:00, days-hr:min:00, or days-hr). The default value is
1851 7-0 (7 days).
1852
1853
1854 PriorityParameters
1855 Arbitrary string used by the PriorityType plugin.
1856
1857
1858 PrioritySiteFactorParameters
1859 Arbitrary string used by the PrioritySiteFactorPlugin plugin.
1860
1861
1862 PrioritySiteFactorPlugin
1863 The specifies an optional plugin to be used alongside "prior‐
1864 ity/multifactor", which is meant to initially set and continu‐
1865 ously update the SiteFactor priority factor. The default value
1866 is "site_factor/none".
1867
1868
1869 PriorityType
1870 This specifies the plugin to be used in establishing a job's
1871 scheduling priority. Supported values are "priority/basic" (jobs
1872 are prioritized by order of arrival), "priority/multifactor"
1873 (jobs are prioritized based upon size, age, fair-share of allo‐
1874 cation, etc). Also see PriorityFlags for configuration options.
1875 The default value is "priority/basic".
1876
1877 When not FIFO scheduling, jobs are prioritized in the following
1878 order:
1879
1880 1. Jobs that can preempt
1881
1882 2. Jobs with an advanced reservation
1883
1884 3. Partition Priority Tier
1885
1886 4. Job Priority
1887
1888 5. Job Id
1889
1890
1891 PriorityUsageResetPeriod
1892 At this interval the usage of associations will be reset to 0.
1893 This is used if you want to enforce hard limits of time usage
1894 per association. If PriorityDecayHalfLife is set to be 0 no
1895 decay will happen and this is the only way to reset the usage
1896 accumulated by running jobs. By default this is turned off and
1897 it is advised to use the PriorityDecayHalfLife option to avoid
1898 not having anything running on your cluster, but if your schema
1899 is set up to only allow certain amounts of time on your system
1900 this is the way to do it. Applicable only if PriorityType=pri‐
1901 ority/multifactor.
1902
1903 NONE Never clear historic usage. The default value.
1904
1905 NOW Clear the historic usage now. Executed at startup
1906 and reconfiguration time.
1907
1908 DAILY Cleared every day at midnight.
1909
1910 WEEKLY Cleared every week on Sunday at time 00:00.
1911
1912 MONTHLY Cleared on the first day of each month at time
1913 00:00.
1914
1915 QUARTERLY Cleared on the first day of each quarter at time
1916 00:00.
1917
1918 YEARLY Cleared on the first day of each year at time 00:00.
1919
1920
1921 PriorityWeightAge
1922 An integer value that sets the degree to which the queue wait
1923 time component contributes to the job's priority. Applicable
1924 only if PriorityType=priority/multifactor. The default value is
1925 0.
1926
1927
1928 PriorityWeightAssoc
1929 An integer value that sets the degree to which the association
1930 component contributes to the job's priority. Applicable only if
1931 PriorityType=priority/multifactor. The default value is 0.
1932
1933
1934 PriorityWeightFairshare
1935 An integer value that sets the degree to which the fair-share
1936 component contributes to the job's priority. Applicable only if
1937 PriorityType=priority/multifactor. The default value is 0.
1938
1939
1940 PriorityWeightJobSize
1941 An integer value that sets the degree to which the job size com‐
1942 ponent contributes to the job's priority. Applicable only if
1943 PriorityType=priority/multifactor. The default value is 0.
1944
1945
1946 PriorityWeightPartition
1947 Partition factor used by priority/multifactor plugin in calcu‐
1948 lating job priority. Applicable only if PriorityType=prior‐
1949 ity/multifactor. The default value is 0.
1950
1951
1952 PriorityWeightQOS
1953 An integer value that sets the degree to which the Quality Of
1954 Service component contributes to the job's priority. Applicable
1955 only if PriorityType=priority/multifactor. The default value is
1956 0.
1957
1958
1959 PriorityWeightTRES
1960 A comma separated list of TRES Types and weights that sets the
1961 degree that each TRES Type contributes to the job's priority.
1962
1963 e.g.
1964 PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
1965
1966 Applicable only if PriorityType=priority/multifactor and if
1967 AccountingStorageTRES is configured with each TRES Type. Nega‐
1968 tive values are allowed. The default values are 0.
1969
1970
1971 PrivateData
1972 This controls what type of information is hidden from regular
1973 users. By default, all information is visible to all users.
1974 User SlurmUser and root can always view all information. Multi‐
1975 ple values may be specified with a comma separator. Acceptable
1976 values include:
1977
1978 accounts
1979 (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
1980 ing any account definitions unless they are coordinators
1981 of them.
1982
1983 cloud Powered down nodes in the cloud are visible.
1984
1985 events prevents users from viewing event information unless they
1986 have operator status or above.
1987
1988 jobs Prevents users from viewing jobs or job steps belonging
1989 to other users. (NON-SlurmDBD ACCOUNTING ONLY) Prevents
1990 users from viewing job records belonging to other users
1991 unless they are coordinators of the association running
1992 the job when using sacct.
1993
1994 nodes Prevents users from viewing node state information.
1995
1996 partitions
1997 Prevents users from viewing partition state information.
1998
1999 reservations
2000 Prevents regular users from viewing reservations which
2001 they can not use.
2002
2003 usage Prevents users from viewing usage of any other user, this
2004 applies to sshare. (NON-SlurmDBD ACCOUNTING ONLY) Pre‐
2005 vents users from viewing usage of any other user, this
2006 applies to sreport.
2007
2008 users (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from view‐
2009 ing information of any user other than themselves, this
2010 also makes it so users can only see associations they
2011 deal with. Coordinators can see associations of all
2012 users they are coordinator of, but can only see them‐
2013 selves when listing users.
2014
2015
2016 ProctrackType
2017 Identifies the plugin to be used for process tracking on a job
2018 step basis. The slurmd daemon uses this mechanism to identify
2019 all processes which are children of processes it spawns for a
2020 user job step. The slurmd daemon must be restarted for a change
2021 in ProctrackType to take effect. NOTE: "proctrack/linuxproc"
2022 and "proctrack/pgid" can fail to identify all processes associ‐
2023 ated with a job since processes can become a child of the init
2024 process (when the parent process terminates) or change their
2025 process group. To reliably track all processes, "proc‐
2026 track/cgroup" is highly recommended. NOTE: The JobContainerType
2027 applies to a job allocation, while ProctrackType applies to job
2028 steps. Acceptable values at present include:
2029
2030 proctrack/cgroup which uses linux cgroups to constrain and
2031 track processes, and is the default. NOTE:
2032 see "man cgroup.conf" for configuration
2033 details
2034
2035 proctrack/cray_aries
2036 which uses Cray proprietary process tracking
2037
2038 proctrack/linuxproc which uses linux process tree using parent
2039 process IDs.
2040
2041 proctrack/pgid which uses process group IDs
2042
2043
2044 Prolog Fully qualified pathname of a program for the slurmd to execute
2045 whenever it is asked to run a job step from a new job allocation
2046 (e.g. "/usr/local/slurm/prolog"). A glob pattern (See glob (7))
2047 may also be used to specify more than one program to run (e.g.
2048 "/etc/slurm/prolog.d/*"). The slurmd executes the prolog before
2049 starting the first job step. The prolog script or scripts may
2050 be used to purge files, enable user login, etc. By default
2051 there is no prolog. Any configured script is expected to com‐
2052 plete execution quickly (in less time than MessageTimeout). If
2053 the prolog fails (returns a non-zero exit code), this will
2054 result in the node being set to a DRAIN state and the job being
2055 requeued in a held state, unless nohold_on_prolog_fail is con‐
2056 figured in SchedulerParameters. See Prolog and Epilog Scripts
2057 for more information.
2058
2059
2060 PrologEpilogTimeout
2061 The interval in seconds Slurms waits for Prolog and Epilog
2062 before terminating them. The default behavior is to wait indefi‐
2063 nitely. This interval applies to the Prolog and Epilog run by
2064 slurmd daemon before and after the job, the PrologSlurmctld and
2065 EpilogSlurmctld run by slurmctld daemon, and the SPANK plugins
2066 run by the slurmstepd daemon.
2067
2068
2069 PrologFlags
2070 Flags to control the Prolog behavior. By default no flags are
2071 set. Multiple flags may be specified in a comma-separated list.
2072 Currently supported options are:
2073
2074 Alloc If set, the Prolog script will be executed at job allo‐
2075 cation. By default, Prolog is executed just before the
2076 task is launched. Therefore, when salloc is started, no
2077 Prolog is executed. Alloc is useful for preparing things
2078 before a user starts to use any allocated resources. In
2079 particular, this flag is needed on a Cray system when
2080 cluster compatibility mode is enabled.
2081
2082 NOTE: Use of the Alloc flag will increase the time
2083 required to start jobs.
2084
2085 Contain At job allocation time, use the ProcTrack plugin to cre‐
2086 ate a job container on all allocated compute nodes.
2087 This container may be used for user processes not
2088 launched under Slurm control, for example
2089 pam_slurm_adopt may place processes launched through a
2090 direct user login into this container. If using
2091 pam_slurm_adopt, then ProcTrackType must be set to
2092 either proctrack/cgroup or proctrack/cray_aries. Set‐
2093 ting the Contain implicitly sets the Alloc flag.
2094
2095 NoHold If set, the Alloc flag should also be set. This will
2096 allow for salloc to not block until the prolog is fin‐
2097 ished on each node. The blocking will happen when steps
2098 reach the slurmd and before any execution has happened
2099 in the step. This is a much faster way to work and if
2100 using srun to launch your tasks you should use this
2101 flag. This flag cannot be combined with the Contain or
2102 X11 flags.
2103
2104 Serial By default, the Prolog and Epilog scripts run concur‐
2105 rently on each node. This flag forces those scripts to
2106 run serially within each node, but with a significant
2107 penalty to job throughput on each node.
2108
2109 X11 Enable Slurm's built-in X11 forwarding capabilities.
2110 This is incompatible with ProctrackType=proctrack/linux‐
2111 proc. Setting the X11 flag implicitly enables both Con‐
2112 tain and Alloc flags as well.
2113
2114
2115 PrologSlurmctld
2116 Fully qualified pathname of a program for the slurmctld daemon
2117 to execute before granting a new job allocation (e.g.
2118 "/usr/local/slurm/prolog_controller"). The program executes as
2119 SlurmUser on the same node where the slurmctld daemon executes,
2120 giving it permission to drain nodes and requeue the job if a
2121 failure occurs or cancel the job if appropriate. The program
2122 can be used to reboot nodes or perform other work to prepare
2123 resources for use. Exactly what the program does and how it
2124 accomplishes this is completely at the discretion of the system
2125 administrator. Information about the job being initiated, it's
2126 allocated nodes, etc. are passed to the program using environ‐
2127 ment variables. While this program is running, the nodes asso‐
2128 ciated with the job will be have a POWER_UP/CONFIGURING flag set
2129 in their state, which can be readily viewed. The slurmctld dae‐
2130 mon will wait indefinitely for this program to complete. Once
2131 the program completes with an exit code of zero, the nodes will
2132 be considered ready for use and the program will be started. If
2133 some node can not be made available for use, the program should
2134 drain the node (typically using the scontrol command) and termi‐
2135 nate with a non-zero exit code. A non-zero exit code will
2136 result in the job being requeued (where possible) or killed.
2137 Note that only batch jobs can be requeued. See Prolog and Epi‐
2138 log Scripts for more information.
2139
2140
2141 PropagatePrioProcess
2142 Controls the scheduling priority (nice value) of user spawned
2143 tasks.
2144
2145 0 The tasks will inherit the scheduling priority from the
2146 slurm daemon. This is the default value.
2147
2148 1 The tasks will inherit the scheduling priority of the com‐
2149 mand used to submit them (e.g. srun or sbatch). Unless the
2150 job is submitted by user root, the tasks will have a sched‐
2151 uling priority no higher than the slurm daemon spawning
2152 them.
2153
2154 2 The tasks will inherit the scheduling priority of the com‐
2155 mand used to submit them (e.g. srun or sbatch) with the
2156 restriction that their nice value will always be one higher
2157 than the slurm daemon (i.e. the tasks scheduling priority
2158 will be lower than the slurm daemon).
2159
2160
2161 PropagateResourceLimits
2162 A list of comma separated resource limit names. The slurmd dae‐
2163 mon uses these names to obtain the associated (soft) limit val‐
2164 ues from the user's process environment on the submit node.
2165 These limits are then propagated and applied to the jobs that
2166 will run on the compute nodes. This parameter can be useful
2167 when system limits vary among nodes. Any resource limits that
2168 do not appear in the list are not propagated. However, the user
2169 can override this by specifying which resource limits to propa‐
2170 gate with the sbatch or srun "--propagate" option. If neither
2171 PropagateResourceLimits or PropagateResourceLimitsExcept are
2172 configured and the "--propagate" option is not specified, then
2173 the default action is to propagate all limits. Only one of the
2174 parameters, either PropagateResourceLimits or PropagateResource‐
2175 LimitsExcept, may be specified. The user limits can not exceed
2176 hard limits under which the slurmd daemon operates. If the user
2177 limits are not propagated, the limits from the slurmd daemon
2178 will be propagated to the user's job. The limits used for the
2179 Slurm daemons can be set in the /etc/sysconf/slurm file. For
2180 more information, see: https://slurm.schedmd.com/faq.html#mem‐
2181 lock The following limit names are supported by Slurm (although
2182 some options may not be supported on some systems):
2183
2184 ALL All limits listed below (default)
2185
2186 NONE No limits listed below
2187
2188 AS The maximum address space for a process
2189
2190 CORE The maximum size of core file
2191
2192 CPU The maximum amount of CPU time
2193
2194 DATA The maximum size of a process's data segment
2195
2196 FSIZE The maximum size of files created. Note that if the
2197 user sets FSIZE to less than the current size of the
2198 slurmd.log, job launches will fail with a 'File size
2199 limit exceeded' error.
2200
2201 MEMLOCK The maximum size that may be locked into memory
2202
2203 NOFILE The maximum number of open files
2204
2205 NPROC The maximum number of processes available
2206
2207 RSS The maximum resident set size
2208
2209 STACK The maximum stack size
2210
2211
2212 PropagateResourceLimitsExcept
2213 A list of comma separated resource limit names. By default, all
2214 resource limits will be propagated, (as described by the Propa‐
2215 gateResourceLimits parameter), except for the limits appearing
2216 in this list. The user can override this by specifying which
2217 resource limits to propagate with the sbatch or srun "--propa‐
2218 gate" option. See PropagateResourceLimits above for a list of
2219 valid limit names.
2220
2221
2222 RebootProgram
2223 Program to be executed on each compute node to reboot it.
2224 Invoked on each node once it becomes idle after the command
2225 "scontrol reboot_nodes" is executed by an authorized user or a
2226 job is submitted with the "--reboot" option. After rebooting,
2227 the node is returned to normal use. See ResumeTimeout to con‐
2228 figure the time you expect a reboot to finish in. A node will
2229 be marked DOWN if it doesn't reboot within ResumeTimeout.
2230
2231
2232 ReconfigFlags
2233 Flags to control various actions that may be taken when an
2234 "scontrol reconfig" command is issued. Currently the options
2235 are:
2236
2237 KeepPartInfo If set, an "scontrol reconfig" command will
2238 maintain the in-memory value of partition
2239 "state" and other parameters that may have been
2240 dynamically updated by "scontrol update". Par‐
2241 tition information in the slurm.conf file will
2242 be merged with in-memory data. This flag
2243 supersedes the KeepPartState flag.
2244
2245 KeepPartState If set, an "scontrol reconfig" command will
2246 preserve only the current "state" value of
2247 in-memory partitions and will reset all other
2248 parameters of the partitions that may have been
2249 dynamically updated by "scontrol update" to the
2250 values from the slurm.conf file. Partition
2251 information in the slurm.conf file will be
2252 merged with in-memory data.
2253 The default for the above flags is not set, and the "scontrol
2254 reconfig" will rebuild the partition information using only the
2255 definitions in the slurm.conf file.
2256
2257
2258 RequeueExit
2259 Enables automatic requeue for batch jobs which exit with the
2260 specified values. Separate multiple exit code by a comma and/or
2261 specify numeric ranges using a "-" separator (e.g. "Requeue‐
2262 Exit=1-9,18") Jobs will be put back in to pending state and
2263 later scheduled again. Restarted jobs will have the environment
2264 variable SLURM_RESTART_COUNT set to the number of times the job
2265 has been restarted.
2266
2267
2268 RequeueExitHold
2269 Enables automatic requeue for batch jobs which exit with the
2270 specified values, with these jobs being held until released man‐
2271 ually by the user. Separate multiple exit code by a comma
2272 and/or specify numeric ranges using a "-" separator (e.g.
2273 "RequeueExitHold=10-12,16") These jobs are put in the JOB_SPE‐
2274 CIAL_EXIT exit state. Restarted jobs will have the environment
2275 variable SLURM_RESTART_COUNT set to the number of times the job
2276 has been restarted.
2277
2278
2279 ResumeFailProgram
2280 The program that will be executed when nodes fail to resume to
2281 by ResumeTimeout. The argument to the program will be the names
2282 of the failed nodes (using Slurm's hostlist expression format).
2283
2284
2285 ResumeProgram
2286 Slurm supports a mechanism to reduce power consumption on nodes
2287 that remain idle for an extended period of time. This is typi‐
2288 cally accomplished by reducing voltage and frequency or powering
2289 the node down. ResumeProgram is the program that will be exe‐
2290 cuted when a node in power save mode is assigned work to per‐
2291 form. For reasons of reliability, ResumeProgram may execute
2292 more than once for a node when the slurmctld daemon crashes and
2293 is restarted. If ResumeProgram is unable to restore a node to
2294 service with a responding slurmd and an updated BootTime, it
2295 should requeue any job associated with the node and set the node
2296 state to DOWN. If the node isn't actually rebooted (i.e. when
2297 multiple-slurmd is configured) starting slurmd with "-b" option
2298 might be useful. The program executes as SlurmUser. The argu‐
2299 ment to the program will be the names of nodes to be removed
2300 from power savings mode (using Slurm's hostlist expression for‐
2301 mat). By default no program is run. Related configuration
2302 options include ResumeTimeout, ResumeRate, SuspendRate, Suspend‐
2303 Time, SuspendTimeout, SuspendProgram, SuspendExcNodes, and Sus‐
2304 pendExcParts. More information is available at the Slurm web
2305 site ( https://slurm.schedmd.com/power_save.html ).
2306
2307
2308 ResumeRate
2309 The rate at which nodes in power save mode are returned to nor‐
2310 mal operation by ResumeProgram. The value is number of nodes
2311 per minute and it can be used to prevent power surges if a large
2312 number of nodes in power save mode are assigned work at the same
2313 time (e.g. a large job starts). A value of zero results in no
2314 limits being imposed. The default value is 300 nodes per
2315 minute. Related configuration options include ResumeTimeout,
2316 ResumeProgram, SuspendRate, SuspendTime, SuspendTimeout, Sus‐
2317 pendProgram, SuspendExcNodes, and SuspendExcParts.
2318
2319
2320 ResumeTimeout
2321 Maximum time permitted (in seconds) between when a node resume
2322 request is issued and when the node is actually available for
2323 use. Nodes which fail to respond in this time frame will be
2324 marked DOWN and the jobs scheduled on the node requeued. Nodes
2325 which reboot after this time frame will be marked DOWN with a
2326 reason of "Node unexpectedly rebooted." The default value is 60
2327 seconds. Related configuration options include ResumeProgram,
2328 ResumeRate, SuspendRate, SuspendTime, SuspendTimeout, Suspend‐
2329 Program, SuspendExcNodes and SuspendExcParts. More information
2330 is available at the Slurm web site (
2331 https://slurm.schedmd.com/power_save.html ).
2332
2333
2334 ResvEpilog
2335 Fully qualified pathname of a program for the slurmctld to exe‐
2336 cute when a reservation ends. The program can be used to cancel
2337 jobs, modify partition configuration, etc. The reservation
2338 named will be passed as an argument to the program. By default
2339 there is no epilog.
2340
2341
2342 ResvOverRun
2343 Describes how long a job already running in a reservation should
2344 be permitted to execute after the end time of the reservation
2345 has been reached. The time period is specified in minutes and
2346 the default value is 0 (kill the job immediately). The value
2347 may not exceed 65533 minutes, although a value of "UNLIMITED" is
2348 supported to permit a job to run indefinitely after its reserva‐
2349 tion is terminated.
2350
2351
2352 ResvProlog
2353 Fully qualified pathname of a program for the slurmctld to exe‐
2354 cute when a reservation begins. The program can be used to can‐
2355 cel jobs, modify partition configuration, etc. The reservation
2356 named will be passed as an argument to the program. By default
2357 there is no prolog.
2358
2359
2360 ReturnToService
2361 Controls when a DOWN node will be returned to service. The
2362 default value is 0. Supported values include
2363
2364 0 A node will remain in the DOWN state until a system adminis‐
2365 trator explicitly changes its state (even if the slurmd dae‐
2366 mon registers and resumes communications).
2367
2368 1 A DOWN node will become available for use upon registration
2369 with a valid configuration only if it was set DOWN due to
2370 being non-responsive. If the node was set DOWN for any
2371 other reason (low memory, unexpected reboot, etc.), its
2372 state will not automatically be changed. A node registers
2373 with a valid configuration if its memory, GRES, CPU count,
2374 etc. are equal to or greater than the values configured in
2375 slurm.conf.
2376
2377 2 A DOWN node will become available for use upon registration
2378 with a valid configuration. The node could have been set
2379 DOWN for any reason. A node registers with a valid configu‐
2380 ration if its memory, GRES, CPU count, etc. are equal to or
2381 greater than the values configured in slurm.conf. (Disabled
2382 on Cray ALPS systems.)
2383
2384
2385 RoutePlugin
2386 Identifies the plugin to be used for defining which nodes will
2387 be used for message forwarding and message aggregation.
2388
2389 route/default
2390 default, use TreeWidth.
2391
2392 route/topology
2393 use the switch hierarchy defined in a topology.conf file.
2394 TopologyPlugin=topology/tree is required.
2395
2396
2397 SallocDefaultCommand
2398 Normally, salloc(1) will run the user's default shell when a
2399 command to execute is not specified on the salloc command line.
2400 If SallocDefaultCommand is specified, salloc will instead run
2401 the configured command. The command is passed to '/bin/sh -c',
2402 so shell metacharacters are allowed, and commands with multiple
2403 arguments should be quoted. For instance:
2404
2405 SallocDefaultCommand = "$SHELL"
2406
2407 would run the shell in the user's $SHELL environment variable.
2408 and
2409
2410 SallocDefaultCommand = "srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL"
2411
2412 would run spawn the user's default shell on the allocated
2413 resources, but not consume any of the CPU or memory resources,
2414 configure it as a pseudo-terminal, and preserve all of the job's
2415 environment variables (i.e. and not over-write them with the job
2416 step's allocation information).
2417
2418 For systems with generic resources (GRES) defined, the SallocDe‐
2419 faultCommand value should explicitly specify a zero count for
2420 the configured GRES. Failure to do so will result in the
2421 launched shell consuming those GRES and preventing subsequent
2422 srun commands from using them. For example, on Cray systems add
2423 "--gres=craynetwork:0" as shown below:
2424 SallocDefaultCommand = "srun -n1 -N1 --mem-per-cpu=0 --gres=craynetwork:0 --pty --preserve-env --mpi=none $SHELL"
2425
2426 For systems with TaskPlugin set, adding an option of
2427 "--cpu-bind=no" is recommended if the default shell should have
2428 access to all of the CPUs allocated to the job on that node,
2429 otherwise the shell may be limited to a single cpu or core.
2430
2431
2432 SbcastParameters
2433 Controls sbcast command behavior. Multiple options can be speci‐
2434 fied in a comma separated list. Supported values include:
2435
2436 DestDir= Destination directory for file being broadcast to
2437 allocated compute nodes. Default value is cur‐
2438 rent working directory.
2439
2440 Compression= Specify default file compression library to be
2441 used. Supported values are "lz4", "none" and
2442 "zlib". The default value with the sbcast --com‐
2443 press option is "lz4" and "none" otherwise. Some
2444 compression libraries may be unavailable on some
2445 systems.
2446
2447
2448 SchedulerParameters
2449 The interpretation of this parameter varies by SchedulerType.
2450 Multiple options may be comma separated.
2451
2452 allow_zero_lic
2453 If set, then job submissions requesting more than config‐
2454 ured licenses won't be rejected.
2455
2456 assoc_limit_stop
2457 If set and a job cannot start due to association limits,
2458 then do not attempt to initiate any lower priority jobs
2459 in that partition. Setting this can decrease system
2460 throughput and utilization, but avoid potentially starv‐
2461 ing larger jobs by preventing them from launching indefi‐
2462 nitely.
2463
2464 batch_sched_delay=#
2465 How long, in seconds, the scheduling of batch jobs can be
2466 delayed. This can be useful in a high-throughput envi‐
2467 ronment in which batch jobs are submitted at a very high
2468 rate (i.e. using the sbatch command) and one wishes to
2469 reduce the overhead of attempting to schedule each job at
2470 submit time. The default value is 3 seconds.
2471
2472 bb_array_stage_cnt=#
2473 Number of tasks from a job array that should be available
2474 for burst buffer resource allocation. Higher values will
2475 increase the system overhead as each task from the job
2476 array will be moved to it's own job record in memory, so
2477 relatively small values are generally recommended. The
2478 default value is 10.
2479
2480 bf_busy_nodes
2481 When selecting resources for pending jobs to reserve for
2482 future execution (i.e. the job can not be started immedi‐
2483 ately), then preferentially select nodes that are in use.
2484 This will tend to leave currently idle resources avail‐
2485 able for backfilling longer running jobs, but may result
2486 in allocations having less than optimal network topology.
2487 This option is currently only supported by the
2488 select/cons_res and select/cons_tres plugins (or
2489 select/cray_aries with SelectTypeParameters set to
2490 "OTHER_CONS_RES" or "OTHER_CONS_TRES", which layers the
2491 select/cray_aries plugin over the select/cons_res or
2492 select/cons_tres plugin respectively).
2493
2494 bf_continue
2495 The backfill scheduler periodically releases locks in
2496 order to permit other operations to proceed rather than
2497 blocking all activity for what could be an extended
2498 period of time. Setting this option will cause the back‐
2499 fill scheduler to continue processing pending jobs from
2500 its original job list after releasing locks even if job
2501 or node state changes. This can result in lower priority
2502 jobs being backfill scheduled instead of newly arrived
2503 higher priority jobs, but will permit more queued jobs to
2504 be considered for backfill scheduling.
2505
2506 bf_hetjob_immediate
2507 Instruct the backfill scheduler to attempt to start a
2508 heterogeneous job as soon as all of its components are
2509 determined able to do so. Otherwise, the backfill sched‐
2510 uler will delay heterogeneous jobs initiation attempts
2511 until after the rest of the queue has been processed.
2512 This delay may result in lower priority jobs being allo‐
2513 cated resources, which could delay the initiation of the
2514 heterogeneous job due to account and/or QOS limits being
2515 reached. This option is disabled by default. If enabled
2516 and bf_hetjob_prio=min is not set, then it would be auto‐
2517 matically set.
2518
2519 bf_hetjob_prio=[min|avg|max]
2520 At the beginning of each backfill scheduling cycle, a
2521 list of pending to be scheduled jobs is sorted according
2522 to the precedence order configured in PriorityType. This
2523 option instructs the scheduler to alter the sorting algo‐
2524 rithm to ensure that all components belonging to the same
2525 heterogeneous job will be attempted to be scheduled con‐
2526 secutively (thus not fragmented in the resulting list).
2527 More specifically, all components from the same heteroge‐
2528 neous job will be treated as if they all have the same
2529 priority (minimum, average or maximum depending upon this
2530 option's parameter) when compared with other jobs (or
2531 other heterogeneous job components). The original order
2532 will be preserved within the same heterogeneous job. Note
2533 that the operation is calculated for the PriorityTier
2534 layer and for the Priority resulting from the prior‐
2535 ity/multifactor plugin calculations. When enabled, if any
2536 heterogeneous job requested an advanced reservation, then
2537 all of that job's components will be treated as if they
2538 had requested an advanced reservation (and get preferen‐
2539 tial treatment in scheduling).
2540
2541 Note that this operation does not update the Priority
2542 values of the heterogeneous job components, only their
2543 order within the list, so the output of the sprio command
2544 will not be effected.
2545
2546 Heterogeneous jobs have special scheduling properties:
2547 they are only scheduled by the backfill scheduling plug‐
2548 in, each of their components is considered separately
2549 when reserving resources (and might have different Prior‐
2550 ityTier or different Priority values), and no heteroge‐
2551 neous job component is actually allocated resources until
2552 all if its components can be initiated. This may imply
2553 potential scheduling deadlock scenarios because compo‐
2554 nents from different heterogeneous jobs can start reserv‐
2555 ing resources in an interleaved fashion (not consecu‐
2556 tively), but none of the jobs can reserve resources for
2557 all components and start. Enabling this option can help
2558 to mitigate this problem. By default, this option is dis‐
2559 abled.
2560
2561 bf_ignore_newly_avail_nodes
2562 If set, then only resources available at the beginning of
2563 a backfill cycle will be considered for use. Otherwise
2564 resources made available during that backfill cycle (dur‐
2565 ing a yield with bf_continue set) may be used for lower
2566 priority jobs, delaying the initiation of higher priority
2567 jobs. Disabled by default.
2568
2569 bf_interval=#
2570 The number of seconds between backfill iterations.
2571 Higher values result in less overhead and better respon‐
2572 siveness. This option applies only to Scheduler‐
2573 Type=sched/backfill. Default: 30, Min: 1, Max: 10800
2574 (3h).
2575
2576
2577 bf_job_part_count_reserve=#
2578 The backfill scheduling logic will reserve resources for
2579 the specified count of highest priority jobs in each par‐
2580 tition. For example, bf_job_part_count_reserve=10 will
2581 cause the backfill scheduler to reserve resources for the
2582 ten highest priority jobs in each partition. Any lower
2583 priority job that can be started using currently avail‐
2584 able resources and not adversely impact the expected
2585 start time of these higher priority jobs will be started
2586 by the backfill scheduler The default value is zero,
2587 which will reserve resources for any pending job and
2588 delay initiation of lower priority jobs. Also see
2589 bf_min_age_reserve and bf_min_prio_reserve. Default: 0,
2590 Min: 0, Max: 100000.
2591
2592
2593 bf_max_job_array_resv=#
2594 The maximum number of tasks from a job array for which
2595 the backfill scheduler will reserve resources in the
2596 future. Since job arrays can potentially have millions
2597 of tasks, the overhead in reserving resources for all
2598 tasks can be prohibitive. In addition various limits may
2599 prevent all the jobs from starting at the expected times.
2600 This has no impact upon the number of tasks from a job
2601 array that can be started immediately, only those tasks
2602 expected to start at some future time. Default: 20, Min:
2603 0, Max: 1000. NOTE: Jobs submitted to multiple parti‐
2604 tions appear in the job queue once per partition. If dif‐
2605 ferent copies of a single job array record aren't consec‐
2606 utive in the job queue and another job array record is in
2607 between, then bf_max_job_array_resv tasks are considered
2608 per partition that the job is submitted to.
2609
2610 bf_max_job_assoc=#
2611 The maximum number of jobs per user association to
2612 attempt starting with the backfill scheduler. This set‐
2613 ting is similar to bf_max_job_user but is handy if a user
2614 has multiple associations equating to basically different
2615 users. One can set this limit to prevent users from
2616 flooding the backfill queue with jobs that cannot start
2617 and that prevent jobs from other users to start. This
2618 option applies only to SchedulerType=sched/backfill.
2619 Also see the bf_max_job_user bf_max_job_part,
2620 bf_max_job_test and bf_max_job_user_part=# options. Set
2621 bf_max_job_test to a value much higher than
2622 bf_max_job_assoc. Default: 0 (no limit), Min: 0, Max:
2623 bf_max_job_test.
2624
2625 bf_max_job_part=#
2626 The maximum number of jobs per partition to attempt
2627 starting with the backfill scheduler. This can be espe‐
2628 cially helpful for systems with large numbers of parti‐
2629 tions and jobs. This option applies only to Scheduler‐
2630 Type=sched/backfill. Also see the partition_job_depth
2631 and bf_max_job_test options. Set bf_max_job_test to a
2632 value much higher than bf_max_job_part. Default: 0 (no
2633 limit), Min: 0, Max: bf_max_job_test.
2634
2635 bf_max_job_start=#
2636 The maximum number of jobs which can be initiated in a
2637 single iteration of the backfill scheduler. This option
2638 applies only to SchedulerType=sched/backfill. Default: 0
2639 (no limit), Min: 0, Max: 10000.
2640
2641 bf_max_job_test=#
2642 The maximum number of jobs to attempt backfill scheduling
2643 for (i.e. the queue depth). Higher values result in more
2644 overhead and less responsiveness. Until an attempt is
2645 made to backfill schedule a job, its expected initiation
2646 time value will not be set. In the case of large clus‐
2647 ters, configuring a relatively small value may be desir‐
2648 able. This option applies only to Scheduler‐
2649 Type=sched/backfill. Default: 100, Min: 1, Max:
2650 1,000,000.
2651
2652 bf_max_job_user=#
2653 The maximum number of jobs per user to attempt starting
2654 with the backfill scheduler for ALL partitions. One can
2655 set this limit to prevent users from flooding the back‐
2656 fill queue with jobs that cannot start and that prevent
2657 jobs from other users to start. This is similar to the
2658 MAXIJOB limit in Maui. This option applies only to
2659 SchedulerType=sched/backfill. Also see the
2660 bf_max_job_part, bf_max_job_test and
2661 bf_max_job_user_part=# options. Set bf_max_job_test to a
2662 value much higher than bf_max_job_user. Default: 0 (no
2663 limit), Min: 0, Max: bf_max_job_test.
2664
2665 bf_max_job_user_part=#
2666 The maximum number of jobs per user per partition to
2667 attempt starting with the backfill scheduler for any sin‐
2668 gle partition. This option applies only to Scheduler‐
2669 Type=sched/backfill. Also see the bf_max_job_part,
2670 bf_max_job_test and bf_max_job_user=# options. Default:
2671 0 (no limit), Min: 0, Max: bf_max_job_test.
2672
2673 bf_max_time=#
2674 The maximum time in seconds the backfill scheduler can
2675 spend (including time spent sleeping when locks are
2676 released) before discontinuing, even if maximum job
2677 counts have not been reached. This option applies only
2678 to SchedulerType=sched/backfill. The default value is
2679 the value of bf_interval (which defaults to 30 seconds).
2680 Default: bf_interval value (def. 30 sec), Min: 1, Max:
2681 3600 (1h). NOTE: If bf_interval is short and bf_max_time
2682 is large, this may cause locks to be acquired too fre‐
2683 quently and starve out other serviced RPCs. It's advis‐
2684 able if using this parameter to set max_rpc_cnt high
2685 enough that scheduling isn't always disabled, and low
2686 enough that the interactive workload can get through in a
2687 reasonable period of time. max_rpc_cnt needs to be below
2688 256 (the default RPC thread limit). Running around the
2689 middle (150) may give you good results. NOTE: When
2690 increasing the amount of time spent in the backfill
2691 scheduling cycle, Slurm can be prevented from responding
2692 to client requests in a timely manner. To address this
2693 you can use max_rpc_cnt to specify a number of queued
2694 RPCs before the scheduler stops to respond to these
2695 requests.
2696
2697 bf_min_age_reserve=#
2698 The backfill and main scheduling logic will not reserve
2699 resources for pending jobs until they have been pending
2700 and runnable for at least the specified number of sec‐
2701 onds. In addition, jobs waiting for less than the speci‐
2702 fied number of seconds will not prevent a newly submitted
2703 job from starting immediately, even if the newly submit‐
2704 ted job has a lower priority. This can be valuable if
2705 jobs lack time limits or all time limits have the same
2706 value. The default value is zero, which will reserve
2707 resources for any pending job and delay initiation of
2708 lower priority jobs. Also see bf_job_part_count_reserve
2709 and bf_min_prio_reserve. Default: 0, Min: 0, Max:
2710 2592000 (30 days).
2711
2712 bf_min_prio_reserve=#
2713 The backfill and main scheduling logic will not reserve
2714 resources for pending jobs unless they have a priority
2715 equal to or higher than the specified value. In addi‐
2716 tion, jobs with a lower priority will not prevent a newly
2717 submitted job from starting immediately, even if the
2718 newly submitted job has a lower priority. This can be
2719 valuable if one wished to maximum system utilization
2720 without regard for job priority below a certain thresh‐
2721 old. The default value is zero, which will reserve
2722 resources for any pending job and delay initiation of
2723 lower priority jobs. Also see bf_job_part_count_reserve
2724 and bf_min_age_reserve. Default: 0, Min: 0, Max: 2^63.
2725
2726 bf_resolution=#
2727 The number of seconds in the resolution of data main‐
2728 tained about when jobs begin and end. Higher values
2729 result in less overhead and better responsiveness. This
2730 option applies only to SchedulerType=sched/backfill.
2731 Default: 60, Min: 1, Max: 3600 (1 hour).
2732
2733 bf_window=#
2734 The number of minutes into the future to look when con‐
2735 sidering jobs to schedule. Higher values result in more
2736 overhead and less responsiveness. A value at least as
2737 long as the highest allowed time limit is generally
2738 advisable to prevent job starvation. In order to limit
2739 the amount of data managed by the backfill scheduler, if
2740 the value of bf_window is increased, then it is generally
2741 advisable to also increase bf_resolution. This option
2742 applies only to SchedulerType=sched/backfill. Default:
2743 1440 (1 day), Min: 1, Max: 43200 (30 days).
2744
2745 bf_window_linear=#
2746 For performance reasons, the backfill scheduler will
2747 decrease precision in calculation of job expected termi‐
2748 nation times. By default, the precision starts at 30 sec‐
2749 onds and that time interval doubles with each evaluation
2750 of currently executing jobs when trying to determine when
2751 a pending job can start. This algorithm can support an
2752 environment with many thousands of running jobs, but can
2753 result in the expected start time of pending jobs being
2754 gradually being deferred due to lack of precision. A
2755 value for bf_window_linear will cause the time interval
2756 to be increased by a constant amount on each iteration.
2757 The value is specified in units of seconds. For example,
2758 a value of 60 will cause the backfill scheduler on the
2759 first iteration to identify the job ending soonest and
2760 determine if the pending job can be started after that
2761 job plus all other jobs expected to end within 30 seconds
2762 (default initial value) of the first job. On the next
2763 iteration, the pending job will be evaluated for starting
2764 after the next job expected to end plus all jobs ending
2765 within 90 seconds of that time (30 second default, plus
2766 the 60 second option value). The third iteration will
2767 have a 150 second window and the fourth 210 seconds.
2768 Without this option, the time windows will double on each
2769 iteration and thus be 30, 60, 120, 240 seconds, etc. The
2770 use of bf_window_linear is not recommended with more than
2771 a few hundred simultaneously executing jobs.
2772
2773 bf_yield_interval=#
2774 The backfill scheduler will periodically relinquish locks
2775 in order for other pending operations to take place.
2776 This specifies the times when the locks are relinquish in
2777 microseconds. Smaller values may be helpful for high
2778 throughput computing when used in conjunction with the
2779 bf_continue option. Also see the bf_yield_sleep option.
2780 Default: 2,000,000 (2 sec), Min: 1, Max: 10,000,000 (10
2781 sec).
2782
2783 bf_yield_sleep=#
2784 The backfill scheduler will periodically relinquish locks
2785 in order for other pending operations to take place.
2786 This specifies the length of time for which the locks are
2787 relinquish in microseconds. Also see the bf_yield_inter‐
2788 val option. Default: 500,000 (0.5 sec), Min: 1, Max:
2789 10,000,000 (10 sec).
2790
2791 build_queue_timeout=#
2792 Defines the maximum time that can be devoted to building
2793 a queue of jobs to be tested for scheduling. If the sys‐
2794 tem has a huge number of jobs with dependencies, just
2795 building the job queue can take so much time as to
2796 adversely impact overall system performance and this
2797 parameter can be adjusted as needed. The default value
2798 is 2,000,000 microseconds (2 seconds).
2799
2800 default_queue_depth=#
2801 The default number of jobs to attempt scheduling (i.e.
2802 the queue depth) when a running job completes or other
2803 routine actions occur, however the frequency with which
2804 the scheduler is run may be limited by using the defer or
2805 sched_min_interval parameters described below. The full
2806 queue will be tested on a less frequent basis as defined
2807 by the sched_interval option described below. The default
2808 value is 100. See the partition_job_depth option to
2809 limit depth by partition.
2810
2811 defer Setting this option will avoid attempting to schedule
2812 each job individually at job submit time, but defer it
2813 until a later time when scheduling multiple jobs simulta‐
2814 neously may be possible. This option may improve system
2815 responsiveness when large numbers of jobs (many hundreds)
2816 are submitted at the same time, but it will delay the
2817 initiation time of individual jobs. Also see
2818 default_queue_depth above.
2819
2820 delay_boot=#
2821 Do not reboot nodes in order to satisfied this job's fea‐
2822 ture specification if the job has been eligible to run
2823 for less than this time period. If the job has waited
2824 for less than the specified period, it will use only
2825 nodes which already have the specified features. The
2826 argument is in units of minutes. Individual jobs may
2827 override this default value with the --delay-boot option.
2828
2829 default_gbytes
2830 The default units in job submission memory and temporary
2831 disk size specification will be gigabytes rather than
2832 megabytes. Users can override the default by using a
2833 suffix of "M" for megabytes.
2834
2835 disable_job_shrink
2836 Deny user requests to shrink the side of running jobs.
2837 (However, running jobs may still shrink due to node fail‐
2838 ure if the --no-kill option was set.)
2839
2840 disable_hetero_steps
2841 Disable job steps that span heterogeneous job alloca‐
2842 tions. The default value on Cray systems.
2843
2844 enable_hetero_steps
2845 Enable job steps that span heterogeneous job allocations.
2846 The default value except for Cray systems.
2847
2848 enable_user_top
2849 Enable use of the "scontrol top" command by non-privi‐
2850 leged users.
2851
2852 Ignore_NUMA
2853 Some processors (e.g. AMD Opteron 6000 series) contain
2854 multiple NUMA nodes per socket. This is a configuration
2855 which does not map into the hardware entities that Slurm
2856 optimizes resource allocation for (PU/thread, core,
2857 socket, baseboard, node and network switch). In order to
2858 optimize resource allocations on such hardware, Slurm
2859 will consider each NUMA node within the socket as a sepa‐
2860 rate socket by default. Use the Ignore_NUMA option to
2861 report the correct socket count, but not optimize
2862 resource allocations on the NUMA nodes.
2863
2864 inventory_interval=#
2865 On a Cray system using Slurm on top of ALPS this limits
2866 the number of times a Basil Inventory call is made. Nor‐
2867 mally this call happens every scheduling consideration to
2868 attempt to close a node state change window with respects
2869 to what ALPS has. This call is rather slow, so making it
2870 less frequently improves performance dramatically, but in
2871 the situation where a node changes state the window is as
2872 large as this setting. In an HTC environment this set‐
2873 ting is a must and we advise around 10 seconds.
2874
2875 kill_invalid_depend
2876 If a job has an invalid dependency and it can never run
2877 terminate it and set its state to be JOB_CANCELLED. By
2878 default the job stays pending with reason DependencyNev‐
2879 erSatisfied.
2880
2881 max_array_tasks
2882 Specify the maximum number of tasks that be included in a
2883 job array. The default limit is MaxArraySize, but this
2884 option can be used to set a lower limit. For example,
2885 max_array_tasks=1000 and MaxArraySize=100001 would permit
2886 a maximum task ID of 100000, but limit the number of
2887 tasks in any single job array to 1000.
2888
2889 max_depend_depth=#
2890 Maximum number of jobs to test for a circular job depen‐
2891 dency. Stop testing after this number of job dependencies
2892 have been tested. The default value is 10 jobs.
2893
2894 max_rpc_cnt=#
2895 If the number of active threads in the slurmctld daemon
2896 is equal to or larger than this value, defer scheduling
2897 of jobs. The scheduler will check this condition at cer‐
2898 tain points in code and yield locks if necessary. This
2899 can improve Slurm's ability to process requests at a cost
2900 of initiating new jobs less frequently. If a value is
2901 set, then a value of 10 or higher is recommended. NOTE:
2902 The maximum number of threads (MAX_SERVER_THREADS) is
2903 internally set to 256 and defines the number of served
2904 RPCs at a given time. Setting max_rpc_cnt to more than
2905 256 will be only useful to let backfill continue schedul‐
2906 ing work after locks have been yielded (i.e. each 2 sec‐
2907 onds) if there are a maximum of MAX(max_rpc_cnt/10, 20)
2908 RPCs in the queue. i.e. max_rpc_cnt=1000, the scheduler
2909 will be allowed to continue after yielding locks only
2910 when there are less than or equal to 100 pending RPCs.
2911 Default: 0 (option disabled), Min: 0, Max: 1000. The
2912 default value is zero, which disables this option. If a
2913 value is set, then a value of 10 or higher is recom‐
2914 mended. It may require some tuning for each system, but
2915 needs to be high enough that scheduling isn't always dis‐
2916 abled, and low enough that requests can get through in a
2917 reasonable period of time.
2918
2919 max_sched_time=#
2920 How long, in seconds, that the main scheduling loop will
2921 execute for before exiting. If a value is configured, be
2922 aware that all other Slurm operations will be deferred
2923 during this time period. Make certain the value is lower
2924 than MessageTimeout. If a value is not explicitly con‐
2925 figured, the default value is half of MessageTimeout with
2926 a minimum default value of 1 second and a maximum default
2927 value of 2 seconds. For example if MessageTimeout=10,
2928 the time limit will be 2 seconds (i.e. MIN(10/2, 2) = 2).
2929
2930 max_script_size=#
2931 Specify the maximum size of a batch script, in bytes.
2932 The default value is 4 megabytes. Larger values may
2933 adversely impact system performance.
2934
2935 max_switch_wait=#
2936 Maximum number of seconds that a job can delay execution
2937 waiting for the specified desired switch count. The
2938 default value is 300 seconds.
2939
2940 no_backup_scheduling
2941 If used, the backup controller will not schedule jobs
2942 when it takes over. The backup controller will allow jobs
2943 to be submitted, modified and cancelled but won't sched‐
2944 ule new jobs. This is useful in Cray environments when
2945 the backup controller resides on an external Cray node.
2946 A restart is required to alter this option. This is
2947 explicitly set on a Cray/ALPS system.
2948
2949 no_env_cache
2950 If used, any job started on node that fails to load the
2951 env from a node will fail instead of using the cached
2952 env. This will also implicitly imply the requeue_set‐
2953 up_env_fail option as well.
2954
2955 nohold_on_prolog_fail
2956 By default, if the Prolog exits with a non-zero value the
2957 job is requeued in a held state. By specifying this
2958 parameter the job will be requeued but not held so that
2959 the scheduler can dispatch it to another host.
2960
2961 pack_serial_at_end
2962 If used with the select/cons_res or select/cons_tres
2963 plugin, then put serial jobs at the end of the available
2964 nodes rather than using a best fit algorithm. This may
2965 reduce resource fragmentation for some workloads.
2966
2967 partition_job_depth=#
2968 The default number of jobs to attempt scheduling (i.e.
2969 the queue depth) from each partition/queue in Slurm's
2970 main scheduling logic. The functionality is similar to
2971 that provided by the bf_max_job_part option for the back‐
2972 fill scheduling logic. The default value is 0 (no
2973 limit). Job's excluded from attempted scheduling based
2974 upon partition will not be counted against the
2975 default_queue_depth limit. Also see the bf_max_job_part
2976 option.
2977
2978 permit_job_expansion
2979 Allow running jobs to request additional nodes be merged
2980 in with the current job allocation.
2981
2982 preempt_reorder_count=#
2983 Specify how many attempts should be made in reording pre‐
2984 emptable jobs to minimize the count of jobs preempted.
2985 The default value is 1. High values may adversely impact
2986 performance. The logic to support this option is only
2987 available in the select/cons_res and select/cons_tres
2988 plugins.
2989
2990 preempt_strict_order
2991 If set, then execute extra logic in an attempt to preempt
2992 only the lowest priority jobs. It may be desirable to
2993 set this configuration parameter when there are multiple
2994 priorities of preemptable jobs. The logic to support
2995 this option is only available in the select/cons_res and
2996 select/cons_tres plugins.
2997
2998 preempt_youngest_first
2999 If set, then the preemption sorting algorithm will be
3000 changed to sort by the job start times to favor preempt‐
3001 ing younger jobs over older. (Requires preempt/parti‐
3002 tion_prio or preempt/qos plugins.)
3003
3004 reduce_completing_frag
3005 This option is used to control how scheduling of
3006 resources is performed when jobs are in completing state,
3007 which influences potential fragmentation. If the option
3008 is not set then no jobs will be started in any partition
3009 when any job is in completing state. If the option is
3010 set then no jobs will be started in any individual parti‐
3011 tion that has a job in completing state. In addition, no
3012 jobs will be started in any partition with nodes that
3013 overlap with any nodes in the partition of the completing
3014 job. This option is to be used in conjunction with Com‐
3015 pleteWait. NOTE: CompleteWait must be set for this to
3016 work.
3017
3018 requeue_setup_env_fail
3019 By default if a job environment setup fails the job keeps
3020 running with a limited environment. By specifying this
3021 parameter the job will be requeued in held state and the
3022 execution node drained.
3023
3024 salloc_wait_nodes
3025 If defined, the salloc command will wait until all allo‐
3026 cated nodes are ready for use (i.e. booted) before the
3027 command returns. By default, salloc will return as soon
3028 as the resource allocation has been made.
3029
3030 sbatch_wait_nodes
3031 If defined, the sbatch script will wait until all allo‐
3032 cated nodes are ready for use (i.e. booted) before the
3033 initiation. By default, the sbatch script will be initi‐
3034 ated as soon as the first node in the job allocation is
3035 ready. The sbatch command can use the --wait-all-nodes
3036 option to override this configuration parameter.
3037
3038 sched_interval=#
3039 How frequently, in seconds, the main scheduling loop will
3040 execute and test all pending jobs. The default value is
3041 60 seconds.
3042
3043 sched_max_job_start=#
3044 The maximum number of jobs that the main scheduling logic
3045 will start in any single execution. The default value is
3046 zero, which imposes no limit.
3047
3048 sched_min_interval=#
3049 How frequently, in microseconds, the main scheduling loop
3050 will execute and test any pending jobs. The scheduler
3051 runs in a limited fashion every time that any event hap‐
3052 pens which could enable a job to start (e.g. job submit,
3053 job terminate, etc.). If these events happen at a high
3054 frequency, the scheduler can run very frequently and con‐
3055 sume significant resources if not throttled by this
3056 option. This option specifies the minimum time between
3057 the end of one scheduling cycle and the beginning of the
3058 next scheduling cycle. A value of zero will disable
3059 throttling of the scheduling logic interval. The default
3060 value is 1,000,000 microseconds on Cray/ALPS systems and
3061 2 microseconds on other systems.
3062
3063 spec_cores_first
3064 Specialized cores will be selected from the first cores
3065 of the first sockets, cycling through the sockets on a
3066 round robin basis. By default, specialized cores will be
3067 selected from the last cores of the last sockets, cycling
3068 through the sockets on a round robin basis.
3069
3070 step_retry_count=#
3071 When a step completes and there are steps ending resource
3072 allocation, then retry step allocations for at least this
3073 number of pending steps. Also see step_retry_time. The
3074 default value is 8 steps.
3075
3076 step_retry_time=#
3077 When a step completes and there are steps ending resource
3078 allocation, then retry step allocations for all steps
3079 which have been pending for at least this number of sec‐
3080 onds. Also see step_retry_count. The default value is
3081 60 seconds.
3082
3083 whole_hetjob
3084 Requests to cancel, hold or release any component of a
3085 heterogeneous job will be applied to all components of
3086 the job.
3087
3088 NOTE: this option was previously named whole_pack and
3089 this is still supported for retrocompatibility.
3090
3091
3092 SchedulerTimeSlice
3093 Number of seconds in each time slice when gang scheduling is
3094 enabled (PreemptMode=SUSPEND,GANG). The value must be between 5
3095 seconds and 65533 seconds. The default value is 30 seconds.
3096
3097
3098 SchedulerType
3099 Identifies the type of scheduler to be used. Note the slurmctld
3100 daemon must be restarted for a change in scheduler type to
3101 become effective (reconfiguring a running daemon has no effect
3102 for this parameter). The scontrol command can be used to manu‐
3103 ally change job priorities if desired. Acceptable values
3104 include:
3105
3106 sched/backfill
3107 For a backfill scheduling module to augment the default
3108 FIFO scheduling. Backfill scheduling will initiate
3109 lower-priority jobs if doing so does not delay the
3110 expected initiation time of any higher priority job.
3111 Effectiveness of backfill scheduling is dependent upon
3112 users specifying job time limits, otherwise all jobs will
3113 have the same time limit and backfilling is impossible.
3114 Note documentation for the SchedulerParameters option
3115 above. This is the default configuration.
3116
3117 sched/builtin
3118 This is the FIFO scheduler which initiates jobs in prior‐
3119 ity order. If any job in the partition can not be sched‐
3120 uled, no lower priority job in that partition will be
3121 scheduled. An exception is made for jobs that can not
3122 run due to partition constraints (e.g. the time limit) or
3123 down/drained nodes. In that case, lower priority jobs
3124 can be initiated and not impact the higher priority job.
3125
3126 sched/hold
3127 To hold all newly arriving jobs if a file
3128 "/etc/slurm.hold" exists otherwise use the built-in FIFO
3129 scheduler
3130
3131
3132 SelectType
3133 Identifies the type of resource selection algorithm to be used.
3134 Changing this value can only be done by restarting the slurmctld
3135 daemon and will result in the loss of all job information (run‐
3136 ning and pending) since the job state save format used by each
3137 plugin is different. Acceptable values include
3138
3139 select/cons_res
3140 The resources (cores and memory) within a node are indi‐
3141 vidually allocated as consumable resources. Note that
3142 whole nodes can be allocated to jobs for selected parti‐
3143 tions by using the OverSubscribe=Exclusive option. See
3144 the partition OverSubscribe parameter for more informa‐
3145 tion.
3146
3147 select/cray_aries
3148 for a Cray system. The default value is
3149 "select/cray_aries" for all Cray systems.
3150
3151 select/linear
3152 for allocation of entire nodes assuming a one-dimensional
3153 array of nodes in which sequentially ordered nodes are
3154 preferable. For a heterogeneous cluster (e.g. different
3155 CPU counts on the various nodes), resource allocations
3156 will favor nodes with high CPU counts as needed based
3157 upon the job's node and CPU specification if TopologyPlu‐
3158 gin=topology/none is configured. Use of other topology
3159 plugins with select/linear and heterogeneous nodes is not
3160 recommended and may result in valid job allocation
3161 requests being rejected. This is the default value.
3162
3163 select/cons_tres
3164 The resources (cores, memory, GPUs and all other track‐
3165 able resources) within a node are individually allocated
3166 as consumable resources. Note that whole nodes can be
3167 allocated to jobs for selected partitions by using the
3168 OverSubscribe=Exclusive option. See the partition Over‐
3169 Subscribe parameter for more information.
3170
3171
3172 SelectTypeParameters
3173 The permitted values of SelectTypeParameters depend upon the
3174 configured value of SelectType. The only supported options for
3175 SelectType=select/linear are CR_ONE_TASK_PER_CORE and CR_Memory,
3176 which treats memory as a consumable resource and prevents memory
3177 over subscription with job preemption or gang scheduling. By
3178 default SelectType=select/linear allocates whole nodes to jobs
3179 without considering their memory consumption. By default
3180 SelectType=select/cons_res, SelectType=select/cray_aries, and
3181 SelectType=select/cons_tres, use CR_CPU, which allocates CPU
3182 (threads) to jobs without considering their memory consumption.
3183
3184 The following options are supported for Select‐
3185 Type=select/cray_aries:
3186
3187 OTHER_CONS_RES
3188 Layer the select/cons_res plugin under the
3189 select/cray_aries plugin, the default is to layer
3190 on select/linear. This also allows all the
3191 options available for SelectType=select/cons_res.
3192
3193 OTHER_CONS_TRES
3194 Layer the select/cons_tres plugin under the
3195 select/cray_aries plugin, the default is to layer
3196 on select/linear. This also allows all the
3197 options available for SelectType=select/cons_tres.
3198
3199 The following options are supported by the Select‐
3200 Type=select/cons_res and SelectType=select/cons_tres plugins:
3201
3202 CR_CPU CPUs are consumable resources. Configure the num‐
3203 ber of CPUs on each node, which may be equal to
3204 the count of cores or hyper-threads on the node
3205 depending upon the desired minimum resource allo‐
3206 cation. The node's Boards, Sockets, CoresPer‐
3207 Socket and ThreadsPerCore may optionally be con‐
3208 figured and result in job allocations which have
3209 improved locality; however doing so will prevent
3210 more than one job being from being allocated on
3211 each core.
3212
3213 CR_CPU_Memory
3214 CPUs and memory are consumable resources. Config‐
3215 ure the number of CPUs on each node, which may be
3216 equal to the count of cores or hyper-threads on
3217 the node depending upon the desired minimum
3218 resource allocation. The node's Boards, Sockets,
3219 CoresPerSocket and ThreadsPerCore may optionally
3220 be configured and result in job allocations which
3221 have improved locality; however doing so will pre‐
3222 vent more than one job being from being allocated
3223 on each core. Setting a value for DefMemPerCPU is
3224 strongly recommended.
3225
3226 CR_Core
3227 Cores are consumable resources. On nodes with
3228 hyper-threads, each thread is counted as a CPU to
3229 satisfy a job's resource requirement, but multiple
3230 jobs are not allocated threads on the same core.
3231 The count of CPUs allocated to a job may be
3232 rounded up to account for every CPU on an allo‐
3233 cated core.
3234
3235 CR_Core_Memory
3236 Cores and memory are consumable resources. On
3237 nodes with hyper-threads, each thread is counted
3238 as a CPU to satisfy a job's resource requirement,
3239 but multiple jobs are not allocated threads on the
3240 same core. The count of CPUs allocated to a job
3241 may be rounded up to account for every CPU on an
3242 allocated core. Setting a value for DefMemPerCPU
3243 is strongly recommended.
3244
3245 CR_ONE_TASK_PER_CORE
3246 Allocate one task per core by default. Without
3247 this option, by default one task will be allocated
3248 per thread on nodes with more than one ThreadsPer‐
3249 Core configured. NOTE: This option cannot be used
3250 with CR_CPU*.
3251
3252 CR_CORE_DEFAULT_DIST_BLOCK
3253 Allocate cores within a node using block distribu‐
3254 tion by default. This is a pseudo-best-fit algo‐
3255 rithm that minimizes the number of boards and min‐
3256 imizes the number of sockets (within minimum
3257 boards) used for the allocation. This default
3258 behavior can be overridden specifying a particular
3259 "-m" parameter with srun/salloc/sbatch. Without
3260 this option, cores will be allocated cyclicly
3261 across the sockets.
3262
3263 CR_LLN Schedule resources to jobs on the least loaded
3264 nodes (based upon the number of idle CPUs). This
3265 is generally only recommended for an environment
3266 with serial jobs as idle resources will tend to be
3267 highly fragmented, resulting in parallel jobs
3268 being distributed across many nodes. Note that
3269 node Weight takes precedence over how many idle
3270 resources are on each node. Also see the parti‐
3271 tion configuration parameter LLN use the least
3272 loaded nodes in selected partitions.
3273
3274 CR_Pack_Nodes
3275 If a job allocation contains more resources than
3276 will be used for launching tasks (e.g. if whole
3277 nodes are allocated to a job), then rather than
3278 distributing a job's tasks evenly across it's
3279 allocated nodes, pack them as tightly as possible
3280 on these nodes. For example, consider a job allo‐
3281 cation containing two entire nodes with eight CPUs
3282 each. If the job starts ten tasks across those
3283 two nodes without this option, it will start five
3284 tasks on each of the two nodes. With this option,
3285 eight tasks will be started on the first node and
3286 two tasks on the second node.
3287
3288 CR_Socket
3289 Sockets are consumable resources. On nodes with
3290 multiple cores, each core or thread is counted as
3291 a CPU to satisfy a job's resource requirement, but
3292 multiple jobs are not allocated resources on the
3293 same socket.
3294
3295 CR_Socket_Memory
3296 Memory and sockets are consumable resources. On
3297 nodes with multiple cores, each core or thread is
3298 counted as a CPU to satisfy a job's resource
3299 requirement, but multiple jobs are not allocated
3300 resources on the same socket. Setting a value for
3301 DefMemPerCPU is strongly recommended.
3302
3303 CR_Memory
3304 Memory is a consumable resource. NOTE: This
3305 implies OverSubscribe=YES or OverSubscribe=FORCE
3306 for all partitions. Setting a value for DefMem‐
3307 PerCPU is strongly recommended.
3308
3309
3310 SlurmUser
3311 The name of the user that the slurmctld daemon executes as. For
3312 security purposes, a user other than "root" is recommended.
3313 This user must exist on all nodes of the cluster for authentica‐
3314 tion of communications between Slurm components. The default
3315 value is "root".
3316
3317
3318 SlurmdParameters
3319 Parameters specific to the Slurmd. Multiple options may be
3320 comma separated.
3321
3322 config_overrides
3323 If set, consider the configuration of each node to be
3324 that specified in the slurm.conf configuration file and
3325 any node with less than the configured resources will not
3326 be set DRAIN. This option is generally only useful for
3327 testing purposes. Equivalent to the now deprecated
3328 FastSchedule=2 option.
3329
3330 shutdown_on_reboot
3331 If set, the Slurmd will shut itself down when a reboot
3332 request is received.
3333
3334
3335 SlurmdUser
3336 The name of the user that the slurmd daemon executes as. This
3337 user must exist on all nodes of the cluster for authentication
3338 of communications between Slurm components. The default value
3339 is "root".
3340
3341
3342 SlurmctldAddr
3343 An optional address to be used for communications to the cur‐
3344 rently active slurmctld daemon, normally used with Virtual IP
3345 addressing of the currently active server. If this parameter is
3346 not specified then each primary and backup server will have its
3347 own unique address used for communications as specified in the
3348 SlurmctldHost parameter. If this parameter is specified then
3349 the SlurmctldHost parameter will still be used for communica‐
3350 tions to specific slurmctld primary or backup servers, for exam‐
3351 ple to cause all of them to read the current configuration files
3352 or shutdown. Also see the SlurmctldPrimaryOffProg and Slurm‐
3353 ctldPrimaryOnProg configuration parameters to configure programs
3354 to manipulate virtual IP address manipulation.
3355
3356
3357 SlurmctldDebug
3358 The level of detail to provide slurmctld daemon's logs. The
3359 default value is info. If the slurmctld daemon is initiated
3360 with -v or --verbose options, that debug level will be preserve
3361 or restored upon reconfiguration.
3362
3363
3364 quiet Log nothing
3365
3366 fatal Log only fatal errors
3367
3368 error Log only errors
3369
3370 info Log errors and general informational messages
3371
3372 verbose Log errors and verbose informational messages
3373
3374 debug Log errors and verbose informational messages and
3375 debugging messages
3376
3377 debug2 Log errors and verbose informational messages and more
3378 debugging messages
3379
3380 debug3 Log errors and verbose informational messages and even
3381 more debugging messages
3382
3383 debug4 Log errors and verbose informational messages and even
3384 more debugging messages
3385
3386 debug5 Log errors and verbose informational messages and even
3387 more debugging messages
3388
3389
3390 SlurmctldHost
3391 The short, or long, hostname of the machine where Slurm control
3392 daemon is executed (i.e. the name returned by the command "host‐
3393 name -s"). This hostname is optionally followed by the address,
3394 either the IP address or a name by which the address can be
3395 identifed, enclosed in parentheses (e.g. SlurmctldHost=mas‐
3396 ter1(12.34.56.78)). This value must be specified at least once.
3397 If specified more than once, the first hostname named will be
3398 where the daemon runs. If the first specified host fails, the
3399 daemon will execute on the second host. If both the first and
3400 second specified host fails, the daemon will execute on the
3401 third host.
3402
3403
3404 SlurmctldLogFile
3405 Fully qualified pathname of a file into which the slurmctld dae‐
3406 mon's logs are written. The default value is none (performs
3407 logging via syslog).
3408 See the section LOGGING if a pathname is specified.
3409
3410
3411 SlurmctldParameters
3412 Multiple options may be comma-separated.
3413
3414
3415 allow_user_triggers
3416 Permit setting triggers from non-root/slurm_user users.
3417 SlurmUser must also be set to root to permit these trig‐
3418 gers to work. See the strigger man page for additional
3419 details.
3420
3421 cloud_dns
3422 By default, Slurm expects that the network address for a
3423 cloud node won't be known until the creation of the node
3424 and that Slurm will be notified of the node's address
3425 (e.g. scontrol update nodename=<name> nodeaddr=<addr>).
3426 Since Slurm communications rely on the node configuration
3427 found in the slurm.conf, Slurm will tell the client com‐
3428 mand, after waiting for all nodes to boot, each node's ip
3429 address. However, in environments where the nodes are in
3430 DNS, this step can be avoided by configuring this option.
3431
3432 idle_on_node_suspend Mark nodes as idle, regardless of current
3433 state,
3434 when suspending nodes with SuspendProgram so that nodes
3435 will be eligible to be resumed at a later time.
3436
3437 preempt_send_user_signal Send the user signal (e.g. --sig‐
3438 nal=<sig_num>)
3439 at preemption time even if the signal time hasn't been
3440 reached. In the case of a gracetime preemption the user
3441 signal will be sent if the user signal has been specified
3442 and not sent, otherwise a SIGTERM will be sent to the
3443 tasks.
3444
3445 reboot_from_controller Run the RebootProgram from the controller
3446 instead of on the slurmds. The RebootProgram will be
3447 passed a comma-separated list of nodes to reboot.
3448
3449
3450 SlurmctldPidFile
3451 Fully qualified pathname of a file into which the slurmctld
3452 daemon may write its process id. This may be used for automated
3453 signal processing. The default value is "/var/run/slurm‐
3454 ctld.pid".
3455
3456
3457 SlurmctldPlugstack
3458 A comma delimited list of Slurm controller plugins to be started
3459 when the daemon begins and terminated when it ends. Only the
3460 plugin's init and fini functions are called.
3461
3462
3463 SlurmctldPort
3464 The port number that the Slurm controller, slurmctld, listens to
3465 for work. The default value is SLURMCTLD_PORT as established at
3466 system build time. If none is explicitly specified, it will be
3467 set to 6817. SlurmctldPort may also be configured to support a
3468 range of port numbers in order to accept larger bursts of incom‐
3469 ing messages by specifying two numbers separated by a dash (e.g.
3470 SlurmctldPort=6817-6818). NOTE: Either slurmctld and slurmd
3471 daemons must not execute on the same nodes or the values of
3472 SlurmctldPort and SlurmdPort must be different.
3473
3474 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3475 automatically try to interact with anything opened on ports
3476 8192-60000. Configure SlurmctldPort to use a port outside of
3477 the configured SrunPortRange and RSIP's port range.
3478
3479
3480 SlurmctldPrimaryOffProg
3481 This program is executed when a slurmctld daemon running as the
3482 primary server becomes a backup server. By default no program is
3483 executed. See also the related "SlurmctldPrimaryOnProg" parame‐
3484 ter.
3485
3486
3487 SlurmctldPrimaryOnProg
3488 This program is executed when a slurmctld daemon running as a
3489 backup server becomes the primary server. By default no program
3490 is executed. When using virtual IP addresses to manage High
3491 Available Slurm services, this program can be used to add the IP
3492 address to an interface (and optionally try to kill the unre‐
3493 sponsive slurmctld daemon and flush the ARP caches on nodes on
3494 the local ethernet fabric). See also the related "SlurmctldPri‐
3495 maryOffProg" parameter.
3496
3497 SlurmctldSyslogDebug
3498 The slurmctld daemon will log events to the syslog file at the
3499 specified level of detail. If not set, the slurmctld daemon will
3500 log to syslog at level fatal, unless there is no SlurmctldLog‐
3501 File and it is running in the background, in which case it will
3502 log to syslog at the level specified by SlurmctldDebug (at fatal
3503 in the case that SlurmctldDebug is set to quiet) or it is run in
3504 the foreground, when it will be set to quiet.
3505
3506
3507 quiet Log nothing
3508
3509 fatal Log only fatal errors
3510
3511 error Log only errors
3512
3513 info Log errors and general informational messages
3514
3515 verbose Log errors and verbose informational messages
3516
3517 debug Log errors and verbose informational messages and
3518 debugging messages
3519
3520 debug2 Log errors and verbose informational messages and more
3521 debugging messages
3522
3523 debug3 Log errors and verbose informational messages and even
3524 more debugging messages
3525
3526 debug4 Log errors and verbose informational messages and even
3527 more debugging messages
3528
3529 debug5 Log errors and verbose informational messages and even
3530 more debugging messages
3531
3532
3533
3534 SlurmctldTimeout
3535 The interval, in seconds, that the backup controller waits for
3536 the primary controller to respond before assuming control. The
3537 default value is 120 seconds. May not exceed 65533.
3538
3539
3540 SlurmdDebug
3541 The level of detail to provide slurmd daemon's logs. The
3542 default value is info.
3543
3544 quiet Log nothing
3545
3546 fatal Log only fatal errors
3547
3548 error Log only errors
3549
3550 info Log errors and general informational messages
3551
3552 verbose Log errors and verbose informational messages
3553
3554 debug Log errors and verbose informational messages and
3555 debugging messages
3556
3557 debug2 Log errors and verbose informational messages and more
3558 debugging messages
3559
3560 debug3 Log errors and verbose informational messages and even
3561 more debugging messages
3562
3563 debug4 Log errors and verbose informational messages and even
3564 more debugging messages
3565
3566 debug5 Log errors and verbose informational messages and even
3567 more debugging messages
3568
3569
3570 SlurmdLogFile
3571 Fully qualified pathname of a file into which the slurmd dae‐
3572 mon's logs are written. The default value is none (performs
3573 logging via syslog). Any "%h" within the name is replaced with
3574 the hostname on which the slurmd is running. Any "%n" within
3575 the name is replaced with the Slurm node name on which the
3576 slurmd is running.
3577 See the section LOGGING if a pathname is specified.
3578
3579
3580 SlurmdPidFile
3581 Fully qualified pathname of a file into which the slurmd daemon
3582 may write its process id. This may be used for automated signal
3583 processing. Any "%h" within the name is replaced with the host‐
3584 name on which the slurmd is running. Any "%n" within the name
3585 is replaced with the Slurm node name on which the slurmd is run‐
3586 ning. The default value is "/var/run/slurmd.pid".
3587
3588
3589 SlurmdPort
3590 The port number that the Slurm compute node daemon, slurmd, lis‐
3591 tens to for work. The default value is SLURMD_PORT as estab‐
3592 lished at system build time. If none is explicitly specified,
3593 its value will be 6818. NOTE: Either slurmctld and slurmd dae‐
3594 mons must not execute on the same nodes or the values of Slurm‐
3595 ctldPort and SlurmdPort must be different.
3596
3597 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3598 automatically try to interact with anything opened on ports
3599 8192-60000. Configure SlurmdPort to use a port outside of the
3600 configured SrunPortRange and RSIP's port range.
3601
3602
3603 SlurmdSpoolDir
3604 Fully qualified pathname of a directory into which the slurmd
3605 daemon's state information and batch job script information are
3606 written. This must be a common pathname for all nodes, but
3607 should represent a directory which is local to each node (refer‐
3608 ence a local file system). The default value is
3609 "/var/spool/slurmd". Any "%h" within the name is replaced with
3610 the hostname on which the slurmd is running. Any "%n" within
3611 the name is replaced with the Slurm node name on which the
3612 slurmd is running.
3613
3614
3615 SlurmdSyslogDebug
3616 The slurmd daemon will log events to the syslog file at the
3617 specified level of detail. If not set, the slurmd daemon will
3618 log to syslog at level fatal, unless there is no SlurmdLogFile
3619 and it is running in the background, in which case it will log
3620 to syslog at the level specified by SlurmdDebug (at fatal in
3621 the case that SlurmdDebug is set to quiet) or it is run in the
3622 foreground, when it will be set to quiet.
3623
3624
3625 quiet Log nothing
3626
3627 fatal Log only fatal errors
3628
3629 error Log only errors
3630
3631 info Log errors and general informational messages
3632
3633 verbose Log errors and verbose informational messages
3634
3635 debug Log errors and verbose informational messages and
3636 debugging messages
3637
3638 debug2 Log errors and verbose informational messages and more
3639 debugging messages
3640
3641 debug3 Log errors and verbose informational messages and even
3642 more debugging messages
3643
3644 debug4 Log errors and verbose informational messages and even
3645 more debugging messages
3646
3647 debug5 Log errors and verbose informational messages and even
3648 more debugging messages
3649
3650
3651 SlurmdTimeout
3652 The interval, in seconds, that the Slurm controller waits for
3653 slurmd to respond before configuring that node's state to DOWN.
3654 A value of zero indicates the node will not be tested by slurm‐
3655 ctld to confirm the state of slurmd, the node will not be auto‐
3656 matically set to a DOWN state indicating a non-responsive
3657 slurmd, and some other tool will take responsibility for moni‐
3658 toring the state of each compute node and its slurmd daemon.
3659 Slurm's hierarchical communication mechanism is used to ping the
3660 slurmd daemons in order to minimize system noise and overhead.
3661 The default value is 300 seconds. The value may not exceed
3662 65533 seconds.
3663
3664
3665 SlurmSchedLogFile
3666 Fully qualified pathname of the scheduling event logging file.
3667 The syntax of this parameter is the same as for SlurmctldLog‐
3668 File. In order to configure scheduler logging, set both the
3669 SlurmSchedLogFile and SlurmSchedLogLevel parameters.
3670
3671
3672 SlurmSchedLogLevel
3673 The initial level of scheduling event logging, similar to the
3674 SlurmctldDebug parameter used to control the initial level of
3675 slurmctld logging. Valid values for SlurmSchedLogLevel are "0"
3676 (scheduler logging disabled) and "1" (scheduler logging
3677 enabled). If this parameter is omitted, the value defaults to
3678 "0" (disabled). In order to configure scheduler logging, set
3679 both the SlurmSchedLogFile and SlurmSchedLogLevel parameters.
3680 The scheduler logging level can be changed dynamically using
3681 scontrol.
3682
3683
3684 SrunEpilog
3685 Fully qualified pathname of an executable to be run by srun fol‐
3686 lowing the completion of a job step. The command line arguments
3687 for the executable will be the command and arguments of the job
3688 step. This configuration parameter may be overridden by srun's
3689 --epilog parameter. Note that while the other "Epilog" executa‐
3690 bles (e.g., TaskEpilog) are run by slurmd on the compute nodes
3691 where the tasks are executed, the SrunEpilog runs on the node
3692 where the "srun" is executing.
3693
3694
3695 SrunPortRange
3696 The srun creates a set of listening ports to communicate with
3697 the controller, the slurmstepd and to handle the application
3698 I/O. By default these ports are ephemeral meaning the port num‐
3699 bers are selected by the kernel. Using this parameter allow
3700 sites to configure a range of ports from which srun ports will
3701 be selected. This is useful if sites want to allow only certain
3702 port range on their network.
3703
3704 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
3705 automatically try to interact with anything opened on ports
3706 8192-60000. Configure SrunPortRange to use a range of ports
3707 above those used by RSIP, ideally 1000 or more ports, for exam‐
3708 ple "SrunPortRange=60001-63000".
3709
3710 Note: A sufficient number of ports must be configured based on
3711 the estimated number of srun on the submission nodes considering
3712 that srun opens 3 listening ports plus 2 more for every 48
3713 hosts. Example:
3714
3715 srun -N 48 will use 5 listening ports.
3716
3717
3718 srun -N 50 will use 7 listening ports.
3719
3720
3721 srun -N 200 will use 13 listening ports.
3722
3723
3724 SrunProlog
3725 Fully qualified pathname of an executable to be run by srun
3726 prior to the launch of a job step. The command line arguments
3727 for the executable will be the command and arguments of the job
3728 step. This configuration parameter may be overridden by srun's
3729 --prolog parameter. Note that while the other "Prolog" executa‐
3730 bles (e.g., TaskProlog) are run by slurmd on the compute nodes
3731 where the tasks are executed, the SrunProlog runs on the node
3732 where the "srun" is executing.
3733
3734
3735 StateSaveLocation
3736 Fully qualified pathname of a directory into which the Slurm
3737 controller, slurmctld, saves its state (e.g.
3738 "/usr/local/slurm/checkpoint"). Slurm state will saved here to
3739 recover from system failures. SlurmUser must be able to create
3740 files in this directory. If you have a BackupController config‐
3741 ured, this location should be readable and writable by both sys‐
3742 tems. Since all running and pending job information is stored
3743 here, the use of a reliable file system (e.g. RAID) is recom‐
3744 mended. The default value is "/var/spool". If any slurm dae‐
3745 mons terminate abnormally, their core files will also be written
3746 into this directory.
3747
3748
3749 SuspendExcNodes
3750 Specifies the nodes which are to not be placed in power save
3751 mode, even if the node remains idle for an extended period of
3752 time. Use Slurm's hostlist expression to identify nodes with an
3753 optional ":" separator and count of nodes to exclude from the
3754 preceding range. For example "nid[10-20]:4" will prevent 4
3755 usable nodes (i.e IDLE and not DOWN, DRAINING or already powered
3756 down) in the set "nid[10-20]" from being powered down. Multiple
3757 sets of nodes can be specified with or without counts in a comma
3758 separated list (e.g "nid[10-20]:4,nid[80-90]:2"). If a node
3759 count specification is given, any list of nodes to NOT have a
3760 node count must be after the last specification with a count.
3761 For example "nid[10-20]:4,nid[60-70]" will exclude 4 nodes in
3762 the set "nid[10-20]:4" plus all nodes in the set "nid[60-70]"
3763 while "nid[1-3],nid[10-20]:4" will exclude 4 nodes from the set
3764 "nid[1-3],nid[10-20]". By default no nodes are excluded.
3765 Related configuration options include ResumeTimeout, ResumePro‐
3766 gram, ResumeRate, SuspendProgram, SuspendRate, SuspendTime, Sus‐
3767 pendTimeout, and SuspendExcParts.
3768
3769
3770 SuspendExcParts
3771 Specifies the partitions whose nodes are to not be placed in
3772 power save mode, even if the node remains idle for an extended
3773 period of time. Multiple partitions can be identified and sepa‐
3774 rated by commas. By default no nodes are excluded. Related
3775 configuration options include ResumeTimeout, ResumeProgram,
3776 ResumeRate, SuspendProgram, SuspendRate, SuspendTime Suspend‐
3777 Timeout, and SuspendExcNodes.
3778
3779
3780 SuspendProgram
3781 SuspendProgram is the program that will be executed when a node
3782 remains idle for an extended period of time. This program is
3783 expected to place the node into some power save mode. This can
3784 be used to reduce the frequency and voltage of a node or com‐
3785 pletely power the node off. The program executes as SlurmUser.
3786 The argument to the program will be the names of nodes to be
3787 placed into power savings mode (using Slurm's hostlist expres‐
3788 sion format). By default, no program is run. Related configu‐
3789 ration options include ResumeTimeout, ResumeProgram, ResumeRate,
3790 SuspendRate, SuspendTime, SuspendTimeout, SuspendExcNodes, and
3791 SuspendExcParts.
3792
3793
3794 SuspendRate
3795 The rate at which nodes are placed into power save mode by Sus‐
3796 pendProgram. The value is number of nodes per minute and it can
3797 be used to prevent a large drop in power consumption (e.g. after
3798 a large job completes). A value of zero results in no limits
3799 being imposed. The default value is 60 nodes per minute.
3800 Related configuration options include ResumeTimeout, ResumePro‐
3801 gram, ResumeRate, SuspendProgram, SuspendTime, SuspendTimeout,
3802 SuspendExcNodes, and SuspendExcParts.
3803
3804
3805 SuspendTime
3806 Nodes which remain idle or down for this number of seconds will
3807 be placed into power save mode by SuspendProgram. For efficient
3808 system utilization, it is recommended that the value of Suspend‐
3809 Time be at least as large as the sum of SuspendTimeout plus
3810 ResumeTimeout. A value of -1 disables power save mode and is
3811 the default. Related configuration options include ResumeTime‐
3812 out, ResumeProgram, ResumeRate, SuspendProgram, SuspendRate,
3813 SuspendTimeout, SuspendExcNodes, and SuspendExcParts.
3814
3815
3816 SuspendTimeout
3817 Maximum time permitted (in seconds) between when a node suspend
3818 request is issued and when the node is shutdown. At that time
3819 the node must be ready for a resume request to be issued as
3820 needed for new work. The default value is 30 seconds. Related
3821 configuration options include ResumeProgram, ResumeRate, Resume‐
3822 Timeout, SuspendRate, SuspendTime, SuspendProgram, SuspendExcN‐
3823 odes and SuspendExcParts. More information is available at the
3824 Slurm web site ( https://slurm.schedmd.com/power_save.html ).
3825
3826
3827 SwitchType
3828 Identifies the type of switch or interconnect used for applica‐
3829 tion communications. Acceptable values include
3830 "switch/cray_aries" for Cray systems, "switch/none" for switches
3831 not requiring special processing for job launch or termination
3832 (Ethernet, and InfiniBand) and The default value is
3833 "switch/none". All Slurm daemons, commands and running jobs
3834 must be restarted for a change in SwitchType to take effect. If
3835 running jobs exist at the time slurmctld is restarted with a new
3836 value of SwitchType, records of all jobs in any state may be
3837 lost.
3838
3839
3840 TaskEpilog
3841 Fully qualified pathname of a program to be execute as the slurm
3842 job's owner after termination of each task. See TaskProlog for
3843 execution order details.
3844
3845
3846 TaskPlugin
3847 Identifies the type of task launch plugin, typically used to
3848 provide resource management within a node (e.g. pinning tasks to
3849 specific processors). More than one task plugin can be specified
3850 in a comma separated list. The prefix of "task/" is optional.
3851 Acceptable values include:
3852
3853 task/affinity enables resource containment using CPUSETs. This
3854 enables the --cpu-bind and/or --mem-bind srun
3855 options. If you use "task/affinity" and
3856 encounter problems, it may be due to the variety
3857 of system calls used to implement task affinity
3858 on different operating systems.
3859
3860 task/cgroup enables resource containment using Linux control
3861 cgroups. This enables the --cpu-bind and/or
3862 --mem-bind srun options. NOTE: see "man
3863 cgroup.conf" for configuration details.
3864
3865 task/none for systems requiring no special handling of user
3866 tasks. Lacks support for the --cpu-bind and/or
3867 --mem-bind srun options. The default value is
3868 "task/none".
3869
3870 NOTE: It is recommended to stack task/affinity,task/cgroup together
3871 when configuring TaskPlugin, and setting TaskAffinity=no and Constrain‐
3872 Cores=yes in cgroup.conf. This setup uses the task/affinity plugin for
3873 setting the affinity of the tasks (which is better and different than
3874 task/cgroup) and uses the task/cgroup plugin to fence tasks into the
3875 specified resources, thus combining the best of both pieces.
3876
3877 NOTE: For CRAY systems only: task/cgroup must be used with, and listed
3878 after task/cray_aries in TaskPlugin. The task/affinity plugin can be
3879 listed everywhere, but the previous constraint must be satisfied. So
3880 for CRAY systems, a configuration like this is recommended:
3881
3882 TaskPlugin=task/affinity,task/cray_aries,task/cgroup
3883
3884
3885 TaskPluginParam
3886 Optional parameters for the task plugin. Multiple options
3887 should be comma separated. If None, Boards, Sockets, Cores,
3888 Threads, and/or Verbose are specified, they will override the
3889 --cpu-bind option specified by the user in the srun command.
3890 None, Boards, Sockets, Cores and Threads are mutually exclusive
3891 and since they decrease scheduling flexibility are not generally
3892 recommended (select no more than one of them). Cpusets and
3893 Sched are mutually exclusive (select only one of them). All
3894 TaskPluginParam options are supported on FreeBSD except Cpusets.
3895 The Sched option uses cpuset_setaffinity() on FreeBSD, not
3896 sched_setaffinity().
3897
3898
3899 Boards Bind tasks to boards by default. Overrides automatic
3900 binding.
3901
3902 Cores Bind tasks to cores by default. Overrides automatic
3903 binding.
3904
3905 Cpusets Use cpusets to perform task affinity functions. By
3906 default, Sched task binding is performed.
3907
3908 None Perform no task binding by default. Overrides auto‐
3909 matic binding.
3910
3911 Sched Use sched_setaffinity (if available) to bind tasks to
3912 processors.
3913
3914 Sockets Bind to sockets by default. Overrides automatic bind‐
3915 ing.
3916
3917 Threads Bind to threads by default. Overrides automatic bind‐
3918 ing.
3919
3920 SlurmdOffSpec
3921 If specialized cores or CPUs are identified for the
3922 node (i.e. the CoreSpecCount or CpuSpecList are con‐
3923 figured for the node), then Slurm daemons running on
3924 the compute node (i.e. slurmd and slurmstepd) should
3925 run outside of those resources (i.e. specialized
3926 resources are completely unavailable to Slurm daemons
3927 and jobs spawned by Slurm). This option may not be
3928 used with the task/cray_aries plugin.
3929
3930 Verbose Verbosely report binding before tasks run. Overrides
3931 user options.
3932
3933 Autobind Set a default binding in the event that "auto binding"
3934 doesn't find a match. Set to Threads, Cores or Sock‐
3935 ets (E.g. TaskPluginParam=autobind=threads).
3936
3937
3938 TaskProlog
3939 Fully qualified pathname of a program to be execute as the slurm
3940 job's owner prior to initiation of each task. Besides the nor‐
3941 mal environment variables, this has SLURM_TASK_PID available to
3942 identify the process ID of the task being started. Standard
3943 output from this program can be used to control the environment
3944 variables and output for the user program.
3945
3946 export NAME=value Will set environment variables for the task
3947 being spawned. Everything after the equal
3948 sign to the end of the line will be used as
3949 the value for the environment variable.
3950 Exporting of functions is not currently sup‐
3951 ported.
3952
3953 print ... Will cause that line (without the leading
3954 "print ") to be printed to the job's stan‐
3955 dard output.
3956
3957 unset NAME Will clear environment variables for the
3958 task being spawned.
3959
3960 The order of task prolog/epilog execution is as follows:
3961
3962 1. pre_launch_priv()
3963 Function in TaskPlugin
3964
3965 1. pre_launch() Function in TaskPlugin
3966
3967 2. TaskProlog System-wide per task program defined in
3968 slurm.conf
3969
3970 3. user prolog Job step specific task program defined using
3971 srun's --task-prolog option or
3972 SLURM_TASK_PROLOG environment variable
3973
3974 4. Execute the job step's task
3975
3976 5. user epilog Job step specific task program defined using
3977 srun's --task-epilog option or
3978 SLURM_TASK_EPILOG environment variable
3979
3980 6. TaskEpilog System-wide per task program defined in
3981 slurm.conf
3982
3983 7. post_term() Function in TaskPlugin
3984
3985
3986 TCPTimeout
3987 Time permitted for TCP connection to be established. Default
3988 value is 2 seconds.
3989
3990
3991 TmpFS Fully qualified pathname of the file system available to user
3992 jobs for temporary storage. This parameter is used in establish‐
3993 ing a node's TmpDisk space. The default value is "/tmp".
3994
3995
3996 TopologyParam
3997 Comma separated options identifying network topology options.
3998
3999 Dragonfly Optimize allocation for Dragonfly network. Valid
4000 when TopologyPlugin=topology/tree.
4001
4002 TopoOptional Only optimize allocation for network topology if
4003 the job includes a switch option. Since optimiz‐
4004 ing resource allocation for topology involves
4005 much higher system overhead, this option can be
4006 used to impose the extra overhead only on jobs
4007 which can take advantage of it. If most job allo‐
4008 cations are not optimized for network topology,
4009 they make fragment resources to the point that
4010 topology optimization for other jobs will be dif‐
4011 ficult to achieve. NOTE: Jobs may span across
4012 nodes without common parent switches with this
4013 enabled.
4014
4015
4016 TopologyPlugin
4017 Identifies the plugin to be used for determining the network
4018 topology and optimizing job allocations to minimize network con‐
4019 tention. See NETWORK TOPOLOGY below for details. Additional
4020 plugins may be provided in the future which gather topology
4021 information directly from the network. Acceptable values
4022 include:
4023
4024 topology/3d_torus best-fit logic over three-dimensional
4025 topology
4026
4027 topology/node_rank orders nodes based upon information a
4028 node_rank field in the node record as gen‐
4029 erated by a select plugin. Slurm performs a
4030 best-fit algorithm over those ordered nodes
4031
4032 topology/none default for other systems, best-fit logic
4033 over one-dimensional topology
4034
4035 topology/tree used for a hierarchical network as
4036 described in a topology.conf file
4037
4038
4039 TrackWCKey
4040 Boolean yes or no. Used to set display and track of the Work‐
4041 load Characterization Key. Must be set to track correct wckey
4042 usage. NOTE: You must also set TrackWCKey in your slurmdbd.conf
4043 file to create historical usage reports.
4044
4045
4046 TreeWidth
4047 Slurmd daemons use a virtual tree network for communications.
4048 TreeWidth specifies the width of the tree (i.e. the fanout). On
4049 architectures with a front end node running the slurmd daemon,
4050 the value must always be equal to or greater than the number of
4051 front end nodes which eliminates the need for message forwarding
4052 between the slurmd daemons. On other architectures the default
4053 value is 50, meaning each slurmd daemon can communicate with up
4054 to 50 other slurmd daemons and over 2500 nodes can be contacted
4055 with two message hops. The default value will work well for
4056 most clusters. Optimal system performance can typically be
4057 achieved if TreeWidth is set to the square root of the number of
4058 nodes in the cluster for systems having no more than 2500 nodes
4059 or the cube root for larger systems. The value may not exceed
4060 65533.
4061
4062
4063 UnkillableStepProgram
4064 If the processes in a job step are determined to be unkillable
4065 for a period of time specified by the UnkillableStepTimeout
4066 variable, the program specified by UnkillableStepProgram will be
4067 executed. This program can be used to take special actions to
4068 clean up the unkillable processes and/or notify computer admin‐
4069 istrators. The program will be run SlurmdUser (usually "root")
4070 on the compute node. By default no program is run.
4071
4072
4073 UnkillableStepTimeout
4074 The length of time, in seconds, that Slurm will wait before
4075 deciding that processes in a job step are unkillable (after they
4076 have been signaled with SIGKILL) and execute UnkillableStepPro‐
4077 gram as described above. The default timeout value is 60 sec‐
4078 onds. If exceeded, the compute node will be drained to prevent
4079 future jobs from being scheduled on the node.
4080
4081
4082 UsePAM If set to 1, PAM (Pluggable Authentication Modules for Linux)
4083 will be enabled. PAM is used to establish the upper bounds for
4084 resource limits. With PAM support enabled, local system adminis‐
4085 trators can dynamically configure system resource limits. Chang‐
4086 ing the upper bound of a resource limit will not alter the lim‐
4087 its of running jobs, only jobs started after a change has been
4088 made will pick up the new limits. The default value is 0 (not
4089 to enable PAM support). Remember that PAM also needs to be con‐
4090 figured to support Slurm as a service. For sites using PAM's
4091 directory based configuration option, a configuration file named
4092 slurm should be created. The module-type, control-flags, and
4093 module-path names that should be included in the file are:
4094 auth required pam_localuser.so
4095 auth required pam_shells.so
4096 account required pam_unix.so
4097 account required pam_access.so
4098 session required pam_unix.so
4099 For sites configuring PAM with a general configuration file, the
4100 appropriate lines (see above), where slurm is the service-name,
4101 should be added.
4102
4103 NOTE: UsePAM option has nothing to do with the con‐
4104 tribs/pam/pam_slurm and/or contribs/pam_slurm_adopt modules. So
4105 these two modules can work independently of the value set for
4106 UsePAM.
4107
4108
4109 VSizeFactor
4110 Memory specifications in job requests apply to real memory size
4111 (also known as resident set size). It is possible to enforce
4112 virtual memory limits for both jobs and job steps by limiting
4113 their virtual memory to some percentage of their real memory
4114 allocation. The VSizeFactor parameter specifies the job's or job
4115 step's virtual memory limit as a percentage of its real memory
4116 limit. For example, if a job's real memory limit is 500MB and
4117 VSizeFactor is set to 101 then the job will be killed if its
4118 real memory exceeds 500MB or its virtual memory exceeds 505MB
4119 (101 percent of the real memory limit). The default value is 0,
4120 which disables enforcement of virtual memory limits. The value
4121 may not exceed 65533 percent.
4122
4123
4124 WaitTime
4125 Specifies how many seconds the srun command should by default
4126 wait after the first task terminates before terminating all
4127 remaining tasks. The "--wait" option on the srun command line
4128 overrides this value. The default value is 0, which disables
4129 this feature. May not exceed 65533 seconds.
4130
4131
4132 X11Parameters
4133 For use with Slurm's built-in X11 forwarding implementation.
4134
4135 home_xauthority
4136 If set, xauth data on the compute node will be placed in
4137 ~/.Xauthority rather than in a temporary file under
4138 TmpFS.
4139
4140
4141 The configuration of nodes (or machines) to be managed by Slurm is also
4142 specified in /etc/slurm.conf. Changes in node configuration (e.g.
4143 adding nodes, changing their processor count, etc.) require restarting
4144 both the slurmctld daemon and the slurmd daemons. All slurmd daemons
4145 must know each node in the system to forward messages in support of
4146 hierarchical communications. Only the NodeName must be supplied in the
4147 configuration file. All other node configuration information is
4148 optional. It is advisable to establish baseline node configurations,
4149 especially if the cluster is heterogeneous. Nodes which register to
4150 the system with less than the configured resources (e.g. too little
4151 memory), will be placed in the "DOWN" state to avoid scheduling jobs on
4152 them. Establishing baseline configurations will also speed Slurm's
4153 scheduling process by permitting it to compare job requirements against
4154 these (relatively few) configuration parameters and possibly avoid hav‐
4155 ing to check job requirements against every individual node's configu‐
4156 ration. The resources checked at node registration time are: CPUs,
4157 RealMemory and TmpDisk.
4158
4159 Default values can be specified with a record in which NodeName is
4160 "DEFAULT". The default entry values will apply only to lines following
4161 it in the configuration file and the default values can be reset multi‐
4162 ple times in the configuration file with multiple entries where "Node‐
4163 Name=DEFAULT". Each line where NodeName is "DEFAULT" will replace or
4164 add to previous default values and not a reinitialize the default val‐
4165 ues. The "NodeName=" specification must be placed on every line
4166 describing the configuration of nodes. A single node name can not
4167 appear as a NodeName value in more than one line (duplicate node name
4168 records will be ignored). In fact, it is generally possible and desir‐
4169 able to define the configurations of all nodes in only a few lines.
4170 This convention permits significant optimization in the scheduling of
4171 larger clusters. In order to support the concept of jobs requiring
4172 consecutive nodes on some architectures, node specifications should be
4173 place in this file in consecutive order. No single node name may be
4174 listed more than once in the configuration file. Use "DownNodes=" to
4175 record the state of nodes which are temporarily in a DOWN, DRAIN or
4176 FAILING state without altering permanent configuration information. A
4177 job step's tasks are allocated to nodes in order the nodes appear in
4178 the configuration file. There is presently no capability within Slurm
4179 to arbitrarily order a job step's tasks.
4180
4181 Multiple node names may be comma separated (e.g. "alpha,beta,gamma")
4182 and/or a simple node range expression may optionally be used to specify
4183 numeric ranges of nodes to avoid building a configuration file with
4184 large numbers of entries. The node range expression can contain one
4185 pair of square brackets with a sequence of comma separated numbers
4186 and/or ranges of numbers separated by a "-" (e.g. "linux[0-64,128]", or
4187 "lx[15,18,32-33]"). Note that the numeric ranges can include one or
4188 more leading zeros to indicate the numeric portion has a fixed number
4189 of digits (e.g. "linux[0000-1023]"). Multiple numeric ranges can be
4190 included in the expression (e.g. "rack[0-63]_blade[0-41]"). If one or
4191 more numeric expressions are included, one of them must be at the end
4192 of the name (e.g. "unit[0-31]rack" is invalid), but arbitrary names can
4193 always be used in a comma separated list.
4194
4195 The node configuration specified the following information:
4196
4197
4198 NodeName
4199 Name that Slurm uses to refer to a node. Typically this would
4200 be the string that "/bin/hostname -s" returns. It may also be
4201 the fully qualified domain name as returned by "/bin/hostname
4202 -f" (e.g. "foo1.bar.com"), or any valid domain name associated
4203 with the host through the host database (/etc/hosts) or DNS,
4204 depending on the resolver settings. Note that if the short form
4205 of the hostname is not used, it may prevent use of hostlist
4206 expressions (the numeric portion in brackets must be at the end
4207 of the string). It may also be an arbitrary string if NodeHost‐
4208 name is specified. If the NodeName is "DEFAULT", the values
4209 specified with that record will apply to subsequent node speci‐
4210 fications unless explicitly set to other values in that node
4211 record or replaced with a different set of default values. Each
4212 line where NodeName is "DEFAULT" will replace or add to previous
4213 default values and not a reinitialize the default values. For
4214 architectures in which the node order is significant, nodes will
4215 be considered consecutive in the order defined. For example, if
4216 the configuration for "NodeName=charlie" immediately follows the
4217 configuration for "NodeName=baker" they will be considered adja‐
4218 cent in the computer.
4219
4220
4221 NodeHostname
4222 Typically this would be the string that "/bin/hostname -s"
4223 returns. It may also be the fully qualified domain name as
4224 returned by "/bin/hostname -f" (e.g. "foo1.bar.com"), or any
4225 valid domain name associated with the host through the host
4226 database (/etc/hosts) or DNS, depending on the resolver set‐
4227 tings. Note that if the short form of the hostname is not used,
4228 it may prevent use of hostlist expressions (the numeric portion
4229 in brackets must be at the end of the string). A node range
4230 expression can be used to specify a set of nodes. If an expres‐
4231 sion is used, the number of nodes identified by NodeHostname on
4232 a line in the configuration file must be identical to the number
4233 of nodes identified by NodeName. By default, the NodeHostname
4234 will be identical in value to NodeName.
4235
4236
4237 NodeAddr
4238 Name that a node should be referred to in establishing a commu‐
4239 nications path. This name will be used as an argument to the
4240 gethostbyname() function for identification. If a node range
4241 expression is used to designate multiple nodes, they must
4242 exactly match the entries in the NodeName (e.g. "Node‐
4243 Name=lx[0-7] NodeAddr=elx[0-7]"). NodeAddr may also contain IP
4244 addresses. By default, the NodeAddr will be identical in value
4245 to NodeHostname.
4246
4247
4248 Boards Number of Baseboards in nodes with a baseboard controller. Note
4249 that when Boards is specified, SocketsPerBoard, CoresPerSocket,
4250 and ThreadsPerCore should be specified. Boards and CPUs are
4251 mutually exclusive. The default value is 1.
4252
4253
4254 CoreSpecCount
4255 Number of cores reserved for system use. These cores will not
4256 be available for allocation to user jobs. Depending upon the
4257 TaskPluginParam option of SlurmdOffSpec, Slurm daemons (i.e.
4258 slurmd and slurmstepd) may either be confined to these resources
4259 (the default) or prevented from using these resources. Isola‐
4260 tion of the Slurm daemons from user jobs may improve application
4261 performance. If this option and CpuSpecList are both designated
4262 for a node, an error is generated. For information on the algo‐
4263 rithm used by Slurm to select the cores refer to the core spe‐
4264 cialization documentation (
4265 https://slurm.schedmd.com/core_spec.html ).
4266
4267
4268 CoresPerSocket
4269 Number of cores in a single physical processor socket (e.g.
4270 "2"). The CoresPerSocket value describes physical cores, not
4271 the logical number of processors per socket. NOTE: If you have
4272 multi-core processors, you will likely need to specify this
4273 parameter in order to optimize scheduling. The default value is
4274 1.
4275
4276
4277 CpuBind
4278 If a job step request does not specify an option to control how
4279 tasks are bound to allocated CPUs (--cpu-bind) and all nodes
4280 allocated to the job have the same CpuBind option the node Cpu‐
4281 Bind option will control how tasks are bound to allocated
4282 resources. Supported values for CpuBind are "none", "board",
4283 "socket", "ldom" (NUMA), "core" and "thread".
4284
4285
4286 CPUs Number of logical processors on the node (e.g. "2"). CPUs and
4287 Boards are mutually exclusive. It can be set to the total number
4288 of sockets, cores or threads. This can be useful when you want
4289 to schedule only the cores on a hyper-threaded node. If CPUs is
4290 omitted, it will be set equal to the product of Sockets, Cores‐
4291 PerSocket, and ThreadsPerCore. The default value is 1.
4292
4293
4294 CpuSpecList
4295 A comma delimited list of Slurm abstract CPU IDs reserved for
4296 system use. The list will be expanded to include all other
4297 CPUs, if any, on the same cores. These cores will not be avail‐
4298 able for allocation to user jobs. Depending upon the TaskPlug‐
4299 inParam option of SlurmdOffSpec, Slurm daemons (i.e. slurmd and
4300 slurmstepd) may either be confined to these resources (the
4301 default) or prevented from using these resources. Isolation of
4302 the Slurm daemons from user jobs may improve application perfor‐
4303 mance. If this option and CoreSpecCount are both designated for
4304 a node, an error is generated. This option has no effect unless
4305 cgroup job confinement is also configured (TaskPlu‐
4306 gin=task/cgroup with ConstrainCores=yes in cgroup.conf).
4307
4308
4309 Feature
4310 A comma delimited list of arbitrary strings indicative of some
4311 characteristic associated with the node. There is no value
4312 associated with a feature at this time, a node either has a fea‐
4313 ture or it does not. If desired a feature may contain a numeric
4314 component indicating, for example, processor speed. By default
4315 a node has no features. Also see Gres.
4316
4317
4318 Gres A comma delimited list of generic resources specifications for a
4319 node. The format is: "<name>[:<type>][:no_consume]:<num‐
4320 ber>[K|M|G]". The first field is the resource name, which
4321 matches the GresType configuration parameter name. The optional
4322 type field might be used to identify a model of that generic
4323 resource. It is forbidden to specify both an untyped GRES and a
4324 typed GRES with the same <name>. A generic resource can also be
4325 specified as non-consumable (i.e. multiple jobs can use the same
4326 generic resource) with the optional field ":no_consume". The
4327 final field must specify a generic resources count. A suffix of
4328 "K", "M", "G", "T" or "P" may be used to multiply the number by
4329 1024, 1048576, 1073741824, etc. respectively.
4330 (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_con‐
4331 sume:4G"). By default a node has no generic resources and its
4332 maximum count is that of an unsigned 64bit integer. Also see
4333 Feature.
4334
4335
4336 MemSpecLimit
4337 Amount of memory, in megabytes, reserved for system use and not
4338 available for user allocations. If the task/cgroup plugin is
4339 configured and that plugin constrains memory allocations (i.e.
4340 TaskPlugin=task/cgroup in slurm.conf, plus ConstrainRAMSpace=yes
4341 in cgroup.conf), then Slurm compute node daemons (slurmd plus
4342 slurmstepd) will be allocated the specified memory limit. Note
4343 that having the Memory set in SelectTypeParameters as any of the
4344 options that has it as a consumable resource is needed for this
4345 option to work. The daemons will not be killed if they exhaust
4346 the memory allocation (ie. the Out-Of-Memory Killer is disabled
4347 for the daemon's memory cgroup). If the task/cgroup plugin is
4348 not configured, the specified memory will only be unavailable
4349 for user allocations.
4350
4351
4352 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4353 tens to for work on this particular node. By default there is a
4354 single port number for all slurmd daemons on all compute nodes
4355 as defined by the SlurmdPort configuration parameter. Use of
4356 this option is not generally recommended except for development
4357 or testing purposes. If multiple slurmd daemons execute on a
4358 node this can specify a range of ports.
4359
4360 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4361 automatically try to interact with anything opened on ports
4362 8192-60000. Configure Port to use a port outside of the config‐
4363 ured SrunPortRange and RSIP's port range.
4364
4365
4366 Procs See CPUs.
4367
4368
4369 RealMemory
4370 Size of real memory on the node in megabytes (e.g. "2048"). The
4371 default value is 1. Lowering RealMemory with the goal of setting
4372 aside some amount for the OS and not available for job alloca‐
4373 tions will not work as intended if Memory is not set as a con‐
4374 sumable resource in SelectTypeParameters. So one of the *_Memory
4375 options need to be enabled for that goal to be accomplished.
4376 Also see MemSpecLimit.
4377
4378
4379 Reason Identifies the reason for a node being in state "DOWN",
4380 "DRAINED" "DRAINING", "FAIL" or "FAILING". Use quotes to
4381 enclose a reason having more than one word.
4382
4383
4384 Sockets
4385 Number of physical processor sockets/chips on the node (e.g.
4386 "2"). If Sockets is omitted, it will be inferred from CPUs,
4387 CoresPerSocket, and ThreadsPerCore. NOTE: If you have
4388 multi-core processors, you will likely need to specify these
4389 parameters. Sockets and SocketsPerBoard are mutually exclusive.
4390 If Sockets is specified when Boards is also used, Sockets is
4391 interpreted as SocketsPerBoard rather than total sockets. The
4392 default value is 1.
4393
4394
4395 SocketsPerBoard
4396 Number of physical processor sockets/chips on a baseboard.
4397 Sockets and SocketsPerBoard are mutually exclusive. The default
4398 value is 1.
4399
4400
4401 State State of the node with respect to the initiation of user jobs.
4402 Acceptable values are "CLOUD", "DOWN", "DRAIN", "FAIL", "FAIL‐
4403 ING", "FUTURE" and "UNKNOWN". Node states of "BUSY" and "IDLE"
4404 should not be specified in the node configuration, but set the
4405 node state to "UNKNOWN" instead. Setting the node state to
4406 "UNKNOWN" will result in the node state being set to "BUSY",
4407 "IDLE" or other appropriate state based upon recovered system
4408 state information. The default value is "UNKNOWN". Also see
4409 the DownNodes parameter below.
4410
4411 CLOUD Indicates the node exists in the cloud. It's initial
4412 state will be treated as powered down. The node will
4413 be available for use after it's state is recovered
4414 from Slurm's state save file or the slurmd daemon
4415 starts on the compute node.
4416
4417 DOWN Indicates the node failed and is unavailable to be
4418 allocated work.
4419
4420 DRAIN Indicates the node is unavailable to be allocated
4421 work.on.
4422
4423 FAIL Indicates the node is expected to fail soon, has no
4424 jobs allocated to it, and will not be allocated to any
4425 new jobs.
4426
4427 FAILING Indicates the node is expected to fail soon, has one
4428 or more jobs allocated to it, but will not be allo‐
4429 cated to any new jobs.
4430
4431 FUTURE Indicates the node is defined for future use and need
4432 not exist when the Slurm daemons are started. These
4433 nodes can be made available for use simply by updating
4434 the node state using the scontrol command rather than
4435 restarting the slurmctld daemon. After these nodes are
4436 made available, change their State in the slurm.conf
4437 file. Until these nodes are made available, they will
4438 not be seen using any Slurm commands or nor will any
4439 attempt be made to contact them.
4440
4441 UNKNOWN Indicates the node's state is undefined (BUSY or
4442 IDLE), but will be established when the slurmd daemon
4443 on that node registers. The default value is
4444 "UNKNOWN".
4445
4446
4447 ThreadsPerCore
4448 Number of logical threads in a single physical core (e.g. "2").
4449 Note that the Slurm can allocate resources to jobs down to the
4450 resolution of a core. If your system is configured with more
4451 than one thread per core, execution of a different job on each
4452 thread is not supported unless you configure SelectTypeParame‐
4453 ters=CR_CPU plus CPUs; do not configure Sockets, CoresPerSocket
4454 or ThreadsPerCore. A job can execute a one task per thread from
4455 within one job step or execute a distinct job step on each of
4456 the threads. Note also if you are running with more than 1
4457 thread per core and running the select/cons_res or
4458 select/cons_tres plugin then you will want to set the Select‐
4459 TypeParameters variable to something other than CR_CPU to avoid
4460 unexpected results. The default value is 1.
4461
4462
4463 TmpDisk
4464 Total size of temporary disk storage in TmpFS in megabytes (e.g.
4465 "16384"). TmpFS (for "Temporary File System") identifies the
4466 location which jobs should use for temporary storage. Note this
4467 does not indicate the amount of free space available to the user
4468 on the node, only the total file system size. The system admin‐
4469 istration should ensure this file system is purged as needed so
4470 that user jobs have access to most of this space. The Prolog
4471 and/or Epilog programs (specified in the configuration file)
4472 might be used to ensure the file system is kept clean. The
4473 default value is 0.
4474
4475
4476 TRESWeights TRESWeights are used to calculate a value that represents
4477 how
4478 busy a node is. Currently only used in federation configura‐
4479 tions. TRESWeights are different from TRESBillingWeights --
4480 which is used for fairshare calculations.
4481
4482 TRES weights are specified as a comma-separated list of <TRES
4483 Type>=<TRES Weight> pairs.
4484 e.g.
4485 NodeName=node1 ... TRESWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0"
4486
4487 By default the weighted TRES value is calculated as the sum of
4488 all node TRES types multiplied by their corresponding TRES
4489 weight.
4490
4491 If PriorityFlags=MAX_TRES is configured, the weighted TRES value
4492 is calculated as the MAX of individual node TRES' (e.g. cpus,
4493 mem, gres).
4494
4495
4496 Weight The priority of the node for scheduling purposes. All things
4497 being equal, jobs will be allocated the nodes with the lowest
4498 weight which satisfies their requirements. For example, a het‐
4499 erogeneous collection of nodes might be placed into a single
4500 partition for greater system utilization, responsiveness and
4501 capability. It would be preferable to allocate smaller memory
4502 nodes rather than larger memory nodes if either will satisfy a
4503 job's requirements. The units of weight are arbitrary, but
4504 larger weights should be assigned to nodes with more processors,
4505 memory, disk space, higher processor speed, etc. Note that if a
4506 job allocation request can not be satisfied using the nodes with
4507 the lowest weight, the set of nodes with the next lowest weight
4508 is added to the set of nodes under consideration for use (repeat
4509 as needed for higher weight values). If you absolutely want to
4510 minimize the number of higher weight nodes allocated to a job
4511 (at a cost of higher scheduling overhead), give each node a dis‐
4512 tinct Weight value and they will be added to the pool of nodes
4513 being considered for scheduling individually. The default value
4514 is 1.
4515
4516
4517 The "DownNodes=" configuration permits you to mark certain nodes as in
4518 a DOWN, DRAIN, FAIL, or FAILING state without altering the permanent
4519 configuration information listed under a "NodeName=" specification.
4520
4521
4522 DownNodes
4523 Any node name, or list of node names, from the "NodeName=" spec‐
4524 ifications.
4525
4526
4527 Reason Identifies the reason for a node being in state "DOWN", "DRAIN",
4528 "FAIL" or "FAILING. Use quotes to enclose a reason having more
4529 than one word.
4530
4531
4532 State State of the node with respect to the initiation of user jobs.
4533 Acceptable values are "DOWN", "DRAIN", "FAIL", "FAILING" and
4534 "UNKNOWN". Node states of "BUSY" and "IDLE" should not be spec‐
4535 ified in the node configuration, but set the node state to
4536 "UNKNOWN" instead. Setting the node state to "UNKNOWN" will
4537 result in the node state being set to "BUSY", "IDLE" or other
4538 appropriate state based upon recovered system state information.
4539 The default value is "UNKNOWN".
4540
4541 DOWN Indicates the node failed and is unavailable to be
4542 allocated work.
4543
4544 DRAIN Indicates the node is unavailable to be allocated
4545 work.on.
4546
4547 FAIL Indicates the node is expected to fail soon, has no
4548 jobs allocated to it, and will not be allocated to any
4549 new jobs.
4550
4551 FAILING Indicates the node is expected to fail soon, has one
4552 or more jobs allocated to it, but will not be allo‐
4553 cated to any new jobs.
4554
4555 UNKNOWN Indicates the node's state is undefined (BUSY or
4556 IDLE), but will be established when the slurmd daemon
4557 on that node registers. The default value is
4558 "UNKNOWN".
4559
4560
4561 On computers where frontend nodes are used to execute batch scripts
4562 rather than compute nodes (Cray ALPS systems), one may configure one or
4563 more frontend nodes using the configuration parameters defined below.
4564 These options are very similar to those used in configuring compute
4565 nodes. These options may only be used on systems configured and built
4566 with the appropriate parameters (--have-front-end) or a system deter‐
4567 mined to have the appropriate architecture by the configure script
4568 (Cray ALPS systems). The front end configuration specifies the follow‐
4569 ing information:
4570
4571
4572 AllowGroups
4573 Comma separated list of group names which may execute jobs on
4574 this front end node. By default, all groups may use this front
4575 end node. If at least one group associated with the user
4576 attempting to execute the job is in AllowGroups, he will be per‐
4577 mitted to use this front end node. May not be used with the
4578 DenyGroups option.
4579
4580
4581 AllowUsers
4582 Comma separated list of user names which may execute jobs on
4583 this front end node. By default, all users may use this front
4584 end node. May not be used with the DenyUsers option.
4585
4586
4587 DenyGroups
4588 Comma separated list of group names which are prevented from
4589 executing jobs on this front end node. May not be used with the
4590 AllowGroups option.
4591
4592
4593 DenyUsers
4594 Comma separated list of user names which are prevented from exe‐
4595 cuting jobs on this front end node. May not be used with the
4596 AllowUsers option.
4597
4598
4599 FrontendName
4600 Name that Slurm uses to refer to a frontend node. Typically
4601 this would be the string that "/bin/hostname -s" returns. It
4602 may also be the fully qualified domain name as returned by
4603 "/bin/hostname -f" (e.g. "foo1.bar.com"), or any valid domain
4604 name associated with the host through the host database
4605 (/etc/hosts) or DNS, depending on the resolver settings. Note
4606 that if the short form of the hostname is not used, it may pre‐
4607 vent use of hostlist expressions (the numeric portion in brack‐
4608 ets must be at the end of the string). If the FrontendName is
4609 "DEFAULT", the values specified with that record will apply to
4610 subsequent node specifications unless explicitly set to other
4611 values in that frontend node record or replaced with a different
4612 set of default values. Each line where FrontendName is
4613 "DEFAULT" will replace or add to previous default values and not
4614 a reinitialize the default values. Note that since the naming
4615 of front end nodes would typically not follow that of the com‐
4616 pute nodes (e.g. lacking X, Y and Z coordinates found in the
4617 compute node naming scheme), each front end node name should be
4618 listed separately and without a hostlist expression (i.e. fron‐
4619 tend00,frontend01" rather than "frontend[00-01]").</p>
4620
4621
4622 FrontendAddr
4623 Name that a frontend node should be referred to in establishing
4624 a communications path. This name will be used as an argument to
4625 the gethostbyname() function for identification. As with Fron‐
4626 tendName, list the individual node addresses rather than using a
4627 hostlist expression. The number of FrontendAddr records per
4628 line must equal the number of FrontendName records per line
4629 (i.e. you can't map to node names to one address). FrontendAddr
4630 may also contain IP addresses. By default, the FrontendAddr
4631 will be identical in value to FrontendName.
4632
4633
4634 Port The port number that the Slurm compute node daemon, slurmd, lis‐
4635 tens to for work on this particular frontend node. By default
4636 there is a single port number for all slurmd daemons on all
4637 frontend nodes as defined by the SlurmdPort configuration param‐
4638 eter. Use of this option is not generally recommended except for
4639 development or testing purposes.
4640
4641 Note: On Cray systems, Realm-Specific IP Addressing (RSIP) will
4642 automatically try to interact with anything opened on ports
4643 8192-60000. Configure Port to use a port outside of the config‐
4644 ured SrunPortRange and RSIP's port range.
4645
4646
4647 Reason Identifies the reason for a frontend node being in state "DOWN",
4648 "DRAINED" "DRAINING", "FAIL" or "FAILING". Use quotes to
4649 enclose a reason having more than one word.
4650
4651
4652 State State of the frontend node with respect to the initiation of
4653 user jobs. Acceptable values are "DOWN", "DRAIN", "FAIL",
4654 "FAILING" and "UNKNOWN". "DOWN" indicates the frontend node has
4655 failed and is unavailable to be allocated work. "DRAIN" indi‐
4656 cates the frontend node is unavailable to be allocated work.
4657 "FAIL" indicates the frontend node is expected to fail soon, has
4658 no jobs allocated to it, and will not be allocated to any new
4659 jobs. "FAILING" indicates the frontend node is expected to fail
4660 soon, has one or more jobs allocated to it, but will not be
4661 allocated to any new jobs. "UNKNOWN" indicates the frontend
4662 node's state is undefined (BUSY or IDLE), but will be estab‐
4663 lished when the slurmd daemon on that node registers. The
4664 default value is "UNKNOWN". Also see the DownNodes parameter
4665 below.
4666
4667 For example: "FrontendName=frontend[00-03] FrontendAddr=efron‐
4668 tend[00-03] State=UNKNOWN" is used to define four front end
4669 nodes for running slurmd daemons.
4670
4671
4672 The partition configuration permits you to establish different job lim‐
4673 its or access controls for various groups (or partitions) of nodes.
4674 Nodes may be in more than one partition, making partitions serve as
4675 general purpose queues. For example one may put the same set of nodes
4676 into two different partitions, each with different constraints (time
4677 limit, job sizes, groups allowed to use the partition, etc.). Jobs are
4678 allocated resources within a single partition. Default values can be
4679 specified with a record in which PartitionName is "DEFAULT". The
4680 default entry values will apply only to lines following it in the con‐
4681 figuration file and the default values can be reset multiple times in
4682 the configuration file with multiple entries where "Partition‐
4683 Name=DEFAULT". The "PartitionName=" specification must be placed on
4684 every line describing the configuration of partitions. Each line where
4685 PartitionName is "DEFAULT" will replace or add to previous default val‐
4686 ues and not a reinitialize the default values. A single partition name
4687 can not appear as a PartitionName value in more than one line (dupli‐
4688 cate partition name records will be ignored). If a partition that is
4689 in use is deleted from the configuration and slurm is restarted or
4690 reconfigured (scontrol reconfigure), jobs using the partition are can‐
4691 celed. NOTE: Put all parameters for each partition on a single line.
4692 Each line of partition configuration information should represent a
4693 different partition. The partition configuration file contains the
4694 following information:
4695
4696
4697 AllocNodes
4698 Comma separated list of nodes from which users can submit jobs
4699 in the partition. Node names may be specified using the node
4700 range expression syntax described above. The default value is
4701 "ALL".
4702
4703
4704 AllowAccounts
4705 Comma separated list of accounts which may execute jobs in the
4706 partition. The default value is "ALL". NOTE: If AllowAccounts
4707 is used then DenyAccounts will not be enforced. Also refer to
4708 DenyAccounts.
4709
4710
4711 AllowGroups
4712 Comma separated list of group names which may execute jobs in
4713 the partition. If at least one group associated with the user
4714 attempting to execute the job is in AllowGroups, he will be per‐
4715 mitted to use this partition. Jobs executed as user root can
4716 use any partition without regard to the value of AllowGroups.
4717 If user root attempts to execute a job as another user (e.g.
4718 using srun's --uid option), this other user must be in one of
4719 groups identified by AllowGroups for the job to successfully
4720 execute. The default value is "ALL". When set, all partitions
4721 that a user does not have access will be hidden from display
4722 regardless of the settings used for PrivateData. NOTE: For per‐
4723 formance reasons, Slurm maintains a list of user IDs allowed to
4724 use each partition and this is checked at job submission time.
4725 This list of user IDs is updated when the slurmctld daemon is
4726 restarted, reconfigured (e.g. "scontrol reconfig") or the parti‐
4727 tion's AllowGroups value is reset, even if is value is unchanged
4728 (e.g. "scontrol update PartitionName=name AllowGroups=group").
4729 For a user's access to a partition to change, both his group
4730 membership must change and Slurm's internal user ID list must
4731 change using one of the methods described above.
4732
4733
4734 AllowQos
4735 Comma separated list of Qos which may execute jobs in the parti‐
4736 tion. Jobs executed as user root can use any partition without
4737 regard to the value of AllowQos. The default value is "ALL".
4738 NOTE: If AllowQos is used then DenyQos will not be enforced.
4739 Also refer to DenyQos.
4740
4741
4742 Alternate
4743 Partition name of alternate partition to be used if the state of
4744 this partition is "DRAIN" or "INACTIVE."
4745
4746
4747 CpuBind
4748 If a job step request does not specify an option to control how
4749 tasks are bound to allocated CPUs (--cpu-bind) and all nodes
4750 allocated to the job do not have the same CpuBind option the
4751 node. Then the partition's CpuBind option will control how tasks
4752 are bound to allocated resources. Supported values forCpuBind
4753 are "none", "board", "socket", "ldom" (NUMA), "core" and
4754 "thread".
4755
4756
4757 Default
4758 If this keyword is set, jobs submitted without a partition spec‐
4759 ification will utilize this partition. Possible values are
4760 "YES" and "NO". The default value is "NO".
4761
4762
4763 DefCpuPerGPU
4764 Default count of CPUs allocated per allocated GPU.
4765
4766
4767 DefMemPerCPU
4768 Default real memory size available per allocated CPU in
4769 megabytes. Used to avoid over-subscribing memory and causing
4770 paging. DefMemPerCPU would generally be used if individual pro‐
4771 cessors are allocated to jobs (SelectType=select/cons_res or
4772 SelectType=select/cons_tres). If not set, the DefMemPerCPU
4773 value for the entire cluster will be used. Also see DefMem‐
4774 PerGPU, DefMemPerNode and MaxMemPerCPU. DefMemPerCPU, DefMem‐
4775 PerGPU and DefMemPerNode are mutually exclusive.
4776
4777
4778 DefMemPerGPU
4779 Default real memory size available per allocated GPU in
4780 megabytes. Also see DefMemPerCPU, DefMemPerNode and MaxMemPer‐
4781 CPU. DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually
4782 exclusive.
4783
4784
4785 DefMemPerNode
4786 Default real memory size available per allocated node in
4787 megabytes. Used to avoid over-subscribing memory and causing
4788 paging. DefMemPerNode would generally be used if whole nodes
4789 are allocated to jobs (SelectType=select/linear) and resources
4790 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
4791 If not set, the DefMemPerNode value for the entire cluster will
4792 be used. Also see DefMemPerCPU, DefMemPerGPU and MaxMemPerCPU.
4793 DefMemPerCPU, DefMemPerGPU and DefMemPerNode are mutually exclu‐
4794 sive.
4795
4796
4797 DenyAccounts
4798 Comma separated list of accounts which may not execute jobs in
4799 the partition. By default, no accounts are denied access NOTE:
4800 If AllowAccounts is used then DenyAccounts will not be enforced.
4801 Also refer to AllowAccounts.
4802
4803
4804 DenyQos
4805 Comma separated list of Qos which may not execute jobs in the
4806 partition. By default, no QOS are denied access NOTE: If
4807 AllowQos is used then DenyQos will not be enforced. Also refer
4808 AllowQos.
4809
4810
4811 DefaultTime
4812 Run time limit used for jobs that don't specify a value. If not
4813 set then MaxTime will be used. Format is the same as for Max‐
4814 Time.
4815
4816
4817 DisableRootJobs
4818 If set to "YES" then user root will be prevented from running
4819 any jobs on this partition. The default value will be the value
4820 of DisableRootJobs set outside of a partition specification
4821 (which is "NO", allowing user root to execute jobs).
4822
4823
4824 ExclusiveUser
4825 If set to "YES" then nodes will be exclusively allocated to
4826 users. Multiple jobs may be run for the same user, but only one
4827 user can be active at a time. This capability is also available
4828 on a per-job basis by using the --exclusive=user option.
4829
4830
4831 GraceTime
4832 Specifies, in units of seconds, the preemption grace time to be
4833 extended to a job which has been selected for preemption. The
4834 default value is zero, no preemption grace time is allowed on
4835 this partition. Once a job has been selected for preemption,
4836 its end time is set to the current time plus GraceTime. The
4837 job's tasks are immediately sent SIGCONT and SIGTERM signals in
4838 order to provide notification of its imminent termination. This
4839 is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence
4840 upon reaching its new end time. This second set of signals is
4841 sent to both the tasks and the containing batch script, if
4842 applicable. Meaningful only for PreemptMode=CANCEL. See also
4843 the global KillWait configuration parameter.
4844
4845
4846 Hidden Specifies if the partition and its jobs are to be hidden by
4847 default. Hidden partitions will by default not be reported by
4848 the Slurm APIs or commands. Possible values are "YES" and "NO".
4849 The default value is "NO". Note that partitions that a user
4850 lacks access to by virtue of the AllowGroups parameter will also
4851 be hidden by default.
4852
4853
4854 LLN Schedule resources to jobs on the least loaded nodes (based upon
4855 the number of idle CPUs). This is generally only recommended for
4856 an environment with serial jobs as idle resources will tend to
4857 be highly fragmented, resulting in parallel jobs being distrib‐
4858 uted across many nodes. Note that node Weight takes precedence
4859 over how many idle resources are on each node. Also see the
4860 SelectParameters configuration parameter CR_LLN to use the least
4861 loaded nodes in every partition.
4862
4863
4864 MaxCPUsPerNode
4865 Maximum number of CPUs on any node available to all jobs from
4866 this partition. This can be especially useful to schedule GPUs.
4867 For example a node can be associated with two Slurm partitions
4868 (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be
4869 limited to only a subset of the node's CPUs, ensuring that one
4870 or more CPUs would be available to jobs in the "gpu" parti‐
4871 tion/queue.
4872
4873
4874 MaxMemPerCPU
4875 Maximum real memory size available per allocated CPU in
4876 megabytes. Used to avoid over-subscribing memory and causing
4877 paging. MaxMemPerCPU would generally be used if individual pro‐
4878 cessors are allocated to jobs (SelectType=select/cons_res or
4879 SelectType=select/cons_tres). If not set, the MaxMemPerCPU
4880 value for the entire cluster will be used. Also see DefMemPer‐
4881 CPU and MaxMemPerNode. MaxMemPerCPU and MaxMemPerNode are mutu‐
4882 ally exclusive.
4883
4884
4885 MaxMemPerNode
4886 Maximum real memory size available per allocated node in
4887 megabytes. Used to avoid over-subscribing memory and causing
4888 paging. MaxMemPerNode would generally be used if whole nodes
4889 are allocated to jobs (SelectType=select/linear) and resources
4890 are over-subscribed (OverSubscribe=yes or OverSubscribe=force).
4891 If not set, the MaxMemPerNode value for the entire cluster will
4892 be used. Also see DefMemPerNode and MaxMemPerCPU. MaxMemPerCPU
4893 and MaxMemPerNode are mutually exclusive.
4894
4895
4896 MaxNodes
4897 Maximum count of nodes which may be allocated to any single job.
4898 The default value is "UNLIMITED", which is represented inter‐
4899 nally as -1. This limit does not apply to jobs executed by
4900 SlurmUser or user root.
4901
4902
4903 MaxTime
4904 Maximum run time limit for jobs. Format is minutes, min‐
4905 utes:seconds, hours:minutes:seconds, days-hours, days-hours:min‐
4906 utes, days-hours:minutes:seconds or "UNLIMITED". Time resolu‐
4907 tion is one minute and second values are rounded up to the next
4908 minute. This limit does not apply to jobs executed by SlurmUser
4909 or user root.
4910
4911
4912 MinNodes
4913 Minimum count of nodes which may be allocated to any single job.
4914 The default value is 0. This limit does not apply to jobs exe‐
4915 cuted by SlurmUser or user root.
4916
4917
4918 Nodes Comma separated list of nodes which are associated with this
4919 partition. Node names may be specified using the node range
4920 expression syntax described above. A blank list of nodes (i.e.
4921 "Nodes= ") can be used if one wants a partition to exist, but
4922 have no resources (possibly on a temporary basis). A value of
4923 "ALL" is mapped to all nodes configured in the cluster.
4924
4925
4926 OverSubscribe
4927 Controls the ability of the partition to execute more than one
4928 job at a time on each resource (node, socket or core depending
4929 upon the value of SelectTypeParameters). If resources are to be
4930 over-subscribed, avoiding memory over-subscription is very
4931 important. SelectTypeParameters should be configured to treat
4932 memory as a consumable resource and the --mem option should be
4933 used for job allocations. Sharing of resources is typically
4934 useful only when using gang scheduling (PreemptMode=sus‐
4935 pend,gang). Possible values for OverSubscribe are "EXCLUSIVE",
4936 "FORCE", "YES", and "NO". Note that a value of "YES" or "FORCE"
4937 can negatively impact performance for systems with many thou‐
4938 sands of running jobs. The default value is "NO". For more
4939 information see the following web pages:
4940 https://slurm.schedmd.com/cons_res.html,
4941 https://slurm.schedmd.com/cons_res_share.html,
4942 https://slurm.schedmd.com/gang_scheduling.html, and
4943 https://slurm.schedmd.com/preempt.html.
4944
4945
4946 EXCLUSIVE Allocates entire nodes to jobs even with Select‐
4947 Type=select/cons_res or SelectType=select/cons_tres
4948 configured. Jobs that run in partitions with "Over‐
4949 Subscribe=EXCLUSIVE" will have exclusive access to
4950 all allocated nodes.
4951
4952 FORCE Makes all resources in the partition available for
4953 oversubscription without any means for users to dis‐
4954 able it. May be followed with a colon and maximum
4955 number of jobs in running or suspended state. For
4956 example "OverSubscribe=FORCE:4" enables each node,
4957 socket or core to oversubscribe each resource four
4958 ways. Recommended only for systems running with
4959 gang scheduling (PreemptMode=suspend,gang). NOTE:
4960 PreemptType=QOS will permit one additional job to be
4961 run on the partition if started due to job preemp‐
4962 tion. For example, a configuration of OverSub‐
4963 scribe=FORCE:1 will only permit one job per
4964 resources normally, but a second job can be started
4965 if done so through preemption based upon QOS. The
4966 use of PreemptType=QOS and PreemptType=Suspend only
4967 applies with SelectType=select/cons_res or Select‐
4968 Type=select/cons_tres.
4969
4970 YES Makes all resources in the partition available for
4971 sharing upon request by the job. Resources will
4972 only be over-subscribed when explicitly requested by
4973 the user using the "--oversubscribe" option on job
4974 submission. May be followed with a colon and maxi‐
4975 mum number of jobs in running or suspended state.
4976 For example "OverSubscribe=YES:4" enables each node,
4977 socket or core to execute up to four jobs at once.
4978 Recommended only for systems running with gang
4979 scheduling (PreemptMode=suspend,gang).
4980
4981 NO Selected resources are allocated to a single job. No
4982 resource will be allocated to more than one job.
4983
4984
4985 PartitionName
4986 Name by which the partition may be referenced (e.g. "Interac‐
4987 tive"). This name can be specified by users when submitting
4988 jobs. If the PartitionName is "DEFAULT", the values specified
4989 with that record will apply to subsequent partition specifica‐
4990 tions unless explicitly set to other values in that partition
4991 record or replaced with a different set of default values. Each
4992 line where PartitionName is "DEFAULT" will replace or add to
4993 previous default values and not a reinitialize the default val‐
4994 ues.
4995
4996
4997 PreemptMode
4998 Mechanism used to preempt jobs from this partition when Preempt‐
4999 Type=preempt/partition_prio is configured. This partition spe‐
5000 cific PreemptMode configuration parameter will override the Pre‐
5001 emptMode configuration parameter set for the cluster as a whole.
5002 The cluster-level PreemptMode must include the GANG option if
5003 PreemptMode is configured to SUSPEND for any partition. The
5004 cluster-level PreemptMode must not be OFF if PreemptMode is
5005 enabled for any partition. See the description of the clus‐
5006 ter-level PreemptMode configuration parameter above for further
5007 information.
5008
5009
5010 PriorityJobFactor
5011 Partition factor used by priority/multifactor plugin in calcu‐
5012 lating job priority. The value may not exceed 65533. Also see
5013 PriorityTier.
5014
5015
5016 PriorityTier
5017 Jobs submitted to a partition with a higher priority tier value
5018 will be dispatched before pending jobs in partition with lower
5019 priority tier value and, if possible, they will preempt running
5020 jobs from partitions with lower priority tier values. Note that
5021 a partition's priority tier takes precedence over a job's prior‐
5022 ity. The value may not exceed 65533. Also see PriorityJobFac‐
5023 tor.
5024
5025
5026 QOS Used to extend the limits available to a QOS on a partition.
5027 Jobs will not be associated to this QOS outside of being associ‐
5028 ated to the partition. They will still be associated to their
5029 requested QOS. By default, no QOS is used. NOTE: If a limit is
5030 set in both the Partition's QOS and the Job's QOS the Partition
5031 QOS will be honored unless the Job's QOS has the OverPartQOS
5032 flag set in which the Job's QOS will have priority.
5033
5034
5035 ReqResv
5036 Specifies users of this partition are required to designate a
5037 reservation when submitting a job. This option can be useful in
5038 restricting usage of a partition that may have higher priority
5039 or additional resources to be allowed only within a reservation.
5040 Possible values are "YES" and "NO". The default value is "NO".
5041
5042
5043 RootOnly
5044 Specifies if only user ID zero (i.e. user root) may allocate
5045 resources in this partition. User root may allocate resources
5046 for any other user, but the request must be initiated by user
5047 root. This option can be useful for a partition to be managed
5048 by some external entity (e.g. a higher-level job manager) and
5049 prevents users from directly using those resources. Possible
5050 values are "YES" and "NO". The default value is "NO".
5051
5052
5053 SelectTypeParameters
5054 Partition-specific resource allocation type. This option
5055 replaces the global SelectTypeParameters value. Supported val‐
5056 ues are CR_Core, CR_Core_Memory, CR_Socket and CR_Socket_Memory.
5057 Use requires the system-wide SelectTypeParameters value be set
5058 to any of the four supported values previously listed; other‐
5059 wise, the partition-specific value will be ignored.
5060
5061
5062 Shared The Shared configuration parameter has been replaced by the
5063 OverSubscribe parameter described above.
5064
5065
5066 State State of partition or availability for use. Possible values are
5067 "UP", "DOWN", "DRAIN" and "INACTIVE". The default value is "UP".
5068 See also the related "Alternate" keyword.
5069
5070 UP Designates that new jobs may be queued on the parti‐
5071 tion, and that jobs may be allocated nodes and run
5072 from the partition.
5073
5074 DOWN Designates that new jobs may be queued on the parti‐
5075 tion, but queued jobs may not be allocated nodes and
5076 run from the partition. Jobs already running on the
5077 partition continue to run. The jobs must be explicitly
5078 canceled to force their termination.
5079
5080 DRAIN Designates that no new jobs may be queued on the par‐
5081 tition (job submission requests will be denied with an
5082 error message), but jobs already queued on the parti‐
5083 tion may be allocated nodes and run. See also the
5084 "Alternate" partition specification.
5085
5086 INACTIVE Designates that no new jobs may be queued on the par‐
5087 tition, and jobs already queued may not be allocated
5088 nodes and run. See also the "Alternate" partition
5089 specification.
5090
5091
5092 TRESBillingWeights
5093 TRESBillingWeights is used to define the billing weights of each
5094 TRES type that will be used in calculating the usage of a job.
5095 The calculated usage is used when calculating fairshare and when
5096 enforcing the TRES billing limit on jobs.
5097
5098 Billing weights are specified as a comma-separated list of <TRES
5099 Type>=<TRES Billing Weight> pairs.
5100
5101 Any TRES Type is available for billing. Note that the base unit
5102 for memory and burst buffers is megabytes.
5103
5104 By default the billing of TRES is calculated as the sum of all
5105 TRES types multiplied by their corresponding billing weight.
5106
5107 The weighted amount of a resource can be adjusted by adding a
5108 suffix of K,M,G,T or P after the billing weight. For example, a
5109 memory weight of "mem=.25" on a job allocated 8GB will be billed
5110 2048 (8192MB *.25) units. A memory weight of "mem=.25G" on the
5111 same job will be billed 2 (8192MB * (.25/1024)) units.
5112
5113 Negative values are allowed.
5114
5115 When a job is allocated 1 CPU and 8 GB of memory on a partition
5116 configured with TRESBilling‐
5117 Weights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0", the billable TRES will
5118 be: (1*1.0) + (8*0.25) + (0*2.0) = 3.0.
5119
5120 If PriorityFlags=MAX_TRES is configured, the billable TRES is
5121 calculated as the MAX of individual TRES' on a node (e.g. cpus,
5122 mem, gres) plus the sum of all global TRES' (e.g. licenses).
5123 Using the same example above the billable TRES will be
5124 MAX(1*1.0, 8*0.25) + (0*2.0) = 2.0.
5125
5126 If TRESBillingWeights is not defined then the job is billed
5127 against the total number of allocated CPUs.
5128
5129 NOTE: TRESBillingWeights doesn't affect job priority directly as
5130 it is currently not used for the size of the job. If you want
5131 TRES' to play a role in the job's priority then refer to the
5132 PriorityWeightTRES option.
5133
5134
5135
5137 There are a variety of prolog and epilog program options that execute
5138 with various permissions and at various times. The four options most
5139 likely to be used are: Prolog and Epilog (executed once on each compute
5140 node for each job) plus PrologSlurmctld and EpilogSlurmctld (executed
5141 once on the ControlMachine for each job).
5142
5143 NOTE: Standard output and error messages are normally not preserved.
5144 Explicitly write output and error messages to an appropriate location
5145 if you wish to preserve that information.
5146
5147 NOTE: By default the Prolog script is ONLY run on any individual node
5148 when it first sees a job step from a new allocation; it does not run
5149 the Prolog immediately when an allocation is granted. If no job steps
5150 from an allocation are run on a node, it will never run the Prolog for
5151 that allocation. This Prolog behaviour can be changed by the Pro‐
5152 logFlags parameter. The Epilog, on the other hand, always runs on
5153 every node of an allocation when the allocation is released.
5154
5155 If the Epilog fails (returns a non-zero exit code), this will result in
5156 the node being set to a DRAIN state. If the EpilogSlurmctld fails
5157 (returns a non-zero exit code), this will only be logged. If the Pro‐
5158 log fails (returns a non-zero exit code), this will result in the node
5159 being set to a DRAIN state and the job being requeued in a held state
5160 unless nohold_on_prolog_fail is configured in SchedulerParameters. If
5161 the PrologSlurmctld fails (returns a non-zero exit code), this will
5162 result in the job requeued to executed on another node if possible.
5163 Only batch jobs can be requeued.
5164 Interactive jobs (salloc and srun) will be cancelled if the Pro‐
5165 logSlurmctld fails.
5166
5167
5168 Information about the job is passed to the script using environment
5169 variables. Unless otherwise specified, these environment variables are
5170 available to all of the programs.
5171
5172 SLURM_ARRAY_JOB_ID
5173 If this job is part of a job array, this will be set to the job
5174 ID. Otherwise it will not be set. To reference this specific
5175 task of a job array, combine SLURM_ARRAY_JOB_ID with
5176 SLURM_ARRAY_TASK_ID (e.g. "scontrol update
5177 ${SLURM_ARRAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in
5178 PrologSlurmctld and EpilogSlurmctld only.
5179
5180 SLURM_ARRAY_TASK_ID
5181 If this job is part of a job array, this will be set to the task
5182 ID. Otherwise it will not be set. To reference this specific
5183 task of a job array, combine SLURM_ARRAY_JOB_ID with
5184 SLURM_ARRAY_TASK_ID (e.g. "scontrol update
5185 ${SLURM_ARRAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID} ..."); Available in
5186 PrologSlurmctld and EpilogSlurmctld only.
5187
5188 SLURM_ARRAY_TASK_MAX
5189 If this job is part of a job array, this will be set to the max‐
5190 imum task ID. Otherwise it will not be set. Available in Pro‐
5191 logSlurmctld and EpilogSlurmctld only.
5192
5193 SLURM_ARRAY_TASK_MIN
5194 If this job is part of a job array, this will be set to the min‐
5195 imum task ID. Otherwise it will not be set. Available in Pro‐
5196 logSlurmctld and EpilogSlurmctld only.
5197
5198 SLURM_ARRAY_TASK_STEP
5199 If this job is part of a job array, this will be set to the step
5200 size of task IDs. Otherwise it will not be set. Available in
5201 PrologSlurmctld and EpilogSlurmctld only.
5202
5203 SLURM_CLUSTER_NAME
5204 Name of the cluster executing the job.
5205
5206 SLURM_JOB_ACCOUNT
5207 Account name used for the job. Available in PrologSlurmctld and
5208 EpilogSlurmctld only.
5209
5210 SLURM_JOB_CONSTRAINTS
5211 Features required to run the job. Available in Prolog, Pro‐
5212 logSlurmctld and EpilogSlurmctld only.
5213
5214 SLURM_JOB_DERIVED_EC
5215 The highest exit code of all of the job steps. Available in
5216 EpilogSlurmctld only.
5217
5218 SLURM_JOB_EXIT_CODE
5219 The exit code of the job script (or salloc). The value is the
5220 status as returned by the wait() system call (See wait(2))
5221 Available in EpilogSlurmctld only.
5222
5223 SLURM_JOB_EXIT_CODE2
5224 The exit code of the job script (or salloc). The value has the
5225 format <exit>:<sig>. The first number is the exit code, typi‐
5226 cally as set by the exit() function. The second number of the
5227 signal that caused the process to terminate if it was terminated
5228 by a signal. Available in EpilogSlurmctld only.
5229
5230 SLURM_JOB_GID
5231 Group ID of the job's owner. Available in PrologSlurmctld, Epi‐
5232 logSlurmctld and TaskProlog only.
5233
5234 SLURM_JOB_GPUS
5235 GPU IDs allocated to the job (if any). Available in the Prolog
5236 only.
5237
5238 SLURM_JOB_GROUP
5239 Group name of the job's owner. Available in PrologSlurmctld and
5240 EpilogSlurmctld only.
5241
5242 SLURM_JOB_ID
5243 Job ID. CAUTION: If this job is the first task of a job array,
5244 then Slurm commands using this job ID will refer to the entire
5245 job array rather than this specific task of the job array.
5246
5247 SLURM_JOB_NAME
5248 Name of the job. Available in PrologSlurmctld and EpilogSlurm‐
5249 ctld only.
5250
5251 SLURM_JOB_NODELIST
5252 Nodes assigned to job. A Slurm hostlist expression. "scontrol
5253 show hostnames" can be used to convert this to a list of indi‐
5254 vidual host names. Available in PrologSlurmctld and Epi‐
5255 logSlurmctld only.
5256
5257 SLURM_JOB_PARTITION
5258 Partition that job runs in. Available in Prolog, PrologSlurm‐
5259 ctld and EpilogSlurmctld only.
5260
5261 SLURM_JOB_UID
5262 User ID of the job's owner.
5263
5264 SLURM_JOB_USER
5265 User name of the job's owner.
5266
5267
5269 Slurm is able to optimize job allocations to minimize network con‐
5270 tention. Special Slurm logic is used to optimize allocations on sys‐
5271 tems with a three-dimensional interconnect. and information about con‐
5272 figuring those systems are available on web pages available here:
5273 <https://slurm.schedmd.com/>. For a hierarchical network, Slurm needs
5274 to have detailed information about how nodes are configured on the net‐
5275 work switches.
5276
5277 Given network topology information, Slurm allocates all of a job's
5278 resources onto a single leaf of the network (if possible) using a
5279 best-fit algorithm. Otherwise it will allocate a job's resources onto
5280 multiple leaf switches so as to minimize the use of higher-level
5281 switches. The TopologyPlugin parameter controls which plugin is used
5282 to collect network topology information. The only values presently
5283 supported are "topology/3d_torus" (default for Cray XT/XE systems, per‐
5284 forms best-fit logic over three-dimensional topology), "topology/none"
5285 (default for other systems, best-fit logic over one-dimensional topol‐
5286 ogy), "topology/tree" (determine the network topology based upon infor‐
5287 mation contained in a topology.conf file, see "man topology.conf" for
5288 more information). Future plugins may gather topology information
5289 directly from the network. The topology information is optional. If
5290 not provided, Slurm will perform a best-fit algorithm assuming the
5291 nodes are in a one-dimensional array as configured and the communica‐
5292 tions cost is related to the node distance in this array.
5293
5294
5296 If the cluster's computers used for the primary or backup controller
5297 will be out of service for an extended period of time, it may be desir‐
5298 able to relocate them. In order to do so, follow this procedure:
5299
5300 1. Stop the Slurm daemons
5301 2. Modify the slurm.conf file appropriately
5302 3. Distribute the updated slurm.conf file to all nodes
5303 4. Restart the Slurm daemons
5304
5305 There should be no loss of any running or pending jobs. Ensure that
5306 any nodes added to the cluster have the current slurm.conf file
5307 installed.
5308
5309 CAUTION: If two nodes are simultaneously configured as the primary con‐
5310 troller (two nodes on which ControlMachine specify the local host and
5311 the slurmctld daemon is executing on each), system behavior will be
5312 destructive. If a compute node has an incorrect ControlMachine or
5313 BackupController parameter, that node may be rendered unusable, but no
5314 other harm will result.
5315
5316
5318 #
5319 # Sample /etc/slurm.conf for dev[0-25].llnl.gov
5320 # Author: John Doe
5321 # Date: 11/06/2001
5322 #
5323 SlurmctldHost=dev0(12.34.56.78) # Primary server
5324 SlurmctldHost=dev1(12.34.56.79) # Backup server
5325 #
5326 AuthType=auth/munge
5327 Epilog=/usr/local/slurm/epilog
5328 Prolog=/usr/local/slurm/prolog
5329 FirstJobId=65536
5330 InactiveLimit=120
5331 JobCompType=jobcomp/filetxt
5332 JobCompLoc=/var/log/slurm/jobcomp
5333 KillWait=30
5334 MaxJobCount=10000
5335 MinJobAge=3600
5336 PluginDir=/usr/local/lib:/usr/local/slurm/lib
5337 ReturnToService=0
5338 SchedulerType=sched/backfill
5339 SlurmctldLogFile=/var/log/slurm/slurmctld.log
5340 SlurmdLogFile=/var/log/slurm/slurmd.log
5341 SlurmctldPort=7002
5342 SlurmdPort=7003
5343 SlurmdSpoolDir=/var/spool/slurmd.spool
5344 StateSaveLocation=/var/spool/slurm.state
5345 SwitchType=switch/none
5346 TmpFS=/tmp
5347 WaitTime=30
5348 JobCredentialPrivateKey=/usr/local/slurm/private.key
5349 JobCredentialPublicCertificate=/usr/local/slurm/public.cert
5350 #
5351 # Node Configurations
5352 #
5353 NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=64000
5354 NodeName=DEFAULT State=UNKNOWN
5355 NodeName=dev[0-25] NodeAddr=edev[0-25] Weight=16
5356 # Update records for specific DOWN nodes
5357 DownNodes=dev20 State=DOWN Reason="power,ETA=Dec25"
5358 #
5359 # Partition Configurations
5360 #
5361 PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP
5362 PartitionName=debug Nodes=dev[0-8,18-25] Default=YES
5363 PartitionName=batch Nodes=dev[9-17] MinNodes=4
5364 PartitionName=long Nodes=dev[9-17] MaxTime=120 AllowGroups=admin
5365
5366
5368 The "include" key word can be used with modifiers within the specified
5369 pathname. These modifiers would be replaced with cluster name or other
5370 information depending on which modifier is specified. If the included
5371 file is not an absolute path name (i.e. it does not start with a
5372 slash), it will searched for in the same directory as the slurm.conf
5373 file.
5374
5375 %c Cluster name specified in the slurm.conf will be used.
5376
5377 EXAMPLE
5378 ClusterName=linux
5379 include /home/slurm/etc/%c_config
5380 # Above line interpreted as
5381 # "include /home/slurm/etc/linux_config"
5382
5383
5385 There are three classes of files: Files used by slurmctld must be
5386 accessible by user SlurmUser and accessible by the primary and backup
5387 control machines. Files used by slurmd must be accessible by user root
5388 and accessible from every compute node. A few files need to be acces‐
5389 sible by normal users on all login and compute nodes. While many files
5390 and directories are listed below, most of them will not be used with
5391 most configurations.
5392
5393 AccountingStorageLoc
5394 If this specifies a file, it must be writable by user SlurmUser.
5395 The file must be accessible by the primary and backup control
5396 machines. It is recommended that the file be readable by all
5397 users from login and compute nodes.
5398
5399 Epilog Must be executable by user root. It is recommended that the
5400 file be readable by all users. The file must exist on every
5401 compute node.
5402
5403 EpilogSlurmctld
5404 Must be executable by user SlurmUser. It is recommended that
5405 the file be readable by all users. The file must be accessible
5406 by the primary and backup control machines.
5407
5408 HealthCheckProgram
5409 Must be executable by user root. It is recommended that the
5410 file be readable by all users. The file must exist on every
5411 compute node.
5412
5413 JobCheckpointDir
5414 Must be writable by user SlurmUser and no other users. The file
5415 must be accessible by the primary and backup control machines.
5416
5417 JobCompLoc
5418 If this specifies a file, it must be writable by user SlurmUser.
5419 The file must be accessible by the primary and backup control
5420 machines.
5421
5422 JobCredentialPrivateKey
5423 Must be readable only by user SlurmUser and writable by no other
5424 users. The file must be accessible by the primary and backup
5425 control machines.
5426
5427 JobCredentialPublicCertificate
5428 Readable to all users on all nodes. Must not be writable by
5429 regular users.
5430
5431 MailProg
5432 Must be executable by user SlurmUser. Must not be writable by
5433 regular users. The file must be accessible by the primary and
5434 backup control machines.
5435
5436 Prolog Must be executable by user root. It is recommended that the
5437 file be readable by all users. The file must exist on every
5438 compute node.
5439
5440 PrologSlurmctld
5441 Must be executable by user SlurmUser. It is recommended that
5442 the file be readable by all users. The file must be accessible
5443 by the primary and backup control machines.
5444
5445 ResumeProgram
5446 Must be executable by user SlurmUser. The file must be accessi‐
5447 ble by the primary and backup control machines.
5448
5449 SallocDefaultCommand
5450 Must be executable by all users. The file must exist on every
5451 login and compute node.
5452
5453 slurm.conf
5454 Readable to all users on all nodes. Must not be writable by
5455 regular users.
5456
5457 SlurmctldLogFile
5458 Must be writable by user SlurmUser. The file must be accessible
5459 by the primary and backup control machines.
5460
5461 SlurmctldPidFile
5462 Must be writable by user root. Preferably writable and remov‐
5463 able by SlurmUser. The file must be accessible by the primary
5464 and backup control machines.
5465
5466 SlurmdLogFile
5467 Must be writable by user root. A distinct file must exist on
5468 each compute node.
5469
5470 SlurmdPidFile
5471 Must be writable by user root. A distinct file must exist on
5472 each compute node.
5473
5474 SlurmdSpoolDir
5475 Must be writable by user root. A distinct file must exist on
5476 each compute node.
5477
5478 SrunEpilog
5479 Must be executable by all users. The file must exist on every
5480 login and compute node.
5481
5482 SrunProlog
5483 Must be executable by all users. The file must exist on every
5484 login and compute node.
5485
5486 StateSaveLocation
5487 Must be writable by user SlurmUser. The file must be accessible
5488 by the primary and backup control machines.
5489
5490 SuspendProgram
5491 Must be executable by user SlurmUser. The file must be accessi‐
5492 ble by the primary and backup control machines.
5493
5494 TaskEpilog
5495 Must be executable by all users. The file must exist on every
5496 compute node.
5497
5498 TaskProlog
5499 Must be executable by all users. The file must exist on every
5500 compute node.
5501
5502 UnkillableStepProgram
5503 Must be executable by user SlurmUser. The file must be accessi‐
5504 ble by the primary and backup control machines.
5505
5506
5508 Note that while Slurm daemons create log files and other files as
5509 needed, it treats the lack of parent directories as a fatal error.
5510 This prevents the daemons from running if critical file systems are not
5511 mounted and will minimize the risk of cold-starting (starting without
5512 preserving jobs).
5513
5514 Log files and job accounting files, may need to be created/owned by the
5515 "SlurmUser" uid to be successfully accessed. Use the "chown" and
5516 "chmod" commands to set the ownership and permissions appropriately.
5517 See the section FILE AND DIRECTORY PERMISSIONS for information about
5518 the various files and directories used by Slurm.
5519
5520 It is recommended that the logrotate utility be used to ensure that
5521 various log files do not become too large. This also applies to text
5522 files used for accounting, process tracking, and the slurmdbd log if
5523 they are used.
5524
5525 Here is a sample logrotate configuration. Make appropriate site modifi‐
5526 cations and save as /etc/logrotate.d/slurm on all nodes. See the
5527 logrotate man page for more details.
5528
5529 ##
5530 # Slurm Logrotate Configuration
5531 ##
5532 /var/log/slurm/*.log {
5533 compress
5534 missingok
5535 nocopytruncate
5536 nodelaycompress
5537 nomail
5538 notifempty
5539 noolddir
5540 rotate 5
5541 sharedscripts
5542 size=5M
5543 create 640 slurm root
5544 postrotate
5545 pkill -x --signal SIGUSR2 slurmctld
5546 pkill -x --signal SIGUSR2 slurmd
5547 pkill -x --signal SIGUSR2 slurmdbd
5548 exit 0
5549 endscript
5550 }
5551
5553 Copyright (C) 2002-2007 The Regents of the University of California.
5554 Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
5555 Copyright (C) 2008-2010 Lawrence Livermore National Security.
5556 Copyright (C) 2010-2017 SchedMD LLC.
5557
5558 This file is part of Slurm, a resource management program. For
5559 details, see <https://slurm.schedmd.com/>.
5560
5561 Slurm is free software; you can redistribute it and/or modify it under
5562 the terms of the GNU General Public License as published by the Free
5563 Software Foundation; either version 2 of the License, or (at your
5564 option) any later version.
5565
5566 Slurm is distributed in the hope that it will be useful, but WITHOUT
5567 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
5568 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
5569 for more details.
5570
5571
5573 /etc/slurm.conf
5574
5575
5577 cgroup.conf(5), gethostbyname [22m(3), getrlimit (2), gres.conf(5), group
5578 (5), hostname (1), scontrol(1), slurmctld(8), slurmd(8), slurmdbd(8),
5579 slurmdbd.conf(5), srun(1), spank(8), syslog (2), topology.conf(5)
5580
5581
5582
5583November 2019 Slurm Configuration File slurm.conf(5)