1nonstop.conf(5) Slurm Configuration File nonstop.conf(5)
2
3
4
6 nonstop.conf - Slurm configuration file for fault-tolerant computing.
7
8
10 nonstop.conf is an ASCII file which describes the configuration used
11 for fault-tolerant computing with Slurm using the optional slurm‐
12 ctld/nonstop plugin. This plugin provides a means for users to notify
13 Slurm of nodes it believes are suspect, replace the job's failing or
14 failed nodes, and extend a job's in response to failures. The file
15 will always be located in the same directory as the slurm.conf.
16
17 Parameter names are case insensitive. Any text following a "#" in the
18 configuration file is treated as a comment through the end of that
19 line. Changes to the configuration file take effect upon restart of
20 Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the
21 command "scontrol reconfigure" unless otherwise noted. The configura‐
22 tion parameters available include:
23
24
25 BackupAddr
26 Communications address used for the slurmctld daemon. This can
27 either be a hostname or IP address. This value would typically
28 be the same as the secondary SlurmctldHost in the slurm.conf
29 file, when applicable.
30
31 ControlAddr
32 Communications address used for the slurmctld daemon. This can
33 either be a hostname or IP address. This value would typically
34 be the same as the SlurmctldHost in the slurm.conf file.
35
36 Debug A number indicating the level of additional logging desired for
37 the plugin. The default value is zero, which generates no addi‐
38 tional logging.
39
40 HotSpareCount
41 This identifies how many nodes in each partition should be main‐
42 tained as spare resources. When a job fails, this pool of re‐
43 sources will be depleted and then replenished when possible us‐
44 ing idle resources. The value should be a comma-delimited list
45 of partition and node count pairs separated by a colon.
46
47 MaxSpareNodeCount
48 This identifies the maximum number of nodes any single job may
49 replace through the job's entire lifetime. This could prevent a
50 single job from causing all of the nodes in a cluster to fail.
51 By default, there is no maximum node count.
52
53 Port Port used for communications. The default value is 6820.
54
55 TimeLimitDelay
56 If a job requires replacement resources and none are immediately
57 available, then permit a job to extend its time limit by the
58 length of time required to secure replacement resources up to
59 the number of minutes specified by TimeLimitDelay. This option
60 will only take effect if no hot spare resources are available at
61 the time replacement resources are requested. This time limit
62 extension is in addition to the value calculated using the Time‐
63 LimitExtend. The default value is zero (no time limit exten‐
64 sion). The value may not exceed 65533 seconds.
65
66 TimeLimitDrop
67 Specifies the number of minutes that a job can extend its time
68 limit for each failed or failing node removed from the job's al‐
69 location. The default value is zero (no time limit extension).
70 The value may not exceed 65533 seconds.
71
72 TimeLimitExtend
73 Specifies the number of minutes that a job can extend its time
74 limit for each replaced node. The default value is zero (no
75 time limit extension). The value may not exceed 65533 seconds.
76
77 UserDrainAllow
78 This identifies a comma-delimited list of user names or user IDs
79 of users who are authorized to drain nodes they believe are
80 failing. Specify a value of "ALL" to permit any user to drain
81 nodes. By default, no users may drain nodes using this inter‐
82 face.
83
84 UserDrainDeny
85 This identifies a comma-delimited list of user names or user IDs
86 of users who are NOT authorized to drain nodes they believe are
87 failing. Specifying a value for UserDrainDeny implicitly allows
88 all other users to drain nodes (sets the value of UserDrainAllow
89 to "ALL").
90
92 #
93 # Sample nonstop.conf file
94 # Date: 12 Feb 2013
95 #
96 ControlAddr=12.34.56.78
97 BackupAddr=12.34.56.79
98 Port=1234
99 #
100 HotSpareCount=batch:6,interactive:0
101 MaxSpareNodesCount=4
102 TimeLimitDelay=30
103 TimeLimitExtend=20
104 TimeLimitExtend=10
105 UserDrainAllow=adam,brenda
106
107
109 Copyright (C) 2013-2022 SchedMD LLC. All rights reserved.
110
111 Slurm is distributed in the hope that it will be useful, but WITHOUT
112 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
113 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
114 for more details.
115
116
118 slurm.conf(5)
119
120
121
122January 2022 Slurm Configuration File nonstop.conf(5)