1nonstop.conf(5) Slurm Configuration File nonstop.conf(5)
2
3
4
6 nonstop.conf - Slurm configuration file for fault-tolerant computing.
7
8
10 nonstop.conf is an ASCII file which describes the configuration used
11 for fault-tolerant computing with Slurm using the optional slurm‐
12 ctld/nonstop plugin. This plugin provides a means for users to notify
13 Slurm of nodes it believes are suspect, replace the job's failing or
14 failed nodes, and extend a job's in response to failures. The file
15 location can be modified at system build time using the
16 DEFAULT_SLURM_CONF parameter or at execution time by setting the
17 SLURM_CONF environment variable. The file will always be located in the
18 same directory as the slurm.conf file.
19
20 Parameter names are case insensitive. Any text following a "#" in the
21 configuration file is treated as a comment through the end of that
22 line. Changes to the configuration file take effect upon restart of
23 Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the
24 command "scontrol reconfigure" unless otherwise noted. The configura‐
25 tion parameters available include:
26
27
28 BackupAddr
29 Communications address used for the slurmctld daemon. This can
30 either be a hostname or IP address. This value would typically
31 be the same as the secondary SlurmctldHost in the slurm.conf
32 file, when applicable.
33
34
35 ControlAddr
36 Communications address used for the slurmctld daemon. This can
37 either be a hostname or IP address. This value would typically
38 be the same as the SlurmctldHost in the slurm.conf file.
39
40
41 Debug A number indicating the level of additional logging desired for
42 the plugin. The default value is zero, which generates no addi‐
43 tional logging.
44
45
46 HotSpareCount
47 This identifies how many nodes in each partition should be main‐
48 tained as spare resources. When a job fails, this pool of
49 resources will be depleted and then replenished when possible
50 using idle resources. The value should be a comma delimited
51 list of partition and node count pairs separated by a colon.
52
53
54 MaxSpareNodeCount
55 This identifies the maximum number of nodes any single job may
56 replace through the job's entire lifetime. This could prevent a
57 single job from causing all of the nodes in a cluster to fail.
58 By default, there is no maximum node count.
59
60
61 Port Port used for communications. The default value is 6820.
62
63
64 TimeLimitDelay
65 If a job requires replacement resources and none are immediately
66 available, then permit a job to extend its time limit by the
67 length of time required to secure replacement resources up to
68 the number of minutes specified by TimeLimitDelay. This option
69 will only take effect if no hot spare resources are available at
70 the time replacement resources are requested. This time limit
71 extension is in addition to the value calculated using the Time‐
72 LimitExtend. The default value is zero (no time limit exten‐
73 sion). The value may not exceed 65533 seconds.
74
75
76 TimeLimitDrop
77 Specifies the number of minutes that a job can extend its time
78 limit for each failed or failing node removed from the job's
79 allocation. The default value is zero (no time limit exten‐
80 sion). The value may not exceed 65533 seconds.
81
82
83 TimeLimitExtend
84 Specifies the number of minutes that a job can extend its time
85 limit for each replaced node. The default value is zero (no
86 time limit extension). The value may not exceed 65533 seconds.
87
88
89 UserDrainAllow
90 This identifies a comma delimited list of user names or user IDs
91 of users who are authorized to drain nodes they believe are
92 failing. Specify a value of "ALL" to permit any user to drain
93 nodes. By default, no users may drain nodes using this inter‐
94 face.
95
96
97 UserDrainDeny
98 This identifies a comma delimited list of user names or user IDs
99 of users who are NOT authorized to drain nodes they believe are
100 failing. Specifying a value for UserDrainDeny implicitly allows
101 all other users to drain nodes (sets the value of UserDrainAllow
102 to "ALL").
103
104
106 #
107 # Sample nonstop.conf file
108 # Date: 12 Feb 2013
109 #
110 ControlAddr=12.34.56.78
111 BackupAddr=12.34.56.79
112 Port=1234
113 #
114 HotSpareCount=batch:6,interactive:0
115 MaxSpareNodesCount=4
116 TimeLimitDelay=30
117 TimeLimitExtend=20
118 TimeLimitExtend=10
119 UserDrainAllow=adam,brenda
120
121
123 Copyright (C) 2013-2014 SchedMD LLC. All rights reserved.
124
125 Slurm is distributed in the hope that it will be useful, but WITHOUT
126 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
127 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
128 for more details.
129
130
132 slurm.conf(5)
133
134
135
136April 2015 Slurm Configuration File nonstop.conf(5)