1nonstop.conf(5) Slurm Configuration File nonstop.conf(5)
2
3
4
6 nonstop.conf - Slurm configuration file for fault-tolerant computing.
7
8
10 nonstop.conf is an ASCII file which describes the configuration used
11 for fault-tolerant computing with Slurm using the optional slurm‐
12 ctld/nonstop plugin. This plugin provides a means for users to notify
13 Slurm of nodes it believes are suspect, replace the job's failing or
14 failed nodes, and extend a job's in response to failures. The file
15 location can be modified at system build time using the
16 DEFAULT_SLURM_CONF parameter or at execution time by setting the
17 SLURM_CONF environment variable. The file will always be located in the
18 same directory as the slurm.conf file.
19
20 Parameter names are case insensitive. Any text following a "#" in the
21 configuration file is treated as a comment through the end of that
22 line. Changes to the configuration file take effect upon restart of
23 Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the
24 command "scontrol reconfigure" unless otherwise noted. The configura‐
25 tion parameters available include:
26
27
28 BackupAddr
29 Communications address used for the slurmctld daemon. This can
30 either be a hostname or IP address. This value would typically
31 be identical to the value of BackupAddr in the slurm.conf file.
32
33
34 ControlAddr
35 Communications address used for the slurmctld daemon. This can
36 either be a hostname or IP address. This value would typically
37 be identical to the value of ControlAddr in the slurm.conf file.
38
39
40 Debug A number indicating the level of additional logging desired for
41 the plugin. The default value is zero, which generates no addi‐
42 tional logging.
43
44
45 HotSpareCount
46 This identifies how many nodes in each partition should be main‐
47 tained as spare resources. When a job fails, this pool of
48 resources will be depleted and then replenished when possible
49 using idle resources. The value should be a comma delimited
50 list of partition and node count pairs separated by a colon.
51
52
53 MaxSpareNodeCount
54 This identifies the maximum number of nodes any single job may
55 replace through the job's entire lifetime. This could prevent a
56 single job from causing all of the nodes in a cluster to fail.
57 By default, there is no maximum node count.
58
59
60 Port Port used for communications. The default value is 6820.
61
62
63 TimeLimitDelay
64 If a job requires replacement resources and none are immediately
65 available, then permit a job to extend its time limit by the
66 length of time required to secure replacement resources up to
67 the number of minutes specified by TimeLimitDelay. This option
68 will only take effect if no hot spare resources are available at
69 the time replacement resources are requested. This time limit
70 extension is in addition to the value calculated using the Time‐
71 LimitExtend. The default value is zero (no time limit exten‐
72 sion). The value may not exceed 65533 seconds.
73
74
75 TimeLimitDrop
76 Specifies the number of minutes that a job can extend it's time
77 limit for each failed or failing node removed from the job's
78 allocation. The default value is zero (no time limit exten‐
79 sion). The value may not exceed 65533 seconds.
80
81
82 TimeLimitExtend
83 Specifies the number of minutes that a job can extend it's time
84 limit for each replaced node. The default value is zero (no
85 time limit extension). The value may not exceed 65533 seconds.
86
87
88 UserDrainAllow
89 This identifies a comma delimited list of user names or user IDs
90 of users who are authorized to drain nodes they believe are
91 failing. Specify a value of "ALL" to permit any user to drain
92 nodes. By default, no users may drain nodes using this inter‐
93 face.
94
95
96 UserDrainDeny
97 This identifies a comma delimited list of user names or user IDs
98 of users who are NOT authorized to drain nodes they believe are
99 failing. Specifying a value for UserDrainDeny implicitly allows
100 all other users to drain nodes (sets the value of UserDrainAllow
101 to "ALL").
102
103
105 #
106 # Sample nonstop.conf file
107 # Date: 12 Feb 2013
108 #
109 ControlAddr=12.34.56.78
110 BackupAddr=12.34.56.79
111 Port=1234
112 #
113 HotSpareCount=batch:6,interactive:0
114 MaxSpareNodesCount=4
115 TimeLimitDelay=30
116 TimeLimitExtend=20
117 TimeLimitExtend=10
118 UserDrainAllow=adam,brenda
119
120
122 Copyright (C) 2013-2014 SchedMD LLC. All rights reserved.
123
124 Slurm is distributed in the hope that it will be useful, but WITHOUT
125 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
126 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
127 for more details.
128
129
131 slurm.conf(5)
132
133
134
135April 2015 Slurm Configuration File nonstop.conf(5)