1nonstop.conf(5)            Slurm Configuration File            nonstop.conf(5)
2
3
4

NAME

6       nonstop.conf - Slurm configuration file for fault-tolerant computing.
7
8

DESCRIPTION

10       nonstop.conf  is  an  ASCII file which describes the configuration used
11       for fault-tolerant computing  with  Slurm  using  the  optional  slurm‐
12       ctld/nonstop  plugin.  This plugin provides a means for users to notify
13       Slurm of nodes it believes are suspect, replace the  job's  failing  or
14       failed  nodes,  and  extend  a job's in response to failures.  The file
15       will always be located in the same directory as the slurm.conf.
16
17       Parameter names are case insensitive.  Any text following a "#" in  the
18       configuration  file  is  treated  as  a comment through the end of that
19       line.  Changes to the configuration file take effect  upon  restart  of
20       Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the
21       command "scontrol reconfigure" unless otherwise noted.  The  configura‐
22       tion parameters available include:
23
24
25       BackupAddr
26              Communications  address used for the slurmctld daemon.  This can
27              either be a hostname or IP address.  This value would  typically
28              be  the  same  as  the secondary SlurmctldHost in the slurm.conf
29              file, when applicable.
30
31       ControlAddr
32              Communications address used for the slurmctld daemon.  This  can
33              either  be a hostname or IP address.  This value would typically
34              be the same as the SlurmctldHost in the slurm.conf file.
35
36       Debug  A number indicating the level of additional logging desired  for
37              the plugin.  The default value is zero, which generates no addi‐
38              tional logging.
39
40       HotSpareCount
41              This identifies how many nodes in each partition should be main‐
42              tained  as  spare resources.  When a job fails, this pool of re‐
43              sources will be depleted and then replenished when possible  us‐
44              ing  idle resources.  The value should be a comma-delimited list
45              of partition and node count pairs separated by a colon.
46
47       MaxSpareNodeCount
48              This identifies the maximum number of nodes any single  job  may
49              replace through the job's entire lifetime.  This could prevent a
50              single job from causing all of the nodes in a cluster  to  fail.
51              By default, there is no maximum node count.
52
53       Port   Port used for communications.  The default value is 6820.
54
55       TimeLimitDelay
56              If a job requires replacement resources and none are immediately
57              available, then permit a job to extend its  time  limit  by  the
58              length  of  time  required to secure replacement resources up to
59              the number of minutes specified by TimeLimitDelay.  This  option
60              will only take effect if no hot spare resources are available at
61              the time replacement resources are requested.  This  time  limit
62              extension is in addition to the value calculated using the Time‐
63              LimitExtend.  The default value is zero (no  time  limit  exten‐
64              sion).  The value may not exceed 65533 seconds.
65
66       TimeLimitDrop
67              Specifies  the  number of minutes that a job can extend its time
68              limit for each failed or failing node removed from the job's al‐
69              location.   The default value is zero (no time limit extension).
70              The value may not exceed 65533 seconds.
71
72       TimeLimitExtend
73              Specifies the number of minutes that a job can extend  its  time
74              limit  for  each  replaced  node.  The default value is zero (no
75              time limit extension).  The value may not exceed 65533 seconds.
76
77       UserDrainAllow
78              This identifies a comma-delimited list of user names or user IDs
79              of  users  who  are  authorized  to drain nodes they believe are
80              failing.  Specify a value of "ALL" to permit any user  to  drain
81              nodes.   By  default, no users may drain nodes using this inter‐
82              face.
83
84       UserDrainDeny
85              This identifies a comma-delimited list of user names or user IDs
86              of  users who are NOT authorized to drain nodes they believe are
87              failing.  Specifying a value for UserDrainDeny implicitly allows
88              all other users to drain nodes (sets the value of UserDrainAllow
89              to "ALL").
90

EXAMPLE

92       #
93       # Sample nonstop.conf file
94       # Date: 12 Feb 2013
95       #
96       ControlAddr=12.34.56.78
97       BackupAddr=12.34.56.79
98       Port=1234
99       #
100       HotSpareCount=batch:6,interactive:0
101       MaxSpareNodesCount=4
102       TimeLimitDelay=30
103       TimeLimitExtend=20
104       TimeLimitExtend=10
105       UserDrainAllow=adam,brenda
106
107

COPYING

109       Copyright (C) 2013-2022 SchedMD LLC. All rights reserved.
110
111       Slurm is distributed in the hope that it will be  useful,  but  WITHOUT
112       ANY  WARRANTY;  without even the implied warranty of MERCHANTABILITY or
113       FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General  Public  License
114       for more details.
115
116

SEE ALSO

118       slurm.conf(5)
119
120
121
122January 2022               Slurm Configuration File            nonstop.conf(5)
Impressum