1nonstop.conf(5)            Slurm Configuration File            nonstop.conf(5)
2
3
4

NAME

6       nonstop.conf - Slurm configuration file for fault-tolerant computing.
7
8

DESCRIPTION

10       nonstop.conf  is  an  ASCII file which describes the configuration used
11       for fault-tolerant computing  with  Slurm  using  the  optional  slurm‐
12       ctld/nonstop  plugin.  This plugin provides a means for users to notify
13       Slurm of nodes it believes are suspect, replace the  job's  failing  or
14       failed nodes, and extend a job's in response to failures.  The file lo‐
15       cation  can  be  modified  at  system  build   time   using   the   DE‐
16       FAULT_SLURM_CONF   parameter  or  at  execution  time  by  setting  the
17       SLURM_CONF environment variable. The file will always be located in the
18       same directory as the slurm.conf file.
19
20       Parameter  names are case insensitive.  Any text following a "#" in the
21       configuration file is treated as a comment  through  the  end  of  that
22       line.   Changes  to  the configuration file take effect upon restart of
23       Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the
24       command  "scontrol reconfigure" unless otherwise noted.  The configura‐
25       tion parameters available include:
26
27
28       BackupAddr
29              Communications address used for the slurmctld daemon.  This  can
30              either  be a hostname or IP address.  This value would typically
31              be the same as the secondary  SlurmctldHost  in  the  slurm.conf
32              file, when applicable.
33
34
35       ControlAddr
36              Communications  address used for the slurmctld daemon.  This can
37              either be a hostname or IP address.  This value would  typically
38              be the same as the SlurmctldHost in the slurm.conf file.
39
40
41       Debug  A  number indicating the level of additional logging desired for
42              the plugin.  The default value is zero, which generates no addi‐
43              tional logging.
44
45
46       HotSpareCount
47              This identifies how many nodes in each partition should be main‐
48              tained as spare resources.  When a job fails, this pool  of  re‐
49              sources  will be depleted and then replenished when possible us‐
50              ing idle resources.  The value should be a comma-delimited  list
51              of partition and node count pairs separated by a colon.
52
53
54       MaxSpareNodeCount
55              This  identifies  the maximum number of nodes any single job may
56              replace through the job's entire lifetime.  This could prevent a
57              single  job  from causing all of the nodes in a cluster to fail.
58              By default, there is no maximum node count.
59
60
61       Port   Port used for communications.  The default value is 6820.
62
63
64       TimeLimitDelay
65              If a job requires replacement resources and none are immediately
66              available,  then  permit  a  job to extend its time limit by the
67              length of time required to secure replacement  resources  up  to
68              the  number of minutes specified by TimeLimitDelay.  This option
69              will only take effect if no hot spare resources are available at
70              the  time  replacement resources are requested.  This time limit
71              extension is in addition to the value calculated using the Time‐
72              LimitExtend.   The  default  value is zero (no time limit exten‐
73              sion).  The value may not exceed 65533 seconds.
74
75
76       TimeLimitDrop
77              Specifies the number of minutes that a job can extend  its  time
78              limit for each failed or failing node removed from the job's al‐
79              location.  The default value is zero (no time limit  extension).
80              The value may not exceed 65533 seconds.
81
82
83       TimeLimitExtend
84              Specifies  the  number of minutes that a job can extend its time
85              limit for each replaced node.  The default  value  is  zero  (no
86              time limit extension).  The value may not exceed 65533 seconds.
87
88
89       UserDrainAllow
90              This identifies a comma-delimited list of user names or user IDs
91              of users who are authorized to  drain  nodes  they  believe  are
92              failing.   Specify  a value of "ALL" to permit any user to drain
93              nodes.  By default, no users may drain nodes using  this  inter‐
94              face.
95
96
97       UserDrainDeny
98              This identifies a comma-delimited list of user names or user IDs
99              of users who are NOT authorized to drain nodes they believe  are
100              failing.  Specifying a value for UserDrainDeny implicitly allows
101              all other users to drain nodes (sets the value of UserDrainAllow
102              to "ALL").
103
104

EXAMPLE

106       #
107       # Sample nonstop.conf file
108       # Date: 12 Feb 2013
109       #
110       ControlAddr=12.34.56.78
111       BackupAddr=12.34.56.79
112       Port=1234
113       #
114       HotSpareCount=batch:6,interactive:0
115       MaxSpareNodesCount=4
116       TimeLimitDelay=30
117       TimeLimitExtend=20
118       TimeLimitExtend=10
119       UserDrainAllow=adam,brenda
120
121

COPYING

123       Copyright (C) 2013-2021 SchedMD LLC. All rights reserved.
124
125       Slurm  is  distributed  in the hope that it will be useful, but WITHOUT
126       ANY WARRANTY; without even the implied warranty of  MERCHANTABILITY  or
127       FITNESS  FOR  A PARTICULAR PURPOSE.  See the GNU General Public License
128       for more details.
129
130

SEE ALSO

132       slurm.conf(5)
133
134
135
136June 2021                  Slurm Configuration File            nonstop.conf(5)
Impressum