1nonstop.conf(5)            Slurm Configuration File            nonstop.conf(5)
2
3
4

NAME

6       nonstop.conf - Slurm configuration file for fault-tolerant computing.
7
8

DESCRIPTION

10       nonstop.conf  is  an  ASCII file which describes the configuration used
11       for fault-tolerant computing  with  Slurm  using  the  optional  slurm‐
12       ctld/nonstop  plugin.  This plugin provides a means for users to notify
13       Slurm of nodes it believes are suspect, replace the  job's  failing  or
14       failed  nodes,  and  extend  a job's in response to failures.  The file
15       location  can  be   modified   at   system   build   time   using   the
16       DEFAULT_SLURM_CONF  parameter  or  at  execution  time  by  setting the
17       SLURM_CONF environment variable. The file will always be located in the
18       same directory as the slurm.conf file.
19
20       Parameter  names are case insensitive.  Any text following a "#" in the
21       configuration file is treated as a comment  through  the  end  of  that
22       line.   Changes  to  the configuration file take effect upon restart of
23       Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the
24       command  "scontrol reconfigure" unless otherwise noted.  The configura‐
25       tion parameters available include:
26
27
28       BackupAddr
29              Communications address used for the slurmctld daemon.  This  can
30              either  be a hostname or IP address.  This value would typically
31              be identical to the value of BackupAddr in the slurm.conf file.
32
33
34       ControlAddr
35              Communications address used for the slurmctld daemon.  This  can
36              either  be a hostname or IP address.  This value would typically
37              be identical to the value of ControlAddr in the slurm.conf file.
38
39
40       Debug  A number indicating the level of additional logging desired  for
41              the plugin.  The default value is zero, which generates no addi‐
42              tional logging.
43
44
45       HotSpareCount
46              This identifies how many nodes in each partition should be main‐
47              tained  as  spare  resources.   When  a  job fails, this pool of
48              resources will be depleted and then  replenished  when  possible
49              using  idle  resources.   The  value should be a comma delimited
50              list of partition and node count pairs separated by a colon.
51
52
53       MaxSpareNodeCount
54              This identifies the maximum number of nodes any single  job  may
55              replace through the job's entire lifetime.  This could prevent a
56              single job from causing all of the nodes in a cluster  to  fail.
57              By default, there is no maximum node count.
58
59
60       Port   Port used for communications.  The default value is 6820.
61
62
63       TimeLimitDelay
64              If a job requires replacement resources and none are immediately
65              available, then permit a job to extend its  time  limit  by  the
66              length  of  time  required to secure replacement resources up to
67              the number of minutes specified by TimeLimitDelay.  This  option
68              will only take effect if no hot spare resources are available at
69              the time replacement resources are requested.  This  time  limit
70              extension is in addition to the value calculated using the Time‐
71              LimitExtend.  The default value is zero (no  time  limit  exten‐
72              sion).  The value may not exceed 65533 seconds.
73
74
75       TimeLimitDrop
76              Specifies  the number of minutes that a job can extend it's time
77              limit for each failed or failing node  removed  from  the  job's
78              allocation.   The  default  value  is zero (no time limit exten‐
79              sion).  The value may not exceed 65533 seconds.
80
81
82       TimeLimitExtend
83              Specifies the number of minutes that a job can extend it's  time
84              limit  for  each  replaced  node.  The default value is zero (no
85              time limit extension).  The value may not exceed 65533 seconds.
86
87
88       UserDrainAllow
89              This identifies a comma delimited list of user names or user IDs
90              of  users  who  are  authorized  to drain nodes they believe are
91              failing.  Specify a value of "ALL" to permit any user  to  drain
92              nodes.   By  default, no users may drain nodes using this inter‐
93              face.
94
95
96       UserDrainDeny
97              This identifies a comma delimited list of user names or user IDs
98              of  users who are NOT authorized to drain nodes they believe are
99              failing.  Specifying a value for UserDrainDeny implicitly allows
100              all other users to drain nodes (sets the value of UserDrainAllow
101              to "ALL").
102
103

EXAMPLE

105       #
106       # Sample nonstop.conf file
107       # Date: 12 Feb 2013
108       #
109       ControlAddr=12.34.56.78
110       BackupAddr=12.34.56.79
111       Port=1234
112       #
113       HotSpareCount=batch:6,interactive:0
114       MaxSpareNodesCount=4
115       TimeLimitDelay=30
116       TimeLimitExtend=20
117       TimeLimitExtend=10
118       UserDrainAllow=adam,brenda
119
120

COPYING

122       Copyright (C) 2013-2014 SchedMD LLC. All rights reserved.
123
124       Slurm is distributed in the hope that it will be  useful,  but  WITHOUT
125       ANY  WARRANTY;  without even the implied warranty of MERCHANTABILITY or
126       FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General  Public  License
127       for more details.
128
129

SEE ALSO

131       slurm.conf(5)
132
133
134
135April 2015                 Slurm Configuration File            nonstop.conf(5)
Impressum