1knl.conf(5) Slurm Configuration File knl.conf(5)
2
3
4
6 knl.conf - Slurm configuration file for Intel Knights Landing proces‐
7 sor.
8
9
11 This ASCII file which describes configuration information for Intel
12 Knights Landing processors and its name may depend upon the NodeFea‐
13 tures plugin configured in Slurm. For example, on Cray systems NodeFea‐
14 tures should be configured to "knl_cray" and its configuration file
15 will be read from "knl_cray.conf". The file location can be modified
16 at system build time using the DEFAULT_SLURM_CONF parameter or at exe‐
17 cution time by setting the SLURM_CONF environment variable. The file
18 will always be located in the same directory as the slurm.conf file.
19 This file is optional.
20
21 Parameter names are case insensitive. Any text following a "#" in the
22 configuration file is treated as a comment through the end of that
23 line. Changes to the configuration file take effect upon restart of
24 Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the
25 command "scontrol reconfigure" unless otherwise noted.
26
27 The overall configuration parameters available include:
28
29
30 AllowMCDRAM
31 Specify the MCDRAM modes which jobs are allowed to use. This
32 may be a subset of MCDRAM modes supported by the node. If not
33 specified, all MCDRAM modes supported by the node are available
34 for use. The comma separated list of allowed MCDRAM modes may
35 include any of the modes listed below.
36
37 cache All of MCDRAM to be used as cache.
38
39 equal MCDRAM to be used partly as cache and partly
40 combined with primary memory.
41
42 flat MCDRAM to be combined with primary memory into
43 a "flat" memory space.
44
45
46 AllowNUMA
47 Specify the NUMA modes which jobs are allowed to use. This may
48 be a subset of NUMA modes supported by the node. If not speci‐
49 fied, all NUMA modes supported by the node are available for
50 use. The comma separated list of allowed NUMA modes may include
51 any of the modes listed below. Note that Slurm can only support
52 homogeneous nodes (e.g. the same number of cores per NUMA node).
53 KNL scn4 and quad modes are not homogeneous, but each NUMA mode
54 will have either 16 or 18 cores. This will result in Slurm us‐
55 ing the lower core count and finding a total of 256 threads
56 rather than 272 threads and setting the node to a DOWN state.
57 Therefore it is recommended that snc4 and quad mode not be al‐
58 lowed at this time.
59
60 a2a All to all
61
62 snc2 Sub-NUMA cluster 2
63
64 snc4 Sub-NUMA cluster 4
65
66 hemi Hemisphere
67
68 quad Quadrant
69
70
71 AllowUserBoot
72 A comma-delimited list of users allowed to modify a node's MC‐
73 DRAM or NUMA state. If not specified then any user can change a
74 node's state and reboot it.
75
76
77 BootTime
78 Estimated time to reboot a node in seconds. Used as a basis for
79 optimizing scheduling decisions. The default value is 300 sec‐
80 onds (5 minutes) for the "knl_generic" plugin and 2700 seconds
81 (45 minutes) for the "knl_cray" plugin.
82
83
84 CapmcPath
85 Fully qualified path to the capmc program. The default value is
86 "/opt/cray/capmc/default/bin/capmc". This parameter is used
87 only by the "knl_cray" plugin.
88
89
90 CapmcPollFreq
91 Time interval between when the capmc program should poll for
92 node state changes, in seconds. The default value is 45 sec‐
93 onds. This parameter is used only by the "knl_cray" plugin.
94
95
96 CapmcRetries
97 Number of times to retry failed operations of the capmc program.
98 Default value is 4.
99
100
101 CapmcTimeout
102 Time limit for the capmc program to return status information
103 milliseconds. The default value is 60000 milliseconds and the
104 minimum value is 1000 milliseconds. This parameter is used by
105 the "knl_cray" plugin, plus the capmc_suspend and capmc_resume
106 programs used for suspending and resuming nodes.
107
108
109 CnselectPath
110 Fully qualified path to the cnselect program. The default value
111 is "/opt/cray/sdb/default/bin/cnselect". This parameter is used
112 only by the "knl_cray" plugin.
113
114
115 DefaultMCDRAM
116 Specify the default MCDRAM modes for job's which do not specify
117 a value. This is only used when a node is booted and the job
118 which has been allocated the node does not specify a desired MC‐
119 DRAM mode. The value can include one of the possible values
120 identified with the AllowMCDRAM configuration parameter above.
121 The default value is "cache".
122
123
124 DefaultNUMA
125 Specify the default NUMA modes for job's which do not specify a
126 value. This is only used when a node is booted and the job
127 which has been allocated the node does not specify a desired
128 NUMA mode. The value can include one of the possible values
129 identified with the AllowNUMA configuration parameter above.
130 The default value is "a2a".
131
132
133 Force If set to a non-zero value then load the node_features/generic
134 plugin even on non-KNL nodes. Used primarily for testing pur‐
135 poses.
136
137
138 LogFile
139 Fully qualified path to a log file. The default value is Slurm‐
140 ctldLogFile from the slurm.conf configuration file. This is op‐
141 tion is used only by the campc_suspend and campc_resume programs
142 (which power down and reboot nodes in the appropriate configura‐
143 tion).
144
145
146 McPath Fully qualified path to memory controller device file directory.
147 Children of this directory with names of the form
148 "mc#/csrow#/ue_count" (i.e. the count of unrecoverable memory
149 errors) will be monitored for non-zero values. If such errors
150 are detected, the node will be set to a DOWN state and the
151 slurmd daemon will shutdown. The default value is "/sys/de‐
152 vices/system/edac/mc". See also UmeCheckInterval.
153
154
155 NodeRebootWeight
156 If a compute node requires a reboot to be usable for a pending
157 job, then reset the node's weight to the specified value. The
158 default value is 4,294,967,294 (0xfffffffe). See also "Weight"
159 in the node configuration specification of slurm.conf.
160
161
162 NumaCpuBind
163 Contains pairs of NUMA modes and the CpuBind mode to set a node
164 to for that mode. Any compute node found with or set to the
165 specified NUMA mode will have that node's CpuBind field set to
166 the configured value. The NUMA node will be followed by an
167 equal sign the desired CpuBind mode for that NUMA mode. Multiple
168 NUMA mode and CpuBind modes should be in a semicolon separated
169 list. By default changes to a node's NUMA mode will not effect
170 that node's CpuBind mode. See the example below.
171
172
173 SyscfgPath
174 Fully qualified path to Intel's syscfg program, which identifies
175 current KNL configuration by viewing BIOS settings. If not de‐
176 fined, the current BIOS setting will not be available. The de‐
177 fault value is "/usr/bin/syscfg". This parameter is used only
178 by the "knl_generic" plugin.
179
180
181 SyscfgTimeout
182 Timeout for syscfg program in milliseconds. Default value is
183 1000 milliseconds. For Dell KNL systems, experience has shown
184 that a higher value of 10000 milliseconds is more appropriate.
185
186
187 SystemType
188 Used to distinguish the flavor of knl we are dealing with. Pos‐
189 sible options are "Dell" and "Intel". The default value is "In‐
190 tel". This parameter is used only by the "knl_generic" plugin.
191
192
193 UmeCheckInterval
194 Interval, in microseconds, between checks for Uncorrectable Mem‐
195 ory Errors (UME). If such errors are detected, the node will be
196 set to a DOWN state and the slurmd daemon will shutdown. The
197 default value is 0 (disabled). See also McPath.
198
199
200 ValidateMode
201 If set to 1 then validate, but do not modify the node's config‐
202 ured MCDRAM and NUMA modes from the slurm.conf file. If the ac‐
203 tual modes do not match configured values the node will be set
204 to a DOWN state. Every KNL nodes MCDRAM and NUMA states must
205 both be listed in the slurm.conf file. This parameter is used
206 only by the "knl_cray" plugin.
207
208
210 ###################################################################
211 # knl_cray.conf
212 # Slurm configuration file for Intel Knights Landing on Cray system
213 ###################################################################
214 CapmcPath=/opt/cray/capmc/default/bin/capmc
215 CapmcTimeout=6000
216 DefaultMCDRAM=flat
217 DefaultNUMA=a2a
218 NumaCpuBind=a2a=core;snc2=thread;snc4=thread
219 LogFile=/var/tmp/slurm_node_feature.log
220 SyscfgPath=/usr/sbin/syscfg
221
222
224 Copyright (C) 2015-2021 SchedMD LLC.
225
226 This file is part of Slurm, a resource management program. For de‐
227 tails, see <https://slurm.schedmd.com/>.
228
229 Slurm is free software; you can redistribute it and/or modify it under
230 the terms of the GNU General Public License as published by the Free
231 Software Foundation; either version 2 of the License, or (at your op‐
232 tion) any later version.
233
234 Slurm is distributed in the hope that it will be useful, but WITHOUT
235 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
236 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
237 for more details.
238
239
241 slurm.conf(5)
242
243
244
245June 2021 Slurm Configuration File knl.conf(5)