1knl.conf(5)                Slurm Configuration File                knl.conf(5)
2
3
4

NAME

6       knl.conf  -  Slurm configuration file for Intel Knights Landing proces‐
7       sor.
8
9

DESCRIPTION

11       This ASCII file which describes  configuration  information  for  Intel
12       Knights  Landing  processors  and its name may depend upon the NodeFea‐
13       tures plugin configured in Slurm. For example, on Cray systems NodeFea‐
14       tures  should  be  configured  to "knl_cray" and its configuration file
15       will be read from "knl_cray.conf".  The file will always be located  in
16       the same directory as the slurm.conf.  This file is optional.
17
18       Parameter  names are case insensitive.  Any text following a "#" in the
19       configuration file is treated as a comment  through  the  end  of  that
20       line.   Changes  to  the configuration file take effect upon restart of
21       Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the
22       command "scontrol reconfigure" unless otherwise noted.
23
24       The overall configuration parameters available include:
25
26
27       AllowMCDRAM
28              Specify  the  MCDRAM  modes which jobs are allowed to use.  This
29              may be a subset of MCDRAM modes supported by the node.   If  not
30              specified,  all MCDRAM modes supported by the node are available
31              for use.  The comma separated list of allowed MCDRAM  modes  may
32              include any of the modes listed below.
33
34              cache            All of MCDRAM to be used as cache.
35
36              equal            MCDRAM  to  be  used partly as cache and partly
37                               combined with primary memory.
38
39              flat             MCDRAM to be combined with primary memory  into
40                               a "flat" memory space.
41
42       AllowNUMA
43              Specify  the NUMA modes which jobs are allowed to use.  This may
44              be a subset of NUMA modes supported by the node.  If not  speci‐
45              fied,  all  NUMA  modes  supported by the node are available for
46              use.  The comma separated list of allowed NUMA modes may include
47              any of the modes listed below.  Note that Slurm can only support
48              homogeneous nodes (e.g. the same number of cores per NUMA node).
49              KNL  scn4 and quad modes are not homogeneous, but each NUMA mode
50              will have either 16 or 18 cores.  This will result in Slurm  us‐
51              ing  the  lower  core  count  and finding a total of 256 threads
52              rather than 272 threads and setting the node to  a  DOWN  state.
53              Therefore  it  is recommended that snc4 and quad mode not be al‐
54              lowed at this time.
55
56              a2a              All to all
57
58              snc2             Sub-NUMA cluster 2
59
60              snc4             Sub-NUMA cluster 4
61
62              hemi             Hemisphere
63
64              quad             Quadrant
65
66       AllowUserBoot
67              A comma-delimited list of users allowed to modify a  node's  MC‐
68              DRAM or NUMA state.  If not specified then any user can change a
69              node's state and reboot it.
70
71       BootTime
72              Estimated time to reboot a node in seconds.  Used as a basis for
73              optimizing  scheduling decisions.  The default value is 300 sec‐
74              onds (5 minutes) for the "knl_generic" plugin and  2700  seconds
75              (45 minutes) for the "knl_cray" plugin.
76
77       CapmcPath
78              Fully qualified path to the capmc program.  The default value is
79              "/opt/cray/capmc/default/bin/capmc".   This  parameter  is  used
80              only by the "knl_cray" plugin.
81
82       CapmcPollFreq
83              Time  interval  between  when  the capmc program should poll for
84              node state changes, in seconds.  The default value  is  45  sec‐
85              onds.  This parameter is used only by the "knl_cray" plugin.
86
87       CapmcRetries
88              Number of times to retry failed operations of the capmc program.
89              Default value is 4.
90
91       CapmcTimeout
92              Time limit for the capmc program to  return  status  information
93              milliseconds.   The  default value is 60000 milliseconds and the
94              minimum value is 1000 milliseconds.  This parameter is  used  by
95              the  "knl_cray"  plugin, plus the capmc_suspend and capmc_resume
96              programs used for suspending and resuming nodes.
97
98       CnselectPath
99              Fully qualified path to the cnselect program.  The default value
100              is "/opt/cray/sdb/default/bin/cnselect".  This parameter is used
101              only by the "knl_cray" plugin.
102
103       DefaultMCDRAM
104              Specify the default MCDRAM modes for job's which do not  specify
105              a  value.   This  is only used when a node is booted and the job
106              which has been allocated the node does not specify a desired MC‐
107              DRAM  mode.   The  value  can include one of the possible values
108              identified with the AllowMCDRAM configuration  parameter  above.
109              The default value is "cache".
110
111       DefaultNUMA
112              Specify  the default NUMA modes for job's which do not specify a
113              value.  This is only used when a node  is  booted  and  the  job
114              which  has  been  allocated  the node does not specify a desired
115              NUMA mode.  The value can include one  of  the  possible  values
116              identified  with  the  AllowNUMA  configuration parameter above.
117              The default value is "a2a".
118
119       Force  If set to a non-zero value then load  the  node_features/generic
120              plugin  even  on non-KNL nodes.  Used primarily for testing pur‐
121              poses.
122
123       LogFile
124              Fully qualified path to a log file.  The default value is Slurm‐
125              ctldLogFile from the slurm.conf configuration file.  This is op‐
126              tion is used only by the campc_suspend and campc_resume programs
127              (which power down and reboot nodes in the appropriate configura‐
128              tion).
129
130       McPath Fully qualified path to memory controller device file directory.
131              Children   of   this   directory   with   names   of   the  form
132              "mc#/csrow#/ue_count" (i.e. the count  of  unrecoverable  memory
133              errors)  will  be monitored for non-zero values.  If such errors
134              are detected, the node will be set  to  a  DOWN  state  and  the
135              slurmd  daemon  will  shutdown.   The default value is "/sys/de‐
136              vices/system/edac/mc".  See also UmeCheckInterval.
137
138       NodeRebootWeight
139              If a compute node requires a reboot to be usable for  a  pending
140              job,  then  reset the node's weight to the specified value.  The
141              default value is 4,294,967,294 (0xfffffffe).  See also  "Weight"
142              in the node configuration specification of slurm.conf.
143
144       NumaCpuBind
145              Contains  pairs of NUMA modes and the CpuBind mode to set a node
146              to for that mode.  Any compute node found with  or  set  to  the
147              specified  NUMA  mode will have that node's CpuBind field set to
148              the configured value.  The NUMA node  will  be  followed  by  an
149              equal sign the desired CpuBind mode for that NUMA mode. Multiple
150              NUMA mode and CpuBind modes should be in a  semicolon  separated
151              list.   By default changes to a node's NUMA mode will not effect
152              that node's CpuBind mode.  See the example below.
153
154       SyscfgPath
155              Fully qualified path to Intel's syscfg program, which identifies
156              current  KNL configuration by viewing BIOS settings.  If not de‐
157              fined, the current BIOS setting will not be available.  The  de‐
158              fault  value  is "/usr/bin/syscfg".  This parameter is used only
159              by the "knl_generic" plugin.
160
161       SyscfgTimeout
162              Timeout for syscfg program in milliseconds.   Default  value  is
163              1000  milliseconds.   For Dell KNL systems, experience has shown
164              that a higher value of 10000 milliseconds is more appropriate.
165
166       SystemType
167              Used to distinguish the flavor of knl we are dealing with.  Pos‐
168              sible options are "Dell" and "Intel".  The default value is "In‐
169              tel".  This parameter is used only by the "knl_generic" plugin.
170
171       UmeCheckInterval
172              Interval, in microseconds, between checks for Uncorrectable Mem‐
173              ory Errors (UME).  If such errors are detected, the node will be
174              set to a DOWN state and the slurmd daemon  will  shutdown.   The
175              default value is 0 (disabled).  See also McPath.
176
177       ValidateMode
178              If  set to 1 then validate, but do not modify the node's config‐
179              ured MCDRAM and NUMA modes from the slurm.conf file. If the  ac‐
180              tual  modes  do not match configured values the node will be set
181              to a DOWN state. Every KNL nodes MCDRAM  and  NUMA  states  must
182              both  be  listed in the slurm.conf file.  This parameter is used
183              only by the "knl_cray" plugin.
184

EXAMPLE

186       ###################################################################
187       # knl_cray.conf
188       # Slurm configuration file for Intel Knights Landing on Cray system
189       ###################################################################
190       CapmcPath=/opt/cray/capmc/default/bin/capmc
191       CapmcTimeout=6000
192       DefaultMCDRAM=flat
193       DefaultNUMA=a2a
194       NumaCpuBind=a2a=core;snc2=thread;snc4=thread
195       LogFile=/var/tmp/slurm_node_feature.log
196       SyscfgPath=/usr/sbin/syscfg
197
198

COPYING

200       Copyright (C) 2015-2022 SchedMD LLC.
201
202       This file is part of Slurm, a resource  management  program.   For  de‐
203       tails, see <https://slurm.schedmd.com/>.
204
205       Slurm  is free software; you can redistribute it and/or modify it under
206       the terms of the GNU General Public License as published  by  the  Free
207       Software  Foundation;  either version 2 of the License, or (at your op‐
208       tion) any later version.
209
210       Slurm is distributed in the hope that it will be  useful,  but  WITHOUT
211       ANY  WARRANTY;  without even the implied warranty of MERCHANTABILITY or
212       FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General  Public  License
213       for more details.
214
215

SEE ALSO

217       slurm.conf(5)
218
219
220
221January 2022               Slurm Configuration File                knl.conf(5)
Impressum