gres.conf(5)

1gres.conf(5)               Slurm Configuration File               gres.conf(5)
2
3
4

NAME

6       gres.conf  -  Slurm configuration file for Generic RESource (GRES) man‐
7       agement.
8
9

DESCRIPTION

11       gres.conf is an ASCII file which describes the configuration of Generic
12       RESource(s)  (GRES)  on  each compute node.  If the GRES information in
13       the slurm.conf file does not fully describe  those  resources,  then  a
14       gres.conf  file  should  be included on each compute node and the slurm
15       controller. The file will always be located in the  same  directory  as
16       slurm.conf.
17
18
19       If  the  GRES  information in the slurm.conf file fully describes those
20       resources (i.e. no "Cores", "File" or "Links" specification is required
21       for that GRES type or that information is automatically detected), that
22       information may be omitted from the gres.conf file and only the config‐
23       uration information in the slurm.conf file will be used.  The gres.conf
24       file may be omitted completely if the configuration information in  the
25       slurm.conf file fully describes all GRES.
26
27
28       If  using  the  gres.conf  file  to describe the resources available to
29       nodes, the first parameter on the line should be NodeName. If configur‐
30       ing  Generic Resources without specifying nodes, the first parameter on
31       the line should be Name.
32
33
34       Parameter names are case insensitive.  Any text following a "#" in  the
35       configuration  file  is  treated  as  a comment through the end of that
36       line.  Changes to the configuration file take effect  upon  restart  of
37       Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the
38       command "scontrol reconfigure" unless otherwise noted.
39
40
41       NOTE: Slurm support for gres/[mps|shard] requires the use  of  the  se‐
42       lect/cons_tres  plugin.  For  more information on how to configure MPS,
43       see https://slurm.schedmd.com/gres.html#MPS_Management.  For  more  in‐
44       formation      on      how      to      configure     Sharding,     see
45       https://slurm.schedmd.com/gres.html#Sharding.
46
47
48       For   more   information   on   GRES   scheduling   in   general,   see
49       https://slurm.schedmd.com/gres.html.
50
51
52       The overall configuration parameters available include:
53
54
55       AutoDetect
56              The  hardware  detection mechanisms to enable for automatic GRES
57              configuration.  Currently, the options are:
58
59              nvml   Automatically detect NVIDIA  GPUs.  Requires  the  NVIDIA
60                     Management Library (NVML).
61
62              off    Do  not  automatically  detect any GPUs. Used to override
63                     other options.
64
65              rsmi   Automatically detect AMD GPUs. Requires the  ROCm  System
66                     Management Interface (ROCm SMI) Library.
67
68              AutoDetect  can  be  on  a line by itself, in which case it will
69              globally apply to all lines in gres.conf by  default.  In  addi‐
70              tion,  AutoDetect can be combined with NodeName to only apply to
71              certain nodes. Node-specific AutoDetects will trump  the  global
72              AutoDetect.  A  node-specific AutoDetect only needs to be speci‐
73              fied once per node. If specified multiple  times  for  the  same
74              nodes,  they must all be the same value. To unset AutoDetect for
75              a node when a global AutoDetect is set, simply set it  to  "off"
76              in  a  node-specific  GRES  line.   E.g.:  NodeName=tux3 AutoDe‐
77              tect=off Name=gpu File=/dev/nvidia[0-3].  AutoDetect  cannot  be
78              used with cloud nodes.
79
80
81              AutoDetect  will  automatically  detect files, cores, links, and
82              any other hardware. If a parameter such as File, Cores, or Links
83              are specified when AutoDetect is used, then the specified values
84              are used to sanity check the auto detected values. If there is a
85              mismatch,  then  the node's state is set to invalid and the node
86              is drained.
87
88       Count  Number of resources of this name/type available  on  this  node.
89              The  default value is set to the number of File values specified
90              (if any), otherwise the default value is one. A suffix  of  "K",
91              "M", "G", "T" or "P" may be used to multiply the number by 1024,
92              1048576,   1073741824,   etc.   respectively.    For    example:
93              "Count=10G".
94
95       Cores  Optionally specify the core index numbers for the specific cores
96              which can use this resource.  For example, it  may  be  strongly
97              preferable  to  use  specific  cores  with specific GRES devices
98              (e.g. on a NUMA architecture).  While Slurm can track and assign
99              resources  at the CPU or thread level, its scheduling algorithms
100              used to co-allocate GRES devices with CPUs operates at a  socket
101              or NUMA level for job allocations.  Therefore it is not possible
102              to preferentially assign GRES with different  specific  CPUs  on
103              the same NUMA or socket and this option should generally be used
104              to identify all cores on some socket. Though, job  step  alloca‐
105              tion  with  --exact  will  look at cores directly for which more
106              specific core identification may be useful.
107
108
109              Multiple cores may be specified using a comma-delimited list  or
110              a  range  may be specified using a "-" separator (e.g. "0,1,2,3"
111              or "0-3").  If  a  job  specifies  --gres-flags=enforce-binding,
112              then  only  the  identified  cores  can  be  allocated with each
113              generic resource. This will tend to improve performance of jobs,
114              but delay the allocation of resources to them.  If specified and
115              a job is not submitted with the --gres-flags=enforce-binding op‐
116              tion  the identified cores will be preferred for scheduling with
117              each generic resource.
118
119              If --gres-flags=disable-binding is specified, then any core  can
120              be  used  with  the resources, which also increases the speed of
121              Slurm's scheduling algorithm but  can  degrade  the  application
122              performance.   The  --gres-flags=disable-binding  option is cur‐
123              rently required to use more CPUs than are bound to a GRES  (e.g.
124              if  a  GPU  is bound to the CPUs on one socket, but resources on
125              more than one socket are required to run the job).  If any  core
126              can  be effectively used with the resources, then do not specify
127              the cores option for improved  speed  in  the  Slurm  scheduling
128              logic.   A restart of the slurmctld is needed for changes to the
129              Cores option to take effect.
130
131              NOTE: Since Slurm must be able to perform resource management on
132              heterogeneous  clusters having various processing unit numbering
133              schemes, a logical core index must be specified instead  of  the
134              physical  core  index.  That logical core index might not corre‐
135              spond to your physical core index number.  Core 0  will  be  the
136              first  core on the first socket, while core 1 will be the second
137              core on the first socket.  This  numbering  coincides  with  the
138              logical  core  number (Core L#) seen in "lstopo -l" command out‐
139              put.
140
141       File   Fully qualified pathname of the device files associated  with  a
142              resource.  The name can include a numeric range suffix to be in‐
143              terpreted by Slurm (e.g. File=/dev/nvidia[0-3]).
144
145
146              This field is generally required if enforcement of  generic  re‐
147              source  allocations is to be supported (i.e. prevents users from
148              making use of resources allocated to  a  different  user).   En‐
149              forcement  of  the  file  allocation  relies  upon Linux Control
150              Groups (cgroups) and  Slurm's  task/cgroup  plugin,  which  will
151              place  the allocated files into the job's cgroup and prevent use
152              of other files.  Please see Slurm's Cgroups Guide for  more  in‐
153              formation: https://slurm.schedmd.com/cgroups.html.
154
155              If File is specified then Count must be either set to the number
156              of file names specified or not set (the  default  value  is  the
157              number of files specified).  The exception to this is MPS/Shard‐
158              ing. For either of these GRES, each GPU would be  identified  by
159              device file using the File parameter and Count would specify the
160              number of entries that would correspond to that  GPU.  For  MPS,
161              typically  100  or  some multiple of 100. For Sharding typically
162              the maximum number of jobs that could simultaneously share  that
163              GPU.
164
165              If  using a card with Multi-Instance GPU functionality, use Mul‐
166              tipleFiles instead. File and MultipleFiles are  mutually  exclu‐
167              sive.
168
169              NOTE: File is required for all gpu typed GRES.
170
171              NOTE:  If  you specify the File parameter for a resource on some
172              node, the option must be specified on all nodes and  Slurm  will
173              track  the  assignment  of  each specific resource on each node.
174              Otherwise Slurm will only track a count of  allocated  resources
175              rather than the state of each individual device file.
176
177              NOTE:  Drain  a  node  before changing the count of records with
178              File parameters (e.g. if you want to add or remove GPUs  from  a
179              node's  configuration).  Failure to do so will result in any job
180              using those GRES being aborted.
181
182              NOTE: When specifying File, Count is limited in size  (currently
183              1024) for each node.
184
185       Flags  Optional flags that can be specified to change configured behav‐
186              ior of the GRES.
187
188              Allowed values at present are:
189
190              CountOnly           Do not attempt to load plugin as  this  GRES
191                                  will  only  be  used to track counts of GRES
192                                  used. This avoids attempting to load non-ex‐
193                                  istent  plugin  which can affect filesystems
194                                  with high latency  metadata  operations  for
195                                  non-existent files.
196
197              one_sharing         To  be  used  on  a  shared gres. If using a
198                                  shared gres (mps) on top of a  sharing  gres
199                                  (gpu)  only allow one of the sharing gres to
200                                  be used by the shared gres.  This is the de‐
201                                  fault for MPS.
202
203                                  NOTE:  If a gres has this flag configured it
204                                  is global, so all other nodes with that gres
205                                  will  have  this flag implied.  This flag is
206                                  not combatible with all_sharing for  a  spe‐
207                                  cific gres.
208
209              all_sharing         To be used on a shared gres. This is the op‐
210                                  posite of one_sharing and can be used to al‐
211                                  low  all  sharing gres (gpu) on a node to be
212                                  used for shared gres (mps).
213
214                                  NOTE: If a gres has this flag configured  it
215                                  is global, so all other nodes with that gres
216                                  will have this flag implied.  This  flag  is
217                                  not  combatible  with one_sharing for a spe‐
218                                  cific gres.
219
220              nvidia_gpu_env      Set  environment  variable  CUDA_VISIBLE_DE‐
221                                  VICES for all GPUs on the specified node(s).
222
223              amd_gpu_env         Set  environment  variable  ROCR_VISIBLE_DE‐
224                                  VICES for all GPUs on the specified node(s).
225
226              intel_gpu_env       Set  environment  variable  ZE_AFFINITY_MASK
227                                  for all GPUs on the specified node(s).
228
229              opencl_env          Set  environment variable GPU_DEVICE_ORDINAL
230                                  for all GPUs on the specified node(s).
231
232              no_gpu_env          Set no GPU-specific  environment  variables.
233                                  This  is mutually exclusive to all other en‐
234                                  vironment-related flags.
235
236              If   no   environment-related   flags   are   specified,    then
237              nvidia_gpu_env,  amd_gpu_env, intel_gpu_env, and opencl_env will
238              be implicitly set by default.  If AutoDetect is used  and  envi‐
239              ronment-related  flags  are  not specified, then AutoDetect=nvml
240              will set nvidia_gpu_env, AutoDetect=rsmi will  set  amd_gpu_env,
241              and AutoDetect=oneapi will set intel_gpu_env.  Conversely, spec‐
242              ified environment-related flags will always override AutoDetect.
243
244              Environment-related flags set on one GRES line will be inherited
245              by  the  GRES  line  directly below it if no environment-related
246              flags are specified on that line and if it is of the same  node,
247              name,  and  type. Environment-related flags must be the same for
248              GRES of the same node, name, and type.
249
250              Note that there is a known issue with the AMD ROCm runtime where
251              ROCR_VISIBLE_DEVICES  is  processed  first,  and then CUDA_VISI‐
252              BLE_DEVICES is processed. To avoid the issues  caused  by  this,
253              set  Flags=amd_gpu_env for AMD GPUs so only ROCR_VISIBLE_DEVICES
254              is set.
255
256       Links  A comma-delimited list of numbers identifying the number of con‐
257              nections   between  this  device  and  other  devices  to  allow
258              coscheduling of better connected devices.  This  is  an  ordered
259              list in which the number of connections this specific device has
260              to device number 0 would be in the first position, the number of
261              connections  it  has  to device number 1 in the second position,
262              etc.  A -1 indicates the device itself and a 0 indicates no con‐
263              nection.  If specified, then this line can only contain a single
264              GRES device (i.e. can only contain a single file via File).
265
266
267              This is an optional value and is  usually  automatically  deter‐
268              mined  if AutoDetect is enabled.  A typical use case would be to
269              identify GPUs having NVLink connectivity.  Note that  for  GPUs,
270              the  minor number assigned by the OS and used in the device file
271              (i.e. the X in /dev/nvidiaX) is not necessarily the same as  the
272              device number/index. The device number is created by sorting the
273              GPUs by PCI bus ID and then numbering  them  starting  from  the
274              smallest                bus               ID.                See
275              https://slurm.schedmd.com/gres.html#GPU_Management
276
277       MultipleFiles
278              Fully qualified pathname of the device files associated  with  a
279              resource.   Graphics  cards using Multi-Instance GPU (MIG) tech‐
280              nology will present multiple device files that should be managed
281              as a single generic resource. The file names can be a comma sep‐
282              arated list or it can include a numeric range suffix (e.g.  Mul‐
283              tipleFiles=/dev/nvidia[0-3]).
284
285              Drain  a node before changing the count of records with the Mul‐
286              tipleFiles parameter, such as when adding or removing GPUs  from
287              a node's configuration.  Failure to do so will result in any job
288              using those GRES being aborted.
289
290              When not using GPUs with MIG functionality,  use  File  instead.
291              MultipleFiles and File are mutually exclusive.
292
293       Name   Name of the generic resource. Any desired name may be used.  The
294              name must match  a  value  in  GresTypes  in  slurm.conf.   Each
295              generic  resource  has  an optional plugin which can provide re‐
296              source-specific functionality.  Generic resources that currently
297              include an optional plugin are:
298
299              gpu    Graphics Processing Unit
300
301              mps    CUDA Multi-Process Service (MPS)
302
303              nic    Network Interface Card
304
305              shard  Shards of a gpu
306
307       NodeName
308              An  optional  NodeName  specification  can be used to permit one
309              gres.conf file to be used for all compute nodes in a cluster  by
310              specifying  the  node(s)  that  each  line should apply to.  The
311              NodeName specification can use a Slurm hostlist specification as
312              shown in the example below.
313
314       Type   An optional arbitrary string identifying the type of generic re‐
315              source.  For example, this might be used to identify a  specific
316              model  of GPU, which users can then specify in a job request.  A
317              restart of the slurmctld and  slurmd  daemons  is  required  for
318              changes to the Type option to take effect.
319
320              NOTE: If using autodetect functionality and defining the Type in
321              your gres.conf file, the Type specified should  match  or  be  a
322              substring  of the value that is detected, using an underscore in
323              lieu of any spaces.
324

EXAMPLES

326       ##################################################################
327       # Slurm's Generic Resource (GRES) configuration file
328       # Define GPU devices with MPS support, with AutoDetect sanity checking
329       ##################################################################
330       AutoDetect=nvml
331       Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1
332       Name=gpu Type=tesla  File=/dev/nvidia1 COREs=2,3
333       Name=mps Count=100 File=/dev/nvidia0 COREs=0,1
334       Name=mps Count=100  File=/dev/nvidia1 COREs=2,3
335
336       ##################################################################
337       # Slurm's Generic Resource (GRES) configuration file
338       # Overwrite system defaults and explicitly configure three GPUs
339       ##################################################################
340       Name=gpu Type=tesla File=/dev/nvidia[0-1] COREs=0,1
341       # Name=gpu Type=tesla  File=/dev/nvidia[2-3] COREs=2,3
342       # NOTE: nvidia2 device is out of service
343       Name=gpu Type=tesla  File=/dev/nvidia3 COREs=2,3
344
345       ##################################################################
346       # Slurm's Generic Resource (GRES) configuration file
347       # Use a single gres.conf file for all compute nodes - positive method
348       ##################################################################
349       ## Explicitly specify devices on nodes tux0-tux15
350       # NodeName=tux[0-15]  Name=gpu File=/dev/nvidia[0-3]
351       # NOTE: tux3 nvidia1 device is out of service
352       NodeName=tux[0-2]  Name=gpu File=/dev/nvidia[0-3]
353       NodeName=tux3  Name=gpu File=/dev/nvidia[0,2-3]
354       NodeName=tux[4-15]  Name=gpu File=/dev/nvidia[0-3]
355
356       ##################################################################
357       # Slurm's Generic Resource (GRES) configuration file
358       # Use NVML to gather GPU configuration information
359       # for all nodes except one
360       ##################################################################
361       AutoDetect=nvml
362       NodeName=tux3 AutoDetect=off Name=gpu File=/dev/nvidia[0-3]
363
364       ##################################################################
365       # Slurm's Generic Resource (GRES) configuration file
366       # Specify some nodes with NVML, some with RSMI, and some with no AutoDetect
367       ##################################################################
368       NodeName=tux[0-7] AutoDetect=nvml
369       NodeName=tux[8-11] AutoDetect=rsmi
370       NodeName=tux[12-15] Name=gpu File=/dev/nvidia[0-3]
371
372       ##################################################################
373       # Slurm's Generic Resource (GRES) configuration file
374       # Define 'bandwidth' GRES to use as a way to limit the
375       # resource use on these nodes for workflow purposes
376       ##################################################################
377       NodeName=tux[0-7] Name=bandwidth Type=lustre Count=4G Flags=CountOnly
378
379

COPYING

381       Copyright (C) 2010 The Regents of the University of  California.   Pro‐
382       duced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
383       Copyright (C) 2010-2022 SchedMD LLC.
384
385       This  file  is  part  of Slurm, a resource management program.  For de‐
386       tails, see <https://slurm.schedmd.com/>.
387
388       Slurm is free software; you can redistribute it and/or modify it  under
389       the  terms  of  the GNU General Public License as published by the Free
390       Software Foundation; either version 2 of the License, or (at  your  op‐
391       tion) any later version.
392
393       Slurm  is  distributed  in the hope that it will be useful, but WITHOUT
394       ANY WARRANTY; without even the implied warranty of  MERCHANTABILITY  or
395       FITNESS  FOR  A PARTICULAR PURPOSE.  See the GNU General Public License
396       for more details.
397
398

NAME

DESCRIPTION

EXAMPLES

COPYING

SEE ALSO