1CONDOR_GPU_DISCOVERY(1)         HTCondor Manual        CONDOR_GPU_DISCOVERY(1)
2
3
4

NAME

6       condor_gpu_discovery - HTCondor Manual
7
8       Output GPU-related ClassAd attributes
9
10

SYNOPSIS

12       condor_gpu_discovery -help
13
14       condor_gpu_discovery [<options> ]
15

DESCRIPTION

17       condor_gpu_discovery  outputs  ClassAd  attributes  corresponding  to a
18       host's GPU capabilities. It can presently report CUDA  and  OpenCL  de‐
19       vices; which type(s) of device(s) it reports is determined by which li‐
20       braries, if any, it can find when it runs; this reflects what GPU  jobs
21       will find on that host when they run. (Note that some HTCondor configu‐
22       ration settings may cause the environment to differ  between  jobs  and
23       the HTCondor daemons in ways that change library discovery.)
24
25       If CUDA_VISIBLE_DEVICES or GPU_DEVICE_ORDINAL is set in the environment
26       when condor_gpu_discovery is run, it will report only  devices  present
27       in the those lists.
28
29       This tool is not available for MAC OS platforms.
30
31       With no command line options, the single ClassAd attribute DetectedGPUs
32       is printed. If the value is 0, no GPUs were detected.  If one  or  more
33       GPUS  were  detected,  the  value is a string, presented as a comma and
34       space separated list of the GPUs discovered, where each is given a name
35       further used as the prefix string in other attribute names. Where there
36       is more than one GPU of a particular type, the prefix  string  includes
37       an  GPU  id  value  identifying the device; these can be integer values
38       that monotonically increase from 0 when the -by-index option is used or
39       globally  unique  identifiers when the -short-uuid or -uuid argument is
40       used.
41
42       For example, a discovery of two GPUs with -by-index may output
43
44          DetectedGPUs="CUDA0, CUDA1"
45
46       Further command line options use "CUDA" either with or without  one  of
47       the  integer  values 0 or 1 as the name of the device properties ad for
48       -nested properties, or as the prefix string  in  attribute  names  when
49       -not-nested properties are chosen.
50
51       For  machines  with  more  than one or two NVIDIA devices, it is recom‐
52       mended that you also use the -short-uuid or  -uuid  option.   The  uuid
53       value  assigned  by  NVIDA to each GPU is unique, so  using this option
54       provides stable device identifiers for your  devices.  The  -short-uuid
55       option  uses only part of the uuid, but it is highly likely to still be
56       unique for devices on a single machine.  As of HTCondor 9.0 -short-uuid
57       is  the  default.   When -short-uuid is used, discovery of two GPUs may
58       look like this
59
60          DetectedGPUs="GPU-ddc1c098, GPU-9dc7c6d6"
61
62       Any NVIDIA runtime library later than 9.0 will accept the above identi‐
63       fiers in the CUDA_VISIBLE_DEVICES environment variable.
64
65       If  the NVML libary is available, and a multi-instance GPU (MIG) -capa‐
66       ble device is present, has MIG enabled, and  has  created  compute  in‐
67       stances  for  each MIG instance, condor_gpu_discovery will report those
68       instance as distinct devices.  Their names will be  in  the  long  UUID
69       form  unless  the  -short-uuid  option is used, because they can not be
70       enumerated via CUDA.  MIG instances don't have some of  the  properties
71       reported  by the -properties, -extra, and -dynamic options; these prop‐
72       erties will be omitted.  If MIG is enabled on any GPU  in  the  system,
73       some  properties  become  unavailable for every GPU in the system; con‐
74       dor_gpu_discovery will report what it can.
75

OPTIONS

77          -help  Print usage information and exit.
78
79          -properties
80                 In addition to the DetectedGPUs attribute,  display  some  of
81                 the  attributes of the GPUs. Each of these attributes will be
82                 in a nested ClassAd (-nested) or have a prefix string at  the
83                 beginning  of its name (-not-nested).  The displayed CUDA at‐
84                 tributes are Capability,  DeviceName,  DriverVersion,  ECCEn‐
85                 abled, GlobalMemoryMb, and RuntimeVersion. The displayed Open
86                 CL attributes are DeviceName, ECCEnabled, OpenCLVersion,  and
87                 GlobalMemoryMb.
88
89          -nested
90
91                 Default.  Display properties that are common to all GPUs in a
92                 Common nested ClassAd,
93                        and properties that are not common to all in a  nested
94                        ClassAd  using the GPUid as the ClassAd name.  Use the
95                        -not-nested argument to disable  nested  ClassAds  and
96                        return  to the older behavior of using a prefix string
97                        for individual property attributes.
98
99          -not-nested
100
101                 Display properties that are common to all GPUs using  a  CUDA
102                 or OCL as
103                        the attribute prefix, and properties that are not com‐
104                        mon to all using a GPUid  prefix.   Versions  of  con‐
105                        dor_gpu_discovery  prior  to  9.11.0 support only this
106                        mode.
107
108          -extra Display more attributes of the GPUs. Each of these attributes
109                 will  be added to a nested property ClassAd (-nested) or have
110                 a prefix string at the beginning of its  name  (-not-nested).
111                 The  additional  CUDA  attributes are ClockMhz, ComputeUnits,
112                 and  CoresPerCU.  The  additional  Open  CL  attributes   are
113                 ClockMhz and ComputeUnits.
114
115          -dynamic
116                 Display  attributes  of  NVIDIA devices that change values as
117                 the GPU is working. Each of these attributes will be added to
118                 the  the  nested  property ClassAd (-nested) or have a prefix
119                 string at the beginning of its name (-not-nested).  These are
120                 FanSpeedPct,  BoardTempC,  DieTempC,  EccErrorsSingleBit, and
121                 EccErrorsDoubleBit.
122
123          -mixed When displaying attribute values, assume that the machine has
124                 a  heterogeneous  set  of GPUs, so always include the integer
125                 value in the prefix string.
126
127          -device <N>
128                 Display properties only for GPU device <N>, where <N> is  the
129                 integer  value defined for the prefix string. This option may
130                 be specified more than once; additional <N> are listed  along
131                 with  the first. This option adds to the devices(s) specified
132                 by the environment variables CUDA_VISIBLE_DEVICES and GPU_DE‐
133                 VICE_ORDINAL, if any.
134
135          -tag string
136                 Set  the resource tag portion of the intended machine ClassAd
137                 attribute Detected<ResourceTag> to be string. If this  option
138                 is  not  specified,  the resource tag is "GPUs", resulting in
139                 attribute name DetectedGPUs.
140
141          -prefix str
142                 When naming -not-nested attributes, use  str  as  the  prefix
143                 string.  When this option is not specified, the prefix string
144                 is either CUDA or OCL unless -uuid  or  -short-uuid  is  also
145                 used.
146
147          -by-index
148                 Use the prefix and device index as the device identifier.
149
150          -short-uuid
151                 Use  the  first 8 characters of the NVIDIA uuid as the device
152                 identifier.  When this option is used, devices will be  shown
153                 as  GPU-<xxxxxxxx> where <xxxxxxxx> is the first 8 hex digits
154                 of the NVIDIA device uuid.  Unlike device indices,  the  uuid
155                 of  a  device will not change of other devices are taken off‐
156                 line or drained.
157
158          -uuid  Use the full NVIDIA uuid as the device identifier rather than
159                 the device index.
160
161          -simulate:[D,N[,D2,...]]
162                 For  testing  purposes,  assume that N devices of type D were
163                 detected, And N2 devices of type D2, etc.  No discovery soft‐
164                 ware is invoked. D can be a value from 0 to 6 which selects a
165                 simulated a GPU from the following table.
166

SIMULATED GPUS

168                 ┌──┬─────────────────┬────────────┬────────────────┐
169                 │  │ DeviceName      │ Capability │ GlobalMemoryMB │
170                 ├──┼─────────────────┼────────────┼────────────────┤
171                 │0 │ GeForce GT 330  │ 1.2        │ 1024           │
172                 ├──┼─────────────────┼────────────┼────────────────┤
173                 │1 │ GeForce GTX 480 │ 2.0        │ 1536           │
174                 ├──┼─────────────────┼────────────┼────────────────┤
175                 │2 │ Tesla           │ 7.0        │ 24220          │
176                 │  │ V100-PCIE-16GB  │            │                │
177                 ├──┼─────────────────┼────────────┼────────────────┤
178                 │3 │ TITAN RTX       │ 7.5        │ 24220          │
179                 ├──┼─────────────────┼────────────┼────────────────┤
180                 │4 │ A100-SXM4-40GB  │ 8.0        │ 40536          │
181                 ├──┼─────────────────┼────────────┼────────────────┤
182                 │5 │ NVIDIA          │ 8.0        │ 20096          │
183                 │  │ A100-SXM4-40GB  │            │                │
184                 │  │ MIG 3g.20gb     │            │                │
185                 ├──┼─────────────────┼────────────┼────────────────┤
186                 │6 │ NVIDIA          │ 8.0        │ 4864           │
187                 │  │ A100-SXM4-40GB  │            │                │
188                 │  │ MIG 1g.5gb      │            │                │
189                 └──┴─────────────────┴────────────┴────────────────┘
190
191       -opencl
192              Prefer  detection  via OpenCL rather than CUDA. Without this op‐
193              tion, CUDA detection software is invoked first, and  no  further
194              Open CL software is invoked if CUDA devices are detected.
195
196       -cuda  Do only CUDA detection.
197
198       -nvcuda
199              For  Windows  platforms  only, use a CUDA driver rather than the
200              CUDA run time.
201
202       -config
203              Output in the syntax of HTCondor configuration, instead of Clas‐
204              sAd  language.  An  additional  attribute  is  produced  NUM_DE‐
205              TECTED_GPUs which is set to the number of GPUs detected.
206
207       -repeat [N]
208              Repeat listed GPUs N (default 2) times.  This results in a  list
209              that looks like CUDA0, CUDA1, CUDA0, CUDA1.
210
211              If used with -divide, the last one on the command-line wins, but
212              you must specify 2 if you want it; the default  value  only  ap‐
213              plies to the first flag.
214
215       -divide [N]
216              Like -repeat, except also divide the attribute GlobalMemoryMb by
217              N.  This may help you avoid overcommitting your GPU's memory.
218
219              If used with -repeat, the last one on the command-line wins, but
220              you  must  specify  2 if you want it; the default value only ap‐
221              plies to the first flag.
222
223       -packed
224              When repeating GPUs, repeat each GPU  N  times,  not  the  whole
225              list.   This  results  in  a  list that looks like CUDA0, CUDA0,
226              CUDA1, CUDA1.
227
228       -cron  This option suppresses the DetectedGpus attribute  so  that  the
229              output is suitable for use with condor_startd cron. Combine this
230              option with the -dynamic option to periodically refresh the  dy‐
231              namic  Gpu  information such as temperature. For example, to re‐
232              fresh GPU temperatures every 5 minutes
233
234                 use FEATURE : StartdCronPeriodic(DYNGPUS, 5*60, $(LIBEXEC)/condor_gpu_discovery, -dynamic -cron)
235
236       -verbose
237              For interactive use of the tool,  output  extra  information  to
238              show detection while in progress.
239
240       -diagnostic
241              Show diagnostic information, to aid in tool development.
242

EXIT STATUS

244       condor_gpu_discovery  will  exit  with  a status value of 0 (zero) upon
245       success, and it will exit with the value 1 (one) upon failure.
246

AUTHOR

248       HTCondor Team
249
251       1990-2023, Center for High Throughput Computing, Computer Sciences  De‐
252       partment,  University  of  Wisconsin-Madison, Madison, WI, US. Licensed
253       under the Apache License, Version 2.0.
254
255
256
257
258                                 Oct 02, 2023          CONDOR_GPU_DISCOVERY(1)
Impressum