1CONDOR_GPU_DISCOVERY(1) HTCondor Manual CONDOR_GPU_DISCOVERY(1)
2
3
4
6 condor_gpu_discovery - HTCondor Manual
7
8 Output GPU-related ClassAd attributes
9
10
12 condor_gpu_discovery -help
13
14 condor_gpu_discovery [<options> ]
15
17 condor_gpu_discovery outputs ClassAd attributes corresponding to a
18 host's GPU capabilities. It can presently report CUDA and OpenCL de‐
19 vices; which type(s) of device(s) it reports is determined by which li‐
20 braries, if any, it can find when it runs; this reflects what GPU jobs
21 will find on that host when they run. (Note that some HTCondor configu‐
22 ration settings may cause the environment to differ between jobs and
23 the HTCondor daemons in ways that change library discovery.)
24
25 If CUDA_VISIBLE_DEVICES or GPU_DEVICE_ORDINAL is set in the environment
26 when condor_gpu_discovery is run, it will report only devices present
27 in the those lists.
28
29 This tool is not available for MAC OS platforms.
30
31 With no command line options, the single ClassAd attribute DetectedGPUs
32 is printed. If the value is 0, no GPUs were detected. If one or more
33 GPUS were detected, the value is a string, presented as a comma and
34 space separated list of the GPUs discovered, where each is given a name
35 further used as the prefix string in other attribute names. Where there
36 is more than one GPU of a particular type, the prefix string includes
37 an GPU id value identifying the device; these can be integer values
38 that monotonically increase from 0 when the -by-index option is used or
39 globally unique identifiers when the -short-uuid or -uuid argument is
40 used.
41
42 For example, a discovery of two GPUs with -by-index may output
43
44 DetectedGPUs="CUDA0, CUDA1"
45
46 Further command line options use "CUDA" either with or without one of
47 the integer values 0 or 1 as the name of the device properties ad for
48 -nested properties, or as the prefix string in attribute names when
49 -not-nested properties are chosen.
50
51 For machines with more than one or two NVIDIA devices, it is recom‐
52 mended that you also use the -short-uuid or -uuid option. The uuid
53 value assigned by NVIDA to each GPU is unique, so using this option
54 provides stable device identifiers for your devices. The -short-uuid
55 option uses only part of the uuid, but it is highly likely to still be
56 unique for devices on a single machine. As of HTCondor 9.0 -short-uuid
57 is the default. When -short-uuid is used, discovery of two GPUs may
58 look like this
59
60 DetectedGPUs="GPU-ddc1c098, GPU-9dc7c6d6"
61
62 Any NVIDIA runtime library later than 9.0 will accept the above identi‐
63 fiers in the CUDA_VISIBLE_DEVICES environment variable.
64
65 If the NVML libary is available, and a multi-instance GPU (MIG) -capa‐
66 ble device is present, has MIG enabled, and has created compute in‐
67 stances for each MIG instance, condor_gpu_discovery will report those
68 instance as distinct devices. Their names will be in the long UUID
69 form unless the -short-uuid option is used, because they can not be
70 enumerated via CUDA. MIG instances don't have some of the properties
71 reported by the -properties, -extra, and -dynamic options; these prop‐
72 erties will be omitted. If MIG is enabled on any GPU in the system,
73 some properties become unavailable for every GPU in the system; con‐
74 dor_gpu_discovery will report what it can.
75
77 -help Print usage information and exit.
78
79 -properties
80 In addition to the DetectedGPUs attribute, display some of
81 the attributes of the GPUs. Each of these attributes will be
82 in a nested ClassAd (-nested) or have a prefix string at the
83 beginning of its name (-not-nested). The displayed CUDA at‐
84 tributes are Capability, DeviceName, DriverVersion, ECCEn‐
85 abled, GlobalMemoryMb, and RuntimeVersion. The displayed Open
86 CL attributes are DeviceName, ECCEnabled, OpenCLVersion, and
87 GlobalMemoryMb.
88
89 -nested
90
91 Default. Display properties that are common to all GPUs in a
92 Common nested ClassAd,
93 and properties that are not common to all in a nested
94 ClassAd using the GPUid as the ClassAd name. Use the
95 -not-nested argument to disable nested ClassAds and
96 return to the older behavior of using a prefix string
97 for individual property attributes.
98
99 -not-nested
100
101 Display properties that are common to all GPUs using a CUDA
102 or OCL as
103 the attribute prefix, and properties that are not com‐
104 mon to all using a GPUid prefix. Versions of con‐
105 dor_gpu_discovery prior to 9.11.0 support only this
106 mode.
107
108 -extra Display more attributes of the GPUs. Each of these attributes
109 will be added to a nested property ClassAd (-nested) or have
110 a prefix string at the beginning of its name (-not-nested).
111 The additional CUDA attributes are ClockMhz, ComputeUnits,
112 and CoresPerCU. The additional Open CL attributes are
113 ClockMhz and ComputeUnits.
114
115 -dynamic
116 Display attributes of NVIDIA devices that change values as
117 the GPU is working. Each of these attributes will be added to
118 the the nested property ClassAd (-nested) or have a prefix
119 string at the beginning of its name (-not-nested). These are
120 FanSpeedPct, BoardTempC, DieTempC, EccErrorsSingleBit, and
121 EccErrorsDoubleBit.
122
123 -mixed When displaying attribute values, assume that the machine has
124 a heterogeneous set of GPUs, so always include the integer
125 value in the prefix string.
126
127 -device <N>
128 Display properties only for GPU device <N>, where <N> is the
129 integer value defined for the prefix string. This option may
130 be specified more than once; additional <N> are listed along
131 with the first. This option adds to the devices(s) specified
132 by the environment variables CUDA_VISIBLE_DEVICES and GPU_DE‐
133 VICE_ORDINAL, if any.
134
135 -tag string
136 Set the resource tag portion of the intended machine ClassAd
137 attribute Detected<ResourceTag> to be string. If this option
138 is not specified, the resource tag is "GPUs", resulting in
139 attribute name DetectedGPUs.
140
141 -prefix str
142 When naming -not-nested attributes, use str as the prefix
143 string. When this option is not specified, the prefix string
144 is either CUDA or OCL unless -uuid or -short-uuid is also
145 used.
146
147 -by-index
148 Use the prefix and device index as the device identifier.
149
150 -short-uuid
151 Use the first 8 characters of the NVIDIA uuid as the device
152 identifier. When this option is used, devices will be shown
153 as GPU-<xxxxxxxx> where <xxxxxxxx> is the first 8 hex digits
154 of the NVIDIA device uuid. Unlike device indices, the uuid
155 of a device will not change of other devices are taken off‐
156 line or drained.
157
158 -uuid Use the full NVIDIA uuid as the device identifier rather than
159 the device index.
160
161 -simulate:[D,N[,D2,...]]
162 For testing purposes, assume that N devices of type D were
163 detected, And N2 devices of type D2, etc. No discovery soft‐
164 ware is invoked. D can be a value from 0 to 6 which selects a
165 simulated a GPU from the following table.
166
168 ┌──┬─────────────────┬────────────┬────────────────┐
169 │ │ DeviceName │ Capability │ GlobalMemoryMB │
170 ├──┼─────────────────┼────────────┼────────────────┤
171 │0 │ GeForce GT 330 │ 1.2 │ 1024 │
172 ├──┼─────────────────┼────────────┼────────────────┤
173 │1 │ GeForce GTX 480 │ 2.0 │ 1536 │
174 ├──┼─────────────────┼────────────┼────────────────┤
175 │2 │ Tesla │ 7.0 │ 24220 │
176 │ │ V100-PCIE-16GB │ │ │
177 ├──┼─────────────────┼────────────┼────────────────┤
178 │3 │ TITAN RTX │ 7.5 │ 24220 │
179 ├──┼─────────────────┼────────────┼────────────────┤
180 │4 │ A100-SXM4-40GB │ 8.0 │ 40536 │
181 ├──┼─────────────────┼────────────┼────────────────┤
182 │5 │ NVIDIA │ 8.0 │ 20096 │
183 │ │ A100-SXM4-40GB │ │ │
184 │ │ MIG 3g.20gb │ │ │
185 ├──┼─────────────────┼────────────┼────────────────┤
186 │6 │ NVIDIA │ 8.0 │ 4864 │
187 │ │ A100-SXM4-40GB │ │ │
188 │ │ MIG 1g.5gb │ │ │
189 └──┴─────────────────┴────────────┴────────────────┘
190
191 -opencl
192 Prefer detection via OpenCL rather than CUDA. Without this op‐
193 tion, CUDA detection software is invoked first, and no further
194 Open CL software is invoked if CUDA devices are detected.
195
196 -cuda Do only CUDA detection.
197
198 -nvcuda
199 For Windows platforms only, use a CUDA driver rather than the
200 CUDA run time.
201
202 -config
203 Output in the syntax of HTCondor configuration, instead of Clas‐
204 sAd language. An additional attribute is produced NUM_DE‐
205 TECTED_GPUs which is set to the number of GPUs detected.
206
207 -repeat [N]
208 Repeat listed GPUs N (default 2) times. This results in a list
209 that looks like CUDA0, CUDA1, CUDA0, CUDA1.
210
211 If used with -divide, the last one on the command-line wins, but
212 you must specify 2 if you want it; the default value only ap‐
213 plies to the first flag.
214
215 -divide [N]
216 Like -repeat, except also divide the attribute GlobalMemoryMb by
217 N. This may help you avoid overcommitting your GPU's memory.
218
219 If used with -repeat, the last one on the command-line wins, but
220 you must specify 2 if you want it; the default value only ap‐
221 plies to the first flag.
222
223 -packed
224 When repeating GPUs, repeat each GPU N times, not the whole
225 list. This results in a list that looks like CUDA0, CUDA0,
226 CUDA1, CUDA1.
227
228 -cron This option suppresses the DetectedGpus attribute so that the
229 output is suitable for use with condor_startd cron. Combine this
230 option with the -dynamic option to periodically refresh the dy‐
231 namic Gpu information such as temperature. For example, to re‐
232 fresh GPU temperatures every 5 minutes
233
234 use FEATURE : StartdCronPeriodic(DYNGPUS, 5*60, $(LIBEXEC)/condor_gpu_discovery, -dynamic -cron)
235
236 -verbose
237 For interactive use of the tool, output extra information to
238 show detection while in progress.
239
240 -diagnostic
241 Show diagnostic information, to aid in tool development.
242
244 condor_gpu_discovery will exit with a status value of 0 (zero) upon
245 success, and it will exit with the value 1 (one) upon failure.
246
248 HTCondor Team
249
251 1990-2023, Center for High Throughput Computing, Computer Sciences De‐
252 partment, University of Wisconsin-Madison, Madison, WI, US. Licensed
253 under the Apache License, Version 2.0.
254
255
256
257
258 Oct 02, 2023 CONDOR_GPU_DISCOVERY(1)