CONDOR_GPU_DISCOVERY(1) | HTCondor Manual | CONDOR_GPU_DISCOVERY(1) |
condor_gpu_discovery - HTCondor Manual
Output GPU-related ClassAd attributes
condor_gpu_discovery -help
condor_gpu_discovery [<options> ]
condor_gpu_discovery outputs ClassAd attributes corresponding to a host's GPU capabilities. It can presently report CUDA and OpenCL devices; which type(s) of device(s) it reports is determined by which libraries, if any, it can find when it runs; this reflects what GPU jobs will find on that host when they run. (Note that some HTCondor configuration settings may cause the environment to differ between jobs and the HTCondor daemons in ways that change library discovery.)
If CUDA_VISIBLE_DEVICES or GPU_DEVICE_ORDINAL is set in the environment when condor_gpu_discovery is run, it will report only devices present in the those lists.
This tool is not available for MAC OS platforms.
With no command line options, the single ClassAd attribute DetectedGPUs is printed. If the value is 0, no GPUs were detected. If one or more GPUS were detected, the value is a string, presented as a comma and space separated list of the GPUs discovered, where each is given a name further used as the prefix string in other attribute names. Where there is more than one GPU of a particular type, the prefix string includes an GPU id value identifying the device; these can be integer values that monotonically increase from 0 when the -by-index option is used or globally unique identifiers when the -short-uuid or -uuid argument is used.
For example, a discovery of two GPUs with -by-index may output
DetectedGPUs="CUDA0, CUDA1"
Further command line options use "CUDA" either with or without one of the integer values 0 or 1 as the name of the device properties ad for -nested properties, or as the prefix string in attribute names when -not-nested properties are chosen.
For machines with more than one or two NVIDIA devices, it is recommended that you also use the -short-uuid or -uuid option. The uuid value assigned by NVIDA to each GPU is unique, so using this option provides stable device identifiers for your devices. The -short-uuid option uses only part of the uuid, but it is highly likely to still be unique for devices on a single machine. As of HTCondor 9.0 -short-uuid is the default. When -short-uuid is used, discovery of two GPUs may look like this
DetectedGPUs="GPU-ddc1c098, GPU-9dc7c6d6"
Any NVIDIA runtime library later than 9.0 will accept the above identifiers in the CUDA_VISIBLE_DEVICES environment variable.
If the NVML libary is available, and a multi-instance GPU (MIG) -capable device is present, has MIG enabled, and has created compute instances for each MIG instance, condor_gpu_discovery will report those instance as distinct devices. Their names will be in the long UUID form unless the -short-uuid option is used, because they can not be enumerated via CUDA. MIG instances don't have some of the properties reported by the -properties, -extra, and -dynamic options; these properties will be omitted. If MIG is enabled on any GPU in the system, some properties become unavailable for every GPU in the system; condor_gpu_discovery will report what it can.
DeviceName | Capability | GlobalMemoryMB | |
0 | GeForce GT 330 | 1.2 | 1024 |
1 | GeForce GTX 480 | 2.0 | 1536 |
2 | Tesla V100-PCIE-16GB | 7.0 | 24220 |
3 | TITAN RTX | 7.5 | 24220 |
4 | A100-SXM4-40GB | 8.0 | 40536 |
5 | NVIDIA A100-SXM4-40GB MIG 3g.20gb | 8.0 | 20096 |
6 | NVIDIA A100-SXM4-40GB MIG 1g.5gb | 8.0 | 4864 |
If used with -divide, the last one on the command-line wins, but you must specify 2 if you want it; the default value only applies to the first flag.
If used with -repeat, the last one on the command-line wins, but you must specify 2 if you want it; the default value only applies to the first flag.
use FEATURE : StartdCronPeriodic(DYNGPUS, 5*60, $(LIBEXEC)/condor_gpu_discovery, -dynamic -cron)
condor_gpu_discovery will exit with a status value of 0 (zero) upon success, and it will exit with the value 1 (one) upon failure.
HTCondor Team
1990-2024, Center for High Throughput Computing, Computer Sciences Department, University of Wisconsin-Madison, Madison, WI, US. Licensed under the Apache License, Version 2.0.
August 25, 2024 |