Introduction
Recently I’ve encountered an issue with one of our NVIDIA cards. There was some kind of card failure. Since nvidia-smi
is heavily dependent on all the cards to work, the command got stuck on the host which was running this specific card. We’ve had a lot of other A10 cards on that host and that was a shame. Since even health checks depends on that command all operations on the host got host. We needed a way to Disable NVIDIA GPU Card on the host and get it working without that specific faulty card.
Procedure
nvidia-smi
is a tool that is used for multiple cases in your GPU environment. This procedure showcases a different side that usually we’re not using the command for.
In comes the drain
and -p
for power options in nvidia-smi
.
When running the following command inside the container of the faulty host:
$ oc exec -it nvidia-driver-daemonset-410.84.2022xxxxxx -- nvidia-smi
We get the following:
sh-4.4# nvidia-smi
Sun May 28 12:36:23 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 ERR! On | 00000000:41:00.0 Off | ERR! |
| ERR! ERR! ERR! ERR! / ERR! | 0MiB / 23028MiB | 0% Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A10 On | 00000000:62:00.0 Off | N/A |
| 0% 36C P8 7W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A10 On | 00000000:62:00.0 Off | N/A |
| 0% 36C P8 8W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A10 On | 00000000:89:00.0 Off | N/A |
| 0% 37C P8 9W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A10 On | 00000000:8A:00.0 Off | N/A |
| 0% 38C P8 9W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
As you can see, there’s a faulty card with ERR!
al around its output.
In our case that faulty card address is at 00000000:41:00.0
.
Disable the card using nvidia-smi
Let’s disable the card:
$ oc exec -it nvidia-driver-daemonset-410.84.2022xxxxxx
$ nvidia-smi -i 00000000:41:00.0 -pm 0
$ nvidia-smi drain -p 00000000:41:00.0 -m 1
$ nvidia-smi -i 00000000:41:00.0 -pm 1
I’ve had some error outputs on the drain command but it disabled the card eventually and the card disappeared from the nvidia-smi
list.
It took me a lot of time to find it out until i’ve found it in the following forum: https://forums.developer.nvidia.com/t/how-to-turn-off-specific-gpu/107574
Thanks for NVIDIA for providing that forum. It helped a lot.
Summary
In this short article we saw how to disable NVIDIA GPU card in OpenShift. We saw some technical nuances how to manage NVIDIA GPU resources within Kubernetes environments. This is from our ongoing work in Octopus Computer Solutions with AI and GenAI technologies, where we optimize computing resources for advanced artificial intelligence applications. We ensures that your AI projects run smoothly, leveraging the best of NVIDIA’s technology in a Kubernetes ecosystem.
Enjoy.