Introduction

Recently I’ve encountered an issue with one of our NVIDIA cards. There was some kind of card failure. Since nvidia-smi is heavily dependent on all the cards to work, the command got stuck on the host which was running this specific card. We’ve had a lot of other A10 cards on that host and that was a shame. Since even health checks depends on that command all operations on the host got host. We needed a way to Disable NVIDIA GPU Card on the host and get it working without that specific faulty card.

Procedure

nvidia-smi is a tool that is used for multiple cases in your GPU environment. This procedure showcases a different side that usually we’re not using the command for.

In comes the drain and -p for power options in nvidia-smi.

When running the following command inside the container of the faulty host:

$ oc exec -it nvidia-driver-daemonset-410.84.2022xxxxxx -- nvidia-smi

We get the following:

sh-4.4# nvidia-smi
Sun May 28 12:36:23 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  ERR!                On   | 00000000:41:00.0 Off |                 ERR! |
| ERR! ERR! ERR!    ERR! / ERR! |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10          On   | 00000000:62:00.0 Off |                  N/A |
|  0%   36C    P8    7W /  300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A10          On   | 00000000:62:00.0 Off |                  N/A |
|  0%   36C    P8    8W /  300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A10          On   | 00000000:89:00.0 Off |                  N/A |
|  0%   37C    P8    9W /  300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A10          On   | 00000000:8A:00.0 Off |                  N/A |
|  0%   38C    P8    9W /  300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

As you can see, there’s a faulty card with ERR! al around its output.

In our case that faulty card address is at 00000000:41:00.0 .

Disable the card using nvidia-smi

Let’s disable the card:

$ oc exec -it nvidia-driver-daemonset-410.84.2022xxxxxx
$ nvidia-smi -i 00000000:41:00.0 -pm 0
$ nvidia-smi drain -p 00000000:41:00.0 -m 1
$ nvidia-smi -i 00000000:41:00.0 -pm 1

I’ve had some error outputs on the drain command but it disabled the card eventually and the card disappeared from the nvidia-smi list.

It took me a lot of time to find it out until i’ve found it in the following forum: https://forums.developer.nvidia.com/t/how-to-turn-off-specific-gpu/107574

Thanks for NVIDIA for providing that forum. It helped a lot.

Summary

In this short article we saw how to disable NVIDIA GPU card in OpenShift. We saw some technical nuances how to manage NVIDIA GPU resources within Kubernetes environments. This is from our ongoing work in Octopus Computer Solutions with AI and GenAI technologies, where we optimize computing resources for advanced artificial intelligence applications. We ensures that your AI projects run smoothly, leveraging the best of NVIDIA’s technology in a Kubernetes ecosystem.

Enjoy.