Using cuda in containers requires nvidia-container-toolkit in addition to newer drivers, an available container runtime (podman, docker, etc.).
Generate a CDI description file:
|
|
You can then pull a cuda image and test whether cuda is available:
|
|
The installation logic outlined above is described in the Installation Guide - NVIDIA Cloud Native Technologies documentation, please refer to it for details.
Current problem encountered
In short, the nvidia driver was updated, so the following error occurred.
|
|
The root cause of the error was that the container was started with --device nvidia.com/all
(i.e. CDI), which is affected by the files under /etc/cdi/
. However, the section of the /etc/cdi/nvidia.yaml
file (part of which is shown below) that deals with the file no longer exists because nvidia was updated from 530.41.03-17 to 535.54.03-2, which eventually led to the above error.
In this case, regenerating the file will solve the problem (the following command is also mentioned in the Installation Guide - NVIDIA Cloud Native Technologies documentation).
|
|
Automation
To avoid further problems, the following hooks were created to automatically update /etc/cdi/nvidia.yaml
when nvidia is updated, but since nvidia has just been updated once, it is not known if the hook works properly.
|
|