Package Details: nvidia-container-toolkit 1.5.0-2

Git Clone URL: https://aur.archlinux.org/nvidia-container-toolkit.git (read-only, click to copy)
Package Base: nvidia-container-toolkit
Description: NVIDIA container runtime toolkit
Upstream URL: https://github.com/NVIDIA/nvidia-container-toolkit
Keywords: docker nvidia nvidia-docker runc
Licenses: Apache
Conflicts: nvidia-container-runtime<2.0.0, nvidia-container-runtime-hook
Replaces: nvidia-container-runtime-hook
Submitter: jshap
Maintainer: jshap (kiendang)
Last Packager: jshap
Votes: 17
Popularity: 1.38
First Submitted: 2019-07-28 01:19
Last Updated: 2021-05-20 17:02

Pinned Comments

jshap commented on 2019-07-28 01:43

see the release notes here for why this exists: https://github.com/NVIDIA/nvidia-container-runtime/releases/tag/3.1.0

tl;dr: nvidia-docker is deprecated because docker now has native gpu support, which this package is required to use. :)

Latest Comments

1 2 3 4 Next › Last »

lahwaacz commented on 2021-06-11 06:24

@jshap Yes, you can't make assumptions about every user's system or requirements, which is also a reason why you shouldn't set no-cgroups = true without even mentioning an alternative. People had to reboot to switch from cgroups v1 to v2 in systemd, so I don't think "It also does not require a reboot." is a valid argument to support this change. Ultimately, it is the user who should decide if they want cgroups v1 or no cgroups at all.

As for the development of this tool, AFAIK there is no roadmap except for a comment claiming that it will take "at least 9 months" (since January), so we shouldn't expect a proper solution before October. This does not sound like a short term to me...

jshap commented on 2021-06-10 18:44

@lahwaacz the reason I chose to turn off c-groups in the toolkit rather than force the system to v1 cgroups was because I didn't want to make assumptions about every user's system or requirements of other setups on their system. It also does not require a reboot.

You're correct that I could probably add a note explaining the kernel option too, however it's already been sufficiently documented in the links I'd attached. Eventually the tool will be rewritten to not even require c-group usage directly and instead just operate through runc, so they're both short term solutions anyways.

lahwaacz commented on 2021-06-06 10:33

What is the reason for this package to prefer the workaround using no-cgroups = true and thus forcing users to manually specify devices exposed to the container, rather than instructing users to set systemd.unified_cgroup_hierarchy=false on the kernel command line? At least the post-install message should mention both options.

jshap commented on 2021-04-09 23:34

Thanks for all the replies. I will be going though these to decide what the best option is for a fix in the package soon.

HedgehogCode commented on 2021-04-09 09:21

You that you can add the nvidia devices to the container after using the no-cgroups=true hack as mentioned in this [1] comment:

$ docker run --gpus all --device /dev/nvidia0 --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools --device /dev/nvidiactl nvidia/cuda:11.0-base nvidia-smi
Fri Apr  9 09:15:37 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67       Driver Version: 460.67       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   53C    P0    N/A /  N/A |      0MiB /  4042MiB |      0%      Default |
[...]

[1] https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-757034464

adjama commented on 2021-04-07 15:02

@GeorgeRaven, @jsharp

Unfortunatly I have to confirm your issue, I also cannot run nvidia-smi in my containers, I do get the same error Failed to initialize NVML: Unknown Error.

So maybe there is more to it than just the cgroups-issue?

GeorgeRaven commented on 2021-04-07 14:47

hey @jsharp and @adjama I haven't tried setting the option systemd.unified_cgroup_hierarchy=0 yet as I want to be on-site for this change, but I just wanted to let you know that @adjama's suggestion does work in that the containers can now be built and run, however trying to use nvidia-smi inside the container still fails.

$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Failed to initialize NVML: Unknown Error
$ sudo docker run --gpus all -t nvidia/cuda:11.0-base nvidia-debugdump -l
Error: nvmlInit(): Unknown Error

But thanks for the help guys, that helps a lot, I have a feeling based on the discussions setting cgroup_hierarchy to off is probably the way to go once I can do it in person. If that fails then I will try this method and look for any relevant logs that could help diagnose this unknown error.

adjama commented on 2021-04-07 14:29

@jshap I came across an issue which might be related. When starting/running a docker container with nvidia support (--gpus all) I got this error message:

Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1,
stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown

Looks like this is already known to nvidia [1], this [2] also might be related.

However, as mentioned in [2], I could fix my issue with editing the /etc/nvidia-container-runtime/config.toml and change #no-cgroups=false to no-cgroups=true. Afte a restart of the docker.service everything worked as usual. Hope this helps.

[1] https://github.com/NVIDIA/nvidia-docker/issues/1447
[2] https://github.com/NVIDIA/libnvidia-container/issues/111

jshap commented on 2021-04-06 19:22

@GeorgeRaven glad you figured out the symbol issue, I think the cgroup problem might be related to the systemd.unified_cgroup_hierarchy=0 kernel cli option. I haven't had time to investigate it yet because I've been busy but give it a shot.

GeorgeRaven commented on 2021-04-06 10:45

hey @jshap, having an issue with an undefined symbol, do you have any idea what could be the cause/ have you seen this before? At first I thought it was just something missing in this machines LD_CONFIG but comparing it to others (where the others work as expected) but they all contain the same entries. Here is an example error via docker:

$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 127, stdout: , stderr: /usr/bin/nvidia-container-cli: symbol lookup error: /usr/bin/nvidia-container-cli: undefined symbol: nvc_nvcaps_device_from_proc_path, version NVC_1.0: unknown.

running nvidia-container-cli directly

$ /usr/bin/nvidia-container-cli
/usr/bin/nvidia-container-cli: symbol lookup error: /usr/bin/nvidia-container-cli: undefined symbol: nvc_nvcaps_device_from_proc_path, version NVC_1.0

If not don't worry I just thought id ask before submitting an issue since it could be packaging related/ you may know better for arch specifics.

nvidia 460.67-5 nvidia-container-toolkit 1.4.2-1 docker 1:20.10.5-1

EDIT:

Upon further inspection I found out the issue was with libnvidia-container being an outdated version on this particular machine. Only to then run into "nvidia-container-cli: container error: cgroup subsystem devices not found: unknown"