RKE1/2 节点配置 Nvidia Container Toolkit

GPU 节点安装 Nvidia 驱动

官方文档:NVIDIA CUDA Installation Guide for Linux

安装前准备

查看是否有可用的 GPU:

1
2
root@gpu-0:~# lspci | grep -i nvidia
03:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)

安装依赖项:

1
2
apt update
apt -y install gcc make

Nouveau 是开源的 NVIDIA 驱动,会与官方闭源驱动冲突,需要禁用:

1
2
3
4
echo -e "blacklist nouveau\noptions nouveau modeset=0" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
update-initramfs -u
reboot
lsmod | grep nouveau

关闭 Secure Boot:

1
2
mokutil --disable-validation
mokutil --sb-state

通过 Nvidia 官方 .run 程序安装驱动

下载对应的安装程序:Nvidia Driver Downloads

1
wget -c "https://cn.download.nvidia.com/tesla/570.86.15/NVIDIA-Linux-x86_64-570.86.15.run"

执行安装:

1
2
3
chmod +x NVIDIA-Linux-x86_64-570.86.15.run
./NVIDIA-Linux-x86_64-570.86.15.run
reboot

如果执行 nvidia-smiNo devices were found or NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.,可以尝试用下面的命令解决:

1
2
3
4
apt -y install dkms
VERSION=$(ls /usr/src | awk -F - '/nvidia/{ print $2 }')
dkms install -m nvidia -v $VERSION
reboot

重启节点会进入 Perform MOK management 页面,选择 EnrollMOK 并输入密码,完成后会再次重启节点。

通过 Nvidia 官方 local repo 安装

下载对应的 local repo 文件:

1
2
3
4
wget -c "https://cn.download.nvidia.com/tesla/570.86.15/nvidia-driver-local-repo-ubuntu2204-570.86.15_1.0-1_amd64.deb"
dpkg -i nvidia-driver-local-repo-ubuntu2204-570.86.15_1.0-1_amd64.deb
cp /var/nvidia-driver-local-repo-ubuntu2204-570.86.15/nvidia-driver-local-081EF1BD-keyring.gpg /usr/share/keyrings/
apt update

查看驱动版本:

1
apt list | grep nvidia-driver

安装驱动:

1
2
apt -y install nvidia-driver-570
reboot

重启节点会进入 Perform MOK management 页面,选择 EnrollMOK 并输入密码,完成后会再次重启节点。

重启完成,可以用 nvidia-smi 命令查看 GPU 信息:

通过 Ubuntu PPA 安装驱动

可以通过 Ubuntu PPA 进行安装,但推荐安装的驱动版本较低:

1
2
3
4
5
6
7
8
9
10
11
root@gpu-0:~# ubuntu-drivers devices
== /sys/devices/pci0000:03/0000:03:00.0 ==
modalias : pci:v000010DEd00001BB3sv000010DEsd000011D8bc03sc02i00
vendor : NVIDIA Corporation
model : GP104GL [Tesla P4]
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-470 - distro non-free recommended
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-driver-390 - distro non-free
driver : nvidia-driver-450-server - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin

根据输出内容,安装推荐的版本:

1
2
apt -y install nvidia-driver-470
reboot

重启节点会进入 Perform MOK management 页面,选择 EnrollMOK 并输入密码,完成后会再次重启节点。

重启完成,可以用 nvidia-smi 命令查看 GPU 信息:

RKE1 节点配置 Nvidia Container Runtime

节点需要先安装 Docker:

1
curl https://releases.rancher.com/install-docker/20.10.sh | sh

安装 Nvidia Container Toolkit:

官方安装文档:Installing the NVIDIA Container Toolkit

1
2
3
4
5
6
7
8
9
10
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

apt-get update

apt-get install -y nvidia-container-toolkit

配置 Nvidia Container Runtime:

1
2
3
4
5
6
7
8
9
10
11
12
cat <<EOF > /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"insecure-registries" : [ "0.0.0.0/0" ],
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
EOF
1
2
3
4
root@gpu-0:~# systemctl restart docker
root@gpu-0:~# docker info | grep Runtime
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
Default Runtime: nvidia

测试容器是否可用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
root@gpu-0:~# docker run --rm --runtime=nvidia --gpus all harbor.warnerchen.com/library/ubuntu:latest nvidia-smi
Wed Feb 19 09:43:47 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:03:00.0 Off | 0 |
| N/A 49C P0 23W / 75W | 0MiB / 7611MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

RKE2 节点配置 Nvidia Container Runtime

安装 Nvidia Container Toolkit:

1
2
3
4
5
6
7
8
9
10
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

apt-get update

apt-get install -y nvidia-container-toolkit

将节点注册到 RKE2 集群后,修改配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cp /var/lib/rancher/rke2/agent/etc/containerd/config.toml .

cp /var/lib/rancher/rke2/agent/etc/containerd/config.toml /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl

vim /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
# 在最后面添加下面的内容
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/bin/nvidia-container-runtime"

reboot

rm /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl

添加 Nvidia Helm Chart 仓库:

仓库地址:https://helm.ngc.nvidia.com/nvidia

安装 GPU Operator,在 values.yaml 中配置 containerd socket 和 config.toml 文件的路径:

1
2
3
4
5
6
7
toolkit:
enabled: true
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock

安装成功:

创建 Pod 验证 GPU 资源:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark
namespace: default
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container
image: harbor.warnerchen.com/nvidia/k8s/cuda-sample:nbody
args: ["nbody", "-gpu", "-benchmark"]
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
EOF

查看日志,运行成功:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
root@rke2-cilium-01:~# kubectl logs nbody-gpu-benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1

> Compute 6.1 CUDA device: [Tesla P4]
20480 bodies, total time for 10 iterations: 27.727 ms
= 151.272 billion interactions per second
= 3025.446 single-precision GFLOP/s at 20 flops per interaction
Author

Warner Chen

Posted on

2024-12-17

Updated on

2025-02-27

Licensed under

You need to set install_url to use ShareThis. Please set it in _config.yml.
You forgot to set the business or currency_code for Paypal. Please set it in _config.yml.

Comments

You forgot to set the shortname for Disqus. Please set it in _config.yml.