docker - カスタムコンテナを使用する Google AI プラットフォームで動的ライブラリ libcuda.so.1 エラーを読み込めませんでした

Question

カスタムコンテナを使用して Google AI Platform でトレーニングジョブを開始しようとしています。トレーニングに GPU を使用したいので、コンテナーに使用した基本イメージは次のとおりです。

FROM nvidia/cuda:11.1.1-cudnn8-runtime-ubuntu18.04

このイメージ (およびその上にインストールされた tensorflow 2.4.1) を使用すると、AI Platform で GPU を使用できると思いましたが、そうではないようです。トレーニングが開始されると、ログに次のように表示されます。

W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (gke-cml-0309-144111--n1-highmem-8-43e-0b9fbbdc-gnq6): /proc/driver/nvidia/version does not exist
I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.

これは、Google AI Platform で GPU を使用するイメージを構築する良い方法ですか? または、代わりに tensorflow イメージに依存して、必要なすべてのドライバーを手動でインストールして GPU を活用する必要がありますか?

編集: ここ ( https://cloud.google.com/ai-platform/training/docs/containers-overview ) を読みました:

For training with GPUs, your custom container needs to meet a few
special requirements. You must build a different Docker image than     
what you'd use for training with CPUs.

Pre-install the CUDA toolkit and cuDNN in your Docker image. Using the 
nvidia/cuda image as your base image is the recommended way to handle 
this. It has the matching versions of CUDA toolkit and cuDNN pre-
installed, and it helps you set up the related environment variables 
correctly.

Install your training application, along with your required ML     
framework and other dependencies in your Docker image.

また、GPU を使用したトレーニング用の Dockerfile の例も示しています。だから私がしたことは大丈夫のようです。残念ながら、Google AI Platform で GPU を使用できない理由を説明できる (またはできない) 上記のエラーがまだ残っています。

EDIT2: ここ ( https://www.tensorflow.org/install/gpu ) を読むと、私の Dockerfile は次のようになります。

FROM tensorflow/tensorflow:2.4.1-gpu
RUN apt-get update && apt-get install -y \
 lsb-release \
 vim \
 curl \
 git \
 libgl1-mesa-dev \
 software-properties-common \
 wget && \
 rm -rf /var/lib/apt/lists/*

# Add NVIDIA package repositories
RUN wget -nv https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
RUN mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
RUN add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
RUN apt-get update

RUN wget -nv http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

RUN apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
RUN apt-get update

# Install NVIDIA driver
RUN apt-get install -y --no-install-recommends nvidia-driver-450
# Reboot. Check that GPUs are visible using the command: nvidia-smi

RUN wget -nv https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt-get update

# Install development and runtime libraries (~4GB)
RUN apt-get install --no-install-recommends \
    cuda-11-0 \
    libcudnn8=8.0.4.30-1+cuda11.0  \
    libcudnn8-dev=8.0.4.30-1+cuda11.0


# other stuff

問題は、キーボード構成と思われる段階でビルドがフリーズすることです。システムが国を選択するよう求めてきますが、番号を入力しても何も起こりません

docker - カスタム コンテナを使用する Google AI プラットフォームで動的ライブラリ libcuda.so.1 エラーを読み込めませんでした

1 に答える 1

Related

Reference

docker - カスタムコンテナを使用する Google AI プラットフォームで動的ライブラリ libcuda.so.1 エラーを読み込めませんでした