Ubuntu (python 3.4) を使用する Azure NC24 GPU VM に CNTK からバージョン2.0.beta7をインストールしました。マシンには 4 つの NVIDIA K80 GPU があります。ビルド情報:
Build type: release
Build target: GPU
With 1bit-SGD: yes
With ASGD: yes
Math lib: mkl
CUDA_PATH: /usr/local/cuda-8.0
CUB_PATH: /usr/local/cub-1.4.1
CUDNN_PATH: /usr/local
Build Branch: HEAD
Build SHA1: 8e8b5ff92eff4647be5d41a5a515956907567126
Built by Source/CNTK/buildinfo.h$$0 on bbdadbf3455d
Build Path: /home/philly/jenkins/workspace/CNTK-Build-Linux
分散モードで CIFAR の例を実行していました。
mpiexec -n 4 python TrainResNet_CIFAR10_Distributed.py -n resnet20 -q 32
Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.019s (447.9 samples per second)
Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.019s (447.9 samples per second)
Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.018s (447.9 samples per second)
Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.019s (447.9 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.3 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.4 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.8 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.6 samples per second)
...
...
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.883s (6300.4 samples per second)
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.883s (6299.7 samples per second)
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.884s (6299.7 samples per second)
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.884s (6299.2 samples per second)
ただし、1 ビット SGD で実行すると、次のようになります。
mpiexec -n 4 python TrainResNet_CIFAR10_Distributed.py -n resnet20 -q 1 -a 50000
...
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.055s (4939.1 samples per second)
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.056s (4938.9 samples per second)
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.056s (4938.9 samples per second)
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.056s (4938.9 samples per second)
ここで説明したように、1bit は通常の対応物よりも高速である必要があります。どんな助けでも大歓迎です。