簡単なサンプル スクリプトを使用して、4 GPU の Amazon インスタンスでこの問題に遭遇しています。
import skflow
import tensorflow as tf
from sklearn import datasets
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target,
test_size=0.2, random_state=42)
def my_model(X, y):
with tf.device('/gpu:1'):
layers = skflow.ops.dnn(X, [1000, 500, 150], keep_prob=0.5) # many neurons to see the impac on memory
with tf.device('/cpu:0'):
return skflow.models.logistic_regression(layers, y)
classifier = skflow.TensorFlowEstimator(model_fn=my_model, n_classes=3)
classifier.fit(X_train, y_train)
nvidia-smi
スクリプトを起動する前の結果は次のとおりです。
Fri Feb 19 11:30:22 2016
+------------------------------------------------------+
| NVIDIA-SMI 346.46 Driver Version: 346.46 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 40C P0 41W / 125W | 2247MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GRID K520 Off | 0000:00:04.0 Off | N/A |
| N/A 36C P0 40W / 125W | 2113MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GRID K520 Off | 0000:00:05.0 Off | N/A |
| N/A 41C P0 43W / 125W | 53MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GRID K520 Off | 0000:00:06.0 Off | N/A |
| N/A 39C P0 41W / 125W | 1816MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
スクリプトの実行中:
Fri Feb 19 11:30:53 2016
+------------------------------------------------------+
| NVIDIA-SMI 346.46 Driver Version: 346.46 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 40C P0 46W / 125W | 3926MiB / 4095MiB | 26% Default |
+-------------------------------+----------------------+----------------------+
| 1 GRID K520 Off | 0000:00:04.0 Off | N/A |
| N/A 37C P0 42W / 125W | 3926MiB / 4095MiB | 17% Default |
+-------------------------------+----------------------+----------------------+
| 2 GRID K520 Off | 0000:00:05.0 Off | N/A |
| N/A 41C P0 44W / 125W | 92MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GRID K520 Off | 0000:00:06.0 Off | N/A |
| N/A 39C P0 42W / 125W | 1856MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
そのため、コードのどの部分にも言及されていませんが、メモリは GPU0 に割り当てられます。この振る舞いがどこから来るか知っていますか?このインスタンスには複数のユーザーがいて、誰も使用しない場合でも GPU0 が飽和状態になるため、これにより問題が発生します。