2

tensorflow-gpu をピギーバックするマゼンタでカスタム モデルをトレーニングしようとしています。問題は、何があっても、tensorflow が GPU メモリを適切に割り当ててトレーニングを開始できないことです。記録のために、これが私が使用しているコマンドです:

t2t_trainer --data_dir="{folder}" --hparams="label_smoothing=0.0, max_length=0,max_target_seq_length=4096" --hparams_set=score2perf_transformer_base --model=transformer --output_dir="{folder}" --problem=score2perf_maestro_language_uncropped_aug --train_steps=2500

これは、seq_length が 2048 に設定されている場合に問題なく機能し、CPU と GPU の電力を約 25% しか使用しません。私は i7-9600k と RTX 2070 を持っており、8 GB の VRAM を搭載しています。ただし、4096 に増やすと、最小量の GPU 割り当てでも失敗し始めます。ログの (要約) バージョンは次のとおりです。

2019-11-14 14:38:14.028064: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 7.60G (8160437760 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-14 14:38:14.028311: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 6.84G (7344393728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
WARNING:tensorflow:From c:\python\lib\site-packages\tensorflow_core\python\training\saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
W1114 14:38:14.551839  9104 deprecation.py:323] From c:\python\lib\site-packages\tensorflow_core\python\training\saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
I1114 14:38:14.811158  9104 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I1114 14:38:14.944813  9104 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into C:\Users\conspiracy2\Documents\music comp\out\11-14 set 2.0\checkpts\model.ckpt.
I1114 14:38:17.920329  9104 basic_session_run_hooks.py:606] Saving checkpoints for 0 into C:\Users\conspiracy2\Documents\music comp\out\11-14 set 2.0\checkpts\model.ckpt.
2019-11-14 14:38:21.598678: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2019-11-14 14:38:22.574418: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 1.44G (1550483456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-14 14:38:22.574642: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 1.44G (1550483456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-14 14:38:32.575117: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 1.44G (1550483456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-14 14:38:32.575322: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 1.44G (1550483456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-14 14:38:32.575478: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 576.00MiB (rounded to 603979776).  Current allocation summary follows.
2019-11-14 14:38:32.575683: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256):   Total Chunks: 75, Chunks in use: 69. 18.8KiB allocated for chunks. 17.3KiB in use in bin. 304B client-requested in use in bin.
2019-11-14 14:38:32.575871: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512):   Total Chunks: 1, Chunks in use: 0. 512B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.576033: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024):  Total Chunks: 2, Chunks in use: 2. 2.3KiB allocated for chunks. 2.3KiB in use in bin. 2.0KiB client-requested in use in bin.
2019-11-14 14:38:32.576206: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2048):  Total Chunks: 96, Chunks in use: 96. 192.0KiB allocated for chunks. 192.0KiB in use in bin. 192.0KiB client-requested in use in bin.
2019-11-14 14:38:32.576406: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4096):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.576604: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8192):  Total Chunks: 19, Chunks in use: 18. 158.8KiB allocated for chunks. 144.0KiB in use in bin. 144.0KiB client-requested in use in bin.
2019-11-14 14:38:32.576926: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16384):         Total Chunks: 13, Chunks in use: 12. 208.0KiB allocated for chunks. 192.0KiB in use in bin. 192.0KiB client-requested in use in bin.
2019-11-14 14:38:32.577128: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (32768):         Total Chunks: 48, Chunks in use: 48. 1.82MiB allocated for chunks. 1.82MiB in use in bin. 1.82MiB client-requested in use in bin.
2019-11-14 14:38:32.577355: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (65536):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.577566: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (131072):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.577770: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (262144):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.577973: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (524288):        Total Chunks: 1, Chunks in use: 1. 620.0KiB allocated for chunks. 620.0KiB in use in bin. 620.0KiB client-requested in use in bin.
2019-11-14 14:38:32.578238: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1048576):       Total Chunks: 72, Chunks in use: 72. 72.00MiB allocated for chunks. 72.00MiB in use in bin. 72.00MiB client-requested in use in bin.
2019-11-14 14:38:32.578395: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2097152):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.578561: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4194304):       Total Chunks: 37, Chunks in use: 36. 151.84MiB allocated for chunks. 144.00MiB in use in bin. 144.00MiB client-requested in use in bin.
2019-11-14 14:38:32.578834: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8388608):       Total Chunks: 38, Chunks in use: 37. 304.00MiB allocated for chunks. 296.00MiB in use in bin. 296.00MiB client-requested in use in bin.
2019-11-14 14:38:32.579017: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16777216):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.579203: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (33554432):      Total Chunks: 6, Chunks in use: 6. 192.00MiB allocated for chunks. 192.00MiB in use in bin. 192.00MiB client-requested in use in bin.
2019-11-14 14:38:32.579489: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (67108864):      Total Chunks: 2, Chunks in use: 1. 160.00MiB allocated for chunks. 64.00MiB in use in bin. 64.00MiB client-requested in use in bin.
2019-11-14 14:38:32.579704: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.579998: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456):     Total Chunks: 11, Chunks in use: 10. 5.29GiB allocated for chunks. 5.00GiB in use in bin. 5.00GiB client-requested in use in bin.
2019-11-14 14:38:32.580279: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 576.00MiB was 256.00MiB, Chunk State:
2019-11-14 14:38:32.580407: I tensorflow/core/common_runtime/bfc_allocator.cc:891]   Size: 300.92MiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev:   Size: 512.00MiB | Requested Size: 512.00MiB | 
2019-11-14 14:38:32.643932: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 5.75GiB
2019-11-14 14:38:32.644132: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 6609954304 memory_limit_: 8160437862 available bytes: 1550483558 curr_region_allocation_bytes_: 16320876032
2019-11-14 14:38:32.644377: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit:                  8160437862
InUse:                  6177115648
MaxInUse:               6185504256
NumAllocs:                     611
MaxAllocSize:            603979776

2019-11-14 14:38:32.644686: W tensorflow/core/common_runtime/bfc_allocator.cc:424] *************************************_**********************************************************____
2019-11-14 14:38:32.644868: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at pad_op.cc:122 : Resource exhausted: OOM when allocating tensor with shape[16777216,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "c:\python\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
    return fn(*args)
  File "c:\python\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "c:\python\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[16777216,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node transformer/parallel_0_4/transformer/transformer/body/decoder/layer_2/self_attention/multihead_attention/dot_product_attention/Pad}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

ここの「完全な」関連ログにペーストビンを添付しました: https://pastebin.com/CQpYdUC4

明らかな疑問を解決するために、いいえ、GPU を使用して他のプログラムを実行していません。また、複数のインスタンスを実行していません。最大 8 GB まで使用できるはずなのに、512 MB の GPU 使用量を割り当てることさえできません。

t2t_trainer.py スクリプトで memory_fraction を 0.2 まで手動で減らしてみました。また、「allow_growth」を設定してみました。これらはどちらも役に立たないようですが、memory_fraction を 0.2 に設定すると使用可能なメモリが減少し、最初は 7 ではなく 1.44 GB を割り当てようとするだけでした。

私は途方に暮れています。記録として、これは Tensorflow 1.14 と CUDA 10.0 です。これは、モデルで必要とされるためです。

4

0 に答える 0