tensorflow - NVIDIA V100 で TensorRT FP16 または INT8 を使用しても速度が向上しない

Question

trt.create_inference_graph を使用して、Keras で変換された Tensorflow 保存モデルを FP32 から FP16 および INT8 に変換し、TensorFlow サービングに使用できる形式で保存しようとしています。コードはこちら - https://colab.research.google.com/drive/16zUmIx0_KxRHLN751RCEBuZRKhWx6BsJ

ただし、これをテストクライアントで実行しても、タイミングに変化は見られません。

ラップトップで NVIDIA V100 32 GB と 8Gb 1070 GTX カードを使用してさまざまなモデルを比較しました。メモリ効果を確認するために、入力形状を減らしたり増やしたりしてみました。全体として、32 GB メモリの利点 (モデルをロードするだけでなく、メモリを使い果たすことなくフレームを処理するため) 以外に、V100 は速度が向上していないようです。特にFP16モードで2倍のスピードアップを考えていました。Keras に変換された TF モデルなのか、モデルの複雑さや設計に何らかの役割があるのかはわかりません。

テストの詳細はこちらhttps://docs.google.com/spreadsheets/d/1Sl7K6sa96wub1OXcneMk1txthQfh63b0H5mwygyVQlE/edit?usp=sharing

Model 4 Keras converted TF sering           
Model 6 TF Graph simple optimisation            
Model 7     TF Graph simple optimisation + Weight Qunatization          
Model 8 TF Graph simple optimisation + Weight + Model Qunatization          

Model 9     Based on Model 4 frozen; NVIDIA Tensor RT Optimisation FP 32            
Model 10    Based on Model 4 frozen; NVIDIA Tensor RT Optimisation FP 16            
Model 11    Based on Model 4 frozen; NVIDIA Tensor RT Optimisation INT 8            
No of Runs 1                
Model   NVIDIA GTX 1070 NVIDIA V100  32 GB      
4   0.13    0.13        
6   0.14    0.15        
7   0.15    0.14        
9   0.13    0.12        
10  0.13    0.12        
11  0.13    0.12        
No of runs :10              
4   1.15    0.81        
6   1.34    1.16        
7   1.15    1.27        
9   1.23    0.82        
10  1.22    0.83        
11  1.22    0.85

FP32 - V100 - 最適化なし

('Label', 'person', ' at ', array([409, 167, 728, 603]), ' Score ', 0.968112)
('Label', 'person', ' at ', array([  0, 426, 512, 785]), ' Score ', 0.8355837)
('Label', 'person', ' at ', array([ 723,  475, 1067,  791]), ' Score ', 0.7234411)
('Label', 'tie', ' at ', array([527, 335, 569, 505]), ' Score ', 0.52543193)
('Time for ', 10, ' is ', 0.7228488922119141)

TensorFlow ベースの最適化を使用した FP 32 - TransformGraph

重みまたはモデルの量子化なし

('Time for ', 10, ' is ', 0.6342859268188477)

FP ?? TensorFlow ベースの最適化を使用 - +重みの量子化 - TransformGraph

重みが量子化された後。モデルサイズは39MB!! (~149 MB から) しかし、時間は 2 倍です ('Time for ', 10, ' is ', 1.201113224029541)

モデルの量子化 - 機能しません (少なくとも TF Serving では)

NVIDIA TensorRT 最適化の使用 (colab ノートブック)

FP16 - v100

('Label', 'person', ' at ', array([409, 167, 728, 603]), ' Score ', 0.9681119)
('Label', 'person', ' at ', array([  0, 426, 512, 785]), ' Score ', 0.83558357)
('Label', 'person', ' at ', array([ 723,  475, 1067,  791]), ' Score ', 0.7234408)
('Label', 'tie', ' at ', array([527, 335, 569, 505]), ' Score ', 0.52543193)
('Time for ', 10, ' is ', 0.8691568374633789)

知 8

('Label', 'person', ' at ', array([409, 167, 728, 603]), ' Score ', 0.9681119)
('Label', 'person', ' at ', array([  0, 426, 512, 785]), ' Score ', 0.83558357)
('Label', 'person', ' at ', array([ 723,  475, 1067,  791]), ' Score ', 0.7234408)
('Label', 'tie', ' at ', array([527, 335, 569, 505]), ' Score ', 0.52543193)
('Time for ', 10, ' is ', 0.8551359176635742)

最適化スニペット https://colab.research.google.com/drive/1u79vDN4MZuq6gYIOkPmWsbghjunbDq6m

注 : 実行ごとに若干の違いがあります

tensorflow - NVIDIA V100 で TensorRT FP16 または INT8 を使用しても速度が向上しない

FP32 - V100 - 最適化なし

TensorFlow ベースの最適化を使用した FP 32 - TransformGraph

FP ?? TensorFlow ベースの最適化を使用 - +重みの量子化 - TransformGraph

モデルの量子化 - 機能しません (少なくとも TF Serving では)

FP16 - v100

知 8

2 に答える 2

Related

Reference