3

GPU で実行する線形回帰を実行する単純な Python スクリプト (Theano を使用) を作成しました。コードが開始すると、「GPU デバイスを使用しています」と表示されますが、(プロファイラーによると) すべての操作は CPU 固有です (GpuElemWise の代わりに ElemWise、GpuFromHost などはありません)。

変数 THEANO_FLAGS を確認しましたが、すべてが正しいようで、キャッチが表示されません (特に、同じ設定の Theano チュートリアルが GPU で正しく実行されている場合:))。

コードは次のとおりです。

# linear regression

import numpy
import theano
import theano.tensor as T

input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])

TS = theano.shared(input_data, "training-set")
E = theano.shared(output_data, "expected")
W1 = theano.shared(numpy.zeros((1, 2)))

O = T.dot(TS, W1.T)
cost = T.mean(T.sqr(E - O.T))
gradient = T.grad(cost=cost, wrt=W1)
update = [[W1, W1 - gradient * 0.0001]]
train = theano.function([], cost, updates=update, allow_input_downcast=True)

for i in range(1000):
    train()
  • THEANO_FLAGS=cuda.root=/usr/local/cuda
  • デバイス=GPU
  • floatX=float32
  • lib.cnmem=.5
  • profile=真
  • CUDA_LAUNCH_BLOCKING=1

出力:

Using gpu device 0: GeForce GT 650M (CNMeM is enabled)
Function profiling
==================
  Message: /home/mw/Documents/LiClipse Workspace/theano1/test2.py:18
  Time in 1000 calls to Function.__call__: 3.348637e-02s
  Time in Function.fn.__call__: 2.419019e-02s (72.239%)
  Time in thunks: 1.839781e-02s (54.941%)
  Total compile time: 1.350801e-01s
    Number of Apply nodes: 18
    Theano Optimizer time: 1.101730e-01s
       Theano validate time: 2.029657e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.491690e-02s
       Import time 2.320528e-03s

Time in all call to theano.grad() 8.740902e-03s
Time since theano import 0.881s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  71.7%    71.7%       0.013s       6.59e-06s     Py    2000       2   theano.tensor.basic.Dot
  12.3%    83.9%       0.002s       3.22e-07s     C     7000       7   theano.tensor.elemwise.Elemwise
   5.7%    89.6%       0.001s       3.50e-07s     C     3000       3   theano.tensor.elemwise.DimShuffle
   4.0%    93.6%       0.001s       3.65e-07s     C     2000       2   theano.tensor.subtensor.Subtensor
   3.6%    97.2%       0.001s       3.31e-07s     C     2000       2   theano.compile.ops.Shape_i
   1.7%    98.9%       0.000s       3.06e-07s     C     1000       1   theano.tensor.opt.MakeVector
   1.1%   100.0%       0.000s       2.10e-07s     C     1000       1   theano.tensor.elemwise.Sum
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  71.7%    71.7%       0.013s       6.59e-06s     Py    2000        2   dot
   4.0%    75.6%       0.001s       3.65e-07s     C     2000        2   Subtensor{int64}
   3.5%    79.1%       0.001s       6.35e-07s     C     1000        1   InplaceDimShuffle{1,0}
   3.3%    82.4%       0.001s       6.06e-07s     C     1000        1   Elemwise{mul,no_inplace}
   2.4%    84.8%       0.000s       4.38e-07s     C     1000        1   Shape_i{0}
   2.3%    87.1%       0.000s       4.29e-07s     C     1000        1   Elemwise{Composite{((i0 * i1) / i2)}}
   2.3%    89.3%       0.000s       2.08e-07s     C     2000        2   InplaceDimShuffle{x,x}
   1.8%    91.1%       0.000s       3.25e-07s     C     1000        1   Elemwise{Cast{float64}}
   1.7%    92.8%       0.000s       3.06e-07s     C     1000        1   MakeVector{dtype='int64'}
   1.5%    94.3%       0.000s       2.78e-07s     C     1000        1   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
   1.4%    95.7%       0.000s       2.53e-07s     C     1000        1   Elemwise{Sub}[(0, 1)]
   1.2%    96.9%       0.000s       2.24e-07s     C     1000        1   Shape_i{1}
   1.1%    98.0%       0.000s       2.10e-07s     C     1000        1   Sum{acc_dtype=float64}
   1.1%    99.1%       0.000s       1.98e-07s     C     1000        1   Elemwise{Sqr}[(0, 0)]
   0.9%   100.0%       0.000s       1.66e-07s     C     1000        1   Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  37.8%    37.8%       0.007s       6.95e-06s   1000     3   dot(<TensorType(float64, matrix)>, training-set.T)
  33.9%    71.7%       0.006s       6.24e-06s   1000    14   dot(Elemwise{Composite{((i0 * i1) / i2)}}.0, training-set)
   3.5%    75.1%       0.001s       6.35e-07s   1000     0   InplaceDimShuffle{1,0}(training-set)
   3.3%    78.4%       0.001s       6.06e-07s   1000    11   Elemwise{mul,no_inplace}(InplaceDimShuffle{x,x}.0, InplaceDimShuffle{x,x}.0)
   3.0%    81.4%       0.001s       5.58e-07s   1000     8   Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{1})
   2.4%    83.8%       0.000s       4.38e-07s   1000     2   Shape_i{0}(expected)
   2.3%    86.2%       0.000s       4.29e-07s   1000    12   Elemwise{Composite{((i0 * i1) / i2)}}(TensorConstant{(1, 1) of -2.0}, Elemwise{Sub}[(0, 1)].0, Elemwise{mul,no_inplace}.0)
   1.8%    87.9%       0.000s       3.25e-07s   1000     6   Elemwise{Cast{float64}}(MakeVector{dtype='int64'}.0)
   1.7%    89.6%       0.000s       3.06e-07s   1000     4   MakeVector{dtype='int64'}(Shape_i{0}.0, Shape_i{1}.0)
   1.6%    91.2%       0.000s       3.03e-07s   1000    10   InplaceDimShuffle{x,x}(Subtensor{int64}.0)
   1.5%    92.7%       0.000s       2.78e-07s   1000    16   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](<TensorType(float64, matrix)>, TensorConstant{(1, 1) of ..974738e-05}, dot.0)
   1.4%    94.1%       0.000s       2.53e-07s   1000     5   Elemwise{Sub}[(0, 1)](expected, dot.0)
   1.2%    95.3%       0.000s       2.24e-07s   1000     1   Shape_i{1}(expected)
   1.1%    96.5%       0.000s       2.10e-07s   1000    15   Sum{acc_dtype=float64}(Elemwise{Sqr}[(0, 0)].0)
   1.1%    97.6%       0.000s       1.98e-07s   1000    13   Elemwise{Sqr}[(0, 0)](Elemwise{Sub}[(0, 1)].0)
   0.9%    98.5%       0.000s       1.72e-07s   1000     7   Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{0})
   0.9%    99.4%       0.000s       1.66e-07s   1000    17   Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](Sum{acc_dtype=float64}.0, Subtensor{int64}.0, Subtensor{int64}.0)
   0.6%   100.0%       0.000s       1.13e-07s   1000     9   InplaceDimShuffle{x,x}(Subtensor{int64}.0)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
4

1 に答える 1

2

コメントで述べたように、allow_input_downcastパラメータをTrueに設定しましたが、共有変数に割り当てるすべてのデータが にあることを確認する必要がありますfloat32。2016年 1 月 6 日現在、 Theanoは、GPU で計算を行う以外に、他のデータ型を処理することはできません。詳細については、こちらを参照してください。したがって、データを「float32」形式にキャストする必要があります。float32

したがって、使用する必要があるコードは次のとおりです。

import numpy
import theano
import theano.tensor as T


input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])

TS = theano.shared(input_data.astype('float32'), "training-set")
E = theano.shared(output_data.astype('float32'), "expected")
W1 = theano.shared(numpy.zeros((1, 2), dtype = 'float32'))

O = T.dot(TS, W1.T)
cost = T.mean(T.sqr(E - O.T))
gradient = T.grad(cost=cost, wrt=W1)
update = [[W1, W1 - gradient * 0.0001]]
train = theano.function([], cost, updates=update, allow_input_downcast=True, profile = True)

for i in range(1000):
    train()

train.profile.print_summary()

プロファイリングの結果は次のとおりです。

Message: LearnTheano.py:18
  Time in 1000 calls to Function.__call__: 2.642968e-01s
  Time in Function.fn.__call__: 2.460811e-01s (93.108%)
  Time in thunks: 1.877530e-01s (71.039%)
  Total compile time: 2.483290e+01s
    Number of Apply nodes: 17
    Theano Optimizer time: 2.818849e-01s
       Theano validate time: 3.435850e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.453926e+01s
       Import time 1.241469e-02s

Time in all call to theano.grad() 1.206994e-02s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  34.8%    34.8%       0.065s       3.27e-05s     C     2000       2   theano.sandbox.cuda.blas.GpuGemm
  28.8%    63.5%       0.054s       1.80e-05s     C     3000       3   theano.sandbox.cuda.basic_ops.GpuElemwise
  12.9%    76.4%       0.024s       2.42e-05s     C     1000       1   theano.sandbox.cuda.basic_ops.GpuCAReduce
  10.3%    86.7%       0.019s       1.93e-05s     C     1000       1   theano.sandbox.cuda.basic_ops.GpuFromHost
   7.2%    93.9%       0.014s       1.36e-05s     C     1000       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   1.8%    95.7%       0.003s       1.13e-06s     C     3000       3   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   1.5%    97.2%       0.003s       2.81e-06s     C     1000       1   theano.tensor.elemwise.Elemwise
   1.1%    98.4%       0.002s       1.08e-06s     C     2000       2   theano.compile.ops.Shape_i
   1.1%    99.5%       0.002s       1.02e-06s     C     2000       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.5%   100.0%       0.001s       9.96e-07s     C     1000       1   theano.tensor.opt.MakeVector
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  25.3%    25.3%       0.047s       4.74e-05s     C     1000        1   GpuGemm{no_inplace}
  12.9%    38.1%       0.024s       2.42e-05s     C     1000        1   GpuCAReduce{pre=sqr,red=add}{1,1}
  12.8%    51.0%       0.024s       2.41e-05s     C     1000        1   GpuElemwise{mul,no_inplace}
  10.3%    61.3%       0.019s       1.93e-05s     C     1000        1   GpuFromHost
   9.5%    70.8%       0.018s       1.79e-05s     C     1000        1   GpuGemm{inplace}
   8.2%    79.0%       0.015s       1.55e-05s     C     1000        1   GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
   7.7%    86.7%       0.014s       1.44e-05s     C     1000        1   GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)]
   7.2%    93.9%       0.014s       1.36e-05s     C     1000        1   HostFromGpu
   1.5%    95.4%       0.003s       2.81e-06s     C     1000        1   Elemwise{Cast{float32}}
   1.1%    96.5%       0.002s       1.02e-06s     C     2000        2   GpuSubtensor{int64}
   1.0%    97.5%       0.002s       9.00e-07s     C     2000        2   GpuDimShuffle{x,x}
   0.8%    98.3%       0.002s       1.59e-06s     C     1000        1   GpuDimShuffle{1,0}
   0.7%    99.1%       0.001s       1.38e-06s     C     1000        1   Shape_i{0}
   0.5%    99.6%       0.001s       9.96e-07s     C     1000        1   MakeVector
   0.4%   100.0%       0.001s       7.76e-07s     C     1000        1   Shape_i{1}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  25.3%    25.3%       0.047s       4.74e-05s   1000     3   GpuGemm{no_inplace}(expected, TensorConstant{-1.0}, <CudaNdarrayType(float32, matrix)>, GpuDimShuffle{1,0}.0, TensorConstant{1.0})
  12.9%    38.1%       0.024s       2.42e-05s   1000     5   GpuCAReduce{pre=sqr,red=add}{1,1}(GpuGemm{no_inplace}.0)
  12.8%    51.0%       0.024s       2.41e-05s   1000    13   GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,x}.0, GpuDimShuffle{x,x}.0)
  10.3%    61.3%       0.019s       1.93e-05s   1000     7   GpuFromHost(Elemwise{Cast{float32}}.0)
   9.5%    70.8%       0.018s       1.79e-05s   1000    16   GpuGemm{inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{-9.99999974738e-05}, GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)].0, training-set, TensorConstant{1.0})
   8.2%    79.0%       0.015s       1.55e-05s   1000    12   GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](GpuCAReduce{pre=sqr,red=add}{1,1}.0, GpuSubtensor{int64}.0, GpuSubtensor{int64}.0)
   7.7%    86.7%       0.014s       1.44e-05s   1000    15   GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)](CudaNdarrayConstant{[[-2.]]}, GpuGemm{no_inplace}.0, GpuElemwise{mul,no_inplace}.0)
   7.2%    93.9%       0.014s       1.36e-05s   1000    14   HostFromGpu(GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)].0)
   1.5%    95.4%       0.003s       2.81e-06s   1000     6   Elemwise{Cast{float32}}(MakeVector.0)
   0.8%    96.3%       0.002s       1.59e-06s   1000     0   GpuDimShuffle{1,0}(training-set)
   0.7%    97.0%       0.001s       1.38e-06s   1000     2   Shape_i{0}(expected)
   0.7%    97.7%       0.001s       1.30e-06s   1000     8   GpuSubtensor{int64}(GpuFromHost.0, Constant{0})
   0.6%    98.3%       0.001s       1.08e-06s   1000    11   GpuDimShuffle{x,x}(GpuSubtensor{int64}.0)
   0.5%    98.8%       0.001s       9.96e-07s   1000     4   MakeVector(Shape_i{0}.0, Shape_i{1}.0)
   0.4%    99.2%       0.001s       7.76e-07s   1000     1   Shape_i{1}(expected)
   0.4%    99.6%       0.001s       7.40e-07s   1000     9   GpuSubtensor{int64}(GpuFromHost.0, Constant{1})
   0.4%   100.0%       0.001s       7.25e-07s   1000    10   GpuDimShuffle{x,x}(GpuSubtensor{int64}.0)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
于 2016-01-06T00:29:42.393 に答える