matlab - CPU および GPU での SVD 速度

Question

私はテストsvdしていますが、 vsスピードアップMatlab R2014aはないようです。カードとを使用しています。CPUGPUGTX 460Core 2 duo E8500

これが私のコードです：

%test SVD
n=10000;
%host
Mh= rand(n,1000);
tic
%[Uh,Sh,Vh]= svd(Mh);
svd(Mh);
toc
%device
Md = gpuArray.rand(n,1000);
tic
%[Ud,Sd,Vd]= svd(Md);
svd(Md);
toc

また、実行時間は実行ごとに異なりますが、CPUとGPUバージョンはほぼ同じです。なぜスピードアップしないのですか？

ここにいくつかのテストがあります

for i=1:10
    clear;
    m= 10000;
    n= 100;
    %host
    Mh= rand(m,n);
    tic
    [Uh,Sh,Vh]= svd(Mh);
    toc
    %device
    Md = gpuArray.rand(m,n);
    tic
    [Ud,Sd,Vd]= svd(Md);
    toc
end

>> test_gpu_svd
Elapsed time is 43.124130 seconds.
Elapsed time is 43.842277 seconds.
Elapsed time is 42.993283 seconds.
Elapsed time is 44.293410 seconds.
Elapsed time is 42.924541 seconds.
Elapsed time is 43.730343 seconds.
Elapsed time is 43.125938 seconds.
Elapsed time is 43.645095 seconds.
Elapsed time is 43.492129 seconds.
Elapsed time is 43.459277 seconds.
Elapsed time is 43.327012 seconds.
Elapsed time is 44.040959 seconds.
Elapsed time is 43.242291 seconds.
Elapsed time is 43.390881 seconds.
Elapsed time is 43.275379 seconds.
Elapsed time is 43.408705 seconds.
Elapsed time is 43.320387 seconds.
Elapsed time is 44.232156 seconds.
Elapsed time is 42.984002 seconds.
Elapsed time is 43.702430 seconds.


for i=1:10
    clear;
    m= 10000;
    n= 100;
    %host
    Mh= rand(m,n,'single');
    tic
    [Uh,Sh,Vh]= svd(Mh);
    toc
    %device
    Md = gpuArray.rand(m,n,'single');
    tic
    [Ud,Sd,Vd]= svd(Md);
    toc
end

>> test_gpu_svd
Elapsed time is 21.140301 seconds.
Elapsed time is 21.334361 seconds.
Elapsed time is 21.275991 seconds.
Elapsed time is 21.582602 seconds.
Elapsed time is 21.093408 seconds.
Elapsed time is 21.305413 seconds.
Elapsed time is 21.482931 seconds.
Elapsed time is 21.327842 seconds.
Elapsed time is 21.120969 seconds.
Elapsed time is 21.701752 seconds.
Elapsed time is 21.117268 seconds.
Elapsed time is 21.384318 seconds.
Elapsed time is 21.359225 seconds.
Elapsed time is 21.911570 seconds.
Elapsed time is 21.086259 seconds.
Elapsed time is 21.263040 seconds.
Elapsed time is 21.472175 seconds.
Elapsed time is 21.561370 seconds.
Elapsed time is 21.330314 seconds.
Elapsed time is 21.546260 seconds.

score 9 · Accepted Answer

Generally SVD is a difficult to paralellize routine. You can check here that with a high end Tesla card, the speedup is not very impressive.

You have a GTX460 card - Fermi architecture. The card is optimized for gaming (single precision computations), not HPC (double precision computation). The Single Precision / Double Precision throughput ratio is 12. So the card has 873 GFLOPS SP / 72 GFLOPS DP. Check here.

So if the Md array uses double precision elements, then the computation on it would be rather slow. Also there's a high chance that when calling the CPU routine, all CPU cores will get utilized, reducing the possible gain of running the routine on the GPU. Plus, in the GPU run you pay time for transferring the buffer to the device.

Per Divakar's suggestion, you could use Md = single(Md) to convert your array to single precision and run the benchmark again. You can try and go with a bigger dataset size to see if something changes. I don't expect to much gain for this routine on your GPU.

Update 1:

After you posted the results, I saw that the DP/SP time ratio is 2. On the CPU side this is normal, because you can fit 2 times less double values in SSE registers. However, a ratio of only 2 on the GPU side means that the gpu code does not make best use of the SM cores - because the theoretical ratio is 12. In other words, I would have expected much better SP performance for an optimized code, compared to DP. It seems that this is not the case.

score 5 · Accepted Answer

VAndrei がすでに述べたように、SVD は並列化が難しいアルゴリズムです。

あなたの主な問題は、マトリックスのサイズです。行列のサイズが大きくなると、SVD のパフォーマンスは急速に低下します。したがって、主な目標は、マトリックスのサイズを縮小することです。これは、ガウス正規方程式 (基本的に、最小二乗の意味での過決定線形システムの縮約) を使用して達成できます。

これは、転置を行列に掛けるだけで実行できます。

MhReduced = Mh' * Mh;

これにより、行列が cols*cols のサイズに縮小されます (cols が Mh の列数である場合)。次に、電話するだけです[U,S,V] = svd(MhReduced);

注: この方法を使用すると、符号が反対の特異ベクトルが生成される場合があります (これらの方法を比較する場合は重要です)。

マトリックスが適切に調整されている場合、これは問題なく機能するはずです。ただし、条件の悪い行列の場合、この方法では使用可能な結果が得られない可能性がありますが、SVD を直接適用しても、SVD の堅牢性により使用可能な結果が得られる可能性があります。

これにより、少なくとも行列が十分に大きい場合、パフォーマンスが大幅に向上するはずです。もう 1 つの利点は、はるかに大きな行列を使用できることです。おそらく、GPU をまったく使用する必要はありません (行列が大きすぎて GPU へのコピーにコストがかかりすぎるか、削減後の行列が小さすぎて GPU の高速化が十分に大きくならないため)。

また、戻り値を使用すると、パフォーマンスの大部分が失われることに注意してください。SVD 計算のパフォーマンスのみに関心がある場合は、戻り値を取得しないでください。「ソリューションベクトル」のみに関心がある場合は、 V を取得します (そして最後の列にアクセスします): [~,~, V] = svd(Mh);。

編集：

私はあなたのサンプルコードを見てきましたが、それが何であるかわかりません.あなたは計算しています. また、で行ったことを理解するのはかなり難しいことに気付いたA'*Aので、詳細に説明します。

A*x=bA は m 行と n 列の係数行列を表し、x は解ベクトル、b は定数ベクトル (両方とも m 行)の線形システムを考えると、解は次のように計算できます。

A が正方形 ( m=n) の場合: x = A^-1 * b、
A が正方形でない場合 ( m!=n, m > n):

A * x = b

A'* A * x = A' * b

x = (A' * A)^-1 * A'*b

A" = (A'*A)^-1 * A'は通常、疑似逆と呼ばれます。ただし、この計算はマトリックスの条件数に悪影響を及ぼします。この問題の解決策は、特異値分解 (SVD) を使用することです。USV = svd(A) が SVD の結果を表す場合、疑似逆行列はで与えられVS"U'、S"は S の非ゼロ要素の逆行列を取ることによって形成されA" = VS"U'ます。

x = A"*b

ただし、SVD は、特に大規模なマトリックスではかなりコストがかかるためです。行列 A の条件が整っていて、非常に正確な結果が必ずしも必要でない場合 (1e-13 または 1e-14 について話している場合)、疑似逆ビアを計算することによるはるかに高速なアプローチを(A'*A)^-1 * A使用できます。

あなたのケースが実際にである場合はA*x=0、SVD を使用して V から最後の列ベクトルを読み取るだけで解決します。

線形システムを解決するのではなく、U と S の結果に対して SVD を使用する場合 (あなたの例が示唆するように)、私が投稿したものが役立つかどうかはわかりません。

ソース: 1、2、3

テスト用のサンプルコードを次に示します。(A'*A)^-1 * A'大きな行列でテストすると、使用した方が他の方法よりもはるかに高速であることがわかります。

clear all

nbRows = 30000;
nbCols = 100;
% Matrix A
A = rand(nbRows,nbCols);

% Vector b
b = rand(nbRows,1);

% A*x=b

% Solve for x, using SVD
% [U,S,V]=svd(A,0);
% x= V*((U'*b)./diag(S))
tic
[U1,S1,V1]=svd(A,0);
x1= V1*((U1'*b)./diag(S1));
toc

tic
[U1,S1,V1]=svd(A,0);
x2 = V1*inv(S1)*U1'*b;
toc

% Solve for x, using manual pseudo-inverse
% A*x=b
% A'*A*x = A'*b
% x = (A'*A)^-1 * A'*b
tic
x3 = inv(A'*A) * A'*b;
toc

% Solve for x, let Matlab decide how (most likely SVD)
tic
x4 = A\b;
toc

matlab - CPU および GPU での SVD 速度

4 に答える 4

編集：

Related

Reference