“cufft”の関連問題_Stack Overflow日本語サイト

0 投票する

1 に答える

515 参照

cuda - cufftPlanMany() に時間がかかりすぎるのはなぜですか?

最初に cufftPlanMany() を呼び出すときは約 0.7 秒かかりますが、それ以降の呼び出しはすべて高速です。cufftPlanMany() の最初の呼び出しを加速する方法はありますか?

2015-09-18T21:40:23.483

0 投票する

0 に答える

329 参照

numpy - 多次元配列の指定された軸に対する FFTW / CUFFT

FFTW / CUFFT (同様の API を持っています) を使用して、多次元配列の特定の軸に対して fft を実行する効率的な方法はありますか?

形状 (2, 3, 4) の 3D 配列があるとします。ストライドは (12, 4, 1) です。つまり、最後の軸に沿って 1 単位移動するには、フラット配列で 1 単位移動しますが、最初の軸に沿って 1 単位移動するには、ステップオーバーする必要があります。 3 * 4 = 12 単位。（配列は、軸が転置されたときに他のストライドを持つことができるnumpy ndarrayですが、与えられたストライドでこの特定の3Dケースだけに対処する答えに満足しています）

ここで、中央の軸に沿って1D fftを計算したいとしましょう。CUFFT は次の関数を公開します。

変換を行うには、、、パラメーターが必要nembedだと思います。それらはここに文書化されています: http://docs.nvidia.com/cuda/cufft/index.html#advanced-data-layoutstridedist

デュメンテーションは、1D fft の場合、位置 x のバッチ b の要素が次から取得されることを示しています。 input[b * idist + x * istride]

ただし、位置 [b][x][z] の要素は次の場所に格納されます。

input[b * 12 + x * 4 + z]

そのため、CUFFT を 3 番目 (z) 軸でループさせる方法が明確ではありません。

私が設定した場合：

idist と odist を 3*4=12 にします (b をインクリメントすると、最初の軸に沿って移動します)。
istride と ostride を 4 にします (x の増分は、fft したい軸である 2 番目の軸に沿って移動します)。
バッチ = 2
inembed と onembed を 3 に (ただし、ドキュメントによると、これらは 1D 変換では無視されます)

次に、最後の軸インデックスが 0 である 2 つのバッチのそれぞれについて正しい fft を計算しますが、最後のインデックスが 1、2、または 3 であるサブ配列はそのままにします。

これは一般的なユースケースのように思えますが、複数の呼び出し (GPU ではコストがかかります) を実行したり、異なるメモリレイアウトでコピーを作成したりせずに、指定されたパラメーターでこれを行う方法を理解できないようです。

numpy multidimensional-array cuda fftw cufft

2015-10-07T06:10:00.090

0 投票する

1 に答える

420 参照

cuda - K40でcuFFTが「遅い」のはなぜですか?

単純な 3D cuFFT プログラムを GTX 780 と倍精度モードの Tesla K40 で比較しました。

GTX 780 では約 85 Gflops を測定しましたが、K40 では約 160 Gflops を測定しました。これらの結果は私を困惑させました: K40 が 1.4 Tflops であるのに対し、GTX 780 ha は 166 Gflops のピーク理論パフォーマンスです。

K40 での cuFFT の効果的なパフォーマンスが理論上のピークパフォーマンスから非常に離れているという事実は、このリンクで Nvidia によって作成されたグラフからも得られます。

なぜこれが起こるのか誰かが私に説明できますか? cuFFT ライブラリに制限はありますか? 多分いくつかのキャッシュの動機...

cuda fft cufft

2015-12-16T10:49:02.470

0 投票する

1 に答える

111 参照

matlab - Recursively use of self-implemented cuIDFT.cu leads to changing output every time when re-runing the code

I have implemented a CUDA version of inverse discrete cosine transform (IDCT), by "translating" the MATLAB built-in function idct.m into CUDA:

My implementation is cuIDCT.cu, works when m = n and both m and n are even numbers.

cuIDCT.cu

Then I compared the result of my CUDA IDCT (i.e. cuIDCT.cu) against MATLAB idct.m using following code:

a test main.cpp function, and
a MATLAB main function main.m to read result from CUDA and compare it against MATLAB.

main.cpp

main.m

I ran the code on Visual Studio 11 (i.e. VS2012) in Windows 7 with Nvidia GPU Tesla K20c, using CUDA Toolkit version 7.5, and my MATLAB version is R2015b.

My test steps:

For test case 1. Un-comment test case 1 and comment test case 2.
1. Run main.cpp.
2. Run main.m in MATLAB.
3. Repeat step 1 and step 2 (without any change, just re-run the code).

I repeated step 3 for 20 times. The output result is unchanged, and results in main.m are:

results of test case 1

The maximum error is 7.7152e-07.

For test case 2. Un-comment test case 2 and comment test case 1.
1. Run main.cpp.
2. Run main.m in MATLAB.
3. Repeat step 1 and step 2 (without any change, just re-run the code).

I repeated step 3 for 20 times. The output result is changed, and results in main.m are (not enough reputation to put all images, only wrong case is shown below):

one situation (the wrong one) of test case 2

The maximum error is 0.45341 (2 times), 0.44898 (1 time), 0.26186 (1 time), 0.26301 (1 time), and 9.5716e-07 (15 times).

From the test results, my conclusion is:

From test case 1: cuIDCT.cu is numerically correct (error ~10^-7) to idct.m.
From test case 2: recursively use of cuIDCT.cu leads to unstable result (i.e. the output changes every time when re-run the code and may sometimes be numerically wrong, error ~0.1)

My question:

From test case 1 we know cuIDCT.cu is numerically correct to idct.m. But why recursiviely use of cuIDCT.cu leads to different output result each time when re-run the code?

Any helps or suggestions are highly appreciated.

matlab cuda gpu dct cufft

2016-01-04T18:02:20.997

0 投票する

1 に答える

557 参照

fft - cuFFT R2C バッチ出力サイズが入力サイズと一致しない

cuFFTを使用してバッチを実験しています。しかし、私は正しい出力を得ているとは思わない。

GPU に 2 つの配列を割り当てています。

次のような単純なカーネルでソース配列を初期化しています。

基本的に、各配列には 0 から 15 までの値があります。これを 16 回取得します。

私は次のように計画を作成します。

そして、私は自分の計画を実行しています:

最後に、dstの内容をホストに転送します。しかし、値を出力すると、次のようになります。

反復的な出力を期待していましたが、16 個ごとではなく、9 個の数値ごとに繰り返されます。

私は何か間違ったことをしていますか？それとも、私が理解していない何かがあります。

fft cufft

2016-01-22T15:19:45.427

問題タブ [cufft]

cuda - cufftPlanMany() に時間がかかりすぎるのはなぜですか?

numpy - 多次元配列の指定された軸に対する FFTW / CUFFT

cuda - K40でcuFFTが「遅い」のはなぜですか?

matlab - Recursively use of self-implemented cuIDFT.cu leads to changing output every time when re-runing the code

fft - cuFFT R2C バッチ出力サイズが入力サイズと一致しない

Reference