cuda - CUDA を使用したネストされた for ループの最適化

Question

そのため、OpenCV を使用して移動オブジェクトの動きを検出するプロジェクトに取り組んでいます。検出を高速化し、CUDA を使用して高速化したいネストされた for ループを作成しようとしています。Visual Basic で CUDA 統合をすべてセットアップしました。これは、私の .cpp ファイルにネストされた for ループです。

      for (int i=0; i<NumberOfFeatures; i++)
  {
    // Compute integral image.
    cvIntegral(mFeatureImgs[i], mFirstOrderIIs[i]);

    for (int j=0; j<NumberOfFeatures; j++)
    {
      // Compute product feature image.
      cvMul(mFeatureImgs[i], mFeatureImgs[j], mWorker);

      // Compute integral image.
      cvIntegral(mWorker, mSecondOrderIIs[i][j]);
    }
  }

私はCUDAに比較的慣れていないので、私の質問は、CUDAを使用してこのネストされたforループをどのように正確に高速化するかの例を教えてもらえますか?

score 2 · Accepted Answer

sgar91 が指摘したように、OpenCV には GPU モジュールが含まれています。

http://opencv.willowgarage.com/wiki/OpenCV_GPU

このウィキでは、Yahoo の OpenCV ヘルプフォーラムで GPU 関連の質問をする方法も提案されています。

GPU で高速化された画像積分機能があります。周りを見回すと、cvMul に相当するものも見つかるかもしれません。

非 GPU コードと GPU バージョンでまったく同じデータ型を使用することはできません。以前に投稿した wiki ページにある「短いサンプル」の例を見てください。GPU で操作できるデータ構造に既存のデータを転送するには、次のようなことを行う必要があることがわかります。

    cv::gpu::GpuMat dst, src;  // this is defining variables that can be accessed by the GPU
    src.upload(src_host);      // this is loading the src (GPU variable) with the image data

    cv::gpu::threshold(src, dst, 128.0, 255.0, CV_THRESH_BINARY);  //this is causing the GPU to act

次のような同様のことを行う必要があります。

    cv::gpu::GpuMat dst, src;
    src.upload(src_data);

    cv::gpu::integral(src, dst);

score 1 · Accepted Answer

cv_integral は基本的に、両方の次元に沿ってピクセル値を合計します。これは行列演算でのみ実行できます。そのため、必要に応じて、そのために arrayfire を試すこともできます。マトリックスを使用して画像操作を行う方法の小さな例を作成しました。

// computes integral image
af::array cv_integral(af::array img) {

  // create an integral image of size + 1
  int w = img.dims(0), h = img.dims(1);
  af::array integral = af::zeros(w + 1, h + 1, af::f32);

  integral(af::seq(1,w), af::seq(1,h)) = img;

  // compute inclusive prefix sums along both dimensions
   integral = af::accum(integral, 0);
   integral = af::accum(integral, 1);

   std::cout << integral << "\n";

   return integral;
}

void af_test()
{
 int w = 6, h = 5; // image size
 float img_host[] = {5,2,3,4,1,7,
                    1,5,4,2,3,4,
                    2,2,1,3,4,45,
                    3,5,6,4,5,2,
                    4,1,3,2,6,9};

  //! create a GPU image (matrix) from the host data
  //! NOTE: column-major order!!
  af::array img(w, h, img_host, af::afHost);

   //! create an image from random data
   af::array img2 = af::randu(w, h) * 10;
   // compute integral images
   af::array integral = cv_integral(img);
   // elementwise product of the images
   af::array res = integral * img2;
   //! compute integral image
   res = cv_integral(res);
   af::eval(res);
   std::cout << res << "\n";
}

cuda - CUDA を使用したネストされた for ループの最適化

2 に答える 2

Related

Reference