image - CudaImage平均フィルター

Question

平均フィルターは、信号（画像）を平滑化する線形クラスのウィンドウフィルターです。フィルターはローパスフィルターとして機能します。フィルタの背後にある基本的な考え方は、信号（画像）の任意の要素がその近傍全体で平均を取ることです。

m x nマトリックスがあり、その上にサイズのある平均フィルターを適用する場合、kマトリックス内の各ポイントのポイントp:(i,j)の値は、正方形内のすべてのポイントの平均になります。

スクエアカーネル

この図は、サイズのあるフィルタリングのSquareカーネルの2場合で、黄色のボックスは平均化されるピクセルであり、すべてのグリッドは隣接するピクセルの2乗であり、ピクセルの新しい値はそれらの平均になります。

問題は、このアルゴリズムが非常に遅いことです。特に大きな画像では、を使用することを考えましGPGPUた。

今の問題は、可能であれば、これをcudaでどのように実装できるかということです。

score 22 · Accepted Answer

これは、驚異的並列画像処理の問題の典型的なケースであり、CUDAフレームワークに非常に簡単にマッピングできます。平均化フィルターは、画像処理ドメインではボックスフィルターとして知られています。

境界条件はテクスチャによって非常に簡単に処理できるため、最も簡単なアプローチは、フィルタリングプロセスにCUDAテクスチャを使用することです。

ホストにソースポインターと宛先ポインターが割り当てられていると仮定します。手順は次のようになります。

デバイスにソースイメージとデスティネーションイメージを保持するのに十分な大きさのメモリを割り当てます。
ソースイメージをホストからデバイスにコピーします。
ソースイメージデバイスポインタをテクスチャにバインドします。
適切なブロックサイズと、画像のすべてのピクセルをカバーするのに十分な大きさのグリッドを指定します。
指定されたグリッドとブロックサイズを使用してフィルタリングカーネルを起動します。
結果をホストにコピーして戻します。
テクスチャのバインドを解除します
デバイスポインタを解放します。

ボックスフィルターのサンプル実装

カーネル

texture<unsigned char, cudaTextureType2D> tex8u;

//Box Filter Kernel For Gray scale image with 8bit depth
__global__ void box_filter_kernel_8u_c1(unsigned char* output,const int width, const int height, const size_t pitch, const int fWidth, const int fHeight)
{
    int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
    int yIndex = blockIdx.y * blockDim.y + threadIdx.y;

    const int filter_offset_x = fWidth/2;
    const int filter_offset_y = fHeight/2;

    float output_value = 0.0f;

    //Make sure the current thread is inside the image bounds
    if(xIndex<width && yIndex<height)
    {
        //Sum the window pixels
        for(int i= -filter_offset_x; i<=filter_offset_x; i++)
        {
            for(int j=-filter_offset_y; j<=filter_offset_y; j++)
            {
                //No need to worry about Out-Of-Range access. tex2D automatically handles it.
                output_value += tex2D(tex8u,xIndex + i,yIndex + j);
            }
        }

        //Average the output value
        output_value /= (fWidth * fHeight);

        //Write the averaged value to the output.
        //Transform 2D index to 1D index, because image is actually in linear memory
        int index = yIndex * pitch + xIndex;

        output[index] = static_cast<unsigned char>(output_value);
    }
}

ラッパー関数：

void box_filter_8u_c1(unsigned char* CPUinput, unsigned char* CPUoutput, const int width, const int height, const int widthStep, const int filterWidth, const int filterHeight)
{

    /*
     * 2D memory is allocated as strided linear memory on GPU.
     * The terminologies "Pitch", "WidthStep", and "Stride" are exactly the same thing.
     * It is the size of a row in bytes.
     * It is not necessary that width = widthStep.
     * Total bytes occupied by the image = widthStep x height.
     */

    //Declare GPU pointer
    unsigned char *GPU_input, *GPU_output;

    //Allocate 2D memory on GPU. Also known as Pitch Linear Memory
    size_t gpu_image_pitch = 0;
    cudaMallocPitch<unsigned char>(&GPU_input,&gpu_image_pitch,width,height);
    cudaMallocPitch<unsigned char>(&GPU_output,&gpu_image_pitch,width,height);

    //Copy data from host to device.
    cudaMemcpy2D(GPU_input,gpu_image_pitch,CPUinput,widthStep,width,height,cudaMemcpyHostToDevice);

    //Bind the image to the texture. Now the kernel will read the input image through the texture cache.
    //Use tex2D function to read the image
    cudaBindTexture2D(NULL,tex8u,GPU_input,width,height,gpu_image_pitch);

    /*
     * Set the behavior of tex2D for out-of-range image reads.
     * cudaAddressModeBorder = Read Zero
     * cudaAddressModeClamp  = Read the nearest border pixel
     * We can skip this step. The default mode is Clamp.
     */
    tex8u.addressMode[0] = tex8u.addressMode[1] = cudaAddressModeBorder;

    /*
     * Specify a block size. 256 threads per block are sufficient.
     * It can be increased, but keep in mind the limitations of the GPU.
     * Older GPUs allow maximum 512 threads per block.
     * Current GPUs allow maximum 1024 threads per block
     */

    dim3 block_size(16,16);

    /*
     * Specify the grid size for the GPU.
     * Make it generalized, so that the size of grid changes according to the input image size
     */

    dim3 grid_size;
    grid_size.x = (width + block_size.x - 1)/block_size.x;  /*< Greater than or equal to image width */
    grid_size.y = (height + block_size.y - 1)/block_size.y; /*< Greater than or equal to image height */

    //Launch the kernel
    box_filter_kernel_8u_c1<<<grid_size,block_size>>>(GPU_output,width,height,gpu_image_pitch,filterWidth,filterHeight);

    //Copy the results back to CPU
    cudaMemcpy2D(CPUoutput,widthStep,GPU_output,gpu_image_pitch,width,height,cudaMemcpyDeviceToHost);

    //Release the texture
    cudaUnbindTexture(tex8u);

    //Free GPU memory
    cudaFree(GPU_input);
    cudaFree(GPU_output);
}

良いニュースは、自分でフィルターを実装する必要がないことです。CUDA Toolkitには、NVIDIA製のNVIDIA Performance Primitives（別名NPP）という名前の無料の信号および画像処理ライブラリが付属しています。NPPは、CUDA対応のGPUを利用して処理を高速化します。平均化フィルターはすでにNPPに実装されています。NPPの現在のバージョン（5.0）は、8ビット、1チャネル、および4チャネルのイメージをサポートしています。機能は次のとおりです。

nppiFilterBox_8u_C1R1チャンネル画像用。
nppiFilterBox_8u_C4R4チャンネル画像用。

score 5 · Accepted Answer

いくつかの基本的な考え/手順:

CPU から GPU に画像データをコピーする
カーネルを呼び出して、各行 (水平) の平均を作成し、共有メモリに保存します。
カーネルを呼び出して各列 (垂直) の平均を作成し、それをグローバルメモリに格納します。
データを CPU メモリにコピーします。

これは、2D メモリと多次元カーネル呼び出しを使用して非常に簡単にスケーリングできるはずです。

score 3 · Accepted Answer

フィルターのサイズが通常であり、巨大でない場合、平均フィルターは CUDA で実装するのに非常に適したケースです。これは、正方形のブロックを使用して設定できます。ブロックのすべてのスレッドは、隣接するピクセルを合計して平均化することにより、1 つのピクセルの値を計算します。

イメージをグローバルメモリに保存すると、簡単にプログラムできます。考えられる最適化の 1 つは、イメージのブロックをブロックの共有メモリにロードすることです。ファントム要素を使用して (隣接するピクセルを探すときに共有ブロックのサイズを超えないようにするため)、ブロック内のピクセルの平均を計算できます。

注意が必要なのは、最終的に「ステッチング」がどのように行われるかということだけです。これは、共有メモリブロックがオーバーラップし (余分な「パディング」ピクセルが原因で)、それらの値を計算したくないためです。二回。

image - CudaImage平均フィルター

3 に答える 3

ボックスフィルターのサンプル実装

Related

Reference