c++ - GPU (OpenGL) をターゲットとする Halide - ベンチマークと HalideRuntimeOpenGL.h の使用

Question

ハライド初心者です。私は言語の感触をつかむためにチュートリアルをいじっています。現在、OSX のコマンドラインから実行する小さなデモアプリを作成しています。

私の目標は、画像に対してピクセル単位の操作を実行し、GPU でスケジュールして、パフォーマンスを測定することです。ここで共有したいことをいくつか試しましたが、次のステップについていくつか質問があります.

最初のアプローチ

Target が OpenGL で GPU にアルゴリズムをスケジュールしたのですが、GPU メモリにアクセスしてファイルに書き込むことができなかったため、Halide ルーチンで、Halide リポジトリにあるglsl サンプルアプリFunc cpu_outと同様のものを作成して、出力を CPU にコピーしました。

pixel_operation_cpu_out.cpp

#include "Halide.h"
#include <stdio.h>

using namespace Halide;

const int _number_of_channels = 4;

int main(int argc, char** argv)
{
    ImageParam input8(UInt(8), 3);

    input8
        .set_stride(0, _number_of_channels) // stride in dimension 0 (x) is three
        .set_stride(2, 1); // stride in dimension 2 (c) is one

    Var x("x"), y("y"), c("c");

    // algorithm
    Func input;
    input(x, y, c) = cast<float>(input8(clamp(x, input8.left(), input8.right()),
                                 clamp(y, input8.top(), input8.bottom()),
                                 clamp(c, 0, _number_of_channels))) / 255.0f;

    Func pixel_operation;

    // calculate the corresponding value for input(x, y, c) after doing a 
    // pixel-wise operation on each each pixel. This gives us pixel_operation(x, y, c).
    // This operation is not location dependent, eg: brighten

    Func out;
    out(x, y, c) = cast<uint8_t>(pixel_operation(x, y, c) * 255.0f + 0.5f);
    out.output_buffer()
        .set_stride(0, _number_of_channels)
        .set_stride(2, 1);
    input8.set_bounds(2, 0, _number_of_channels); // Dimension 2 (c) starts at 0 and has extent _number_of_channels.
    out.output_buffer().set_bounds(2, 0, _number_of_channels);

    // schedule

     out.compute_root();
     out.reorder(c, x, y)
         .bound(c, 0, _number_of_channels)
         .unroll(c);

    // Schedule for GLSL

    out.glsl(x, y, c);

    Target target = get_target_from_environment();
    target.set_feature(Target::OpenGL);

    // create a cpu_out Func to copy over the data in Func out from GPU to CPU
    std::vector<Argument> args = {input8};
    Func cpu_out;
    cpu_out(x, y, c) = out(x, y, c);
    cpu_out.output_buffer()
        .set_stride(0, _number_of_channels)
        .set_stride(2, 1);
    cpu_out.output_buffer().set_bounds(2, 0, _number_of_channels);
    cpu_out.compile_to_file("pixel_operation_cpu_out", args, target);

    return 0;
}

この AOT をコンパイルするので、関数呼び出しを行いmain()ます。main()別のファイルにあります。

main_file.cpp

注:ここで使用されるクラスは、この Halide サンプルアプリImage のクラスと同じです。

int main()
{
    char *encodeded_jpeg_input_buffer = read_from_jpeg_file("input_image.jpg");
    unsigned char *pixelsRGBA = decompress_jpeg(encoded_jpeg_input_buffer);

    Image input(width, height, channels, sizeof(uint8_t), Image::Interleaved);
    Image output(width, height, channels, sizeof(uint8_t), Image::Interleaved);
    input.buf.host = &pixelsRGBA[0];
    unsigned char *outputPixelsRGBA = (unsigned char *)malloc(sizeof(unsigned char) * width * height * channels);
    output.buf.host = &outputPixelsRGBA[0];

    double best = benchmark(100, 10, [&]() {
         pixel_operation_cpu_out(&input.buf, &output.buf);
    });

    char* encoded_jpeg_output_buffer = compress_jpeg(output.buf.host);
    write_to_jpeg_file("output_image.jpg", encoded_jpeg_output_buffer);
}

これは問題なく機能し、期待どおりの出力が得られます。私が理解していることから、CPUメモリでcpu_out値を使用できるようにします。これが、にアクセスすることでoutこれらの値にアクセスできる理由です。output.buf.hostmain_file.cpp

2 番目のアプローチ:

私が試みた 2 番目のことは、Halide スケジュールでデバイスからホストへのコピーを作成するFunc cpu_outのではなく、のcopy_to_host関数を使用することでしたmain_file.cpp。

pixel_operation_gpu_out.cpp

#include "Halide.h"
#include <stdio.h>

using namespace Halide;

const int _number_of_channels = 4;

int main(int argc, char** argv)
{
    ImageParam input8(UInt(8), 3);

    input8
        .set_stride(0, _number_of_channels) // stride in dimension 0 (x) is three
        .set_stride(2, 1); // stride in dimension 2 (c) is one

    Var x("x"), y("y"), c("c");

    // algorithm
    Func input;
    input(x, y, c) = cast<float>(input8(clamp(x, input8.left(), input8.right()),
                                 clamp(y, input8.top(), input8.bottom()),
                                 clamp(c, 0, _number_of_channels))) / 255.0f;

    Func pixel_operation;

    // calculate the corresponding value for input(x, y, c) after doing a 
    // pixel-wise operation on each each pixel. This gives us pixel_operation(x, y, c).
    // This operation is not location dependent, eg: brighten

    Func out;
    out(x, y, c) = cast<uint8_t>(pixel_operation(x, y, c) * 255.0f + 0.5f);
    out.output_buffer()
        .set_stride(0, _number_of_channels)
        .set_stride(2, 1);
    input8.set_bounds(2, 0, _number_of_channels); // Dimension 2 (c) starts at 0 and has extent _number_of_channels.
    out.output_buffer().set_bounds(2, 0, _number_of_channels);

    // schedule

     out.compute_root();
     out.reorder(c, x, y)
         .bound(c, 0, _number_of_channels)
         .unroll(c);

    // Schedule for GLSL

    out.glsl(x, y, c);

    Target target = get_target_from_environment();
    target.set_feature(Target::OpenGL);

    std::vector<Argument> args = {input8};
    out.compile_to_file("pixel_operation_gpu_out", args, target);

    return 0;
}

main_file.cpp

#include "pixel_operation_gpu_out.h"
#include "runtime/HalideRuntime.h"

int main()
{
    char *encodeded_jpeg_input_buffer = read_from_jpeg_file("input_image.jpg");
    unsigned char *pixelsRGBA = decompress_jpeg(encoded_jpeg_input_buffer);

    Image input(width, height, channels, sizeof(uint8_t), Image::Interleaved);
    Image output(width, height, channels, sizeof(uint8_t), Image::Interleaved);
    input.buf.host = &pixelsRGBA[0];
    unsigned char *outputPixelsRGBA = (unsigned char *)malloc(sizeof(unsigned char) * width * height * channels);
    output.buf.host = &outputPixelsRGBA[0];

    double best = benchmark(100, 10, [&]() {
         pixel_operation_gpu_out(&input.buf, &output.buf);
    });

    int status = halide_copy_to_host(NULL, &output.buf);

    char* encoded_jpeg_output_buffer = compress_jpeg(output.buf.host);
    write_to_jpeg_file("output_image.jpg", encoded_jpeg_output_buffer);

    return 0;
}

だから、今、私が考えているのは、それがGPU をpixel_operation_gpu_out維持しているということです。そうすると、メモリが CPU にコピーされます。このプログラムは、期待される出力も提供します。output.bufcopy_to_host

質問:

2 番目のアプローチは、最初のアプローチよりもはるかに低速です。ただし、遅い部分はベンチマークされた部分にはありません。たとえば、最初のアプローチでは、4k 画像のベンチマーク時間として 17 ミリ秒を取得します。同じ画像の場合、2 番目のアプローチでは、ベンチマーク時間を 22us として取得し、所要時間copy_to_hostは 10 秒です。アプローチ1と2の両方が本質的に同じことをしているので、この動作が予想されるかどうかはわかりません。

次に試みたのは、テクスチャを使用して入力バッファと出力バッファにリンクし、jpeg ファイルに保存する代わりに[HalideRuntimeOpenGL.h][3]OpenGL コンテキストに直接描画できるようにすることでした。main_file.cppただし、関数の使用方法を理解するための例を見つけることができずHalideRuntimeOpenGL.h、自分で試したことは常に実行時エラーを引き起こし、解決方法がわかりませんでした。誰かが私に指摘できるリソースを持っているなら、それは素晴らしいことです.

また、上記のコードに関するフィードバックも歓迎します。私はそれが機能し、私が望むことをしていることを知っていますが、それは完全に間違った方法である可能性があり、それ以上のことはわかりません.

score 0 · Accepted Answer

10 秒でメモリをコピーバックする理由として最も可能性が高いのは、GPU API がすべてのカーネル呼び出しをキューに入れてから、halide_copy_to_host が呼び出されたときにそれらが終了するのを待っているためです。すべての計算呼び出しを実行した後、ベンチマークタイミング内でハライド_デバイス_sync を呼び出して、コピーバック時間なしでループ内の計算時間を取得できます。

このコードからは、カーネルが何回実行されているかわかりません。(私の推測では 100 ですが、ベンチマークへのこれらの引数は、重要性を得るために必要な回数だけ実行を試みる何らかのパラメーター化をセットアップする可能性があります。そうであれば、キューイング呼び出しが非常に高速であるため、それは問題です。しかし計算はもちろん非同期です. これが事実である場合, キューに 10 個の呼び出しを行ってから Halide_device_sync を呼び出して数字 "10" をいじって実際にかかる時間の全体像を得ることができます.)

c++ - GPU (OpenGL) をターゲットとする Halide - ベンチマークと HalideRuntimeOpenGL.h の使用

1 に答える 1

Related

Reference