opengl-es - 単純な GLSL 畳み込みシェーダーが非常に遅い

Question

iOS 用の OpenGL ES2.0 で 2D アウトラインシェーダーを実装しようとしています。めちゃくちゃ遅いです。5fpsスローのように。texture2D() 呼び出しまで追跡しました。ただし、それらがなければ、畳み込みシェーダーは元に戻すことができません。mediump の代わりに lowp を使用してみましたが、すべてが黒くなりますが、さらに 5fps が得られますが、まだ使用できません。

これが私のフラグメントシェーダーです。

    varying mediump vec4 colorVarying;
    varying mediump vec2 texCoord;

    uniform bool enableTexture;
    uniform sampler2D texture;

    uniform mediump float k;

    void main() {

        const mediump float step_w = 3.0/128.0;
        const mediump float step_h = 3.0/128.0;
        const mediump vec4 b = vec4(0.0, 0.0, 0.0, 1.0);
        const mediump vec4 one = vec4(1.0, 1.0, 1.0, 1.0);

        mediump vec2 offset[9];
        mediump float kernel[9];
        offset[0] = vec2(-step_w, step_h);
        offset[1] = vec2(-step_w, 0.0);
        offset[2] = vec2(-step_w, -step_h);
        offset[3] = vec2(0.0, step_h);
        offset[4] = vec2(0.0, 0.0);
        offset[5] = vec2(0.0, -step_h);
        offset[6] = vec2(step_w, step_h);
        offset[7] = vec2(step_w, 0.0);
        offset[8] = vec2(step_w, -step_h);

        kernel[0] = kernel[2] = kernel[6] = kernel[8] = 1.0/k;
        kernel[1] = kernel[3] = kernel[5] = kernel[7] = 2.0/k;
        kernel[4] = -16.0/k;  

        if (enableTexture) {
              mediump vec4 sum = vec4(0.0);
            for (int i=0;i<9;i++) {
                mediump vec4 tmp = texture2D(texture, texCoord + offset[i]);
                sum += tmp * kernel[i];
            }

            gl_FragColor = (sum * b) + ((one-sum) * texture2D(texture, texCoord));
        } else {
            gl_FragColor = colorVarying;
        }
    }

これは最適化されておらず、最終決定もされていませんが、続行する前にパフォーマンスを上げる必要があります。ループ内の texture2D() 呼び出しをしっかりした vec4 に置き換えてみましたが、他のすべてが進行しているにもかかわらず、問題なく実行されます。

これを最適化するにはどうすればよいですか？3D で問題なく動作するより複雑なエフェクトを見てきたので、それが可能であることはわかっています。なぜこれが問題を引き起こしているのか、まったくわかりません。

score 49 · Accepted Answer

私はこれとまったく同じことを自分で行いましたが、ここで最適化できることがいくつかあります。

まず、enableTexture条件を削除し、代わりにシェーダーを 2 つのプログラムに分割します。1 つは this の true 状態用で、もう 1 つは false 用です。iOS フラグメントシェーダー、特にテクスチャ読み取りを含むものでは、条件分岐は非常にコストがかかります。

次に、ここには 9 つの依存テクスチャ読み取りがあります。これらは、フラグメントシェーダー内でテクスチャ座標が計算されるテクスチャ読み取りです。依存テクスチャ読み取りは、iOS デバイス内の PowerVR GPU では非常にコストがかかります。これは、ハードウェアがキャッシュなどを使用してテクスチャ読み取りを最適化するのを妨げるためです。周囲の 8 つのピクセルと中央の 1 つのピクセルの固定オフセットからサンプリングしているため、これらの計算は次のようにする必要があります。頂点シェーダーに移動しました。これは、これらの計算をピクセルごとに実行する必要がなく、頂点ごとに 1 回実行するだけで、残りはハードウェア補間によって処理されることも意味します。

3 番目に、for() ループは、これまで iOS シェーダーコンパイラーによって適切に処理されていないため、可能な場合は回避する傾向があります。

前述したように、私はオープンソースの iOS GPUImageフレームワークでこのような畳み込みシェーダーを作成しました。一般的な畳み込みフィルターには、次の頂点シェーダーを使用します。

 attribute vec4 position;
 attribute vec4 inputTextureCoordinate;

 uniform highp float texelWidth; 
 uniform highp float texelHeight; 

 varying vec2 textureCoordinate;
 varying vec2 leftTextureCoordinate;
 varying vec2 rightTextureCoordinate;

 varying vec2 topTextureCoordinate;
 varying vec2 topLeftTextureCoordinate;
 varying vec2 topRightTextureCoordinate;

 varying vec2 bottomTextureCoordinate;
 varying vec2 bottomLeftTextureCoordinate;
 varying vec2 bottomRightTextureCoordinate;

 void main()
 {
     gl_Position = position;

     vec2 widthStep = vec2(texelWidth, 0.0);
     vec2 heightStep = vec2(0.0, texelHeight);
     vec2 widthHeightStep = vec2(texelWidth, texelHeight);
     vec2 widthNegativeHeightStep = vec2(texelWidth, -texelHeight);

     textureCoordinate = inputTextureCoordinate.xy;
     leftTextureCoordinate = inputTextureCoordinate.xy - widthStep;
     rightTextureCoordinate = inputTextureCoordinate.xy + widthStep;

     topTextureCoordinate = inputTextureCoordinate.xy - heightStep;
     topLeftTextureCoordinate = inputTextureCoordinate.xy - widthHeightStep;
     topRightTextureCoordinate = inputTextureCoordinate.xy + widthNegativeHeightStep;

     bottomTextureCoordinate = inputTextureCoordinate.xy + heightStep;
     bottomLeftTextureCoordinate = inputTextureCoordinate.xy - widthNegativeHeightStep;
     bottomRightTextureCoordinate = inputTextureCoordinate.xy + widthHeightStep;
 }

および次のフラグメントシェーダー:

 precision highp float;

 uniform sampler2D inputImageTexture;

 uniform mediump mat3 convolutionMatrix;

 varying vec2 textureCoordinate;
 varying vec2 leftTextureCoordinate;
 varying vec2 rightTextureCoordinate;

 varying vec2 topTextureCoordinate;
 varying vec2 topLeftTextureCoordinate;
 varying vec2 topRightTextureCoordinate;

 varying vec2 bottomTextureCoordinate;
 varying vec2 bottomLeftTextureCoordinate;
 varying vec2 bottomRightTextureCoordinate;

 void main()
 {
     mediump vec4 bottomColor = texture2D(inputImageTexture, bottomTextureCoordinate);
     mediump vec4 bottomLeftColor = texture2D(inputImageTexture, bottomLeftTextureCoordinate);
     mediump vec4 bottomRightColor = texture2D(inputImageTexture, bottomRightTextureCoordinate);
     mediump vec4 centerColor = texture2D(inputImageTexture, textureCoordinate);
     mediump vec4 leftColor = texture2D(inputImageTexture, leftTextureCoordinate);
     mediump vec4 rightColor = texture2D(inputImageTexture, rightTextureCoordinate);
     mediump vec4 topColor = texture2D(inputImageTexture, topTextureCoordinate);
     mediump vec4 topRightColor = texture2D(inputImageTexture, topRightTextureCoordinate);
     mediump vec4 topLeftColor = texture2D(inputImageTexture, topLeftTextureCoordinate);

     mediump vec4 resultColor = topLeftColor * convolutionMatrix[0][0] + topColor * convolutionMatrix[0][1] + topRightColor * convolutionMatrix[0][2];
     resultColor += leftColor * convolutionMatrix[1][0] + centerColor * convolutionMatrix[1][1] + rightColor * convolutionMatrix[1][2];
     resultColor += bottomLeftColor * convolutionMatrix[2][0] + bottomColor * convolutionMatrix[2][1] + bottomRightColor * convolutionMatrix[2][2];

     gl_FragColor = resultColor;
 }

およびuniforms は入力画像の幅と高さの逆数であり、texelWidthuniformは畳み込みのさまざまなサンプルの重みを指定します。texelHeightconvolutionMatrix

iPhone 4 では、これはカメラビデオの 640x480 フレームで 4 ～ 8 ミリ秒で実行されます。これは、その画像サイズで 60 FPS レンダリングを行うには十分です。エッジ検出などを行う必要がある場合は、上記を単純化し、プリパスで画像を輝度に変換してから、1 つのカラーチャネルからのみサンプリングすることができます。これはさらに高速で、同じデバイスでフレームあたり約 2 ミリ秒です。

score 6 · Accepted Answer

このシェーダーでかかる時間を短縮する唯一の方法は、テクスチャフェッチの数を減らすことです。シェーダーは中央のピクセルを中心に等間隔に配置されたポイントからテクスチャをサンプリングし、それらを線形に結合するため、テクスチャサンプリングに使用できる GL_LINEAR モードを利用することで、フェッチの数を減らすことができます。

基本的に、すべてのテクセルでサンプリングする代わりに、テクセルのペア間でサンプリングして、線形に重み付けされた合計を直接取得します。

オフセット (-stepw,-steph) と (-stepw,0) でのサンプリングをそれぞれ x0 と x1 と呼びましょう。次に、あなたの合計は

sum = x0*k0 + x1*k1

k0/(k0+k1)代わりに、x0 から、つまり x1 からの距離で、これら 2 つのテクセルの間でサンプリングするとk1/(k0+k1)、GPU はフェッチ中に線形重み付けを実行し、次のようになります。

y = x1*k1/(k0+k1) + x0*k0/(k1+k0)

したがって、合計は次のように計算できます。

sum = y*(k0 + k1)たった1回のフェッチから！

他の隣接するピクセルに対してこれを繰り返すと、隣接するオフセットごとに 4 つのテクスチャフェッチを実行し、中央のピクセルに対して 1 つの余分なテクスチャフェッチを実行することになります。

リンクはこれをよりよく説明しています

opengl-es - 単純な GLSL 畳み込みシェーダーが非常に遅い

2 に答える 2

Related

Reference