image-processing - バイナリイメージの高速ピクセルカウント-ARMネオン組み込み関数-iOS開発

Question

誰かがバイナリ画像の白いピクセルの数を数える高速関数を教えてもらえますか？iOSアプリ開発に必要です。私は次のように定義された画像のメモリに直接取り組んでいます

  bool *imageData = (bool *) malloc(noOfPixels * sizeof(bool));

関数を実装しています

             int whiteCount = 0;
             for (int q=i; q<i+windowHeight; q++)
             {
                 for (int w=j; w<j+windowWidth; w++)
                 { 
                     if (imageData[q*W + w] == 1)
                         whiteCount++;
                 }
             }

これは明らかに可能な限り最も遅い機能です。iOSのARMNeon組み込み関数を使用して、1サイクルで複数の操作を実行できると聞きました。多分それは行く方法ですか？

問題は、私があまり慣れておらず、現時点でアセンブリ言語を学ぶのに十分な時間がないことです。したがって、誰かが上記の問題またはC /C++での他の高速実装のためのNeon組み込みコードを投稿できれば素晴らしいと思います。

私がオンラインで見つけることができるネオン組み込み関数の唯一のコードは、rgbからgrayへのコードです http://computer-vision-talks.com/2011/02/a-very-fast-bgra-to-grayscale-conversion- on-iphone /

score 3 · Accepted Answer

まず、乗算を因数分解して分岐を取り除くことにより、元のコードを少し高速化できます。

 int whiteCount = 0;
 for (int q = i; q < i + windowHeight; q++)
 {
     const bool * const row = &imageData[q * W];

     for (int w = j; w < j + windowWidth; w++)
     { 
         whiteCount += row[w];
     }
 }

（これは、が真にバイナリであることを前提としていimageData[]ます。つまり、各要素は0または1のみになります。）

簡単なNEONの実装は次のとおりです。

#include <arm_neon.h>

// ...

int i, w;
int whiteCount = 0;
uint32x4_t v_count = { 0 };

for (q = i; q < i + windowHeight; q++)
{
    const bool * const row = &imageData[q * W];

    uint16x8_t vrow_count = { 0 };

    for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
    {
        uint8x16_t v = vld1q_u8(&row[j]);           // load 16 x 8 bit pixels
        vrow_count = vpadalq_u8(vrow_count, v);     // accumulate 16 bit row counts
    }
    for ( ; w < j + windowWidth; ++w)               // scalar clean up loop
    {
        whiteCount += row[j];
    }
    v_count = vpadalq_u16(v_count, vrow_count);     // update 32 bit image counts
}                                                   // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount

imageData[]（これは、が本当にバイナリ、、、であるimageWidth <= 2^19と想定していsizeof(bool) == 1ます。）

更新されたバージョンunsigned charと値は白が255、黒が0です。

#include <arm_neon.h>

// ...

int i, w;
int whiteCount = 0;
const uint8x16_t v_mask = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
uint32x4_t v_count = { 0 };

for (q = i; q < i + windowHeight; q++)
{
    const uint8_t * const row = &imageData[q * W];

    uint16x8_t vrow_count = { 0 };

    for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
    {
        uint8x16_t v = vld1q_u8(&row[j]);           // load 16 x 8 bit pixels
        v = vandq_u8(v, v_mask);                    // mask out all but LS bit
        vrow_count = vpadalq_u8(vrow_count, v);     // accumulate 16 bit row counts
    }
    for ( ; w < j + windowWidth; ++w)               // scalar clean up loop
    {
        whiteCount += (row[j] == 255);
    }
    v_count = vpadalq_u16(v_count, vrow_count);     // update 32 bit image counts
}                                                   // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount

imageData[]（これは、が白の場合は255、黒の場合は0の値を持っていることを前提としていますimageWidth <= 2^19。）

上記のコードはすべてテストされておらず、さらに作業が必要になる場合があることに注意してください。

score 0 · Accepted Answer

http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

セクション6.55.3.6

ベクトル化されたアルゴリズムが比較を行い、それらを構造体に配置しますが、それでも構造体の各要素を調べて、それがゼロかどうかを判断する必要があります。

そのループは現在どのくらいの速さで実行され、どのくらいの速さで実行する必要がありますか？また、NEONは浮動小数点ユニットと同じレジスタで動作するため、ここでNEONを使用すると、FPUコンテキストスイッチが強制される可能性があることにも注意してください。

image-processing - バイナリイメージの高速ピクセルカウント-ARMネオン組み込み関数-iOS開発

2 に答える 2

Related

Reference