c++ - 1bpp 画像のフィルタリング

Question

3x3 フィルターを使用して、1 ピクセルあたり 1 ビットの画像をフィルター処理しようとしています。各入力ピクセルについて、周囲のピクセルの加重合計 (フィルターによって決定された重み付き) がしきい値を超えると、対応する出力ピクセルが 1 に設定されます。 .

これは 8 bpp に変換してからフィルタリングするよりも効率的であることを期待していましたが、良い方法が思いつきません。単純な方法は、バイトへの 9 つのポインター (3 つの連続する行と、これらのバイトの最初と最後のビットの出力を計算するための、各行の現在のバイトの両側へのポインター) を追跡し、各入力ピクセルの計算を追跡することです。

sum = filter[0] * (lastRowPtr & aMask > 0) + filter[1] * (lastRowPtr & bMask > 0) + ... + filter[8] * (nextRowPtr & hMask > 0)、

バイトの端にあるビットの余分な faff を使用します。ただし、これは遅く、本当に醜いようです。各バイトに 8 つのピクセルがあるという事実から、並列処理が得られず、代わりに大量の余分な作業をマスキングする必要があります。

この種のことを最善の方法で行うための良い情報源はありますか? この特定の問題に対する解決策は驚くべきものですが、C/C++ で 1bpp 画像を効率的に処理する例を教えていただければ幸いです。画像の変換とコピーを避けるために、将来的には 8 bpp のものを 1 bpp アルゴリズムに置き換えたいと考えているので、これに関する一般的なリソースをいただければ幸いです。

score 2 · Accepted Answer

Look into separable filters. Among other things, they allow massive parallelism in the cases where they work.

For example, in your 3x3 sample-weight-and-filter case:

Sample 1x3 (horizontal) pixels into a buffer. This can be done in isolation for each pixel, so a 1024x1024 image can run 1024^2 simultaneous tasks, all of which perform 3 samples.
Sample 3x1 (vertical) pixels from the buffer. Again, this can be done on every pixel simultaneously.
Use the contents of the buffer to cull pixels from the original texture.

The advantage to this approach, mathematically, is that it cuts the number of sample operations from n^2 to 2n, although it requires a buffer of equal size to the source (if you're already performing a copy, that can be used as the buffer; you just can't modify the original source for step 2). In order to keep memory use at 2n, you can perform steps 2 and 3 together (this is a bit tricky and not entirely pleasant); if memory isn't an issue, you can spend 3n on two buffers (source, hblur, vblur).

Because each operation is working in complete isolation from an immutable source, you can perform the filter on every pixel simultaneously if you have enough cores. Or, in a more realistic scenario, you can take advantage of paging and caching to load and process a single column or row. This is convenient when working with odd strides, padding at the end of a row, etc. The second round of samples (vertical) may screw with your cache, but at the very worst, one round will be cache-friendly and you've cut processing from exponential to linear.

Now, I've yet to touch on the case of storing data in bits specifically. That does make things slightly more complicated, but not terribly much so. Assuming you can use a rolling window, something like:

d = s[x-1] + s[x] + s[x+1]

works. Interestingly, if you were to rotate the image 90 degrees during the output of step 1 (trivial, sample from (y,x) when reading), you can get away with loading at most two horizontally adjacent bytes for any sample, and only a single byte something like 75% of the time. This plays a little less friendly with cache during the read, but greatly simplifies the algorithm (enough that it may regain the loss).

Pseudo-code:

buffer source, dest, vbuf, hbuf;

for_each (y, x)   // Loop over each row, then each column. Generally works better wrt paging
{
    hbuf(x, y) = (source(y, x-1) + source(y, x) + source(y, x+1)) / 3   // swap x and y to spin 90 degrees
}
for_each (y, x)
{
    vbuf(x, 1-y) = (hbuf(y, x-1) + hbuf(y, x) + hbuf(y, x+1)) / 3    // 1-y to reverse the 90 degree spin
}
for_each (y, x)
{
    dest(x, y) = threshold(hbuf(x, y))
}

Accessing bits within the bytes (source(x, y) indicates access/sample) is relatively simple to do, but kind of a pain to write out here, so is left to the reader. The principle, particularly implemented in this fashion (with the 90 degree rotation), only requires 2 passes of n samples each, and always samples from immediately adjacent bits/bytes (never requiring you to calculate the position of the bit in the next row). All in all, it's massively faster and simpler than any alternative.

score 2 · Accepted Answer

I found a number of years ago that unpacking the bits to bytes, doing the filter, then packing the bytes back to bits was faster than working with the bits directly. It seems counter-intuitive because it's 3 loops instead of 1, but the simplicity of each loop more than made up for it.

I can't guarantee that it's still the fastest; compilers and especially processors are prone to change. However simplifying each loop not only makes it easier to optimize, it makes it easier to read. That's got to be worth something.

A further advantage to unpacking to a separate buffer is that it gives you flexibility for what you do at the edges. By making the buffer 2 bytes larger than the input, you unpack starting at byte 1 then set byte 0 and n to whatever you like and the filtering loop doesn't have to worry about boundary conditions at all.

score 1 · Accepted Answer

画像全体を1ビット/バイト（または、基本的には8bpp）に拡張するのではなく、現在のウィンドウを単純に拡張できます-最初の行の最初のバイトを読み取り、シフトしてマスクし、次に3ビットを読み取ります必要; 他の 2 つの行についても同じことを行います。次に、次のウィンドウでは、左側の列を破棄して、各行からもう 1 ビットをフェッチします。これを正しく行うためのロジックとコードは、単純に画像全体を拡大するほど簡単ではありませんが、必要なメモリは大幅に少なくなります。

妥協点として、現在作業中の 3 つの行を拡張することができます。おそらく、そのようにコーディングする方が簡単です。

c++ - 1bpp 画像のフィルタリング

3 に答える 3

Related