bit-manipulation - ビットのインターリーブを効率的に解除する方法 (逆モートン)

Question

この質問:ビットのインターリーブを解除する方法 (UnMortonizing?)には、モートン数の 2 つの半分 (奇数ビットのみ) の 1 つを抽出するための良い答えがありますが、両方の部分 (奇数ビットと偶数ビット) 可能な限り少ない操作で。

私の使用では、32ビットの整数を取り、2つの16ビットの整数を抽出する必要があります。1つは偶数ビットで、もう1つは1ビット右にシフトされた奇数ビットです。

input,  z: 11101101 01010111 11011011 01101110

output, x: 11100001 10110111 // odd bits shifted right by 1
        y: 10111111 11011010 // even bits

モートン数 (つまりインターリーブビット) を生成するためのマジックナンバーを使用したシフトとマスクを使用するソリューションはたくさんあるようです。 .

アップデート

完全なシャッフル/シャッフル解除に関する Hacker's Delight のセクションを読み直した後、次のように適応したいくつかの有用な例を見つけました。

// morton1 - extract even bits

uint32_t morton1(uint32_t x)
{
    x = x & 0x55555555;
    x = (x | (x >> 1)) & 0x33333333;
    x = (x | (x >> 2)) & 0x0F0F0F0F;
    x = (x | (x >> 4)) & 0x00FF00FF;
    x = (x | (x >> 8)) & 0x0000FFFF;
    return x;
}

// morton2 - extract odd and even bits

void morton2(uint32_t *x, uint32_t *y, uint32_t z)
{
    *x = morton1(z);
    *y = morton1(z >> 1);
}

これは、現在のスカラー形式でも SIMD を利用することでも改善できると思うので、より良い解決策 (スカラーまたは SIMD) にまだ興味があります。

score 12 · Accepted Answer

プロセッサが 64 ビット int を効率的に処理する場合は、操作を組み合わせることができます...

int64 w = (z &0xAAAAAAAA)<<31 | (z &0x55555555 )
w = (w | (w >> 1)) & 0x3333333333333333;
w = (w | (w >> 2)) & 0x0F0F0F0F0F0F0F0F; 
...

score 5 · Accepted Answer

誰かが 3D でモートンコードを使用している場合、彼は 3 ごとに 1 ビットを読み取る必要があり、ここでは 64 ビットが私が使用した関数です。

uint64_t morton3(uint64_t x) {
    x = x & 0x9249249249249249;
    x = (x | (x >> 2))  & 0x30c30c30c30c30c3;
    x = (x | (x >> 4))  & 0xf00f00f00f00f00f;
    x = (x | (x >> 8))  & 0x00ff0000ff0000ff;
    x = (x | (x >> 16)) & 0xffff00000000ffff;
    x = (x | (x >> 32)) & 0x00000000ffffffff;
    return x;
}
uint64_t bits; 
uint64_t x = morton3(bits)
uint64_t y = morton3(bits>>1)
uint64_t z = morton3(bits>>2)

score 1 · Accepted Answer

速度が必要な場合は、一度に 1 バイトの変換に table-lookup を使用できます (2 バイトのテーブルは高速ですが、サイズが大きくなります)。手順は Delphi IDE で行いますが、アセンブラ/アルゴリズムは同じです。

const
  MortonTableLookup : array[byte] of byte = ($00, $01, $10, $11, $12, ... ;

procedure DeinterleaveBits(Input: cardinal);
//In: eax
//Out: dx = EvenBits; ax = OddBits;
asm
  movzx   ecx, al                                     //Use 0th byte
  mov     dl, byte ptr[MortonTableLookup + ecx]
//
  shr     eax, 8
  movzx   ecx, ah                                     //Use 2th byte
  mov     dh, byte ptr[MortonTableLookup + ecx]
//
  shl     edx, 16
  movzx   ecx, al                                     //Use 1th byte
  mov     dl, byte ptr[MortonTableLookup + ecx]
//
  shr     eax, 8
  movzx   ecx, ah                                     //Use 3th byte
  mov     dh, byte ptr[MortonTableLookup + ecx]
//
  mov     ecx, edx  
  and     ecx, $F0F0F0F0
  mov     eax, ecx
  rol     eax, 12
  or      eax, ecx

  rol     edx, 4
  and     edx, $F0F0F0F0
  mov     ecx, edx
  rol     ecx, 12
  or      edx, ecx
end;

score 1 · Accepted Answer

固定サイズの整数に制限されたり、ハードコーディングされた定数を使用して同様のコマンドのリストを作成したりしたくなかったので、テンプレートメタプログラミングを利用して関数と定数を生成する C++11 ソリューションを開発しました。で生成されたアセンブリコードは、-O3BMI を使用せずに得られるのと同じくらいタイトに見えます。

andl    $0x55555555, %eax
movl    %eax, %ecx
shrl    %ecx
orl     %eax, %ecx
andl    $0x33333333, %ecx
movl    %ecx, %eax
shrl    $2, %eax
orl     %ecx, %eax
andl    $0xF0F0F0F, %eax
movl    %eax, %ecx
shrl    $4, %ecx
orl     %eax, %ecx
movzbl  %cl, %esi
shrl    $8, %ecx
andl    $0xFF00, %ecx
orl     %ecx, %esi

TL;DR ソースリポジトリとライブデモ.

実装

基本的に、関数のすべてのステップは、次のmorton1ような一連の定数をシフトして追加することによって機能します。

0b0101010101010101(1 と 0 を交互に)
0b0011001100110011(交互に 2x 1 と 0)
0b0000111100001111(交互に 4x 1 と 0)
0b0000000011111111(交互に 8x 1 と 0)

D次元を使用すると、 D-10 と11 のパターンになります。したがって、これらを生成するには、連続したものを生成し、ビットごとに適用するだけで十分です。

/// @brief Generates 0b1...1 with @tparam n ones
template <class T, unsigned n>
using n_ones = std::integral_constant<T, (~static_cast<T>(0) >> (sizeof(T) * 8 - n))>;

/// @brief Performs `@tparam input | (@tparam input << @tparam width` @tparam repeat times.
template <class T, T input, unsigned width, unsigned repeat>
struct lshift_add :
    public lshift_add<T, lshift_add<T, input, width, 1>::value, width, repeat - 1> {
};
/// @brief Specialization for 1 repetition, just does the shift-and-add operation.
template <class T, T input, unsigned width>
struct lshift_add<T, input, width, 1> : public std::integral_constant<T,
    (input & n_ones<T, width>::value) | (input << (width < sizeof(T) * 8 ? width : 0))> {
};

これで、次のようにして、コンパイル時に任意の次元の定数を生成できるようになりました。

template <class T, unsigned step, unsigned dimensions = 2u>
using mask = lshift_add<T, n_ones<T, 1 << step>::value, dimensions * (1 << step), sizeof(T) * 8 / (2 << step)>;

同じタイプの再帰を使用して、アルゴリズムの各ステップに対して関数を生成できますx = (x | (x >> K)) & M。

template <class T, unsigned step, unsigned dimensions>
struct deinterleave {
    static T work(T input) {
        input = deinterleave<T, step - 1, dimensions>::work(input);
        return (input | (input >> ((dimensions - 1) * (1 << (step - 1))))) & mask<T, step, dimensions>::value;
    }
};
// Omitted specialization for step 0, where there is just a bitwise and

「いくつのステップが必要ですか?」という質問に答える必要があります。これは次元数にも依存します。一般に、ステップは出力ビットkを計算します。2^k - 1各次元の意味のあるビットの最大数はで与えられるz = sizeof(T) * 8 / dimensionsため、手順を踏むだけで十分1 + log_2 zです。問題は、これをconstexprテンプレートパラメーターとして使用するために必要なことです。これを回避するために私が見つけた最良の方法は、log2メタプログラミングを介して定義することです。

template <unsigned arg>
struct log2 : public std::integral_constant<unsigned, log2<(arg >> 1)>::value + 1> {};
template <>
struct log2<1u> : public std::integral_constant<unsigned, 0u> {};

/// @brief Helper constexpr which returns the number of steps needed to fully interleave a type @tparam T.
template <class T, unsigned dimensions>
using num_steps = std::integral_constant<unsigned, log2<sizeof(T) * 8 / dimensions>::value + 1>;

最後に、1 つの呼び出しを実行できます。

/// @brief Helper function which combines @see deinterleave and @see num_steps into a single call.
template <class T, unsigned dimensions>
T deinterleave_first(T n) {
    return deinterleave<T, num_steps<T, dimensions>::value - 1, dimensions>::work(n);
}

bit-manipulation - ビットのインターリーブを効率的に解除する方法 (逆モートン)

6 に答える 6

実装

Related