c++ - 64ビットの数値から特定の行/列/対角線を「分離」します

Question

OK、64ビットの数値を考えてみましょう。そのビットは8x8のテーブルを形成しています。

例えば

0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0

として書かれた

a b c d e f g h
----------------
0 1 1 0 1 0 1 0
0 1 1 0 1 0 1 1 
0 1 1 1 1 0 1 0 
0 1 1 0 1 0 1 0 
1 1 1 0 1 0 1 0 
0 1 1 0 1 0 1 0 
0 1 1 0 1 1 1 0 
0 1 1 0 1 0 1 0

さて、たとえば列d（00100000）（またはその問題の任意の行/対角線）などを分離したい場合はどうなりますか？

~~これはできますか？~~もしそうなら、どのように？

ヒント：

（a）ここでの私の主な目的は、最初は言及されていませんが、生の速度です。「検索」機能は1秒間に数百万回実行されているため、私は最速のアルゴリズムを探しています。
（b）これは私が意味するものに近づくものです：https ：//www.chessprogramming.org/Kindergarten_Bitboards

score 63 · Accepted Answer

主な手順は次の 4 つだけです。

const uint64_t column_mask = 0x8080808080808080ull;
const uint64_t magic = 0x2040810204081ull;

int get_col(uint64_t board, int col) {
    uint64_t column = (board << col) & column_mask;
    column *= magic;
    return (column >> 56) & 0xff;
}

それはこのように動作します：

ボードをずらして列を左側に揃えます
必要な列 (0..8) のみを含むようにマスクされています
これにマジックナンバーを掛けると、元のすべてのビットが左側にプッシュされます。
一番左のバイトが右にシフトされます

マジックナンバーは、必要なビットのみをコピーし、残りを未使用の場所に落としたり、数字をオーバーフローさせたりするために選択されます。プロセスは次のようになります (数字自体ではなく、ビット「ID」です)。

original column: ...1.......2.......3.......4.......5.......6.......7.......8....
aligned column:  1.......2.......3.......4.......5.......6.......7.......8.......
multiplied:      123456782345678.345678..45678...5678....678.....78......8.......
shifted to right:........................................................12345678

constキーワードを追加すると、アセンブリは実際には非常にうまくなります。

get_col:
.LFB7:
        .cfi_startproc
        movl    %esi, %ecx
        movabsq $-9187201950435737472, %rax
        salq    %cl, %rdi
        andq    %rax, %rdi
        movabsq $567382630219905, %rax
        imulq   %rax, %rdi
        shrq    $56, %rdi
        movl    %edi, %eax
        ret

分岐なし、外部データなし、計算あたり約 0.4ns。

編集：NPEのソリューションをベースラインとして使用すると、約6分の1の時間がかかります（次に速いもの）

score 7 · Accepted Answer

そうです、どちらが速い/遅いかなどの議論を「解決」するために、すべてのコードを 1 つのプログラムにまとめました [そして、適切なコードスニペットに対して適切な人物の功績を称えたいと思います] 。

コードを関数にしたときにコードを正しく解釈したことを確認するために、コードを以下に示します。私はそれを適切な出力で実行し、各関数が同じ結果を返すことを確認しました[場合によっては順序がわずかに異なることに注意してください-そのため、コードの他の方法を実行するバリエーションを作成しました。「正しい」結果]。それで、これ以上苦労することなく、結果は次のとおりです。

mats1 time in clocks per iteration 10.3457
mats2 time in clocks per iteration 10.4785
mats3 time in clocks per iteration 10.5538
viraptor time in clocks per iteration 6.24603
lemees time in clocks per iteration 14.4818
npe time in clocks per iteration 13.1455
alex time in clocks per iteration 24.8272

(コア i5、g++ 4.7 からの viraptor の結果)

mats1 time in clocks per iteration 7.62338
mats2 time in clocks per iteration 7.36226
mats3 time in clocks per iteration 7.45361
viraptor time in clocks per iteration 2.09582
lemees time in clocks per iteration 9.43744
npe time in clocks per iteration 7.51016
alex time in clocks per iteration 19.3554

(コア i5、clang++ 3.2 からの viraptor の結果)

mats1 time in clocks per iteration 12.956
mats2 time in clocks per iteration 13.4395
mats3 time in clocks per iteration 13.3178
viraptor time in clocks per iteration 2.12914
lemees time in clocks per iteration 13.9267
npe time in clocks per iteration 16.2102
alex time in clocks per iteration 13.8705

これは 3.4GHz AMD Athlon2 でのクロックサイクルです。私は最新の Intel マシンを持っていません。誰かがそのコードを実行したい場合は、それがどのように見えるか見てみたいと思います。おそらく、いくつかの値をフェッチしてチェックすることを除けば、すべてがキャッシュ内でうまく動作すると確信しています。

したがって、勝者は明らかに viraptor で、約 40% で、「私の」コードが 2 番目です。Alex のコードにはジャンプや分岐はありませんが、他の代替コードよりも実行速度が遅いようです。npe の結果が私の結果よりもずっと遅い理由がわからない - ほとんど同じことを行う (そして、g++ からのアセンブラー出力を見ると、コードは非常に似ている)。

#include <iostream>
#include <fstream>
#include <cstdint>

using namespace std;

const int SIZE = 1000000;

uint64_t g_val[SIZE];

ofstream nulloutput;

static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

#define BITA_TO_B(x, a, b) (((x) >> (a-b)) & (1 << b))

unsigned char get_col_mats1(uint64_t val, int col)
{
    return BITA_TO_B(val, 56+col, 7) |
    BITA_TO_B(val, 48+col, 6) |
    BITA_TO_B(val, 40+col, 5) |
    BITA_TO_B(val, 32+col, 4) |
    BITA_TO_B(val, 24+col, 3) |
    BITA_TO_B(val, 16+col, 2) |
    BITA_TO_B(val, 8+col, 1) |
    BITA_TO_B(val, 0+col, 0);
}

unsigned char get_col_mats2(uint64_t val, int col)
{
    return BITA_TO_B(val, 63-col, 7) |
    BITA_TO_B(val, 55-col, 6) |
    BITA_TO_B(val, 47-col, 5) |
    BITA_TO_B(val, 39-col, 4) |
    BITA_TO_B(val, 31-col, 3) |
    BITA_TO_B(val, 23-col, 2) |
    BITA_TO_B(val, 15-col, 1) |
    BITA_TO_B(val, 7-col, 0);
}


unsigned char get_col_viraptor(uint64_t board, int col) {
    const uint64_t column_mask = 0x8080808080808080ull;
    const uint64_t magic = 0x2040810204081ull ;
    uint64_t column = board & (column_mask >> col);
    column <<= col;
    column *= magic;
    return (column >> 56) & 0xff;
}


unsigned char get_col_alex(uint64_t bitboard, int col)
{
    unsigned char result;
    result |= (bitboard & (1ULL << 63-col)) ? 0x80 : 0;
    result |= (bitboard & (1ULL << 55-col)) ? 0x40 : 0;
    result |= (bitboard & (1ULL << 47-col)) ? 0x20 : 0;
    result |= (bitboard & (1ULL << 39-col)) ? 0x10 : 0;
    result |= (bitboard & (1ULL << 31-col)) ? 0x08 : 0;
    result |= (bitboard & (1ULL << 23-col)) ? 0x04 : 0;
    result |= (bitboard & (1ULL << 15-col)) ? 0x02 : 0;
    result |= (bitboard & (1ULL << 7-col))  ? 0x01 : 0;

    return result;
}

unsigned char get_col_lemees(uint64_t val, int column)
{
    int result = 0;
    int source_bitpos = 7 - column; // "point" to last entry in this column
    for (int target_bitpos = 0; target_bitpos < 8; ++target_bitpos)
    {
    bool bit = (val >> source_bitpos) & 1;  // "extract" bit
    result |= bit << target_bitpos;            // add bit if it was set
    source_bitpos += 8;                        // move one up in table
    }
    return result;
}

int get(uint64_t board, int row, int col) {
  return (board >> (row * 8 + col)) & 1;
}

uint8_t get_col_npe(uint64_t board, int col) {
  uint8_t ret = 0;
  for (int i = 0; i < 8; ++i) {
    ret = (ret << 1) + get(board, i, col);
  }
  return ret;
}



#define BITA_TO_B2(x, a, b) (((x) >> (a-b)) & (1 << b))

unsigned char get_col_mats3(uint64_t val, int col)
{
    return BITA_TO_B2(val, 63-col, 7) |
    BITA_TO_B2(val, 55-col, 6) |
    BITA_TO_B2(val, 47-col, 5) |
    BITA_TO_B2(val, 39-col, 4) |
    BITA_TO_B2(val, 31-col, 3) |
    BITA_TO_B2(val, 23-col, 2) |
    BITA_TO_B2(val, 15-col, 1) |
    BITA_TO_B2(val, 7-col, 0);
}

template<unsigned char (*f)(uint64_t val, int col)>
void runbench(const char *name)
{
    unsigned char col[8]  = {0};
    uint64_t long t = rdtsc();
    for(int j = 0; j < SIZE; j++)
    {
    uint64_t val = g_val[j];
    for(int i = 0; i < 8; i++)
    {
        col[i] += f(val, i);
    }
//  __asm__ __volatile__("":::"memory");
    }
    t = rdtsc() - t;
    for(int i = 0; i < 8; i++)
    {
    nulloutput<< "col " << i << " has bits " << hex << (int)col[i] << endl;
    }
    cout << name << " time in clocks per iteration " << dec << t / (8.0 * SIZE) << endl;
}

#define BM(name) void bench_##name() { runbench<get_col_##name>(#name); }

BM(mats1);
BM(mats2);
BM(mats3);
BM(viraptor);
BM(lemees);
BM(npe);
BM(alex);

struct function
{
    void (*func)(void);
    const char *name;
};


#define FUNC(f) { bench_##f, #f }

function funcs[] = 
{
    FUNC(mats1),
    FUNC(mats2),
    FUNC(mats3),
    FUNC(viraptor),
    FUNC(lemees),
    FUNC(npe),
    FUNC(alex),
}; 


int main()
{
    unsigned long long a, b;
    int i;
    int sum = 0;

    nulloutput.open("/dev/nul");
    for(i = 0; i < SIZE; i++)
    {
    g_val[i] = rand() + ((long)rand() << 32L);
    }
    unsigned char col[8];

    for(i = 0; i < sizeof(funcs)/sizeof(funcs[0]); i++)
    {
    funcs[i].func();
    }
}

score 2 · Accepted Answer

単純なループでコーディングし、オプティマイザーにインライン化とループ展開を実行させます。

4.7.2を使用してコンパイルすると-O3、私のボックスでは、次のコマンドで1秒あたり約3億回のget_col()呼び出しを実行できます。

bitboard.cpp:

#include <cinttypes>
#include <iostream>

int get(uint64_t board, int row, int col) {
  return (board >> (row * 8 + col)) & 1;
}

uint8_t get_col(uint64_t board, int col) {
  uint8_t ret = 0;
  for (int i = 0; i < 8; ++i) {
    ret = (ret << 1) + get(board, i, col);
  }
  return ret;
}

extern uint64_t board;
extern int sum;

extern void f();

int main() {
  for (int i = 0; i < 40000000; ++i) {
    for (int j = 0; j < 8; ++j) {
      sum += get_col(board, j);
    }
    f();
  }
  std::cout << sum << std::endl;
}

bitboard_b.cpp:

#include <cinttypes>

uint64_t board = 0x1234567890ABCDEFull;
int sum = 0;

void f() {}

のアセンブリコードをget_col()見ると、ループがゼロであり、おそらく手作りする可能性のあるものと同じくらい効率的であることがわかります。

__Z7get_colyi:
LFB1248:
        movl    %esi, %ecx
        movq    %rdi, %rax
        movq    %rdi, %rdx
        shrq    %cl, %rax
        leal    8(%rsi), %ecx
        andl    $1, %eax
        shrq    %cl, %rdx
        leal    16(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    24(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    32(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    40(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %edx
        movq    %rdi, %rax
        shrq    %cl, %rax
        leal    48(%rsi), %ecx
        andl    $1, %eax
        leal    (%rax,%rdx,2), %edx
        movq    %rdi, %rax
        shrq    %cl, %rax
        leal    56(%rsi), %ecx
        andl    $1, %eax
        leal    (%rax,%rdx,2), %eax
        shrq    %cl, %rdi
        andl    $1, %edi
        leal    (%rdi,%rax,2), %eax
        ret

これは完全な実装を意味するものではなく、アイデアの大まかな説明にすぎません。特に、ビットの順序は、予想とは逆になる場合があります。

score 1 · Accepted Answer

数値を転置してから、必要に応じて、関連する列を選択するだけで、現在は行になっているため、連続するビットを選択できます。

私のテストでは、個別に選択された8ビットをOR演算するよりもはるかに優れていませんでしたが、複数の列を選択する場合ははるかに優れています（転置が制限要因であるため）。

score 1 · Accepted Answer

#define BITA_TO_B(x, a, b) (((x) >> (a)) & 1) << b;

unsigned long long x = 0x6A6B7A6AEA6E6A;

unsigned char col_d = BITA_TO_B(x, 60, 7) |
                      BITA_TO_B(x, 52, 6) |
                      BITA_TO_B(x, 44, 5) |
                      BITA_TO_B(x, 36, 4) |
                      BITA_TO_B(x, 28, 3) |
                      BITA_TO_B(x, 20, 2) |
                      BITA_TO_B(x, 12, 1) |
                      BITA_TO_B(x,  4, 0);

おそらく、もう少し最適化されたアイデア：

#define BITA_TO_B(x, a, b) (((x) >> (a-b)) & (1 << b));

bが定数の場合、これはわずかにパフォーマンスが向上します。

別の方法は次のとおりです。

unsigned long xx = x & 0x10101010101010; 
col_d = (xx >> 53) | (xx >> 46) | (xx >> 39) ... (xx >> 4);

多くではなく1つの「and」を実行すると、速度が向上します。

score 1 · Accepted Answer

あなたの場合（8x8 = 64ビットテーブルに特化）、ビットシフトを実行して特定のビットを抽出し、ビットシフトを使用して新しい整数変数に再配置する必要があります。

uint64_t matrix = ... // input
int column = 3; // "d"-column

int result = 0;
int source_bitpos = 7 - column; // "point" to last entry in this column
for (int target_bitpos = 0; target_bitpos < 8; ++target_bitpos)
{
    bool bit = (matrix >> source_bitpos) & 1;  // "extract" bit
    result |= bit << target_bitpos;            // add bit if it was set
    source_bitpos += 8;                        // move one up in table
}

ここを参照してください：http：//ideone.com/UlWAAO

score 1 · Accepted Answer

PEXTこれは、インテルの命令に組み込み関数を使用する意思がある場合 (およびビットボード操作を行っている場合)、サイクルに 1 回実行できるソリューションです (値とマスクがレジスターにある場合)。

int get_col(uint64_t board) {
    return _pext_u64(board, 0x8080808080808080ull);
}

これは 0 列目です。別の列が必要な場合は、マスクを適切にシフトしてください。もちろん、これはハードウェア固有の命令を使用した不正行為ですが、ビットボードはすべて不正行為です。

score 0 · Accepted Answer

これはどう...

uint64_t bitboard = ...;
uint8_t result = 0;

result |= (bitboard & (1ULL << 60)) ? 0x80 : 0;
result |= (bitboard & (1ULL << 52)) ? 0x40 : 0;
result |= (bitboard & (1ULL << 44)) ? 0x20 : 0;
result |= (bitboard & (1ULL << 36)) ? 0x10 : 0;
result |= (bitboard & (1ULL << 28)) ? 0x08 : 0;
result |= (bitboard & (1ULL << 20)) ? 0x04 : 0;
result |= (bitboard & (1ULL << 12)) ? 0x02 : 0;
result |= (bitboard & (1ULL <<  4)) ? 0x01 : 0;

score 0 · Accepted Answer

これはChess Programming Wikiからのものです。ボードを転置した後、単一の行を分離するのは簡単です。また、逆方向に戻ることもできます。

/**
 * Flip a bitboard about the antidiagonal a8-h1.
 * Square a1 is mapped to h8 and vice versa.
 * @param x any bitboard
 * @return bitboard x flipped about antidiagonal a8-h1
 */
U64 flipDiagA8H1(U64 x) {
   U64 t;
   const U64 k1 = C64(0xaa00aa00aa00aa00);
   const U64 k2 = C64(0xcccc0000cccc0000);
   const U64 k4 = C64(0xf0f0f0f00f0f0f0f);
   t  =       x ^ (x << 36) ;
   x ^= k4 & (t ^ (x >> 36));
   t  = k2 & (x ^ (x << 18));
   x ^=       t ^ (t >> 18) ;
   t  = k1 & (x ^ (x <<  9));
   x ^=       t ^ (t >>  9) ;
   return x;
}

c++ - 64ビットの数値から特定の行/列/対角線を「分離」します

9 に答える 9

Related

Reference