c++ - SSE 4.2 の CRC32C をソフトウェアに実装する

Question

そのため、データが破損していないことを確認するために CRC32C チェックサムを組み込んだ設計があります。ソフトウェアが実行されているコンピューターが SSE 4.2 をサポートしている場合、ソフトウェアバージョンとハードウェアアクセラレーションバージョンの両方を使用できるため、CRC32C を使用することにしました。

Intel の開発者マニュアル (vol 2A) を使用します。これは、crc32命令の背後にあるアルゴリズムを提供しているようです。しかし、私はほとんど運がありません。Intel の開発者ガイドには次のように書かれています。

BIT_REFLECT32: DEST[31-0] = SRC[0-31]
MOD2: Remainder from Polynomial division modulus 2

TEMP1[31-0] <- BIT_REFLECT(SRC[31-0])
TEMP2[31-0] <- BIT_REFLECT(DEST[31-0])
TEMP3[63-0] <- TEMP1[31-0] << 32
TEMP4[63-0] <- TEMP2[31-0] << 32
TEMP5[63-0] <- TEMP3[63-0] XOR TEMP4[63-0]
TEMP6[31-0] <- TEMP5[63-0] MOD2 0x11EDC6F41
DEST[31-0]  <- BIT_REFLECT(TEMP6[31-0])

今、私が知る限り、行まではすべてTEMP6正しく開始しましたが、多項式除算を誤解しているか、間違って実装している可能性があると思います。私の理解が正しければ、1 / 1 mod 2 = 1、0 / 1 mod 2 = 0、および両方のゼロ除算は定義されていません。

私が理解していないのは、64 ビットと 33 ビットのオペランドを使用したバイナリ除算がどのように機能するかです。SRCisとis は0x00000000すべてセットされたビットになり、 whileはすべてセットされていないビットになります。DEST0xFFFFFFFFTEMP5[63-32]TEMP5[31-0]

からのビットをTEMP5分子として使用する場合、多項式の長さは 33 ビットしかないため、ゼロによる 30 除算が行われます11EDC6F41(したがって、64 ビットの符号なし整数に変換すると、上位 30 ビットが未設定のままになります)。分母は 30 ビットでは未設定です。

ただし、多項式を分子として使用する場合、の下位 32 ビットTEMP5が設定されていないため、そこでゼロ除算が行われ、分子の上位 30 ビットが次のようになるため、結果の上位 30 ビットはゼロになります。ゼロとして0 / 1 mod 2 = 0。

これがどのように機能するかを誤解していますか？単純に何かが足りない？あるいは、Intel はドキュメントの重要なステップをいくつか省略していますか?

彼らが使用したアルゴリズムと思われるものについて Intel の開発者ガイドに行った理由は、彼らが 33 ビットの多項式を使用していたためで、32 ビットの多項式を使用したときには起こらなかった出力を同一にしたかったためです1EDC6F41(ショー下）。

uint32_t poly = 0x1EDC6F41, sres, crcTable[256], data = 0x00000000;

for (n = 0; n < 256; n++) {
    sres = n;
    for (k = 0; k < 8; k++)
        sres = (sres & 1) == 1 ? poly ^ (sres >> 1) : (sres >> 1);
    crcTable[n] = sres;
}
sres = 0xFFFFFFFF;

for (n = 0; n < 4; n++) {
    sres = crcTable[(sres ^ data) & 0xFF] ^ (sres >> 8);
}

上記のコードは4138093821出力としてcrc32生成し、オペコード2346497208は入力を使用して生成します0x00000000。

これがひどく書かれているか、場所が理解できない場合は申し訳ありませんが、私にはかなり遅れています。

score 76 · Accepted Answer

ここでは、CRC-32C のソフトウェアバージョンとハードウェアバージョンの両方を示します。ソフトウェアバージョンは、一度に 8 バイトを処理するように最適化されています。ハードウェアバージョンはcrc32q、1 つのコアで 3 つの命令を効率的に並行して実行するように最適化されています。これは、その命令のスループットが 1 サイクルであるためです。ただし、レイテンシは 3 サイクルです。

crc32c.c:

/* crc32c.c -- compute CRC-32C using the Intel crc32 instruction
 * Copyright (C) 2013, 2021 Mark Adler
 * Version 1.2  5 Jun 2021  Mark Adler
 */

/*
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the author be held liable for any damages
  arising from the use of this software.

  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

  Mark Adler
  madler@alumni.caltech.edu
 */

/* Version History:
 1.0  10 Feb 2013  First version
 1.1  31 May 2021  Correct register constraints on assembly instructions
                   Include pre-computed tables to avoid use of pthreads
                   Return zero for the CRC when buf is NULL, as initial value
 1.2   5 Jun 2021  Make tables constant
 */

// Use hardware CRC instruction on Intel SSE 4.2 processors.  This computes a
// CRC-32C, *not* the CRC-32 used by Ethernet and zip, gzip, etc.  A software
// version is provided as a fall-back, as well as for speed comparisons.

#include <stddef.h>
#include <stdint.h>

// Tables for CRC word-wise calculation, definitions of LONG and SHORT, and CRC
// shifts by LONG and SHORT bytes.
#include "crc32c.h"

// Table-driven software version as a fall-back.  This is about 15 times slower
// than using the hardware instructions.  This assumes little-endian integers,
// as is the case on Intel processors that the assembler code here is for.
static uint32_t crc32c_sw(uint32_t crc, void const *buf, size_t len) {
    if (buf == NULL)
        return 0;
    unsigned char const *data = buf;
    while (len && ((uintptr_t)data & 7) != 0) {
        crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
        len--;
    }
    size_t n = len >> 3;
    for (size_t i = 0; i < n; i++) {
        uint64_t word = crc ^ ((uint64_t const *)data)[i];
        crc = crc32c_table[7][word & 0xff] ^
              crc32c_table[6][(word >> 8) & 0xff] ^
              crc32c_table[5][(word >> 16) & 0xff] ^
              crc32c_table[4][(word >> 24) & 0xff] ^
              crc32c_table[3][(word >> 32) & 0xff] ^
              crc32c_table[2][(word >> 40) & 0xff] ^
              crc32c_table[1][(word >> 48) & 0xff] ^
              crc32c_table[0][word >> 56];
    }
    data += n << 3;
    len &= 7;
    while (len) {
        len--;
        crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
    }
    return crc;
}

// Apply the zeros operator table to crc.
static uint32_t crc32c_shift(uint32_t const zeros[][256], uint32_t crc) {
    return zeros[0][crc & 0xff] ^ zeros[1][(crc >> 8) & 0xff] ^
           zeros[2][(crc >> 16) & 0xff] ^ zeros[3][crc >> 24];
}

// Compute CRC-32C using the Intel hardware instruction. Three crc32q
// instructions are run in parallel on a single core. This gives a
// factor-of-three speedup over a single crc32q instruction, since the
// throughput of that instruction is one cycle, but the latency is three
// cycles.
static uint32_t crc32c_hw(uint32_t crc, void const *buf, size_t len) {
    if (buf == NULL)
        return 0;

    // Pre-process the crc.
    uint64_t crc0 = crc ^ 0xffffffff;

    // Compute the crc for up to seven leading bytes, bringing the data pointer
    // to an eight-byte boundary.
    unsigned char const *next = buf;
    while (len && ((uintptr_t)next & 7) != 0) {
        __asm__("crc32b\t" "(%1), %0"
                : "+r"(crc0)
                : "r"(next), "m"(*next));
        next++;
        len--;
    }

    // Compute the crc on sets of LONG*3 bytes, making use of three ALUs in
    // parallel on a single core.
    while (len >= LONG*3) {
        uint64_t crc1 = 0;
        uint64_t crc2 = 0;
        unsigned char const *end = next + LONG;
        do {
            __asm__("crc32q\t" "(%3), %0\n\t"
                    "crc32q\t" LONGx1 "(%3), %1\n\t"
                    "crc32q\t" LONGx2 "(%3), %2"
                    : "+r"(crc0), "+r"(crc1), "+r"(crc2)
                    : "r"(next), "m"(*next));
            next += 8;
        } while (next < end);
        crc0 = crc32c_shift(crc32c_long, crc0) ^ crc1;
        crc0 = crc32c_shift(crc32c_long, crc0) ^ crc2;
        next += LONG*2;
        len -= LONG*3;
    }

    // Do the same thing, but now on SHORT*3 blocks for the remaining data less
    // than a LONG*3 block.
    while (len >= SHORT*3) {
        uint64_t crc1 = 0;
        uint64_t crc2 = 0;
        unsigned char const *end = next + SHORT;
        do {
            __asm__("crc32q\t" "(%3), %0\n\t"
                    "crc32q\t" SHORTx1 "(%3), %1\n\t"
                    "crc32q\t" SHORTx2 "(%3), %2"
                    : "+r"(crc0), "+r"(crc1), "+r"(crc2)
                    : "r"(next), "m"(*next));
            next += 8;
        } while (next < end);
        crc0 = crc32c_shift(crc32c_short, crc0) ^ crc1;
        crc0 = crc32c_shift(crc32c_short, crc0) ^ crc2;
        next += SHORT*2;
        len -= SHORT*3;
    }

    // Compute the crc on the remaining eight-byte units less than a SHORT*3
    // block.
    unsigned char const *end = next + (len - (len & 7));
    while (next < end) {
        __asm__("crc32q\t" "(%1), %0"
                : "+r"(crc0)
                : "r"(next), "m"(*next));
        next += 8;
    }
    len &= 7;

    // Compute the crc for up to seven trailing bytes.
    while (len) {
        __asm__("crc32b\t" "(%1), %0"
                : "+r"(crc0)
                : "r"(next), "m"(*next));
        next++;
        len--;
    }

    // Return the crc, post-processed.
    return ~(uint32_t)crc0;
}

// Check for SSE 4.2.  SSE 4.2 was first supported in Nehalem processors
// introduced in November, 2008.  This does not check for the existence of the
// cpuid instruction itself, which was introduced on the 486SL in 1992, so this
// will fail on earlier x86 processors.  cpuid works on all Pentium and later
// processors.
#define SSE42(have) \
    do { \
        uint32_t eax, ecx; \
        eax = 1; \
        __asm__("cpuid" \
                : "=c"(ecx) \
                : "a"(eax) \
                : "%ebx", "%edx"); \
        (have) = (ecx >> 20) & 1; \
    } while (0)

// Compute a CRC-32C.  If the crc32 instruction is available, use the hardware
// version.  Otherwise, use the software version.
uint32_t crc32c(uint32_t crc, void const *buf, size_t len) {
    int sse42;
    SSE42(sse42);
    return sse42 ? crc32c_hw(crc, buf, len) : crc32c_sw(crc, buf, len);
}

crc32c.h を生成するコード (回答に 30,000 文字の制限があるため、stackoverflow ではテーブル自体を投稿できません):

// Generate crc32c.h for crc32c.c.

#include <stdio.h>
#include <stdint.h>

#define LONG 8192
#define SHORT 256

// Print a 2-D table of four-byte constants in hex.
static void print_table(uint32_t *tab, size_t rows, size_t cols, char *name) {
    printf("static uint32_t const %s[][%zu] = {\n", name, cols);
    size_t end = rows * cols;
    size_t k = 0;
    for (;;) {
        fputs("   {", stdout);
        size_t n = 0, j = 0;
        for (;;) {
            printf("0x%08x", tab[k + n]);
            if (++n == cols)
                break;
            putchar(',');
            if (++j == 6) {
                fputs("\n   ", stdout);
                j = 0;
            }
            putchar(' ');
        }
        k += cols;
        if (k == end)
            break;
        puts("},");
    }
    puts("}\n};");
}

/* CRC-32C (iSCSI) polynomial in reversed bit order. */
#define POLY 0x82f63b78

static void crc32c_word_table(void) {
    uint32_t table[8][256];

    // Generate byte-wise table.
    for (unsigned n = 0; n < 256; n++) {
        uint32_t crc = ~n;
        for (unsigned k = 0; k < 8; k++)
            crc = crc & 1 ? (crc >> 1) ^ POLY : crc >> 1;
        table[0][n] = ~crc;
    }

    // Use byte-wise table to generate word-wise table.
    for (unsigned n = 0; n < 256; n++) {
        uint32_t crc = ~table[0][n];
        for (unsigned k = 1; k < 8; k++) {
            crc = table[0][crc & 0xff] ^ (crc >> 8);
            table[k][n] = ~crc;
        }
    }

    // Print table.
    print_table(table[0], 8, 256, "crc32c_table");
}

// Return a(x) multiplied by b(x) modulo p(x), where p(x) is the CRC
// polynomial. For speed, this requires that a not be zero.
static uint32_t multmodp(uint32_t a, uint32_t b) {
    uint32_t prod = 0;
    for (;;) {
        if (a & 0x80000000) {
            prod ^= b;
            if ((a & 0x7fffffff) == 0)
                break;
        }
        a <<= 1;
        b = b & 1 ? (b >> 1) ^ POLY : b >> 1;
    }
    return prod;
}

/* Take a length and build four lookup tables for applying the zeros operator
   for that length, byte-by-byte, on the operand. */
static void crc32c_zero_table(size_t len, char *name) {
    // Generate operator for len zeros.
    uint32_t op = 0x80000000;               // 1 (x^0)
    uint32_t sq = op >> 4;                  // x^4
    while (len) {
        sq = multmodp(sq, sq);              // x^2^(k+3), k == len bit position
        if (len & 1)
            op = multmodp(sq, op);
        len >>= 1;
    }

    // Generate table to update each byte of a CRC using op.
    uint32_t table[4][256];
    for (unsigned n = 0; n < 256; n++) {
        table[0][n] = multmodp(op, n);
        table[1][n] = multmodp(op, n << 8);
        table[2][n] = multmodp(op, n << 16);
        table[3][n] = multmodp(op, n << 24);
    }

    // Print the table to stdout.
    print_table(table[0], 4, 256, name);
}

int main(void) {
    puts(
"// crc32c.h\n"
"// Tables and constants for crc32c.c software and hardware calculations.\n"
"\n"
"// Table for a 64-bits-at-a-time software CRC-32C calculation. This table\n"
"// has built into it the pre and post bit inversion of the CRC."
    );
    crc32c_word_table();
    puts(
"\n// Block sizes for three-way parallel crc computation.  LONG and SHORT\n"
"// must both be powers of two.  The associated string constants must be set\n"
"// accordingly, for use in constructing the assembler instructions."
        );
    printf("#define LONG %d\n", LONG);
    printf("#define LONGx1 \"%d\"\n", LONG);
    printf("#define LONGx2 \"%d\"\n", 2 * LONG);
    printf("#define SHORT %d\n", SHORT);
    printf("#define SHORTx1 \"%d\"\n", SHORT);
    printf("#define SHORTx2 \"%d\"\n", 2 * SHORT);
    puts(
"\n// Table to shift a CRC-32C by LONG bytes."
    );
    crc32c_zero_table(8192, "crc32c_long");
    puts(
"\n// Table to shift a CRC-32C by SHORT bytes."
    );
    crc32c_zero_table(256, "crc32c_short");
    return 0;
}

score 17 · Accepted Answer

Mark Adler の答えは正確で完全ですが、アプリケーションに CRC-32C をすばやく簡単に統合する方法を探している人は、特に Windows と .NET を使用している場合、コードを適応させるのが少し難しいと感じるかもしれません。

利用可能なハードウェアに応じて、ハードウェアまたはソフトウェアのいずれかの方法を使用してCRC-32C を実装するライブラリを作成しました。C++ および .NET 用の NuGet パッケージとして利用できます。もちろんオープンソースです。

上記の Mark Adler のコードをパッケージ化する以外に、ソフトウェアフォールバックのスループットを 50% 向上させる簡単な方法を見つけました。私のコンピューターでは、ライブラリは現在、ソフトウェアで 2 GB/秒、ハードウェアで 20 GB/秒以上を達成しています。興味のある方のために、最適化されたソフトウェアの実装を次に示します。

static uint32_t append_table(uint32_t crci, buffer input, size_t length)
{
    buffer next = input;
#ifdef _M_X64
    uint64_t crc;
#else
    uint32_t crc;
#endif

    crc = crci ^ 0xffffffff;
#ifdef _M_X64
    while (length && ((uintptr_t)next & 7) != 0)
    {
        crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
        --length;
    }
    while (length >= 16)
    {
        crc ^= *(uint64_t *)next;
        uint64_t high = *(uint64_t *)(next + 8);
        crc = table[15][crc & 0xff]
            ^ table[14][(crc >> 8) & 0xff]
            ^ table[13][(crc >> 16) & 0xff]
            ^ table[12][(crc >> 24) & 0xff]
            ^ table[11][(crc >> 32) & 0xff]
            ^ table[10][(crc >> 40) & 0xff]
            ^ table[9][(crc >> 48) & 0xff]
            ^ table[8][crc >> 56]
            ^ table[7][high & 0xff]
            ^ table[6][(high >> 8) & 0xff]
            ^ table[5][(high >> 16) & 0xff]
            ^ table[4][(high >> 24) & 0xff]
            ^ table[3][(high >> 32) & 0xff]
            ^ table[2][(high >> 40) & 0xff]
            ^ table[1][(high >> 48) & 0xff]
            ^ table[0][high >> 56];
        next += 16;
        length -= 16;
    }
#else
    while (length && ((uintptr_t)next & 3) != 0)
    {
        crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
        --length;
    }
    while (length >= 12)
    {
        crc ^= *(uint32_t *)next;
        uint32_t high = *(uint32_t *)(next + 4);
        uint32_t high2 = *(uint32_t *)(next + 8);
        crc = table[11][crc & 0xff]
            ^ table[10][(crc >> 8) & 0xff]
            ^ table[9][(crc >> 16) & 0xff]
            ^ table[8][crc >> 24]
            ^ table[7][high & 0xff]
            ^ table[6][(high >> 8) & 0xff]
            ^ table[5][(high >> 16) & 0xff]
            ^ table[4][high >> 24]
            ^ table[3][high2 & 0xff]
            ^ table[2][(high2 >> 8) & 0xff]
            ^ table[1][(high2 >> 16) & 0xff]
            ^ table[0][high2 >> 24];
        next += 12;
        length -= 12;
    }
#endif
    while (length)
    {
        crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
        --length;
    }
    return (uint32_t)crc ^ 0xffffffff;
}

ご覧のとおり、一度に大きなブロックをクランチするだけです。より大きなルックアップテーブルが必要ですが、それでもキャッシュフレンドリーです。テーブルは同じ方法で生成されますが、行数が増えるだけです。

私が調査したもう 1 つのことは、PCLMULQDQ 命令を使用して、AMD プロセッサでハードウェアアクセラレーションを取得することです。Intel の zlib 用 CRC パッチ( GitHubでも入手可能) を CRC-32C 多項式に移植することができました。~~魔法の定数 0x9db42487を除いて。解読できる人いたら教えてください~~. reddit に関する supersaw7 の優れた説明の後、とらえどころのない 0x9db42487 定数も移植しました。それを磨き、テストする時間を見つける必要があります。

score 15 · Accepted Answer

まず、インテルのCRC32命令が計算に役立ちますCRC-32C(つまり、通常の CRC32 とは異なる多項式を使用します。ウィキペディアの CRC32エントリを参照してください) 。

CRC32C に Intel のハードウェアアクセラレーションを使用gccするには、次のことができます。

asmステートメントを介した C コードのインラインアセンブリ言語
組み込み関数_mm_crc32_u8、_mm_crc32_u16、_mm_crc32_u32またはを使用し_mm_crc32_u64ます。Intel のコンパイラの説明と実装については、Intel Intrinsics Guide を参照してください。iccgcc

これは__mm_crc32_u8、一度に 1 バイトかかる方法です__mm_crc32_u64。一度に 8 バイトかかるため、を使用するとパフォーマンスがさらに向上します。

uint32_t sse42_crc32(const uint8_t *bytes, size_t len)
{
  uint32_t hash = 0;
  size_t i = 0;
  for (i=0;i<len;i++) {
    hash = _mm_crc32_u8(hash, bytes[i]);
  }

  return hash;
}

これをコンパイルするには-msse4.2、CFLAGS. それ以外の場合と同様にgcc -g -msse4.2 test.c、について文句を言いundefined reference to _mm_crc32_u8ます。

実行可能ファイルが実行されているプラットフォームで命令が使用できない場合にプレーン C 実装に戻したい場合は、GCC のifunc属性を使用できます。お気に入り

uint32_t sse42_crc32(const uint8_t *bytes, size_t len)
{
  /* use _mm_crc32_u* here */
}

uint32_t default_crc32(const uint8_t *bytes, size_t len)
{
  /* pure C implementation */
}

/* this will be called at load time to decide which function really use */
/* sse42_crc32 if SSE 4.2 is supported */
/* default_crc32 if not */
static void * resolve_crc32(void) {
  __builtin_cpu_init();
  if (__builtin_cpu_supports("sse4.2")) return sse42_crc32;

  return default_crc32;
}

/* crc32() implementation will be resolved at load time to either */
/* sse42_crc32() or default_crc32() */
uint32_t crc32(const uint8_t *bytes, size_t len) __attribute__ ((ifunc ("resolve_crc32")));

score 8 · Accepted Answer

ここでさまざまなアルゴリズムを比較します: https://github.com/htot/crc32c

最速のアルゴリズムは、Intel の crc_iscsi_v_pcl.asm アセンブリコード (Linux カーネルで変更された形式で利用可能) から取得され、このプロジェクトに含まれる C ラッパー (crcinterasm.cc) を使用しています。

最初にこのコードを 32 ビットプラットフォームで実行できるようにするために、可能な場合は C (crc32intelc) に移植されています。少量のインラインアセンブリが必要です。コードの特定の部分はビット数に依存します。crc32q は 32 ビットでは使用できず、movq も使用できません。これらは、32 ビットプラットフォーム用の代替コードでマクロ (crc32intel.h) に入れられます。

c++ - SSE 4.2 の CRC32C をソフトウェアに実装する

4 に答える 4

Related

Reference