c - 比較に基づいて float 変数を 0.0f または 1.0f に設定する SSE コード

Question

私は2つの配列を持っています:char* cそしてfloat* f、私はこの操作を行う必要があります:

// Compute float mask
float* f;
char* c;
char c_thresh;
int n;

for ( int i = 0; i < n; ++i )
{
    if ( c[i] < c_thresh ) f[i] = 0.0f;
    else                   f[i] = 1.0f;
}

私はそれを行うための迅速な方法を探しています: 条件なしで、可能であれば SSE (4.2 または AVX) を使用します。

float代わりに使用するとcharコードが高速になる場合は、フロートのみを使用するようにコードを変更できます。

// Compute float mask
float* f;
float* c;
float c_thresh;
int n;

for ( int i = 0; i < n; ++i )
{
    if ( c[i] < c_thresh ) f[i] = 0.0f;
    else                   f[i] = 1.0f;
}

ありがとう

score 6 · Accepted Answer

非常に簡単です。比較を行い、バイトを dword に変換し、かつ 1.0f を使用するだけです: (テストされていません。これはとにかくコードをコピーして貼り付けることを意図したものではなく、どのように行うかを示すことを目的としています)

movd xmm0, [c]          ; read 4 bytes from c
pcmpgtb xmm0, threshold ; compare (note: comparison is >, not >=, so adjust threshold)
pmovzxbd xmm0, xmm0     ; convert bytes to dwords
pand xmm0, one          ; AND all four elements with 1.0f
movdqa [f], xmm0        ; save result

組み込み関数に変換するのはかなり簡単なはずです。

score 4 · Accepted Answer

float に切り替えることで、GCC でループを自動ベクトル化でき、組み込み関数について心配する必要がなくなります。次のコードは、必要なことを行い、自動ベクトル化します。

void foo(float *f, float*c, float c_thresh, const int n) {
    for (int i = 0; i < n; ++i) {
        f[i] = (float)(c[i] >= c_thresh);
    }
}

でコンパイル

g++  -O3 -Wall  -pedantic -march=native main.cpp -ftree-vectorizer-verbose=1

結果を確認し、 coliruで自分でコードを編集/コンパイルできます。ただし、MSVC2013 はループをベクトル化しませんでした。

score 2 · Accepted Answer

どうですか：

f[i] = (c[i] >= c_thresh);

少なくともこれにより、条件が削除されます。

score 2 · Accepted Answer

AVX バージョン:

void floatSelect(float* f, const char* c, size_t n, char c_thresh) {
    for (size_t i = 0; i < n; ++i) {
        if (c[i] < c_thresh) f[i] = 0.0f;
        else f[i] = 1.0f;
    }
}

void vecFloatSelect(float* f, const char* c, size_t n, char c_thresh) {
    const auto thresh = _mm_set1_epi8(c_thresh);
    const auto zeros = _mm256_setzero_ps();
    const auto ones = _mm256_set1_ps(1.0f);
    const auto shuffle0 = _mm_set_epi8(3, -1, -1, -1, 2, -1, -1, -1, 1, -1, -1, -1, 0, -1, -1, -1);
    const auto shuffle1 = _mm_set_epi8(7, -1, -1, -1, 6, -1, -1, -1, 5, -1, -1, -1, 4, -1, -1, -1);
    const auto shuffle2 = _mm_set_epi8(11, -1, -1, -1, 10, -1, -1, -1, 9, -1, -1, -1, 8, -1, -1, -1);
    const auto shuffle3 = _mm_set_epi8(15, -1, -1, -1, 14, -1, -1, -1, 13, -1, -1, -1, 12, -1, -1, -1);

    const size_t nVec = (n / 16) * 16;
    for (size_t i = 0; i < nVec; i += 16) {
        const auto chars = _mm_loadu_si128(reinterpret_cast<const __m128i*>(c + i));
        const auto mask = _mm_cmplt_epi8(chars, thresh);
        const auto floatMask0 = _mm_shuffle_epi8(mask, shuffle0);
        const auto floatMask1 = _mm_shuffle_epi8(mask, shuffle1);
        const auto floatMask2 = _mm_shuffle_epi8(mask, shuffle2);
        const auto floatMask3 = _mm_shuffle_epi8(mask, shuffle3);
        const auto floatMask01 = _mm256_set_m128i(floatMask1, floatMask0);
        const auto floatMask23 = _mm256_set_m128i(floatMask3, floatMask2);
        const auto floats0 = _mm256_blendv_ps(ones, zeros, _mm256_castsi256_ps(floatMask01));
        const auto floats1 = _mm256_blendv_ps(ones, zeros, _mm256_castsi256_ps(floatMask23));
        _mm256_storeu_ps(f + i, floats0);
        _mm256_storeu_ps(f + i + 8, floats1);
    }
    floatSelect(f + nVec, c + nVec, n % 16, c_thresh);
}

score 1 · Accepted Answer

への変換

f[i] = (float)(c[i] >= c_thresh);

- Intel Compiler で自動ベクトル化も可能 (gcc にも当てはまると他の人が言及)

一般的に分岐ループを自動ベクトル化する必要がある場合は、 #pragma ivdepまたはpragma simdを試すこともできます(最後のものはIntel Cilk Plusおよび OpenMP 4.0 標準の一部です)。これらのプラグマは、SSE、AVX、および将来のベクトル拡張 ( AVX512など) のために、移植可能な方法で指定されたコードを自動ベクトル化します。これらのプラグマは、Intel コンパイラ (すべての既知のバージョン)、Cray および PGI コンパイラ (ivdep のみ) でサポートされ、おそらく今後の GCC4.9 リリースでサポートされ、VS2012 以降では MSVC (ivdep のみ) で部分的にサポートされます。

与えられた例では、何も変更せず (if と char* を保持)、プラグマ ivdep を追加しただけです。

void foo(float *f, char*c, char c_thresh, const int n) {
    #pragma ivdep
    for ( int i = 0; i < n; ++i )
    {
        if ( c[i] < c_thresh ) f[i] = 0.0f;
        else                   f[i] = 1.0f;
    }
}

AVX をサポートしていない私の Core i5 (SSE3 のみ) では、n = 32K (32000000) の場合、c[i] をランダムに生成し、c_thresh を 0 に設定して (signed char を使用)、コードが約 5 倍のスピードアップを提供します。 ICL によるベクトル化によるものです。

完全なテスト (追加のテストケースの正確性チェックを含む) はこちらから入手できます(coliru です。つまり、gcc4.8 のみで、ICL/Cray はありません。これが、coliru env でベクトル化されない理由です)。

より多くのプリフェッチ、アライメント、および型変換のプラグマ/最適化を処理することで、パフォーマンスをさらに最適化できるはずです。また、特定の単純なケースでは ivdep/simd の代わりに restrict キーワード (または使用するコンパイラに応じて制限) を追加して使用できますが、より一般的なケースではプラグマ simd/ivdep が最も強力です。

注: 実際、 #pragma ivdep は、「コンパイラーに想定された反復間の依存関係を無視するように指示します」(大まかに言えば、同じループを並列化するとデータ競合につながる人)。よく知られている理由により、コンパイラはこれらの仮定に対して非常に保守的です。特定のケースでは、明らかに書き込み後読み取りまたは読み取り後書き込みの依存関係はありません。必要に応じて、以下の私のコメントに示されているように、Advisor XEの正確性分析などの動的ツールを使用して、少なくとも特定のワークロードでそのような依存関係の存在を検証できます。

c - 比較に基づいて float 変数を 0.0f または 1.0f に設定する SSE コード

6 に答える 6

Related

Reference