c - ダブルスのSSE、それだけの価値はありませんか？

Question

SSE組み込み関数の使用について少し読んで、doubleを使用してクォータニオンローテーションを実装して運試しをしました。以下は私が書いた通常のSSE関数です。


void quat_rot(quat_t a, REAL* restrict b){
    ///////////////////////////////////////////
    // Multiply vector b by quaternion a     //
    ///////////////////////////////////////////
    REAL cross_temp[3],result[3];

    cross_temp[0]=a.el[2]*b[2]-a.el[3]*b[1]+a.el[0]*b[0];
    cross_temp[1]=a.el[3]*b[0]-a.el[1]*b[2]+a.el[0]*b[1];
    cross_temp[2]=a.el[1]*b[1]-a.el[2]*b[0]+a.el[0]*b[2];

    result[0]=b[0]+2.0*(a.el[2]*cross_temp[2]-a.el[3]*cross_temp[1]);
    result[1]=b[1]+2.0*(a.el[3]*cross_temp[0]-a.el[1]*cross_temp[2]);
    result[2]=b[2]+2.0*(a.el[1]*cross_temp[1]-a.el[2]*cross_temp[0]);

    b[0]=result[0];
    b[1]=result[1];
    b[2]=result[2];
}

SSEを使用


inline void cross_p(__m128d *a, __m128d *b, __m128d *c){
    const __m128d SIGN_NP = _mm_set_pd(0.0, -0.0);


    __m128d l1 = _mm_mul_pd( _mm_unpacklo_pd(a[1], a[1]), b[0] );

    __m128d l2 = _mm_mul_pd( _mm_unpacklo_pd(b[1], b[1]), a[0] );
    __m128d m1 = _mm_sub_pd(l1, l2);
            m1 = _mm_shuffle_pd(m1, m1, 1);
            m1 = _mm_xor_pd(m1, SIGN_NP);

            l1 = _mm_mul_pd( a[0], _mm_shuffle_pd(b[0], b[0], 1) );

    __m128d m2 = _mm_sub_sd(l1, _mm_unpackhi_pd(l1, l1));

    c[0] = m1;
    c[1] = m2;
}

void quat_rotSSE(quat_t a, REAL* restrict b){
    ///////////////////////////////////////////
    // Multiply vector b by quaternion a     //
    ///////////////////////////////////////////

    __m128d axb[2];
    __m128d aa[2];
            aa[0] = _mm_load_pd(a.el+1);
            aa[1] = _mm_load_sd(a.el+3);
    __m128d bb[2];
            bb[0] = _mm_load_pd(b);
            bb[1] = _mm_load_sd(b+2);

    cross_p(aa, bb, axb);

    __m128d w = _mm_set1_pd(a.el[0]);

    axb[0] = _mm_add_pd(axb[0], _mm_mul_pd(w, bb[0]));
    axb[1] = _mm_add_sd(axb[1], _mm_mul_sd(w, bb[1]));

    cross_p(aa, axb, axb);

    _mm_store_pd(b, _mm_add_pd(bb[0], _mm_add_pd(axb[0], axb[0])));
    _mm_store_sd(b+2, _mm_add_pd(bb[1], _mm_add_sd(axb[1], axb[1])));
}

回転は基本的に関数を使用して行われます。

ここに画像の説明を入力してください

次に、次のテストを実行して、各関数が一連の回転を実行するのにかかる時間を確認します。


int main(int argc, char *argv[]){

    REAL a[] __attribute__ ((aligned(16))) = {0.2, 1.3, 2.6};
    quat_t q = {{0.1, 0.7, -0.3, -3.2}};

    REAL sum = 0.0;
    for(int i = 0; i < 4; i++) sum += q.el[i] * q.el[i];
    sum = sqrt(sum);
    for(int i = 0; i < 4; i++) q.el[i] /= sum;

    int N = 1000000000;

    for(int i = 0; i < N; i++){
        quat_rotSSE(q, a);
    }

    printf("rot = ");
    for(int i = 0; i < 3; i++) printf("%f, ", a[i]);
    printf("\n");

    return 0;
}

gcc4.6.3と-O3-std=c99-msse3を使用してコンパイルしました。

UNIXを使用した通常の機能のタイミングtimeは18.841秒で、SSEの場合は21.689秒でした。

何かが足りないのですが、SSEの実装が通常より15％遅いのはなぜですか？倍精度の場合、SSEの実装はどの場合に高速になりますか？

編集：コメントからアドバイスを受けて、私はいくつかのことを試みました、

-O1オプションは非常によく似た結果をもたらします。
関数を使用restrictしてみてcross_p、2番目の外積を保持するために__m128dを追加しました。これは、製造されたアセンブリに違いはありませんでした。
私が理解していることから、正規関数用に生成されたアセンブリには、一部を除いてスカラー命令のみが含まれていmovapdます。

SSE関数用に生成されたアセンブリコードは、通常のアセンブリコードよりわずか4行少なくなっています。

編集：生成されたアセンブリへのリンクを追加しました、

quat_rot

quat_rotSSE

score 11 · Accepted Answer

SSE (and SIMD in general) works really well when you're performing the same operations on a large number of elements, where there's no dependencies between operations. For example, if you had an array of double and needed to do array[i] = (array[i] * K + L)/M + N; for each element then SSE/SIMD would help.

If you're not performing the same operations on a large number of elements, then SSE doesn't help. For example, if you had one double and needed to do foo = (foo * K + L)/M + N; then SSE/SIMD isn't going to help.

Basically, SSE is the wrong tool for the job. You need to change the job into something where SSE is the right tool. For example, rather than multiplying one vector by one quaternion; try multiplying an array of 1000 vectors by a quaternion, or perhaps multiplying an array of 1000 vectors by an array of 1000 quaternions.

EDIT: Added everything below here!

Note that this typically means modifying data structures to suit. For example, rather than having an array of structures it's often better to have a structure of arrays.

For a better example, imagine your code uses an array of quaternions, like this:

for(i = 0; i < quaternionCount; i++) {
   cross_temp[i][0] = a[i][2] * b[i][2] - a[i][3] * b[i][1] + a[i][0] * b[i][0];
   cross_temp[i][1] = a[i][3] * b[i][0] - a[i][1] * b[i][2] + a[i][0] * b[i][1];
   cross_temp[i][2] = a[i][1] * b[i][1] - a[i][2] * b[i][0] + a[i][0] * b[i][2];

   b[i][0] = b[i][0] + 2.0 * (a[i][2] * cross_temp[i][2] - a[i][3] * cross_temp[i][1]);
   b[i][1] = b[i][1] + 2.0 * (a[i][3] * cross_temp[i][0] - a[i][1] * cross_temp[i][2]);
   b[i][2] = b[i][2] + 2.0 * (a[i][1] * cross_temp[i][1] - a[i][2] * cross_temp[i][0]);
}

The first step would be to transform it into a quaternion of arrays, and do this:

for(i = 0; i < quaternionCount; i++) {
   cross_temp[0][i] = a[2][i] * b[2][i] - a[3][i] * b[1][i] + a[0][i] * b[0][i];
   cross_temp[1][i] = a[3][i] * b[0][i] - a[1][i] * b[2][i] + a[0][i] * b[1][i];
   cross_temp[2][i] = a[1][i] * b[1][i] - a[2][i] * b[0][i] + a[0][i] * b[2][i];

   b[0][i] = b[0][i] + 2.0 * (a[2][i] * cross_temp[2][i] - a[3][i] * cross_temp[1][i]);
   b[1][i] = b[1][i] + 2.0 * (a[3][i] * cross_temp[0][i] - a[1][i] * cross_temp[2][i]);
   b[2][i] = b[2][i] + 2.0 * (a[1][i] * cross_temp[1][i] - a[2][i] * cross_temp[0][i]);
}

Then, because 2 adjacent doubles fit into a single SSE register you want to unroll the loop by 2:

for(i = 0; i < quaternionCount; i += 2) {
   cross_temp[0][i] = a[2][i] * b[2][i] - a[3][i] * b[1][i] + a[0][i] * b[0][i];
   cross_temp[0][i+1] = a[2][i+1] * b[2][i+1] - a[3][i+1] * b[1][i+1] + a[0][i+1] * b[0][i+1];
   cross_temp[1][i] = a[3][i] * b[0][i] - a[1][i] * b[2][i] + a[0][i] * b[1][i];
   cross_temp[1][i+1] = a[3][i+1] * b[0][i+1] - a[1][i+1] * b[2][i+1] + a[0][i+1] * b[1][i+1];
   cross_temp[2][i] = a[1][i] * b[1][i] - a[2][i] * b[0][i] + a[0][i] * b[2][i];
   cross_temp[2][i+1] = a[1][i+1] * b[1][i+1] - a[2][i+1] * b[0][i+1] + a[0][i+1] * b[2][i+1];

   b[0][i] = b[0][i] + 2.0 * (a[2][i] * cross_temp[2][i] - a[3][i] * cross_temp[1][i]);
   b[0][i+1] = b[0][i+1] + 2.0 * (a[2][i+1] * cross_temp[2][i+1] - a[3][i+1] * cross_temp[1][i+1]);
   b[1][i] = b[1][i] + 2.0 * (a[3][i] * cross_temp[0][i] - a[1][i] * cross_temp[2][i]);
   b[1][i+1] = b[1][i+1] + 2.0 * (a[3][i+1] * cross_temp[0][i+1] - a[1][i+1] * cross_temp[2][i+1]);
   b[2][i] = b[2][i] + 2.0 * (a[1][i] * cross_temp[1][i] - a[2][i] * cross_temp[0][i]);
   b[2][i+1] = b[2][i+1] + 2.0 * (a[1][i+1] * cross_temp[1][i+1] - a[2][i+1] * cross_temp[0][i+1]);
}

Now, you want to break this down into individual operations. For example, the first 2 lines of the inner loop would become:

   cross_temp[0][i] = a[2][i] * b[2][i];
   cross_temp[0][i] -= a[3][i] * b[1][i];
   cross_temp[0][i] += a[0][i] * b[0][i];
   cross_temp[0][i+1] = a[2][i+1] * b[2][i+1];
   cross_temp[0][i+1] -= a[3][i+1] * b[1][i+1];
   cross_temp[0][i+1] += a[0][i+1] * b[0][i+1];

Now re-order that:

   cross_temp[0][i] = a[2][i] * b[2][i];
   cross_temp[0][i+1] = a[2][i+1] * b[2][i+1];

   cross_temp[0][i] -= a[3][i] * b[1][i];
   cross_temp[0][i+1] -= a[3][i+1] * b[1][i+1];

   cross_temp[0][i] += a[0][i] * b[0][i];
   cross_temp[0][i+1] += a[0][i+1] * b[0][i+1];

Once you've done all of this, think about converting to SSE. The first 2 lines of code is one load (that loads both a[2][i] and a[2][i+1] into an SSE register) followed by one multiplication (and not 2 separate loads and 2 separate multiplications). Those 6 lines might become (pseudo-code):

   load SSE_register1 with both a[2][i] and a[2][i+1]
   multiply SSE_register1 with both b[2][i] and b[2][i+1]

   load SSE_register2 with both a[3][i] and a[3][i+1]
   multiply SSE_register2 with both b[1][i] and b[1][i+1]

   load SSE_register2 with both a[0][i] and a[0][i+1]
   multiply SSE_register2 with both b[0][i] and b[0][i+1]

   SE_register1 = SE_register1 - SE_register2
   SE_register1 = SE_register1 + SE_register3

Each line of pseudo-code here is a single SSE instruction/intrinsic; and every SSE instruction/intrinsic is doing 2 operations in parallel.

If every instruction does 2 operations in parallel, then (in theory) it could be up to twice as fast as the original "one operation per instruction" code.

score 1 · Accepted Answer

おそらくコードの完全な最適化を可能にするいくつかのアイデア。

関数をインライン化する必要があります
コンパイラがパラメータを数回リロードしないように、restrict仕様を追加する必要がありますcross_p
その場合__m128d、2回目の呼び出しの結果を受け取る4番目の変数を導入する必要があります。cross_p

次に、アセンブラ（gccオプション-S）を調べて、これらすべてによって何が生成されるかを確認します。

c - ダブルスのSSE、それだけの価値はありませんか？

2 に答える 2

Related

Reference