c++ - SIMDまたは非SIMD-クロスプラットフォーム

Question

可能な場合はSIMD（SSE、SPUなど）を利用できるように、いくつかの並列化可能な問題のC++クロスプラットフォーム実装を作成する方法についてのアイデアが必要です。また、実行時にSIMDではなくSIMDを切り替えられるようにしたいと考えています。

この問題に取り組むことをどのように提案しますか？ （もちろん、考えられるすべてのオプションについて、問題を複数回実装したくありません）

これはC++では非常に簡単な作業ではないかもしれませんが、何かが足りないと思います。これまでのところ、私の考えは次のようになります...クラスcStreamは単一フィールドの配列になります。複数のcStreamを使用して、SoA（Structure of Arrays）を実現できます。次に、いくつかのファンクターを使用して、cStream全体で実行する必要のあるLambda関数を偽造できます。

// just for example I'm not expecting this code to compile
cStream a; // something like float[1024]
cStream b;
cStream c;

void Foo()
{
    for_each(
        AssignSIMD(c, MulSIMD(AddSIMD(a, b), a)));
}

for_eachは、ストリームの現在のポインターをインクリメントするだけでなく、SIMDを使用する場合と使用しない場合でファンクターの本体をインライン化する役割を果たします。

そのようなもの：

// just for example I'm not expecting this code to compile
for_each(functor<T> f)
{
#ifdef USE_SIMD
    if (simdEnabled)
        real_for_each(f<true>()); // true means use SIMD
    else
#endif
        real_for_each(f<false>());
}

SIMDが有効になっている場合は、一度チェックすると、ループがメインファンクターの周りにあることに注意してください。

score 3 · Accepted Answer

この分野のアイデアについては、MacSTLライブラリのソースを参照することをお勧めします：www.pixelglow.com/macstl/

score 3 · Accepted Answer

誰かが興味を持っているなら、これはポールが投稿したライブラリについて読んでいるときに私が持ってきた新しいアイデアをテストするために私が持っている汚いコードです。

ありがとうポール！

// This is just a conceptual test
// I haven't profile the code and I haven't verified if the result is correct
#include <xmmintrin.h>


// This class is doing all the math
template <bool SIMD>
class cStreamF32
{
private:
    void*       m_data;
    void*       m_dataEnd;
    __m128*     m_current128;
    float*      m_current32;

public:
    cStreamF32(int size)
    {
        if (SIMD)
            m_data = _mm_malloc(sizeof(float) * size, 16);
        else
            m_data = new float[size];
    }
    ~cStreamF32()
    {
        if (SIMD)
            _mm_free(m_data);
        else
            delete[] (float*)m_data;
    }

    inline void Begin()
    {
        if (SIMD)
            m_current128 = (__m128*)m_data;
        else
            m_current32 = (float*)m_data;
    }

    inline bool Next()
    {
        if (SIMD)
        {
            m_current128++;
            return m_current128 < m_dataEnd;
        }
        else
        {
            m_current32++;
            return m_current32 < m_dataEnd;
        }
    }

    inline void operator=(const __m128 x)
    {
        *m_current128 = x;
    }
    inline void operator=(const float x)
    {
        *m_current32 = x;
    }

    inline __m128 operator+(const cStreamF32<true>& x)
    {
        return _mm_add_ss(*m_current128, *x.m_current128);
    }
    inline float operator+(const cStreamF32<false>& x)
    {
        return *m_current32 + *x.m_current32;
    }

    inline __m128 operator+(const __m128 x)
    {
        return _mm_add_ss(*m_current128, x);
    }
    inline float operator+(const float x)
    {
        return *m_current32 + x;
    }

    inline __m128 operator*(const cStreamF32<true>& x)
    {
        return _mm_mul_ss(*m_current128, *x.m_current128);
    }
    inline float operator*(const cStreamF32<false>& x)
    {
        return *m_current32 * *x.m_current32;
    }

    inline __m128 operator*(const __m128 x)
    {
        return _mm_mul_ss(*m_current128, x);
    }
    inline float operator*(const float x)
    {
        return *m_current32 * x;
    }
};

// Executes both functors
template<class T1, class T2>
void Execute(T1& functor1, T2& functor2)
{
    functor1.Begin();
    do
    {
        functor1.Exec();
    }
    while (functor1.Next());

    functor2.Begin();
    do
    {
        functor2.Exec();
    }
    while (functor2.Next());
}

// This is the implementation of the problem
template <bool SIMD>
class cTestFunctor
{
private:
    cStreamF32<SIMD> a;
    cStreamF32<SIMD> b;
    cStreamF32<SIMD> c;

public:
    cTestFunctor() : a(1024), b(1024), c(1024) { }

    inline void Exec()
    {
        c = a + b * a;
    }

    inline void Begin()
    {
        a.Begin();
        b.Begin();
        c.Begin();
    }

    inline bool Next()
    {
        a.Next();
        b.Next();
        return c.Next();
    }
};


int main (int argc, char * const argv[]) 
{
    cTestFunctor<true> functor1;
    cTestFunctor<false> functor2;

    Execute(functor1, functor2);

    return 0;
}

score 2 · Accepted Answer

私が見たSIMDスケーリングへの最も印象的なアプローチは、RTFactレイトレーシングフレームワークです：スライド、紙。一見の価値があります。研究者はインテルと密接に関係しているため（ザールブリュッケンは現在インテルビジュアルコンピューティングインスティテュートをホストしています）、AVXへのフォワードスケーリングを確実に行うことができ、ララビーは彼らの頭の中にありました。

IntelのCt「データ並列処理」テンプレートライブラリも非常に有望に見えます。

score 2 · Accepted Answer

SIMD/非SIMDでの私の試みを一瞥したいと思うかもしれません：

vrep、SIMDに特化したテンプレート化された基本クラス（floatのみのSSEと整数ベクトルを導入したSSE2を区別する方法に注意してください）。
より便利なv4f、v4iなどのクラス（中間v4を介してサブクラス化）。

もちろん、 SoAよりもrgba / xyzタイプの計算用の4要素ベクトルを対象としているため、8ウェイAVXが登場すると完全に蒸気が不足しますが、一般的な原則が役立つ場合があります。

score 1 · Accepted Answer

与えられた例がコンパイル時に何を実行するかを決定することに注意してください（プリプロセッサを使用しているため）。この場合、より複雑な手法を使用して実際に実行するものを決定できます。たとえば、タグディスパッチ：http ：//cplusplus.co.il/2010/01/03/tag-dispatching/ そこに示されている例に従うと、SIMDを使用した場合の高速実装と、使用しない場合の低速実装が可能になります。

score 0 · Accepted Answer

liboilのような既存のソリューションを使用することを考えましたか？多くの一般的なSIMD操作を実装し、実行時にSIMD /非SIMDコードを使用するかどうかを決定できます（初期化関数によって割り当てられた関数ポインターを使用）。

c++ - SIMDまたは非SIMD-クロスプラットフォーム

6 に答える 6

Related

Reference