c++ - 8 スレッドが 2 スレッドより遅いのはなぜですか?

Question

まず、私の英語が下手なことをお詫びしなければなりません。私は現在、ハードウェアトランザクションメモリを学習しており、TBB の spin_rw_mutex.h を使用して C++ でトランザクションブロックを実装しています。speculative_spin_rw_mutex は、spin_rw_mutex.h 内のクラスです。h は、Intel TSX の RTM インターフェイスを既に実装しているミューテックスです。

RTM のテストに使用した例は非常に単純です。Account クラスを作成し、あるアカウントから別のアカウントにランダムに送金します。すべてのアカウントはアカウント配列にあり、サイズは 100 です。ランダム関数はブーストにあります (STL にも同じランダム関数があると思います)。伝達関数は、speculative_spin_rw_mutex で保護されています。tbb::parallel_for と tbb::task_scheduler_init を使用して並行性を制御しました。すべての転送メソッドは、paraller_for のラムダで呼び出されます。合計転送回数は 100 万回です。奇妙なことに、task_scheduler_init を 2 に設定すると、プログラムが最速 (8 秒) になります。実際、私のCPUは8スレッドのi7 6700kです。8 ～ 50,000 の範囲では、プログラムのパフォーマンスはほとんど変わりません (11 ～ 12 秒)。task_scheduler_init を 100,000 に増やすと、実行時間は約 18 秒に増加します。プロファイラーを使用してプログラムを分析しようとしたところ、ホットスポット機能がミューテックスであることがわかりました。ただし、トランザクションのロールバック率はそれほど高くないと思います。プログラムが遅い理由がわかりません。

偽の共有がパフォーマンスを低下させると誰かが言うので、その結果、私は使用しようとしました

std::vector> cache_aligned_accounts(AccountsSIZE,Account(1000));

元の配列を置き換える

アカウント* accounts[AccountsSIZE];

偽の共有を避けるために。何も変わっていないようです。これが私の新しいコードです。



#include <tbb/spin_rw_mutex.h>
#include <iostream>
#include "tbb/task_scheduler_init.h"  
#include "tbb/task.h"
#include "boost/random.hpp"
#include <ctime>
#include <tbb/parallel_for.h>
#include <tbb/spin_mutex.h>
#include <tbb/cache_aligned_allocator.h>
#include <vector>
using namespace tbb;
tbb::speculative_spin_rw_mutex mu;

class Account {
private:
    int balance;
public:
    Account(int ba) {
        balance = ba;
    }
    int getBalance() {
        return balance;
    }
    void setBalance(int ba) {
        balance = ba;
    }
};

//Transfer function. Using speculative_spin_mutex to set critical section
void transfer(Account &from, Account &to, int amount) {
    speculative_spin_rw_mutex::scoped_lock lock(mu);
    if ((from.getBalance())<amount)
    {
        throw std::invalid_argument("Illegal amount!");
    }
    else {
        from.setBalance((from.getBalance()) - amount);
        to.setBalance((to.getBalance()) + amount);
    }
}

const int AccountsSIZE = 100;

//Random number generater and distributer
boost::random::mt19937 gener(time(0));
boost::random::uniform_int_distribution<> distIndex(0, AccountsSIZE - 1);
boost::random::uniform_int_distribution<> distAmount(1, 1000);
/*
Function of transfer money
*/
void all_transfer_task() {
    task_scheduler_init init(10000);//Set the number of tasks can be run together
    /*
    Initial accounts, using cache_aligned_allocator to avoid false sharing
    */
    std::vector<Account, cache_aligned_allocator<Account>> cache_aligned_accounts(AccountsSIZE,Account(1000));

    const int TransferTIMES = 10000000;
    //All transfer tasks
    parallel_for(0, TransferTIMES, 1, [&](int i) {

        try {
            transfer(cache_aligned_accounts[distIndex(gener)], cache_aligned_accounts[distIndex(gener)], distAmount(gener));
        }
        catch (const std::exception& e)
        {
            //cerr << e.what() << endl;
        }
        //std::cout << distIndex(gener) << std::endl;
    });

    std::cout << cache_aligned_accounts[0].getBalance() << std::endl;

    int total_balance = 0;
    for (size_t i = 0; i < AccountsSIZE; i++)
    {
        total_balance += (cache_aligned_accounts[i].getBalance());
    }
    std::cout << total_balance << std::endl;
}

score 2 · Accepted Answer

Intel TSX はキャッシュラインの粒度で動作するため、まず最初にフォールスシェアリングを使用する必要があります。残念ながら、cache_aligned_allocator はおそらく期待しているものとは異なります。つまり、std::vector 全体が整列されますが、誤った共有を防ぐためにキャッシュライン全体を占有する個々のアカウントが必要です。

score 1 · Accepted Answer

ベンチマークを再現することはできませんが、この動作には 2 つの原因が考えられます。

「あまりにも多くのクックがスープを沸騰させます」:すべてのスレッドによるすべての転送によってロックされている単一のspin_rw_mutexを使用します。あなたの転送は順番に実行されるようです。これは、プロファイルがそこでホットポイントを検出する理由を説明します。Intel のページでは、このような場合のパフォーマンスの低下について警告しています。
スループット vs. 速度: i7 では、いくつかのベンチマークで、より多くのコアを使用すると、各コアの実行が少し遅くなるため、固定サイズループの全体的な時間が長くなることがわかりました。ただし、全体的なスループット (つまり、これらすべての並列ループで発生するトランザクションの総数) を数えると、スループットははるかに高くなります (ただし、コア数に完全に比例するわけではありません)。

私はむしろ最初のケースを選びたいのですが、2番目のケースは排除することではありません.

c++ - 8 スレッドが 2 スレッドより遅いのはなぜですか?

2 に答える 2

Related

Reference