memory - CUDA デバイスのメモリトランザクションが必要

Question

グローバルメモリから共有メモリへの転送トランザクションを理解するために、小さなcudaコードを書きました。コードは次のとおりです。

#include <iostream>
using namespace std;

__global__ void readUChar4(uchar4* c, uchar4* o){
  extern __shared__ uchar4 gc[];
  int tid = threadIdx.x;
  gc[tid] = c[tid];
  o[tid] = gc[tid];
}

int main(){
  string a = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
  uchar4* c;
  cudaError_t e1 = cudaMalloc((void**)&c, 128*sizeof(uchar4));
  if(e1==cudaSuccess){
    uchar4* o;
    cudaError_t e11 = cudaMalloc((void**)&o, 128*sizeof(uchar4));

    if(e11 == cudaSuccess){
      cudaError_t e2 = cudaMemcpy(c, a.c_str(), 128*sizeof(uchar4), cudaMemcpyHostToDevice);
      if(e2 == cudaSuccess){
        readUChar4<<<1,128, 128*sizeof(uchar4)>>>(c, o);
        uchar4* oFromGPU = (uchar4*)malloc(128*sizeof(uchar4));
        cudaError_t e22 = cudaMemcpy(oFromGPU, o, 128*sizeof(uchar4), cudaMemcpyDeviceToHost);
        if(e22 == cudaSuccess){
          for(int i =0; i < 128; i++){
            cout << oFromGPU[i].x << " ";
            cout << oFromGPU[i].y << " ";
            cout << oFromGPU[i].z << " ";
            cout << oFromGPU[i].w << " " << endl;

          }
        }
        else{
          cout << "Failed to copy from GPU" << endl;
        }
      }
      else{
        cout << "Failed to copy" << endl;
      }
    }
    else{
      cout << "Failed to allocate output memory" << endl;
    }
  }
  else{
    cout << "Failed to allocate memory" << endl;
  }
  return 0;
}

このコードは、デバイスメモリから共有メモリにデータをコピーし、デバイスメモリに戻すだけです。次の 3 つの質問があります。

この場合、デバイスメモリから共有メモリへの転送は、4 つのメモリトランザクションを実行することが保証されていますか? cudaMalloc がメモリを割り当てる方法に依存すると思います。データがメモリ上に散らばるようにメモリが無計画に割り当てられた場合、4 つ以上のメモリトランザクションが必要になります。ただし、cudaMalloc が 128 バイトのチャンクでメモリを割り当てる場合、または連続してメモリを割り当てる場合、4 つを超えるメモリトランザクションは必要ありません。
上記のロジックは、共有メモリからデバイスメモリへのデータの書き込みにも当てはまりますか。つまり、転送は 4 つのメモリトランザクションで完了します。
このコードは銀行の競合を引き起こしますか? スレッドに順番に ID が割り当てられている場合、このコードによってバンクの競合が発生することはないと思います。ただし、スレッド 32 と 64 が同じワープで実行されるようにスケジュールされている場合、このコードによってバンクの競合が発生する可能性があります。

score 2 · Accepted Answer

In the code you provided (repeated here) the compiler will completely remove the shared memory store and load since they don't do anything necessary or beneficial for the code.

 __global__ void readUChar4(uchar4* c, uchar4* o){
  extern __shared__ uchar4 gc[];
  int tid = threadIdx.x;
  gc[tid] = c[tid];
  o[tid] = gc[tid];
}

Assuming you did something with the shared memory so it was not eliminated, then:

The loads and stores from and to global memory in this code would take ONE transaction per warp (assuming Fermi or later GPU), since they are only 32-bits (uchar4 = 4*8 bits) per thread (total 128 bytes per warp). cudaMalloc allocates memory contiguously.
The answer from 1. applies to stores also, yes.
There are no bank conflicts in this code. Threads in a warp are always contiguous, with the first thread a multiple of the warp size. So threads 32 and 64 will never be in the same warp. And since you are loading and storing a 32-bit data type, and the banks are 32 bits wide, there are no conflicts.

memory - CUDA デバイスのメモリ トランザクションが必要

1 に答える 1

Related

Reference

memory - CUDA デバイスのメモリトランザクションが必要