cuda - jCUDA の関数 cuMemcpyHtoD のエラー

Question

私はJavaプログラミングが初めてで、jCUDAで行列乗算プログラムをコーディングしようとしています。

ホストからデバイスへ、またはその逆にデータを転送するときに、次を使用します。

cuMemcpyHtoD(devMatrixA, Pointer.to(hostMatrixA), numRows * numCols * Sizeof.FLOAT);
cuMemcpyHtoD(devMatrixB, Pointer.to(hostMatrixA), numRows * numCols * Sizeof.FLOAT);
cuMemcpyDtoH(Pointer.to(hostMatrixC), devMatrixC, numRows * numCols * Sizeof.FLOAT);

ここで、devMatrixA、devMatrixB、および devMatrixC は、デバイスメモリに格納されるマトリックスです。また、hostMatrixA、hostMatrixB、および hostMatrixC は、ホストメモリに格納されている行列です。

データ転送のために上記の関数を呼び出すと、「ポインター型のメソッド to(byte[]) は引数 (float[][]) には適用されません」というエラーが表示されます。 (' には赤い下線が引かれています。Eclipse を使用しています。以下のように完全なコードを指定しました。

私のJavaの知識を許してください。間違った方向に進んでいるかどうかを提案してください。

Package JCudaMatrixAddition;
import static jcuda.driver.JCudaDriver.*;

import java.io.*;

import jcuda.*;
import jcuda.driver.*;
import jcuda.Pointer;
import jcuda.Sizeof;


public class JCudaMatrixAddition {
    public static void main(String[] args) throws IOException 
    {
        // Enable exceptions and omit all subsequent error checks
        JCudaDriver.setExceptionsEnabled(true);

        // Create the PTX file by calling the NVCC
        String ptxFilename = preparePtxFile("JCudaMatrixAdditionKernel.cu");

        //Initialize the driver and create a context for the first device.
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet (device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);

        //Load PTX file
        CUmodule module = new CUmodule();
        cuModuleLoad(module,ptxFilename);

        //Obtain a function pointer to the Add function
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "add");

        int numRows = 32;
        int numCols = 32;

        //Allocate and fill Host input Matrices:
        float hostMatrixA[][] = new float[numRows][numCols];
        float hostMatrixB[][] = new float[numRows][numCols];
        float hostMatrixC[][] = new float[numRows][numCols];


        for(int i = 0; i<numRows; i++)

        {
            for(int j = 0; j<numCols; j++)
            {
                hostMatrixA[i][j] = (float) 1.0;
                hostMatrixB[i][j] = (float) 1.0;
            }
        }
        // Allocate the device input data, and copy the
        // host input data to the device
        CUdeviceptr devMatrixA = new CUdeviceptr();
        cuMemAlloc(devMatrixA, numRows * numCols * Sizeof.FLOAT);

        //This is the part where it gives me the error
        cuMemcpyHtoD(devMatrixA, Pointer.to(hostMatrixA), numRows * numCols * Sizeof.FLOAT);

        CUdeviceptr devMatrixB = new CUdeviceptr();
        cuMemAlloc(devMatrixB, numRows * numCols * Sizeof.FLOAT);

        //This is the part where it gives me the error
        cuMemcpyHtoD(devMatrixB, Pointer.to(hostMatrixA), numRows * numCols * Sizeof.FLOAT);

        //Allocate device matrix C to store output
        CUdeviceptr devMatrixC = new CUdeviceptr();
        cuMemAlloc(devMatrixC, numRows * numCols * Sizeof.FLOAT);

        // Set up the kernel parameters: A pointer to an array
        // of pointers which point to the actual values.

        Pointer kernelParameters = Pointer.to(Pointer.to(new int[]{numRows}),
                                   Pointer.to(new int[]{numRows}), 
                                   Pointer.to(devMatrixA),
                                   Pointer.to(devMatrixB),
                                   Pointer.to(devMatrixC));

        //Kernel thread configuration
        int blockSize = 32;
        int gridSize = 1;

        cuLaunchKernel(function, 
                       gridSize, 1, 1,
                       blockSize, 32, 1,
                       0, null, kernelParameters, null);

        cuCtxSynchronize();
        // Allocate host output memory and copy the device output
        // to the host.

        //This is the part where it gives me the error
        cuMemcpyDtoH(Pointer.to(hostMatrixC), devMatrixC, numRows * numCols * Sizeof.FLOAT);

        //verify the result
        for (int i =0; i<numRows; i++)
        {
            for (int j =0; j<numRows; j++)
            {
                System.out.print("   "+ hostMatrixB[i][j]);
            }
            System.out.println("");
        }
        cuMemFree(devMatrixA);
        cuMemFree(devMatrixB);
        cuMemFree(devMatrixC);

    }

score 2 · Accepted Answer

float[][]ホストからデバイスにアレイを直接コピーすることはできません。

配列を作成する場合、これは値float[][]の大きな配列ではありません。float代わりに、配列の配列です。次のような配列を作成することさえできると想像してください

float array[][] = new float[3];
array[0] = new float[42];
array[1] = null;
array[2] = new float[1234];

これは単に連続したメモリブロックではないため、このような配列をデバイスにコピーすることはできません。

CUDA (JCuda だけでなく CUDA 全般) で行列を処理する場合、行列は通常 1 次元配列として表されます。したがって、この場合、行列を次のように宣言できます。

float hostMatrixA[] = new float[numRows*numCols];

行列要素にアクセスするには、適切なインデックスを計算する必要があります。

int row = ...;
int col = ...;
hostMatrix[col+row*numCols] = 123.0f; // Column-major

// Or
hostMatrix[row+col*numRows] = 123.0f; // Row-major

最後の 2 行の違いは、一方は列優先の順序を想定し、もう一方は行優先の順序を想定していることです。詳細については、行優先順序に関するウィキペディアのサイトを参照してください。

いくつかの補足事項:

CUBLAS などの CUDA 行列ライブラリは、列優先の順序付けを使用するため、同じ規則に従うことをお勧めします。特に後で CUBLAS/JCublas 機能を使用する場合。たとえば、cublasSgeam関数は、行列の加算を実行する機能を既に提供しています。

行列の加算のみを行いたい場合、CUDA/JCuda を使用しても高速化は見られません。この回答でこれについての要約を書きました。

ところで：技術的には、「2D配列」を使用することは可能です。JCudaDriverSampleは、これを行う方法を示しています。ただし、これはかなり不便であり、行列演算にはお勧めできません。

cuda - jCUDA の関数 cuMemcpyHtoD のエラー

1 に答える 1

Related

Reference