cuda - JCudaでネイティブポインタの構造体を作成するにはどうすればよいですか

Question

構造体のリストを受け取る CUDA カーネルがあります。

kernel<<<blockCount,blockSize>>>(MyStruct *structs);

各構造体には 3 つのポインターが含まれます。

typedef struct __align(16)__ {
    float* pointer1;
    float* pointer2;
    float* pointer3;
}

float を含む 3 つのデバイス配列があり、構造体内の各ポインターは、3 つのデバイス配列のいずれか内の float を指しています。

構造体のリストは、カーネルに送信される構造体のリストの順序に応じて、カーネルが再帰的な操作を実行できるようにするツリー/グラフ構造を表します。(このビットは C++ で動作するため、私の問題とは関係ありません)

私がやりたいことは、JCuda からポインターの構造体を送信できるようにすることです。この投稿のようにパディングされた配列にフラット化されない限り、これはネイティブに可能ではないことを理解しています。

構造体のリストを送信するときに発生する可能性のある配置とパディングに関するすべての問題を理解しています。これは本質的に、私が問題なく繰り返しているパディングされた配列です。

どうすればよいかわからないのは、フラット化された構造体バッファーにポインターを設定することです。たとえば、次のようなことができると思います。

Pointer A = ....(underlying device array1)
Pointer B = ....(underlying device array2)
Pointer C = ....(underlying device array3)

ByteBuffer structListBuffer = ByteBuffer.allocate(16*noSteps);
for(int x = 0; x<noSteps; x++) {
    // Get the underlying pointer values
    long pointer1 = A.withByteOffset(getStepOffsetA(x)).someGetUnderlyingPointerValueFunction();
    long pointer2 = B.withByteOffset(getStepOffsetB(x)).someGetUnderlyingPointerValueFunction();
    long pointer3 = C.withByteOffset(getStepOffsetC(x)).someGetUnderlyingPointerValueFunction();

    // Build the struct
    structListBuffer.asLongBuffer().append(pointer1);
    structListBuffer.asLongBuffer().append(pointer2);
    structListBuffer.asLongBuffer().append(pointer3);
    structListBuffer.asLongBuffer().append(0); //padding
}

structListBuffer次に、カーネルが期待する方法で構造体のリストを含めます。

someGetUnderlyingPointerValueFunction()では、ByteBuffer から実行する方法はありますか?

score 2 · Accepted Answer

私がすべてを正しく理解していれば、質問の主なポイントは、そのような魔法の機能があるかどうかです

long address = pointer.someGetUnderlyingPointerValueFunction();

これは、ネイティブポインターのアドレスを返します。

簡単な答え: いいえ、そのような機能はありません。

_{(補足: かなり前に同様の機能が既に要求されていましたが、まだ追加していません。主な理由は、そのような関数が Java 配列または (非直接の) バイトバッファーへのポインターに対して意味をなさないためです。さらに、手動でパディングとアラインメントを持つ構造体、32 ビットマシンと 64 ビットマシンで異なるサイズのポインター、およびビッグエンディアンまたはリトルエンディアンのバッファーを処理することは、頭の痛い問題です。getAddress()関数のようなものを追加する可能性が最も高い. おそらくCUdeviceptrクラスにのみ, それが間違いなく理にかなっている - 少なくともPointerクラスよりも. 人々はこのメソッドを使って奇妙なことをするだろう.VMの厄介なクラッシュを引き起こすことを行いますが、JCuda自体は非常に薄い抽象化レイヤーであるため、とにかくこの点に関するセーフティネットはありません...)}

とはいえ、次のような方法で現在の制限を回避できます。

private static long getPointerAddress(CUdeviceptr p)
{
    // WORKAROUND until a method like CUdeviceptr#getAddress exists
    class PointerWithAddress extends Pointer
    {
        PointerWithAddress(Pointer other)
        {
            super(other);
        }
        long getAddress()
        {
            return getNativePointer() + getByteOffset();
        }
    }
    return new PointerWithAddress(p).getAddress();
}

もちろん、これは見苦しく、getNativePointer()andgetByteOffset()メソッドを作成する意図と明らかに矛盾していますprotected。しかし、最終的には「公式」の方法に置き換えられる可能性があります。

private static long getPointerAddress(CUdeviceptr p)
{
    return p.getAddress();
}

これまでのところ、これはおそらく C 側でできることに最も近いソリューションです。

これをテストするために私が書いた例を次に示します。カーネルは、構造体を「識別可能な」値で埋めて (最終的に正しい場所にあるかどうかを確認するために) ダミーのカーネルにすぎず、1 つのスレッドのみで起動することになっています。

typedef struct __declspec(align(16)) {
    float* pointer1;
    float* pointer2;
    float* pointer3;
} MyStruct;

extern "C"
__global__ void kernel(MyStruct *structs)
{
    structs[0].pointer1[0] = 1.0f;
    structs[0].pointer1[1] = 1.1f;
    structs[0].pointer1[2] = 1.2f;

    structs[0].pointer2[0] = 2.0f;
    structs[0].pointer2[1] = 2.1f;
    structs[0].pointer2[2] = 2.2f;

    structs[0].pointer3[0] = 3.0f;
    structs[0].pointer3[1] = 3.1f;
    structs[0].pointer3[2] = 3.2f;

    structs[1].pointer1[0] = 11.0f;
    structs[1].pointer1[1] = 11.1f;
    structs[1].pointer1[2] = 11.2f;

    structs[1].pointer2[0] = 12.0f;
    structs[1].pointer2[1] = 12.1f;
    structs[1].pointer2[2] = 12.2f;

    structs[1].pointer3[0] = 13.0f;
    structs[1].pointer3[1] = 13.1f;
    structs[1].pointer3[2] = 13.2f;
}

このカーネルは、次のプログラムで起動されます (注: PTX ファイルのコンパイルは、アプリケーションのケースと一致しない可能性がある設定で、ここでオンザフライで行われます。疑わしい場合は、PTX ファイルを手動でコンパイルすることができます)。

各構造体のおよびポインタはpointer1、カーネルによって書き込まれた値を識別できるようにするオフセットを持つデバイスバッファおよびの連続する要素をそれぞれ指すように初期化されます。(これを 32 ビットまたは 64 ビットマシンで実行する 2 つの可能性のあるケースを処理しようとしたことに注意してください。これは、異なるポインターサイズを意味します。ただし、現在、32 ビットバージョンしかテストできません)。pointer2pointer3ABC

import static jcuda.driver.JCudaDriver.*;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.IntBuffer;
import java.nio.LongBuffer;
import java.util.Arrays;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;


public class JCudaPointersInStruct 
{
    public static void main(String args[]) throws IOException
    {
        JCudaDriver.setExceptionsEnabled(true);
        String ptxFileName = preparePtxFile("JCudaPointersInStructKernel.cu");
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);
        CUmodule module = new CUmodule();
        cuModuleLoad(module, ptxFileName);
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "kernel");

        int numElements = 9;
        CUdeviceptr A = new CUdeviceptr();
        cuMemAlloc(A, numElements * Sizeof.FLOAT);
        cuMemsetD32(A, 0, numElements);
        CUdeviceptr B = new CUdeviceptr();
        cuMemAlloc(B, numElements * Sizeof.FLOAT);
        cuMemsetD32(B, 0, numElements);
        CUdeviceptr C = new CUdeviceptr();
        cuMemAlloc(C, numElements * Sizeof.FLOAT);
        cuMemsetD32(C, 0, numElements);

        int numSteps = 2;
        int sizeOfStruct = Sizeof.POINTER * 4;
        ByteBuffer hostStructsBuffer = 
            ByteBuffer.allocate(numSteps * sizeOfStruct);
        if (Sizeof.POINTER == 4)
        {
            IntBuffer b = hostStructsBuffer.order(
                ByteOrder.nativeOrder()).asIntBuffer();
            for(int x = 0; x<numSteps; x++) 
            {
                CUdeviceptr pointer1 = A.withByteOffset(getStepOffsetA(x));
                CUdeviceptr pointer2 = B.withByteOffset(getStepOffsetB(x));
                CUdeviceptr pointer3 = C.withByteOffset(getStepOffsetC(x));

                //System.out.println("Step "+x+" pointer1 is "+pointer1);
                //System.out.println("Step "+x+" pointer2 is "+pointer2);
                //System.out.println("Step "+x+" pointer3 is "+pointer3);

                b.put((int)getPointerAddress(pointer1));
                b.put((int)getPointerAddress(pointer2));
                b.put((int)getPointerAddress(pointer3));
                b.put(0);
            }
        }
        else
        {
            LongBuffer b = hostStructsBuffer.order(
                ByteOrder.nativeOrder()).asLongBuffer();
            for(int x = 0; x<numSteps; x++) 
            {
                CUdeviceptr pointer1 = A.withByteOffset(getStepOffsetA(x));
                CUdeviceptr pointer2 = B.withByteOffset(getStepOffsetB(x));
                CUdeviceptr pointer3 = C.withByteOffset(getStepOffsetC(x));

                //System.out.println("Step "+x+" pointer1 is "+pointer1);
                //System.out.println("Step "+x+" pointer2 is "+pointer2);
                //System.out.println("Step "+x+" pointer3 is "+pointer3);

                b.put(getPointerAddress(pointer1));
                b.put(getPointerAddress(pointer2));
                b.put(getPointerAddress(pointer3));
                b.put(0);
            }
        }

        CUdeviceptr structs = new CUdeviceptr();
        cuMemAlloc(structs, numSteps * sizeOfStruct);
        cuMemcpyHtoD(structs, Pointer.to(hostStructsBuffer), 
            numSteps * sizeOfStruct);

        Pointer kernelParameters = Pointer.to(
            Pointer.to(structs)
        );
        cuLaunchKernel(function, 
            1, 1, 1, 
            1, 1, 1, 
            0, null, kernelParameters, null);
        cuCtxSynchronize();


        float hostA[] = new float[numElements];
        cuMemcpyDtoH(Pointer.to(hostA), A, numElements * Sizeof.FLOAT);
        float hostB[] = new float[numElements];
        cuMemcpyDtoH(Pointer.to(hostB), B, numElements * Sizeof.FLOAT);
        float hostC[] = new float[numElements];
        cuMemcpyDtoH(Pointer.to(hostC), C, numElements * Sizeof.FLOAT);

        System.out.println("A "+Arrays.toString(hostA));
        System.out.println("B "+Arrays.toString(hostB));
        System.out.println("C "+Arrays.toString(hostC));
    }

    private static long getStepOffsetA(int x)
    {
        return x * Sizeof.FLOAT * 4 + 0 * Sizeof.FLOAT;
    }
    private static long getStepOffsetB(int x)
    {
        return x * Sizeof.FLOAT * 4 + 1 * Sizeof.FLOAT;
    }
    private static long getStepOffsetC(int x)
    {
        return x * Sizeof.FLOAT * 4 + 2 * Sizeof.FLOAT;
    }


    private static long getPointerAddress(CUdeviceptr p)
    {
        // WORKAROUND until a method like CUdeviceptr#getAddress exists
        class PointerWithAddress extends Pointer
        {
            PointerWithAddress(Pointer other)
            {
                super(other);
            }
            long getAddress()
            {
                return getNativePointer() + getByteOffset();
            }
        }
        return new PointerWithAddress(p).getAddress();
    }




    //-------------------------------------------------------------------------
    // Ignore this - in practice, you'll compile the PTX manually
    private static String preparePtxFile(String cuFileName) throws IOException
    {
        int endIndex = cuFileName.lastIndexOf('.');
        if (endIndex == -1)
        {
            endIndex = cuFileName.length()-1;
        }
        String ptxFileName = cuFileName.substring(0, endIndex+1)+"ptx";
        File cuFile = new File(cuFileName);
        if (!cuFile.exists())
        {
            throw new IOException("Input file not found: "+cuFileName);
        }
        String modelString = "-m"+System.getProperty("sun.arch.data.model");
        String command =
            "nvcc " + modelString + " -ptx -arch sm_11 -lineinfo "+
            cuFile.getPath()+" -o "+ptxFileName;
        System.out.println("Executing\n"+command);
        Process process = Runtime.getRuntime().exec(command);
        String errorMessage =
            new String(toByteArray(process.getErrorStream()));
        String outputMessage =
            new String(toByteArray(process.getInputStream()));
        int exitValue = 0;
        try
        {
            exitValue = process.waitFor();
        }
        catch (InterruptedException e)
        {
            Thread.currentThread().interrupt();
            throw new IOException(
                "Interrupted while waiting for nvcc output", e);
        }

        if (exitValue != 0)
        {
            System.out.println("nvcc process exitValue "+exitValue);
            System.out.println("errorMessage:\n"+errorMessage);
            System.out.println("outputMessage:\n"+outputMessage);
            throw new IOException(
                "Could not create .ptx file: "+errorMessage);
        }
        System.out.println("Finished creating PTX file");
        return ptxFileName;
    }
    private static byte[] toByteArray(InputStream inputStream)
        throws IOException
    {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        byte buffer[] = new byte[8192];
        while (true)
        {
            int read = inputStream.read(buffer);
            if (read == -1)
            {
                break;
            }
            baos.write(buffer, 0, read);
        }
        return baos.toByteArray();
    }

}

結果は、期待/希望どおりです。

A [1.0, 1.1, 1.2, 0.0, 11.0, 11.1, 11.2, 0.0, 0.0]
B [0.0, 2.0, 2.1, 2.2, 0.0, 12.0, 12.1, 12.2, 0.0]
C [0.0, 0.0, 3.0, 3.1, 3.2, 0.0, 13.0, 13.1, 13.2]

cuda - JCudaでネイティブポインタの構造体を作成するにはどうすればよいですか

1 に答える 1

Related

Reference