java - JOCL/OPENCLによる強度和計算の高速化

Question

こんにちは、JOCL (opencl) は初めてです。画像ごとの強度の合計を取るために、このコードを書きました。カーネルは、相互に配置されたすべての画像のすべてのピクセルの 1D 配列を取ります。画像は 300x300 であるため、画像あたり 90000 ピクセルです。現時点では、これを順番に行う場合よりも遅くなります。

私のコード

package PAR;

/*
 * JOCL - Java bindings for OpenCL
 * 
 * Copyright 2009 Marco Hutter - http://www.jocl.org/
 */
import IMAGE_IO.ImageReader;
import IMAGE_IO.Input_Folder;
import static org.jocl.CL.*;

import org.jocl.*;

/**
 * A small JOCL sample.
 */
public class IPPARA {

    /**
     * The source code of the OpenCL program to execute
     */
    private static String programSource =
            "__kernel void "
            + "sampleKernel(__global uint *a,"
            + "             __global uint *c)"
            + "{"
            + "__private uint intensity_core=0;"
            + "      uint i = get_global_id(0);"
            + "      for(uint j=i*90000; j < (i+1)*90000; j++){ "
            + "              intensity_core += a[j];"
            + "     }"
            + "c[i]=intensity_core;" 
            + "}";

    /**
     * The entry point of this sample
     *
     * @param args Not used
     */
    public static void main(String args[]) {
        long numBytes[] = new long[1];

        ImageReader imagereader = new ImageReader() ;
        int srcArrayA[]  = imagereader.readImages();

        int size[] = new int[1];
        size[0] = srcArrayA.length;
        long before = System.nanoTime();
        int dstArray[] = new int[size[0]/90000];


        Pointer srcA = Pointer.to(srcArrayA);
        Pointer dst = Pointer.to(dstArray);


        // Obtain the platform IDs and initialize the context properties
        System.out.println("Obtaining platform...");
        cl_platform_id platforms[] = new cl_platform_id[1];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);

        // Create an OpenCL context on a GPU device
        cl_context context = clCreateContextFromType(
                contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);
        if (context == null) {
            // If no context for a GPU device could be created,
            // try to create one for a CPU device.
            context = clCreateContextFromType(
                    contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);

            if (context == null) {
                System.out.println("Unable to create a context");
                return;
            }
        }

        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);

        // Get the list of GPU devices associated with the context
        clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, null, numBytes);

        // Obtain the cl_device_id for the first device
        int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0],
                Pointer.to(devices), null);

        // Create a command-queue
        cl_command_queue commandQueue =
                clCreateCommandQueue(context, devices[0], 0, null);

        // Allocate the memory objects for the input- and output data
        cl_mem memObjects[] = new cl_mem[2];
        memObjects[0] = clCreateBuffer(context,
                CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                Sizeof.cl_uint * srcArrayA.length, srcA, null);
        memObjects[1] = clCreateBuffer(context,
                CL_MEM_READ_WRITE,
                Sizeof.cl_uint * (srcArrayA.length/90000), null, null);

        // Create the program from the source code
        cl_program program = clCreateProgramWithSource(context,
                1, new String[]{programSource}, null, null);

        // Build the program
        clBuildProgram(program, 0, null, null, null, null);

        // Create the kernel
        cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);

        // Set the arguments for the kernel
        clSetKernelArg(kernel, 0,
                Sizeof.cl_mem, Pointer.to(memObjects[0]));
        clSetKernelArg(kernel, 1,
                Sizeof.cl_mem, Pointer.to(memObjects[1]));

        // Set the work-item dimensions
        long local_work_size[] = new long[]{1};
        long global_work_size[] = new long[]{(srcArrayA.length/90000)*local_work_size[0]};


        // Execute the kernel
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
                global_work_size, local_work_size, 0, null, null);

        // Read the output data
        clEnqueueReadBuffer(commandQueue, memObjects[1], CL_TRUE, 0,
                (srcArrayA.length/90000) * Sizeof.cl_float, dst, 0, null, null);

        // Release kernel, program, and memory objects
        clReleaseMemObject(memObjects[0]);
        clReleaseMemObject(memObjects[1]);
        clReleaseKernel(kernel);
        clReleaseProgram(program);
        clReleaseCommandQueue(commandQueue);
        clReleaseContext(context);


        long after = System.nanoTime();

        System.out.println("Time: " + (after - before) / 1e9);

    }
}

回答の提案の後、CPU を介した並列コードは、順次コードとほぼ同じ速さです。他にできる改善点はありますか?

score 2 · Accepted Answer

 for(uint j=i*90000; j < (i+1)*90000; j++){ "
        + "              c[i] += a[j];"

1)合計するためにグローバルメモリ(c[])を使用していますが、これは遅いです。プライベート変数を使用して高速化します。このようなもの：

          "__kernel void "
        + "sampleKernel(__global uint *a,"
        + "             __global uint *c)"
        + "{"
        + "__private uint intensity_core=0;" <---this is a private variable of each core
        + "      uint i = get_global_id(0);"
        + "      for(uint j=i*90000; j < (i+1)*90000; j++){ "
        + "              intensity_core += a[j];" <---register is at least 100x faster than global memory
         //but we cannot get rid of a[] so the calculation time cannot be less than %50
        + "     }"
        + "c[i]=intensity_core;"   
        + "}";  //expecting %100 speedup

これで、強度の合計の c[number of images] 配列が得られました。

local-work-size が 1 の場合、少なくとも 160 個の画像 (GPU のコア数) がある場合、計算ではすべてのコアが使用されます。

90000*num_images 回の読み取りと num_images の書き込み、および 90000*num_images レジスターの読み取り/書き込みが必要になります。レジスタを使用すると、カーネル時間が半分になります。

2）2回のメモリアクセスごとに1回の計算しか行っていません。gpu のピーク Gflops (6490M で 250 Gflops ピーク) のごく一部を使用するには、1 回のメモリアクセスごとに少なくとも 10 の演算が必要です。

あなたの i7 CPU は簡単に 100 Gflops を持つことができますが、メモリがボトルネックになります。これは、pci-express を介してデータ全体を送信するとさらに悪化します (HD Graphics 3000 の定格は 125 GFLOPS です)。

 // Obtain a device ID 
    cl_device_id devices[] = new cl_device_id[numDevices];
    clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
    cl_device_id device = devices[deviceIndex];
 //one of devices[] element must be your HD3000.Example: devices[0]->gpu devices[1]->cpu 
 //devices[2]-->HD3000

あなたのプログラムでは：

 // Obtain the cl_device_id for the first device
    int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
    cl_device_id devices[] = new cl_device_id[numDevices];
    clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0],
            Pointer.to(devices), null);

おそらく最初のデバイスは GPU です。

score 0 · Accepted Answer

300x300 の画像ごとに作業グループ全体を使用する必要があります。これにより、GPU コアが飽和し、ローカルメモリを使用できるようになります。また、カーネルは、デバイス上の計算ユニットと同じ数の画像を同時に処理できる必要があります。

以下のカーネルは、3 つのステップでリダクションを行います。

ワークアイテムごとに値を 1 つのプライベートユニットに読み込む
プライベート var をローカルメモリに書き込みます (非常に単純な手順ですが、重要です)。
ローカルメモリ内の値を減らして最終的な値を取得します。ここに示すように、これを行うには 2 つの方法があります。

WG_MAX_SIZE が定義されているのは、可変サイズのローカルメモリブロックを渡すのが好きではないためです。ほとんどのプラットフォームで使用するのに適した値であるため、値は 64 です。より大きな作業グループで実験する場合は、この値を高く設定してください。WG_MAX_SIZE より小さいワークグループでも問題なく動作します。

#define WORK_SIZE 90000
#define WG_MAX_SIZE 64
__kernel void sampleKernel(__global uint *a, __global uint *c)
{

    local uint intensity_core[WG_MAX_SIZE];
    private uint workItemIntensity = 0;

    int gid = get_group_id(0);
    int lid = get_local_id(0);
    int wgsize = get_local_size(0);
    int i;

    for(i=gid*WORK_SIZE; i < (gid+1)*WORK_SIZE; i+=wgsize){ 
        workItemIntensity += a[j];
    }
    intensity_core[lid] = workItemIntensity;
    mem_fence(CLK_LOCAL_MEM_FENCE);

    //option #1
    //loop to reduce the final values O(n) time
    if(lid == 0){
        for(i=1;i<wgsize;i++){
            workItemIntensity += intensity_core[i];
        }
        c[gid]=intensity_core;
    }

    //option #2
    //O(logn) time reduction
    //assumes work group size is a power of 2
    int steps = 32 - clz(wgsize);
    for(i=1;i<steps;i++){
        if(lid % (1 << i) == 0){
            intensity_core[lid] += intensity_core[i<<(i-1)];
        }
        mem_fence(CLK_LOCAL_MEM_FENCE);
    }
    if(lid == 0){
        c[gid]=intensity_core[0];
    }
}

java - JOCL/OPENCLによる強度和計算の高速化

2 に答える 2

Related

Reference