cuda - Best way to process a big algorithm on Cuda

Question

So I have this method which needs to apply a lot of transforms on an image. In total I need several different operations applied to the same data. On my CPU code I do all these transforms on the same loop but I was wondering the best way to apply it in cuda.

So in CPU I have

loop 1
  loop 2
    loop 3
      DO A LOT OF SMALL BUT INDEPENDENT OPERATIONS
    end
  end
end

I use threading on the outermost loop with openmp and the algorithm accelerates almost times the number of threads so it is very paralelizable. Nonetheless for very big images it can still take a lot of time so I figured I can use Cuda.

So I managed to get rid of the outermost loops: loop 1 and loop 2 and replace every cicle with one cuda thread but now I'm not sure what is a better design

For example I tried doing this

cuda_kernel{

   loop 3
      DO A LOT OF SMALL BUT INDEPENDENT OPERATIONS
   end
}

Several of those operations have branching too and others don't. My question is if you think it is best on Cuda to do this instead

cuda_kernel 1{

   loop 3
      DO JUST FIRST OPERATION
   end
}

cuda_kernel 2{

   loop 3
      DO JUST SECOND OPERATION
   end
}


ETC

In this case each kernel will be greatly simplified but one will be called after the other serially and loop 3 will be repeated for each operation.

So what would you recommend to calculate everything at once or do each kernel separetely?

score 1 · Accepted Answer

カーネル呼び出しは、実行時間に関して非常にコストがかかります。1 つのカーネル呼び出しにスタックする操作が多いほど、パフォーマンスが向上します。私は実際に行います：

cuda_kernel {
 loop 2
   loop 3
    Do stuff here ...
   end
 end
}

これは、すべてを実行する最速の方法です。ここで 2 つのループを使用して、ネストされたループがある場合でも、カーネル呼び出しをループに入れるのではなく、カーネル内で実行することを示しました。お役に立てれば。

cuda - Best way to process a big algorithm on Cuda

1 に答える 1

Related

Reference