So I have this method which needs to apply a lot of transforms on an image. In total I need several different operations applied to the same data. On my CPU code I do all these transforms on the same loop but I was wondering the best way to apply it in cuda.
So in CPU I have
loop 1
loop 2
loop 3
DO A LOT OF SMALL BUT INDEPENDENT OPERATIONS
end
end
end
I use threading on the outermost loop with openmp and the algorithm accelerates almost times the number of threads so it is very paralelizable. Nonetheless for very big images it can still take a lot of time so I figured I can use Cuda.
So I managed to get rid of the outermost loops: loop 1 and loop 2 and replace every cicle with one cuda thread but now I'm not sure what is a better design
For example I tried doing this
cuda_kernel{
loop 3
DO A LOT OF SMALL BUT INDEPENDENT OPERATIONS
end
}
Several of those operations have branching too and others don't. My question is if you think it is best on Cuda to do this instead
cuda_kernel 1{
loop 3
DO JUST FIRST OPERATION
end
}
cuda_kernel 2{
loop 3
DO JUST SECOND OPERATION
end
}
ETC
In this case each kernel will be greatly simplified but one will be called after the other serially and loop 3 will be repeated for each operation.
So what would you recommend to calculate everything at once or do each kernel separetely?