First of all I agree with mfa it would be better if you have your data transposed. This way you will read data from global memory sequentially (search for bank conflicts). But this is only one thing.
Other thing is to change your algoritm. Current drawbacks of your approach 1) you have small number of work items - equal to height of your image. 2) long cycle.
I'd suggest you to rewrite your algo to actually make it parallel next way:
E.g. you have to sum 512 items. Then you run workgroup that has 256 work items.
each work item adds up 2 values. E.g. 1st will add v[1]=v[0]+v[1], 2nd v[3] = v[2]+v[3] and so on. So, after first operation you'll have pair sums in odd indices. Next cycle you make similar procedure, but only 128 work items do the job, because you already have only 256 elements to process. The only difference that now 1st work item will do v[3] = v[1]+v[3], 2nd v[7] = v[5] + v[7] and so on.
This way you have 1) O(logN) complexity instead of O(N) 2) you spawn more items which do less work. -> benefit from parallelization.
Of course you will need to call barrier(...) instruction after every write to sync calculations between work items in work group.
To speed up even more, first each work group copies its values from global to local memory and performs calculations using local memory.
One question, that you might want to ask: "What if i need to sum too many values (say 100000) and i cant create work group with such big number of work items". In this case you do partial sums and then run your kernel second time to sum those partial sums. Because you can't synchronize between work groups while executing kernel.
To be more clear here is the code. It sums 2 x blockSize values. Hope I did not make any mistakes (did not actually compile this one)
// run Work group with local size = (BLK_SIZ, 1), global size = ( width, height )
__kernel void calc_sum(__global float* d_in, __global float* d_sums, const int rowLen)
{
int our_row = get_global_id(1);
int lx = get_local_id(0);
int gr = get_group_id(0);
__local float our_mem[(2*BLK_SIZ)];
// copy glob -> loc mem
our_mem[2*lx + 0] = d_in[gr*2*BLK_SIZ + 2*lx + 0];
if(gr*2*BLK_SIZ + 2*lx + 0 >= rowLen)
our_mem[2*lx + 0] = 0;
our_mem[2*lx + 1] = d_in[gr*2*BLK_SIZ + 2*lx + 1];
if(gr*2*BLK_SIZ + 2*lx + 1 >= rowLen)
our_mem[2*lx + 1] = 0;
// do the calculations
int width = 2;
int num_el = 2*BLK_SIZ / width;
int wby2 = width>>1;
for(int i = 0;i<7;++i)
{
barrier(CLK_LOCAL_MEM_FENCE);
if(lx < num_el)
{
int idx = width*(lx + 1) - 1;
our_mem[idx] = min(our_mem[idx], our_mem[idx-wby2]);
}
width<<=1;
wby2 = width>>1;
num_el>>=1;
}
barrier(CLK_LOCAL_MEM_FENCE);
// store res
if(lx == 0) // choose some element from work group to actualy write the sum
{
d_sums[our_row] = our_mem[2*lx-1]; // sum is in last element
}
}
Also search in the internet for blelloch / hillis steele parallel prefix sum algoritms.
nVidia also has nice example with good documentation for parallel prefix sum algorithm. It does more than you need, but has same approach that I've described.
Hope this helps.