There are a few questions similar to this but in this case, its a bit weird; NVCC 3.1 doesn't like this but 3.2 and 4.0RC do;
float xtmp[MAT1];
for (i=0; i<MAT1; i++){
xtmp[i]=x[p[i]]; //value that should be here
}
Where p is passed by reference to the function (int *p) coming from...
int p_pivot[MAT1],q_pivot[MAT1];
To add a bit of context, before the p's get to the 'top' function, they are populated by (I'm cutting out as much irrelevant code as i can for clarity)
...
for (i=0;i<MAT1;i++){
...
p_pivot[i]=q_pivot[i]=i
...
}
...
Beyond that the only operations on pivot are 3-step-swaps with integer temporary values.
After all that p_pivot is passed to the 'top' function by (&p_pivot[0])
For anyone looking for more detail, the code is here and the only change that should be needed to flip between 3.2/4.0 to earlier is to change the cudaDeviceSynchronise(); to cudaThreadSynchronize();. This is my dirty dirty experimental code so please don't judge me! :D
As noted, all of the above works fine in higher versions of NVCC, and I'm working to get those put onto the machine in question, but I'd be interested to see what I'm missing.
It must be the array-lookup indexing that's causing the issue, but I don't understand why?