I have a 3 x N gpuArray, where N is very large, dimension 1 represents x, y, and z, and these points are the result of a viewpoint transformation applied via multiplication of a 3 x N matrix with a 3x3 rotation matrix and sum with a translation vector. At this point I need to convert to spherical coordinates, however to do this I have to slice the array into its x, y, and z components before passing these separate arrays as inputs to cart2sph (or sub functions like atan2 if I want to write my own version).
The problem is that the slicing of the array to feed into cart2sph takes much longer than the viewpoint transformation and the coordinate transform combined. The only way I've found to accelerate this is to replace the slicing with a dot-multiply-and-sum operation, which for some reason is faster than simply slicing. Here's some example code:
tst = randn(3, 6000000, 'single', 'gpuArray');
for n = 1:100
tst1 = tst(1, :);
tst2 = tst(2, :);
tst3 = tst(3, :);
[th, phi, r] = cart2sph(tst1, tst2, tst3);
for n = 1:100
tst1b = sum(tst.*[1;0;0]);
tst2b = sum(tst.*[0;1;0]);
tst3b = sum(tst.*[0;0;1]);
[th, phi, r] = cart2sph(tst1b, tst2b, tst3b);
On my computer the first loop takes ~1.5 seconds and the second ~0.5. A dot-multiply & sum is 3x faster than simply slicing the array, and the cart2sph takes only a trivial amount of time. So my questions are:
1) Is there a faster way to get from a 3xN xyz array to a 3xN phi-theta-r array, preferably that does not require a set of (very slow) slicing operations?
2) Why is a multiply and sum operation faster than a simple slicing operation?