Improve Performance of Small Matrix Problems on the GPU using pagefun
This example shows how to use pagefun
to improve the performance of applying a large number of independent rotations and translations to objects in a 3-D environment. This is typical of a range of problems which involve a large batch of calculations on small arrays.
GPUs are most effective when carrying out calculations on very large matrices. In MATLAB® this is usually achieved by vectorizing code to maximize the work done in each instruction. When you have a large data set but your calculations are divided into many small matrix operations, it can be challenging to maximize performance by running simultaneously on the many hundreds of GPU cores.
The arrayfun
function allow scalar operations to be carried out in parallel on the GPU. The pagefun
function adds the capability of carrying out matrix operations in batch in a similar way. The pagefun
function is available in Parallel Computing Toolbox™ for use with gpuArrays.
In this example, a robot is navigating a known map containing a large number of features that the robot can identify using its sensors. The robot locates itself in the map by measuring the relative position and orientation of those objects and comparing them to the map locations. Assuming the robot is not completely lost, it can use any difference between the two to correct its position, for instance by using a Kalman Filter. We will focus on the first part of the algorithm.
Set up the map
Let's create a map of objects with randomized positions and orientations in a large room.
numObjects = 1000;
roomDimensions = [50 50 5]; % Length * breadth * height in meters
We represent positions and orientations using 3-by-1 vectors T
and 3-by-3 rotation matrices R
. When we have N of these transforms we pack the translations into a 3-by-N matrix, and the rotations into a 3-by-3-by-N array. The supporting function randomTransforms
is provided at the end of this example and initializes N transforms with random values, providing a structure as output.
Use randomTransforms
to set up the map of object transforms, and a start location for the robot.
Map = randomTransforms(numObjects,roomDimensions); Robot = randomTransforms(1,roomDimensions);
Define the equations
To correctly identify map features the robot needs to transform the map to put its sensors at the origin. Then it can find map objects by comparing what it sees with what it expects to see.
For a map object we can find its position relative to the robot and orientation by transforming its global map location:
where and are the position and orientation of the robot, and and represent the map data. The equivalent MATLAB code looks like this:
Rrel(:,:,i) = Rbot' * Rmap(:,:,i) Trel(:,i) = Rbot' * (Tmap(:,i) - Tbot)
Perform many matrix transforms on the CPU using a for
loop
We need to transform every map object to its location relative to the robot. The supporting function loopingTransform
is provided at the end of this example and loops over all the transforms in turn. Note the 'like'
syntax for zeros
which will allow us to use the same code on the GPU in the next section.
To time the calculation we use the timeit
function, which will call loopingTransform
multiple times to get an average timing. Since it requires a function with no arguments, we use the @()
syntax to create an anonymous function of the right form.
cpuTime = timeit(@()loopingTransform(Robot,Map,numObjects)); fprintf('It takes %3.4f seconds on the CPU to execute %d transforms.\n', ... cpuTime, numObjects);
It takes 0.0047 seconds on the CPU to execute 1000 transforms.
Try the same code on the GPU
To run this code on the GPU is merely a matter of copying the data into a gpuArray
. When MATLAB encounters data stored on the GPU it will run any code using it on the GPU as long as it is supported.
gMap.R = gpuArray(Map.R); gMap.T = gpuArray(Map.T); gRobot.R = gpuArray(Robot.R); gRobot.T = gpuArray(Robot.T);
Now we call gputimeit
, which is the equivalent of timeit
for code that includes GPU computation. It makes sure all GPU operations have finished before recording the time.
fprintf('Computing...\n');
Computing...
gpuTime = gputimeit(@()loopingTransform(gRobot,gMap,numObjects)); fprintf('It takes %3.4f seconds on the GPU to execute %d transforms.\n', ... gpuTime, numObjects);
It takes 0.2745 seconds on the GPU to execute 1000 transforms.
fprintf(['Unvectorized GPU code is %3.2f times slower ',... 'than the CPU version.\n'], gpuTime/cpuTime);
Unvectorized GPU code is 58.19 times slower than the CPU version.
Batch process using pagefun
The GPU version above was very slow because, although all calculations were independent, they ran in series. Using pagefun
we can run all the computations in parallel. The supporting function pagefunTransform
is provided at the end of this example and applies the same transforms as the loopingTransform
function using pagefun
instead of a for
-loop.
gpuPagefunTime = gputimeit(@()pagefunTransform(gRobot, gMap)); fprintf(['It takes %3.4f seconds on the GPU using pagefun ',... 'to execute %d transforms.\n'], gpuPagefunTime, numObjects);
It takes 0.0003 seconds on the GPU using pagefun to execute 1000 transforms.
fprintf(['Vectorized GPU code is %3.2f times faster ',... 'than the CPU version.\n'], cpuTime/gpuPagefunTime);
Vectorized GPU code is 17.88 times faster than the CPU version.
fprintf(['Vectorized GPU code is %3.2f times faster ',... 'than the unvectorized GPU version.\n'], gpuTime/gpuPagefunTime);
Vectorized GPU code is 1040.61 times faster than the unvectorized GPU version.
Explanation
The first computation was the calculation of the rotations. This involved a matrix multiply, which translates to the function mtimes
(*
). We pass this to pagefun
along with the two sets of rotations to be multiplied:
Rel.R = pagefun(@mtimes, Robot.R', Map.R);
Robot.R'
is a 3-by-3 matrix, and Map.R
is a 3-by-3-by-N array. The pagefun
function matches each independent matrix from the map to the same robot rotation, and gives us the required 3-by-3-by-N output.
The translation calculation also involves a matrix multiply, but the normal rules of matrix multiplication allow this to come outside the loop without any changes:
Rel.T = Robot.R' * (Map.T - Robot.T);
More advanced GPU vectorization - Solving a "lost robot" problem
If our robot was in an unknown part of the map, it might use a global search algorithm to locate itself. The algorithm would test a number of possible locations by carrying out the above computation and looking for good correspondence between the objects seen by the robot's sensors and what it would expect to see at that position.
Now we have got multiple robots as well as multiple objects. N objects and M robots should give us N*M transforms out. To distinguish 'robot space' from 'object space' we use the 4th dimension for rotations and the 3rd for translations. That means our robot rotations will be 3-by-3-by-1-by-M, and the translations will be 3-by-1-by-M.
We initialize our search with random robot locations. A good search algorithm would use topological or other clues to seed the search more intelligently.
numRobots = 10; Robot = randomTransforms(numRobots,roomDimensions); Robot.R = reshape(Robot.R, 3, 3, 1, []); % Spread along the 4th dimension Robot.T = reshape(Robot.T, 3, 1, []); % Spread along the 3rd dimension gRobot.R = gpuArray(Robot.R); gRobot.T = gpuArray(Robot.T);
A supporting function loopingTransform2
is defined at the end of this example and performs a looping transform using two nested loops, to loop over the robots as well as over the objects.
cpuTime = timeit(@()loopingTransform2(Robot,Map,numObjects,numRobots)); fprintf('It takes %3.4f seconds on the CPU to execute %d transforms.\n', ... cpuTime, numObjects*numRobots);
It takes 0.0852 seconds on the CPU to execute 10000 transforms.
For our GPU timings we use tic
and toc
this time, because otherwise the calculation would take too long. This will be precise enough for our purposes. To ensure any cost associated with creating the output data is included, we are calling loopingTransform2
with a single output variable, just as timeit
and gputimeit
do by default.
fprintf('Computing...\n');
Computing...
tic; gRel = loopingTransform2(gRobot,gMap,numObjects,numRobots); gpuTime = toc; fprintf('It takes %3.4f seconds on the GPU to execute %d transforms.\n', ... gpuTime, numObjects*numRobots);
It takes 3.6775 seconds on the GPU to execute 10000 transforms.
fprintf(['Unvectorized GPU code is %3.2f times slower ',... 'than the CPU version.\n'], gpuTime/cpuTime);
Unvectorized GPU code is 43.17 times slower than the CPU version.
As before, the looping version runs much slower on the GPU because it is not doing calculations in parallel.
The new pagefun
version needs to incorporate the transpose
operator as well as mtimes
into a call to pagefun
. We also need to squeeze
the transposed robot orientations to put the spread over robots into the 3rd dimension, to match the translations. Despite this, the resulting code is considerably more compact. A supporting function pagefunTransform2
is provided at the end of this example and applies the same transforms as the loopingTransform2
function using two pagefun
calls instead of nested for
-loops.
Once again, pagefun
expands dimensions appropriately. So where we multiply 3-by-3-by-1-by-M matrix Rt
with 3-by-3-by-N-by-1 matrix Map.R
, we get a 3-by-3-by-N-by-M matrix out.
gpuPagefunTime = gputimeit(@()pagefunTransform2(gRobot,gMap)); fprintf(['It takes %3.4f seconds on the GPU using pagefun ',... 'to execute %d transforms.\n'], gpuPagefunTime, numObjects*numRobots);
It takes 0.0012 seconds on the GPU using pagefun to execute 10000 transforms.
fprintf(['Vectorized GPU code is %3.2f times faster ',... 'than the CPU version.\n'], cpuTime/gpuPagefunTime);
Vectorized GPU code is 72.28 times faster than the CPU version.
fprintf(['Vectorized GPU code is %3.2f times faster ',... 'than the unvectorized GPU version.\n'], gpuTime/gpuPagefunTime);
Vectorized GPU code is 3120.72 times faster than the unvectorized GPU version.
Conclusion
The pagefun
function supports a number of 2-D operations, as well as most of the scalar operations supported by arrayfun
. Together, these functions allow you to vectorize a range of computations involving matrix algebra and array manipulation, eliminating the need for loops and making huge performance gains.
Wherever you are doing small calculations on GPU data in a loop, you should consider converting to a batch implementation in this way. This can also be an opportunity to make use of the GPU to improve performance where previously it gave no performance gains.
Supporting Functions
Random Transform Function
The randomTransforms
function sets up a map of object transforms and a start location for the robot for a number of objects in a room of specified dimensions.
function Tform = randomTransforms(N,roomDimensions) Tform.T = zeros(3, N); Tform.R = zeros(3, 3, N); for i = 1:N Tform.T(:,i) = rand(3, 1) .* roomDimensions'; % To get a random orientation, we can extract an orthonormal % basis for a random 3-by-3 matrix. Tform.R(:,:,i) = orth(rand(3, 3)); end end
Looping Transform Function
The loopingTransform
function transforms every object to its location relative to the robot by looping over the transforms in turn.
function Rel = loopingTransform(Robot,Map,numObjects) Rel.R = zeros(size(Map.R), 'like', Map.R); % Initialize memory Rel.T = zeros(size(Map.T), 'like', Map.T); % Initialize memory for i = 1:numObjects Rel.R(:,:,i) = Robot.R' * Map.R(:,:,i); Rel.T(:,i) = Robot.R' * (Map.T(:,i) - Robot.T); end end
pagefun
Transform Function
The pagefunTransform
function transforms every object to its location relative to the robot by applying the transforms using the pagefun
function.
function Rel = pagefunTransform(Robot,Map) Rel.R = pagefun(@mtimes, Robot.R', Map.R); Rel.T = Robot.R' * (Map.T - Robot.T); end
Nested Looping Transform Function
The loopingTransform2
function performs a looping transform using two nested loops, to loop over the robots as well as over the objects. The transforms map every object to its location relative to every robot.
function Rel = loopingTransform2(Robot,Map,numObjects,numRobots) Rel.R = zeros(3, 3, numObjects, numRobots, 'like', Map.R); Rel.T = zeros(3, numObjects, numRobots, 'like', Map.T); for i = 1:numObjects for j = 1:numRobots Rel.R(:,:,i,j) = Robot.R(:,:,1,j)' * Map.R(:,:,i); Rel.T(:,i,j) = ... Robot.R(:,:,1,j)' * (Map.T(:,i) - Robot.T(:,1,j)); end end end
Two-call pagefun
Transform Function
The pagefunTransform2
function performs transforms to map every object to its location relative to every robot using two calls to the pagefun
function.
function Rel = pagefunTransform2(Robot, Map) Rt = pagefun(@transpose, Robot.R); Rel.R = pagefun(@mtimes, Rt, Map.R); Rel.T = pagefun(@mtimes, squeeze(Rt), ... (Map.T - Robot.T)); end
See Also
pagefun
| gpuArray
| arrayfun
| gputimeit