Loop through list of vectors (ie, the rows of a matrix), applying same "simple" function to each one, on GPU? Should I use arrayfun somehow?

11 views (last 30 days)
I currently have an analysis which loops through a list of measurements, where each measurement itself is a vector (so, it loops through the rows of a matrix).
On each measurement input (1D vector) I apply a custom function (which is not simple enough to share but nonetheless straightforward), which then gives an output result of the same vector size as the input. The analysis function itself more or less just loops through each element in the vector and does something, but the result for each element in the vector also depend on the previous results (there is a clear dependency ordering).
I often have 10 or 100s of millions of measurements per experiment, and it currently takes about 6 hours to run about 10 million, but we'd really like to get this working in less than 15 minutes... And so we hope there is a way to get it to run on a GPU? The key thing is that each measurement is actually only 15-50 elements long, and so I'd like to parallelise along the number of measurements, rather than the function which operates on each measurement (which seems straightforward).
I found this question which seemed similar: https://ch.mathworks.com/matlabcentral/answers/83839-can-arrayfun-take-multi-dimensional-arrays-as-individual-arguments but the Staff seemed to just be saying to loop it, which doesn't seem like it is fast enough for me. It seems like pagefun could work, but I'm struggling to come up with a way how.
For an example, an outline sketch of my code might be:
% Update - see attached files.
gpu_run_test
Update(d again, after vectorising - see most recent update) - I've separated out the specific part of my code and created a minimum working example, which I have now attached to this post. The code runs without issue, but running it with GPU arrays is currently much slower:
Outdated info
For larger N, this is only exaggerated. After more thorough research while separating this code example out from everything else, I realised what I hope for might not be possible in the end... If anyone is able to help it would be incredibly amazing.
Update 2 - I've reworked my code again, to make it as clear as I can where the dependency lies. It is hopefully more readable now, as the function calls are far more concise and minimal.
In words; the wc at each step depends on the position at that step, but then the position result at each step depends (on a bunch of things which depend) on the wc at the previous step. In particular, this occurs when "interp1" is used with arg_position and arg_density to assess the wc at the newly-determined position.
Update 3 - Ok so it finally clicked what everyone was suggesting, and indeed it works extremely well now after vectorising across the measurements. The attached code (updated again) works a lot faster, but under one condition:
The arg_position and arg_wc used in the interp1 line must be normal 1-D vectors, the same for all measurements. This is not a dealbreaker, as actually it only changes a relatively few 2000 times instead of 100 million (indeed, everything except the first argument: "invertible_group_delays" only changes 2000 times versus 100 million)...
Does anyone have a suggestion how I can keep all the measurements performed parallel, instead of splitting it into 2000 chunks? The interp1 line seems to be the only issue now (line 45 replacing 47).
Here are the results as they stand now:
gpu_run_test
real scenarios have N = ~1e8
with only ~2000 equilibriums ==> N = ~ 50000 per equilibrium
all arguments except the first will only change each equilibrium
N = 50000
Now doing normal inversion.
Total time for normal inversion:
Elapsed time is 0.567713 seconds.
Maximum difference between expected and achieved result: 5.5202e-06
Now doing GPU.
Total time for GPU inversions:
Elapsed time is 0.502490 seconds.
Maximum difference between expected and achieved result: 5.5202e-06
Done.
gpu_run_test
N = 10000000 ( = 1e7)
Now doing normal inversion.
Total time for normal inversion:
Elapsed time is 93.092429 seconds.
Maximum difference between expected and achieved normal result: 5.5202e-09 mm
Now doing GPU.
Total time for GPU inversions:
Elapsed time is 101.491855 seconds.
Maximum difference between expected and achieved gpu result: 5.5202e-09 mm
Done.
For a typical experiment, I'd like to get through N = 1e8 in under 5 minutes. Anyone have advice on the best way to get there from where things stand? Currently it takes 15 minutes. Perhaps I just have to buy more CPUs?

Accepted Answer

Matt J
Matt J on 25 Mar 2024
Edited: Matt J on 25 Mar 2024
Well whether you use the GPU or not, you don't have to loop across measurements. The operations in your example are trivial to vectorize over the measurements so, assuming that's true of your actual code as well, you can gain speed-up just by doing that.
Of course, running the vectorized code on the GPU should speed things up still further, which you can do just by making the variables inside measurement_analysis() into gpuArray variables. All of this is illustrated for your minimized example below:
measurement_results = measurement_analysis(raw_measurements); %pass in the total measurement array
function output_result = measurement_analysis(raw_measurements)
sum_of_previous = gpuArray(0);
output_result = gpuArray.nan(size(raw_measurements));
for k = 1:width(raw_measurements)
output_result(:,k) = dependent_function(input_measurement(:,k), sum_of_previous)
sum_of_previous = sum_of_previous + output_result(:,k);
end
end
function output = dependent_function(input_vector, sum_of_previous)
output = sum_of_previous.*sqrt(1 - input_vector./measurement_length);
%Not sure where "measurement_length" is created. Example doesn''t
%show it.
end
  8 Comments
mack
mack on 1 Apr 2024
Thanks again for your help - Ive taken your advice as best I can, and both tried to vectorise the code as much as possible (reduced function calls), and also made it as explicit as possible how the logic dependency works.
Its possible the "calculate_inversion_phi_integral" function could be vectorised yet further, but it wasn't very clear to me how that might be possible... I just know Im not very good at that still and so there is perhaps still room for improvement.
Nonetheless hopefully its clear now how the dependency works. I am happy for more advice if you have any to offer.
mack
mack on 1 Apr 2024
It finally clicked what you meant with vectorise, and I did so over the measurements - it works really great now, GPU or otherwise. Thanks so much for the help - sorry I can be a bit dense when it comes to understanding certain things. :-)
If you have any suggestion for the interp1 issue to keep it vectorised over all the measurements, instead of just all the measurements per equilibrium, that would be seriously incredible. But I think for the moment this will hopefully suffice (unless i come back here and edit this to cry something otherwise).

Sign in to comment.

More Answers (2)

Catalytic
Catalytic on 25 Mar 2024
Edited: Catalytic on 25 Mar 2024
It's hard to know exactly what's going on in your code, because every variable and function name has the string "measurement" in it, making them difficult to distinguish. Also including the string "index" in your indices also clutters the code and makes reading hard. Readers can see when a variable is an index by the way it is used. It isn't necessary to label it so.
All that said, if you have a function of a very general form that you want to parallelize, you probably want to use parfor, rather than the GPU.
  6 Comments
Damian Pietrus
Damian Pietrus on 26 Mar 2024
If you do parallelize with parfor, it would be interesting to see your scaling on your local machine as you increase the number of workers. If you find that the problem does improve with additional workers, you may wish to see if you have an HPC cluster with MATLAB Parallel Server available to you so you can scale across many machines.
mack
mack on 31 Mar 2024
I am working to parfor-ise my code in a different issue, which I will raise soon, once I similarly make a minimum example and tidy up the way I sort it...
Perhaps it is easy to parfor-ise with the example I have written, but considering how simple the calculation is, I expect it will work better with larger chunks fed to each thread. There is a spot earlier on in my code where it would work better to parfor-ise, as there is an earlier calculation which takes ~5 minutes which would be nicer to have parallel too.
At the moment I think I currently have too many loops-inside-of-loops for parfor to work happily. :-( But I will let you know here once I have results to compare (or another issue to address). Getting this working is my highest priority, and I think my boss' too...

Sign in to comment.


Joss Knight
Joss Knight on 29 Mar 2024

If your calculation is truly sequential then by definition you cannot parallelize along the sequence. But it sounds like you definitely can parallelize across measurements. Work sequentially down the columns but compute each element of each row at the same time.

Often seemingly sequential operations, such as a cumulative sum, are in fact parallelizable because they can be computed using a reduction tree where groups of values are reduced with their neighbours and then this is done repeatedly until the final result is found. If your algorithm has this structure and it cannot be implemented using standard vectorized calculations available in matlab (like cumsum or accumarray) then you may be left with no alternative but to write your own CUDA algorithm and integrate into MATLAB with mexcuda.

  2 Comments
mack
mack on 31 Mar 2024
Hello,
Thank you for your reply. I've edited the original post with attachments of two files that give a concrete minimum working example. I will paste the comment I made elsewhere regarding the most difficult part of this algorithm:
In particular, in this inversion procedure, it requires interpreting the "wc" variable at each newly-solved position in order to solve for the next position in that particular sweep. This dependency is why there is no simple invertible matrix we can use to directly get the positions, and why it must be done algorithmically (actually, there is another polarisation which is effectively the same as setting wc = 0 everywhere, and for this polarisation it is possible to directly solve for the answer by inverting a matrix...)
Nonetheless, all of the manipulations in this function are extremely straightforward (even the interp1 can be changed to something simpler if this function is not acceptable)... And so I'm still praying there is a chance I can get this to work (a lot) faster.
mack
mack on 2 Apr 2024
Edited: mack on 2 Apr 2024
Hello, I implemented correct coding for matlab across measurements, and it works much faster for both CPU and GPU computations, and can do N = ~1e7 measurements in about 100 seconds. (See the working example I've attached to my original post)
If I try 1e8 at once the memory runs out, for both CPU and GPU computations. Nonetheless, the GPU computation is just a shade slower (110 s instead of 100 s), so Im wondering if there is a way to speed this up?
100 seconds x 10 required for a full experiment of N = 1e8 is still 15 minutes, and I was hoping to reduce this to under 5. I just want to check I'm not missing anything else obvious before I suggest more CPUs or something?
Does the interp1 line make a big issue? I could rewrite line 45 in gpuInvert with something that doesnst use interp1 (like just a simple weighted average of the closest elements, or even just the closest element, as a small approximation here is not so concerning).

Sign in to comment.

Products


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!