severe speed degradation with batch function on cluster

2 views (last 30 days)
I run a matlab script called "runMe.m" containining "parpool('local',16)" through a PBS job sent to a worker node (method 1). I get the same runtime as running the script locally on the compute node (method 2, for testing only).
My problem is when I try to call "runMe.m" using the batch function in Matlab from the head node (method 3). The only changes is "runMe.m" has "parpool('local',16)" commented because the pool is setup from the head node. Here's some pseudocode of the batch:
pbs = parcluster('singleNodeJob')
job = pbs.batch(@runMe,0,{},'Pool',15)
'singleNodeJob' is configured with 16 workers and the PBS input "-l nodes=1:ppn=16", same as method 1 PBS script. I'm unsure how to set "threads" in 'singleNodeJob' because "parpool('local',16)" clearly runs with multiple threads on my worker node.
"runMe.m" has parfor loops which run great when on a local machine (method 1 & 2) - say 5 minutes. But when run via batch (method 3), my runtime falls apart - say 20 minutes with 'singleNodeJob' threads set to 8, and 30 minutes if threads set to 1. "runMe.m" contains both parforloops and multi-threaded functions (i.e. corr).
To add context, the reason I'm trying to run method 3 is because I have sufficient toolbox licenses via this method, whereas method 1, I run out of toolbox licenses.
Thank you for your help!

Answers (1)

Raymond Norris
Raymond Norris on 7 Apr 2022
Let me see if I have this right
  • Method #1 finishes runMe in ~5 minutes, using local scheduler
PBS jobscript
#!/bin/sh
#PBS -l nodes=1:ppn=16
module load matlab
matlab -batch runMe
runMe.m
function runMe
parpool("local",16);
parfor idx = 1:160
...
end
  • Method #2 finishes runMe in ~5 minutes, using local scheduler
% # Run MATLAB on the compute node
% qsub -I -l nodes=1:ppn=16
% module load matlab
% matlab
runMe
  • Method #3, finishes runMe in 20-30 minutes, using PBS scheduler. singleNodeJob is configured with 16 workers with -l nodes=1:ppn=16
% # Run MATLAB on the head node. Preferred choice because you have a
% # limited number of MATLAB and Toolbox licenses, and would like to use
% # MATLAB Parallel Server licenses instead.
% module load matlab
% matlab
Submit job
pbs = parcluster('singleNodeJob');
job = pbs.batch(@runMe,0,{},'Pool',15);
runMe.m
function runMe
% parpool("local",16);
parfor idx = 1:160
% Note, we're only running with 15 workers, not 16
...
end
A couple of thoughts:
  1. How are you measuring 5, 20, 30 minutes etc.?
  2. When using the local scheduler, you're using the resources already available to you in the PBS job. Therefore, starting the parallel pool will happen quicker. If you're running a PBS pool, you're submitting an "inner" job. Who know how busy the queue is. It might take 10-15 minutes for this inner job to start. This could be the crux of the matter.
  3. When you set NumThreads, for example
pbs.NumThreads = 8;
This doesn't increase the core count of your PBS jobs. Therefore, you'll have 16 workers running, each with access to 8 comp threads, all running on the same 16 cores on the single node. Conversely, if you set NumThreads to 1, then all the non-parallel code will only get a single comp thread.
  2 Comments
David L.
David L. on 7 Apr 2022
Thanks very much for responding Raymond. Yes, you are exactly spot on in how you wrote up my 3 cases. Thanks for your thoroughness. (we spoke last week btw).
  • 1. How are you measuring 5, 20, 30 minutes etc.?
I measure using the Matlab profiler in runMe.m, there are numerous subfunctions within runMe.m and I check to verify how each are impacted. Some functions are multi-threaded but not parallel, some are parallel with parfor. I have to check each to find the right balance of workers and threads for my code.
  • 2. When using the local scheduler, you're using the resources already available to you in the PBS job. Therefore, starting the parallel pool will happen quicker. If you're running a PBS pool, you're submitting an "inner" job. Who know how busy the queue is. It might take 10-15 minutes for this inner job to start. This could be the crux of the matter.
For cases 1 & 2, the Matlab profiler in starts in runMe.m after "parpool('local',16)" has excuted. For case 3, the head node is starting up the pool in the batch script before the profiler starts in runMe.m. In all cases, the pool creation time is not calculated in the runtime. I'm not competing with other users for the cluster resources and this isn't a contributor to my timing discrepancy.
Regarding your 3rd comment about threads, my confusion is surrounding the 'local' parallel config which defaults to 1 thread, but clearly runs multithreaded on both my cluster (method 1/2) and on workstations (800% cpu utilization on top). When I set a PBS/Torque parallel config to 1 thread, all functions within runMe.m clearly go to singlethreaded (100% cpu utilization on top). This difference between how 'local' and 'Torque' parallel profiles behave may be central to this performance difference.
Is there a way to use the Matlab parallel server license but still use the 'local' parallel config on the compute nodes?
Raymond Norris
Raymond Norris on 7 Apr 2022
Good to see you again, David :)
One advantage PBS Pro has over TORQUE is that you can differentiate how the cores of a node (really a chunk) can be divided up. For example, for 10 cores, you could assign 5 for MPI and 2 for OMP. TORQUE lacks this granularity.
At our next call, let's discuss an idea that will address using your MATLAB Parallel Server license in lieu of MATLAB/Toolbox licenses.

Sign in to comment.

Categories

Find more on Manage Products in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!