Why workers keep aborting during parallel computation on cluster?

I keep getting the warning
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
In distcomp/remoteparfor/handleIntervalErrorResult (line 245)
In distcomp/remoteparfor/getCompleteIntervals (line 392)
In parallel_function>distributed_execution (line 741)
In parallel_function (line 573)
In fuction_pa1 (line 100)]
when I run a simulation that has parfor loop on the cluster. I noticed that workers abort excution one after another and that seems to happen more when on a cluster compated to my PC.
I would like to know the reason of this issue, and is there a way to avoid it ?
Thanks.

19 Comments

Monte-carlo simulation on Matlab not simulink.
According to this answer, It might be related to workers' crash.
matlab_crash_dump files might be stored in JobStrageLocation of parallel workers.
c=parcluster();
c.JobStorageLocation
Thanks Kojiro!
Based on the discription linked answer, I checked log files (appear as job#.log) of the jobs but they are all empty. I am not sure if making smptEnabled falge false would help or not. I suspect that it is communication-based issue forcing workers to abort.
Does your code have file I/O? For example, save.
Parallel workers might crash if multiple workers try to write to the same file.
Yes, I used save after the parfor loop ends. Isn't only the body of the parfor loop that gets distributed?
please, correct me if I got this worng.
No, I meant save inside parfor loop. But you're using save after parfor loop, it's safe.
Did you try changing SpmdEnabled option to false?
parpool('SpmdEnabled', false);
parfor n=1:100
% parallel codes
end
yes it is false but I still see the same warning.
OK. Does this occur if you require smaller wokers?
Such as,
parpool(2, 'SpmdEnabled', false);
parfor n=1:100
% parallel codes
end
I haven't tried this small but in some cases it keeps running with one remaining worker (e.g. 1 out of 12 or 24) and other times all workers abort the parfor excution.
Does your cluster have enough resource?
If Linux, from Terminal
ulimit -a
provides the resouce (max processes etc.).
yes, resources are there but it is per availability since many poeple use it in my university. I can reserve multiple computing nodes on the slurm scheduler for a single job or multiple of them.
@Kojiro Saito would putting parfor the outermost be the reason?
for example
parfor i=1:100
%do something
for l=1:1000
% do something
end
for j=1:100
% do another thing
end
end
I don't think so. I think it is an usual script.
Are you able to check the SLURM's log file?
I found oom(out of memory) but with increasing the allocated memory I did not find other errors in slurm logs.
In the /.matlab/local_cluster_jobs directory, the job.log is empty and the file job.metadata.text contains 'concurrent' only. I also want to add that the cluster profile is local only; that is I don't have Matlab parallel server that allow cluster profile using slurm or Maltalb scheduler. So the I did it is by reserving the resources on HPC via slurm and when granted I run matlab locally on the resources of HPC I granted.
I understood. It was related to memory error. As you mentioned, increasting the allocated memory such as "--mem-per-cpu=2G" in sbatch option would solve.
That is not the case always. I found it effective in times and other times it just the same. I wonder if that is related to cluster being very heterogenous.
Heterogenous would be a cause. This link is a system requirement of Parallel Server not Parallel Computing Toolbox, but it says an important point;
"Parallel processing constructs that work on the infrastructure enabled by parpoolparfor, parfeval spmd, distributed arrays, and message passing functions—cannot be used on a heterogeneous cluster configuration. The underlying MPI infrastructure requires that all cluster computers have matching word sizes and processor endianness."
Interesting point! I think this is the reason in my case. Thank you @koj@Kojiro Saito

Sign in to comment.

 Accepted Answer

Heterogenous environment would be a cause of this issue. This link is a system requirement of Parallel Server not Parallel Computing Toolbox, but it says an important point;
"Parallel processing constructs that work on the infrastructure enabled by parpoolparfor, parfeval spmd, distributed arrays, and message passing functions—cannot be used on a heterogeneous cluster configuration. The underlying MPI infrastructure requires that all cluster computers have matching word sizes and processor endianness."

2 Comments

I forgot to ask, are there ways to work around this issue? would chosing computing nodes on the cluster having same archeticture suffice ? if so how to do that using slurm ?
If you know the nodes names which are homogeneous, you can specify the nodes with sbatch. For example, if node0 to node4 are the same OS, you can use nodelist option (or -w option).
sbatch --nodelist node[0-4] yourscript.sh

Sign in to comment.

More Answers (0)

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!