Why workers keep aborting during parallel computation on cluster?
Show older comments
I keep getting the warning
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
In distcomp/remoteparfor/handleIntervalErrorResult (line 245)
In distcomp/remoteparfor/getCompleteIntervals (line 392)
In parallel_function>distributed_execution (line 741)
In parallel_function (line 573)
In fuction_pa1 (line 100)]
when I run a simulation that has parfor loop on the cluster. I noticed that workers abort excution one after another and that seems to happen more when on a cluster compated to my PC.
I would like to know the reason of this issue, and is there a way to avoid it ?
Thanks.
19 Comments
Mario Malic
on 7 Dec 2020
Whhat kind of simulation?
Muh Alam
on 7 Dec 2020
Kojiro Saito
on 8 Dec 2020
matlab_crash_dump files might be stored in JobStrageLocation of parallel workers.
c=parcluster();
c.JobStorageLocation
Muh Alam
on 9 Dec 2020
Kojiro Saito
on 9 Dec 2020
Does your code have file I/O? For example, save.
Parallel workers might crash if multiple workers try to write to the same file.
Muh Alam
on 9 Dec 2020
Kojiro Saito
on 10 Dec 2020
No, I meant save inside parfor loop. But you're using save after parfor loop, it's safe.
Did you try changing SpmdEnabled option to false?
parpool('SpmdEnabled', false);
parfor n=1:100
% parallel codes
end
Muh Alam
on 10 Dec 2020
Kojiro Saito
on 10 Dec 2020
OK. Does this occur if you require smaller wokers?
Such as,
parpool(2, 'SpmdEnabled', false);
parfor n=1:100
% parallel codes
end
Muh Alam
on 10 Dec 2020
Kojiro Saito
on 11 Dec 2020
Does your cluster have enough resource?
If Linux, from Terminal
ulimit -a
provides the resouce (max processes etc.).
Muh Alam
on 14 Dec 2020
Muh Alam
on 3 Feb 2021
Kojiro Saito
on 3 Feb 2021
I don't think so. I think it is an usual script.
Are you able to check the SLURM's log file?
Kojiro Saito
on 4 Feb 2021
I understood. It was related to memory error. As you mentioned, increasting the allocated memory such as "--mem-per-cpu=2G" in sbatch option would solve.
Muh Alam
on 6 Feb 2021
Kojiro Saito
on 7 Feb 2021
Heterogenous would be a cause. This link is a system requirement of Parallel Server not Parallel Computing Toolbox, but it says an important point;
"Parallel processing constructs that work on the infrastructure enabled by parpool—parfor, parfeval spmd, distributed arrays, and message passing functions—cannot be used on a heterogeneous cluster configuration. The underlying MPI infrastructure requires that all cluster computers have matching word sizes and processor endianness."
Muh Alam
on 8 Feb 2021
Accepted Answer
More Answers (0)
Categories
Find more on Third-Party Cluster Configuration in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!