Parfor loop just hangs, CPU usage goes to zero
25 views (last 30 days)
Hi all. Here is a sample code of what I am attempting to run.
parfor i = 1:num
answer(:,i) = someFunction(someData(:,i));
Key information: "someFunction" is a C++ mex file. "someData" is a memmapfile (memmapfilename.data) because it is too large to be loaded onto each worker
Oddly, the parfor loop just hangs, the CPU usage goes to zero, and when I CTRL+C, here is what I get:
Operation terminated by user during distcomp.remoteparfor/getCompleteIntervals (line
In parallel_function>distributed_execution (line 820)
[tags, out] = P.getCompleteIntervals(chunkSize);
In parallel_function (line 587)
R = distributed_execution(...
This isn't an issue if I replace the "parfor" with a simple "for" - everything works fine. What seems to happen is that some of the workers become unresponsive. After the above issue is encountered, even running a simple command such as
will return "2" on only some, but not all, workers.
Any help would be great. A fresh re-installation did not help. Validation for "parpool" passed.
Dave Behera on 24 Mar 2016
It seems that there is a deadlock when the workers are trying to the access the file using the same object (that you got from memmapfile). Due to that, the progress is getting stalled with zero CPU usage and no abort message.
Can you try creating a separate memmapfile object within each parfor iteration and passing it to the someFunction function? This may make the file access thread-safe.
Also, could you try the same workflow with spmd?
arvid Martens on 9 Jan 2018
I noticed that the problem started to occur after I updated the drivers of the GPUs that are being used during the calculations. Rolling back the drivers resolved the problem. However, new GPU hardware is on its way, as the current ones are pretty old. So I hope the problem is resolved by then.
Is there a way to throw an error when this stalling occurs? I could write an error handling to reduce the time lost by this stalling.
DeepSea on 15 Aug 2021
I've been stucked in this problem for couples of weeks, and fixed it by removing "continue" in an if-judgement and a for-loop.
continue; % Avoid using "continue"