During parfor-loop, suddenly get the error "unable to read file"

parfor k = 1:100000
% something else
tmpStruct = load(filename);
% something else
end
I have 3 scripts like the one above. I am running them on 3 different nodes of a cluster.
After some iterations, one job get the error "unable to read file, no such file or directory". This is confusing, since the file does exist and the other two jobs can read it.
I thought this is due to limited file handle of the linux system. But I don't understand why the load() function is related to file handle.
And, if they are related, how can I aviod this "limited file handle" problem? I tried to increase the file handle of linux, but it seems it will always exceed the limit if I run several jobs togegher.
By the way, I am definitely sure that I did not use fopen() in my script.

9 Comments

If workers are trying to read the same file at the same time, that will most likely cause an issue. Are you loading the same file, does anything in it changes so that you have to load it every single time?
Yes, I am loading the same file.
No, the content in that file did not change. The file is a look-up table, I pre-calculated something and save it. So in the parfor-loop, I just need to input the index then retrive result from look-up table.
I thought it would be safe to just read same file at the same time. Is this action troublesome?
It can be troublesome, yes. I am not able to tell exactly how you should solve your issue, someone with more expertise in parallel computing may help. Try loading the file into the table (or variable) before the parfor call and adjust your code for it. It'll save you a lot of computing time, because you don't have to load every single time. If it doesn't work, see parfeval, spmd.
Thanks for your suggestion on how to modify my code. It won't help. For brevity, I simplified my problem in the original question. The pseudo-code below is closer to my problem.
parfor k = 1:100000
lsqcurvefit(@my_obj_function)
end
function my_obj_function
tmpStruct = load(filename);
look_up_table = tmpStruct.look_up_table;
end
So I may not load this look-up-table outside parfor-loop and use it inside. Actually, there is one way to load outside and use inside, that's pass the look_up_table as a parameter of my_obj_function. But I thought this is somewhat inelegant.
I have found another way to avoid using load(). I wrapped the look-up table into a function call. Although function call is a little bit slower than load(), I'll use the function call to avoid annoying file handle problem.
I would appreciate if you could tell me how using load() would be troublesome. This is really confusing.
I'll take back my word saying " pass look_up_table as a parameter of my_obj_function is inelegant ". Pass it as a parameter is less error-prone and much easier to understand.
Thanks, Mario.
You could also check composite and parallel.pool.Constant. I think it depends on the filesystem and the file format. I still think that you should completely avoid load, because it is inefficient.
You have mentioned that 3 nodes of cluster are running this, question is - how do they load the file? Do you supply full or relative path? Does each node has its copy of file?
This error message "unable to read file, no such file or directory" implies that file is either corrupted or MATLAB cannot read it (does not have read access), or path to file is not correct.
I don't know how load actually works, but when a file is opened, access to it may or may not be locked to other processes. Even though it takes almost no time to load the file, you have multiple workers that may access the same file (which might be problematic, I can't tell for sure) at the same time, and cause such error.
Yes, I have completely avoided using load() and this error did not happen.
The file is stored on a net drive ( I am not sure if it is called net drive or something else, anyway, it is a drive..). All the 3 node3 have access to this drive. The 3 nodes did not have a copy and they read the file from the drive.
I used relative path and it worked. The file is always there. The "unable to read" error happened in the middle of a parfor-loop.
I suspect it is related to some unknown features of load(). When highly parallel, this feature comes out.
Search here for your issues, I have seen some comments that network drives can cause read access issues. You can try creating a copy of the file for each node, which still won't completely negate your issues as each node have workers that try to access the files. That's why I am recommending you to avoid load, at least within parfor loop.
Yes, using load() is time-cosuming and error-prone.

Sign in to comment.

Answers (0)

Categories

Products

Release

R2018a

Asked:

on 12 Dec 2020

Commented:

on 14 Dec 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!