parallel processing for readtable function

Question

sermet OGUTCU on 18 Aug 2021

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/1436217-parallel-processing-for-readtable-function

Commented: Walter Roberson on 19 Aug 2021

ps = parallel.Settings;
ps.Pool.AutoCreate = false; %do not autocreate parpool when encountering a |parfor|
ps.Pool.IdleTimeout = Inf;  %do not shutdown parpool after Inf idle time
parfor j=1:10          
        tCOD{j,:}=readtable(full_file_name(j,:),'FileType','text', ...
                                            'headerlines',end_of_header_line(j),'readvariablenames',0);
        tCOD{j,:}=[];        
end

When using 10 files with a size of about 100mb, only a couple of second differences are observed between for and parfor. Is there something wrong with the above steps for parfor. Is there any more convenient parallel processing approach that can be applied?

If the multiple "full_file_name(j,:)" can be read simultaneously (using multiple core) rather than sequentially, the speed of readtable can be increased significantly.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Edric Ellis on 19 Aug 2021

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/1436217-parallel-processing-for-readtable-function#answer_770404

Open in MATLAB Online

You've manually disabled the AutoCreate for parallel pools - I presume you're manually creating a pool with a separate parpool statement.

Whether or not parfor can go faster than a serial for loop depends on your underlying hardware for this case. It might simply be that the limiting factor is your disk drive, and trying to read from it multiple times simultaneously does not allow any speedup.

You could try manually running multiple copies of MATLAB and calling readtable from each of them simultaneously to see if they slow down when there is contention for disk access. I.e. run something like this in multiple copies of MATLAB:

while true
    t = tic();
    readtable(args..);
    toc(t)
end

I suspect that you will see that as you start more copies of MATLAB, the toc time will increase.

1 Comment
Show -1 older commentsHide -1 older comments

Walter Roberson on 19 Aug 2021

As a generalization: if you are not using server-quality hardware components, then optimal utilzation is typically two processes per controller. Not per drive . Any more and you are typically filling the memory channel between the controller and the rest of the hardware.

If you have only one drive per controller, then optimal utilization is typically two processes. Controllers can often be moving drive heads to position for the next read at the same time that the previous data is being transferred, so you want commands from multiple processes to be queued... but not too many processes.

The considerations are notably different for SSD.

Sign in to comment.

parallel processing for readtable function

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

1 Comment
Show -1 older commentsHide -1 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

parallel processing for readtable function

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

1 Comment Show -1 older commentsHide -1 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments