efficient way to process millions of files individually

Question

Yuchun Ding on 6 Dec 2018

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/434304-efficient-way-to-process-millions-of-files-individually

Commented: Walter Roberson on 6 Dec 2018

I'd like to process 10 millions images on my harddrive and generate a measurement for each and stored in a matrix, I wonder what's the most efficient way to do this?

doing it in a for loops is stright forward but I guess it will take forever to finish...

2 Comments
Show NoneHide None

John D'Errico on 6 Dec 2018

Edited: John D'Errico on 6 Dec 2018

Big problems take big time, but computers are not infinitely fast nor infintely powerful. So the sooner you start the task off, the sooner you will finish.

You can probably gain some throughput with parallel processing, depending on how many cores you have available. But remember that may then leave you with a possible bottleneck in your disk access speed. So you need to make that as fast as possible too, which suggests you need to use ssd storage for those files.

You can tell the ultimate limits of what you can do easily enough.

Measure the time needed to read each file.
Meaure the time needed to process that file.

Multiply each by 1e7, then add. That is the fundamental limit you are faced with. So unless you can speed up one or the other of those operations, you can do no better.

Walter Roberson on 6 Dec 2018

Adding the two might not be optimal under the circumstance that the computation is inherently serial or can be parallelized to (number of cores) minus (number of disk controllers ) or fewer cores, especially if it can be done with cores divided by controllers , minus one, or fewer. In such cases you might be able to gain substantially from overlapping . II would have to think more about the formulae to work out the total time .

But time per file times number of files divided by number of controllers would be a lower bound on total time.

Sign in to comment.

Sign in to answer this question.

Answer 1

Walter Roberson on 6 Dec 2018

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/434304-efficient-way-to-process-millions-of-files-individually#answer_350953

Typically the biggest barrier is the speed of reading the files.

For any one drive Typically it is more efficient to read large files than many short files (unless you have a lot of fragmentation and a mediocre controller .)

Drive speed and interface speeds can make significant differences . Even for SSD drives quality matters. For non SSD drives that are not targeted at Enterprise you would prefer something designed for usb 3.1 with appropriate cables and controllers.

You would prefer to have the files split among multiple drives preferably on multiple controllers .

You might think of using parallel processing. In the single drive case that will not help, not unless the calculation to process each file takes longer than reading the file and the calculation is not being automatically multithreaded .

If you have multiple drives on different controllers then parallel processing can potentially increase performance (provided each worker is fetching from aa different controller )

Let me emphasise again: if you have only one drive and your processing is i/o bound then parallel processing tends to make things worse due to increased contention on the bottleneck resources .

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

efficient way to process millions of files individually

2 Comments
Show NoneHide None

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

efficient way to process millions of files individually

2 Comments Show NoneHide None

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

2 Comments
Show NoneHide None

0 Comments
Show -2 older commentsHide -2 older comments