efficient way to process millions of files individually

4 views (last 30 days)
I'd like to process 10 millions images on my harddrive and generate a measurement for each and stored in a matrix, I wonder what's the most efficient way to do this?
doing it in a for loops is stright forward but I guess it will take forever to finish...
  2 Comments
John D'Errico
John D'Errico on 6 Dec 2018
Edited: John D'Errico on 6 Dec 2018
Big problems take big time, but computers are not infinitely fast nor infintely powerful. So the sooner you start the task off, the sooner you will finish.
You can probably gain some throughput with parallel processing, depending on how many cores you have available. But remember that may then leave you with a possible bottleneck in your disk access speed. So you need to make that as fast as possible too, which suggests you need to use ssd storage for those files.
You can tell the ultimate limits of what you can do easily enough.
  1. Measure the time needed to read each file.
  2. Meaure the time needed to process that file.
Multiply each by 1e7, then add. That is the fundamental limit you are faced with. So unless you can speed up one or the other of those operations, you can do no better.
Walter Roberson
Walter Roberson on 6 Dec 2018
Adding the two might not be optimal under the circumstance that the computation is inherently serial or can be parallelized to (number of cores) minus (number of disk controllers ) or fewer cores, especially if it can be done with cores divided by controllers , minus one, or fewer. In such cases you might be able to gain substantially from overlapping . II would have to think more about the formulae to work out the total time .
But time per file times number of files divided by number of controllers would be a lower bound on total time.

Sign in to comment.

Answers (1)

Walter Roberson
Walter Roberson on 6 Dec 2018
Typically the biggest barrier is the speed of reading the files.
For any one drive Typically it is more efficient to read large files than many short files (unless you have a lot of fragmentation and a mediocre controller .)
Drive speed and interface speeds can make significant differences . Even for SSD drives quality matters. For non SSD drives that are not targeted at Enterprise you would prefer something designed for usb 3.1 with appropriate cables and controllers.
You would prefer to have the files split among multiple drives preferably on multiple controllers .
You might think of using parallel processing. In the single drive case that will not help, not unless the calculation to process each file takes longer than reading the file and the calculation is not being automatically multithreaded .
If you have multiple drives on different controllers then parallel processing can potentially increase performance (provided each worker is fetching from aa different controller )
Let me emphasise again: if you have only one drive and your processing is i/o bound then parallel processing tends to make things worse due to increased contention on the bottleneck resources .

Categories

Find more on Desktop in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!