Process Large Set of Images Using MapReduce Framework and Hadoop

This example shows how to execute a cell counting algorithm on a large number of images using Image Processing Toolbox™ with MATLAB® MapReduce and Datastore. MapReduce is a programming technique for analyzing data sets that do not fit in memory. The example also uses MATLAB Parallel Server™ to run parallel MapReduce programs on Hadoop® clusters. The example shows how to create a subset of the images so that you can test your algorithm on a local system before moving it to the Hadoop cluster.

Download Sample Data

This part of the example shows how to download the BBBC005v1 data set from the Broad Bioimage Benchmark Collection. This data set is an annotated biological image set designed for testing and validation. The image set provides examples of in- and out-of-focus synthetic images, which can be used for validation of focus metrics. The data set contains almost 20,000 files. For more information, see this introduction to the data set.

At the system prompt on a Linux system, use the wget command to download the zip file containing the BBBC data set. Before running this command, make sure that your target location has enough space to hold the zip file (1.8 GB) and the extracted images (2.6 GB).

wget http://www.broadinstitute.org/bbbc/BBBC005/BBBC005_v1_images.zip

At the system prompt on a Linux system, extract the files from the zip file.

unzip BBBC005_v1_images.zip

Examine the image file names in this data set. They are constructed in a specific format to contain useful information about each image. For example, the file name BBBC005_v1_images/SIMCEPImages_A05_C18_F1_s16_w1.TIF indicates that the image contains 18 cells (C18) and was filtered with a Gaussian low-pass filter with diameter 1 and a sigma of 0.25x diameter to simulate focus blur (F1). The w1 identifies the stain used.

View Image Files and Test Algorithm

This example shows how to view the files in the BBBC data set and test an algorithm on a small subset of the files using the Image Batch Processor app. The example tests a simple algorithm that segments the cells in the images. (The example uses a modified version of this cell segmentation algorithm to create the cell counting algorithm used in the MapReduce implementation.)

Open the Image Batch Processor app. From the MATLAB Toolstrip, on the Apps tab, in the Image Processing and Computer Vision group, click Image Batch Processor. You can also open the app from the command license using the imageBatchProcessor command.

Load the cell image data set into the app. In the Image Batch Processor app, click Load Images and navigate to the folder in which you stored the downloaded data set

Specify the name of the function that implements your cell segmentation algorithm. To specify an existing function, type its name in the Function name field or click the folder icon to browse and select the function. To create a new batch processing function, click New and enter your code in the template file opened in the MATLAB editor. For this example, create a new function containing the following image segmentation code.

function imout = cellSegmenter(im)
% A simple cell segmenter

% Otsu thresholding
t  = graythresh(im);
bw = im2bw(im,t);

% Show thresholding result in app
imout = imfuse(im,bw);

% Find area of blobs
stats = regionprops('table',bw,{'Area'});

% Average cell diameter is about 33 pixels (based on random inspection)
cellArea = pi*(33/2)^2;

% Estimate cell count based on area of blobs
cellsPerBlob = stats.Area/cellArea;
cellCount = sum(round(cellsPerBlob));
disp(cellCount);

Select a few images displayed in the app, using the mouse, and click Run to execute a test run of your algorithm.

Note

For this example, choose only images with the “w1” stain. The segmentation algorithm works best with these images.

Examine the results of running your algorithm to verify that your segmentation algorithm found the correct number of cells in the image. The names of the images contain the cell count in the C number. For example, the image named SIMCEPImages_A05_C18_F1_s05_w1.TIF contains 18 cells. Compare this number to the results returned at the command line for both images.

18

18

Test Your MapReduce Framework Locally: Prepare Data

This example shows how to set up a small test version on your local system of the large scale processing you want to perform. You should test your processing framework before running it on thousands of files. To do this, you must first create an image datastore to contain the images. MapReduce uses a datastore to process data in small chunks that individually fit into memory. For local testing, select a subset of the images in the datastore to facilitate a quicker, iterative development process. Once you have created the datastore, convert the sample subset of images into Hadoop sequence files, a format used by the Hadoop cluster.

Create an ImageDatastore. Because the cell segmentation algorithm implemented in cellSegmenter.m works best with the cell body stain, select only the files with the indicator w1 in their file names.

localDataFolder = '/your_data/broad_data/BBBC005_v1_images/';
w1FilesOnly     = fullfile(localDataFolder,'*w1*');
localimds       = imageDatastore(w1FilesOnly);
disp(localimds);
 ImageDatastore with properties:

      Files: {
             ' .../broad_data/BBBC005_v1_images/SIMCEPImages_A01_C1_F1_s01_w1.TIF';
             ' .../broad_data/BBBC005_v1_images/SIMCEPImages_A01_C1_F1_s02_w1.TIF';
             ' .../broad_data/BBBC005_v1_images/SIMCEPImages_A01_C1_F1_s03_w1.TIF'
              ... and 9597 more
             }
    ReadFcn: @readDatastoreImage

Note that there still are over 9000 files in the subset.

Subset this sample further, selecting every 100th file from the thousands of files in the data set. This brings the number of files down to a more manageable number.

localimds.Files = localimds.Files(1:100:end);
disp(localimds);
ImageDatastore with properties:

      Files: {
             ' .../broad_data/BBBC005_v1_images/SIMCEPImages_A01_C1_F1_s01_w1.TIF';
             ' .../broad_data/BBBC005_v1_images/SIMCEPImages_A05_C18_F1_s01_w1.TIF';
             ' .../broad_data/BBBC005_v1_images/SIMCEPImages_A09_C35_F1_s01_w1.TIF'
              ... and 93 more
             }
    ReadFcn: @readDatastoreImage

Repackage the subset of images into Hadoop sequence files. Note that this step simply changes the data from one storage format to another without changing the data value. For more information about sequence files, see Getting Started with MapReduce (MATLAB).

You can use the MATLAB mapreduce function to do this conversion. You must create a “map” function and a “reduce” function which you pass to the mapreduce function. To convert the image files to Hadoop sequence files, the map function should be a no-op function. For this example, the map function simply saves the image data as-is, using its file name as a key.

function identityMap(data, info, intermKVStore)
    add(intermKVStore, info.Filename, data);
end

Create a reduce function that converts the image files into a key-value datastore backed by sequence files.

function identityReduce(key, intermValueIter, outKVStore)
    while hasnext(intermValueIter)
        add(outKVStore, key, getnext(intermValueIter));
    end
end

Call mapreduce, passing your map and reduce functions. The example first calls the mapreducer function to specify where the processing takes place. To test your set up and perform the processing on your local system, specify 0. (When run locally, mapreduce creates a key-value datastore back by MAT-files.) In the following code, the mapreducer function specifies that the operation should take place on your local system.

mapreducer(0);

matFolder = 'BBBC005_v1_subset_images_w1_mat';
localmatds = mapreduce(localimds, ...
    @identityMap, @identityReduce,...
    'OutputFolder', matFolder);
disp(localmatds);
********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map  10% Reduce   0%
Map  20% Reduce   0%
Map  30% Reduce   0%
Map  40% Reduce   0%
Map  50% Reduce   0%
Map  60% Reduce   0%
Map  70% Reduce   0%
Map  80% Reduce   0%
Map  90% Reduce   0%
Map 100% Reduce   0%
Map 100% Reduce  10%
Map 100% Reduce  20%
Map 100% Reduce  30%
Map 100% Reduce  40%
Map 100% Reduce  50%
Map 100% Reduce  60%
Map 100% Reduce  70%
Map 100% Reduce  80%
Map 100% Reduce  90%
Map 100% Reduce 100%
  KeyValueDatastore with properties:

       Files: {
              ' .../results_1_12-Jun-2015_10-41-25_187.mat';
              ' .../results_2_12-Jun-2015_10-41-25_187.mat'
              }
    ReadSize: 1 key-value pairs
    FileType: 'mat'

Test Your MapReduce Framework Locally: Run Your Algorithm

This example shows how to test your MapReduce framework on your local system. After creating the subset of image files for testing, and converting them to a key-value datastore, you are ready to test the algorithm. Modify your original cell segmentation algorithm to return the cell count. (The Image Batch Processor app, where this example first tested the algorithm, can only return processed images, not values such as the cell count.)

Modify the cell segmentation function to return a cell count.

function cellCount = cellCounter(im)
% A simple cell counter

% Otsu thresholding
t  = graythresh(im);
bw = im2bw(im,t);

% Show thresholding result in app
% imout = imfuse(im,bw);

stats = regionprops('table',bw,{'Area'});

% Average cell diameter is about 33 pixels (based on random inspection)
cellArea = pi*(33/2)^2;

% Estimate cell count based on area of blobs
cellsPerBlob = stats.Area/cellArea;
cellCount = sum(round(cellsPerBlob));

Create map and reduce functions that perform the desired processing. For this example, create a map function that calculates the error count for a specific image. This function gets the actual cell count for an image from the file name coding (the C number) and compares it to the cell count returned by the segmentation algorithm.

function mapImageToMisCountError(data, ~, intermKVStore)

% Extract the image
im = data.Value{1};

% Call the cell counting algorithm
actCount = cellCounter(im);

% The original file name is available as the key
fileName = data.Key{1};
[~, name] = fileparts(fileName);
% Extract expected cell count and focus blur from the file name
strs = strsplit(name, '_');
expCount  = str2double(strs{3}(2:end));
focusBlur = str2double(strs{4}(2:end));

diffCount = abs(actCount-expCount);

% Note: focus blur is the key
add(intermKVStore, focusBlur, diffCount); 
end

Create a reduce function that computes the average error in cell count for each focus value.

function reduceErrorCount(key, intermValueIter, outKVStore)

focusBlur = key;

% Compute the sum of all differences in cell count for this value of focus
% blur
count = 0;
totalDiff = 0;
while hasnext(intermValueIter)
    diffCount = getnext(intermValueIter);
    count = count +1;
    totalDiff = totalDiff+diffCount;
end

% Average
meanDiff = totalDiff/count;

add(outKVStore, focusBlur, meanDiff);
end

Run the mapreduce job on your local system.

focusErrords = mapreduce(localmatds, @mapImageToMisCountError, @reduceErrorCount);

% Gather the result
focusErrorTbl = readall(focusErrords);
disp(focusErrorTbl);
averageErrors = cell2mat(focusErrorTbl.Value);
********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map  10% Reduce   0%
Map  20% Reduce   0%
Map  30% Reduce   0%
Map  40% Reduce   0%
Map  50% Reduce   0%
Map  75% Reduce   0%
Map 100% Reduce   0%
Map 100% Reduce  13%
Map 100% Reduce  25%
Map 100% Reduce  38%
Map 100% Reduce  50%
Map 100% Reduce  63%
Map 100% Reduce  75%
Map 100% Reduce  88%
Map 100% Reduce 100%
    Key      Value  
    ___    _________

     1     [ 0.8333]
     4     [ 0.6667]
     7     [ 0.5000]
    10     [ 0.5000]
    14     [ 0.5000]
    17     [ 0.5000]
    20     [ 0.5000]
    23     [ 0.3333]
    26     [ 0.8333]
    29     [ 0.8333]
    32     [      3]
    35     [      7]
    39     [12.3333]
    42     [17.3333]
    45     [18.3333]
    48     [23.8333]

Inspect the results on the subset. The simple cell counting algorithm used here relies on the average area of a cell or a group of cells. Increasing focus blur diffuses cell boundaries, and thus the area. The expected result is for the error to go up with increasing focus blur. Plot the results.

bar(focusErrorTbl.Key, averageErrors);
ha = gca;
ha.XTick = sort(focusErrorTbl.Key);
ha.XLim  = [min(focusErrorTbl.Key)-2 max(focusErrorTbl.Key)+2];
title('Cell counting result on a test data set');
xlabel('Focus blur');
ylabel('Average error in cell count');

Run Your MapReduce Framework on a Hadoop Cluster

This example shows how to load all the image data into the Hadoop file system and run your MapReduce framework on a Hadoop cluster.

Load the image data into the Hadoop file system using the following shell commands. To run this command, replace your_data with the location on your computer.

hadoop fs -mkdir /user/broad_data/
hadoop fs -copyFromLocal /your_data/broad_data/BBBC005_v1_images /user/broad_data/BBBC005_v1_images

Set up access to the MATLAB Parallel Server cluster. To run this command, replace 'your/hadoop/install' with the location on your computer.

setenv('HADOOP_HOME','/your/hadoop/install');
cluster = parallel.cluster.Hadoop;
cluster.HadoopProperties('mapred.job.tracker') = 'hadoop01glnxa64:54311';
cluster.HadoopProperties('fs.default.name') = 'hdfs://hadoop01glnxa64:54310';
disp(cluster);
% Change mapreduce execution environment to point to the remote cluster
mapreducer(cluster);

Convert all the image data into Hadoop sequence files. This is similar to what you did on your local system when you converted a subset of the images for prototyping. You can reuse the map and reduce functions you used previously.

% Use the internal Hadoop cluster ingested with Broad Institute files
broadFolder = 'hdfs://hadoop01glnxa64:54310/user/broad_data/BBBC005_v1_images';

% Pick only the 'cell body stain' (w1) files for processing
w1Files = fullfile(broadFolder, '*w1*.TIF');

% Create an ImageDatastore representing all these files
imageDS = imageDatastore(w1Files);

% Specify the output folder.
seqFolder = 'hdfs://hadoop01glnxa64:54310/user/datasets/images/broad_data/broad_sequence';

% Convert the images to a key-value datastore.
seqds = mapreduce(imageDS, @identityMap, @identityReduce,'OutputFolder', seqFolder); 

Run the cell counting algorithm on the entire data set stored in the Hadoop file system using the MapReduce framework. The only change from the local version now is that the input and output locations are on the Hadoop file system.

% Output location for error count.
output = 'hdfs://hadoop01glnxa64:54310/user/broad_data/BBBC005_focus_vs_errorCount';
tic;
focusErrords = mapreduce(seqds, @mapImageToMisCountError, @reduceErrorCount,...
    'OutputFolder',output);
toc
% Gather result
focusErrorTbl = readall(focusErrords);
disp(focusErrorTbl);
averageErrors = cell2mat(focusErrorTbl.Value);

% Plot
bar(focusErrorTbl.Key, averageErrors);
ha = gca;
ha.XTick = sort(focusErrorTbl.Key);
ha.XLim  = [min(focusErrorTbl.Key)-2 max(focusErrorTbl.Key)+2];
title('Cell counting result on the entire data set');
xlabel('Focus blur');
ylabel('Average error in cell count');
Parallel mapreduce execution on the Hadoop cluster:
********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map   7% Reduce   0%
Map  10% Reduce   0%
Map  12% Reduce   0%
Map  20% Reduce   0%
Map  23% Reduce   0%
Map  25% Reduce   0%
Map  28% Reduce   0%
Map  30% Reduce   0%
Map  32% Reduce   0%
Map  33% Reduce   0%
Map  38% Reduce   0%
Map  41% Reduce   0%
Map  43% Reduce   0%
Map  48% Reduce   0%
Map  50% Reduce   0%
Map  51% Reduce   5%
Map  53% Reduce   7%
Map  55% Reduce  10%
Map  56% Reduce  11%
Map  58% Reduce  11%
Map  62% Reduce  12%
Map  64% Reduce  12%
Map  65% Reduce  12%
Map  67% Reduce  16%
Map  69% Reduce  16%
Map  71% Reduce  16%
Map  74% Reduce  17%
Map  75% Reduce  17%
Map  76% Reduce  17%
Map  79% Reduce  20%
Map  83% Reduce  23%
Map  85% Reduce  24%
Map  88% Reduce  24%
Map  92% Reduce  24%
Map  94% Reduce  25%
Map  96% Reduce  28%
Map  97% Reduce  29%
Map 100% Reduce  66%
Map 100% Reduce  69%
Map 100% Reduce  78%
Map 100% Reduce  87%
Map 100% Reduce  96%
Map 100% Reduce 100%
Elapsed time is 227.508109 seconds.
    Key      Value  
    ___    _________

     4     [ 1.1117]
     7     [ 1.0983]
    10     [ 1.0500]
    14     [ 0.9317]
    17     [ 0.8650]
    20     [ 0.7583]
    23     [ 0.6050]
    26     [ 1.0600]
    29     [ 1.5750]
    32     [ 4.0633]
    42     [18.9267]
    48     [26.2417]
     1     [ 1.0083]
    35     [ 7.5650]
    39     [13.2383]
    45     [20.5500]

See Also

|

Related Topics