How to make a MAT-file that can be used to create a Datastore for MapReduce?

14 views (last 30 days)
This page tell you how to Read and Analyze Data in KeyValueDatastore for MAT-File. However, it only "shows how to create a datastore for key-value pair data in a MAT-file that is the output of mapreduce." The question is how you can make a MAT-file to create a datastore?
I found the following reply by Rick Amos in another thread useful: Currently, the one very specific form of mat files that can be read by datastore is the output of another mapreduce call. An unofficial shortcut that creates such a mat file is the following code:-
data.Key = {'Test'};
data.Value = {struct('a', 'Hello World!', 'b', 42)};
save('myMatFile.mat', '-struct', 'data');
ds = datastore('myMatFile.mat');
readall(ds)
This is nice to know, and it works well with one key-value pair. In general case, how do you save multiple key-value pairs for datastore (such that readall(ds) would produce multiple rows)? I have tried two alternatives with no success: saving two same-sized cell arrays for keys and values, and saving one struct array of key-value pairs. Thank you!

Accepted Answer

Rick Amos
Rick Amos on 24 Nov 2014
In R2014b there is currently not a direct way of creating a MAT file datastore. However, there are several indirect ways that will create a mat file datastore in R2014b.
The first method is to use the output of a mapreduce operation. That is, create an input file 'input.txt' that has the following contents:
Filename
myMatFile.mat
mySecondMatFile.mat
Then create a 'myMapper.m' with the following contents:
function myMapper(data, ~, intermediateOutput)
filenames = data.Filename;
addmulti(intermediateOutput, filenames, filenames);
end
And a 'myReducer.m' with the following contents:
function myReducer(filename, ~, finalOutput)
% This should be changed depending on the inputData.
% This purely converts a struct array into a cell array of structs for addmulti.
data = load(filename);
values = num2cell(data.myStructArrayVariable);
keys = repmat({'SomeKey'}, size(values));
addmulti(finalOutput, keys, values);
end
With all of this in place, do:
ds = datastore('input.txt');
mapFunction = @myMapper;
reduceFunction = @myReducer;
outputFolder = '/my/output/folder';
resultDS = mapreduce(ds, mapFunction, reduceFunction, 'OutputFolder', outputFolder)
This will create a collection of MAT files in the given output folder that consists of the original data and that can be used with datastore.
The second method is an unofficial shortcut to this. That is to do the following:-
% Suppose keys and values are two arrays of the same size such as:-
keys = {'TestKey1'; 'TestKey2'};
values = struct('Foo', {1,2}, 'Bar', {3,4});
% Then this will store data in such a way that it can likely be read by datastore:-
if ~iscell(keys)
keys = num2cell(keys);
end
if ~iscell(values)
values= num2cell(values);
end
data.Key = keys(:);
data.Value = values(:);
save('myMatFile.mat', '-struct', 'data');
ds = datastore('myMatFile.mat');
readall(ds)
  2 Comments
Oleg Komarov
Oleg Komarov on 4 Dec 2014
I find this useful (thus +1), since it provides a workaround to store multiple values. However, it is not a scalable option since each tuple of values is saved in a scalar structure, which is then repeated for each row.
I have not tested it but I think the benefit of the compression given by the matfile will be outweighed by the overhead of the struct.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!