filedatastore; read M mat files of size (1xL) into single tall array that is of size (1, L*M)

7 views (last 30 days)
I have an extremely large collection of data distributed into multiple mat files I am wanting to do some stats on as a whole entitiy, so I need to load them in as a filedatastore. The dimensions of the tall array are stopping me from getting the exact stats I am after.
Before the question is inevitably asked; I can't read all the files normally as they won't all fit in RAM at the same time, if I could, I would as I could then just easily append each array from each file as a normal matrix operation.
Mat file format:
Each file is identical in size; there is a single variable in each file of size [1 x L].
In the directory consisting of the mat files for the datastore, the number of files is M.
Attempted datastore code:
So far I've managed to get a datastore created of my desired data format, however the dimensions of the tall array are preventing me from getting the exact stats I'm after.
%find files in directory with mat file extension (all matching formats)
Datastore_Files = Search_Files(Datastore_Directory, ".mat");
%create absolute file path to each individual file, make string array for datastore input
Datastore_File_List = string(fullfile({Datastore_Files.folder}, {Datastore_Files.name}));
%create single tall datastore from array of mat files
File_Data_Store = tall(fileDatastore(Datastore_File_List, 'ReadFcn', @(x)table2array(struct2table(load(x)), 'UniformRead', true), 'UniformRead', true));
%get data store size
File_Data_Store_Size = gather(size(File_Data_Store));
disp(File_Data_Store_Size)
M L
Stats issue:
My issue due to the datastore dimensions is that for example, if I then perform a function such as mean(), I end up with a mean value per-file, rather than getting a single value representing the mean value for the entire datastore as a whole.
%Returns mean value per-file; not a single value for the whole dataset.
Test1 = gather(mean(File_Data_Store))
disp(size(Test1))
1 M
%Also returns mean value per-file; not a single value for the whole dataset.
Test2 = gather(mean(File_Data_Store(:,:)))
disp(size(Test2))
1 M
Attempted workarounds:
As above, I can't appear to perform the normal trick for a standard matrix, where if you had a multidimensional array and wanted a single mean() value representing the entire array, you could just use mean(:).
I also can't use reshape, as you can't change the size of the first dimension of a tall array.
T = reshape(File_Data_Store, 1, File_Data_Store_Size(1)*File_Data_Store_Size(2))
Error using tall/reshape (line 17)
Reshaping the first dimension of tall arrays is not supported.
Question:
Is there a way for me to concatonate the output from each file during the datastore creation such that I end up with a single tall array of dimensions [M*L, 1] instead of [M, L]?
Alternatively, is there a way I am unaware of for performing operations on a tall array as a whole; rather than each column independently (each file)?
  1 Comment
dpb
dpb on 13 May 2022
See mapreduce and and example of mean() <Compute-mean-value-with-mapreduce>
There are also tall arrays and gather.
I've not used any of the above "in anger" so can only point at the doc...

Sign in to comment.

Accepted Answer

Jeremy Hughes
Jeremy Hughes on 13 May 2022
In general, you'll have better luck identifying where the problem lies by looking at each piece of the code separately.
fcn = @(x)table2array(struct2table(load(x))); % Issue may be here
ds = fileDatastore(Datastore_File_List, 'ReadFcn', fcn, 'UniformRead', true);
A = tall(ds)
If you have a 1-by-L vector as the return of fcn, then tall will try to create an M-by-L array eventually from that data. Calling mean on that will result in the mean of each column, or an 1xL array.
A = rand(3,10)
A = 3×10
0.9645 0.9308 0.0643 0.4233 0.2423 0.7365 0.1519 0.0225 0.2807 0.4034 0.7564 0.6492 0.9459 0.3899 0.9306 0.3297 0.2200 0.2077 0.3861 0.1871 0.2675 0.1926 0.5899 0.9457 0.9813 0.3425 0.3606 0.6717 0.0494 0.0663
m = mean(A)
m = 1×10
0.6628 0.5909 0.5334 0.5863 0.7181 0.4696 0.2442 0.3006 0.2387 0.2189
I think what you are asking for is the mean of the whole array. For in-memory arrays, I would do this:
m = mean(A(:))
m = 0.4563
But tall probably won't like that.
If you modify that fcn to return the transpose,
fcn = @(x)table2array(struct2table(load(x)))'; % Note the added ' transpose character.
Now each read will result in an L-by-1 instead, and the tall array should represent an (M*L)-by-1 array. Which you can call mean on, and get a single value.
m = gather(mean(A))
  1 Comment
Alex Hogg
Alex Hogg on 16 May 2022
This works perfectly; I attempted this but must have popped the transpose on the internal table read, rather than the matrix post-conversion from table.
Thank you :)

Sign in to comment.

More Answers (0)

Products


Release

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!