MATLAB Answers

big data 2d matrix percentile calculation using tall

29 views (last 30 days)
David Santos
David Santos on 12 Aug 2019
Edited: David Santos on 14 Aug 2019
I'm trying to calculate a percentile of a lot of files (25000 or even more) containing 4x1 cell, representing 4 maps or 1483x2824 matrixes.
I'm using tall arrays following indications of Percentiles of Tall Matrix Along Different Dimensions:
tic
%start local pool for mutithreading
c=parcluster('local');
c.NumWorkers=20;
parpool(c, c.NumWorkers);
folder='/home/temporal2/dsantos/mat/*.mat'; %more than 25000 files
A=ones(1483,2824,2);%aux matrix for stablish prdtile data type
y=tall(A);
%database of files cointaining 4x1cell of 1483*2824 maps
ds=fileDatastore(folder,'ReadFcn',@loadPrc,'FileExtensions','.mat','UniformRead', true)
t=tall(ds);
%fill the aux tall array with each map in the correct format
for i=1:25000
y(:,:,i)=t(1+(i-1)*1483:1483*i,:);
end
%calculate the percentile
p90_1=prctile(y,90,3)
P90_1=gather(p90_1);
save('/home/temporal2/dsantos/p90_1.mat','P90_1','-v7.3');
toc
But it seems that tall arrays won't work for this because I get the error:
Warning: Error encountered during preview of tall array 'p90_1'. At
tempting to
gather 'p90_1' will probably result in an error. The error encountered was:
Requested 500025x500025 (1862.8GB) array exceeds maximum array size preference.
Creation of arrays greater than this limit may take a long time and cause
MATLAB to become unresponsive. See <a href="matlab: helpview([docroot
'/matlab/helptargets.map'], 'matlab_env_workspace_prefs')">array size limit</a>
or preference panel for more information.
> In tall/display (line 21)
p90_1 =
MxNx... tall array
? ? ? ...
? ? ? ...
? ? ? ...
: : :
: : :
>> Error using digraph/distances (line 72)
Internal problem while evaluating tall expression. The problem was:
Requested 500028x500028 (1862.9GB) array exceeds maximum array size preference.
Creation of arrays greater than this limit may take a long time and cause
MATLAB to become unresponsive. See <a href="matlab: helpview([docroot
'/matlab/helptargets.map'], 'matlab_env_workspace_prefs')">array size limit</a>
or preference panel for more information.
Error in
matlab.bigdata.internal.lazyeval.LazyPartitionedArray>iGenerateMetadata (line
756)
allDistances = distances(cg.Graph);
Error in
matlab.bigdata.internal.lazyeval.LazyPartitionedArray>iGenerateMetadataFillingPart
itionedArrays
(line 739)
[metadatas, partitionedArrays] = iGenerateMetadata(inputArrays,
executorToConsider);
Error in ...
Error in tall/gather (line 50)
[varargout{:}] = iGather(varargin{:});
Caused by:
Error using matlab.internal.graph.MLDigraph/bfsAllShortestPaths
Requested 500028x500028 (1862.9GB) array exceeds maximum array size
preference. Creation of arrays greater than this limit may take a long time
and cause MATLAB to become unresponsive. See <a href="matlab:
helpview([docroot '/matlab/helptargets.map'],
'matlab_env_workspace_prefs')">array size limit</a> or preference panel for
more information.
Any clue on how to solve this problem?
All the best

  0 Comments

Sign in to comment.

Answers (2)

Edric Ellis
Edric Ellis on 13 Aug 2019
That particular error is an internal error basically because your tall array expression is simply too large - contains too many expressions. tall arrays operate by building up a symbolic representation of all the expressions you've evaluated, and then running them all together when you call gather. Because you've got a for loop over 25000 elements, this symbolic representation is large - too large to be evaluated. tall arrays are basically not designed to be looped over in this way. Instead, you need to express your program in terms of a smaller number of vectorised operations.
I would proceed in the following manner (I can't be more specific since your problem statement isn't executable - see this page on tips regarding making a minimal reproduction):
  1. Have your loadPrc return a 4 × 1483 × 2824 numeric matrix (rather than a cell array)
  2. Your corresponding tall array t will then be 25000 × 1483 × 2824
  3. Instead of the for loop, simply call prctile in dimension 1
ds = fileDatastore();
t = tall(ds);
p90_1=prctile(t,90,1);
P90_1=gather(p90_1);
% and then perhaps
P90_1 = shiftdim(P90_1, 1)

  0 Comments

Sign in to comment.


David Santos
David Santos on 13 Aug 2019
Thanks a lot for your answer Edric!
I'm not sure how to solve point 1. Here's my simplified loadPrc:
function dataOut=loadPrc(filename)
data=load(filename);%data is a 4x1 cell, 4 frequency maps of 1483x2824points
dataOut=data{1};%let's solve just the first frequency map for the moment.. 1483x2824 matrix
end
how can I modify this to reach to your proposal?
I've tried this now I'm my server and because it has the 2017a version "'UniformRead', true" is not working so dataOut is always a cell. can I have a numeric matrix somehow¿?
In the other hand if I just calculate the percentile of one frequency map (as stated in loadPrc), dataOut is going to be 2d not 3d matrix. I'm doing this because if join the 4 frequencies=>dataOut=4x1483x2824 so, how can I calculate each frequency percentile? maybe I can do :
p90_1=prctile(t(1:4:end,:,:,90,1);
P90_1=gather(p90_1);
p90_2=prctile(t(2:4:end,:,:,90,1);
P90_2=gather(p90_2);
?
All the best

  4 Comments

Show 1 older comment
Edric Ellis
Edric Ellis on 14 Aug 2019
Ah, sorry, I hadn't realised that prctile in the tall dimension supports only vectors. Hm, this might turn out to be trickier than I thought. In fact, I'm not sure I know how to do this using tall arrays.
Let me just confirm that I got the basics of your problem correct - you do want to compute percentiles individually for each 1483x2824 element - so 4187992 percentiles down vectors of length 25000.
It may be that tall arrays aren't the right tool in this case - at the very least, I think it will be necessary to "transpose" the data so that you can load a handful of 25000-element vectors in memory at a time and call prctile on those in sequence (perhaps even in parallel if you have Parallel Computing Toolbox).
David Santos
David Santos on 14 Aug 2019
Thanks for your answer!
-In the way I was working at the begining of my question (with that tricky aux tall array to format in the way percentile likes 3d tall array) I was able to calculate prctile over the 3rd dim of the tall array of size (1483x2824x25000) in the way:
p90_1=prctile(t,90,3);
P90_1=gather(p90_1);
the problem was that at the end, when I used gather, matlab needed to load the entire vector in memory and its always too big. I think tall arrays won't work because of this. It would be great to be able to load in memmory only the p90_1 variable instead of the entire (400 GB) t matrix.
-Yes, you got rigth, I want to compute percentiles individually for each 1483x2824 matrix/map. What you propose could be a solution but even with the parallel process (40 cores) it would imply a lot loading files isn't? I will try to do a minitest and see what happens
-What about other ways? Mapreduce? Using big matfiles on disk? Approximations to percentile as P2 algorithm?
David Santos
David Santos on 14 Aug 2019
SOME TESTING
I did some testing using just 4 maps/files/(1483x2824) matrixes with your "slicing" percentile calculation proposal. The first 2 option s(using matfile and tall arrays) only calculate the 10 first rows
%%Using matfile option
tic
folder=dir('matBorrame/*.mat'); %4 files folder
P=zeros(1483,2824,2);
save('P.mat','P','-v7.3');
m=matfile('P.mat','Writable',true);
for i=1:4
fprintf('%d\n',i);
v=load(strcat('matBorrame/',folder(i).name));
id=strcat('l',folder(i).name(1:end-4-7));
m.P(:,:,i)=v.(id){1};
end
p90_1=ones(1483,2824);
for r=1:10 %%
fprintf('ROW:%d\n',r);
for c=1:2824
p90_1(r,c)=prctile(m.P(r,c,:),90);
end
end
save('p5_90_4.mat','p90_1','-v7.3');
toc
%Elapsed time is 190.559574 seconds.%
%%Tall arrays option
tic
ds=fileDatastore('matBorrame','ReadFcn',@loadPrc,'FileExtensions','.mat','UniformRead', true)
t=tall(ds);
A=ones(1483,2824,2);%aux matrix for stablish prctile data format
y=tall(A);
for i=1:4
y(:,:,i)=t(1+(i-1)*1483:1483*i,:);
end
p90_1=ones(1483,2824);
for r=1:10
fprintf('ROW:%d\n',r);
for c=1:2824
aux=squeeze(y(r,c,:));
p90_1(r,c)=gather(prctile(aux,90));
end
end
save('p1_90_4.mat','p90_1','-v7.3');
toc
%Elapsed time is >>5000s.% I stopped before finish...
%%In memory option; Processing all rows!!
tic
p90_1=prctile(m.P,90,3);
toc
%Elapsed time is 1.335489 seconds.%
My conclusions:
- tall Arrays are a bad solution for your proposal, it will last forever...
- Using matfile could work but is around 4000 times slower than the standard solution
- What about other ways? Mapreduce? Approximations to percentile as P2 algorithm? black magic?
All the best

Sign in to comment.

Sign in to answer this question.