MATLAB Answers

0

big data 2d matrix percentile calculation using tall

Asked by David Santos on 12 Aug 2019
Latest activity Edited by David Santos on 14 Aug 2019
I'm trying to calculate a percentile of a lot of files (25000 or even more) containing 4x1 cell, representing 4 maps or 1483x2824 matrixes.
I'm using tall arrays following indications of Percentiles of Tall Matrix Along Different Dimensions:
tic
%start local pool for mutithreading
c=parcluster('local');
c.NumWorkers=20;
parpool(c, c.NumWorkers);
folder='/home/temporal2/dsantos/mat/*.mat'; %more than 25000 files
A=ones(1483,2824,2);%aux matrix for stablish prdtile data type
y=tall(A);
%database of files cointaining 4x1cell of 1483*2824 maps
ds=fileDatastore(folder,'ReadFcn',@loadPrc,'FileExtensions','.mat','UniformRead', true)
t=tall(ds);
%fill the aux tall array with each map in the correct format
for i=1:25000
y(:,:,i)=t(1+(i-1)*1483:1483*i,:);
end
%calculate the percentile
p90_1=prctile(y,90,3)
P90_1=gather(p90_1);
save('/home/temporal2/dsantos/p90_1.mat','P90_1','-v7.3');
toc
But it seems that tall arrays won't work for this because I get the error:
Warning: Error encountered during preview of tall array 'p90_1'. At
tempting to
gather 'p90_1' will probably result in an error. The error encountered was:
Requested 500025x500025 (1862.8GB) array exceeds maximum array size preference.
Creation of arrays greater than this limit may take a long time and cause
MATLAB to become unresponsive. See <a href="matlab: helpview([docroot
'/matlab/helptargets.map'], 'matlab_env_workspace_prefs')">array size limit</a>
or preference panel for more information.
> In tall/display (line 21)
p90_1 =
MxNx... tall array
? ? ? ...
? ? ? ...
? ? ? ...
: : :
: : :
>> Error using digraph/distances (line 72)
Internal problem while evaluating tall expression. The problem was:
Requested 500028x500028 (1862.9GB) array exceeds maximum array size preference.
Creation of arrays greater than this limit may take a long time and cause
MATLAB to become unresponsive. See <a href="matlab: helpview([docroot
'/matlab/helptargets.map'], 'matlab_env_workspace_prefs')">array size limit</a>
or preference panel for more information.
Error in
matlab.bigdata.internal.lazyeval.LazyPartitionedArray>iGenerateMetadata (line
756)
allDistances = distances(cg.Graph);
Error in
matlab.bigdata.internal.lazyeval.LazyPartitionedArray>iGenerateMetadataFillingPart
itionedArrays
(line 739)
[metadatas, partitionedArrays] = iGenerateMetadata(inputArrays,
executorToConsider);
Error in ...
Error in tall/gather (line 50)
[varargout{:}] = iGather(varargin{:});
Caused by:
Error using matlab.internal.graph.MLDigraph/bfsAllShortestPaths
Requested 500028x500028 (1862.9GB) array exceeds maximum array size
preference. Creation of arrays greater than this limit may take a long time
and cause MATLAB to become unresponsive. See <a href="matlab:
helpview([docroot '/matlab/helptargets.map'],
'matlab_env_workspace_prefs')">array size limit</a> or preference panel for
more information.
Any clue on how to solve this problem?
All the best

  0 Comments

Sign in to comment.

2 Answers

Answer by Edric Ellis
on 13 Aug 2019

That particular error is an internal error basically because your tall array expression is simply too large - contains too many expressions. tall arrays operate by building up a symbolic representation of all the expressions you've evaluated, and then running them all together when you call gather. Because you've got a for loop over 25000 elements, this symbolic representation is large - too large to be evaluated. tall arrays are basically not designed to be looped over in this way. Instead, you need to express your program in terms of a smaller number of vectorised operations.
I would proceed in the following manner (I can't be more specific since your problem statement isn't executable - see this page on tips regarding making a minimal reproduction):
  1. Have your loadPrc return a 4 × 1483 × 2824 numeric matrix (rather than a cell array)
  2. Your corresponding tall array t will then be 25000 × 1483 × 2824
  3. Instead of the for loop, simply call prctile in dimension 1
ds = fileDatastore();
t = tall(ds);
p90_1=prctile(t,90,1);
P90_1=gather(p90_1);
% and then perhaps
P90_1 = shiftdim(P90_1, 1)

  0 Comments

Sign in to comment.


Answer by David Santos on 13 Aug 2019

Thanks a lot for your answer Edric!
I'm not sure how to solve point 1. Here's my simplified loadPrc:
function dataOut=loadPrc(filename)
data=load(filename);%data is a 4x1 cell, 4 frequency maps of 1483x2824points
dataOut=data{1};%let's solve just the first frequency map for the moment.. 1483x2824 matrix
end
how can I modify this to reach to your proposal?
I've tried this now I'm my server and because it has the 2017a version "'UniformRead', true" is not working so dataOut is always a cell. can I have a numeric matrix somehow¿?
In the other hand if I just calculate the percentile of one frequency map (as stated in loadPrc), dataOut is going to be 2d not 3d matrix. I'm doing this because if join the 4 frequencies=>dataOut=4x1483x2824 so, how can I calculate each frequency percentile? maybe I can do :
p90_1=prctile(t(1:4:end,:,:,90,1);
P90_1=gather(p90_1);
p90_2=prctile(t(2:4:end,:,:,90,1);
P90_2=gather(p90_2);
?
All the best

  4 Comments

Show 1 older comment
Ah, sorry, I hadn't realised that prctile in the tall dimension supports only vectors. Hm, this might turn out to be trickier than I thought. In fact, I'm not sure I know how to do this using tall arrays.
Let me just confirm that I got the basics of your problem correct - you do want to compute percentiles individually for each 1483x2824 element - so 4187992 percentiles down vectors of length 25000.
It may be that tall arrays aren't the right tool in this case - at the very least, I think it will be necessary to "transpose" the data so that you can load a handful of 25000-element vectors in memory at a time and call prctile on those in sequence (perhaps even in parallel if you have Parallel Computing Toolbox).
Thanks for your answer!
-In the way I was working at the begining of my question (with that tricky aux tall array to format in the way percentile likes 3d tall array) I was able to calculate prctile over the 3rd dim of the tall array of size (1483x2824x25000) in the way:
p90_1=prctile(t,90,3);
P90_1=gather(p90_1);
the problem was that at the end, when I used gather, matlab needed to load the entire vector in memory and its always too big. I think tall arrays won't work because of this. It would be great to be able to load in memmory only the p90_1 variable instead of the entire (400 GB) t matrix.
-Yes, you got rigth, I want to compute percentiles individually for each 1483x2824 matrix/map. What you propose could be a solution but even with the parallel process (40 cores) it would imply a lot loading files isn't? I will try to do a minitest and see what happens
-What about other ways? Mapreduce? Using big matfiles on disk? Approximations to percentile as P2 algorithm?
SOME TESTING
I did some testing using just 4 maps/files/(1483x2824) matrixes with your "slicing" percentile calculation proposal. The first 2 option s(using matfile and tall arrays) only calculate the 10 first rows
%%Using matfile option
tic
folder=dir('matBorrame/*.mat'); %4 files folder
P=zeros(1483,2824,2);
save('P.mat','P','-v7.3');
m=matfile('P.mat','Writable',true);
for i=1:4
fprintf('%d\n',i);
v=load(strcat('matBorrame/',folder(i).name));
id=strcat('l',folder(i).name(1:end-4-7));
m.P(:,:,i)=v.(id){1};
end
p90_1=ones(1483,2824);
for r=1:10 %%
fprintf('ROW:%d\n',r);
for c=1:2824
p90_1(r,c)=prctile(m.P(r,c,:),90);
end
end
save('p5_90_4.mat','p90_1','-v7.3');
toc
%Elapsed time is 190.559574 seconds.%
%%Tall arrays option
tic
ds=fileDatastore('matBorrame','ReadFcn',@loadPrc,'FileExtensions','.mat','UniformRead', true)
t=tall(ds);
A=ones(1483,2824,2);%aux matrix for stablish prctile data format
y=tall(A);
for i=1:4
y(:,:,i)=t(1+(i-1)*1483:1483*i,:);
end
p90_1=ones(1483,2824);
for r=1:10
fprintf('ROW:%d\n',r);
for c=1:2824
aux=squeeze(y(r,c,:));
p90_1(r,c)=gather(prctile(aux,90));
end
end
save('p1_90_4.mat','p90_1','-v7.3');
toc
%Elapsed time is >>5000s.% I stopped before finish...
%%In memory option; Processing all rows!!
tic
p90_1=prctile(m.P,90,3);
toc
%Elapsed time is 1.335489 seconds.%
My conclusions:
- tall Arrays are a bad solution for your proposal, it will last forever...
- Using matfile could work but is around 4000 times slower than the standard solution
- What about other ways? Mapreduce? Approximations to percentile as P2 algorithm? black magic?
All the best

Sign in to comment.