How many times does each number appear?
    11 views (last 30 days)
  
       Show older comments
    
Let a
 a = 
     1 4
     3 5
     4 7
     6 8
be integer intervals
    [1 2 3 4]
    [3 4 5]
    [4 5 6 7]
    [6 7 8]
I'm looking for an output b to count the number of times each element appears. It is clear that 1 appears one time, 2 appears one time, 3 appears twice, and so on...
So the output b would list the element and how many times it appears:
 b = 
     1 1
     2 1
     3 2
     4 3
     5 2
     6 2
     7 2
     8 1
The challenge is, a is an extremely large nx2 array, so using find even with parallel looping is inefficient. Is there a fast way to yield b given a? If it helps, we can be sure that the elements down the left-hand column of a are sorted, but not necessarily so for the right-hand column. Thanks in advance!
2 Comments
  Jan
      
      
 on 2 Apr 2013
				Some contributors spend a lot of time here to solve the problem. What about voting for this question to get the interst of others?
Accepted Answer
  Teja Muppirala
    
 on 2 Apr 2013
        Assuming that the second column of a is always >= the first column of a, then this is an efficient solution.
 mA = max(a(:));
 b = (1:mA)';
 b = [b cumsum(accumarray(a(:,1),1,[mA 1]) + ...
               accumarray(min(a(:,2)+1,mA),-1,[mA 1]))];
 b(end) = b(end) +1;
1 Comment
More Answers (6)
  Brian B
      
 on 2 Apr 2013
        
      Edited: Brian B
      
 on 2 Apr 2013
  
      Like Cedric, I keep thinking of iterating over interval length. Another way to look at the vector b is as the sum of the outputs of a bunch of FIR filters with rectangular impulse responses to appropriate inputs. If you cut the filter length in half at each iteration, you can avoid iterating over all n interval lengths, and instead loop about log2(n) times.
Here I generate intervals of up to 1000, so the loop runs about 10 times.
N=1e8;
M=1e7;
a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];
idx = 1:max(a(:));
d = diff(a, 1, 2) + 1 ; % interval length
s = a(:,1); % interval start point
L = ceil(max(d)/2);
b = 0*idx;
tic
while L>0
    iL = (d>=L);
    u = hist(s(iL),idx);
    %b = b + conv(u, ones(1,L),'same');
    b = b + filter(ones(1,L),1,u);
    s(iL) = s(iL)+L;
    d(iL) = d(iL)-L;    
    L = ceil(max(d)/2);
    toc
end
This ran in 243 seconds on my machine; Cedric's solution exhausted my memory after 577 seconds....
EDIT: Replaced the call to conv() with a call to filter(), since conv() was returning the central part of the convolution, not the initial part, as we need.
  Jan
      
      
 on 2 Apr 2013
        
      Edited: Jan
      
      
 on 2 Apr 2013
  
      Are the values in the first and second column of a distinct? If so:
c = zeros(1, max(a(:, 2)) + 1);
c(a(:, 1)) = 1;
a2     = a(:, 2) + 1;  % [EDITED: +1 inserted]
c(a2)  = c(a2) - 1; 
c      = cumsum(c);
c(end) = [];
If the values are not unique, instead of 1 and -1 histc determines the values:
q = 1:max(a(:, 2));
c = histc(a(:, 1), q);
c = c - histc(a(:, 2) + 1, q);  % [EDITED: +1 inserted]
c = cumsum(c);
How efficient is simple FOR loop?
n = max(a(:, 2));
c = zeros(1, n + 1);
for ia = 1:length(a)
  a1    = a(ia, 1);
  c(a1) = c(a1) + 1;
  a2    = a(ia, 2) + 1;  % [EDITED: +1 inserted]
  c(a2) = c(a2) - 1;
end
c(end) = [];
c      = cumsum(c);
Converting this to C would be interesting also. Most of all the limitation of CUMSUM to the type double would not matter, such that UINT16 might reduce the data size, if sufficient.
2 Comments
  Jan
      
      
 on 2 Apr 2013
				
      Edited: Jan
      
      
 on 2 Apr 2013
  
			Interesting timings:
N = 1e7; M = 1e6;
a = [randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];
FOR-loop:   1.90 seconds
HISTC:      9.64 seconds (see also Brian B's HISTC solution)
CUMSUM:     1.71 seconds (columns of [a] must be unique!)
CONV:      41.00 seconds (Brian B's loop with HIS and CONV)
ACCUMARRAY: 1.83 seconds (Teja's CUMSUM/ACCUMARRAY)
The relation my change for the real data. But for the test data, the simple FOR loop can compete with the vectorized methods (R2009a/64, Win7).
  Image Analyst
      
      
 on 1 Apr 2013
        
      Edited: Image Analyst
      
      
 on 1 Apr 2013
  
      Are they known to be integers? I'd use histc(). You could do this in about 4 lines of code -- 1 (or maybe a 3 line for loop) to construct a 1D list of all numbers you gave in the seconds array you showed, 1 to construct the hist bin edges, one to call histc() and one to construct "b" from the result of histc(). Give it a try and come back if you have trouble.
  Cedric
      
      
 on 1 Apr 2013
        
      Edited: Cedric
      
      
 on 1 Apr 2013
  
      Hi Hamad,
Well, how large is "extremely large"? Image Analyst might have given you the optimal solution already. As an alternative, you could go for a solution involving ACCUMARRAY or SPARSE, where you would use these integers as indices and a vector of ones as values. This way you get counts per index/integer. For example:
 >> buffer = arrayfun(@(r) a(r,1):a(r,2), 1:size(a,1), 'UniformOutput', false) ;
 >> buffer = [buffer{:}].' ;
 >> counts = accumarray(buffer, ones(numel(buffer),1))
 counts =
     1
     1
     2
     3
     2
     2
     2
     1
There is still room for improvement though (in particular, I guess that a PARFOR loop would be more efficient than ARRAYFUN for building the vector of all integers, but I kept the latter because my point was to illustrate ACCUMARRAY). If not efficient enough after you improve it, it is probably something that you could optimize writing a few lines of C/MEX, in particular for building the vector of all integers.
7 Comments
  Brian B
      
 on 2 Apr 2013
        
      Edited: Brian B
      
 on 2 Apr 2013
  
      Jan Simon's solution is interesting because cumsum can be though of as an IIR filter, and so this idea can further improve the FIR approach above, without looping:
N=1e8;
M=1e7;
a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];
idx = 1:max(a(:));
b = hist(a(:,1),idx);
b = b - hist(a(:,2)+1,idx);
b = cumsum(b);
This works even if the values are not distinct.
2 Comments
  Jan
      
      
 on 2 Apr 2013
				I've posted a very similar solution, but without the "+1" in the 2nd histc() call - I've fixed my bug now.
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!




