How to quickly group numerical data without giving bin sizes

10 views (last 30 days)
I am trying to find an efficient and quick way to group numerical data. In short, I have several paths towards a particular pixel, and these paths consist of rays of slightly different lengths (as any ray that crosses the pixel anywhere is valid for a path). These paths can therefore be considered groups of rays. I want to differentiate the paths by their (average) length and select the path that contains the largest amount of rays, or, in other words, identify the groups and select the largest group.
Importantly though, I do not just need the length, but also an index to identify one ray, e.g. the "middle" one of the group. (Say I have an array of size 10, and the first 7 and last 3 elements form 2 groups. I would like to identify the groups, then, out of the 7 elements of the larger group, I would like to get the index of the 4th element as the "middle".)
My current solution is to round the ray lengths (to third decimal, as the pixel size is on the millimeter scale) and use the "mode" function, however, this is both inefficient (because I want to do this column-wise for a matrix that also contains NaN that I would like to ignore) and in some cases inaccurate. For example:
array = [0.2248 0.2249 0.2250 0.2251 0.2399 0.2400 0.2401];
array2 = round(array,2);
mode(array2)
ans = 0.2400
Of course it would be logical to group the first four entries and the last three, but the rounding operation is ill-suited when the values vary around the .5. I have used to Histogram function to plot examples in my code and it groups the entries in a satisfactory way, however, I actively do not want to have the plot itself, I just need the grouping, and the histogram function seems to have a rather large overhead for this purpose (as this operation has to be performed thousands of times for a proper run of the program). The discretize function unfortunately needs me to give it an explicit number of bins, i.e. I would need to have an a priori idea of the groups.
Is there any function that can efficiently do this, or are there suggestions for a better way to do it myself than "mode"?

Accepted Answer

Star Strider
Star Strider on 16 Aug 2023
I am not certain that there is a robust approach to this sorts of problem. For multivariable problems (each point is a vector determined by more than one value), there are built-in clustering functions. This is a bit unique.
The data ideally need to be ordered (although that may not be an absolute requirement), the reason being that it is easier to calculate the differences if they are. This approach may be too much for this particular problem, however I decided to make it a bit more robust and so be appropriate for other problems, although I cannot be ceertain it will be robust for all such problems, and may need tweaking in some instances.
Try this —
array = [0.2248 0.2249 0.2250 0.2251 0.2399 0.2400 0.2401];
% array = [array array+0.51] % Test Vector
DifMtx = abs(array(:)-array) % Difference MAtrix
DifMtx = 7×7
0 0.0001 0.0002 0.0003 0.0151 0.0152 0.0153 0.0001 0 0.0001 0.0002 0.0150 0.0151 0.0152 0.0002 0.0001 0 0.0001 0.0149 0.0150 0.0151 0.0003 0.0002 0.0001 0 0.0148 0.0149 0.0150 0.0151 0.0150 0.0149 0.0148 0 0.0001 0.0002 0.0152 0.0151 0.0150 0.0149 0.0001 0 0.0001 0.0153 0.0152 0.0151 0.0150 0.0002 0.0001 0
[Col1,ixs] = sort(DifMtx(:,1)); % First Column & Inmdices
Col1Dif = diff([0; Col1]); % Ordered Column Differences
BP = [1; find(Col1Dif >= 5*min(Col1Dif(Col1Dif>0))); numel(Col1)+1]; % Break Points
for k = 1:numel(BP)-1
idxrng = BP(k) : BP(k+1)-1;
Cluster{k} = array(idxrng);
end
figure
hold on
for k = 1:numel(Cluster)
stem(Cluster{k}, ones(size(Cluster{k})), '.', 'filled', 'DisplayName',["Cluster #"+k])
end
hold off
grid
xlim([0.22 0.245]) % Optional
ylim([0 2])
legend('Location','best')
xlabel('Array')
title('Clusters')
The ‘ixs’ vector indexes into the original ‘Col1’ vector (and the original ‘array’ vector) if that information is needed.
.
  2 Comments
Dominik Rhiem
Dominik Rhiem on 16 Aug 2023
Thank you, this seems like a really good approach. I will look into it and see whether I can implement this as-is. I have found a workaround myself in the meantime, but this seems more robust.
Star Strider
Star Strider on 16 Aug 2023
As always, my pleasure!
I did my best to make it as robust as I could, however if you encounter a vector in which it has problems, post back and I will see if I can improve it to make it work with the new vector.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!