How to quickly group numerical data without giving bin sizes

Question

Dominik Rhiem on 15 Aug 2023

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/2008882-how-to-quickly-group-numerical-data-without-giving-bin-sizes

Commented: Star Strider on 16 Aug 2023

I am trying to find an efficient and quick way to group numerical data. In short, I have several paths towards a particular pixel, and these paths consist of rays of slightly different lengths (as any ray that crosses the pixel anywhere is valid for a path). These paths can therefore be considered groups of rays. I want to differentiate the paths by their (average) length and select the path that contains the largest amount of rays, or, in other words, identify the groups and select the largest group.

Importantly though, I do not just need the length, but also an index to identify one ray, e.g. the "middle" one of the group. (Say I have an array of size 10, and the first 7 and last 3 elements form 2 groups. I would like to identify the groups, then, out of the 7 elements of the larger group, I would like to get the index of the 4th element as the "middle".)

My current solution is to round the ray lengths (to third decimal, as the pixel size is on the millimeter scale) and use the "mode" function, however, this is both inefficient (because I want to do this column-wise for a matrix that also contains NaN that I would like to ignore) and in some cases inaccurate. For example:

array = [0.2248 0.2249 0.2250 0.2251 0.2399 0.2400 0.2401];
array2 = round(array,2);
mode(array2)
ans = 0.2400

Of course it would be logical to group the first four entries and the last three, but the rounding operation is ill-suited when the values vary around the .5. I have used to Histogram function to plot examples in my code and it groups the entries in a satisfactory way, however, I actively do not want to have the plot itself, I just need the grouping, and the histogram function seems to have a rather large overhead for this purpose (as this operation has to be performed thousands of times for a proper run of the program). The discretize function unfortunately needs me to give it an explicit number of bins, i.e. I would need to have an a priori idea of the groups.

Is there any function that can efficiently do this, or are there suggestions for a better way to do it myself than "mode"?

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Star Strider on 16 Aug 2023

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/2008882-how-to-quickly-group-numerical-data-without-giving-bin-sizes#answer_1288442

Open in MATLAB Online

I am not certain that there is a robust approach to this sorts of problem. For multivariable problems (each point is a vector determined by more than one value), there are built-in clustering functions. This is a bit unique.

The data ideally need to be ordered (although that may not be an absolute requirement), the reason being that it is easier to calculate the differences if they are. This approach may be too much for this particular problem, however I decided to make it a bit more robust and so be appropriate for other problems, although I cannot be ceertain it will be robust for all such problems, and may need tweaking in some instances.

Try this —

array = [0.2248 0.2249 0.2250 0.2251 0.2399 0.2400 0.2401];

% array = [array array+0.51] % Test Vector

DifMtx = abs(array(:)-array) % Difference MAtrix

DifMtx = 7×7

0 0.0001 0.0002 0.0003 0.0151 0.0152 0.0153 0.0001 0 0.0001 0.0002 0.0150 0.0151 0.0152 0.0002 0.0001 0 0.0001 0.0149 0.0150 0.0151 0.0003 0.0002 0.0001 0 0.0148 0.0149 0.0150 0.0151 0.0150 0.0149 0.0148 0 0.0001 0.0002 0.0152 0.0151 0.0150 0.0149 0.0001 0 0.0001 0.0153 0.0152 0.0151 0.0150 0.0002 0.0001 0

[Col1,ixs] = sort(DifMtx(:,1)); % First Column & Inmdices

Col1Dif = diff([0; Col1]); % Ordered Column Differences

BP = [1; find(Col1Dif >= 5*min(Col1Dif(Col1Dif>0))); numel(Col1)+1]; % Break Points

for k = 1:numel(BP)-1

idxrng = BP(k) : BP(k+1)-1;

Cluster{k} = array(idxrng);

end

figure

hold on

for k = 1:numel(Cluster)

stem(Cluster{k}, ones(size(Cluster{k})), '.', 'filled', 'DisplayName',["Cluster #"+k])

end

hold off

grid

xlim([0.22 0.245]) % Optional

ylim([0 2])

legend('Location','best')

xlabel('Array')

title('Clusters')

The ‘ixs’ vector indexes into the original ‘Col1’ vector (and the original ‘array’ vector) if that information is needed.

.

2 Comments
Show NoneHide None

Dominik Rhiem on 16 Aug 2023

Thank you, this seems like a really good approach. I will look into it and see whether I can implement this as-is. I have found a workaround myself in the meantime, but this seems more robust.

Star Strider on 16 Aug 2023

As always, my pleasure!

I did my best to make it as robust as I could, however if you encounter a vector in which it has problems, post back and I will see if I can improve it to make it work with the new vector.

Sign in to comment.

How to quickly group numerical data without giving bin sizes

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments
Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

How to quickly group numerical data without giving bin sizes

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None