Compute group statistics when the group definition is an OR of values found in several columns of a table

7 views (last 30 days)
Good morning,
I would like to compute group statistics when the definition of the group spans several columns which are not mutually exclusive. Think of the following example (the code corresponding to that is found below): a professor wants to compute the individual mean for 4 assignments of three students: Bob, Emma, and John. There is no restriction on who they can work with, so they can work in groups of 1, 2, or 3 for each assignment.
I have written a brute force solution below (3 versions). The reason I am unhappy with the best solution so far (v2) is that it is not easily scalable because I have to input the OR operator and the column number manually.
Additionally, I would like to match the assignment number to the student and have a table by Assignment Number, Student and their corresponding grade.
I wonder if there exists an OR condition for grpstats or similar functions. Or is there a way to easily scale/adapt v2 to accommodate more dynamic conditions? I was thinking of eval, but I know it should be avoided whenever possible.
% Mean per person when groups are not the same
T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
{'Emma','','Bob','' ,'John','' }', ...
{'' ,'','John','' ,'','' }' ...
,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'})
T = 6×5 table
Assignment Member1 Member2 Member3 Grade __________ ________ __________ __________ _____ 1 {'John'} {'Emma' } {0×0 char} 10 1 {'Bob' } {0×0 char} {0×0 char} 9 2 {'Emma'} {'Bob' } {'John' } 8 3 {'Emma'} {0×0 char} {0×0 char} 8 3 {'Bob' } {'John' } {0×0 char} 7 4 {'Bob' } {0×0 char} {0×0 char} 10
allnames=[T.Member1 ; T.Member2 ; T.Member3];
uniquenames=unique(allnames);
timespentv1=[];
for k=1:100000:1000000
T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
{'Emma','','Bob','' ,'John','' }', ...
{'' ,'','John','' ,'','' }' ...
,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});
T=repmat(T,k,1);
% Brute force
% allnames=[T.Member1 ; T.Member2 ; T.Member3];
% uniquenames=unique(allnames);
tic
for i=2:numel(uniquenames)
gradesforthatperson=[];
for j=1:3
personfound=ismember(T.(strcat('Member',num2str(j))),uniquenames{i});
gradesforthatperson=[gradesforthatperson ; T.Grade(personfound,1)];
end
averageforthatperson(i)=nanmean(gradesforthatperson);
end
timespentv1=[timespentv1 toc];
end
uniquenames'
ans = 1×4 cell array
{0×0 char} {'Bob'} {'Emma'} {'John'}
averageforthatperson
averageforthatperson = 1×4
0 8.5000 8.6667 8.3333
% v2
timespentv2=[];
for k=1:100000:1000000
T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
{'Emma','','Bob','' ,'John','' }', ...
{'' ,'','John','' ,'','' }' ...
,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});
T=repmat(T,k,1);
% Brute force v2
% allnames=[T.Member1 ; T.Member2 ; T.Member3];
% uniquenames=unique(allnames);
tic
for i=2:numel(uniquenames)
gradesforthatperson=[];
personfound= ismember(T.(strcat('Member',num2str(1))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(2))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(3))),uniquenames{i}) ;
gradesforthatperson=[T.Grade(personfound,1)];
averageforthatperson(i)=nanmean(gradesforthatperson);
end
timespentv2=[timespentv2 toc]; % advantage is that it keeps the order of the assignments
end
% v3
timespentv3=[];
for k=1:100000:1000000
T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
{'Emma','','Bob','' ,'John','' }', ...
{'' ,'','John','' ,'','' }' ...
,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});
T=repmat(T,k,1);
% Brute force v3
% allnames=[T.Member1 ; T.Member2 ; T.Member3];
% uniquenames=unique(allnames);
tic
for i=2:numel(uniquenames)
gradesforthatperson=[];
personfound= max(ismember([T.(strcat('Member',num2str(1))),T.(strcat('Member',num2str(2))),T.(strcat('Member',num2str(3)))],uniquenames{i}),[],2);
gradesforthatperson=[T.Grade(personfound,1)];
averageforthatperson(i)=nanmean(gradesforthatperson);
end
timespentv3=[timespentv3 toc];
end
figure;plot([timespentv1' timespentv2' timespentv3']);legend('v1','v2','v3','location','NorthWest');
title('Time required for each version');xlabel('Size table x100.000');ylabel('seconds');
Thank you in advance.
Best regards,
P.S.: I am using R2015b.

Accepted Answer

Bruno Luong
Bruno Luong on 28 Sep 2023
Edited: Bruno Luong on 28 Sep 2023
I submit a method without loop. I expected to be faster but to my surprise it is slower than v2. But at least it is scalable.
% Mean per person when groups are not the same
T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
{'Emma','','Bob','' ,'John','' }', ...
{'' ,'','John','' ,'','' }' ...
,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'})
allnames=[T.Member1 ; T.Member2 ; T.Member3];
uniquenames=unique(allnames);
% v2
timespentv2=[];
for k=1:100000:1000000
T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
{'Emma','','Bob','' ,'John','' }', ...
{'' ,'','John','' ,'','' }' ...
,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});
T=repmat(T,k,1);
tic
for i=2:numel(uniquenames)
gradesforthatperson=[];
personfound= ismember(T.(strcat('Member',num2str(1))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(2))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(3))),uniquenames{i}) ;
gradesforthatperson=[T.Grade(personfound,1)];
averageforthatperson(i)=mean(gradesforthatperson,'omitnan');
end
timespentv2=[timespentv2 toc]; % advantage is that it keeps the order of the assignments
end
averageforthatperson
% v4
timespentv4=[];
for k=1:100000:1000000
T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
{'Emma','','Bob','' ,'John','' }', ...
{'' ,'','John','' ,'','' }' ...
,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});
T=repmat(T,k,1);
[tf, membercol]=ismember({'Member1' 'Member2' 'Member3'}, T.Properties.VariableNames);
membercol = membercol(tf);
tic
averageforthatperson = zeros(1,numel(uniquenames));
i=2:numel(uniquenames);
Tmembercol = T{:,membercol};
[tf,loc] = ismember(Tmembercol,uniquenames(i));
g=repmat(T.Grade,1,size(loc,2));
averageforthatperson(i) = accumarray(loc(tf),g(tf))./accumarray(loc(tf),1);
timespentv4=[timespentv4 toc]; % advantage is that it keeps the order of the assignments
end
averageforthatperson
figure;
plot([timespentv2; timespentv4]');
legend('v2','v4','location','NorthWest');
title('Time required for each version');
xlabel('Size table x100.000');
ylabel('seconds');
  3 Comments
Bruno Luong
Bruno Luong on 28 Sep 2023
You don't need EVAL, here is the v6 that is as fast as v2
% Mean per person when groups are not the same
T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
{'Emma','','Bob','' ,'John','' }', ...
{'' ,'','John','' ,'','' }' ...
,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'})
T = 6×5 table
Assignment Member1 Member2 Member3 Grade __________ ________ __________ __________ _____ 1 {'John'} {'Emma' } {0×0 char} 10 1 {'Bob' } {0×0 char} {0×0 char} 9 2 {'Emma'} {'Bob' } {'John' } 8 3 {'Emma'} {0×0 char} {0×0 char} 8 3 {'Bob' } {'John' } {0×0 char} 7 4 {'Bob' } {0×0 char} {0×0 char} 10
allnames=[T.Member1 ; T.Member2 ; T.Member3];
uniquenames=unique(allnames);
% v2
timespentv2=[];
for k=1:100000:1000000
T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
{'Emma','','Bob','' ,'John','' }', ...
{'' ,'','John','' ,'','' }' ...
,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});
T=repmat(T,k,1);
tic
for i=2:numel(uniquenames)
gradesforthatperson=[];
personfound= ismember(T.(strcat('Member',num2str(1))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(2))),uniquenames{i}) | ismember(T.(strcat('Member',num2str(3))),uniquenames{i}) ;
gradesforthatperson=[T.Grade(personfound,1)];
averageforthatperson(i)=mean(gradesforthatperson,'omitnan');
end
timespentv2=[timespentv2 toc]; % advantage is that it keeps the order of the assignments
end
averageforthatperson
averageforthatperson = 1×4
0 8.5000 8.6667 8.3333
% v6
timespentv6=[];
for k=1:100000:1000000
T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
{'Emma','','Bob','' ,'John','' }', ...
{'' ,'','John','' ,'','' }' ...
,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'});
T=repmat(T,k,1);
tic
for i=2:numel(uniquenames)
personfound = false;
for j=1:3
personfound = personfound | ismember(T.(sprintf('Member%d',j)),uniquenames{i});
end
gradesforthatperson=[T.Grade(personfound,1)];
averageforthatperson(i)=mean(gradesforthatperson,'omitnan');
end
timespentv6=[timespentv6 toc]; % advantage is that it keeps the order of the assignments
end
averageforthatperson
averageforthatperson = 1×4
0 8.5000 8.6667 8.3333
figure;
plot([timespentv2; timespentv6]');
legend('v2','v6','location','NorthWest');
title('Time required for each version');
xlabel('Size table x100.000');
ylabel('seconds');
O.Hubert
O.Hubert on 29 Sep 2023
Great! I didn't think of sprintf instead of eval, and the looping of the OR condition is very neat. It also allows to have other types of conditions (AND, NOT, etc) in a similar way.
I'll accept your answer.
Thanks again.

Sign in to comment.

More Answers (1)

Jeff Miller
Jeff Miller on 29 Sep 2023
Maybe reformat the table with stack:
T=table([1 1 2 3 3 4]',{'John','Bob','Emma','Emma','Bob','Bob'}', ...
{'Emma','','Bob','' ,'John','' }', ...
{'' ,'','John','' ,'','' }' ...
,[10 9 8 8 7 10]','VariableNames',{'Assignment','Member1','Member2','Member3','Grade'})
T2 = stack(T,{'Member1','Member2','Member3'},'NewDataVariableName','Student')
tblstats = grpstats(T2,'Student','mean','DataVars','Grade')
to get
T =
6×5 table
Assignment Member1 Member2 Member3 Grade
__________ ________ __________ __________ _____
1 {'John'} {'Emma' } {0×0 char} 10
1 {'Bob' } {0×0 char} {0×0 char} 9
2 {'Emma'} {'Bob' } {'John' } 8
3 {'Emma'} {0×0 char} {0×0 char} 8
3 {'Bob' } {'John' } {0×0 char} 7
4 {'Bob' } {0×0 char} {0×0 char} 10
T2 =
18×4 table
Assignment Grade Student_Indicator Student
__________ _____ _________________ __________
1 10 Member1 {'John' }
1 10 Member2 {'Emma' }
1 10 Member3 {0×0 char}
1 9 Member1 {'Bob' }
1 9 Member2 {0×0 char}
1 9 Member3 {0×0 char}
2 8 Member1 {'Emma' }
2 8 Member2 {'Bob' }
2 8 Member3 {'John' }
3 8 Member1 {'Emma' }
3 8 Member2 {0×0 char}
3 8 Member3 {0×0 char}
3 7 Member1 {'Bob' }
3 7 Member2 {'John' }
3 7 Member3 {0×0 char}
4 10 Member1 {'Bob' }
4 10 Member2 {0×0 char}
4 10 Member3 {0×0 char}
tblstats =
3×3 table
Student GroupCount mean_Grade
________ __________ __________
John {'John'} 3 8.3333
Emma {'Emma'} 3 8.6667
Bob {'Bob' } 4 8.5
>>
  1 Comment
O.Hubert
O.Hubert on 29 Sep 2023
Edited: O.Hubert on 29 Sep 2023
Thank you Jeff for the loopless and very short code. I did not know about the stack function.
It does what I want and stays within the table environment. However, this is painfully slow (we are talking about 15-20x slower than the looped version.) Nevertheless, it has the advantage to be transparent. I may prefer that solution if time is not important.
Thanks again for the suggestion.

Sign in to comment.

Categories

Find more on Data Type Identification in Help Center and File Exchange

Products


Release

R2015b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!