Conditional average (need help with speed)

I have a table that looks like this:
country_id year M T average_T
1 2000 10 76 NaN
1 2001 5 39 Mean of 76 and 62
1 2002 NaN 37 Mean of 39 =39
1 2003 15 5 NaN
1 2004 10 28 Mean of 5 and 2
1 2005 10 8 Mean of 8=8
2 1999 15 1 NaN
2 2000 10 62 Mean of 1=1
2 2001 20 32 Mean of 76 and 62
2 2002 10 72 Mean of 32=32
2 2003 15 2 Mean of 5 and 2
I want to calculate the column average_T which is last year's average of the T values for the cases that have the same year and M value. (First entry for each id is NaN because we don't know past year's T for those entries)
I have written a code that can do this but it is impossible to run with my big data set:
mytable.average_T=NaN(N,1);
for k=2:N
if mytable{k,'country_id'} == mytable{k-1,'country_id'}
mytable.average_T(k,1)= mean(T(mytable.M==mytable.M(k-1)& ...
mytable.year==mytable.year(k-1)), 'omitNaN');
end
end

 Accepted Answer

dpb
dpb on 16 Jan 2021
Edited: dpb on 17 Jan 2021
Grouping variables and rowfun to the rescue...
tMeans=rowfun(@(x),mean(x,'omitnan'),mytable,'InputVariables','T','GroupingVariables',{'year','M'});

11 Comments

Thank you for your reply! The problem that I'm having with rowfun is that it sorts the result according to the grouping variables (year and M) but I want to get results ordered according to country_id and year. It is not possible to re-order the variables after getting the results with rowfun as I lose the country_id variable in the results.
Thank you!
tMeans=rowfun(@(x),mean(x,'omitnan'),mytable,'InputVariables','T','GroupingVariables',{'country_id','year','M'});
then.
Altho I now notice you had separated by country in the code snippet, I had just read the text of the question that doesn't say by country...
Mia Dier
Mia Dier on 17 Jan 2021
Edited: Mia Dier on 17 Jan 2021
Hi again!
This again unfortunately doesn't solve the problem. I want the grouping variables to be year and M but I want the outcome to be ordered by country id first and then by year which is the original ordering of my dataset. I hope it is more clear now.
Thank you!
There is no such thing as a country order any longer once you've averaged over all country id, too.
I know that we average over the country id too but I need to know which average_T belonged to which country_id. Please check the example table in my question. It should be easier to understand from the table.
Thank you again!
dpb
dpb on 17 Jan 2021
Edited: dpb on 17 Jan 2021
It makes no sense, no. You either compute average over each country ID as a group as well or you don't group countries -- if you keep the country id then that is the ID of the group; if you don't use countries as a grouping variable then there is no way to associate any given order of the contributing elements that made up that average to the average itself; that is gone.
As noted in the other Q? of the same subject, you could keep a set of which countries were include in the averaging, but that's all that is, there's no order to associate with the mean.
Or in a similar vein as in the other Q? comment you could assign an auxiliary variable that is the row in the table that is passed through the function and kept with the group that would identify the members of the group but again while that could be sorted, other than it is the identification of who is in the group, there's no meaning in the order in the computed mean.
BTW, this last id would just be the grouping index you could get from findgroups; it may be that the information contained from it is what you're actually looking for here, but the request as couched just doesn't make sense.
As for the previous year thing, you can simply associate the computed average of the year with the previous year after the fact or create another year variable that is the actual year+1 to use as the grouping variable instead.
I'm not claiming that there is an order to associate with the mean. I'm just trying to sort the variables as they were in the first place. And keeping the 'original' order makes perfect sense if I have many other variables in that particular order and if I use them all together in a model later on.
Anyways, I found the solution to my problem my merging the rowfun results with the original data set.
Thank you.
"...merging the rowfun results with the original data set."
Well, yes, that's in line with what I was saying -- the result of rowfun with a merging/grouping operation are a totally separate output in number of elements/height of the output table.
If you replicate those back into the original set, then you can associate the mean with from whence it came, agreed; that's the same set of data as findgroups will tell you -- which group each element/row belongs to, and is in the original order.
tMeans=rowfun(@(x),mean(x,'omitnan'),mytable, ...
'InputVariables','T', ...
'GroupingVariables',{'country_id',year','M'}, ...
'OutputVariableNames',{'GroupMean'} );
mytable.GroupMeans=tMeans.GroupMean(findgroups(mytable.country_id,mytable.year,mytable.M));
should just populate the new column without any need to merge anything.
Now that we've defined the actual requirement, it's easier to solve the problem... :)
Amazing thank you! :)
NB: You could do the same thing with findgroups and splitapply without building the output table from rowfun, too.

Sign in to comment.

More Answers (0)

Categories

Asked:

on 16 Jan 2021

Commented:

dpb
on 17 Jan 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!