# Large datasets: Any way to perform regression analyses on select variables within a large table based on row name?

1 view (last 30 days)
Josiah on 4 Jan 2013
I have a large dataset of soil profiles. I am trying to calculate regressions of organic carbon and profile depth. The data set is a csv with columns for 'profile_name', 'top_depth', 'bottom_depth' and 'organic_carbon'. There are other columns for spatial data that I shouldn't have to mention.
The data is organized so there are multiple rows for one profile, so the 'profile name' value is the same for anywhere from 2 to 10 rows while the 'top_depth' and 'bottom_depth' change to reflect the sample interval within the soil profile, and the 'organic_carbon' represents how much carbon is in the soil.
What I want to do is write a script that will run linear and/or logarithmic regressions of 'organic carbon' and the 'Bottom Depth' values within each distinct 'profile_name'. I might want to go further with some calculations but I think that would be the best start. The hurdle for me is sort of binning the profile data by 'profile_name'. Any clues would be greatly appreciated!

Tom Lane on 5 Jan 2013
If you have the Statistics Toolbox, you might find it handy to use "dataset" to read in the csv file and create a dataset array from it. Then I recommend that you convert the profile_name variable to a nominal variable. The following illustrates how you can operate on different subsets based on values of a nominal variable:
d = dataset(Origin,Displacement,Weight); % you would read this from a file
d.Origin = nominal(d.Origin); % convert text to nominal
org = unique(d.Origin);
for j=1:length(org);
t = d.Origin==org(j);
p = polyfit(d.Weight(t),d.Displacement(t),1);
fprintf('%s: %s\n',char(org(j)),num2str(p))
end

Sam on 4 Apr 2013
Edited: Sam on 5 Apr 2013
I've compiled a similar soils dataset and find grpstats() to be very useful for sorting and indexing my data, especially when the data have multiple z values from horizon-based or depth-based profile sampling.