manually subsetting data for training and testing purposes

5 views (last 30 days)
I have a dataset containing locations, rows for each observation (month) and various climate data in the columns. Something that looks like this:
site_number month datacol1 datacol2 datacol3 etc...
1 Jan data1 data2 data3 ....
1 Feb data1 data2 data3 ...
....
1 Dec data1 data2 data3
...
2 Jan data1 data2 data3 ....
2 Feb data1 data2 data3 ....
.....
2 Dec data1 data2 data3
....
....
etc...
I want to create training and testing data for this dataset, and I want these datasets to be in blocks containing groups based on their individual site number (each site containing 12 observations for months in the rows).
To put into context I have 28 sites altogether, and want to cross validate using a testing dataset containing 3 sites (total of 36 rows), and a training dataset containing the other sites, grouped by site number.
Can anybody advise on how I can do this please?
  2 Comments
Chad Greene
Chad Greene on 7 Oct 2017
Hi Roisin,
I'm not sure what you mean by training and testing data, or what you mean by cross validating. These seem like very general terms, but it sounds like they mean something specific to you. (This likely reflects my own ignorance of the methods you are trying to implement.)
If you want to create some array x containing all the data1 values corresponding to site 2, you could do so with
x = datacol1(site_number==2);
Does that answer your question?
Roisin Loughnane
Roisin Loughnane on 8 Oct 2017
Hi Chad,
Thank you for your reply, but that's not exactly what I am looking for. I want to separate the dataset into blocks based on the site numbers (1st column). I want all data to be included in this block, not just datacol1. Help please! I'm sure this is easy but what I've found (unique/accumarray/splitapply) has not helped me so far.
The testing and training data is really just data partitioning/subsetting in order to fit models to a proportion of the dataset, and test on the remaining data for boosted regression trees.

Sign in to comment.

Answers (2)

Kian Azami
Kian Azami on 8 Oct 2017
for making training and testing dataset you can use the following commands:
% First you make crossvalidation partitioning on your data
% y is a vector which contains the categories of your observations
% 'HoldOut' an optional property to make training and test set
% Fraction of data to form test set
c = cvpartition(y,'HoldOut',p)
% Now you can find the indices of your training and test sets
trainingIdx = training(c);
testIdx = test(c);
% Now you can find your training and test data
trainingData = Your_Data(trainingIdx,:);
testData = Your_Data(testIdx,:);
% Then you can learn from your training data
% y is the response variable in your data
Trained_Model = fitcknn(trainingData,'y')
% You can predict the Test data by the Trained_Model you defined
Pre_Test = predict(Trained_Model,testData)
% Finally you can calculate the error of your model for this test data
testErr = loss(Trained_Model,testData)
  9 Comments
Roisin Loughnane
Roisin Loughnane on 8 Oct 2017
I have all data in a table. The problem is it does not divide based on the site class. It takes the given class (site in my case) and partition based on this class, putting a proportion of each in both training and test data. I want all of the site observation to stay together, in either training or test data.
Kian Azami
Kian Azami on 8 Oct 2017
Roisin, I think you are confusing two different things which I do not understand.
You want to divide your data into test data and training data. I mean if you have 100 observations you keep 30 for test and 70 for training. But on the other hand, you want to have all the observations together! Maybe you have something else in your mind, but you do not explain it correctly.
Maybe you need to create another larger category (another excel column) and say that these first 10 sites are group1 and these second 10 sites are group2 and so on... Then you do the learning process by that larger groups.
Sorry, but these were the things that I had in my hand, to give you some ideas and suggestions;)

Sign in to comment.


Drew
Drew on 13 Apr 2024 at 0:06
In R2023b, a "custom" partition functionality has been added to cvpartition. This functionality can be used to create a cvpartition that groups the sites together. See "Version History" section of cvpartition doc page.
R2023b: Create custom cross-validation partitions
The cvpartition function supports the creation of custom cross-validation partitions. Use the CustomPartition name-value argument to specify the test set observations. For example, cvpartition("CustomPartition",testSets) specifies to partition the data based on the test sets in testSets. The IsCustom property of the resulting cvpartition object is set to 1 (true).

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!