How to take a random sample of each column?

15 views (last 30 days)
I have data from a file with 25 columns and 9000 rows. Actually the problem is that I would like to have everything on an .m format and not on a .mat one but having too many rows, this is not possible (when I try to save the file it is said that the files that are too big will be saved as a .mat). How can I get just a random sample for each column (I mean around 150-200 rows). Thank you.

Accepted Answer

Image Analyst
Image Analyst on 13 Oct 2015
If you have the Statistics and Machine Learning Toolbox, you can use the randsample() function:
y = randsample(n,k) returns a k-by-1 vector y of values sampled uniformly at random, without replacement, from the integers 1 to n.
y = randsample(population,k) returns a vector of k values sampled uniformly at random, without replacement, from the values in the vector population. The orientation of y (row or column) is the same as population.
y = randsample(n,k,replacement) or y = randsample(population,k,replacement) returns a sample taken with replacement if replacement is true, or without replacement if replacement is false. The default is false.
y = randsample(n,k,true,w) or y = randsample(population,k,true,w) returns a weighted sample taken with replacement, using a vector of positive weights w, whose length is n. The probability that the integer i is selected for an entry of y is w(i)/sum(w). Usually, w is a vector of probabilities. randsample does not support weighted sampling without replacement.
y = randsample(s,...) uses the stream s for random number generation. s is a member of the RandStream class. Default is the MATLAB® default random number stream.

More Answers (2)

Thorsten on 14 Oct 2015
randomlySelected=data(randsample(size(data,1), nRows), :);
if you don't have randsample, use
ind = randperm(size(data,1));
ind = ind(1:nRows);
randomlySelected=data(ind, :);

Image Analyst
Image Analyst on 15 Oct 2015
If you want to do it all in one line, and if you have the Statistics and Machine Learning Toolbox, use datasample
randomlySelectedRows = datasample(data, 150);
This returns a 150 row by 25 col matrix.
Otherwise you can use randperm() to make sure you don't select any row twice. Also use the second argument of it to get a sampling of 150 of the numbers:
data = rand(9000, 25); % Sample data.
nRows=150; % However many rows you want to extract in the subset.
rowsToExtract = randperm(size(data, 1), nRows); % Get list of the rows to use.
randomlySelectedRows = data(rowsToExtract, :); % Do the extraction.
Image Analyst
Image Analyst on 29 Nov 2023
@Noam A Not sure what you mean (despite reading it several times). My code does pick every row once, and doesn't repeat any rows for any column. Every column will use a unique set of row numbers to extract from. Each and every row is not chosen of course unless your n is chosen precisely to make sure that happens. If n is small then only some of the rows are chosen obviously.
Not sure if your code calls randperm again for each column (sounds like it), but that could possibly give two columns having the same row chosen. If you don't call randperm for each row, then of course all columns will have the same set of rows extracted.
Noam A
Noam A on 29 Nov 2023
Edited: Noam A on 30 Nov 2023
@Image Analyst yes, I guess I wasn't clear. I wanted a random set of samples from each column. It's fine if rows get repeated, but what I didn't want is to pick n random row indicies, and then use those same indices for each column. This is what previous answers in this original thread were doing, as far as I could tell. Basically I am treating each column as its its own dataset and picking n random samples from that dataset. I found a solution that worked (see code I posted above), though I am sure it could be optomized.Thanks again!

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!