Main Content

datasample

Randomly sample from data, with or without replacement

Description

y = datasample(data,k) returns k observations sampled uniformly at random, with replacement, from the data in data.

example

y = datasample(data,k,dim) returns a sample taken along dimension dim of data.

example

y = datasample(___,Name,Value) returns a sample for any of the input arguments in the previous syntaxes, with additional options specified by one or more name-value pair arguments. For example, 'Replace',false specifies sampling without replacement.

example

y = datasample(s,___) uses the random number stream s to generate random numbers. The option s can precede any of the input arguments in the previous syntaxes.

example

[y,idx] = datasample(___) also returns an index vector indicating which values datasample sampled from data using any of the input arguments in the previous syntaxes.

example

Examples

collapse all

Create the random number stream for reproducibility.

s = RandStream('mlfg6331_64'); 

Draw five unique values from the integers 1 to 10.

y = datasample(s,1:10,5,'Replace',false)
y = 1×5

     9     8     3     6     2

Create the random number stream for reproducibility.

s = RandStream('mlfg6331_64');

Generate 48 random characters from the sequence ACGT per specified probabilities.

seq = datasample(s,'ACGT',48,'Weights',[0.15 0.35 0.35 0.15])
seq = 
'GGCGGCGCAAGGCGCCGGACCTGGCTGCACGCCGTTCCCTGCTACTCG'

Set the random seed for reproducibility of the results.

rng(10,'twister') 

Generate a matrix with 10 rows and 1000 columns.

X = randn(10,1000);

Create the random number stream for reproducibility within datasample.

s = RandStream('mlfg6331_64');

Randomly select five unique columns from X.

Y = datasample(s,X,5,2,'Replace',false)
Y = 10×5

    0.4317   -0.3327    0.9112   -2.3244    0.9559
    0.6977   -0.7422    0.4578   -1.3745   -0.8634
   -0.8543   -0.3105    0.9836   -0.6434   -0.4457
    0.1686    0.6609   -0.0553   -0.1202   -1.3699
   -1.7649   -1.1607   -0.3513   -1.5533    0.0597
   -0.3821    0.5696   -1.6264   -0.2104   -1.5486
   -1.6844    0.7148   -0.6876   -0.4447   -1.4615
   -0.4170    1.3696    1.1874   -0.9901    0.5875
   -0.2410    1.4703   -2.5003   -1.1321   -1.8451
    0.6212    1.4118   -0.4518    0.8697    0.8093

Resample observations from a dataset array to create a bootstrap replicate data set. See Bootstrap Resampling for more information about bootstrapping.

Load the sample data set.

load hospital

Create a data set that has the same size as the hospital data set and contains random samples chosen with replacement from the hospital data set.

y = datasample(hospital,size(hospital,1));

Select samples from data based on indices of a sample chosen from another vector.

Generate two random vectors.

x1 = randn(100,1);
x2 = randn(100,1);

Select a sample of 10 elements from vector x1, and return the indices of the sample in vector idx.

[y1,idx] = datasample(x1,10);

Select a sample of 10 elements from vector x2 using the indices in vector idx.

y2 = x2(idx);

Input Arguments

collapse all

Input data from which to sample, specified as a vector, matrix, multidimensional array, table, or dataset array. By default, datasample samples from the first nonsingleton dimension of data. For example, if data is a matrix, then datasample samples from the rows. Change this behavior with the dim input argument.

Data Types: single | double | logical | char | string | table

Number of samples, specified as a positive integer.

Example: datasample(data,100) returns 100 observations sampled uniformly and at random from the data in data.

Data Types: single | double

Dimension to sample, specified as a positive integer. For example, if data is a matrix and dim is 2, y contains a selection of columns in data. If data is a table or dataset array and dim is 2, y contains a selection of variables in data. Use dim to ensure sampling along a specific dimension regardless of whether data is a vector, matrix, or N-dimensional array.

Data Types: single | double

Random number stream, specified as the global stream or RandStream. For example, s = RandStream('mlfg6331_64') creates a random number stream that uses the multiplicative lagged Fibonacci generator algorithm. For details, see Creating and Controlling a Random Number Stream.

The rng function provides a simple way to control the global stream. For example, rng(seed) seeds the random number generator using the nonnegative integer seed. For details, see Managing the Global Stream Using RandStream.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Replace',false,'Weights',ones(datasize,1) samples without replacement and with probability proportional to the elements of Weights, where datasize is the size of the dimension being sampled.

Indicator for sampling with replacement, specified as the comma-separated pair consisting of 'Replace' and either true or false.

Sample with replacement if 'Replace' is true, or without replacement if 'Replace' is false. If 'Replace' is false, then k must not be larger than the size of the dimension being sampled. For example, if data = [1 3 Inf; 2 4 5] and y = datasample(data,k,'Replace',false), then k cannot be larger than 2.

Data Types: logical

Sampling weights, specified as the comma-separated pair consisting of 'Weights' and a vector of nonnegative numeric values. The vector is of size datasize, where datasize is the size of the dimension being sampled. The vector must have at least one positive value and cannot contain NaN values. The datasample function samples with probability proportional to the elements of 'Weights'.

Example: 'Weights',[0.1 0.5 0.35 0.46]

Data Types: single | double

Output Arguments

collapse all

Sample, returned as a vector, matrix, multidimensional array, table, or dataset array.

  • If data is a vector, then y is a vector containing k elements selected from data.

  • If data is a matrix and dim = 1, then y is a matrix containing k rows selected from data. Or, if dim = 2, then y is a matrix containing k columns selected from data.

  • If data is an N-dimensional array and dim = 1, then y is an N-dimensional array of samples taken along the first nonsingleton dimension of data. Or, if you specify a value for the dim name-value pair argument, datasample samples along the dimension dim.

  • If data is a table and dim = 1, then y is a table containing k rows selected from data. Or, if dim = 2, then y is a table containing k variables selected from data.

  • If data is a dataset array and dim = 1, then y is a dataset array containing k rows selected from data. Or, if dim = 2, then y is a dataset array containing k variables selected from data.

If the input data contains missing observations that are represented as NaN values, datasample samples from the entire input, including the NaN values. For example, y = datasample([NaN 6 14],2) can return y = NaN 14.

When the sample is taken with replacement (default), y can contain repeated observations from data. Set the Replace name-value pair argument to false to sample without replacement.

Indices, returned as a vector indicating which elements datasample chooses from data to create y. For example:

  • If data is a vector, then y = data(idx).

  • If data is a matrix and dim = 1, then y = data(idx,:).

  • If data is a matrix and dim = 2, then y = data(:,idx).

Tips

  • To sample random integers with replacement from a range, use randi.

  • To sample random integers without replacement, use randperm or datasample.

  • To randomly sample from data, with or without replacement, use datasample.

Algorithms

datasample uses randperm, rand, or randi to generate random values. Therefore, datasample changes the state of the MATLAB® global random number generator. Control the random number generator using rng.

For selecting weighted samples without replacement, datasample uses the algorithm of Wong and Easton [1].

Alternative Functionality

You can use randi or randperm to generate indices for random sampling with or without replacement, respectively. However, datasample can be more convenient to use because it samples directly from your data. datasample also allows weighted sampling.

References

[1] Wong, C. K. and M. C. Easton. "An Efficient Method for Weighted Sampling Without Replacement." SIAM Journal of Computing 9(1), pp. 111–113, 1980.

Extended Capabilities

Version History

Introduced in R2011b

Go to top of page