Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

datasample

Randomly sample from data, with or without replacement

Syntax

y = datasample(data,k)
y = datasample(data,k,dim)
y = datasample(___,Name,Value)
y = datasample(s,___)
[y,idx] = datasample(___)

Description

y = datasample(data,k) returns k observations sampled uniformly at random, with replacement, from the data in data.

y = datasample(data,k,dim) returns a sample taken along dimension dim of data.

y = datasample(___,Name,Value) uses any of the input arguments in the previous syntaxes followed by one or more Name,Value pair arguments.

y = datasample(s,___) uses the random number stream s to generate random numbers. The option s can precede any of the input arguments in the previous syntaxes.

[y,idx] = datasample(___) also returns an index vector indicating which values datasample sampled from data using any of the input arguments in the previous syntaxes.

Input Arguments

data

Vector, matrix, N-dimensional array, table, or dataset array representing the data from which to sample. By default, datasample samples from the first nonsingleton dimension of the data array. For example, if data is a matrix, then datasample samples from the rows. Change this behavior with the dim input argument.

k

Positive integer, the number of samples.

dim

Integer specifying the dimension to sample. For example, if data is a matrix and dim is 2, y contains a selection of columns in data. If data is a table or dataset array and dim is 2, y contains a selection of variables in data. Use dim to ensure sampling along a specific dimension regardless of whether data is a vector, matrix, or N-dimensional array.

Default: 1

s

Random number stream. Create s using RandStream. For example, s = RandStream('mlfg6331_64') creates a random number stream that uses the multiplicative lagged Fibonacci generator algorithm. For details, see Creating and Controlling a Random Number Stream (MATLAB).

Default: The global stream. The rng function provides a simple way to control the global stream. For example, rng(seed) seeds the random number generator using the nonnegative integer seed. For details, see Managing the Global Stream (MATLAB).

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

'Replace'

Sample with replacement if Replace is true, or without replacement if Replace is false. If Replace is false, then k must not be larger than the size of the dimension being sampled. For example, if data = [1 3 Inf; 2 4 5] and y = datasample(data,k,'Replace',false), then k cannot be larger than 2.

Default: true

'Weights'

Vector with the same number of elements as the size of the dimension being sampled. The vector must have nonnegative elements and at least one positive value (NaN values are not allowed). The datasample function samples with probability proportional to the elements of Weights.

Default: ones(datasize,1), where datasize is the size of the dimension being sampled

Output Arguments

y

  • If data is a vector, y is a vector containing k elements selected from data.

  • If data is a matrix, y is a matrix containing k rows selected from data. Or, if dim = 2, y is a matrix containing k columns selected from data.

  • If data is an N-dimensional array, datasample samples along its first nonsingleton dimension. Or, if you specify a value for the dim name-value pair argument, datasample samples along the dimension dim.

If the input data contains missing observations that are represented as NaN values, datasample samples from the entire input, including the NaN values. For example, y = datasample([NaN 6 14],2) can return y = NaN 14.

When the sample is taken with replacement (default), y can contain repeated observations from data. Set the Replace name-value pair argument to false to sample without replacement.

idx

Vector of indices indicating which elements datasample chose from data to create y. For example:

  • If data is a vector, then y = data(idx).

  • If data is a matrix and dim = 1, then y = data(idx,:).

  • If data is a matrix and dim = 2, then y = data(:,idx).

Examples

Draw five unique values from the integers 1:10.

s = RandStream('mlfg6331_64'); % For reproducibility
y = datasample(s,1:10,5,'Replace',false)

y =

     9     8     3     6     2

Generate a random sequence of the characters ACGT, with replacement, according to specified probabilities.

s = RandStream('mlfg6331_64'); % For reproducibility
seq = datasample(s,'ACGT',48,'Weights',[0.15 0.35 0.35 0.15])

seq =

    'GGCGGCGCAAGGCGCCGGACCTGGCTGCACGCCGTTCCCTGCTACTCG'

Select a random subset of columns from a data matrix.

rng(10,'twister')              % For reproducibility
X = randn(10,1000);
s = RandStream('mlfg6331_64'); % For reproducibility
Y = datasample(s,X,5,2,'Replace',false)

Y =

    0.4317   -0.3327    0.9112   -2.3244    0.9559
    0.6977   -0.7422    0.4578   -1.3745   -0.8634
   -0.8543   -0.3105    0.9836   -0.6434   -0.4457
    0.1686    0.6609   -0.0553   -0.1202   -1.3699
   -1.7649   -1.1607   -0.3513   -1.5533    0.0597
   -0.3821    0.5696   -1.6264   -0.2104   -1.5486
   -1.6844    0.7148   -0.6876   -0.4447   -1.4615
   -0.4170    1.3696    1.1874   -0.9901    0.5875
   -0.2410    1.4703   -2.5003   -1.1321   -1.8451
    0.6212    1.4118   -0.4518    0.8697    0.8093

Resample observations from a dataset array to create a bootstrap replicate dataset. See Bootstrap Resampling for more information about bootstrapping.

load hospital
y = datasample(hospital,size(hospital,1));

Use the second output to sample “in parallel" from two data vectors.

x1 = randn(100,1);
x2 = randn(100,1);
[y1,idx] = datasample(x1,10);
y2 = x2(idx);

Tips

  • To sample random integers with replacement from a range, use randi.

  • To sample random integers without replacement, use randperm or datasample.

  • To randomly sample from data, with or without replacement, use datasample.

Algorithms

datasample uses randperm, rand, or randi to generate random values. Therefore, datasample changes the state of the MATLAB® global random number generator. Control the random number generator using rng.

For selecting weighted samples without replacement, datasample uses the algorithm of Wong and Easton [1].

Alternatives

You can use randi or randperm to generate indices for random sampling with or without replacement, respectively. However, datasample can be more convenient because it samples directly from your data. datasample also allows weighted sampling.

References

[1] Wong, C. K. and M. C. Easton. An Efficient Method for Weighted Sampling Without Replacement. SIAM Journal of Computing 9(1), pp. 111–113, 1980.

Extended Capabilities

Introduced in R2011b

Was this topic helpful?