Main Content

nbintest

Unpaired hypothesis test for count data with small sample sizes

Description

example

test = nbintest(X,Y) performs a hypothesis test that two independent samples of short-read count data, in each row of X and Y, come from distributions with equal means under the assumptions that:

  • Short-read counts are modeled using the negative binomial distribution.

  • Variance and mean of data in each row are linked through a regression function along all the rows.

X and Y must have the same number of rows and at least 2 columns, but not necessarily the same number of columns. Rows of X and Y correspond to variables, features, or genes, such as measurements of gene expression for different genes. Columns are usually time points or patients.

test is a NegativeBinomialTest object with two-sided p-values stored in the pValue property.

Use this function when you want to perform an unpaired hypothesis test for short-read count data (from high-throughput assays such as RNA-Seq or ChIP-Seq) with small sample sizes (in the order of tens at most). For instance, use this function to decide if observed differences in read counts between two conditions are significant for given genes.

test = nbintest(X,Y,Name,Value) uses additional options specified by one or more Name,Value pair arguments.

Note

It is recommended that you use the diagnostic plots of the NegativeBinomialTest object returned by nbintest before interpreting the p-values. These plots allow you to see if the model assumption is correct, and the variance link used is appropriate for the data.

Examples

collapse all

This example shows how to perform an unpaired hypothesis test for synthetic short-read count data from two different biological conditions.

The data in this example contains synthetic gene count data for 5000 genes, representing two different biological conditions, such as diseased and normal cells. For each condition, there are five samples. Only 10% of the genes (500 genes) are differentially expressed. Specifically, half of them (250 genes) are exactly 3-fold overexpressed. The other 250 genes are 3-fold underexpressed. The rest of the gene expression data is generated from the same negative binomial distribution for both conditions. Each sample also has a different size factor (that is, the coverage or sampling depth).

Load the data.

load('nbintest_data.mat','K','H0');

The variable K contains gene count data. The rows represent genes, and the columns represent samples. In this case, the first five columns represent samples from the first condition. The other five columns represent samples from the second condition. Display the first few rows of K.

K(1:5,:)
ans = 5×10

       13683       14140        8281       14309       12208        8045        9446       11317       14597       14592
       16028       16805        9813       16486       14076        9901       10927       13348       16999       17036
         814         862         492         910         758         521         573         753         870         936
       15870       16453        9857       16454       14267        9671       10997       13624       17151       17205
        9422        9393        5734        9598        8174        5381        6315        7752        9869        9795

In this example, the null hypothesis is true when the gene is not differentially expressed. The variable H0 contains boolean indicators that indicate for which genes the null hypothesis is true (marked as 1). In other words, H0 contains known labels that you will use later to compare with predicted results.

sum(H0)
ans = 4500

Out of 5000 genes, 4500 are not differentially expressed in this synthetic data.

Run an unpaired hypothesis test for samples from two conditions using nbintest. The assumption is that the data came from a negative binomial distribution, where the variance is linked to the mean via a locally-regressed smooth function of the mean as described in [1] by setting 'VarianceLink' to 'LocalRegression'.

tLocal = nbintest(K(:,1:5),K(:,6:10),'VarianceLink','LocalRegression');

Use plotVarianceLink to plot a scatter plot for each experimental condition (for X and Y conditions), with the sample variance on the common scale versus the estimate of the condition-dependent mean. Use a linear scale for both axes. Include curves for all other linkage options by setting 'Compare' to true.

plotVarianceLink(tLocal,'Scale','linear','Compare',true)

The Identity line represents the Poisson model, where the variance is identical to the mean as described in [3]. Observe that the data seems to be overdispersed (that is, most points are above the Identity line). The Constant line represents the negative binomial model, where the variance is the sum of the shot noise term (mean) and a constant multiplied by the squared mean as described in [2]. The Local Regression and Constant linkage options appear to fit better with the overdispersed data.

Use plotChiSquaredFit to assess the goodness-of-fit for variance regression. It plots the empirical CDF (ecdf) of the chi-squared probabilities. The probabilities are the ratio between the observed and the estimated variance stratified by short-read count levels into five equal-sized bins.

plotChiSquaredFit(tLocal)

Each figure shows five ecdf curves. Each curve represents one of the five short-read count levels. For instance, the blue line represents the ecdf curve for a low short-read counts between 0 and 1264. The red line represents high counts (more than 11438).

One way to interpret the curves is to check if the ecdf curves are above the diagonal line. If they are above the line, then the variance is overestimated. If they are below the line, then the variance is underestimated. In both figures, the variance seems to be correctly estimated for higher counts (that is, the red line follows the diagonal line), but slightly overestimated for lower count levels.

To assess the performance of the hypothesis test, construct a confusion matrix using the known labels and the predicted p-values.

confusionmat(H0,(tLocal.pValue > .001))
ans = 2×2

         493           7
           5        4495

Out of 500 differentially expressed genes, 493 are correctly predicted (true positives) and 7 of them are incorrectly predicted as not-differentially expressed genes (false negatives). Out of 4500 genes that are not differentially expressed, 4495 are correctly predicted (true negatives) and 5 of them are incorrectly predicted as differentially expressed genes (false positives).

For a comparison, run the hypothesis test again assuming that counts are modeled by the Poisson distribution, where the variance is identical to the mean.

tPoisson = nbintest(K(:,1:5),K(:,6:10),'VarianceLink','Identity');

Plot the ecdf curves. Observe that all the curves are below the diagonal line, implying that the variance is underestimated. Therefore, the negative binomial model fits the data better.

plotChiSquaredFit(tPoisson)

Input Arguments

collapse all

Gene expression values from the first experimental condition, specified as a matrix or table. For instance, X can represent gene expression values from cancer cells.

Note

X and Y must have the same number of rows and at least 2 columns, but not necessarily the same number of columns. Rows of X and Y correspond to genes (or features), such as measurements of gene expression for different genes. Columns are usually time points or patients.

Gene expression values from the second experimental condition, specified as a matrix or table. For instance, Y can represent gene expression values from normal cells.

Note

X and Y must have the same number of rows and at least 2 columns, but not necessarily the same number of columns. Rows of X and Y correspond to genes (or features), such as measurements of gene expression for different genes. Columns are time points or patients.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'VarianceLink','Identity' specifies that the variance is equal to the mean when defining the linkage between the two.

Logical flag to pool variance across both conditions, specified as true or false. By default, the variance is estimated separately for each condition.

Example: 'PooledVariance',true

Size (scaling) factor of each column in X and Y, specified as a cell array of two vectors such as {SX,SY}. SX and SY are numeric vectors with sizes equal to size(X,2) and size(Y,2). SX , SY, or both can be a scalar indicating that all columns share the same size factor.

In a high-throughput sequencing library, the size factor is an estimation of the coverage or the sampling depth. The default is an empty array [], meaning the size factor is estimated as the median of the ratio of the sample’s counts to the geometric mean of each row in X or Y. Rows with zero geometric mean are ignored.

Example: 'SizeFactor',{[1.2,0.5,0.8],[0.8,1.1,1.5]}

Output Arguments

collapse all

Hypothesis test results, returned as a NegativeBinomialTest object. Use this object to create diagnostic plots and access p-values.

References

[1] Anders, S., and Huber, W. (2010). Differential Expression Analysis for Sequence Count Data. Genome Biology, 11(10):R106.

[2] Robinson, M.D., and Smyth, G.K. (2008). Small-sample Estimation of Negative Binomial Dispersion, with Applications to SAGE data. Biostatistics, 9:321-332.

[3] Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., and Gilad, Y. (2008). RNA-seq: an Assessment of Technical Reproducibility and Comparison with Gene Expression Arrays. Genome Research, 16:1509-1517.

Version History

Introduced in R2014b