Rank key features by class separability criteria
ranks the features in
IDX = rankfeatures(
X using an independent evaluation criterion for
X is a matrix where every column is an observed
vector and the number of rows corresponds to the original number of features.
GROUP contains the class labels.
IDX is a list
of indices to the rows of
X with the most significant features.
Find a reduced set of genes to differentiate breast cancer cells
Find a reduced set of genes that is sufficient for differentiating breast cancer cells from all other types of cancer in the t-matrix NCI60 data set.
Load sample data.
Get a logical index vector to the breast cancer cells.
BC = GROUP == 8;
I = rankfeatures(X,BC,NumberOfIndices=12);
Test features with a linear discriminant classifier.
C = classify(X(I,:)',X(I,:)',double(BC)); cp = classperf(BC,C); cp.CorrectRate
ans = 1
Use cross-correlation weighting to further reduce the required number of genes.
I = rankfeatures(X,BC,'CCWeighting',0.7,'NumberOfIndices',8); C = classify(X(I,:)',X(I,:)',double(BC)); cp = classperf(BC,C); cp.CorrectRate
ans = 1
Find discriminant peaks of two groups of signals
Find the discriminant peaks of two groups of signals with Gaussian pulses modulated by two different sources.
Specify the regional information to outweigh Z-value of features as a function handle. Set the number of output indices to 5.
f = rankfeatures(y',grp,NWeighting=@(x) x/10+5,NumberOfIndices=5); plot(t,y(grp==1,:),'b',t,y(grp==2,:),'g',t(f),1.35,'vr');
X — Sample data
Sample data, specified as a numeric matrix. Each column is an observed vector, and each row is a feature.
GROUP — Class labels
numeric vector | string vector | cell array of character vectors
Class labels, specified as a numeric vector, string vector, or cell array of
numel(GROUP) is the same as the number of columns
GROUP must have only two unique values.
If it contains any
NaN values, the function ignores the corresponding
observation vector in
Specify optional pairs of arguments as
the argument name and
Value is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
rankfeatures(x,groups,Criterion="entrophy",NWeighting=0.2) specifies to use the
relative entropy as the criterion to assess the feature significance and regional
information value of 0.2 to outweigh the Z-value of potential features.
Before R2021a, use commas to separate each name and value, and enclose
Name in quotes.
Criterion — Criterion to assess significance of feature
"ttest" (default) | "entrophy" |
Criterion to assess the significance of each feature for separating two labeled groups, specified as one of the following:
"ttest"— Absolute value two-sample t-test with pooled variance estimate.
"entropy"— Relative entropy, also known as Kullback-Leibler distance or divergence.
"bhattacharyya"— Minimum attainable classification error or Chernoff bound.
"roc"— Area between the empirical receiver operating characteristic (ROC) curve and the random classifier slope.
"wilcoxon"— Absolute value of the standardized u-statistic of a two-sample unpaired Wilcoxon test, also known as Mann-Whitney.
"bhattacharyya" assume normal distributed classes while
"wilcoxon" are nonparametric
tests. All tests are feature independent.
CCWeighting — Correlation information to outweigh Z-value of features
0 (default) | numeric scalar between
Correlation information to outweigh the Z-value of potential features, specified
as a numeric scalar between
The function uses to calculate the weight, where ρ is the average
of the absolute values of the cross-correlation coefficient between the candidate
feature and all previously selected features. α is the
CCWeighting value that sets the weighting factor.
By default, α is
0, and the function does
not weight the potential features. A large value of ρ (close to 1)
outweighs the significance statistic, meaning that features are highly correlated with
the features already picked are less likely to be included in the output list.
NWeighting — Regional information to outweigh Z-value of features
0 (default) | nonnegative scalar | function handle
Regional information to outweigh the Z-value of potential features, specified as a nonnegative scalar or function handle.
The function uses to calculate the weight, where D is the distance
(in rows) between the candidate feature and previously selected features.
β is the
NWeighting value that sets the
weighting factor. β must be greater than or equal to
By default, β is
0, and the function does
not weight the potential features. A small value of D (close to
0) outweighs the significance statistics of only close features.
This means that features that are close to already picked features are less likely to
be included in the output list. This option is useful for extracting features from
time series with temporal correlation.
β can also be a function of the feature location, specified
@ or an anonymous function. In both cases
rankfeatures passes the row position of the feature to the
specified function and expects back a value greater than or equal to
You can use
NumberOfIndices — Number of output indices
Number of output indices in
IDX, specified as a positive
scalar. By default, the number of indices is the same as the number of features when
α and β are
the number of indices is set to
CrossNorm — Method for independent normalization across observations
"none" (default) |
Method for independent normalization across observations for every feature, specified as one of the following:
"none"(default) — No normalization.
In these equations,
Xmin = min(X), and
Cross-normalization ensures comparability among different features although it is not always necessary because the selected criterion might already account for this.
IDX — List of indices
List of indices to the rows of X with the most significant features, returned as a numeric vector.
Z — List of absolute values of criterion for features
List of absolute values of the
Criterion used for the features,
returned as a numeric vector.
 Theodoridis, Sergios, and Konstantinos Koutroumbas. Pattern Recognition. San Diego: Academic Press, 1999: 341-342.
 Liu, Huan, and Hiroshi Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer International Series in Engineering and Computer Science 454. Boston: Kluwer Academic Publishers, 1998.
 Ross, Douglas T., Uwe Scherf, Michael B. Eisen, Charles M. Perou, Christian Rees, Paul Spellman, Vishwanath Iyer, et al. “Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines.” Nature Genetics 24, no. 3 (March 2000): 227–35.