sequentialfs with "dummified" input feature matrix
1 view (last 30 days)
Show older comments
Context
All my features are categorical and thus I converted each feature into dummy variables using:
dummyvar(nominal(featureVector))
So a M x 1 feature vector with N categories (held within a table column) will be converted to a M x N matrix where each row is a 0 1 vector (e.g [0 0 1]) representing a given category (still held within a table column). This allowed my to convert my table of now dummified features into a matrix using the code below to train an SVM (via fitcsvm for example or I'm actually using libsvm's svmtrain). Each feature is no longer represented as by 1 column in the matrix, I was told that this ok and the training performs as intended.
XMatrix = cell2mat(table2cell(dummifiedXTable));
I am now pursuing feature selection using sequentialfs() which iteratively finds the most predictive features as so:
c = cvpartition(Y,'k',5);
opts = statset('display','iter');
inmodel = sequentialfs(@my_fun,dumXMat,dumY,'cv',c,'options',opts);
where my_fun is:
function [ criterion ] = my_fun_lib(trainX,trainY,testX,testY)
bestc = '1';
model = svmtrain(trainY, trainX,['-s 0 -t 0 -c ' bestc]);
criterion = sum(svmpredict(testY, testX, model) ~= testY);
end
Problem
My issue is that my input feature matrix X is in a form where each column doesn't represent a feature within itself which is what sequentialfs() expects. I've tried feeding sequentialfs() the feature table before converting into a matrix (when a column still represents a feature) & taking care of this conversion in the my_fun function but it appears that sequentialfs() only wants to accept a matrix.
How can I resolve this issue? Many many thanks for your help!
0 Comments
Accepted Answer
Ilya
on 17 Aug 2015
To answer this particular question:
"Is there any way instead of selecting category choices like this to select initial features? For example, I have features Weight, Model_Year, Color, Owner_Name, and Owner_Country and I want to know which of these are most predictive, not which of their category choices are."
I don't know why you would want to do this because selecting categories is in principle more useful than selecting predictors. But if you insist on doing that, just dummify variables inside your my_fun_lib. This works as long as all your features are categorical. Here is an example:
%%Form a table with two categorical variables
load fisheriris
T = table(categorical(meas(:,1)),categorical(meas(:,2)));
%%Convert table into matrix
X = zeros(size(T));
for p=1:size(X,2)
X(:,p) = double(T{:,p});
end
%%Turn into a binary classification problem
y = strcmp(species,'setosa');
%%Run sequentialfs
c = cvpartition(y,'k',5);
opts = statset('display','iter');
inmodel = sequentialfs(@mycrit,X,y,'cv',c,'options',opts);
where mycrit looks like this:
function val = mycrit(Xtrain,Ytrain,Xtest,Ytest)
Ntrain = size(Xtrain,1);
X = dummyvar([Xtrain; Xtest]);
Xtrain = X(1:Ntrain,:);
Xtest = X(Ntrain+1:end,:);
obj = fitcsvm(Xtrain,Ytrain);
val = loss(obj,Xtest,Ytest);
end
3 Comments
Ilya
on 21 Aug 2015
Suppose a question can be answered by "yes", "no" or "not sure". If your feature selection procedure picks just the "yes" category, you know that you only need to distinguish between "yes" and "not yes". If you pick the whole predictor, you will be also distinguishing between "no" and "not sure" although this has no effect on accuracy.
More Answers (1)
Madhav Rajan
on 17 Aug 2015
I understand that you want to perform feature selection using 'sequentialfs' with a 'dummified' input feature matrix. Assuming that you have the statistical and machine learning toolbox, you can refer the following example, where I have used the sample 'carsmall' dataset available in MATLAB.
The example uses two features, one variable is the numerical 'Weight' variable which is the weight of the cars. The other feature is a Mx1 'Model_Year' which is dummified into an Mx3 matrix since there are three categorical values of 'Model_Years'. The class variable is the 'Origin' which is either '0' if the country is 'USA' and '1' otherwise. 'svmtrain' and 'svmclassify' have been used in the my_fun file to model and classify the data.
The deviance formula in the example just sums up the misclassified points and is returned by 'my_fun'.
example.m
%%load the cars data set
load carsmall;
%%define y, x1, x2
y = categorical(cellstr(Origin),{'USA','France', 'Germany', 'Italy', 'Sweden', 'Japan'}, {'0', '1', '1' ,'1' ,'1', '1'} );
x1 = Weight;
x2 =dummyvar(nominal(Model_Year));
%%call sequential fs with the correct parameters
c = cvpartition(y,'k',10);
opts = statset('display','iter');
X = [x1 x2];
[fs,history] = sequentialfs(@my_fun, X,y, 'cv',c,'nullmodel',false,...
'options',opts,...
'direction','forward');
%%display the outputs
fs
history
my_fun.m
function [ dev ] = my_fun( XTRAIN,ytrain,XTEST,ytest)
%MY_FUN Summary of this function goes here
% Detailed explanation goes here
%%train the model and test it
SVMModel = svmtrain(XTRAIN,ytrain);
group = svmclassify(SVMModel, XTEST);
%%compute deviance
dev = sum(~strcmp(ytest, group));
end
Looking at your script, the ' svmtrain ' function appears to be called with incorrect parameters. You can refer the documentation for more details. The input to the 'my_fun.m' has to be a matrix and hence it is necessary to convert any table data which represents the feature variable to a matrix. You can refer the documentation of the ' sequentialfs ' function for more details on defining the criterion function 'my_fun.m'.
Hope this helps
See Also
Categories
Find more on Gaussian Process Regression in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!