MATLAB Answers

How to select the components that show the most variance in PCA

766 views (last 30 days)
Faraz
Faraz on 27 Feb 2016
Commented: Darren Lim on 3 Feb 2021
I have a huge data set that I need for training (32000*2500). This seems to be too much for my classifier. So I decided to do some reading on dimensionality reduction and specifically into PCA.
From my understanding PCA selects the current data and replots them on another (x,y) domain/scale. These new coordinates don't mean anything but the data is rearranged to give one axis maximum variation. After these new coefficients I can drop the cooeff having minimum variation.
Now I am trying to implement this in MatLab and am having trouble with the output provided. MatLab always considers rows as observations and columns as variables. So my inout to the pca function would be my matrix of size (32000*2500). This would return the PCA coefficients in an output matrix of size 2500*2500.
The help for pca states:
Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance.
In this output, which dimension is the observations of my data? I mean if I have to give this to the classifier, will the rows of coeff represent my datas observations or is it now the columns of coeff?
And how do I remove the coefficients having the least variation? And thus effectively reduce the dimension of my data

  0 Comments

Sign in to comment.

Accepted Answer

the cyclist
the cyclist on 27 Feb 2016
Edited: the cyclist on 25 Oct 2018
Here is some code I wrote to help myself understand the MATLAB syntax for PCA.
rng 'default'
M = 7; % Number of observations
N = 5; % Number of variables observed
% Made-up data
X = rand(M,N);
% De-mean (MATLAB will de-mean inside of PCA, but I want the de-meaned values later)
X = X - mean(X); % Use X = bsxfun(@minus,X,mean(X)) if you have an older version of MATLAB
% Do the PCA
[coeff,score,latent,~,explained] = pca(X);
% Calculate eigenvalues and eigenvectors of the covariance matrix
covarianceMatrix = cov(X);
[V,D] = eig(covarianceMatrix);
% "coeff" are the principal component vectors. These are the eigenvectors of the covariance matrix. Compare ...
coeff
V
% Multiply the original data by the principal component vectors to get the projections of the original data on the
% principal component vector space. This is also the output "score". Compare ...
dataInPrincipalComponentSpace = X*coeff
score
% The columns of X*coeff are orthogonal to each other. This is shown with ...
corrcoef(dataInPrincipalComponentSpace)
% The variances of these vectors are the eigenvalues of the covariance matrix, and are also the output "latent". Compare
% these three outputs
var(dataInPrincipalComponentSpace)'
latent
sort(diag(D),'descend')
The first figure on the wikipedia page for PCA is really helpful in understanding what is going on. There is variation along the original (x,y) axes. The superimposed arrows show the principal axes. The long arrow is the axis that has the most variation; the short arrow captures the rest of the variation.
Before thinking about dimension reduction, the first step is to redefine a coordinate system (x',y'), such that x' is along the first principal component, and y' along the second component (and so on, if there are more variables).
In my code above, those new variables are dataInPrincipalComponentSpace. As in the original data, each row is an observation, and each column is a dimension.
These data are just like your original data, except it is as if you measured them in a different coordinate system -- the principal axes.
Now you can think about dimension reduction. Take a look at the variable explained. It tells you how much of the variation is captured by each column of dataInPrincipalComponentSpace. Here is where you have to make a judgement call. How much of the total variation are you willing to ignore? One guideline is that if you plot explained, there will often be an "elbow" in the plot, where each additional variable explains very little additional variation. Keep only the components that add a lot more explanatory power, and ignore the rest.
In my code, notice that the first 3 components together explain 87% of the variation; suppose you decide that that's good enough. Then, for your later analysis, you would only keep those 3 dimensions -- the first three columns of dataInPrincipalComponentSpace. You will have 7 observations in 3 dimensions (variables) instead of 5.
I hope that helps!

  26 Comments

NN
NN on 4 Dec 2020
Dear Cyclist ,
i have gone through the discussion to understand PCA .I am doing a forecasting problem with neural network and used the below syntax for finding out PCA components for reducing the dimension of training and testing data .
coeff = pca(X)
Can i use the output of this command (coeff matrix) as new training and testing data for neural network ad use it for forecasting?
I have 9 input features for forecasting.How can i plot the contribution rates of each feature and prinicpal components against the variance to know the contribution of features and dimension reduction?
kindly help
the cyclist
the cyclist on 4 Dec 2020
I have several comments here.
No, you should not use coeff as the new training data. coeff is not data -- it is the transformation matrix from the original coordinate system to the PCA coordinate system.
You can use the data from the new coordinate system for your neural network. These data are the score output from pca(). [These are equivalent to the variable I called dataInPrincipalComponentSpace.] Note that if you use all columns of dataInPrincipalComponentSpace, then you have not done dimensional reduction -- you will simply be in a new coordinate system where the vectors are orthogonal to each other. The dimensional reduction step is when you choose to drop columns from dataInPrincipalComponentSpace.
I haven't thought deeply about this, but I'm pretty sure you should only do the PCA on the training set to determine coeff. Otherwise you are leaking information from your test set back to the training set. (But you'll want to apply coeff to the test set, before putting it into your neural network.)
You can use the output explained to make a scree plot of the amount of explained variance in each principal component.
% Scree plot
figure
h = plot(explained,'.-');
set(h,'LineWidth',3,'MarkerSize',36)
ylim([0 100])
set(gca,'XTick',1:N)
title('Explained variance by principal components')
xlabel('Principal component number')
ylabel('Fraction of variation explained [%]')
NN
NN on 7 Dec 2020
Thank you very much for the explanation.I will check this

Sign in to comment.

More Answers (3)

naghmeh moradpoor
naghmeh moradpoor on 1 Jul 2017
Dear Cyclist,
I used your code and I was successful to find all the PCAs for my dataset. Thank you! On my dataset, PC1, PC2 and PC3 explained more than 90% of the variance. I would like to know how to find which variables from my dataset are related to PC1, PC2 and PC3?
Please could you help me with this Regards, Ngh

  1 Comment

Abdul Haleem Butt
Abdul Haleem Butt on 3 Nov 2017
dataInPrincipalComponentSpace is same as in the original data, each row is an observation, and each column is a dimension.

Sign in to comment.


Sahil Bajaj
Sahil Bajaj on 12 Feb 2019
Dear Cyclist,
Thansk a lot for your helpful explanation. I used your code and I was successful to find 4 PCAs explaining 97% variance for my dataset, which had total 14 components initially. I was just wondering how to find which variables from my dataset are related to PC1, PC2, PC3 and PC4 so that I can ignore the others, and know which parameters should I use for further analysis?
Thanks !
Sahil

  9 Comments

Show 6 older comments
Darren Lim
Darren Lim on 2 Feb 2021
, thanks for answering this post, you wouldnt imagine how much time i have saved by studying your answer, so thank you!
i just picked up PCA a few days ago to solve a financial trading problem, so I am very new to PCA. Just to confirm my understanding , in the coeff example you provided ;
coeff =
-0.5173 0.7366 -0.1131 0.4106 0.0919
0.6256 0.1345 0.1202 0.6628 -0.3699
-0.3033 -0.6208 -0.1037 0.6252 0.3479
0.4829 0.1901 -0.5536 -0.0308 0.6506
0.1262 0.1334 0.8097 0.0179 0.5571
can I clarify that for Column 1 , the Variable of co-efficient 0.6256 describe the largest "weightage" in accordance to PC 1 ? so if my Variable(2,1) is say the mathematics (0.6256) subject of my 7 sample students(Observations) , can I say that Mathematics then , account for the largest "Variance" among all the 7 students in the whole data set (since PC1 has the highest variance and also has accounted for 42.2% of the entire data set) ?
and say , Variable(1,1) is English(-0.5173) , does it mean that English tend to anti correlate to Mathematics?
..and for PC2 , Variable(2,1) English (0.7366) describe the difference the most for the Sample students ?
In Essence , i think i roughly understand PCA at high level , what i am not so sure is how to intepret the data , as i think PCA is powerful but wont be useful if the output is misintepreted. Any help interpreting the coeff will be appreciated :) ( my challenge is to find out which variable is useful for my trading and eliminate unnecesary variables so that i can optimise a trading strategy )
Thanks in advance !
the cyclist
the cyclist on 2 Feb 2021
I'm happy to hear you have found my answer to be helpful.
The way you are trying to interpret the results is a little confusing to me. Using your example of school subjects, I'll try to explain how I would interpret.
Let's suppose that the original dataset variables (X) are scores on a standardized exam:
  1. Math (column 1)
  2. Writing
  3. History
  4. Art
  5. Science
[Sorry I changed up your subject ordering.]
Each row is one student's scores. Row 3 is the 3rd student's scores, and X(3,4) is the 3rd student's Art score.
Now we do the PCA, to see what combination of variables explains the variation among observations (i.e. students).
coeff is the coefficients of the linear combination of original variables . coeff(:,1) are the coefficients to get from the original variables to the first new variable (which explains the most variation between observations):
-0.5173*Math + 0.6256*Writing -0.3033*History + 0.4829*Art + 0.1262*Science
At this point, the researcher might try to interpret these coefficients. For example, because Writing and Art are very positively weighted, maybe this variable -- which is NOT directly measured! -- is something like "Creativity".
Similarly, maybe the coefficients coeff(:,2), which weights Math very heavily, corresponds to "Logic".
And so on.
So, interpreting that single value of 0.6256, I think you can say, "Writing is the most highly weighted original variable in the new variable that explains the most variation."
But, it also seems to me that to answer a couple of your questions, you actually want to look at the original variables, and not the PCA-transformed data. If you want to know which school subject had the largest variance -- just calculate that on the original data. Similarly for the correlation between subjects.
PCA is (potentially) helpful for determining if there is some underlying variable that explains the variation among multiiple variables. (For example, "Creativity" explaining variation in both Writing and Art.) But, factor analysis and other techniques are more explicitly designed to find those latent factors.
Darren Lim
Darren Lim on 3 Feb 2021
Thanks @the cyclist !
Crystal Clear! I think many others will find this answer helpful as well , thanks again for your insights and time!
Darren

Sign in to comment.


Salma Hassan
Salma Hassan on 18 Sep 2019
i still not understand
i need an answer for my question------> how many eigenvector i have to use?
from these figures
explained.PNG
latent.PNG
coeff.PNG

  3 Comments

the cyclist
the cyclist on 19 Sep 2019
It is not a simple answer. The first value of the explained variable is about 30. That means that the first principal component explains about 30% of the total variance of all your variables. The next value of explained is 14. So, together, the first two components explain about 44% of the total variation. Is that enough? It depends on what you are trying to do. It is difficult to give generic advice on this point.
You can plot the values of explained or latent, to see how the explained variance is captured as you add each additional component. See, for example, the wikipedia article on scree plots.
Salma Hassan
Salma Hassan on 19 Sep 2019
if we say that the first two components which explain about 44% enough for me, what does this mean for latent and coff . how can this lead me to the number of eigen vectors
thanks for your interest in reply. i appreicate this
the cyclist
the cyclist on 20 Sep 2019
It means that the first two columns of coeff are the coefficients you want to use.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!