Why does repeating pca on randomized data switch the order of the points? Compare the test case and the Monte-Carlo case.
6 views (last 30 days)
Show older comments
Dear Help,
I would like to repeat PCA analysis many times using slightly randomized data. The idea is to produce a "cloud" of points in PCA space that surround the real data. I use simple matlab functions below to code this. But I find that the red and black points intersperse. Am I misunderstanding something about the pca output?
-Dodgeball
% Note: The first three columns of X correspond to the "black" data category, whereas the last three columns correspond to "red" category.
X= [21.3082, 20.4909, 21.7057, 27.0204, 26.8216, 26.8253;
24.7031, 24.7590, 24.7928, 18.7839, 18.6898, 21.7545;
27.1607, 27.2170, 26.7247, 25.1205, 20.8257, 21.5048;
26.8413, 26.6575, 26.8508, 21.2030, 20.9727, 23.0522;
26.2660, 26.0913, 26.4202, 20.9011, 21.6699, 20.8864;
26.7285, 26.5326, 26.7244, 20.9167, 22.2356, 22.0653;
26.2849, 26.3539, 26.4534, 21.3777, 21.7901, 21.4655;
27.1331, 27.1494, 27.1535, 21.3922, 22.9945, 22.9521;
26.3820, 26.4615, 26.6303, 21.2554, 22.3799, 21.8397;
21.7944, 26.0630, 26.1678, 28.0560, 28.8596, 28.0929;
21.3088, 21.1267, 21.8997, 26.0820, 25.3791, 25.9918;
21.1613, 22.0565, 21.2904, 25.7505, 25.8322, 25.7237;
20.7289, 21.1625, 21.1503, 24.6710, 24.8744, 24.9068;
24.7395, 24.7855, 24.8222, 21.3117, 20.7364, 21.6248;
23.2372, 23.4656, 21.3302, 25.8519, 25.9230, 25.9140];
% test case
[coeff,score,latent,tsquared,explained] = pca(X');
plot(score(1:3,1), score(1:3,2),'.k');
hold on;
plot(score(4:6,1), score(4:6,2),'.r');
% end test case
% begin Monte-Carlo case
HOBART = X;
HOBART_red = HOBART(:,1:3);
HOBART_black = HOBART(:,4:6);
cov_red = cov(HOBART_red'); % covariance matrix 15x15
cov_black = cov(HOBART_black'); % covariance matrix 15x15
mu_red = mean(HOBART_red'); % mu 1x15
mu_black = mean(HOBART_black'); % mu 1x15
nDRAWS_MV = 1000;
XDATA_red_MV = zeros(3*nDRAWS_MV,1);
YDATA_red_MV = zeros(3*nDRAWS_MV,1);
XDATA_black_MV = zeros(3*nDRAWS_MV,1);
YDATA_black_MV = zeros(3*nDRAWS_MV,1);
cankick = 0;
avgexplained1 = zeros(nDRAWS_MV,1);
avgexplained2 = zeros(nDRAWS_MV,1);
for(i=1:1:nDRAWS_MV)
sample_HUGEmatrixC_MV = [ mvnrnd(mu_red,cov_red,3)', mvnrnd(mu_black,cov_black,3)'];
% [coeff,score,latent,tsquared,explained] = pca(sample_HUGEmatrixC_MV','VariableWeights','variance');
[coeff,score,latent,tsquared,explained] = pca(sample_HUGEmatrixC_MV');
old_cankick = cankick;
cankick = cankick+3;
XDATA_red_MV((old_cankick+1):1:cankick,1)= score(1:3,1);
YDATA_red_MV((old_cankick+1):1:cankick,1)= score(1:3,2);
XDATA_black_MV((old_cankick+1):1:cankick,1)= score(4:6,1);
YDATA_black_MV((old_cankick+1):1:cankick,1)= score(4:6,2);
avgexplained1(i,1) = explained(1);
avgexplained2(i,1) = explained(2);
end
figure; % plot figure of the scatter of the Monte Carlo
hold on;
plot(XDATA_red_MV,YDATA_red_MV,'.r');
plot(XDATA_black_MV,YDATA_black_MV,'.k');
xlabel(strcat('PCA1:',num2str(round(mean(avgexplained1))),'%'))
ylabel(strcat('PCA2:',num2str(round(mean(avgexplained2))),'%'))
% end Monte-Carlo case
0 Comments
Accepted Answer
the cyclist
on 12 Nov 2021
Edited: the cyclist
on 12 Nov 2021
I have to admit that I did not spend the time to come to a complete understanding of your code.
However, I'm guessing that your unexpected result is explained by the fact that the principal component vectors are the eigenvectors of the covariance matrix, and the negative of an eigenvector is also an eigenvector. Therefore, swapping signs on all your PCA components will give equally valid PCA components.
I'm guessing that the randomness you are inserting is enough that the PCA algorithm is sometimes landing on the opposite-signed eigenvectors than those of the original X.
I am not certain about the following, and you should carefully think about this yourself, but I think if you insert the line
score = score*sign(score(1)); % Flip signs of all components in PC space if needed, to ensure first one is positive
after you calculate the PCA, then all your MC black/red distinctions will be retained consistently, and still be valid. (To make it consistent with the original data, force the MC score(1,1) to have the same sign as the original PCA.)
More Answers (0)
See Also
Categories
Find more on Dimensionality Reduction and Feature Extraction in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!