Why does repeating pca on randomized data switch the order of the points? Compare the test case and the Monte-Carlo case.

6 views (last 30 days)
Dear Help,
I would like to repeat PCA analysis many times using slightly randomized data. The idea is to produce a "cloud" of points in PCA space that surround the real data. I use simple matlab functions below to code this. But I find that the red and black points intersperse. Am I misunderstanding something about the pca output?
-Dodgeball
% Note: The first three columns of X correspond to the "black" data category, whereas the last three columns correspond to "red" category.
X= [21.3082, 20.4909, 21.7057, 27.0204, 26.8216, 26.8253;
24.7031, 24.7590, 24.7928, 18.7839, 18.6898, 21.7545;
27.1607, 27.2170, 26.7247, 25.1205, 20.8257, 21.5048;
26.8413, 26.6575, 26.8508, 21.2030, 20.9727, 23.0522;
26.2660, 26.0913, 26.4202, 20.9011, 21.6699, 20.8864;
26.7285, 26.5326, 26.7244, 20.9167, 22.2356, 22.0653;
26.2849, 26.3539, 26.4534, 21.3777, 21.7901, 21.4655;
27.1331, 27.1494, 27.1535, 21.3922, 22.9945, 22.9521;
26.3820, 26.4615, 26.6303, 21.2554, 22.3799, 21.8397;
21.7944, 26.0630, 26.1678, 28.0560, 28.8596, 28.0929;
21.3088, 21.1267, 21.8997, 26.0820, 25.3791, 25.9918;
21.1613, 22.0565, 21.2904, 25.7505, 25.8322, 25.7237;
20.7289, 21.1625, 21.1503, 24.6710, 24.8744, 24.9068;
24.7395, 24.7855, 24.8222, 21.3117, 20.7364, 21.6248;
23.2372, 23.4656, 21.3302, 25.8519, 25.9230, 25.9140];
% test case
[coeff,score,latent,tsquared,explained] = pca(X');
plot(score(1:3,1), score(1:3,2),'.k');
hold on;
plot(score(4:6,1), score(4:6,2),'.r');
% end test case
% begin Monte-Carlo case
HOBART = X;
HOBART_red = HOBART(:,1:3);
HOBART_black = HOBART(:,4:6);
cov_red = cov(HOBART_red'); % covariance matrix 15x15
cov_black = cov(HOBART_black'); % covariance matrix 15x15
mu_red = mean(HOBART_red'); % mu 1x15
mu_black = mean(HOBART_black'); % mu 1x15
nDRAWS_MV = 1000;
XDATA_red_MV = zeros(3*nDRAWS_MV,1);
YDATA_red_MV = zeros(3*nDRAWS_MV,1);
XDATA_black_MV = zeros(3*nDRAWS_MV,1);
YDATA_black_MV = zeros(3*nDRAWS_MV,1);
cankick = 0;
avgexplained1 = zeros(nDRAWS_MV,1);
avgexplained2 = zeros(nDRAWS_MV,1);
for(i=1:1:nDRAWS_MV)
sample_HUGEmatrixC_MV = [ mvnrnd(mu_red,cov_red,3)', mvnrnd(mu_black,cov_black,3)'];
% [coeff,score,latent,tsquared,explained] = pca(sample_HUGEmatrixC_MV','VariableWeights','variance');
[coeff,score,latent,tsquared,explained] = pca(sample_HUGEmatrixC_MV');
old_cankick = cankick;
cankick = cankick+3;
XDATA_red_MV((old_cankick+1):1:cankick,1)= score(1:3,1);
YDATA_red_MV((old_cankick+1):1:cankick,1)= score(1:3,2);
XDATA_black_MV((old_cankick+1):1:cankick,1)= score(4:6,1);
YDATA_black_MV((old_cankick+1):1:cankick,1)= score(4:6,2);
avgexplained1(i,1) = explained(1);
avgexplained2(i,1) = explained(2);
end
figure; % plot figure of the scatter of the Monte Carlo
hold on;
plot(XDATA_red_MV,YDATA_red_MV,'.r');
plot(XDATA_black_MV,YDATA_black_MV,'.k');
xlabel(strcat('PCA1:',num2str(round(mean(avgexplained1))),'%'))
ylabel(strcat('PCA2:',num2str(round(mean(avgexplained2))),'%'))
% end Monte-Carlo case

Accepted Answer

the cyclist
the cyclist on 12 Nov 2021
Edited: the cyclist on 12 Nov 2021
I have to admit that I did not spend the time to come to a complete understanding of your code.
However, I'm guessing that your unexpected result is explained by the fact that the principal component vectors are the eigenvectors of the covariance matrix, and the negative of an eigenvector is also an eigenvector. Therefore, swapping signs on all your PCA components will give equally valid PCA components.
I'm guessing that the randomness you are inserting is enough that the PCA algorithm is sometimes landing on the opposite-signed eigenvectors than those of the original X.
I am not certain about the following, and you should carefully think about this yourself, but I think if you insert the line
score = score*sign(score(1)); % Flip signs of all components in PC space if needed, to ensure first one is positive
after you calculate the PCA, then all your MC black/red distinctions will be retained consistently, and still be valid. (To make it consistent with the original data, force the MC score(1,1) to have the same sign as the original PCA.)
  1 Comment
dodgeball
dodgeball on 12 Nov 2021
Thank you so much for your insight! This seems a very plausible explanation. I would be surprised if there is an argument against your approach, but I will think about it further. If you or anyone else on this thread cares to weigh in on this or to comment further on the theory, please feel free to do so!
-Dodgeball

Sign in to comment.

More Answers (0)

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!