# Is it suitable to include dependent varible when conducting PCA?

1 view (last 30 days)
Xiaoxiao Du on 5 Jun 2022
Commented: Xiaoxiao Du on 8 Jun 2022
The question emerge from a certain situation, where:
If there are N stationary time-series variables(X1~XN), each variable have 1,000,000 obervations, all of them are somehow correlated (say, 0.5-0.95). If we run the PCA on the first 800,000 observations as in-sample data, tranforming the original coordinate system into a new set of PCs coordinates, run some regression against PCA variables, and use it to examine if there are anomalies in any of the variables in the out-of-sample observation (800,001-1,000,000). Which methodologies I should adopt? (Assume PC1&PC2 explained 95% of all variance)
1. run PCA on X1-XN on 800,000 observations, for a certain Xy, we get a predict_Xy = regress(Xy, [PC1 PC2]), where there is Xy inside PC1 and PC2. If it is correct, how can I interpret the result?
2. run PCA on X1-XN without Xy on 800,000 observations, for a certain Xy, we get a predict_Xy = regress(Xy, [PC1 PC2]), where there is no Xy terms inside PC1 and PC2. This would be easy to interpret but if the anomaly may occur on any variable, we have to run a regression on each variable to detect the anomalies?
I've done this analysis on a 10,000,000 observation * 4 variable data. The est. error variance derived from Method2 is 3 times of Method1 (it is understandable since there is variable Xy itself inside the prediction function), the Method1 seems a lot appealing but indeed counter-tuition.
Or perhaps I should not use PCA at the beginning? If so, what methodology should I pick?
##### 2 CommentsShowHide 1 older comment
the cyclist on 5 Jun 2022
Yes, it worked. I got a notification.

the cyclist on 5 Jun 2022
I've never used PCA on time series data, and I had never heard of anyone doing so, but googling PCA time series does turn up some links.
Regardless, I think you definitely do not want to do the version where you include Xy in the PCA. I believe that when people commingle the explanatory and response variables, they use canonical correlation analysis. But that is also typically not for time series data.
I think it is possible that a vector autoregression model might be closer to the right model for your system. But, it is not clear to me how to identify the "anomalies" you are talking about.
Looks like something called "dynamic factor analysis" might also be useful, but I have no experience with it.
I hope that that is at least a little helpful.
Xiaoxiao Du on 8 Jun 2022
Thanks! You made a quite clear conclusion, especially the 'N-dimensional space' term.
I'll be reading papers in the realm you mentioned. May I leave the question open so I can update some progress or perhaps there will be new thoughts from others?
Thanks again!