Is it suitable to include dependent varible when conducting PCA?

Question

Xiaoxiao Du on 5 Jun 2022

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/1734080-is-it-suitable-to-include-dependent-varible-when-conducting-pca

Commented: Xiaoxiao Du on 8 Jun 2022

The question emerge from a certain situation, where:

If there are N stationary time-series variables(X1~XN), each variable have 1,000,000 obervations, all of them are somehow correlated (say, 0.5-0.95). If we run the PCA on the first 800,000 observations as in-sample data, tranforming the original coordinate system into a new set of PCs coordinates, run some regression against PCA variables, and use it to examine if there are anomalies in any of the variables in the out-of-sample observation (800,001-1,000,000). Which methodologies I should adopt? (Assume PC1&PC2 explained 95% of all variance)

run PCA on X1-XN on 800,000 observations, for a certain Xy, we get a predict_Xy = regress(Xy, [PC1 PC2]), where there is Xy inside PC1 and PC2. If it is correct, how can I interpret the result?
run PCA on X1-XN without Xy on 800,000 observations, for a certain Xy, we get a predict_Xy = regress(Xy, [PC1 PC2]), where there is no Xy terms inside PC1 and PC2. This would be easy to interpret but if the anomaly may occur on any variable, we have to run a regression on each variable to detect the anomalies?

I've done this analysis on a 10,000,000 observation * 4 variable data. The est. error variance derived from Method2 is 3 times of Method1 (it is understandable since there is variable Xy itself inside the prediction function), the Method1 seems a lot appealing but indeed counter-tuition.

Or perhaps I should not use PCA at the beginning? If so, what methodology should I pick?

2 Comments
Show NoneHide None

Xiaoxiao Du on 5 Jun 2022

@the cyclist am I using @ right?

the cyclist on 5 Jun 2022

Yes, it worked. I got a notification.

Sign in to comment.

Sign in to answer this question.

Answer 1

the cyclist on 5 Jun 2022

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/1734080-is-it-suitable-to-include-dependent-varible-when-conducting-pca#answer_979455

I've never used PCA on time series data, and I had never heard of anyone doing so, but googling PCA time series does turn up some links.

Regardless, I think you definitely do not want to do the version where you include Xy in the PCA. I believe that when people commingle the explanatory and response variables, they use canonical correlation analysis. But that is also typically not for time series data.

I think it is possible that a vector autoregression model might be closer to the right model for your system. But, it is not clear to me how to identify the "anomalies" you are talking about.

Looks like something called "dynamic factor analysis" might also be useful, but I have no experience with it.

I hope that that is at least a little helpful.

7 Comments
Show 5 older commentsHide 5 older comments

Xiaoxiao Du on 6 Jun 2022

Apologies for the confusion caused of putting the 'time series' phrase in the description. The data was originated from time series data, but the aim is not to investigate the time series behavior or forecasting the future (like what VAR model do).

The idea was to find patterns inside the in-sample data, conclude the pattern as a static one, and apply it on the out-of-sample data to check for anomalies. (thus, it is a cross-sectional data analysis)

An elemetary school style illustration would be like:

Q: pls find the anomalies below and mark it in bold and underscore style:

[1 2 3 4 5]

[2 4 6 8 10]

[3 6 9 10 15]

[4 8 12 16 20]

(As you see, in this question there is no difference between x's and y's. And we wouldn't call it a time-series analysis.)

In human mind it is quite easy to find the pattern, but I believe our brains didn't run the linear regression one by one and, in this case, make 4C2 = 6 regressions and eventually spot the 'anomaly'. The human bring process might sound a bit like 'Convolutional Neural Network' mechanism?

I re-check VAR and DFA, in my understanding those are for time series data and mainly focus on 'forecasting'. As for CCA, I find PCA maybe better for this case? (as the illustraion above)

the cyclist on 6 Jun 2022

Open in MATLAB Online

I am having a difficult time connect your original question with this simpler "elementary school illustration" you made.

Are each of those vectors (e.g. [1 2 3 4 5]) supposed to represent one of the XN?
There is "no difference between x's and y's" in the simple problem, but that seems to be a very important aspect of the original problem, so I am confused.
This definitely does not seem like a cross-sectional dataset to me, since it looks like you have multiple measurements of each variable. Maybe it is not a time series, but it is also not a cross-section.
I don't see how the 80/20 split you mention in your original post is relevant here.

But, trying to be productive ...

Are you saying that the essence of your problem is captured by the following?

You have N vectors:

X1 = [1 2 3 4 5]; % but much, much longer
X2 = [2 4 6 8 10];
X3 = [3 6 9 10 15];
...
XN = [4 8 12 16 20];

The majority of these vectors seem to follow the same "pattern", but some do not. Find the "anomalous ones" (where it seems like you can't necessarily state precisely what you mean by "anomalous").

Xiaoxiao Du on 6 Jun 2022

Open in MATLAB Online

test.mat

Sorry... Expression is one of my weakness...

Let me answer the questions in order:

Yes. In the example above, X1 = [1 2 3 4 5], X2 = [2 4 6 8 10], ...
Each variable in the system can be independent variable. 'Anomaly' can pop out from any observation in any variable. I wish to eastablish a 'framework' that include all variables for anomaly detection (like... whac-a-mole?)
Sorry for making such a rough illustration. The real data for all X's are approximately normal distributed, almost centered at 0 (*10^-8), slight difference in stdev but all on a same scale (*10^-4).
The full methodology was intended to have a 80/20 split, but we are now still discussing the problem of how to build a model using 80% data here. (there exists (extreme) anomalies inside the 80% data, might adopt Robust PCA instead of normal PCA)

Please allow me to use a small portion of the original data and illustrate what I mean:

test is a 289778 observation*4 variable matrix, name the four variable 'A' 'B' 'C' 'D'.

Methodology 1 (run PCA on all 4 variables):

load('test.mat')
[coeff,score,latent,tsquare,explained,mu] = pca(test);

from 'explained' we can see PC1 and PC2 explained 94.9% of all variance. So we choose 'PC1' and 'PC2',

[b,bint,r,rint,stats] = regress(test(:,1),score(:,1:2));

we can get

pred_A = 0.6141*PC1 - 0.1712*PC2

Substituting PC1 PC2 with their original 'meaning' from coeff

pred_A = 0.6141*(0.6141*A+0.6060*B+0.4421*C+0.2453*D)-0.1712*(-0.1712*A-0.3335*B+0.1915*C+0.9071*D)
pred_A = 0.4064*A+0.4292*B+0.2387*C-0.0046*D

Stat is quite good I think?

0.9525 5.811e+6 0 1.467e-10

Methodology 2 (run PCA on only 'independent variables' B C D):

load('test.mat')
[coeff,score,latent,tsquare,explained,mu] = pca(test(:,2:4));

from 'explained' we can see PC1 and PC2 explained 96.4% of all variance. So we choose 'PC1' and 'PC2',

[b,bint,r,rint,stats] = regress(test(:,1),score(:,1:2));

we can get

pred_A = 0.7347*PC1 - 0.1901*PC2

Substituting PC1 PC2 with their original 'meaning' from coeff

pred_A = 0.7347*(0.7642*B+0.5560*C+0.3269*D)-0.1901*(-0.4696*B+0.1321*C+0.8730*D)
pred_A = 0.6507*B+0.3834*C+0.0743*D

Stat is not as good (since it does not contain A itself)

0.8695 1.931e+6 0 4.034e-10

Afterall, I would like to know may I use Methodology 1? If so, I'm happy since the residual between observation and modelling is smaller. If not, why?

For the definition of anomaly, using Method 1 as example,

tmp_r = test(:,1)-(b(1)*score(:,1)+b(2)*score(:,2));

use histogram and we will see the distribution is good-enough (the sharp peak at x=0 is something with the original data, will look into it to somehow refine)

we define the anomaly by <5% and >95% percentile,

prctile(tmp_r,[1 5 50 95 99])
-3.607e-05	-1.811e-05	-1.353e-07	1.777e-05    3.580e-05

So I define there is an anomaly in A when it deviates from its 'expected value' by at least 3.6e-05.

We opereate the same thing on B C D also, so we can detect anomalies in the whole system at once?

Although it seems little bit weird, Am I doing the whole process correctly?

the cyclist on 7 Jun 2022

Edited: the cyclist on 7 Jun 2022

I have not been able to spend enough time to deeply understand your problem, but I have a few thoughts.

First, I'm pretty sure that Methodology 1 is not a good approach. Although the PCA obscures this a little bit, all you are doing in the regression is predicting test(:,1) from test(:,1). (Did you notice that your regression coefficients are identical to the first row of the PCA coefficients?) If you had used all the PCs in the regression, you would have gotten a perfect fit. That doesn't seem useful. In fact, I think it makes it more difficult to spot anomalies.

My understanding now is that "anomaly" means "prediction of test(:,1) from test(:,2:n)" is an outlier. But I think your overall method -- PCA and then regression -- is not well suited to what you actually want to do. Specifically, if you plot a histogram of the residuals from one of the regressions, you will see that they are not normally distributed. (Very very not normally distributed.) So, my concern would be that you are violating the assumptions of the regression, and therefore not really doing what you want.

I do think I understand your problem a lot better now, after your patient and more detailed explanations. Your approach to finding anomalies seems to be "For each variable in turn, regress it against all other variables, and find the values that are poorly predicted". (Right?) But, it seems to me now that none of your variables are actually a response variable, with the others being explanatory. (Right?) This is just an ad hoc scheme.

My advice -- and I will also try to think about this -- would be to try to get really specific about the definition of anomaly. It can't be "a value that doesn't quite fit the pattern", because that's too vague. I think that if you can get a tight definition of anomaly, the method for identifying them will be easier.

I hope that is helpful.

Xiaoxiao Du on 7 Jun 2022

Thank you very much for your time and effort spent in this question! I'm delighted to have someone to discuss with on this model.

Yes I notice the pca coefficient of test(:,1) is same as its regression coefficient b(), and also using the 4 PCs will make the whole regression into exactly "predict_A = A". (property of SVD mechanism)

I do think my overall method is junvenile, especially the linear regression part, but failed to find a better mehod or name for such problem. However, I was quite satisfied with the residual histogram plot. Although I wouldn't use normal distribution to call it, but it seems good? (figure below) Does residual necessarily need to fit normal distribution to be 'valid'?

(histogram plot of "A - pred_A")

Yes, My model is now "For each variable in turn, regress it against all other variables on PC's coordinate, and wish to find the values that outlies outside it's 95% prediction", but my initial purpose was to design a model where input is a whole dataset matrix and output is the 'anomaly' (a "spider web" system instead of N-variable=>N-regression combined system) (I have no clue about how I could design that system at the moment).

Oh! A better Metaphor of the question just popped out! Have you watched the movie “Battleship”? There is one scene where some monster is attacking warships, but warship somehow cannot locate the monster by radar or sona. They eventually predicted the location of the monster by analyzing the altitude data matrix of climate buoy on the sea surface and use it's anomaly to detect the trace of the monster. In that case, each buoy gives out a time-series data of its altitude, parallelly speaking the altitude-surface should be floating in a reasonable range, thus summarizing the 'surface pattern' is similiar to what we do here.

(Similarity between my case and the Battleship case: 1. original data is time series data (but not focus on forecasting); 2. no response variable; 3. only correlated independent variables with same/similiar underlying meanings; 4. Aim to detet anomalies among 'independent variables')

the cyclist on 7 Jun 2022

If I had to put a name on the thing you are trying to do, it would be "multivariate outlier detection". Googling that phrase does turn up some hits that seem quite related to what you are trying to do. This paper discusses a few techniques.

It seems that if you are primarily interested in "these data points are far away from the bulk of the data points, in N-dimensional space", then there is some good information for you there.

Xiaoxiao Du on 8 Jun 2022

Thanks! You made a quite clear conclusion, especially the 'N-dimensional space' term.

I'll be reading papers in the realm you mentioned. May I leave the question open so I can update some progress or perhaps there will be new thoughts from others?

Thanks again!

Sign in to comment.

Is it suitable to include dependent varible when conducting PCA?

2 Comments
Show NoneHide None

Accepted Answer

7 Comments
Show 5 older commentsHide 5 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Is it suitable to include dependent varible when conducting PCA?

2 Comments Show NoneHide None

Accepted Answer

7 Comments Show 5 older commentsHide 5 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

2 Comments
Show NoneHide None

7 Comments
Show 5 older commentsHide 5 older comments