How to take PCA of large sparse matrices without losing a row

6 views (last 30 days)
I am working on a project where I am required to take PCA of a sparse matrix, that when converted to dense becomes 553 * 26315. The matrix was formaed by taking TFIDF of 553 * 25. Running PCA as
[coeff,score,latent,~,explained] = pca(X)
returns coeff = 26315 * 552; score = 553 *552, latent and explained are 552 * 1. I want to confirm if I am suppose to transpose X before applying the function or coeff after wards. I am to use the output as features in my work, so if I have a target array of 553 values, I would certainly need 553 row. Also, if I set Economy false, Matlab fails to run it as the matrix is too large; is there a way to make sparse matrices small without loosing informatio before applying PCA.
Not exact part of the exact topic, but could someone comment if using PSO (before or without PCA) makes good sense.

Answers (1)

arushi
arushi on 14 Aug 2024
Hi Amigo,
When working with Principal Component Analysis (PCA) on large sparse matrices, it's important to ensure that the data is oriented correctly and that the PCA implementation is efficient. Here are some key points and steps to address your concerns:
1. Matrix Orientation - For PCA, the rows of the input matrix X typically represent observations (samples), and the columns represent variables (features). Given your data dimensions:
  • X is 553 (samples) by 26315 (features).
The output dimensions you mentioned:
  • coeff: 26315 x 552
  • score: 553 x 552
  • latent and explained: 552 x 1
These dimensions suggest that the function is working correctly, as PCA reduces the number of features while retaining the number of samples.
2. Sparse Matrix Handling - MATLAB's pca function may not handle sparse matrices efficiently for large datasets. To address this, consider using specialized functions for sparse matrices or dimensionality reduction techniques that are more suitable for sparse data.
3. Economy Mode - Setting the 'Economy' mode to false can cause issues with large matrices due to memory constraints. Keeping it true is advisable to save memory.
4. Feature Reduction Before PCA - To reduce the size of the matrix without losing significant information, you can use techniques such as:
  • Truncated Singular Value Decomposition (SVD): This can be applied directly to sparse matrices and is often used for large-scale PCA.
  • Feature Selection: Select a subset of features based on some criteria (e.g., variance threshold).
Hope this helps.

Categories

Find more on Dimensionality Reduction and Feature Extraction in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!