How to identify data set characteristics which influence the success of a model using those data sets as input.

2 views (last 30 days)
I am studying the effect of hurricanes on coral reefs and have developed a damage prediction model which uses as inputs the fragility and distribution of different coral species at 150 post-storm survey sites. I can also create multiple simulated reefs by randomly assigning species, colonies and damage from the measured probability distribution functions of those parqameters for each species. When I make 1000 simulated reef experiments the results of my damage prediction are widly distributed from terrible to great. I need to mine the 1000 simultaed reefs to identify patterns which are influencing the success of the model. I expect this is a common scenario and would apprecieate any guidance on which tools to use and how to proceed. I have the statistics and machine learning toolbox.

Accepted Answer

Yatharth
Yatharth on 3 May 2024
Hello Wayne,
To answer your question on how you can identify data characteristics which influence the success of a model.
You can perform some basic Exploratory Data Analysis (EDA) to understand the distributions of your parameters and outcomes, identify outliers, and see if there are any obvious patterns or correlations.
  1. Use "histogram", "boxplot", or "scatter" functions to visualize the distributions of your parameters and outcomes.
  2. Use "corrplot" to visualize correlations between parameters and between parameters and outcomes.
With many input parameters, it's crucial to identify which ones significantly impact the model's outcome. Feature selection techniques can help reduce dimensionality and focus on the most influential variables.
  1. Use "sequentialfs" (sequential feature selection) to identify the most important features. This function can help you find a subset of the input variables that most effectively predict the outcome.
  2. Consider using principal component analysis (PCA) with "pca" to reduce dimensionality and possibly uncover underlying patterns in your data.
Here are the links for some of the mentioned functions:
  1. scatter: https://www.mathworks.com/help/matlab/ref/scatter.html
  2. corrplot: https://www.mathworks.com/help/econ/corrplot.html
  3. sequentialfs: https://www.mathworks.com/help/stats/sequentialfs.html
  4. pca: https://in.mathworks.com/help/stats/pca.html
Here are some examples that might be useful in your case:
  1. For feature selection: https://www.mathworks.com/help/stats/selecting-features-for-classifying-high-dimensional-data.html
  2. For classification : https://www.mathworks.com/help/stats/classification-example.html
  3. For cross validation: https://www.mathworks.com/help/stats/crossval.html

More Answers (0)

Categories

Find more on Dimensionality Reduction and Feature Extraction in Help Center and File Exchange

Products


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!