How I make a better pre-processing for machine learning?

1 view (last 30 days)
Hi! I have a data set with different features of data (size 1950x22). With this I have to develop an algorithm through machine learning that is capable of predicting with respect to one of the categories (in particular the 22nd) when the result gives new data for the other features. So I summarize: the output features (the 22nd), the one that expresses the result through which the other categories (the first 21 columns) must be trained to predict, has been categorized into three different categories: 1,2,3.. The problem is that after the pre-processing the reference categories have become from
1: 1655
2: 295
3: 176
to:
1: 1337
2: 135
3: 24
Is there a way to overfit the data of the 3 categories output? Or to make sure that in doing the training in the classification learner app it takes all the data belonging to category 3 of the output features

Answers (1)

Avadhoot
Avadhoot on 9 Oct 2023
Edited: Avadhoot on 9 Oct 2023
Hi,
I understand that you have a class imbalance in your dataset. The imbalance is further worsened by the preprocessing performed. There are several ways to deal with class imbalance. Some of them are listed below:
  1. Undersampling:
You can remove some of the samples of the majority class (i.e., class 1 and 2) by randomly discarding them using the “datasample” function. You can use the function by adding the following line to your code:
dataSampled = datasample(data,k,'Replace',false);
Unrecognized function or variable 'data'.
Here’s what the code does:
  • “k” is the number of samples you want to select.
  • Setting the “Replace” input argument to false ensures that the sampling is done without replacement.
Refer to the below documentation for details about data sampling:
2. Oversampling:
You can use an oversampling technique like Synthetic Minority Over-sampling Technique (SMOTE) to create synthetic samples of class 3 so that the class imbalance is resolved. For details on how to use SMOTE in MATLAB please refer to the following FileExchange submission:
3. Class weighting:
You can assign different weights to each class such that class 3 is given more importance. You can do it by using the “ClassWeights” option in your classification layer as follows:
classificationLayer(Classes=classes,ClassWeights=classWeights)
Here is what the parameters mean:
  • “classes” is a vector containing all class names i.e. [1,2,3].
  • ClassWeights” is a vector containing the weights for each class. Here you can specify more weight to class 3.
  • A sample “ClassWeights” vector would be: [1,2,4].
Refer to the following documentation for more details on class weighting:
4. Evaluation metrics:
Use evaluation metrics like precision, recall, F1 score and AUC-ROC so that class imbalance does not affect the model.
Hope this helps.

Products


Release

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!