[vc_row][vc_column][vc_custom_heading text=”Introduction”][vc_column_text]Multiclass classification is a classification problem where more than two classes are present. It is a fundamental machine learning task which aims to classify each instance into one of a predefined set of classes. For instance, classifying a set of images of animals as dogs, cats or rabbits. Each sample is assigned to only one label, i.e., an image can be classified as either dog or rabbit but not both at the same time.

When working on a classification problem, there are instances when one class label has lower number of observations than other class labels. So, this type of dataset is known as imbalanced dataset. The problem is common and can lead to biased classification by the model.

This article will talk about handling imbalanced dataset. It will also provide a step-by-step approach to perform multiclass classification using machine learning algorithms.

[/vc_column_text][vc_row_inner][vc_column_inner][vc_custom_heading text=”Dataset”][vc_column_text]The dataset used is a numeric dataset. It consists of 11k+ rows and 10 columns. The columns 1-9 are the feature vectors and column 10 is the target vector. There are total 20 classes that need to be predicted and the classes are labelled as digits from 0-19.

The head() method is used to return top n (5 by default) rows of a data frame.[/vc_column_text][vc_column_text css=”.vc_custom_1646041771505{background-color: #e8e8e8 !important;border-radius: 3px !important;}”]

df = pd.read_csv("./training.csv")
df.head()

[/vc_column_text][vc_single_image image=”4808″ img_size=”large”][vc_column_text]The dataset is highly imbalanced. Count plot of the target can be visualized as following[/vc_column_text][vc_column_text css=”.vc_custom_1646042088555{background-color: #e8e8e8 !important;border-radius: 3px !important;}”]

import seaborn as sns
sns.countplot(df['target'])

[/vc_column_text][vc_single_image image=”4800″ img_size=”large”][vc_custom_heading text=”Removal of outliers”][vc_custom_heading text=”Boxplot” font_container=”tag:h4|text_align:left”][vc_column_text]In this part, we are going to learn about outliers. Outliers in a dataset are the values which lie far away from majority of points. Boxplots can be plotted which show the median, minimum, maximum, first quartile and third quartile. Using boxplots, we can visually check for the outliers present in the dataset.[/vc_column_text][vc_single_image image=”4826″ img_size=”large”][vc_column_text]The first quartile(Q1) is the median of the lower half of the set. This means that about 25% of the values are less than Q1.

The third quartile(Q3) is the median of the upper half of the set. This means that about 75% of the values are less than Q3.[/vc_column_text][vc_custom_heading text=”IQR” font_container=”tag:h4|text_align:left”][vc_column_text]IQR (Inter-Quartile Range) is the difference between Q3 and Q1. An interval is calculated using the given equation:

[ Q1 – 1.5 x IQR, Q3 + 1.5 x IQR ]

Therefore, if an observation lies outside this interval, it is considered as an outlier.

[/vc_column_text][vc_column_text css=”.vc_custom_1646043034333{background-color: #e8e8e8 !important;border-radius: 3px !important;}”]

#finding outliers for feature3
#finding the 1st quartile
q1 = np.quantile(df.feature3, 0.25)

# finding the 3rd quartile
q3 = np.quantile(df.feature3, 0.75)

med = np.median(df.feature3)

# finding the iqr region
iqr = q3-q1

# finding upper and lower whiskers
upper_bound = q3+(1.5*iqr)
lower_bound = q1-(1.5*iqr)
outliers = df.feature3[(df.feature3 <= lower_bound) | (df.feature3 >= upper_bound)]

[/vc_column_text][vc_custom_heading text=”Feature Selection using Correlation Heatmap”][vc_column_text]Correlation is a statistical score which tells how close two variables are to having a linear relationship with each other. Higher correlation between two variables will have very similar effect on dependent variables. Therefore, we prefer dropping one of the two features.

We can generate a correlation matrix using:

[/vc_column_text][vc_column_text css=”.vc_custom_1646043118287{background-color: #e8e8e8 !important;border-radius: 3px !important;}”]

df.corr()

[/vc_column_text][vc_column_text]And correlation heatmap using:[/vc_column_text][vc_column_text css=”.vc_custom_1646043165239{background-color: #e8e8e8 !important;border-radius: 3px !important;}”]

sns.heatmap(corr)

[/vc_column_text][vc_column_text]The correlation heatmap generated is shown below[/vc_column_text][vc_single_image image=”4810″ img_size=”large”][vc_column_text]The feature1 and feature3 have high correlation with other features and dropping them gave us better results.[/vc_column_text][vc_column_text css=”.vc_custom_1646043390068{background-color: #e8e8e8 !important;border-radius: 3px !important;}”]

X = df.drop('target',axis=1)
y=df['target']
X = X.drop(['feature1','feature3'],axis = 1)

[/vc_column_text][vc_custom_heading text=”Handling Imbalanced Dataset”][vc_custom_heading text=”SMOTE (Synthetic Minority Oversampling Technique)” font_container=”tag:h4|text_align:left”][vc_column_text]SMOTE helps in oversampling the examples in the minority class. The process involved in oversampling is given as-

Select random data from the minority class.
Calculate the Euclidean distance between the random data and its k nearest neighbors.
Multiply the difference with a random number between 0 and 1. Then, add the result to the minority class as a synthetic sample.
Repeat the procedure until the expected proportion of minority class is met.

[/vc_column_text][vc_custom_heading text=”SMOTE-Tomek Links” font_container=”tag:h4|text_align:left”][vc_column_text]This is a modified version of SMOTE. It combines the ability of both SMOTE and Tomek Links. SMOTE is capable of generating synthetic data for minority class. And, Tomek Links is able to remove the data that are identified as Tomek links from the majority class.[/vc_column_text][vc_column_text]

from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks

# Define SMOTE-Tomek Links
resample=SMOTETomek(tomek=TomekLinks(sampling_strategy='majority'))
X, y = resample.fit_resample(X, y)

[/vc_column_text][vc_column_text]We can now visualize count plot to have an equal number of samples for each class in the target.[/vc_column_text][vc_single_image image=”4811″ img_size=”large”][vc_custom_heading text=”Splitting Data into Train and Test Data”][vc_column_text]The dataset can be split into training set and validation set. We will use the training set for training our model. Validation set will be used to check whether our model can perform well on new, unseen data.[/vc_column_text][vc_column_text]

from sklearn.model_selection import train_test_split

train_x, val_x,train_y, val_y = train_test_split(X,y,test_size=0.2)

[/vc_column_text][vc_custom_heading text=”Multiclass Classification using Random Forest Classifier”][vc_column_text]Random forest consists of a large number of single decision trees that work as an ensemble. Each individual tree in the random forest outputs a class prediction. Each class gets some votes and the class with the most votes becomes the model’s prediction.[/vc_column_text][vc_single_image image=”4822″ img_size=”large” add_caption=”yes”][vc_column_text]

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(train_x, train_y)

rfc_predict = rfc.predict(val_x)
print('Accuracy score:',accuracy_score(val_y, rfc_predict))

[/vc_column_text][vc_column_text]Random Forest Classifier worked very well on the test dataset and gave an accuracy score of 88.62%. However, only knowing the accuracy score is not enough. In the next and final part, we will look at various evaluation metrics and calculate them using python in-built functions.[/vc_column_text][vc_custom_heading text=”Evaluation Metrics”][vc_column_text]

Accuracy: the proportion of the total number of predictions that were correct.
Precision: the proportion of positive cases that were correctly identified.
Sensitivity or Recall : the proportion of actual positive cases which are correctly identified.
F1 Score: The F1 score can be interpreted as a harmonic mean of the precision and recall

F1 Score = 2 * (precision * recall) / (precision + recall)

[/vc_column_text][vc_column_text]

from sklearn.metrics import classification_report

print(classification_report(val_y,rfc_predict))

[/vc_column_text][vc_single_image image=”4813″ img_size=”large”][/vc_column_inner][/vc_row_inner][/vc_column][/vc_row][vc_row][vc_column][/vc_column][/vc_row]

Sandeep Kumar

Sandeep Kumar is the Founder & CEO of Aitude, a leading AI tools, research, and tutorial platform dedicated to empowering learners, researchers, and innovators. Under his leadership, Aitude has become a go-to resource for those seeking the latest in artificial intelligence, machine learning, computer vision, and development strategies.