Handling datasets with highly imbalanced classes

Imbalanced datasets are a common occurrence in the field of machine learning. Sure it would be great if every problem we are trying to solve has perfectly balanced datasets, but that is definitely not the case.

If you ever come across an inbalanced dataset and try to handle it as if it was a normal dataset, your models will end up being biased towards the majority class.

Today, we will explore some of the best techniques to address this issue by utilizing a dataset from Kaggle, which can be accessed here. The dataset represents a real-world scenario where legitimate credit card transactions far outnumber fraudulent ones. This class imbalance poses a unique problem:

Problem Statement: How can we develop a robust credit card fraud detection system in the presence of a highly imbalanced dataset, where legitimate transactions significantly outweigh fraudulent ones?

In this tutorial, we will delve into strategies and methods to tackle this challenge. By the end, you’ll gain insights into effectively handling imbalanced datasets and building a more reliable credit card fraud detection system. To get started, download the dataset from Kaggle and follow along with the provided code and instructions.

Imports and first steps

We will begin by importing all the required functions and models. To address the issue of handling imbalanced datasets, we will make extensive use of the imblearn library.

#pip install -U imbalanced-learn  ###how to install imblearn
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from imblearn.metrics import classification_report_imbalanced
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import numpy as np 
from sklearn.model_selection import cross_validate,train_test_split
from sklearn.metrics import RocCurveDisplay
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import PrecisionRecallDisplay
from imblearn.ensemble import EasyEnsembleClassifier
from imblearn.under_sampling import NearMiss

Now let’s read the csv file and see how many instances we have.

df = pd.read_csv('creditcard.csv')
len(df)

Now, let’s specify our target column and our feature columns. We won’t delve into an extensive analysis of the meaning of each column, as our primary objective is to address the imbalance in the dataset. As outlined on Kaggle, our dataset comprises a total of 284,807 rows, with each row representing a transaction. Among these transactions, only 492 are classified as fraudulent. If we were to approach this dataset as a typical one, we could easily end up with a model that boasts a 99% accuracy rate simply by consistently predicting that a transaction is not fraudulent. We will see this in action a bit.

y = df["Class"]
x = df.drop(["Class"],axis=1)
unique_classes, class_counts = np.unique(y, return_counts=True)
unique_classes

class_counts

In the cells below we create three lists, one that contains the names of the classifiers and two that contain the classifiers themselves. The difference between clf1 and clf2 is that the second one has the class_weight parameter set to balanced in order to help with the dataset imbalance.

names = ["RandomForestClassifier","LinearSVC","GaussianNB"]
names_weight_aware = ["RandomForestClassifier","LinearSVC"]
clf1 = [RandomForestClassifier(),LinearSVC(),GaussianNB()]
clf2 = [RandomForestClassifier(class_weight="balanced"),LinearSVC(class_weight="balanced")]

Before attempting to train our models on the dataset, let’s consider how we can assess their performance more effectively. Would it be wise to rely solely on accuracy? As explained previously, doing so could result in artificially high accuracy, even if the classifier consistently predicts class 0. Therefore, we will also employ the “balanced_accuracy” metric. Balanced accuracy takes into account the number of instances in each class, in contrast to the standard “accuracy” metric.

scoring = ['accuracy', 'balanced_accuracy']

Lets now define our train and test sets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

Default models’ performance

Let’s proceed to train and evaluate the performance of the standard classifiers we defined earlier. To streamline the process, we’ll pair the list of classifier names with the list of default models. To obtain more accurate results on our models’ performance, we will utilize cross-validation. By displaying the mean and standard deviation of the metrics, we can gain a better understanding of how our models performed. It’s important to note that with cross-validation using cv=10, the model is trained 10 times, each time using a different tenth of the entire dataset as the test set. After training the models, we will also visualize the ROC curve. We will also store each models performance metrics in a map for later comparison reasons.


store_predictions = {}

for name, clf in zip(names,clf1):
    print(f"{name} - default without specialized class weights parameter")
    scores = cross_validate(clf,x,y,scoring= scoring ,cv= 10, return_train_score = False,verbose = 1 ,  n_jobs = -1)
    
    for score in scoring:
        print(f"{score} : {scores['test_'+score].mean()} , {scores['test_'+score].std()} ")
    
    clf.fit(x_train,y_train)
    
    key = name + "_default"

    store_predictions[key] = clf.predict(x_test)
    
    RocCurveDisplay.from_estimator(clf, x_test, y_test)
    PrecisionRecallDisplay.from_estimator(clf, x_test, y_test)

LINEAR SVC RESULTS & GAUSIAN NB RESULTS

The first results seen bellow are Linear Svc results. Due to printed error above i had to crop the image. Sorry for that.

ROC & Precision-recall curves for out models ( best and worst performing model )

Results discussion

Let’s discuss the results. When comparing the different classifiers based on their respective metrics, it’s clear that Random Forest and GaussianNB performed the best. However, it’s important to note that their high accuracy (normal accuracy not the balanced one), nearly 100%, is misleading, which is precisely why we use balanced accuracy.

On the other hand, LinearSVC is the worst-performing model in regards to balanced metrics, which are the most important one to take into account here. Being a linear model, it struggles to create effective boundaries to separate the different classes. The balanced accuracy is close to 50%, indicating that its predictions for the minority class are almost random.

As for the ROC curves, the Random Forest curves appears really good with the high left point in a really good spot. However we will see how we can make this already good model, a great one. In contrast, LinearSVC’s ROC curve is nearly a diagonal line, which is the worst-case scenario for a classifier.

Looking at Average Precision (AP), Random Forest achieves a score of 0.87, suggesting that it performs well without the need for the class_weight parameter. LinearSVC has a low AP of 0.22, indicating that it is not a good for this dataset. The question here is whether this seemingly weak model is truly a poor choice. We’ll soon demonstrate how we can significantly improve the performance of these weaker models to levels that may surprise you.

Models’ performace with weighted classes

Now, let’s apply the same analysis to models that are designed to be insensitive to the class imbalance. Specifically, the random classifier and LinearSVC can account for the imbalance by setting weights for the classes. This adjustment is made in the lists we defined earlier, where in the second list, we included the “class_weight” parameter for these models with a value of “balanced.”

for name, clf in zip(names_weight_aware,clf2):
    print(f"{name} - imbalance aware")
    scores = cross_validate(clf,x,y,scoring= scoring ,cv= 10, return_train_score = False,verbose = 1 ,  n_jobs = -1)
    
    for score in scoring:
        print(f"{score} : {scores['test_'+score].mean()} , {scores['test_'+score].std()} ")
    
    clf.fit(x_train,y_train)
    
    key = name + "_imbalance_aware"

    store_predictions[key] = clf.predict(x_test)
    
    RocCurveDisplay.from_estimator(clf, x_test, y_test)
    PrecisionRecallDisplay.from_estimator(clf, x_test, y_test)

Both models increased their balanced accuracy 2-3 %. This shows us that this technique is weak and is not enough to mitigate the negative effects of having such an imbalanced datasets.

Lets continue with some special techniques ideal for this exact use case scenario.

Models’ performance when SMOTE is applied

Next, let’s delve into another technique commonly used to address class imbalance problems, namely SMOTE (Synthetic Minority Over-sampling Technique). SMOTE is a method that leverages the existing instances of the minority class to generate synthetic instances, effectively reducing the gap in numbers. Let’s break down the steps of how the SMOTE algorithm works:

SMOTE begins by identifying the minority class in the dataset, which has fewer examples than the majority class.
For each instance in the minority class, SMOTE selects its k-nearest neighbors in the feature space. The value of ‘k’ can be specified by us within the code. You can visualize this by imagining all instances in a 2-D space, where proximity between instances is based on the similarity of their feature values. If you want a detailed understanding of how SMOTE works, I recommend watching a video tutorial, as visualizing the procedure can be very helpful. If you’d like more theory-based tutorials in the future, please let me know. However, our current focus is on writing the actual code, so we won’t delve extensively into the theory of SMOTE here.
SMOTE then creates synthetic instances for each selected minority class example by interpolating between the feature vectors of the chosen instance and one or more of its neighbors. By utilizing both the instances and a neighbours feature vector, Smote creates a third new instance that resambles the 2 real ones
The synthetic instances are added to the minority class, increasing its size and balancing the class distribution.

To assess the effect of SMOTE on a model’s accuracy, we will combine it with the our best model and worst model, namely Random forest and Linear SVC to see how it affects their performance.

To incorporate SMOTE into the process, we will create a pipeline where SMOTE is set as the first step, followed by the model as the second step. We will then call the fit method on the pipeline itself. This means that our data will first be updated by SMOTE and then passed to the next step of the pipeline, which is our main classifier. To evaluate our models’ performance, we will utilize the classification_report_imbalanced function from the imblearn library, providing an alternative approach for assessing the performance of models in situations involving imbalanced datasets. We will then directly compare the results gained with those we got in our previous experimentations. ( we already stored these results in out map named store_predictions )

print("Random Forest")
#default models results we earlier stored in our map
key = names[0] + "_default"

print(classification_report_imbalanced(y_test, store_predictions[key]))

print("Same model but with class weighting")

key = names[0]  + "_imbalance_aware"
print(classification_report_imbalanced(y_test, store_predictions[key]))

print("Same model but with SMOTE implemented")

# Create a pipeline
pipeline = make_pipeline(SMOTE(random_state=3, k_neighbors=6),
                         clf1[0])
pipeline.fit(x_train, y_train)

# Classify and report the results
print(classification_report_imbalanced(y_test, pipeline.predict(x_test)))

Results discussion

In the case of Random Forest we observe that SMOTE gives a little worse percision for the minority class (0.86) in comparison to normal random forest and the one with the class weight parameter(0.99 and 0.99 respectively). This could lead us into believing that our model has become worse. But lets thing here for a moment. Is precission the right metric to take into account when talking about the minority class ? The short answer is no. But why ?

Remember precission is equal to TP/(TP+FP) whereas recall is
equal to TP/(TP+FN).

Where TP = True positives, FP = False positives, FN = False negatives. ( if you are not familiar with these metrics i woulds highly reccomend you to read an article about them since knowing when to use each one is crucial )

In other words precission answers the question: “Of all the instances that the model predicted as Fraud transactions, how many were actually fraud transaction?”

Whereas Recall answers the question:”Of all the actual positive Fraud transactions, how many did our model correctly predict?”

Simple stated, recall expresses our models performance regarding the minority class by taking into account all of the instances of the minority class whereass precision doesn’t.

When we examine the recall achieved by using SMOTE, it becomes evident that our model’s performance has improved, especially concerning the minority class, which is the most critical in our case. It is preferable to correctly identify more fraudulent transactions, even if it means making some mistakes along the way. The mistakes we make are the false positives, which are instances that are not fraudulent but are identified as such. These false positives are in the denominator of the precision calculation, which is why precision tends to decrease as recall increases.

But the random classifier was already a good enough model. Lets see how a weak model like the Linear SVC benefits from the usage of SMOTE. The results will surprise you

print("LINEAR SVC")

key = names[1] + "_default"

print(classification_report_imbalanced(y_test, store_predictions[key]))

print("Same model but with class weighting")

key = names[1]  + "_imbalance_aware"

print(classification_report_imbalanced(y_test, store_predictions[key]))

print("Same model but with SMOTE implemented")

# Create a pipeline
pipeline = make_pipeline(SMOTE(random_state=3, k_neighbors=6),
                         clf1[1])
pipeline.fit(x_train, y_train)

# Classify and report the results
print(classification_report_imbalanced(y_test, pipeline.predict(x_test)))

The results here are remarkable better with SMOTE compared to without SMOTE. The recall for the minority class has skyrocketed from to . This means that 76% of the instances were correctly classified. Precision is still low for class 1, which means that a lot of non-fraud instances were classified as fraud instances which is ok in our case since our important class is the fraud class.

Next we will try out another technique or better stated a technique family.

Models’ performance with Near-miss

Last but not least, let’s explore the Near-Miss technique. In contrast to SMOTE, Near-Miss is a family of techniques that address class imbalance by undersampling the majority class, removing some instances to balance the dataset. There are three versions of Near-Miss:

Near Miss-1: This version removes majority class instances that are close to the minority class, retaining only those that are farthest from the minority class to reduce potential confusion.
Near Miss-2: Near Miss-2 retains majority class instances with the smallest average distance to the k-nearest minority class instances, making it more challenging for the model to misclassify them.
Near Miss-3: Unlike the first two, Near Miss-3 removes majority class instances that are far from the decision boundary, preserving the overall structure of the majority class.

For a deeper understanding of these techniques, it is highly recommended to watch relevant videos or lectures. Feature space techniques can be more easily grasped when combined with visual explanations.

To assess the impact of the Near-Miss techniques on model performance, we will use the Gaussian Naive Bayes (Gaussian NB) classifier in combination with each of the three Near-Miss approaches: Near Miss-1, Near Miss-2, and Near Miss-3. We will then analyze their results, as presented in the classification report.

from imblearn.under_sampling import NearMiss
print("Gaussian NB")

key = names[2]  + "_default"

# Classify and report the results
print(classification_report_imbalanced(y_test, store_predictions[key]))

# Create a pipeline
pipeline = make_pipeline(NearMiss(version=1),
                         clf1[2])
pipeline.fit(x_train, y_train)

# Classify and report the results
print("Gaussian NB with near miss 1")
print(classification_report_imbalanced(y_test, pipeline.predict(x_test)))

# Create a pipeline
pipeline = make_pipeline(NearMiss(version=2),
                         clf1[2])
pipeline.fit(x_train, y_train)

# Classify and report the results
print("Gaussian NB with near miss 2")
print(classification_report_imbalanced(y_test, pipeline.predict(x_test)))

# Create a pipeline
print("Gaussian NB with near miss 3")
pipeline = make_pipeline(NearMiss(version=3, n_neighbors_ver3=3),
                         clf1[2])
pipeline.fit(x_train, y_train)

# Classify and report the results
print(classification_report_imbalanced(y_test, pipeline.predict(x_test)))

Before delving into the results, let’s clarify what the “IBA” metric in the second to last column means. “IBA” stands for “Instance-Based Accuracy,” which is a classification performance metric focused on evaluating the accuracy of individual instances, rather than relying on overall measures such as accuracy, precision, or recall.

IBA provides insights into how well a machine learning model performs on a case-by-case basis. This metric is particularly valuable when you want to understand the variations in your model’s predictions across different instances, which is especially important when misclassifying specific instances can have significant consequences, as in our case. In simpler terms, IBA is a metric commonly used in class imbalance scenarios where the minority class is highly significant. It allows us to evaluate the model’s performance at the individual instance level, rather than looking at the dataset as a whole. This is particularly beneficial for assessing how well the model handles specific cases.

Upon reviewing the results, it becomes apparent that our model combined with NearMiss-3 performs exceptionally well, boasting the best IBA metrics and a slightly improved recall for the minority class when compared to the default model, which was already performing adequately.

In contrast, NearMiss-2 stands out as the by-far worst model for this use case, primarily due to its extremely low recall of 0.01 for class 0 (Non-Fraud). This indicates that only 1% of class 0 instances were correctly identified, while most were wrongly classified as class 1 (Fraud). It’s important to remember that while we prioritize the minority class, if our models consistently predict that every transaction is fraudulent, it wouldn’t be serving our objective effectively.

NearMiss-1 achieves a remarkable recall of 0.8 for class 1, indicating that it effectively identifies the majority of fraudulent transactions. This is particularly impressive because, in the context of fraud detection, discovering fraudulent transactions is the most critical objective. However, it comes at the cost of lower precision for the fraud class, as it is willing to trade off precision to enhance recall for fraud detection.

The very low precision for the minority class (near 0) is misleading because it’s mainly due to the overwhelmingly larger number of legitimate transactions. Even an occasional false identification of a legitimate transaction as fraudulent can significantly affect precision, making it appear close to 0. This is primarily a result of class imbalance.

It’s essential to consider that the significantly lower precision for the minority class means that many legitimate transactions may be falsely flagged as fraud, potentially leading to frustration for customers. Nevertheless, the trade-off is justified by the potential savings from preventing a larger number of actual fraudulent transactions.

Conclusions

In this tutorial, we’ve explored techniques for dealing with class imbalance in the context of fraud detection. However, it’s crucial to recognize that there are instances of class imbalance datasets where the stakes are even higher, such as in cancer classification problems.

We’ve delved into well-known techniques, shedding light on their strengths and limitations. It’s essential to understand that addressing class imbalance often involves a trade-off, which can lead to a substantial number of majority class instances (non-fraudulent transactions in our case) being misclassified as instances of the minority class, primarily due to the overwhelming class imbalance. This trade-off is inherent in imbalanced datasets and should be carefully considered.

The importance of the minority class and the level of tolerance for false positives classifications are key factors that will lead us into finding the best model for our specific use case. For example, in scenarios like cancer classification, our foremost objective wouldl be to achieve the highest possible recall for the minority class. This is because would want to accurately identify the vast majority of minority class instances. In this context, the cost of false positives takes a backseat, as the potential cost of false negatives, in the form of missed cancer diagnoses, is immeasurable – it’s a matter of life and death. Each problem presents its unique considerations, underscoring the necessity of approaching model development meticulously and with careful coding practices.

See you in the next tutorial, Codelanders!