Consider
a problem where you are working on a machine learning classification problem.
You get an accuracy of 98% and you are very happy. But that happiness doesn’t last
long when you look at the confusion matrix and realize that majority class is 98%
of the total data and all examples are classified as majority class. Welcome to
the real world of imbalanced datasets!!
Some
of the well-known examples of imbalanced datasets are :
1 - Fraud detection: where number of fraud cases could be much
smaller than non-fraudulent transactions.
2- Prediction of disputed / delayed invoices:
where the problem is to predict default / disputed invoices.
3- Predictive maintenance datasets, etc
In
all the above examples, the cost of misclassifying minority class could very
high. That means, if I am not able to identify the fraud cases correctly, the
model won’t be useful.
There
are multiple ways of handling unbalanced datasets. Some of them are :
collecting more data, trying out different ML algorithms, modifying class
weights, penalizing the models, using anomaly detection techniques,
oversampling and under sampling techniques etc.
I am focusing mainly on SMOTE
based oversampling techniques in this article. As a data scientist you might
not have direct control over collection of more data which might need various
approvals from client, top management and could also take more time etc. Also,
applying class weights or too much parameter tuning can lead to overfitting. Undersampling
technique can lead to loss of important information. Even with oversampling. But
that might not be the case with oversampling techniques. Oversampling methods
can be easily tried and embedded in your framework.
Over
Sampling Algorithms based on SMOTE
1- SMOTE:
Synthetic Minority Over sampling Technique (SMOTE) algorithm applies KNN
approach where it selects K nearest neighbors, joins them and creates the
synthetic samples in the space. The algorithm takes the feature vectors and its
nearest neighbors, computes the distance between these vectors. The difference
is multiplied by random number between (0, 1) and it is added back to feature.
SMOTE algorithm is a pioneer algorithm and many other algorithms are derived
from SMOTE.
Reference: Smote Paper
R Implementation: smotefamily, unbalanced, DMwR
Python Implementation: imblearn
2- ADASYN:
ADAptive SYNthetic (ADASYN) is based on
the idea of adaptively generating minority data samples according to their
distributions using K nearest neighbor. The algorithm adaptively updates the
distribution and there are no assumptions made for the underlying distribution
of the data. The algorithm uses
Euclidean distance for KNN Algorithm. The
key difference between ADASYN and SMOTE is that the former uses a density
distribution, as a criterion to
automatically decide the number of synthetic samples that must be generated for
each minority sample by adaptively changing the weights of the different
minority samples to compensate for the skewed distributions. The latter
generates the same number of synthetic samples for each original minority
sample.
Paper reference:ADASYN Paper
R Implementation: smotefamily
Python Implementation: imblearn
3- ANS: Adaptive
Neighbor Synthetic (ANS) dynamically adapts the number of neighbors needed for
oversampling around different minority regions. This algorithm eliminates the
parameter K of SMOTE for a dataset and assign different number of neighbors for
each positive instance. Every parameter for this technique is automatically set
within the algorithm making it become parameter free.
Reference: ANS Paper
R Implementation: smotefamily
4-
Border SMOTE:
Borderline-SMOTE generates the synthetic sample along the borderline of minority
and majority classes. This also helps in separating out the minority and
majority classes.
Reference - Border SMOTE
R Implementation: smotefamily
5- Safe
Level SMOTE: Safe level is defined as the number of a positive instances in
k nearest neighbors. If the safe level of an instance is close to 0, the
instance is nearly noise. If it is close to k, the instance is considered safe.
Each synthetic instance is generated in safe position by considering the safe
level ratio of instances. In contrast, SMOTE and Borderline-SMOTE may generate
synthetic instances in unsuitable locations, such as overlapping regions and
noise regions.
Reference:SLS Paper
R Implementation: smotefamily
6- DBSMOTE:
Density-Based Synthetic Minority Over-sampling Technique is based on clustering
algorithm DBSCAN. The clusters are discovered by DBSCAN Algorithm. DBSMOTE
generates synthetic instances along a shortest path from each positive instance
to a pseudo-centroid of a minority-class cluster
Reference: DBSMOTE
R Implementation: smotefamily
Combining SMOTE and Undersampling Algorithms :
1. SMOTETomek: Tomek links can be used as
an under-sampling method or as a data cleaning method. Tomek links to the
over-sampled training set as a data cleaning method. Thus, instead of removing
only the majority class examples that form Tomek links, examples from both classes are removed
Reference:SMOTE Tomek
Python Implementation: imblearn
2. 2- MOTEENN: Just like Tomek, Edited
Nearest Neighbor removes any example whose class label differs from the class
of at least two of its three nearest neighbors. The ENN method removes the
instances of the majority class whose prediction made by KNN method is
different from the majority class. ENN method can remove both the noisy
examples as borderline examples, providing a smoother decision surface. ENN
tends to remove more examples than the Tomek links does, so it is expected that
it will provide a more in depth data cleaning
b.
Python Implementation: imblearn



No comments:
Post a Comment