A Comprehensive Evaluation of Oversampling Techniques for Enhancing Text Classification Performance

No Thumbnail Available

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Nature Portfolio

Open Access Color

HYBRID

Green Open Access

Yes

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

No
Impulse
Top 10%
Influence
Average
Popularity
Top 10%

Research Projects

Journal Issue

Abstract

Class imbalance is a common and critical challenge in text classification tasks, where the underrepresentation of certain classes often impairs the ability of classifiers to learn minority class patterns effectively. According to the "garbage in, garbage out" principle, even high-performing models may fail when trained on skewed distributions. To address this issue, this study investigates the impact of oversampling techniques, specifically the Synthetic Minority Over-sampling Technique (SMOTE) and thirty of its variants, on two benchmark text classification datasets: TREC and Emotions. Each dataset was vectorized using the MiniLMv2 transformer model to obtain semantically rich representations, and classification was performed using six machine learning algorithms. The balanced and imbalanced scenarios were compared in terms of F1-Score and Balanced Accuracy. This work constitutes, to the best of our knowledge, the first large-scale, systematic benchmarking of SMOTE-based oversampling methods in the context of transformer-embedded text classification. Furthermore, statistical significance of the observed performance differences was validated using the Friedman test. The results provide practical insights into the selection of oversampling techniques tailored to dataset characteristics and classifier sensitivity, supporting more robust and fair learning in imbalanced natural language processing tasks.

Description

Asuroglu, Tunc/0000-0003-4153-0764

Keywords

Imbalanced Datasets, Text Classification, Synthetic Minority Over-Sampling Technique (Smote), Imbalanced datasets, Text classification, Synthetic minority over-sampling technique (SMOTE), Article

Turkish CoHE Thesis Center URL

Fields of Science

Citation

WoS Q

Q1

Scopus Q

Q1
OpenCitations Logo
OpenCitations Citation Count
N/A

Source

Scientific Reports

Volume

15

Issue

1

Start Page

End Page

PlumX Metrics
Citations

Scopus : 6

PubMed : 1

Captures

Mendeley Readers : 99

SCOPUS™ Citations

4

checked on Feb 03, 2026

Web of Science™ Citations

3

checked on Feb 03, 2026

Google Scholar Logo
Google Scholar™
OpenAlex Logo
OpenAlex FWCI
19.27898059

Sustainable Development Goals

SDG data could not be loaded because of an error. Please refresh the page or try again later.