A Comprehensive Evaluation of Oversampling Techniques for Enhancing Text Classification Performance

dc.contributor.author Taskiran, Salimkan Fatma
dc.contributor.author Turkoglu, Bahaeddin
dc.contributor.author Kaya, Ersin
dc.contributor.author Asuroglu, Tunc
dc.date.accessioned 2025-08-10T17:19:58Z
dc.date.available 2025-08-10T17:19:58Z
dc.date.issued 2025
dc.description Asuroglu, Tunc/0000-0003-4153-0764 en_US
dc.description.abstract Class imbalance is a common and critical challenge in text classification tasks, where the underrepresentation of certain classes often impairs the ability of classifiers to learn minority class patterns effectively. According to the "garbage in, garbage out" principle, even high-performing models may fail when trained on skewed distributions. To address this issue, this study investigates the impact of oversampling techniques, specifically the Synthetic Minority Over-sampling Technique (SMOTE) and thirty of its variants, on two benchmark text classification datasets: TREC and Emotions. Each dataset was vectorized using the MiniLMv2 transformer model to obtain semantically rich representations, and classification was performed using six machine learning algorithms. The balanced and imbalanced scenarios were compared in terms of F1-Score and Balanced Accuracy. This work constitutes, to the best of our knowledge, the first large-scale, systematic benchmarking of SMOTE-based oversampling methods in the context of transformer-embedded text classification. Furthermore, statistical significance of the observed performance differences was validated using the Friedman test. The results provide practical insights into the selection of oversampling techniques tailored to dataset characteristics and classifier sensitivity, supporting more robust and fair learning in imbalanced natural language processing tasks. en_US
dc.identifier.doi 10.1038/s41598-025-05791-7
dc.identifier.issn 2045-2322
dc.identifier.scopus 2-s2.0-105009545193
dc.identifier.uri https://doi.org/10.1038/s41598-025-05791-7
dc.identifier.uri https://hdl.handle.net/20.500.13091/10581
dc.language.iso en en_US
dc.publisher Nature Portfolio en_US
dc.relation.ispartof Scientific Reports
dc.rights info:eu-repo/semantics/closedAccess en_US
dc.subject Imbalanced Datasets en_US
dc.subject Text Classification en_US
dc.subject Synthetic Minority Over-Sampling Technique (Smote) en_US
dc.title A Comprehensive Evaluation of Oversampling Techniques for Enhancing Text Classification Performance en_US
dc.type Article en_US
dspace.entity.type Publication
gdc.author.id Asuroglu, Tunc/0000-0003-4153-0764
gdc.author.scopusid 59971302100
gdc.author.scopusid 57218160917
gdc.author.scopusid 36348487700
gdc.author.scopusid 56780249800
gdc.author.wosid Turkoglu, Bahaeddin/Kma-4950-2024
gdc.author.wosid Asuroglu, Tunc/Itv-2441-2023
gdc.bip.impulseclass C4
gdc.bip.influenceclass C5
gdc.bip.popularityclass C4
gdc.coar.access metadata only access
gdc.coar.type text::journal::journal article
gdc.description.department Konya Technical University en_US
gdc.description.departmenttemp [Taskiran, Salimkan Fatma; Kaya, Ersin] Konya Tech Univ, Dept Comp Engn, TR-42250 Konya, Turkiye; [Turkoglu, Bahaeddin] Ankara Univ, Dept Artificial Intelligence & Data Engn, TR-06830 Ankara, Turkiye; [Asuroglu, Tunc] Tampere Univ, Fac Med & Hlth Technol, Tampere 33720, Finland; [Asuroglu, Tunc] VTT Tech Res Ctr Finland, Tampere 33101, Finland en_US
gdc.description.issue 1 en_US
gdc.description.publicationcategory Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı en_US
gdc.description.scopusquality Q1
gdc.description.volume 15 en_US
gdc.description.woscitationindex Science Citation Index Expanded
gdc.description.wosquality Q1
gdc.identifier.openalex W4411864770
gdc.identifier.pmid 40594759
gdc.identifier.wos WOS:001522986600014
gdc.index.type WoS
gdc.index.type Scopus
gdc.index.type PubMed
gdc.oaire.accesstype HYBRID
gdc.oaire.diamondjournal false
gdc.oaire.impulse 9.0
gdc.oaire.influence 3.0360314E-9
gdc.oaire.isgreen true
gdc.oaire.keywords Imbalanced datasets
gdc.oaire.keywords Text classification
gdc.oaire.keywords Synthetic minority over-sampling technique (SMOTE)
gdc.oaire.keywords Article
gdc.oaire.popularity 9.585713E-9
gdc.oaire.publicfunded false
gdc.openalex.collaboration International
gdc.openalex.fwci 19.27898059
gdc.openalex.normalizedpercentile 0.99
gdc.openalex.toppercent TOP 1%
gdc.opencitations.count 0
gdc.plumx.mendeley 99
gdc.plumx.pubmedcites 1
gdc.plumx.scopuscites 6
gdc.scopus.citedcount 4
gdc.virtual.author Kaya, Ersin
gdc.wos.citedcount 3
relation.isAuthorOfPublication 6b459b99-eed9-45fb-b42f-50fbb4ee7090
relation.isAuthorOfPublication.latestForDiscovery 6b459b99-eed9-45fb-b42f-50fbb4ee7090

Files