Comparison of Textual Data Augmentation Methods on Sst-2 Dataset

No Thumbnail Available

Date

2024

Authors

Çataltaş, M.
Baykan, N.A.

Journal Title

Journal ISSN

Volume Title

Publisher

Springer Science and Business Media Deutschland GmbH

Open Access Color

Green Open Access

No

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

No
Impulse
Average
Influence
Average
Popularity
Average

Research Projects

Journal Issue

Abstract

Since the arrival of advanced deep learning models, more successful techniques have been proposed, significantly enhancing the performance of nearly all natural language processing tasks. While these deep learning models achieve the best results, large datasets are needed to get these results. However, data collection in large amounts is a challenging task and cannot be done successfully for every task. Therefore, data augmentation might be required to satisfy the need for large datasets by generating synthetic data samples using original data samples. This study aims to give an idea to those who will work in this field by comparing the successes of using a large dataset as a whole and data augmentation in smaller pieces at different rates. For this aim, this study presents a comparison of three textual data augmentation techniques, examining their efficacy based on the augmentation mechanism. Through empirical evaluations on the Stanford Sentiment Treebank dataset, the sampling-based method LAMBADA showed superior performance in low-data regime scenarios and moreover showcased better results than other methods when the augmentation ratio is increased, offering significant improvements in model robustness and accuracy. These findings offer insights for researchers on augmentation strategies, thereby enhancing generalization in future works. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.

Description

2nd International Congress of Electrical and Computer Engineering, ICECENG 2023 -- 22 November 2023 through 25 November 2023 -- 309799

Keywords

Data augmentation, Natural language processing, Text generation, Deep learning, Learning algorithms, Learning systems, Natural language processing systems, Data augmentation, Data sample, Language processing, Large datasets, Learning models, Natural language processing, Natural languages, Performance, Text generations, Textual data, Large datasets

Turkish CoHE Thesis Center URL

Fields of Science

Citation

WoS Q

N/A

Scopus Q

Q3
OpenCitations Logo
OpenCitations Citation Count
N/A

Source

EAI/Springer Innovations in Communication and Computing

Volume

Issue

Start Page

189

End Page

201
PlumX Metrics
Citations

Scopus : 0

Captures

Mendeley Readers : 2

Google Scholar Logo
Google Scholar™
OpenAlex Logo
OpenAlex FWCI
0.0

Sustainable Development Goals

3

GOOD HEALTH AND WELL-BEING
GOOD HEALTH AND WELL-BEING Logo

4

QUALITY EDUCATION
QUALITY EDUCATION Logo

6

CLEAN WATER AND SANITATION
CLEAN WATER AND SANITATION Logo

9

INDUSTRY, INNOVATION AND INFRASTRUCTURE
INDUSTRY, INNOVATION AND INFRASTRUCTURE Logo