Data Augmentation for Text Classification Using Autoencoders

Cataltas, MustafaCicekli, IlyasAkhan Baykan, Nurdan2025-10-102025-10-1020252169-3536https://doi.org/10.1109/ACCESS.2025.3610157https://hdl.handle.net/20.500.13091/10845Deep learning models have greatly improved various natural language processing tasks. However, their effectiveness depends on large data sets, which can be difficult to acquire. To mitigate this challenge, data augmentation techniques are employed to artificially expand the training data by generating synthetic samples. By enriching the dataset, data augmentation enhances model generalization, reduces overfitting, and improves model performance. This paper investigates the effectiveness of employing autoencoders for text data augmentation to enhance the performance of text classification models. The research compares four types of autoencoders which are Traditional Autoencoder (AE), Adversarial Autoencoder (AAE), Denoising Adversarial Autoencoder (DAAE), and Variational Autoencoder (VAE). Basic text preprocessing techniques, which are lowercasing, removal of non-alphanumeric characters and removal of stop words, are applied to all documents. Additionally, label-based filtering is applied, where the outputs of autoencoders that contradict the predictions of BERT are eliminated. The experiments are conducted using the SST-2 sentiment classification dataset, which consists of 7,791 training instances and 1,821 test instances. To better analyze the impact of data augmentation methods, experiments are also performed on smaller subsets of 100, 200, 400, and 1,000 instances. Data augmentation is applied at ratios of 1:1, 1:2 and 1:4 for these subsets. The results demonstrate that AE-based data augmentation methods, particularly at a 1:1 ratio, achieve better accuracy than the baseline models. This underscores the potential of autoencoders in improving text classification outcomes in NLP tasks.eninfo:eu-repo/semantics/openAccessAutoencodersAutoencodersData AugmentationData AugmentationDeep LearningDeep LearningText ClassificationText ClassificationData Augmentation for Text Classification Using AutoencodersArticle10.1109/ACCESS.2025.36101572-s2.0-105016663127