Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID-19

Erol, Gizemnur; Uzbaş, Betül; Yücelbaş, Cüneyt; Yücelbaş, Sule

Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.13091/3135

Full metadata record

DC Field	Value	Language
dc.contributor.author	Erol, Gizemnur	-
dc.contributor.author	Uzbaş, Betül	-
dc.contributor.author	Yücelbaş, Cüneyt	-
dc.contributor.author	Yücelbaş, Sule	-
dc.date.accessioned	2022-11-28T16:54:40Z	-
dc.date.available	2022-11-28T16:54:40Z	-
dc.date.issued	2022	-
dc.identifier.issn	1532-0626	-
dc.identifier.issn	1532-0634	-
dc.identifier.uri	https://doi.org/10.1002/cpe.7393	-
dc.identifier.uri	https://doi.org/10.1002/cpe.7393	-
dc.identifier.uri	https://hdl.handle.net/20.500.13091/3135	-
dc.description.abstract	Real-time polymerase chain reaction (RT-PCR) known as the swab test is a diagnostic test that can diagnose COVID-19 disease through respiratory samples in the laboratory. Due to the rapid spread of the coronavirus around the world, the RT-PCR test has become insufficient to get fast results. For this reason, the need for diagnostic methods to fill this gap has arisen and machine learning studies have started in this area. On the other hand, studying medical data is a challenging area because the data it contains is inconsistent, incomplete, difficult to scale, and very large. Additionally, some poor clinical decisions, irrelevant parameters, and limited medical data adversely affect the accuracy of studies performed. Therefore, considering the availability of datasets containing COVID-19 blood parameters, which are less in number than other medical datasets today, it is aimed to improve these existing datasets. In this direction, to obtain more consistent results in COVID-19 machine learning studies, the effect of data preprocessing techniques on the classification of COVID-19 data was investigated in this study. In this study primarily, encoding categorical feature and feature scaling processes were applied to the dataset with 15 features that contain blood data of 279 patients, including gender and age information. Then, the missingness of the dataset was eliminated by using both K-nearest neighbor algorithm (KNN) and chain equations multiple value assignment (MICE) methods. Data balancing has been done with synthetic minority oversampling technique (SMOTE), which is a data balancing method. The effect of data preprocessing techniques on ensemble learning algorithms bagging, AdaBoost, random forest and on popular classifier algorithms KNN classifier, support vector machine, logistic regression, artificial neural network, and decision tree classifiers have been analyzed. The highest accuracies obtained with the bagging classifier were 83.42% and 83.74% with KNN and MICE imputations by applying SMOTE, respectively. On the other hand, the highest accuracy ratio reached with the same classifier without SMOTE was 83.91% for the KNN imputation. In conclusion, certain data preprocessing techniques are examined comparatively and the effect of these data preprocessing techniques on success is presented and the importance of the right combination of data preprocessing to achieve success has been demonstrated by experimental studies.	en_US
dc.language.iso	en	en_US
dc.publisher	Wiley	en_US
dc.relation.ispartof	Concurrency and Computation-Practice & Experience	en_US
dc.rights	info:eu-repo/semantics/closedAccess	en_US
dc.subject	COVID-19	en_US
dc.subject	KNN imputation	en_US
dc.subject	machine learning	en_US
dc.subject	multivariate imputation by chained equation	en_US
dc.subject	synthetic minority oversampling technique	en_US
dc.title	Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID-19	en_US
dc.type	Article	en_US
dc.identifier.doi	10.1002/cpe.7393	-
dc.identifier.scopus	2-s2.0-85140036970	en_US
dc.department	Fakülteler, Mühendislik ve Doğa Bilimleri Fakültesi, Yazılım Mühendisliği Bölümü	en_US
dc.department	Fakülteler, Mühendislik ve Doğa Bilimleri Fakültesi, Bilgisayar Mühendisliği Bölümü	en_US
dc.identifier.wos	WOS:000869547800001	en_US
dc.institutionauthor	Erol, Gizemnur	-
dc.institutionauthor	Uzbaş, Betül	-
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.authorscopusid	57931549700	-
dc.authorscopusid	57201915831	-
dc.authorscopusid	55913650300	-
dc.authorscopusid	55913641100	-
dc.identifier.scopusquality	Q3	-
item.cerifentitytype	Publications	-
item.fulltext	With Fulltext	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
item.openairetype	Article	-
item.grantfulltext	embargo_20300101	-
item.languageiso639-1	en	-
crisitem.author.dept	02.03. Department of Computer Engineering	-
Appears in Collections:	Mühendislik ve Doğa Bilimleri Fakültesi Koleksiyonu Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collections WoS İndeksli Yayınlar Koleksiyonu / WoS Indexed Publications Collections

Files in This Item:

File	Size	Format
Concurrency and Computation - 2022 - Erol - Analyzing the effect of data preprocessing techniques using machine learning.pdf Until 2030-01-01	1.2 MB	Adobe PDF	View/Open Request a copy

Show simple item record

CORE Recommender

WEB OF SCIENCE^TM
Citations

2

checked on Oct 12, 2024

Page view(s)

770

checked on Oct 14, 2024

Download(s)

10

checked on Oct 14, 2024

Google Scholar^TM

Check

Files in This Item:

WEB OF SCIENCETM Citations

Page view(s)

Download(s)

Google ScholarTM

Altmetric

WEB OF SCIENCE^TM
Citations

Google Scholar^TM