Tviter verileri üzerinde sınıflandırma algoritmaları kullanarak hisse senedi değerleri için yön tahmini

Türkalp, Mustafa Vehbi

Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.13091/1435

Title:	Tviter verileri üzerinde sınıflandırma algoritmaları kullanarak hisse senedi değerleri için yön tahmini
Other Titles:	Direction estimation for stock values by using classification algorithms on twitter data
Authors:	Türkalp, Mustafa Vehbi
Advisors:	Koçer, Barış
Keywords:	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Computer Engineering and Computer Science and Control Borsa Stock exchange Fiyat tahmini Price ferecasting Sınıflandırma Classification Twitter Twitter
Publisher:	Konya Teknik Üniversitesi
Abstract:	Borsa, gerek hisselerin kolay alınıp satılması, gerek sağladığı gelir ve gerekse verilere kolay erişim bakımından sağladığı avantajlarla her zaman yatırımcıların gözde yatırım aracı olagelmiştir. Yatırımcılar da bu platformda daha fazla gelir elde edebilmek adına, hisse senetlerinin ileriye dönük yön tahmini ile sürekli ilgilenmişlerdir. Bunun için de çok farklı teknikler geliştirilmiştir. Biz de bu çalışmamızda, hisse senetlerinin fiyatlandırılmasında asıl kriter olarak düşündüğümüz "arz-talep" ilişkisinden yola çıkarak, hisseye olan talebin artması veya azalmasının önceden tahminini yapmak için, günümüzde en fazla kullanılan sosyal medya paylaşım platformlarından biri olan Twitter mesajlarının sınıflamasını yaptık. Araştırmamızda, Amerikan Dow Jones (DJIA) borsasında işlem gören Apple, Facebook, General Electric, General Motors, The Coca-Cola Company, McDonald's, Microsoft, Netflix, Pfizer Corporation, Tesla Motors gibi dünya çapında firmaların hisse senetleri için atılan tivitleri analiz ettik. Sınıflandırma işlemini, naive bayes, rastgele orman, destek vektör makinesi, karar ağacı, k-en yakın komşu ve yapay sinir ağları sınıflandırma algoritmalarını kullanarak gerçekleştirdik ve bu algoritmaların başarım sonuçlarını karşılaştırdık. İlgili hisse senetleri için Nisan 2019 – Mayıs 2019 tarihlerini kapsayacak şekilde iki aylık veri, twitter web arayüzü kullanılarak elde edildi. Benzer şekilde yön tahmini başarı seviyesinde test amacı ile kullanılan hisse senedi değerlerini içeren dosyalar da yine www.eoddata.com web adresi üzerinden elde edildi. Tivitleri etiketleme işlemi, borsa bilgileri birbirinden farklı, 75 farklı katılımcının tivitleri tek tek okuyup eli ile pozitif, negatif veya nötr olarak işaretlemesi ile yapıldı. Herhangi bir duygu belirtmeyen tivitlerin yanı sıra, anlaşılmayan, reklamdan ibaret olan, sadece bir link verilmiş olan v.b. tivitlerin tamamı nötr sınıfına dahil edildi. Algoritmaların sınıflandırma başarılarının ölçülmesinde değil ancak hisse senedinin yön tahmininin başarısı hesaplanırken, çok fazla çöp tivit içerdiği için nötr sınıfı göz ardı edilidi. Sınıflandırma için öncelikle tivitleri, noktalama işareleri, hyperlinkler ve web adresleri, "tab" karakteri, hashtaglar, retweetler, birbiri ile aynı olan tekrarlanmış tivitler v.b. fazlalıkları silerek temizledik. Temizlenen bu veri seti üzerinde sınıflandırma için, makine öğrenmesi teknikleri kullanılarak sınıflandırma yaptık. TF-IDF yöntemi kullanılarak her bir veri setinde her bir tivit için geçen tüm kelimelerin frekans ağırlıklarını hesaplayarak vektör haline dönüştürüp sayısallaştırdık. Sayısallaştırdığımız bu veriler üzerinde 6 farklı sınıflandırma algoritması kullanılarak başarım sonuçlarını elde ettik. Tivitlerin sınıfını tahmin etmek için yaptığımız sınıflandırma işlemi için belirlediğimiz algoritmaları kullanırken, homojenliği sağlamak için corss validation yöntemi ile veri setlerini 10 parçaya ayırdık. Bu parçalardan her birisini sıra ile doğrulama maksadı için kullandık, diğer parçaları ise sistemin eğitimde kullandık. En sonunda da tüm parçalar bittiği zaman, 10 parçanın ortalamasını alınarak genel tahmin başarısını elde ettik. Araştırma sonunda sınıflandırma işlemini gerçekleştirdiğimiz veri seti üzerinde %77,37 ile en başarılı sonucu rastgele orman algoritması verirken, %61,41 ile en kötü sonucu destek vektör makinesi verdi. Yine sınıflarını tahmin etmeye çalıştığımız hisse senetleri tivitleri için en iyi tahmin başarısı %83,3 ile GM'a ait iken en kötü tahmin başarısı ise %62,15 ile GE'ye ait olarak bulundu. Hisse senetlerinin yön tahmin başarı sonuçlarının değerlendirilmesinde ise en başarılı tahmin %96,5 ile KO için yapılırken, en kötü tahmin ise %66,7 ile TSLA için yapılmıştır. Bu tahmin başarılarının hesaplanmasında nötr tivitler göz ardı edilerek sadece pozitif ve negatif etiketli tivitler dikkate alınmıştır. Pozitif olarak işaretlenmiş olan bir tivit, ertesi gün hisse senedi negatif yönlü bir hareket yapmadıkça başarılı bir tahmin yapmış olarak alınmıştır, aynı şekilde negatif olarak işaretlenmiş olan bir tivit, ertesi gün ilgili hisse senedi yükselmemişse başarılı olarak alınmıştır. Elde edilen bulgular sonucu hisse senetlerinin yön tahminlerinin yapılmasında twitter verilerinin kullanılabileceği, gayet başarılı sonuçlar elde edilebileceği görülmüştür. The stock market has always been the favorite investment instrument of investors with the advantages it provides. Some of them are easy buying and selling of the shares, size of proceed and the easy access to the data. In order to generate more revenue on this platform, investors were also constantly interested in the forward-looking direction forecasting of stocks. Many different techniques have been developed for this reason. In this study, we made the classification of twitter messages -which is one of the most widely used social media sharing platforms- in order to predict the increase or decrease of the demand for the stock, because of we considered that the main criterion of pricing of stock is "supply-demand" relationship. In our research, we analyzed the tweets for stocks of global companies such as Apple, Facebook, General Electric, General Motors, The Coca-Cola Company, McDonalds, Microsoft, Netflix, Pfizer Corporation, Tesla Motors, which are traded on the American Dow Jones (DJIA) stock exchange. We performed classification, using naive bayes, random forest, support vector machine, decision tree, k-nearest neighbor and artificial neural network classification algorithms and compared the performance results of these algorithms Bi-monthly data for the related stocks -covering April 2019 - May 2019- were obtained using the twitter web interface. Similarly, files containing the stock values used for determination of success level of direction estimation and testing purposes were also obtained from the www.eoddata.com web address. The labeling process was performed by 75 different participants which have various stock market knowledge, by reading the tweets one by one and marking them manually as positive, negative and neutral. In addition to senseless tweets, the tweets consisting of a link, consisting of an advertisement, vague tweets etc. were labeled as neutral. Because of containing a lot of rubbish tweets, the neutral class was ignored when calculating the success of direction estimation of the stock values. But they were not ignored the measuring the success of the classification algorithms. For the classification, firstly we cleaned the tweets by removing the inessential factors as punctuation marks, hyperlinks and web addresses, "tab" characters, hashtags, re-tweets, repeating similar tweets etc. We classified using machine learning techniques for classification on this clear dataset. Using the TF-IDF method, we digitized by calculating the frequency weights of all words in each data set for each tivit and converted it into a vector. We obtained the performance results by using 6 different classification algorithms on these digitized data. While we used the classification algorithms which we determined for classification process with machine learning method to estimate the class of tweets, we divided the data sets into 10 parts with the corss validation method in order to make it homogenous. We used each of these parts in order for verification purposes, and used the other parts in the training of the system. Finally, when all the parts were finished, we achieved the overall prediction success by averaging 10 tracks. At the end of the research, random forest algorithm gave the most successful result with 77.37% and the support vector machine gave the worst result with 61.41%. Also the best predictive success for the stocks we tried to predict the class belongs to GM with 83.3%, while the worst predicted success belongs to GE with 62.15%. In the evaluation of success results of the direction estimation of the stocks, the most successful prediction was made for KO with 96.5%, and the worst estimation was made for TSLA with 66.7%. In the calculation of these predictive successes, only positive and negative tagged tweets were taken into account, while neutral tweets were ignored. A positive tweet was considered as a asuccessful estimate unless the stock made a negative move the following day, and a tivit marked as negative was considered as successfully if the corresponding stock did not rise the following day. As a result of the findings, it was seen that twitter data can be used to make direction estimations of stocks and very successful results can be obtained.
URI:	https://tez.yok.gov.tr/UlusalTezMerkezi/TezGoster?key=vjszP7PzV0HebcjFEvDfwAOQCkWxnZwgwiDT6JXKyAmDejavxHgl_0AidWmUMG16 https://hdl.handle.net/20.500.13091/1435
Appears in Collections:	Tez Koleksiyonu

Files in This Item:

File	Size	Format
587271.pdf	2.72 MB	Adobe PDF	View/Open

Show full item record

CORE Recommender

Page view(s)

434

checked on Apr 22, 2024

Download(s)

266

checked on Apr 22, 2024

Google Scholar^TM

Check

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM