Browse > Article
http://dx.doi.org/10.13088/jiis.2017.23.2.123

Stock Price Prediction by Utilizing Category Neutral Terms: Text Mining Approach  

Lee, Minsik (Department of Information and Industrial Engineering, Yonsei University)
Lee, Hong Joo (Department of Business Administration, Catholic University of Korea)
Publication Information
Journal of Intelligence and Information Systems / v.23, no.2, 2017 , pp. 123-138 More about this Journal
Abstract
Since the stock market is driven by the expectation of traders, studies have been conducted to predict stock price movements through analysis of various sources of text data. In order to predict stock price movements, research has been conducted not only on the relationship between text data and fluctuations in stock prices, but also on the trading stocks based on news articles and social media responses. Studies that predict the movements of stock prices have also applied classification algorithms with constructing term-document matrix in the same way as other text mining approaches. Because the document contains a lot of words, it is better to select words that contribute more for building a term-document matrix. Based on the frequency of words, words that show too little frequency or importance are removed. It also selects words according to their contribution by measuring the degree to which a word contributes to correctly classifying a document. The basic idea of constructing a term-document matrix was to collect all the documents to be analyzed and to select and use the words that have an influence on the classification. In this study, we analyze the documents for each individual item and select the words that are irrelevant for all categories as neutral words. We extract the words around the selected neutral word and use it to generate the term-document matrix. The neutral word itself starts with the idea that the stock movement is less related to the existence of the neutral words, and that the surrounding words of the neutral word are more likely to affect the stock price movements. And apply it to the algorithm that classifies the stock price fluctuations with the generated term-document matrix. In this study, we firstly removed stop words and selected neutral words for each stock. And we used a method to exclude words that are included in news articles for other stocks among the selected words. Through the online news portal, we collected four months of news articles on the top 10 market cap stocks. We split the news articles into 3 month news data as training data and apply the remaining one month news articles to the model to predict the stock price movements of the next day. We used SVM, Boosting and Random Forest for building models and predicting the movements of stock prices. The stock market opened for four months (2016/02/01 ~ 2016/05/31) for a total of 80 days, using the initial 60 days as a training set and the remaining 20 days as a test set. The proposed word - based algorithm in this study showed better classification performance than the word selection method based on sparsity. This study predicted stock price volatility by collecting and analyzing news articles of the top 10 stocks in market cap. We used the term - document matrix based classification model to estimate the stock price fluctuations and compared the performance of the existing sparse - based word extraction method and the suggested method of removing words from the term - document matrix. The suggested method differs from the word extraction method in that it uses not only the news articles for the corresponding stock but also other news items to determine the words to extract. In other words, it removed not only the words that appeared in all the increase and decrease but also the words that appeared common in the news for other stocks. When the prediction accuracy was compared, the suggested method showed higher accuracy. The limitation of this study is that the stock price prediction was set up to classify the rise and fall, and the experiment was conducted only for the top ten stocks. The 10 stocks used in the experiment do not represent the entire stock market. In addition, it is difficult to show the investment performance because stock price fluctuation and profit rate may be different. Therefore, it is necessary to study the research using more stocks and the yield prediction through trading simulation.
Keywords
Stock Price; Neutral Terms; Text Mining; Online News;
Citations & Related Records
Times Cited By KSCI : 5  (Citation Analysis)
연도 인용수 순위
1 Ahn, S. W and S. B. Cho, "Stock Prediction Using News Text Mining and Time Series Analysis", Proceedings of Korea Computer Congress, Vol.37, No.1(2010), 364-369
2 Amilon, H., "GARCH estimation and discrete stock prices: an application to low-priced Australian stocks", Economics Letters, Vol.81, No.2(2003), 215-222.   DOI
3 Bothos, E., D. Apostolou, G. Mentzas, "Using Social Media to Predict Future Events with Agent-Based Markets", IEEE Intelligent Systems, Vol.25, No.6(2010), 50-58.   DOI
4 Cao, Q., W. Duan, and Q. Gan, "Exploring determinants of voting for the "helpfulness" of online user reviews: A text mining approach", Decision Support Systems, Vol.50, No.2(2011), 511-521.   DOI
5 Choeh, J. Y., H. J. Lee, and S. J. Park, "A Personalized Approach for Recommending Useful Product Reviews Based on Information Gain", KSII Transactions on Internet and Information Systems, Vol.9, No.5(2015), 1702-1716.   DOI
6 Ding, X., Y. Zhang, T. Liu, and J. Duan, "Using Structured Events to Predict Stock Price Movement: An Empirical Investigation", Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, 1415-1425.
7 Ding, X., Y. Zhang, T. Liu, and J. Duan, "Deep Learning for Event-Driven Stock Prediction", Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), Buenos Aires, Argentina, 2015, 2327-2333.
8 Fung, G. P. C., J. X. Yu, X. Yu and W. Lam, "News Sensitive Stock Trend Prediction", Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Taipei, Taiwan, 2002.
9 Huang, A. "Similarity measures for text document clustering." Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, 2008.
10 Jeantheau, T., "A link between complete models with stochastic volatility and ARCH models," Finance and Stochastics, Vol. 8, No. 1(2004), 111-131.   DOI
11 Jeong, J. S., D. S. Kim, and J. W. Kim, "Influence analysis of Internet buzz to corporate performance: Individual stock prediction using sentiment analysis of online news," Journal of Intelligence and Information Systems, Vol. 21, No. 4(2015), 37-51.   DOI
12 Kim, K. Y., and K. R. Lee, "A Study on the Prediction of Stock Price Using Artificial Intelligence System", Korean Journal of Business Administration, Vol.21, No.6 (2008), 2421-2449
13 Kim, Y. S., N. G. Nim, and S. R. Jeong, "Stock-Index Invest Model Using News Big Data Opinion Mining," Journal of Intelligence and Information Systems, Vol. 18, No. 2(2012), 143-156.   DOI
14 Lee, H. Y., "A Combination Model of Multiple Artificial Intelligence Techniques Based on Genetic Algorithms for the Prediction of Korean Stock Price Index(KOSPI)", Entrue Journal of Information Technology, Vol.7, No.2(2008), 33-43.
15 Lee, M. and H. J. Lee, "Increasing Accuracy of Classifying Useful Reviews by Removing Neutral Terms", Journal of Intelligence and Information Systems, Vol. 22, No. 3(2016), 129-142.   DOI
16 Liaw, A. and M. Wiener, "Classification and regression by randomForest", R News, 2(3), 18-22, 2002.
17 Meyer, D., E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien, 2012. URL http://CRAN.R-project.org/package=e1071. R package version 1.6-1.
18 Mittermayer, M. A., "Forecasting Intraday Stock Price Trends with Text Mining Technique", Proceedings of the 37th Hawaii International Conference on Social Systems, Hawaii, 2004.
19 Oh, C. and O. R. L. Sheng, "Investigating Predictive Power of Stock Micro Blog Sentiment in Forecasting Future Stock Price Directional Movement", Proceedings of ICIS 2011, Shanghai, China.
20 Park, K. H and H. J. Shin, "Stock Price Prediction Based on Time Series Network", Korean Management Science Review, Vol.28, No.1(2011), 53-60
21 Perkins, J., Python 3 Text Processing with NLTK 3 Cookbook, Packt Publishing, 2014.
22 Schumaker, R. P. and H. Chen, "Textual Analysis of Stock Market Prediction Using Breaking Financial News: The AZFinText System", ACM Transactions on Information Systems, Vol. 27, No. 2(2009), Article No. 12.
23 Seo, Y. W., J. Giampapa and K. Sycara, "Text Classification for Intelligent Portfolio Management", Carnegie Mellon University, Robotics Institute, 2002.
24 Thomas, J. D. and K. Sycara, "Integrating Genetic Algorithms and Text Learning for Financial Prediction", Proceedings of Genetic and Evolutionary Computation Conference (GECCO), Las Vegas, NV, 2002.
25 Tumasjan, A., T. O. Sprenger, P. G. Sandner, I. M. Welpe, "Election Forecasts With Twitter", Social Science Computer Review, Vol. 29, Issue 4, 2011, 402-418.   DOI
26 Tuszynski, J., caTools: Tools: Moving Window Statistics, GIF, Base64, ROC AUC, etc., 2012. URL http://CRAN.R-project.org/ package=caTools. R package version 1.13.
27 Yu, E. J., Y. S. Kim, N. G. Kim, and S. R. Jeong, "Prediction the Direction of the Stock Index by Using a Domain-Specific Sentiment Dictionary," Journal of Intelligence and Information Systems, Vol. 19, No. 1(2013), 95-110.   DOI
28 Zhang, R. and T. Tran, "An information gain-based approach for recommending useful product reviews", Knowledge Information Systems, Vol. 26, No. 3(2011), 419-434.   DOI