[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.1633/JISTaP.2014.2.3.3

Study of Machine-Learning Classifier and Feature Set Selection for Intent Classification of Korean Tweets about Food Safety

Yeom, Ha-Neul (Korea University of Science and Technology (UST) Korea Institute of Science and Technology Information (KISTI))
Hwang, Myunggwon (Korea Institute of Science and Technology Information (KISTI))
Hwang, Mi-Nyeong (Korea Institute of Science and Technology Information (KISTI))
Jung, Hanmin (Korea University of Science and Technology (UST) Korea Institute of Science and Technology Information (KISTI))

Publication Information

Journal of Information Science Theory and Practice / v.2, no.3, 2014 , pp. 29-39 More about this Journal

Abstract

In recent years, several studies have proposed making use of the Twitter micro-blogging service to track various trends in online media and discussion. In this study, we specifically examine the use of Twitter to track discussions of food safety in the Korean language. Given the irregularity of keyword use in most tweets, we focus on optimistic machine-learning and feature set selection to classify collected tweets. We build the classifier model using Naive Bayes & Naive Bayes Multinomial, Support Vector Machine, and Decision Tree Algorithms, all of which show good performance. To select an optimum feature set, we construct a basic feature set as a standard for performance comparison, so that further test feature sets can be evaluated. Experiments show that precision and F-measure performance are best when using a Naive Bayes Multinomial classifier model with a test feature set defined by extracting Substantive, Predicate, Modifier, and Interjection parts of speech.

Keywords

Twitter; Tweets; Machine-learning Feature; Text Classification;

Citations & Related Records

Reference

1	Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10-18. DOI
2	KAIST Semantic Web Research Center, (2011). Hannanum Korean Morphological Analyzer User Manual.
3	Lampos, V., Bie, T. D., & Cristianini, N. (2010). Flu detector-tracking epidemics on Twitter. Machine Learning and Knowledge Discovery in Databases, 6323, 599-602.
4	McCallum, A., & Nigam, K. (1998). A comparison of event models for Naive Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 752, 41-48.
5	Paul, M. J., & Dredze, M. (2011). You are what you Tweet: Analyzing Twitter for public health. Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media.
6	Rokach, L., & Maimon, O. (2005). Decision trees. Data mining and knowledge discovery handbook. Springer, US, 165-192.
7	Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting elections with Twitter: What 140 characters reveal about political sentiment. Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, 178-185.
8	Youn, S., & McLeod, D. (2007). A comparative study for email classification. Advances and Innovations in Systems, Computing Sciences and Software Engineering, Springer, Netherland, 387-391.
9	Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048-1054. DOI ScienceOn
10	Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., Paliouras, G., & Spyropoulos, C. D. (2000). An evaluation of Naive Bayesian anti-spam filtering. Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, 9-17.
11	Chen, B. (2010). Chapter 6. Classification and prediction. Lecture Note Distributed in Data Mining CSCI 4370/5370 at Georgia State University, Retrieved June 2, 2014, from http://storm.cis.fordham.edu/-yli/documents/CISC4631Spring12/Chapter6_Class1.ppt.
12	Choi, D., Hwang, M., Kim, J., Ko, B., & Kim, P. (2014). Tracing trending topics by analyzing the sentiment status of Tweets. Computer Science and Information Systems, 11(1), 157-169. DOI ScienceOn
13	Dustin, B. (2002). Introduction to support vector machines. Retrieved Jun 2, 2014, from https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CCYQFjAA&url=http%3A%2F%2Fwww.work.caltech.edu%2F-boswell%2FIntroToSVM.pdf&ei=4f-6SU9P2A4K48gW6noHIBw&usg=AFQjCNGlfz-DO-ZpOtj219pI81FgjP2yyEA&sig2=SdwK6MV4e2EVzFaZuZhLEw
14	Lee, W., Kim, S., Kim, G., & Choi, K. (1999). Implementation of modularized morphological analyzer. Proceedings of Korean Institute of Information Scientists and Engineers: Special Interest Group on Human Language Technology, 123-136.