Browse > Article
http://dx.doi.org/10.13088/jiis.2019.25.4.105

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity  

Lee, Min Seok (Department of Business Administration, The Catholic University of Korea)
Yang, Seok Woo (Department of Psychology, The Catholic University of Korea)
Lee, Hong Joo (Department of Business Administration, The Catholic University of Korea)
Publication Information
Journal of Intelligence and Information Systems / v.25, no.4, 2019 , pp. 105-122 More about this Journal
Abstract
Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.
Keywords
Sentence Classification; Feature Selection; Information Gain; Word Similarity; Word Embedding;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 Barkan, O., "Bayesian Neural Word Embedding," Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), (2017)
2 Barkan, O. and N. Koenigstein."Item2Vec: Neural Item Embedding for Collaborative Filtering," arXiv Preprint arXiv:1603.04259 (2016).
3 Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," CoRR abs/1607.04606, (2016)
4 Deerwester, S., S.T. Dumais, T.K. Landauer, G.W. Furnas, and R. Harshman. "Indexing by latent semantic analysis," Journal of the American Society of Information Science, Vol.41, No.6(1990), 391-407.   DOI
5 Duda, R.O., P.E. Hart, and D.G. Stork. Pattern classification, Wiley, 2000.
6 Frome, A., G. Corrado, and J. Shlens, "Devise: A Deep Visual-Semantic Embedding Model," Advances in Neural Information Processing Systems, 26(2013) 1-11.
7 Joachims, T., "Text categorization with support vector machines," Technical report, University of Dortmund, (1997).
8 Jolliffe, I.T., Principal Component Analysis, Springer-Verlag New York, Secaucus, NJ, (1989)
9 Kim, Y., "Convolutional neural networks for sentence classification," Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, 1746-1751.
10 Lee, M. and H. J. Lee, "Increasing Accuracy of Classifying Useful Reviews by Removing Neutral Terms," Journal of Intelligent Information Systems, Vol.22, No.3(2016), 129-142.   DOI
11 Lee, M. and H. J. Lee, "Stock Price Prediction by Utilizing Category Neutral Terms: Text Mining Approach," Journal of Intelligent Information Systems, Vol.23, No.2(2017), 123-138.
12 Lewis, D.D., "Naive (Bayes) at forty: The independence assumption in information retrieval," Proceedings of ECML-98, 10th European Conference on Machine Learning, (1998), 4-15.
13 Lewis, D.D., "Feature selection and feature extraction for text categorization," Proceddings Speech and Natural Language Workshop, San Francisco, (1992), 212-217.
14 Li, J., K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu, "Feature Selection: a data perspective," ACM Computing Surveys(CSUR), Vol.50, No.6(2017), 94:1-94:45.
15 Landauer, T.K., P. W. Foltz, and D. Laham, "Introduction to Latent Semantic Analysis," Discourse Processes, Vol.25(1998), 259-84.   DOI
16 Mika, S., G. Ratsch, J. Weston, B. Scholkopf and K. -R. Muller, "Fisher discriminant analysis with kernels," Proceedings, IEEE Workshop on Neural Network for Signal Processing, (1999).
17 Mohan, P., I. Paramasivam, "A study on impact of dimensionality reduction on Naive Bayes classifier," Indian Journal of Science and Technology, Vol.10, No. 20(2017).
18 Peng, H., F. Long, C. Dong, "Feature selection based on mutual information: Criteria of maxdependence, max-relevance, min-redundancy", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.27, No.8(2005).   DOI
19 Peters, M., M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. "Deep contextualized word representations", NAACL, (2018).
20 Pennington, J., R. Socher, and C. D. Manning. "Glove: Global vectors for word representation", EMNLP, (2014).
21 Sahami, M., "Learning limited dependence Bayesian classifiers". Proceedings 2nd International Conference on Knowledge Discovery and Data Mining, (1996), 334-338.
22 Rapp, M., F.-J. Lübken, P. Hoffmann, R. Latteck, G. Baumgarten, and T. A. Blix, "PMSE dependence on aerosol charge, number density and aerosol size," Journal of Geophysical Research, Vol.108, No.D8(2003), 1-11.
23 Roweis, S.T. and Saul, L.K., "Nonlinear dimensionality reduction by Locally Linear Embedding," Science, Vol.290, No.5500(2000), 2323-2326.   DOI
24 Mika, S., G. Ratsch, J. Weston, B. Scholkopf, and K. -R Muller, "Fisher discriminant analysis with kernels," Proceedings of IEEE Workshop on Neural Networks for Signal Processing, (1999).
25 Sahlgren, M., "The distributional hypothesis," Italian Journal of Linguistics, Vol.20, No.1 (2008), 33-53.
26 Mikolov, T., K. Chen, G. Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space", ICLR Workshop, (2013).
27 Yu, L.C., J. Wang, K. R. Lai, and X. Zhang, "Refining word embeddings for sentiment analysis", Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, (2017), 545-550.
28 Zhang, R. and T. Tran, "An Information gainbased approach for recommending useful product reviews", Knowledge Information Systems, Vol.26, No.3(2011), 419-434.   DOI
29 Zhou, P., W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, and B. Xu. "Attention-based bidirectional long short-term memory networks for relation classification", The 54th Annual Meeting of the Association for Computational Linguistics, (2016), 207-213.
30 Zhu, L., G. Wang, and X. Zou, "Improved information gain feature selection method for Chinese text classification based on word embedding", proceedings of the 6th International Conference on Software and Computer Applications, (2017), 72-76.
31 Azhagusundari, B. and A.S. Thanamani, "Feature Selection based on Information Gain," International Journal of Innovative Technology and Exploring Engineering (IJITEE), Vol.2, No.2(2013), 18-21.