DOI QR코드

DOI QR Code

A Study of Efficiency Information Filtering System using One-Hot Long Short-Term Memory

  • Kim, Hee sook (Department of Computer Information, Inchon Campus of Korea Polytechnic) ;
  • Lee, Min Hi (Department of Architecture, Howon University)
  • 투고 : 2017.02.03
  • 심사 : 2017.03.04
  • 발행 : 2017.03.31

초록

In this paper, we propose an extended method of one-hot Long Short-Term Memory (LSTM) and evaluate the performance on spam filtering task. Most of traditional methods proposed for spam filtering task use word occurrences to represent spam or non-spam messages and all syntactic and semantic information are ignored. Major issue appears when both spam and non-spam messages share many common words and noise words. Therefore, it becomes challenging to the system to filter correct labels between spam and non-spam. Unlike previous studies on information filtering task, instead of using only word occurrence and word context as in probabilistic models, we apply a neural network-based approach to train the system filter for a better performance. In addition to one-hot representation, using term weight with attention mechanism allows classifier to focus on potential words which most likely appear in spam and non-spam collection. As a result, we obtained some improvement over the performances of the previous methods. We find out using region embedding and pooling features on the top of LSTM along with attention mechanism allows system to explore a better document representation for filtering task in general.

키워드

참고문헌

  1. H. Drucker, D. Wu, and V.N. Vapnik, "Support Vector Machines for Spam Classification," IEEE Transactions on Neural Networks, Vol. 10, No. 5, pp. 1048-1054, Sept 5, Sep. 1999. https://doi.org/10.1109/72.788645
  2. A. Kolcz and J. Alspector. "SVM-Based Filtering of E-mail Spam with Content-Specific Misclassification Costs", in Proc. of the Workshop on Text Mining (TextDM'01), 2001.
  3. Deerwester, Dumais, Furnas, Lanouauer, and Harshman, "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, Vol. 41, No. 6, pp. 391-407, 1990. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  4. S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Computation, Vol. 9, No. 8, pp.1735-1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735
  5. Rie Johnson and Tong Zhang, "Effective Use of Word Order for Text Categorization with Convolutional Neural Network," In NAACL HLT, 2015.
  6. Rie Johnson and Tong Zhang, "Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings," in Proc. 33rd International Conference on Machine Learning, Vol. 48, 2016.
  7. A.L. Maas, R.E Daly, P.T. Pham et al., "Learning Word Vectors for Sentiment Analysis," in Proc.49th Annual Meeting of the Association for Computational Linguistics, pp. 142-150, June 19-24, 2011.
  8. K.S. Tai, R. Socher, and C.D. Manning, "Improved Semantic Representation from Tree-Structured Long Short-Term Memory Networks," in Proc.53rd Annual Meeting of the Association for Computational Linguistics and 7th International Joint Conference on Natural Language Processing, pp. 1556-1566, July 26-31, 2015.
  9. Z. Wojciech and I. Sutskever, "Learning to Execute," under review as a Conference Paper at 5th International Conference on Learning Representations (ICLR), May 7 - 9, 2015.
  10. I. Vulic and M. Moens, "Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings," in Proc.38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 363-372, Aug. 9-13, 2015.
  11. K.S. Jones, "A Statistical Interpretation of Term Specificity and Its Application in Retrieval," Journal of Documentation, Vol. 60, No. 5, pp. 493-502, 2004. https://doi.org/10.1108/00220410410560573
  12. V. Metsis, I. Androutsopoulos, and G. Paliouras, "Spam Filtering with Naive Bayes - Which Naive Bayes?," in Proc. 3rd Conf. Email and Anti-Spam, July 27-28, 2006.
  13. J. Pennington, R.Socher, and C.D. Manning, "GloveL Global Vectors for Word Representation," in Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543,October 25-29, 2014.
  14. G. Tzortzis and A. Likas, "Deep Belief Networks for Spam Filtering", in Proc. 19th IEEE International Conference on Tools with Artificial Intelligence, pp. 306-309, Oct. 29-31, 2007.