Browse > Article
http://dx.doi.org/10.30693/SMJ.2018.7.4.24

SMS Text Messages Filtering using Word Embedding and Deep Learning Techniques  

Lee, Hyun Young (국민대학교 컴퓨터공학학과)
Kang, Seung Shik (국민대학교 소프트웨어학부)
Publication Information
Smart Media Journal / v.7, no.4, 2018 , pp. 24-29 More about this Journal
Abstract
Text analysis technique for natural language processing in deep learning represents words in vector form through word embedding. In this paper, we propose a method of constructing a document vector and classifying it into spam and normal text message, using word embedding and deep learning method. Automatic spacing applied in the preprocessing process ensures that words with similar context are adjacently represented in vector space. Additionally, the intentional word formation errors with non-alphabetic or extraordinary characters are designed to avoid being blocked by spam message filter. Two embedding algorithms, CBOW and skip grams, are used to produce the sentence vector and the performance and the accuracy of deep learning based spam filter model are measured by comparing to those of SVM Light.
Keywords
spam text message; word embedding; text vector; deep learning; binary classification;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 박경민, 최훈, 이창건, 황인태, 이칠우, "휴대 단말을 위한 지능형 사용자 인터페이스 플랫폼," 스마트미디어저널, 제1권, 제4호, 44-51쪽, 2012년 12월
2 손대능, 이정태, 이승욱, 신중휘, 임해창, "문자 메시지의 특성을 고려한 한국어 모바일 스팸필터링 시스템," 한국산학기술학회논문지, 제11권, 제7호, 2595-2602쪽, 2010년 7월   DOI
3 M. Salib, "MeatSlicer: Spam Classification with Naive Bayes and Smart Heuristics," Proceedings of the Spam Conference, Dec. 2002.
4 K. Schneider, "A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering," Proceedings of 10th Conference of the European Chapter of the Association for Computational Linguistics(EACL 2003), Budapest, Hungary, vol. 1, pp. 307-314, April. 2003.
5 강승식, "메일 주소 유효성과 제목-내용 가중치 기법에 의한 스팸 메일 필터링," 멀티미디어학회 논문지, 제9권, 제2호, 255-263쪽, 2006년 2월
6 Drucker, H., Wu, D., & Vapnik, V. N., "Support Vector Machines for Spam Categorization," IEEE Transactions on Neural networks, vol. 10, Issue 5, pp. 1048-1054, Sep. 1999   DOI
7 허기수, 정현태, 박아론, 백성준, "양자 유전 알고리즘을 이용한 특징 선택 및 성능 분석," 스마트미디어저널, 제1권, 제1호, 40-45쪽, 2012년 3월
8 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J., "Distributed Representations of Words and Phrases and their Compositionality," In Advances in neural information processing systems, Lake Tahoe, the United States, pp. 3111-3119, Dec. 2013.
9 Mikolov, Tomas, et al., "Recurrent neural network based language model," Eleventh Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, pp. 1045-1048, Sep. 2010.
10 Mikolov, T., Yih, W. T., & Zweig, G., "Linguistic Regularities in Continuous Space Word Representations," In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, the United States, pp. 746-751, Jun. 2013.
11 Manevitz, L. M., & Yousef, M., "One-class SVMs for Document Classification," Journal of machine Learning research, vol. 2, pp. 139-154, Dec. 2001.
12 Socher, R., Lin, C. C., Manning, C., & Ng, A. Y., "Parsing Natural Scenes and Natural Language with Recursive Neural Networks," In Proceedings of the 28th international conference on machine learning (ICML-11), Bellevue, Washington, USA, pp. 129-136, Jul. 2011.
13 Chen, D., & Manning, C., "A Fast and Accurate Dependency Parser using Neural Networks," In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, pp. 740-750, Oct. 2014.
14 Simard, P. Y., Steinkraus, D., & Platt, J. C., "Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis," In Proceedings of the 7th International Conference on Document Analysis and Recognition(ICDAR 2003), Edinburgh, Scotland, UK, vol. 2, pp. 958-962, Aug. 2003.
15 Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
16 Sahlgren, M., "The distributional hypothesis," Italian Journal of Disability Studies, vol.20, pp. 33-53, 2008.
17 강승식, "음절 bigram을 이용한 띄어쓰기 오류의 자동 교정," 음성과학, 제8권, 제2호, 83-90쪽, 2001년 6월
18 강승식, 장두성, "SMS 변형된 문자열의 자동 오류 교정 시스템," 정보과학회논문지, 제35권, 제6호, 386-391쪽, 2008년 6월
19 강승식, "스팸 문자 필터링을 위한 변형된 한글 SMS 문장의 정규화 기법," 정보처리학회논문지, 제3권, 제7호, 271-276쪽, 2014년 7월