[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2018.02.004

A New Fine-grain SMS Corpus and Its Corresponding Classifier Using Probabilistic Topic Model

Ma, Jialin (College of Computer and Information, Hohai University)
Zhang, Yongjun (College of Computer and Information, Hohai University)
Wang, Zhijian (College of Computer and Information, Hohai University)
Chen, Bolun (Huaiyin Institute of Technology)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.12, no.2, 2018 , pp. 604-625 More about this Journal

Abstract

Nowadays, SMS spam has been overflowing in many countries. In fact, the standards of filtering SMS spam are different from country to country. However, the current technologies and researches about SMS spam filtering all focus on dividing SMS message into two classes: legitimate and illegitimate. It does not conform to the actual situation and need. Furthermore, they are facing several difficulties, such as: (1) High quality and large-scale SMS spam corpus is very scarce, fine categorized SMS spam corpus is even none at all. This seriously handicaps the researchers' studies. (2) The limited length of SMS messages lead to lack of enough features. These factors seriously degrade the performance of the traditional classifiers (such as SVM, K-NN, and Bayes). In this paper, we present a new fine categorized SMS spam corpus which is unique and the largest one as far as we know. In addition, we propose a classifier, which is based on the probability topic model. The classifier can alleviate feature sparse problem in the task of SMS spam filtering. Moreover, we compare the approach with three typical classifiers on the new SMS spam corpus. The experimental results show that the proposed approach is more effective for the task of SMS spam filtering.

Keywords

Spam SMS corpus; Topic Model; LDA; SMTM;

Citations & Related Records

Times Cited By KSCI : 1 (Citation Analysis)

Reference
Cited By KSCI

1	Wu, N., Wu, M., and Chen, S, "Real-time monitoring and filtering system for mobile SMS," in Proc. of IEEE Conference on Industrial Electronics & Applications, p. 1319 - 1324, 2008.
2	Yan, X., Guo, J., Lan, Y., and Cheng, X., "A biterm topic model for short texts," in Proc. of Paper presented at the Proceedings of the 22nd international conference on World Wide Web, 2013.
3	Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E. P., Yan, H., and Li, X., "Comparing Twitter and Traditional Media Using Topic Models," Paper presented at the In ECIR, p. 338-349, 2011.
4	Ahmed, I., Ali, R., Guan, D., Lee, Y.-K., Lee, S., & Chung, T., "Semi-supervised learning sing frequent itemset and ensemble learning for SMS classification," Expert Systems with Applications, 42(3), 1065-1073, 2015. DOI
5	Almeida, T., Hidalgo, J. M. G., & Silva, T. P., "Towards sms spam filtering: Results under a new dataset," International Journal of Information Security Science, 2(1), 1-18, 2013.
6	Almeida, T. A., Hidalgo, J. M. G., & Yamakami, A., " Contributions to the study of SMS spam filtering: new collection and results," in Proc. of Paper presented at the Proceedings of the 11th ACM symposium on Document engineering, 2011.
7	Chemudugunta, C., Smyth, P., Steyvers, M., "Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model," MIT Press, Vol. 19, 2007.
8	Blei, D. M., "Probabilistic topic models," Communications of the ACM, 55(4), 77-84, 2012. DOI
9	Blei, D. M., Ng, A. Y., & Jordan, M. I., Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022, 2003.
10	Chan, P. P. K., Yang, C., Yeung, D. S., and Ng, W. W. Y., "Spam filtering for short messages in adversarial environment," Neurocomputing, 155, 167-176, 2015. DOI
11	Chen, T., and Kan, M.-Y, "Creating a live, public short message service corpus: the NUS SMS corpus," Language Resources and Evaluation, vol. 47, no. 2, 299-335, 2013. DOI
12	Cormack, G. V., Gomez Hidalgo, J. M., and Sanz, E. P., "Spam filtering for short messages," in Proc. of Paper presented at the Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, p. 313-320, 2007.
13	Heinrich G., "Parameter estimation for text analysis," Technical Report, 2004.
14	Cormack, G. V., Hidalgo, J. M. G., and Sanz, E. P., "Feature engineering for mobile (SMS) spam filtering," Paper presented at the Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 871-872, 2007.
15	Delany, S. J., Buckley, M., and Greene, D., "SMS spam filtering: methods and data," Expert Systems with Applications, vol. 39, no. 10, 9899-9908, 2012. DOI
16	Deng, J., Xia, H., Fu, Y., Zhou, J., and Xia, Q., "Intelligent spam filtering for massive short message stream," COMPEL - The international journal for computation and mathematics in electrical and electronic engineering, vol. 32, no. 2, 586-596, 2013. DOI
17	Endres, D. M., & Schindelin, J. E., "A new metric for probability distributions," IEEE Transactions on Information theory, vol. 49, no. 7, 2003.
18	Gomez Hidalgo, J. M., Bringas, G. C., Sanz, E. P., and Garcia, F. C., "Content based SMS spam filtering," in Proc. of Paper presented at the Proceedings of the 2006 ACM symposium on Document engineering, p. 107-114, 2006.
19	Hidalgo, J. M. G., Almeida, T., and Yamakami, A., "On the validity of a new SMS spam Collection," in Proc. of Paper presented at the Machine Learning and Applications (ICMLA), 2012 11th International Conference on, 2012.
20	Ho, T. P., Kang, H.-S., and Kim, S.-R., "Graph-based KNN Algorithm for Spam SMS Detection," J. UCS, vol. 19, no. 16, 2404-2419, 2013.
21	Hofmann T., "Probabilistic latent semantic indexing," in Proc. of Paper presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 1999.
22	Hong, L., and Davison, B. D, "Empirical study of topic modeling in Twitter," in Proc. of Proceedings of the Sigkdd Workshop on Social Media Analytics, 80-88, 2010.
23	Liu, W., and Wang, T. x., "Index-based Online Text Classification for SMS Spam Filtering," Journal of Computers, vol. 5, no. 6, 2010.
24	Hu, X., & Yan, F., "Sampling of mass SMS filtering algorithm based on frequent time-domain area," in Proc. of Kwledge Discovery and Data Mining, 2010. WKDD '10. Third International Conference on, 2010.
25	Jiang, N., Jin, Y., Skudlark, A., and Zhang, Z.-L, "Understanding sms spam in a large cellular network: characteristics, strategies and defenses," Research in Attacks, Intrusions, and Defenses, pp. 328-347, Springer, 2013.
26	Kang, S.-S, "A Normalization Method of Distorted Korean SMS Sentences for Spam Message Filtering," KIPS Transactions on Software and Data Engineering, vol. 3, no. 7, 271-276, 2014. DOI
27	Modupe, A., Olugbara, O. O., & Ojo, S. O., "Investigating topic models for mobile short messaging service communication filtering," Paper presented at the Proceedings of the World Congress on Engineering, 2013.
28	Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P., "The author-topic model for authors and documents," in Proc. of Paper presented at the Proceedings of the 20th conference on Uncertainty in artificial intelligence, p. 487-494, 2004.
29	Sohn, D.-N., Lee, J.-T., and Rim, H.-C, "The contribution of stylistic information to content-based mobile spam filtering," in Proc. of Paper presented at the Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, p. 321-324, 2009.
30	Thomas K. Landauer, P. W. F., Darrell Laham, "An Introduction to Latent Semantic Analysis," Discourse Processes, vol. 25, p. 259-284, 1998. DOI
31	Wadhawan, A., & Negi, N., "A Novel Approach For Generating Rules For SMS Spam Filtering Using Rough Sets," International Journal of Scientific & Technology Research, 3(7), p. 80-86, 2014.