Browse > Article
http://dx.doi.org/10.9717/kmms.2020.23.9.1181

Emerging Topic Detection Using Text Embedding and Anomaly Pattern Detection in Text Streaming Data  

Choi, Semok (Division of Computer Convergence, Chungnam National University)
Park, Cheong Hee (Division of Computer Convergence, Chungnam National University)
Publication Information
Abstract
Detection of an anomaly pattern deviating normal data distribution in streaming data is an important technique in many application areas. In this paper, a method for detection of an newly emerging pattern in text streaming data which is an ordered sequence of texts is proposed based on text embedding and anomaly pattern detection. Using text embedding methods such as BOW(Bag Of Words), Word2Vec, and BERT, the detection performance of the proposed method is compared. Experimental results show that anomaly pattern detection using BERT embedding gave an average F1 value of 0.85 and the F1 value of 1 in three cases among five test cases.
Keywords
Anomaly Pattern Detection; Text Streaming Data; Text Embedding; Word Embedding;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 C. Aggarwal, Outlier Analysis, Springer, Switzerland, 2017.
2 C. Park, "Outlier and Anomaly Pattern Detection on Data Streams," The Journal of Supercomputing, Vol. 75, No. 9, pp. 6118-6128, 2019.   DOI
3 T. Kim and C. Park, "Anomaly Pattern Detection for Streaming Data," Expert Systems with Applications, Vol. 149, pp. 1-8, 2020.
4 Y. Wu, M. Schuster, Z. Chen, Q. Le, M. Norouzi, W. Macherey, et al., "Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation," arXiv Preprint arXiv:1609.08144, 2016.
5 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, et al., "Attention is All You Need," Advances in Neural Information Processing Systems, pp. 5998-6008, 2017.
6 Word2vec(2013), https://code.google.com/archive/p/word2vec/ (accessed February 20, 2019).
7 Bert-as-service(2018), https://github.com/hanxiao/bert-as-service (accessed February 20, 2019).
8 D. Greene and P. Cunningham, "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering," Proceedings of International Conference on Machine Learning, pp. 377-384, 2006.
9 S. Petrovic, M. Osborne, and V. Lavrenko, "Using Paraphrases for Improving First Story Detection in News and Twitter," Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 338-346, 2012.
10 J. Allan, R. Papka, and V. Lavrenko, "On-line New Event Detection and Tracking," Proceeding of International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 37-45, 1998.
11 J. Allan, "Introduction to Topic Detection and Tracking," In Topic Detection and Tracking: Event-based Information Organization, Vol. 12, pp. 1-16. 2002.   DOI
12 T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, "Distributed Representations of Words and Phrases and Their Compositionality," Proceeding of International Conference on Neural Information Processing Systems, Vol. 2, pp. 3111-3119, 2013.
13 J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, pp. 4171-4186, 2019.
14 S. Choi, New Topic Occurrence Detection Using Text Embedding Model in Text Streaming Data, Master's Thesis of Chungnam National University, 2020.
15 S. Petrovi'c, M. Osborne, and V. Lavrenko, "Streaming First Story Detection with Application to Twitter," Proceeding of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 181-189, 2010.
16 S. Moran, R. McCreadie, C. Macdonald, and I. Ounis, "Enhancing First Story Detection Using Word Embeddings," Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 821-824, 2016.
17 C. Park and T. Kim, "Energy Theft Detection in Advanced Metering Infrastructure Based on Anomaly Pattern Detection," Energies, Vol. 13, No. 15, pp. 1-10, 2020.
18 M. Mathioudakis and N. Koudas, "Twitter Monitor: Trend Detection Over the Twitter Stream," Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 1155-1158, 2010.
19 N. Panagiotou, C. Akkaya, K. Tsioutsiouliklis, V. Kalogeraki, and D. Gunopulos, "First Story Detection Using Entities and Relations," Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, pp. 3237-3244, 2016.
20 E. Lee and P. Kim, "A Method for Short Text Classification Using SNS Feature Information Based on Markov Logic Network," Journal of Korea Multimedia Society, Vol. 20, No. 7, pp. 1065-1072, 2017.   DOI
21 H.J. Choi and C.H. Park, "Emerging Topic Detection in Twitter Stream Based on High Utility Pattern Mining," Expert Systems with Applications, Vol. 115, pp. 27-36, 2019.   DOI
22 S. Phuvipadawat and T. Murata, "Breaking News Detection and Tracking in Twitter," Proceedings of IEEE/WIC/ACM International Conference Web Intelligence and Intelligent Agent Technology, pp. 120-123, 2010.
23 D. Quercia, H. Askham, and J. Crowcroft, "Tweetlda: Supervised Topic Classification and Link Prediction in Twitter," Proceedings of 4th Annual ACM Web Science Conference, pp. 247-250, 2012.
24 U. Erra, S. Senatore, and G. Caggianese. "Approximate TF-IDF Based on Topic Extraction from Massive Message Stream Using the GPU," Information Sciences, Vol. 292, pp. 143-161, 2015.   DOI
25 G. Salton, E.A. Fox, and H. Wu, "Extended Boolean Information Retrieval," Communications of the ACM, Vol. 26, No. 11, pp. 1022- 1036, 1983.   DOI
26 P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching Word Vectors with Subword Information," Transactions of the Association for Computational Linguistics, Vol. 5, No. 1, pp. 135-146, 2017.   DOI