Browse > Article
http://dx.doi.org/10.3745/KTSDE.2016.5.11.521

A Design on Informal Big Data Topic Extraction System Based on Spark Framework  

Park, Kiejin (아주대학교 융합시스템공학과)
Publication Information
KIPS Transactions on Software and Data Engineering / v.5, no.11, 2016 , pp. 521-526 More about this Journal
Abstract
As on-line informal text data have massive in its volume and have unstructured characteristics in nature, there are limitations in applying traditional relational data model technologies for data storage and data analysis jobs. Moreover, using dynamically generating massive social data, social user's real-time reaction analysis tasks is hard to accomplish. In the paper, to capture easily the semantics of massive and informal on-line documents with unsupervised learning mechanism, we design and implement automatic topic extraction systems according to the mass of the words that consists a document. The input data set to the proposed system are generated first, using N-gram algorithm to build multiple words to capture the meaning of the sentences precisely, and Hadoop and Spark (In-memory distributed computing framework) are adopted to run topic model. In the experiment phases, TB level input data are processed for data preprocessing and proposed topic extraction steps are applied. We conclude that the proposed system shows good performance in extracting meaningful topics in time as the intermediate results come from main memories directly instead of an HDD reading.
Keywords
Topic Model; N-gram; Spark; Hadoop; Machine Learning;
Citations & Related Records
연도 인용수 순위
  • Reference
1 D. M. Blei, "Probabilistic Topic Models," Communication of the ACM, Vol.55, No.4, pp.77-87, 2012.   DOI
2 V. K. Vavilapalli and A. C. Murthy, et al., "Apache Hadoop YARN: Yet Another Resource Negotiator," in Proceedings of the 4th annual Symposium on Cloud Computing ACM, No.5, pp.1-16, 2013.
3 M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, "Spark: Cluster Computing with Working Sets," in HotCloud, p.10, 2010.
4 D. M. Blei, A. Y. Ng, and M. J. Jordan, "Latent Dirichlet Allocation," Journal of Machine Learning Research, Vol.3, pp. 993-1022, 2003.
5 T. Hofmann, "Probabilistic Latent Semantic Indexing," in Proceedings of the 22nd Annual International SIGIR Conference on Research and Development in Information Retrival, pp.50-57, 1999.
6 J. Park and H. Oh, "Distributed Online Learning for Topic Models," Communications of the Korean Institute of Information Scientists and Engineers, Vol.32, No.7, pp.40-45, 2014.
7 K. Shvachko, et al., "The Hadoop Distributed File System," in Proceedings of the 26th IEEE Transactions on Computing Symposium on Mass Storage Systems and Technologies, pp. 1-10, 2010.
8 M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica, "Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing," NSDI, Apr., 2012.
9 J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Proceedings of the 6th Symposium on Operating System Design and Implementation, pp.137-150, 2004.
10 P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai, "Class-Based N-gram Models of Natural Language," Computational Linguistics, Vol.18, No.4, pp.467-479, 1992.
11 K. Park, C. Baek, and L. Peng, "A Development of Streaming Big Data Analysis System Using In-memory Cluster Computing Framework: Spark," LNEE, Vol.393, pp.157-163, 2016.
12 https://www.reddit.com/wiki/ko/reddiquette.
13 M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia, "Spark SQL: Relational data processing in Spark," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383-1394, 2015.