DOI QR코드

DOI QR Code

A Development of LDA Topic Association Systems Based on Spark-Hadoop Framework

  • Park, Kiejin (Dept. of Integrative Systems Engineering, Ajou University) ;
  • Peng, Limei (Dept. of Industrial Engineering, Ajou University)
  • Received : 2017.05.08
  • Accepted : 2017.07.30
  • Published : 2018.02.28

Abstract

Social data such as users' comments are unstructured in nature and up-to-date technologies for analyzing such data are constrained by the available storage space and processing time when fast storing and processing is required. On the other hand, it is even difficult in using a huge amount of dynamically generated social data to analyze the user features in a high speed. To solve this problem, we design and implement a topic association analysis system based on the latent Dirichlet allocation (LDA) model. The LDA does not require the training process and thus can analyze the social users' hourly interests on different topics in an easy way. The proposed system is constructed based on the Spark framework that is located on top of Hadoop cluster. It is advantageous of high-speed processing owing to that minimized access to hard disk is required and all the intermediately generated data are processed in the main memory. In the performance evaluation, it requires about 5 hours to analyze the topics for about 1 TB test social data (SNS comments). Moreover, through analyzing the association among topics, we can track the hourly change of social users' interests on different topics.

Keywords

E1JBB0_2018_v14n1_140_f0001.png 이미지

Fig. 1. The inference process of LDA topic modeling parameter.

E1JBB0_2018_v14n1_140_f0002.png 이미지

Fig. 2. Association analyses cluster structure based on Spark-Hadoop framework.

E1JBB0_2018_v14n1_140_f0003.png 이미지

Fig. 3. Structure for generating summarized documents.

E1JBB0_2018_v14n1_140_f0004.png 이미지

Fig. 4. Topic association analyses process.

E1JBB0_2018_v14n1_140_f0005.png 이미지

Fig. 5. Sample of source code for topic association analysis.

E1JBB0_2018_v14n1_140_f0006.png 이미지

Fig. 6. The number of topics versus the change of logLikeliHood.

E1JBB0_2018_v14n1_140_f0007.png 이미지

Fig. 7. Change of association between Topic7 and Topic12.

Table 1. Topic analysis result for five topics out of a total of 14

E1JBB0_2018_v14n1_140_t0001.png 이미지

Table 2. Topic similarity using column similarities

E1JBB0_2018_v14n1_140_t0002.png 이미지

References

  1. D. M. Blei, "Probabilistic topic models," Communication of the ACM, vol. 55, no. 4, pp. 77-87, 2012. https://doi.org/10.1145/2133806.2133826
  2. K. Park, C. Baek, and L. Peng, "A development of streaming big data analysis system using in-memory cluster computing framework: Spark," Lecture Notes in Electrical Engineering, vol. 393, pp. 157-163, 2016.
  3. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, "Spark: cluster computing with working sets," in Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10), Boston, MA, 2010, pp. 1-7.
  4. J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," in Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation (OSDI'04), San Francisco, CA, 2004, pp. 137-149.
  5. M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, et al., "Spark SQL: relational data processing in Spark," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD'15), Melbourne, Australia, 2015, pp. 1383-1394.
  6. S. Kang, K. Park, and L. Peng, "Improving diversity using bandwagon effect for developing recommendation system," Far East Journal of Electronics and Communications, vol. 17, no. 3, pp. 539-544, 2017. https://doi.org/10.17654/EC017030539
  7. D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent Dirichlet allocation," Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.
  8. M. D. Hoffman, D. M. Blei, and F. Bach, "Online learning for latent Dirichlet allocation," in Proceedings of the 23rd International Conference on Neural Information Processing Systems (NIPS'10), Vancouver, Canada, 2010, pp. 856-864.
  9. V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, et al., "Apache Hadoop YARN: yet another resource negotiator," in Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC'13), Santa Clara, CA, 2014, pp 1-16.
  10. J. Park and H. Oh, "Distributed online machine learning for topic models," Communications of the Korean Institute of Information Scientists and Engineers, vol. 32, no. 7, pp. 40-45, 2014.
  11. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The Hadoop distributed file system," in Proceedings of 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, 2010, pp. 1-10.
  12. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, "Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing," in Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12), Berkeley, CA, 2012, pp. 1-14.