DOI QR코드

DOI QR Code

S-PARAFAC: 아파치 스파크를 이용한 분산 텐서 분해

S-PARAFAC: Distributed Tensor Decomposition using Apache Spark

  • 양혜경 (이화여자대학교 컴퓨터공학과) ;
  • 용환승 (이화여자대학교 컴퓨터공학과)
  • 투고 : 2017.07.25
  • 심사 : 2017.12.19
  • 발행 : 2018.03.15

초록

최근 추천시스템과 데이터 분석 분야에서 고차원 형태의 텐서를 이용하는 연구가 증가하고 있다. 이는 고차원의 데이터인 텐서 분석을 통해 더 많은 잠재 요소와 잠재 패턴을 추출가능하기 때문이다. 그러나 고차원 형태인 텐서는 크기가 방대하고 계산이 복잡하기 때문에 텐서 분해를 통해 분석해야한다. 기존 텐서 도구들인 rTensor, pyTensor와 MATLAB은 단일 시스템에서 작동하기 때문에 방대한 양의 데이터를 처리하기 어렵다. 하둡을 이용한 텐서 분해 도구들도 있지만 처리 시간이 오래 걸린다. 따라서 본 논문에서는 인 메모리 기반의 빅데이터 시스템인 아파치 스파크를 기반으로 하는 텐서 분해 도구인 S-PARAFAC을 제안한다. S-PARAFAC은 텐서 분해 방법 중 PARAFAC 분해에 초점을 맞춰 아파치 스파크에 적합하게 변형하여 텐서 분해를 빠르게 분산 처리가능 하도록 하였다. 본 논문에서는 하둡을 기반의 텐서 분해 도구와 S-PARAFAC의 성능을 비교하여 약 4~25배 정도의 좋은 성능을 보였다.

Recently, the use of a recommendation system and tensor data analysis, which has high-dimensional data, is increasing, as they allow us to analyze the tensor and extract potential elements and patterns. However, due to the large size and complexity of the tensor, it needs to be decomposed in order to analyze the tensor data. While several tools are used for tensor decomposition such as rTensor, pyTensor, and MATLAB, since such tools run on a single machine, they are unable to handle large data. Also, while distributed tensor decomposition tools based on Hadoop can handle a scalable tensor, its computing speed is too slow. In this paper, we propose S-PARAFAC, which is a tensor decomposition tool based on Apache Spark, in distributed in-memory environments. We converted the PARAFAC algorithm into an Apache Spark version that enables rapid processing of tensor data. We also compared the performance of the Hadoop based tensor tool and S-PARAFAC. The result showed that S-PARAFAC is approximately 4~25 times faster than the Hadoop based tensor tool.

키워드

과제정보

연구 과제 주관 기관 : 한국연구재단

참고문헌

  1. Apache Hadoop [Online] Available: https://hadoop.apache.org/
  2. Apache Spark [Online]. Available: https://spark.apache.org/
  3. Apache Mahout [Online]. Availabe: http://mahout.apache.org/
  4. MATLAB Tensor ToolBox [Online], Available: http://www.sandia.gov/-tgkolda/TensorToolbox/index-2.6.html
  5. PyTensor: A Python based Tensor Library [Online]. Available: http://www.cs.cmu.edu/-cjl/papers/CMUCS-10-102.pdf
  6. rTensor [Online]. Available: https://cran.r-project.org/web/packages/rTensor/rTensor.pdf
  7. U. Kang, E. E. Papalexakis, A. Harpale, and C. Faloutsos, "Gigatensor: scaling tensor analysis up by 100 times - algorithms and discoveries," Proc. of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining 2012, pp.316-324, 2012.
  8. I. Jeon, E. E. Papalexakis, C. Faloutsos, L. Sael, and U. Kang, "Mining billion-scale tensors: algorithms and discoveries," The VLDB Journal, No. 25, Vol. 4, pp. 519-544, Aug. 2016. https://doi.org/10.1007/s00778-016-0427-4
  9. I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos, "Haten2: Billion-scale tensor decompositions," Proc. of the 31st IEEE International Conference on Data Engineering (ICDE), pp. 1047-1058, 2015.
  10. N. Park, B. Jeon, J. Lee, U Kang, "BIGtensor: Mining Billion-Scale Tensor Made Easy," Proc. the 25th ACM International Conference on Information and Knowledge Management (CIKM) 2016, Indianapolis, United States, pp. 2457-2460, 2016.
  11. BigTensor Download Site [Online] Available: https://datalab.snu.ac.kr/bigtensor/index.php(downlo aded, 2016, Nov. 11)
  12. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, I.Stoica, "Resilient distributed datasets: a faulttolerant abstraction for in-memory cluster computing," Proc. NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, 2012.
  13. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica, "Spark: Cluster Computing with Working Sets," Proc. of HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing 2010.
  14. M. Zaharia, M. Chowdhury, T. Das, A, Dave, J. Ma, M. Mccauley, M. J. Franklin, S. Shenker, I. Stoica, "Fast and Interactive Analytics over Hadoop Data with Spark," USENIX; login: Vol. 37, No. 4, pp. 45-51, 2012.
  15. J. H. Choi, S. V. N. Vishwanathan, "DFacTo: distributed factorization of tensors," Proc. NIPS'14 Proceedings of the 27th International Conference on Neural Information Processing Systems, pp. 1296-1304, 2014.
  16. K. S. Aggour, B, Yener, "Adapting to Data Sparsity for Efficient parallel PARAFAC Tensor decomposition in Hadoop," Proc. of the IEEE International Conference on Big Data, pp. 294-301, 2016.
  17. YELP Tensor Dataset [Online]. Available: https://datalab.snu.ac.kr/bigtensor/datasets.php.(downloaded, 2016, Nov. 11)
  18. MovieLense Tensor Dataset [Online]. Available: https://datalab.snu.ac.kr/bigtensor/datasets.php.(downloaded, 2016, Nov. 11)