거대 인용 자료를 이용한 문서 추천 방법

Documents recommendation using large citation data

  • 투고 : 2013.06.30
  • 심사 : 2013.08.05
  • 발행 : 2013.09.30


본 연구에서는 논문이나 특허 등의 문서들의 인용 정보를 활용하여 연관성이 높고 중요한 특허를 추천하는 방법을 제안한다. 문서 간의 연관성 지표인 공통피인용횟수와 중요도 지표인 HITS를 적절한 형태로 결합한 뉴먼 커널로부터 두 정보의 반영 정도를 조율하는 것이 핵심이다. 제안하는 방법은 미래의 인용에 대한 예측 오차를 최소화하는 것으로 이를 통해 뉴먼 커널의 조율모수 ${\gamma}$를 적절하게 선택할 수 있다. 또한, 거대 인용 자료를 분석하기 위해 필요한 계산 기술에 대해서 자세히 논의한다. 마지막으로, 미국 등록 특허 400만 건에 대한 실증적 자료 분석을 시행한다.

In this research, we propose a document recommendation method which can find documents that are relatively important to a specific document based on citation information. The key idea is parameter tuning in the Neumann kernal which is an intermediate between a measure of importance (HITS) and of relatedness (co-citation). Our method properly selects the tuning parameter ${\gamma}$ in the Neumann kernal minimizing the prediction error in future citation. We also discuss some comutational issues needed for analysing large citation data. Finally, results of analyzing patents data from the US Patent Office are given.



  1. Blei, D. M. and Lafferty, J. D. (2007) A correlated topic model of science. The Annals of Applied Statistics, 1, 17-35.
  2. Blei, D. M., NG, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.
  3. Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual (web) search engine. Computer Network and ISDN Systems, 30, 107-117.
  4. Cook, D. J. and Holder, L. B. (2006). Mining graph data, John Wiley & Sons, New Jersey.
  5. Garfield, E. and Merton, R. K. (1979). Citation indexing: Its theory and application in science, technology, and humanities, Wiley, New York.
  6. Golub, G. H. and Van Loan, C. F. (2012). Matrix computations, Johns Hopkins University Press, Baltimore.
  7. He, Q., Pei, J., Kifer, D., Mitra, P. and Giles, C. L. (2010). Context-aware citation recommendation. Proceedings of the 19th International Conference on World Wide Web, 421-430.
  8. Hofmann, T. (2004). Latent semantic models for collaborative filtering. ACM Transactions on Information Systems, 22, 22, 89-115.
  9. Jannach, D., Zanker, M., Felfernig, A. and Friedrich, G. (2010). Recommender systems: An introduction, Cambridge University Press, New York.
  10. Kandola, J., Shawe-Taylor, J. and Cristianini, N. (2003). Learning semantic similarity. In Neural Information Processing Systems, 673-680.
  11. Kessler, M. M. (1963). Bibliographic coupling between scientific papers. American Documentation, 14, 10-25.
  12. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46, 604-632.
  13. Lam, C. (2010). Hadoop in action, Manning Publications Company, Stamford.
  14. Lehoucq, R. B., Sorensen, D. C. and Yang, C. (1998). ARPACK users’ guide: Solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods, 6, Siam, Philadelphia.
  15. Li, W. and McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. Proceedings of the 23rd International Conference on Machine Learning, 577-584.
  16. Liben-Nowell, D. and Kleinberg, J. (2007). The link prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58, 1019-1031.
  17. McNee, S. M., Albert, I., Cosley, D., Gopalkrishnan, P., Lam, S. K., Rashid, A. M., Konstan, J. A. and Riedl, J. (2002). On the recommending of citations for research papers. Proceedings of the 2002 ACM Conference on Computer Supported Cooperative Work, 116-125.
  18. Page, L. and Brin S. (1999). The PageRank citation ranking: Bringing order to the web, Stanford InfoLab, California.
  19. Shimbo, M. and Ito, T. (2006). Kernels as link analysis measures, John Wiley & Sons, New Jersey, 283-310.
  20. Saad, Y. (1990). SPARSKIT: A basic toolkit for sparse matrix computations, Research Institute for Advanced Computer Science, NASA Ames Research Center Moffet Field, CA.
  21. Sanders, J. and Kandrot, E. (2010). CUDA by example: An introduction to general-purpose GPU programming, Addison-Wesley Professional, Boston.
  22. Small, H. (1973). Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24, 265-269.
  23. Strohman, T., Croft, W. and Jensen, D. (2007). Recommending citations for academic papers. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 705-706.
  24. Tang, J. and Zhang, J. (2009). A discriminative approach to topic-based citation recommendation. Advances in Knowledge Discovery and Data Mining, 572-579.
  25. Teh, Y. W., Jordan M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101, 1566-1581.
  26. Wei, X. and Croft W. B. (2006). LDA-based document models for ad-hoc retrieval. Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 178-185.
  27. White, S. and Smyth P. (2003). Algorithms for estimating relative importance in networks. Proceedings of the KDD’03, 266-275.

피인용 문헌

  1. Classification of ratings in online reviews vol.27, pp.4, 2016,
  2. The knowledge and human resources distribution system for university-industry cooperation vol.20, pp.3, 2014,
  3. Social Tagging-based Recommendation Platform for Patented Technology Transfer vol.21, pp.3, 2015,
  4. Patent data analysis using clique analysis in a keyword network vol.27, pp.5, 2016,
  5. Structuring of unstructured big data and visual interpretation vol.25, pp.6, 2014,