DOI QR코드

DOI QR Code

Generating and Controlling an Interlinking Network of Technical Terms to Enhance Data Utilization

데이터 활용률 제고를 위한 기술 용어의 상호 네트워크 생성과 통제

  • 정도헌 (덕성여자대학교 문헌정보학과)
  • Received : 2018.02.18
  • Accepted : 2018.03.26
  • Published : 2018.03.30

Abstract

As data management and processing techniques have been developed rapidly in the era of big data, nowadays a lot of business companies and researchers have been interested in long tail data which were ignored in the past. This study proposes methods for generating and controlling a network of technical terms based on text mining technique to enhance data utilization in the distribution of long tail theory. Especially, an edit distance technique of text mining has given us efficient methods to automatically create an interlinking network of technical terms in the scholarly field. We have also used linked open data system to gather experimental data to improve data utilization and proposed effective methods to use data of LOD systems and algorithm to recognize patterns of terms. Finally, the performance evaluation test of the network of technical terms has shown that the proposed methods were useful to enhance the rate of data utilization.

빅 데이터 시대에 접어들면서 저장 기술과 처리 기술이 급속도로 발전함에 따라, 과거에는 간과되었던 롱테일(long tail) 데이터가 많은 기업과 연구자들에게 관심의 대상이 되고 있다. 본 연구는 롱테일 법칙의 영역에 존재하는 데이터의 활용률을 높이기 위해 텍스트 마이닝 기반의 기술 용어 네트워크 생성 및 통제 기법을 제안한다. 특히 텍스트 마이닝의 편집 거리(edit distance) 기법을 이용해 학문분야에서 사용되는 기술 용어의 상호 네트워크를 자동으로 생성하는 효과적인 방안을 제시하였다. 데이터의 활용률 향상 실험을 위한 데이터 수집을 위해 LOD(linked open data) 환경을 이용하였으며, 이 과정에서 효과적으로 LOD 시스템의 데이터를 활용하는 기법과 용어의 패턴 처리 알고리즘을 제안하였다. 마지막으로, 생성된 기술 용어 네트워크의 성능 측정을 통해 제안한 기법이 롱테일 데이터의 활용률 제고에 효과적이었음을 확인하였다.

Keywords

References

  1. 안광모, 김윤석, 김영훈, 서영훈 (2013). Levenshtein 거리를 이용한 영화평 감성 분류. 디지털콘텐츠학회 논문지, 14(4), 581-587. http://dx.doi.org/10.9728/dcs.2013.14.4.581 Ahn, K. M., Kim, Y. S., Kim, Y. H., & Seo, Y. H. (2013). Sentiment classification of movie reviews using levenshtein distance. Journal of Digital Contents Society, 14(4), 581-587. http://dx.doi.org/10.9728/dcs.2013.14.4.581
  2. 황미녕, 조민희, 황명권, 정도헌, 성원경 (2011). 기술 용어의 용어지배값을 이용한 활용주기 모델링 방법. 한국정보과학회 학술발표논문집, 38(1C), 139-141. Hwang, M. N., Cho, M., Hwang, M., Jeong, D. H., & Sung, W. K. (2011). A utility cycle modeling method for technological terms based on term dominance value. Proceedings of the KIISE Conference, 38(1C), 139-141.
  3. Abe, A., & Tsumoto, S. (2010). Analysis of research keys as temporal patterns of technical term usage in bibliographical data. Lecture Notes in Computer Science book series (LNCS, volume 6496), International Conference on Active Media Technology AMT 2010, 150-157. https://doi.org/10.1007/978-3-642-15470-6_16
  4. Cormode, G., & Muthukrishnan, S. (2007). The string edit distance matching problem with moves. ACM Transactions on Algorithms, NY, USA, 3(1), No.2. https://doi.org/10.1145/1186810.1186812
  5. Fortune (2017). Apple just acquired this little-known artificial intelligence startup. Retrieved from http://fortune.com/2017/05/13/apple-lattice/
  6. Gartner (2018). Dark data (Gartner IT Glossary). Retrieved from https://www.gartner.com/it-glossary/dark-data
  7. Heidorn, P. B. (2008). Shedding light on the dark data in the long tail of science. Library Trends, 57(2), 280-299. https://doi.org/10.1353/lib.0.0036
  8. Hwang, M. N., Cho, M. H., Hwang. M., Lee, M., & Jeong, D. H. (2014). Technical terms trends analysis method for technology opportunity discovery. Information, An International Interdisciplinary Journal, 17(3), 877-883.
  9. Jain, P., Hitzler, P., Sheth, A. P., Verma, K., & Yeh, P. Z. (2010). Ontology alignment for linked open data. Lecture Notes in Computer Science book series (LNCS, volume 6496) ISWC 2010: The Semantic Web, 402-417. https://doi.org/10.1007/978-3-642-17746-0_26
  10. Jeong, D. H., Hwang, M., & Sung, W. K. (2011). Generating knowledge map for acronymexpansion recognition. In the Proceedings on U- and E-Service Science and Technology (UNESST 2011), 287-293. https://doi.org/10.1007/978-3-642-27210-3_38
  11. Jeong, D. H., Hwang, M., Kim, J., Jung, H., & Sung, W. K. (2013). Acronym-expansion recognition based on knowledge map system. Information, An International Interdisciplinary Journal, 12(A), 8403-8408.
  12. Kim, J., Hwang, M., Jeong, D. H., & Jung, H. (2012). Technology trends analysis and forecasting application based on decision tree. Expert Systems with Applications and Statistical Feature Analysis, 39(2012), 12618-12625. https://doi.org/10.1016/j.eswa.2012.05.021
  13. Li, Q., Li, Y., Gao, J., Su, L., Zhao, B., Demirbas, M., Fan, W., & Han, J. (2014). A confidenceaware approach for truth discovery on long-tail data. Journal Proceedings of the VLDB Endowment, 8(4), 425-436. https://doi.org/10.14778/2735496.2735505
  14. Noia, T. D., Mirizzi, R., Ostuni, V. C., Romito, D., & Zanker, M. (2012). Linked open data to support content-based recommender systems. Proceedings of the 8th International Conference on Semantic Systems, 1-8. https://doi.org/10.1145/2362499.2362501
  15. Paulheim, H., & Fümkranz, J. (2012). Unsupervised generation of data mining features from linked open data. Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, No. 31. https://doi.org/10.1145/2254129.2254168
  16. Reis, D. C., Golgher, P. B., Silva, A. S., & Laender, A. F. (2004). Automatic web news extraction using tree edit distance. Proceedings of the 13th International Conference on World Wide Web, 502-511. https://doi.org/10.1145/988672.988740
  17. Veritas (2016). Veritas global databerg report finds 85% of stored data is either dark or Redundant, Obsolete, or Trivial (ROT). Retrieved from https://www.veritas.com/news-releases/2016-03-15-veritas-global-databerg-report-finds-85-percent-of-stored-data
  18. Wikipedia (2018a). Long tail. Retrieved from https://en.wikipedia.org/wiki/Long_tail
  19. Wikipedia (2018b). X-ray diffraction (redirection). Retrieved from https://en.wikipedia.org/wiki/X-ray_crystallography
  20. Wikipedia (2018c). High-performance liquid chromatography. Retrieved from https://en.wikipedia.org/wiki/High-performance_liquid_chromatography
  21. Wikipedia (2018d). Edit distance. Retrieved from https://en.wikipedia.org/wiki/Edit_distance
  22. Wu, F., Hoffmann, R., & Weld, D. S. (2008). Information extraction from Wikipedia: moving down the long tail. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 731-739. https://doi.org/10.1145/1401890.1401978
  23. Zhang, C., Shin, J., Ré, C., Cafarella, M., & Niu, F. (2016). Extracting databases from dark data with deepdive. Proceedings of the 2016 International Conference on Management of Data, 847-859. https://doi.org/10.1145/2882903.2904442