DOI QR코드

DOI QR Code

Study on Extraction of Keywords Using TF-IDF and Text Structure of Novels

TF-IDF와 소설 텍스트의 구조를 이용한 주제어 추출 연구

  • 유은순 (단국대학교 미디어콘텐츠연구원) ;
  • 최건희 (단국대학교 소프트웨어학과) ;
  • 김승훈 (단국대학교 응용컴퓨터공학과)
  • Received : 2015.02.05
  • Accepted : 2015.02.20
  • Published : 2015.02.28

Abstract

With the explosive growth of information about books, there is a growing number of customers who find it difficult to pick a book. Against the backdrop, the importance of a book recommendation system becomes greater, through which appropriate information about books could be offered then to encourage customers to buy a book in the end. However, existing recommendation systems based on the bibliographical information or user data reveal the reliability issue found in their recommendation results. This is why it is necessary to reflect semantic information extracted from the texts of a book's main body in a recommendation system. Accordingly, this paper suggests a method for extracting keywords from the main body of novels, as a preceding research, by using TF-IDF method as well as the text structure. To this end, the texts of 100 novels have been collected then to divide them into four structural elements of preface, dialogue, non-dialogue and closing. Then, the TF-IDF weight of each keyword has been calculated. The calculation results show that the extraction accuracy of keywords improves by 42.1% in performance when more weight is given to dialogue while including preface and closing instead of using just the main body.

도서 상품에 대한 정보량이 폭증하면서 고객이 도서 선택에 어려움을 겪는 상황이 발생하고 있다. 이에 따라 고객에게 적합한 도서 정보를 제공하여 구매를 유도하는 도서 추천시스템의 중요성이 커지고 있다. 하지만 도서의 서지정보나 사용자 정보 등을 이용한 기존의 추천시스템은 추천 결과의 신뢰도에 문제를 드러내고 있기 때문에 도서 본문 텍스트의 의미적 정보를 추천시스템에 반영하는 것이 필요하다. 따라서 본 논문은 이에 대한 선행연구로 TF-IDF기법과 소설의 외형적 구조를 이용한 소설 텍스트의 주제어 추출 방법을 제안하였다. 이를 위해 100권의 소설텍스트를 수집하고 각각의 소설을 머리말, 대화문, 비대화문, 맺음말의 4개의 구조로 분리한 후 TF-IDF 가중치를 계산하였다. 실험결과 본문 텍스트만을 이용했을 때 보다 머리말과 맺음말을 포함하고 대화문에 가중치를 높게 부여하였을 때 주제어의 추출 정확도가 42.1%의 성능 향상을 보였다.

Keywords

References

  1. S. G. Lee, H.-J. Kim, "Keyword Extraction from News Corpus using Modified TF-IDF", The Journal of Society for e-Business Studies, Vol.14, No.4, pp.59-73, 2009
  2. G.-S. Go, W.-K. Jung, Y.-G. Shin, S.-S. Park and D.-S. Jang, "A Study on Development of Patent Information Retrieval Using Textmining", Journal of the Korea Academia-Industrial cooperation Society, Vol.12, No.8, pp.3677-3688, 2011 https://doi.org/10.5762/KAIS.2011.12.8.3677
  3. P. Soucy, G. W. Mineau, "Beyond TFIDF weighting for text categorization in the vector space model" In IJCAI, Vol. 5, pp. 1130-1135, 2005
  4. O. Zamir, O. Etzioni, O. "Grouper: a dynamic clustering interface to Web search results", Computer Networks, Vol.31, No.11, pp.1361-1374, 1999 https://doi.org/10.1016/S1389-1286(99)00054-7
  5. J. Martineau, T. Finin, "Delta TFIDF: An Improved Feature Space for Sentiment Analysis", In Proceedings of the 3rd AAAI International Conference on Weblogs and Social Media, 2009
  6. J. Ramos, "Using tf-idf to determine word relevance in document queries", In Proceedings of the First Instructional Conference on Machine Learning, 2003
  7. S.-P. Jung, S.-H. Lim, J.-H. Jeon, B. M. Kim and H. A. Lee, "Web Search Result Clustering using Snippets", Journal of KISS: Databases, pp.321-331, 2012
  8. H.-G. Choi, S. J. Jun, and E.-J. Hwang, "Multi-Modal Scheme for Music Mood Classification", Korea Information Science Society, pp.259-262, 2011
  9. H.I. Shin, U.I Yun, H.M. Ryang and G.B. Pyun, "An analytical Study for Extracting Topic Words on Text Documents", Korean Society For Internet Information, Vol.2011, No.6, pp.133-134, 2011
  10. S.-H. Jang, S.-S. Kang, "Keyword - based Document Clustering Algorithm", Korea Information Science Society. Vol.29, No.1B, pp.469-471, 2002
  11. C.-H. Kim, Theory of the novel structure, Korean Studies Information, pp.16-17; 45-51; 203-204, 2010
  12. H. S. Kim, "Types, Discourse Functions of Quotation and Speech Presentation in Novel", The Journal of Language and Literature, pp.113-142, 2000
  13. www.kldp.net/projects/hannanum
  14. GunHee. Choi, H-S. An, J-S. Park, "Main body of the text books extraction research", Proceedings of the Korea Inteligent Information System Society Conference pp.191-193, 2014

Cited by

  1. 온라인 리뷰 분석을 통한 상품 평가 기준 추출: LDA 및 k-최근접 이웃 접근법을 활용하여 vol.26, pp.1, 2015, https://doi.org/10.13088/jiis.2020.26.1.097
  2. A Semantic Network Analysis of Parenting Stress on Social Media vol.38, pp.1, 2020, https://doi.org/10.7466/jkhma.2020.38.1.61
  3. Effective Emotion Recognition Technique in NLP Task over Nonlinear Big Data Cluster vol.2021, pp.None, 2015, https://doi.org/10.1155/2021/5840759
  4. Sentiment analysis based on food e-commerce reviews vol.792, pp.1, 2015, https://doi.org/10.1088/1755-1315/792/1/012023
  5. A Study on PF-IFF-Based Diagnosis Model of Plant Equipment Failure vol.12, pp.1, 2015, https://doi.org/10.3390/app12010347