DOI QR코드

DOI QR Code

Automatic Classification of Blog Posts using Various Term Weighting

다양한 어휘 가중치를 이용한 블로그 포스트의 자동 분류

  • Kim, Su-Ah (Department of Computer Software Engineering, Kumoh National Institute of Technology) ;
  • Jho, Hee-Sun (Department of Computer Software Engineering, Kumoh National Institute of Technology) ;
  • Lee, Hyun Ah (Department of Computer Software Engineering, Kumoh National Institute of Technology)
  • Received : 2014.07.28
  • Accepted : 2014.12.19
  • Published : 2015.01.31

Abstract

Most blog sites provide predefined classes based on contents or topics, but few bloggers choose classes for their posts because of its cumbersome manual process. This paper proposes an automatic blog post classification method that variously combines term frequency, document frequency and class frequency from each classes to find appropriate weighting scheme. In experiment, combination of term frequency, category term frequency and inversed (excepted category's) document frequency shows 77.02% classification precisions.

대부분의 블로그 사이트에서는 미리 정의된 분류 체계에 따른 내용 기반 분류 환경을 제공하고 있으나, 작성된 포스트의 분류를 수동으로 선택해야하는 번거로움 때문에 대부분의 블로거들은 포스트에 대한 분류를 입력하지 않고 있다. 본 논문에서는 블로그 포스트의 자동 분류를 위해 블로그 사이트에서 분류별 문서를 수집하고 수집된 분류별 문서의 어휘빈도와 문서빈도, 분류별 빈도 등의 다양한 어휘 가중치 조합하여 블로그 포스트의 특성에 적합한 가중치 방식을 찾고자 한다. 실험에서는 본 논문에서 제안한 TF-CTF-IECDF를 어휘 가중치로 사용한 분류 모델이 77.02%의 분류 정확률을 보였다.

Keywords

References

  1. Y. J. Kim, "A study on the blog as a media : Focused on media functions and the problems of the blog," Korean Journal of Journalism & Communication Studies, vol. 50, no. 2, pp. 59-90, 2006 (in Korean).
  2. D. H. Park, W. S. Choi, and H. J. Kim, "Web document classification based on hangeul morpheme and keyword analyses," Transactions of the Korean Information Processing Society Transaction : Part D (Database), vol. 19-D, no. 4, pp. 263-270, 2012 (in Korean). https://doi.org/10.3745/KIPSTD.2012.19D.4.263
  3. S. W. Lee, D. J. Choi, H. W. Jung, and J. H. Lee, "Study of blog auto categorizing based on time periodicity," Proceedings of Korean Institute of Intelligent Systems Spring Conference, vol. 21, no. 1, pp. 86-87, 2011 (in Korean).
  4. H. Qu, A. L. Pietra, and S. Poon "Automated blog classification: challenges and pitfalls," Association for the Advancement of Artificial Intelligence Spring Symposium : Computational Approaches to Analyzing Weblogs, pp. 184-186, 2006.
  5. D. Ikeda, H. Takamura, and M. Okumura, "Semi-supervised learning for blog classification," Proceedings of the 23th Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence, vol. 2, pp. 1156-1161, 2008.
  6. E. Lex, C. Seifert, M. Cranitzer, and A. Juffinger, "Automated blog classification : A cross domain approach," Proceedings of the International Association for Development of the Information Society, International Conference on WWW/Internet, p. 598, 2009.
  7. C. Hashimoto and S. Kurohashi, "Blog categorization exploiting domain dictionary and dynamically estimated domains of unknown words," Proceedings of ACL-08, HLT Short Papers, pp 69-72, 2008.
  8. Stephanie D. Husby and Denilson Barbosa, "Topic classification of blog posts using distant supervision," Proceedings of the 13th Conference of the European Chapter of Association for Computational Linguistics, pp 28-36, 2012.
  9. M. K. Dalal and M. A. Zaveri, "Automatic classification of unstructured blog text," Journal of Intelligent Learning Systems and Applications, vol. 5, no. 4, pp. 108-114, 2013. https://doi.org/10.4236/jilsa.2013.52012
  10. H. Y. Kim, An Experimental Study on Semi-Supervised Classification of Blog Genres, MS Thesis, Yonsei University, Korea, 2009 (in Korean).
  11. http://www.cs.waikato.ac.nz/ml/weka/, Accessed July 25, 2014.
  12. S. A. Kim, H. S. Cho, and H. A. Lee, "Automatic classification of blog posts," Technology of the 25th Annual Conference on Human and Cognitive Language, pp. 160-162, 2013 (in Korean).

Cited by

  1. Development and assessment of a hand assist device: GRIPIT vol.14, pp.1, 2017, https://doi.org/10.1186/s12984-017-0223-4
  2. Associated Keyword Recommendation System for Keyword-based Blog Marketing vol.22, pp.5, 2016, https://doi.org/10.5626/ktcp.2016.22.5.246
  3. 자동분류기반 성격 유형별 도서추천시스템 개발을 위한 실험적 연구 vol.48, pp.2, 2015, https://doi.org/10.16981/kliss.48.201706.215
  4. Effective Emotion Recognition Technique in NLP Task over Nonlinear Big Data Cluster vol.2021, pp.None, 2015, https://doi.org/10.1155/2021/5840759
  5. 성격유형별 선호도서 추천을 위한 서평 키워드 활용의 유효성 연구 vol.55, pp.3, 2021, https://doi.org/10.4275/kslis.2021.55.3.343