Semantic Topic Selection Method of Document for Classification

Ko, kwang-Sup;Kim, Pan-Koo;Lee, Chang-Hoon;Hwang, Myung-Gwon;

doi:10.6109/JKIICE.2007.11.1.163

Journal of the Korea Institute of Information and Communication Engineering (한국정보통신학회논문지)

Volume 11 Issue 1
/
Pages.163-172
/
2007
/
2234-4772(pISSN)
/
2288-4165(eISSN)

The Korea Institute of Information and Commucation Engineering (한국정보통신학회)

DOI QR Code

Semantic Topic Selection Method of Document for Classification

문서분류를 위한 의미적 주제선정방법

고광섭 (건국대학교 컴퓨터공학과) ;
김판구 (조선대학교 컴퓨터공학부) ;
이창훈 (건국대학교 컴퓨터공학과) ;
황명권 (조선대학교 컴퓨터공학부)

Published : 2007.01.31

https://doi.org/10.6109/JKIICE.2007.11.1.163 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The web as global network includes text document, video, sound, etc and connects each distributed information using link Through development of web, it accumulates abundant information and the main is text based documents. Most of user use the web to retrieve information what they want. So, numerous researches have progressed to retrieve the text documents using the many methods, such as probability, statistics, vector similarity, Bayesian, and so on. These researches however, could not consider both the subject and the semantics of documents. As a result user have to find by their hand again. Especially, it is more hard to find the korean document because the researches of korean document classification is insufficient. So, to overcome the previous problems, we propose the korean document classification method for semantic retrieval. This method firstly, extracts TF value and RV value of concepts that is included in document, and maps into U-WIN that is korean vocabulary dictionary to select the topic of document. This method is possible to classify the document semantically and showed the efficiency through experiment.

웹은 전세계 규모의 네트워크로써 문자, 화상, 음성 등의 미디어 정보들을 페이지 단위로 관리되며, 링크를 이용하여 분산된 정보들을 연결하고 있다. 이러한 웹의 지속적인 발전으로 무수한 정보들을 축적하고 있으며, 그 중 텍스트로 구성된 문서들이 주를 이룬다. 사용자는 이렇게 많은 정보들 중에서 자신이 원하는 특정 정보를 찾기 위해 웹을 사용한다. 그래서 웹은 사용자 요구에 적합한 정보를 검색해 주기 위해 계속적인 시도와 많은 연구들로 발전되고 있다. 확률을 이용한 방법, 통계적인 기법을 이용한 방법, 벡터 유사도를 이용한 방법, 베이지안 자동문서 분류 방법 등 기존의 방법들은 문서의 의미적인 주제나 특징을 정확하게 처리 할 수 없어 사용자는 재검색을 해야 하는 문제점을 갖는다. 특히, 국내 문서 분류를 위한 연구는 많이 이루어지지 않아 검색에 더욱 어렵다. 이러한 문제점을 보완하기 위해 본 논문에서는 국내문서의 효율적이고 의미적인 분류를 위해 출현 개념의 TF(Term Frequency)와 주변 개념들과의 관계된 정도(RV : Relation Value)를 추출한다. 그리고 추출된 키워드들을 국내 어휘 사전인 U-WIN에 매핑하여 문서의 주제를 선택하고 본문에서 제 시하는 분류방법에 의해 웹 문서를 분류한다. 이는 문서 내 개념들의 관계를 이용하여 문서의 주제를 선정하고 문서의 의미적인 분류를 가능하게 한다.

Keywords

TF

References

Jinze Liu, Wei Wang, Jiong Yang, 'Research track posters: A framework for ontology-driven subspace clustering', Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining KDD '04, pp. 623-628, ISBN:1-58113-888-1, Aug. 2004
Illlhoi Yoo, Xiaohua Hu, 'A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE', International Conference on Digital Libraries archive Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries table of contents, pp. 220-229, ISBB:1-59593-354-9, 2006
Hwanjo Yu, ChengXiang Zhai, Jiawei Han, 'Text classification from positiveand unlabeled documents', Source Conference on Information and Knowledge Management archive Proceedings of the twelfth international conference on Information and knowledge management , ISBN:1-58113-723-0, pp.232-239, 2003
Thierson Couto, Marco Cristo, Marcos Andre Goncalves, Pavel Calado, Nivio Ziviani, Edleno Moura, Berthier Ribeiro-Neto, Belo Horizonte, 'A comparative study of citations and links in document classification', Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, ISBN: 1-59593-354-9, pp.75-84, 2006
Yifen Huang, Tom M. Mitchell,'Text clustering with extended user feedback', Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 413-420, ISBN: 1-59593- 369-7, 2006
Hyunjang Kong, Myunggwon Hwang, Gwangsu Hwang, Jaehong Shim, Pankoo Kim, 'Topic Selection of Web Documents Using Specific Domain Ontology', MICAI 2006: Advances in Artificial Intelligence, LNAI 4293, pp.1047-1056, 2006
Greiner, R., Grove, A, Schuurmans, D.: On learning hierarchical Classifications (1997)
Quek, C.Y, Mitchell, T: Classification of World Wide Web Documents. Seniors Honors Thesis, School of Computer Science, Carnegie Melon University (1998)
Koller, D., Sahami, M.: Hierarchically Classifying Documents Using Very Few Words. In the Proceeding of Machine Learning (ICML-97) (1997) 170-176
http://en.wikipedia.org/wiki/Tf-idf
김준수, 옥철영, '정제된 의미정보와 시소러스를 이용한 동형이의어 분별시스템', 정보처리학회논문 지 B 제 12-B권 제7호 pp.829-840 2005. 12 https://doi.org/10.3745/KIPSTB.2005.12B.7.829
허준희, 최준혁, 이정현, 김중배, 임기옥, '문서의 주 제어별 가중치 부여와 단어 군집을 이용한 한국어 문서 자동 분류 시스템', 정보처리학회논문지 B 제 8-brnjs 제5호 pp.447-454 2001.10
쵀재혁, 서혜성, 노상욱, 최경희, 정기현, '온톨로지 기반의 웹 페이지 분류시스템',정보처리학회논문 지 B 제 11-Brnjs, 제 6호, pp723-734, 2004년 10월
M.P.Sinka and D.W.Corne, 'A large benchma가 dataset for web document clustering,' Soft Computing Systems:Design, Management and Applications, Frontiers in Artificial Intelligence and Applications, Vol.87, pp.881-890, 2002
R.Hanson, J.Stutz and P.Cheeseman, 'Bayesian Classification Theory', Techinical Report FIA-90-12-7-01, NASA Ames research Center, AI Branch, 1991
황명권, 배용근, 김판구, '문서 내용의 계층화률 이용한 문서 비교 방법', 한국해양정보통신학회논문 제 제 10권 12호, pp2335-2342, 2006년 12월