• Title/Summary/Keyword: document classification

Search Result 449, Processing Time 0.029 seconds

An Analytical Study on Performance Factors of Automatic Classification based on Machine Learning (기계학습에 기초한 자동분류의 성능 요소에 관한 연구)

  • Kim, Pan Jun
    • Journal of the Korean Society for information Management
    • /
    • v.33 no.2
    • /
    • pp.33-59
    • /
    • 2016
  • This study examined the factors affecting the performance of automatic classification for the domestic conference papers based on machine learning techniques. In particular, In view of the classification performance that assigning automatically the class labels to the papers in Proceedings of the Conference of Korean Society for Information Management using Rocchio algorithm, I investigated the characteristics of the key factors (classifier formation methods, training set size, weighting schemes, label assigning methods) through the diversified experiments. Consequently, It is more effective that apply proper parameters (${\beta}$, ${\lambda}$) and training set size (more than 5 years) according to the classification environments and properties of the document set. and If the performance is equivalent, I discovered that the use of the more simple methods (single weighting schemes) is very efficient. Also, because the classification of domestic papers is corresponding with multi-label classification which assigning more than one label to an article, it is necessary to develop the optimum classification model based on the characteristics of the key factors in consideration of this environment.

A Research on Utilization of KDC Based on Literary Warrant (문헌적 근거에 기반한 한국십진분류법(KDC) 활용현황에 대한 연구)

  • Kim, Sungwon
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.55 no.2
    • /
    • pp.25-50
    • /
    • 2021
  • General-purpose classification scheme encompasses all subject areas, While the whole classification scheme is constructed by library studies experts, structure and preparation of each specific subject area's classification should be referenced to that specific subject. In order for the whole system to be practical and useful classification scheme, not just a simple collection of each subject area's scheme, it is necessary to set the rule for properly distributing the amount of classification items, and the collections assigned to these items. The rule to set the distribution of items based on the amount of document collections is called 'literary warrant'. This study examines actual status of assignment of each classification items to information resources, as a result of application of Korean Decimal Classification, and then suggests a way to improve these practices.

Reputation Analysis of Document Using Probabilistic Latent Semantic Analysis Based on Weighting Distinctions (가중치 기반 PLSA를 이용한 문서 평가 분석)

  • Cho, Shi-Won;Lee, Dong-Wook
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.58 no.3
    • /
    • pp.632-638
    • /
    • 2009
  • Probabilistic Latent Semantic Analysis has many applications in information retrieval and filtering, natural language processing, machine learning from text, and in related areas. In this paper, we propose an algorithm using weighted Probabilistic Latent Semantic Analysis Model to find the contextual phrases and opinions from documents. The traditional keyword search is unable to find the semantic relations of phrases, Overcoming these obstacles requires the development of techniques for automatically classifying semantic relations of phrases. Through experiments, we show that the proposed algorithm works well to discover semantic relations of phrases and presents the semantic relations of phrases to the vector-space model. The proposed algorithm is able to perform a variety of analyses, including such as document classification, online reputation, and collaborative recommendation.

Development of A Web Mining System Based On Document Similarity (문서 유사도 기반의 웹 마이닝 시스템 개발)

  • 이강찬;민재홍;박기식;임동순;우훈식
    • The Journal of Society for e-Business Studies
    • /
    • v.7 no.1
    • /
    • pp.75-86
    • /
    • 2002
  • In this study, we proposed design issues and structure of a web mining system and develop a system for the purpose of knowledge integration under world wide web environments resulted from our developing experiences. The developed system consists of three main functions: 1) gathering documents utilizing a search agent; 2) determining similarity coefficients between any two documents from term frequencies; 3) clustering documents based on similarity coefficients. It is believed that the developed system can be utilized for discovery of knowledge in relatively narrow domains such as news classification, index term generation in knowledge management.

  • PDF

Classification of Form-based Documents by Partitioned Feature Extraction (분할 특징 추출에 의한 양식 문서의 분류)

  • 정현철;이종현;최영우;김재희
    • Proceedings of the IEEK Conference
    • /
    • 1999.06a
    • /
    • pp.520-523
    • /
    • 1999
  • Specially, form-based documents are easily understood, quickly processed and thus used more than the general documents. In this paper, a method to classify the documents with minimum features is proposed, not like former methods which use all possible features. To apply this characteristics. a document was first partitioned to areas of certain shape and size, then features were extracted from the partitioned area. It is also possible to sort the partitioned area by using the fact that each partitioned area has the different significance in the point of feature. In conclusion, by using proposed method of extracting features from partitioned document, the processing time decreases due to search area reduction.

  • PDF

Efficient Document Classification for Web Document Collection (웹 문서 수집을 위한 효율적인 문서 분류)

  • Lee, Jung-Hun;Cheon, Suh-Hyun;Kim, Sun-Hee
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2006.10b
    • /
    • pp.397-401
    • /
    • 2006
  • 최근 다양한 형식의 웹 문서에서 사용자가 원하는 정보만을 검색 하기위해 웹 문서를 주제별로 분류하여 수집하고, 관리하는 것은 필수적인 요소이다. 즉, 정확하고 빠른 정보 검색을 위한 웹 문서 수집은 문서 형식에 따라 분류되어 수집 되어야 한다. 따라서 웹 환경에서 문서를 구성하는 형식을 텍스트나 이미지 데이터로 구분하고 그 형식에 맞는 분류기법을 사용한다면 정확한 정보 검색이 이루어 질수 있다. 본 논문에서는 텍스트와 URL을 이용한 주제 중심의 하이브리드 웹 문서 분류 방법을 제안한다. 텍스트와 URL을 이용한 분류 방법은 텍스트 형식은 주제 중심의 문서 분류방식을 사용하며, 텍스트 정보의 효용성이 낮은 경우 URL의 주제 분포도를 이용하여 분류하며 수집한다. 이를 통해 여러 가지 형식의 웹 문서가 분류 가능하며, 주제에 따른 문서 분류의 정확도가 높아진다.

  • PDF

Comparison of Document Features Extraction Methods for Automatic Classification of Real World FAQ Mails (실세계의 FAQ 메일 자동분류를 위한 문서 특징추출 방법의 성능 비교)

  • 홍진혁;류중원;조성배
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2001.04b
    • /
    • pp.271-273
    • /
    • 2001
  • 최근 문서 자동분류의 중요성이 널리 인식되어 다양한 연구가 진행되고 있다. 본 논문에서는 한글 문서의 효과적인 자동분류를 위한 다양한 특징추출 방법들을 구현하고 실제 질의메일에 대한 효율적인 특징주출 방법을 제시한다. 실험을 위해 문서 빈도(document frequency), 정보획득(information gain), 상호 정보량(mutual information), x$^2$등 7가지 특징추출 방법을 사용하였으며 463개의 실제 테스트 질의메일에 적용한 결과, x$^2$ 방법이 74.7%의 인식률을 내어 성능이 가장 좋음을 알 수 있었다. 반면에 x$^2$와 함께 가장 자주 쓰이는 방법 중의 하나인 정보 이득은 인식률이 최대 40.6%밖에 되지 않았다.

  • PDF

Empirical Analysis & Comparisons of Web Document Classification Methods (문서분류 기법을 이용한 웹 문서 분류의 실험적 비교)

  • Lee, Sang-Soon;Choi, Jung-Min;Jang, Geun;Lee, Byung-Soo
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.10d
    • /
    • pp.154-156
    • /
    • 2002
  • 인터넷의 발전으로 우리는 많은 정보와 지식을 인터넷에서 제공받을 수 있으며 HTML, 뉴스그룹 문서, 전자메일 등의 웹 문서로 존재한다. 이러한 웹 문서들은 여러가지 목적으로 분류해야 할 필요가 있으며 이를 적용한 시스템으로는 Personal WebWatcher, InfoFinder, Webby, NewT 등이 있다. 웹 문서 분류 시스템에서는 문서분류 기법을 사용하여 웹 문서의 소속 클래스를 결정하는데 문서분류를 위한 기법 중 대표적인 알고리즘으로 나이브 베이지안(Naive Baysian), k-NN(k-Nearest Neighbor), TFIDF(Term Frequency Inverse Document Frequency)방법을 이용한다. 본 논문에서는 웹 문서를 대상으로 이러한 문서분류 알고리즘 각각의 성능을 비교 및 평가하고자 한다.

  • PDF

문서지문기법을 이용한 웹 문서의 자동 분류

  • Kim Jin-Hwa
    • Proceedings of the Korean Operations and Management Science Society Conference
    • /
    • 2004.10a
    • /
    • pp.407-429
    • /
    • 2004
  • As documents in webs are increasing explosively due to the rapid development of electronic documents, an efficient system classifying documents automatically is required. In this study, a new document classification method, which is called Document Finger Print Method, is suggested to classify web documents automatically and efficiently. The performance of the suggested method is evaluated alone with other existing methods such as key words based method, weighted key words based method, neural networks, and decision trees. An experiment is designed with 10 documents categories and 59 randomly selected words. The result shows that the suggested algorithm has a superior classifying performance compared to other methods. The most important advantage of this method is that the suggested method works well without the size limits of the number of words in documents.

  • PDF

Machine Printed and Handwritten Text Discrimination in Korean Document Images

  • Trieu, Son Tung;Lee, Guee Sang
    • Smart Media Journal
    • /
    • v.5 no.3
    • /
    • pp.30-34
    • /
    • 2016
  • Nowadays, there are a lot of Korean documents, which often need to be identified in one of printed or handwritten text. Early methods for the identification use structural features, which can be simple and easy to apply to text of a specific font, but its performance depends on the font type and characteristics of the text. Recently, the bag-of-words model has been used for the identification, which can be invariant to changes in font size, distortions or modifications to the text. The method based on bag-of-words model includes three steps: word segmentation using connected component grouping, feature extraction, and finally classification using SVM(Support Vector Machine). In this paper, bag-of-words model based method is proposed using SURF(Speeded Up Robust Feature) for the identification of machine printed and handwritten text in Korean documents. The experiment shows that the proposed method outperforms methods based on structural features.