• Title/Summary/Keyword: 자동범주화

Search Result 59, Processing Time 0.023 seconds

An Experimental Study on the Performance Improvement of Automatic Classification for the Articles of Korean Journals Based on Controlled Keywords in International Database (해외 데이터베이스의 통제키워드에 기초한 국내 학술지 논문의 자동분류 성능 향상에 관한 실험적 연구)

  • Kim, Pan Jun;Lee, Jae Yun
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.48 no.3
    • /
    • pp.491-510
    • /
    • 2014
  • As a major factor for efficient management and retrieval of the articles in databases, keywords are classified into uncontrolled keywords and controlled keywords. Most of Korean scholarly databases fail to provide controlled vocabularies to indexing research articles which help users to retrieve relevant papers exhaustively. In this paper, we carried out automatic descriptor assignment experiments to Korean articles using automatic classifiers learned with descriptors in international database. The results of the experiments show that the classifier learning with descriptors in international database can potentially offer controlled vocabularies to Korean scholarly articles having English s. Also, we sought to improve the performance of automatic descriptor assignment using various classifiers and combination of them.

An Empirical Study on Improving the Performance of Text Categorization Considering the Relationships between Feature Selection Criteria and Weighting Methods (자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구)

  • Lee Jae-Yun
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.39 no.2
    • /
    • pp.123-146
    • /
    • 2005
  • This study aims to find consistent strategies for feature selection and feature weighting methods, which can improve the effectiveness and efficiency of kNN text classifier. Feature selection criteria and feature weighting methods are as important factor as classification algorithms to achieve good performance of text categorization systems. Most of the former studies chose conflicting strategies for feature selection criteria and weighting methods. In this study, the performance of several feature selection criteria are measured considering the storage space for inverted index records and the classification time. The classification experiments in this study are conducted to examine the performance of IDF as feature selection criteria and the performance of conventional feature selection criteria, e.g. mutual information, as feature weighting methods. The results of these experiments suggest that using those measures which prefer low-frequency features as feature selection criterion and also as feature weighting method. we can increase the classification speed up to three or five times without loosing classification accuracy.

Design of Automatic Records Classification System Using Contextual Information (맥락정보를 이용한 기록 자동분류시스템 설계)

  • Jang, Ji-Sook;Rieh, Hae-Young
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.9 no.1
    • /
    • pp.151-173
    • /
    • 2009
  • The classification in the Records and Archives Sciences focuses on the contextual information in producing and utilizing records rather than their contents. This study aimed at designing an automatic records classification system to enable an automatic classification focusing on the aggregation of the context of records rather than the contents of individual record in the classification scheme, structured on the basis of business activities analyses for records reflecting the business activities. The automatic records classification system was designed to have mutual supplements by constructing the classification scheme and thesaurus together as the classification reference, as well as the aggregation of records that have been already classified. Additionally included are plans to apply the classified contextual information of records to the classification reference on the real-time base right after the category assignment of records to be classified. Although there are limitations as the designed system depends on the quality of the contextual information, it is considered that the system could lead to ensure that the contextual information of records should be more substantial.

An Experimental Study on the Automatic Classification of Korean Journal Articles through Feature Selection (자질선정을 통한 국내 학술지 논문의 자동분류에 관한 연구)

  • Kim, Pan Jun
    • Journal of the Korean Society for information Management
    • /
    • v.39 no.1
    • /
    • pp.69-90
    • /
    • 2022
  • As basic data that can systematically support and evaluate R&D activities as well as set current and future research directions by grasping specific trends in domestic academic research, I sought efficient ways to assign standardized subject categories (control keywords) to individual journal papers. To this end, I conducted various experiments on major factors affecting the performance of automatic classification, focusing on feature selection techniques, for the purpose of automatically allocating the classification categories on the National Research Foundation of Korea's Academic Research Classification Scheme to domestic journal papers. As a result, the automatic classification of domestic journal papers, which are imbalanced datasets of the real environment, showed that a fairly good level of performance can be expected using more simple classifiers, feature selection techniques, and relatively small training sets.

An Analytical Study on Automatic Classification of Domestic Journal articles Based on Machine Learning (기계학습에 기초한 국내 학술지 논문의 자동분류에 관한 연구)

  • Kim, Pan Jun
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.2
    • /
    • pp.37-62
    • /
    • 2018
  • This study examined the factors affecting the performance of automatic classification based on machine learning for domestic journal articles in the field of LIS. In particular, In view of the classification performance that assigning automatically the class labels to the articles in "Journal of the Korean Society for Information Management", I investigated the characteristics of the key factors(weighting schemes, training set size, classification algorithms, label assigning methods) through the diversified experiments. Consequently, It is effective to apply each element appropriately according to the classification environment and the characteristics of the document set, and a fairly good performance can be obtained by using a simpler model. In addition, the classification of domestic journals can be considered as a multi-label classification that assigns more than one category to a specific article. Therefore, I proposed an optimal classification model using simple and fast classification algorithm and small learning set considering this environment.

An Analytical Study on Performance Factors of Automatic Classification based on Machine Learning (기계학습에 기초한 자동분류의 성능 요소에 관한 연구)

  • Kim, Pan Jun
    • Journal of the Korean Society for information Management
    • /
    • v.33 no.2
    • /
    • pp.33-59
    • /
    • 2016
  • This study examined the factors affecting the performance of automatic classification for the domestic conference papers based on machine learning techniques. In particular, In view of the classification performance that assigning automatically the class labels to the papers in Proceedings of the Conference of Korean Society for Information Management using Rocchio algorithm, I investigated the characteristics of the key factors (classifier formation methods, training set size, weighting schemes, label assigning methods) through the diversified experiments. Consequently, It is more effective that apply proper parameters (${\beta}$, ${\lambda}$) and training set size (more than 5 years) according to the classification environments and properties of the document set. and If the performance is equivalent, I discovered that the use of the more simple methods (single weighting schemes) is very efficient. Also, because the classification of domestic papers is corresponding with multi-label classification which assigning more than one label to an article, it is necessary to develop the optimum classification model based on the characteristics of the key factors in consideration of this environment.

A Study on Book Categorization in Social Sciences Using kNN Classifiers and Table of Contents Text (목차 정보와 kNN 분류기를 이용한 사회과학 분야 도서 자동 분류에 관한 연구)

  • Lee, Yong-Gu
    • Journal of the Korean Society for information Management
    • /
    • v.37 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • This study applied automatic classification using table of contents (TOC) text for 6,253 social science books from a newly arrived list collected by a university library. The k-nearest neighbors (kNN) algorithm was used as a classifier, and the ten divisions on the second level of the DDC's main class 300 given to books by the library were used as classes (labels). The features used in this study were keywords extracted from titles and TOCs of the books. The TOCs were obtained through the OpenAPI from an Internet bookstore. As a result, it was found that the TOC features were good for improving both classification recall and precision. The TOC was shown to reduce the overfitting problem of imbalanced data with its rich features. Law and education have high topic specificity in the field of social sciences, so the only title features can bring good classification performance in these fields.

An Automatic Classification System of Korean Documents Using Weight for Keywords of Document and Word Cluster (문서의 주제어별 가중치 부여와 단어 군집을 이용한 한국어 문서 자동 분류 시스템)

  • Hur, Jun-Hui;Choi, Jun-Hyeog;Lee, Jung-Hyun;Kim, Joong-Bae;Rim, Kee-Wook
    • The KIPS Transactions:PartB
    • /
    • v.8B no.5
    • /
    • pp.447-454
    • /
    • 2001
  • The automatic document classification is a method that assigns unlabeled documents to the existing classes. The automatic document classification can be applied to a classification of news group articles, a classification of web documents, showing more precise results of Information Retrieval using a learning of users. In this paper, we use the weighted Bayesian classifier that weights with keywords of a document to improve the classification accuracy. If the system cant classify a document properly because of the lack of the number of words as the feature of a document, it uses relevance word cluster to supplement the feature of a document. The clusters are made by the automatic word clustering from the corpus. As the result, the proposed system outperformed existing classification system in the classification accuracy on Korean documents.

  • PDF

A preliminary Study on Text Categorization of Book using Table of Contents and Book Description (목차, 책 소개를 이용한 단행본 문서 범주화에 관한 기초연구)

  • Do, Hyun-Ho;Lee, Yong-Gu
    • Proceedings of the Korean Society for Information Management Conference
    • /
    • 2014.08a
    • /
    • pp.127-130
    • /
    • 2014
  • 이 연구에서는 도서관의 주요 장서에 해당하는 단행본 도서에 대한 자동 분류를 적용가능한지 알아보고자 하였다. 분류자질로 메타데이터인 서명, 목차, 책 소개를 사용하였으며, 다양한 자질 가중치를 적용하여 581건의 단행본 도서를 통해 kNN 분류기의 분류성능을 파악하였다. 실험 결과 이들 메타데이터를 모두 사용하였을 때 가장 좋은 분류성능을 가져왔으며, 실험문헌집단의 규모가 작은 한계가 있지만 로그 TF를 취한 가중치 방법이 좋은 성능을 가져왔다.

  • PDF

Improving the Performance of SVM Text Categorization with Inter-document Similarities (문헌간 유사도를 이용한 SVM 분류기의 문헌분류성능 향상에 관한 연구)

  • Lee, Jae-Yun
    • Journal of the Korean Society for information Management
    • /
    • v.22 no.3 s.57
    • /
    • pp.261-287
    • /
    • 2005
  • The purpose of this paper is to explore the ways to improve the performance of SVM (Support Vector Machines) text classifier using inter-document similarities. SVMs are powerful machine learning systems, which are considered as the state-of-the-art technique for automatic document classification. In this paper text categorization via SVMs approach based on feature representation with document vectors is suggested. In this approach, document vectors instead of index terms are used as features, and vector similarities instead of term weights are used as feature values. Experiments show that SVM classifier with document vector features can improve the document classification performance. For the sake of run-time efficiency, two methods are developed: One is to select document vector features, and the other is to use category centroid vector features instead. Experiments on these two methods show that we can get improved performance with small vector feature set than the performance of conventional methods with index term features.