• Title/Summary/Keyword: document classification

Search Result 449, Processing Time 0.026 seconds

Classification Analysis in Information Retrieval by Using Gauss Patterns

  • Lee, Jung-Jin;Kim, Soo-Kwan
    • Communications for Statistical Applications and Methods
    • /
    • v.9 no.1
    • /
    • pp.1-11
    • /
    • 2002
  • This paper discusses problems of the Poisson Mixture model which Is widely used to decide the effective words in judging relevant document. Gamma Distribution model and Gauss Patterns model as an alternative of the Poisson Mixture model are studied. Classification experiments by using TREC sub-collection, WSJ[1,2] with MGQUERY and AidSearch3.0 system are discussed.

A Methodology for Automatic Hierarchy Definition of Sentences in Engineering Documents (엔지니어링 문서의 문장 자동 계층정의 방법론)

  • Park, Sang-Il;Kim, Bong-Geun;Kim, Kyeong-Hwan;Lee, Sang-Ho
    • Journal of the Computational Structural Engineering Institute of Korea
    • /
    • v.22 no.4
    • /
    • pp.323-330
    • /
    • 2009
  • This paper proposes a methodology for automatic hierarchy classification of subtitles in a engineering document by the a fact that heading symbols of subtitles represent a hierarchical structure of the document. The proposed methodology is composed of two methods: extracting subtitles from plan text document and determining hierarchical structure of the subtitles. The subtitles in a document is extracted by comparing heading symbol patterns with predefined heading symbol groups, and the depth levels of the subtitles are determined by analyzing relative location of subtitles according to change of the heading symbol patterns. A prototype module, which can transform a plain text document into a structured XML document in accordance with a hierarchical structure of subtitles, is developed based on the proposed methodology, and the performance of the module is analyzed with 20 engineering documents.

Combined Feature Set and Hybrid Feature Selection Method for Effective Document Classification (효율적인 문서 분류를 위한 혼합 특징 집합과 하이브리드 특징 선택 기법)

  • In, Joo-Ho;Kim, Jung-Ho;Chae, Soo-Hoan
    • Journal of Internet Computing and Services
    • /
    • v.14 no.5
    • /
    • pp.49-57
    • /
    • 2013
  • A novel approach for the feature selection is proposed, which is the important preprocessing task of on-line document classification. In previous researches, the features based on information from their single population for feature selection task have been selected. In this paper, a mixed feature set is constructed by selecting features from multi-population as well as single population based on various information. The mixed feature set consists of two feature sets: the original feature set that is made up of words on documents and the transformed feature set that is made up of features generated by LSA. The hybrid feature selection method using both filter and wrapper method is used to obtain optimal features set from the mixed feature set. We performed classification experiments using the obtained optimal feature sets. As a result of the experiments, our expectation that our approach makes better performance of classification is verified, which is over 90% accuracy. In particular, it is confirmed that our approach has over 90% recall and precision that have a low deviation between categories.

Local Similarity based Document Layout Analysis using Improved ARLSA

  • Kim, Gwangbok;Kim, SooHyung;Na, InSeop
    • International Journal of Contents
    • /
    • v.11 no.2
    • /
    • pp.15-19
    • /
    • 2015
  • In this paper, we propose an efficient document layout analysis algorithm that includes table detection. Typical methods of document layout analysis use the height and gap between words or columns. To correspond to the various styles and sizes of documents, we propose an algorithm that uses the mean value of the distance transform representing thickness and compare with components in the local area. With this algorithm, we combine a table detection algorithm using the same feature as that of the text classifier. Table candidates, separators, and big components are isolated from the image using Connected Component Analysis (CCA) and distance transform. The key idea of text classification is that the characteristics of the text parallel components that have a similar thickness and height. In order to estimate local similarity, we detect a text region using an adaptive searching window size. An improved adaptive run-length smoothing algorithm (ARLSA) was proposed to create the proper boundary of a text zone and non-text zone. Results from experiments on the ICDAR2009 page segmentation competition test set and our dataset demonstrate the superiority of our dataset through f-measure comparison with other algorithms.

The Block Segmentation and Extraction of Layout Information In Document (문서의 영역분리와 레이아웃 정보의 추출)

  • 조용주;남궁재찬
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.17 no.10
    • /
    • pp.1131-1146
    • /
    • 1992
  • In this paper, we suggest a new algorithm applied to the segmentation of published documents to obtain constituent and layout information of document. Firstly, we begin the process of blocking and labeling on a 300dpi scanned document. Secondly, we classify the blocked document by individual sub-regions. Thirdly, we group sub-regions into graphic areas and text areas. Finally, we extract information for layout recognition by using the data. From an experiment on papers of an academic society, we obtain the above 98% of region classification rate and extraction rate of information for the layout recognition.

  • PDF

Designing an expert system for library classification (문헌분류 전문가시스팀의 설계에 대한 연구)

  • 김정현
    • Journal of Korean Library and Information Science Society
    • /
    • v.21
    • /
    • pp.459-483
    • /
    • 1994
  • The purpose of the study is to design and implement a prototype expert system for library classification in the literature field of the DDC 20. The system was largely consisted of a knowledge base, an inference engine, a knowledge acquisition facility, an explanation facility and an user interface facility. The knowledge base was represented by inference rules and frames. The name file for authors and titles was designed separately. The forward chaining technique was chosen for the inference engine and the menu-driven dialog technique was also taken for the user interface. The conclusions of the study can be summarized as follows: 1) The difficulty of document classification work is due to the complex and stringent classification rules. Such problems can be considerably alleviated by using the present system. 2) Even the novice with a knowledge about the DDC 20 can easily access the system. And also librarian other than the professional classifier can easily be accustomed to the classification work. 3) The system can be used as an online classification scheme. 4) By adding any local language other than English or Hangeul on the menu screen, the language problem relating classification can be overcome. 5) The system can be employed as the intensification tool for the education of classification as well as library automation.

  • PDF

A study on the expert system for classification of books (분류전문가시스팀에 관한 연구)

  • 김정현
    • Journal of Korean Library and Information Science Society
    • /
    • v.19
    • /
    • pp.35-57
    • /
    • 1992
  • This study is an attempt to provide some helpful data for the design and the implementation of the expert system for the book-classification based on the analysis of various cases of the classification-expert system models. Following the introduction, the concepts and some features of an expert system were overviewed in the second chapter, on the basis of which the following concrete cases were introduced and analyzed in the third chapter : (1) ACN System for NC, (2) Expert System for NDC, (3) Expert System for UDC, (4) Herba Medica System, (5) Expert System for IPC, (6) Stratcyclode Project, (7) Expert System for Classification of INIS Database, (8) AutoBC System, and etc. In the conclusion, for the development of the classification-expert system, it was turned out that constructing a new system by using an AI language such as Prolog or LISP is more desirable than employing any one of expert system shells. Together it is necessary for the following requirements to be met : (1) The subject concept of a document elicited should be accurate. (2) Not only a domain knowledge but also the knowledge covering all the subjects should be represented in the knowledge-bases. (3) The knowledge-bases should be organized in such a way that the characteristics of the knowledge about classification should be well defined. (4) rule-base consisting of accurate rules about classification should be made. (5) It should be possible for classification code wanted to be generated immediately.

  • PDF

Comparison of Performance Factors for Automatic Classification of Records Utilizing Metadata (메타데이터를 활용한 기록물 자동분류 성능 요소 비교)

  • Young Bum Gim;Woo Kwon Chang
    • Journal of the Korean Society for information Management
    • /
    • v.40 no.3
    • /
    • pp.99-118
    • /
    • 2023
  • The objective of this study is to identify performance factors in the automatic classification of records by utilizing metadata that contains the contextual information of records. For this study, we collected 97,064 records of original textual information from Korean central administrative agencies in 2022. Various classification algorithms, data selection methods, and feature extraction techniques are applied and compared with the intent to discern the optimal performance-inducing technique. The study results demonstrated that among classification algorithms, Random Forest displayed higher performance, and among feature extraction techniques, the TF method proved to be the most effective. The minimum data quantity of unit tasks had a minimal influence on performance, and the addition of features positively affected performance, while their removal had a discernible negative impact.

Movie Review Classification Based on a Multiple Classifier

  • Tsutsumi, Kimitaka;Shimada, Kazutaka;Endo, Tsutomu
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2007.11a
    • /
    • pp.481-488
    • /
    • 2007
  • In this paper, we propose a method to classify movie review documents into positive or negative opinions. There are several approaches to classify documents. The previous studies, however, used only a single classifier for the classification task. We describe a multiple classifier for the review document classification task. The method consists of three classifiers based on SVMs, ME and score calculation. We apply two voting methods and SVMs to the integration process of single classifiers. The integrated methods improved the accuracy as compared with the three single classifiers. The experimental results show the effectiveness of our method.

  • PDF

One-Class Document Classification using Pseudo Negative Examples (One-class 문서 분류를 위한 가상 부정 예제의 사용)

  • Song Ho-Jin;Kang In-Su;Na Seung-Hoon;Lee Jong-Hyeok
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2005.07b
    • /
    • pp.469-471
    • /
    • 2005
  • 문서 분류에서의 one class classification 문제는 오직 하나의 범주를 생성하고 새로운 문서가 주어졌을 때 미리 만들어진 하나의 범주에 속하는가를 판별하는 문제이다. 기존의 여러 범주로 이루어진 분류 문제를 해결할 때와는 달리 one class classification에서는 학습 시에 이미 정해진 하나의 범주와 관련이 있는 문서들만을 사용하여 학습을 수행하기 때문에 범주의 경계를 정하는 것이 매우 어려운 작업이며 또한 분류기의 성능에 있어서도 매우 중요한 요소로 작용하게 된다. 본 논문에서는 기존의 연구에서 one class classification 문제를 해결할 때 관심의 대상이 되는 예제의 일부를 부정 예제로 간주하여 one class문제를 two class문제로 변경시켜 학습을 수행했던 것에서 더 나아가 추가적으로 새로운 가상 부정 예제를 설정하여 학습을 수행하고, SVM을 통하여 범주화 성능을 확인해 보기로 한다.

  • PDF