• Title/Summary/Keyword: 문서 분류기

Search Result 191, Processing Time 0.029 seconds

Semantic-based Genetic Algorithm for Feature Selection (의미 기반 유전 알고리즘을 사용한 특징 선택)

  • Kim, Jung-Ho;In, Joo-Ho;Chae, Soo-Hoan
    • Journal of Internet Computing and Services
    • /
    • v.13 no.4
    • /
    • pp.1-10
    • /
    • 2012
  • In this paper, an optimal feature selection method considering sematic of features, which is preprocess of document classification is proposed. The feature selection is very important part on classification, which is composed of removing redundant features and selecting essential features. LSA (Latent Semantic Analysis) for considering meaning of the features is adopted. However, a supervised LSA which is suitable method for classification problems is used because the basic LSA is not specialized for feature selection. We also apply GA (Genetic Algorithm) to the features, which are obtained from supervised LSA to select better feature subset. Finally, we project documents onto new selected feature subset and classify them using specific classifier, SVM (Support Vector Machine). It is expected to get high performance and efficiency of classification by selecting optimal feature subset using the proposed hybrid method of supervised LSA and GA. Its efficiency is proved through experiments using internet news classification with low features.

Modified Na$\ddot{i}$ve Bayes Classifier for Categorizing Questions in Question-Answering Community (확장된 나이브 베이즈 분류기를 활용한 질문-답변 커뮤니티의 질문 분류)

  • Yeon, Jong-Heum;Shim, Jun-Ho;Lee, Sang-Goo
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.1
    • /
    • pp.95-99
    • /
    • 2010
  • Social media refers to the content, which are created by users, such as blogs, social networks, and wikis. Recently, question-answering (QA) communities, in which users share information by questions and answers, are regarded as a kind of social media. Thus, QA communities have become a huge source of information for the past decade. However, it is hard for users to search the exact question-answer that is exactly matched with their needs as the number of question-answers increases in QA communities. This paper proposes an approach for classifying a question into three categories (information, opinion, and suggestion) according to the purpose of the question for more accurate information retrieval. Specifically, our approach is based on modified Na$\ddot{i}$ve Bayes classifier which uses structural characteristics of QA documents to improve the classification accuracy. Through our experiments, we achieved about 71.2% in classification accuracy.

Record management system and Registry System in the Gabo Reform (갑오개혁기 기록관리제도와 등기실체제(Registry System))

  • Lee, Seung-Hwi
    • The Korean Journal of Archival Studies
    • /
    • no.17
    • /
    • pp.85-114
    • /
    • 2008
  • One of the features of record management during the Gabo Reform is that the documents office controled producing and distribution of records. The records completed the operations were sent the record office and classified and arranged. previous researches understood this record management system during Gabo Reform were introduced from Japan. This article clarifies that new record management system settled through Meiji Restoration were introduced from German(Prussian) registry system at the time. However, German registry system managed current records and this system was based on modern record management system which open the records to the public with archives. Japan accepted only registry system, current record management system of German, and didn't established archives at Meiji regime. It is same with Joseon Dynasty during the Gabo Reform regime. Therefore, the record related regulation at the Gabo Reform regime could not be judged to be a modern system. The regulations on records at Gabo Reform regime had no terms about people's right or open the records to the public which decides modern record regulations. The meaning of record system during Gabo Reform regime is that the value of records and name of organizations coincides with record life cycle. The documents office managed current records and record office classified and filed closed records. Concept of "current record=document=documents office, non-current record=record= record office" didn't succeed to today. The term 'record' is used as current record or non-current record without difference.

능동적 학습을 위한 군집화 기반 복수 문의 예제 선정

  • Gang, Jae-Ho;Ryu, Gwang-Ryeol;Gwon, Hyeok-Cheol
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2005.05a
    • /
    • pp.240-249
    • /
    • 2005
  • 사용자 맞춤 서비스를 위하여 온라인상에서 사용자의 관심 분야를 파악하고자 하는 경우에는 적은 수의 훈련 예제로 효율적인 학습이 가능한 능동적 학습이 적절하다. 능동적 학습을 효과적으로 적용하기 위하여 사용자에게 문의할 가치가 높은 예제를 선정하는 것도 중요하지만, 사용자 편의를 위해서는 문의 횟수를 가능한 최소화하여야 한다. 문의 횟수를 줄이면서도 많은 수의 훈련 예제를 획득하기 위해서는 복수의 문의 예제들을 사용자에게 한꺼번에 제시하고 그 관심 여부를 표한하게 하는 것이 효과적이다. 본 논문에서는 능동적 학습 적용 시 사용자에게 문의할 가치가 높은 복수 문의 예제들을 효과적으로 선정하기 위하여 가중치 반영 군집화를 적용하는 방안을 제안한다. 본 제안 방안은 먼저 각 예제의 문의 예제로서의 가치를 파악하고 이를 가중치로 삼아 군집화를 수행하여 상대적으로 유사한 예제들의 집합을 구성한다. 이어서 생성된 각각의 군집에서 가장 보편적인 예제를 문의 예제로 선정하면 선정된 각각의 문의 예지는 문의할 가치가 높으면서 함께 문의하게 될 예제들은 서로 충분히 달라 학습에 보다 유용하게 사용할 수 있는 훈련 예제들을 얻을 수 있다. 문서 분류 문제를 대상으로 본 제안 방안을 실험한 결과, 단순히 문의 가치가 높은 복수의 예제들을 함께 문의할 예제들로 선정하는 방안에 비해 학습 성능이 뛰어났으며, 한 번에 문의하는 예제 수를 증가시키더라도 분류기의 성능 저하가 적음을 확인하였다.

  • PDF

Recognition of Word-level Attributed in Machine-printed Document Images (인쇄 문서 영상의 단어 단위 속성 인식)

  • Gwak, Hui-Gyu;Kim, Su-Hyeong
    • Journal of KIISE:Software and Applications
    • /
    • v.28 no.5
    • /
    • pp.412-421
    • /
    • 2001
  • 본 논문은 문서 영상에 존재하는 개별 단어들에 대한 속성정보 추출 방법을 제안한다. 단어 단위의 속성 인식은 단어 영상 매칭의 정확도 및 속도 개선, OCR 시스템에서 인식률 향상, 문서의 재생산 등 다양한 응용 가치를 찾을 수 있으며, 메타정보(meta-information) 추출을 통해 영상 검색(image retrieval)이나 요약(summary) 생성 등에 활용할 수 있다. 제안하는 시스템에서 고려하는 단어 영상의 속성은 언어의 종류(한글, 영문), 스타일(볼드, 이탤릭, 보통, 밑줄), 문자 크기(10, 12, 14 포인트), 문자 개수 (한글: 2, 3, 4, 5, 영문: 4, 5, 6, 7, 8, 9, 10), 서체(명조, 고딕)의 다섯 가지 정보이다. 속성 인식을 위한 특징은, 언어 종류 인식에 2개, 스타일 인식에 3개, 문자 크기와 개수는 각각 1개, 한글 서체 인식은 1개, 영문 서체 인식은 2개를 사용한다. 분류기는 신경망, 2차형 판별함수(QDF), 선형 판별함수(LDF)를 계층적으로 구성한다. 다섯 가지 속성이 조합된 26,400개의 단어 영상을 사용한 실험을 통해, 제안된 방법이 소수의 특징만으로도 우수한 속성 인식 성능을 보임을 입증하였다.

  • PDF

Cluster-Based Selection of Diverse Query Examples for Active Learning (능동적 학습을 위한 군집화 기반의 다양한 복수 문의 예제 선정 방법)

  • Kang, Jae-Ho;Ryu, Kwang-Ryel;Kwon, Hyuk-Chul
    • Journal of Intelligence and Information Systems
    • /
    • v.11 no.1
    • /
    • pp.169-189
    • /
    • 2005
  • In order to derive a better classifier with a limited number of training examples, active teaming alternately repeats the querying stage fur category labeling and the subsequent learning stage fur rebuilding the calssifier with the newly expanded training set. To relieve the user from the burden of labeling, especially in an on-line environment, it is important to minimize the number of querying steps as well as the total number of query examples. We can derive a good classifier in a small number of querying steps by using only a small number of examples if we can select multiple of diverse, representative, and ambiguous examples to present to the user at each querying step. In this paper, we propose a cluster-based batch query selection method which can select diverse, representative, and highly ambiguous examples for efficient active learning. Experiments with various text data sets have shown that our method can derive a better classifier than other methods which only take into account the ambiguity as the criterion to select multiple query examples.

  • PDF

XML Document Keyword Weight Analysis based Paragraph Extraction Model (XML 문서 키워드 가중치 분석 기반 문단 추출 모델)

  • Lee, Jongwon;Kang, Inshik;Jung, Hoekyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.11
    • /
    • pp.2133-2138
    • /
    • 2017
  • The analysis of existing XML documents and other documents was centered on words. It can be implemented using a morpheme analyzer, but it can classify many words in the document and cannot grasp the core contents of the document. In order for a user to efficiently understand a document, a paragraph containing a main word must be extracted and presented to the user. The proposed system retrieves keyword in the normalized XML document. Then, the user extracts the paragraphs containing the keyword inputted for searching and displays them to the user. In addition, the frequency and weight of the keyword used in the search are informed to the user, and the order of the extracted paragraphs and the redundancy elimination function are minimized so that the user can understand the document. The proposed system can minimize the time and effort required to understand the document by allowing the user to understand the document without reading the whole document.

Time-Series based Dataset Selection Method for Effective Text Classification (효율적인 문헌 분류를 위한 시계열 기반 데이터 집합 선정 기법)

  • Chae, Yeonghun;Jeong, Do-Heon
    • The Journal of the Korea Contents Association
    • /
    • v.17 no.1
    • /
    • pp.39-49
    • /
    • 2017
  • As the Internet technology advances, data on the web is increasing sharply. Many research study about incremental learning for classifying effectively in data increasing. Web document contains the time-series data such as published date. If we reflect time-series data to classification, it will be an effective classification. In this study, we analyze the time-series variation of the words. We propose an efficient classification through dividing the dataset based on the analysis of time-series information. For experiment, we corrected 1 million online news articles including time-series information. We divide the dataset and classify the dataset using SVM and $Na{\ddot{i}}ve$ Bayes. In each model, we show that classification performance is increasing. Through this study, we showed that reflecting time-series information can improve the classification performance.

A preliminary Study on Text Categorization of Book using Table of Contents and Book Description (목차, 책 소개를 이용한 단행본 문서 범주화에 관한 기초연구)

  • Do, Hyun-Ho;Lee, Yong-Gu
    • Proceedings of the Korean Society for Information Management Conference
    • /
    • 2014.08a
    • /
    • pp.127-130
    • /
    • 2014
  • 이 연구에서는 도서관의 주요 장서에 해당하는 단행본 도서에 대한 자동 분류를 적용가능한지 알아보고자 하였다. 분류자질로 메타데이터인 서명, 목차, 책 소개를 사용하였으며, 다양한 자질 가중치를 적용하여 581건의 단행본 도서를 통해 kNN 분류기의 분류성능을 파악하였다. 실험 결과 이들 메타데이터를 모두 사용하였을 때 가장 좋은 분류성능을 가져왔으며, 실험문헌집단의 규모가 작은 한계가 있지만 로그 TF를 취한 가중치 방법이 좋은 성능을 가져왔다.

  • PDF

Web Mining Using Fuzzy Integration of Multiple Structure Adaptive Self-Organizing Maps (다중 구조적응 자기구성지도의 퍼지결합을 이용한 웹 마이닝)

  • 김경중;조성배
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.1
    • /
    • pp.61-70
    • /
    • 2004
  • It is difficult to find an appropriate web site because exponentially growing web contains millions of web documents. Personalization of web search can be realized by recommending proper web sites using user profile but more efficient method is needed for estimating preference because user's evaluation on web contents presents many aspects of his characteristics. As user profile has a property of non-linearity, estimation by classifier is needed and combination of classifiers is necessary to anticipate diverse properties. Structure adaptive self-organizing map (SASOM) that is suitable for Pattern classification and visualization is an enhanced model of SOM and might be useful for web mining. Fuzzy integral is a combination method using classifiers' relevance that is defined subjectively. In this paper, estimation of user profile is conducted by using ensemble of SASOM's teamed independently based on fuzzy integral and evaluated by Syskill & Webert UCI benchmark data. Experimental results show that the proposed method performs better than previous naive Bayes classifier as well as voting of SASOM's.