• Title/Summary/Keyword: Document categorization

Search Result 73, Processing Time 0.024 seconds

Guiding Practical Text Classification Framework to Optimal State in Multiple Domains

  • Choi, Sung-Pil;Myaeng, Sung-Hyon;Cho, Hyun-Yang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.3 no.3
    • /
    • pp.285-307
    • /
    • 2009
  • This paper introduces DICE, a Domain-Independent text Classification Engine. DICE is robust, efficient, and domain-independent in terms of software and architecture. Each module of the system is clearly modularized and encapsulated for extensibility. The clear modular architecture allows for simple and continuous verification and facilitates changes in multiple cycles, even after its major development period is complete. Those who want to make use of DICE can easily implement their ideas on this test bed and optimize it for a particular domain by simply adjusting the configuration file. Unlike other publically available tool kits or development environments targeted at general purpose classification models, DICE specializes in text classification with a number of useful functions specific to it. This paper focuses on the ways to locate the optimal states of a practical text classification framework by using various adaptation methods provided by the system such as feature selection, lemmatization, and classification models.

A Web-Document Categorization System Using the Hierarchical Information of the Concept (의미의 상하위 정보를 이용한 웹문서 분류시스템)

  • Kang, Won-Seog;Hwang, Do-Sam;Choi, Key-Sun
    • Annual Conference on Human and Language Technology
    • /
    • 1999.10e
    • /
    • pp.36-39
    • /
    • 1999
  • 본 논문에서는 다양성을 가진 웹문서의 범주를 결정짓는 웹문서 분류 시스템을 설계, 구축한다. 웹문서는 일관된 형식과 내용이 없이 만들어지기 때문에 문서의 범주를 결정하는 시스템을 구축하기는 쉬운 일이 아니다. 제안한 웹문서 분류 시스템은 잡음 처리에 적합한 신경망 방식을 적용하여 다양한 내용의 웹문서의 범주를 결정짓는다. 본 시스템은 한국어 문장을 분석하는 한국어 형태소 해석기, 단어의 의미를 획득하는 개념 획득기, 단어의 사용된 의미를 고르는 애매성 해소기, 그리고 문서의 범주를 결정하는 신경망 범주 결정기로 구성된다. 본 시스템은 단어의 의미를 이용하여 문서를 표현하고 분석하는 개념 중심의 문서 분류 시스템이다.

  • PDF

Comparison of Feature Selection Methods using the Statistics of Words in Text Categorization (문서 분류에서 단어의 통계 정보를 이용한 특징 선택 기법의 비교)

  • 임윤택;윤충화
    • Proceedings of the Safety Management and Science Conference
    • /
    • 1999.11a
    • /
    • pp.209-216
    • /
    • 1999
  • 정보 검색 분야의 문서 분류에 기계 학습 기법을 적용할 때 발생하는 가장 큰 문제는 문서를 패턴으로 표현할 때, 하나의 패턴이 가지는 특징의 수가 기계 학습 기법에서 처리할 수 있는 범위를 넘어서는 것이다. 이러한 문제를 해결하기 위하여 특징 선택 기법은 패턴을 구성하고 있는 특징 중에서 실제 문서 분류에 많은 영향을 주는 특징만을 선택하여, 기계 학습 기법에서 쉽게 처리할 수 있을 정도의 패턴을 구성하게 한다. 본 논문에서는 이러한 특징 선택 기법 중에서 IG(Information Gain), Gini index, Relief-F, DF(Document Frequency)를 비교하였다. 실험 결과 문서들에 포함된 모든 고유 단어를 특징의 길이로 하여 패턴을 구성했을 때보다 특징 선택 기법을 적용하여 고유 단어 중 일부를 특징으로 패턴을 구성할 때 기계학습에서 더 향상된 분류 성능을 보였다

  • PDF

An Experimental Study on Feature Ranking Schemes for Text Classification (텍스트 분류를 위한 자질 순위화 기법에 관한 연구)

  • Pan Jun Kim
    • Journal of the Korean Society for information Management
    • /
    • v.40 no.1
    • /
    • pp.1-21
    • /
    • 2023
  • This study specifically reviewed the performance of the ranking schemes as an efficient feature selection method for text classification. Until now, feature ranking schemes are mostly based on document frequency, and relatively few cases have used the term frequency. Therefore, the performance of single ranking metrics using term frequency and document frequency individually was examined as a feature selection method for text classification, and then the performance of combination ranking schemes using both was reviewed. Specifically, a classification experiment was conducted in an environment using two data sets (Reuters-21578, 20NG) and five classifiers (SVM, NB, ROC, TRA, RNN), and to secure the reliability of the results, 5-Fold cross-validation and t-test were applied. As a result, as a single ranking scheme, the document frequency-based single ranking metric (chi) showed good performance overall. In addition, it was found that there was no significant difference between the highest-performance single ranking and the combination ranking schemes. Therefore, in an environment where sufficient learning documents can be secured in text classification, it is more efficient to use a single ranking metric (chi) based on document frequency as a feature selection method.

Developing a Test Collection for Korean Text Categorization (한국어 문서분류 테스트컬렉션 개발)

  • Ra, Dong-Yul;Kim, Yunsik;Shin, Hyun-Joo;Lee, Kyu-Hee;Kim, Tae-Kyu;Kang, Hyun-Kyu;Choe, Ho-Seop;Yoon, Hwa-Mook
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2007.11a
    • /
    • pp.435-439
    • /
    • 2007
  • Document categorization system is important in the internet age in which huge number of documents are created and need to be dealt with. By this reason a lot of research has been done in this field. For the development of the system, a supervised learning method is widely used. This approach needs a test collection as a prerequisite. For the case of English, several test collections are available which provide a lot of help for developing systems and doing research. But no public test collections have been reported and are not available in the case of Korean. To improve the situation for Korean we are undergoing the construction of a Korean test collection. In this paper the approaches being used and current stage of the collection will be described.

  • PDF

Academic Conference Categorization According to Subjects Using Topical Information Extraction from Conference Websites (학회 웹사이트의 토픽 정보추출을 이용한 주제에 따른 학회 자동분류 기법)

  • Lee, Sue Kyoung;Kim, Kwanho
    • The Journal of Society for e-Business Studies
    • /
    • v.22 no.2
    • /
    • pp.61-77
    • /
    • 2017
  • Recently, the number of academic conference information on the Internet has rapidly increased, the automatic classification of academic conference information according to research subjects enables researchers to find the related academic conference efficiently. Information provided by most conference listing services is limited to title, date, location, and website URL. However, among these features, the only feature containing topical words is title, which causes information insufficiency problem. Therefore, we propose methods that aim to resolve information insufficiency problem by utilizing web contents. Specifically, the proposed methods the extract main contents from a HTML document collected by using a website URL. Based on the similarity between the title of a conference and its main contents, the topical keywords are selected to enforce the important keywords among the main contents. The experiment results conducted by using a real-world dataset showed that the use of additional information extracted from the conference websites is successful in improving the conference classification performances. We plan to further improve the accuracy of conference classification by considering the structure of websites.

A Study on Analysis of Topic Modeling using Customer Reviews based on Sharing Economy: Focusing on Sharing Parking (공유경제 기반의 고객리뷰를 이용한 토픽모델링 분석: 공유주차를 중심으로)

  • Lee, Taewon
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.25 no.3
    • /
    • pp.39-51
    • /
    • 2020
  • This study will examine the social issues and consumer awareness of sharing parking through the method text mining. In this experiment, the topic by keyword was extracted and analyzed using TFIDF (Term frequency inverse document frequency) and LDA (Latent dirichlet allocation) technique. As a result of categorization by topic, citizens' complaints such as local government agreements, parking space negotiations, parking culture improvement, citizen participation, etc., played an important role in implementing shared parking services. The contribution of this study highly differentiated from previous studies that conducted exploratory studies using corporate and regional cases, and can be said to have a high academic contribution. In addition, based on the results obtained by utilizing the LDA analysis in this study, there is a practical contribution that it can be applied or utilized in establishing a sharing economy policy for revitalizing the local economy.

Comparison Between Optimal Features of Korean and Chinese for Text Classification (한중 자동 문서분류를 위한 최적 자질어 비교)

  • Ren, Mei-Ying;Kang, Sinjae
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.25 no.4
    • /
    • pp.386-391
    • /
    • 2015
  • This paper proposed the optimal attributes for text classification based on Korean and Chinese linguistic features. The experiments committed to discover which is the best feature among n-grams which is known as language independent, morphemes that have language dependency and some other feature sets consisted with n-grams and morphemes showed best results. This paper used SVM classifier and Internet news for text classification. As a result, bi-gram was the best feature in Korean text categorization with the highest F1-Measure of 87.07%, and for Chinese document classification, 'uni-gram+noun+verb+adjective+idiom', which is the combined feature set, showed the best performance with the highest F1-Measure of 82.79%.

A Comparative Analysis of Healthcare-Associated Infection Policy in South Korea and Its Implications in Coronavirus Disease 2019

  • Jeong, Yoolwon;Kim, Kinam
    • Health Policy and Management
    • /
    • v.31 no.3
    • /
    • pp.312-327
    • /
    • 2021
  • Background: Infection prevention and control (IPC) to manage healthcare-associated infection (HCAI) has emerged as one of the most significant public health issues in Korea. The purpose of this study is to draw implications in IPC policies by analyzing the context, process, and major actors in policy development and comparatively analyzing IPC policy contents of Korea with three other countries. Additionally, IPC policies were analyzed in the context of coronavirus disease 2019 (COVID-19) to provide implications for future pandemics and HCAI events. Methods: This study incorporates a qualitative approach based on document and content analysis, applying codes and thematic categorization. IPC policy contents are comparatively analyzed by adopting the concept model, developed by the World Health Organization, which consists of core components of IPC structure at the national and facility level. Results: National IPC policies were developed within a complex social and political context, through the involvement of various stakeholders. IPC policies in Korea place a high emphasis on establishing IPC programs and built environments in healthcare facilities, whereas there were potentials for improvement in policies involving patients and promoting a safety culture. IPC policies, which currently focus on general hospitals and certain functions of hospitals, should further be expanded to target all healthcare facilities and functions, to ensure more efficient and sustainable IPC responses in the current and future disease outbreaks. Conclusion: IPC is a complex policy arena and lessons learned from the analysis of existing policies in the context of COVID-19 should provide valuable strategic implications for future policies.

(A Question Type Classifier based on a Support Vector Machine for a Korean Question-Answering System) (한국어 질의응답시스템을 위한 지지 벡터기계 기반의 질의유형분류기)

  • 김학수;안영훈;서정연
    • Journal of KIISE:Software and Applications
    • /
    • v.30 no.5_6
    • /
    • pp.466-475
    • /
    • 2003
  • To build an efficient Question-Answering (QA) system, a question type classifier is needed. It can classify user's queries into predefined categories regardless of the surface form of a question. In this paper, we propose a question type classifier using a Support Vector Machine (SVM). The question type classifier first extracts features like lexical forms, part of speech and semantic markers from a user's question. The system uses $X^2$ statistic to select important features. Selected features are represented as a vector. Finally, a SVM categorizes questions into predefined categories according to the extracted features. In the experiment, the proposed system accomplished 86.4% accuracy The system precisely classifies question type without using any rules like lexico-syntactic patterns. Therefore, the system is robust and easily portable to other domains.