• Title/Summary/Keyword: document classification

Search Result 449, Processing Time 0.029 seconds

A Study on Collaboration in Classification System Development Practice (분류시스템 개발과정에서의 협력에 대한 연구)

  • Park, Ok-Nam
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.42 no.4
    • /
    • pp.181-199
    • /
    • 2008
  • This study presents an empirical study of classification system design focused upon an image design team within an organizational setting. It aims to understand collaboration during design practice. Data was collected through on-site interviews, observations, and document and email reviews. This study uses social process model as a conceptual framework. The study revealed type of collaboration, factors influencing collaboration, influences of collaboration on design practice.

Improving the Performance of Document Clustering with Distributional Similarities (분포유사도를 이용한 문헌클러스터링의 성능향상에 대한 연구)

  • Lee, Jae-Yun
    • Journal of the Korean Society for information Management
    • /
    • v.24 no.4
    • /
    • pp.267-283
    • /
    • 2007
  • In this study, measures of distributional similarity such as KL-divergence are applied to cluster documents instead of traditional cosine measure, which is the most prevalent vector similarity measure for document clustering. Three variations of KL-divergence are investigated; Jansen-Shannon divergence, symmetric skew divergence, and minimum skew divergence. In order to verify the contribution of distributional similarities to document clustering, two experiments are designed and carried out on three test collections. In the first experiment the clustering performances of the three divergence measures are compared to that of cosine measure. The result showed that minimum skew divergence outperformed the other divergence measures as well as cosine measure. In the second experiment second-order distributional similarities are calculated with Pearson correlation coefficient from the first-order similarity matrixes. From the result of the second experiment, secondorder distributional similarities were found to improve the overall performance of document clustering. These results suggest that minimum skew divergence must be selected as document vector similarity measure when considering both time and accuracy, and second-order similarity is a good choice for considering clustering accuracy only.

Improving Multinomial Naive Bayes Text Classifier (다항시행접근 단순 베이지안 문서분류기의 개선)

  • 김상범;임해창
    • Journal of KIISE:Software and Applications
    • /
    • v.30 no.3_4
    • /
    • pp.259-267
    • /
    • 2003
  • Though naive Bayes text classifiers are widely used because of its simplicity, the techniques for improving performances of these classifiers have been rarely studied. In this paper, we propose and evaluate some general and effective techniques for improving performance of the naive Bayes text classifier. We suggest document model based parameter estimation and document length normalization to alleviate the Problems in the traditional multinomial approach for text classification. In addition, Mutual-Information-weighted naive Bayes text classifier is proposed to increase the effect of highly informative words. Our techniques are evaluated on the Reuters21578 and 20 Newsgroups collections, and significant improvements are obtained over the existing multinomial naive Bayes approach.

An Opinion Document Clustering Technique for Product Characterization (제품 특징화를 위한 오피니언 문서의 클러스터링 기법)

  • Chang, Jae-Young
    • The Journal of Society for e-Business Studies
    • /
    • v.19 no.2
    • /
    • pp.95-108
    • /
    • 2014
  • Opinion Mining is one of the application domains of text mining which extracting opinions from documents, and much researches are currently underway. Most of related researches focused on the sentiment classification which classifies the documents into positive/negative opinions. However, there is a little interest in extracting the features characterizing the individual product. In this paper, we propose the technique classifying the opinion documents according to the product features, and selecting the those features characterizing each product. In the proposed method, we utilize the document clustering technique and develope a new algorithm for evaluating the similarity between documents. In addition, through experiments, we prove the usefulness of proposed method.

An Automatic Classification System of Official Documents in Middle Schools Using Term Weighting of Titles (제목의 단어 가중치를 이용한 중등학교 공문서 자동분류시스템)

  • Kang, Hyun-Hee;Jin, Min
    • Journal of The Korean Association of Information Education
    • /
    • v.7 no.2
    • /
    • pp.219-226
    • /
    • 2003
  • It takes a lot of time to classify official documents in schools and educational institutions. In order to reduce the overhead, we propose an automatic document classification method using word information of the titles of documents in this paper. At first, meaningful words are extracted from titles of existing documents and Inverse Document Frequency(IDF) weights of words are calculated against each category. Then we build a word weight dictionary. Documents are automatically classified into the appropriate category of which the sum of weights of words of the title is the highest by using the word weight dictionary. We also evaluate the performance of the proposed method using a real dataset of a middle school.

  • PDF

사용자 의도 정보를 사용한 웹문서 분류

  • Jang, Yeong-Cheol
    • Proceedings of the Korea Society for Industrial Systems Conference
    • /
    • 2008.10b
    • /
    • pp.292-297
    • /
    • 2008
  • 복잡한 시맨틱을 포함한 웹 문서를 정확히 범주화하고 이 과정을 자동화하기 위해서는 인간의 지식체계를 수용할 수 있는 표준화, 지능화, 자동화된 문서표현 및 분류기술이 필요하다. 이를 위해 키워드 빈도수, 문서내 키워드들의 관련성, 시소러스의 활용, 확률기법 적용 등에 사용자의도(intention) 정보를 활용한 범주화와 조정 프로세스를 도입하였다. 웹 문서 분류과정에서 시소러스 등을 사용하는 지식베이스 문서분류와 비 감독 학습을 하는 사전 지식체계(a priori)가 없는 유사성 문서분류 방법에 의도정보를 사용할 수 있도록 기반체계를 설계하였고 다시 이 두 방법의 차이는 Hybrid조정프로세스에서 조정하였다. 본 연구에서 설계된 HDCI(Hybrid Document Classification with Intention) 모델은 위의 웹 문서 분류과정과 이를 제어 및 보조하는 사용자 의도 분석과정으로 구성되어 있다. 의도분석과정에 키워드와 함께 제공된 사용자 의도는 도메인 지식(domain Knowledge)을 이용하여 의도간 계층트리(intention hierarchy tree)를 구성하고 이는 문서 분류시 제약(constraint) 또는 가이드의 역할로 사용자 의도 프로파일(profile) 또는 문서 특성 대표 키워드를 추출하게 된다. HDCI는 문서간 유사성에 근거한 상향식(bottom-up)의 확률적인 접근에서 통제 및 안내의 역할을 수행하고 지식베이스(시소러스) 접근 방식에서 다양성에 한계가 있는 키워들 간 관계설정의 정확도를 높인다.

  • PDF

Reliability Analysis and Utilization of BIM-based Highway Construction Output Volume (BIM기반 고속도로 공사 물량산출 신뢰성 검토 및 활용)

  • Jung, Guk-Young;Woo, Jeong-Won;Kang, Kyeong-Don;Shin, Jae-Choul
    • Journal of KIBIM
    • /
    • v.3 no.3
    • /
    • pp.9-18
    • /
    • 2013
  • In case of applying the BIM method in the civil engineering of irregularly shaped structure, BIM method began to be introduced in the current building engineering area compared with the expected effects of the relatively high construction productivity has been recognized. In this paper, I have developed quantity calculation algorithms applying it to earthwork and bridge construction, tunnel construction, retaining wall construction, culvert construction and implemented BIM based 3D-BIM Modeling quantity calculation. Structure work in which errors occurred in range between -6.28% ~ 5.17%. Especially, understanding of the problem and improvement of the existing 2D-CAD based of quantity calculation through rock type quantity calculation error in range of -14.36% ~ 13.07% of earthwork quantity calculation. It's benefit and applicability of BIM method in civil engineering. In addition, routine method for quantity of earthwork has the same error tolerance negligible for that of structure work. But, rock type's quantity calculated as the error appears significantly to the reliability of 2D-based volume calculation shows that the problem could be. Through the estimating quantity of earthwork based 3D-BIM, proposed method has better reliability than routine method. BIM, as well as the design, construction, maintenance levels of information when you consider the benefits of integration, the introduction of BIM design in civil engineering and the possibility of applying for the effectiveness was confirmed. In addition, as the beginning phase of information integration, quantity document automation program has been developed for activation of BIM. And automatically enter the program code number, linkage and manual volume calculation program, quantity document automation programs, such as the development is now underway, and step-by-step procedures and methods are presented.

Recommendation System using Associative Web Document Classification by Word Frequency and α-Cut (단어 빈도와 α-cut에 의한 연관 웹문서 분류를 이용한 추천 시스템)

  • Jung, Kyung-Yong;Ha, Won-Shik
    • The Journal of the Korea Contents Association
    • /
    • v.8 no.1
    • /
    • pp.282-289
    • /
    • 2008
  • Although there were some technological developments in improving the collaborative filtering, they have yet to fully reflect the actual relation of the items. In this paper, we propose the recommendation system using associative web document classification by word frequency and ${\alpha}$-cut to address the short comings of the collaborative filtering. The proposed method extracts words from web documents through the morpheme analysis and accumulates the weight of term frequency. It makes associative rules and applies the weight of term frequency to its confidence by using Apriori algorithm. And it calculates the similarity among the words using the hypergraph partition. Lastly, it classifies related web document by using ${\alpha}$-cut and calculates similarity by using adjusted cosine similarity. The results show that the proposed method significantly outperforms the existing methods.

Comparison Between Optimal Features of Korean and Chinese for Text Classification (한중 자동 문서분류를 위한 최적 자질어 비교)

  • Ren, Mei-Ying;Kang, Sinjae
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.25 no.4
    • /
    • pp.386-391
    • /
    • 2015
  • This paper proposed the optimal attributes for text classification based on Korean and Chinese linguistic features. The experiments committed to discover which is the best feature among n-grams which is known as language independent, morphemes that have language dependency and some other feature sets consisted with n-grams and morphemes showed best results. This paper used SVM classifier and Internet news for text classification. As a result, bi-gram was the best feature in Korean text categorization with the highest F1-Measure of 87.07%, and for Chinese document classification, 'uni-gram+noun+verb+adjective+idiom', which is the combined feature set, showed the best performance with the highest F1-Measure of 82.79%.

Development of XML based HACCP Diet Automatic Classification System (XML 기반 HACCP 식단 자동 분류 시스템 개발)

  • Cha, Kyung-Ae;Yeo, Sun-Dong;Hong, Won-Kee
    • Journal of Korea Multimedia Society
    • /
    • v.19 no.1
    • /
    • pp.86-95
    • /
    • 2016
  • The main objective of HACCP(Hazard analysis and critical control points) system is to provide a systematic preventive approach how to control the risks in food production process. Practically, the diet classification process performed at the one of the beginning steps of the HACCP system, makes an important role of determining food safety risks and how to control them in every control point according to the different risk level of the diet. In this paper, we propose an automatic diet classification method for HACCP system using XML(eXtensible Markup Language). In order to guarantee the diet classification accuracy, we design the XML schema and attributes represents the relationship of every diet and ingredients analysing the HACCP diet classification rules. Based on the XML schema and document generation method, we develope the proposed system as client and server model that implemented XML based HACCP diet information generation module and integrated HACCP information management modules, respectively. Moreover, we show the efficiency of the proposed system with experiment results describing the school food diet information as XML documents and parsing the diet information.