• Title/Summary/Keyword: categorization

Search Result 1,009, Processing Time 0.024 seconds

Text Document Categorization using FP-Tree (FP-Tree를 이용한 문서 분류 방법)

  • Park, Yong-Ki;Kim, Hwang-Soo
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.11
    • /
    • pp.984-990
    • /
    • 2007
  • As the amount of electronic documents increases explosively, automatic text categorization methods are needed to identify those of interest. Most methods use machine learning techniques based on a word set. This paper introduces a new method, called FPTC (FP-Tree based Text Classifier). FP-Tree is a data structure used in data-mining. In this paper, a method of storing text sentence patterns in the FP-Tree structure and classifying text using the patterns is presented. In the experiments conducted, we use our algorithm with a #Mutual Information and Entropy# approach to improve performance. We also present an analysis of the algorithm via an ordinary differential categorization method.

Development of Categorization System for Efficient Calculation of Damage Cost according to Strong Wind (강풍 피해에 따른 피해비용의 효율적인 산정을 위한 분류체계 개발)

  • Song, Chang Young;Lee, Jong Hoon
    • Journal of the Korean Society of Safety
    • /
    • v.31 no.2
    • /
    • pp.127-132
    • /
    • 2016
  • In this study, the plan to construct a disaster information categorization system that can be objectively and efficiently performed was suggested in order to perform disaster management task systematically. Recently, the damage of natural disasters is gradually growing larger and faster, increasing the economic loss. Especially, as for the domestic storm damage, the damage from strong wind was found to be greater than the damage from torrential rain. Also, strong wind was found to be inflicting a great damage on human life, property and agricultural crops, so the necessity to study damage restoration from strong wind is increasing. Nevertheless, the damage items categorized in the domestic disaster year book are often comprehensive or unclear in criteria, and thus fail to reflect items or matters due to actual disaster damage. It is difficult to aggregate damage accurately such that it does not correspond to the national compensation scope or the damage amount is calculated according to subjective judgment of the investigator in charge. As such, if the disaster information management is inadequate by not applying accurate categorization criteria from damage amount calculation, there can be an issue with fairness when paying the damage support aid. Therefore, this study suggested a categorization plan for objective and efficient execution of disaster information management task in order to resolve such issues. It is expected that quick and efficient execution would be possible in disaster information management and task procedure domestically by constructing systematic categorization system related to disaster information.

An Experimental Study on Feature Selection Using Wikipedia for Text Categorization (위키피디아를 이용한 분류자질 선정에 관한 연구)

  • Kim, Yong-Hwan;Chung, Young-Mee
    • Journal of the Korean Society for information Management
    • /
    • v.29 no.2
    • /
    • pp.155-171
    • /
    • 2012
  • In text categorization, core terms of an input document are hardly selected as classification features if they do not occur in a training document set. Besides, synonymous terms with the same concept are usually treated as different features. This study aims to improve text categorization performance by integrating synonyms into a single feature and by replacing input terms not in the training document set with the most similar term occurring in training documents using Wikipedia. For the selection of classification features, experiments were performed in various settings composed of three different conditions: the use of category information of non-training terms, the part of Wikipedia used for measuring term-term similarity, and the type of similarity measures. The categorization performance of a kNN classifier was improved by 0.35~1.85% in $F_1$ value in all the experimental settings when non-learning terms were replaced by the learning term with the highest similarity above the threshold value. Although the improvement ratio is not as high as expected, several semantic as well as structural devices of Wikipedia could be used for selecting more effective classification features.

Document Clustering based on Level-wise Stop-word Removing for an Efficient Document Searching (효율적인 문서검색을 위한 레벨별 불용어 제거에 기반한 문서 클러스터링)

  • Joo, Kil Hong;Lee, Won Suk
    • The Journal of Korean Association of Computer Education
    • /
    • v.11 no.3
    • /
    • pp.67-80
    • /
    • 2008
  • Various document categorization methods have been studied to provide a user with an effective way of browsing a large scale of documents. They do compares set of documents into groups of semantically similar documents automatically. However, the automatic categorization method suffers from low accuracy. This thesis proposes a semi-automatic document categorization method based on the domains of documents. Each documents is belongs to its initial domain. All the documents in each domain are recursively clustered in a level-wise manner, so that the category tree of the documents can be founded. To find the clusters of documents, the stop-word of each document is removed on the document frequency of a word in the domain. For each cluster, its cluster keywords are extracted based on the common keywords among the documents, and are used as the category of the domain. Recursively, each cluster is regarded as a specified domain and the same procedure is repeated until it is terminated by a user. In each level of clustering, a user can adjust any incorrectly clustered documents to improve the accuracy of the document categorization.

  • PDF

Automatic Text Categorization based on Semi-Supervised Learning (준지도 학습 기반의 자동 문서 범주화)

  • Ko, Young-Joong;Seo, Jung-Yun
    • Journal of KIISE:Software and Applications
    • /
    • v.35 no.5
    • /
    • pp.325-334
    • /
    • 2008
  • The goal of text categorization is to classify documents into a certain number of pre-defined categories. The previous studies in this area have used a large number of labeled training documents for supervised learning. One problem is that it is difficult to create the labeled training documents. While it is easy to collect the unlabeled documents, it is not so easy to manually categorize them for creating training documents. In this paper, we propose a new text categorization method based on semi-supervised learning. The proposed method uses only unlabeled documents and keywords of each category, and it automatically constructs training data from them. Then a text classifier learns with them and classifies text documents. The proposed method shows a similar degree of performance, compared with the traditional supervised teaming methods. Therefore, this method can be used in the areas where low-cost text categorization is needed. It can also be used for creating labeled training documents.

The Origin and Changes of True-cold Damage(正傷寒) in Introduction to Medicine(醫學入門) (『의학입문(醫學入門)·상한편(傷寒篇)』 편제(編制) 중 정상한(正傷寒)의 명칭, 병명분류의 기원과 그 후 변화)

  • Jo, Hak-jun
    • Journal of Korean Medical classics
    • /
    • v.29 no.2
    • /
    • pp.55-78
    • /
    • 2016
  • Objectives : The goal of this paper is to research what the name and concept of true-cold damage in Introduction to Medicine were originated from, and to trace the origin and changes of categorization of it after the book. Methods : Books concerned with true-cold damage were collected as many as possible, besides ones that Introduction to Medicine referred to, before the name, concept and categorization of it were searched and analysed. Results : The concept of true-cold damage in Introduction to Medicine, which had come from Lei Zheong Huo Ren Shu(類證活人書) in Song dynasty, was more similar to one of cold damage in a broad sense. The name that Li Chan appreciated, was derived from not Shang Han Zhi Ge(傷寒直格), but Shang Han Zheng Zhi Ming Tiao(傷寒證治明條) in Song dynasty. On the other hand, since Tao Hua(陶華) began to go into the details of cold damage in a narrow sense, most books had followed it. Whereas 11 diseases among 24 diseases of true-cold damage in Introduction to Medicine indirectly came from Lei Zheong Huo Ren Shu(12 diseases), 14 diseases among them were directly derived from Shang Han Zheng Zhi Ming Tiao(16 diseases) and 10 diseases were added containing syndromes of retained fluid and jaundice. The categorization in Introduction to Medicine scarcely adopted except Donguibogam(東醫寶鑑) and Uimunbogam(醫門寶鑑), while the categorization of true-cold damage in a narrow sense was mostly composed of 2 diseases, that is cold damage(傷寒) and wind damage(傷風). Conclusions : Li Chan had fulfilled the total conditions in which the concept, cause, symptoms, prescriptions and prognosis of 24 diseases in true-cold damage were equipped, in order to build up the system and categorization of it. To our regret, his scientific outcome had been hardly referred after his book.

A Suggestion of Criteria for Categorizing Libraries into Types: Linking between Library and Information (도서관 관종구분의 기준에 대한 고찰)

  • Kim, Gi-Yeong;Choi, Yoon-Hee
    • Journal of the Korean Society for information Management
    • /
    • v.29 no.1
    • /
    • pp.395-404
    • /
    • 2012
  • The categorization of libraries into several types supports an understanding of the concept of library and also provides a framework for the practice of library management, such as planning and management. Although a 4-type categorization with public, academic, special, and school libraries is the most traditional and general approach to categorization, the definition of each type has been set enumeratively and inductively, so that it has weaknesses in its clarity between categories and in its applicability to a new environment. In this conceptual paper, deductive and analytical criteria for the 4-type categorization are suggested based on characteristics of information needs. Implications of the suggestions about library management, and especially, the meaning and impact of stakeholders on library management are discussed. Additionally, this paper attempts to put forth a conceptual link between library and information.

A Study on the Categorization of Citizens' Information Needs (시민 정보요구 범주화 연구)

  • Lee, Jiyoung;Kim, Giyeong;Park, Young-Sook
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.50 no.2
    • /
    • pp.245-269
    • /
    • 2016
  • In this study, we develop a categorization of citizens' information problems in their everyday life based on the characteristics in their information seeking behavior for developing information services which support to solve the problems practically. First of all, we extracted keywords related to their faced everyday life problems from the scripts of open-ended interviews with citizens who had diverse characteristics. The keywords were categorized into 6 groups, such as hobby/recreation, legal problems, current affairs, education, health, and economic matters, based on the characteristics in related information seeking behaviors. Then the 6-group categorization was tested statistically with questionnaire survey data based on their prefered information sources. Through the statistical test, the 6-group categorization has proved being valid. Based on the results, we suggested to reconsider the current information services in public libraries, and discussed a possibility to shift the services to problem-based information services.

Text Categorization Using TextRank Algorithm (TextRank 알고리즘을 이용한 문서 범주화)

  • Bae, Won-Sik;Cha, Jeong-Won
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.1
    • /
    • pp.110-114
    • /
    • 2010
  • We describe a new method for text categorization using TextRank algorithm. Text categorization is a problem that over one pre-defined categories are assigned to a text document. TextRank algorithm is a graph-based ranking algorithm. If we consider that each word is a vertex, and co-occurrence of two adjacent words is a edge, we can get a graph from a document. After that, we find important words using TextRank algorithm from the graph and make feature which are pairs of words which are each important word and a word adjacent to the important word. We use classifiers: SVM, Na$\ddot{i}$ve Bayesian classifier, Maximum Entropy Model, and k-NN classifier. We use non-cross-posted version of 20 Newsgroups data set. In consequence, we had an improved performance in whole classifiers, and the result tells that is a possibility of TextRank algorithm in text categorization.

A Study on the Product Categorization Model for Efficient Search in On-line Chartering

  • Choi, Hyung-Rim;Park, Nam-kyu;Park, Young-Jae;Park, Yong-Sung;Kang, Si-Hyeob
    • Journal of Navigation and Port Research
    • /
    • v.27 no.3
    • /
    • pp.307-313
    • /
    • 2003
  • Off-line ship chartering is done nearly through the brokers. Because of the international scale of chartering market, brokers spend too much times and costs on searching the most appropriate product which the consumers want. In this research, we propose the on-line Charter Product Categorization Model to search the products efficiently in the Cyber Chartering System. This Model will make concerned parties of the ship chartering to get unified product information efficiently, and the select the most appropriate product. In this research, we classified the ship chartering products into categories of cargo, ship type, and sea routes, and defined mutual relation of each products, and we verified that this classification is necessary to search the products through the product searching experiment.