• Title/Summary/Keyword: 텍스트 범주화

Search Result 49, Processing Time 0.026 seconds

An Experimental Study on Feature Selection Using Wikipedia for Text Categorization (위키피디아를 이용한 분류자질 선정에 관한 연구)

  • Kim, Yong-Hwan;Chung, Young-Mee
    • Journal of the Korean Society for information Management
    • /
    • v.29 no.2
    • /
    • pp.155-171
    • /
    • 2012
  • In text categorization, core terms of an input document are hardly selected as classification features if they do not occur in a training document set. Besides, synonymous terms with the same concept are usually treated as different features. This study aims to improve text categorization performance by integrating synonyms into a single feature and by replacing input terms not in the training document set with the most similar term occurring in training documents using Wikipedia. For the selection of classification features, experiments were performed in various settings composed of three different conditions: the use of category information of non-training terms, the part of Wikipedia used for measuring term-term similarity, and the type of similarity measures. The categorization performance of a kNN classifier was improved by 0.35~1.85% in $F_1$ value in all the experimental settings when non-learning terms were replaced by the learning term with the highest similarity above the threshold value. Although the improvement ratio is not as high as expected, several semantic as well as structural devices of Wikipedia could be used for selecting more effective classification features.

An Experimental Study on Feature Ranking Schemes for Text Classification (텍스트 분류를 위한 자질 순위화 기법에 관한 연구)

  • Pan Jun Kim
    • Journal of the Korean Society for information Management
    • /
    • v.40 no.1
    • /
    • pp.1-21
    • /
    • 2023
  • This study specifically reviewed the performance of the ranking schemes as an efficient feature selection method for text classification. Until now, feature ranking schemes are mostly based on document frequency, and relatively few cases have used the term frequency. Therefore, the performance of single ranking metrics using term frequency and document frequency individually was examined as a feature selection method for text classification, and then the performance of combination ranking schemes using both was reviewed. Specifically, a classification experiment was conducted in an environment using two data sets (Reuters-21578, 20NG) and five classifiers (SVM, NB, ROC, TRA, RNN), and to secure the reliability of the results, 5-Fold cross-validation and t-test were applied. As a result, as a single ranking scheme, the document frequency-based single ranking metric (chi) showed good performance overall. In addition, it was found that there was no significant difference between the highest-performance single ranking and the combination ranking schemes. Therefore, in an environment where sufficient learning documents can be secured in text classification, it is more efficient to use a single ranking metric (chi) based on document frequency as a feature selection method.

Methodology for Applying Text Mining Techniques to Analyzing Online Customer Reviews for Market Segmentation (온라인 고객리뷰 분석을 통한 시장세분화에 텍스트마이닝 기술을 적용하기 위한 방법론)

  • Kim, Keun-Hyung;Oh, Sung-Ryoel
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.8
    • /
    • pp.272-284
    • /
    • 2009
  • In this paper, we proposed the methodology for analyzing online customer reviews by using text mining technologies. We introduced marketing segmentation into the methodology because it would be efficient and effective to analyze the online customers by grouping them into similar online customers that might include similar opinions and experiences of the customers. That is, the methodology uses categorization and information extraction functions among text mining technologies, matched up with the concept of market segmentation. In particular, the methodology also uses cross-tabulations analysis function which is a kind of traditional statistics analysis functions to derive rigorous results of the analysis. In order to confirm the validity of the methodology, we actually analyzed online customer reviews related with tourism by using the methodology.

The Effect of the Quality of Pre-Assigned Subject Categories on the Text Categorization Performance (학습문헌집합에 기 부여된 범주의 정확성과 문헌 범주화 성능)

  • Shim, Kyung;Chung, Young-Mee
    • Journal of the Korean Society for information Management
    • /
    • v.23 no.2
    • /
    • pp.265-285
    • /
    • 2006
  • In text categorization a certain level of correctness of labels assigned to training documents is assumed without solid knowledge on that of real-world collections. Our research attempts to explore the quality of pre-assigned subject categories in a real-world collection, and to identify the relationship between the quality of category assignment in training set and text categorization performance. Particularly, we are interested in to what extent the performance can be improved by enhancing the quality (i.e., correctness) of category assignment in training documents. A collection of 1,150 abstracts in computer science is re-classified by an expert group, and divided into 907 training documents and 227 test documents (15 duplicates are removed). The performances of before and after re-classification groups, called Initial set and Recat-1/Recat-2 sets respectively, are compared using a kNN classifier. The average correctness of subject categories in the Initial set is 16%, and the categorization performance with the Initial set shows 17% in $F_1$ value. On the other hand, the Recat-1 set scores $F_1$ value of 61%, which is 3.6 times higher than that of the Initial set.

Optimization of Number of Training Documents in Text Categorization (문헌범주화에서 학습문헌수 최적화에 관한 연구)

  • Shim, Kyung
    • Journal of the Korean Society for information Management
    • /
    • v.23 no.4 s.62
    • /
    • pp.277-294
    • /
    • 2006
  • This paper examines a level of categorization performance in a real-life collection of abstract articles in the fields of science and technology, and tests the optimal size of documents per category in a training set using a kNN classifier. The corpus is built by choosing categories that hold more than 2,556 documents first, and then 2,556 documents per category are randomly selected. It is further divided into eight subsets of different size of training documents : each set is randomly selected to build training documents ranging from 20 documents (Tr-20) to 2,000 documents (Tr-2000) per category. The categorization performances of the 8 subsets are compared. The average performance of the eight subsets is 30% in $F_1$ measure which is relatively poor compared to the findings of previous studies. The experimental results suggest that among the eight subsets the Tr-100 appears to be the most optimal size for training a km classifier In addition, the correctness of subject categories assigned to the training sets is probed by manually reclassifying the training sets in order to support the above conclusion by establishing a relation between and the correctness and categorization performance.

Regarding the illegal transaction of overseas direct purchase Monitoring service design and analysis (해외직구 물품 불법 거래에 관한 모니터링 서비스 설계와 해석)

  • Shin, Yong-Hun;Kim, Jeong-Ho;Jo, Jin-Pyo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.11a
    • /
    • pp.508-511
    • /
    • 2021
  • 관세법에서는 해외직구물품이 일정금액(미화 150불, 단 미국은 미화 200불)이하 또는 자기사용 물품으로 인정되는 경우에 제세를 면제토록 규정하고 있으며 관련규정을 어길시 관세법상 무신고 밀수입죄에 해당된다. 본 논문은 해외직구 리셀러(되팔이)가 증가하고 해당 사항이 사회적 이슈로 대두되기에 해외직구 물품 불법거래에 관한 모니터링시스템을 설계하고 해석하였다. 온라인 중고 사이트(e-commerce)에서 거래되고 있는 거래 내용을 크롤링을 통하여 데이터를 수집·전처리를 통해 구조화하고 데이터 정제, 텍스트 범주화, 텍스트 마이닝 등 관계 예측을 해석하였다.

A Study on the Categorization of Reading Strategies for Reading Instruction in School Library (학교도서관 중심의 독서교육을 위한 독서전략 범주화에 관한 연구)

  • Lee, Byeong-Ki
    • Journal of Korean Library and Information Science Society
    • /
    • v.39 no.3
    • /
    • pp.139-159
    • /
    • 2008
  • Much of the current literature on reading instruction supports the idea of teaching students a series of reading strategies instead of isolated reading skills. Reading strategies are plans or methods that can be used or taught to facilitate reading proficiency. In the meantime, the reading instruction program of school library is the reading promotion event has been limited. Therefore, the reading instruction program of school library need to focus reading strategies oriented instruction rather than reading skill. This Study categorizes Reading Strategies that divided into text type, text structure, reading process, cognitive strategies.

  • PDF

Automatic Korean Text Categorization by Subject Thesaurus (분야별 관련어사전에 의한 한글 웹문서 자동분류)

  • Kim, Young;Chae, Soo-Hoan
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2005.05a
    • /
    • pp.771-774
    • /
    • 2005
  • 인터넷이 폭 넓게 보급되어 온라인 상에서 얻을 수 있는 텍스트 정보의 양이 급증함에 따라 산재해 있는 문서들에 대한 효과적인 정보 관리 및 검색이 요구되고 있다. 자동 문서분류란 문서의 내용에 기반하여 미리 정의되어 있는 범주에 문서를 자동으로 할당하는 작업으로써 효율적인 정보 관리 및 검색을 가능하게 한다. 특히 한국어 정보처리의 중요성에 비해 관련 분야의 자료들을 수집, 분류하는데 있어 많은 어려움이 있다. 따라서 논문에서는 한글 웹문서 자동 문서 범주화에 대한 수행단계중 각 분야에 대해 사전구축을 하고, 중복단어제거를 통한 보다 효과적인 분야별 문서분류를 제안하고자한다.

  • PDF

A Case Study on Text Analysis Using Meal Kit Product Review Data (밀키트 제품 리뷰 데이터를 이용한 텍스트 분석 사례 연구)

  • Choi, Hyeseon;Yeon, Kyupil
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.5
    • /
    • pp.1-15
    • /
    • 2022
  • In this study, text analysis was performed on the mealkit product review data to identify factors affecting the evaluation of the mealkit product. The data used for the analysis were collected by scraping 334,498 reviews of mealkit products in Naver shopping site. After preprocessing the text data, wordclouds and sentiment analyses based on word frequency and normalized TF-IDF were performed. Logistic regression model was applied to predict the polarity of reviews on mealkit products. From the logistic regression models derived for each product category, the main factors that caused positive and negative emotions were identified. As a result, it was verified that text analysis can be a useful tool that provides a basis for maximizing positive factors for a specific category, menu, and material and removing negative risk factors when developing a mealkit product.

An Analytical Study on Automatic Classification of Domestic Journal articles Based on Machine Learning (기계학습에 기초한 국내 학술지 논문의 자동분류에 관한 연구)

  • Kim, Pan Jun
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.2
    • /
    • pp.37-62
    • /
    • 2018
  • This study examined the factors affecting the performance of automatic classification based on machine learning for domestic journal articles in the field of LIS. In particular, In view of the classification performance that assigning automatically the class labels to the articles in "Journal of the Korean Society for Information Management", I investigated the characteristics of the key factors(weighting schemes, training set size, classification algorithms, label assigning methods) through the diversified experiments. Consequently, It is effective to apply each element appropriately according to the classification environment and the characteristics of the document set, and a fairly good performance can be obtained by using a simpler model. In addition, the classification of domestic journals can be considered as a multi-label classification that assigns more than one category to a specific article. Therefore, I proposed an optimal classification model using simple and fast classification algorithm and small learning set considering this environment.