• Title/Summary/Keyword: Text categorization

Search Result 146, Processing Time 0.022 seconds

Keyword Extraction from News Corpus using Modified TF-IDF (TF-IDF의 변형을 이용한 전자뉴스에서의 키워드 추출 기법)

  • Lee, Sung-Jick;Kim, Han-Joon
    • The Journal of Society for e-Business Studies
    • /
    • v.14 no.4
    • /
    • pp.59-73
    • /
    • 2009
  • Keyword extraction is an important and essential technique for text mining applications such as information retrieval, text categorization, summarization and topic detection. A set of keywords extracted from a large-scale electronic document data are used for significant features for text mining algorithms and they contribute to improve the performance of document browsing, topic detection, and automated text classification. This paper presents a keyword extraction technique that can be used to detect topics for each news domain from a large document collection of internet news portal sites. Basically, we have used six variants of traditional TF-IDF weighting model. On top of the TF-IDF model, we propose a word filtering technique called 'cross-domain comparison filtering'. To prove effectiveness of our method, we have analyzed usefulness of keywords extracted from Korean news articles and have presented changes of the keywords over time of each news domain.

  • PDF

A Study on The Usability Evaluation Based on Text Analysis for The Development of Comfort-Shoes for Middle-Aged

  • KIM, Ji Ho;YOON, Sang Hoon;KWON, Ki Hyun;SEO, Jeong Kwon;HAN, Seung Jin
    • Journal of Sport and Applied Science
    • /
    • v.3 no.2
    • /
    • pp.17-27
    • /
    • 2019
  • Purpose: This study is to conduct usability evaluations from the perspective of developing comfort-shoes for the middle-aged and elderly to identify key factors and derive implications for optimal comfort-shoes production. Research design, data, and methodology: A total of 10 middle-aged and elderly women in their 50s and 60s were selected as eligible for the rescue. For data collection, the study was conducted in a Gang Survey, where pre-explanations, shoes test, and interviews were conducted. The collected data were analyzed in a total of four stages. In step 1, the contents obtained through interviews with the subjects were recorded in text, organized and analyzed systematically, and in step 2, unnecessary vocabulary, sentences, and overlapping opinions were eliminated. In step 3, we classified areas around key functions and carried out categorization tasks. Finally, in Step 4, the results and implications of the study were derived by classifying each usability evaluation shoe as positive and negative text around categorized data. Results: There are a total of seven factors for comfort-shoes usability evaluation, which are categorized as cushion, fitting, stability, flexibility, lightweight, comfort, and pressure. Positive/negative factors for the derived usability evaluation factors were shown in the form of a positive-centered, negative-centered, and positive-mixed mix for each of the four products. Positive-focused products are VA products, which are seven times more positive than negative factors. Negative-centered products are CL and SA products, which are five times more negative than positive factors. Positive mixing was a CA product with a ratio of 1:1. Text-based usability evaluations allow us to proceed with analysis based on more scientific data rather than simply listening to opinions and judging by comments. Conclusions: The study discussed implications of developing comfort-shoes for middle-aged consumers and future directions were discussed.

A Study on the Categorization of Reading Strategies for Reading Instruction in School Library (학교도서관 중심의 독서교육을 위한 독서전략 범주화에 관한 연구)

  • Lee, Byeong-Ki
    • Journal of Korean Library and Information Science Society
    • /
    • v.39 no.3
    • /
    • pp.139-159
    • /
    • 2008
  • Much of the current literature on reading instruction supports the idea of teaching students a series of reading strategies instead of isolated reading skills. Reading strategies are plans or methods that can be used or taught to facilitate reading proficiency. In the meantime, the reading instruction program of school library is the reading promotion event has been limited. Therefore, the reading instruction program of school library need to focus reading strategies oriented instruction rather than reading skill. This Study categorizes Reading Strategies that divided into text type, text structure, reading process, cognitive strategies.

  • PDF

Text-mining Based Graph Model for Keyword Extraction from Patent Documents (특허 문서로부터 키워드 추출을 위한 위한 텍스트 마이닝 기반 그래프 모델)

  • Lee, Soon Geun;Leem, Young Moon;Um, Wan Sup
    • Journal of the Korea Safety Management & Science
    • /
    • v.17 no.4
    • /
    • pp.335-342
    • /
    • 2015
  • The increasing interests on patents have led many individuals and companies to apply for many patents in various areas. Applied patents are stored in the forms of electronic documents. The search and categorization for these documents are issues of major fields in data mining. Especially, the keyword extraction by which we retrieve the representative keywords is important. Most of techniques for it is based on vector space model. But this model is simply based on frequency of terms in documents, gives them weights based on their frequency and selects the keywords according to the order of weights. However, this model has the limit that it cannot reflect the relations between keywords. This paper proposes the advanced way to extract the more representative keywords by overcoming this limit. In this way, the proposed model firstly prepares the candidate set using the vector model, then makes the graph which represents the relation in the pair of candidate keywords in the set and selects the keywords based on this relationship graph.

A One-Size-Fits-All Indexing Method Does Not Exist: Automatic Selection Based on Meta-Learning

  • Jimeno-Yepes, Antonio;Mork, James G.;Demner-Fushman, Dina;Aronson, Alan R.
    • Journal of Computing Science and Engineering
    • /
    • v.6 no.2
    • /
    • pp.151-160
    • /
    • 2012
  • We present a methodology that automatically selects indexing algorithms for each heading in Medical Subject Headings (MeSH), National Library of Medicine's vocabulary for indexing MEDLINE. While manually comparing indexing methods is manageable with a limited number of MeSH headings, a large number of them make automation of this selection desirable. Results show that this process can be automated, based on previously indexed MEDLINE citations. We find that AdaBoostM1 is better suited to index a group of MeSH hedings named Check Tags, and helps improve the micro F-measure from 0.5385 to 0.7157, and the macro F-measure from 0.4123 to 0.5387 (both p < 0.01).

An Evaluation of Category Features in Text Categorization Using Nearest Neighbor Method (Nearest Neighbor 방법을 이용한 문서 범주화에서 범주 자질의 평가)

  • Kwon, Oh-Woog;Lee, Jong-Hyeok;Lee, Geun-Bae
    • Annual Conference on Human and Language Technology
    • /
    • 1997.10a
    • /
    • pp.7-14
    • /
    • 1997
  • 문서 범주화에서 문서의 내용에 따라 적합한 범주의 종류와 수를 찾는 문제를 해결하기 위해서는 문서 당 하나의 범주를 할당할 경우에 가장 좋은 성능을 보이는 모델이 효과적일 것이다. 그러므로, 본 논문에서는 문서 당 하나의 범주를 할당할 경우에 좋은 결과를 보이는 k-nearest neighbor 방법을 이용한다. 그리고 k-nearest neighbor 방법을 이용한 문서 범주화의 성능을 향상시키기 위해서, 문서 표현에 사용하는 단어들을 범주 자질의 성격을 갖는 단어들로 제한하는 방법을 제안한다. 제안한 방법은 Router 신문 일년치로 구성된 Router-21578 테스트 집합에서 breakeven point 82%라는 좋은 결과를 보였다.

  • PDF

Automatic Text Categorization by Term Weighting and Inverted Category Frequency (용어 가중치와 역범주 빈도에 의한 자동문서 범주화)

  • Lee, Kyung-Chan;Kang, Seung-Shik
    • Annual Conference on Human and Language Technology
    • /
    • 2003.10d
    • /
    • pp.14-17
    • /
    • 2003
  • 문서의 확률을 이용하여 자동으로 문서를 분류하는 문서 범주화 기법의 대표적인 방법이 나이브 베이지언 확률 모델이다. 이 방법의 기본 형식은 출현 용어의 확률 계산 방법이다. 하지만 실제 문서 범주화 과정에서 출현하지 않는 용어들도 성능에 많은 영향을 줄 수 있으며, 출현 용어들에 대한 빈도 이외의 역범주 빈도나 용어가중치를 적용하여 문서 범주화 시스템의 성능을 향상시킬 수 있다. 본 논문에서는 나이브 베이지언 확률 모델에 출현 용어와 출현하지 않는 용어들에 대한 smoothing 기법을 적용하여 실험하였다. 성능 평가를 위해 뉴스그룹 문서들을 이용하였으며, 역범주 빈도와 가중치를 적용했을 때 나이브 베이지언 확률 모델에 비해 약 7% 정도 성능 개선 효과가 있었다.

  • PDF

Hierarchical Text Categorization using Support Vector Machine (지지 벡터 기계를 이용한 계층적 문서 분류)

  • Yoon, Yong-Wook;Lee, Chang-Ki;Lee, Gary Geun-Bae
    • Annual Conference on Human and Language Technology
    • /
    • 2003.10d
    • /
    • pp.7-13
    • /
    • 2003
  • 인터넷을 통해 생성, 전달되는 문서 량이 급격히 많아짐에 따라, 정보의 접근을 용이하게 하기 위한 문서의 자동 분류 기능이 절실히 요구되고 있다. SVM(Support Vector Machine)은 최근에 문서 분류에 널리 쓰이고 있는 기법으로 다른 분류기에 비하여 좋은 성능을 보여주고 있다. 하지만 SVM은 현재까지 주로 비 계층 평탄화(flat)된 분류 응용에 효과적으로 적용되어 왔다. 이와 달리 본 논문은 문서 분류에 있어서 최종 분류 class를 한번에 출력하는 비 계층 분류보다는, 비슷한 성질을 갖는 class의 집합을 계층적 구조로 묶어 분류하는 계층적 분류 기법이 보다 사람이 이해하기 쉽고 사용하기 편리하며 더 효과적이라는 것을 보이고, 실험을 통해 계층적 분류를 위한 효과적인 SVM분류기를 개발하여 비 계층 분류보다 좋은 분류 성능을 보여 줄 수 있음을 확인한다.

  • PDF

Fuzzy-based Trust Measurement for CoPs in Knowledge Management Systems (실행공동체를 위한 지식관리시스템에서의 퍼지기반 신뢰도 측정)

  • Yang, Kun-Woo
    • The Journal of Information Systems
    • /
    • v.19 no.4
    • /
    • pp.65-85
    • /
    • 2010
  • The importance of communities of practice(CoP) as an organizational informal unit for fostering knowledge transfer and sharing gains a lot of attention from KM researchers and practitioners. Since most of CoPs are formulated online these days, the credibility or trustworthiness of knowledge contents circulated within a certain CoP should be considered thoroughly for them to be fully utilized safely. Here comes the need for an appropriate trust measuring methodology to determine the true value of knowledge given by unknown people through an online channel. In this paper, an improved trust measuring method is proposed using new trust variables such as level of degrees derived from the relationships among community users. In addition, activeness, relevance, and usefulness of the knowledge contents themselves, which are calculated automatically using a text categorization technique, are also used for trust measurement. The proposed framework incorporates fuzzy set and calculation concepts to help build trust matrices and models, which are used to measure the level of trust involved in specific knowledge artifacts concerned.

Automatic Text Categorization by using Normalized Term Frequency Weighting (정규화 용어빈도가중치에 의한 자동문서분류)

  • 김수진;김민수;백장선;박혁로
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2003.04c
    • /
    • pp.510-512
    • /
    • 2003
  • 본 논문에서는 문서의 자동 분류를 위한 용어 빈도 가중치 계산 방법으로 Box-Cox변환기법을 응용한 정규화 용어빈도 가중치를 정의하고, 이를 문서 분류에 적응하였다. 여기서 Box-Cox 변환기법이란 자료를 정규분포화 할 때 적용하는 통계적인 변환방법으로서, 본 논문에서는 이를 응용하여 새로운 용어빈도가중치 계산법을 제안한다. 문서에서 등장한 용어 빈도는 너무 많거나 적게 등장할 경우, 중요도가 떨어지게 되는데, 이는 용어의 중요도가 빈도에 따른 정규분포로 모델링 될 수 있다는 것을 의미한다. 또한 정규화 가중치 계산방법은 기존의 용어빈도 가중치 공식과 비교할 때, 용어마다 계산방법이 달라져, 로그나 루트와 같은 고정된 가중치 방법보다는 좀더 일반적인 방법이라 할 수 있다. 신문기사 8000건을 대상으로 4개의 그룹으로 나누어 실험 한 결과, 정규화 용어빈도가중치 계산방법이 모두 우위의 분류 정확도롤 가져, 본 논문에서 제안한 방법이 타당함을 알 수 있다.

  • PDF