• Title/Summary/Keyword: 통계용어

Search Result 107, Processing Time 0.025 seconds

Comparison of term weighting schemes for document classification (문서 분류를 위한 용어 가중치 기법 비교)

  • Jeong, Ho Young;Shin, Sang Min;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.2
    • /
    • pp.265-276
    • /
    • 2019
  • The document-term frequency matrix is a general data of objects in text mining. In this study, we introduce a traditional term weighting scheme TF-IDF (term frequency-inverse document frequency) which is applied in the document-term frequency matrix and used for text classifications. In addition, we introduce and compare TF-IDF-ICSDF and TF-IGM schemes which are well known recently. This study also provides a method to extract keyword enhancing the quality of text classifications. Based on the keywords extracted, we applied support vector machine for the text classification. In this study, to compare the performance term weighting schemes, we used some performance metrics such as precision, recall, and F1-score. Therefore, we know that TF-IGM scheme provided high performance metrics and was optimal for text classification.

Detecting and classification ADRs using Named Entity Recognition on social media (개체명 인식을 이용한 소셜 미디어에서의 약물 부작용 표현 추출 및 분류)

  • Jeong, Hyeon-jeong;Kim, Hyon Hee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.05a
    • /
    • pp.443-446
    • /
    • 2021
  • 의약품에 대한 안전성 정보 수집과 관리는 온라인, 오프라인을 통해 약물 이상 사례를 보고받는 형태로 진행되고 있다. 하지만 소비자들의 자발적인 참여로 이루어지므로 실제 발생하는 약물 부작용보다 데이터가 현저히 적다는 단점이 존재한다. 본 논문에서는 약물 이상 데이터 희소성 문제를 해결 할 수 있도록 소셜 미디어에서 약물 부작용 표현을 찾을 수 있도록 하였다. 소셜 미디어의 경우에는 표준 약물 부작용 용어를 사용하기보다는 일반인들이 자연어로 표현한 경우가 많으므로 개체명 인식 기법을 이용해 부작용을 추출할 수 있는 모델을 개발하였다. 또한 추출된 부작용 표현을 표준용어로 분류할 수 있는 모델을 제시하였다. 실험 결과 제안한 두 가지 모델은 0.9 이상의 정확도를 얻을 수 있었으며, 일반 사용자들이 자연어로 표현한 약물 부작용 표현을 효과적으로 찾아내고 표준 부작용 용어로 매핑할 수 있음을 보여준다.

Automatic term-network construction for Oral Documents (구술문서에 기초한 자동 용어 네트워크 구축)

  • Park, Soon-Cheol
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.12 no.4
    • /
    • pp.25-31
    • /
    • 2007
  • An automatic term-network construction system is proposed in this paper. This system uses the statistical values of the terms appeared in a document corpus. The 186 oral history documents collected from the Saemangeum area of Chollapuk-do, Korea, are used for the research. The term relationships presented in the term-network are decided by the cosine similarities of the term vectors. The number of the terms extracted from the documents is about 1700. The system is able to show the term relationships from the term-network as quickly as like a real-time system. The way of this term-network construction is expected as one of the methods to construct the ontology system and to support the semantic retrieval system in the near future.

  • PDF

Terms Based Sentiment Classification for Online Review Using Support Vector Machine (Support Vector Machine을 이용한 온라인 리뷰의 용어기반 감성분류모형)

  • Lee, Taewon;Hong, Taeho
    • Information Systems Review
    • /
    • v.17 no.1
    • /
    • pp.49-64
    • /
    • 2015
  • Customer reviews which include subjective opinions for the product or service in online store have been generated rapidly and their influence on customers has become immense due to the widespread usage of SNS. In addition, a number of studies have focused on opinion mining to analyze the positive and negative opinions and get a better solution for customer support and sales. It is very important to select the key terms which reflected the customers' sentiment on the reviews for opinion mining. We proposed a document-level terms-based sentiment classification model by select in the optimal terms with part of speech tag. SVMs (Support vector machines) are utilized to build a predictor for opinion mining and we used the combination of POS tag and four terms extraction methods for the feature selection of SVM. To validate the proposed opinion mining model, we applied it to the customer reviews on Amazon. We eliminated the unmeaning terms known as the stopwords and extracted the useful terms by using part of speech tagging approach after crawling 80,000 reviews. The extracted terms gained from document frequency, TF-IDF, information gain, chi-squared statistic were ranked and 20 ranked terms were used to the feature of SVM model. Our experimental results show that the performance of SVM model with four POS tags is superior to the benchmarked model, which are built by extracting only adjective terms. In addition, the SVM model based on Chi-squared statistic for opinion mining shows the most superior performance among SVM models with 4 different kinds of terms extraction method. Our proposed opinion mining model is expected to improve customer service and gain competitive advantage in online store.

Terminology Recognition System based on Machine Learning for Scientific Document Analysis (과학 기술 문헌 분석을 위한 기계학습 기반 범용 전문용어 인식 시스템)

  • Choi, Yun-Soo;Song, Sa-Kwang;Chun, Hong-Woo;Jeong, Chang-Hoo;Choi, Sung-Pil
    • The KIPS Transactions:PartD
    • /
    • v.18D no.5
    • /
    • pp.329-338
    • /
    • 2011
  • Terminology recognition system which is a preceding research for text mining, information extraction, information retrieval, semantic web, and question-answering has been intensively studied in limited range of domains, especially in bio-medical domain. We propose a domain independent terminology recognition system based on machine learning method using dictionary, syntactic features, and Web search results, since the previous works revealed limitation on applying their approaches to general domain because their resources were domain specific. We achieved F-score 80.8 and 6.5% improvement after comparing the proposed approach with the related approach, C-value, which has been widely used and is based on local domain frequencies. In the second experiment with various combinations of unithood features, the method combined with NGD(Normalized Google Distance) showed the best performance of 81.8 on F-score. We applied three machine learning methods such as Logistic regression, C4.5, and SVMs, and got the best score from the decision tree method, C4.5.

Creation and clustering of proximity data for text data analysis (텍스트 데이터 분석을 위한 근접성 데이터의 생성과 군집화)

  • Jung, Min-Ji;Shin, Sang Min;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.3
    • /
    • pp.451-462
    • /
    • 2019
  • Document-term frequency matrix is a type of data used in text mining. This matrix is often based on various documents provided by the objects to be analyzed. When analyzing objects using this matrix, researchers generally select only terms that are common in documents belonging to one object as keywords. Keywords are used to analyze the object. However, this method misses the unique information of the individual document as well as causes a problem of removing potential keywords that occur frequently in a specific document. In this study, we define data that can overcome this problem as proximity data. We introduce twelve methods that generate proximity data and cluster the objects through two clustering methods of multidimensional scaling and k-means cluster analysis. Finally, we choose the best method to be optimized for clustering the object.

일본의 결핵 - `97결핵의 통계

  • 대한결핵협회
    • 보건세계
    • /
    • v.45 no.2 s.498
    • /
    • pp.12-15
    • /
    • 1998
  • 일본 후생성과 일본결핵예방회에서 '결핵의 통계 1997년판' 책자를 발간했다. 동지에는 1996년 일본 전국의 결핵발생 현황과 대책에 관한 제반 역학 통계치들을 수록하였는데 주로 전국 보건소의 결핵발생동향조사(서베이런스)정보 시스템에 의한 자료를 정리$\cdot$분석하였다. 몇 개의 주요한 과제에 대해서는 최근 국내$\cdot$외의 학회, 학술지에 수록된 정보, 인구통계, 보건소, 운영보고 등의 자료를 취합하여 이해하기 쉽게 도표를 사용하여 설명했다. 특히 종전의 '서베이런스'에서 '발생동향조사'로 용어를 바꾼 것과 환자발생 동향뿐만 아니라 대책 및 조사결과의 해석과 현장으로의 환원이라고 하는 광의의 의미를 함께 담고 있다. 본문에서는 결핵의 통계에서 가장 중요한 지표인 발생상황(이환율) 및 치료상황(치료성공률-발견한 환자를 치유시키고 있는가)을 중심으로 전개하고자 한다.

  • PDF

A Leveling and Similarity Measure using Extended AHP of Fuzzy Term in Information System (정보시스템에서 퍼지용어의 확장된 AHP를 사용한 레벨화와 유사성 측정)

  • Ryu, Kyung-Hyun;Chung, Hwan-Mook
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.19 no.2
    • /
    • pp.212-217
    • /
    • 2009
  • There are rule-based learning method and statistic based learning method and so on which represent learning method for hierarchy relation between domain term. In this paper, we propose to leveling and similarity measure using the extended AHP of fuzzy term in Information system. In the proposed method, we extract fuzzy term in document and categorize ontology structure about it and level priority of fuzzy term using the extended AHP for specificity of fuzzy term. the extended AHP integrates multiple decision-maker for weighted value and relative importance of fuzzy term. and compute semantic similarity of fuzzy term using min operation of fuzzy set, dice's coefficient and Min+dice's coefficient method. and determine final alternative fuzzy term. after that compare with three similarity measure. we can see the fact that the proposed method is more definite than classification performance of the conventional methods and will apply in Natural language processing field.

Alleviating Syntactic Term Mismatches in Korean Information Retrieval (한국어정보검색에서 구문적 용어불일치 완화방안)

  • Yun, Bo-Hyun;Kim, Sang-Bum;Rim, Hae-Chang
    • Annual Conference on Human and Language Technology
    • /
    • 1998.10c
    • /
    • pp.143-149
    • /
    • 1998
  • 한국어 정보검색에서 복합명사와 명사구로 발생하는 색인어와 질의어간의 구문적 용어 불일치는 많은 문제를 일으켜왔다. 본 논문에서는 복합명사 분해와 명사구 정규화를 함께 수행하여 유사도 측정값을 적당히 유지함으로써 재현율을 저하시키지 않고서 정확률을 향상시킬 수 있는 구문적 용어불일치 완화방안을 제시하고자 한다 색인모듈에서는 통계정보를 이용하여 복합명사를 분해하고, 의존관계를 이용하여 명사구를 정규화한다. 분해되고 정규화된 키워드에 경계정보 '/'가 할당되고, 가중치가 계산된다. 검색모듈에서는 경계정보를 이용하여 부분일치를 고려하는 유사도 계산을 수행한다. KTSET 2.0으로 실험한 결과, 제안한 방법은 구문적 용어불일치를 완화할 수 있으며, 재현율을 저하시키지 않고서 정확률을 향상시킬 수 있음을 보인다.

  • PDF

Document classification using a deep neural network in text mining (텍스트 마이닝에서 심층 신경망을 이용한 문서 분류)

  • Lee, Bo-Hui;Lee, Su-Jin;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.5
    • /
    • pp.615-625
    • /
    • 2020
  • The document-term frequency matrix is a term extracted from documents in which the group information exists in text mining. In this study, we generated the document-term frequency matrix for document classification according to research field. We applied the traditional term weighting function term frequency-inverse document frequency (TF-IDF) to the generated document-term frequency matrix. In addition, we applied term frequency-inverse gravity moment (TF-IGM). We also generated a document-keyword weighted matrix by extracting keywords to improve the document classification accuracy. Based on the keywords matrix extracted, we classify documents using a deep neural network. In order to find the optimal model in the deep neural network, the accuracy of document classification was verified by changing the number of hidden layers and hidden nodes. Consequently, the model with eight hidden layers showed the highest accuracy and all TF-IGM document classification accuracy (according to parameter changes) were higher than TF-IDF. In addition, the deep neural network was confirmed to have better accuracy than the support vector machine. Therefore, we propose a method to apply TF-IGM and a deep neural network in the document classification.