• Title/Summary/Keyword: Na$\ddot{i}$ve Bayesian classifier

Search Result 7, Processing Time 0.022 seconds

Extending Data Model of $Na\ddot{i}ve-Bayesian$ Classifier in e-Catalog Classification (전자 카탈로그 자동분류에서 $Na\ddot{i}ve-Bayesian$ Classifier 데이터 모델 확장)

  • Kim Sung-hwan;Kim Hyun-chul;Lee Tae-hee;Lee Sang-goo
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2005.07b
    • /
    • pp.100-102
    • /
    • 2005
  • 인터넷 환경에서의 B2B Market Place의 출현은 판매자와 구매자와의 다자간 거래를 가능하게 하였다. 이러한 기반에서 상품정보를 포함하는 전자 카탈로그의 활용은 나날이 증가하고 있다. 그러나 동일한 상품에 대한 분류체계와 기준이 다르므로 전자카탈로그에 대한 재분류는 고비용을 초래하는 필수 불가결한 문제로 남게 되었다. 본 연구에서는 이러한 문제를 해결하기 위해 기계학습 기법을 이용한 $Na\ddot{i}ve$ Bayesian classifier 모델을 사용하였다 학습 데이터를 생성해야 하는 $Na\ddot{i}ve$ Bayesian 알고리즘 적용 시 전자 카탈로그는 일반 문서보다 상대적으로 학습 정보가 적으므로 데이터 모델의 확장을 통해 학습 정보를 생성하여 이러한 단점을 보완하였다. 전자 카탈로그 자동분류에 있어서 효과적이고 풍부한 양의 학습 데이터를 생성하는 것이 분류 정확도 향상에 중요한 영향을 미침을 실험을 통해 확인하였다.

  • PDF

A Novel Method for a Reliable Classifier using Gradients

  • Han, Euihwan;Cha, Hyungtai
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.6 no.1
    • /
    • pp.18-20
    • /
    • 2017
  • In this paper, we propose a new classification method to complement a $na{\ddot{i}}ve$ Bayesian classifier. This classifier assumes data distribution to be Gaussian, finds the discriminant function, and derives the decision curve. However, this method does not investigate finding the decision curve in much detail, and there are some minor problems that arise in finding an accurate discriminant function. Our findings also show that this method could produce errors when finding the decision curve. The aim of this study has therefore been to investigate existing problems and suggest a more reliable classification method. To do this, we utilize the gradient to find the decision curve. We then compare/analyze our algorithm with the $na{\ddot{i}}ve$ Bayesian method. Performance evaluation indicates that the average accuracy of our classification method is about 10% higher than $na{\ddot{i}}ve$ Bayes.

A novel nomogram of naïve Bayesian model for prevalence of cardiovascular disease

  • Kang, Eun Jin;Kim, Hyun Ji;Lee, Jea Young
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.3
    • /
    • pp.297-306
    • /
    • 2018
  • Cardiovascular disease (CVD) is the leading cause of death worldwide and has a high mortality rate after onset; therefore, the CVD management requires the development of treatment plans and the prediction of prevalence rates. In our study, age, income, education level, marriage status, diabetes, and obesity were identified as risk factors for CVD. Using these 6 factors, we proposed a nomogram based on a $na{\ddot{i}}ve$ Bayesian classifier model for CVD. The attributes for each factor were assigned point values between -100 and 100 by Bayes' theorem, and the negative or positive attributes for CVD were represented to the values. Additionally, the prevalence rate can be calculated even in cases with some missing attribute values. A receiver operation characteristic (ROC) curve and calibration plot verified the nomogram. Consequently, when the attribute values for these risk factors are known, the prevalence rate for CVD can be predicted using the proposed nomogram based on a $na{\ddot{i}}ve$ Bayesian classifier model.

Extending Na$ddot{i}$ve Bayesian Classifier for Catalog Classification Systems (Na$ddot{i}$ve-Bayesian Classifier를 이 용한 전자 카탈로그 자동 분류 시스템)

  • 서광훈;이경종;김현철;이태희;이상구
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.04b
    • /
    • pp.91-93
    • /
    • 2004
  • B2B Marketplace상에서의 거래에서 나타나는 주요한 특징은 다품종 및 대량의 물품 거래가 n:n거래 관계에 놓여있다는 점과 거래자가 원활한 거래 및 기업 내 관리를 위해 각자의 전자 카탈로그를 이용한 거래를 원한다는 정이다. 하지만 개별적인 전자 카탈로그 사용과 미흡한 표준안은 전자 카탈로그 상호 연계의 걸림돌이 되어 시장 형성의 걸림돌이 되고 있다. B2B Marketplace는 표준 분류체계를 중심으로 거래 대상 상품을 재분류하여 구매 당사자간의 거래 대상 물품에 대한 상호 애핑을 지원하는 방법 등으로 이를 충족시키려 하고 있다. 하지만 요청되는 다량의 물품에 대해 매번 분류를 수행해야 하는 고비용의 작업이라는 문제점이 있다. 본 논문에서는 이를 극복하기 위하여 기계학습 기법을 이용한 전자 카탈로그 상품 자동분류기를 모델링하고 이를 구현하는 것에 초점을 두었다. 상품의 속성별로 분류에 끼치는 영향력이 다론 것이라는데 착안하여 전자 카탈로그를 상품 단위로 재 모델링 하였으며 속성별 정보가 풍부하지 못한 정물 극복하기 위하여 속성값을 어휘 단위로 구분한 데이터를 추가 하는 확장 모델을 정의하였다. 또한 해당 모델을 학습시키기 위한 알고리즘으로는 속성별로 다른 가중치를 부여 할 수 있도록 확장된 Naive Bayesian Classifier를 고안하였다. 그리고 이론 B2B Market Place상의 실 데이터에 적용하여 고안된 모델의 유효성을 검증하였다.

  • PDF

Text Categorization Using TextRank Algorithm (TextRank 알고리즘을 이용한 문서 범주화)

  • Bae, Won-Sik;Cha, Jeong-Won
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.1
    • /
    • pp.110-114
    • /
    • 2010
  • We describe a new method for text categorization using TextRank algorithm. Text categorization is a problem that over one pre-defined categories are assigned to a text document. TextRank algorithm is a graph-based ranking algorithm. If we consider that each word is a vertex, and co-occurrence of two adjacent words is a edge, we can get a graph from a document. After that, we find important words using TextRank algorithm from the graph and make feature which are pairs of words which are each important word and a word adjacent to the important word. We use classifiers: SVM, Na$\ddot{i}$ve Bayesian classifier, Maximum Entropy Model, and k-NN classifier. We use non-cross-posted version of 20 Newsgroups data set. In consequence, we had an improved performance in whole classifiers, and the result tells that is a possibility of TextRank algorithm in text categorization.

Software Quality Classification using Bayesian Classifier (베이지안 분류기를 이용한 소프트웨어 품질 분류)

  • Hong, Euy-Seok
    • Journal of Information Technology Services
    • /
    • v.11 no.1
    • /
    • pp.211-221
    • /
    • 2012
  • Many metric-based classification models have been proposed to predict fault-proneness of software module. This paper presents two prediction models using Bayesian classifier which is one of the most popular modern classification algorithms. Bayesian model based on Bayesian probability theory can be a promising technique for software quality prediction. This is due to the ability to represent uncertainty using probabilities and the ability to partly incorporate expert's knowledge into training data. The two models, Na$\ddot{i}$veBayes(NB) and Bayesian Belief Network(BBN), are constructed and dimensionality reduction of training data and test data are performed before model evaluation. Prediction accuracy of the model is evaluated using two prediction error measures, Type I error and Type II error, and compared with well-known prediction models, backpropagation neural network model and support vector machine model. The results show that the prediction performance of BBN model is slightly better than that of NB. For the data set with ambiguity, although the BBN model's prediction accuracy is not as good as the compared models, it achieves better performance than the compared models for the data set without ambiguity.

A Study on Differences of Contents and Tones of Arguments among Newspapers Using Text Mining Analysis (텍스트 마이닝을 활용한 신문사에 따른 내용 및 논조 차이점 분석)

  • Kam, Miah;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.3
    • /
    • pp.53-77
    • /
    • 2012
  • This study analyses the difference of contents and tones of arguments among three Korean major newspapers, the Kyunghyang Shinmoon, the HanKyoreh, and the Dong-A Ilbo. It is commonly accepted that newspapers in Korea explicitly deliver their own tone of arguments when they talk about some sensitive issues and topics. It could be controversial if readers of newspapers read the news without being aware of the type of tones of arguments because the contents and the tones of arguments can affect readers easily. Thus it is very desirable to have a new tool that can inform the readers of what tone of argument a newspaper has. This study presents the results of clustering and classification techniques as part of text mining analysis. We focus on six main subjects such as Culture, Politics, International, Editorial-opinion, Eco-business and National issues in newspapers, and attempt to identify differences and similarities among the newspapers. The basic unit of text mining analysis is a paragraph of news articles. This study uses a keyword-network analysis tool and visualizes relationships among keywords to make it easier to see the differences. Newspaper articles were gathered from KINDS, the Korean integrated news database system. KINDS preserves news articles of the Kyunghyang Shinmun, the HanKyoreh and the Dong-A Ilbo and these are open to the public. This study used these three Korean major newspapers from KINDS. About 3,030 articles from 2008 to 2012 were used. International, national issues and politics sections were gathered with some specific issues. The International section was collected with the keyword of 'Nuclear weapon of North Korea.' The National issues section was collected with the keyword of '4-major-river.' The Politics section was collected with the keyword of 'Tonghap-Jinbo Dang.' All of the articles from April 2012 to May 2012 of Eco-business, Culture and Editorial-opinion sections were also collected. All of the collected data were handled and edited into paragraphs. We got rid of stop-words using the Lucene Korean Module. We calculated keyword co-occurrence counts from the paired co-occurrence list of keywords in a paragraph. We made a co-occurrence matrix from the list. Once the co-occurrence matrix was built, we used the Cosine coefficient matrix as input for PFNet(Pathfinder Network). In order to analyze these three newspapers and find out the significant keywords in each paper, we analyzed the list of 10 highest frequency keywords and keyword-networks of 20 highest ranking frequency keywords to closely examine the relationships and show the detailed network map among keywords. We used NodeXL software to visualize the PFNet. After drawing all the networks, we compared the results with the classification results. Classification was firstly handled to identify how the tone of argument of a newspaper is different from others. Then, to analyze tones of arguments, all the paragraphs were divided into two types of tones, Positive tone and Negative tone. To identify and classify all of the tones of paragraphs and articles we had collected, supervised learning technique was used. The Na$\ddot{i}$ve Bayesian classifier algorithm provided in the MALLET package was used to classify all the paragraphs in articles. After classification, Precision, Recall and F-value were used to evaluate the results of classification. Based on the results of this study, three subjects such as Culture, Eco-business and Politics showed some differences in contents and tones of arguments among these three newspapers. In addition, for the National issues, tones of arguments on 4-major-rivers project were different from each other. It seems three newspapers have their own specific tone of argument in those sections. And keyword-networks showed different shapes with each other in the same period in the same section. It means that frequently appeared keywords in articles are different and their contents are comprised with different keywords. And the Positive-Negative classification showed the possibility of classifying newspapers' tones of arguments compared to others. These results indicate that the approach in this study is promising to be extended as a new tool to identify the different tones of arguments of newspapers.