• Title/Summary/Keyword: naive Bayesian

Search Result 118, Processing Time 0.024 seconds

Investigating Opinion Mining Performance by Combining Feature Selection Methods with Word Embedding and BOW (Bag-of-Words) (속성선택방법과 워드임베딩 및 BOW (Bag-of-Words)를 결합한 오피니언 마이닝 성과에 관한 연구)

  • Eo, Kyun Sun;Lee, Kun Chang
    • Journal of Digital Convergence
    • /
    • v.17 no.2
    • /
    • pp.163-170
    • /
    • 2019
  • Over the past decade, the development of the Web explosively increased the data. Feature selection step is an important step in extracting valuable data from a large amount of data. This study proposes a novel opinion mining model based on combining feature selection (FS) methods with Word embedding to vector (Word2vec) and BOW (Bag-of-words). FS methods adopted for this study are CFS (Correlation based FS) and IG (Information Gain). To select an optimal FS method, a number of classifiers ranging from LR (logistic regression), NN (neural network), NBN (naive Bayesian network) to RF (random forest), RS (random subspace), ST (stacking). Empirical results with electronics and kitchen datasets showed that LR and ST classifiers combined with IG applied to BOW features yield best performance in opinion mining. Results with laptop and restaurant datasets revealed that the RF classifier using IG applied to Word2vec features represents best performance in opinion mining.

Comparison of GEE Estimators Using Imputation Methods (대체방법별 GEE추정량 비교)

  • 김동욱;노영화
    • The Korean Journal of Applied Statistics
    • /
    • v.16 no.2
    • /
    • pp.407-426
    • /
    • 2003
  • We consider the missing covariates problem in generalized estimating equations(GEE) model. If the covariate is partially missing, GEE can not be calculated. In this paper, we study the performance of 7 imputation methods to handle missing covariates in GEE models, and the properties of GEE estimators are investigated after missing covariates are imputed for ordinal data of repeated measurements. The 7 imputation methods include i) Naive Deletion ii) Sample Average Imputation iii) Row Average Imputation iv) Cross-wave Regression Imputation v) Carry-over Imputation vi) Bayesian Bootstrap vii) Approximate Bayesian Bootstrap. A Monte-Carlo simulation is used to compare the performance of these methods. For the missing mechanism generating the missing data, we assume ignorable nonresponse. Furthermore, we generate missing covariates with or without considering wave nonresp onse patterns.

Comparison of nomograms designed to predict hypertension with a complex sample (고혈압 예측을 위한 노모그램 구축 및 비교)

  • Kim, Min Ho;Shin, Min Seok;Lee, Jea Young
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.5
    • /
    • pp.555-567
    • /
    • 2020
  • Hypertension has a steadily increasing incidence rate as well as represents a risk factors for secondary diseases such as cardiovascular disease. Therefore, it is important to predict the incidence rate of the disease. In this study, we constructed nomograms that can predict the incidence rate of hypertension. We use data from the Korean National Health and Nutrition Examination Survey (KNHANES) for 2013-2016. The complex sampling data required the use of a Rao-Scott chi-squared test to identify 10 risk factors for hypertension. Smoking and exercise variables were not statistically significant in the Logistic regression; therefore, eight effects were selected as risk factors for hypertension. Logistic and Bayesian nomograms constructed from the selected risk factors were proposed and compared. The constructed nomograms were then verified using a receiver operating characteristics curve and calibration plot.

Bayesian Network based Event Recognition in Multi-Camera Environment (멀티카메라 환경에서의 베이지안 네트워크 기반 이벤트 인식)

  • Lim, Soo-Jung;Min, Jun-Ki;Park, Han-Saem;Cho, Sung-Bae
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2007.06c
    • /
    • pp.248-251
    • /
    • 2007
  • 기존의 멀티 카메라 시스템은 넓은 영역을 커버하거나 이동 중인 물체를 트래킹 하기 위한 목적으로 주로 사용되어 왔다. 하지만 이러한 시스템은 하나의 카메라가 커버하는 영상이 가려지면 정보를 잃게 되는 단점이 있다. 멀티 카메라 시스템은 하나의 영역을 여러 카메라가 커버하도록 하여 이런 단점을 극복할 수 있다. 또한 다양한 시점의 카메라에서 수집되는 영상의 경우, 영상에 따라 담고 있는 정보가 다르므로 여러 카메라의 입력 정보를 함께 활용하여 보다 많은 정보를 얻을 수도 있다. 본 논문은 이런 장점을 활용하여 멀티 카메라 환경에서의 이벤트 인식 문제를 다룬다. 이를 위해 사무실 환경에 8대의 카메라를 설치하였으며, 시나리오에 따라 영상을 수집하였다. 수집된 영상은 전문가에 의해 어노테이션 된 후 인식 모델의 학습에 사용되며, 학습된 베이지안 네트워크 모델의 구조와 파라미터를 도메인 지식에 기반해서 수정하여 최종 이벤트 인식 모델을 설계하였다. 실험 결과 제안하는 이벤트 인식 모델의 인식률은 평균 87.0%로 Naive Bayes보다 우수한 성능을 보임을 확인하였다.

  • PDF

Bayesian Approach to Users' Perspective on Movie Genres

  • Lenskiy, Artem A.;Makita, Eric
    • Journal of information and communication convergence engineering
    • /
    • v.15 no.1
    • /
    • pp.43-48
    • /
    • 2017
  • Movie ratings are crucial for recommendation engines that track the behavior of all users and utilize the information to suggest items the users might like. It is intuitively appealing that information about the viewing preferences in terms of movie genres is sufficient for predicting a genre of an unlabeled movie. In order to predict movie genres, we treat ratings as a feature vector, apply a Bernoulli event model to estimate the likelihood of a movie being assigned a certain genre, and evaluate the posterior probability of the genre of a given movie by using the Bayes rule. The goal of the proposed technique is to efficiently use movie ratings for the task of predicting movie genres. In our approach, we attempted to answer the question: "Given the set of users who watched a movie, is it possible to predict the genre of a movie on the basis of its ratings?" The simulation results with MovieLens 1M data demonstrated the efficiency and accuracy of the proposed technique, achieving an 83.8% prediction rate for exact prediction and 84.8% when including correlated genres.

Features Reduction using Logistic Regression for Spam Filtering (로지스틱 회귀 분석을 이용한 스펨 필터링의 특징 축소)

  • Jung, Yong-Gyu;Lee, Bum-Joon
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.10 no.2
    • /
    • pp.13-18
    • /
    • 2010
  • Today, The much amount of spam that occupies the mail server and network storage occurs the lack of negative issues, such as overload, and for users to delete the spam should spend time, resources have a problem. Automatic spam filtering on the incidence to solve the problem is essential. A lot of Spam filters have tried to solve the problem emerged as an essential element automatically. Unlike traditional method such as Naive Bayesian, PCA through the many-dimensional data set of spam with a few spindle-dimensional process that narrowed the operation to reduce the burden on certain groups for classification Logistic regression analysis method was used to filter the spam. Through the speed and performance, it was able to get the positive results.

Development of Customized Strategy for Enhancing Automobile Repurchase Using Data Mining Techniques (자동차 재구매 증진을 위한 데이터 마이닝 기반의 맞춤형 전략 개발)

  • Lee, Dong-Wook;Choi, Keun-Ho;Yoo, Dong-Hee
    • The Journal of Information Systems
    • /
    • v.26 no.3
    • /
    • pp.47-61
    • /
    • 2017
  • Purpose Although automobile production has increased since the development of the Korean automobile industry, the number of customers who can purchase automobiles decreases relatively. Therefore, automobile companies need to develop strategies to attract customers and promote their repurchase behaviors. To this end, this paper analyzed customer data from a Korean automobile company using data mining techniques to derive repurchase strategies. Design/methodology/approach We conducted under-sampling to balance the collected data and generated 10 datasets. We then implemented prediction models by applying a decision tree, naive Bayesian, and artificial neural network algorithms to each of the datasets. As a result, we derived 10 patterns consisting of 11 variables affecting customers' decisions about repurchases from the decision tree algorithm, which yielded the best accuracy. Using the derived patterns, we proposed helpful strategies for improving repurchase rates. Findings From the top 10 repurchase patterns, we found that 1) repurchases in January are associated with a specific residential region, 2) repurchases in spring or autumn are associated with whether it is a weekend or not, 3) repurchases in summer are associated with whether the automobile is equipped with a sunroof or not, and 4) a customized promotion for a specific occupation increases the number of repurchases.

A Study of using Emotional Features for Information Retrieval Systems (감정요소를 사용한 정보검색에 관한 연구)

  • Kim, Myung-Gwan;Park, Young-Tack
    • The KIPS Transactions:PartB
    • /
    • v.10B no.6
    • /
    • pp.579-586
    • /
    • 2003
  • In this paper, we propose a novel approach to employ emotional features to document retrieval systems. Fine emotional features, such as HAPPY, SAD, ANGRY, FEAR, and DISGUST, have been used to represent Korean document. Users are allowed to use these features for retrieving their documents. Next, retrieved documents are learned by classification methods like cohesion factor, naive Bayesian, and, k-nearest neighbor approaches. In order to combine various approaches, voting method has been used. In addition, k-means clustering has been used for our experimentation. The performance of our approach proved to be better in accuracy than other methods, and be better in short texts rather than large documents.

Identifying Optimum Features for Abbreviation Disambiguation in Biomedical Domain (생의학 도메인에서 약어 중의성 해결을 위한 최적 자질의 규명)

  • Lim, Ho-Gun;Seo, Hee-Cheol;Kim, Seon-Ho;Rim, Hae-Chang
    • Annual Conference on Human and Language Technology
    • /
    • 2004.10d
    • /
    • pp.173-180
    • /
    • 2004
  • 생의학 도메인에서 약어 중의성 해결이란 생의학 문서에 나타난 약어의 원래 형태(long form)를 판별하는 작업이다. 본 논문은 생의학 도메인에서 약어 중의성 해결에 적합한 자질들을 실험적으로 탐색하는데 목적이 있다. 이를 위해서 약어 중의성 해결에 사용할 문맥을 전역 문맥(topical context)과 지역 문맥(local context)으로 구분하고, 각각의 문맥에서 스테밍(stemming), 불용어 제거, 품사 부착 등의 과정을 통해서 다양한 자질들을 고려하도록 한다. 생의학 도메인에서 약어 중의성 해결을 위한 실험 자료의 부족을 해결하기 위해서, 학습 자료와 평가 자료를 자동으로 구축했으며, 평가를 위한 약어로는 기존 연구에서 사용된 두 가지 약어 목록을 사용했다. 또한 단순 베이지언 모델(Naive Bayesian Model)을 이용해서 각 자질들의 유용성을 평가하였다 실험 결과, 전역 문맥이 지역 문맥보다 더 좋은 성능을 보였으며, 전역 문맥에서는 불용어만을 제거한 경우가 각각의 평가 자료에서 94.2%와 96.2%로 가장 좋은 결과를 보였으며, 전역 문맥과 지역 문맥을 함께 사용하는 경우에 각각의 평가 자료에서 1.8%와 0.3%의 성능 향상이 있었다.

  • PDF

Smart IoT Hardware Control System using Secure Mobile Messenger (모바일 메신저를 이용한 스마트 IoT 하드웨어 제어 시스템)

  • Lee, Sang-Hyeong;Kim, Dong-Hyun;Lee, Hae-Yeoun
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.65 no.12
    • /
    • pp.2232-2239
    • /
    • 2016
  • IoT industry has been highlighted in the domestic and foreign country. Since most IoT systems operate separate servers in Internet to control IoT hardwares, there exists the possibility of security problems. Also, IoT systems in markets use their own hardware controllers and devices. As a result, there are many limitations in adding new sensors or devices and using applications to access hardware controllers. To solve these problems, we have developed a novel IoT hardware control system based on a mobile messenger. For the security, we have adopted a secure mobile messenger, Telegram, which has its own security protection. Also, it can improve the easy of the usage without any installation of specific applications. For the enhancement of the system accessibility, the proposed IoT system supports various network protocols. As a result, there are many possibility to include various functions in the system. Finally, our IoT system can analyze the collected information from sensors to provide useful information to the users. Through the experiment, we show that the proposed IoT system can perform well.