• Title/Summary/Keyword: Naive Bayes Algorithm

Search Result 75, Processing Time 0.028 seconds

Comparison Thai Word Sense Disambiguation Method

  • Modhiran, Teerapong;Kruatrachue, Boontee;Supnithi, Thepchai
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 2004.08a
    • /
    • pp.1307-1312
    • /
    • 2004
  • Word sense disambiguation is one of the most important problems in natural language processing research topics such as information retrieval and machine translation. Many approaches can be employed to resolve word ambiguity with a reasonable degree of accuracy. These strategies are: knowledge-based, corpus-based, and hybrid-based. This paper pays attention to the corpus-based strategy. The purpose of this paper is to compare three famous machine learning techniques, Snow, SVM and Naive Bayes in Word-Sense Disambiguation on Thai language. 10 ambiguous words are selected to test with word and POS features. The results show that SVM algorithm gives the best results in solving of Thai WSD and the accuracy rate is approximately 83-96%.

  • PDF

Improving Text Categorization with High Quality Bigrams (고품질 바이그램을 이용한 문서 범주화 성능 향상)

  • Lee, Chan-Do;Tan, Chade-Meng;Wang, Yuan-Fang
    • The KIPS Transactions:PartB
    • /
    • v.9B no.4
    • /
    • pp.415-420
    • /
    • 2002
  • This paper presents an efficient text categorization algorithm that generates high quality bigrams by using the information gain metric, combined with various frequency thresholds. The bigrams, along with unigrams, are then given as features to a Naive Bayes classifier. The experimental results suggest that the bigrams, while small in number, can substantially contribute to improving text categorization. Upon close examination of the results, we conclude that the algorithm is most successful in correctly classifying more positive documents, but may cause more negative documents to be classified incorrectly.

Study on Automatic Bug Triage using Deep Learning (딥 러닝을 이용한 버그 담당자 자동 배정 연구)

  • Lee, Sun-Ro;Kim, Hye-Min;Lee, Chan-Gun;Lee, Ki-Seong
    • Journal of KIISE
    • /
    • v.44 no.11
    • /
    • pp.1156-1164
    • /
    • 2017
  • Existing studies on automatic bug triage were mostly used the method of designing the prediction system based on the machine learning algorithm. Therefore, it can be said that applying a high-performance machine learning model is the core of the performance of the automatic bug triage system. In the related research, machine learning models that have high performance are mainly used, such as SVM and Naïve Bayes. In this paper, we apply Deep Learning, which has recently shown good performance in the field of machine learning, to automatic bug triage and evaluate its performance. Experimental results show that the Deep Learning based Bug Triage system achieves 48% accuracy in active developer experiments, un improvement of up to 69% over than conventional machine learning techniques.

Natural Object Recognition for Augmented Reality Applications (증강현실 응용을 위한 자연 물체 인식)

  • Anjan, Kumar Paul;Mohammad, Khairul Islam;Min, Jae-Hong;Kim, Young-Bum;Baek, Joong-Hwan
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.11 no.2
    • /
    • pp.143-150
    • /
    • 2010
  • Markerless augmented reality system must have the capability to recognize and match natural objects both in indoor and outdoor environment. In this paper, a novel approach is proposed for extracting features and recognizing natural objects using visual descriptors and codebooks. Since the augmented reality applications are sensitive to speed of operation and real time performance, our work mainly focused on recognition of multi-class natural objects and reduce the computing time for classification and feature extraction. SIFT(scale invariant feature transforms) and SURF(speeded up robust feature) are used to extract features from natural objects during training and testing, and their performance is compared. Then we form visual codebook from the high dimensional feature vectors using clustering algorithm and recognize the objects using naive Bayes classifier.

Performance Analysis of Perturbation-based Privacy Preserving Techniques: An Experimental Perspective

  • Ritu Ratra;Preeti Gulia;Nasib Singh Gill
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.10
    • /
    • pp.81-88
    • /
    • 2023
  • In the present scenario, enormous amounts of data are produced every second. These data also contain private information from sources including media platforms, the banking sector, finance, healthcare, and criminal histories. Data mining is a method for looking through and analyzing massive volumes of data to find usable information. Preserving personal data during data mining has become difficult, thus privacy-preserving data mining (PPDM) is used to do so. Data perturbation is one of the several tactics used by the PPDM data privacy protection mechanism. In Perturbation, datasets are perturbed in order to preserve personal information. Both data accuracy and data privacy are addressed by it. This paper will explore and compare several perturbation strategies that may be used to protect data privacy. For this experiment, two perturbation techniques based on random projection and principal component analysis were used. These techniques include Improved Random Projection Perturbation (IRPP) and Enhanced Principal Component Analysis based Technique (EPCAT). The Naive Bayes classification algorithm is used for data mining approaches. These methods are employed to assess the precision, run time, and accuracy of the experimental results. The best perturbation method in the Nave-Bayes classification is determined to be a random projection-based technique (IRPP) for both the cardiovascular and hypothyroid datasets.

An Efficient kNN Algorithm (효율적인 kNN 알고리즘)

  • Lee Jae Moon
    • The KIPS Transactions:PartB
    • /
    • v.11B no.7 s.96
    • /
    • pp.849-854
    • /
    • 2004
  • This paper proposes an algorithm to enhance the execution time of kNN in the document classification. The proposed algorithm is to enhance the execution time by minimizing the computing cost of the similarity between two documents by using the list of pairs, while the conventional kNN uses the iist of pairs. The 1ist of pairs can be obtained by applying the matrix transposition to the list of pairs at the training phase of the document classification. This paper analyzed the proposed algorithm in the time complexity and compared it with the conventional kNN. And it compared the proposed algorithm with the conventional kNN by using routers-21578 data experimentally. The experimental results show that the proposed algorithm outperforms kNN about $90{\%}$ in terms of the ex-ecution time.

Implementation of a bio-inspired two-mode structural health monitoring system

  • Lin, Tzu-Kang;Yu, Li-Chen;Ku, Chang-Hung;Chang, Kuo-Chun;Kiremidjian, Anne
    • Smart Structures and Systems
    • /
    • v.8 no.1
    • /
    • pp.119-137
    • /
    • 2011
  • A bio-inspired two-mode structural health monitoring (SHM) system based on the Na$\ddot{i}$ve Bayes (NB) classification method is discussed in this paper. To implement the molecular biology based Deoxyribonucleic acid (DNA) array concept in structural health monitoring, which has been demonstrated to be superior in disease detection, two types of array expression data have been proposed for the development of the SHM algorithm. For the micro-vibration mode, a two-tier auto-regression with exogenous (AR-ARX) process is used to extract the expression array from the recorded structural time history while an ARX process is applied for the analysis of the earthquake mode. The health condition of the structure is then determined using the NB classification method. In addition, the union concept in probability is used to improve the accuracy of the system. To verify the performance and reliability of the SHM algorithm, a downscaled eight-storey steel building located at the shaking table of the National Center for Research on Earthquake Engineering (NCREE) was used as the benchmark structure. The structural response from different damage levels and locations was collected and incorporated in the database to aid the structural health monitoring process. Preliminary verification has demonstrated that the structure health condition can be precisely detected by the proposed algorithm. To implement the developed SHM system in a practical application, a SHM prototype consisting of the input sensing module, the transmission module, and the SHM platform was developed. The vibration data were first measured by the deployed sensor, and subsequently the SHM mode corresponding to the desired excitation is chosen automatically to quickly evaluate the health condition of the structure. Test results from the ambient vibration and shaking table test showed that the condition and location of the benchmark structure damage can be successfully detected by the proposed SHM prototype system, and the information is instantaneously transmitted to a remote server to facilitate real-time monitoring. Implementing the bio-inspired two-mode SHM practically has been successfully demonstrated.

Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities (문헌간 유사도를 이용한 자동분류에서 미분류 문헌의 활용에 관한 연구)

  • Kim, Pan-Jun;Lee, Jae-Yun
    • Journal of the Korean Society for information Management
    • /
    • v.24 no.1 s.63
    • /
    • pp.251-271
    • /
    • 2007
  • This paper studies the problem of classifying documents with labeled and unlabeled learning data, especially with regards to using document similarity features. The problem of using unlabeled data is practically important because in many information systems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. There are two steps In general semi-supervised learning algorithm. First, it trains a classifier using the available labeled documents, and classifies the unlabeled documents. Then, it trains a new classifier using all the training documents which were labeled either manually or automatically. We suggested two types of semi-supervised learning algorithm with regards to using document similarity features. The one is one step semi-supervised learning which is using unlabeled documents only to generate document similarity features. And the other is two step semi-supervised learning which is using unlabeled documents as learning examples as well as similarity features. Experimental results, obtained using support vector machines and naive Bayes classifier, show that we can get improved performance with small labeled and large unlabeled documents then the performance of supervised learning which uses labeled-only data. When considering the efficiency of a classifier system, the one step semi-supervised learning algorithm which is suggested in this study could be a good solution for improving classification performance with unlabeled documents.

Analysis of high school students' views on science-technology-society (HS-VOSTS) questionnaire results (고등학생을 위한 과학-기술-사회에 대한 시각 (HS-VOST) 설문조사 결과 분석)

  • Kang, Dae-Ki
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2011.10a
    • /
    • pp.201-203
    • /
    • 2011
  • We report an experimental result of applying a data mining algorithm for analyzing the questionnaire results of high school students' views on science-technology-society (HS-VOSTS). The preliminary empirical result of Naive Bayes classifier on HS-VOSTS questionnaire from one South Korean university students indicates that data mining algorithms can be effectively applied to automated knowledge discovery from students' survey data.

  • PDF

Fast Conditional Independence-based Bayesian Classifier

  • Junior, Estevam R. Hruschka;Galvao, Sebastian D. C. de O.
    • Journal of Computing Science and Engineering
    • /
    • v.1 no.2
    • /
    • pp.162-176
    • /
    • 2007
  • Machine Learning (ML) has become very popular within Data Mining (KDD) and Artificial Intelligence (AI) research and their applications. In the ML and KDD contexts, two main approaches can be used for inducing a Bayesian Network (BN) from data, namely, Conditional Independence (CI) and the Heuristic Search (HS). When a BN is induced for classification purposes (Bayesian Classifier - BC), it is possible to impose some specific constraints aiming at increasing the computational efficiency. In this paper a new CI based approach to induce BCs from data is proposed and two algorithms are presented. Such approach is based on the Markov Blanket concept in order to impose some constraints and optimize the traditional PC learning algorithm. Experiments performed with the ALARM, as well as other six UCI and three artificial domains revealed that the proposed approach tends to execute fewer comparison tests than the traditional PC. The experiments also show that the proposed algorithms produce competitive classification rates when compared with both, PC and Naive Bayes.