• Title/Summary/Keyword: Bigram Algorithm

Search Result 7, Processing Time 0.019 seconds

Prediction of Protein-Protein Interactions from Sequences using a Correlation Matrix of the Physicochemical Properties of Amino Acids

  • Kopoin, Charlemagne N'Diffon;Atiampo, Armand Kodjo;N'Guessan, Behou Gerard;Babri, Michel
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.3
    • /
    • pp.41-47
    • /
    • 2021
  • Detection of protein-protein interactions (PPIs) remains essential for the development of therapies against diseases. Experimental studies to detect PPI are longer and more expensive. Today, with the availability of PPI data, several computer models for predicting PPIs have been proposed. One of the big challenges in this task is feature extraction. The relevance of the information extracted by some extraction techniques remains limited. In this work, we first propose an extraction method based on correlation relationships between the physicochemical properties of amino acids. The proposed method uses a correlation matrix obtained from the hydrophobicity and hydrophilicity properties that it then integrates in the calculation of the bigram. Then, we use the SVM algorithm to detect the presence of an interaction between 2 given proteins. Experimental results show that the proposed method obtains better performances compared to the approaches in the literature. It obtains performances of 94.75% in accuracy, 95.12% in precision and 96% in sensitivity on human HPRD protein data.

On the Development of a Large-Vocabulary Continuous Speech Recognition System for the Korean Language (대용량 한국어 연속음성인식 시스템 개발)

  • Choi, In-Jeong;Kwon, Oh-Wook;Park, Jong-Ryeal;Park, Yong-Kyu;Kim, Do-Yeong;Jeong, Ho-Young;Un, Chong-Kwan
    • The Journal of the Acoustical Society of Korea
    • /
    • v.14 no.5
    • /
    • pp.44-50
    • /
    • 1995
  • This paper describes a large-vocabulary continuous speech recognition system using continuous hidden Markov models for the Korean language. To improve the performance of the system, we study on the selection of speech modeling units, inter-word modeling, search algorithm, and grammars. We used triphones as basic speech modeling units, generalized triphones and function word-dependent phones are used to improve the trainability of speech units and to reduce errors in function words. Silence between words is optionally inserted by using a silence model and a null transition. Word pair grammar and bigram model based oil word classes are used. Also we implement a search algorithm to find N-best candidate sentences. A postprocessor reorders the N-best sentences using word triple grammar, selects the most likely sentence as the final recognition result, and finally corrects trivial errors related with postpositions. In recognition tests using a 3,000-word continuous speech database, the system attained $93.1\%$ word recognition accuracy and $73.8\%$ sentence recognition accuracy using word triple grammar in postprocessing.

  • PDF

Two-Phase Clustering Method Considering Mobile App Trends (모바일 앱 트렌드를 고려한 2단계 군집화 방법)

  • Heo, Jeong-Man;Park, So-Young
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.4
    • /
    • pp.17-23
    • /
    • 2015
  • In this paper, we propose a mobile app clustering method using word clusters. Considering the quick change of mobile app trends, the proposed method divides the mobile apps into some semantically similar mobile apps by applying a clustering algorithm to the mobile app set, rather than the predefined category system. In order to alleviate the data sparseness problem in the short mobile app description texts, the proposed method additionally utilizes the unigram, the bigram, the trigram, the cluster of each word. For the purpose of accurately clustering mobile apps, the proposed method manages to avoid exceedingly small or large mobile app clusters by using the word clusters. Experimental results show that the proposed method improves 22.18% from 57.48% to 79.66% on overall accuracy by using the word clusters.

Design of a Korean Speech Recognition Platform (한국어 음성인식 플랫폼의 설계)

  • Kwon Oh-Wook;Kim Hoi-Rin;Yoo Changdong;Kim Bong-Wan;Lee Yong-Ju
    • MALSORI
    • /
    • no.51
    • /
    • pp.151-165
    • /
    • 2004
  • For educational and research purposes, a Korean speech recognition platform is designed. It is based on an object-oriented architecture and can be easily modified so that researchers can readily evaluate the performance of a recognition algorithm of interest. This platform will save development time for many who are interested in speech recognition. The platform includes the following modules: Noise reduction, end-point detection, met-frequency cepstral coefficient (MFCC) and perceptually linear prediction (PLP)-based feature extraction, hidden Markov model (HMM)-based acoustic modeling, n-gram language modeling, n-best search, and Korean language processing. The decoder of the platform can handle both lexical search trees for large vocabulary speech recognition and finite-state networks for small-to-medium vocabulary speech recognition. It performs word-dependent n-best search algorithm with a bigram language model in the first forward search stage and then extracts a word lattice and restores each lattice path with a trigram language model in the second stage.

  • PDF

A Study on the Effectiveness of Bigrams in Text Categorization (바이그램이 문서범주화 성능에 미치는 영향에 관한 연구)

  • Lee, Chan-Do;Choi, Joon-Young
    • Journal of Information Technology Applications and Management
    • /
    • v.12 no.2
    • /
    • pp.15-27
    • /
    • 2005
  • Text categorization systems generally use single words (unigrams) as features. A deceptively simple algorithm for improving text categorization is investigated here, an idea previously shown not to work. It is to identify useful word pairs (bigrams) made up of adjacent unigrams. The bigrams it found, while small in numbers, can substantially raise the quality of feature sets. The algorithm was tested on two pre-classified datasets, Reuters-21578 for English and Korea-web for Korean. The results show that the algorithm was successful in extracting high quality bigrams and increased the quality of overall features. To find out the role of bigrams, we trained the Na$\"{i}$ve Bayes classifiers using both unigrams and bigrams as features. The results show that recall values were higher than those of unigrams alone. Break-even points and F1 values improved in most documents, especially when documents were classified along the large classes. In Reuters-21578 break-even points increased by 2.1%, with the highest at 18.8%, and F1 improved by 1.5%, with the highest at 3.2%. In Korea-web break-even points increased by 1.0%, with the highest at 4.5%, and F1 improved by 0.4%, with the highest at 4.2%. We can conclude that text classification using unigrams and bigrams together is more efficient than using only unigrams.

  • PDF

A Semi-supervised Learning of HMM to Build a POS Tagger for a Low Resourced Language

  • Pattnaik, Sagarika;Nayak, Ajit Kumar;Patnaik, Srikanta
    • Journal of information and communication convergence engineering
    • /
    • v.18 no.4
    • /
    • pp.207-215
    • /
    • 2020
  • Part of speech (POS) tagging is an indispensable part of major NLP models. Its progress can be perceived on number of languages around the globe especially with respect to European languages. But considering Indian Languages, it has not got a major breakthrough due lack of supporting tools and resources. Particularly for Odia language it has not marked its dominancy yet. With a motive to make the language Odia fit into different NLP operations, this paper makes an attempt to develop a POS tagger for the said language on a HMM (Hidden Markov Model) platform. The tagger judiciously considers bigram HMM with dynamic Viterbi algorithm to give an output annotated text with maximum accuracy. The model is experimented on a corpus belonging to tourism domain accounting to a size of approximately 0.2 million tokens. With the proportion of training and testing as 3:1, the proposed model exhibits satisfactory result irrespective of limited training size.

A Study on Performance Evaluation of Hidden Markov Network Speech Recognition System (Hidden Markov Network 음성인식 시스템의 성능평가에 관한 연구)

  • 오세진;김광동;노덕규;위석오;송민규;정현열
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.4 no.4
    • /
    • pp.30-39
    • /
    • 2003
  • In this paper, we carried out the performance evaluation of HM-Net(Hidden Markov Network) speech recognition system for Korean speech databases. We adopted to construct acoustic models using the HM-Nets modified by HMMs(Hidden Markov Models), which are widely used as the statistical modeling methods. HM-Nets are carried out the state splitting for contextual and temporal domain by PDT-SSS(Phonetic Decision Tree-based Successive State Splitting) algorithm, which is modified the original SSS algorithm. Especially it adopted the phonetic decision tree to effectively express the context information not appear in training speech data on contextual domain state splitting. In case of temporal domain state splitting, to effectively represent information of each phoneme maintenance in the state splitting is carried out, and then the optimal model network of triphone types are constructed by in the parameter. Speech recognition was performed using the one-pass Viterbi beam search algorithm with phone-pair/word-pair grammar for phoneme/word recognition, respectively and using the multi-pass search algorithm with n-gram language models for sentence recognition. The tree-structured lexicon was used in order to decrease the number of nodes by sharing the same prefixes among words. In this paper, the performance evaluation of HM-Net speech recognition system is carried out for various recognition conditions. Through the experiments, we verified that it has very superior recognition performance compared with the previous introduced recognition system.

  • PDF