• Title/Summary/Keyword: bigram

Search Result 71, Processing Time 0.029 seconds

Wine Label Character Recognition in Mobile Phone Images using a Lexicon-Driven Post-Processing (사전기반 후처리를 이용한 모바일 폰 영상에서 와인 라벨 문자 인식)

  • Lim, Jun-Sik;Kim, Soo-Hyung;Lee, Chil-Woo;Lee, Guee-Sang;Yang, Hyung-Jung;Lee, Myung-Eun
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.5
    • /
    • pp.546-550
    • /
    • 2010
  • In this paper, we propose a method for the postprocessing of cursive script recognition in Wine Label Images. The proposed method mainly consists of three steps: combination matrix generation, character combination filtering, string matching. Firstly, the combination matrix generation step detects all possible combinations from a recognition result for each of the pieces. Secondly, the unnecessary information in the combination matrix is removed by comparing with bigram of word in the lexicon. Finally, string matching step decides the identity of result as a best matched word in the lexicon based on the levenshtein distance. An experimental result shows that the recognition accuracy is 85.8%.

Improvement and Evaluation of the Korean Large Vocabulary Continuous Speech Recognition Platform (ECHOS) (한국어 음성인식 플랫폼(ECHOS)의 개선 및 평가)

  • Kwon, Suk-Bong;Yun, Sung-Rack;Jang, Gyu-Cheol;Kim, Yong-Rae;Kim, Bong-Wan;Kim, Hoi-Rin;Yoo, Chang-Dong;Lee, Yong-Ju;Kwon, Oh-Wook
    • MALSORI
    • /
    • no.59
    • /
    • pp.53-68
    • /
    • 2006
  • We report the evaluation results of the Korean speech recognition platform called ECHOS. The platform has an object-oriented and reusable architecture so that researchers can easily evaluate their own algorithms. The platform has all intrinsic modules to build a large vocabulary speech recognizer: Noise reduction, end-point detection, feature extraction, hidden Markov model (HMM)-based acoustic modeling, cross-word modeling, n-gram language modeling, n-best search, word graph generation, and Korean-specific language processing. The platform supports both lexical search trees and finite-state networks. It performs word-dependent n-best search with bigram in the forward search stage, and rescores the lattice with trigram in the backward stage. In an 8000-word continuous speech recognition task, the platform with a lexical tree increases 40% of word errors but decreases 50% of recognition time compared to the HTK platform with flat lexicon. ECHOS reduces 40% of recognition errors through incorporation of cross-word modeling. With the number of Gaussian mixtures increasing to 16, it yields word accuracy comparable to the previous lexical tree-based platform, Julius.

  • PDF

Determinants of Online Review Helpfulness for Korean Skincare Products in Online Retailing

  • OH, Yun-Kyung
    • Journal of Distribution Science
    • /
    • v.18 no.10
    • /
    • pp.65-75
    • /
    • 2020
  • Purpose: This study aims to examine how to review contents of experiential and utilitarian products (e.g., skincare products) and how to affect review helpfulness by applying natural language processing techniques. Research design, data, and methodology: This study uses 69,633 online reviews generated for the products registered at Amazon.com by 13 Korean cosmetic firms. The authors identify key topics that emerge about consumers' use of skincare products such as skin type and skin trouble, by applying bigram analysis. The review content variables are included in the review helpfulness model, including other important determinants. Results: The estimation results support the positive effect of review extremity and content on the helpfulness. In particular, the reviewer's skin type information was recognized as highly useful when presented together as a basis for high-rated reviews. Moreover, the content related to skin issues positively affects review helpfulness. Conclusions: The positive relationship between extreme reviews and helpfulness of reviews challenges the findings from prior literature. This result implies that an in-depth study of the effect of product types on review helpfulness is needed. Furthermore, a positive effect of review content on helpfulness suggests that applying big data analytics can provide meaningful customer insights in the online retail industry.

A Comparative Study of Feature Extraction Methods for Authorship Attribution in the Text of Traditional East Asian Medicine with a Focus on Function Words (한의학 고문헌 텍스트에서의 저자 판별 - 기능어의 역할을 중심으로 -)

  • Oh, Junho
    • Journal of Korean Medical classics
    • /
    • v.33 no.2
    • /
    • pp.51-59
    • /
    • 2020
  • Objectives : We would like to study what is the most appropriate "feature" to effectively perform authorship attribution of the text of Traditional East Asian Medicine Methods : The authorship attribution performance of the Support Vector Machine (SVM) was compared by cross validation, depending on whether the function words or content words, single word or collocations, and IDF weights were applied or not, using 'Variorum of the Nanjing' as an experimental Corpus. Results : When using the combination of 'function words/uni-bigram/TF', the performance was best with accuracy of 0.732, and the combination of 'content words/unigram/TFIDF' showed the lowest accuracy of 0.351. Conclusions : This shows the following facts from the authorship attribution of the text of East Asian traditional medicine. First, function words play an important role in comparison to content words. Second, collocations was relatively important in content words, but single words have more important meanings in function words. Third, unlike general text analysis, IDF weighting resulted in worse performance.

A Study on the Effectiveness of Bigrams in Text Categorization (바이그램이 문서범주화 성능에 미치는 영향에 관한 연구)

  • Lee, Chan-Do;Choi, Joon-Young
    • Journal of Information Technology Applications and Management
    • /
    • v.12 no.2
    • /
    • pp.15-27
    • /
    • 2005
  • Text categorization systems generally use single words (unigrams) as features. A deceptively simple algorithm for improving text categorization is investigated here, an idea previously shown not to work. It is to identify useful word pairs (bigrams) made up of adjacent unigrams. The bigrams it found, while small in numbers, can substantially raise the quality of feature sets. The algorithm was tested on two pre-classified datasets, Reuters-21578 for English and Korea-web for Korean. The results show that the algorithm was successful in extracting high quality bigrams and increased the quality of overall features. To find out the role of bigrams, we trained the Na$\"{i}$ve Bayes classifiers using both unigrams and bigrams as features. The results show that recall values were higher than those of unigrams alone. Break-even points and F1 values improved in most documents, especially when documents were classified along the large classes. In Reuters-21578 break-even points increased by 2.1%, with the highest at 18.8%, and F1 improved by 1.5%, with the highest at 3.2%. In Korea-web break-even points increased by 1.0%, with the highest at 4.5%, and F1 improved by 0.4%, with the highest at 4.2%. We can conclude that text classification using unigrams and bigrams together is more efficient than using only unigrams.

  • PDF

Compound Noun Decomposition by using Syllable-based Embedding and Deep Learning (음절 단위 임베딩과 딥러닝 기법을 이용한 복합명사 분해)

  • Lee, Hyun Young;Kang, Seung Shik
    • Smart Media Journal
    • /
    • v.8 no.2
    • /
    • pp.74-79
    • /
    • 2019
  • Traditional compound noun decomposition algorithms often face challenges of decomposing compound nouns into separated nouns when unregistered unit noun is included. It is very difficult for those traditional approach to handle such issues because it is impossible to register all existing unit nouns into the dictionary such as proper nouns, coined words, and foreign words in advance. In this paper, in order to solve this problem, compound noun decomposition problem is defined as tag sequence labeling problem and compound noun decomposition method to use syllable unit embedding and deep learning technique is proposed. To recognize unregistered unit nouns without constructing unit noun dictionary, compound nouns are decomposed into unit nouns by using LSTM and linear-chain CRF expressing each syllable that constitutes a compound noun in the continuous vector space.

A Word Spacing System based on Syllable Patterns for Memory-constrained Devices (메모리 제약적 기기를 위한 음절 패턴 기반 띄어쓰기 시스템)

  • Kim, Shin-Il;Yang, Seon;Ko, Young-Joong
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.8
    • /
    • pp.653-658
    • /
    • 2010
  • In this paper, we propose a word spacing system which can be performed with just a small memory. We focus on significant memory reduction while maintaining the performance of the system as much as the latest studies. Our proposed method is based on the theory of Hidden Markov Model. We use only probability information not adding any rule information. Two types of features are employed: 1) the first features are the spacing patterns dependent on each individual syllable and 2) the second features are the values of transition probability between the two syllable-patterns. In our experiment using only the first type of features, we achieved a high accuracy of more than 91% while reducing the memory by 53% compared with other systems developed for mobile application. When we used both types of features, we achieved an outstanding accuracy of more than 94% while reducing the memory by 76% compared with other system which employs bigram syllables as its features.

Modification Distance Model using Headible Path Contexts for Korean Dependency Parsing (지배가능 경로 문맥을 이용한 의존 구문 분석의 수식 거리 모델)

  • Woo, Yeon-Moon;Song, Young-In;Park, So-Young;Rim, Hae-Chang
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.2
    • /
    • pp.140-149
    • /
    • 2007
  • This paper presents a statistical model for Korean dependency-based parsing. Although Korean is one of free word order languages, it has the feature of which some word order is preferred to local contexts. Earlier works proposed parsing models using modification lengths due to this property. Our model uses headible path contexts for modification length probabilities. Using a headible path of a dependent it is effective for long distance relation because the large surface context for a dependent are abbreviated as its headible path. By combined with lexical bigram dependency, our probabilistic model achieves 86.9% accuracy in eojoel analysis for KAIST corpus, more improvement especially for long distance dependencies.

A Comparative Study on Optimal Feature Identification and Combination for Korean Dialogue Act Classification (한국어 화행 분류를 위한 최적의 자질 인식 및 조합의 비교 연구)

  • Kim, Min-Jeong;Park, Jae-Hyun;Kim, Sang-Bum;Rim, Hae-Chang;Lee, Do-Gil
    • Journal of KIISE:Software and Applications
    • /
    • v.35 no.11
    • /
    • pp.681-691
    • /
    • 2008
  • In this paper, we have evaluated and compared each feature and feature combinations necessary for statistical Korean dialogue act classification. We have implemented a Korean dialogue act classification system by using the Support Vector Machine method. The experimental results show that the POS bigram does not work well and the morpheme-POS pair and other features can be complementary to each other. In addition, a small number of features, which are selected by a feature selection technique such as chi-square, are enough to show steady performance of dialogue act classification. We also found that the last eojeol plays an important role in classifying an entire sentence, and that Korean characteristics such as free order and frequent subject ellipsis can affect the performance of dialogue act classification.

Korean Compound Noun Decomposition and Semantic Tagging System using User-Word Intelligent Network (U-WIN을 이용한 한국어 복합명사 분해 및 의미태깅 시스템)

  • Lee, Yong-Hoon;Ock, Cheol-Young;Lee, Eung-Bong
    • The KIPS Transactions:PartB
    • /
    • v.19B no.1
    • /
    • pp.63-76
    • /
    • 2012
  • We propose a Korean compound noun semantic tagging system using statistical compound noun decomposition and semantic relation information extracted from a lexical semantic network(U-WIN) and dictionary definitions. The system consists of three phases including compound noun decomposition, semantic constraint, and semantic tagging. In compound noun decomposition, best candidates are selected using noun location frequencies extracted from a Sejong corpus, and re-decomposes noun for semantic constraint and restores foreign nouns. The semantic constraints phase finds possible semantic combinations by using origin information in dictionary and Naive Bayes Classifier, in order to decrease the computation time and increase the accuracy of semantic tagging. The semantic tagging phase calculates the semantic similarity between decomposed nouns and decides the semantic tags. We have constructed 40,717 experimental compound nouns data set from Standard Korean Language Dictionary, which consists of more than 3 characters and is semantically tagged. From the experiments, the accuracy of compound noun decomposition is 99.26%, and the accuracy of semantic tagging is 95.38% respectively.