• Title/Summary/Keyword: unknown word

Search Result 70, Processing Time 0.031 seconds

Real-time Unknown Word Identification Using Support Vector Machine For Chinese Text-to-Speech (중국어 음성합성을 위한 지진 벡터 기반 실시간 미등록어 처리)

  • Ha, Ju-Hong;Zheng, Yu;Lee, Gary G.
    • Annual Conference on Human and Language Technology
    • /
    • 2003.10d
    • /
    • pp.267-272
    • /
    • 2003
  • 음성 합성 시스템 구축에 있어서 입력 텍스트를 정확한 발음 표기로 변환하는 것은 매우 중요하다. 중국어에는 하나의 한자가 의미나 사용에 따라 다르게 발음되는 다음자(polyphony)들이 존재한다. 다음자의 처리는 상당히 복잡한 문제이기 때문에 본 논문에서는 그 중 가장 발음에 영향을 미치는 요소인 인명과 지명에 대한 미등록어 처리를 수행했다. 무엇보다 실시간 음성 합성 시스템을 위해서는 처리 속도의 향상이 요구된다. 따라서 본 연구에서는 미등록어 후보 구간 선정을 선행하고, 선정된 후보에 대해 추정하는 두 단계로 진행하였다. 후보 구간 선정은 단일 한자 단어(monosyllable word)의 확률과 간단한 패턴들을 이용한다. 최종 선정된 후보의 미등록어 추정은 SVM(Support Vector Machine)을 기반으로 실시하였다.

  • PDF

Influence Maximization Scheme against Various Social Adversaries

  • Noh, Giseop;Oh, Hayoung;Lee, Jaehoon
    • Journal of information and communication convergence engineering
    • /
    • v.16 no.4
    • /
    • pp.213-220
    • /
    • 2018
  • With the exponential developments of social network, their fundamental role as a medium to spread information, ideas, and influence has gained importance. It can be expressed by the relationships and interactions within a group of individuals. Therefore, some models and researches from various domains have been in response to the influence maximization problem for the effects of "word of mouth" of new products. For example, in reality, more than two related social groups such as commercial companies and service providers exist within the same market issue. Under such a scenario, they called social adversaries competitively try to occupy their market influence against each other. To address the influence maximization (IM) problem between them, we propose a novel IM problem for social adversarial players (IM-SA) which are exploiting the social network attributes to infer the unknown adversary's network configuration. We sophisticatedly define mathematical closed form to demonstrate that the proposed scheme can have a near-optimal solution for a player.

Feature Generation of Dictionary for Named-Entity Recognition based on Machine Learning (기계학습 기반 개체명 인식을 위한 사전 자질 생성)

  • Kim, Jae-Hoon;Kim, Hyung-Chul;Choi, Yun-Soo
    • Journal of Information Management
    • /
    • v.41 no.2
    • /
    • pp.31-46
    • /
    • 2010
  • Now named-entity recognition(NER) as a part of information extraction has been used in the fields of information retrieval as well as question-answering systems. Unlike words, named-entities(NEs) are generated and changed steadily in documents on the Web, newspapers, and so on. The NE generation causes an unknown word problem and makes many application systems with NER difficult. In order to alleviate this problem, this paper proposes a new feature generation method for machine learning-based NER. In general features in machine learning-based NER are related with words, but entities in named-entity dictionaries are related to phrases. So the entities are not able to be directly used as features of the NER systems. This paper proposes an encoding scheme as a feature generation method which converts phrase entities into features of word units. Futhermore, due to this scheme, entities with semantic information in WordNet can be converted into features of the NER systems. Through our experiments we have shown that the performance is increased by about 6% of F1 score and the errors is reduced by about 38%.

Wordnet Extension for IT terminology Using Web Search (웹 검색을 활용한 워드넷에서의 IT 전문 용어 확장)

  • Park, Kyeong-Kook;Lee, Kwang-Mo;Kim, Yu-Seop
    • Annual Conference on Human and Language Technology
    • /
    • 2007.10a
    • /
    • pp.189-193
    • /
    • 2007
  • In this paper, we designed a methodology to expand the WordNet. We added unknown terms like IT technical terms to the existing WordNet by using web search. The WordNet is an online taxonomy representing the relationships among terms, but it usually showed limitation to contain new technical terminologies. That's why we tried to expand the WordNet. Firstly, when we met unregistered terms in WordNet, we built a query of those terms for web search. Given a web search results, we tried to find out terms with a high-level relatedness with the unregistered terms. We used the Korean Morphological Analyzer to score the relatedness between terms and located the unregistered term as a hyponym of terms with high score of relatedness.

  • PDF

Recognizing Unknown Words and Correcting Spelling errors as Preprocessing for Korean Information Processing System (한국어 정보처리 시스템의 전처리를 위한 미등록어 추정 및 철자 오류의 자동 교정)

  • Park, Bong-Rae;Rim, Hae-Chang
    • The Transactions of the Korea Information Processing Society
    • /
    • v.5 no.10
    • /
    • pp.2591-2599
    • /
    • 1998
  • In this paper, we proose a method of recognizing unknown words and correcting spelling errors(including spacing erors) to increase the performance of Korean information processing systems. Unknown words are recognized through comparative analysis of two or more morphologically similar eojeols(spacing units in Korean) including the same unknown word candidates. And spacing errors and spelling errors are corrected by using lexicatlized rules shich are automatically extracted from very large raw corpus. The extractionof the lexicalized rules is based on morphological and contextual similarities between error eojeols and their corection eojeols which are confirmed to be used in the corpus. The experimental result shows that our system can recognize unknown words in an accuracy of 98.9%, and can correct spacing errors and spelling errors in accuracies of 98.1% and 97.1%, respectively.

  • PDF

Korean Noun Extractor using Occurrence Patterns of Nouns and Post-noun Morpheme Sequences (한국어 명사 출현 특성과 후절어를 이용한 명사추출기)

  • Park, Yong-Hyun;Hwang, Jae-Won;Ko, Young-Joong
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.12
    • /
    • pp.919-927
    • /
    • 2010
  • Since the performance of mobile devices is recently improved, the requirement of information retrieval is increased in the mobile devices as well as PCs. If a mobile device with small memory uses a tradition language analysis tool to extract nouns from korean texts, it will impose a burden of analysing language. As a result, the need for the language analysis tools adequate to the mobile devices is increasing. Therefore, this paper proposes a new method for noun extraction using post-noun morpheme sequences and noun patterns from a large corpus. The proposed noun extractor has only the dictionary capacity of 146KB and its performance shows 0.86 $F_1$-measure; the capacity of noun dictionary corresponds to only the 4% capacity of the existing noun extractor with a POS tagger. In addition, it easily extract nouns for unknown word because its dependence for noun dictionaries is low.

A study on Gaussian mixture model deep neural network hybrid-based feature compensation for robust speech recognition in noisy environments (잡음 환경에 효과적인 음성 인식을 위한 Gaussian mixture model deep neural network 하이브리드 기반의 특징 보상)

  • Yoon, Ki-mu;Kim, Wooil
    • The Journal of the Acoustical Society of Korea
    • /
    • v.37 no.6
    • /
    • pp.506-511
    • /
    • 2018
  • This paper proposes an GMM(Gaussian Mixture Model)-DNN(Deep Neural Network) hybrid-based feature compensation method for effective speech recognition in noisy environments. In the proposed algorithm, the posterior probability for the conventional GMM-based feature compensation method is calculated using DNN. The experimental results using the Aurora 2.0 framework and database demonstrate that the proposed GMM-DNN hybrid-based feature compensation method shows more effective in Known and Unknown noisy environments compared to the GMM-based method. In particular, the experiments of the Unknown environments show 9.13 % of relative improvement in the average of WER (Word Error Rate) and considerable improvements in lower SNR (Signal to Noise Ratio) conditions such as 0 and 5 dB SNR.

Distributed Representation of Words with Semantic Hierarchical Information (의미적 계층정보를 반영한 단어의 분산 표현)

  • Kim, Minho;Choi, Sungki;Kwon, Hyuk-Chul
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2017.04a
    • /
    • pp.941-944
    • /
    • 2017
  • 심층 학습에 기반을 둔 통계적 언어모형에서 가장 중요한 작업은 단어의 분산 표현(Distributed Representation)이다. 단어의 분산 표현은 단어 자체가 가지는 의미를 다차원 공간에서 벡터로 표현하는 것으로서, 워드 임베딩(word embedding)이라고도 한다. 워드 임베딩을 이용한 심층 학습 기반 통계적 언어모형은 전통적인 통계적 언어모형과 비교하여 성능이 우수한 것으로 알려져 있다. 그러나 워드 임베딩 역시 자료 부족분제에서 벗어날 수 없다. 특히 학습데이터에 나타나지 않은 단어(unknown word)를 처리하는 것이 중요하다. 본 논문에서는 고품질 한국어 워드 임베딩을 위하여 단어의 의미적 계층정보를 이용한 워드 임베딩 방법을 제안한다. 기존연구에서 제안한 워드 임베딩 방법을 그대로 활용하되, 학습 단계에서 목적함수가 입력 단어의 하위어, 동의어를 반영하여 계산될 수 있도록 수정함으로써 단어의 의미적 계층청보를 반영할 수 있다. 본 논문에서 제안한 워드 임베딩 방법을 통해 생성된 단어 벡터의 유추검사(analog reasoning) 결과, 기존 방법보다 5%가 증가한 47.90%를 달성할 수 있었다.

Practical Development and Application of a Korean Morphological Analyzer for Automatic Indexing (자동 색인을 위한 한국어 형태소 분석기의 실제적인 구현 및 적용)

  • Choi, Sung-Pil;Seo, Jerry;Chae, Young-Suk
    • The KIPS Transactions:PartB
    • /
    • v.9B no.5
    • /
    • pp.689-700
    • /
    • 2002
  • In this paper, we developed Korean Morphological Analyzer for an automatic indexing that is essential for Information Retrieval. Since it is important to index large-scaled document set efficiently, we concentrated on maximizing the speed of word analysis, modularization and structuralization of the system without new concepts or ideas. In this respect, our system is characterized in terms of software engineering aspect to be used in real world rather than theoretical issues. First, a dictionary of words was structured. Then modules that analyze substantive words and inflected words were introduced. Furthermore numeral analyzer was developed. And we introduced an unknown word analyzer using the patterns of morpheme. This whole system was integrated into K-2000, an information retrieval system.

Investigating the Role of Memorable Tourism Experience towards Revisit Intention and Electronic Word of Mouth: A Study on Beach Tourists

  • Van Vien VU;Van Hao HOANG;Lan Huong VU
    • Journal of Distribution Science
    • /
    • v.22 no.2
    • /
    • pp.83-93
    • /
    • 2024
  • Purpose: Although many studies have addressed destination marketing concepts, the relationship between beach tourists' memorable tourism experience (MTE), revisit intention and electronic word of mouth (eWOM) remains unknown. To address this issue, the authors established a model to investigate the effects of MTE's dimensions on revisit intention and eWOM. Research design, data and methodology: Drawing on 581 questionnaires from domestic beach tourists in Vietnam, a quantitative study approach was administered to empirically analyze a partial least squares path model in PLS-SEM. Results: The findings revealed that four dimensions of MTE including hedonism, local culture, meaningfulness and involvement have positive influence on beach tourists' revisit intention. Besides, meaningfulness and knowledge directly affect eWOM. It is worth noting that beach tourists' revisit intention significantly and directly influence their eWOM. The findings also confirm the indirect effects of hedonism, local culture, meaningfulness and involvement on eWOM through the mediating role of revisit intention. Conclusions: This study will be important to determine beach tourists' behavior through each dimension of MTE. This study also emphasizes on the direct effect of beach tourists' revisit intention on eWOM, as well as confirms its mediating role in the relation between MTE and eWOM. The findings will assist policymakers and destination marketers with strategies and effective future actions.