• Title/Summary/Keyword: Corpus expansion

Search Result 14, Processing Time 0.032 seconds

A Method of Chinese and Thai Cross-Lingual Query Expansion Based on Comparable Corpus

  • Tang, Peili;Zhao, Jing;Yu, Zhengtao;Wang, Zhuo;Xian, Yantuan
    • Journal of Information Processing Systems
    • /
    • v.13 no.4
    • /
    • pp.805-817
    • /
    • 2017
  • Cross-lingual query expansion is usually based on the relationship among monolingual words. Bilingual comparable corpus contains relationships among bilingual words. Therefore, this paper proposes a method based on these relationships to conduct query expansion. First, the word vectors which characterize the bilingual words are trained using Chinese and Thai bilingual comparable corpus. Then, the correlation between Chinese query words and Thai words are computed based on these word vectors, followed with selecting the Thai candidate expansion terms via the correlative value. Then, multi-group Thai query expansion sentences are built by the Thai candidate expansion words based on Chinese query sentence. Finally, we can get the optimal sentence using the Chinese and Thai query expansion method, and perform the Thai query expansion. Experiment results show that the cross-lingual query expansion method we proposed can effectively improve the accuracy of Chinese and Thai cross-language information retrieval.

Sentence-Chain Based Seq2seq Model for Corpus Expansion

  • Chung, Euisok;Park, Jeon Gue
    • ETRI Journal
    • /
    • v.39 no.4
    • /
    • pp.455-466
    • /
    • 2017
  • This study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for addressing language generation issues; it has the ability to generate new sentences from given input sentences. We present a method of corpus expansion using a sentence-chain based seq2seq model. For training the seq2seq model, sentence chains are used as triples. The first two sentences in a triple are used for the encoder of the seq2seq model, while the last sentence becomes a target sequence for the decoder. Using only internal resources, evaluation results show an improvement of approximately 7.6% relative perplexity over a baseline language model of Korean text. Additionally, from a comparison with a previous study, the sentence chain approach reduces the size of the training data by 38.4% while generating 1.4-times the number of n-grams with superior performance for English text.

An Automatic Expansion of Sentiment Lexicon by Using Corpus (코퍼스를 이용한 감성 사전 자동 확장)

  • Lee, Kong Joo;Seo, Hyung-Won;Kim, Jae-Hoon
    • Annual Conference on Human and Language Technology
    • /
    • 2010.10a
    • /
    • pp.158-161
    • /
    • 2010
  • 본 연구에서는 기본 감성 사전과 대량의 코퍼스를 이용하여 대상 코퍼스에서 사용하는 확장된 감성 표현을 자동으로 추출하는 방법을 제안한다. 대상 코퍼스로는 방송사들이 운영하는 시청자 게시판의 게시글을 대상으로 하였다. 이와 같은 방법으로 대상 코퍼스에서 사용하는 구체적인 감성 패턴들을 추출할 수 있었다.

  • PDF

A Wikipedia-based Query Expansion Method for In-depth Blog Distillation (주제를 깊이 있게 다루는 블로그 피드 검색을 위한 위키피디아 기반 질의 확장 방법)

  • Song, Woo-Sang;Lee, Ye-Ha;Lee, Jong-Hyeok;Yang, Gi-Joo
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.11
    • /
    • pp.1121-1125
    • /
    • 2010
  • This paper proposes a Wikipedia-based feedback method for in-depth blog distillation whose goal is to find blogs that represent in-depth thoughts or analysis on a given query. The proposed method uses Wikipedia articles which are relevant to the query. TREC Blogs08 collection which is a large-scale blog corpus and English Wikipedia dump were used for experiments, The proposed method significantly increased the retrieval performance including MAP over the conventional post based feedback method.

An Analysis of the process acting as a driver of the expansion of meanings in the synonym-antonym net: the meanings of '틀리다' ranging from "be wrong" to "be different" ([다름]의 '틀리다'를 형성하는 유의-반의 관계망 분석)

  • Shin, Jung-Jin
    • Korean Linguistics
    • /
    • v.78
    • /
    • pp.31-54
    • /
    • 2018
  • '맞다(right)', which is inversely related to 'teullida', has a synonymic relationship with '같다(same)' depending on the sense. Naturally, the '같다' is usually inversely related to '다르다(be different)' as symmetry verb. The meaning of '다르다' is 'teullida' and there is a close meaning relationship network in the network of words. In other words, the process acting as a driver of the expansion of meanings based on the antonym-relation of (1)'틀리다${\leftrightarrow}$맞다', and the s?ynonym-relation of (2)'맞다 = 같다' forms a network, and the relation between them and the opposite semantics is (3)'같다=맞다${\leftrightarrow}$다르다'. And many of today's speakers speak (4)'teullida' of [difference]. Therefore, after the application of the synonymic analogy, eventually, the antonymic analogy is formed, and the word formed is 'teullida' of [difference]. This, of course, forms another type of enlargement of the meaning.

Acoustic correlates of prosodic prominence in conversational speech of American English, as perceived by ordinary listeners

  • Mo, Yoon-Sook
    • Phonetics and Speech Sciences
    • /
    • v.3 no.3
    • /
    • pp.19-26
    • /
    • 2011
  • Previous laboratory studies have shown that prosodic structures are encoded in the modulations of phonetic patterns of speech including suprasegmental as well as segmental features. Drawing on a prosodically annotated large-scale speech data from the Buckeye corpus of conversational speech of American English, the current study first evaluated the reliability of prosody annotation by a large number of ordinary listeners and later examined whether and how prosodic prominence influences the phonetic realization of multiple acoustic parameters in everyday conversational speech. The results showed that all the measures of acoustic parameters including pitch, loudness, duration, and spectral balance are increased when heard as prominent. These findings suggest that prosodic prominence enhances the phonetic characteristics of the acoustic parameters. The results also showed that the degree of phonetic enhancement vary depending on the types of the acoustic parameters. With respect to the formant structure, the findings from the present study more consistently support Sonority Expansion Hypothesis than Hyperarticulation Hypothesis, showing that the lexically stressed vowels are hyperarticulated only when hyperarticulation does not interfere with sonority expansion. Taken all into account, the present study showed that prosodic prominence modulates the phonetic realization of the acoustic parameters to the direction of the phonetic strengthening in everyday conversational speech and ordinary listeners are attentive to such phonetic variation associated with prosody in speech perception. However, the present study also showed that in everyday conversational speech there is no single dominant acoustic measure signaling prosodic prominence and listeners must attend to such small acoustic variation or integrate acoustic information from multiple acoustic parameters in prosody perception.

  • PDF

Research about SMT Performance Improvement Through Automatic Corpus Expansion (말뭉치 자동 확장을 통한 SMT 성능 향상에 대한 연구)

  • Choi, Gyu-Hyun;Shin, Jong-Hun;Kim, Young-Kil
    • 한국어정보학회:학술대회논문집
    • /
    • 2016.10a
    • /
    • pp.296-299
    • /
    • 2016
  • 현재 자동번역에는 통계적 방법에 속하는 통계기반 자동번역 시스템(SMT)이 많이 사용되고 있지만, 학습 데이터로 사용되는 대용량의 병렬 말뭉치를 수동으로 구축하는데 어려움이 있다. 본 연구의 목적은 통계기반 자동번역의 성능을 향상시키기 위해 기존 다른 언어쌍의 말뭉치와 SMT 자동번역 기술을 이용하여 대상이 되는 언어쌍의 SMT 병렬 말뭉치를 자동으로 확장하는 방법을 제안한다. 제안 방법은 서로 다른언어 B와 C의 병렬 말뭉치를 얻기 위해, A와 B의 SMT 자동번역 시스템을 구축하고 기존의 A-C 말뭉치의 A를 SMT를 통해 B로 번역하여 B와 C의 말뭉치를 자동으로 확장한다. 실험을 통해 확장한 병렬 말뭉치가 통계기반 자동번역 시스템의 성능을 향상시킬 수 있음을 확인한다.

  • PDF

Research about SMT Performance Improvement Through Automatic Corpus Expansion (말뭉치 자동 확장을 통한 SMT 성능 향상에 대한 연구)

  • Choi, Gyu-Hyun;Shin, Jong-Hun;Kim, Young-Kil
    • Annual Conference on Human and Language Technology
    • /
    • 2016.10a
    • /
    • pp.296-299
    • /
    • 2016
  • 현재 자동번역에는 통계적 방법에 속하는 통계기반 자동번역 시스템(SMT)이 많이 사용되고 있지만, 학습 데이터로 사용되는 대용량의 병렬 말뭉치를 수동으로 구축하는데 어려움이 있다. 본 연구의 목적은 통계기반 자동번역의 성능을 향상시키기 위해 기존 다른 언어쌍의 말뭉치와 SMT 자동번역 기술을 이용하여 대상이 되는 언어쌍의 SMT 병렬 말뭉치를 자동으로 확장하는 방법을 제안한다. 제안 방법은 서로 다른 언어 B와 C의 병렬 말뭉치를 얻기 위해, A와 B의 SMT 자동번역 시스템을 구축하고 기존의 A-C 말뭉치의 A를 SMT를 통해 B로 번역하여 B와 C의 말뭉치를 자동으로 확장한다. 실험을 통해 확장한 병렬 말뭉치가 통계기반 자동번역 시스템의 성능을 향상시킬 수 있음을 확인한다.

  • PDF

Semi-automatic Expansion for a Chatting Corpus Based on Similarity Measure Using Utterance Embedding by CNN (합성곱 신경망에 의한 발화 임베딩을 사용한 유사도 측정 기반의 채팅 말뭉치 반자동 확장 방법)

  • An, Jaehyun;Ko, Youngjoong
    • Annual Conference on Human and Language Technology
    • /
    • 2018.10a
    • /
    • pp.95-100
    • /
    • 2018
  • 채팅 시스템을 잘 만들기 위해서는 양질, 대량의 채팅 말뭉치가 굉장히 중요하지만 구축 시 많은 비용이 발생한다는 어려움이 있었다. 따라서 본 논문에서는 영화 자막, 극대본과 같이 대량의 발화 데이터를 이용하여 채팅 말뭉치를 반자동으로 확장하는 방법을 제안한다. 채팅 말뭉치 확장을 위해 미리 구축된 채팅 말뭉치와 유사도 기법을 이용하여 채팅 유사도를 구하고, 채팅 유사도가 실험을 통해 얻은 임계값보다 크다면 올바른 채팅쌍이라고 판단하였다. 그리고 길이가 매우 짧은 채팅성 발화의 채팅 유사도를 효과적으로 계산하기 위해 본 논문에서 제안하는 것은 형태소 단위 임베딩 벡터와 합성곱 신경망 모델을 이용하여 발화 단위 표상을 생성하는 것이다. 실험 결과 기본 발화 단위 표상 생성 방법인 TF를 이용하는 것보다 정확률, 재현율, F1에서 각각 5.16%p, 6.09%p, 5.73%p 상승하여 61.28%, 53.19%, 56.94%의 성능을 가지는 채팅 말뭉치 반자동 구축 모델을 생성할 수 있었다.

  • PDF

Experimental Analysis of Correct Answer Characteristics in Question Answering Systems (질의응답시스템에서 정답 특징에 관한 실험적 분석)

  • Han, Kyoung-Soo
    • Journal of Digital Contents Society
    • /
    • v.19 no.5
    • /
    • pp.927-933
    • /
    • 2018
  • One of the factors that have the greatest influence on the error of the question answering system that finds and provides answers to natural language questions is the step of searching for documents or passages that contain correct answers. In order to improve the retrieval performance, it is necessary to understand the characteristics of documents and passages containing correct answers. This paper experimentally analyzes how many question words appear in the correct answer documents, how the location of the question word is distributed, and how the topic of the question and the correct answer document are similar using the corpus composed of the question, the documents with correct answer, and the documents without correct answer. This study explains the causes of previous search research results for question answer system and discusses the necessary elements of effective search step.