• Title/Summary/Keyword: 한국어 코퍼스

Search Result 245, Processing Time 0.021 seconds

Personalized Chit-chat Based on Language Models (언어 모델 기반 페르소나 대화 모델)

  • Jang, Yoonna;Oh, Dongsuk;Lim, Jungwoo;Lim, Heuiseok
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.491-494
    • /
    • 2020
  • 최근 언어 모델(Language model)의 기술이 발전함에 따라, 자연어처리 분야의 많은 연구들이 좋은 성능을 내고 있다. 정해진 주제 없이 인간과 잡담을 나눌 수 있는 오픈 도메인 대화 시스템(Open-domain dialogue system) 분야에서 역시 이전보다 더 자연스러운 발화를 생성할 수 있게 되었다. 언어 모델의 발전은 응답 선택(Response selection) 분야에서도 모델이 맥락에 알맞은 답변을 선택하도록 하는 데 기여를 했다. 하지만, 대화 모델이 답변을 생성할 때 일관성 없는 답변을 만들거나, 구체적이지 않고 일반적인 답변만을 하는 문제가 대두되었다. 이를 해결하기 위하여 화자의 개인화된 정보에 기반한 대화인 페르소나(Persona) 대화 데이터 및 태스크가 연구되고 있다. 페르소나 대화 태스크에서는 화자마다 주어진 페르소나가 있고, 대화를 할 때 주어진 페르소나와 일관성이 있는 답변을 선택하거나 생성해야 한다. 이에 우리는 대용량의 코퍼스(Corpus)에 사전 학습(Pre-trained) 된 언어 모델을 활용하여 더 적절한 답변을 선택하는 페르소나 대화 시스템에 대하여 논의한다. 언어 모델 중 자기 회귀(Auto-regressive) 방식으로 모델링을 하는 GPT-2, DialoGPT와 오토인코더(Auto-encoder)를 이용한 BERT, 두 모델이 결합되어 있는 구조인 BART가 실험에 활용되었다. 이와 같이 본 논문에서는 여러 종류의 언어 모델을 페르소나 대화 태스크에 대해 비교 실험을 진행했고, 그 결과 Hits@1 점수에서 BERT가 가장 우수한 성능을 보이는 것을 확인할 수 있었다.

  • PDF

Alleviating Semantic Term Mismatches in Korean Information Retrieval (한국어 정보 검색에서 의미적 용어 불일치 완화 방안)

  • Yun, Bo-Hyun;Park, Sung-Jin;Kang, Hyun-Kyu
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.12
    • /
    • pp.3874-3884
    • /
    • 2000
  • An information retrieval system has to retrieve all and only documents which are relevant to a user query, even if index terms and query terms are not matched exactly. However, term mismatches between index terms and qucry terms have been a serious obstacle to the enhancement of retrieval performance. In this paper, we discuss automatic term normalization between words in text corpora and their application to a Korean information retrieval system. We perform two types of term normalizations to alleviate semantic term mismatches: equivalence class and co-occurrence cluster. First, transliterations, spelling errors, and synonyms are normalized into equivalence classes bv using contextual similarity. Second, context-based terms are normalized by using a combination of mutual information and word context to establish word similarities. Next, unsupervised clustering is done by using K-means algorithm and co-occurrence clusters are identified. In this paper, these normalized term products are used in the query expansion to alleviate semantic tem1 mismatches. In other words, we utilize two kinds of tcrm normalizations, equivalence class and co-occurrence cluster, to expand user's queries with new tcrms, in an attempt to make user's queries more comprehensive (adding transliterations) or more specific (adding spc'Cializationsl. For query expansion, we employ two complementary methods: term suggestion and term relevance feedback. The experimental results show that our proposed system can alleviatl' semantic term mismatches and can also provide the appropriate similarity measurements. As a result, we know that our system can improve the rctrieval efficiency of the information retrieval system.

  • PDF

A Study of Keyword Spotting System Based on the Weight of Non-Keyword Model (비핵심어 모델의 가중치 기반 핵심어 검출 성능 향상에 관한 연구)

  • Kim, Hack-Jin;Kim, Soon-Hyub
    • The KIPS Transactions:PartB
    • /
    • v.10B no.4
    • /
    • pp.381-388
    • /
    • 2003
  • This paper presents a method of giving weights to garbage class clustering and Filler model to improve performance of keyword spotting system and a time-saving method of dialogue speech processing system for keyword spotting by calculating keyword transition probability through speech analysis of task domain users. The point of the method is grouping phonemes with phonetic similarities, which is effective in sensing similar phoneme groups rather than individual phonemes, and the paper aims to suggest five groups of phonemes obtained from the analysis of speech sentences in use in Korean morphology and in stock-trading speech processing system. Besides, task-subject Filler model weights are added to the phoneme groups, and keyword transition probability included in consecutive speech sentences is calculated and applied to the system in order to save time for system processing. To evaluate performance of the suggested system, corpus of 4,970 sentences was built to be used in task domains and a test was conducted with subjects of five people in their twenties and thirties. As a result, FOM with the weights on proposed five phoneme groups accounts for 85%, which has better performance than seven phoneme groups of Yapanel [1] with 88.5% and a little bit poorer performance than LVCSR with 89.8%. Even in calculation time, FOM reaches 0.70 seconds than 0.72 of seven phoneme groups. Lastly, it is also confirmed in a time-saving test that time is saved by 0.04 to 0.07 seconds when keyword transition probability is applied.

A realization of pauses in utterance across speech style, gender, and generation (과제, 성별, 세대에 따른 휴지의 실현 양상 연구)

  • Yoo, Doyoung;Shin, Jiyoung
    • Phonetics and Speech Sciences
    • /
    • v.11 no.2
    • /
    • pp.33-44
    • /
    • 2019
  • This paper dealt with how realization of pauses in utterance is affected by speech style, gender, and generation. For this purpose, we analyzed the frequency and duration of pauses. Pauses were categorized into four types: pause with breath, pause with no breath, utterance medial pause, and utterance final pause. Forty-eight subjects living in Seoul were chosen from the Korean Standard Speech Database. All subjects engaged in reading and spontaneous speech, through which we could also compare the realization between the two speech styles. The results showed that utterance final pauses had longer durations than utterance medial pauses. It means that utterance final pause has a function that signals the end of an utterance to the audience. For difference between tasks, spontaneous speech had longer and more frequent pauses because of cognitive reasons. With regard to gender variables, women produced shorter and less frequent pauses. For male speakers, the duration of pauses with breath was significantly longer. Finally, for generation variable, older speakers produced more frequent pauses. In addition, the results showed several interaction effects. Male speakers produced longer pauses, but this gender effect was more prominent at the utterance final position.

Comparison of vowel lengths of articles and monosyllabic nouns in Korean EFL learners' noun phrase production in relation to their English proficiency (한국인 영어학습자의 명사구 발화에서 영어 능숙도에 따른 관사와 단음절 명사 모음 길이 비교)

  • Park, Woojim;Mo, Ranm;Rhee, Seok-Chae
    • Phonetics and Speech Sciences
    • /
    • v.12 no.3
    • /
    • pp.33-40
    • /
    • 2020
  • The purpose of this research was to find out the relation between Korean learners' English proficiency and the ratio of the length of the stressed vowel in a monosyllabic noun to that of the unstressed vowel in an article of the noun phrases (e.g., "a cup", "the bus", etcs.). Generally, the vowels in monosyllabic content words are phonetically more prominent than the ones in monosyllabic function words as the former have phrasal stress, making the vowels in content words longer in length, higher in pitch, and louder in amplitude. This study, based on the speech samples from Korean-Spoken English Corpus (K-SEC) and Rated Korean-Spoken English Corpus (Rated K-SEC), examined 879 English noun phrases, which are composed of an article and a monosyllabic noun, from sentences which are rated on 4 levels of proficiency. The lengths of the vowels in these 879 target NPs were measured and the ratio of the vowel lengths in nouns to those in articles was calculated. It turned out that the higher the proficiency level, the greater the mean ratio of the vowels in nouns to the vowels in articles, confirming the research's hypothesis. This research thus concluded that for the Korean English learners, the higher the English proficiency level, the better they could produce the stressed and unstressed vowels with more conspicuous length differences between them.