• Title/Summary/Keyword: speaker attention

Search Result 29, Processing Time 0.025 seconds

Speaker verification system combining attention-long short term memory based speaker embedding and I-vector in far-field and noisy environments (Attention-long short term memory 기반의 화자 임베딩과 I-vector를 결합한 원거리 및 잡음 환경에서의 화자 검증 알고리즘)

  • Bae, Ara;Kim, Wooil
    • The Journal of the Acoustical Society of Korea
    • /
    • v.39 no.2
    • /
    • pp.137-142
    • /
    • 2020
  • Many studies based on I-vector have been conducted in a variety of environments, from text-dependent short-utterance to text-independent long-utterance. In this paper, we propose a speaker verification system employing a combination of I-vector with Probabilistic Linear Discriminant Analysis (PLDA) and speaker embedding of Long Short Term Memory (LSTM) with attention mechanism in far-field and noisy environments. The LSTM model's Equal Error Rate (EER) is 15.52 % and the Attention-LSTM model is 8.46 %, improving by 7.06 %. We show that the proposed method solves the problem of the existing extraction process which defines embedding as a heuristic. The EER of the I-vector/PLDA without combining is 6.18 % that shows the best performance. And combined with attention-LSTM based embedding is 2.57 % that is 3.61 % less than the baseline system, and which improves performance by 58.41 %.

Masked cross self-attentive encoding based speaker embedding for speaker verification (화자 검증을 위한 마스킹된 교차 자기주의 인코딩 기반 화자 임베딩)

  • Seo, Soonshin;Kim, Ji-Hwan
    • The Journal of the Acoustical Society of Korea
    • /
    • v.39 no.5
    • /
    • pp.497-504
    • /
    • 2020
  • Constructing speaker embeddings in speaker verification is an important issue. In general, a self-attention mechanism has been applied for speaker embedding encoding. Previous studies focused on training the self-attention in a high-level layer, such as the last pooling layer. In this case, the effect of low-level layers is not well represented in the speaker embedding encoding. In this study, we propose Masked Cross Self-Attentive Encoding (MCSAE) using ResNet. It focuses on training the features of both high-level and low-level layers. Based on multi-layer aggregation, the output features of each residual layer are used for the MCSAE. In the MCSAE, the interdependence of each input features is trained by cross self-attention module. A random masking regularization module is also applied to prevent overfitting problem. The MCSAE enhances the weight of frames representing the speaker information. Then, the output features are concatenated and encoded in the speaker embedding. Therefore, a more informative speaker embedding is encoded by using the MCSAE. The experimental results showed an equal error rate of 2.63 % using the VoxCeleb1 evaluation dataset. It improved performance compared with the previous self-attentive encoding and state-of-the-art methods.

On the Role of the Phatic Function of Intonation in Russian (러시아어 발화시 억양의 역할)

  • Park, Kun-Woo
    • Speech Sciences
    • /
    • v.4 no.1
    • /
    • pp.81-89
    • /
    • 1998
  • This paper investigates the phatic function of intonation in Russian by recording and analysing 11 female native speakers of standard Moscow Russian. This paper shows that differences in intonation pattern of a sentence are associated with differences in degree of listener's involvement in the speech. Intonation pattern of an utterance having phatic function appears to be determined by 1) the speaker's readiness to talk to evoke the listener's attention ; 2) the speaker's intention to continue the communication. Some emphasis is placed on the relationship between intonation pattern of an utterance and speaker-listener interaction.

  • PDF

A study on User Experience of Artificial Intelligence speaker (인공지능 스피커(AI speaker) 사례 분석을 통한 고찰)

  • Jo, Gyu-Eun;Kim, Seung-In
    • Journal of the Korea Convergence Society
    • /
    • v.9 no.8
    • /
    • pp.127-133
    • /
    • 2018
  • The purpose of this study is to analyze the technology trend of artificial intelligent speaker(AI speaker) and to suggest direction of domestic AI speaker through the case study of AI speaker. As a research method, technical background was studied through literature, and then, case of AI speaker was investigated. As a result, It attempts to extend it to the visual interface. One of these attempts is attention to the built-in screen AI speaker. AI speakers should be a platform for humans and computers to interact with, not just convenience facilities. Based on the implications presented in this study, we hope to be able to use it as a reference for predicting the service development direction of domestic artificial intelligent speakers in the future.

Selective Attentive Learning for Fast Speaker Adaptation in Multilayer Perceptron (다층 퍼셉트론에서의 빠른 화자 적응을 위한 선택적 주의 학습)

  • 김인철;진성일
    • The Journal of the Acoustical Society of Korea
    • /
    • v.20 no.4
    • /
    • pp.48-53
    • /
    • 2001
  • In this paper, selectively attentive learning method has been proposed to improve the learning speed of multilayer Perceptron based on the error backpropagation algorithm. Three attention criterions are introduced to effectively determine which set of input patterns is or which portion of network is attended to for effective learning. Such criterions are based on the mean square error function of the output layer and class-selective relevance of the hidden nodes. The acceleration of learning time is achieved by lowering the computational cost per iteration. Effectiveness of the proposed method is demonstrated in a speaker adaptation task of isolated word recognition system. The experimental results show that the proposed selective attention technique can reduce the learning time more than 60% in an average sense.

  • PDF

Spatial Speaker Localization for a Humanoid Robot Using TDOA-based Feature Matrix (도착시간지연 특성행렬을 이용한 휴머노이드 로봇의 공간 화자 위치측정)

  • Kim, Jin-Sung;Kim, Ui-Hyun;Kim, Do-Ik;You, Bum-Jae
    • The Journal of Korea Robotics Society
    • /
    • v.3 no.3
    • /
    • pp.237-244
    • /
    • 2008
  • Nowadays, research on human-robot interaction has been getting increasing attention. In the research field of human-robot interaction, speech signal processing in particular is the source of much interest. In this paper, we report a speaker localization system with six microphones for a humanoid robot called MAHRU from KIST and propose a time delay of arrival (TDOA)-based feature matrix with its algorithm based on the minimum sum of absolute errors (MSAE) for sound source localization. The TDOA-based feature matrix is defined as a simple database matrix calculated from pairs of microphones installed on a humanoid robot. The proposed method, using the TDOA-based feature matrix and its algorithm based on MSAE, effortlessly localizes a sound source without any requirement for calculating approximate nonlinear equations. To verify the solid performance of our speaker localization system for a humanoid robot, we present various experimental results for the speech sources at all directions within 5 m distance and the height divided into three parts.

  • PDF

Symbolic Violence of the Native Speaker Fallacy: A Qualitative Case Study of an NNES Teacher

  • Choi, Soo-Joung
    • English Language & Literature Teaching
    • /
    • v.15 no.3
    • /
    • pp.33-57
    • /
    • 2009
  • Taking the issues of inequity and power between NES and NNES teachers as a starting point, this qualitative study explores the way the widespread belief of the native speaker fallacy manifests itself in one NNES teacher's teaching life and is linked to the teacher's understanding of herself as an English teacher. Guided by critical applied linguistics (Pennycook, 2001) and using Bourdieu's (1991) theorization of symbolic violence, I conducted an instrumental case study (Stake, 1995) in an ESL writing class at a US university. I collected data through classroom observations and interviews over a nine-month period and analyzed the data using the constant comparison method (Glaser and Strauss, 1967). The findings illustrate the ways the dominant ideology of the native speaker fallacy works to maintain and reproduce the status quo unequal relation between NES and NNES teachers by making all parties involved believe in the artificial sociocultural arrangements that favor NES teachers as legitimate. The findings direct our attention to the importance of critical teacher education that will enable future TESOL professionals to engage in critical reflection on diverse issues and envision transformative change. The findings, in particular, point to the need for language support for NNES teachers in TESOL teacher education.

  • PDF

Multimodal depression detection system based on attention mechanism using AI speaker (AI 스피커를 활용한 어텐션 메커니즘 기반 멀티모달 우울증 감지 시스템)

  • Park, Junhee;Moon, Nammee
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2021.06a
    • /
    • pp.28-31
    • /
    • 2021
  • 전세계적으로 우울증은 정신 건강 질환으로써 문제가 되고 있으며, 이를 해결하기 위해 일상생활에서의 우울증 탐지에 대한 연구가 진행되고 있다. 따라서 본 논문에서는 일상생활에 밀접하게 연관되어 있는 AI 스피커를 사용한 어텐션 메커니즘(Attention Mechanism) 기반 멀티모달 우울증 감지 시스템을 제안한다. 제안된 방법은 AI 스피커로부터 수집할 수 있는 음성 및 텍스트 데이터를 수집하고 CNN(Convolutional Neural Network)과 BiLSTM(Bidirectional Long Short-Term Memory Network)를 통해 각 데이터에서의 학습을 진행한다. 학습과정에서 Self-Attention 을 적용하여 특징 벡터에 추가적인 가중치를 부여하는 어텐션 메커니즘을 사용한다. 최종적으로 음성 및 텍스트 데이터에서 어텐션 가중치가 추가된 특징들을 합하여 SoftMax 를 통해 우울증 점수를 예측한다.

  • PDF

The Effect of Perceived Anthropomorphic Characteristics on Continuous Usage Intention of Artificial Intelligence Voice Speaker : Based on the Integrated Adoption Model (인공지능 음성 스피커의 의인화 특성 지각 정도가 지속적 이용 의향에 미치는 영향: 통합 수용 모델을 기반으로)

  • Lee, Sungjoon
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.11
    • /
    • pp.41-55
    • /
    • 2021
  • AI voice speaker has played an important role in forming an early market and development for AI-based goods and service with growing attention from many people. In this context, this research examined factors affecting continuous intention of AI voice speaker based on the integrated adoption model, which combined two factors of perceived playfulness and innovation resistance with extended technology acceptance model. It was also examined whether three perceived anthropomorphic features(i.e., perceived rational support, perceived intimacy, perceived cognitive openness) have influences on continuous intention of AI voice speaker. The data was collected by an online-survey and were responses of those who are in their 20s and 30s and have experienced in using AI voice speaker. They were analyzed by using SEM(Structural Equation Modeling). The results showed that all of perceived ease of use, perceived usefulness, perceived playfulness and innovation resistance had significant influences on continuous intention of AI voice speaker. In addition, all of perceived rational support, perceived intimacy and perceived cognitive openness as perceived anthropomorphic features had significant influences on perceived ease of use, perceived usefulness and perceived playfulness. The implications of found results in this research was also discussed.

$Gei^3ta^1$ in Taiwan Mandarin--- A Particular Construction

  • Lee, Chia-Chun
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2007.11a
    • /
    • pp.268-274
    • /
    • 2007
  • The present paper investigates a particular structure in Taiwan Mandarin, "(NP) + (intensifier) + $gei^3ta^1$ "give him/it"+ adjective" in terms of construction grammar. The structure is mostly observed in utterances of younger generation. Though it is not regarded as a grammatical or standard structure, it is still a register of language. The structure lays emphasis on speaker's attitude toward an undesired, unpleasant event. In most cases, the attitude tends to be negative. The events or propositions must have existed or been completed. The adjectives compatible with this structure belong to category of higher degree. The grammatical usage illustrates semantic bleaching of $gei^3ta^1$. And the changes from giving to a grammatical particle denoting subjective belief is a kind of subjectification. Moreover, $ta^1$ could refer to events or situation expressed by a more complicated grammatical structure, or denotes nothing as a dummy word. Though many previous studies paid attention to the newly developed structure resulted from language contact, the adequate account was not provided. It is hoped through this investigation, we will get a better understanding of this particular structure.

  • PDF