• Title/Summary/Keyword: Speaker representation

Search Result 34, Processing Time 0.024 seconds

Speech Recognition Using Recurrent Neural Prediction Models (회귀신경예측 모델을 이용한 음성인식)

  • 류제관;나경민;임재열;성경모;안성길
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.32B no.11
    • /
    • pp.1489-1495
    • /
    • 1995
  • In this paper, we propose recurrent neural prediction models (RNPM), recurrent neural networks trained as a nonlinear predictor of speech, as a new connectionist model for speech recognition. RNPM modulates its mapping effectively by internal representation, and it requires no time alignment algorithm. Therefore, computational load at the recognition stage is reduced substantially compared with the well known predictive neural networks (PNN), and the size of the required memory is much smaller. And, RNPM does not suffer from the problem of deciding the time varying target function. In the speaker dependent and independent speech recognition experiments under the various conditions, the proposed model was comparable in recognition performance to the PNN, while retaining the above merits that PNN doesn't have.

  • PDF

Which Agent is More Captivating for Winning the Users' Hearts?: Focusing on Paralanguage Voice and Human-like Face Agent

  • SeoYoung Lee
    • Asia pacific journal of information systems
    • /
    • v.34 no.2
    • /
    • pp.585-619
    • /
    • 2024
  • This paper delves into the comparative analysis of human interactions with AI agents based on the presence or absence of a facial representation, combined with the presence or absence of paralanguage voice elements. The "CASA (Computer-Are-Social-Actors)" paradigm posits that people perceive computers as social actors, not tools, unconsciously applying human norms and behaviors to computers. Paralanguages are speech voice elements such as pitch, tone, stress, pause, duration, speed that help to convey what a speaker is trying to communicate. The focus is on understanding how these elements collectively contribute to the generation of flow, intimacy, trust, and interactional enjoyment within the user experience. Subsequently, this study uses PLS analysis to explore the connections among all variables within the research framework. This paper has academic and practical implications.

Plan-based Ellipsis Resolution for Utterances in Noun-Phrase-Form in Restricted Domain Dialogues (제한된 영역의 대화에서 체언구 형태의 발화 이해를 위한 계획기반 생략 처리)

  • 윤철진;서정연
    • Korean Journal of Cognitive Science
    • /
    • v.11 no.1
    • /
    • pp.81-92
    • /
    • 2000
  • Elliptical fragments are common in natural language dialogues between humans. Since most elliptical fragments should be interpeted within the context. it is not easy for computers to recognize the speaker's intention from the elliptical fragments. In t this paper we propose a model to recognize speaker's intention from elliptical fragments 1 in Korean by expanding the tripartite plan-based model proposed by Lambert. We add new discourse recipes to define user's discourse actions through elliptical fragments. In order to use plan inference process. we must represent utterances as actions. e. g .. r e elliptical fragments are represented as surface speech acts. In surface speech act representation. we include the information of 'Josa' (case markers in Korean), because t the information of 'Josa' plays a very important role in analysing speakers' intention in Korean. Finally. by using an object and discourse focus theory, the system can recognize the intention that a user is trying to compare between two plans by uttering elliptical fragments

  • PDF

Improvement of Character-net via Detection of Conversation Participant (대화 참여자 결정을 통한 Character-net의 개선)

  • Kim, Won-Taek;Park, Seung-Bo;Jo, Geun-Sik
    • Journal of the Korea Society of Computer and Information
    • /
    • v.14 no.10
    • /
    • pp.241-249
    • /
    • 2009
  • Recently, a number of researches related to video annotation and representation have been proposed to analyze video for searching and abstraction. In this paper, we have presented a method to provide the picture elements of conversational participants in video and the enhanced representation of the characters using those elements, collectively called Character-net. Because conversational participants are decided as characters detected in a script holding time, the previous Character-net suffers serious limitation that some listeners could not be detected as the participants. The participants who complete the story in video are very important factor to understand the context of the conversation. The picture elements for detecting the conversational participants consist of six elements as follows: subtitle, scene, the order of appearance, characters' eyes, patterns, and lip motion. In this paper, we present how to use those elements for detecting conversational participants and how to improve the representation of the Character-net. We can detect the conversational participants accurately when the proposed elements combine together and satisfy the special conditions. The experimental evaluation shows that the proposed method brings significant advantages in terms of both improving the detection of the conversational participants and enhancing the representation of Character-net.

Phonetic Realization of Aspiration of Stops in English /Cr/ and /sCr/ Clusters and their Syllable Structure at the Phonetic Level: a Comparison between Two Speaker Groups (영어의 /Cr/과 /sCr/ 자음군 내 폐쇄음의 기식성 실현과 음성 단위의 음절구조: 두 화자집단 간 비교)

  • Sohn, Hyang-Sook
    • Phonetics and Speech Sciences
    • /
    • v.6 no.3
    • /
    • pp.121-130
    • /
    • 2014
  • This study investigates the acoustic property of aspiration realized in English voiceless stops of /Cr/ and /sCr/ clusters. VOT is measured from stops in these clusters produced by two groups; one from native speakers of English and the other from Korean native speakers. Aspiration of stops in different types of clusters is compared to various phonological factors such as location of stress, syllable type, and position in word. Pursuing the idea that phonetic realization is correlated with phonological representation, attempts are made to account for the gradient nature of aspiration of stops on the basis of syllable structure at the phonetic level, which may vary in the wake of resyllabification. Voiceless stops in /Cr/ and /sCr/ clusters are further compared to results obtained in the previous study on /sC/ cluster. Variations in aspiration are also characterized in terms of segmental precedence relation of stops in the clusters, namely, post-[s], pre-[r], or both.

Speech Rhythm and the Three Aspects of Speech Timing: Articulatory, Acoustic and Auditory

  • Yun, Il-Sung
    • Speech Sciences
    • /
    • v.8 no.1
    • /
    • pp.67-76
    • /
    • 2001
  • This study is targeted at introducing the three aspects of speech timing (articulatory, acoustic and auditory) and discussing their strong and weak points in describing speech timing. Traditional (extrinsic) articulatory timing theories exclude timing representation in the speaker's articulatory plan for his utterance, while the (intrinsic) articulatory timing theories headed by Fowler incorporate time into the plan for an utterance. As compared with articulatory timing studies with crucial constraints in data collection, acoustic timing studies can deal with even several hours of speech relatively easily. This enables us to perform suprasegmental timing studies as well as segmental timing studies. On the other hand, perception of speech timing is related to psychology rather than physiology and physics. Therefore, auditory timing studies contribute to enhancing our understanding of speech timing from the psychological point of view. Traditionally, some theories of speech timing (e.g. typology of speech rhythm: stress-timing; syllable-timing or mora-timing) have been based on our perception. However, it is problematic that auditory timing can be subjective despite some validity. Many questions as to speech timing are expected to be answered more objectively. Acoustic and articulatory description of timing will be the method of solving such problems of auditory timing.

  • PDF

A Study of Subjective Speech Quality Measurement in VoIP (VoIP 음질의 주관적 평가에 관한 연구)

  • 강영도;강진석;최연성;김장형
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.5 no.2
    • /
    • pp.279-287
    • /
    • 2001
  • In this paper, we discuss the scale of subjective speech quality measurement over VoIP(Voice over IP) network which is a component of broadband networks. Objective parameters of multimedia services like PSNR or jitter can easily measured and defined, but these factors are not easily meet the user's perceptual recognition. We suggest the speech quality measurement scale through the subjective measurement for end-to-end speech quality composed of sender-side quality, transmission quality, receiver-side quality, which provide the degree of correctness of representation of speaker, the degree of impairment caused by various factors, the degree of recognition of processed speech, respectively. Also, we examined the proposed method and verify it's availability.

  • PDF

CONTINUOUS DIGIT RECOGNITION FOR A REAL-TIME VOICE DIALING SYSTEM USING DISCRETE HIDDEN MARKOV MODELS

  • Choi, S.H.;Hong, H.J.;Lee, S.W.;Kim, H.K.;Oh, K.C.;Kim, K.C.;Lee, H.S.
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • 1994.06a
    • /
    • pp.1027-1032
    • /
    • 1994
  • This paper introduces a interword modeling and a Viterbi search method for continuous speech recognition. We also describe a development of a real-time voice dialing system which can recognize around one hundred words and continuous digits in speaker independent mode. For continuous digit recognition, between-word units have been proposed to provide a more precise representation of word junctures. The best path in HMM is found by the Viterbi search algorithm, from which digit sequences are recognized. The simulation results show that a interword modeling using the context-dependent between-word units provide better recognition rates than a pause modeling using the context-independent pause unit. The voice dialing system is implemented on a DSP board with a telephone interface plugged in an IBM PC AT/486.

  • PDF

Experiences of Military Prostitute and Im/Possibility of Representation: Re-writing History from a Postcolonial Feminist Perspective (기지촌 여성의 경험과 윤리적 재현의 불/가능성: 탈식민주의 페미니스트 역사 쓰기)

  • Lee, Na-Young
    • Women's Studies Review
    • /
    • v.28 no.1
    • /
    • pp.79-120
    • /
    • 2011
  • The purpose of this paper is to illuminate the implication of feminist oral history from a postcolonial feminist perspective as critically reexamining the relationship between hearer and speaker, representer and narrator, the said and the unsaid, and secrecy and silence. Based upon oral (life) history of a U.S. military prostitute (yanggongju), I tried to show the experiences of a historically-excluded and marginalized 'Other,' and then critically reevaluate the meaning of encountering 'Other', not just through the research process but also in the post/colonial society in Korea. The narrative of an old woman in the "kijichon" (a formal prostitute in U.S. military base) shows how woman has navigated the boundaries between inevitability/coincidence, the enforced/the voluntary, prostitution/intimacy, and military prostitute/military bride while continually negotiating as well as having conflict with various myths and ideologies of the 'normative woman,' 'nationhood,' and 'normal family.' In addition, her narrative which causes the rupture of our own stereotypical images of a military prostitute not only proves the possibility of reconstructing the self-identity of a subaltern woman, but also redirects the research focus from the research object to the research subject (ourselves). Consequently, the implication in feminist oral history is that feminist researchers who whish to represent the experiences of other should first inquire 'what/how we can hear,' 'why we want to know others,' and 'who we are,' while simultaneously asking if subaltern woman can speak.

Automatic Recognition of Pitch Accent Using Distributed Time-Delay Recursive Neural Network (분산 시간지연 회귀신경망을 이용한 피치 악센트 자동 인식)

  • Kim Sung-Suk
    • The Journal of the Acoustical Society of Korea
    • /
    • v.25 no.6
    • /
    • pp.277-281
    • /
    • 2006
  • This paper presents a method for the automatic recognition of pitch accents over syllables. The method that we propose is based on the time-delay recursive neural network (TDRNN). which is a neural network classifier with two different representation of dynamic context: the delayed input nodes allow the representation of an explicit trajectory F0(t) along time. while the recursive nodes provide long-term context information that reflects the characteristics of pitch accentuation in spoken English. We apply the TDRNN to pitch accent recognition in two forms: in the normal TDRNN. all of the prosodic features (pitch. energy, duration) are used as an entire set in a single TDRNN. while in the distributed TDRNN. the network consists of several TDRNNs each taking a single prosodic feature as the input. The final output of the distributed TDRNN is weighted sum of the output of individual TDRNN. We used the Boston Radio News Corpus (BRNC) for the experiments on the speaker-independent pitch accent recognition. π 1e experimental results show that the distributed TDRNN exhibits an average recognition accuracy of 83.64% over both pitch events and non-events.