• Title/Summary/Keyword: Voice Script

Search Result 17, Processing Time 0.02 seconds

Lip and Voice Synchronization Using Visual Attention (시각적 어텐션을 활용한 입술과 목소리의 동기화 연구)

  • Dongryun Yoon;Hyeonjoong Cho
    • The Transactions of the Korea Information Processing Society
    • /
    • v.13 no.4
    • /
    • pp.166-173
    • /
    • 2024
  • This study explores lip-sync detection, focusing on the synchronization between lip movements and voices in videos. Typically, lip-sync detection techniques involve cropping the facial area of a given video, utilizing the lower half of the cropped box as input for the visual encoder to extract visual features. To enhance the emphasis on the articulatory region of lips for more accurate lip-sync detection, we propose utilizing a pre-trained visual attention-based encoder. The Visual Transformer Pooling (VTP) module is employed as the visual encoder, originally designed for the lip-reading task, predicting the script based solely on visual information without audio. Our experimental results demonstrate that, despite having fewer learning parameters, our proposed method outperforms the latest model, VocaList, on the LRS2 dataset, achieving a lip-sync detection accuracy of 94.5% based on five context frames. Moreover, our approach exhibits an approximately 8% superiority over VocaList in lip-sync detection accuracy, even on an untrained dataset, Acappella.

An Acoustical Analysis of English Stops at the Initial and After-initial-/s/ Positions by Korean and American Speakers (한국인과 미국인의 초성 및 초성 /s/ 다음에 오는 영어 파열음 음향 분석)

  • Yang, Byunggon
    • Phonetics and Speech Sciences
    • /
    • v.5 no.3
    • /
    • pp.11-20
    • /
    • 2013
  • The purpose of this study is to compare the acoustic parameters of English stop consonants at the initial and after-initial-/s/ positions in a message produced by 47 Korean and American speakers in order to provide better pronunciation skills of English stops for Korean learners. A Praat script was developed to obtain voice onset time (VOT), maximum consonant intensity (maxCi), and rate of rise (ROR) from six target words with stops at the positions in the message. Results show that VOT and maxCi were significantly different between the two language groups while ROR wasn't. The Korean speakers generally produced the stop consonants with longer VOTs and higher consonant intensity. From the comparison of consonant groups at the two different positions, the Korean participants did not distinguish them as clearly as the American participants did at the after-initial-/s/ position. Finally a comparison of each language and sex group revealed that the major difference was attributed to stop consonants in the after-/s/ position. The author concluded that Korean speakers should be careful not to produce all the stops with longer VOTs and higher intensity. Further studies would be desirable to examine how Americans evaluate Korean speakers' English proficiency with modified acoustic values of English stops.

Acoustic Analysis and Auditory-Perceptual Assessment for Diagnosis of Functional Dysphonia (기능성 음성장애의 진단을 위한 음향학적, 청지각적 평가)

  • Kim, Geun-Hyo;Lee, Yeon-Yoo;Bae, In-Ho;Lee, Jae-Seok;Lee, Chang-Yoon;Park, Hee-June;Lee, Byung-Joo;Kwon, Soon-Bok
    • Journal of Clinical Otolaryngology Head and Neck Surgery
    • /
    • v.29 no.2
    • /
    • pp.212-222
    • /
    • 2018
  • Background and Objectives : The purpose of this study was to compare the measured values of acoustic and auditory perceptual assessments between normal and functional dysphonia (FD) groups. Materials and Methods : 102 subjects with FD and 59 normal voice groups were participated in this study. Mid-vowel portion of the sustained vowel /a/ and two sentences of 'Sanchaek' were edited, concatenated, and analyzed by Praat script. And then auditory-perceptual (AP) rating was completed by three listeners. Results : The FD group showed higher acoustic voice quality index version 2.02 and version 3.01 (AVQIv2 and AVQIv3), slope, Hammarberg index (HAM), grade (G) and overall severity (OS), values than normal group. Additionally, smoothed cepstral peak prominence in Praat (PraatCPPS), tilt, low-to high spectral band energies (L/H ratio), long-term average spectrum (LTAS) in FD group were lower than normal voice group. And the correlation among measured values ranged from -0.250 to 0.960. In ROC curve analysis, cutoff values of AVQIv2, AVQIv3, PraatCPPS, slope, tilt, L/H ratio, HAM, and LTAS were 3.270, 2.013, 13.838, -22.286, -9.754, 369.043, 27.912, and 34.523, respectively, and the AUC of each analysis was over .890 in AVQIv2, AVQIv3, and PraatCPPS, over 0.731 in HAM, tilt, and slope, over 0.605 in LTAS and L/H ratio. Conclusions : In conclusion, AVQI and CPPS showed the highest predictive power for distinguishing between normal and FD groups. Acoustic analyses and AP rating as noninvasive examination can reinforce the screening capability of FD and help to establish efficient diagnosis and treatment process plan for FD.

Automatic Speech Style Recognition Through Sentence Sequencing for Speaker Recognition in Bilateral Dialogue Situations (양자 간 대화 상황에서의 화자인식을 위한 문장 시퀀싱 방법을 통한 자동 말투 인식)

  • Kang, Garam;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.2
    • /
    • pp.17-32
    • /
    • 2021
  • Speaker recognition is generally divided into speaker identification and speaker verification. Speaker recognition plays an important function in the automatic voice system, and the importance of speaker recognition technology is becoming more prominent as the recent development of portable devices, voice technology, and audio content fields continue to expand. Previous speaker recognition studies have been conducted with the goal of automatically determining who the speaker is based on voice files and improving accuracy. Speech is an important sociolinguistic subject, and it contains very useful information that reveals the speaker's attitude, conversation intention, and personality, and this can be an important clue to speaker recognition. The final ending used in the speaker's speech determines the type of sentence or has functions and information such as the speaker's intention, psychological attitude, or relationship to the listener. The use of the terminating ending has various probabilities depending on the characteristics of the speaker, so the type and distribution of the terminating ending of a specific unidentified speaker will be helpful in recognizing the speaker. However, there have been few studies that considered speech in the existing text-based speaker recognition, and if speech information is added to the speech signal-based speaker recognition technique, the accuracy of speaker recognition can be further improved. Hence, the purpose of this paper is to propose a novel method using speech style expressed as a sentence-final ending to improve the accuracy of Korean speaker recognition. To this end, a method called sentence sequencing that generates vector values by using the type and frequency of the sentence-final ending appearing in the utterance of a specific person is proposed. To evaluate the performance of the proposed method, learning and performance evaluation were conducted with a actual drama script. The method proposed in this study can be used as a means to improve the performance of Korean speech recognition service.

A study on the voiceless plosives from the English and Korean spontaneous speech corpus (영어와 한국어 자연발화 음성 코퍼스에서의 무성 파열음 연구)

  • Yoon, Kyuchul
    • Phonetics and Speech Sciences
    • /
    • v.11 no.4
    • /
    • pp.45-53
    • /
    • 2019
  • The purpose of this work was to examine the factors affecting the identities of the voiceless plosives, i.e. English [p, t, k] and Korean [ph, th, kh], from the spontaneous speech corpora. The factors were automatically extracted by a Praat script and the percent correctness of the discriminant analyses was incrementally assessed by increasing the number of factors used in predicting the identities of the plosives. The factors included the spectral moments and tilts of the plosive release bursts, the post-burst aspirations and the vowel onsets, the durations such as the closure durations and the voice onset times (VOTs), the locations within words and utterances and the identities of the following vowels. The results showed that as the number of factors increased up to five, so did the percent correctness of the analyses, resulting in 74.6% for English and 66.4% for Korean. However, the optimal number of factors for the maximum percent correctness was four, i.e. the spectral moments and tilts of the release bursts and the following vowels, the closure durations and the VOTs. This suggests that the identities of the voiceless plosives are mostly determined by their internal and vowel onset cues.

Construction of Cham Identity in Cambodia

  • Maunati, Yekti;Sari, Betti Rosita
    • SUVANNABHUMI
    • /
    • v.6 no.1
    • /
    • pp.107-135
    • /
    • 2014
  • Cham identities which are socially constructed and multilayered, display their markers in a variety of elements, including homeland attachment to the former Kingdom of Champa, religion, language and cultural traditions, to mention a few. However, unlike other contemporary diasporic experience which binds the homeland and the host country, the Cham diaspora in Cambodia has a unique pattern as it seems to have no voice in the political and economic spheres in Vietnam, its homeland. The relations between the Cham in Cambodia and Vietnam seem to be limited to cultural heritages such as Cham musical traditions, traditional clothing, and the architectural heritage. Many Cham people have established networks outside Cambodia with areas of the Muslim world, like Malaysia, Indonesia, southern Thailand and the Middle Eastern countries. Pursuing education or training in Islam as well as working in those countries, especially Malaysia has become a way for the Cham to widen their networks and increase their knowledge of particularly, Islam. Returning to Cambodia, these people become religious teachers or ustadz (Islamic teachers in the pondok [Islamic boarding school]). This has developed slowly, side by side with the formation of their identity as Cham Muslims. Among certain Cham, the absence of an ancient cultural heritage as an identity marker has been replaced by the Islamic culture as the important element of identity. However, being Cham is not a single identity, it is fluid and contested. Many scholars argue that the Cham in Cambodia constitute three groups: the Cham Chvea, Cham, and Cham Bani (Cham Jahed). The so-called Cham Jahed has a unique practice of Islam. Unlike other Cham who pray five times a day, Cham Jahed people pray, once a week, on Fridays. They also have a different ritual for the wedding ceremony which they regard as the authentic tradition of the Cham. Indeed, they consider themselves pure descendants of the Cham in Vietnam; retaining Cham traditions and tending to maintain their relationship with their fellow Cham in Central Vietnam. In terms of language, another marker of identity, the Cham and the Cham Jahed share the same language, but Cham Jahed preserve the written Cham script more often than the Cham. Besides, the Cham Jahed teaches the language to the young generation intensively. This paper, based on fieldwork in Cambodia in 2010 and 2011 will focus on the process of the formation of the Cham identity, especially of those called Cham and Cham Jahed.

  • PDF

Interface Application of a Virtual Assistant Agent in an Immersive Virtual Environment (몰입형 가상환경에서 가상 보조 에이전트의 인터페이스 응용)

  • Giri Na;Jinmo Kim
    • Journal of the Korea Computer Graphics Society
    • /
    • v.30 no.1
    • /
    • pp.1-10
    • /
    • 2024
  • In immersive virtual environments including mixed reality (MR) and virtual reality (VR), avatars or agents, which are virtual humans, are being studied and applied in various ways as factors that increase users' social presence. Recently, studies are being conducted to apply generative AI as an agent to improve user learning effects or suggest a collaborative environment in an immersive virtual environment. This study proposes a novel method for interface application of a virtual assistant agent (VAA) using OpenAI's ChatGPT in an immersive virtual environment including VR and MR. The proposed method consists of an information agent that responds to user queries and a control agent that controls virtual objects and environments according to user needs. We set up a development environment that integrates the Unity 3D engine, OpenAI, and packages and development tools for user participation in MR and VR. Additionally, we set up a workflow that leads from voice input to the creation of a question query to an answer query, or a control request query to a control script. Based on this, MR and VR experience environments were produced, and experiments to confirm the performance of VAA were divided into response time of information agent and accuracy of control agent. It was confirmed that the interface application of the proposed VAA can increase efficiency in simple and repetitive tasks along with user-friendly features. We present a novel direction for the interface application of an immersive virtual environment through the proposed VAA and clarify the discovered problems and limitations so far.