• 제목/요약/키워드: Speech level

검색결과 677건 처리시간 0.026초

이중채널 잡음음성인식을 위한 공간정보를 이용한 통계모델 기반 음성구간 검출 (Statistical Model-Based Voice Activity Detection Using Spatial Cues for Dual-Channel Noisy Speech Recognition)

  • 신민화;박지훈;김홍국;이연우;이성로
    • 말소리와 음성과학
    • /
    • 제2권3호
    • /
    • pp.141-148
    • /
    • 2010
  • In this paper, voice activity detection (VAD) for dual-channel noisy speech recognition is proposed in which spatial cues are employed. In the proposed method, a probability model for speech presence/absence is constructed using spatial cues obtained from dual-channel input signal, and a speech activity interval is detected through this probability model. In particular, spatial cues are composed of interaural time differences and interaural level differences of dual-channel speech signals, and the probability model for speech presence/absence is based on a Gaussian kernel density. In order to evaluate the performance of the proposed VAD method, speech recognition is performed for speech segments that only include speech intervals detected by the proposed VAD method. The performance of the proposed method is compared with those of several methods such as an SNR-based method, a direction of arrival (DOA) based method, and a phase vector based method. It is shown from the speech recognition experiments that the proposed method outperforms conventional methods by providing relative word error rates reductions of 11.68%, 41.92%, and 10.15% compared with SNR-based, DOA-based, and phase vector based method, respectively.

  • PDF

How Different are Vowel Epentheses in Learner Speech and Loanword Phonology?

  • Park, Mi-Sun;Kim, Jong-Mi
    • 음성과학
    • /
    • 제15권2호
    • /
    • pp.33-51
    • /
    • 2008
  • Difference of learner speech and loanword phonology is investigated in terms of Korean learners' speech and their loanword adaptation of English words with a post-vocalic word-final stop. When we compared the speech of 12 Korean learners in mid-intermediate level with that of eight English speakers, the learner speech did not reflect loanword phonology of the vowel insertion after a voiced word-final stop (e.g., rib$[\dotplus]$, bad$[\dotplus]$, gag$[\dotplus]$ vs. tip[=], cat[=], book[=]), but, instead, the target phonology of vowel lengthening before a voiced word-final stop (e.g., rib[r.I:b], CAD$[k{\ae}:d]$, bag$[b{\ae}:g]$ vs. rip[rI.p], cat$[k{\ae}t]$, back$[b{\ae}k])$. A longitudinal study of learner speech before and after instruction showed some development toward the acquisition of target phonology. The results indicate that learner speech departs from loanword phonology, and approaches to target speech in a faster rate than direct ratio. Thus, native phonology predicts loanword phonology, but lends little support to learner speech. Our results also indicate that loanword phonology is constant, while learner speech changes toward the acquisition of target phonology.

  • PDF

일반 영유아의 초기 발성 발달 연구 (Vocal Development of Typically Developing Infants)

  • 하승희;설아영;배소영
    • 말소리와 음성과학
    • /
    • 제6권4호
    • /
    • pp.161-169
    • /
    • 2014
  • This study investigated changes in the prelinguistic vocal production of typically developing infants aged 5-20 months based on Stark Assessment of Early Vocal Development-Revised (SAEVD-R). Fifty-eight typically developing infants participated in the study, and they were divided into four age groups, 5-8 months, 9-12 months, 13-16 months, and 17-20 months of age. Vocalization samples were collected from infants' play activities and were classified into 5 levels and 23 types using SAEVD-R. The results revealed that the four age groups showed significant differences in production proportion of vocalization levels. Level 1, 2, 4, and 5 vocalizations exhibited significantly different across the four age groups. Level 3 was predominantly produced across every age group. Therefore, the vocalization level was not significantly different across the four age groups. Especially, vowels in Level 3 vocalization predominantly produced across all ages during a long period. Also, significant increases in the proportion of Levels 4 and 5 occurred after 9 months, which suggested that the production of cannonical syllables is a key indicator of advancement in prelinguistic vocal development. The results have clinical implication in early identification and speech-language intervention for young children with speech delays or at risk.

음성인식을 위한 잡음하의 음성왜곡제거 (The suppression of noise-induced speech distortions for speech recognition)

  • 지상문;오영환
    • 전자공학회논문지S
    • /
    • 제35S권12호
    • /
    • pp.93-102
    • /
    • 1998
  • 본 논문에서는 잡음에 의해 기인된 음성의 왜곡을 제거하여 음성인식기의 성능을 향상시키는 방법을 기술한다. 잡음 환경에서는 음성의 발성 방식이 변이하고(롬바드효과), 잡음이 음성신호에 첨가되므로 음성인식기의 성능을 저하시킨다. 롬바드 효과는 주변 잡음의 크기나 종류, 화자의 특성과 음소 등에 종속적인 비선형적인 변환이므로 측정방법이 알려져 있지 않았다. 본 연구에서는 롬바드 효과의 크기를 측정하는 방법을 제시하고, 롬바드 효과의 크기에 따른 롬바드 효과의 보정방법을 제안한다. 잡음에 의한 음성의 왜곡은 다음의 과정을 통해서 제거한다. 우선, 스펙트럼 차감법을 사용하여 음성에 포함된 잡잡음을 제거하고, 음성의 동적인 특성을 강조하기 위해 대역 통과 필터링을 한다. 두 번째로 에너지 정규화 과정을 통해서 롬바드 효과에 의한 음성의 발성 강도의 변이를 제거한다. 마지막으로 제안한 롬바드 효과의 크기 척도는 롬바드 음성의 켑스트럼에 존재하는 왜곡을 제거하는 변환에 이용한다. 제안한 방법을 음성인식에 적용한 결과, SNR(signal-to-noise ratio) 0, 10, 20 dB에서 46.3%, 75.5%, 87.4%의 인식률을 82.6%, 95.7%, 97.6%로 향상시켰다.

  • PDF

조음중증도에 따른 인공와우이식 아동들의 말명료도와 이해가능도의 상관연구 (The Relationship Between Speech Intelligibility and Comprehensibility for Children with Cochlear Implants)

  • 허현숙;하승희
    • 말소리와 음성과학
    • /
    • 제2권3호
    • /
    • pp.171-178
    • /
    • 2010
  • This study examined the relationship between speech intelligibility and comprehensibility for hearing impaired children with cochlear implants. Speech intelligibility was measured by orthographic transcription method for acoustic signal at the level of words and sentences. Comprehensibility was evaluated by examining listener's ability to answer questions about the contents of a narrative. Speech samples were collected from 12 speakers(age of 6~15 years) with cochlear implants. For each speaker, 4 different listeners(total of 48 listeners) completed 2 tasks: One task involved making orthographic transcriptions and the other task involved answering comprehension questions. The results of the study were as follows: (1) Speech intelligibility and comprehensibility scores tended to be increased by decreasing of severity. (2) Across all speakers, the relationship was significant between speech intelligibility and comprehensibility scores without considering severity. However, within severity groups, there was the significant relationship between comprehensibility and speech intelligibility only for moderate-severe group. These results suggest that speech intelligibility scores measured by orthographic transcription may not accurately reflect how well listener comprehend speech of children with cochlear implants and therefore, measures of both speech intelligibility and listener comprehension should be considered in evaluating speech ability and information-bearing capability in speakers with cochlear implants.

  • PDF

말소리 변조 스크립트를 이용한 호감도 청취평가 특징 (Characteristics of the auditory evaluation of good impression using speech manipulation scripts)

  • 권순복
    • 말소리와 음성과학
    • /
    • 제8권4호
    • /
    • pp.131-138
    • /
    • 2016
  • This study analyzes the characteristics of good impression using speech manipulation scripts and investigates the characteristics of preferred speech voice. Fourty male and female college students participated in this study. They have been exposed to the Gyeongsang dialect spoken by their friends and family for more than 15 years. Two sample voices(1 male and 1 female), considered as giving good impression, were subject to voice analysis. Two students were asked to read the sample paragraph of 'Walking' and their voice samples were analyzed through Praat. The collected speech data were manipulated into 4 different sets by changing pitch level, degree of loudness and speech rate. First, both men and women received good impression more from pitch-lowered sound than from the original one. Second, men tended to receive good impression more from slightly louder voice than from the natural-pitched one. Third, it was shown that men often felt more drowned to a voice at slightly faster speech rate than at the original speech rate. Overall, both male and female listeners favored lower pitch over the original pitch. Men tended to prefer louder voice sound while women preferred less loud one. Men received better impression at a lower speech rate but women at a faster speech rate.

An Encrypted Speech Retrieval Scheme Based on Long Short-Term Memory Neural Network and Deep Hashing

  • Zhang, Qiu-yu;Li, Yu-zhou;Hu, Ying-jie
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제14권6호
    • /
    • pp.2612-2633
    • /
    • 2020
  • Due to the explosive growth of multimedia speech data, how to protect the privacy of speech data and how to efficiently retrieve speech data have become a hot spot for researchers in recent years. In this paper, we proposed an encrypted speech retrieval scheme based on long short-term memory (LSTM) neural network and deep hashing. This scheme not only achieves efficient retrieval of massive speech in cloud environment, but also effectively avoids the risk of sensitive information leakage. Firstly, a novel speech encryption algorithm based on 4D quadratic autonomous hyperchaotic system is proposed to realize the privacy and security of speech data in the cloud. Secondly, the integrated LSTM network model and deep hashing algorithm are used to extract high-level features of speech data. It is used to solve the high dimensional and temporality problems of speech data, and increase the retrieval efficiency and retrieval accuracy of the proposed scheme. Finally, the normalized Hamming distance algorithm is used to achieve matching. Compared with the existing algorithms, the proposed scheme has good discrimination and robustness and it has high recall, precision and retrieval efficiency under various content preserving operations. Meanwhile, the proposed speech encryption algorithm has high key space and can effectively resist exhaustive attacks.

Syllable-timing Interferes with Korean Learners' Speech of Stress-timed English

  • Lee, Ok-Hwa;Kim, Jong-Mi
    • 음성과학
    • /
    • 제12권4호
    • /
    • pp.95-112
    • /
    • 2005
  • We investigate Korean learners' speech-timing of English before and after instruction in comparison with native speech, in an attempt to resolve disagreements in the literature as to whether speech-timing is measurable (Lehiste, 1977; Roach, 1982; Dauer, 1983 vs. Low et al., 2000; Yun 2002; Jian, 2004). We measured the pair-wise variability between the adjacent stressed and unstressed syllables within a foot as well as that among adjacent feet in approximately 555 English sentences, which were read by 29 native speakers and 41 Korean learners in the intermediate proficiency level. The results show that in comparison with native American English, Korean learner speech is before instruction significantly (p<.001) smaller for the pair-wise variability between the adjacent stressed and unstressed syllables within a foot; and significantly (p=.01) bigger for the variability among adjacent feet within the utterance. The learner speech after instruction showed significant (p=.01) improvement in the pair-wise variability of syllable sequence toward native speech values. The variability among adjacent feet was progressively smaller for learner speech before and after instruction and for native speech (p=.03). We thus conclude that the speech timing difference between Korean English and American English is measurable in terms of the duration. of stressed and unstressed syllables and that the latter is stress-timed and the former is syllable-timing interfered.

  • PDF

A Novel Two-Level Pitch Detection Approach for Speaker Tracking in Robot Control

  • Hejazi, Mahmoud R.;Oh, Han;Kim, Hong-Kook;Ho, Yo-Sung
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 제어로봇시스템학회 2005년도 ICCAS
    • /
    • pp.89-92
    • /
    • 2005
  • Using natural speech commands for controlling a human-robot is an interesting topic in the field of robotics. In this paper, our main focus is on the verification of a speaker who gives a command to decide whether he/she is an authorized person for commanding. Among possible dynamic features of natural speech, pitch period is one of the most important ones for characterizing speech signals and it differs usually from person to person. However, current techniques of pitch detection are still not to a desired level of accuracy and robustness. When the signal is noisy or there are multiple pitch streams, the performance of most techniques degrades. In this paper, we propose a two-level approach for pitch detection which in compare with standard pitch detection algorithms, not only increases accuracy, but also makes the performance more robust to noise. In the first level of the proposed approach we discriminate voiced from unvoiced signals based on a neural classifier that utilizes cepstrum sequences of speech as an input feature set. Voiced signals are then further processed in the second level using a modified standard AMDF-based pitch detection algorithm to determine their pitch periods precisely. The experimental results show that the accuracy of the proposed system is better than those of conventional pitch detection algorithms for speech signals in clean and noisy environments.

  • PDF

자동 음성 분할을 위한 음향 모델링 및 에너지 기반 후처리 (Acoustic Modeling and Energy-Based Postprocessing for Automatic Speech Segmentation)

  • 박혜영;김형순
    • 대한음성학회지:말소리
    • /
    • 제43호
    • /
    • pp.137-150
    • /
    • 2002
  • Speech segmentation at phoneme level is important for corpus-based text-to-speech synthesis. In this paper, we examine acoustic modeling methods to improve the performance of automatic speech segmentation system based on Hidden Markov Model (HMM). We compare monophone and triphone models, and evaluate several model training approaches. In addition, we employ an energy-based postprocessing scheme to make correction of frequent boundary location errors between silence and speech sounds. Experimental results show that our system provides 71.3% and 84.2% correct boundary locations given tolerance of 10 ms and 20 ms, respectively.

  • PDF