DOI QR코드

DOI QR Code

생성적 적대 신경망을 이용한 음향 도플러 기반 무 음성 대화기술

An acoustic Doppler-based silent speech interface technology using generative adversarial networks

  • 이기승 (건국대학교 전기전자공학부)
  • Lee, Ki-Seung (Department of Electronic Engineering, Konkuk University)
  • 투고 : 2021.01.18
  • 심사 : 2021.03.02
  • 발행 : 2021.03.31

초록

본 논문에서는 발성하고 있는 입 주변에 40 kHz의 주파수를 갖는 초음파 신호를 방사하고 되돌아오는 신호의 도플러 변이를 검출하여 발성음을 합성하는 무 음성 대화기술을 제안하였다. 무음성 대화 기술에서는 비 음성 신호로 부터 추출된 특징변수와 해당 음성 신호의 파라메터 간 대응 규칙을 생성하고 이를 이용하여 음성신호를 합성하게 된다. 기존의 무 음성 대화기술에서는 추정된 음성 파라메터와 실제 음성 파라메터간의 오차가 최소화되도록 대응규칙을 생성한다. 본 연구에서는 추정 음성 파라메터가 실제 음성 파라메터의 분포와 유사하도록 생성적 적대 신경망을 도입하여 대응 규칙을 생성하도록 하였다. 60개 한국어 음성을 대상으로 한 실험에서 제안된 기법은 객관적, 주관적 지표상 으로 기존의 신경망 기반 기법보다 우수한 성능을 나타내었다.

In this paper, a Silent Speech Interface (SSI) technology was proposed in which Doppler frequency shifts of the reflected signal were used to synthesize the speech signals when 40kHz ultrasonic signal was incident to speaker's mouth region. In SSI, the mapping rules from the features derived from non-speech signals to those from audible speech signals was constructed, the speech signals are synthesized from non-speech signals using the constructed mapping rules. The mapping rules were built by minimizing the overall errors between the estimated and true speech parameters in the conventional SSI methods. In the present study, the mapping rules were constructed so that the distribution of the estimated parameters is similar to that of the true parameters by using Generative Adversarial Networks (GAN). The experimental result using 60 Korean words showed that, both objectively and subjectively, the performance of the proposed method was superior to that of the conventional neural networks-based methods.

키워드

참고문헌

  1. B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and J. S. Brumberg, "Silent speech interfaces," Speech Comm. 52, 270-287 (2010). https://doi.org/10.1016/j.specom.2009.08.002
  2. T. Hueber, G. Aversano, G. Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, and M. Stone, "Eigen-tongue feature extraction for an ultrasound-based silent speech interface," Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1245-1248 (2007).
  3. B. Denby and M. Stone, "Speech synthesis from real time ultrasound images of the tongue," Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 685-688 (2004).
  4. S. Li, Y. Tian, G. Lu, Y. Zhang, H. Lv, X. Yu, H. Xue, H. Zhang, J. Wang, and X. Jing, "A 94-GHz milimeter-wave sensor for speech signal acquisition," Sensors, 13, 14248-14260 (2013). https://doi.org/10.3390/s131114248
  5. K. S. Lee, "Silent speech interface using Doppler sonar," IEICE Trans. on Information and Systems, E103-D, 1875-1887, (2020). https://doi.org/10.1587/transinf.2019EDP7211
  6. T. Toda and K. Shikano, "NAM-to-Speech convertsion with Gaussian Mixture Models," Proc. INTERSPEECH, 1957-1960 (2005).
  7. K.-S. Lee, "Prediction of acoustic feature parameters using myoelectric signals," IEEE Trans. on Biomed. Eng, 57, 1587-1595 (2010). https://doi.org/10.1109/TBME.2010.2041455
  8. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," Proc. Adv. NIPS. 2672-2680 (2014).
  9. D. W. Griffin and J. S. Lim, "Signal estimation from the modified short-time fourier transform," IEEE Trans. on ASSP. 32, 236-243 (1984). https://doi.org/10.1109/TASSP.1984.1164317
  10. ITU-T, Rec. P. 862, Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for end-to-end Speech Quality Assessment of Narrow Band Telephone Networks and Speech Codecs, 2001.