DOI QR코드

DOI QR Code

Feature Selection-based Voice Transformation

단위 선택 기반의 음성 변환

  • Lee, Ki-Seung (Department of Electronic Engineering, Konkuk University)
  • Received : 2011.11.29
  • Accepted : 2011.12.23
  • Published : 2012.01.31

Abstract

A voice transformation (VT) method that can make the utterance of a source speaker mimic that of a target speaker is described. Speaker individuality transformation is achieved by altering three feature parameters, which include the LPC cepstrum, pitch period and gain. The main objective of this study involves construction of an optimal sequence of features selected from a target speaker's database, to maximize both the correlation probabilities between the transformed and the source features and the likelihood of the transformed features with respect to the target model. A set of two-pass conversion rules is proposed, where the feature parameters are first selected from a database then the optimal sequence of the feature parameters is then constructed in the second pass. The conversion rules were developed using a statistical approach that employed a maximum likelihood criterion. In constructing an optimal sequence of the features, a hidden Markov model (HMM) was employed to find the most likely combination of the features with respect to the target speaker's model. The effectiveness of the proposed transformation method was evaluated using objective tests and informal listening tests. We confirmed that the proposed method leads to perceptually more preferred results, compared with the conventional methods.

Keywords

References

  1. M. Abe, S. Nakamura, K. Shikano and H. Kuwabara, "Voice conversion through vector quantization," in Proc. IEEE ICASSP, pp. 565-568, 1988.
  2. M. Savic and I. H. Nam, "Voice personality transformation," Digital Signal Processing, vol. 4, pp. 107- 110, 1991.
  3. H. Valbret, E. Moulines and J. P. Tubach, "Voice transformation using PSOLA technique," Speech Communication, vol. 11, no. 2-3, pp. 175-187, 1992. https://doi.org/10.1016/0167-6393(92)90012-V
  4. H. Mizuno and M. Abe, "Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectral tilt," Speech Communication, vol. 16, no. 2, pp. 153-164, 1995. https://doi.org/10.1016/0167-6393(94)00052-C
  5. M. Narendranath, H. A. Murthy, S. Rajendran, and B. Yegnanarayana, "Transformation of formants of voice conversion using artificial neural networks," Speech Communication, vol. 16, no. 2, pp. 207- 216, 1995. https://doi.org/10.1016/0167-6393(94)00058-I
  6. N. Iwahashi and Y. Sagisaka, "Speech spectrum conversion based on speaker interpolation and multifunctional representation with weighting by radial basis function networks," Speech Communication, vol. 16, no. 2, pp. 139-152, 1995. https://doi.org/10.1016/0167-6393(94)00051-B
  7. Y. Stylianou O. Cappe and E. Moulines, "Continuous probabilistic transform for voice conversion," IEEE Trans. on Acoustic Speech and Signal Processing, vol. 6, no. 2, pp. 131-142, 1998. https://doi.org/10.1109/89.661472
  8. N. Bi and Y. Qi, "Application of speech conversion to alaryngeal speech enhancement," IEEE Trans. on Acoustic Speech and Signal Processing, vol. 5, no. 2, pp. 97-105, 1997. https://doi.org/10.1109/89.554771
  9. L. M. Arslan, "Speaker transformation algorithm using segmental codebooks (STASC)," Speech Communication, vol. 28, no. 28, pp. 211-226, 1999. https://doi.org/10.1016/S0167-6393(99)00015-1
  10. K. S. Lee, D. H. Youn and I. W. Cha, "A New voice personality transformation based on both linear and nonlinear prediction analysis," in Proc. ICSLP, pp. 1401-1404, 1996.
  11. K. S. Lee, D. H. Youn and I. W. Cha, "Voice conversion using a low dimensional vector mapping," IEICE Trans. on Information and System, vol-E85D, no. 8, pp. 1297- 1305, 2002.
  12. K. S. Lee "Statistical approach for voice personality transformation," IEEE Trans. on Audio, Speech and Language processing, vol. 15, no. 2, pp. 641-651, 2007. https://doi.org/10.1109/TASL.2006.876760
  13. Z.-H. Jian and Y. Zhen, "Voice conversion using Viterbi algorithm based on Gaussian mixture model," in Proc. Intelligent Signal Processing and Communication Systems, pp. 32-35, 2007.
  14. D. Sundermann, H. Hoge, A. Bonafonte, H. Ney, A. Black, S. Narayanan, "Text-Independent Voice Conversion Based on Unit Selection," in Proc. IEEE ICASSP, pp. 14-19, 2006.
  15. D. Sundermann, H. Hoge, A. Bonafonte, H. Ney and A. W. Black, "Residual prediction based on unit selection," in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 369-374, 2005.
  16. T. Dutoit, A. Holzapfel, M. Jottrand, A. Moinet, J. Perez and Y. Stylianou, "Towards a Voice Conversion System Based on Frame Selection," in Proc. IEEE ICASSP, pp. 15-20, 2007.
  17. S. J. Cox and J. S. Bridle, "Unsupervised speaker adaptation by probabilistic spectrum fitting," in Proc. IEEE ICASSP, pp. 294-297, 1989.
  18. D. G. Childers, B. Yegnanarayana and Ke Wu, "Voice Conversion: Factors responsible for quality," in Proc. IEEE ICASSP, pp. 748-751, 1985.
  19. Y. Linde, A. Buzo and R. M. Gray, "An algorithm for vector quantizer design," IEEE Trans. on Communications, vol. 28, Issue 1, pp. 84-95, 1980. https://doi.org/10.1109/TCOM.1980.1094577
  20. M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou and A. Syrdal, "The AT&T Next-Gen TTS system," in Proc. Joint Meeting of ASA, EAA, and DAGA, Berlin, Germany, March 1999.
  21. L. R. Rabiner and R. W. Schafer, Digital Processing of speech signals, Prentice-Hall, 1987.
  22. G. M. White and R. B. Neely, "Speech recognition experiments with linear prediction, bandpass filtering, and dynamic programming," IEEE Trans. on Acoustic Speech and Signal Processing, vol. ASSP-24, no. 2, pp. 183-188, 1976.
  23. S. Roucos and A. M. Wilgus, "High quality timescale modification for speech," in Proc. ICASSP 85, pp. 493-469, 1985.
  24. A. Q. Summerfield, "Lipreading and audio-visual speech perception," Philos. Trans. R. Soc. London B, vol. 335, pp. 71-78, 1992. https://doi.org/10.1098/rstb.1992.0009
  25. D. A. Reynolds and R. C. Rose, "Robust textindependent speaker identification using Gaussian mixture speaker models," IEEE Trans. on Acoustic Speech and Signal Processing, vol. 3, no. 1, pp. 72-83, 1995. https://doi.org/10.1109/89.365379