DOI QR코드

DOI QR Code

Multi-band Approach to Deep Learning-Based Artificial Stereo Extension

  • Jeon, Kwang Myung (School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology) ;
  • Park, Su Yeon (School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology) ;
  • Chun, Chan Jun (School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology) ;
  • Park, Nam In (Digital Technology and Biometry Division, National Forensic Service) ;
  • Kim, Hong Kook (School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology)
  • Received : 2016.10.28
  • Accepted : 2017.03.28
  • Published : 2017.06.01

Abstract

In this paper, an artificial stereo extension method that creates stereophonic sound from a mono sound source is proposed. The proposed method first trains deep neural networks (DNNs) that model the nonlinear relationship between the dominant and residual signals of the stereo channel. In the training stage, the band-wise log spectral magnitude and unwrapped phase of both the dominant and residual signals are utilized to model the nonlinearities of each sub-band through deep architecture. From that point, stereo extension is conducted by estimating the residual signal that corresponds to the input mono channel signal with the trained DNN model in a sub-band domain. The performance of the proposed method was evaluated using a log spectral distortion (LSD) measure and multiple stimuli with a hidden reference and anchor (MUSHRA) test. The results showed that the proposed method provided a lower LSD and higher MUSHRA score than conventional methods that use hidden Markov models and DNN with full-band processing.

Keywords

References

  1. J. Lapierre and C. Faller, "Spatial Audio Processing," Proc. AES Convention, Paris, France, May 20-23, 2006, Preprint 6804.
  2. E. Schuijers et al., "Low Complexity Parametric Stereo Coding," Proc. AES Convention, Berlin, Germany, May 8-11, 2004, Preprint 6073.
  3. H. Purnhangen et al., "Synthetic Ambience in Parametric Stereo Coding," Proc. AES Convention, Berlin, Germany, May 8-11, 2004, Preprint 6074.
  4. C.J. Chun et al., "Real-Time Conversion of Stereo Audio to 5.1 Channel Audio for Providing Realistic Sounds," Int. J. Signal Process. Image Process. Pattern Recogn., vol. 2, no. 4, Dec. 2009, pp. 85-94.
  5. N.I. Park and H.K. Kim, "Artificial Stereo Extension of Speech Based on Inter-Channel Coherence," Adv. Sci. Technol. Lett., vol. 14, no. 1, Aug. 2012, pp. 168-171.
  6. N.I. Park et al., "Artificial Stereo Extension Based on Gaussian Mixture Model," Proc. AES Convention, Rome, Italy, May 4-7, 2013, Preprint 8877.
  7. N.I. Park et al., "Artificial Stereo Extension Based on Hidden Markov Model for the Incorporation of Non-stationary Energy Trajectory," Proc. AES Convention, New York, USA, Oct. 17-20, 2013, Preprint 8980.
  8. G. Hinton et al., "Deep Neural Networks for Acoustic Modeling in Speech Recognition," IEEE Signal Process. Mag., vol. 29, no. 6, Nov. 2012, pp. 82-97. https://doi.org/10.1109/MSP.2012.2205597
  9. C.J. Chun et al., "Extension of Monaural to Stereophonic Sound Based on Deep Neural Networks," Proc. AES Convention, New York, USA, Oct. 29-Nov. 1, 2015, Preprint 9400.
  10. J. Herre et al., "MPEG Surround—the ISO/MPEG Standard for Efficient and Compatible Multichannel Audio Coding," J. Audio Eng. Soc., vol. 56, no. 11, Nov. 2008, pp. 932-955.
  11. J. Herre et al., "MPEG-H 3D Audio—the New Standard for Coding of Immersive Spatial Audio," IEEE J. Sel. Topics Signal Process., vol. 9, no. 5, Aug. 2015, pp. 770-779. https://doi.org/10.1109/JSTSP.2015.2411578
  12. K.M. Jeon et al., "An MDCT-Domain Audio Denoising Method with a Block Switching Scheme," IEEE Trans. Consum. Electron., vol. 59, no. 4, Nov. 2013, pp. 818-824. https://doi.org/10.1109/TCE.2013.6689694
  13. S.Y. Park, C.J. Chun, and H.K. Kim, "Sub-band-based Upmixing of Stereo to 5.1-Channel Audio Signals Using Deep Neural Networks," Int. Conf. Inform. Commun. Technol. Convergence, Jeju, Rep. of Korea, Oct. 19-21, 2016, pp. 377-380.
  14. G. Kovacs, L. Toth, and T. Grosz, "Robust Multi-band ASR Using Deep Neural Nets and Spectro-Temporal Features," Proc. Int. Conf. Speech Comput. (SPECOM), Novi Sad, Serbia, Oct. 5-9, 2014, pp. 386-393.
  15. ISO/IEC 23008-3:2015, Information Technology - High Efficiency Coding and Media Delivery in Heterogeneous Environments - Part 3: 3D Audio, Oct. 2015.
  16. A. Spanias, T. Painter, and V. Atti, Audio Signal Processing and Coding, Hoboken, NJ, USA: John & Wiley & Sons, Inc., Jan. 2007.
  17. X. Mei and S. Sun, "An Efficient Method to Compute LSFs from LPC Coefficients," Int. Conf. Signal Process. Proc., Beijing, China, Aug. 21-25, 2000, pp. 655-658.
  18. Y. Bengio, "Learning Deep Architectures for AI," Found. Trends$^{(R)}$ Mach. Learn., vol. 2, no. 1, Jan. 2009, pp. 1-127. https://doi.org/10.1561/2200000006
  19. Y. Xu et al., "An Experimental Study on Speech Enhancement Based on Deep Neural Networks," IEEE Signal Process. Lett., vol. 21, no. 1, Jan. 2014, pp. 65-68. https://doi.org/10.1109/LSP.2013.2291240
  20. G.S. Kendall, "Directional Sound Processing in Stereo Reproduction," Int. Comput. Music Conf., San Jose, CA, Oct. 14-18, 1992, pp. 261-264.
  21. C. Shuixian et al., "Frequency Dependence of Spatial Cues and Its Implication in Spatial Stereo Coding," Proc. Int. Conf. Comput. Sci. Softw. Eng., Wuhan, China, Dec. 12-14, 2008, pp. 1066-1069.
  22. A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Processing, Englewood Cliffs, NJ, USA: Prentice-Hall, 1989.
  23. A.H. Gray and J.D. Markel, "Distance Measures for Speech Processing," IEEE Trans. Acoust. Speech Signal Process., vol. 24, no. 5, Oct. 1976, pp. 380-391. https://doi.org/10.1109/TASSP.1976.1162849
  24. ITU-R BS.1534-1, Method for the Subjective Assessment of Intermediate Quality Levels of Coding System, Jan. 2003.
  25. EBU Technical Document 3253, Sound Quality Assessment Material Recordings for Subjective Tests-Users' Handbook for the EBU-SQAM Compact Disc, Apr. 1988.
  26. http://slrdb.etri.re.kr/
  27. P. Kabal, TSP Speech Database, Department of Electrical & Computer Engineering, McGill University, Montreal, Canada, Tech. Rep. TR-2002-09-04, Sept. 2002.

Cited by

  1. A semantic approach to improving machine readability of a large-scale attack graph vol.75, pp.6, 2017, https://doi.org/10.1007/s11227-018-2394-6
  2. A hybrid speech enhancement system with DNN based speech reconstruction and Kalman filtering vol.79, pp.43, 2017, https://doi.org/10.1007/s11042-020-09563-5