Acknowledgement
이 논문은 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. 2019R1F1A106299513).
References
- J. Lim and A. Oppenheim. "All-pole modeling of degraded speech," IEEE Trans. on Acoustics, Speech, and Signal Process. 26, 197-210 (1978). https://doi.org/10.1109/TASSP.1978.1163086
- S. F. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. on Acoustics, Speech and Signal Proccess. 27, 113-120 (1979). https://doi.org/10.1109/TASSP.1979.1163209
- Y. Ephraim and D. Malah, "Speech enhancement using minimum mean square error short time spectral amplitude estimator," IEEE Trans. on Acoustics, Speech and Signal Process. 32, 1109-1121 (1984). https://doi.org/10.1109/TASSP.1984.1164453
- R. Martin, "Spectral subtraction based on minimum statistics," Proc. EUSIPCO. 1182-1185 (1994).
- P. J. Moreno, B. Raj, and R. M. Stern, "Data-driven environmental compensation for speech recognition: a unified approach," Speech Communication, 24, 267-285 (1998). https://doi.org/10.1016/S0167-6393(98)00025-9
- W. Kim and J. H. L. Hansen, "Feature compensation in the cepstral domain employing model combination," Speech Communication, 51, 83-96 (2009). https://doi.org/10.1016/j.specom.2008.06.004
- J. Du, L.-R. Dai, and Q. Huo, "Synthesized stereo mapping via deep neural networks for noisy speech recognition," Proc. ICASSP. 1764-1768 (2014).
- K. Han, Y. He, D. Bagchi, E. F. -Luissier, and D. L. Wang, "Deep neural network based spectral feature mapping for robust speech recognition," Proc. Interspeech, 2484-2488 (2015).
- K. Tan and D. L. Wang, "A convolutional recurrent neural network for real-time speech enhancement," Proc. Interspeech, 3229-3233 (2018).
- Y. Xu, J. Du; L.-R. Dai, and C.-H. Lee, "A regression approach to speech enhancement based on deep neural networks," IEEE/ACM. Trans. on Audio, Speech, and Lang. Process. 23, 7-19 (2014).
- Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, "Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking," arXiv preprint arXiv:1810.04826 (2018).
- X. Hao, X. Su, S. Wen, Z. Wang, Y. Pan, F. Bao, and W. Chen, "Masking and inpainting: A two-stage speech enhancement approach for low SNR and non-stationary noise," Proc. ICASSP. 6959-6963 (2020).
- Z. Xu, S. Elshamy, and T. Fingscheidt, "Using separate losses for speech and noise in mask-based speech enhancement," Proc. ICASSP. 7519-7523 (2020).
- F. Schroff, D. Kalenichenko, and J. Philbin, "Facenet: A unified embedding for face recognition and clustering," Proc. IEEE conference on CVPR. 815-823 (2015).
- J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, "DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1," NASA STI/Recon Tech. Rep., 1993.
- A. Varga and H. J. M. Steeneken, "Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems," Speech Communication, 12, 247-251 (1993). https://doi.org/10.1016/0167-6393(93)90095-3
- E. Vincent, R. Gribonval, and C. Fevotte, "Performance measurement in blind audio source separation," IEEE Trans. on audio, speech, and lang. process. 14, 1462-1469 (2006). https://doi.org/10.1109/TSA.2005.858005
- A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs," Proc. IEEE ICASSP. 01CH37221 (2001).
- C. H. Taal, R. C. Hendriks, R, Heusdens, and J. Jensen, "A short-time objective intelligibility measure for time- frequency weighted noisy speech," Proc. IEEE ICASSP. 4214-4217 (2010).