DOI QR코드

DOI QR Code

반향 음성 신호의 하모닉 모델링을 이용한 음질 예측 알고리즘

Speech Quality Estimation Algorithm using a Harmonic Modeling of Reverberant Signals

  • 양재모 (연세대학교 전기전자공학부) ;
  • 강홍구 (연세대학교 전기전자공학부)
  • 투고 : 2013.10.07
  • 심사 : 2013.11.20
  • 발행 : 2013.11.30

초록

실내 환경에서 음성 신호는 음향 전달 함수에 의한 반향 신호를 포함한다. 이때 반향의 정도나 반향에 의한 음질 변화를 예측하는 것은 반향 제거 알고리즘 등에서 중요한 정보를 제공한다. 본 논문은 음성 신호의 하모닉 모델링 기법을 이용한 반향 환경에서의 자동 음질 예측 기법을 제안하다. 제안한 방법에서는 반향을 포함하는 음성 신호에 대한 하모닉 모델링 기법이 가능함을 보이고, 모델링된 하모닉 성분과 나머지 성분 사이의 통계적인 비율을 예측한다. 예측된 비율은 일반적인 방 환경에서의 음질 측정 표준 파라미터와 비 교하였다. 실험 결과 제안된 방법은 다양한 반향 환경 (반향 시간 0.2~1.0초)에서 표준 음질 파라미터를 정확하게 예측할 수 있음을 증명하였다.

The acoustic signal from a distance sound source in an enclosed space often produces reverberant sound that varies depending on room impulse response. The estimation of the level of reverberation or the quality of the observed signal is important because it provides valuable information on the condition of system operating environment. It is also useful for designing a dereverberation system. This paper proposes a speech quality estimation method based on the harmonicity of received signal, a unique characteristic of voiced speech. At first, we show that the harmonic signal modeling to a reverberant signal is reasonable. Then, the ratio between the harmonically modeled signal and the estimated non-harmonic signal is used as a measure of standard room acoustical parameter, which is related to speech clarity. Experimental results show that the proposed method successfully estimates speech quality when the reverberation time varies from 0.2s to 1.0s. Finally, we confirm the superiority of the proposed method in both background noise and reverberant environments.

키워드

Ⅰ. Introduction

In typical speech communication systems such as hands-free telephones, voice-controlled systems and hearing aids, the received sensor signal is degraded by room rever-beration and background noise. The signal degradation leads to unintelligibility of the target speech and decreases the performance of automatic speech recognition (ASR) [1][2]. The reverberation is modeled by a multi-path propagation process of an acoustic sound from source to sensor in an enclosed space. Generally, the received signal can be decomposed into two components; early (including direct path) and late reverberations. The early reverberation, arriving shortly after the direct sound reinforces the sound, is a useful component to determine speech intelligibility [2]. Due to the fact that the early reflection varies depending on the speaker and sensor positions, it also gives us information on the volume of space and the distance of the speaker. The ate reverberation results from reflections with longer delays after the arrival of the direct sound, which impairs speech intelligibility and is the principle obstacle for ASR system. These detrimental effects are generally increased with longer distance between the source and sensor.

The ISO 3382 standard defines room acoustical parameters and specifies how to measure the parameters using known room impulse response (RIR) [3]. However, room acoustical parameters estimation methods in an indirect way have been preferred because it is still an open problem to blindly estimate RIR in a practical system. Lebart et al. proposed reverberation time (T60) estimation method in which segmentation procedure is used for detecting gaps in sounds to allow the sound decay curve to be tracked [4]. Ratnam et al. assumed that the diffused tail of the reverberation can be modeled as exponentially decaying Gaussian white noise. Based on the Gaussian model, they proposed statistical T60 estimation method using a maximum-likelihood (ML) estimation of the room decaying time constant [5]. A method which can simultaneously estimate both T60 and direct-to-reverberation ratio (DRR) was proposed by Falk et al. using short- and long-term temporal dynamic information [6]. However, it is not a blind method because they adopt a support vector regressor (SVR) which is a kind of a training method to estimate the parameters. Recently, Georganti et al. proposed a single-microphone speaker distance detection method by using a pattern recognizer. They used spectral and temporal features which are depending on the reverberation condition [7].elephones,

One of the principle objective of the room acoustical parameter estimation is to decide the quality of the received speech signal. The information about the quality of the received signal ,which is depending on room reverberation conditions and speaker to sensor distance, gives useful information to user and to post processing systems such as ASR or dereverberation. This paper proposed a blind single-channel speech quality estimation method based on thespeech harmonicity. We verified that the early reverberation signal can be approximated by a harmonically modeled signal. The modeled signal is used to estimate the ratio between the harmonic and the non-harmonic components which can substitute room acoustical parameter related to speech clarity. The performance of the proposed algorithm is confirmed for random positions of speaker and sensor in various room environments.

 

Ⅱ. Signal model

1. Reverberant signal model

The RIR h (t) representing the acoustical properties between sensor and speaker in a room can be divided into two parts; early reverberation including direct path and late reverberation [2].

where he(t) and hl(t)are the early and the late reverberation of RIR, respectively, The parameter T1 can be adjusted depending on applications or subjective preference. Usually, T1 ranges from 50ms to 80ms. The reverberant signal, x (t), obtained by the convolution of the anechoic speech signal, s (t) and the h (t) can be represented as:

The first term in (2) (early reverberation), xe (t), is composed of the sounds which are reflected off one or more surfaces until T1 time period. The early reverberation includes the information of the room size and the positions of speaker and sensor. The other sound resulting from reflections with long delays is the late reverberation, xl (t), which impairs speech intelligibility. The late reverberation can be considered as white Gaussian process because it is composed of large number of random paths in a room (Polack's statistical RIR model) [2]. Therefore, it is reasonable assumption that the early and the late reverberation are uncorrelated.

2. Harmonic signal model

A speech signal can be modeled as the sum of a harmonic signal sh (t) and a non-harmonic signal sn (t) as follows [8]:

The harmonic part accounts for the quasi-periodic component of the speech signal such as voiced while the non-harmonic part accounts for its non-periodic components such as fricative or aspiration noise, period-to-period variations for the glottal excitations. The quasi-periodicity of the harmonic signal sh (t) is approximately modeled as the sum of K -sinusoidal components whose frequencies correspond to the integer multiple of the fundamental frequency F0 [8]. Assuming that Ak (t) and θk (t) are the amplitude and phase of the k-th harmonic component, it can be represented as

where is the time derivative of the phase of the k-th harmonic component and is the F0. Without loss of generality, Ak (t) and θk (t) can be derived from the short time Fourier transform of the signal S(f) around time index n0 which are given as [8].

where is a short enough analysis window to extract the time-varying feature of the harmonic signal.

 

Ⅲ. Harmonic to non-harmonic ratio estimation

In this section, we propose a single-channel speech quality estimation method using the ratio between the harmonic and the non-harmonic components of the observed signal. After defining the harmonic to non-harmonic ratio (HnHR), we show that the ideal HnHR corresponds to the standard room acoustical parameter.

1. Room acoustic parameters.

The standard ISO 3382 defines several room acoustical parameters [3]. Among the parameters, the reverberation time (T60) and the clarity (C50, C80) are considered in this paper because it can represent not only the room condition but also the distance between speaker to sensor. Therefore, the speech quality can be also varied by the distance between a sensor and speaker even if it is measured in a same room. The clarity parameter is defined as the logarithmic energy ratio of an impulse response between early and late reverberation given as follows [3]:

where C50 is used to express the clarity of speech and C80 is better suited for music. If we assume that T1 is very small (smaller than 4ms), the clarity parameter becomes a good approximation of the DRR which gives the information of the distance from speaker to sensor. Actually, the clarity index is closely related to the distance.

2 Harmonic component of reverberant signal

In a practical system, h (t) is unknown and it is very hard to blindly estimate an accurate RIR. We will verify that the ratio between the harmonic and the non-harmonic component of the observed signal gives us useful information on speech quality. Using (1), (2) and (3), the observed signal can be decomposed into the following harmonic xeh (t) and non-harmonic xnh (t) components [1]:

where * represents the convolution operation. xeh (t) is the early reverberation of the harmonic signal which is composed of the sum of several reflections with small delays. Since the length of the he (t) is essentially short, xeh (t) can be seen as a harmonic signal in low frequency band. Therefore, it is possible to model xeh (t) as a harmonic signal similar to (4). xlh (t) and xn (t) are the late reverberation of the harmonic signal and reverberation of noisy signal s (t), respectively.

3. HnHR estimation

The early-to-late signal ratio (ELR) can be regarded as one of the room acoustical parameter relating to speech clarity. Ideally, if we assume that h (t) and s (t) are independent ELR can be represented as follows:

where E {•} represents the expectation operator. Actually, (8) becomes C50 while xe (t) and xl (t) are practically unknown. From (2) and (7), it is possible to assume that xeh (t) and xnh (t) follow xe (t) and xl (t), respectively, because sn (t) has much smaller energy than sh (t). Therefore, the harmonic to non-harmonic ratio (HnHR) given in (9) can be regarded as the replacement of the ELR value.

Figure1 depicts the flow graph of the proposed HnHR estimation algorithm. Details of operations at each module are described as follows:

그림 1.제안된 HnHR 알고리즘의 플로우 그래프 Fig. 1. Flow graph of the proposed HnHR estimation algorithm.

Pitch estimation: F0 is an important factor in the proposed method because pitch estimation error directly affects the HnHR value. In the proposed method, we adopted the metric of subharmonic-to-harmonic ratio (SHR) given in (10) [9]. It shows robust performance in noisy and reverberant environments.

Weighted harmonic modeling: Using the estimated F0, the amplitude and phase at each harmonic frequency are used to synthesize the harmonic component, xeh (t). In the reverberation tail interval, however, the synthesized harmonic can track the reverberation signal because the harmonicity of the signal gradually decreases after speech offset instant. The reverberation tail interval can be disregarded by a voice activity detection or binary-decision for the processing frame. However, it affects the HnHR because large portion of the late reverberation components are included in this interval. Therefore, we apply the frame based amplitude weighting to gradually decrease energy of the synthesized signal at the reverberation tail interval as follows:

where ϵ is set to 5 through lots of experimental results. The weighting function is depicted in Fig. 2 which maintains the original harmonic model when SHR is lager than 7dB and gradually decreases the amplitude of the harmonically modeled signal in low SHR case.

그림 2.SHR에 근거한 웨이팅 함수 Fig. 2. Amplitude weighting function based on SHR in a frame

Non-harmonic component estimation: Without loss of generality, we can assume that xeh (t) and xnh (t) are uncorrelated. Therefore, the spectral variance of non-harmonic part is derived from a spectral subtraction method as given follows [10]:

where is the synthesized signal by utilizing the weighted harmonic modeling.

HnHR estimation: Finally, HnHR is estimated using (9) where the expectation value is calculated by the first order recursive averaging with a forgetting factor of 0.95.

 

Ⅳ. Experimental results

We implemented the proposed algorithm depicted in Fig. 1 to verify the performance. It can distinguish a quality of the observed speech signal and follow ideal ELR value, C50, in a room. In this experiment, we compared the HnHR in various T60 environments. The speaker and the sensor are located in random positions by changing its distance from 0m to 5.0m. We tested 30 random positions for each distance and each T60 which varies from 0.2s~1.0s. The image source model (ISM) is used to generate RIRs for the random positions of speaker and sensor [11][12]. Conversational male and female speech signals with the sampling frequency of 16kHz were recorded. The analysis frame length is set to 32ms with a 10ms sliding Hanning window. DFT size is four times longer than the analysis frame length, and 50~5000Hz frequency band is only considered to estimates HnHR. The high frequency band is disregarded because the harmonicity is relatively low and the estimated harmonic frequency can be erroneous comparing to the low frequency band. Figure3 shows the accuracy of the harmonic modeling for a clean speech signal. The original signal to modeling error ratio ranges 15dB~20dB and the average value is around 16dB. Therefore, the estimated HnHR ratio is upper bounded by 16dB in the proposed implementation while it may have larger upper bound if the performance of the harmonic modeling module is further enhanced.

그림 3.하모닉 모델링의 정확도 (a) 신호 예 (b) 모델링 오차 Fig. 3. Accuracy of the harmonic modeling (a) Clean and harmonic modeling signals, (b) modeling error

The results of the proposed HnHR estimation method are depicted in Fig. 4. Fig. 4(a) and Fig. 4(b) show the results without considering background noise. The solid lines in Fig. 4(a) represent C50 (ideal ELR) curve obtained by considering the sensor to speaker distance in various reverberation conditions. The dotted lines are estimated ELR by using the known xe (t) and xl (t). The estimated ELR well tracks C50 in large reverberant case while it shows erroneous result for T60=0.2s and 0.4s cases because a relatively small energy of the late reverberation can amplify ELR value in a frame by frame estimation scheme. Figure4(b) shows the results of the proposed HnHR estimation in noise free environments. The maximum HnHR is 13dB in the experiment (upper bounded by 16dB). In low level reverberation the estimated HnHRs score larger than 12dB even for a long distance speaker while it decreases rapidly in high level reverberation conditions. The minimum HnHR is 4dB for the highest reverberation level though the ideal value is 2dB. It is because the reverberation signal can be modeled by the proposed harmonic model in reverberation tail part as we explained before. The maximum range of HnHR is 9dB that may be enough to distinguish speech quality in reverberant environments. The results considering background noise is depicted in Fig. 4(c) and 4(d). White Gaussian random noise is considered in this simulation. In Fig. 4(c), the range of the estimated ELR is decreased because the noise energy is added in both numerator and denominator in ELR estimation of (8). The estimated HnHR in noisy and reverberant environments in Fig. 4(d) shows smaller values comparing to the result depicted in Fig. 4(b). The maximum HnHR value is 6dB and the maximum range is 5dB because most of noise components are included in the non-harmonic part. The estimated HnHR shows an ability to distinguish quality of speech in reverberant environment by the range of 9dB (6dB for noisy condition). In both cases, we note that HnHR generally follows the trend of C50, which is the desired result. Furthermore, an informal listening test confirms that large HnHR values (such as 10dB) indeed correspond to good speech quality in high SNR conditions. However, in low SNR conditions, the HnHR is capped by an upper bound due to the errors for the pitch estimation and harmonic modeling, which can be improved if more sophisicated methods are developed and used in future research.

그림 4.제안된 HnHR 예측 결과 Fig. 4. Results of the proposed HnHR estimation, (a) C50 (solid line) and estimated ELR (dotted line) in 50dB SNR, (b) HnHR in 50dB SNR, (c) C50 and estimated ELR in 15dB SNR, (d) HnHR in 15dB SNR

 

Ⅴ. Conclusion

A harmonicity-based speech quality estimation method for a reverberant signal has been proposed. The harmonic modeled signal becomes advantageous in terms of substituting the early reverberation component of the observed signal. We verified that the proposed harmonic to non-harmonic ratio (HnHR) can be used as a substitute value of the standard room acoustical parameter which is also related to speech measure. Experimental results for random locations of speaker and single sensor showed that the proposed method successfully measured the degree of speech quality in various reverberation environments. Future work involves improving the accuracy of the harmonic modeling for a reverberant signal and unifying the proposed method with a single channel dereverberation algorithm.

참고문헌

  1. T. Nakatani, K. Kinoshita, and M. Miyoshi, "Harmonicity-based blind dereverberation for single-channel speech signal," IEEE Trans. ASLP, vol.15, pp. 80-95,January2007.
  2. E. A. P. Habets, "Singe-and multi-microphones speech dereverberation using spectral enhancement" Ph.D thesis, 2007.
  3. ISO3382, "Measurement of the reverberation time of rooms with reference to other acoustical parameters," International Organization for Standardization, Geneva, Switzerland.
  4. K. Lebart, J. M. Boucher, and P. N. Denbigh, "A new method based on spectral subtraction for speech dereverberation," Acta Acoustica., vol.87, pp.359-366, June 2001.
  5. R. Ratnam, Douglas L. Jones, Bruce C. Wheeler, William D. O'Brien, Charissa R. Lansing, and Albert S. Feng, "Blind estimation of reverberation time," J. Acoust. Soc. Am. ,vol.114, pp.2877-2892, 2003. https://doi.org/10.1121/1.1616578
  6. Tiago H. Falk and Wai-Yip Chan, "Temporal dynamics for blind measurement of room acoustical parameters," IEEEtrans. Instrum., Measure. Soc., vol.59, pp.978-989, 2010. https://doi.org/10.1109/TIM.2009.2024697
  7. E. Georfanti, Tobias May, S. van dePar, Aki H., and J. Mourjopoulos, "Speaker distance detection using a single microphone," IEEE trans. ASLP, vol.19, pp.1949-1961, September2011.
  8. R. J. McAulayand T. F. Quatieri, "Speech analysis/synthesis based on a sinusoidal representation," IEEEtrans. ASSP, vol.ASSP-34, pp.774-754, April1986.
  9. Xuejing Sun, "Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio," ICASSP, pp.I333-I336, May2002.
  10. Steven F. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEEtrans. ASSP, vol.ASSP-27, pp.113-120, April, 1979.
  11. Jont B. Allen and David A. Berkley, "Image method for efficiently simulating small room acoustics," J. Acoust. Soc. Am., vol.65, pp.943-950, April1979. https://doi.org/10.1121/1.382599
  12. E. A. Lehmann and A. M. Johansson, "Prediction of energy decay in room impulse responses simulated with an image-source model," J. Acoust. Soc. Am., vol.124, pp.269-277, July 2008. https://doi.org/10.1121/1.2936367