• Title/Summary/Keyword: 음향 정보

Search Result 1,315, Processing Time 0.03 seconds

Integrated receptive field diversification method for improving speaker verification performance for variable-length utterances (가변 길이 입력 발성에서의 화자 인증 성능 향상을 위한 통합된 수용 영역 다양화 기법)

  • Shin, Hyun-seo;Kim, Ju-ho;Heo, Jungwoo;Shim, Hye-jin;Yu, Ha-Jin
    • The Journal of the Acoustical Society of Korea
    • /
    • v.41 no.3
    • /
    • pp.319-325
    • /
    • 2022
  • The variation of utterance lengths is a representative factor that can degrade the performance of speaker verification systems. To handle this issue, previous studies had attempted to extract speaker features from various branches or to use convolution layers with different receptive fields. Combining the advantages of the previous two approaches for variable-length input, this paper proposes integrated receptive field diversification that extracts speaker features through more diverse receptive field. The proposed method processes the input features by convolutional layers with different receptive fields at multiple time-axis branches, and extracts speaker embedding by dynamically aggregating the processed features according to the lengths of input utterances. The deep neural networks in this study were trained on the VoxCeleb2 dataset and tested on the VoxCeleb1 evaluation dataset that divided into 1 s, 2 s, 5 s, and full-length. Experimental results demonstrated that the proposed method reduces the equal error rate by 19.7 % compared to the baseline.

A study on deep neural speech enhancement in drone noise environment (드론 소음 환경에서 심층 신경망 기반 음성 향상 기법 적용에 관한 연구)

  • Kim, Jimin;Jung, Jaehee;Yeo, Chaneun;Kim, Wooil
    • The Journal of the Acoustical Society of Korea
    • /
    • v.41 no.3
    • /
    • pp.342-350
    • /
    • 2022
  • In this paper, actual drone noise samples are collected for speech processing in disaster environments to build noise-corrupted speech database, and speech enhancement performance is evaluated by applying spectrum subtraction and mask-based speech enhancement techniques. To improve the performance of VoiceFilter (VF), an existing deep neural network-based speech enhancement model, we apply the Self-Attention operation and use the estimated noise information as input to the Attention model. Compared to existing VF model techniques, the experimental results show 3.77%, 1.66% and 0.32% improvements for Source to Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligence (STOI), respectively. When trained with a 75% mix of speech data with drone sounds collected from the Internet, the relative performance drop rates for SDR, PESQ, and STOI are 3.18%, 2.79% and 0.96%, respectively, compared to using only actual drone noise. This confirms that data similar to real data can be collected and effectively used for model training for speech enhancement in environments where real data is difficult to obtain.

A Korean menu-ordering sentence text-to-speech system using conformer-based FastSpeech2 (콘포머 기반 FastSpeech2를 이용한 한국어 음식 주문 문장 음성합성기)

  • Choi, Yerin;Jang, JaeHoo;Koo, Myoung-Wan
    • The Journal of the Acoustical Society of Korea
    • /
    • v.41 no.3
    • /
    • pp.359-366
    • /
    • 2022
  • In this paper, we present the Korean menu-ordering Sentence Text-to-Speech (TTS) system using conformer-based FastSpeech2. Conformer is the convolution-augmented transformer, which was originally proposed in Speech Recognition. Combining two different structures, the Conformer extracts better local and global features. It comprises two half Feed Forward module at the front and the end, sandwiching the Multi-Head Self-Attention module and Convolution module. We introduce the Conformer in Korean TTS, as we know it works well in Korean Speech Recognition. For comparison between transformer-based TTS model and Conformer-based one, we train FastSpeech2 and Conformer-based FastSpeech2. We collected a phoneme-balanced data set and used this for training our models. This corpus comprises not only general conversation, but also menu-ordering conversation consisting mainly of loanwords. This data set is the solution to the current Korean TTS model's degradation in loanwords. As a result of generating a synthesized sound using ParallelWave Gan, the Conformer-based FastSpeech2 achieved superior performance of MOS 4.04. We confirm that the model performance improved when the same structure was changed from transformer to Conformer in the Korean TTS.

Comparative analysis of the soundscape evaluation depending on the listening experiment methods (청감실험방식에 따른 음풍경 평가결과 비교분석)

  • Jo, A-Hyeon;Haan, Chan-Hoon
    • The Journal of the Acoustical Society of Korea
    • /
    • v.41 no.3
    • /
    • pp.287-301
    • /
    • 2022
  • The present study aims to investigate the difference of soundscape evaluation results from on-site field test and laboratory test which are commonly used for soundscape surveys. In order to do this, both field and lab tests were carried out at four different areas in Cheongju city. On-site questionnaire surveys were undertaken to 65 people at 13 points. Laboratory listening tests were carried out to 48 adults using recorded sounds and video. Laboratory tests were undertaken to two different groups who had experience of field survey or not. Also, two different sound reproduction tools, headphones and speakers, were used in laboratory tests. As a result, it was found that there is a very close correlation between sound loudness and annoyance in both field and laboratory tests. However, it was concluded that there must be a difference in recognizing the figure sounds between field and laboratory tests since it is hard to apprehend on-site situation only using visual and aural information provided in laboratory tests. In laboratory tests, it was shown that there is a some difference in perceived most loud figure sounds in two groups using headphones and speakers. Also, it was analyzed that there is a tendency that field experienced people recognize the figure sounds using their experienced memory while non-experienced people can not perceive the figure sounds.

Shear-wave elasticity imaging with axial sub-Nyquist sampling (축방향 서브 나이퀴스트 샘플링 기반의 횡탄성 영상 기법)

  • Woojin Oh;Heechul Yoon
    • The Journal of the Acoustical Society of Korea
    • /
    • v.42 no.5
    • /
    • pp.403-411
    • /
    • 2023
  • Functional ultrasound imaging, such as elasticity imaging and micro-blood flow Doppler imaging, enhances diagnostic capability by providing useful mechanical and functional information about tissues. However, the implementation of functional ultrasound imaging poses limitations such as the storage of vast amounts of data in Radio Frequency (RF) data acquisition and processing. In this paper, we propose a sub-Nyquist approach that reduces the amount of acquired axial samples for efficient shear-wave elasticity imaging. The proposed method acquires data at a sampling rate one-third lower than the conventional Nyquist sampling rate and tracks shear-wave signals through RF signals reconstructed using band-pass filtering-based interpolation. In this approach, the RF signal is assumed to have a fractional bandwidth of 67 %. To validate the approach, we reconstruct the shear-wave velocity images using shear-wave tracking data obtained by conventional and proposed approaches, and compare the group velocity, contrast-to-noise ratio, and structural similarity index measurement. We qualitatively and quantitatively demonstrate the potential of sub-Nyquist sampling-based shear-wave elasticity imaging, indicating that our approach could be practically useful in three-dimensional shear-wave elasticity imaging, where a massive amount of ultrasound data is required.

Reducing latency of neural automatic piano transcription models (인공신경망 기반 저지연 피아노 채보 모델)

  • Dasol Lee;Dasaem Jeong
    • The Journal of the Acoustical Society of Korea
    • /
    • v.42 no.2
    • /
    • pp.102-111
    • /
    • 2023
  • Automatic Music Transcription (AMT) is a task that detects and recognizes musical note events from a given audio recording. In this paper, we focus on reducing the latency of real-time AMT systems on piano music. Although neural AMT models have been adapted for real-time piano transcription, they suffer from high latency, which hinders their usefulness in interactive scenarios. To tackle this issue, we explore several techniques for reducing the intrinsic latency of a neural network for piano transcription, including reducing window and hop sizes of Fast Fourier Transformation (FFT), modifying convolutional layer's kernel size, and shifting the label in the time-axis to train the model to predict onset earlier. Our experiments demonstrate that combining these approaches can lower latency while maintaining high transcription accuracy. Specifically, our modified model achieved note F1 scores of 92.67 % and 90.51 % with latencies of 96 ms and 64 ms, respectively, compared to the baseline model's note F1 score of 93.43 % with a latency of 160 ms. This methodology has potential for training AMT models for various interactive scenarios, including providing real-time feedback for piano education.

Development of a Listener Position Adaptive Real-Time Sound Reproduction System (청취자 위치 적응 실시간 사운드 재생 시스템의 개발)

  • Lee, Ki-Seung;Lee, Seok-Pil
    • The Journal of the Acoustical Society of Korea
    • /
    • v.29 no.7
    • /
    • pp.458-467
    • /
    • 2010
  • In this paper, a new audio reproduction system was developed in which the cross-talk signals would be reasonably cancelled at an arbitrary listener position. To adaptively remove the cross-talk signals according to the listener's position, a method of tracking the listener position was employed. This was achieved using the two microphones, where the listener direction was estimated using the time-delay between the two signals from the two microphones, respectively. Moreover, room reverberation effects were taken into consideration where linear prediction analysis was involved. To remove the cross-talk signals at the left-and right-ears, the paths between the sources and the ears were represented using the KEMAR head-related transfer functions (HRTFs) which were measured from the artificial dummy head. To evaluate the usefulness of the proposed listener tracking system, the performance of cross-talk cancellation was evaluated at the estimated listener positions. The performance was evaluated in terms of the channel separation ration (CSR), a -10 dB of CSR was experimentally achieved although the listener positions were more or less deviated. A real-time system was implemented using a floating-point digital signal processor (DSP). It was confirmed that the average errors of the listener direction was 5 degree and the subjects indicated that 80 % of the stimuli was perceived as the correct directions.

Noise-Biased Compensation of Minimum Statistics Method using a Nonlinear Function and A Priori Speech Absence Probability for Speech Enhancement (음질향상을 위해 비선형 함수와 사전 음성부재확률을 이용한 최소통계법의 잡음전력편의 보상방법)

  • Lee, Soo-Jeong;Lee, Gang-Seong;Kim, Sun-Hyob
    • The Journal of the Acoustical Society of Korea
    • /
    • v.28 no.1
    • /
    • pp.77-83
    • /
    • 2009
  • This paper proposes a new noise-biased compensation of minimum statistics(MS) method using a nonlinear function and a priori speech absence probability(SAP) for speech enhancement in non-stationary noisy environments. The minimum statistics(MS) method is well known technique for noise power estimation in non-stationary noisy environments. It tends to bias the noise estimate below that of true noise level. The proposed method is combined with an adaptive parameter based on a sigmoid function and a priori speech absence probability (SAP) for biased compensation. Specifically. we apply the adaptive parameter according to the a posteriori SNR. In addition, when the a priori SAP equals unity, the adaptive biased compensation factor separately increases ${\delta}_{max}$ each frequency bin, and vice versa. We evaluate the estimation of noise power capability in highly non-stationary and various noise environments, the improvement in the segmental signal-to-noise ratio (SNR), and the Itakura-Saito Distortion Measure (ISDM) integrated into a spectral subtraction (SS). The results shows that our proposed method is superior to the conventional MS approach.

A Pre-Selection of Candidate Units Using Accentual Characteristic In a Unit Selection Based Japanese TTS System (일본어 악센트 특징을 이용한 합성단위 선택 기반 일본어 TTS의 후보 합성단위의 사전선택 방법)

  • Na, Deok-Su;Min, So-Yeon;Lee, Kwang-Hyoung;Lee, Jong-Seok;Bae, Myung-Jin
    • The Journal of the Acoustical Society of Korea
    • /
    • v.26 no.4
    • /
    • pp.159-165
    • /
    • 2007
  • In this paper, we propose a new pre-selection of candidate units that is suitable for the unit selection based Japanese TTS system. General pre-selection method performed by calculating a context-dependent cost within IP (Intonation Phrase). Different from other languages, however. Japanese has an accent represented as the height of a relative pitch, and several words form a single accentual phrase. Also. the prosody in Japanese changes in accentual phrase units. By reflecting such prosodic change in pre-selection. the qualify of synthesized speech can be improved. Furthermore, by calculating a context-dependent cost within accentual phrase, synthesis speed can be improved than calculating within intonation phrase. The proposed method defines AP. analyzes AP in context and performs pre-selection using accentual phrase matching which calculates CCL (connected context length) of the Phoneme's candidates that should be synthesized in each accentual phrase. The baseline system used in the proposed method is VoiceText, which is a synthesizer of Voiceware. Evaluations were made on perceptual error (intonation error, concatenation mismatch error) and synthesis time. Experimental result showed that the proposed method improved the qualify of synthesized speech. as well as shortened the synthesis time.

Effects of Ultrasonic Scanner Setting Parameters on the Quality of Ultrasonic Images (초음파 진단기의 설정 파라미터가 영상의 질에 미치는 효과)

  • Yang, Jeong-Hwa;Lee, Kyung-Sung;Kang, Gwan-Suk;Paeng, Dong-Guk;Choi, Min-Joo
    • The Journal of the Acoustical Society of Korea
    • /
    • v.27 no.2
    • /
    • pp.57-65
    • /
    • 2008
  • Setting parameters of Ultrasonic scanners influence the quality of ultrasonic images. In order to obtain optimized images sonographers need to understand the effects of the setting parameters on ultrasonic images. The present study considered typical four parameters including TGC (Time Gain Control), Gain, Frequency, DR (Dynamic Range). LCS (low contrast sensitivity) was chosen to quantitatively compare the quality of the images. In the present experiment LCS targets of a standard ultrasonic test phantom (539, ATS, USA) were imaged using a clinical ultrasonic scanner (SA-9000 PRIME, Medison, Korea). Altering the settings in the parameters of the ultrasonic scanner, 6 LCS target images (+15 dB, +6 dB, +3 dB, -3 dB, -6 dB, -15 dB) to each setting were obtained, and their LCS values were calculated. The results show that the mean pixel value (LCS) is the highest at the max setting in TGC, mid to max in gain and pen mode in frequency and 40-66 dB in DR. Among all images, the image being the highest in LCS was obtained at the setting of DR 40 dB. It is expected that the results will be of use in setting the parameters when ultrasonically examining masses often clinically found In either solid lesions (similar to +15, +6, +3 dB targets) or cystic lesions (similar to -15, -6, -3 dB targets).