• Title/Summary/Keyword: 음질평가

Search Result 353, Processing Time 0.024 seconds

A Study on the Perception of Foreign Undergraduates on Online Lecture

  • Kim, Yoon-Hee;Lim, Eun-jin
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.9
    • /
    • pp.203-212
    • /
    • 2020
  • The purpose of this study is to analyze the perception of non-face-to-face online undergraduate lectures experienced by foreign learners, to identify problems of online lectures, and to suggest improvements. For this study, the perception of online lectures was investigated and analyzed by foreign undergraduate students who took online lectures at A and B universities. Through this, I explored the design direction, complementary measures, and direction of online lectures to be held at Korean universities in the future. As a result of this study, non-real-time lectures through E campus were recognized as advantages in that they could learn repeatedly and listen to lectures at home., Real-time lectures using Zoom were recognized as an advantage of being able to communicate between professors and learners. Both types of online lectures had many tasks and had difficulty in focusing on the lecture until the end. In the future, it was found that the amount of lecture contents and the amount of tasks should be reduced and the condition and sound quality of the lecture image should be improved. As for the evaluation method, they preferred online evaluation rather than offline evaluation, and they preferred relative evaluation rather than absolute evaluation. The results of this study were able to closely understand how learners perceive online lectures. Also, when conducting online lectures, I was able to know the points that need to be improved in the future. The results of this study are expected to contribute to the design direction of online lectures and the development of online contents at each university.

One-shot multi-speaker text-to-speech using RawNet3 speaker representation (RawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성 시스템)

  • Sohee Han;Jisub Um;Hoirin Kim
    • Phonetics and Speech Sciences
    • /
    • v.16 no.1
    • /
    • pp.67-76
    • /
    • 2024
  • Recent advances in text-to-speech (TTS) technology have significantly improved the quality of synthesized speech, reaching a level where it can closely imitate natural human speech. Especially, TTS models offering various voice characteristics and personalized speech, are widely utilized in fields such as artificial intelligence (AI) tutors, advertising, and video dubbing. Accordingly, in this paper, we propose a one-shot multi-speaker TTS system that can ensure acoustic diversity and synthesize personalized voice by generating speech using unseen target speakers' utterances. The proposed model integrates a speaker encoder into a TTS model consisting of the FastSpeech2 acoustic model and the HiFi-GAN vocoder. The speaker encoder, based on the pre-trained RawNet3, extracts speaker-specific voice features. Furthermore, the proposed approach not only includes an English one-shot multi-speaker TTS but also introduces a Korean one-shot multi-speaker TTS. We evaluate naturalness and speaker similarity of the generated speech using objective and subjective metrics. In the subjective evaluation, the proposed Korean one-shot multi-speaker TTS obtained naturalness mean opinion score (NMOS) of 3.36 and similarity MOS (SMOS) of 3.16. The objective evaluation of the proposed English and Korean one-shot multi-speaker TTS showed a prediction MOS (P-MOS) of 2.54 and 3.74, respectively. These results indicate that the performance of our proposed model is improved over the baseline models in terms of both naturalness and speaker similarity.

Salience of Envelope Interaural Time Difference of High Frequency as Spatial Feature (공간감 인자로서의 고주파 대역 포락선 양이 시간차의 유효성)

  • Seo, Jeong-Hun;Chon, Sang-Bae;Sung, Koeng-Mo
    • The Journal of the Acoustical Society of Korea
    • /
    • v.29 no.6
    • /
    • pp.381-387
    • /
    • 2010
  • Both timbral features and spatial features are important in the assessment of multichannel audio coding systems. The prediction model, extending the ITU-R Rec. BS. 1387-1 to multichannel audio coding systems, with the use of spatial features such as ITDDist (Interaural Time Difference Distortion), ILDDist (Interaural Level Difference Distortion), and IACCDist (InterAural Cross-correlation Coefficient Distortion) was proposed by Choi et al. In that model, ITDDistswere only computed for low frequency bands (below 1500Hz), and ILDDists were computed only for high frequency bands (over 2500Hz) according to classical duplex theory. However, in the high frequency range, information in temporal envelope is also important in spatial perception, especially in sound localization. A new model to compute the ITD distortions of temporal envelopes in high frequency components is introduced in this paper to investigate the role of such ITD on spatial perception quantitatively. The computed ITD distortions of temporal envelopes in high frequency components were highly correlated with perceived sound quality of multichannel audio sounds.

A Fast Normalized Cross-Correlation Computation for WSOLA-based Speech Time-Scale Modification (WSOLA 기반의 음성 시간축 변환을 위한 고속의 정규상호상관도 계산)

  • Lim, Sangjun;Kim, Hyung Soon
    • The Journal of the Acoustical Society of Korea
    • /
    • v.31 no.7
    • /
    • pp.427-434
    • /
    • 2012
  • The overlap-add technique based on waveform similarity (WSOLA) method is known to be an efficient high-quality algorithm for time scaling of speech signal. The computational load of WSOLA is concentrated on the repeated normalized cross-correlation (NCC) calculation to evaluate the similarity between two signal waveforms. To reduce the computational complexity of WSOLA, this paper proposes a fast NCC computation method, in which NCC is obtained through pre-calculated sum tables to eliminate redundancy of repeated NCC calculations in the adjacent regions. While the denominator part of NCC has much redundancy irrespective of the time-scale factor, the numerator part of NCC has less redundancy and the amount of redundancy is dependent on both the time-scale factor and optimal shift value, thereby requiring more sophisticated algorithm for fast computation. The simulation results show that the proposed method reduces about 40%, 47% and 52% of the WSOLA execution time for the time-scale compression, 2 and 3 times time-scale expansions, respectively, while maintaining exactly the same speech quality of the conventional WSOLA.

Implementation of the High-Quality Audio System with the Separately Processed Musical Instrument Channels (악기별 분리처리를 통한 고음질 오디오 시스템 구현)

  • Kim, Tae-Hoon;Lee, Sang-Hak;Kim, Dae-Kyung;Lee, Sang-Chan
    • The Journal of the Acoustical Society of Korea
    • /
    • v.32 no.4
    • /
    • pp.346-353
    • /
    • 2013
  • This paper deals with the implementation of a high-quality audio system for karaoke. For improving the key/tempo changes performance, we separated the audio into many musical instrument channels. By separating musical instrument channels, high-quality key/tempo changes can be achieved and we confirmed this using the cross-correlation distribution and the MOS evaluation. The improved audio system was implemented using the TMS320C6747 DSP with fixed/floating-point operations. The implemented audio system can perform the multi-channel WMA decoding, the MP3 encoding/decoding, the wav playing, the EQ, and the key/tempo changes in real time. The WMA channels used for processing the separated instrument channels. The audio system includs the MP3 encoding/decoding function for playing and recording and the wav channel for the effect sound.

Transcoding Algorithm for AMR and EVRC Vocoders Via Direct Parameter Transformation (AMR과 EVRC 음성부호화기를 위한 파라미터 직접 변환 방식의 상호부호화 알고리듬)

  • Lee, Sun-Il;Yu, Chang-Dong
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.39 no.6
    • /
    • pp.696-708
    • /
    • 2002
  • In this paper, a novel transcoding algorithm for the Adaptive Multi Rate(AMR) and the Enhanced Variable Rate Codec(EVRC) vocoders via direct parameter transformation is proposed. In contrast to the conventional tandem transcoding algorithm, the proposed algorithm converts the parameters of one coder to the other without going through the decoding and encoding processes. The proposed algorithm consists of the parameter decoding, frame classification, mode decision, and transcoders for two frame types. The transcoders convert the parameters such as LSP, frame energy, pitch delay for the adaptive codebook, fixed codebook vector, and codebook gains. Evaluation results show that while exhibiting better computational and delay characteristics, the proposed algorithm produces equivalent speech quality to that produced by the tandem transcoding algorithm.

Real-time implementation of the 2.4kbps EHSX Speech Coder Using a $TMS320C6701^TM$ DSPCore ($TMS320C6701^TM$을 이용한 2.4kbps EHSX 음성 부호화기의 실시간 구현)

  • 양용호;이인성;권오주
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.29 no.7C
    • /
    • pp.962-970
    • /
    • 2004
  • This paper presents an efficient implementation of the 2.4 kbps EHSX(Enhanced Harmonic Stochastic Excitation) speech coder on a TMS320C6701$^{TM}$ floating-point digital signal processor. The EHSX speech codec is based on a harmonic and CELP(Code Excited Linear Prediction) modeling of the excitation signal respectively according to the frame characteristic such as a voiced speech and an unvoiced speech. In this paper, we represent the optimization methods to reduce the complexity for real-time implementation. The complexity in the filtering of a CELP algorithm that is the main part for the EHSX algorithm complexity can be reduced by converting program using floating-point variable to program using fixed-point variable. We also present the efficient optimization methods including the code allocation considering a DSP architecture and the low complexity algorithm of harmonic/pitch search in encoder part. Finally, we obtained the subjective quality of MOS 3.28 from speech quality test using the PESQ(perceptual evaluation of speech quality), ITU-T Recommendation P.862 and could get a goal of realtime operation of the EHSX codec.c.

A Study on 8kbps FBD-MPC Method Considering Low Bit Rate (Low Bit Rate을 고려한 8kbps FBD-MPC 방식에 관한 연구)

  • Lee, See-Woo
    • Journal of Digital Convergence
    • /
    • v.12 no.6
    • /
    • pp.271-276
    • /
    • 2014
  • In a speech coding system using excitation source of voiced and unvoiced, it would be involved a distortion of speech quality in case coexist with a voiced and unvoiced consonants in a frame. In this paper, I propose a method of 8kbps Multi-Pulse Speech Coding(FBD-MPC: Frequency Band Division MPC) by using TSIUVC(Transition Segment Including Unvoiced Consonant) searching, extraction and approximation-synthesis method in a frequency domain. I evaluate the 8kbps MPC and FBD-MPC. As a result, SNRseg of FBD-MPC was improved 0.5dB for female voice and 0.2dB for male voice respectively. Compared to the MPC, SNRseg of FBD-MPC has been improved that I was able to control the distortion of the speech waveform finally. And so, I expect to be able to this method for cellular phone and smart phone using excitation source of low bit rate.

Robust Speech Enhancement Based on Soft Decision Employing Spectral Deviation (스펙트럼 변이를 이용한 Soft Decision 기반의 음성향상 기법)

  • Choi, Jae-Hun;Chang, Joon-Hyuk;Kim, Nam-Soo
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.47 no.5
    • /
    • pp.222-228
    • /
    • 2010
  • In this paper, we propose a new approach to noise estimation incorporating spectral deviation with soft decision scheme to enhance the intelligibility of the degraded speech signal in non-stationary noisy environments. Since the conventional noise estimation technique based on soft decision scheme estimates and updates the noise power spectrum using a fixed smoothing parameter which was assumed in stationary noisy environments, it is difficult to obtain the robust estimates of noise power spectrum in non-stationary noisy environments that spectral characteristics of noise signal such as restaurant constantly change. In this paper, once we first classify the stationary noise and non-stationary noise environments based on the analysis of spectral deviation of noise signal, we adaptively estimate and update the noise power spectrum according to the classified noise types. The performances of the proposed algorithm are evaluated by ITU-T P. 862 perceptual evaluation of speech quality (PESQ) under various ambient noise environments and show better performances compared with the conventional method.

A New MPEG Reference Model for Unified Speech and Audio Coding (통합 음성/오디오 부호화를 위한 새로운 MPEG 참조 모델)

  • Song, Jeong-Ook;Oh, Hyen-O;Kang, Hong-Goo
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.47 no.5
    • /
    • pp.74-80
    • /
    • 2010
  • Speech and audio codecs have been developed based on different type of coding technologies since they have different characteristics of signal and applications. In harmony with a convergence between broadcasting and telecommunication system, international organizations for standardization such as 3GPP and ISO/IEC MPEG have tried to compress and transmit multimedia signals using unified codecs. MPEG recently initiated an activity to standardize the USAC (Unified speech and audio coding). However, USAC RM (Reference model) software has been problematic since it has a complex hierarchy, many useless source codes and poor quality of the encoder. To solve these problems, this paper introduces a new RM software designed with an open source paradigm. It was presented at the MPEG meeting in April, 2010 and the source code was released in June.