Search | Korea Science

Zero-shot voice conversion with HuBERT

Hyelee Chung;Hosung Nam
- Phonetics and Speech Sciences
- /
- v.15 no.3
- /
- pp.69-74
- /
- 2023
This study introduces an innovative model for zero-shot voice conversion that utilizes the capabilities of HuBERT. Zero-shot voice conversion models can transform the speech of one speaker to mimic that of another, even when the model has not been exposed to the target speaker's voice during the training phase. Comprising five main components (HuBERT, feature encoder, flow, speaker encoder, and vocoder), the model offers remarkable performance across a range of scenarios. Notably, it excels in the challenging unseen-to-unseen voice-conversion tasks. The effectiveness of the model was assessed based on the mean opinion scores and similarity scores, reflecting high voice quality and similarity to the target speakers. This model demonstrates considerable promise for a range of real-world applications demanding high-quality voice conversion. This study sets a precedent in the exploration of HuBERT-based models for voice conversion, and presents new directions for future research in this domain. Despite its complexities, the robust performance of this model underscores the viability of HuBERT in advancing voice conversion technology, making it a significant contributor to the field.
https://doi.org/10.13064/KSSS.2023.15.3.069 인용 PDF

A CELP Coder using the Band-Divided Long Term Prediction (대역 분할 장구간 예측을 이용한 CELP 부호화기)

Choi, Young-Soo;Kang, Hong-Goo;Lim, Myoung-Seob;Ahn, Dong-Soon;Youn, Dae-Hee
- The Journal of the Acoustical Society of Korea
- /
- v.14 no.4
- /
- pp.38-45
- /
- 1995
In this paper a way to improve the performance of the long term prediction is proposed, which adopts the Multi-band Excitation (MBE) method in addition to the Code-Excited Linear Prediction (CELP) method at low bit rates below 4.8 kbps. In the proposed method, the multiband long term prediction is performed on the periodic components which still remain after the long term prediction of the conventional CELP method. At this point, the whole frequency region is divided into subbands whose size is equal to the spacing between the harmonics of the fundamental frequency, and the periodic multiband excitation signals. are represented as the sum of sine waves approximately as large as the spectrum of the excitation signals, so that the actual characteristics of the excitation signals can be better taken into account. To evaluate the performance of the proposed method, computer simulation is performed at 4.8 kbps. The 4.8 kbps DoD CELP and the 4.4 kbps IMBE were chosen as the reference vocoders for the speech quality measure. The result of the perceptual speech quality measure showed that the performance of the proposed method is better than that of the 4.8 kbps DoD CELP vocoder, and similar to that of the 4.4 kbps IMBE vocoder.
PDF

Real-Time Implementation of Acoustic Echo Canceller for Mobile Handset Using TeakLite DSP Core (Teaklite DSP Core 를 이용한 이동통신 단말기용 음향반향제거기의 실시간 구현)

Gwon, Hong-Seok;Kim, Si-Ho;Jang, Byeong-Uk;Bae, Geon-Seong
- Journal of the Institute of Electronics Engineers of Korea SP
- /
- v.39 no.2
- /
- pp.128-136
- /
- 2002
In this paper, we developed an acoustic echo canceller in real-time using TeakLite DSP Core, which will be placed in the vocoder chip of a mobile handset. Considering the limited computational capacity given to the acoustic echo canceller in a vocoder chip, we employed a FIR-type adaptive filter using a conventional NLMS algorithm. To begin with, we designed and implemented an acoustic echo canceller with floating-point format C-source code, and then converted it into fixed-point format through integer simulation. Then we programmed and optimized it in the assembler level to make it run ill real-time. After optimization procedure, the implemented echo canceller has approximately 624 words of program memory and 811 words of data memory. With 8 KHz sampling rate and 256 filter taps in the echo canceller that corresponds to 32 msec of echo delay, it requires 14.12 MIPS of computational capacity. For coverage of 16 msec echo delay, i.e., 128 filter taps, 9 MIPS is requited.
PDF KSCI

Comparison of Korean Real-time Text-to-Speech Technology Based on Deep Learning (딥러닝 기반 한국어 실시간 TTS 기술 비교)

Kwon, Chul Hong
- The Journal of the Convergence on Culture Technology
- /
- v.7 no.1
- /
- pp.640-645
- /
- 2021
The deep learning based end-to-end TTS system consists of Text2Mel module that generates spectrogram from text, and vocoder module that synthesizes speech signals from spectrogram. Recently, by applying deep learning technology to the TTS system the intelligibility and naturalness of the synthesized speech is as improved as human vocalization. However, it has the disadvantage that the inference speed for synthesizing speech is very slow compared to the conventional method. The inference speed can be improved by applying the non-autoregressive method which can generate speech samples in parallel independent of previously generated samples. In this paper, we introduce FastSpeech, FastSpeech 2, and FastPitch as Text2Mel technology, and Parallel WaveGAN, Multi-band MelGAN, and WaveGlow as vocoder technology applying non-autoregressive method. And we implement them to verify whether it can be processed in real time. Experimental results show that by the obtained RTF all the presented methods are sufficiently capable of real-time processing. And it can be seen that the size of the learned model is about tens to hundreds of megabytes except WaveGlow, and it can be applied to the embedded environment where the memory is limited.
https://doi.org/10.17703/JCCT.2021.7.1.640 인용 PDF KSCI

The Research of Improving The Performance of the G.723.1 MP-MLQ Vocoder (G.723.1 MP-MLQ 부호화기의 성능개선에 관한 연구)

Min SoYeon;Na DuckSn;Kim JeongJin;BAE MyungJin
- Proceedings of the Acoustical Society of Korea Conference
- /
- autumn
- /
- pp.49-52
- /
- 1999
4.8kbps 내외의 전송률에서 양호한 음질을 제공하는 CELP 계열 음성 부호화기 중에서 인터넷 폰 및 화상회의를 목적으로 개발된 G.723.1은 5.3kbps ACELP(Algebraic CELP)와 6.3kbps MP-MLQ(Multi-Pulse Maximum Likelihood Quantization) 두 개의 부호화기를 포함하고 있다[1]. 이 중 MP-MLQ는 고정 코드북 검색 시 많은 계산량으로 인해 실시간 구현에 어려움이 따르고 있다. 이러한 문제점을 개선하기 위해 본 논문에서는 유/무성음을 분리한 후 grid bit를 먼저 결정하여 코드북을 검색하는 방법 제안한다. LSP 파라미터의 분포특성을 이용하여 유/무성음을 분리한 후 무성음에 대해서는 스펙트럼 정보만 전송하고 유성음에 대해서만 코드북 검색을 수행한다. 그리고 코드북 검색 시 Grid bit를 먼저 결정하여 수행하였다. Grid bit는 짝/홀수번째 전체 펄스를 이용하여 합성한 합성음과 DC 성분이 제거된 원음과의 비교를 통하여 결정하였다. 실험 결과 전체 처리시간은 평균적으로 약 $20.55\%$ 감소하였으며 주관적 음질평가 결과 음질 열하는 거의 발생하지 않았다.
PDF

The Research about Voice Transmission between CDMA Network and PSTN Network Using CDMA Circuit Data Service (CDMA 회선 데이터 서비스를 이용한 CDMA망과 PSTN 망간의 음성 전송에 관한 연구)

Park, Yong-Seok;Ahn, Jae-Hwan;Ryou, Jae-Cheol
- The KIPS Transactions:PartC
- /
- v.15C no.5
- /
- pp.367-374
- /
- 2008
To realize the voice privacy between CDMA mobile phone and PSTN terminal, the voice frames shall be transmitted transparently between the heterogeneous networks. For satisfying this requirement, we propose the method which transmits voice frames using the CDMA circuit data channel in real time. In this paper we analyze the causes of voice delay which occurs during voice transmission using circuit data channel. And in order to overcome this kind of delay, the technique controlling the TCP control flag and the variable audio block construction algorithm according to the vocoder output rate are proposed. As a result of experimenting by applying the proposed method, we confirmed that the transit delay was improved with about average 70%.
https://doi.org/10.3745/KIPSTC.2008.15-C.5.367 인용 PDF KSCI

A Study on Delta Pitch Searching of CELP Vocoder using the Symmetry of Correlation (상관관계 대칭성을 이용한 CELP 보코더의 델타피치 검색에 관한 연구)

Jung Hyun Uk;Min So Yeon;Bae Myung Jin
- Proceedings of the Acoustical Society of Korea Conference
- /
- autumn
- /
- pp.119-122
- /
- 2004
G.723.1은 저 전송률 환경에서 고 음질을 제공하여 주고 있으나 CELP형 부호화기가 갖는 합성에 의한 분석(Analysis by Synthesis)방식의 구조로 인해 많은 처리 시간과 계산량을 요구하게 된다. 본 논문에서는 G.723.1에 대해 NAMDF함수를 적용하여 델타 피치 검색과정의 계산량을 줄여 부호화기의 전체 계산량을 감소시키는 방법을 제안하였다. 기존의 피치 검출 알고리즘에서 피치 검출을 위해 사용하고 있는 자기상관함수는 곱셈 연산에서 발생하는 bit의 dynamic range가 커서 나눗셈 연산에서도 과도한 연산량을 필요로 한다. 따라서, 이러한 계산량의 감소를 위해 기존의 자기상관함수 대신 계산량을 감소하기 위하여 NAMDF 방법을 적용하였고 추가된 skipping 기법을 사용하였다. 계산량 감소율 측면에서는 약 $64\%$의 감소율을 보였고 기존의 방법과 제안한 방법간의 피치 pitch contour은 원음성의 피치 contour와 유사하였고, 음질 평가에서도 기존의 G.723.1 부호화기 합성음과 유사한 길과를 얻을 수 있었다.
PDF

A Study on a Robust Voice Activity Detector Under the Noise Environment in the G,723.1 Vocoder (G.723.1 보코더에서 잡음환경에 강인한 음성활동구간 검출기에 관한 연구)

이희원;장경아;배명진
- The Journal of the Acoustical Society of Korea
- /
- v.21 no.2
- /
- pp.173-181
- /
- 2002
Generally the one of serious problems in Voice Activity Detection (VAD) is speech region detection in noise environment. Therefore, this paper propose the new method using energy, lsp varation. As a result of processing time and speech quality of the proposed algorithm, the processing time is reduced due to the accurate detection of inactive period, and there is almot no difference in the subjective quality test. As a result of bit rate, proposed algorithm measures the number of VAD=1 and the result shows predominant reduction of bit rate as SNR of noisy speech is low (about 5∼10 dB).
PDF KSCI

Erlang Capacity Calculation for the Mixed Traffic of 3G1x CDMA Wireless Networks Integration for Voice over Internet Protocol (음성 및 데이터를 포함하는 이동통신 혼합 트래픽의 Erlang 용량 산출방법)

Chung, H.K.
- Electronics and Telecommunications Trends
- /
- v.17 no.5 s.77
- /
- pp.37-46
- /
- 2002
이동통신에서는 무선자원의 효율적인 사용을 위하여 variable rate vocoder 및 VoX 기법을 이용한 음성 전송이 일반적 추세이며, 버스티 특성을 갖는 패킷 트래픽의 경우 statistical multiplexing을 이용하여 무선 채널의 사용을 극대화 시킨다. 트래픽 밀도를 나타내는 Erlang 용량은 일정속도의 회선교환 트래픽에 대하여 동시에 점유할 수 있는 dedicated circuit의 수에 기초하는 개념이므로 statistical multiplexing으로 처리되는 데이터 패킷의 트래픽 밀도는 queuing model에 근거한 데이터 스루풋이 현실적이다. 그러나 이동통신 시스템에서 트래픽 특성을 달리하는 circuit 및 패킷 타입의 혼합 서비스가 동시에 제공될 경우 네트워크 planning을 위한 구성 시스템의 용량산정을 위해 트래픽 밀도의 통합적인 표현을 요구한다. 따라서 Erlang 용량과 데이터 스루풋의 상호 변환을 통하여 네트워크 구성요소의 용량 산정에 적당한 용량표현을 선택할 수 있다. 본 고에서는 트래픽 처리기로서의 통신시스템을 기술하기 위하여 일반적인 텔레트래픽 시스템 모델과 파라미터를 정의한다. 또한 음성 및 비음성 서비스의 혼합 트래픽 환경에서 트래픽 밀도계산을 위한 Erlang 용량과 데이터 스루풋의 상호 변환 관계를 소개한다. 마지막으로 3G1x 무선접속환경에서 음성 및 HSPD 서비스가 공존할 경우 기지국 CE dimensioning에 필요한 혼합 트래픽 Erlang 용량 산출 방법을 기술한다.
https://doi.org/10.22648/ETRI.2002.J.170504 인용 PDF

A Study on a Reduction of the Transmission Bit Rate by the U/V Decision Using LSP in the CELP Vocoder (LSP를 이용한 음성신호의 성분분리에 의한 CELP 보코더의 전송률 감소에 관한 연구)

Na DuckSu;Park YoungHo;Jeong Chan Jung;Bae MyungJin
- Proceedings of the Acoustical Society of Korea Conference
- /
- spring
- /
- pp.61-64
- /
- 1999
기존의 CELP 보코더에서, 무성음에 대한 별도의 처리 없이 유성음과 동일하게 처리하였다. 유성음과 무성음은 발성모델측면에서 임펄스열과 랜덤 잡음으로 각각 다름에 도 불구하고 동일하게 처리함으로써 합성음에서 음질의 저하 및 계산량과 전송률 측면에서 손실을 가져왔다. 또, U/V(Unvoiced /voiced) 분류기를 사용하는 경우에는 U/V 분류기의 성능에 따라 합성음의 음질저하의 정도의 차이가 심하다. 본 논문에서는 에러율과 전처리 계산량을 쳐소로 할 수 있는 U/V 분류기를 사용하여 CELP 보코더에서 전송률을 감소시키는 방법을 제안한다. CELP 보코더에서는 스펙트럼 정보를 LPC 파라미터로 추출한 후 다시 전송형 파라미터인 LSP(Line Spectrum Frequency)로 변환한다 새로운 린/V 분류기는 이 LSP 파라미터를 이용한다. LSP 파라미터의 주파수영역 분포도와 간격정보를 이용하여 U/V를 결정하게 된다 제안한 방법을 5.3kbps ACELP에 적용하여 성능 평가를 실시하였다 실험결과 음질의 저하 없이 $5.6\%$ (280bps)의 전송률을 감소할 수 있었다.
PDF

Search Result 151, Processing Time 0.026 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)