Comparison of Korean Speech De-identification Performance of Speech De-identification Model and Broadcast Voice Modulation

Seung Min Kim;Dae Eol Park;Dae Seon Choi;

doi:10.30693/SMJ.2023.12.2.56

Smart Media Journal (스마트미디어저널)

Volume 12 Issue 2
/
Pages.56-65
/
2023
/
2287-1322(pISSN)
/
2288-9671(eISSN)

THE KOREAN INSTITUTE OF SMART MEDIA (한국스마트미디어학회)

DOI QR Code

Comparison of Korean Speech De-identification Performance of Speech De-identification Model and Broadcast Voice Modulation

음성 비식별화 모델과 방송 음성 변조의 한국어 음성 비식별화 성능 비교

김승민 (숭실대학교 소프트웨어학부) ;
박대얼 (숭실대학교 소프트웨어학부) ;
최대선 (숭실대학교 소프트웨어학부)

Received : 2023.03.02
Accepted : 2023.03.22
Published : 2023.03.31

https://doi.org/10.30693/SMJ.2023.12.2.56 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

In broadcasts such as news and coverage programs, voice is modulated to protect the identity of the informant. Adjusting the pitch is commonly used voice modulation method, which allows easy voice restoration to the original voice by adjusting the pitch. Therefore, since broadcast voice modulation methods cannot properly protect the identity of the speaker and are vulnerable to security, a new voice modulation method is needed to replace them. In this paper, using the Lightweight speech de-identification model as the evaluation target model, we compare speech de-identification performance with broadcast voice modulation method using pitch modulation. Among the six modulation methods in the Lightweight speech de-identification model, we experimented on the de-identification performance of Korean speech as a human test and EER(Equal Error Rate) test compared with broadcast voice modulation using three modulation methods: McAdams, Resampling, and Vocal Tract Length Normalization(VTLN). Experimental results show VTLN modulation methods performed higher de-identification performance in both human tests and EER tests. As a result, the modulation methods of the Lightweight model for Korean speech has sufficient de-identification performance and will be able to replace the security-weak broadcast voice modulation.

뉴스와 취재 프로그램 같은 방송에서는 제보자의 신원 보호를 위해 음성을 변조한다. 음성 변조 방법으로 피치(pitch)를 조절하는 방법이 가장 많이 사용되는데, 이 방법은 피치를 재조절하는 방식으로 쉽게 원본 음성과 유사하게 음성 복원이 가능하다. 따라서 방송 음성 변조 방법은 화자의 신원 보호를 제대로 해줄 수 없고 보안상 취약하기 때문에 이를 대체하기 위한 새로운 음성 변조 방법이 필요하다. 본 논문에서는 Voice Privacy Challenge에서 비식별화 성능이 검증된 Lightweight 음성 비식별화 모델을 성능 비교 모델로 사용하여 피치 조절을 사용한 방송 음성변조 방법과 음성 비식별화 성능 비교 실험 및 평가를 진행한다. Lightweight 음성 비식별화 모델의 6가지 변조 방법 중 비식별화 성능이 좋은 3가지 변조 방법 McAdams, Resampling, Vocal Tract Length Normalization(VTLN)을 사용하였으며 한국어 음성에 대한 비식별화 성능을 비교하기 위해 휴먼 테스트와 EER(Equal Error Rate) 테스트를 진행하였다. 실험 결과로 휴먼 테스트와 EER 테스트 모두 VTLN 변조 방법이 방송 변조보다 더 높은 비식별화 성능을 보였다. 결과적으로 한국어 음성에 대해 Lightweight 모델의 변조 방법은 충분한 비식별화 성능을 가지고 있으며 보안상 취약한 방송 음성 변조를 대체할 수 있을 것이다.

Keywords

Acknowledgement

이 논문은 2023년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원(No.2021-0-00511, 엣지 AI 보안을 위한 Robust AI 및 분산 공격탐지기술 개발)과 2023년도 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. 2020R1A2C1014813)

References

장준혁, "디지털 소외계층을 위한 지능형 IoT 애플리케이션의 공개 API 기반 대화형 음성 상호작용 기법," 스마트미디어저널, 제11권, 제10호, 22-30쪽, 2022년 11월
이명호, 임명진, 신주현, "텍스트와 음성의 앙상블을 통한 다중 감정인식 모델," 스마트미디어저널, 제11권, 제8호, 65-72쪽, 2022년 09월 https://doi.org/10.30693/SMJ.2022.11.8.65
김진수, 최방호, 조기환, "산업 영역에서 빅데이터 개인정보 보호체계에 관한 연구," 스마트미디어저널, 제8권, 제1호, 09-18쪽, 2019년 3월
Mawalim, Candy Olivia, et al., "Speaker anonymization by modifying fundamental frequency and x-vector singular value," Computer Speech & Language, Vol. 73:101326, May 2022.
Patino, Jose, et al., "Speaker anonymisation using the McAdams coefficient," arXiv preprint arXiv:2011.01130, 2020.
Tomashenko, Natalia, et al., "The voiceprivacy 2020 challenge: Results and findings," Computer Speech & Language, Vol. 74:101362, Jul. 2022.
Zhao, Yi, et al., "Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion," arXiv preprint arXiv:2008.12527, 2022.
Mawalim, Candy Olivia, Shogo Okada, and Masashi Unoki, "System Description: Speaker Anonymization by Pitch Shifting Based on Time-Scale Modification (PV-TSM)," 2nd Symposium on Security and Privacy in Speech Communication, 2022.
Khamsehashari, Razieh, et al., "Voice Privacy-Leveraging Multi-Scale Blocks with ECAPA-TDNN SE-Res2NeXt Extension for Speaker Anonymization," Proc. 2nd Symposium on Security and Privacy in Speech Communication, pp. 43-48, Incheon, Korea, Sep. 2022.
Meyer, Sarina, et al., "Cascade of phonetic speech recognition, speaker embeddings gan and multispeaker speech synthesis for the VoicePrivacy 2022 Challenge," Proc. 2nd Symposium on Security and Privacy in Speech Communication, 2022.
Masood, Momina, et al., "Deepfakes Generation and Detection: State-of-the-art, open challenges, countermeasures, and way forward," Applied Intelligence, Vol. 53:3974-4026, Jun. 2022. https://doi.org/10.1007/s10489-022-03766-z
H. Kai, S. Takamichi, S. Shiota and H. Kiya, "Lightweight Voice Anonymization Based on Data-Driven Optimization of Cascaded Voice Modification Modules," 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 560-566, Shenzhen, China, Mar. 2021.
S. McAdams, "Spectral fusion, spectral parsing and the formation of auditory images", Ph.D dissertation, Stanford University, 1985.
L. Lee et al., "A frequency warping approach to speaker normalization," IEEE Transactions on speech and audio processing, vol. 6, no. 1, pp. 49-60, 1998. https://doi.org/10.1109/89.650310
양승정, 이종태, "TD-PSOLA 를 이용한 음성 파형 편집에 관한 연구," 대한산업공학회 추계학술대회 논문집, 311-314쪽, 1999년 10월
정현욱, 김종국, 배명진, "PSOLA 알고리즘을 이용한 친절전화기능의 구현에 관한 연구," 한국음향학회 학술대회논문집, 93-96쪽, 2004년
S. Takamichi et al., "The NAIST text-to-speech system for the Blizzard Challenge 2015," Proc. Blizzard Challenge workshop, Berlin, Germany, Sep. 2015.
Rosenberg, Aaron E., "Automatic speaker verification: A review," Proceedings of the IEEE, Vol. 64, No. 4, PP. 475-487, Apr. 1976. https://doi.org/10.1109/PROC.1976.10156
Yu, Dong, and Li Deng, Automatic Speech Recognition: A Deep Learning Approach, Springer London, 2016.
AI Hub(2018). https://aihub.or.kr/ (accessed Nov., 8, 2022).
Desplanques, Brecht, Jenthe Thienpondt, and Kris Demuynck, "Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification," arXiv preprint arXiv:2005.07143, 2020.
민소연, et al., "한국어 특성 기반의 STT 엔진 정확도를 위한 정량적 평가방법 연구," 한국산학기술학회 논문지, vol. 21, no. 7, pp. 699-707, 2020
CLOVA Speech Recongnition(CSR). https://www.ncloud.com/product/aiService/csr (accessed Nov., 28, 2022).

Smart Media Journal (스마트미디어저널)

Comparison of Korean Speech De-identification Performance of Speech De-identification Model and Broadcast Voice Modulation

음성 비식별화 모델과 방송 음성 변조의 한국어 음성 비식별화 성능 비교

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)