CoNSIST : Consist of New methodologies on AASIST, leveraging Squeeze-and-Excitation, Positional Encoding, and Re-formulated HS-GAL

Jae-Hoon Ha;Joo-Won Mun;Sang-Yup Lee;

doi:10.3745/PKIPS.y2024m05a.692

Proceedings of the Korea Information Processing Society Conference (한국정보처리학회:학술대회논문집)

2024.05a
/
Pages.692-695
/
2024
/
2005-0011(pISSN)
/
2671-7298(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

CoNSIST : Consist of New methodologies on AASIST, leveraging Squeeze-and-Excitation, Positional Encoding, and Re-formulated HS-GAL

Jae-Hoon Ha (Dept. of Digital Analytics, Yonsei University) ;
Joo-Won Mun (Dept. of Digital Analytics, Yonsei University) ;
Sang-Yup Lee (Dept. of Communication, Yonsei University)

Published : 2024.05.23

https://doi.org/10.3745/PKIPS.y2024m05a.692 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

With the recent advancements in artificial intelligence (AI), the performance of deep learning-based audio deepfake technology has significantly improved. This technology has been exploited for criminal activities, leading to various cases of victimization. To prevent such illicit outcomes, this paper proposes a deep learning-based audio deepfake detection model. In this study, we propose CoNSIST, an improved audio deepfake detection model, which incorporates three additional components into the graph-based end-to-end model AASIST: (i) Squeeze and Excitation, (ii) Positional Encoding, and (iii) Reformulated HS-GAL, This incorporation is expected to enable more effective feature extraction, elimination of unnecessary operations, and consideration of more diverse information, thereby improving the performance of the original AASIST. The results of multiple experiments indicate that CoNSIST has enhanced the performance of audio deepfake detection compared to existing models.

Keywords

References

Yi, Jiangyan, et al. "Audio Deepfake Detection: A Survey." ArXiv (Cornell University), 28 Aug. 2023,
Jung, Kyunghoon, and Changhyun Kim. "Beware of Voice Cloning: Deep Voice Crime Steals 400 Billion Won." Moneytoday, 11 Feb. 2023, news.mt.co.kr/mtview.php?no=2023020913433930492.
Jung, Jee-weon, et al. "AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks." ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 23 May 2022, https://doi.org/10.1109/icassp43922.2022.9747766.
Hamza, Ameer, et al. "Deepfake Audio Detection via MFCC Features Using Machine Learning." IEEE Access, vol. 10, 2022, pp. 134018-134028, https://doi.org/10.1109. https://doi.org/10.1109
Lataifeh, Mohammed, and Ashraf Elnagar. "Ar-DAD: Arabic Diversified Audio Dataset." Data in Brief, Nov. 2020, p. 106503, https://doi.org/10.1016/j.dib.2020.106503.
Borrelli, Clara, et al. "Synthetic Speech Detection through Short-Term and Long-Term Prediction Traces." EURASIP Journal on Information Security, vol. 2021, no. 1, 6 Apr. 2021, https://doi.org/10.1186/s13635-021-00116-3.
Arun Kumar Singh, and Priyanka Singh. "Detection of AI-Synthesized Speech Using Cepstral & Bispectral Statistics." ArXiv (Cornell University), 3 Sept. 2020. Accessed 12 Apr. 2024.
Chintha, Akash, et al. "Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection." IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, Aug. 2020, pp. 1024-1037, https://doi.org/10.1109/jstsp.2020.2999185.
Liu, Xiaohui, et al. Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection. 4 June 2023, https://doi.org/10.1109/icassp49357.2023.10096278.
Tak, Hemlata, et al. "End-To-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection." ArXiv (Cornell University), 1 Jan. 2021, https://doi.org/10.48550/arxiv.2107.12710.
Tak, Hemlata, et al. "Graph Attention Networks for Anti-Spoofing." ArXiv (Cornell University), 30 Aug. 2021, https://doi.org/10.21437/interspeech.2021-993. Accessed 3 Apr. 2024.
Velickovic, Petar, et al. "Graph Attention Networks." arXiv (Cornell University), Feb. 2018, doi:10.17863/cam.48429.
Hu, Jie, et al. "Squeeze-and-Excitation Networks." 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, doi:10.1109/cvpr.2018.00745.
Dufter, Philipp, et al. "Position Information in Transformers: An Overview." Computational Linguistics, vol. 48, no. 3, 2022, pp. 733-763, https://doi.org/10.1162/coli_a_00445. Accessed 4 Dec. 2022.
Vaswani, Ashish, et al. "Attention is All you Need." arXiv (Cornell University), vol. 30, June 2017, pp. 5998-6008, arxiv.org/pdf/1706.03762v5.
Wang, Xin, et al. "ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech." ArXiv (Cornell University), 4 Nov. 2019, https://doi.org/10.48550/arxiv.1911.01601.
Kinnunen, Tomi, et al. "T-DCF: A Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification." Odyssey 2018 the Speaker and Language Recognition Workshop, 26 June 2018, www.isca-speech.org/archive/Odyssey_2018/pdfs/68.pdf, https://doi.org/10.21437/odyssey.2018-44.
Wang, Xin, and Junichi Yamagishi. "A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection." ArXiv (Cornell University), 30 Aug. 2021, https://doi.org/10.21437/interspeech.2021-702.

Proceedings of the Korea Information Processing Society Conference (한국정보처리학회:학술대회논문집)

CoNSIST : Consist of New methodologies on AASIST, leveraging Squeeze-and-Excitation, Positional Encoding, and Re-formulated HS-GAL

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)