CoNSIST : Consist of New methodologies on AASIST, leveraging Squeeze-and-Excitation, Positional Encoding, and Re-formulated HS-GAL

  • Jae-Hoon Ha (Dept. of Digital Analytics, Yonsei University) ;
  • Joo-Won Mun (Dept. of Digital Analytics, Yonsei University) ;
  • Sang-Yup Lee (Dept. of Communication, Yonsei University)
  • 발행 : 2024.05.23

초록

With the recent advancements in artificial intelligence (AI), the performance of deep learning-based audio deepfake technology has significantly improved. This technology has been exploited for criminal activities, leading to various cases of victimization. To prevent such illicit outcomes, this paper proposes a deep learning-based audio deepfake detection model. In this study, we propose CoNSIST, an improved audio deepfake detection model, which incorporates three additional components into the graph-based end-to-end model AASIST: (i) Squeeze and Excitation, (ii) Positional Encoding, and (iii) Reformulated HS-GAL, This incorporation is expected to enable more effective feature extraction, elimination of unnecessary operations, and consideration of more diverse information, thereby improving the performance of the original AASIST. The results of multiple experiments indicate that CoNSIST has enhanced the performance of audio deepfake detection compared to existing models.

키워드

참고문헌

  1. Yi, Jiangyan, et al. "Audio Deepfake Detection: A Survey." ArXiv (Cornell University), 28 Aug. 2023,
  2. Jung, Kyunghoon, and Changhyun Kim. "Beware of Voice Cloning: Deep Voice Crime Steals 400 Billion Won." Moneytoday, 11 Feb. 2023, news.mt.co.kr/mtview.php?no=2023020913433930492.
  3. Jung, Jee-weon, et al. "AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks." ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 23 May 2022, https://doi.org/10.1109/icassp43922.2022.9747766.
  4. Hamza, Ameer, et al. "Deepfake Audio Detection via MFCC Features Using Machine Learning." IEEE Access, vol. 10, 2022, pp. 134018-134028, https://doi.org/10.1109. https://doi.org/10.1109
  5. Lataifeh, Mohammed, and Ashraf Elnagar. "Ar-DAD: Arabic Diversified Audio Dataset." Data in Brief, Nov. 2020, p. 106503, https://doi.org/10.1016/j.dib.2020.106503.
  6. Borrelli, Clara, et al. "Synthetic Speech Detection through Short-Term and Long-Term Prediction Traces." EURASIP Journal on Information Security, vol. 2021, no. 1, 6 Apr. 2021, https://doi.org/10.1186/s13635-021-00116-3.
  7. Arun Kumar Singh, and Priyanka Singh. "Detection of AI-Synthesized Speech Using Cepstral & Bispectral Statistics." ArXiv (Cornell University), 3 Sept. 2020. Accessed 12 Apr. 2024.
  8. Chintha, Akash, et al. "Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection." IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, Aug. 2020, pp. 1024-1037, https://doi.org/10.1109/jstsp.2020.2999185.
  9. Liu, Xiaohui, et al. Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection. 4 June 2023, https://doi.org/10.1109/icassp49357.2023.10096278.
  10. Tak, Hemlata, et al. "End-To-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection." ArXiv (Cornell University), 1 Jan. 2021, https://doi.org/10.48550/arxiv.2107.12710.
  11. Tak, Hemlata, et al. "Graph Attention Networks for Anti-Spoofing." ArXiv (Cornell University), 30 Aug. 2021, https://doi.org/10.21437/interspeech.2021-993. Accessed 3 Apr. 2024.
  12. Velickovic, Petar, et al. "Graph Attention Networks." arXiv (Cornell University), Feb. 2018, doi:10.17863/cam.48429.
  13. Hu, Jie, et al. "Squeeze-and-Excitation Networks." 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, doi:10.1109/cvpr.2018.00745.
  14. Dufter, Philipp, et al. "Position Information in Transformers: An Overview." Computational Linguistics, vol. 48, no. 3, 2022, pp. 733-763, https://doi.org/10.1162/coli_a_00445. Accessed 4 Dec. 2022.
  15. Vaswani, Ashish, et al. "Attention is All you Need." arXiv (Cornell University), vol. 30, June 2017, pp. 5998-6008, arxiv.org/pdf/1706.03762v5.
  16. Wang, Xin, et al. "ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech." ArXiv (Cornell University), 4 Nov. 2019, https://doi.org/10.48550/arxiv.1911.01601.
  17. Kinnunen, Tomi, et al. "T-DCF: A Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification." Odyssey 2018 the Speaker and Language Recognition Workshop, 26 June 2018, www.isca-speech.org/archive/Odyssey_2018/pdfs/68.pdf, https://doi.org/10.21437/odyssey.2018-44.
  18. Wang, Xin, and Junichi Yamagishi. "A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection." ArXiv (Cornell University), 30 Aug. 2021, https://doi.org/10.21437/interspeech.2021-702.