Enhancing Speech Recognition with Whisper-tiny Model: A Scalable Keyword Spotting Approach

Whisper-tiny 모델을 활용한 음성 분류 개선: 확장 가능한 키워드 스팟팅 접근법

  • Shivani Sanjay Kolekar (Dept. of Artificial Intelligence Convergence, Chonnam National University Gwangju) ;
  • Hyeonseok Jin (Dept. of Artificial Intelligence Convergence, Chonnam National University Gwangju) ;
  • Kyungbaek Kim (Dept. of Artificial Intelligence Convergence, Chonnam National University Gwangju)
  • 시바니 산제이 콜레카르 (전남대학교 인공지능학과) ;
  • 진현석 (전남대학교 인공지능학과) ;
  • 김경백 (전남대학교 인공지능학과)
  • Published : 2024.05.23

Abstract

The effective implementation of advanced speech recognition (ASR) systems necessitates the deployment of sophisticated keyword spotting models that are both responsive and resource-efficient. The initial local detection of user interactions is crucial as it allows for the selective transmission of audio data to cloud services, thereby reducing operational costs and mitigating privacy risks associated with continuous data streaming. In this paper, we address these needs and propose utilizing the Whisper-Tiny model with fine-tuning process to specifically recognize keywords from google speech dataset which includes 65000 audio clips of keyword commands. By adapting the model's encoder and appending a lightweight classification head, we ensure that it operates within the limited resource constraints of local devices. The proposed model achieves the notable test accuracy of 92.94%. This architecture demonstrates the efficiency as on-device model with stringent resources leading to enhanced accessibility in everyday speech recognition applications.

Keywords

Acknowledgement

This work was supported by Innovative Human Resource Development for Local Intellectualization program through the Institute of Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT)(IITP-2024-00156287, 50%). This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2023-RS-2023-00256629, 50%) grant funded by the Korea government(MSIT).

References

  1. Sharif, Khairunisa, and Bastian Tenbergen. "Smart home voice assistants: a literature survey of user privacy and security vulnerabilities." Complex Systems Informatics and Modeling Quarterly 24 (2020): 15-30.
  2. Cheng, Shitong, et al. "Task offloading for automatic speech recognition in edge-cloud computing based mobile networks." 2020 IEEE Symposium on Computers and Communications (ISCC). IEEE, 2020.
  3. Wang, Siyin, et al. "Can Whisper Perform Speech-Based In-Context Learning?." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.
  4. Lyu, Ke-Ming, Ren-yuan Lyu, and Hsien-Tsung Chang. "Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation." PeerJ Computer Science 10 (2024): e1973.
  5. Warden, Pete. "Speech commands: A dataset for limited-vocabulary speech recognition." arXiv preprint arXiv:1804.03209 (2018).
  6. Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." International Conference on Machine Learning. PMLR, 2023.
  7. Mwase, Christine, et al. "Communication-efficient distributed AI strategies for the IoT edge." Future Generation Computer Systems 131 (2022): 292-308.