Search | Korea Science

Building robust Korean speech recognition model by fine-tuning large pretrained model (대형 사전훈련 모델의 파인튜닝을 통한 강건한 한국어 음성인식 모델 구축)

Changhan Oh;Cheongbin Kim;Kiyoung Park
- Phonetics and Speech Sciences
- /
- v.15 no.3
- /
- pp.75-82
- /
- 2023
Automatic speech recognition (ASR) has been revolutionized with deep learning-based approaches, among which self-supervised learning methods have proven to be particularly effective. In this study, we aim to enhance the performance of OpenAI's Whisper model, a multilingual ASR system on the Korean language. Whisper was pretrained on a large corpus (around 680,000 hours) of web speech data and has demonstrated strong recognition performance for major languages. However, it faces challenges in recognizing languages such as Korean, which is not major language while training. We address this issue by fine-tuning the Whisper model with an additional dataset comprising about 1,000 hours of Korean speech. We also compare its performance against a Transformer model that was trained from scratch using the same dataset. Our results indicate that fine-tuning the Whisper model significantly improved its Korean speech recognition capabilities in terms of character error rate (CER). Specifically, the performance improved with increasing model size. However, the Whisper model's performance on English deteriorated post fine-tuning, emphasizing the need for further research to develop robust multilingual models. Our study demonstrates the potential of utilizing a fine-tuned Whisper model for Korean ASR applications. Future work will focus on multilingual recognition and optimization for real-time inference.
https://doi.org/10.13064/KSSS.2023.15.3.075 인용 PDF

Study on the Vulnerabilities of Automatic Speech Recognition Models in Military Environments (군사적 환경에서 음성인식 모델의 취약성에 관한 연구)

Elim Won;Seongjung Na;Youngjin Ko
- Convergence Security Journal
- /
- v.24 no.2
- /
- pp.201-207
- /
- 2024
Voice is a critical element of human communication, and the development of speech recognition models is one of the significant achievements in artificial intelligence, which has recently been applied in various aspects of human life. The application of speech recognition models in the military field is also inevitable. However, before artificial intelligence models can be applied in the military, it is necessary to research their vulnerabilities. In this study, we evaluates the military applicability of the multilingual speech recognition model "Whisper" by examining its vulnerabilities to battlefield noise, white noise, and adversarial attacks. In experiments involving battlefield noise, Whisper showed significant performance degradation with an average Character Error Rate (CER) of 72.4%, indicating difficulties in military applications. In experiments with white noise, Whisper was robust to low-intensity noise but showed performance degradation under high-intensity noise. Adversarial attack experiments revealed vulnerabilities at specific epsilon values. Therefore, the Whisper model requires improvements through fine-tuning, adversarial training, and other methods.
https://doi.org/10.33778/kcsa.2024.24.2.201 인용 PDF HTML

Madness, the Smile, and Transnational Connections in "A Whisper in the Dark"

Jin, Seongeun
- American Studies
- /
- v.44 no.1
- /
- pp.137-154
- /
- 2021
Due to her successful novel Little Women (1869), Louisa May Alcott has generally become known as a writer of sentimental fiction. However, her thrillers demonstrate her keen insights into domestic and international issues. Alcott's so-called "left hand" shows her stances on political and historical issues in America as well as in Europe and Asia. Particularly, Alcott's supporting voice for women against social prejudices is metaphorically portrayed in "A Whisper in the Dark" (1861). Interestingly, in the story Alcott displays her knowledge of the drug trades and the cultural effects of white male colonizers exploiting other peoples and countries around the globe, which were issues that she had learned about from neighboring intellectuals and newspapers. In the paper, I examine Alcott's radical views on gender equality, chauvinistic attitudes, and transnational politics in the mid-nineteenth century.

Development of Learning Program using Chinese Whispers Game(Broken Telephone Game) for Systematic Assessment and Reporting of Patients and Exploration on Learners' Experiences (속삭임게임을 활용한 체계적 환자사정 및 보고 교육프로그램의 개발 및 학습자 경험탐색)

Jung, Hyun-Jung
- Journal of Korea Entertainment Industry Association
- /
- v.13 no.6
- /
- pp.143-153
- /
- 2019
In order to save lives by recognizing the deteriorating changes of the patients, patient's assessment and reporting should be foundation, but this task is mainly delegated to nursing students or inexperienced nurses. A whisper game is a game in which the first person whisper selects a word, phrase or sentence and delivers it to the team member and finally confirms how many original message have changed during the transmit process. The purpose of this study was to develop a whisper game program to transmit the information of the children included in the DVD using in the pediatric advanced life support process. After four times of games, the experiences of 31 nursing students in the fourth grade were explored by analyzing the reflective journal. The results of the study showed three themes: learning motivation, metacognitive ability, and situated contextual learning. Repeated practice through a whisper game is expected to be widely used because it has been identified as a fresh and interesting learning method that enables nursing students to metacognize the process of assessing patients and conveying information in the contextual situation.
https://doi.org/10.21184/jkeia.2019.8.13.6.143 인용

Exploring the feasibility of fine-tuning large-scale speech recognition models for domain-specific applications: A case study on Whisper model and KsponSpeech dataset

Jungwon Chang;Hosung Nam
- Phonetics and Speech Sciences
- /
- v.15 no.3
- /
- pp.83-88
- /
- 2023
This study investigates the fine-tuning of large-scale Automatic Speech Recognition (ASR) models, specifically OpenAI's Whisper model, for domain-specific applications using the KsponSpeech dataset. The primary research questions address the effectiveness of targeted lexical item emphasis during fine-tuning, its impact on domain-specific performance, and whether the fine-tuned model can maintain generalization capabilities across different languages and environments. Experiments were conducted using two fine-tuning datasets: Set A, a small subset emphasizing specific lexical items, and Set B, consisting of the entire KsponSpeech dataset. Results showed that fine-tuning with targeted lexical items increased recognition accuracy and improved domain-specific performance, with generalization capabilities maintained when fine-tuned with a smaller dataset. For noisier environments, a trade-off between specificity and generalization capabilities was observed. This study highlights the potential of fine-tuning using minimal domain-specific data to achieve satisfactory results, emphasizing the importance of balancing specialization and generalization for ASR models. Future research could explore different fine-tuning strategies and novel technologies such as prompting to further enhance large-scale ASR models' domain-specific performance.
https://doi.org/10.13064/KSSS.2023.15.3.083 인용 PDF

Enhancing Speech Recognition with Whisper-tiny Model: A Scalable Keyword Spotting Approach (Whisper-tiny 모델을 활용한 음성 분류 개선: 확장 가능한 키워드 스팟팅 접근법)

Shivani Sanjay Kolekar;Hyeonseok Jin;Kyungbaek Kim
- Proceedings of the Korea Information Processing Society Conference
- /
- 2024.05a
- /
- pp.774-776
- /
- 2024
The effective implementation of advanced speech recognition (ASR) systems necessitates the deployment of sophisticated keyword spotting models that are both responsive and resource-efficient. The initial local detection of user interactions is crucial as it allows for the selective transmission of audio data to cloud services, thereby reducing operational costs and mitigating privacy risks associated with continuous data streaming. In this paper, we address these needs and propose utilizing the Whisper-Tiny model with fine-tuning process to specifically recognize keywords from google speech dataset which includes 65000 audio clips of keyword commands. By adapting the model's encoder and appending a lightweight classification head, we ensure that it operates within the limited resource constraints of local devices. The proposed model achieves the notable test accuracy of 92.94%. This architecture demonstrates the efficiency as on-device model with stringent resources leading to enhanced accessibility in everyday speech recognition applications.
https://doi.org/10.3745/PKIPS.y2024m05a.774 인용 PDF

The Effects of the Methods of Disguised Voice on the Aural Decision (위장 발화 방법의 차이가 청취 판단에 미치는 영향)

Song Min-Chang;Shin Jiyoung;Kang SunMee
- MALSORI
- /
- no.46
- /
- pp.25-35
- /
- 2003
This study deals with the disguised voice (or voice disguise) in the field of forensic phonetics. We especially studied the effects of the methods of disguised voice on the aural decision. Within the nonelectronic-deliberate voice disguise area, the methods of disguised voice include use of lowered pitch, pinched nostrils, falsetto, and whisper. Ten (male:5, female:5) Seoul speakers made a recording of 16 sentences. In the aural test, 30 subjects listened normal and disguised voice. And they were asked to make a decision whether speakers identified or not. The result is as follows: The speaker verification of the falsetto and whisper was more difficult than the lowered pitch and pinched nostrils.
PDF

Digital enhancement of pronunciation assessment: Automated speech recognition and human raters

Miran Kim
- Phonetics and Speech Sciences
- /
- v.15 no.2
- /
- pp.13-20
- /
- 2023
This study explores the potential of automated speech recognition (ASR) in assessing English learners' pronunciation. We employed ASR technology, acknowledged for its impartiality and consistent results, to analyze speech audio files, including synthesized speech, both native-like English and Korean-accented English, and speech recordings from a native English speaker. Through this analysis, we establish baseline values for the word error rate (WER). These were then compared with those obtained for human raters in perception experiments that assessed the speech productions of 30 first-year college students before and after taking a pronunciation course. Our sub-group analyses revealed positive training effects for Whisper, an ASR tool, and human raters, and identified distinct human rater strategies in different assessment aspects, such as proficiency, intelligibility, accuracy, and comprehensibility, that were not observed in ASR. Despite such challenges as recognizing accented speech traits, our findings suggest that digital tools such as ASR can streamline the pronunciation assessment process. With ongoing advancements in ASR technology, its potential as not only an assessment aid but also a self-directed learning tool for pronunciation feedback merits further exploration.
https://doi.org/10.13064/KSSS.2023.15.2.013 인용 PDF

A Study of Data Augmentation and Auto Speech Recognition for the Elderly (한국어 노인 음성 데이터 증강 및 인식 연구 )

Keon Hee Kim;Seoyoon Park;Hansaem Kim
- Annual Conference on Human and Language Technology
- /
- 2023.10a
- /
- pp.56-60
- /
- 2023
기존의 음성인식은 청장년 층에 초점이 맞추어져 있었으나, 최근 고령화가 가속되면서 노인 음성에 대한 연구 필요성이 증대되고 있다. 그러나 노인 음성 데이터셋은 청장년 음성 데이터셋에 비해서는 아직까지 충분히 확보되지 못하고 있다. 본 연구에서는 부족한 노인 음성 데이터셋 확보에 기여하고자 희소한 노인 데이터셋을 증강할 수 있는 방법론에 대해 연구하였다. 이를 위해 노인 음성 특징(feature)을 분석하였으며, '주파수'와 '발화 속도' 특징을 일반 성인 음성에 합성하여 데이터를 증강하였다. 이후 Whisper small 모델을 파인 튜닝한 뒤 노인 음성에 대한 CER(Character Error Rate)를 구하였고, 기존 노인 데이터셋에 증강한 데이터셋을 함께 사용하는 것이 가장 효과적임을 밝혀내었다.
PDF

Development and Utilization of Speech Recognition Service for Ship Radio Communication (선박무선통신 음성인식 서비스 개발 및 활용)

Kwang-Il Kim;Sang-Lok Yoo
- Proceedings of the Korean Institute of Navigation and Port Research Conference
- /
- 2023.11a
- /
- pp.236-237
- /
- 2023
선박무선통신장비는 선박이 항해하는데 필요한 안전정보, 선박교통 모니터링 및 관제, 입·출항 정보를 교환하기 위한 필수 장비이므로 선박항해사는 무선통신 내용을 항상 주의 깊게 청취해야 함. 본 연구에서는 선박의 실제 음성 교신데이터 500시간 데이터를 수집 및 학습하고, Wav2Vec 및 Whisper 모델을 활용하여 한글 및 영어(해사영어) 음성인식 모델을 개발하고 실용화를 수행하였다. 음성인식 모델의 성능은 CER(Character Error Rate) 기준 94.5%로 향후 선박 운항 관련 댜양한 분야에 적용이 가능할 것으로 사료된다.
PDF

Search Result 19, Processing Time 0.028 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)