[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.7236/IJASC.2021.10.2.79

Comparative Analysis of Speech Recognition Open API Error Rate

Kim, Juyoung (Graduate School of Smart Convergence Kwangwoon University)
Yun, Dai Yeol (Department of Plasma Bioscience and Display, KwangWoon University)
Kwon, Oh Seok (Department of Plasma Bioscience and Display, KwangWoon University Graduate School)
Moon, Seok-Jae (Department of Computer Science, Kwangwoon University)
Hwang, Chi-gon (Department of Computer Engineering, Institute of Information Technology, Kwangwoon University)

Publication Information

International journal of advanced smart convergence / v.10, no.2, 2021 , pp. 79-85 More about this Journal

Abstract

Speech recognition technology refers to a technology in which a computer interprets the speech language spoken by a person and converts the contents into text data. This technology has recently been combined with artificial intelligence and has been used in various fields such as smartphones, set-top boxes, and smart TVs. Examples include Google Assistant, Google Home, Samsung's Bixby, Apple's Siri and SK's NUGU. Google and Daum Kakao offer free open APIs for speech recognition technologies. This paper selects three APIs that are free to use by ordinary users, and compares each recognition rate according to the three types. First, the recognition rate of "numbers" and secondly, the recognition rate of "Ga Na Da Hangul" are conducted, and finally, the experiment is conducted with the complete sentence that the author uses the most. All experiments use real voice as input through a computer microphone. Through the three experiments and results, we hope that the general public will be able to identify differences in recognition rates according to the applications currently available, helping to select APIs suitable for specific application purposes.

Keywords

Open API; Speech Recognition Technology; Recognition Rate; Artificial Intelligence;

Citations & Related Records

Reference

1	B. Sklar, Digital Communications, Prentice Hall, pp. 187, 1998
2	Qian, Yao, et al. "On the training aspects of deep neural network (DNN) for parametric TTS synthesis." 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.
3	Chetlur, Sharan, et al. "cudnn: Efficient primitives for deep learning." arXiv preprint arXiv:1410.0759 (2014).
4	Muller, Meinard. "Dynamic time warping." Information retrieval for music and motion (2007): 69-84.
5	Corkrey, Ross, and Lynne Parkinson. "Interactive voice response: review of studies 1989-2000." Behavior Research Methods, Instruments, & Computers 34.3 (2002): 342-353. DOI
6	Park, Hyeon-Sin, et al. "Trend of state-of-the-art machine learning-based speech recognition technology." The Magazine of the IEIE 41.3 (2014): 18-27. DOI
7	Seung Joo Choi, and Jong-Bae Kim. "Comparison Analysis of Speech Recognition Open APIs' Accuracy." Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology 7.8 (2017): 411-418. DOI
8	Kalchbrenner, Nal, Edward Grefenstette, and Phil Blunsom. "A convolutional neural network for modelling sentences." arXiv preprint arXiv:1404.2188 (2014).
9	Chang Soo Ko. "The Prospects of Natural Language Process." Urimal, 31.0 (2012): 5-22.
10	Hochreiter, Sepp, and Jurgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. DOI
11	Young, Steve J., and Sj Young. "The HTK hidden Markov model toolkit: Design and philosophy." (1993): 69.
12	Schuster, Mike, and Kuldip K. Paliwal. "Bidirectional recurrent neural networks." IEEE transactions on Signal Processing 45.11 (1997): 2673-2681. DOI
13	Liang, Jinling, et al. "Robust synchronization of an array of coupled stochastic discrete-time delayed neural networks." IEEE Transactions on Neural Networks 19.11 (2008): 1910-1921. DOI
14	Higuchi, Takuya, et al. "Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise." 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.