Browse > Article
http://dx.doi.org/10.22156/CS4SMB.2021.11.06.033

A Study on a Non-Voice Section Detection Model among Speech Signals using CNN Algorithm  

Lee, Hoo-Young (IIR-TECH AI Lab.)
Publication Information
Journal of Convergence for Information Technology / v.11, no.6, 2021 , pp. 33-39 More about this Journal
Abstract
Speech recognition technology is being combined with deep learning and is developing at a rapid pace. In particular, voice recognition services are connected to various devices such as artificial intelligence speakers, vehicle voice recognition, and smartphones, and voice recognition technology is being used in various places, not in specific areas of the industry. In this situation, research to meet high expectations for the technology is also being actively conducted. Among them, in the field of natural language processing (NLP), there is a need for research in the field of removing ambient noise or unnecessary voice signals that have a great influence on the speech recognition recognition rate. Many domestic and foreign companies are already using the latest AI technology for such research. Among them, research using a convolutional neural network algorithm (CNN) is being actively conducted. The purpose of this study is to determine the non-voice section from the user's speech section through the convolutional neural network. It collects the voice files (wav) of 5 speakers to generate learning data, and utilizes the convolutional neural network to determine the speech section and the non-voice section. A classification model for discriminating speech sections was created. Afterwards, an experiment was conducted to detect the non-speech section through the generated model, and as a result, an accuracy of 94% was obtained.
Keywords
Speech Recognition; Deep-Learning; CNN; Artificial-Intelligence; NLP;
Citations & Related Records
연도 인용수 순위
  • Reference
1 S. S. Jo & Y. G. Kim. (2017). AI (Artificial Intelligence) Voice Assistant Evolving to Platform. IITP, p1-25, Feb.
2 B. S. Kim & H. J. Woo. (2019). A Study on the Intention to Use AI Speakers: focusing on extended technology acceptance model. The Korea Contents Association, 19(9), 1-10. DOI : 10.5392/JKCA.2019.19.09.001   DOI
3 L. H. Meng & J. S. Han. (2017). The Impact of Relational Benefits on Positive Affect, Perceived Value, and Behavior Intention in Social Commerce : Focused on Chinese Tourist having the Hotel Service of Social Commerce environment. Journal of tourism and leisure research, 29(10), 69-88.
4 J. H. Seo & Y. T. Kim. (2013). Effects of Service Convenience on Customer Satisfaction and Reuse Intention by Korail Talk App Users among Korail Passengers. Journal of the Korean Society for Railway, 16(5), 410- 417.   DOI
5 H. Zhou et al. (2017). Using deep convolutional neural network to classify urban sounds. In TENCON 2017-2017 IEEE Region 10 Conference (pp. 3089-3092). IEEE. DOI : 10.1109/TENCON.2017.8228392   DOI
6 X. Zha, H. Peng, X. Qin, G. Li & S. Yang. (2019). A deep learning framework for signal detection and modulation classification. Sensors, 19(18), 4042. DOI : 10.3390/s19184042   DOI
7 Y. LeCun, Y. Bengio & G. Hinton. (2015). Deep learning. nature, 521(7553), 436-444. DOI : 10.1038/nature14539   DOI
8 L. Xiaojun et al. (2017). Feature extraction and fusion using deep convolutional neural networks for face detection. Mathematical Problems in Engineering. 1-9. DOI : 10.1155/2017/1376726   DOI
9 Y. Lecun et all. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. DOI: 10.1109/5.726791   DOI
10 Library of Congress. (2008). WAVE Audio File Format.
11 Microsoft Corporation. (1998). WAVE and AVI codec Registries-RFC 2361, IETF.
12 IBM & Microsoft. (1991). Multimedia Programming interface and Data Specifications 1.0
13 R. Branson. (2015). What Makes WAV Better than MP3, Online Video Converter.
14 B. Theodore et all. (2013, August). Feature extraction with convolutional neural networks for handwritten word recognition. In 2013 12th International Conference on Document Analysis and Recognition (pp. 285-289). IEEE. DOI : 10.1109/ICDAR.2013.64   DOI
15 A. Oord et all. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499, 1-15.
16 J. G. van Velden & G. F. Smoorenburg. (1991). Vowel recognition in noise for male, female and child voices. In Acoustics, Speech, and Signal Processing, IEEE International Conference on (pp. 989-992). IEEE Computer Society. DOI : 10.1109/ICASSP.1991.150507   DOI
17 D. S. Park. (2018). A Study on the Gender and Age Classification of Speech Data Using CNN. Journal of KIIT, 16(11), 11-21. DOI : 10.14801/jkiit.2018.16.11.11   DOI
18 Filipp Akopyan et all. (2015). TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 34(10), 1537-1557. DOI : 10.1109/TCAD.2015.2474396   DOI
19 Y. E. Yuan. (2019). DeepMorse: A deep convolutional learning method for blind morse signal detection in wideband wireless spectrum. IEEE Access, 7, 80577-80587. DOI : 10.1109/ACCESS.2019.2923084   DOI