Search | Korea Science

Mun Seop Yun;Sang Hyuk Yoon;Ki Won Lee;Se Hoon Kim;Min Woo Lee;Ho-Young Kwak;Won Joo Lee
- Journal of the Korea Society of Computer and Information
- /
- v.29 no.4
- /
- pp.23-30
- /
- 2024
In this paper, we propose a deep learning-based personalized senior care service application. The proposed application uses Speech to Text technology to convert the user's speech into text and uses it as input to Autogen, an interactive multi-agent large-scale language model developed by Microsoft, for user convenience. Autogen uses data from previous conversations between the senior and ChatBot to understand the other user's intent and respond to the response, and then uses a back-end agent to create a wish list, a shared calendar, and a greeting message with the other user's voice through a deep learning model for voice cloning. Additionally, the application can perform home IoT services with SKT's AI speaker (NUGU). The proposed application is expected to contribute to future AI-based senior care technology.
https://doi.org/10.9708/jksci.2024.29.04.023 인용 PDF HTML

Kim, Kwang Hyeon;Kwon, Chul Hong
- The Journal of the Convergence on Culture Technology
- /
- v.8 no.3
- /
- pp.469-475
- /
- 2022
To train the model of the deep learning-based single-speaker TTS system, a speech DB of tens of hours and a lot of training time are required. This is an inefficient method in terms of time and cost to train multi-speaker or personalized TTS models. The voice cloning method uses a speaker encoder model to make the TTS model of a new speaker. Through the trained speaker encoder model, a speaker embedding vector representing the timbre of the new speaker is created from the small speech data of the new speaker that is not used for training. In this paper, we propose a multi-speaker TTS system to which voice cloning is applied. The proposed TTS system consists of a speaker encoder, synthesizer and vocoder. The speaker encoder applies the d-vector technique used in the speaker recognition field. The timbre of the new speaker is expressed by adding the d-vector derived from the trained speaker encoder as an input to the synthesizer. It can be seen that the performance of the proposed TTS system is excellent from the experimental results derived by the MOS and timbre similarity listening tests.
https://doi.org/10.17703/JCCT.2022.8.3.469 인용 PDF KSCI