• Title/Summary/Keyword: Voice and Image Recognition

Search Result 74, Processing Time 0.022 seconds

Gendered innovation for algorithm through case studies (음성·영상 신호 처리 알고리즘 사례를 통해 본 젠더혁신의 필요성)

  • Lee, JiYeoun;Lee, Heisook
    • Journal of Digital Convergence
    • /
    • v.16 no.12
    • /
    • pp.459-466
    • /
    • 2018
  • Gendered innovations is a term used by policy makers and academics to refer the process of creating better research and development (R&D) for both men and women. In this paper, we analyze the literatures in image and speech signal processing that can be used in ICT, examine the importance of gendered innovations through case study. Therefore the latest domestic and foreign literature related to image and speech signal processing based on gender research is searched and a total of 9 papers are selected. In terms of gender analysis, research subjects, research environment, and research design are examined separately. Especially, through the case analysis of algorithms of the elderly voice signal processing, machine learning, machine translation technology, and facial gender recognition technology, we found that there is gender bias in existing algorithms, and which leads to gender analysis is required. We also propose a gendered innovations method integrating sex and gender analysis in algorithm development. Gendered innovations in ICT can contribute to the creation of new markets by developing products and services that reflect the needs of both men and women.

(Searching Effective Network Parameters to Construct Convolutional Neural Networks for Object Detection) (물체 검출 컨벌루션 신경망 설계를 위한 효과적인 네트워크 파라미터 추출)

  • Kim, Nuri;Lee, Donghoon;Oh, Songhwai
    • Journal of KIISE
    • /
    • v.44 no.7
    • /
    • pp.668-673
    • /
    • 2017
  • Deep neural networks have shown remarkable performance in various fields of pattern recognition such as voice recognition, image recognition and object detection. However, underlying mechanisms of the network have not been fully revealed. In this paper, we focused on empirical analysis of the network parameters. The Faster R-CNN(region-based convolutional neural network) was used as a baseline network of our work and three important parameters were analyzed: the dropout ratio which prevents the overfitting of the neural network, the size of the anchor boxes and the activation function. We also compared the performance of dropout and batch normalization. The network performed favorably when the dropout ratio was 0.3 and the size of the anchor box had not shown notable relation to the performance of the network. The result showed that batch normalization can't entirely substitute the dropout method. The used leaky ReLU(rectified linear unit) with a negative domain slope of 0.02 showed comparably good performance.

Multi-Modal Emotion Recognition in Videos Based on Pre-Trained Models (사전학습 모델 기반 발화 동영상 멀티 모달 감정 인식)

  • Eun Hee Kim;Ju Hyun Shin
    • Smart Media Journal
    • /
    • v.13 no.10
    • /
    • pp.19-27
    • /
    • 2024
  • Recently, as the demand for non-face-to-face counseling has rapidly increased, the need for emotion recognition technology that combines various aspects such as text, voice, and facial expressions is being emphasized. In this paper, we address issues such as the dominance of non-Korean data and the imbalance of emotion labels in existing datasets like FER-2013, CK+, and AFEW by using Korean video data. We propose methods to enhance multimodal emotion recognition performance in videos by integrating the strengths of image modality with text modality. A pre-trained model is used to overcome the limitations caused by small training data. A GPT-4-based LLM model is applied to text, and a pre-trained model based on VGG-19 architecture is fine-tuned to facial expression images. The method of extracting representative emotions by combining the emotional results of each aspect extracted using a pre-trained model is as follows. Emotion information extracted from text was combined with facial expression changes in a video. If there was a sentiment mismatch between the text and the image, we applied a threshold that prioritized the text-based sentiment if it was deemed trustworthy. Additionally, as a result of adjusting representative emotions using emotion distribution information for each frame, performance was improved by 19% based on F1-Score compared to the existing method that used average emotion values for each frame.

Spontaneous Speech Emotion Recognition Based On Spectrogram With Convolutional Neural Network (CNN 기반 스펙트로그램을 이용한 자유발화 음성감정인식)

  • Guiyoung Son;Soonil Kwon
    • The Transactions of the Korea Information Processing Society
    • /
    • v.13 no.6
    • /
    • pp.284-290
    • /
    • 2024
  • Speech emotion recognition (SER) is a technique that is used to analyze the speaker's voice patterns, including vibration, intensity, and tone, to determine their emotional state. There has been an increase in interest in artificial intelligence (AI) techniques, which are now widely used in medicine, education, industry, and the military. Nevertheless, existing researchers have attained impressive results by utilizing acted-out speech from skilled actors in a controlled environment for various scenarios. In particular, there is a mismatch between acted and spontaneous speech since acted speech includes more explicit emotional expressions than spontaneous speech. For this reason, spontaneous speech-emotion recognition remains a challenging task. This paper aims to conduct emotion recognition and improve performance using spontaneous speech data. To this end, we implement deep learning-based speech emotion recognition using the VGG (Visual Geometry Group) after converting 1-dimensional audio signals into a 2-dimensional spectrogram image. The experimental evaluations are performed on the Korean spontaneous emotional speech database from AI-Hub, consisting of 7 emotions, i.e., joy, love, anger, fear, sadness, surprise, and neutral. As a result, we achieved an average accuracy of 83.5% and 73.0% for adults and young people using a time-frequency 2-dimension spectrogram, respectively. In conclusion, our findings demonstrated that the suggested framework outperformed current state-of-the-art techniques for spontaneous speech and showed a promising performance despite the difficulty in quantifying spontaneous speech emotional expression.

An Exploratory Approach to Designer Models for Pattern Design Using ChatGPT (챗 GPT를 활용한 패턴 디자인의 디자이너 모델에 대한 탐색적 접근)

  • Hua-Qian Xie;Seung-Keun Song
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.6
    • /
    • pp.799-805
    • /
    • 2024
  • Recently, generative artificial intelligence (AI) technology has been rapidly developing and its application fields are expanding beyond text, voice, image, object recognition, time-series forecasting, and natural language processing to the creative design field that AI was thought to be incapable of. We aim for an exploratory approach to study the cognitive model of pattern designers using generative AI. To this end, we used GPT 4o, which is the most well-known generative AI, and applied the protocol analysis method, a cognitive science research method, to the pattern design process. Four design graduate students were selected as subjects and pilot and main experiment were conducted. Voice recording and video capture were performed to collect data. The protocol method applied the concurrent protocol method, which simultaneously expresses what comes to mind while performing the task. The collected verbal data was used to classify the design process by segmenting words and developing a coding scheme to establish a framework for analysis. As a result, analysis, selection, visualization, evaluation, and optimization were discovered. We expect to present design guidelines for pattern design practice.

A Study on Interactive Talking Companion Doll Robot System Using Big Data for the Elderly Living Alone (빅데이터를 이용한 독거노인 돌봄 AI 대화형 말동무 아가야(AGAYA) 로봇 시스템에 관한 연구)

  • Song, Moon-Sun
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.5
    • /
    • pp.305-318
    • /
    • 2022
  • We focused on the care effectiveness of the interactive AI robots. developed an AI toy robot called 'Agaya' to contribute to personalization with more human-centered care. First, by applying P-TTS technology, you can maximize intimacy by autonomously selecting the voice of the person you want to hear. Second, it is possible to heal in your own way with good memory storage and bring back memory function. Third, by having five senses of the role of eyes, nose, mouth, ears, and hands, seeking better personalised services. Fourth, it attempted to develop technologies such as warm temperature maintenance, aroma, sterilization and fine dust removal, convenient charging method. These skills will expand the effective use of interactive robots by elderly people and contribute to building a positive image of the elderly who can plan the remaining old age productively and independently

Development of Data Fusion Human Identification System Based on Finger-Vein Pattern-Matching Method and photoplethysmography Identification

  • Ko, Kuk Won;Lee, Jiyeon;Moon, Hongsuk;Lee, Sangjoon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.7 no.2
    • /
    • pp.149-154
    • /
    • 2015
  • Biometric techniques for authentication using body parts such as a fingerprint, face, iris, voice, finger-vein and also photoplethysmography have become increasingly important in the personal security field, including door access control, finance security, electronic passport, and mobile device. Finger-vein images are now used to human identification, however, difficulties in recognizing finger-vein images are caused by capturing under various conditions, such as different temperatures and illumination, and noise in the acquisition camera. The human photoplethysmography is also important signal for human identification. In this paper To increase the recognition rate, we develop camera based identification method by combining finger vein image and photoplethysmography signal. We use a compact CMOS camera with a penetrating infrared LED light source to acquire images of finger vein and photoplethysmography signal. In addition, we suggest a simple pattern matching method to reduce the calculation time for embedded environments. The experimental results show that our simple system has good results in terms of speed and accuracy for personal identification compared to the result of only finger vein images.

Structural live load surveys by deep learning

  • Li, Yang;Chen, Jun
    • Smart Structures and Systems
    • /
    • v.30 no.2
    • /
    • pp.145-157
    • /
    • 2022
  • The design of safe and economical structures depends on the reliable live load from load survey. Live load surveys are traditionally conducted by randomly selecting rooms and weighing each item on-site, a method that has problems of low efficiency, high cost, and long cycle time. This paper proposes a deep learning-based method combined with Internet big data to perform live load surveys. The proposed survey method utilizes multi-source heterogeneous data, such as images, voice, and product identification, to obtain the live load without weighing each item through object detection, web crawler, and speech recognition. The indoor objects and face detection models are first developed based on fine-tuning the YOLOv3 algorithm to detect target objects and obtain the number of people in a room, respectively. Each detection model is evaluated using the independent testing set. Then web crawler frameworks with keyword and image retrieval are established to extract the weight information of detected objects from Internet big data. The live load in a room is derived by combining the weight and number of items and people. To verify the feasibility of the proposed survey method, a live load survey is carried out for a meeting room. The results show that, compared with the traditional method of sampling and weighing, the proposed method could perform efficient and convenient live load surveys and represents a new load research paradigm.

Speech sound and personality impression (말소리와 성격 이미지)

  • Lee, Eunyung;Yuh, Heaok
    • Phonetics and Speech Sciences
    • /
    • v.9 no.4
    • /
    • pp.59-67
    • /
    • 2017
  • Regardless of their intention, listeners tend to assess speakers' personalities based on the sounds of the speech they hear. Assessment criteria, however, have not been fully investigated to indicate whether there is any relationship between the acoustic cue of produced speech sounds and perceived personality impression. If properly investigated, the potential relationship between these two will provide crucial insights on the aspects of human communications and further on human-computer interaction. Since human communications have distinctive characteristics of simultaneity and complexity, this investigation would be the identification of minimum essential factors among the sounds of speech and perceived personality impression. The purpose of this study, therefore, is to identify significant associations between the speech sounds and perceived personality impression of speaker by the listeners. Twenty eight subjects participated in the experiment and eight acoustic parameters were extracted by using Praat from the recorded sounds of the speech. The subjects also completed the Neo-five Factor Inventory test so that their personality traits could be measured. The results of the experiment show that four major factors(duration average, pitch difference value, pitch average and intensity average) play crucial roles in defining the significant relationship.

Optimal Algorithm and Number of Neurons in Deep Learning (딥러닝 학습에서 최적의 알고리즘과 뉴론수 탐색)

  • Jang, Ha-Young;You, Eun-Kyung;Kim, Hyeock-Jin
    • Journal of Digital Convergence
    • /
    • v.20 no.4
    • /
    • pp.389-396
    • /
    • 2022
  • Deep Learning is based on a perceptron, and is currently being used in various fields such as image recognition, voice recognition, object detection, and drug development. Accordingly, a variety of learning algorithms have been proposed, and the number of neurons constituting a neural network varies greatly among researchers. This study analyzed the learning characteristics according to the number of neurons of the currently used SGD, momentum methods, AdaGrad, RMSProp, and Adam methods. To this end, a neural network was constructed with one input layer, three hidden layers, and one output layer. ReLU was applied to the activation function, cross entropy error (CEE) was applied to the loss function, and MNIST was used for the experimental dataset. As a result, it was concluded that the number of neurons 100-300, the algorithm Adam, and the number of learning (iteraction) 200 would be the most efficient in deep learning learning. This study will provide implications for the algorithm to be developed and the reference value of the number of neurons given new learning data in the future.