• Title/Summary/Keyword: visual-audio

Search Result 424, Processing Time 0.024 seconds

The Influence of Topic Exploration and Topic Relevance On Amplitudes of Endogenous ERP Components in Real-Time Video Watching (실시간 동영상 시청시 주제탐색조건과 주제관련성이 내재적 유발전위 활성에 미치는 영향)

  • Kim, Yong Ho;Kim, Hyun Hee
    • Journal of Korea Multimedia Society
    • /
    • v.22 no.8
    • /
    • pp.874-886
    • /
    • 2019
  • To delve into the semantic gap problem of the automatic video summarization, we focused on an endogenous ERP responses at around 400ms and 600ms after the on-set of audio-visual stimulus. Our experiment included two factors: the topic exploration of experimental conditions (Topic Given vs. Topic Exploring) as a between-subject factor and the topic relevance of the shots (Topic-Relevant vs. Topic-Irrelevant) as a within-subject factor. For the Topic Given condition of 22 subjects, 6 short historical documentaries were shown with their video titles and written summaries, while in the Topic Exploring condition of 25 subjects, they were asked instead to explore topics of the same videos with no given information. EEG data were gathered while they were watching videos in real time. It was hypothesized that the cognitive activities to explore topics of videos while watching individual shots increase the amplitude of endogenous ERP at around 600 ms after the onset of topic relevant shots. The amplitude of endogenous ERP at around 400ms after the onset of topic-irrelevant shots was hypothesized to be lower in the Topic Given condition than that in the Topic Exploring condition. The repeated measure MANOVA test revealed that two hypotheses were acceptable.

Incomplete Cholesky Decomposition based Kernel Cross Modal Factor Analysis for Audiovisual Continuous Dimensional Emotion Recognition

  • Li, Xia;Lu, Guanming;Yan, Jingjie;Li, Haibo;Zhang, Zhengyan;Sun, Ning;Xie, Shipeng
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.2
    • /
    • pp.810-831
    • /
    • 2019
  • Recently, continuous dimensional emotion recognition from audiovisual clues has attracted increasing attention in both theory and in practice. The large amount of data involved in the recognition processing decreases the efficiency of most bimodal information fusion algorithms. A novel algorithm, namely the incomplete Cholesky decomposition based kernel cross factor analysis (ICDKCFA), is presented and employed for continuous dimensional audiovisual emotion recognition, in this paper. After the ICDKCFA feature transformation, two basic fusion strategies, namely feature-level fusion and decision-level fusion, are explored to combine the transformed visual and audio features for emotion recognition. Finally, extensive experiments are conducted to evaluate the ICDKCFA approach on the AVEC 2016 Multimodal Affect Recognition Sub-Challenge dataset. The experimental results show that the ICDKCFA method has a higher speed than the original kernel cross factor analysis with the comparable performance. Moreover, the ICDKCFA method achieves a better performance than other common information fusion methods, such as the Canonical correlation analysis, kernel canonical correlation analysis and cross-modal factor analysis based fusion methods.

CNN-based Visual/Auditory Feature Fusion Method with Frame Selection for Classifying Video Events

  • Choe, Giseok;Lee, Seungbin;Nang, Jongho
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.3
    • /
    • pp.1689-1701
    • /
    • 2019
  • In recent years, personal videos have been shared online due to the popular uses of portable devices, such as smartphones and action cameras. A recent report predicted that 80% of the Internet traffic will be video content by the year 2021. Several studies have been conducted on the detection of main video events to manage a large scale of videos. These studies show fairly good performance in certain genres. However, the methods used in previous studies have difficulty in detecting events of personal video. This is because the characteristics and genres of personal videos vary widely. In a research, we found that adding a dataset with the right perspective in the study improved performance. It has also been shown that performance improves depending on how you extract keyframes from the video. we selected frame segments that can represent video considering the characteristics of this personal video. In each frame segment, object, location, food and audio features were extracted, and representative vectors were generated through a CNN-based recurrent model and a fusion module. The proposed method showed mAP 78.4% performance through experiments using LSVC data.

Design and Implementation of Smart Pen based User Interface System for U-learning (U-Learning 을 위한 스마트펜 인터페이스 시스템 디자인 및 개발)

  • Shim, Jae-Youen;Kim, Seong-Whan
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2010.11a
    • /
    • pp.1388-1391
    • /
    • 2010
  • In this paper, we present a design and implementation of U-learning system using pen based augmented reality approach. Student has been given a smart pen and a smart study book, which is similar to the printed material already serviced. However, we print the study book using CMY inks, and embed perceptually invisible dot patterns using K ink. Smart pen includes (1) IR LED for illumination, IR pass filter for extracting the dot patterns, and (3) camera for image captures. From the image sequences, we perform topology analysis which determines the topological distance between dot pixels, and perform error correction decoding using four position symbols and five CRC symbols. When a student touches a smart study books with our smart pen, we show him/her multimedia (visual/audio) information which is exactly related with the selected region. Our scheme can embed 16 bit information, which is more than 200% larger than previous scheme, which supports 7 bits or 8 bits information.

The Use of Graphic Novels for Developing Multiliteracies (그래픽노블을 통한 다중문식성의 발달)

  • Yun, Eunja
    • Journal of English Language & Literature
    • /
    • v.56 no.4
    • /
    • pp.575-596
    • /
    • 2010
  • The modes of narratives and communication have expanded due to social and cultural changes and technological development. Thus texts have become multimodal and media hybridities and media crossover have been increasing as well. Multimodality requires new literacy to understand and interpret those multimodal texts other than existing traditional literacy approaches. The New London Group (2000) argues that multiliteracies are needed to serve today's changing multimodal texts. Kress (2003) also argues, visual texts have been prevailing, being mingled with other modes of texts such as linguistic, audio, gestural, and spatial modes. Literary texts are not exception in this trend of multimodality. The recent renaissance of comics, in particular, the new light on graphic novels can be interpreted in this historical vein. In comparison to comics, no consensus has been made in defining graphic novels, however, many studies have been recently conducted in order to look into the potential of graphic novels in building multiliteracies. In this paper, the graphic novel as a literary genre are explored from a histocial perspective and the definition of graphic novels was attempted to be made. In the light of multiliteracies, this paper presented cases that show how graphic novels can be utilized to build multiliteracies. Lastly, the use of graphic novels for English as a foreign language was introduced as well. The author hopes that at the age of multimodality, the potential graphic novels have in language and literacy education can be taken into account by language teachers and students in expanding their territory of literacy.

Audio-Visual Scene Aware Dialogue System Utilizing Action From Vision and Language Features (이미지-텍스트 자질을 이용한 행동 포착 비디오 기반 대화시스템)

  • Jungwoo Lim;Yoonna Jang;Junyoung Son;Seungyoon Lee;Kinam Park;Heuiseok Lim
    • Annual Conference on Human and Language Technology
    • /
    • 2023.10a
    • /
    • pp.253-257
    • /
    • 2023
  • 최근 다양한 대화 시스템이 스마트폰 어시스턴트, 자동 차 내비게이션, 음성 제어 스피커, 인간 중심 로봇 등의 실세계 인간-기계 인터페이스에 적용되고 있다. 하지만 대부분의 대화 시스템은 텍스트 기반으로 작동해 다중 모달리티 입력을 처리할 수 없다. 이 문제를 해결하기 위해서는 비디오와 같은 다중 모달리티 장면 인식을 통합한 대화 시스템이 필요하다. 기존의 비디오 기반 대화 시스템은 주로 시각, 이미지, 오디오 등의 다양한 자질을 합성하거나 사전 학습을 통해 이미지와 텍스트를 잘 정렬하는 데에만 집중하여 중요한 행동 단서와 소리 단서를 놓치고 있다는 한계가 존재한다. 본 논문은 이미지-텍스트 정렬의 사전학습 임베딩과 행동 단서, 소리 단서를 활용해 비디오 기반 대화 시스템을 개선한다. 제안한 모델은 텍스트와 이미지, 그리고 오디오 임베딩을 인코딩하고, 이를 바탕으로 관련 프레임과 행동 단서를 추출하여 발화를 생성하는 과정을 거친다. AVSD 데이터셋에서의 실험 결과, 제안한 모델이 기존의 모델보다 높은 성능을 보였으며, 대표적인 이미지-텍스트 자질들을 비디오 기반 대화시스템에서 비교 분석하였다.

  • PDF

A Research of User Experience on Multi-Modal Interactive Digital Art

  • Qianqian Jiang;Jeanhun Chung
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.16 no.1
    • /
    • pp.80-85
    • /
    • 2024
  • The concept of single-modal digital art originated in the 20th century and has evolved through three key stages. Over time, digital art has transformed into multi-modal interaction, representing a new era in art forms. Based on multi-modal theory, this paper aims to explore the characteristics of interactive digital art in innovative art forms and its impact on user experience. Through an analysis of practical application of multi-modal interactive digital art, this study summarises the impact of creative models of digital art on the physical and mental aspects of user experience. In creating audio-visual-based art, multi-modal digital art should seamlessly incorporate sensory elements and leverage computer image processing technology. Focusing on user perception, emotional expression, and cultural communication, it strives to establish an immersive environment with user experience at its core. Future research, particularly with emerging technologies like Artificial Intelligence(AR) and Virtual Reality(VR), should not merely prioritize technology but aim for meaningful interaction. Through multi-modal interaction, digital art is poised to continually innovate, offering new possibilities and expanding the realm of interactive digital art.

Design and Implementation of a Real-Time Lipreading System Using PCA & HMM (PCA와 HMM을 이용한 실시간 립리딩 시스템의 설계 및 구현)

  • Lee chi-geun;Lee eun-suk;Jung sung-tae;Lee sang-seol
    • Journal of Korea Multimedia Society
    • /
    • v.7 no.11
    • /
    • pp.1597-1609
    • /
    • 2004
  • A lot of lipreading system has been proposed to compensate the rate of speech recognition dropped in a noisy environment. Previous lipreading systems work on some specific conditions such as artificial lighting and predefined background color. In this paper, we propose a real-time lipreading system which allows the motion of a speaker and relaxes the restriction on the condition for color and lighting. The proposed system extracts face and lip region from input video sequence captured with a common PC camera and essential visual information in real-time. It recognizes utterance words by using the visual information in real-time. It uses the hue histogram model to extract face and lip region. It uses mean shift algorithm to track the face of a moving speaker. It uses PCA(Principal Component Analysis) to extract the visual information for learning and testing. Also, it uses HMM(Hidden Markov Model) as a recognition algorithm. The experimental results show that our system could get the recognition rate of 90% in case of speaker dependent lipreading and increase the rate of speech recognition up to 40~85% according to the noise level when it is combined with audio speech recognition.

  • PDF

A Study about the Users's Preferred Playing Speeds on Categorized Video Content using WSOLA method (WSOLA를 이용한 동영상 미세배속 재생 서비스에 대한 콘텐츠별 배속 선호도 분석 연구)

  • Kim, I-Gil
    • Journal of Digital Contents Society
    • /
    • v.16 no.2
    • /
    • pp.291-298
    • /
    • 2015
  • In a fast-paced information technology environment, consumption of video content is changing from one-way television viewing to VOD (Video on Demand) playing anywhere, anytime, on any device. This video-watching trend gives additional importance to videos with fine-speed-control, in addition to the strength of the digital video signal. Currently, many video players provide a fine-speed-control function which can speed up the video to skip a boring part, or slow it down to focus on an exciting scene. The audio information is just as important as the visual information for understanding the content of the speed-controlled video. Thus, a number of algorithms for fine-speed-control video-playing technologies have been proposed to solve the pitch distortion in the audio-processing area. In this study, well-known techniques for prosodic modification of speech signals, WSOLA (Waveform-Similarity-Based Overlap-Add), have been applied to analyze users' needs for fine-speed-control video playing. By surveying the users' preferred speeds on categorized video content and analyzing the results, this paper proposes that various fine-speed adjustments are needed to accommodate users' preferred video consumption.

A Study on Audio-Visual Expression of Biometric Data Based on the Polysomnography Test (수면다원검사에 기반한 생체데이터 시청각화 연구)

  • Kim, Hee Soo;Oh, Na Yea;Park, Jin Wan
    • Korea Science and Art Forum
    • /
    • v.35
    • /
    • pp.145-155
    • /
    • 2018
  • The goal of the study is to provide a new type of audio-visualization method through case analysis and work production based on Polysomnography(PSG) data that is difficult to interpret or not familiar to the public. Most art works are produced with conscious actions during waking hours. On the other hand, during sleep, we get into the world of unconsciousness. Therefore, through the experiment, want to discover if could get something new when we were in the subconscious state, and if so, wondered what kind of art could be made through it. The study method is to consider definition of sleep and sleep data first. The sleep data were classified into normal group and Narcolepsy, Insomnia, and sleep apnea by focusing on sleep disorder graphs that is measured by sleep polygraph. After that, I refined and converted the acquired biometric data into a text-based script. The degree of sleep in the text form of the script was rendered as a 3D animated image using Maya. In addition, the heart rate data script was transformed into a midi format, and the audition was implemented in the garage band. After Effects combines the image and sound to create four single channel images of 3 minutes and 20 seconds each. As a result of the research, I made an opportunity for anyone easy to understand the results, having difference with the normal data, through art instead of using difficult medical term. It also showed the possibility of artistic expression even when conscious actions did not occur. Through the results of this research, I expect the expansion and diversity of artistic audiovisual expression of biometric data.