• Title/Summary/Keyword: 비디오 합성

Search Result 172, Processing Time 0.022 seconds

One-shot multi-speaker text-to-speech using RawNet3 speaker representation (RawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성 시스템)

  • Sohee Han;Jisub Um;Hoirin Kim
    • Phonetics and Speech Sciences
    • /
    • v.16 no.1
    • /
    • pp.67-76
    • /
    • 2024
  • Recent advances in text-to-speech (TTS) technology have significantly improved the quality of synthesized speech, reaching a level where it can closely imitate natural human speech. Especially, TTS models offering various voice characteristics and personalized speech, are widely utilized in fields such as artificial intelligence (AI) tutors, advertising, and video dubbing. Accordingly, in this paper, we propose a one-shot multi-speaker TTS system that can ensure acoustic diversity and synthesize personalized voice by generating speech using unseen target speakers' utterances. The proposed model integrates a speaker encoder into a TTS model consisting of the FastSpeech2 acoustic model and the HiFi-GAN vocoder. The speaker encoder, based on the pre-trained RawNet3, extracts speaker-specific voice features. Furthermore, the proposed approach not only includes an English one-shot multi-speaker TTS but also introduces a Korean one-shot multi-speaker TTS. We evaluate naturalness and speaker similarity of the generated speech using objective and subjective metrics. In the subjective evaluation, the proposed Korean one-shot multi-speaker TTS obtained naturalness mean opinion score (NMOS) of 3.36 and similarity MOS (SMOS) of 3.16. The objective evaluation of the proposed English and Korean one-shot multi-speaker TTS showed a prediction MOS (P-MOS) of 2.54 and 3.74, respectively. These results indicate that the performance of our proposed model is improved over the baseline models in terms of both naturalness and speaker similarity.

Automatic Arm Region Segmentation and Background Image Composition (자동 팔 영역 분할과 배경 이미지 합성)

  • Kim, Dong Hyun;Park, Se Hun;Seo, Yeong Geon
    • Journal of Digital Contents Society
    • /
    • v.18 no.8
    • /
    • pp.1509-1516
    • /
    • 2017
  • In first-person perspective training system, the users needs realistic experience. For providing this experience, the system should offer the users virtual and real images at the same time. We propose an automatic a persons's arm segmentation and image composition method. It consists of arm segmentation part and image composition part. Arm segmentation uses an arbitrary image as input and outputs arm segment or alpha matte. It enables end-to-end learning because we make use of FCN in this part. Image composition part conducts image combination between the result of arm segmentation and other image like road, building, etc. To train the network in arm segmentation, we used arm images through dividing the videos that we took ourselves for the training data.

Efficient Layered Depth Image Representation of Multi-view Image with Color and Depth Information (컬러와 깊이 정보를 포함하는 다시점 영상의 효율적 계층척 깊이 영상 표현)

  • Lim, Joong-Hee;Kim, Min-Tae;Shin, Jong-Hong;Jee, Inn-Ho
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.9 no.1
    • /
    • pp.53-59
    • /
    • 2009
  • Multi-view video is necessary to develop a new compression encoding technique for storage and transmission, because of a huge amount of data. Layered depth image is an efficient representation method of multi-view video data. This method makes a data structure that is synthesis of multi-view color and depth image. This paper proposed enhanced compression method by presentation of efficient layered depth image using real distance comparison, solution of overlap problem, and interpolation. In experimental results, confirmed high compression performance.

  • PDF

Effective Compression Technique of Multi-view Image expressed by Layered Depth Image (계층적 깊이 영상으로 표현된 다시점 영상의 효과적인 압축 기술)

  • Jee, Inn-Ho
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.14 no.4
    • /
    • pp.29-37
    • /
    • 2014
  • Since multi-view video exists a number of camera color image and depth image, it has a huge of data. Thus, a new compression technique is indispensable for reducing this data. Recently, the effective compression encoding technique for multi-view video that used in layered depth image concepts is a remarkable. This method uses several view point of depth information and warping function, synthesizes multi-view color and depth image, becomes one data structure. In this paper we use actual distance for solving overlap in layered depth image that reduce required data for reconstructing in color-based transform. In experimental results, we confirmed high compression performance and good quality of reconstructed image.

Adaptive Residual DPCM using Weighted Linear Combination of Adjacent Residues in Screen Content Video Coding (스크린 콘텐츠 비디오의 압축을 위한 인접 화소의 가중 합을 이용한 적응적 Residual DPCM 기법)

  • Kang, Je-Won
    • Journal of Broadcast Engineering
    • /
    • v.20 no.5
    • /
    • pp.782-785
    • /
    • 2015
  • In this paper, we propose a novel residual differential pulse-code modulation (RDPCM) coding technique to improve coding efficiency of screen content videos. The proposed method uses a weighted combination of adjacent residues to provide an accurate estimate in RDPCM. The weights are trained in previously coded samples by using an L1 optimization problem with the least absolute shrinkage and selection operation (LASSO). The proposed method achieves BD-rate saving about 3.1% in all-intra coding.

Efficient Compression Technique of Multi-view Image with Color and Depth Information by Layered Depth Image Representation (계층적 깊이 영상 표현에 의한 컬러와 깊이 정보를 포함하는 다시점 영상에 대한 효율적인 압축기술)

  • Lim, Joong-Hee;Shin, Jong-Hong;Jee, Inn-Ho
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.34 no.2C
    • /
    • pp.186-193
    • /
    • 2009
  • Multi-view video is necessary to develop a new compression encoding technique for storage and transmission, because of a huge amount of data. Layered depth image is an efficient representation method of multi-view video data. This method makes a data structure that is synthesis of multi-view color and depth image. This paper proposed enhanced compression method by presentation of efficient layered depth image using real distance comparison, solution of overlap problem, and YCrCb color transformation. In experimental results, confirmed high compression performance and good reconstructed image.

DisplayPort 1.1a Standard Based Multiple Video Streaming Controller Design (디스플레이포트1.1a 표준 기반 멀티플 비디오 스트리밍 컨트롤러 설계)

  • Jang, Ji-Hoon;Im, Sang-Soon;Song, Byung-Cheol;Kang, Jin-Ku
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.48 no.11
    • /
    • pp.27-33
    • /
    • 2011
  • Recently many display devices support the digital display interface as display market growth. DisplayPort is a next generation display interface at the PC, projector and high definition content applications in more widely used connection solution development. This paper implements multiple streams based on the behavior of the main link that is suitable for the display port v1.1a standard. The limit point of Displayport, interface between the Sink Device and Sink Device is also implemented. And two or more differential image data are enable to output the result through four Lanes stated in display port v1.1a, of two or more display devices without the addition of a separate Lane. The Multiple Video Streaming Controller is implemented with 6,222 ALUTs and 6,686 register, 999,424 of block memory bits synthesized using Quartus II at Altera Audio/Video Development board (Stratix II GX FPGA Chip).

2D Adjacency Matrix Generation using DCT for UWV Contents (DCT를 통한 UWV 콘텐츠의 2D 인접도 행렬 생성)

  • Xiaorui, Li;Kim, Kyuheon
    • Journal of Broadcast Engineering
    • /
    • v.22 no.3
    • /
    • pp.366-374
    • /
    • 2017
  • Since a display device such as TV or digital signage is getting larger, the types of media is getting changed into wider view one such as UHD, panoramic and jigsaw-like media. Especially, panoramic and jigsaw-like media is realized by stitching video clips, which are captured by different camera or devices. However, a stitching process takes long time, and has difficulties in applying for a real-time process. Thus, this paper suggests to find out 2D Adjacency Matrix, which tells spatial relationships among those video clips in order to decrease a stitching processing time. Using the Discrete Cosine Transform (DCT), we convert the each frame of video source from the spatial domain (2D) into frequency domain. Based on the aforementioned features, 2D Adjacency Matrix of images could be found that we can efficiently make the spatial map of the images by using DCT. This paper proposes a new method of generating 2D adjacency matrix by using DCT for producing a panoramic and jigsaw-like media through various individual video clips.

A Study on Architecture of Motion Compensator for H.264/AVC Encoder (H.264/AVC부호화기용 움직임 보상기의 아키텍처 연구)

  • Kim, Won-Sam;Sonh, Seung-Il;Kang, Min-Goo
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.12 no.3
    • /
    • pp.527-533
    • /
    • 2008
  • Motion compensation always produces the principal bottleneck in the real-time high quality video applications. Therefore, a fast dedicated hardware is needed to perform motion compensation in the real-time video applications. In many video encoding methods, the frames are partitioned into blocks of Pixels. In general, motion compensation predicts present block by estimating the motion from previous frame. In motion compensation, the higher pixel accuracy shows the better performance but the computing complexity is increased. In this paper, we studied an architecture of motion compensator suitable for H.264/AVC encoder that supports quarter-pixel accuracy. The designed motion compensator increases the throughput using transpose array and 3 6-tap Luma filters and efficiently reduces the memory access. The motion compensator is described in VHDL and synthesized in Xilinx ISE and verified using Modelsim_6.1i. Our motion compensator uses 36-tap filters only and performs in 640 clock-cycle per macro block. The motion compensator proposed in this paper is suitable to the areas that require the real-time video processing.

Super Metric: Quality Assessment Methods for Immersive Video (몰입형 비디오 품질 평가를 위한 슈퍼 메트릭)

  • Jeong, Jong-Beom;Kim, Seunghwan;Lee, Soonbin;Kim, Inae;Ryu, Eun-Seok
    • Journal of Internet Computing and Services
    • /
    • v.22 no.2
    • /
    • pp.51-58
    • /
    • 2021
  • Three degrees of freedom plus(3DoF+) and six degrees of freedom(6DoF) system, which supports a user's movements in graphical and natural scene-based virtual reality, requires multiple high-quality and high-resolution videos to provide immersive media. Previous video quality assessment methods are not appropriate for the 3DoF+ and 6DoF system assessment because different types of artifacts occur in these systems which are not shown in the traditional video compression. This paper provides the performance assessments of several quality assessment methods in 3DoF+ system. Furthermore, this paper presents a super metric, which combines multiple quality assessment methods, thereby it showed a higher correlation coefficient with the subjective quality assessment than the previous methods. Experimental results on 3DoF+ immersive video showed 0.4513 gain on correlation coefficient with subjective quality assessment compared to that of peak signal-to-noise ratio(PSNR).