• Title/Summary/Keyword: parallel architecture

Search Result 891, Processing Time 0.025 seconds

All Phase Discrete Sine Biorthogonal Transform and Its Application in JPEG-like Image Coding Using GPU

  • Shan, Rongyang;Zhou, Xiao;Wang, Chengyou;Jiang, Baochen
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.9
    • /
    • pp.4467-4486
    • /
    • 2016
  • Discrete cosine transform (DCT) based JPEG standard significantly improves the coding efficiency of image compression, but it is unacceptable event in serious blocking artifacts at low bit rate and low efficiency of high-definition image. In the light of all phase digital filtering theory, this paper proposes a novel transform based on discrete sine transform (DST), which is called all phase discrete sine biorthogonal transform (APDSBT). Applying APDSBT to JPEG scheme, the blocking artifacts are reduced significantly. The reconstructed image of APDSBT-JPEG is better than that of DCT-JPEG in terms of objective quality and subjective effect. For improving the efficiency of JPEG coding, the structure of JPEG is analyzed. We analyze key factors in design and evaluation of JPEG compression on the massive parallel graphics processing units (GPUs) using the compute unified device architecture (CUDA) programming model. Experimental results show that the maximum speedup ratio of parallel algorithm of APDSBT-JPEG can reach more than 100 times with a very low version GPU. Some new parallel strategies are illustrated in this paper for improving the performance of parallel algorithm. With the optimal strategy, the efficiency can be improved over 10%.

Design of a Dingle-chip Multiprocessor with On-chip Learning for Large Scale Neural Network Simulation (대규모 신경망 시뮬레이션을 위한 칩상 학습가능한 단일칩 다중 프로세서의 구현)

  • 김종문;송윤선;김명원
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.33B no.2
    • /
    • pp.149-158
    • /
    • 1996
  • In this paper we describe designing and implementing a digital neural chip and a parallel neural machine for simulating large scale neural netsorks. The chip is a single-chip multiprocessor which has four digiral neural processors (DNP-II) of the same architecture. Each DNP-II has program memory and data memory, and the chip operates in MIMD (multi-instruction, multi-data) parallel processor. The DNP-II has the instruction set tailored to neural computation. Which can be sed to effectively simulate various neural network models including on-chip learning. The DNP-II facilitates four-way data-driven communication supporting the extensibility of parallel systems. The parallel neural machine consists of a host computer, processor boards, a buffer board and an interface board. Each processor board consists of 8*8 array of DNP-II(equivalently 2*2 neural chips). Each processor board acn be built including linear array, 2-D mesh and 2-D torus. This flexibility supports efficiency of mapping from neural network models into parallel strucgure. The neural system accomplishes the performance of maximum 40 GCPS(giga connection per second) with 16 processor boards.

  • PDF

Task Creation and Assignment based on Object Caching for Parallel Spatial Join (병렬공간 조인을 위한 객체 캐쉬 기반 태스크 생성 및 할당)

  • 서영덕;김진덕;홍봉희
    • Journal of KIISE:Software and Applications
    • /
    • v.26 no.10
    • /
    • pp.1178-1178
    • /
    • 1999
  • A spatial join has the property that its execution time exponentially increases in proportion to the number of spatial objects. Recently, there have been many attempts for improving the performance of the spatial join by using parallel processing schemes, In the case of executing parallel spatial join using the parallel machine with shared disk architecture, the disk bottleneck of parallel processing of spatial join worsens in comparison with sequential spatial join. This paper presents the algorithms of task creation and assignment to reduce the disk bottleneck caused by accessing the shared disk at the same time, and to minimize message passing between processors, This paper proposes object caching which is a higher level of abstraction than page caching, and uses it to do creation and assignment of tasks according to temporal and spatial localities for minimizing disk access time. The object caching shows the performance improvement of 50%. The task creation and assignment using localities gives the gain of 30% and 20%. Overall performance evaluation of the proposed algorithms shows 7.2 times speed up than those of sequential execution of spatial joins.

An Improved Hybrid Approach to Parallel Connected Component Labeling using CUDA

  • Soh, Young-Sung;Ashraf, Hadi;Kim, In-Taek
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.16 no.1
    • /
    • pp.1-8
    • /
    • 2015
  • In many image processing tasks, connected component labeling (CCL) is performed to extract regions of interest. CCL was usually done in a sequential fashion when image resolution was relatively low and there are small number of input channels. As image resolution gets higher up to HD or Full HD and as the number of input channels increases, sequential CCL is too time-consuming to be used in real time applications. To cope with this situation, parallel CCL framework was introduced where multiple cores are utilized simultaneously. Several parallel CCL methods have been proposed in the literature. Among them are NSZ label equivalence (NSZ-LE) method[1], modified 8 directional label selection (M8DLS) method[2], and HYBRID1 method[3]. Soh [3] showed that HYBRID1 outperforms NSZ-LE and M8DLS, and argued that HYBRID1 is by far the best. In this paper we propose an improved hybrid parallel CCL algorithm termed as HYBRID2 that hybridizes M8DLS with label backtracking (LB) and show that it runs around 20% faster than HYBRID1 for various kinds of images.

Neural network controller design with a performance evaluation level (성능평가 계층을 가지는 신경망제어기 설계)

  • 이현철;조원철;전기준
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 1992.10a
    • /
    • pp.613-618
    • /
    • 1992
  • We propose a new control architecture which consists of a PI controller and a neural network(NN) controller connected together in parallel. This architecture is well adapted to a wide range of uncertainties and variations of systems. The NN controller is learned through weights of the emulator which identify the dynamic chracteristics of the systems. A performance evaluation level of two NN's decides automatically which controller of the two controllers will be used mainly. The PI controller operates mainly during learning phase of the NN controller whereas a good performance is obtained from the NN controller only, when the NN controller is learned sufficiently.

  • PDF

A Review of the Development of Spatial Structures in China

  • Shen, S.Z.;Lan, T.T.
    • Journal of Korean Association for Spatial Structures
    • /
    • v.1 no.1 s.1
    • /
    • pp.34-42
    • /
    • 2001
  • The development of contemporary spatial structures for long-span roofs in China was initiated in the 19505. Space trusses, reticulated shells and cable suspended structures have been developing rapidly since 1980s. Recently there has been a growing interest in tensile membrane structures. Comprehensive theoretical study has been carried out parallel to the engineering application, which provided necessary theoretical support to the development of different types of spatial structures.

  • PDF

Hardware Design and Implementation of a Parallel Processor for High-Performance Multimedia Processing (고성능 멀티미디어 처리용 병렬프로세서 하드웨어 설계 및 구현)

  • Kim, Yong-Min;Hwang, Chul-Hee;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.16 no.5
    • /
    • pp.1-11
    • /
    • 2011
  • As the use of mobile multimedia devices is increasing in the recent year, the needs for high-performance multimedia processors are increasing. In this regard, we propose a SIMD (Single Instruction Multiple Data) based parallel processor that supports high-performance multimedia applications with low energy consumption. The proposed parallel processor consists of 16 processing elements (PEs) and operates on a 3-stage pipelining. Experimental results indicated that the proposed parallel processor outperforms conventional parallel processors in terms of performance. In addition, our proposed parallel processor outperforms commercial high-performance TI C6416 DSP in terms of performance (1.4-31.4x better) and energy efficiency (5.9-8.1x better) with same 130nm technology and 720 clock frequency. The proposed parallel processor was developed with verilog HDL and verified with a FPGA prototype system.

A Service System Architecture of a Large Parallel Information Retrieval System Based on ODYSSEUS/Parallel-OOSQL (오디세우스/Parallel-OOSQL에 기반한 대규모 병렬 정보검색 서비스 시스템 아키텍처)

  • 성경복;이재길;황규영
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.10b
    • /
    • pp.109-111
    • /
    • 2004
  • 인터넷에 존재하는 데이타의 양이 기하급수적으로 증가함에 따라 많은 양의 데이타에 대해 정보 검색을 효율적으로 지원하기 위해 병렬 정보검색 시스템이 개발되었다. 그러나 대규모 병렬 정보검색 서비스 시스템의 운영에 관해 발표된 자료가 미비하여 실제적으로 대규모 병렬 정보검색 시스템을 구축하고 운영하기에는 어려움이 있다. 본 논문에서는 대규모 병렬 정보검색 서비스 시스템의 아키텍처를 제안한다. 이를 위해, 1) 병렬 정보검색 서비스 시스템을 구축하기 위한 물리적인 기기 구성을 보이고, 2) 검색 서비스 중에도 빠른 데이타 추가가 가능한 데이터 추가 방법을 고안하며, 3) 데이터 베이스 재구축 중에도 지속적인 서비스가 가능한 데이터 베이스 재구축 방법을 고안한다

  • PDF

Efficient Implementation of a Pseudorandom Sequence Generator for High-Speed Data Communications

  • Hwang, Soo-Yun;Park, Gi-Yoon;Kim, Dae-Ho;Jhang, Kyoung-Son
    • ETRI Journal
    • /
    • v.32 no.2
    • /
    • pp.222-229
    • /
    • 2010
  • A conventional pseudorandom sequence generator creates only 1 bit of data per clock cycle. Therefore, it may cause a delay in data communications. In this paper, we propose an efficient implementation method for a pseudorandom sequence generator with parallel outputs. By virtue of the simple matrix multiplications, we derive a well-organized recursive formula and realize a pseudorandom sequence generator with multiple outputs. Experimental results show that, although the total area of the proposed scheme is 3% to 13% larger than that of the existing scheme, our parallel architecture improves the throughput by 2, 4, and 6 times compared with the existing scheme based on a single output. In addition, we apply our approach to a $2{\times}2$ multiple input/multiple output (MIMO) detector targeting the 3rd Generation Partnership Project Long Term Evolution (3GPP LTE) system. Therefore, the throughput of the MIMO detector is significantly enhanced by parallel processing of data communications.

Investigation of two parallel lengthwise cracks in an inhomogeneous beam of varying thickness

  • Rizov, Victor I.
    • Coupled systems mechanics
    • /
    • v.9 no.4
    • /
    • pp.381-396
    • /
    • 2020
  • Analytical investigation of the fracture of inhomogeneous beam with two parallel lengthwise cracks is performed. The thickness of the beam varies continuously along the beam length. The beam is loaded in three-point bending. Two beam configurations with different lengths of the cracks are analyzed. The two cracks are located arbitrary along the thickness of the beam. Solutions to the strain energy release rate are derived assuming that the material has non-linear elastic mechanical behavior. Besides, the beam exhibits continuous material inhomogeneity along its thickness. The balance of the energy is analyzed in order to derive the strain energy release rate. Verifications of the solutions are carried-out by considering the complementary strain energy stored in the beam configurations. The influence of the continuous variation of the thickness along the beam length on the lengthwise fracture behavior is investigated. The dependence of the lengthwise fracture on the lengths of the two parallel cracks is also studied.