• Title/Summary/Keyword: 고정 소수점 표현

Search Result 9, Processing Time 0.04 seconds

Efficient Fixed-Point Representation for ResNet-50 Convolutional Neural Network (ResNet-50 합성곱 신경망을 위한 고정 소수점 표현 방법)

  • Kang, Hyeong-Ju
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.1
    • /
    • pp.1-8
    • /
    • 2018
  • Recently, the convolutional neural network shows high performance in many computer vision tasks. However, convolutional neural networks require enormous amount of operation, so it is difficult to adopt them in the embedded environments. To solve this problem, many studies are performed on the ASIC or FPGA implementation, where an efficient representation method is required. The fixed-point representation is adequate for the ASIC or FPGA implementation but causes a performance degradation. This paper proposes a separate optimization of representations for the convolutional layers and the batch normalization layers. With the proposed method, the required bit width for the convolutional layers is reduced from 16 bits to 10 bits for the ResNet-50 neural network. Since the computation amount of the convolutional layers occupies the most of the entire computation, the bit width reduction in the convolutional layers enables the efficient implementation of the convolutional neural networks.

An efficient fixed-point implementation of the IMDCT for audio compression (오디오 압축을 위한 IMDCT의 최적 DSP 근사구현 기법 연구)

  • Jeong, J.H.;Chang, T.G.;Son, Y.K.;Lee, J.W.
    • Proceedings of the KIEE Conference
    • /
    • 2001.07d
    • /
    • pp.2513-2515
    • /
    • 2001
  • 본 논문에서는 유한비트 근사화를 통하여 고정소수점 연산을 이용하여 DCT구현시 발생하는 오차 영향에 대한 해석을 수행하였다. 고정소수점 연산을 위해서는 유한 비트 근사화를 실시하여야 하는데 이 과정에서 수치 표현범위의 제약으로 인한 오차가 발생하게 되고, 특히 순환 연산구조를 가지는 DCT등의 알고리즘 구현시 급격한 성능의 감소를 가져오게 된다. 본 논문에서는 순환 연산식을 유한비트 근사화를 통하여 구현시 발생되는 에러에 대한 분석을 수행하고, 해석식을 도출하였다.

  • PDF

Floating Point Converter Design Supporting Double/Single Precision of IEEE754 (IEEE754 단정도 배정도를 지원하는 부동 소수점 변환기 설계)

  • Park, Sang-Su;Kim, Hyun-Pil;Lee, Yong-Surk
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.48 no.10
    • /
    • pp.72-81
    • /
    • 2011
  • In this paper, we proposed and designed a novel floating point converter which supports single and double precisions of IEEE754 standard. The proposed convertor supports conversions between floating point number single/double precision and signed fixed point number(32bits/64bits) as well as conversions between signed integer(32bits/64bits) and floating point number single/double precision and conversions between floating point number single and double precisions. We defined a new internal format to convert various input types into one type so that overflow checking could be conducted easily according to range of output types. The internal format is similar to the extended format of floating point double precision defined in IEEE754 2008 standard. This standard specifies that minimum exponent bit-width of the extended format of floating point double precision is 15bits, but 11bits are enough to implement the proposed converting unit. Also, we optimized rounding stage of the convertor unit so that we could make it possible to operate rounding and represent correct negative numbers using an incrementer instead an adder. We designed single cycle data path and 5 cycles data path. After describing the HDL model for two data paths of the convertor, we synthesized them with TSMC 180nm technology library using Synopsys design compiler. Cell area of synthesis result occupies 12,886 gates(2 input NAND gate), and maximum operating frequency is 411MHz.

OpenGL ES 1.1 Implementation Using OpenGL (OpenGL을 이용한 OpenGL ES 1.1 구현)

  • Lee, Hwan-Yong;Baek, Nak-Hoon
    • The KIPS Transactions:PartA
    • /
    • v.16A no.3
    • /
    • pp.159-168
    • /
    • 2009
  • In this paper, we present an efficient way of implementing OpenGL ES 1.1 standard for the environments with hardware-supported OpenGL API, such as desktop PCs. Although OpenGL ES was started from the existing OpenGL features, it becomes a new three-dimensional graphics library customized for embedded systems through introducing fixed-point arithmetic operations, buffer management with fixed-point data type supports, completely new texture mapping functionalities and others. Currently, it is the official three dimensional graphics library for Google Android, Apple iPhone, PlayStation3, etc. In this paper, we achieved improvements on the arithmetic operations for the fixed-point number representation, which is the most characteristic data type for OpenGL ES. For the conversion of fixed-point data types to the floating-point number representations for the underlying OpenGL, we show the way of efficient conversion processes even with satisfying OpenGL ES standard requirements. We also introduced a simple memory management scheme to mange the converted data for the buffer containing fixed-point numbers. In the case of texture processing, the requirements in both standards are quite different and thus we used completely new software-implementations. Our final implementation result of OpenGL ES library provides all of over than 200 functions in OpenGL ES 1.1 standard and completely passed its conformance test, to show its compliance with the standard. From the efficiency viewpoint, we measured its execution times for several OpenGL ES-specific application programs and achieved at most 33.147 times improvements, to become the fastest one among the OpenGL ES implementations in the same category.

Realization of Block LMS Algorithm based on Block Floating Point (BFP 기반의 블록 LMS 알고리즘 구현)

  • Lee Kwang-Jae;Chakraborty Mriatyunjoy;Park Ju-Yong;Lee Moon-Ho
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.43 no.1 s.307
    • /
    • pp.91-100
    • /
    • 2006
  • A scheme is proposed for implementing the block LMS algorithm in a block floating point framework that permits processing of data over a wide dynamic range at a processor complexity and coat as low as that of a fixed point processor. The proposed scheme adopts appropriate formats for representing the filter coefficients and the data. Using these and a new upper bound on the step size, update relations for the filter weight mantissas and exponent are developed, taking care so that neither overflow occurs, nor are quantifies which are already very small multiplied directly. It is further shown how the mantissas of the filter coefficients and also the filter output can be evaluated faster by suitably modifying the approach of the fast block LMS algorithm

Impelementation of Optimized MPEG-4 BSAC Audio based on the embedded system (임베디드 시스템 기반 MPEG-4 BSAC 오디오 최적화 구현)

  • Hwang, Jin-Yong;Park, Jong-Soon;Oh, Hwa-Yong;Kim, Byoung-Ii;Chang, Tae-Gyu
    • Proceedings of the KIEE Conference
    • /
    • 2005.10b
    • /
    • pp.361-363
    • /
    • 2005
  • 본 논문에서는 MPEG-4 Version2 Audio 표준에 근거하여 낮은 연산부담을 갖는 독자적인 엘고리즘을 적용한 MPEG-4 BSAC Audio 디코더를 개발하였다. 개발된 BSAC 디코더는 32bit RISC 구조를 갖는 Intel Xscale Processor 기반 시스템에 최적화하여 구현 및 평가를 수행하였다. 수행속도 증가 및 연산 정밀도 향상을 위해 각 기능 블록별 기능 및 구현 원리 연구와 32 bit 연산 구조를 파악하여, 이를 고정소수점 연산 구조로 구현함으로써 성능을 향상시켰다. 유한비트에 따른 오차 영향을 최소화하기 위해 데이터의 표현 범위에 대한 연구를 통해 근사한 오차를 최소화 하여 연산 정밀도를 향상 시키고자 하였다. 비선형 양자화기 및 filter bank 등 상대적으로 높은 연산 부담을 갖는 기능 블록은 Table look-up, 보간법, 지수연산 제거, pre/post scrambling 기법 등을 적용하여 최적화 하였다. 최종적으로 개발된 BSAC 디코더는 32 bit 연산 구조의 X-scale 프로세서를 탑재한 Development Board와 WindowsCE OS로 구성된 타겟 system에 이식하여 performance 평가하였으며, 높은 연산 정밀도 및 다른 수행속도를 확인할 수 있었다. 주관적인 청각 평가에서도 MPEG-4 reference 디코더와의 음원의 차이가 거의 없음을 확인하였다.

  • PDF

Analytic derivation of the finite wordlength errors in fixed-point implementation of SDFT (SDFT 고정소수점 연산에 대한 유한 비트 오차영향 해석)

  • Chang, Tae-Gyu;Kim, Jae-Hwa
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.37 no.4
    • /
    • pp.65-71
    • /
    • 2000
  • Finite wordlength effect of the recursive implementation of SDFT(sliding-DFT) is analytically derived in this paper. Representation errors of the twiddle coefficients and the data registers are the two major causes of the spectral errors in the recursive implementation. The noise-to-signal ratio is analytically derived in terms of the coefficients wordlength, the data registers wordlength, and the DFT's block-length used in the computation Error dynamic equation is obtained from the recursive DFT and the probabilistic models for the coefficients error and the round-off error are introduced for the NSR derivation, The result of the NSR derivation is verified with the simulation data.

  • PDF

Linear Regression-Based Precision Enhancement of Summed Area Table (선형 회귀분석 기반 합산영역테이블 정밀도 향상 기법)

  • Jeong, Juhyeon;Lee, Sungkil
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.11
    • /
    • pp.809-814
    • /
    • 2013
  • Summed area table (SAT) is a data structure in which the sum of pixel values in an arbitrary rectangular area can be represented by the linear combination of four pixel values. Since SAT serially accumulates the pixel values from an image corner to the other corner, a high-resolution image can yield overflow in a floating-point representation. In this paper, we present a new SAT construction technique, which accumulates only the residuals from the linearly-regressed representation of an image and thereby significantly reduces the accumulation errors. Also, we propose a method to find the integral of the linear regression in constant time using double integral. We performed experiments on the image reconstruction, and the results showed that our approach more reduces the accumulation errors than the conventional fixed-offset SAT.

Quantization Method for Normalization of JPEG Pleno Hologram (JPEG Pleno 홀로그램 데이터의 정규화를 위한 양자화)

  • Kim, Kyung-Jin;Kim, Jin-Kyum;Oh, Kwan-Jung;Kim, Jin-Woong;Kim, Dong-Wook;Seo, Young-Ho
    • Journal of Broadcast Engineering
    • /
    • v.25 no.4
    • /
    • pp.587-597
    • /
    • 2020
  • In this paper, we analyze the normalization that occurs when processing digital hologram and propose an optimized quantization method. In JPEG Pleno, which standardizes the compression of holograms, full complex holograms are defined as complex numbers with 32-bit or 64-bit precision, and the range of values varies greatly depending on the method of hologram generation and object type. Such data with high precision and wide dynamic range are converted to fixed-point or integer numbers with lower precision for signal processing and compression. In addition, in order to reconstruct the hologram to the SLM (spatial light modulator), it is approximated with a precision of a value that can be expressed by the pixels of the SLM. This process can be refereed as a normalization process using quantization. In this paper, we introduce a method for normalizing high precision and wide range hologram using quantization technique and propose an optimized method.