Search | Korea Science

Effective Partitioning of Static Global Buses for Small Processor Arrays

Matsumae, Susumu
- Journal of Information Processing Systems
- /
- v.7 no.1
- /
- pp.85-92
- /
- 2011
This paper shows an effective partitioning of static global row/column buses for tightly coupled 2D mesh-connected small processor arrays ("mesh", for short). With additional O(n/m (n/m + log m)) time slowdown, it enables the mesh of size $m{\times}m$ with static row/column buses to simulate the mesh of the larger size $n{\times}n$ with reconfigurable row/column buses ($m{\leq}n$). This means that if a problem can be solved in O(T) time by the mesh of size $n{\times}n$ with reconfigurable buses, then the same problem can be solved in O(Tn/m (n/m + log m)) time on the mesh of a smaller size $m{\times}m$ without a reconfigurable function. This time-cost is optimal when the relation $n{\geq}m$ log m holds (e.g., m = $n^{1-\varepsilon}$ for $\varepsilon$ > 0).
https://doi.org/10.3745/JIPS.2011.7.1.085 인용 PDF KSCI

Multiple Signature Comparison of LogTM-SE for Fast Conflict Detection (다중 시그니처 비교를 통한 트랜잭셔널 메모리의 충돌해소 정책의 성능향상)

Kim, Deok-Ho;Oh, Doo-Hwan;Ro, Won-W.
- The KIPS Transactions:PartA
- /
- v.18A no.1
- /
- pp.19-24
- /
- 2011
As era of multi-core processors has arrived, transactional memory has been considered as an effective method to achieve easy and fast multi-threaded programming. Various hardware transactional memory systems such as UTM, VTM, FastTM, LogTM, and LogTM-SE, have been introduced in order to implement high-performance multi-core processors. Especially, LogTM-SE has provided study performance with an efficient memory management policy and a practical thread scheduling method through conflict detection based on signatures. However, increasing number of cores on a processor imposes the hardware complexity for signature processing. This causes overall performance degradation due to the heavy workload on signature comparison. In this paper, we propose a new architecture of multiple signature comparison to improve conflict detection of signature based transactional memory systems.
https://doi.org/10.3745/KIPSTA.2011.18A.1.019 인용 PDF KSCI

Hardware Design of Special-Purpose Arithmetic Unit for 3-Dimensional Graphics Processor (3차원 그래픽프로세서용 특수 목적 연산장치의 하드웨어 설계)

Choi, Byeong-Yoon
- Proceedings of the Korean Institute of Information and Commucation Sciences Conference
- /
- 2011.05a
- /
- pp.140-142
- /
- 2011
In this paper, special purpose arithmetic unit for mobile graphics accelerator is designed. The designed processor supports six operations, such as $1/{\chi}$, $\frac{1}{{\sqrt{x}}$, $log_2x$, $2^x$, $sin(x)$, $cos(x)$. The processor adopts 2nd-order polynomial minimax approximation scheme based on IEEE floating point data format to satisfy accuracy conditions and has 5-stage pipeline structure to meet high operational rates. The SFAU processor consists of 23,000 gates and its estimated operating frequency is about 400 Mhz at operating condition of 65nm CMOS technology. Because the processor can execute all operations with 5-stage pipeline scheme, it has about 400 MOPS(million operations per second) execution rate. Thus, it can be applicable to the 3D mobile graphics processors.
PDF

Page Logging System for Web Mining Systems (웹마이닝 시스템을 위한 페이지 로깅 시스템)

Yun, Seon-Hui;O, Hae-Seok
- The KIPS Transactions:PartC
- /
- v.8C no.6
- /
- pp.847-854
- /
- 2001
The Web continues to grow fast rate in both a large aclae volume of traffic and the size and complexity of Web sites. Along with growth, the complexity of tasks such as Web site design Web server design and of navigating simply through a Web site have increased. An important input to these design tasks is the analysis of how a web site is being used. The is paper proposes a Page logging System(PLS) identifying reliably user sessions required in Web mining system PLS consists of Page Logger acquiring all the page accesses of the user Log processor producing user session from these data, and statements to incorporate a call to page logger applet. Proposed PLS abbreviates several preprocessing tasks which spends a log of time and efforts that must be performed in Web mining systems. In particular, it simplifies the complexity of transaction identification phase through acquiring directly the amount of time a user stays on a page. Also PLS solves local cache hits and proxy IPs that create problems with identifying user sessions from Web sever log.
PDF

An Efficient DVS Algorithm for Pinwheel Task Schedules

Chen, Da-Ren;Chen, You-Shyang
- Journal of Information Processing Systems
- /
- v.7 no.4
- /
- pp.613-626
- /
- 2011
In this paper, we focus on the pinwheel task model with a variable voltage processor with d discrete voltage/speed levels. We propose an intra-task DVS algorithm, which constructs a minimum energy schedule for k tasks in O(d+k log k) time We also give an inter-task DVS algorithm with O(d+n log n) time, where n denotes the number of jobs. Previous approaches solve this problem by generating a canonical schedule beforehand and adjusting the tasks' speed in O(dn log n) or O($n^3$) time. However, the length of a canonical schedule depends on the hyper period of those task periods and is of exponential length in general. In our approach, the tasks with arbitrary periods are first transformed into harmonic periods and then profile their key features. Afterward, an optimal discrete voltage schedule can be computed directly from those features.
https://doi.org/10.3745/JIPS.2011.7.4.613 인용 PDF KSCI

Designing a Bitonic Sorting Algorithm for Shared-Memory Parallel Computers and an Efficient Implementation of its Communication (공유 메모리 병렬 컴퓨터 환경에서 Bitonic Sorting 알고리즘 설계와 효율적인 통신의 구현)

Lee, Jae-Dong;Kwon, Kyung-Hee;Park, Yong-Beom
- The Transactions of the Korea Information Processing Society
- /
- v.4 no.11
- /
- pp.2690-2700
- /
- 1997
This paper presents parallel sorting algorithm, SHARED-MEMORY-BS and REDUCED-BS, which are implemented on shared-memory parallel computers. These algorithm sort N keys in $O(log^2N)$ time. REDUCED-BS users a parity strategy which gives an idea for the efficient usage of the local memory associated with each processor. By taking advantage of the local memory associated with each processor, the communication of REDUCED-BS is decreased by approximately half that of SHARED-MEMORY-BS. On the basis of alleviating the communication, the algorithm REDUCED-BS results in a significant improvement of performance.
PDF

Using Cache Access History for Reducing False Conflicts in Signature-Based Eager Hardware Transactional Memory (시그니처 기반 이거 하드웨어 트랜잭셔널 메모리에서의 캐시 접근 이력을 이용한 거짓 충돌 감소)

Kang, Jinku;Lee, Inhwan
- Journal of KIISE
- /
- v.42 no.4
- /
- pp.442-450
- /
- 2015
This paper proposes a method for reducing false conflicts in signature-based eager hardware transactional memory (HTM). The method tracks the information on all cache blocks that are accessed by a transaction. If the information provides evidence that there are no conflicts for a given transactional request from another core, the method prevents the occurrence of a false conflict by forcing the HTM to ignore the decision based on the signature. The method is very effective in reducing false conflicts and the associated unnecessary transaction stalls and aborts, and can be used to improve the performance of the multicore processor that implements the signature-based eager HTM. When running the STAMP benchmark on a 16-core processor that implements the LogTM-SE, the increased speed (decrease in execution time) achieved with the use of the method is 20.6% on average.
https://doi.org/10.5626/JOK.2015.42.4.442 인용 KSCI

A Novel Arithmetic Unit Over GF(2$^{m}$) for Reconfigurable Hardware Implementation of the Elliptic Curve Cryptographic Processor (타원곡선 암호프로세서의 재구성형 하드웨어 구현을 위한 GF(2$^{m}$)상의 새로운 연산기)

김창훈;권순학;홍춘표;유기영
- Journal of KIISE:Computer Systems and Theory
- /
- v.31 no.8
- /
- pp.453-464
- /
- 2004
In order to solve the well-known drawback of reduced flexibility that is associate with ASIC implementations, this paper proposes a novel arithmetic unit over GF(2$^{m}$ ) for field programmable gate arrays (FPGAs) implementations of elliptic curve cryptographic processor. The proposed arithmetic unit is based on the binary extended GCD algorithm and the MSB-first multiplication scheme, and designed as systolic architecture to remove global signals broadcasting. The proposed architecture can perform both division and multiplication in GF(2$^{m}$ ). In other word, when input data come in continuously, it produces division results at a rate of one per m clock cycles after an initial delay of 5m-2 in division mode and multiplication results at a rate of one per m clock cycles after an initial delay of 3m in multiplication mode respectively. Analysis shows that while previously proposed dividers have area complexity of Ο(m$^2$) or Ο(mㆍ(log$_2$$^{m}$ )), the Proposed architecture has area complexity of Ο(m), In addition, the proposed architecture has significantly less computational delay time compared with the divider which has area complexity of Ο(mㆍ(log$_2$$^{m}$ )). FPGA implementation results of the proposed arithmetic unit, in which Altera's EP2A70F1508C-7 was used as the target device, show that it ran at maximum 121MHz and utilized 52% of the chip area in GF(2$^{571}$ ). Therefore, when elliptic curve cryptographic processor is implemented on FPGAs, the proposed arithmetic unit is well suited for both division and multiplication circuit.
PDF KSCI

A Parallel Algorithm for Merging Heaps on MasPar Machine (MasPar 머쉰상의 병렬 힙 병합 알고리즘)

Min, Yong-Sik
- The Transactions of the Korea Information Processing Society
- /
- v.2 no.4
- /
- pp.554-560
- /
- 1995
In this paper, we suggest a parallel algorithm to merge priority queues organized in two heaps, kheap and nheap of sizes k and n, correspondingly. Employing max(2$^{-1}$, $\ulcorner$(m＋1)/4$\lrcorner$'s processors, this algorithm requires O(log(n/k)＊log(n)) on an EREW-PRAM, where i is the height of the heap and m is the summation of sizes n and k. Also, when we run it on the MasPar machine, this method achieves a 33.934-fold speedup with 64 processors to merge 8 million data items which consist of two heaps of different sizes. So our parallel algorithm's EPU is close to 1, which is considered as an optimal speedup ratio.eedup ratio.
PDF

Algorithm and Design of Double-base Log Encoder for Flash A/D Converters

Son, Nguyen-Minh;Kim, In-Soo;Choi, Jae-Ha;Kim, Jong-Soo
- Journal of the Institute of Convergence Signal Processing
- /
- v.10 no.4
- /
- pp.289-293
- /
- 2009
This study proposes a novel double-base log encoder (DBLE) for flash Analog-to-Digital converters (ADCs). Analog inputs of flash ADCs are represented in logarithmic number systems with bases of 2 and 3 at the outputs of DBLE. A look up table stores the sets of exponents of base 2 and 3 values. This algorithm improves the performance of a DSP (Digital Signal Processor) system that takes outputs of a flash ADC, since the double-base log number representation does multiplication operation easily within negligible error range in ADC. We have designed and implemented 6 bits DBLE implemented with ROM (Read-Only Memory) architecture in a $0.18\;{\mu}m$ CMOS technology. The power consumption and speed of DBLE are better than the FAT tree and binary ROM encoders at the cost of more chip area. The DBLE can be implemented into SoC architecture with DSP to improve the processing speed.
PDF

Search Result 30, Processing Time 0.027 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)