Search | Korea Science

Parsa, Saeed;Hamzei, Mohammad
- ETRI Journal
- /
- v.36 no.1
- /
- pp.124-133
- /
- 2014
To speed up data-intensive programs, two complementary techniques, namely nested loops parallelization and data locality optimization, should be considered. Effective parallelization techniques distribute the computation and necessary data across different processors, whereas data locality places data on the same processor. Therefore, locality and parallelization may demand different loop transformations. As such, an integrated approach that combines these two can generate much better results than each individual approach. This paper proposes a unified approach that integrates these two techniques to obtain an appropriate loop transformation. Applying this transformation results in coarse grain parallelism through exploiting the largest possible groups of outer permutable loops in addition to data locality through dependence satisfaction at inner loops. These groups can be further tiled to improve data locality through exploiting data reuse in multiple dimensions.
https://doi.org/10.4218/etrij.14.0113.0266 인용 PDF KSCI

Jeong, Sam-Jin
- Journal of the Korea Convergence Society
- /
- v.6 no.3
- /
- pp.51-57
- /
- 2015
This paper proposes an efficient method such as Extended Three Region Partitioning Method for nested loops with irregular dependences for maximizing parallelism. Our approach is based on the Convex Hull theory, and also based on minimum dependence distance tiling, the unique set oriented partitioning, and three region partitioning methods. In the proposed method, we eliminate anti dependences from the nested loop by variable renaming. After variable renaming, we present algorithm to select one or more appropriate lines among given four lines such as LMLH, RMLH, LMLT and RMLT. If only one line is selected, the method divides the iteration space into two parallel regions by the selected line. Otherwise, we present another algorithm to find a serial region. The selected lines divide the iteration space into two parallel regions as large as possible and one or less serial region as small as possible. Our proposed method gives much better speedup and extracts more parallelism than other existing three region partitioning methods.
https://doi.org/10.15207/JKCS.2015.6.3.051 인용 PDF KSCI

Kim, Youngchan;Yun, Youngsun;Kim, Hansol;Chang, Hanbyul;Park, Jaedeok;Choe, Yunjin;Na, Jeongkyun;Yi, Joohan;Kang, Hyungu;Yeo, Minsu;Choi, Kyuhong;Noh, Young-Chul;Jeong, Yoonchan;Lee, Hyuk-Jae;Yu, Bong-Ahn;Yeom, Dong-Il;Jun, Changsu
- Korean Journal of Optics and Photonics
- /
- v.32 no.1
- /
- pp.1-8
- /
- 2021
We have studied a tiled-aperture coherent-beam-combining system based on constructive interference, as a way to overcome the power limitation of a single laser. A 1-watt-level, 3-channel coherent fiber laser and a 3-channel fiber array of triangular tiling with tip-tilt function were developed. A monitoring system, phase controller, and 3-channel phase modulator formed a closed-loop control system, and the SPGD algorithm was applied. Eventually, phase-locking with a rate of 5-67 kHz and peak-intensity efficiency comparable to the ideal case of 53.3% was successfully realized. We were able to develop the essential elements for a tiled-aperture coherent-beam-combining system that had the potential for highest output power without any beam-combining components, and a multichannel coherent-beam-combining system with higher output power and high speed is anticipated in the future.
https://doi.org/10.3807/KJOP.2021.32.1.001 인용 PDF KSCI

Cho, Seok-Jae;Park, Sungkyung;Park, Chester Sungchung
- Journal of IKEEE
- /
- v.24 no.2
- /
- pp.559-569
- /
- 2020
One of the deep-running algorithms, CNN's artificial intelligence application uses off-chip memory to store data on the Convolution Layer. DMA can reduce processor load at every data transfer. It can also reduce application performance degradation by varying the order in which data from the Convolution layer is transmitted to the global buffer of the accelerator. For basic layouts with continuous memory addresses, SG-DMA showed about 3.4 times performance improvement in pre-setting DMA compared to using ordinaly DMA, and for Ideal layouts with discontinuous memory addresses, the ordinal DMA was about 1396 cycles faster than SG-DMA. Experiments have shown that a combination of memory data layout and DMA can reduce the DMA preset load by about 86 percent.
https://doi.org/10.7471/ikeee.2020.24.2.559 인용 PDF KSCI