Performance Comparison of Synchronization Methods for CC-NUMA Systems

CC-NUMA 시스템에서의 동기화 기법에 대한 성능 비교

  • Published : 2000.04.15

Abstract

The main goal of synchronization is to guarantee exclusive access to shared data and critical sections, and then it makes parallel programs work correctly and reliably. Exclusive access restricts parallelism of parallel programs, therefor efficient synchronization is essential to achieve high performance in shared-memory parallel programs. Many techniques are devised for efficient synchronization, which utilize features of systems and applications. This paper shows the simulation results that existing synchronization methods have inefficiency under CC-NUMA(Cache Coherent Non-Uniform Memory Access) system, and then compares the performance of Freeze&Melt synchronization that can remove the inefficiency. The simulation results present that Test-and-Test&Set synchronization has inefficiency caused by broadcast operation and the pre-defined order of Queue-On-Lock-Bit (QOLB) synchronization to execute a critical section causes inefficiency. Freeze&Melt synchronization, which removes these inefficiencies, has performance gain by decreasing the waiting time to execute a critical section and the execution time of a critical section, and by reducing the traffic between clusters.

동기화는 병렬 프로그램의 수행이 정확하게 이루어지도록 하기 위해 공유 데이타나 프로그램상의 임계구간(critical section)에 대해 배타적인 수행을 보장하는 것을 목적으로 한다. 배타적인 프로그램의 수행은 병렬 프로그램의 병렬성을 제한하므로 효율적인 동기화는 높은 성능의 병렬 프로그램 수행을 위해 반드시 필요하다. 이런 필요에 의해 응용 프로그램이나 시스템의 특성을 이용하여 동기화의 성능을 높이는 기법들이 고안되었다. 본 논문에서는 모의실험을 통해 캐시에 기반을 둔 NUMA(Non-Uniform Memory Access) 시스템에서 나타나는 기존 동기화의 비효율성을 분석하여 제시하고, 이 비효율성을 제거할 수 있는 Freeze&Melt 동기화 기법과의 성능을 비교한다. 제시된 결과를 통해 Test-and-Test&Set 동기화는 동기화 과정에서 발생하는 방송(broadcast) 작업에 의해 비효율이 발생하고, QOLB(Queue-On-Lock-Bit) 동기화는 공유 데이타나 임계구간을 수행할 프로세서의 순서가 미리 정해져 있다는 점에 의해 비효율이 발생함을 확인할 수 있다. 이와 같은 단점들을 극복하고자 제안된 Freeze&Melt 동기화를 이용하여 임계구간을 수행하기까지 대기하는 시간과 임계구간을 수행하는 시간을 줄이고, 클러스터간의 통신량(traffic)을 감소시킴으로써 성능의 향상을 이룰 수 있다.

Keywords

References

  1. E. W. Dijstra, Solution of a Problem in Concurrent Programming Control, Communications of the ACM 8(9), 1965 https://doi.org/10.1145/365559.365617
  2. D. E. Knuth, Additional Comments on a problem in Concurrent Programming Control, Communication of the ACM 9(5), 1966 https://doi.org/10.1145/355592.365595
  3. R. P. Case, and A. Padegs, Architecture of the IBM System 370. Communication of the ACM, 21(1):73-76, 1978 https://doi.org/10.1145/359327.359337
  4. P. J. Woest, and J. R. Goodman, An Analysis of Synchronization Mechanisms in Shared-Memory Multiprocessors. Technical Report TR1005, University of Wisconsin-Madison, 1991
  5. E. H. Jensen, G. W. Hagensen, and J. M. Brouchton, A New Approach to Exclusive Data Access in Shared Memory Multiprocessors. Technical Report UCRL-97663, Lawrence Livermore National Lab, 1987
  6. J. R. Goodman, M. K. Vernon, and P. J. Woest. 'Efficient synchronization primitives for large-scale cache-coherent shared-memory multiprocessors,' In Proceedings of the 3rd Symposium on Architectural Support for Programming Languages and Operating Systems. 1989 https://doi.org/10.1145/70082.68188
  7. L. Rudolph, and Z. Seagall. 'Dynamic decentralized cache schemes for MIMD parallel processors,' In Proceedings of the 11th Annual International Symposium on Computer Architecture. 1984 https://doi.org/10.1145/800015.808203
  8. P. Magnusson, A. Landin, and E. Hagersten. Efficient Software Synchronization on Large Cache Coherent Multiprocessors. Technical Report T94:07, Swedish Institute of Computer Science, Feburuary 1994
  9. M. Herlihy. 'Wait-free synchronization,' ACM Transactions on Programming Language and Systems, 11(1), 1991 https://doi.org/10.1145/114005.102808
  10. S. Prakash, Y. Lee, and T. Johnson. 'Non-blocking algorithms for concurrent data structures,' Technical Report TR91-002, Univ. of Florida, 1991
  11. J. M. Mellor-Crummey, and M. L. Scott. 'Algorithms for scalable synchronization on sharedmemory multiprocessors,' ACM Transactions on Computer Systems. 1991 https://doi.org/10.1145/103727.103729
  12. J. Laudon and D. Lenoski. The SGI origin: A CCNUMA highly scalable server. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA'97), pages 241-251, June 1997 https://doi.org/10.1145/264107.264206
  13. E. S. Moon, S. T. Jhang, and C. S. Jhon. 'Adjacency preferred hardware synchronization method for CC-NUMA systems,' In Proceedings of International Conference on Electronics, Informations and Communications. 1998
  14. J. E. Veenstra and R. J. Fowler. 'MINT: a front end for efficient simulation of shared-memory multiprocessors,' In Proceedings of the 2nd International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. 1994
  15. S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. 'Methodological considerations and characterization of the SPLASH-2 parallel application suite,' In Proceedings of the 22th Annual International Symposium on Computer Architecture. 1995 https://doi.org/10.1145/223982.223990
  16. D. E. Culler and J. P. Singh, Parallel Computer Architecture, pp538-541, Morgan Kaufmann Publishers, INC, San Francisco, 1998
  17. E. S. Moon, S. T. Jhang, and C. S. Jhon, Analysis of the Relation of Synchronization Algorithm and Parallel Programs in Shared-Memory Multiprocessor Systems, Will be appear in Proceedings of High Performance Computing Symposium 2000, April 2000
  18. S. W. Chung, S. T. Jhang, and C. S. Jhon, 'PANDA : Ring-Based Multiprocessor System using New Snooping Protocol,' In Proceedings of International Conference on Parallel And Distributed Systems. 1998 https://doi.org/10.1109/ICPADS.1998.741012