Performance Analyses of Instruction Fetch Models Considering Cache Miss and Branch Misprediction

캐쉬 미스와 분기예측 실패를 고려한 명령어 페치 모델의 성능분석

  • 김선모 ((주)LG 전자 연구원) ;
  • 정진하 (인하대학교 전자공학과) ;
  • 최상방 (인하대학교 전자전기컴퓨터공학부)
  • Published : 2001.12.01

Abstract

Cache memories are small fast memories used to temporarily hold the contents of main memory that are likely to be referenced by processors so as to reduce instruction and data access time. In this paper, we represent analytical models of instruction fetch process for four types of instruction cache structures that can be used for superscalar processors. In the models, we define various kinds of architectural parameters and take cache miss and branch misprediction into consideration. To prove the correctness of the proposed models, we performed extensive simulations and compared the results with the analytical models. Simulation results showed that the proposed model can estimate the instruction fetch rate accurately within 10% error in most cases. Both analytical model and simulation show that the increase of cache misses reduces the instruction fetch rate more severely than that of branch misprediction does. However, the analytical model can explain the causes of performance degradation which cannot be uncovered by the simulation method only. The model is also able to provide exact relationship between cache miss and branch misprediction for instruction fetch analysis.

캐쉬 메모리는 명령어와 데이터의 참조시간을 줄이기 위하여 프로세서에 의해 참조되어질 가능성이 높은 주 메모리의 내용을 일시적으로 저장하는 용량이 작고 빠른 메모리이다. 본 논문에서는 슈퍼스칼라 프로세서에 적용될 수 있는 네 가지 명령어 캐쉬 구조에 대하여 캐쉬 미스와 분기예측 실패를 고려한 해석적 모델을 제안하고 성능을 분석하였다. 슈퍼스칼라 구조의 다양한 파라미터들을 정의하여 명령어 페치를 모델링하였으며, 해석적 모델의 타당성을 검증하기 위하여 시뮬레이션을 수행하여 얻은 결과와 비교하였다. 명령어 페치율에 있어서는 분기예측 실패로 인한 영향보다는 캐쉬 미스로 인한 성능저하가 더욱 큰 것으로 나타났다. 본 연구를 통하여 얻은 해석적 모델을 사용하면 시뮬레이션에서는 드러나지 않는 성능제약의 원인에 대한 명확한 규명이 가능하며, 캐쉬 성능에 있어서 캐쉬 미스와 분기예측 실패간의 관계에 대한 정확한 분석이 가능하다.

Keywords

References

  1. J.L. Hennessy and D.A. Patterson, 'Computer architecture: A quantitative approach,' Morgan Kaufmann Publishers, 2nd Ed. 1996
  2. M. Johnson, Superscalar microprocessor design, Englewood Cliffs, N. J.: Prentice Hall, 1991
  3. F. Bodin and A. Seznec, 'Skewed associativity improves program performance and enhnces predictability,' IEEE Trans. Computers, vol. 46, no.5, pp. 530-544, May 1997 https://doi.org/10.1109/12.589219
  4. O. Temam, C. Fricker, and W. Jalby, 'Cache interference phenomena,' Proc. ACM SIGMETRICS, pp. 261-271, 1994 https://doi.org/10.1145/183019.183047
  5. R.A. Uhlig and T.N. Mudge, 'Trace-driven memory simulation: A survey,' ACM Computing Surveys, vol. 29, no. 2, pp. 129-170, June 1997 https://doi.org/10.1145/254180.254184
  6. H.J. Kim, S.M. Kim, and S.B. Choi, 'System performance analyses of out-of-order superscalar processors using analytical method,' IEICE Trans. Fundamentals of Electronics Communications and Computer Sciences, vol. E82-A, no. 6, pp. 927-938. June 1999
  7. A. Agarwal, M. Horowitz, and J. Hennessy, 'An analytical cache model,' ACM Trans. Computer Systems, vol. 7, no. 2, pp. 184-215, May 1989 https://doi.org/10.1145/63404.63407
  8. S. Coleman and K.S. McKinley, 'Tile size selection using cache organization and data layout,' Proc. SIGPLAN '95 Conf. Programming Language Design and Implementation, vol. 30, pp. 279-289, June 1995 https://doi.org/10.1145/207110.207162
  9. T. Fahringer, 'Automatic cache performance prediction in a parallelizing computer,' Proc. AICA '93-International Section, Sept. 1993
  10. C. Fricker, O. Temam, and W. Jalby, 'Influence of cross interferences on blocked loops: A case study with matrix-vector multiply,' ACM Trans. Programming Languages and Systems, vol. 17, no. 4, pp. 561-575, July 1995 https://doi.org/10.1145/210184.210185
  11. S. Ghost, M. Martonosi, and S. Malik, 'Cache miss equations: An analytical representation of cache misses,' Proc. 11th ACM Int'l Conf. Supercomputing, Vienna, Austria, July 1997
  12. M.S. Lam, E.E. Rothberg, and M.E. Wolf, 'The cache performance and optimizations of blocked algorithms,' Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 63-74, Santa Clara, Calif., 1991 https://doi.org/10.1145/106972.106981
  13. K.S. McKinley and O. Temam, 'A quantitative analysis of loop nest locality,' Proc. Seventh Conf. Architectural Support for Programming Languages and Operating Systems, vol. 7, Oct. 1996
  14. M.E. Wolf and M.S. Lam, 'A data locality optimizing algorithm,' Proc. SIGPLAN '91 Conf. Programming Language Design and Implementation, vol. 26, pp. 30-44, June 1991 https://doi.org/10.1145/113445.113449
  15. J.S. Harper, D.J. Kerbyson, and G.R. Nudd, 'Analytical modeling of set-associative cache behavior,' IEEE Trans. on Computers, vol. 48, no. 10, pp. 1009-1023, Oct. 1999 https://doi.org/10.1109/12.805152
  16. T.Y. Yeh, D.T. Marr, and Y.N. Patt, 'Increasing the instruction fetch rate via multiple branch prediction and a branch address cache,' Proc. Seventh ACM Int'l Conf. Supercomputing, pp. 67-76, Tokyo, July 1993 https://doi.org/10.1145/165939.165956
  17. S. Wallace and N. Bagherzadeh, 'Modeled and measured instruction fetching performance for superscalar microprocessors,' IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 6, pp. 570-578, June 1998 https://doi.org/10.1109/71.689444
  18. M.D. Smith, M. Johnson, and M.A. Horowitz, 'Limits on multiple instruction issue,' Proc. Third Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 290-302, Apr. 1989 https://doi.org/10.1145/68182.68209
  19. T.M. Conte, K.N. Meneszes, P.M. Mills, and B.A. Patel, 'Optimization of instruction fetch mechanisms for high issue rates,' Proc. 22nd Ann. Int'l Symp. Computer Architecture, pp. 333-344, June 1995 https://doi.org/10.1145/223982.224444
  20. G. Irlam, 'Spa' Personal Communication http://www.base.com/gordoni/spa/cat1/spy.1, 1995
  21. Standard Performance Evaluation Corporation, 'SPEC CPU95 benchmark,' http://www.specbench.org/osg/cpu95/, Mar. 1998