Browse > Article
http://dx.doi.org/10.7471/ikeee.2017.21.4.353

Analysis and solution of memory failure phenomenon in Server systems  

Shin, Hyunsung (Dept. of Computer Science Engineering, Seoul National University)
Yoo, Sungjoo (Dept. of Computer Science Engineering, Seoul National University)
Publication Information
Journal of IKEEE / v.21, no.4, 2017 , pp. 353-357 More about this Journal
Abstract
In order to maintain numerous server systems used in enterprise and data center environments, the most important thing is to prevent the occurrence of UE (Uncorrectable Error) of each server system. With the recent development of cloud services, more memory modules are being used than ever before, while the operating frequency of server systems has increased and the process of developing memory has continued to shrink, making it more likely to fail. In these environments, there is a way to repair memory defects directly in the server system, but there is no currently available guideline to use it effectively. In this paper, we propose a method to effectively prevent memory failure in a server system based on the observation and analysis of memory failure phenomenon in existing system.
Keywords
Datacenter; Server; Memory; Failure; Memory repair; UE(Uncorrectable Error);
Citations & Related Records
연도 인용수 순위
  • Reference
1 Luiz Andre Barroso, Jimmy Clidaras and Urs Holzle, "The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second edition," Morgan & Claypool publishers, pp. 154, Jul, 2013. DOI:10.2200/S00516ED2V01Y201306CAC024)
2 AMD, "List of Intel AMD microprocessors", https://en.wikipedia.org/wiki/List_of_AMD_microprocessors
3 Chin-Lung Su, Yi-Ting Yeh and Cheng-Wen Wu, "An Integrated ECC and Redundancy Repair Scheme for Memory Reliability Enhancement" in Proc of the 2005 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 2005. DOI: 10.1109/DFTVS.2005.18
4 SDDC, "Intel x4 Single Device Data Correction", http://www.ece.umd.edu/courses/enee759h.S2003/references/29227401.pdf
5 Torvalds, "Linux Kernel Drivers for Intel Sandy-Bridge Integrated MC", https://github.com/torvalds/linux/blob/master/drivers/edac/sb_edac.c
6 Mcelog, "Advanced hardware error handling for x86 Linux", http://www.mcelog.org
7 Hspice, "Device Level Circuit Simulation", https://www.synopsys.com/verification/ams-verification/circuit-simulation/hspice.html
8 TCAD, "Technology Computer Aided Design", https://www.synopsys.com/silicon/tcad.html
9 Charles Slayman, Manny Ma and Scott Lindley, "Impact of Error Correction Code and Dynamic Memory Reconfiguration on High-Reliability /Low-Cost Server Memory" in Proc of the integrated Reliability Workshop Final Report, 2006 IEEE International, 2006. DOI: 10.1109/IRWS.2006.305243
10 INTEL, "List of Intel Xeon microprocessors", https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors