Hardware Error Analysis and Severity Characterization in Linux-Based Server Systems
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 33093
Hardware Error Analysis and Severity Characterization in Linux-Based Server Systems

Authors: N. Georgoulopoulos, A. Hatzopoulos, K. Karamitsios, K. Kotrotsios, A. I. Metsai

Abstract:

Current server systems are responsible for critical applications that run in different infrastructures, such as the cloud, physical machines, and virtual machines. A common challenge that these systems face are the various hardware faults that may occur due to the high load, among other reasons, which translates to errors resulting in malfunctions or even server downtime. The most important hardware parts, that are causing most of the errors, are the CPU, RAM, and the hard drive - HDD. In this work, we investigate selected CPU, RAM, and HDD errors, observed or simulated in kernel ring buffer log files from GNU/Linux servers. Moreover, a severity characterization is given for each error type. Understanding these errors is crucial for the efficient analysis of kernel logs that are usually utilized for monitoring servers and diagnosing faults. In addition, to support the previous analysis, we present possible ways of simulating hardware errors in RAM and HDD, aiming to facilitate the testing of methods for detecting and tackling the above issues in a server running on GNU/Linux.

Keywords: hardware errors, Kernel logs, GNU/Linux servers, RAM, HDD, CPU

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 680

References:


[1] B. David (July 1, 2015). “Prices for Data Storage Equipment and the State of IT Innovation”. The Federal Reserve Board FEDS Notes, 2015
[2] G. Amvrosiadis, A. Oprea and B. Schroeder, “Practical scrubbing: Getting to the bad sector at the right time”, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012), pp. 1-12, 2012.
[3] J. Meza, Q. Wu, S. Kumar and O. Mutlu, “Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field”, 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 415-426, 2015.
[4] T. C. May and M. H. Woods, “Alpha-Particle-Induced Soft Errors in Dynamic Memories”, IEEE Transactions on Electron Devices, 1979.
[5] C. Constantinescu, “Trends and Challenges in VLSI Circuit Reliability”, IEEE Micro, 2003.
[6] P.-F. Chia, S.-J. Wen and S. Baeg, “New DRAM HCI Qualification Method Emphasizing on Repeated Memory Access”, IRW, 2010.
[7] B. G. Streetman, S. Banerjee, “Solid state electronic devices”, Boston: Pearson. p. 341, 2016.
[8] A. Kleen, “Machine check handling on linux”, SUSE Labs, 2004.
[9] N. Pandit, Z. Kalbarczyk and R. K. Iyer, “Effectiveness of machine checks for error diagnostics”, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks, pp. 578-583, 2009.
[10] Intel Corporation, “Machine Check Architecture”, in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2, 2018.
[11] A. Das, F. Mueller, C. Siegel and A. Vishnu, “Desh: Deep Learning for System Health Prediction of Lead Times to Failure in HPC.”, Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC ’18, 2018.
[12] I. Giurgiu, J. Szabo, D. Wiesmann and J. Bird, “Predicting DRAM reliability in the field with machine learning”, Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference on Industrial Track - Middleware ’17, 2017.
[13] X. Sun et al., “System-level hardware failure prediction using deep learning”, 56th ACM/IEEE Design Automation Conference (DAC), pp. 1-6, 2019.
[14] E. Nemeth, G. Snyder, T.R. Hein, B. Whaley “Unix and Linux System Administration Handbook”. Pearson Education. p. 366, 2010.
[15] S. M. Hancock, “Tru64 UNIX troubleshooting: diagnosing and correcting system problems”, Digital Press, 2002.
[16] The kernel development community, “Error Detection And Correction (EDAC) Devices” https://www.kernel.org/doc/html/v4.14/driver-api/edac.html, 2020.
[17] APEI Error INJection, https://www.kernel.org/doc/Documentation/acpi/apei/einj.txt.
[18] Memtest86+, www.memtest.org.
[19] "Usage of operating systems for websites". W3Techs. Technologies, Operating Systems, 7 March 2015.
[20] Scsi_debug adapter driver for Linux, http://sg.danny.cz/sg/sdebug26.html.
[21] G. Kroah-Hartman, “Linux kernel in a nutshell”, O'Reilly Media Inc., p. 59, 2007.
[22] Oracle, “Troubleshooting DIMM Problems”, https://docs.oracle.com/cd/E19121-01/sf.x4250/820-4213-11/dimms.html.
[23] “Linux Thermal Daemon Monitors and Controls Temperature in Tablets, Laptops”, https://www.linux.com/news/linux-thermal-daemon-monitors-and-controls-temperature-tablets-laptops/.
[24] E. B. Nightingale, J. Douceur and V. Orgovan, “Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs”, Proceedings of EuroSys 2011, 2011.