Flagging Critical Components to Prevent Transient Faults in Real-Time Systems
Authors: Muhammad Sheikh Sadi, D. G. Myers, Cesar Ortega Sanchez
Abstract:
This paper proposes the use of metrics in design space exploration that highlight where in the structure of the model and at what point in the behaviour, prevention is needed against transient faults. Previous approaches to tackle transient faults focused on recovery after detection. Almost no research has been directed towards preventive measures. But in real-time systems, hard deadlines are performance requirements that absolutely must be met and a missed deadline constitutes an erroneous action and a possible system failure. This paper proposes the use of metrics to assess the system design to flag where transient faults may have significant impact. These tools then allow the design to be changed to minimize that impact, and they also flag where particular design techniques – such as coding of communications or memories – need to be applied in later stages of design.
Keywords: Criticality, Metrics, Real-Time Systems, Transient Faults.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1085487
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1344References:
[1] M. Zhang, S. Mitra, T. M. Mak, N. Seifert, N. J. Wang, Q. Shi, K. S. Kim, N. R. Shanbhag, and S. J. Patel, "Sequential Element Design With Built-In Soft Error Resilience," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 14, pp. 1368-1378, 2006.
[2] M. Zhang, "Analysis and design of soft-error tolerant circuits," Ph.D. Thesis, University of Illinois at Urbana-Champaign, United States -- Illinois, 2006.
[3] Z. Xinping and Q. Wei, "Prototyping a fault-tolerant multiprocessor SoC with run-time fault recovery," presented at 43rd ACM/IEEE Design Automation Conference , pp. 53 - 56, 2006.
[4] V. Narayanan and Y. Xie, "Reliability concerns in embedded system designs," Computer, vol. 39, pp. 118-120, 2006.
[5] M. W. Rashid, E. J. Tan, M. C. Huang, and D. H. Albonesi, "Powerefficient error tolerance in chip multiprocessors," Micro, IEEE, vol. 25, pp. 60-70, 2005.
[6] Meaney, S. B. Swaney, P. N. Sanda, and L. Spainhower, "IBM z990 soft error detection and recovery," Device and Materials Reliability, IEEE Transactions on, vol. 5, pp. 419-427, 2005.
[7] S. Krishnamohan, "Efficient techniques for modeling and mitigation of soft errors in nanometer-scale static CMOS logic circuits," Ph.D. Thesis, Michigan State University, United States -- Michigan, 2005.
[8] R. K. Iyer, N. M. Nakka, Z. T. Kalbarczyk, and S. Mitra, "Recent advances and new avenues in hardware-level reliability support," Micro, IEEE, vol. 25, pp. 18-29, 2005.
[9] B. T. Gold, J. Kim, J. C. Smolens, E. S. Chung, V. Liaskovitis, E. Nurvitadhi, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk, "TRUSS: a reliable, scalable server architecture," Micro, IEEE, vol. 25, pp. 51-59, 2005.
[10] J. M. Cazeaux, D. Rossi, M. Omana, C. Metra, and A. Chatterjee, "On transistor level gate sizing for increased robustness to transient faults," presented at 11th IEEE International On-Line Testing Symposium, pp. 23 - 28, 2005.
[11] S. Borkar, "Designing reliable systems from unreliable components: the challenges of transistor variability and degradation," Micro, IEEE, vol. 25, pp. 10-16, 2005.
[12] Y. Xie, L. Li, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin, "Reliability-aware co-synthesis for embedded systems," presented at 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, pp. 41 - 50, 2004.
[13] M. Hiller, A. Jhumka, and S. Neeraj, "EPIC: profiling the propagation and effect of data errors in software," Transactions on Computers, vol. 53, pp. 512-530, 2004.
[14] A. G. Mohamed, S. Chad, T. N. Vijaykumar, and P. Irith, "Transientfault recovery for chip multiprocessors," IEEE Micro, vol. 23, pp. 76, 2003.
[15] T. N. Vijaykumar, I. Pomeranz, and K. Cheng, "Transient-fault recovery using simultaneous multithreading," presented at 29th Annual International Symposium on Computer Architecture, pp. 87-98, 2002.
[16] N. Oh, P. P. Shirvani, and E. J. McCluskey, "Error detection by duplicated instructions in super-scalar processors," Reliability, IEEE Transactions on, vol. 51, pp. 63-75, 2002.
[17] J. Ray, J. C. Hoe, and B. Falsafi, "Dual use of superscalar datapath for transient-fault detection and recovery," presented at 34th ACM/IEEE International Symposium on Microarchitecture, pp. 214 - 224, 2001.
[18] S. K. Reinhardt and S. S. Mukherjee, "Transient fault detection via simultaneous multithreading," presented at 27th International Symposium on Computer Architecture, pp. 25- 36, 2000.
[19] T. M. Austin, "DIVA: a reliable substrate for deep submicron microarchitecture design," presented at 32nd Annual International Symposium on Microarchitecture, pp. 196 - 207, 1999.