Error Tolerant Computing
Participants
- Current Students
- Doochul Shin
- Hye-Yeon Cheong
- In Suk Chong
- Jordan Melzer
- Yeung On
- Shahidi
- Zhaolian Pan
- Graduate Students
- Hyukjune Chung
- Zhigang Jiang
Summary
One of the key enablers of the information technology (IT) revolution has been VLSI scaling, which has provided continuous increases in scale and speed of digital systems, accompanied by decreases in cost. However, with the VLSI feature sizes and separations approaching the physical limits of the devices and the fabrication technology, this trend is slowing. The increase in circuit speed from one fabrication process generation to the next has already decreased. The 2001 International Technology Roadmap for Semiconductors (ITRS) clearly states that in the near future, VLSI scaling will be inhibited by high variations in the values of key parameters, higher defect densities, and higher susceptibility to external noise. This is expected to occur in spite of continuous, widespread, and extensive research into improving fabrication processes. If ignored, this slowdown in VLSI scaling will significantly inhibit the continuation of the IT revolution.
We propose two new notions, namely error-tolerance and acceptable operation (as distinct from fault- or defect-tolerance and error-free operation). These notions systematically capture (for the first time) the fact that an increasingly large class of digital systems can be useful even if they do not perfectly conform to a deterministic design specification. This class of applications spans many "grand challenge" problems in scientific computing that involve stochastic optimization; audio, video, and speech systems; and others such as information extraction and fusion, real-time machine translation of natural languages, speech processing, and biometrics. For example, we have shown that in the largest module within an MPEG encoder, a large percentage of defects cause errors at the output of the decoder that are not only of low significance by a traditional measure (PSNR) but also imperceptible to the viewer. The proposed research will consider, in a vertically-integrated manner, such applications, their users, their hardware implementations, and details of VLSI devices and the fabrication process. Hence, the technical focus area of the proposed research is integration of computing, networking, and human-computer interfaces. All existing methodologies for design and test of digital systems, including the state-of-the-art defect-tolerance (DT) and fault-tolerance (FT) methodologies, focus on obtaining systems that perfectly conform to a deterministic design specification. In other words, they discard any system that does not provide error-free operation, i.e., any system with any defects in its modules that causes any error (that cannot be masked) at the system outputs is discarded. We demonstrate that the proposed notion of error-tolerance will dramatically increase yields, and hence decrease costs, of the above class of systems. We show that these improvements will occur even when a state-of-the-art process of the future is used to abricate a circuit designed using the state-of-the-art DT and FT principles. We propose to develop the first systematic methodology for design and test of this class of digital systems, that will exploit the notion of error tolerance, to enable dramatic improvements in scale, speed, and cost. In the proposed methodology, system specification will include a description of the types of errors at system outputs, and the thresholds on their severities, that are tolerable. The design methodology will exploit this information to obtain designs that provide higher performance and/or lower costs.
Over the next 15 years, the proposed approach will provide dramatic improvements in scale, speed, and cost for a wide class of digital systems. The increase in the scale and speed that we will provide will enable development of devices with advanced capabilities in areas such as speech processing, real-time translation of spoken natural languages, and biometrics.
Publications
- M. A. Breuer, S. K. Gupta, and T. M. Mak. "Defect and error tolerance in the presence of massive numbers of defects". IEEE Design & Test of Computers, 21:216–227, May–June 2004.
- Z. Jiang and S. Gupta. "An ATPG for threshold testing: Obtaining acceptable yield in future processes". In International Test Conference , 2002.
- H. Chung and A. Ortega, "System level fault tolerance for motion estimation". Technical Report USC-SIPI354,Signal and Image Processing Institute, Univ. of Southern California , 2002.
- H. Chung and A. Ortega, "Analysis and testing for error tolerant motion estimation". In Proc. of IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT), Monterey, October 2005.PDF format
- I. Chong and A. Ortega, "Hardware Testing For Error Tolerant Multimedia Compression based on Linear Transforms". In Proc. of IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT), Monterey, October 2005.PDF format
- I. Chong, H. Cheong, and A. Ortega, "New Quality Metric for Multimedia Compression Using Faulty Hardware". In Proc. of International Workshop on Video Processing and Quality Metrics for Consumer Electronics, Arizona, January 2006.PDF format
- H. Cheong, I. Chong and A. Ortega, "Computation Error Tolerance in Motion Estimation algorithms". In Proc. of International Conference on Image Processing (ICIP06), Atlanta, October 2006. PDF format
This work supported in part by the National Science Foundation under Grant No. 0428940.