Low-cost Methods for Error Detection in Multi-core Systems
Repository Usage Stats
There is broad consensus among academic and industrial researchers in computer architecture that hardware faults, both transient and permanent, will become significantly more frequent as CMOS feature sizes continue to shrink. Circuit-level techniques alone are insufficient to overcome this problem, and therefore system designers have begun to add fault tolerance features to processor micro-architectures and memory systems. Many of the techniques used today were developed in a time when fault coverage was the primary optimization target; hardware, power, and performance costs were only secondary concerns. These priorities do not accurately reflect the needs of today's commodity systems, which are very sensitive to manufacturing and performance costs and can trade-off some amount of fault coverage to reduce these costs. In my dissertation work I have developed novel error detection techniques with significantly lower area and performance costs than those traditionally used in high availability designs. These savings were made possible by a guiding principle of verifying high-level system tasks rather than checking correct operation of specific low-level components. This high-level, end-to-end approach to error-detection has distinct advantages over checking low-level components in terms of applicability to a wide range of systems, coverage of complex component interactions, and implementation cost. The major challenge in developing end-to-end checkers is to find high-level tasks that are both relevant and verifiable at runtime. I approached this problem by decomposing system-level tasks into sub-tasks that are more easily verifiable and, when combined, are sufficient to ensure correctness of a high-level task. Such a decomposition is a step back from a full end-to-end design and requires additional assumptions about the underlying system, but I found the resulting cost and complexity benefits to outweigh the loss in flexibility that comes with them. I have applied the ideas of task decomposition and high-level checking to processor cores, memory systems, and the I/O system, in order to develop low-cost checkers for each of these subsystems. The checking mechanisms resulting from this work are highly effective in detecting errors and incur lower hardware and performance cost than mechanisms with comparable error coverage proposed in the past.
More InfoShow full item record
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations