||There is broad consensus among academic and industrial researchers in computer architecture
that hardware faults, both transient and permanent, will become significantly more
frequent as CMOS feature sizes continue to shrink. Circuit-level techniques alone
are insufficient to overcome this problem, and therefore system designers have begun
to add fault tolerance features to processor micro-architectures and memory systems.
Many of the techniques used today were developed in a time when fault coverage was
the primary optimization target; hardware, power, and performance costs were only
secondary concerns. These priorities do not accurately reflect the needs of today's
commodity systems, which are very sensitive to manufacturing and performance costs
and can trade-off some amount of fault coverage to reduce these costs.
In my dissertation work I have developed novel error detection techniques with significantly
lower area and performance costs than those traditionally used in high availability
designs. These savings were made possible by a guiding principle of verifying high-level
system tasks rather than checking correct operation of specific low-level components.
This high-level, end-to-end approach to error-detection has distinct advantages over
checking low-level components in terms of applicability to a wide range of systems,
coverage of complex component interactions, and implementation cost. The major challenge
in developing end-to-end checkers is to find high-level tasks that are both relevant
and verifiable at runtime. I approached this problem by decomposing system-level tasks
into sub-tasks that are more easily verifiable and, when combined, are sufficient
to ensure correctness of a high-level task. Such a decomposition is a step back from
a full end-to-end design and requires additional assumptions about the underlying
system, but I found the resulting cost and complexity benefits to outweigh the loss
in flexibility that comes with them.
I have applied the ideas of task decomposition and high-level checking to processor
cores, memory systems, and the I/O system, in order to develop low-cost checkers for
each of these subsystems. The checking mechanisms resulting from this work are highly
effective in detecting errors and incur lower hardware and performance cost than mechanisms
with comparable error coverage proposed in the past.