Low-cost Methods for Error Detection in Multi-core Systems

Meixner, Albert

Low-cost Methods for Error Detection in Multi-core Systems

View / Download1.25 MB

Date

2008-04-10

Authors

Meixner, Albert

Advisors

Sorin, Daniel J

Repository Usage Stats

410
views

1024
downloads

Abstract

There is broad consensus among academic and industrial researchers in computer architecture that hardware faults, both transient and permanent, will become significantly more frequent as CMOS feature sizes continue to shrink. Circuit-level techniques alone are insufficient to overcome this problem, and therefore system designers have begun to add fault tolerance features to processor micro-architectures and memory systems. Many of the techniques used today were developed in a time when fault coverage was the primary optimization target; hardware, power, and performance costs were only secondary concerns. These priorities do not accurately reflect the needs of today's commodity systems, which are very sensitive to manufacturing and performance costs and can trade-off some amount of fault coverage to reduce these costs.

In my dissertation work I have developed novel error detection techniques with significantly lower area and performance costs than those traditionally used in high availability designs. These savings were made possible by a guiding principle of verifying high-level system tasks rather than checking correct operation of specific low-level components. This high-level, end-to-end approach to error-detection has distinct advantages over checking low-level components in terms of applicability to a wide range of systems, coverage of complex component interactions, and implementation cost. The major challenge in developing end-to-end checkers is to find high-level tasks that are both relevant and verifiable at runtime. I approached this problem by decomposing system-level tasks into sub-tasks that are more easily verifiable and, when combined, are sufficient to ensure correctness of a high-level task. Such a decomposition is a step back from a full end-to-end design and requires additional assumptions about the underlying system, but I found the resulting cost and complexity benefits to outweigh the loss in flexibility that comes with them.

I have applied the ideas of task decomposition and high-level checking to processor cores, memory systems, and the I/O system, in order to develop low-cost checkers for each of these subsystems. The checking mechanisms resulting from this work are highly effective in detecting errors and incur lower hardware and performance cost than mechanisms with comparable error coverage proposed in the past.

Type

Dissertation

Department

Computer Science

Subjects

Computer science

Permalink

https://hdl.handle.net/10161/599

Rights

http://rightsstatements.org/vocab/InC/1.0/

Citation

Meixner, Albert (2008). Low-cost Methods for Error Detection in Multi-core Systems. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/599.

Collections

Dissertations

Full item page

Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.

Low-cost Methods for Error Detection in Multi-core Systems

Date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

Abstract

Type

Department

Description

Provenance

Subjects

Citation

Permalink

Rights

Citation

Collections