FAULT MODELING, DESIGN-FOR-TEST, AND FAULT TOLERANCE FOR MACHINE LEARNING HARDWARE
Date
2022
Authors
Advisors
Journal Title
Journal ISSN
Volume Title
Repository Usage Stats
views
downloads
Abstract
The ubiquitous application of DNNs has led to a rise in demand for custom artificial intelligence (AI) accelerators. Domain-specific AI accelerators for machine-learning inferencing applications are homogeneous designs composed of thousands of identical compute cores, or processing elements (PEs), that interface with the on-chip memory (such as local and global buffers). Accelerators can be classified on the basis of two major use-cases: training and inferencing. Inferencing is carried out by using AI accelerators on edge devices as well as in datacenters. They are being deployed for inferencing in autonomous driving, manufacturing automation, and navigation. Many such use-cases require high reliability. However, DNN inferencing applications are inherently fault-tolerant with respect to structural faults in the hardware; it has been shown that many faults are not functionally critical, i.e., they do not lead to any significant error in inferencing. As a result, testing for all faults in an accelerator chip is an "over-kill". Methods of functional criticality assessment need to be devised for low-cost testing of large AI chips. Moreover, testing homogeneous array-based AI accelerators by running automatic test pattern generation (ATPG) at the array level results in a high CPU time and pattern count. Current test methods do not fully exploit the regular dataflow in the accelerators. Hence, we plan on developing a "constant-testable" solution wherein a small test-pattern set is generated for one PE and reused for testing all other PEs.
Deep neural net (DNN)-driven inferencing applications such as image classification are inherently fault-tolerant with respect to structural faults; it has been shown that many faults are not functionally critical, i.e., they do not lead to any significant error in inferencing. This dissertation proposes low-cost structural and functional test methods for AI accelerators. Incorporation of the knowledge of fault criticality in testing enables the application of dedicated test effort for functionally critical faults. The dissertation utilizes supervised learning-driven DNNs, graph convolutional networks (GCNs), and neural twins of digital logic circuits to evaluate the functional criticality of faults in the gate-level netlist of an inferencing accelerator, thereby bypassing the need for computationally expensive brute-force fault simulations.
The generation of labeled data for supervised learning introduces prohibitive computation costs if the labeling process involves time-consuming simulations. For criticality analysis, a large number of fault simulations are needed to collect sufficient information about critical and benign faults. High runtime requirements for collecting sufficient labeled data become the bottleneck in supervised learning-driven fault-criticality analysis. This dissertation presents methodologies that reduce the amount of labeled and balanced data required for accurate classifier training.
Resistive-oxide random-access memory (RRAM) devices constitute a promising technology for building neuromorphic accelerator hardware due to their processing-in-memory (neuromorphic) abilities. The fundamental matrix-multiply operations in AI accelerators can be executed with reduced latency and power consumption by RRAM cells; however, they are known to suffer from high defect rates that contribute to faulty behavior. It is therefore important to analyze RRAM fault models and understand the root causes of defects and variations. In this dissertation, we present a physics-based classification of RRAM fault origins for dense RRAM crossbars---high density is a requirement for the training and inferencing of large neural networks with a high throughput. In this report, we present insights into the RRAM fault origins, which provide valuable feedback for the fabrication and design of RRAM-based accelerators. In addition to fault analysis, we need to tolerate faulty RRAM cells in a crossbar to ensure intended system operation---especially when crossbars suffer from low-to-medium defect densities and it is not economically viable to discard the entire crossbar. Although software-based fault-tolerance schemes have been proposed in the literature, more efficient fault tolerance for RRAM crossbars can be achieved through innovations in the hardware design. The dissertation presents the architecture of a novel processing element to tolerate faults in binary RRAM-based crossbars for in-memory computing.
Monolithic 3D (M3D) ICs have emerged as suitable platforms for high-density vertical integration of large system-on-chips (SoCs) like domain-specific and neuromorphic inferencing accelerators, with significant improvement in power, performance, and area (PPA) over 2D and conventional 3D-stacked ICs. However, the immature M3D fabrication process is prone to defects (especially in the inter-layer vias (ILVs)) and inter-tier process variations. In this dissertation, we present state-of-the-art low-cost built-in self-test (BIST) solutions for detecting and localizing both hard and resistive (small-delay) defects in ILVs. In addition to testing ILVs in high-density and realistic M3D layouts, tier-level fault localization is needed for yield ramp-up prior to high-volume production of M3D accelerator ICs. Due to overhead concerns, only a limited number of observation points can be inserted on the outgoing ILVs of an M3D tier for fault localization. This dissertation introduces NodeRank, an intelligent graph-theoretic algorithm, for observation-point insertion on an optimal set of outgoing ILVs in an M3D tier which lead to an increase in the diagnosability of detected faults in the M3D design.
In summary, the dissertation addresses important problems related to the functional impact of hardware faults in machine learning applications, low-cost test and diagnosis of accelerator faults, technology bring-up and fault tolerance for RRAM-based neuromorphic engines, and design-for-testability (DfT) for high-density M3D ICs. The insights and findings resulting from this dissertation are anticipated to lead to the fabrication of reliable accelerator ICs supported by low-cost DfT infrastructure.
Type
Department
Description
Provenance
Citation
Permalink
Citation
Chaudhuri, Arjun (2022). FAULT MODELING, DESIGN-FOR-TEST, AND FAULT TOLERANCE FOR MACHINE LEARNING HARDWARE. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/26881.
Collections
Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.