Technology Impacts of CMOS Scaling on Microprocessor Core Design for Hard-Fault Tolerance in Single-Core Applications and Optimized Throughput in Throughput-Oriented Chip Multiprocessors
The continued march of technological progress, epitomized by Moore’s Law provides the microarchitect with increasing numbers of transistors to employ as we continue to shrink feature geometries. Physical limitations impose new constraints upon designers in the areas of overall power and localized power density. Techniques to scale threshold and supply voltages to lower values in order to reduce power consumption of the part have also run into physical limitations, exacerbating power and cooling problems in deep sub-micron CMOS process generations. Smaller device geometries are also subject to increased sensitivity to common failure modes as well as manufacturing process variability.
In the face of these added challenges, we observe a shift in the focus of the industry, away from building ever–larger single–core chips, whose focus is on reducing single–threaded latency toward a design approach that employs multiple cores on a single chip to improve throughput. While the early multicore era utilized the existing single–core designs of the previous generation in small numbers, subsequent generations have introduced cores tailored to multicore use. These cores seek to achieve power-efficient throughput and have led to a new emphasis on throughput-oriented computing, particularly for Internet workloads, where the end-to-end computational task is dominated by long–latency network operations. The ubiquity of these workloads makes a compelling argument for throughput–oriented designs, but does not free the microarchitect fully from latency demands of common workloads in enterprise and desktop application spaces.
We believe that a continued need for both throughput–oriented and latency–sensitive processors will exist in coming generations of technology. We further opine that making effective use of the additional transistors that will be available may require different techniques for latency–sensitive designs than for throughput–oriented ones, since we may trade latency or throughput for the desired attribute of a core in each of the respective paradigms.
We make three major contributions with this thesis. Our first contribution is a fine–grained fault diagnosis and deconfiguration technique for array structures, such as the ROB, within the microprocessor core. We present and evaluate two variants of this technique. The first variant uses an existing fault detection and correction technique whose scope is the processor core execution pipeline to ensure correct processor operation. The second variant integrates fault detection and correction into the array structure itself to provide a self–contained, fine–grained, fault detection, diagnosis, and repair technique.
In our second contribution, we develop a lightweight, fine–grained fault diagnosis mechanism for the processor core. In this work, we leverage the first contribution's methods to provide deconfiguration of faulty array elements. We additionally extend the scope of that work to include all pipeline circuitry from instruction issue to retirement.
In our third and final contribution, we focus on throughput–oriented core data cache design. In this work, we study the demands of the throughput–oriented core running a representative workload and then propose and evaluate an alternative data cache implementation that more closely matches the demands of the core. We then show that a better–matched cache design can be exploited to provide improved throughput under a fixed power budget.
Our results show that typical latency–sensitive cores have sufficient redundancy to make finegrained hard–fault tolerance an affordable alternative for hardening complex designs. Our designs suffer little or no performance loss when no faults are present and retain nearly the same performance characteristics in the presence of small numbers of hard faults in protected structures. In our study of the latency–sensitive core, we have shown that SRAM–based designs have low latencies that end up providing less benefit to a throughput–oriented core and workload than a better–fitted data cache composed of DRAM. The move from a high–power, fast technology to a lower–power, slower technology allows us to increase L1 data cache capacity, which is a net benefit for the throughput–oriented core.
Engineering, Electronics and Electrical
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations