Browsing by Author "Roth, A"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
Item Open Access BOLT: Energy-efficient out-of-order latency-tolerant execution(Proceedings - International Symposium on High-Performance Computer Architecture, 2010-05-27) Hilton, A; Roth, ALT (latency tolerant) execution is an attractive candidate technique for future out-of-order cores. LT defers the forward slices of LLC (last-level cache) misses to a slice buffer and re-executes them when the misses return. An LT core increases ILP without physically scaling the issue queue and register file and increases MLP without additional software threads that can reduce cache performance. Unfortunately, proposed LT designs are not energy ef.cient. They require too many additional structures and they defer and re-execute too many instructions to justify their performance gains. In this paper, we address these inefficiencies. We introduce a microarchitecture called BOLT (Better Out-of-Order Latency-Tolerance) that implements LT as an alternative use of SMT (Simultaneous Multi-Threading). We also present a new slice buffer organization and traversal scheme that increases performance and reduces overhead by pruning instances of useless and redundant LT. Collectively, these modifications turn out-of-order LT into a technique that improves performance in an energy-efficient way. ©2009 IEEE.Item Open Access CPROB: Checkpoint processing with opportunistic minimal recovery(Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, 2009-11-23) Hilton, A; Eswaran, N; Roth, ACPR (Checkpoint Processing and Recovery) is a physical register management scheme that supports a larger instruction window and higher average IPC than conventional ROB-style register management. It does so by restricting mis-speculation recovery to checkpoints created at rename, and leveraging this restriction to aggressively reclaim registers that don't appear in checkpoints. The cost of CPR is checkpoint overhead, which is incurred when a mis-speculation occurs on an instruction for which a checkpoint was not created a priori. Here, CPR must recover to the immediately older checkpoint, squashing instructions older than the mis-speculation itself. In contrast, a ROB processor performs minimal recovery and only squashes instructions younger than the mis-speculation. CPROB is a hybrid register management scheme that preserves CPR's aggressive reclamation while opportunistically minimizing checkpoint overhead. CPROB extends CPR to track and hold the registers needed to perform minimal recovery to un-executed branches within each checkpoint. Recovery registers are held on a best-effort basis only. A checkpoint's recovery registers can be freed spontaneously when all branches in the checkpoint execute. They can also be aggressively victimized if dispatch needs registers to proceed. CPROB naturally adapts the register reclamation policy to dynamic branch behavior. When branch mis-predictions are infrequent and registers are needed to support a large window, CPROB victimizes registers and behaves like CPR. When mis-predictions are frequent and the window is small, CPROB holds on to registers and behaves like ROB. As a result, it out-performs both CPR and ROB for a given program. This performance improvement, combined with reduced checkpoint overhead, makes CPROB more energy-efficient than either ROB or CPR.Item Open Access Decoupled store completion/silent deterministic replay: Enabling scalable data memory for CPR/CFP processors(Proceedings - International Symposium on Computer Architecture, 2009-11-30) Hilton, A; Roth, ACPR/CFP (Checkpoint Processing and Recovery/Continual Flow Pipeline) support an adaptive instruction window that scales to tolerate last-level cache misses. CPR/CFP scale the register file by aggressively reclaiming the destination registers of many in-flight instructions. However, an analogous mechanism does not exist for stores and loads. As the window expands, CPR/CFP processors must track all in-flight stores and loads to support forwarding and detect memory ordering violations. The previously-described SVW (Store Vulnerability Window) and SQIP (Store Queue Index Prediction) schemes provide scalable, non-associative load and store queues, respectively. However, they don't work smoothly in a CPR/CFP context. SVW/SQIP rely on the ability to dynamically stall some loads until a specific older store writes to the cache. Enforcing this serialization in CPR/CFP is expensive if the load and store are in the same checkpoint. We introduce two complementary procedures that implement this serialization efficiently. Decoupled Store Completion (DSC) allows stores to write to the cache before the enclosing checkpoint completes execution. Silent Deterministic Replay (SDR) supports mis-speculation recovery in the presence of DSC by replaying loads older than completed stores using values from the load queue. The combination of DSC and SDR enables an SVW/SQIP based CPR/CFP memory system that outperforms previous designs while occupying less area. Copyright 2009 ACM.Item Open Access Flexible register management using reference counting(Proceedings - International Symposium on High-Performance Computer Architecture, 2012-05-03) Battle, S; Hilton, AD; Hempstead, M; Roth, AConventional out-of-order processors that use a unified physical register file allocate and reclaim registers explicitly using a free list that operates as a circular queue. We describe and evaluate a more flexible register management scheme - reference counting. We implement reference counting using a bit-matrix with a column for every physical register and a row for every entity that can hold a physical register, e.g., an in-flight instruction. Columns are NOR'ed together to create a bitvector free list from which registers are allocated using priority encoders. We describe reference counting designs that support micro-architectural techniques including register file power gating, dynamic register move elimination, register file checkpointing, and latency tolerant execution. Performance and circuit simulation show that the energy cost of reference counting is low and is easily recouped by the savings of the techniques it enables. © 2012 IEEE.Item Open Access Ginger: Control independence using tag rewriting(Proceedings - International Symposium on Computer Architecture, 2007-10-22) Hilton, AD; Roth, AThe negative performance impact of branch mis-predictions can be reduced by exploiting control independence (CI). When a branch mis-predicts, the wrong-path instructions up to the point where control converges with the correct path are selectively squashed and replaced with correct-path instructions. Instructions beyond the convergence-point-the branch's control-independent (CI) instructions-are spared from squashing. Exploiting CI requires updating the input data dependences of CI instructions to reflect the selective removal and insertion of logically older instructions and transitively re-dispatching those CI instructions whose inputs have changed. This capability is generally called out-of-order renaming. Previously proposed CI designs use out-of-order renaming schemes that either consume excessive rename/dispatch bandwidth, can only be applied in limited cases, or incur a cost even when the branch would be correctly predicted. Ginger is a CI design that is both general and bandwidth efficient. Ginger implements out-of-order renaming using tag rewriting, re-linking the input dependences of CI instructions as they sit in the window. To do this, Ginger halts the pipeline uses the idle map table read and write ports and the issue queue match lines and write lines to perform a register-tag "search-and-replace" operation. After a few cycles, the pipeline restarts and execution resumes with correct data dependences. Cycle-level simulation shows that Ginger out-performs previous CI designs, yielding geometric mean speedups over an aggressive non-CI processor of 5%, 12%, and 11%-on SPECint2000, MediaBench, and Comm-Bench-with speedups of 15% or greater on 11 of 46 programs. Copyright 2007 ACM.Item Open Access Icfp: tolerating all-level cache misses in in-order processors(Proceedings - International Symposium on High-Performance Computer Architecture, 2009-04-24) Hilton, A; Nagarakatte, S; Roth, AGrowing concerns about power have revived interest in in-order pipelines. In-order pipelines sacrifice single-thread performance. Specifically, they do not allow execution to flow freely around data cache misses. As a result, they have difficulties overlapping independent misses with one another. Previously proposed techniques like Runahead execution and Multipass pipelining have attacked this problem. In this paper, we go a step further and introduce iCFP (in-order Continual Flow Pipeline), an adaptation of the CFP concept to an in-order processor. When iCFP encounters a primary data cache or L2 miss, it checkpoints the register file and transitions into an "advance" execution mode. Miss-independent instructions execute as usual and even update register state. Missdependent instructions are diverted into a slice buffer, un-blocking the pipeline latches. When the miss returns, iCFP "rallies" and executes the contents of the slice buffer, merging miss-dependent state with missindependent state along the way. An enhanced register dependence tracking scheme and a novel store buffer design facilitate the merging process. Cycle-level simulations show that iCFP out-performs Runahead, Multipass, and SLTP, another non-blocking in-order pipeline design. © 2008 IEEE.Item Metadata only ICFP: Tolerating all-level cache misses in in-order processors(IEEE Micro, 2010-01-01) Hilton, A; Nagarakatte, S; Roth, AIn-order continual flow pipeline (iCFP) is an in-order pipeline that allows execution to flow around data cache misses. When a cache miss occurs, iCFP executes and speculatively retires miss-independent instructions. It saves miss-dependent instructions in a slice buffer. When the miss returns, iCFP reexecutes the contents of the slice buffer and merges the results into working state. iCFP exploits existing support for multithreading and several novel components. © 2006 IEEE.