Browsing by Author "Hilton, A"
Now showing 1 - 5 of 5
- Results Per Page
- Sort Options
Item Open Access BOLT: Energy-efficient out-of-order latency-tolerant execution(Proceedings - International Symposium on High-Performance Computer Architecture, 2010-05-27) Hilton, A; Roth, ALT (latency tolerant) execution is an attractive candidate technique for future out-of-order cores. LT defers the forward slices of LLC (last-level cache) misses to a slice buffer and re-executes them when the misses return. An LT core increases ILP without physically scaling the issue queue and register file and increases MLP without additional software threads that can reduce cache performance. Unfortunately, proposed LT designs are not energy ef.cient. They require too many additional structures and they defer and re-execute too many instructions to justify their performance gains. In this paper, we address these inefficiencies. We introduce a microarchitecture called BOLT (Better Out-of-Order Latency-Tolerance) that implements LT as an alternative use of SMT (Simultaneous Multi-Threading). We also present a new slice buffer organization and traversal scheme that increases performance and reduces overhead by pruning instances of useless and redundant LT. Collectively, these modifications turn out-of-order LT into a technique that improves performance in an energy-efficient way. ©2009 IEEE.Item Open Access CPROB: Checkpoint processing with opportunistic minimal recovery(Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, 2009-11-23) Hilton, A; Eswaran, N; Roth, ACPR (Checkpoint Processing and Recovery) is a physical register management scheme that supports a larger instruction window and higher average IPC than conventional ROB-style register management. It does so by restricting mis-speculation recovery to checkpoints created at rename, and leveraging this restriction to aggressively reclaim registers that don't appear in checkpoints. The cost of CPR is checkpoint overhead, which is incurred when a mis-speculation occurs on an instruction for which a checkpoint was not created a priori. Here, CPR must recover to the immediately older checkpoint, squashing instructions older than the mis-speculation itself. In contrast, a ROB processor performs minimal recovery and only squashes instructions younger than the mis-speculation. CPROB is a hybrid register management scheme that preserves CPR's aggressive reclamation while opportunistically minimizing checkpoint overhead. CPROB extends CPR to track and hold the registers needed to perform minimal recovery to un-executed branches within each checkpoint. Recovery registers are held on a best-effort basis only. A checkpoint's recovery registers can be freed spontaneously when all branches in the checkpoint execute. They can also be aggressively victimized if dispatch needs registers to proceed. CPROB naturally adapts the register reclamation policy to dynamic branch behavior. When branch mis-predictions are infrequent and registers are needed to support a large window, CPROB victimizes registers and behaves like CPR. When mis-predictions are frequent and the window is small, CPROB holds on to registers and behaves like ROB. As a result, it out-performs both CPR and ROB for a given program. This performance improvement, combined with reduced checkpoint overhead, makes CPROB more energy-efficient than either ROB or CPR.Item Open Access Decoupled store completion/silent deterministic replay: Enabling scalable data memory for CPR/CFP processors(Proceedings - International Symposium on Computer Architecture, 2009-11-30) Hilton, A; Roth, ACPR/CFP (Checkpoint Processing and Recovery/Continual Flow Pipeline) support an adaptive instruction window that scales to tolerate last-level cache misses. CPR/CFP scale the register file by aggressively reclaiming the destination registers of many in-flight instructions. However, an analogous mechanism does not exist for stores and loads. As the window expands, CPR/CFP processors must track all in-flight stores and loads to support forwarding and detect memory ordering violations. The previously-described SVW (Store Vulnerability Window) and SQIP (Store Queue Index Prediction) schemes provide scalable, non-associative load and store queues, respectively. However, they don't work smoothly in a CPR/CFP context. SVW/SQIP rely on the ability to dynamically stall some loads until a specific older store writes to the cache. Enforcing this serialization in CPR/CFP is expensive if the load and store are in the same checkpoint. We introduce two complementary procedures that implement this serialization efficiently. Decoupled Store Completion (DSC) allows stores to write to the cache before the enclosing checkpoint completes execution. Silent Deterministic Replay (SDR) supports mis-speculation recovery in the presence of DSC by replaying loads older than completed stores using values from the load queue. The combination of DSC and SDR enables an SVW/SQIP based CPR/CFP memory system that outperforms previous designs while occupying less area. Copyright 2009 ACM.Item Open Access Icfp: tolerating all-level cache misses in in-order processors(Proceedings - International Symposium on High-Performance Computer Architecture, 2009-04-24) Hilton, A; Nagarakatte, S; Roth, AGrowing concerns about power have revived interest in in-order pipelines. In-order pipelines sacrifice single-thread performance. Specifically, they do not allow execution to flow freely around data cache misses. As a result, they have difficulties overlapping independent misses with one another. Previously proposed techniques like Runahead execution and Multipass pipelining have attacked this problem. In this paper, we go a step further and introduce iCFP (in-order Continual Flow Pipeline), an adaptation of the CFP concept to an in-order processor. When iCFP encounters a primary data cache or L2 miss, it checkpoints the register file and transitions into an "advance" execution mode. Miss-independent instructions execute as usual and even update register state. Missdependent instructions are diverted into a slice buffer, un-blocking the pipeline latches. When the miss returns, iCFP "rallies" and executes the contents of the slice buffer, merging miss-dependent state with missindependent state along the way. An enhanced register dependence tracking scheme and a novel store buffer design facilitate the merging process. Cycle-level simulations show that iCFP out-performs Runahead, Multipass, and SLTP, another non-blocking in-order pipeline design. © 2008 IEEE.Item Metadata only ICFP: Tolerating all-level cache misses in in-order processors(IEEE Micro, 2010-01-01) Hilton, A; Nagarakatte, S; Roth, AIn-order continual flow pipeline (iCFP) is an in-order pipeline that allows execution to flow around data cache misses. When a cache miss occurs, iCFP executes and speculatively retires miss-independent instructions. It saves miss-dependent instructions in a slice buffer. When the miss returns, iCFP reexecutes the contents of the slice buffer and merges the results into working state. iCFP exploits existing support for multithreading and several novel components. © 2006 IEEE.