Coordinating Software and Hardware for Performance Under Power Constraints

Thumbnail Image



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



For more than 50 years since its birth in 1965, Moore's Law has been a self-fulfilling prophecy that drives computing forward. However, as Dennard scaling ends, chip power density presents a challenge that becomes increasingly severe with every process generation. Consequently, a growing subset of transistors on a chip will need to be powered off in order to operate under a sustainable thermal envelope, a design strategy commonly referred as ``dark silicon''.

Although dark silicon poses a major challenge for consistently delivering higher performance, it also inspires researchers to rethink how a chip should be designed and managed under a power and thermal constrained environment. The historical way of extracting performance, whether single-thread or multi-thread, by throwing complicated and power-hungry hardwares is no longer applicable. Instead, we need to rely more on software, but the hardware needs to provide new mechanisms. In this thesis, we present three pieces of work on software and hardware codesign to demonstrate how coordinating softwares like compilers and runtimes with underlying hardware support can help boosting performance in a power-efficient way.

First, out-of-order (OoO) processors achieves higher performance than the in-order (IO) ones by aggressively scheduling instructions out of program order during execution. However, dynamic scheduling requires sophisticated control and numerous bookkeeping structures---e.g., reorder buffer, load-store queue, register alias table---that increase complexity, area, and most importantly power. Observing that a compiler produces better static schedules when the instruction set defines simple operation, we propose an ISA extension that decouples the data access and register write operations in a load instruction. We show that with modest system and hardware support, we can improve compilers' instruction scheduling by hoisting a decoupled load's data access above may-alias stores and branches. We find that decoupled loads improve performance with geometric mean speedup of 8.4% for a wide range of applications, bringing a step closer to OoO performance on IO design.

Second, sprinting is a class of computational mechanisms that provides a short but significant performance boost while temporarily exceeding the thermal design point. Using phase change material to buffer heat, sprinting is a promising way to deliver high performance in future chip designs that are likely to be power and thermal constrained. However, because sprints cannot be sustained, the system needs a mechanism to decide when to start and stop a sprint. We propose UTAR, a software runtime framework that manages sprints by dynamically predicting utility and modeling thermal headroom. Moreover, we propose a new sprint mechanism for caches, increasing capacity briefly for enhanced performance. For a system that extends last-level cache capacity from 2MB to 4MB per core and can absorb 10J of heat with phase change material, UTAR-guided cache sprinting improves performance by 17% on average and by up to 40% over a non-sprinting system. These performance outcomes, within 95% of an oracular policy, are possible because UTAR accurately predicts phase behavior and sprint utility.

Finally, applications often exhibit phase behaviors that demands for different types of system resources. As a result, management frameworks need to coordinate between different sprinting mechanisms to realize the full performance potential. we propose UTAR+, an extended version of UTAR that not only determines when to sprint, but also the type of resource as well as the sprinting intensity. Building upon UTAR's phase predictor and utility and thermal-aware policy, UTAR+ quickly identifies the most profitable sprinting option to maximize performance/watt. For a system that offers multiple sprinting options, UTAR+-guided multi-resource sprinting improves performance by 22% on average and by up to 83% over a non-sprinting system, outperforming UTAR+guided single-resource sprinting for a variety of applications.





Huang, Ziqiang (2019). Coordinating Software and Hardware for Performance Under Power Constraints. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.