Methodologies for the Design of High-Performance, Scalable Hardware Accelerators
Date
2025
Authors
Advisors
Journal Title
Journal ISSN
Volume Title
Repository Usage Stats
views
downloads
Attention Stats
Abstract
The past decade of computing has been marked by the expected end of Moore's Law and Dennard's Scaling, and by a tapering in computing performance and energy efficiency. Instead, the community has witnessed continuing increases in transistor density and computing efficiency. This trend has been fueled in no small part by the diaspora of accelerator architectures. In contrast to the general-purpose architectures that characterized earlier eras, accelerators are circuits specialized for particular applications. But as computing capabilities have increased, so has the size of workloads, particularly in machine learning.
To meet these demands, this thesis investigates methodologies for improving the scalability of accelerator architectures. We begin by exploring tiled Markov-Chain Monte Carlo accelerators, identify inefficiencies in prior work, resolve them, and present the BigLittleMCA architecture, which leverages burn-in to improve power efficiency by 47% relative to prior work.
While BigLittleMCA provided application-specific optimizations to improve scalability, we turn our sights to how we can facilitate scalability for large SoC architectures. To this purpose, we develop a general-purpose programming framework for multi-core, platform-agnostic accelerator development: Beethoven. We demonstrate how this work can be used to deploy accelerators through a range of approaches, from microkernel to application-scale accelerators. We begin by demonstrating the shortcomings of prior work on a 4KB memcpy benchmark. Then, we compare against High-Level Synthesis approaches using the MachSuite benchmark suite. Next, we present how Beethoven can be used to accelerate a single-core accelerator from prior work as a multi-core FPGA system, achieving 3.3 times higher throughput and 34 times lower energy per operation than a GPU, for our chosen workload, attention. And finally, we show how existing multi-core accelerator systems exhibit structural patterns that can be targeted using Beethoven's abstractions using a boosted decision-tree accelerator design from prior work.
As the final work, we present TinyProSE, an accelerator architecture for decoder-only LLM models, and its tape-out in TSMC 16N silicon using Beethoven. In this work, we identify inefficiencies in computing non-linear functions and develop an area efficient methodology, TinyAct, for computing them.
Type
Department
Description
Provenance
Subjects
Citation
Permalink
Citation
Kjellqvist, Christopher Mattias (2025). Methodologies for the Design of High-Performance, Scalable Hardware Accelerators. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/34149.
Collections
Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.
