Browsing by Subject "Accelerators"
Results Per Page
Sort Options
Item Open Access Accelerated Motion Planning Through Hardware/Software Co-Design(2019) Murray, SeanRobotics has the potential to dramatically change society over the next decade. Technology has matured such that modern robots can execute complex motions with sub-millimeter precision. Advances in sensing technology have driven down the price of depth cameras and increased their performance. However, the planning algorithms used in currently-deployed systems are too slow to react to changing environments; this has restricted the use of high degree-of-freedom (DOF) robots to tightly-controlled environments where planning in real time is not necessary.
Our work focuses on overcoming this challenge through careful hardware/software co-design. We leverage aggressive precomputation and parallelism to design accelerators for several components of the motion planning problem. We present architectures for accelerating collision detection as well as path search. We show how we can maintain flexibility even with custom hardware, and describe microarchitectures that we have implemented at the register-transfer level. We also show how to generate effective planning roadmaps for use with our designs.
Our accelerators bring the total planning latency to less than 3 microseconds, several orders of magnitude faster than the state of the art. This capability makes it possible to deploy systems that plan under uncertainty, use complex decision making algorithms, or plan for multiple robots in a workspace. We hope this technology will push robotics into domains and applications that were previously infeasible.
Item Open Access Accelerating Probabilistic Computing with a Stochastic Processing Unit(2020) Zhang, XiangyuStatistical machine learning becomes a more important workload for computing systems than ever before. Probabilistic computing is a popular approach in statistical machine learning, which solves problems by iteratively generating samples from parameterized distributions. As an alternative to Deep Neural Networks, probabilistic computing provides conceptually simple, compositional, and interpretable models. However, probabilistic algorithms are often considered too slow on the conventional processors due to sampling overhead to 1) computing the parameters of a distribution and 2) generating samples from the parameterized distribution. A specialized architecture is needed to address both the above aspects.
In this dissertation, we claim a specialized architecture is necessary and feasible to efficiently support various probabilistic computing problems in statistical machine learning, while providing high-quality and robust results.
We start with exploring a probabilistic architecture to accelerate Markov Random Field (MRF) Gibbs Sampling by utilizing the quantum randomness of optical-molecular devices---Resonance Energy Transfer (RET) networks. We provide a macro-scale prototype, the first such system to our knowledge, to experimentally demonstrate the capability of RET devices to parameterize a distribution and run a real application. By doing a quantitative result quality analysis, we further reveal the design issues of an existing RET-based probabilistic computing unit (1st-gen RSU-G) that lead to unsatisfactory result quality in some applications. By exploring the design space, we propose a new RSU-G microarchitecture that empirically achieves the same result quality as 64-bit floating-point software, with the same area and modest power overheads compared with 1st-gen RSU-G. An efficient stochastic probabilistic unit can be fulfilled using RET devices.
The RSU-G provides high-quality true Random Number Generation (RNG). We further explore how quality of an RNG is related to application end-point result quality. Unexpectedly, we discover the target applications do not necessarily require high-quality RNGs---a simple 19-bit Linear-Feedback Shift Register (LFSR) does not degrade end-point result quality in the tested applications. Therefore, we propose a Stochastic Processing Unit (SPU) with a simple pseudo RNG that achieves equivalent function to RSU-G but maintains the benefit of a CMOS digital circuit.
The above results bring up a subsequent question: are we confident to use a probabilistic accelerator with various approximation techniques, even though the end-point result quality ("accuracy") is good in tested benchmarks? We found current methodologies for evaluating correctness of probabilistic accelerators are often incomplete, mostly focusing only on end-point result quality ("accuracy") but omitting other important statistical properties. Therefore, we claim a probabilistic architecture should provide some measure (or guarantee) of statistical robustness. We take a first step toward defining metrics and a methodology for quantitatively evaluating correctness of probabilistic accelerators. We propose three pillars of statistical robustness: 1) sampling quality, 2) convergence diagnostic, and 3) goodness of fit. We apply our framework to a representative MCMC accelerator (SPU) and surface design issues that cannot be exposed using only application end-point result quality. Finally, we demonstrate the benefits of this framework to guide design space exploration in a case study showing that statistical robustness comparable to floating-point software can be achieved with limited precision, avoiding floating-point hardware overheads.
Item Open Access Accelerator Architectures for Deep Learning and Graph Processing(2020) Song, LinghaoDeep learning and graph processing are two big-data applications and they are widely applied in many domains. The training of deep learning is essential for inference and has not yet been fully studied. With data forward, error backward, and gradient calculation, deep learning training is a more complicated process with higher computation and communication intensity. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remaining challenging. In this dissertation, I present AccPar, a principled and systematic method of determining the tensor partition for multiple heterogeneous accelerators for efficient training acceleration. Emerging resistive random access memory (ReRAM) is promising for processing in memory (PIM). For high-throughput training acceleration in ReRAM-based PIM accelerator, I present PipeLayer, an architecture for layer-wise pipelined parallelism. Graph processing is well-known for poor locality and high memory bandwidth demand. In conventional architectures, graph processing incurs a significant amount of data movements and energy consumption. I present GraphR, the first ReRAM-based graph processing accelerator which follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. Sparse matrix-vector multiplication (SpMV), a subset of graph processing, is the key computation in iterative solvers for scientific computing. The efficiently accelerating floating-point processing in ReRAM remains a challenge. In this dissertation, I present ReFloat, a data format, and a supporting accelerator architecture, for low-cost floating-point processing in ReRAM for scientific computing.
Item Open Access Dynamic Deep Learning Acceleration with Co-Designed Hardware Architecture(2023) Hanson, Edward ThorRecent advancements in Deep Learning (DL) hardware target the training and inference of static DL models, thus simultaneously achieving high runtime performance and efficiency.However, dynamic DL models are seen as the next step in further pushing the accuracy-performance tradeoff of DL inference and training in our favor; by reshaping the model's parameters or structure based on the input, dynamic DL models have the potential to boost accuracy while introducing marginal computation cost. As the field of DL progresses towards dynamic models, much of the advancements in DL accelerator design are eclipsed by data movement-related bottlenecks introduced by unpredictable memory access patterns and computation flow. Additionally, designing hardware for every niche task is inefficient due to the high cost of developing new hardware. Therefore, we must carefully design DL hardware and software stack to support future, dynamic DL models by emphasizing flexibility and generality without sacrificing end-to-end performance and efficiency.
This dissertation targets algorithmic-, hardware-, and software-level optimizations to optimize DL systems.Starting from the algorithm level, the robust nature of DNNs is exploited to reduce computational and data movement demand. At the hardware level, dynamic hardware mechanisms are investigated to better serve a broad range of impactful future DL workloads. At the software level, statistical patterns of dynamic models are leveraged to enhance the performance of offline and online scheduling strategies. Success of this research is measured by considering all key metrics associated with DL and DL acceleration: inference latency and accuracy, training throughput, peak memory occupancy, area efficiency, and energy efficiency.
Item Open Access Experimental Study of Storage Ring Free-Electron Laser with Novel Capabilities(2016) Yan, JunThe Duke Free-electron laser (FEL) system, driven by the Duke electron storage ring, has been at the forefront of developing new light source capabilities over the past two decades. In 1999, the Duke FEL demonstrated the first lasing of a storage ring FEL in the vacuum ultraviolet (VUV) region at $194$ nm using two planar OK-4 undulators. With two helical undulators added to the outboard sides of the planar undulators, in 2005 the highest FEL gain ($47.8\%$) of a storage ring FEL was achieved using the Duke FEL system with a four-undulator configuration. In addition, the Duke FEL has been used as the photon source to drive the High Intensity $\gamma$-ray Source (HIGS) via Compton scattering of the FEL beam and electron beam inside the FEL cavity. Taking advantage of FEL's wavelength tunability as well as the adjustability of the energy of the electron beam in the storage ring, the nearly monochromatic $\gamma$-ray beam has been produced in a wide energy range from $1$ to $100$ MeV at the HIGS. To further push the FEL short wavelength limit and enhance the FEL gain in the VUV regime for high energy $\gamma$-ray production, two additional helical undulators were installed in 2012 using an undulator switchyard system to allow switching between the two planar and two helical undulators in the middle section of the FEL system. Using different undulator configurations made possible by the switchyard, a number of novel capabilities of the storage ring FEL have been developed and exploited for a wide FEL wavelength range from infrared (IR) to VUV. These new capabilities will eventually be made available to the $\gamma$-ray operation, which will greatly enhance the $\gamma$-ray user research program, creating new opportunities for certain types of nuclear physics research.
With the wide wavelength tuning range, the FEL is an intrinsically well-suited device to produce lasing with multiple colors. Taking advantage of the availability of an undulator system with multiple undulators, we have demonstrated the first two-color lasing of a storage ring FEL. Using either a three- or four-undulator configuration with a pair of dual-band high reflectivity mirrors, we have achieved simultaneous lasing in the IR and UV spectral regions. With the low-gain feature of the storage ring FEL, the power generated at the two wavelengths can be equally built up and precisely balanced to reach FEL saturation. A systematic experimental program to characterize this two-color FEL has been carried out, including precise power control, a study of the power stability of two-color lasing, wavelength tuning, and the impact of the FEL mirror degradation. Using this two-color laser, we have started to develop a new two-color $\gamma$-ray beam for scientific research at the HIGS.
Using the undulator switchyard, four helical undulators installed in the beamline can be configured to not only enhance the FEL gain in the VUV regime, but also allow for the full polarization control of the FEL beams. For the accelerator operation, the use of helical undulators is essential to extend the FEL mirror lifetime by reducing radiation damage from harmonic undulator radiation. Using a pair of helical undulators with opposite helicities, we have realized (1) fast helicity switching between left- and right-circular polarizations, and (2) the generation of fully controllable linear polarization. In order to extend these new capabilities of polarization control to the $\gamma$-ray operation in a wide energy range at the HIGS, a set of FEL polarization diagnostic systems need to be developed to cover the entire FEL wavelength range. The preliminary development of the polarization diagnostics for the wavelength range from IR to UV has been carried out.
Item Open Access Female-focused Business Incubation in the Triangle(2015-05-17) Jaffee, ValerieThis Duke University Master’s thesis was completed as a pro-bono research project for the Women’s Business Center of North Carolina (WBC of NC), a Durham-based nonprofit that helps women start and grow businesses throughout the state. To accelerate the growth of women-owned businesses in North Carolina, the WBC of NC is considering designing a female-focused business incubator in the Raleigh-Durham-Chapel Hill “Triangle” area. This report examines the current incubation landscape for women entrepreneurs in the Triangle, explores the need for a female-focused incubator, and provides guidance on designing an incubation program targeted at women.Item Open Access In-Memory Computing Architecture for Deep Learning Acceleration(2020) Chen, FanThe ever-increasing demands of deep learning applications, especially the more powerful but intensive unsupervised deep learning models, overwhelm computation capability, communication capability, and storage capability of the modern general-purpose CPUs and GPUs. To accommodate the memory and computing requirement, multi-core systems that make intensive use of accelerators become the future of computing. Such novel computing systems incurs new challenges including architectural support for model training in the accelerators, large cache demands for multi-core processors, system performance, energy, and efficiency. In this thesis, I present my research works that address these challenges by leveraging emerging memory and logic devices, as well as advanced integration technologies. In the first work, I present the first training accelerator architecture, ReGAN, for unsupervised deep learning. ReGAN follows the process-in-memory strategy by leveraging energy efficiency of resistive memory arrays for in-situ deep learning execution. I proposed an efficient pipelined training procedure to reduce on-chip memory access. In the second work, I present ZARA to address the resource underutilization due to a new operator, namely, transposed convolution, used in unsupervised learning models. ZARA improves the system efficiency by a novel computation deformation technique. In the third work, I present MARVEL that targets to improve power efficiency in previous resistive accelerators. MARVEL leverage the monolithic 3D integration technology by stacking multi-layer of low-power analog/digital conversion circuits implemented with carbon nanotube field-effect transistors. The area-consuming eDRAM buffers are replaced by dense cross-point Spin Transfer Torque Magnetic RAM. I explored the design space and demonstrated that MARVEL can provide further improved power efficiency with increased number of integration layers. In the last piece of work, I propose the first holistic solution for employing skyrmions racetrack memory as last-level caches for future high-capacity cache design. I first present a cache architecture and a physical-to-logic mapping scheme based on comprehensive analysis on working mechanism of skyrmions racetrack memory. Then I model the impact of process variations and propose a process variation aware data management technique to minimize the performance degradation incurred by process variations.