Browsing by Subject "Computer architecture"
Results Per Page
Sort Options
Item Open Access Accelerated Motion Planning Through Hardware/Software Co-Design(2019) Murray, SeanRobotics has the potential to dramatically change society over the next decade. Technology has matured such that modern robots can execute complex motions with sub-millimeter precision. Advances in sensing technology have driven down the price of depth cameras and increased their performance. However, the planning algorithms used in currently-deployed systems are too slow to react to changing environments; this has restricted the use of high degree-of-freedom (DOF) robots to tightly-controlled environments where planning in real time is not necessary.
Our work focuses on overcoming this challenge through careful hardware/software co-design. We leverage aggressive precomputation and parallelism to design accelerators for several components of the motion planning problem. We present architectures for accelerating collision detection as well as path search. We show how we can maintain flexibility even with custom hardware, and describe microarchitectures that we have implemented at the register-transfer level. We also show how to generate effective planning roadmaps for use with our designs.
Our accelerators bring the total planning latency to less than 3 microseconds, several orders of magnitude faster than the state of the art. This capability makes it possible to deploy systems that plan under uncertainty, use complex decision making algorithms, or plan for multiple robots in a workspace. We hope this technology will push robotics into domains and applications that were previously infeasible.
Item Open Access Accelerator Architectures for Deep Learning and Graph Processing(2020) Song, LinghaoDeep learning and graph processing are two big-data applications and they are widely applied in many domains. The training of deep learning is essential for inference and has not yet been fully studied. With data forward, error backward, and gradient calculation, deep learning training is a more complicated process with higher computation and communication intensity. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remaining challenging. In this dissertation, I present AccPar, a principled and systematic method of determining the tensor partition for multiple heterogeneous accelerators for efficient training acceleration. Emerging resistive random access memory (ReRAM) is promising for processing in memory (PIM). For high-throughput training acceleration in ReRAM-based PIM accelerator, I present PipeLayer, an architecture for layer-wise pipelined parallelism. Graph processing is well-known for poor locality and high memory bandwidth demand. In conventional architectures, graph processing incurs a significant amount of data movements and energy consumption. I present GraphR, the first ReRAM-based graph processing accelerator which follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. Sparse matrix-vector multiplication (SpMV), a subset of graph processing, is the key computation in iterative solvers for scientific computing. The efficiently accelerating floating-point processing in ReRAM remains a challenge. In this dissertation, I present ReFloat, a data format, and a supporting accelerator architecture, for low-cost floating-point processing in ReRAM for scientific computing.
Item Open Access Coordinating the Design and Management of Heterogeneous Datacenter Resources(2014) Guevara, Marisabel AlejandraHeterogeneous design presents an opportunity to improve energy efficiency but raises a challenge in management. Whereas prior work separates the two, we coordinate heterogeneous design and management. We present a market-based resource allocation mechanism that navigates the performance and power trade-offs of heterogeneous architectures. Given this management framework, we explore a design space of heterogeneous processors and show a 12x reduction in response time violations when equipping a datacenter with three processor types over a homogeneous system that consumes the same power. To better understand trade-offs in large heterogeneous design spaces, we explore dozens of design strategies and present a risk taxonomy that classifies the reasons why a deployed system may underperform relative to design targets. We propose design strategies that explicitly mitigate risk, such as a strategy that minimizes the coefficient of variation in performance. In our experiments, we find that risk-aware design accounts for more than 70% of the strategies that produce systems with the best service quality. We also present a new datacenter management mechanism that fairly allocates processors to latency-sensitive applications. Tasks express value for performance using sophisticated piecewise-linear utility functions. With fairness in market allocations, we show how datacenters can mitigate envy amongst latency-sensitive users. We quantify the price of fairness and detail efficiency-fairness trade-offs. Finally, we extend the market to fairly allocate heterogeneous processors.
Item Open Access Design Strategies for Efficient and Secure Memory(2019) Lehman, Tamara SilbergleitRecent computing trends force users to relinquish physical control to unknown parties, making the system vulnerable to physical attacks. Software alone is not well equipped to protect against physical attacks. Instead software and hardware have to enforce security in collaboration to defend against physical attacks. Many secure processor implementations have surfaced over the last two decades (i.e. Intel SGX, ARM Trustzone) but inefficiencies are hindering their adoption.
Secure processors use secure memory to detect and guard against physical attacks. Secure memory assumes that everything within the chip boundary is trusted and provides confidentiality and integrity verification for data in memory. Both of these features, confidentiality and integrity, require large metadata structures which are
stored in memory. When a system equipped with secure memory misses at the last-level-cache (LLC), the memory controller has to issue additional memory requests to fetch the corresponding metadata from memory. These additional memory requests increase delay and energy. The main goal of this dissertation is to reduce overheads of secure memory in two dimensions: delay and energy.
First, to reduce the delay overhead we propose the first safe speculative integrity verification mechanism, PoisonIvy, that effectively hides the integrity verification latency while maintaining security guarantees. Secure memory has high delay overheads due to the long integrity verification latency. Speculation allows the system to return decrypted data back to the processor before the integrity verification completes, effectively removing the integrity verification latency from the critical path of a memory access. However, speculation without any other mechanism to safeguard security is unsafe. PoisonIvy safeguards security guarantees by preventing any effect of unverified data from leaving the trusted boundary. PoisonIvy is able to realize all the benefits of speculative integrity verification while maintaining the same security guarantees as the non-speculative system.
Speculation is effective in reducing delay overhead but it has no effect on reducing the number of additional memory accesses, which cause large energy overhead. Secure memory metadata has unique memory access patterns that are not compatible with traditional cache designs. In the second part of this work, we provide the first in-depth study of metadata access patterns, MAPS, to help guide architects design more efficient cache architectures customized for secure memory metadata. Based on the unique characteristics of secure memory metadata observed in the in-depth analysis, in the third part of this work we explore the design space of efficient
cache designs. We describe one possible design, Metadata Cache eXtension (MCX), which exploits the bimodal reuse distance distribution of metadata blocks to improve the cache efficiency thereby reducing the number of additional memory accesses. We
also explore an LLC eviction policy suitable to handle multiple types of blocks to improve the efficiency of caching metadata blocks on-chip further.
Item Open Access In-Memory Computing Architecture for Deep Learning Acceleration(2020) Chen, FanThe ever-increasing demands of deep learning applications, especially the more powerful but intensive unsupervised deep learning models, overwhelm computation capability, communication capability, and storage capability of the modern general-purpose CPUs and GPUs. To accommodate the memory and computing requirement, multi-core systems that make intensive use of accelerators become the future of computing. Such novel computing systems incurs new challenges including architectural support for model training in the accelerators, large cache demands for multi-core processors, system performance, energy, and efficiency. In this thesis, I present my research works that address these challenges by leveraging emerging memory and logic devices, as well as advanced integration technologies. In the first work, I present the first training accelerator architecture, ReGAN, for unsupervised deep learning. ReGAN follows the process-in-memory strategy by leveraging energy efficiency of resistive memory arrays for in-situ deep learning execution. I proposed an efficient pipelined training procedure to reduce on-chip memory access. In the second work, I present ZARA to address the resource underutilization due to a new operator, namely, transposed convolution, used in unsupervised learning models. ZARA improves the system efficiency by a novel computation deformation technique. In the third work, I present MARVEL that targets to improve power efficiency in previous resistive accelerators. MARVEL leverage the monolithic 3D integration technology by stacking multi-layer of low-power analog/digital conversion circuits implemented with carbon nanotube field-effect transistors. The area-consuming eDRAM buffers are replaced by dense cross-point Spin Transfer Torque Magnetic RAM. I explored the design space and demonstrated that MARVEL can provide further improved power efficiency with increased number of integration layers. In the last piece of work, I propose the first holistic solution for employing skyrmions racetrack memory as last-level caches for future high-capacity cache design. I first present a cache architecture and a physical-to-logic mapping scheme based on comprehensive analysis on working mechanism of skyrmions racetrack memory. Then I model the impact of process variations and propose a process variation aware data management technique to minimize the performance degradation incurred by process variations.
Item Open Access Integrating Computer Architecture and Coding Theory to Advance Emerging Memory Technologies(2020) Mappouras, GeorgiosNew memory technologies constantly emerge promising higher density, bandwidth, latency, and power efficiency comparing to traditional solutions. However, these technologies often suffer from substantial drawbacks like limited lifetime or low fault tolerance. These drawbacks prevent the integration of these technologies in modern computer systems and increase their cost of implementation. In this work, we utilize solutions both from the disciplines of computer architecture and coding theory to address the drawbacks of emerging memory technologies. By integrating computer architecture and coding theory we can design more optimized solutions, paving the way for emerging memory technologies to become viable and reliable options for modern computer systems.
More specifically we design MinWear codes to increase the lifetime of Flash memory, providing larger lifetime gains for smaller capacity costs comparing to prior work. We also design GreenFlag codes to address shift errors in 3D racetrack memory, providing double shift error detection and correction. Additionally, we enhance the fault tolerance of 3D-stacked DRAM with a two-level coding technique called Jenga. Jenga provides fault tolerance in all the granularity levels of the 3D-stacked DRAM with minimal performance overheads while outperforming prior solutions.
Item Open Access Structures, Circuits and Architectures for Molecular Scale Integrated Sensing and Computing(2009) Pistol, ConstantinNanoscale devices offer the technological advances to enable a new era in computing. Device sizes at the molecular-scale have the potential to expand the domain of conventional computer systems to reach into environments and application domains that are otherwise impractical, such as single-cell sensing or micro-environmental monitoring.
New potential application domains, like biological scale computing, require processing elements that can function inside nanoscale volumes (e.g. single biological cells) and are thus subject to extreme size and resource constraints. In this thesis we address these critical new domain challenges through a synergistic approach that matches manufacturing techniques, circuit technology, and architectural design with application requirements. We explore and vertically integrate these three fronts: a) assembly methods that can cost-effectively provide nanometer feature sizes, b) device technologies for molecular-scale computing and sensing, and c) architectural design techniques for nanoscale processors, with the goal of mapping a potential path toward achieving molecular-scale computing.
We make four primary contributions in this thesis. First, we develop and experimentally demonstrate a scalable, cost-effective DNA self-assembly-based fabrication technique for molecular circuits. Second, we propose and evaluate Resonance Energy Transfer (RET) logic, a novel nanoscale technology for computing based on single-molecule optical devices. Third, we design and experimentally demonstrate selective sensing of several biomolecules using RET-logic elements. Fourth, we explore the architectural implications of integrating computation and molecular sensors to form nanoscale sensor processors (nSP), nanoscale-sized systems that can sense, process, store and communicate molecular information. Through the use of self-assembly manufacturing, RET molecular logic, and novel architectural techniques, the smallest nSP design is about the size of the largest known virus.
Item Open Access Verification-Aware Processor Design(2009) Lungu, AnitaAs technological advances enable computers to permeate many of our society's critical application domains (such as medicine, finances, transportation), the requirement for computers to always behave correctly becomes critical as well. Currently, ensuring that processor designs are correct represents a major challenge for the computing industry consuming the majority (up to 70%) of the resources allocated for the creation of a new processor. Looking towards the future, we see that with each new processor generation, even more transistors fit on the same chip area and more complex designs become possible, which makes it unlikely that the difficulty of the design verification problem will decrease by itself.
We believe that the difficulty of the design verification problem is compounded by the current processor design flow. In most design cycles, a design's verifiability is not explicitly considered at an early stage - when decisions are most influential - because that initial focus is exclusively on improving the design on more traditional metrics like performance, power, and area. It is thus possible for the resulting design to be very difficult to verify in the end, specifically because its verifiability was not ranked high on the priority list in the beginning.
In this thesis we propose to view verifiability as a critical design constraint to be considered, together with other established metrics, like performance and power, from the initial stages of design. Our high level goal is for this approach to make designs more verifiable, which would both decrease the resources invested in the verification step and lead to more robust designs.
More specifically, we make five main contributions in this thesis. The first is our proposal for a change in design perspective towards considering verifiability as a first class constraint. Second, we use formal verification (through a combination of theorem proving, model checking, and probabilistic model checking ) to quantitatively evaluate the impact on verifiability of various design choices like the organization of caches, TLBs, pipeline, operand bypass network, and dynamic power management mechanisms. Our third contribution is to evaluate design trade-offs between verifiability and other established metrics, like performance and power, in the context of multi-core dynamic power management schemes. Fourth, we re-design several components for increasing their verifiability. Finally, we propose design guidelines for increasing verifiability. In the context of single core processors our guidelines refer to the organization of caches and translation lookaside buffers (TLBs), the depth of the core's pipeline, the type of ALUs used, while for multi-core processors we refer to dynamic power management schemes (DPMs) for power capping.
Our results confirm that making design choices with verifiability as a first class design constraint has the capacity to decrease the verification effort. Furthermore, making explicit trade-offs between verifiability, performance and power helps identify better design points for given verification, performance, and power goals.