Browsing by Author "Lebeck, Alvin R"
Results Per Page
Sort Options
Item Open Access A Molecular-scale Programmable Stochastic Process Based On Resonance Energy Transfer Networks: Modeling And Applications(2016) Wang, SiyangWhile molecular and cellular processes are often modeled as stochastic processes, such as Brownian motion, chemical reaction networks and gene regulatory networks, there are few attempts to program a molecular-scale process to physically implement stochastic processes. DNA has been used as a substrate for programming molecular interactions, but its applications are restricted to deterministic functions and unfavorable properties such as slow processing, thermal annealing, aqueous solvents and difficult readout limit them to proof-of-concept purposes. To date, whether there exists a molecular process that can be programmed to implement stochastic processes for practical applications remains unknown.
In this dissertation, a fully specified Resonance Energy Transfer (RET) network between chromophores is accurately fabricated via DNA self-assembly, and the exciton dynamics in the RET network physically implement a stochastic process, specifically a continuous-time Markov chain (CTMC), which has a direct mapping to the physical geometry of the chromophore network. Excited by a light source, a RET network generates random samples in the temporal domain in the form of fluorescence photons which can be detected by a photon detector. The intrinsic sampling distribution of a RET network is derived as a phase-type distribution configured by its CTMC model. The conclusion is that the exciton dynamics in a RET network implement a general and important class of stochastic processes that can be directly and accurately programmed and used for practical applications of photonics and optoelectronics. Different approaches to using RET networks exist with vast potential applications. As an entropy source that can directly generate samples from virtually arbitrary distributions, RET networks can benefit applications that rely on generating random samples such as 1) fluorescent taggants and 2) stochastic computing.
By using RET networks between chromophores to implement fluorescent taggants with temporally coded signatures, the taggant design is not constrained by resolvable dyes and has a significantly larger coding capacity than spectrally or lifetime coded fluorescent taggants. Meanwhile, the taggant detection process becomes highly efficient, and the Maximum Likelihood Estimation (MLE) based taggant identification guarantees high accuracy even with only a few hundred detected photons.
Meanwhile, RET-based sampling units (RSU) can be constructed to accelerate probabilistic algorithms for wide applications in machine learning and data analytics. Because probabilistic algorithms often rely on iteratively sampling from parameterized distributions, they can be inefficient in practice on the deterministic hardware traditional computers use, especially for high-dimensional and complex problems. As an efficient universal sampling unit, the proposed RSU can be integrated into a processor / GPU as specialized functional units or organized as a discrete accelerator to bring substantial speedups and power savings.
Item Open Access Accelerating Data Parallel Applications via Hardware and Software Techniques(2020) Bashizade, RaminThe unprecedented amount of data available today opens the door to many new applications in areas such as finance, scientific simulation, machine learning, etc. Many such applications perform the same computations on different data, which are called data-parallel. However, processing this enormous amount of data is challenging, especially in the post-Moore's law era. Specialized accelerators are a promising solution to meet the performance requirements of data-parallel applications. Among these are graphics processing units (GPUs), as well as more application-specific solutions.
One of the areas with high performance requirements is statistical machine learning, which has widespread applications in various domains. These methods include probabilistic algorithms, such as Markov Chain Monte-Carlo (MCMC), which rely on generating random numbers from probability distributions. These algorithms are computationally expensive on conventional processors, yet their statistical properties, namely, interpretability and uncertainty quantification compared to deep learning, make them an attractive alternative approach. Therefore, hardware specialization can be adopted to address the shortcomings of conventional processors in running these applications.
In addition to hardware techniques, probabilistic algorithms can benefit from algorithmic optimizations that aim to avoid performing unnecessary work. To be more specific, we can skip a random variable (RV) whose probability distribution function (PDF) is concentrated on only one value, i.e., there is only one value to choose, and the values of its neighboring RVs have not changed. In other words, if a RV has a concentrated PDF, its PDF will remain concentrated until at least one of its neighbors changes. Due to their high throughput and centralized scheduling mechanism, GPUs are a suitable target for this optimization.
Other than probabilistic algorithms, GPUs can be utilized to accelerate a variety of applications. GPUs with their Single-Instruction Multiple-Thread (SIMT) execution model offer massive parallelism that is combined with a relative ease of programming. The large amount and diversity of resources on the GPU is intended to ensure applications with different characteristics can achieve high performance, but at the same time it means that some of these resources will remain under-utilized, which is inefficient in a multi-tenant environment.
In this dissertation, we propose and evaluate solutions to the challenges mentioned above, namely i) accelerating probabilistic algorithms with uncertainty quantification, ii) optimizing probabilistic algorithms on GPUs to avoid unnecessary work, and iii) increasing resource utilization of GPUs in multi-tenant environments.
Item Open Access Accelerating Probabilistic Computing with a Stochastic Processing Unit(2020) Zhang, XiangyuStatistical machine learning becomes a more important workload for computing systems than ever before. Probabilistic computing is a popular approach in statistical machine learning, which solves problems by iteratively generating samples from parameterized distributions. As an alternative to Deep Neural Networks, probabilistic computing provides conceptually simple, compositional, and interpretable models. However, probabilistic algorithms are often considered too slow on the conventional processors due to sampling overhead to 1) computing the parameters of a distribution and 2) generating samples from the parameterized distribution. A specialized architecture is needed to address both the above aspects.
In this dissertation, we claim a specialized architecture is necessary and feasible to efficiently support various probabilistic computing problems in statistical machine learning, while providing high-quality and robust results.
We start with exploring a probabilistic architecture to accelerate Markov Random Field (MRF) Gibbs Sampling by utilizing the quantum randomness of optical-molecular devices---Resonance Energy Transfer (RET) networks. We provide a macro-scale prototype, the first such system to our knowledge, to experimentally demonstrate the capability of RET devices to parameterize a distribution and run a real application. By doing a quantitative result quality analysis, we further reveal the design issues of an existing RET-based probabilistic computing unit (1st-gen RSU-G) that lead to unsatisfactory result quality in some applications. By exploring the design space, we propose a new RSU-G microarchitecture that empirically achieves the same result quality as 64-bit floating-point software, with the same area and modest power overheads compared with 1st-gen RSU-G. An efficient stochastic probabilistic unit can be fulfilled using RET devices.
The RSU-G provides high-quality true Random Number Generation (RNG). We further explore how quality of an RNG is related to application end-point result quality. Unexpectedly, we discover the target applications do not necessarily require high-quality RNGs---a simple 19-bit Linear-Feedback Shift Register (LFSR) does not degrade end-point result quality in the tested applications. Therefore, we propose a Stochastic Processing Unit (SPU) with a simple pseudo RNG that achieves equivalent function to RSU-G but maintains the benefit of a CMOS digital circuit.
The above results bring up a subsequent question: are we confident to use a probabilistic accelerator with various approximation techniques, even though the end-point result quality ("accuracy") is good in tested benchmarks? We found current methodologies for evaluating correctness of probabilistic accelerators are often incomplete, mostly focusing only on end-point result quality ("accuracy") but omitting other important statistical properties. Therefore, we claim a probabilistic architecture should provide some measure (or guarantee) of statistical robustness. We take a first step toward defining metrics and a methodology for quantitatively evaluating correctness of probabilistic accelerators. We propose three pillars of statistical robustness: 1) sampling quality, 2) convergence diagnostic, and 3) goodness of fit. We apply our framework to a representative MCMC accelerator (SPU) and surface design issues that cannot be exposed using only application end-point result quality. Finally, we demonstrate the benefits of this framework to guide design space exploration in a case study showing that statistical robustness comparable to floating-point software can be achieved with limited precision, avoiding floating-point hardware overheads.
Item Open Access Architectures for Memristor-based Storage Structures(2011) Liu, YangRapid data growth nowadays makes it more critical to reduce search time to improve the performance of search-intensive applications. However, huge data size makes it more difficult to efficiently perform search operations. Representative conventional approaches to reduce search time, such as CAM and in-memory databases, are no longer efficient because of the data explosion: CMOS-based CAM has low capacity which cannot be increased through CMOS scaling, and in-memory databases have performance degradation as data size increases. As a result, we have to exploit emerging nanotechnologies to accelerate search.
Among emerging nanotechnologies, memristors have become promising candidates to build storage structures because of high capacity, short switching time and low power consumption. However, the benefit we can obtain from these storage structures is limited by low endurance of memristors. In order to utilize the computation ability of memristors and deal with the endurance problem, we explore the design space of memristor-based storage structures.
We first propose MemCAM/MemTCAM, a configurable memristor-based CAM/TCAM design, in which we use memristors as both memory latches and logic gates. Computation ability of memristors makes it possible to perform range search and high density of memristors provides an opportunity to build MemCAM/MemTCAM with large capacity and small area. We use SPICE to model the memristor and analyze power and performance at different temperatures. The results show that it is feasible to build MemCAM and MemTCAM which have high capacity and can reduce total search time and energy consumption for search-intensive applications with huge data size.
We then propose four hybrid memristor-based storage structures, Hash-CAM, T-tree-CAM, TB+-tree, and TB+-tree-CAM, to solve the endurance problem. We use an analytical model to evaluate and compare the performance and lifetime of two software-implemented memory-based T-trees and these four hybrid storage structures. The results show that hybrid storage structures can utilize range search abilities, achieve better performance than memory-based T-trees, and improve lifetime from minutes to longer than 60 years. Furthermore, TB+-tree-CAM, a hybrid memristor-based storage structure combining T-tree, B+-tree and CAM, manages to balance between performance and lifetime and can outperform other storage structures when taking both performance and lifetime into consideration.
Item Open Access Devices and Circuit Design Strategies for Building Scalable Integrated Molecular Circuits(2017) LaBoda, CraigResonance energy transfer (RET) logic provides a method for building integrated molecular circuits using self-assembled networks of fluorescent molecules to perform computation. The unique operating principles and materials of these circuits make them suitable candidates for introducing computation to domains that are incompatible with conventional silicon-based systems. To realize the full potential of this technology, however, a variety of technical challenges currently preventing the design of larger, more complex systems must be overcome. Two of these primary challenges are energy loss and exciton loss. Energy loss forces the outputs from RET devices to be red-shifted from their inputs. This prevents most independently designed RET components from being cascaded with one another. Exciton loss weakens the output signals from RET devices, making it difficult to observe the computational results. Together, these forms of signal degradation constrain both the size and topology of RET systems.
This work explores new RET devices and circuit design strategies that address the above limitations. The primary contributions of this dissertation are threefold. First, a RET device capable of restoring energy in these systems is designed and experimentally demonstrated. This device enables cascading and feedback in RET logic, two circuit design concepts commonly used in large-scale digital systems. Second, a new style of RET logic design, called Pre-Charge Logic (PCL), is introduced. PCL addresses both forms of signal loss while simultaneously providing a library of cascadable RET logic gates, many of which cannot be implemented using previous methods of RET logic design. Third, the design methods of PCL are explored and validated by simulation and experimental demonstration. Continuous-time Markov chain modeling confirms that the proposed PCL devices perform their intended Boolean operations, while an experimental demonstration of a PCL PASS gate substantiates the underlying operating principles of these devices. Collectively, these contributions pave the way for developing larger, more complex RET systems in the future.
Item Open Access Enhancing Transactional Key-Value Storage Systems in Datacenters using Precise Clocks and Software-Defined Storage(2019) Misra, Pulkit AmbikanandanTransactional key-value storage is an important service offered by cloud service providers for building applications (e.g., Amazon DynamoDB, Microsoft CosmosDB, Google Spanner). This type of service is popular because it provides high-level guarantees like consistency, scalability and fault-tolerance to ease application development and deployment on the cloud. Unfortunately, providing high performance without high complexity entails several challenges for transactional key-value storage systems in datacenters due to several sophisticated protocols that provide the high-level guarantees (e.g., transaction and replication), and the overheads incurred by traversing multiple abstraction layers.
We leverage two emerging datacenter capabilities --- precise synchronized clocks and software-defined storage --- to address the performance and complexity challenges with transactional key-value storage systems in datacenters. To this end, we use a cross-layer approach that investigates all levels of the storage stack, from developer APIs to underlying hardware. We show that this methodology opens avenues for synergistic interactions between software and the underlying hardware, and leads to simpler system designs with better performance.
This dissertation presents 4 systems --- Semel, Milana, Kairos and SkimpyFTL. Semel is a multi-version key-value storage system that exploits the remap-on-write property of flash-based Solid State Drives for device-integrated multi-versioning and uses a simplified, unordered (inconsistent) replication protocol for fault tolerance. Milana supports serializable ACID transactions over Semel using an enhanced Optimistic Concurrency Control protocol that leverages intra-datacenter precisely synchronized clocks to reduce transaction abort rate and enable local validation of read-only transactions. Kairos builds over Milana and adds support for inter-transaction caching and sharded transaction validation; cache consistency in Kairos is based on a simple, stateless, time-to-live protocol with leases, without having to track sharers or send invalidations like with directory-based cache consistency protocols. Finally, SkimpyFTL builds over Semel and adds support for memory-efficient data indexing in flash-based key-value storage systems.
Item Open Access Harnessing Data Parallel Hardware for Server Workloads(2015) Agrawal, Sandeep RTrends in increasing web traffic demand an increase in server throughput while preserving energy efficiency and total cost of ownership. Present work in optimizing data center efficiency primarily focuses on using general purpose processors, however these might not be the most efficient platforms for server workloads. Data parallel hardware achieves high energy efficiency by amortizing instruction costs across multiple data streams, and high throughput by enabling massive parallelism across independent threads. These benefits are considered traditionally applicable to scientific workloads, and common server tasks like page serving or search are considered unsuitable for a data parallel execution model.
Our work builds on the observation that server workload execution patterns are not completely unique across multiple requests. For a high enough arrival rate, a server has the opportunity to launch cohorts of similar requests on data parallel hardware, improving server performance and power/energy efficiency. We present a framework---called Rhythm---for high throughput servers that can exploit similarity across requests to improve server performance and power/energy efficiency by launching data parallel executions for request cohorts. An implementation of the SPECWeb Banking workload using Rhythm on NVIDIA GPUs provides a basis for evaluation.
Similarity search is another ubiquitous server workload that involves identifying the nearest neighbors to a given query across a large number of points. We explore the performance, power and dollar benefits of using accelerators to perform similarity search for query cohorts in very high dimensions under tight deadlines, and demonstrate an implementation on GPUs that searches across a corpus of billions of documents and is significantly cheaper than commercial deployments. We show that with software and system modifications, data parallel designs can greatly outperform common task parallel implementations.
Item Open Access Improving Congestion Control Convergence in RDMA Networks(2022) Snyder, JohnRemote Direct Memory Access (RDMA) networks are becoming a popular interconnect technology to enable high performance communication in distributed systems. While RDMA hardware enables high bandwidth and low latency networking, networks require congestion control algorithms to ensure they operate efficiently. Ideally, a congestion control algorithm allows computers in the network to inject enough traffic to keep the network fully utilized, stops computers from causing congestion, and allocates bandwidth fairly to all computers in the network. This enables the network to operate at peak performance and be fair to all users. While many protocols eventually converge to this ideal network state over time, they often take too long, reducing performance. We develop mechanisms and protocols that improve convergence time.
In this thesis, we identify several ways in which slow convergence to the ideal network state harms performance and leaves RDMA networks susceptible to performance isolation attacks. We show that slow convergence to fair injection rates on end-hosts in RDMA networks greatly increases the communication time for long flows, which leads to sub-optimal application performance. We identify why unfairness occurs and measure unfairness' impact on application performance. We then show that because RDMA networks are loss-less and start sending packet at line-rate, users can unfairly gain more bandwidth and sometimes ignore congestion control altogether. This allows misbehaving users to harm the performance of other users sharing the network.
To improve long flow performance in RDMA networks, we propose two new mechanisms for Additive-Increase Multiplicative-Decrease protocols: 1) Variable Additive Increase and 2) Sampling Frequency. These mechanisms reduce the time it takes for the network to allocate bandwidth fairly between end-hosts, so long flows are not starved of bandwidth. To create these mechanisms, we determine when unfairness occurs and how end-hosts can infer unfairness without any additional information from switches.
We then introduce One Round Trip Time Convergence (1RC) and a new method of setting flow weights, which improve performance and isolation in RDMA networks. 1RC enables a network to converge to fair rates during the first RTT by dropping packets when a flow uses too much bandwidth. We do this while maintaining packet ordering and fairness between flows that start sending packets at the same time. We then use a new weighting scheme, which decreases the bandwidth allocation to users opening too many connections. A lower weight for misbehaving users mitigates the impact of a user trying to gain more bandwidth than is fair.
Finally, we introduce the Collective Congestion Control Protocol (3CPO), which improves convergence in multicast networks designed for collective communication. Multicast operations send a single packet to several destinations at once and cause severe congestion in the network quickly. 3CPO manages multicast congestion by inferring global congestion state through multicast operations and then tunes each end-host injection rate to be near optimal. 3CPO requires no extra information from switches and works entirely on end-hosts.
Item Open Access mNoC: Large Nanophotonic Network-on-Chip Crossbars with Molecular Scale Devices(ACM JOURNAL ON EMERGING TECHNOLOGIES IN COMPUTING SYSTEMS, 2015-07) Pang, Jun; Dwyer, Christopher; Lebeck, Alvin RItem Open Access Multi-version Indexing in Flash-based Key-Value Stores(2019-12-02) Misra, Pulkit A; Chase, Jeffrey S; Gehrke, Johannes; Lebeck, Alvin RItem Open Access Practical Dynamic Information-Flow Tracking on Mobile Devices(2014) Pistol, Ion ValentinToday's consumer mobile platforms such as Android and iOS manage large ecosystems of untrusted third-party applications. It is common for an application to request one or more types of sensitive data. Unfortunately, users have no insight into how their data is used. Given the sensitivity of the data accessible by these applications, it is paramount that mobile operating systems prevent apps from leaking it.
This dissertation shows that it is possible to improve the soundness of dynamic information-flow tracking on a mobile device without sacrificing precision, performance, or transparency. We extend the state of the art in dynamic information-flow tracking on Android and address two major limitations: quantifying implicit flow leaks in Dalvik bytecode and tracking explicit flows in native code. Our goal is to deliver seamless end-to-end taint tracking across Dalvik bytecode and native code.
We propose SpanDex, a system that quantifies implicit flow leaks in Dalvik bytecode for apps handling password data. SpanDex computes a bound of revealed tainted data by recording the control-flow dependencies and for each password character, keeps track of the possible set of values that have been inferred. We also propose TaintTrap, a taint tracking system for native code in third party apps. We explore native taint tracking performance bottlenecks and hardware acceleration techniques to improve instrumentation performance.
Item Open Access Resonance Energy Transfer-Based Molecular Switch Designed Using a Systematic Design Process Based on Monte Carlo Methods and Markov Chains(2016) Rallapalli, ArjunA RET network consists of a network of photo-active molecules called chromophores that can participate in inter-molecular energy transfer called resonance energy transfer (RET). RET networks are used in a variety of applications including cryptographic devices, storage systems, light harvesting complexes, biological sensors, and molecular rulers. In this dissertation, we focus on creating a RET device called closed-diffusive exciton valve (C-DEV) in which the input to output transfer function is controlled by an external energy source, similar to a semiconductor transistor like the MOSFET. Due to their biocompatibility, molecular devices like the C-DEVs can be used to introduce computing power in biological, organic, and aqueous environments such as living cells. Furthermore, the underlying physics in RET devices are stochastic in nature, making them suitable for stochastic computing in which true random distribution generation is critical.
In order to determine a valid configuration of chromophores for the C-DEV, we developed a systematic process based on user-guided design space pruning techniques and built-in simulation tools. We show that our C-DEV is 15x better than C-DEVs designed using ad hoc methods that rely on limited data from prior experiments. We also show ways in which the C-DEV can be improved further and how different varieties of C-DEVs can be combined to form more complex logic circuits. Moreover, the systematic design process can be used to search for valid chromophore network configurations for a variety of RET applications.
We also describe a feasibility study for a technique used to control the orientation of chromophores attached to DNA. Being able to control the orientation can expand the design space for RET networks because it provides another parameter to tune their collective behavior. While results showed limited control over orientation, the analysis required the development of a mathematical model that can be used to determine the distribution of dipoles in a given sample of chromophore constructs. The model can be used to evaluate the feasibility of other potential orientation control techniques.
Item Open Access Structures, Circuits and Architectures for Molecular Scale Integrated Sensing and Computing(2009) Pistol, ConstantinNanoscale devices offer the technological advances to enable a new era in computing. Device sizes at the molecular-scale have the potential to expand the domain of conventional computer systems to reach into environments and application domains that are otherwise impractical, such as single-cell sensing or micro-environmental monitoring.
New potential application domains, like biological scale computing, require processing elements that can function inside nanoscale volumes (e.g. single biological cells) and are thus subject to extreme size and resource constraints. In this thesis we address these critical new domain challenges through a synergistic approach that matches manufacturing techniques, circuit technology, and architectural design with application requirements. We explore and vertically integrate these three fronts: a) assembly methods that can cost-effectively provide nanometer feature sizes, b) device technologies for molecular-scale computing and sensing, and c) architectural design techniques for nanoscale processors, with the goal of mapping a potential path toward achieving molecular-scale computing.
We make four primary contributions in this thesis. First, we develop and experimentally demonstrate a scalable, cost-effective DNA self-assembly-based fabrication technique for molecular circuits. Second, we propose and evaluate Resonance Energy Transfer (RET) logic, a novel nanoscale technology for computing based on single-molecule optical devices. Third, we design and experimentally demonstrate selective sensing of several biomolecules using RET-logic elements. Fourth, we explore the architectural implications of integrating computation and molecular sensors to form nanoscale sensor processors (nSP), nanoscale-sized systems that can sense, process, store and communicate molecular information. Through the use of self-assembly manufacturing, RET molecular logic, and novel architectural techniques, the smallest nSP design is about the size of the largest known virus.