Browsing by Author "Chen, Yiran"
- Results Per Page
- Sort Options
Item Open Access Accelerator Architectures for Deep Learning and Graph Processing(2020) Song, LinghaoDeep learning and graph processing are two big-data applications and they are widely applied in many domains. The training of deep learning is essential for inference and has not yet been fully studied. With data forward, error backward, and gradient calculation, deep learning training is a more complicated process with higher computation and communication intensity. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remaining challenging. In this dissertation, I present AccPar, a principled and systematic method of determining the tensor partition for multiple heterogeneous accelerators for efficient training acceleration. Emerging resistive random access memory (ReRAM) is promising for processing in memory (PIM). For high-throughput training acceleration in ReRAM-based PIM accelerator, I present PipeLayer, an architecture for layer-wise pipelined parallelism. Graph processing is well-known for poor locality and high memory bandwidth demand. In conventional architectures, graph processing incurs a significant amount of data movements and energy consumption. I present GraphR, the first ReRAM-based graph processing accelerator which follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. Sparse matrix-vector multiplication (SpMV), a subset of graph processing, is the key computation in iterative solvers for scientific computing. The efficiently accelerating floating-point processing in ReRAM remains a challenge. In this dissertation, I present ReFloat, a data format, and a supporting accelerator architecture, for low-cost floating-point processing in ReRAM for scientific computing.
Item Open Access Advancing the Design and Utility of Adversarial Machine Learning Methods(2021) Inkawhich, Nathan AlbertWhile significant progress has been made to craft Deep Neural Networks (DNNs) with super-human recognition performance, their reliability and robustness in challenging operating conditions is still a major concern. In this work, we study multiple facets of the DNN robustness problem by pursuing two main threads of research. The key methodological linkage throughout our investigations is the consistent design/development/utilization/deployment of Adversarial Machine Learning techniques, which have remarkable abilities to both degrade and enhance model performance. Our ultimate goal is to help construct the more safe and reliable models of the future.
In the first thread of research, we take the perspective of an adversary who wishes to find novel and increasingly potent ways to fool current DNN models. Our approach is centered around the development of a feature space attack, and the construction of novel adversarial threat models that work to reduce required knowledge assumptions. Interestingly, we find that a transfer-based blackbox adversary can be significantly more powerful than previously believed, and can reliably cause targeted misclassifications with imperceptible noises. Further, we find that the attacker does not necessarily require access to the target model's training distribution to create transferable attacks, which is a more practically concerning scenario due to the reduction of required attacker knowledge.
Along the second thread of research, we take the perspective of a DNN model designer whose job is to create systems capable of robust operation in ``open-world'' environments, where both known and unknown target types may be encountered. Our approach is to establish a classifier + out-of-distribution (OOD) detector system co-design that is centered around an adversarial training procedure and an outlier exposure-based learning objective. Through various experiments, we find that our systems can achieve high accuracy in extended operating conditions, while reliably detecting and rejecting fine-grained OOD target types. We also develop a method for efficiently improving OOD detection by learning from the deployment environment. Overall, by exposing novel vulnerabilities of current DNNs while also improving the reliability of existing models to known vulnerabilities, our work makes significant progress towards creating the next-generation of more trustworthy models.
Item Open Access Algorithm-hardware co-optimization for neural network efficiency improvement(2020) Yang, QingDeep neural networks (DNNs) are tremendously applied in the artificial intelligence field. While the performance of DNNs is continuously improved by more complicated and deeper structures, the feasibility of deployment on edge devices remains a critical problem. In this thesis, we present algorithm-hardware co-optimization approaches to address the challenges of efficient DNN deployments from three aspects: 1) save computational cost, 2) save memory cost, and 3) save data movements.
First, we present a joint regularization technique to advance the compression beyond the weights to neuron activations. By distinguishing and leveraging the significant difference among neuron responses and connections during learning, the jointly pruned network, namely JPnet, optimizes the sparsity of activations and weights. Second, to structurally regulate the dynamic activation sparsity (DAS), we propose a generic low-cost approach based on winners-take-all (WTA) dropout technique. The network enhanced by the proposed WTA dropout, namely DASNet, features structured activation sparsity with an improved sparsity level, which can be easily utilized to achieve acceleration on conventional embedded systems. The effectiveness of JPNet and DASNet has been thoroughly evaluated through various network models with different activation functions and on different datasets. Third, we propose BitSystolic, a neural processing unit based on a systolic array structure, to fully support the mixed-precision inference. In BitSystolic, the numerical precision of both weights and activations can be configured in the range of 2b~8b, fulfilling different requirements across mixed-precision models and tasks. Moreover, the design can support various data flows presented in different types of neural layers and adaptively optimize the data reuse by switching between the matrix-matrix mode and vector-matrix mode. We designed and fabricated the proposed BitSystolic in the 65nm process. Our measurement results show that BitSystolic features the unified power efficiency of up to 26.7 TOPS/W with 17.8 mW peak power consumption across various layer types. In the end, we will have a glance at computing-in-memory architectures based on resistive random-access memory (ReRAM) which realizes in-place storage and computation. A quantized training method is proposed to enhance the accuracy of neuromorphic systems based on ReRAM by alleviating the impact of limited parameter precision.
Item Open Access Boosting the Sensing Granularity of Acoustic Signals by Exploiting Hardware Non-linearit(2023) Chen, XiangruAcoustic sensing is a new sensing modality that senses the contexts of human targets and our surroundings using acoustic signals. It becomes a hot topic in both academia and industry owing to its finer sensing granularity and the wide availability of microphone and speaker on commodity devices. While prior studies focused on addressing well-known challenges such as increasing the limited sensing range and enabling multi-target sensing, we propose a novel scheme to leverage the non-linearity distortion of microphones to further boost the sensing granularity. Specifically, we observe the existence of the non-linear signal generated by the direct path signal and target reflection signal. We mathematically show that the non-linear chirp signal amplifies the phase variations and this property can be utilized to improve the granularity of acoustic sensing. Experiment results show that, by properly leveraging the hardware non-linearity, the amplitude estimation error for sub-millimeter-level vibration can be reduced from 0.137 mm to 0.029 mm.
Item Open Access Dynamic Deep Learning Acceleration with Co-Designed Hardware Architecture(2023) Hanson, Edward ThorRecent advancements in Deep Learning (DL) hardware target the training and inference of static DL models, thus simultaneously achieving high runtime performance and efficiency.However, dynamic DL models are seen as the next step in further pushing the accuracy-performance tradeoff of DL inference and training in our favor; by reshaping the model's parameters or structure based on the input, dynamic DL models have the potential to boost accuracy while introducing marginal computation cost. As the field of DL progresses towards dynamic models, much of the advancements in DL accelerator design are eclipsed by data movement-related bottlenecks introduced by unpredictable memory access patterns and computation flow. Additionally, designing hardware for every niche task is inefficient due to the high cost of developing new hardware. Therefore, we must carefully design DL hardware and software stack to support future, dynamic DL models by emphasizing flexibility and generality without sacrificing end-to-end performance and efficiency.
This dissertation targets algorithmic-, hardware-, and software-level optimizations to optimize DL systems.Starting from the algorithm level, the robust nature of DNNs is exploited to reduce computational and data movement demand. At the hardware level, dynamic hardware mechanisms are investigated to better serve a broad range of impactful future DL workloads. At the software level, statistical patterns of dynamic models are leveraged to enhance the performance of offline and online scheduling strategies. Success of this research is measured by considering all key metrics associated with DL and DL acceleration: inference latency and accuracy, training throughput, peak memory occupancy, area efficiency, and energy efficiency.
Item Open Access Efficient and Generalizable Neural Architecture Search for Visual Recognition(2021) Cheng, Hsin-PaiNeural Architecture Search (NAS) can achieve accuracy superior to human-designed neural networks, because of the easier automation process and searching techniques.While automated designed neural architectures can achieve new state-of-the-art performance with less human crafting efforts, there are three obstacles which hinder us building the next generation NAS algorithms: (1) search space is constrained which limits their representation ability; (2) searching large search space is time costly which slows down the model crafting process; (3) inference of complicated neural architectures are slow which limits the deployability on different devices To improve search space, previous NAS works rely on existing block motifs. Specifically, previous search space seek the best combination of MobileNetV2 blocks without exploring the sophisticated cell connections. To accelerate searching process, more accurate description of neural architecture is necessary. To deploy neural architectures to hardware, better adaptability is required. The dissertation proposes ScaleNAS to expand a search space that is adaptable to multiple vision-based tasks. The dissertation will show that NASGEM overcomes the neural architecture representation ability to accelerate searching. Finally, we shows how to integrate neural architecture search to strucural pruning and mixed precision quantization to further improve hardware deployment.
Item Open Access Efficient and Scalable Deep Learning(2019) Wen, WeiDeep Neural Networks (DNNs) can achieve accuracy superior to traditional machine learning models, because of their large learning capacity and the availability of large amounts of labeled data. In general, larger DNNs can obtain higher accuracy. However, there are two obstacles which hinder us building larger DNNs: (1) inference of large DNNs is slow which limits their deployment to small devices; (2) training large DNNs is also slow which slows down research exploration. To remove those obstacles, this dissertation focuses on acceleration of DNN inference and training. To accelerate DNN inference, original DNNs are compressed while keeping original accuracy. More specific, Structurally Sparse Deep Neural Networks (SSDNNs) are proposed to remove neural components. In Convolutional Neural Networks (CNNs), neurons, filters, channels and layers can be removed; in Recurrent Neural Networks (RNNs), hidden sizes can be reduced. The study shows that SSDNNs can achieve higher speedup than sparse DNNs which have non-structured sparsity. Besides SSDNNs, a Force Regularization is proposed to enforce DNNs to lower-rank space, such that DNNs can be decomposed to lower-rank architectures with fewer ranks than traditional methods. The dissertation also demonstrates that SSDNNs and Force Regularization are orthogonal and can be combined for higher speedup. To accelerate DNN training, distributed deep learning is required. However, two problems hinder us using more compute nodes for higher training speed: Communication Bottleneck and Generalization Gap. Communication Bottleneck is that communication time will increase and dominate when the distributed systems scale to many compute nodes. To reduce gradient communication in Stochastic Gradient Descent (SGD), SGD with low-precision gradients (TernGrad) is proposed. Moreover, in distributed deep learning, a large batch size is required to exploit system computing power; unfortunately, accuracy will decrease when the batch size is very large, which is referred to as the Generalization Gap. One hypothesis to explain Generalization Gap is that large-batch SGD sticks at sharp minima. The dissertation proposes a stochastic smoothing (SmoothOut) to escape sharp minima. The dissertation will show that TernGrad overcomes Communication Bottleneck and SmoothOut helps to close the Generalization Gap.
Item Open Access Efficient Neural Network Based Systems on Mobile and Cloud Platforms(2020) Mao, JiachenIn recent years, machine learning, especially neural networks arouses unprecedented influence in both academia and industry.
The reason lies in the state-of-the-art performance of neural networks on many critical applications such as object detection, translation, and games. However, the deployment of neural network models on resource-constrained devices (e.g. edge devices) is challenged by their heavy memory and computing cost during execution. Many efforts have been done in previous literature for efficient execution of neural networks, including the perspectives of hardware, software, and algorithm.
My research focus during my Ph.D. study is mainly on software, and algorithm targeting at mobile platforms. More specifically, we emphasize the system design, system optimization, and model compression of neural networks for better mobile user experience. From the system design perspective, we first propose MoDNN – a local distributed mobile computing system for DNN testing. MoDNN can partition already trained DNN models onto several mobile devices to accelerate DNN computations by alleviating device-level computing cost and memory usage. Two model partition schemes are also designed to minimize non-parallel data delivery time, including both wakeup time and transmission time. Then, we propose AdaLearner – an adaptive local distributed mobile computing system for DNN training. To exploit the potential of our system, we adapt the neural networks training phase to mobile device-wise resources and fiercely decrease the transmission overhead for better system scalability. From the system optimization perspective, we propose MobiEye, a cloud-based video detection system optimized for deployment in real-time mobile applications. MobiEye is based on a state-of-the-art video detection framework called Deep Feature Flow (DFF). MobiEye optimizes DFF by three system-level optimization methods. From the model compression perspective, we propose Tprune, a model analyzing and pruning framework for Transformer. In TPrune, we first proposed Block-wise Structured Sparsity Learning (BSSL) to analyze Transformer model property. Then, based on the characters derived from BSSL, we apply Structured Hoyer Square (SHS) to derive the final compressed models. The realization of the projects during my PhD study could contribute to the current research on efficient neural network execution and thus result in more user-friendly and smart applications on edge devices for more users.
Item Open Access Enable Intelligence on Billion Devices with Deep Learning(2022) Li, AngWith the proliferation of edge computing and Internet of Things (IoT), billions of edge devices (e.g., smartphone, AR/VR headset, autonomous car, etc) are deployed in our daily life and constantly generating the gigantic amount of data at the network edge. Bringing deep learning to such huge volumes of data will boost many novel applications and services in edge ecosystem and fuel the continuous booming of artificial intelligence (AI). Driven by this motivation, there is an urgent need to push the AI frontier to the network edge in order to fully exploit big data residing on edge devices.
However, empowering edge intelligence with AI, especially deep learning, is technically challenging, due to the several critical challenges including privacy, efficiency, and performance.Conventional wisdom requires edge devices to transmit the data to cloud datacenters for training and inference. But moving a huge amount of data is prohibited by cost, high transmission delay, and privacy leakage. The emerging federated learning (FL) is a promising distributed learning paradigm that enables massive devices to collaboratively learn a machine learning model (e.g., deep neural network) without explicitly sharing data, and hence the privacy concerns caused by data sharing in the centralized learning can be mitigated. But FL is facing some critical challenges that hinder its deployments to edge devices, such as communication cost and data heterogeneity.
Once we obtain a learned machine learning model, the next step is to deploy the model for serving applications and services. One straightforward approach is to deploy the model on device to perform the inference locally. Unfortunately, on-device AI often suffers from poor performance because most AI applications requires high computational power, which is technically unaffordable for resource-constrained edge devices. Edge computing pushes the cloud services from the network core to the network edge, and hence bridging devices with edge servers can alleviate the computational cost of running AI models on device alone. However, such a collaborative deployment scheme will inevitably incur transmission delay and raise privacy concern due to data movement between devices and edge servers. For example, the device can send the features extracted from raw data (e.g., images) to the cloud where a pre-trained machine learning model is deployed, but these extracted features can still be exploited by attackers to recover raw data and to infer embedded private attributes (e.g., age, gender, etc.).
In this dissertation, I start with presenting a privacy-respecting data crowdsourcing framework for deep learning to address the privacy issue in centralized training. Then, I shift the setting from the centralized one to the decentralized environment, where three novel FL frameworks are proposed to jointly improve communication and computation efficiency while handling the heterogeneous data across devices. In addition to improving the learning on large-scale edge devices, I also design an efficient edge-assisted photorealistic video style transfer system for mobile phones by leveraging the collaboration between smartphones and the edge server. Besides, in order to mitigate the privacy concern caused by the data movement in the collaborative system, an adversarial training framework is proposed to prevent the adversary from reconstructing the raw data and inferring private attributes.
Item Open Access From Adversaries to Anomalies: Addressing Real-World Vulnerabilities of Deep Learning-based Vision Models(2024) Inkawhich, Matthew JosephDeep Neural Networks (DNNs) have driven the performance of computer vision to new heights, which has led to them to being rapidly integrated into many of our real-world systems. Meanwhile, the majority of research on DNNs remains focused on enhancing accuracy and efficiency. Furthermore, the evaluation protocols used to quantify performance generally assume idealistic operating conditions that do not well-emulate realistic environments. For example, modern benchmarks typically have balanced class distributions, ample training data, consistent object scale, minimal noise, and only test on inputs that lie within the training distribution. As a result, we are currently integrating these naive and under-tested models into our trusted systems! In this work, we focus on the robustness of DNN-based vision models, seeking to understand their vulnerabilities to non-ideal deployment data. The rallying cry of our research is that before these models are deployed into our safety-critical applications (e.g., autonomous vehicles, defense technologies), we must attempt to anticipate, understand, and address all possible vulnerabilities. We begin by investigating a class of malignant inputs that are specifically designed to fool DNN models. We conduct this investigation by taking on the perspective of an adversary who wishes to attack a pretrained DNN by adding (nearly) imperceptible noise to a benign input to fool a downstream model. While most adversarial literature focuses on image classifiers, we seek to understand the feasibility of attacks on other tasks such as video recognition models and deep reinforcement learning agents. Sticking to the theme of \textit{realistic} vulnerabilities, we primarily focus on black-box attacks in which the adversary does not assume knowledge of the target model's architecture and parameters. Our novel attack algorithms achieve surprisingly strong effectiveness, thus uncovering new serious potential security risks.
While malignant adversarial inputs represent a critical vulnerability, they are still a fairly niche issue in the context of all problematic inputs for a DNN. In the second phase of our work, we turn our attention to the open-set vulnerability. Here, we acknowledge that during deployment, models may encounter novel classes from outside of their training distribution. Again, the majority of works in this area only consider image classifiers for their simplicity. This motivates us to study the more complex and practically useful open-set object detection problem. We address this problem in two phases. First, we create a tunable class-agnostic object proposal network that can be easily adapted to suit a variety of open-set applications. Next, we define a new Open-Set Object Detection and Discovery (OSODD) task that emphasizes both known and unknown object detection with class-wise separation. We then devise a novel framework that combines our tunable proposal network with a powerful transformer-based foundational model, which achieves state-of-the-art performance on this challenging task.
We conclude with a feasibility study of inference-time dynamic Convolutional Neural Networks (CNNs). We argue that this may be an exciting potential solution for improving robustness to natural variations such as changing object scale, aspect ratio, and surrounding contextual information. Our preliminary results indicate that different inputs have a strong preference for different convolutional kernel configurations. We show that by allowing just four layers of common off-the-shelf CNN models to have dynamic convolutional stride, dilation, and size, we can achieve remarkably high levels of accuracy on classification tasks.
Item Open Access From Labeled to Unlabeled Data: Understand Deep Visual Representations under Causal Lens(2023) Yang, YueweiDeep vision models have been highly successful in various computer vision applications such as image classification, segmentation, and object detection. These models encode visual data into low-dimensional representations, which are then utilized in downstream tasks. Typically, the most accurate models are fine-tuned using fully labeled data, but this approach may not generalize well to different applications. Self-supervised learning has emerged as a potential solution to this issue, where the deep vision encoder is pretrained with unlabeled data to learn more generalized representations. However, the underlying mechanism governing the generalization and specificity of representations seeks more understanding. Causality is an important concept in visual representation learning as it can help improve the generalization of models by providing a deeper understanding of the underlying relationships between features and objects in the visual world.
Through works presented in this dissertation, we provide a causal interpretation of the mechanism underlying deep vision models' ability to learn representations in both labeled and unlabeled environments and improve the generalization and the specificity of extracted representations through the interpreted causal factors. Specifically, we tackle the problem from 4 aspects: Causally Interpret Supervised Deep Vision Models; Supervised Learning with Underlabeled Data; Self-supervised Learning with Unlabeled Data; Causally Understand Unsupervised Visual Representation Learning.
Firstly, we interpret the prediction of a deep vision model by identifying causal pixels in the input images via 'inversing' the model weights. Secondly, we recognise the challenges of learning an accurate object detection model with missing labels in the dataset and we address this underlabel data issue by adapting positive-unlabeled learning approach instead of the positive-negative approach. Thirdly, we focus on improving both generalization and specificity of unsupervised representations based on prior causal relations; Finally, we enhance the stability of the unsupervised representations during the inference by intervening data variables under a well constructed causal framework.
We establish a causal relationship between deep vision models and their input/output for different applications with (partially) labeled data, and strengthen generalized representations through extensive analytical understanding of unsupervised representation learning under various hypothesized causal frameworks.
Item Open Access Highly Efficient Neuromorphic Computing Systems With Emerging Nonvolatile Memories(2020) Yan, BonanEmerging nonvolatile memory based hardware neuromorphic computing systems have enabled the implementation of general vector-matrix multiplication in a manner to fuse computation and memory at the same physical location. However, there remain three major challenges in designing such neuromorphic computing systems for high efficiency in a large scale integration: (a) the analog/digital interface circuits dominate the power and area in such mixed-signal designs; (b) they are highly customized and can only compute a class of neural network models once developed; (c) non-ideal device properties largely forfeit the benefit in terms of computational efficiency.
Designs of mixed-signal interface circuitry have been extensively studied, but a holistic design approach regarding very-large-scale integration is overlooked for emerging nonvolatile memory based neuromorphic computing systems involving circuit design, microarchitecture and hardware/software co-simulation. The realization of such neuromorphic computing platforms requires: (a) efficient interface circuits as well as execution models; (b) appropriate reconfigurability at runtime for different neural network architectures; and (c) reliability enhancement methods to resist imperfect fabrication and tough working environment.
Motivated by these demands, this dissertation first introduces an implementation scheme of neuromorphic computing system that uses emerging nonvolatile memory as synapses and CMOS integrated circuits as neurons. To save the energy consumption of data communication, the neuron circuits are improved upon conventional integrated and first neuron circuits for better current-to-spike conversion efficiency. Trade-offs between throughput and latency are investigated and validated by a prototype 64Kb Resistive Random Access Memory based in-memory computing processing engine.
Next, this dissertation proposes a type of fully-memristive neuromorphic computing system architecture that incorporates Mott memristor as the neuron circuit. The small footprint and intrinsic bionic dynamics of emerging memory-based neuron circuits significantly reduce design complexity. This dissertation investigates and models the randomness that Mott memristors inflict. By suppressing it during inference and exploiting it during learning, the proposed system is optimized for the balance of inference accuracy and training efficiency.
Moreover, this dissertation advances the reconfigurability of emerging memory based neuromorphic computing systems by presenting a paradigm that supports post-fabrication switching between spiking and non-spiking neural network model execution. An improved version of time-to-first-spike temporal encoding is proposed to use single spikes in accelerating the execution speed.
Finally, this dissertation presents hardware/software codesign techniques for the implementation of neuromorphic computing systems with emerging nonvolatile memories. A hardware/software co-simulation flow is developed. And based on this, this dissertation also proposes a closed-loop design to enhance the weight stability to resist the read disturbance.
In summary, the dissertation tackles important problems in designing neuromorphic computing systems with emerging nonvolatile memories. The outcome of this research is expected not only to pave the way for realizing highly efficiency artificial intelligence hardware, but also shorten the product development cycle.
Item Open Access Improving the Efficiency and Robustness of In-Memory Computing in Emerging Technologies(2023) Yang, XiaoxuanEmerging technologies, such as resistive random-access memory (ReRAM), have proven their potential in in-memory computing for deep learning applications. My dissertation work focuses on improving the efficiency and robustness of in-memory computing in emerging technologies.
Existing ReRAM-based processing-in-memory (PIM) designs can support the inferencing and the training of neural networks, such as convolutional neural networks and recurrent neural networks. However, these designs suffer from the re-writing procedure for the self-attention calculation. Therefore, I propose an architecture that enables the efficient self-attention mechanism in PIM design. The optimized calculation procedure and finer granularity pipeline design improve efficiency. The contributions lie in enabling feasible and efficient ReRAM-based PIM designs for attention-based models.
Inferencing with ReRAM-based design has one severe problem: the inferencing accuracy can be degraded due to the non-idealities in hardware devices. The robustness of the previous method is not validated under the combination of device stochastic noise. With the proposed hardware-aware training method, the robustness of inferencing accuracy can be improved. Besides, with hardware efficiency and inferencing robustness targets, the multi-objective optimization method is developed to explore the design space and generate high-quality Pareto-optimal design configurations with minimal cost. This work integrates attributes from the design space and the evaluation space and develops efficient hardware-software co-design methods.
Training with ReRAM-based design has one challenging endurance problem due to the frequent weight updates for neural network training. The expectation for endurance management is to decrease the number of weight updates and balance the write accesses. The proposed endurance-aware training method utilizes gradient structure pruning and dynamically structurally adjusts the write probabilities. This method can expand the life cycle for ReRAM during the training process.
In summary, the research above targets realizing efficient self-attention mechanisms and solving accuracy degradation and endurance problems for the inferencing and training processes. Besides, the efforts lie in figuring out the challenging parts of each topic and developing hardware-software co-design considering efficiency and robustness. The developed designs are the potential solutions for the challenging problems of in-memory computing in emerging technologies.
Item Open Access In-Memory Computing Architecture for Deep Learning Acceleration(2020) Chen, FanThe ever-increasing demands of deep learning applications, especially the more powerful but intensive unsupervised deep learning models, overwhelm computation capability, communication capability, and storage capability of the modern general-purpose CPUs and GPUs. To accommodate the memory and computing requirement, multi-core systems that make intensive use of accelerators become the future of computing. Such novel computing systems incurs new challenges including architectural support for model training in the accelerators, large cache demands for multi-core processors, system performance, energy, and efficiency. In this thesis, I present my research works that address these challenges by leveraging emerging memory and logic devices, as well as advanced integration technologies. In the first work, I present the first training accelerator architecture, ReGAN, for unsupervised deep learning. ReGAN follows the process-in-memory strategy by leveraging energy efficiency of resistive memory arrays for in-situ deep learning execution. I proposed an efficient pipelined training procedure to reduce on-chip memory access. In the second work, I present ZARA to address the resource underutilization due to a new operator, namely, transposed convolution, used in unsupervised learning models. ZARA improves the system efficiency by a novel computation deformation technique. In the third work, I present MARVEL that targets to improve power efficiency in previous resistive accelerators. MARVEL leverage the monolithic 3D integration technology by stacking multi-layer of low-power analog/digital conversion circuits implemented with carbon nanotube field-effect transistors. The area-consuming eDRAM buffers are replaced by dense cross-point Spin Transfer Torque Magnetic RAM. I explored the design space and demonstrated that MARVEL can provide further improved power efficiency with increased number of integration layers. In the last piece of work, I propose the first holistic solution for employing skyrmions racetrack memory as last-level caches for future high-capacity cache design. I first present a cache architecture and a physical-to-logic mapping scheme based on comprehensive analysis on working mechanism of skyrmions racetrack memory. Then I model the impact of process variations and propose a process variation aware data management technique to minimize the performance degradation incurred by process variations.
Item Open Access Intelligent Circuit Design and Implementation with Machine Learning(2022) Xie, ZhiyaoElectronic design automation (EDA) technology has achieved remarkable progress over the past decades. However, modern chip design is not completely automatic yet in general and the gap is not easily surmountable. For example, the chip design flow is still largely restricted to individual point tools with limited interplay across tools and design steps. Tools applied at early steps cannot well judge if their solutions may eventually lead to satisfactory designs, inevitably leading to over-pessimistic design or significantly longer turnaround time. While existing challenges have long been unsolved, the ever-increasing complexity of integrated circuits (ICs) leads to even more stringent design requirements. Therefore, there is a compelling need for essential improvement in existing EDA techniques.
The stagnation of EDA technologies roots from insufficient knowledge reuse. In practice, very similar simulation or optimization results may need to be repeatedly constructed from scratch. This motivates my research on introducing more ``intelligence'' to EDA with machine learning (ML), which explores complex correlations in design flows based on prior data. Besides design time, I also propose ML solutions to boost IC performance by assisting the circuit management at runtime.
In this dissertation, I present multiple fast yet accurate ML models covering a wide range of chip design stages from the register-transfer level (RTL) to sign-off, solving primary chip-design problems about power, timing, interconnect, IR drop, routability, and design flow tuning. Targeting the RTL stage, I present APOLLO, a fully automated power modeling framework. It constructs an accurate per-cycle power model by extracting the most power-correlated signals. The model can be further implemented on chip for runtime power management with unprecedented low hardware costs. Targeting gate-level netlist, I present Net2 for early estimations on post-placement wirelength. It further enables more accurate timing analysis without actual physical design information. Targeting circuit layout, I present RouteNet for early routability prediction. As the first deep learning-based routability estimator, some feature-extraction and model-design principles proposed in it are widely adopted by later works. I also present PowerNet for fast IR drop estimation. It captures spatial and temporal information about power distribution with a customized CNN architecture. Last, besides targeting a single design step, I present FIST to efficiently tune design flow parameters during both logic synthesis and physical design.
Item Open Access Joint Optimization of Algorithms, Hardware, and Systems for Efficient Deep Neural Networks(2024) Li, ShiyuDeep learning has enabled remarkable performance breakthroughs across various domains, including computer vision, natural language processing, and recommender systems. However, the typical deep neural network (DNN) models employed in these applications require millions of parameters and billions of operations, leading to substantial computational and memory requirements. While researchers have proposed compression methods, optimized frameworks, and specialized accelerators to improve efficiency, outstanding challenges persist, limiting the achievable gains.
A fundamental challenge lies in the inherent irregularity and sparsity of DNNs. Although these models exhibit significant sparsity, with a considerable fraction of weights and activations being zero or near-zero values, exploiting this sparsity efficiently on modern hardware is problematic due to the irregular distribution of non-zero elements. This irregularity leads to substantial overhead in indexing, gathering, and processing sparse data, resulting in poor utilization of computational and memory resources. Furthermore, recent research has identified a significant gap between the theoretical and practical improvements achieved by compression methods. Additionally, emerging DNN architectures with novel operators often nullify previous optimization efforts in software frameworks and hardware accelerators, necessitating continuous adaptation.
To address these critical challenges, this dissertation targets building a holistic approach that jointly optimizes algorithms, hardware architectures, and system designs to enable efficient deployment of DNNs in the presence of irregularity and sparsity. On the algorithm level, a novel hardware-friendly compression method based on matrix decomposition is proposed. The original convolutional kernels are decomposed into common basis kernels and a series of coefficients, with conventional pruning applied to the coefficients. This compressed DNN forms a hardware-friendly structure where the sparsity pattern is shared across input feature map pixels, alleviating sparse pattern processing costs.
On the hardware level, a novel sparse DNN accelerator is introduced to support the inference of the compressed DNN. Low-precision quantization is applied to sparse coefficients, and high-precision to basis kernels. By involving only low-precision coefficients in sparse processing, the hardware efficiently matches non-zero weights and activations using inverted butterfly networks. The shared basis kernels and sparse coefficients significantly reduce buffer size and bandwidth requirements, boosting performance and energy efficiency.
At the system level, a near-data processing framework is proposed to address the challenge of training large DNN-based recommendation models. This framework adopts computational storage devices and coherent system interconnects to partition the model into subtasks. Data-intensive embedding operations run on computational storage devices with customized memory hierarchies, while compute-intensive feature processing and aggregation operations are assigned to GPUs for maximum efficiency. This framework enables training large DNN-based recommendation models without expensive hardware investments.
Through joint optimization across algorithms, hardware architectures, and system designs, this research aims to overcome the limitations imposed by irregularity and sparsity, enabling efficient deployment of DNNs in a broad range of applications and resource-constrained environments. By addressing these critical issues, this work paves the way for fully harnessing the potential of deep learning technologies in practical settings.
Item Open Access On Impact of Network Architecture for Deep Learning(2023) Fu, HaoThe architecture of neural networks is a crucial factor in the success of deep learning models across a range of fields, including computer vision and natural language processing (NLP). Specific architectures are tailored to address particular tasks, and the selection of architecture can significantly affect the training process, model performance, and robustness.
In the field of NLP, we address the training deficiency of text VAEs with autoregressive decoders through two approaches. First, we introduce a cyclical annealing schedule that enables progressive learning of meaningful latent codes by leveraginginformative representations from previous cycles as warm restarts. Second, we propose semi-implicit (SI) representations for the latent distributions of natural languages, which extend the commonly used Gaussian distribution family by mixing the variational parameter with a flexible implicit distribution. Our proposed methods are demonstrated to be effective in text generation tasks such as dialog response generation, with significant performance improvements compared to other training techniques.
In the field of computer vision, we investigate the intrinsic influence of network structure on a model’s robustness in addressing data distribution shifts. We propose a novel paradigm, Dense Connectivity Search of Outlier Detector (DCSOD), that automatically explores the dense connectivity of CNN architectures on Out-of-Distribution (OOD) detection tasks using Neural Architecture Search (NAS). To improve the quality of evaluation on OOD detection during the search, we propose evolving distillation based on our multi-view feature learning explanation. Experimental results show that DCSOD achieves remarkable performance over widely used architectures and previous NAS baselines.
Item Open Access Practical Solutions to Neural Architecture Search on Applied Machine Learning(2024) Zhang, TunhouThe advent of Artificial Intelligence (AI) propels the real world into a new era characterized by remarkable design innovations and groundbreaking design automation, primarily fueled by Deep Neural Networks (DNN). At the heart of this transformation is the progress in Automated Machine Learning (AutoML), notably Neural Architecture Search (NAS). NAS lays a robust foundation for developing algorithms capable of automating design processes to determine the optimal architecture for academic benchmarks. However, the real challenge emerges when adapting NAS for Applied Machine Learning (AML) scenarios: navigating the complex terrain of design space exploration and exploitation. This complexity arises due to the heterogeneity of data and architectures required by real-world AML problems, an aspect that traditional NAS approaches struggle to address fully.
To bridge this gap, our research emphasizes creating a flexible search space that reduces reliance on human-derived architectural assumptions. We introduce innovative techniques aimed at refining search algorithms to accommodate greater flexibility. By carefully examining and enhancing search spaces and methodologies, we empower NAS solutions to cater to practical AML problems. This enables the exploration of broader search spaces, better performance potential, and lower search process costs.
We start by challenging homogeneous search space design for multi-modality 3D representations, proposing ``PIDS'' to enable joint dimension and interaction search for 3D point cloud segmentation. We consider two axes on adapting point cloud operators toward multi-modality data with density, geometry, and order varieties, achieving significant mIOU improvement on segmentation benchmarks over the state-of-the-art 3D models.To implement our approach efficiently in recommendation systems, we develop ``NASRec'' to support heterogeneous building operators and propose practical solutions to improve the quality of NAS on Click-Through Rates (CTR) prediction. We propose an end-to-end full architecture search with minimal human priors. We provide practical solutions to tackle scalability and heterogeneity challenges in NAS, outperforming manually designed models and existing NAS models on various CTR benchmarks. Finally, we pioneer our effort on industry-scale CTR benchmarks and propose DistDNAS to optimize search and serving efficiency, producing smaller and better recommendation models on a large-scale CTR benchmark. Intuited by the discoveries in NAS, we additionally uncover the underlying theoretical foundations of residual learning on computer vision foundation research and envision the prospects of our research on Artificial Intelligence, including Large Language Models, Generative AI, and beyond.
Item Embargo Processing-in-Memory Accelerators Toward Energy-Efficient Real-World Machine Learning(2024) Kim, BokyungArtificial intelligence (AI) has permeated the real world, reaping unprecedented success. Numberless applications exploit machine learning (ML) technologies of big data and compute-intensive algorithms. Moreover, the aspiration of authentic machine intelligence moves computing toward the edge to handle complex tasks conventionally tailored for human beings. Along with the rapid development, the gap between the increasing resource requirements in ML and the restricted environments of edge engenders urgent attention to the challenges in efficiency. To resolve the gap, solutions across different disciplines in hardware are necessary beyond algorithm development.
Unfortunately, hardware development falls far behind because of heterogeneity. While the sensational advance of ML algorithms is a game-change of computing paradigms, conventional hardware unfits new paradigms due to fundamental limitations in its architecture and technology. The traditional architecture separating storage and computation is dreadfully inefficient for innumerable data processing and computing in algorithms, showing high power consumption and low performance. The realization of the fundamental limitations motivates efficient and non-conventional hardware accelerators.
As a new hardware paradigm, processing-in-memory accelerators (PIM) have brought significant expectations because of their intuitive effectiveness for the limitations of traditional hardware. PIM merges computing and processing units and saves resources for data and computations, pursuing non-heterogeneity and ultimately improving efficiency.Previous PIM accelerators have shown promising outcomes with high-performance computing, particularly thanks to emerging memories under the name of memristor.
Despite its motivation for non-heterogeneity, PIM-based designs couldn't fully escape from heterogeneity causing inefficiency with high costs. While emerging memories provide revolutions at device and circuit levels, PIM at higher levels struggles with various components in systems (horizontal heterogeneity). Furthermore, PIM is holistically designed across hierarchical levels of heterogeneity (vertical heterogeneity), which complicates its design with efficiency.Even robustness could be significantly influenced by heterogeneity.
Confronting the challenges in heterogeneity, efficiency, and robustness, my research has cultivated PIM hardware through cross-layer designs for practically efficient ML acceleration. Specifically, focusing on architecture/system-level innovations, I have pioneered novel 3D architectures and systemic paradigms, which provide a strong foundation for future computing. For ML acceleration, I have proposed new methodologies to efficiently operate 3D architecture and a novel dataflow with a new 3D design for energy efficiency by pursuing non-heterogeneity. The innovations have been examined through rigorous hardware experiments, and their practical efficiency has been proved with a fabricated chip for seizure classification, a real-world application. According to the need for future ML, my research is evolving to accomplish robustness in hardware as ML platforms. In this dissertation, I summarize the research impacts based on my diverse design experiences, spanning architecture and system design to chip fabrication.
Item Open Access Secure and Power-Efficient Computing on Mobile Platforms(2019) Nixon, Kent WindsorMobile devices have been the driving force behind the electronics industry for over a decade. Compared more traditional computing systems such as desktop or laptop computers, these devices prioritize ease-of-use and portability over raw compute power or extensible input methodologies. This change in focus results in devices which are generally small in size, regularly transported (and forgotten), using greatly simplified user interfaces. The main challenges with such devices become 1) securing the data produced by and stored on them, and 2) minimizing power consumption during operation in order to prolong limited battery life.
W.r.t. the first of these two challenges, the first research goal of this dissertation is to identify and develop robust and transparent methodologies for both authenticating a user to a device, as well as securing data stored on or generated by these devices. For securing data produced by and stored on mobile devices, consideration must be given to both user authentication and data integrity. For this dissertation, a novel means of user authentication based on device interaction will be examined. The detailed gesture-based authentication scheme is shown to have high accuracy, while requiring no additional input from the user beyond utilizing the device. Additionally, for securing data stored on the device post-authentication, this dissertation will explore alternate methodologies for detection of adversarial noise added to user images. The discussed methodology is shown to have high attack-detection accuracy while remaining computationally efficient.
W.r.t. the second challenge, the second research goal of this dissertation is to examine alternative, more computationally- and power-efficient methodologies for accomplishing existing tasks, tailored around the unique capabilities and limitations of mobile devices. For this dissertation, a general-case power-saving technique of dynamic framerate and resolution scaling will be investigated. It is shown that significant power savings can be achieved with little- to no-impact on user experience. For saving power in a more specialized task, this dissertation will investigate the use of the GPS in route reconstruction apps for wearable devices. The demonstrated scheduler greatly reduces power consumption while still allowing route reconstruction.