Browsing by Author "Chen, Yiran"
Results Per Page
Sort Options
Item Open Access Accelerator Architectures for Deep Learning and Graph Processing(2020) Song, LinghaoDeep learning and graph processing are two big-data applications and they are widely applied in many domains. The training of deep learning is essential for inference and has not yet been fully studied. With data forward, error backward, and gradient calculation, deep learning training is a more complicated process with higher computation and communication intensity. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remaining challenging. In this dissertation, I present AccPar, a principled and systematic method of determining the tensor partition for multiple heterogeneous accelerators for efficient training acceleration. Emerging resistive random access memory (ReRAM) is promising for processing in memory (PIM). For high-throughput training acceleration in ReRAM-based PIM accelerator, I present PipeLayer, an architecture for layer-wise pipelined parallelism. Graph processing is well-known for poor locality and high memory bandwidth demand. In conventional architectures, graph processing incurs a significant amount of data movements and energy consumption. I present GraphR, the first ReRAM-based graph processing accelerator which follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. Sparse matrix-vector multiplication (SpMV), a subset of graph processing, is the key computation in iterative solvers for scientific computing. The efficiently accelerating floating-point processing in ReRAM remains a challenge. In this dissertation, I present ReFloat, a data format, and a supporting accelerator architecture, for low-cost floating-point processing in ReRAM for scientific computing.
Item Open Access Advancing the Design and Utility of Adversarial Machine Learning Methods(2021) Inkawhich, Nathan AlbertWhile significant progress has been made to craft Deep Neural Networks (DNNs) with super-human recognition performance, their reliability and robustness in challenging operating conditions is still a major concern. In this work, we study multiple facets of the DNN robustness problem by pursuing two main threads of research. The key methodological linkage throughout our investigations is the consistent design/development/utilization/deployment of Adversarial Machine Learning techniques, which have remarkable abilities to both degrade and enhance model performance. Our ultimate goal is to help construct the more safe and reliable models of the future.
In the first thread of research, we take the perspective of an adversary who wishes to find novel and increasingly potent ways to fool current DNN models. Our approach is centered around the development of a feature space attack, and the construction of novel adversarial threat models that work to reduce required knowledge assumptions. Interestingly, we find that a transfer-based blackbox adversary can be significantly more powerful than previously believed, and can reliably cause targeted misclassifications with imperceptible noises. Further, we find that the attacker does not necessarily require access to the target model's training distribution to create transferable attacks, which is a more practically concerning scenario due to the reduction of required attacker knowledge.
Along the second thread of research, we take the perspective of a DNN model designer whose job is to create systems capable of robust operation in ``open-world'' environments, where both known and unknown target types may be encountered. Our approach is to establish a classifier + out-of-distribution (OOD) detector system co-design that is centered around an adversarial training procedure and an outlier exposure-based learning objective. Through various experiments, we find that our systems can achieve high accuracy in extended operating conditions, while reliably detecting and rejecting fine-grained OOD target types. We also develop a method for efficiently improving OOD detection by learning from the deployment environment. Overall, by exposing novel vulnerabilities of current DNNs while also improving the reliability of existing models to known vulnerabilities, our work makes significant progress towards creating the next-generation of more trustworthy models.
Item Open Access Algorithm-hardware co-optimization for neural network efficiency improvement(2020) Yang, QingDeep neural networks (DNNs) are tremendously applied in the artificial intelligence field. While the performance of DNNs is continuously improved by more complicated and deeper structures, the feasibility of deployment on edge devices remains a critical problem. In this thesis, we present algorithm-hardware co-optimization approaches to address the challenges of efficient DNN deployments from three aspects: 1) save computational cost, 2) save memory cost, and 3) save data movements.
First, we present a joint regularization technique to advance the compression beyond the weights to neuron activations. By distinguishing and leveraging the significant difference among neuron responses and connections during learning, the jointly pruned network, namely JPnet, optimizes the sparsity of activations and weights. Second, to structurally regulate the dynamic activation sparsity (DAS), we propose a generic low-cost approach based on winners-take-all (WTA) dropout technique. The network enhanced by the proposed WTA dropout, namely DASNet, features structured activation sparsity with an improved sparsity level, which can be easily utilized to achieve acceleration on conventional embedded systems. The effectiveness of JPNet and DASNet has been thoroughly evaluated through various network models with different activation functions and on different datasets. Third, we propose BitSystolic, a neural processing unit based on a systolic array structure, to fully support the mixed-precision inference. In BitSystolic, the numerical precision of both weights and activations can be configured in the range of 2b~8b, fulfilling different requirements across mixed-precision models and tasks. Moreover, the design can support various data flows presented in different types of neural layers and adaptively optimize the data reuse by switching between the matrix-matrix mode and vector-matrix mode. We designed and fabricated the proposed BitSystolic in the 65nm process. Our measurement results show that BitSystolic features the unified power efficiency of up to 26.7 TOPS/W with 17.8 mW peak power consumption across various layer types. In the end, we will have a glance at computing-in-memory architectures based on resistive random-access memory (ReRAM) which realizes in-place storage and computation. A quantized training method is proposed to enhance the accuracy of neuromorphic systems based on ReRAM by alleviating the impact of limited parameter precision.
Item Open Access Boosting the Sensing Granularity of Acoustic Signals by Exploiting Hardware Non-linearit(2023) Chen, XiangruAcoustic sensing is a new sensing modality that senses the contexts of human targets and our surroundings using acoustic signals. It becomes a hot topic in both academia and industry owing to its finer sensing granularity and the wide availability of microphone and speaker on commodity devices. While prior studies focused on addressing well-known challenges such as increasing the limited sensing range and enabling multi-target sensing, we propose a novel scheme to leverage the non-linearity distortion of microphones to further boost the sensing granularity. Specifically, we observe the existence of the non-linear signal generated by the direct path signal and target reflection signal. We mathematically show that the non-linear chirp signal amplifies the phase variations and this property can be utilized to improve the granularity of acoustic sensing. Experiment results show that, by properly leveraging the hardware non-linearity, the amplitude estimation error for sub-millimeter-level vibration can be reduced from 0.137 mm to 0.029 mm.
Item Open Access Dynamic Deep Learning Acceleration with Co-Designed Hardware Architecture(2023) Hanson, Edward ThorRecent advancements in Deep Learning (DL) hardware target the training and inference of static DL models, thus simultaneously achieving high runtime performance and efficiency.However, dynamic DL models are seen as the next step in further pushing the accuracy-performance tradeoff of DL inference and training in our favor; by reshaping the model's parameters or structure based on the input, dynamic DL models have the potential to boost accuracy while introducing marginal computation cost. As the field of DL progresses towards dynamic models, much of the advancements in DL accelerator design are eclipsed by data movement-related bottlenecks introduced by unpredictable memory access patterns and computation flow. Additionally, designing hardware for every niche task is inefficient due to the high cost of developing new hardware. Therefore, we must carefully design DL hardware and software stack to support future, dynamic DL models by emphasizing flexibility and generality without sacrificing end-to-end performance and efficiency.
This dissertation targets algorithmic-, hardware-, and software-level optimizations to optimize DL systems.Starting from the algorithm level, the robust nature of DNNs is exploited to reduce computational and data movement demand. At the hardware level, dynamic hardware mechanisms are investigated to better serve a broad range of impactful future DL workloads. At the software level, statistical patterns of dynamic models are leveraged to enhance the performance of offline and online scheduling strategies. Success of this research is measured by considering all key metrics associated with DL and DL acceleration: inference latency and accuracy, training throughput, peak memory occupancy, area efficiency, and energy efficiency.
Item Open Access Efficient and Generalizable Neural Architecture Search for Visual Recognition(2021) Cheng, Hsin-PaiNeural Architecture Search (NAS) can achieve accuracy superior to human-designed neural networks, because of the easier automation process and searching techniques.While automated designed neural architectures can achieve new state-of-the-art performance with less human crafting efforts, there are three obstacles which hinder us building the next generation NAS algorithms: (1) search space is constrained which limits their representation ability; (2) searching large search space is time costly which slows down the model crafting process; (3) inference of complicated neural architectures are slow which limits the deployability on different devices To improve search space, previous NAS works rely on existing block motifs. Specifically, previous search space seek the best combination of MobileNetV2 blocks without exploring the sophisticated cell connections. To accelerate searching process, more accurate description of neural architecture is necessary. To deploy neural architectures to hardware, better adaptability is required. The dissertation proposes ScaleNAS to expand a search space that is adaptable to multiple vision-based tasks. The dissertation will show that NASGEM overcomes the neural architecture representation ability to accelerate searching. Finally, we shows how to integrate neural architecture search to strucural pruning and mixed precision quantization to further improve hardware deployment.
Item Open Access Efficient and Scalable Deep Learning(2019) Wen, WeiDeep Neural Networks (DNNs) can achieve accuracy superior to traditional machine learning models, because of their large learning capacity and the availability of large amounts of labeled data. In general, larger DNNs can obtain higher accuracy. However, there are two obstacles which hinder us building larger DNNs: (1) inference of large DNNs is slow which limits their deployment to small devices; (2) training large DNNs is also slow which slows down research exploration. To remove those obstacles, this dissertation focuses on acceleration of DNN inference and training. To accelerate DNN inference, original DNNs are compressed while keeping original accuracy. More specific, Structurally Sparse Deep Neural Networks (SSDNNs) are proposed to remove neural components. In Convolutional Neural Networks (CNNs), neurons, filters, channels and layers can be removed; in Recurrent Neural Networks (RNNs), hidden sizes can be reduced. The study shows that SSDNNs can achieve higher speedup than sparse DNNs which have non-structured sparsity. Besides SSDNNs, a Force Regularization is proposed to enforce DNNs to lower-rank space, such that DNNs can be decomposed to lower-rank architectures with fewer ranks than traditional methods. The dissertation also demonstrates that SSDNNs and Force Regularization are orthogonal and can be combined for higher speedup. To accelerate DNN training, distributed deep learning is required. However, two problems hinder us using more compute nodes for higher training speed: Communication Bottleneck and Generalization Gap. Communication Bottleneck is that communication time will increase and dominate when the distributed systems scale to many compute nodes. To reduce gradient communication in Stochastic Gradient Descent (SGD), SGD with low-precision gradients (TernGrad) is proposed. Moreover, in distributed deep learning, a large batch size is required to exploit system computing power; unfortunately, accuracy will decrease when the batch size is very large, which is referred to as the Generalization Gap. One hypothesis to explain Generalization Gap is that large-batch SGD sticks at sharp minima. The dissertation proposes a stochastic smoothing (SmoothOut) to escape sharp minima. The dissertation will show that TernGrad overcomes Communication Bottleneck and SmoothOut helps to close the Generalization Gap.
Item Open Access Efficient Neural Network Based Systems on Mobile and Cloud Platforms(2020) Mao, JiachenIn recent years, machine learning, especially neural networks arouses unprecedented influence in both academia and industry.
The reason lies in the state-of-the-art performance of neural networks on many critical applications such as object detection, translation, and games. However, the deployment of neural network models on resource-constrained devices (e.g. edge devices) is challenged by their heavy memory and computing cost during execution. Many efforts have been done in previous literature for efficient execution of neural networks, including the perspectives of hardware, software, and algorithm.
My research focus during my Ph.D. study is mainly on software, and algorithm targeting at mobile platforms. More specifically, we emphasize the system design, system optimization, and model compression of neural networks for better mobile user experience. From the system design perspective, we first propose MoDNN – a local distributed mobile computing system for DNN testing. MoDNN can partition already trained DNN models onto several mobile devices to accelerate DNN computations by alleviating device-level computing cost and memory usage. Two model partition schemes are also designed to minimize non-parallel data delivery time, including both wakeup time and transmission time. Then, we propose AdaLearner – an adaptive local distributed mobile computing system for DNN training. To exploit the potential of our system, we adapt the neural networks training phase to mobile device-wise resources and fiercely decrease the transmission overhead for better system scalability. From the system optimization perspective, we propose MobiEye, a cloud-based video detection system optimized for deployment in real-time mobile applications. MobiEye is based on a state-of-the-art video detection framework called Deep Feature Flow (DFF). MobiEye optimizes DFF by three system-level optimization methods. From the model compression perspective, we propose Tprune, a model analyzing and pruning framework for Transformer. In TPrune, we first proposed Block-wise Structured Sparsity Learning (BSSL) to analyze Transformer model property. Then, based on the characters derived from BSSL, we apply Structured Hoyer Square (SHS) to derive the final compressed models. The realization of the projects during my PhD study could contribute to the current research on efficient neural network execution and thus result in more user-friendly and smart applications on edge devices for more users.
Item Open Access Enable Intelligence on Billion Devices with Deep Learning(2022) Li, AngWith the proliferation of edge computing and Internet of Things (IoT), billions of edge devices (e.g., smartphone, AR/VR headset, autonomous car, etc) are deployed in our daily life and constantly generating the gigantic amount of data at the network edge. Bringing deep learning to such huge volumes of data will boost many novel applications and services in edge ecosystem and fuel the continuous booming of artificial intelligence (AI). Driven by this motivation, there is an urgent need to push the AI frontier to the network edge in order to fully exploit big data residing on edge devices.
However, empowering edge intelligence with AI, especially deep learning, is technically challenging, due to the several critical challenges including privacy, efficiency, and performance.Conventional wisdom requires edge devices to transmit the data to cloud datacenters for training and inference. But moving a huge amount of data is prohibited by cost, high transmission delay, and privacy leakage. The emerging federated learning (FL) is a promising distributed learning paradigm that enables massive devices to collaboratively learn a machine learning model (e.g., deep neural network) without explicitly sharing data, and hence the privacy concerns caused by data sharing in the centralized learning can be mitigated. But FL is facing some critical challenges that hinder its deployments to edge devices, such as communication cost and data heterogeneity.
Once we obtain a learned machine learning model, the next step is to deploy the model for serving applications and services. One straightforward approach is to deploy the model on device to perform the inference locally. Unfortunately, on-device AI often suffers from poor performance because most AI applications requires high computational power, which is technically unaffordable for resource-constrained edge devices. Edge computing pushes the cloud services from the network core to the network edge, and hence bridging devices with edge servers can alleviate the computational cost of running AI models on device alone. However, such a collaborative deployment scheme will inevitably incur transmission delay and raise privacy concern due to data movement between devices and edge servers. For example, the device can send the features extracted from raw data (e.g., images) to the cloud where a pre-trained machine learning model is deployed, but these extracted features can still be exploited by attackers to recover raw data and to infer embedded private attributes (e.g., age, gender, etc.).
In this dissertation, I start with presenting a privacy-respecting data crowdsourcing framework for deep learning to address the privacy issue in centralized training. Then, I shift the setting from the centralized one to the decentralized environment, where three novel FL frameworks are proposed to jointly improve communication and computation efficiency while handling the heterogeneous data across devices. In addition to improving the learning on large-scale edge devices, I also design an efficient edge-assisted photorealistic video style transfer system for mobile phones by leveraging the collaboration between smartphones and the edge server. Besides, in order to mitigate the privacy concern caused by the data movement in the collaborative system, an adversarial training framework is proposed to prevent the adversary from reconstructing the raw data and inferring private attributes.
Item Open Access From Labeled to Unlabeled Data: Understand Deep Visual Representations under Causal Lens(2023) Yang, YueweiDeep vision models have been highly successful in various computer vision applications such as image classification, segmentation, and object detection. These models encode visual data into low-dimensional representations, which are then utilized in downstream tasks. Typically, the most accurate models are fine-tuned using fully labeled data, but this approach may not generalize well to different applications. Self-supervised learning has emerged as a potential solution to this issue, where the deep vision encoder is pretrained with unlabeled data to learn more generalized representations. However, the underlying mechanism governing the generalization and specificity of representations seeks more understanding. Causality is an important concept in visual representation learning as it can help improve the generalization of models by providing a deeper understanding of the underlying relationships between features and objects in the visual world.
Through works presented in this dissertation, we provide a causal interpretation of the mechanism underlying deep vision models' ability to learn representations in both labeled and unlabeled environments and improve the generalization and the specificity of extracted representations through the interpreted causal factors. Specifically, we tackle the problem from 4 aspects: Causally Interpret Supervised Deep Vision Models; Supervised Learning with Underlabeled Data; Self-supervised Learning with Unlabeled Data; Causally Understand Unsupervised Visual Representation Learning.
Firstly, we interpret the prediction of a deep vision model by identifying causal pixels in the input images via 'inversing' the model weights. Secondly, we recognise the challenges of learning an accurate object detection model with missing labels in the dataset and we address this underlabel data issue by adapting positive-unlabeled learning approach instead of the positive-negative approach. Thirdly, we focus on improving both generalization and specificity of unsupervised representations based on prior causal relations; Finally, we enhance the stability of the unsupervised representations during the inference by intervening data variables under a well constructed causal framework.
We establish a causal relationship between deep vision models and their input/output for different applications with (partially) labeled data, and strengthen generalized representations through extensive analytical understanding of unsupervised representation learning under various hypothesized causal frameworks.
Item Open Access Highly Efficient Neuromorphic Computing Systems With Emerging Nonvolatile Memories(2020) Yan, BonanEmerging nonvolatile memory based hardware neuromorphic computing systems have enabled the implementation of general vector-matrix multiplication in a manner to fuse computation and memory at the same physical location. However, there remain three major challenges in designing such neuromorphic computing systems for high efficiency in a large scale integration: (a) the analog/digital interface circuits dominate the power and area in such mixed-signal designs; (b) they are highly customized and can only compute a class of neural network models once developed; (c) non-ideal device properties largely forfeit the benefit in terms of computational efficiency.
Designs of mixed-signal interface circuitry have been extensively studied, but a holistic design approach regarding very-large-scale integration is overlooked for emerging nonvolatile memory based neuromorphic computing systems involving circuit design, microarchitecture and hardware/software co-simulation. The realization of such neuromorphic computing platforms requires: (a) efficient interface circuits as well as execution models; (b) appropriate reconfigurability at runtime for different neural network architectures; and (c) reliability enhancement methods to resist imperfect fabrication and tough working environment.
Motivated by these demands, this dissertation first introduces an implementation scheme of neuromorphic computing system that uses emerging nonvolatile memory as synapses and CMOS integrated circuits as neurons. To save the energy consumption of data communication, the neuron circuits are improved upon conventional integrated and first neuron circuits for better current-to-spike conversion efficiency. Trade-offs between throughput and latency are investigated and validated by a prototype 64Kb Resistive Random Access Memory based in-memory computing processing engine.
Next, this dissertation proposes a type of fully-memristive neuromorphic computing system architecture that incorporates Mott memristor as the neuron circuit. The small footprint and intrinsic bionic dynamics of emerging memory-based neuron circuits significantly reduce design complexity. This dissertation investigates and models the randomness that Mott memristors inflict. By suppressing it during inference and exploiting it during learning, the proposed system is optimized for the balance of inference accuracy and training efficiency.
Moreover, this dissertation advances the reconfigurability of emerging memory based neuromorphic computing systems by presenting a paradigm that supports post-fabrication switching between spiking and non-spiking neural network model execution. An improved version of time-to-first-spike temporal encoding is proposed to use single spikes in accelerating the execution speed.
Finally, this dissertation presents hardware/software codesign techniques for the implementation of neuromorphic computing systems with emerging nonvolatile memories. A hardware/software co-simulation flow is developed. And based on this, this dissertation also proposes a closed-loop design to enhance the weight stability to resist the read disturbance.
In summary, the dissertation tackles important problems in designing neuromorphic computing systems with emerging nonvolatile memories. The outcome of this research is expected not only to pave the way for realizing highly efficiency artificial intelligence hardware, but also shorten the product development cycle.
Item Open Access Improving the Efficiency and Robustness of In-Memory Computing in Emerging Technologies(2023) Yang, XiaoxuanEmerging technologies, such as resistive random-access memory (ReRAM), have proven their potential in in-memory computing for deep learning applications. My dissertation work focuses on improving the efficiency and robustness of in-memory computing in emerging technologies.
Existing ReRAM-based processing-in-memory (PIM) designs can support the inferencing and the training of neural networks, such as convolutional neural networks and recurrent neural networks. However, these designs suffer from the re-writing procedure for the self-attention calculation. Therefore, I propose an architecture that enables the efficient self-attention mechanism in PIM design. The optimized calculation procedure and finer granularity pipeline design improve efficiency. The contributions lie in enabling feasible and efficient ReRAM-based PIM designs for attention-based models.
Inferencing with ReRAM-based design has one severe problem: the inferencing accuracy can be degraded due to the non-idealities in hardware devices. The robustness of the previous method is not validated under the combination of device stochastic noise. With the proposed hardware-aware training method, the robustness of inferencing accuracy can be improved. Besides, with hardware efficiency and inferencing robustness targets, the multi-objective optimization method is developed to explore the design space and generate high-quality Pareto-optimal design configurations with minimal cost. This work integrates attributes from the design space and the evaluation space and develops efficient hardware-software co-design methods.
Training with ReRAM-based design has one challenging endurance problem due to the frequent weight updates for neural network training. The expectation for endurance management is to decrease the number of weight updates and balance the write accesses. The proposed endurance-aware training method utilizes gradient structure pruning and dynamically structurally adjusts the write probabilities. This method can expand the life cycle for ReRAM during the training process.
In summary, the research above targets realizing efficient self-attention mechanisms and solving accuracy degradation and endurance problems for the inferencing and training processes. Besides, the efforts lie in figuring out the challenging parts of each topic and developing hardware-software co-design considering efficiency and robustness. The developed designs are the potential solutions for the challenging problems of in-memory computing in emerging technologies.
Item Open Access In-Memory Computing Architecture for Deep Learning Acceleration(2020) Chen, FanThe ever-increasing demands of deep learning applications, especially the more powerful but intensive unsupervised deep learning models, overwhelm computation capability, communication capability, and storage capability of the modern general-purpose CPUs and GPUs. To accommodate the memory and computing requirement, multi-core systems that make intensive use of accelerators become the future of computing. Such novel computing systems incurs new challenges including architectural support for model training in the accelerators, large cache demands for multi-core processors, system performance, energy, and efficiency. In this thesis, I present my research works that address these challenges by leveraging emerging memory and logic devices, as well as advanced integration technologies. In the first work, I present the first training accelerator architecture, ReGAN, for unsupervised deep learning. ReGAN follows the process-in-memory strategy by leveraging energy efficiency of resistive memory arrays for in-situ deep learning execution. I proposed an efficient pipelined training procedure to reduce on-chip memory access. In the second work, I present ZARA to address the resource underutilization due to a new operator, namely, transposed convolution, used in unsupervised learning models. ZARA improves the system efficiency by a novel computation deformation technique. In the third work, I present MARVEL that targets to improve power efficiency in previous resistive accelerators. MARVEL leverage the monolithic 3D integration technology by stacking multi-layer of low-power analog/digital conversion circuits implemented with carbon nanotube field-effect transistors. The area-consuming eDRAM buffers are replaced by dense cross-point Spin Transfer Torque Magnetic RAM. I explored the design space and demonstrated that MARVEL can provide further improved power efficiency with increased number of integration layers. In the last piece of work, I propose the first holistic solution for employing skyrmions racetrack memory as last-level caches for future high-capacity cache design. I first present a cache architecture and a physical-to-logic mapping scheme based on comprehensive analysis on working mechanism of skyrmions racetrack memory. Then I model the impact of process variations and propose a process variation aware data management technique to minimize the performance degradation incurred by process variations.
Item Open Access Intelligent Circuit Design and Implementation with Machine Learning(2022) Xie, ZhiyaoElectronic design automation (EDA) technology has achieved remarkable progress over the past decades. However, modern chip design is not completely automatic yet in general and the gap is not easily surmountable. For example, the chip design flow is still largely restricted to individual point tools with limited interplay across tools and design steps. Tools applied at early steps cannot well judge if their solutions may eventually lead to satisfactory designs, inevitably leading to over-pessimistic design or significantly longer turnaround time. While existing challenges have long been unsolved, the ever-increasing complexity of integrated circuits (ICs) leads to even more stringent design requirements. Therefore, there is a compelling need for essential improvement in existing EDA techniques.
The stagnation of EDA technologies roots from insufficient knowledge reuse. In practice, very similar simulation or optimization results may need to be repeatedly constructed from scratch. This motivates my research on introducing more ``intelligence'' to EDA with machine learning (ML), which explores complex correlations in design flows based on prior data. Besides design time, I also propose ML solutions to boost IC performance by assisting the circuit management at runtime.
In this dissertation, I present multiple fast yet accurate ML models covering a wide range of chip design stages from the register-transfer level (RTL) to sign-off, solving primary chip-design problems about power, timing, interconnect, IR drop, routability, and design flow tuning. Targeting the RTL stage, I present APOLLO, a fully automated power modeling framework. It constructs an accurate per-cycle power model by extracting the most power-correlated signals. The model can be further implemented on chip for runtime power management with unprecedented low hardware costs. Targeting gate-level netlist, I present Net2 for early estimations on post-placement wirelength. It further enables more accurate timing analysis without actual physical design information. Targeting circuit layout, I present RouteNet for early routability prediction. As the first deep learning-based routability estimator, some feature-extraction and model-design principles proposed in it are widely adopted by later works. I also present PowerNet for fast IR drop estimation. It captures spatial and temporal information about power distribution with a customized CNN architecture. Last, besides targeting a single design step, I present FIST to efficiently tune design flow parameters during both logic synthesis and physical design.
Item Open Access Joint Optimization of Algorithms, Hardware, and Systems for Efficient Deep Neural Networks(2024) Li, ShiyuDeep learning has enabled remarkable performance breakthroughs across various domains, including computer vision, natural language processing, and recommender systems. However, the typical deep neural network (DNN) models employed in these applications require millions of parameters and billions of operations, leading to substantial computational and memory requirements. While researchers have proposed compression methods, optimized frameworks, and specialized accelerators to improve efficiency, outstanding challenges persist, limiting the achievable gains.
A fundamental challenge lies in the inherent irregularity and sparsity of DNNs. Although these models exhibit significant sparsity, with a considerable fraction of weights and activations being zero or near-zero values, exploiting this sparsity efficiently on modern hardware is problematic due to the irregular distribution of non-zero elements. This irregularity leads to substantial overhead in indexing, gathering, and processing sparse data, resulting in poor utilization of computational and memory resources. Furthermore, recent research has identified a significant gap between the theoretical and practical improvements achieved by compression methods. Additionally, emerging DNN architectures with novel operators often nullify previous optimization efforts in software frameworks and hardware accelerators, necessitating continuous adaptation.
To address these critical challenges, this dissertation targets building a holistic approach that jointly optimizes algorithms, hardware architectures, and system designs to enable efficient deployment of DNNs in the presence of irregularity and sparsity. On the algorithm level, a novel hardware-friendly compression method based on matrix decomposition is proposed. The original convolutional kernels are decomposed into common basis kernels and a series of coefficients, with conventional pruning applied to the coefficients. This compressed DNN forms a hardware-friendly structure where the sparsity pattern is shared across input feature map pixels, alleviating sparse pattern processing costs.
On the hardware level, a novel sparse DNN accelerator is introduced to support the inference of the compressed DNN. Low-precision quantization is applied to sparse coefficients, and high-precision to basis kernels. By involving only low-precision coefficients in sparse processing, the hardware efficiently matches non-zero weights and activations using inverted butterfly networks. The shared basis kernels and sparse coefficients significantly reduce buffer size and bandwidth requirements, boosting performance and energy efficiency.
At the system level, a near-data processing framework is proposed to address the challenge of training large DNN-based recommendation models. This framework adopts computational storage devices and coherent system interconnects to partition the model into subtasks. Data-intensive embedding operations run on computational storage devices with customized memory hierarchies, while compute-intensive feature processing and aggregation operations are assigned to GPUs for maximum efficiency. This framework enables training large DNN-based recommendation models without expensive hardware investments.
Through joint optimization across algorithms, hardware architectures, and system designs, this research aims to overcome the limitations imposed by irregularity and sparsity, enabling efficient deployment of DNNs in a broad range of applications and resource-constrained environments. By addressing these critical issues, this work paves the way for fully harnessing the potential of deep learning technologies in practical settings.
Item Open Access On Impact of Network Architecture for Deep Learning(2023) Fu, HaoThe architecture of neural networks is a crucial factor in the success of deep learning models across a range of fields, including computer vision and natural language processing (NLP). Specific architectures are tailored to address particular tasks, and the selection of architecture can significantly affect the training process, model performance, and robustness.
In the field of NLP, we address the training deficiency of text VAEs with autoregressive decoders through two approaches. First, we introduce a cyclical annealing schedule that enables progressive learning of meaningful latent codes by leveraginginformative representations from previous cycles as warm restarts. Second, we propose semi-implicit (SI) representations for the latent distributions of natural languages, which extend the commonly used Gaussian distribution family by mixing the variational parameter with a flexible implicit distribution. Our proposed methods are demonstrated to be effective in text generation tasks such as dialog response generation, with significant performance improvements compared to other training techniques.
In the field of computer vision, we investigate the intrinsic influence of network structure on a model’s robustness in addressing data distribution shifts. We propose a novel paradigm, Dense Connectivity Search of Outlier Detector (DCSOD), that automatically explores the dense connectivity of CNN architectures on Out-of-Distribution (OOD) detection tasks using Neural Architecture Search (NAS). To improve the quality of evaluation on OOD detection during the search, we propose evolving distillation based on our multi-view feature learning explanation. Experimental results show that DCSOD achieves remarkable performance over widely used architectures and previous NAS baselines.
Item Open Access Practical Solutions to Neural Architecture Search on Applied Machine Learning(2024) Zhang, TunhouThe advent of Artificial Intelligence (AI) propels the real world into a new era characterized by remarkable design innovations and groundbreaking design automation, primarily fueled by Deep Neural Networks (DNN). At the heart of this transformation is the progress in Automated Machine Learning (AutoML), notably Neural Architecture Search (NAS). NAS lays a robust foundation for developing algorithms capable of automating design processes to determine the optimal architecture for academic benchmarks. However, the real challenge emerges when adapting NAS for Applied Machine Learning (AML) scenarios: navigating the complex terrain of design space exploration and exploitation. This complexity arises due to the heterogeneity of data and architectures required by real-world AML problems, an aspect that traditional NAS approaches struggle to address fully.
To bridge this gap, our research emphasizes creating a flexible search space that reduces reliance on human-derived architectural assumptions. We introduce innovative techniques aimed at refining search algorithms to accommodate greater flexibility. By carefully examining and enhancing search spaces and methodologies, we empower NAS solutions to cater to practical AML problems. This enables the exploration of broader search spaces, better performance potential, and lower search process costs.
We start by challenging homogeneous search space design for multi-modality 3D representations, proposing ``PIDS'' to enable joint dimension and interaction search for 3D point cloud segmentation. We consider two axes on adapting point cloud operators toward multi-modality data with density, geometry, and order varieties, achieving significant mIOU improvement on segmentation benchmarks over the state-of-the-art 3D models.To implement our approach efficiently in recommendation systems, we develop ``NASRec'' to support heterogeneous building operators and propose practical solutions to improve the quality of NAS on Click-Through Rates (CTR) prediction. We propose an end-to-end full architecture search with minimal human priors. We provide practical solutions to tackle scalability and heterogeneity challenges in NAS, outperforming manually designed models and existing NAS models on various CTR benchmarks. Finally, we pioneer our effort on industry-scale CTR benchmarks and propose DistDNAS to optimize search and serving efficiency, producing smaller and better recommendation models on a large-scale CTR benchmark. Intuited by the discoveries in NAS, we additionally uncover the underlying theoretical foundations of residual learning on computer vision foundation research and envision the prospects of our research on Artificial Intelligence, including Large Language Models, Generative AI, and beyond.
Item Open Access Secure and Power-Efficient Computing on Mobile Platforms(2019) Nixon, Kent WindsorMobile devices have been the driving force behind the electronics industry for over a decade. Compared more traditional computing systems such as desktop or laptop computers, these devices prioritize ease-of-use and portability over raw compute power or extensible input methodologies. This change in focus results in devices which are generally small in size, regularly transported (and forgotten), using greatly simplified user interfaces. The main challenges with such devices become 1) securing the data produced by and stored on them, and 2) minimizing power consumption during operation in order to prolong limited battery life.
W.r.t. the first of these two challenges, the first research goal of this dissertation is to identify and develop robust and transparent methodologies for both authenticating a user to a device, as well as securing data stored on or generated by these devices. For securing data produced by and stored on mobile devices, consideration must be given to both user authentication and data integrity. For this dissertation, a novel means of user authentication based on device interaction will be examined. The detailed gesture-based authentication scheme is shown to have high accuracy, while requiring no additional input from the user beyond utilizing the device. Additionally, for securing data stored on the device post-authentication, this dissertation will explore alternate methodologies for detection of adversarial noise added to user images. The discussed methodology is shown to have high attack-detection accuracy while remaining computationally efficient.
W.r.t. the second challenge, the second research goal of this dissertation is to examine alternative, more computationally- and power-efficient methodologies for accomplishing existing tasks, tailored around the unique capabilities and limitations of mobile devices. For this dissertation, a general-case power-saving technique of dynamic framerate and resolution scaling will be investigated. It is shown that significant power savings can be achieved with little- to no-impact on user experience. For saving power in a more specialized task, this dissertation will investigate the use of the GPS in route reconstruction apps for wearable devices. The demonstrated scheduler greatly reduces power consumption while still allowing route reconstruction.
Item Open Access Security and Robustness in Neuromorphic Computing and Deep Learning(2020) Yang, ChaofeiMachine learning (ML) has been promoting fast in the recent decade. Among many ML algorithms, inspired by biological neural systems, neural networks (NNs) and neuromorphic computing systems (NCSs) achieve state-of-the-art performance. With the development of computing resources and big data, deep neural networks (DNNs), also known as deep learning (DL), are applied in various applications such as image recognition and detection, feature extraction, and natural language processing. However, novel security threats are introduced in these applications. Attackers are trying to steal, bug, and destroy the models, thus incurring immeasurable losses. However, we do not fully understand these threats yet, due to the reason that NNs are black boxes and under active development. The complexity of NNs also exposes more vulnerabilities than traditional ML algorithms. To solve the above security threats, this dissertation focuses on identifying novel security threats against NNs and revisiting traditional issues from NNs' perspective. We also grasp the key to these attacks and explore variations and develop robust defenses against them.
One of our works aims at preventing attackers with physical access from learning the proprietary algorithm implemented by the neuromorphic hardware, i.e., replication attack. For this purpose, we leverage the obsolescence effect in memristors to judiciously reduce the accuracy of outputs for any unauthorized user. Our methodology is verified to be compatible with mainstream classification applications, memristor devices, and security and performance constraints. In many applications, public data may be poisoned when being collected as the inputs for re-training DNNs. Although poisoning attack against support vector machines (SVMs) has been extensively studied, we still have very limited knowledge and understanding about how such an attack can be implemented against neural networks. Thus, we examine the possibility of directly applying a gradient-based method to generate poisoned samples against neural networks. We then propose a generative method to accelerate the generation of poisoned samples while maintaining a high attack efficiency. Experiment results show that the generative method can significantly accelerate the generation rate of the poisoned samples compared with the numerical gradient method, with marginal degradation on model accuracy. Deepfake represents a category of face-swapping attacks that leverage machine learning models such as autoencoders or generative adversarial networks. Various detection techniques for Deepfake attacks have been explored. These methods, however, are passive measures against Deepfakes as they are mitigation strategies after the high-quality fake content is generated. More importantly, we would like to think ahead of the attackers with robust defenses. This work aims to take an offensive measure to impede the generation of high-quality fake images or videos. We propose to use novel transformation-aware adversarially perturbed faces as a defense against GAN-based Deepfake attacks. Additionally, we explore techniques for data preprocessing and augmentation to enhance models' robustness. Specifically, we leverage convolutional neural networks (CNNs) to automate the wafer inspection process and propose several techniques to preprocess and augment wafer images for enhancing our model's generalization on unseen wafers (e.g., from other fabs).
Item Embargo Testing and Fault Diagnosis Solutions for Monolithic 3D ICs(2024) Hung, Shao-ChunAs Moore’s Law hits physical limits, three-dimensional (3D) integration constitutes a promising technology to continue power, performance, and area (PPA) improvement. Among modern 3D integration technologies, monolithic 3D (M3D) has attracted a lot of attention because it offers better performance and lower power consumption compared to conventional 2D integrated circuits (ICs). However, the benefits of M3D integration are accompanied by new challenges. Recent research has shown that low-temperature manufacturing processes necessary for upper-tier fabrication can cause performance mismatch between device tiers. Interconnects between tiers, referred to as monolithic inter-tier vias (MIVs), are prone to defects due to the surface roughness of the inter-tier dielectric. These M3D-specific defects tend to be manifested in the form of delay faults that impact circuit timing. Moreover, power supply noise (PSN) is another concern for M3D ICs because of high power and current densities. Excessive voltage droop during delay testing may cause good chips to fail on the tester and lead to yield loss.
Stacking memory on logic is one of the major applications of M3D integration. Combining M3D with the emerging resistive random-access memory (RRAM) has been shown to achieve extremely high memory density and improve power efficiency. However, both M3D and RRAM suffer from high defect rates due to immature manufacturing processes and process variations. Testing and fault diagnosis of memory-on-logic designs are therefore important to facilitate yield learning and shorten the time-to-market.
Motivated by the aforementioned challenges, this dissertation focuses on developing effective testing and diagnosis solutions for M3D ICs. The dissertation first addresses the PSN-induced yield loss problem by test pattern reshaping. The dissertation presents an analysis framework to identify test patterns that are most likely to lead to yield loss. These patterns are subject to reshaping through two distinct algorithms based on integer linear programming (ILP) and simulated annealing (SA). Simulation results show that PSN-induced yield loss is eliminated with reshaped patterns. The dissertation also employs two design-for-test (DfT) methodologies, namely test point insertion (TPI) and scan segmentation, to minimize switching activities during testing. The TPI framework leverages reinforcement learning (RL) to determine the optimal types and locations of test points (TPs) for test power reduction; the RL-based scan segmentation framework effectively partitionsscan D-flip-flops (SDFFs) into segments and assigns enable signals to control these segments. Both frameworks have been demonstrated to ensure power-safe testing for M3D ICs without any adverse impact on test coverage.
Next, the dissertation presents a fault localization framework using graph neural networks (GNNs) to identify tier-level fault locations during diagnosis. Leveraging circuit netlists and failure log files from the tester, the GNN-based framework efficiently localizes faults to device tiers, offering rapid feedback to the foundry and enhancing the quality of diagnosis reports. Moreover, this research develops a diagnosis procedure to identify the fault origin when an M3D-integrated RRAM device fails the manufacturing test. The dissertation presents a detailed characterization of RRAM faulty behaviors in the presence of concurrent process variations and manufacturing defects. Based on RRAM characteristics, a diagnosis sequence is developed by identifying appropriate reference resistance and applied voltages to efficiently distinguish fault origins. The dissertation also introduces a test sequence to detect faults due to PSN noise and quantify the magnitude of noise and defects within M3D-integrated multi-level cell (MLC) arrays. Simulation results have demonstrated the efficacy of the proposed test and diagnosis sequences on memory-on-logic stacking M3D devices.
In summary, this dissertation addresses critical issues in the testing and diagnosis of M3D ICs. The outcomes of the dissertation provide theoretical insights and effective solutions for ensuring power-safe testing and facilitating yield learning. It is expected that the evolving M3D technology will derive significant benefits from these solutions as it progresses towards commercial viability.