Browsing by Subject "Computer Architecture"
- Results Per Page
- Sort Options
Item Open Access Hybrid Digital/Analog In-Memory Computing(2024) Zheng, QilinThe relentless advancement of deep learning applications, particularly the highly potent yet computationally intensive deep unsupervised learning models, is pushing the boundaries of what modern general-purpose CPUs and GPUs can handle in terms of computation, communication, and storage capacities. To meet these burgeoning memory and computational demands, computing systems based on in-memory computing, which extensively utilize accelerators, are emerging as the next frontier in computing technology. This thesis delves into my research efforts aimed at overcoming these obstacles to develop a processing-in-memory based computing system tailored for machine learning tasks, with a focus on employing a hybrid digital/analog design approach.
In the initial part of my work, I introduce a novel concept that leverages hybrid digital/analog in-memory computing to enhance the efficiency of depth-wise convolution applications. This approach not only optimizes computational efficiency but also paves the way for more energy-efficient machine learning operations.
Following this, I expand upon the initial concept by presenting a design methodology that applies hybrid digital/analog in-memory computing to the processing of sparse attention operators. This extension significantly improves mapping efficiency, making it a vital enhancement for the processing capabilities of deep learning models that rely heavily on attention mechanisms.
In my third piece of work, I detail the implementation strategies aimed at augmenting the power efficiency of in-memory computing macros. By integrating hybrid digital/analog computing concepts, this implementation focuses on general-purpose neural network acceleration, showcasing a significant step forward in reducing the energy consumption of such computational processes.
Lastly, I introduce a system-level simulation tool designed for simulating general-purpose in-memory-computing based systems. This tool facilitates versatile architecture exploration, allowing for the assessment and optimization of various configurations to meet the specific needs of machine learning workloads. Through these comprehensive research efforts, this thesis contributes to the advancement of in-memory computing technologies, offering novel solutions to the challenges posed by the next generation of machine learning applications.
Item Open Access Joint Optimization of Algorithms, Hardware, and Systems for Efficient Deep Neural Networks(2024) Li, ShiyuDeep learning has enabled remarkable performance breakthroughs across various domains, including computer vision, natural language processing, and recommender systems. However, the typical deep neural network (DNN) models employed in these applications require millions of parameters and billions of operations, leading to substantial computational and memory requirements. While researchers have proposed compression methods, optimized frameworks, and specialized accelerators to improve efficiency, outstanding challenges persist, limiting the achievable gains.
A fundamental challenge lies in the inherent irregularity and sparsity of DNNs. Although these models exhibit significant sparsity, with a considerable fraction of weights and activations being zero or near-zero values, exploiting this sparsity efficiently on modern hardware is problematic due to the irregular distribution of non-zero elements. This irregularity leads to substantial overhead in indexing, gathering, and processing sparse data, resulting in poor utilization of computational and memory resources. Furthermore, recent research has identified a significant gap between the theoretical and practical improvements achieved by compression methods. Additionally, emerging DNN architectures with novel operators often nullify previous optimization efforts in software frameworks and hardware accelerators, necessitating continuous adaptation.
To address these critical challenges, this dissertation targets building a holistic approach that jointly optimizes algorithms, hardware architectures, and system designs to enable efficient deployment of DNNs in the presence of irregularity and sparsity. On the algorithm level, a novel hardware-friendly compression method based on matrix decomposition is proposed. The original convolutional kernels are decomposed into common basis kernels and a series of coefficients, with conventional pruning applied to the coefficients. This compressed DNN forms a hardware-friendly structure where the sparsity pattern is shared across input feature map pixels, alleviating sparse pattern processing costs.
On the hardware level, a novel sparse DNN accelerator is introduced to support the inference of the compressed DNN. Low-precision quantization is applied to sparse coefficients, and high-precision to basis kernels. By involving only low-precision coefficients in sparse processing, the hardware efficiently matches non-zero weights and activations using inverted butterfly networks. The shared basis kernels and sparse coefficients significantly reduce buffer size and bandwidth requirements, boosting performance and energy efficiency.
At the system level, a near-data processing framework is proposed to address the challenge of training large DNN-based recommendation models. This framework adopts computational storage devices and coherent system interconnects to partition the model into subtasks. Data-intensive embedding operations run on computational storage devices with customized memory hierarchies, while compute-intensive feature processing and aggregation operations are assigned to GPUs for maximum efficiency. This framework enables training large DNN-based recommendation models without expensive hardware investments.
Through joint optimization across algorithms, hardware architectures, and system designs, this research aims to overcome the limitations imposed by irregularity and sparsity, enabling efficient deployment of DNNs in a broad range of applications and resource-constrained environments. By addressing these critical issues, this work paves the way for fully harnessing the potential of deep learning technologies in practical settings.
Item Embargo Processing-in-Memory Accelerators Toward Energy-Efficient Real-World Machine Learning(2024) Kim, BokyungArtificial intelligence (AI) has permeated the real world, reaping unprecedented success. Numberless applications exploit machine learning (ML) technologies of big data and compute-intensive algorithms. Moreover, the aspiration of authentic machine intelligence moves computing toward the edge to handle complex tasks conventionally tailored for human beings. Along with the rapid development, the gap between the increasing resource requirements in ML and the restricted environments of edge engenders urgent attention to the challenges in efficiency. To resolve the gap, solutions across different disciplines in hardware are necessary beyond algorithm development.
Unfortunately, hardware development falls far behind because of heterogeneity. While the sensational advance of ML algorithms is a game-change of computing paradigms, conventional hardware unfits new paradigms due to fundamental limitations in its architecture and technology. The traditional architecture separating storage and computation is dreadfully inefficient for innumerable data processing and computing in algorithms, showing high power consumption and low performance. The realization of the fundamental limitations motivates efficient and non-conventional hardware accelerators.
As a new hardware paradigm, processing-in-memory accelerators (PIM) have brought significant expectations because of their intuitive effectiveness for the limitations of traditional hardware. PIM merges computing and processing units and saves resources for data and computations, pursuing non-heterogeneity and ultimately improving efficiency.Previous PIM accelerators have shown promising outcomes with high-performance computing, particularly thanks to emerging memories under the name of memristor.
Despite its motivation for non-heterogeneity, PIM-based designs couldn't fully escape from heterogeneity causing inefficiency with high costs. While emerging memories provide revolutions at device and circuit levels, PIM at higher levels struggles with various components in systems (horizontal heterogeneity). Furthermore, PIM is holistically designed across hierarchical levels of heterogeneity (vertical heterogeneity), which complicates its design with efficiency.Even robustness could be significantly influenced by heterogeneity.
Confronting the challenges in heterogeneity, efficiency, and robustness, my research has cultivated PIM hardware through cross-layer designs for practically efficient ML acceleration. Specifically, focusing on architecture/system-level innovations, I have pioneered novel 3D architectures and systemic paradigms, which provide a strong foundation for future computing. For ML acceleration, I have proposed new methodologies to efficiently operate 3D architecture and a novel dataflow with a new 3D design for energy efficiency by pursuing non-heterogeneity. The innovations have been examined through rigorous hardware experiments, and their practical efficiency has been proved with a fabricated chip for seizure classification, a real-world application. According to the need for future ML, my research is evolving to accomplish robustness in hardware as ML platforms. In this dissertation, I summarize the research impacts based on my diverse design experiences, spanning architecture and system design to chip fabrication.