Browsing by Subject "Sparsity"
Results Per Page
Sort Options
Item Open Access Bayesian Models for Causal Analysis with Many Potentially Weak Instruments(2015) Jiang, ShengThis paper investigates Bayesian instrumental variable models with many instruments. The number of instrumental variables grows with the sample size and is allowed to be much larger than the sample size. With some sparsity condition on the coefficients on the instruments, we characterize a general prior specification where the posterior consistency of the parameters is established and calculate the corresponding convergence rate.
In particular, we show the posterior consistency for a class of spike and slab priors on the many potentially weak instruments. The spike and slab prior shrinks the number of instrumental variables, which avoids overfitting and provides uncertainty quantifications on the first stage. A simulation study is conducted to illustrate the convergence notion and estimation/selection performance under dependent instruments. Computational issues related to the Gibbs sampler are also discussed.
Item Open Access Efficient and Scalable Deep Learning(2019) Wen, WeiDeep Neural Networks (DNNs) can achieve accuracy superior to traditional machine learning models, because of their large learning capacity and the availability of large amounts of labeled data. In general, larger DNNs can obtain higher accuracy. However, there are two obstacles which hinder us building larger DNNs: (1) inference of large DNNs is slow which limits their deployment to small devices; (2) training large DNNs is also slow which slows down research exploration. To remove those obstacles, this dissertation focuses on acceleration of DNN inference and training. To accelerate DNN inference, original DNNs are compressed while keeping original accuracy. More specific, Structurally Sparse Deep Neural Networks (SSDNNs) are proposed to remove neural components. In Convolutional Neural Networks (CNNs), neurons, filters, channels and layers can be removed; in Recurrent Neural Networks (RNNs), hidden sizes can be reduced. The study shows that SSDNNs can achieve higher speedup than sparse DNNs which have non-structured sparsity. Besides SSDNNs, a Force Regularization is proposed to enforce DNNs to lower-rank space, such that DNNs can be decomposed to lower-rank architectures with fewer ranks than traditional methods. The dissertation also demonstrates that SSDNNs and Force Regularization are orthogonal and can be combined for higher speedup. To accelerate DNN training, distributed deep learning is required. However, two problems hinder us using more compute nodes for higher training speed: Communication Bottleneck and Generalization Gap. Communication Bottleneck is that communication time will increase and dominate when the distributed systems scale to many compute nodes. To reduce gradient communication in Stochastic Gradient Descent (SGD), SGD with low-precision gradients (TernGrad) is proposed. Moreover, in distributed deep learning, a large batch size is required to exploit system computing power; unfortunately, accuracy will decrease when the batch size is very large, which is referred to as the Generalization Gap. One hypothesis to explain Generalization Gap is that large-batch SGD sticks at sharp minima. The dissertation proposes a stochastic smoothing (SmoothOut) to escape sharp minima. The dissertation will show that TernGrad overcomes Communication Bottleneck and SmoothOut helps to close the Generalization Gap.
Item Open Access Joint Optimization of Algorithms, Hardware, and Systems for Efficient Deep Neural Networks(2024) Li, ShiyuDeep learning has enabled remarkable performance breakthroughs across various domains, including computer vision, natural language processing, and recommender systems. However, the typical deep neural network (DNN) models employed in these applications require millions of parameters and billions of operations, leading to substantial computational and memory requirements. While researchers have proposed compression methods, optimized frameworks, and specialized accelerators to improve efficiency, outstanding challenges persist, limiting the achievable gains.
A fundamental challenge lies in the inherent irregularity and sparsity of DNNs. Although these models exhibit significant sparsity, with a considerable fraction of weights and activations being zero or near-zero values, exploiting this sparsity efficiently on modern hardware is problematic due to the irregular distribution of non-zero elements. This irregularity leads to substantial overhead in indexing, gathering, and processing sparse data, resulting in poor utilization of computational and memory resources. Furthermore, recent research has identified a significant gap between the theoretical and practical improvements achieved by compression methods. Additionally, emerging DNN architectures with novel operators often nullify previous optimization efforts in software frameworks and hardware accelerators, necessitating continuous adaptation.
To address these critical challenges, this dissertation targets building a holistic approach that jointly optimizes algorithms, hardware architectures, and system designs to enable efficient deployment of DNNs in the presence of irregularity and sparsity. On the algorithm level, a novel hardware-friendly compression method based on matrix decomposition is proposed. The original convolutional kernels are decomposed into common basis kernels and a series of coefficients, with conventional pruning applied to the coefficients. This compressed DNN forms a hardware-friendly structure where the sparsity pattern is shared across input feature map pixels, alleviating sparse pattern processing costs.
On the hardware level, a novel sparse DNN accelerator is introduced to support the inference of the compressed DNN. Low-precision quantization is applied to sparse coefficients, and high-precision to basis kernels. By involving only low-precision coefficients in sparse processing, the hardware efficiently matches non-zero weights and activations using inverted butterfly networks. The shared basis kernels and sparse coefficients significantly reduce buffer size and bandwidth requirements, boosting performance and energy efficiency.
At the system level, a near-data processing framework is proposed to address the challenge of training large DNN-based recommendation models. This framework adopts computational storage devices and coherent system interconnects to partition the model into subtasks. Data-intensive embedding operations run on computational storage devices with customized memory hierarchies, while compute-intensive feature processing and aggregation operations are assigned to GPUs for maximum efficiency. This framework enables training large DNN-based recommendation models without expensive hardware investments.
Through joint optimization across algorithms, hardware architectures, and system designs, this research aims to overcome the limitations imposed by irregularity and sparsity, enabling efficient deployment of DNNs in a broad range of applications and resource-constrained environments. By addressing these critical issues, this work paves the way for fully harnessing the potential of deep learning technologies in practical settings.