Efficient and Scalable Deep Learning

dc.contributor.advisor

Li, Hai

dc.contributor.advisor

Chen, Yiran

dc.contributor.author

Wen, Wei

dc.date.accessioned

2020-02-10T17:28:11Z

dc.date.available

2020-02-10T17:28:11Z

dc.date.issued

2019

dc.department

Electrical and Computer Engineering

dc.description.abstract

Deep Neural Networks (DNNs) can achieve accuracy superior to traditional machine learning models, because of their large learning capacity and the availability of large amounts of labeled data. In general, larger DNNs can obtain higher accuracy. However, there are two obstacles which hinder us building larger DNNs: (1) inference of large DNNs is slow which limits their deployment to small devices; (2) training large DNNs is also slow which slows down research exploration. To remove those obstacles, this dissertation focuses on acceleration of DNN inference and training. To accelerate DNN inference, original DNNs are compressed while keeping original accuracy. More specific, Structurally Sparse Deep Neural Networks (SSDNNs) are proposed to remove neural components. In Convolutional Neural Networks (CNNs), neurons, filters, channels and layers can be removed; in Recurrent Neural Networks (RNNs), hidden sizes can be reduced. The study shows that SSDNNs can achieve higher speedup than sparse DNNs which have non-structured sparsity. Besides SSDNNs, a Force Regularization is proposed to enforce DNNs to lower-rank space, such that DNNs can be decomposed to lower-rank architectures with fewer ranks than traditional methods. The dissertation also demonstrates that SSDNNs and Force Regularization are orthogonal and can be combined for higher speedup. To accelerate DNN training, distributed deep learning is required. However, two problems hinder us using more compute nodes for higher training speed: Communication Bottleneck and Generalization Gap. Communication Bottleneck is that communication time will increase and dominate when the distributed systems scale to many compute nodes. To reduce gradient communication in Stochastic Gradient Descent (SGD), SGD with low-precision gradients (TernGrad) is proposed. Moreover, in distributed deep learning, a large batch size is required to exploit system computing power; unfortunately, accuracy will decrease when the batch size is very large, which is referred to as the Generalization Gap. One hypothesis to explain Generalization Gap is that large-batch SGD sticks at sharp minima. The dissertation proposes a stochastic smoothing (SmoothOut) to escape sharp minima. The dissertation will show that TernGrad overcomes Communication Bottleneck and SmoothOut helps to close the Generalization Gap.

dc.identifier.uri

https://hdl.handle.net/10161/20143

dc.subject

Artificial intelligence

dc.subject

Computer science

dc.subject

Computer engineering

dc.subject

Deep neural networks

dc.subject

Distributed Training

dc.subject

Model Compression

dc.subject

Quantization

dc.subject

Sharp Minima

dc.subject

Sparsity

dc.title

Efficient and Scalable Deep Learning

dc.type

Dissertation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wen_duke_0066D_15445.pdf
Size:
4.44 MB
Format:
Adobe Portable Document Format

Collections