Algorithm-hardware co-optimization for neural network efficiency improvement

Thumbnail Image



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



Deep neural networks (DNNs) are tremendously applied in the artificial intelligence field. While the performance of DNNs is continuously improved by more complicated and deeper structures, the feasibility of deployment on edge devices remains a critical problem. In this thesis, we present algorithm-hardware co-optimization approaches to address the challenges of efficient DNN deployments from three aspects: 1) save computational cost, 2) save memory cost, and 3) save data movements.

First, we present a joint regularization technique to advance the compression beyond the weights to neuron activations. By distinguishing and leveraging the significant difference among neuron responses and connections during learning, the jointly pruned network, namely JPnet, optimizes the sparsity of activations and weights. Second, to structurally regulate the dynamic activation sparsity (DAS), we propose a generic low-cost approach based on winners-take-all (WTA) dropout technique. The network enhanced by the proposed WTA dropout, namely DASNet, features structured activation sparsity with an improved sparsity level, which can be easily utilized to achieve acceleration on conventional embedded systems. The effectiveness of JPNet and DASNet has been thoroughly evaluated through various network models with different activation functions and on different datasets. Third, we propose BitSystolic, a neural processing unit based on a systolic array structure, to fully support the mixed-precision inference. In BitSystolic, the numerical precision of both weights and activations can be configured in the range of 2b~8b, fulfilling different requirements across mixed-precision models and tasks. Moreover, the design can support various data flows presented in different types of neural layers and adaptively optimize the data reuse by switching between the matrix-matrix mode and vector-matrix mode. We designed and fabricated the proposed BitSystolic in the 65nm process. Our measurement results show that BitSystolic features the unified power efficiency of up to 26.7 TOPS/W with 17.8 mW peak power consumption across various layer types. In the end, we will have a glance at computing-in-memory architectures based on resistive random-access memory (ReRAM) which realizes in-place storage and computation. A quantized training method is proposed to enhance the accuracy of neuromorphic systems based on ReRAM by alleviating the impact of limited parameter precision.





Yang, Qing (2020). Algorithm-hardware co-optimization for neural network efficiency improvement. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.