Understanding Deep Learning via Analyzing Training Dynamics

Wang, Xiang

Understanding Deep Learning via Analyzing Training Dynamics

View / Download2.76 MB

Date

2022

Authors

Wang, Xiang

Advisors

Ge, Rong

Repository Usage Stats

44
views

10
downloads

Abstract

Deep learning has achieved tremendous success in practice, yet the theoretical understanding lags behind. How does gradient descent successfully optimize the highly non-convex training objective, and how does it find a solution that also generalizes well to unseen data despite the model being over-parameterized? Answering these questions requires a characterization of the training dynamics of gradient descent. In this thesis, we first develop analysis techniques of training dynamics in tensor decompositions and then showcase the explanation of two phenomena by analyzing gradient descent dynamics.

In the first part, we analyze the gradient descent dynamics in over-parameterized tensor decompositions. For non-orthogonal low-rank tensors, we show that gradient descent from a small initialization can identify the subspace that the ground-truth components lie in, and automatically exploit such structure to reduce the requirement on the over-parameterization. Then, for orthogonal tensors, we show gradient descent fits the ground truth components one by one from the larger components to the smaller components, similar to a tensor deflation process. Since tensor decomposition is closely related to the optimization of neural networks, we believe many techniques developed here will apply to neural networks as well.

In the second part, we explain two phenomena by analyzing the training dynamics of gradient descent. We first explain the representation learning process of non-contrastive self-supervised methods by analyzing the training dynamics on a linear network. Our analysis reveals the role of weight decay in discarding the nuisance features and keeping the robust features. Then we show there will be a long plateau in both the loss and accuracy interpolation (between a random initialization with the minimizer it converges to) if different classes have different last-layer biases on a deep network. We also show how the last-layer biases for different classes can be different even on a perfectly balanced dataset by analyzing a simple model.

Type

Dissertation

Department

Computer Science

Subjects

Computer science

Permalink

https://hdl.handle.net/10161/26854

Citation

Wang, Xiang (2022). Understanding Deep Learning via Analyzing Training Dynamics. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/26854.

Collections

Dissertations

Full item page

Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.

Understanding Deep Learning via Analyzing Training Dynamics

Date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

Abstract

Type

Department

Description

Provenance

Subjects

Citation

Permalink

Citation

Collections