Understanding Deep Learning via Analyzing Training Dynamics
Access is limited until:
Deep learning has achieved tremendous success in practice, yet the theoretical understanding lags behind. How does gradient descent successfully optimize the highly non-convex training objective, and how does it find a solution that also generalizes well to unseen data despite the model being over-parameterized? Answering these questions requires a characterization of the training dynamics of gradient descent. In this thesis, we first develop analysis techniques of training dynamics in tensor decompositions and then showcase the explanation of two phenomena by analyzing gradient descent dynamics.
In the first part, we analyze the gradient descent dynamics in over-parameterized tensor decompositions. For non-orthogonal low-rank tensors, we show that gradient descent from a small initialization can identify the subspace that the ground-truth components lie in, and automatically exploit such structure to reduce the requirement on the over-parameterization. Then, for orthogonal tensors, we show gradient descent fits the ground truth components one by one from the larger components to the smaller components, similar to a tensor deflation process. Since tensor decomposition is closely related to the optimization of neural networks, we believe many techniques developed here will apply to neural networks as well.
In the second part, we explain two phenomena by analyzing the training dynamics of gradient descent. We first explain the representation learning process of non-contrastive self-supervised methods by analyzing the training dynamics on a linear network. Our analysis reveals the role of weight decay in discarding the nuisance features and keeping the robust features. Then we show there will be a long plateau in both the loss and accuracy interpolation (between a random initialization with the minimizer it converges to) if different classes have different last-layer biases on a deep network. We also show how the last-layer biases for different classes can be different even on a perfectly balanced dataset by analyzing a simple model.
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations
Works are deposited here by their authors, and represent their research and opinions, not that of Duke University. Some materials and descriptions may include offensive content. More info