Optimization Dynamics in Mildly Overparametrized Models

Loading...
Thumbnail Image
Limited Access
This item is unavailable until:
2025-09-08

Date

2024

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

21
views
0
downloads

Abstract

Modern machine learning, especially deep learning, has shown remarkable empirical success. The common practice is to train large overparametrized neural networks on non-convex objectives using simple gradient-based algorithms. From a theoretical perspective, it is surprising that neural networks can perform well and efficiently learn useful representations under these conditions. To address this gap, this thesis examines overparametrized models that capture essential practical aspects with minimal requirements: feature learning, polynomial size, and polynomial convergence time. We refer to these as mildly overparametrized models.

Analyzing the training dynamics of these mildly overparametrized models offers valuable insights into the underlying mechanisms, allowing us to go beyond worst-case analysis. In this thesis, we aim to understand the training dynamics of such mildly overparametrized models. We leverage natural properties of these problems and provide theoretical analysis of their optimization dynamics.

In the first part of this dissertation, we focus on the local optimization dynamic analysis of mildly overparametrized models. We start with the simplest case of learning two-layer neural networks with a low-dimensional structure. We show that when the loss is below a certain threshold, the local loss landscape is benign and leads to training dynamics unique to overparametrized settings. Specifically, we show that (1) two-layer neural networks with positive second-layer weights using gradient descent, and (2) two-layer neural networks with standard second-layer weights using gradient descent with weight decay, can efficiently recover the target network. These results also suggest a strong form of feature learning, where student neurons align with ground-truth directions at the end of training.

In the second part of this dissertation, we investigate the global optimization dynamic analysis of mildly overparametrized models. We extend our analysis beyond the commonly studied two-layer networks to consider two important cases that better reflect real-world scenarios: multi-layer networks and feature learning under noisy data. First, we show that gradient descent can learn certain three-layer network that cannot be efficiently represented by any two-layer network. Using a newly proposed multi-layer mean-field framework, our results reveal a hierarchical feature learning process from the bottom layer to the top layer, even without direct supervision for learning the first layer’s features. Additionally, we address the noisy sparse linear regression problem by proposing a new parametrization. This parametrization guides gradient descent to first learn features and then memorize noise, ultimately achieving benign overfitting.

Description

Provenance

Citation

Citation

Zhou, Mo (2024). Optimization Dynamics in Mildly Overparametrized Models. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/31952.

Collections


Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.