Optimization Dynamics in Mildly Overparametrized Models

Zhou, Mo

Optimization Dynamics in Mildly Overparametrized Models

Limited Access

This item is unavailable until:
2025-09-08

Date

2024

Authors

Zhou, Mo

Advisors

Ge, Rong

Repository Usage Stats

24
views

0
downloads

Abstract

Modern machine learning, especially deep learning, has shown remarkable empirical success. The common practice is to train large overparametrized neural networks on non-convex objectives using simple gradient-based algorithms. From a theoretical perspective, it is surprising that neural networks can perform well and efficiently learn useful representations under these conditions. To address this gap, this thesis examines overparametrized models that capture essential practical aspects with minimal requirements: feature learning, polynomial size, and polynomial convergence time. We refer to these as mildly overparametrized models.

Analyzing the training dynamics of these mildly overparametrized models offers valuable insights into the underlying mechanisms, allowing us to go beyond worst-case analysis. In this thesis, we aim to understand the training dynamics of such mildly overparametrized models. We leverage natural properties of these problems and provide theoretical analysis of their optimization dynamics.

In the first part of this dissertation, we focus on the local optimization dynamic analysis of mildly overparametrized models. We start with the simplest case of learning two-layer neural networks with a low-dimensional structure. We show that when the loss is below a certain threshold, the local loss landscape is benign and leads to training dynamics unique to overparametrized settings. Specifically, we show that (1) two-layer neural networks with positive second-layer weights using gradient descent, and (2) two-layer neural networks with standard second-layer weights using gradient descent with weight decay, can efficiently recover the target network. These results also suggest a strong form of feature learning, where student neurons align with ground-truth directions at the end of training.

In the second part of this dissertation, we investigate the global optimization dynamic analysis of mildly overparametrized models. We extend our analysis beyond the commonly studied two-layer networks to consider two important cases that better reflect real-world scenarios: multi-layer networks and feature learning under noisy data. First, we show that gradient descent can learn certain three-layer network that cannot be efficiently represented by any two-layer network. Using a newly proposed multi-layer mean-field framework, our results reveal a hierarchical feature learning process from the bottom layer to the top layer, even without direct supervision for learning the first layer’s features. Additionally, we address the noisy sparse linear regression problem by proposing a new parametrization. This parametrization guides gradient descent to first learn features and then memorize noise, ultimately achieving benign overfitting.

Type

Dissertation

Department

Computer Science

Subjects

Computer science

Permalink

https://hdl.handle.net/10161/31952

Rights

https://creativecommons.org/licenses/by-nc-nd/4.0/

Citation

Zhou, Mo (2024). Optimization Dynamics in Mildly Overparametrized Models. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/31952.

Collections

Dissertations

Full item page

Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.

Optimization Dynamics in Mildly Overparametrized Models

Date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

Abstract

Type

Department

Description

Provenance

Subjects

Citation

Permalink

Rights

Citation

Collections