Browsing by Author "Ge, Rong"
Results Per Page
Sort Options
Item Open Access Learning Representations With Linear-Algebraic Structure(2022) Frandsen, AbrahamRepresentation learning is a key step for enabling algorithms to make sense of data and output good decisions. Good data representations preserve useful information, discard irrelevant features, and simplify complex relationships between data. We approach the problem of representation learning through the lens of latent variable models. In such models, the representations are directly given as unobserved variables encoding the core structure of the data. We utilize linear algebraic structure to specify the properties of the representations, which enables rigorous analysis and efficient, provable algorithms.
We first consider the area of natural language processing, where the data are comprised of words. We propose a novel model for word representations that encodes compositional syntactic and semantic structure as latent multilinear structure. We prove that the representations can be efficiently recovered and develop a practical learning algorithm.
We show that learning the word embedding model is closely connected to the Tucker decomposition, an important basic operation in tensor analysis that also arises in the context of other latent variable models. We formulate the Tucker decomposition as a nonconvex optimization problem and prove that its landscape is benign. We then give a local search algorithm that provably finds the global optimum.
We finally consider the area of reinforcement learning and control, where the time dynamics of the data are vital. We propose a model in which the state observations are high-dimensional with nonlinear dynamics, but depend on a latent low-dimensional linear control system. We develop state representation learning algorithms based on both forward and inverse dynamics that provably and efficiently recover the hidden linear system.
Item Open Access Online Algorithms with Learned Predictions(2022) Anand, KeertiOptimization under uncertainty is a classic theme in the fields of algorithm design and machine learning. The traditional design of online algorithms have however proved to be insufficient for practical instances, since it only tries to optimize for the worst-case (but possibly highly unlikely) future scenarios. The availability of data and the advent of powerful machine learning paradigms has led to a promising approach of leveraging predictions about the future to aid in making decisions under uncertainty. This motivates us to explore whether algorithm design can go beyond the worst-case, i.e is it possible to design efficient algorithms that perform well for the typical instances, while retaining a suitable robustness guarantee for the worst-case instance? This entails our vision of combining the power of Machine Learning and Algorithm design to get the best of both worlds. This can be categorized into two inter-dependent parts : (1) How to design the prediction pipeline to generate forecasts about the future unknowns and (2) Given such predictions, how to re-design the decision-making algorithm to best leverage them. In this thesis, we investigate both of these questions, and summarise the key contributions as follows:
Rent or Buy Problem: We investigate the rent-or-buy problem and design a simple online algorithm that leverages predictions from classification black-box model whose performance is directly dependent on the prediction error of the classification task. We then demonstrate that by incorporating the optimization benchmarks in prediction model leads to significantly better performance, while maintaining a worst-case adversarial result.
Online Search: We define a general online search framework that captures classic problems like (generalized) ski rental, bin packing, minimum makespan scheduling, etc. We then model the task of making predictions as a regression problem and show nearly tight bounds on the sample complexity of this regression problem.
Online Algorithms with Multiple Predictions: We study the setting of online covering problems that are supplemented with multiple predictions. We give algorithmic techniques that obtain a solution that is competitive against the performance of the best predictor. We apply our algorithmic framework to solve classical problems such as online set cover, (weighted) caching, andonline facility location in the multiple predictions setting.
Item Open Access SVRG Escapes Saddle Points(2018-04-25) Wang, WeiyaoIn this paper, we consider a fundamental problem of non-convex optimization that has wide applications and implications in machine learning. Previous works have shown that stochastic gradient descent with the variance reduction technique (SVRG) converges to a first-order stationary point in O(n^(2/3)/ε^2) iterations in non-convex settings. However, many problems in non-convex optimization requires an efficient way to find second- order stationary points that are a subset of first-order stationary points, and therefore algorithms that converges to a first-order stationary point are not satisfying. This paper shows how SVRG converges to a second-order stationary point in non-convex setting with a uniformly random perturbation. To find an ε-second-order stationary point, it takes O( n^(3/4)polylog(d/(nε))/ε^2 ) iterations. In comparison with the convergence rate of SVRG to a first-order stationary point, the loss in convergence rate only depends poly- logarithmically on the dimension d and involves a small polynomial of n, O(n^(1/12) ). This is the best known result in finding second-order stationary point. We also give some intuitions and proof sketches to a new framework of analysis using the stable manifold theorem. The analysis from the new framework may help to eliminate the n^(1/12) loss from our current analysis. Our results can be directly applied to problems such as the training of neural networks and tensor decomposition. The intuition of stable manifold may provide independent interest to the non-convex optimization community.Item Open Access Theoretical Understanding of Neural Network Optimization Landscape and Self-Supervised Representation Learning(2023) Wu, ChenweiNeural networks have achieved remarkable empirical success in various areas. One key factor of their success is their ability to automatically learn useful representations from data. Self-supervised representation learning, which learns the representations during pre-training and applies learned representations in downstream tasks, has become the dominant approach for representation learning in recent years. However, theoretical understanding of self-supervised representation learning is scarce. Two main bottlenecks in understanding self-supervised representation learning are the big differences between pre-training and downstream tasks and the difficulties in neural network optimization. In this thesis, we present an initial exploration into analyzing the benefit of pre-training in self-supervised representation learning and two heuristics in neural network optimization.
The first part of this thesis presents our attempts to understand why the representations produced by pre-trained models are useful in downstream tasks. We assume we can optimize the training objective well in this part. For the over-realized sparse coding model with noise, we show that the masking objective used in pre-training ensures the recovery of ground-truth model parameters. For a more complicated log-linear word model, we characterize what downstream tasks can benefit from the learned representations in pre-training. Our experiments validate these theoretical results.
The second part of this thesis provides explanations about two important phenomena in the neural network optimization landscape. We first propose and rigorously prove a novel conjecture that explains the low-rank structure of the layer-wise neural network Hessian. Our conjecture is verified experimentally and can be used to tighten generalization bounds for neural networks. We also study the training stability and generalization problem in the learning-to-learn framework where machine learning algorithms are used to learn parameters for training neural networks. We rigorously proved our conjectures in simple models and empirically verified our theoretical results in our experiments with practical neural networks and real data.
Our results provide theoretical understanding of the benefits of pre-training for downstream tasks and two important heuristics of neural network optimization landscape. We hope these insights could further improve the performance of self-supervised representation learning approaches and inspire the design of new algorithms.
Item Embargo Understanding Deep Learning via Analyzing Training Dynamics(2022) Wang, XiangDeep learning has achieved tremendous success in practice, yet the theoretical understanding lags behind. How does gradient descent successfully optimize the highly non-convex training objective, and how does it find a solution that also generalizes well to unseen data despite the model being over-parameterized? Answering these questions requires a characterization of the training dynamics of gradient descent. In this thesis, we first develop analysis techniques of training dynamics in tensor decompositions and then showcase the explanation of two phenomena by analyzing gradient descent dynamics.
In the first part, we analyze the gradient descent dynamics in over-parameterized tensor decompositions. For non-orthogonal low-rank tensors, we show that gradient descent from a small initialization can identify the subspace that the ground-truth components lie in, and automatically exploit such structure to reduce the requirement on the over-parameterization. Then, for orthogonal tensors, we show gradient descent fits the ground truth components one by one from the larger components to the smaller components, similar to a tensor deflation process. Since tensor decomposition is closely related to the optimization of neural networks, we believe many techniques developed here will apply to neural networks as well.
In the second part, we explain two phenomena by analyzing the training dynamics of gradient descent. We first explain the representation learning process of non-contrastive self-supervised methods by analyzing the training dynamics on a linear network. Our analysis reveals the role of weight decay in discarding the nuisance features and keeping the robust features. Then we show there will be a long plateau in both the loss and accuracy interpolation (between a random initialization with the minimizer it converges to) if different classes have different last-layer biases on a deep network. We also show how the last-layer biases for different classes can be different even on a perfectly balanced dataset by analyzing a simple model.