Dimensionality Reduction and Learning on Networks
Machine learning is a powerful branch of mathematics and statistics that allows the automation of tasks that would otherwise require humans a long time to perform. Two particular fields of machine learning that have been developing in the last two decades are dimensionality reduction and semi-supervised learning.
Dimensionality reduction is a powerful tool in the analysis of high dimensional data by reducing the number of variables under consideration while approximately preserving some quantity of interest (usually pairwise distances). Methods such as Principal Component Analysis (PCA) or Isometric Feature Mapping (ISOMAP) do this do this by embedding the data, equipped with a nonnegative, symmetric, similarity kernel or adjacency matrix into Euclidean space and finding a linear subspace or low dimensional submanifold which best fits the data, respectively.
When the data takes the form of network data, how to perform such dimensionality reduction intrinsically, without resorting to an embedding, that can be extended to the case of nonnegative, non-symmetric adjacency matrices remains an important open problem. In the first part of my dissertation, using current techniques in local spectral clustering to partition the network using a Markov process induced by the adjacency matrix, we deliver an intrinsic dimensionality reduction of the network in terms of a non-Markov process on a reduced state space that approximately preserves transitions of the original Markov process between clusters. By iterating the process, one obtains a family of non-Markov processes on successively finer state spaces representing the original network ands its diffusion at different scales, which can be used to approximate the law of the original process at a particular time scale. We give applications of this theory to a variety of synthetic data sets and evaluate its performance accordingly.
Next, consider the case of detecting astronomical phenomenon solely in terms of the light intensities observed. There already exists a large database of prior recorded phenomena that has been categorized by humans as a function of the observed light intensity. Given these so-called class labels then, how can we automate the procedure of extending these class labels to the massive amount of data that is currently being observed? This is the problem of concern in semi-supervised learning.
In the second part of my thesis, we consider data sets in which relations between data points are more complex than simply pairwise. Examples include gene networks where the the data points are random variables, and similarities of a subset are measured by non-independence of the corresponding random variables. Such data sets can be illustrated as a hypergraph, and the natural question for diagnosis becomes: how does one perform transductive inference (a particular form of semi-supervised learning)? Using the simple case of pairwise and threewise similarities, we construct a reversible random walk on undirected edges induced by threewise relations (faces). By pulling the random walk back to a random walk on the vertex set and mixing it with the random walk induced by pairwise similarities, we perform diffusive transductive inference. We present applications and results of this technique, any analyze its performance on a variety of data sets.
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations