Learning and Exploiting Low-Dimensional Structure in High-Dimensional Data

Li, Didong

Learning and Exploiting Low-Dimensional Structure in High-Dimensional Data

View / Download5.62 MB

Date

2020

Authors

Li, Didong

Advisors

Dunson, David B

Mukherjee, Sayan

Repository Usage Stats

361
views

352
downloads

Abstract

Data lying in a high dimensional ambient space are commonly thought to have a much lower intrinsic dimension. In particular, the data may be concentrated near a lower-dimensional manifold. If one does not exploit the hidden geometry in the data but instead deal with the ambient high dimensional Euclidean spaces directly, both the statistical and computation efficiency are extremely low. In contrast, an accurate approximation of the unknown manifold will benefit a variety of aspects including dimension reduction, feature selection, density estimation, classification, clustering, data denoising, data visualization and so on. Most of the literature for data analysis relies on linear or locally linear methods. However, when the manifold has essential curvature, these linear methods suffer from low accuracy and efficiency. There is also an immense literature focused on non-linear methods like Variational Auto Encoders and Gaussian Process Latent Variable Model, to improve the approximation performance. However, these methods are complex black boxes lacking reproducibility, identifiability and interpretability. As a result, new non-linear tools need to be developed without introducing too much extra complexity.

This dissertation focuses on exploiting the geometry in the data through the curvature of the unknown manifold to efficiently estimate the manifold, while keeping the simple and clean close forms as in linear methods. In particular, a simple and general alternative of locally linear manifold learning method is proposed, which instead uses pieces of spheres, or spherelets, to locally approximate the unknown manifold. The spherical principal components analysis (SPCA) is developed as a non-linear alternative of PCA, to find the best sphere fitting the data. SPCA provides simple tools that can be implemented efficiency for big and complex data and allow one to learn about geometric structure in the data, without introducing much more complexity than linear methods.

Inspired by spherelets, a curved kernel called the Fisher-Gaussian (FG) kernel is introduced, which outperforms multivariate Gaussians for density estimation. In particular, the Dirichlet process mixture of FG kernels model is studied for density estimation, which is proved to be posterior consistent. In addition, some applications of spherelets, including classification, geodesic distance estimation and clustering are also considered, with a variety of real data applications.

Type

Dissertation

Department

Mathematics

Subjects

Mathematics, Statistics

Permalink

https://hdl.handle.net/10161/21433

Citation

Li, Didong (2020). Learning and Exploiting Low-Dimensional Structure in High-Dimensional Data. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/21433.

Collections

Dissertations

Full item page

Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.

Learning and Exploiting Low-Dimensional Structure in High-Dimensional Data

Date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

Abstract

Type

Department

Description

Provenance

Subjects

Citation

Permalink

Citation

Collections