Separating Features from Noise with Persistence and Statistics
In this thesis, we explore techniques in statistics and persistent homology, which detect features among data sets such as graphs, triangulations and point cloud. We accompany our theorems with algorithms and experiments, to demonstrate their effectiveness in practice.
We start with the derivation of graph scan statistics, a measure useful to assess the statistical significance of a subgraph in terms of edge density. We cluster graphs into densely-connected subgraphs based on this measure. We give algorithms for finding such clusterings and experiment on real-world data.
We next study statistics on persistence, for piecewise-linear functions defined on the triangulations of topological spaces. We derive persistence pairing probabilities among vertices in the triangulation. We also provide upper bounds for total persistence in expectation.
We continue by examining the elevation function defined on the triangulation of a surface. Its local maxima obtained by persistence pairing are useful in describing features of the triangulations of protein surfaces. We describe an algorithm to compute these local maxima, with a run-time ten-thousand times faster in practice than previous method. We connect such improvement with the total Gaussian curvature of the surfaces.
Finally, we study a stratification learning problem: given a point cloud sampled from a stratified space, which points belong to the same strata, at a given scale level? We assess the local structure of a point in relation to its neighbors using kernel and cokernel persistent homology. We prove the effectiveness of such assessment through several inference theorems, under the assumption of dense sample. The topological inference theorem relates the sample density with the homological feature size. The probabilistic inference theorem provides sample estimates to assess the local structure with confidence. We describe an algorithm that computes the kernel and cokernel persistence diagrams and prove its correctness. We further experiment on simple synthetic data.
spatial scan statistics
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations