Geometric Computing over Uncertain Data

Zhang, Wuzhou

Geometric Computing over Uncertain Data

View / Download2.38 MB

Date

2015

Authors

Zhang, Wuzhou

Advisors

Agarwal, Pankaj Kumar

Repository Usage Stats

320
views

244
downloads

Abstract

Entering the era of big data, human beings are faced with an unprecedented amount of geometric data today. Many computational challenges arise in processing the new deluge of geometric data. A critical one is data uncertainty: the data is inherently noisy and inaccuracy, and often lacks of completeness. The past few decades have witnessed the influence of geometric algorithms in various fields including GIS, spatial databases, and computer vision, etc. Yet most of the existing geometric algorithms are built on the assumption of the data being precise and are incapable of properly handling data in the presence of uncertainty. This thesis explores a few algorithmic challenges in what we call geometric computing over uncertain data.

We study the nearest-neighbor searching problem, which returns the nearest neighbor of a query point in a set of points, in a probabilistic framework. This thesis investigates two different nearest-neighbor formulations: expected nearest neighbor (ENN), where we consider the expected distance between each input point and a query point, and probabilistic nearest neighbor (PNN), where we estimate the probability of each input point being the nearest neighbor of a query point.

For the ENN problem, we consider a probabilistic framework in which the location of each input point and/or query point is specified as a probability density function and the goal is to return the point that minimizes the expected distance. We present methods for computing an exact ENN or an \\eps-approximate ENN, for a given error parameter 0 < \\eps < 1, under different distance functions. These methods build an index of near-linear size and answer ENN queries in polylogarithmic or sublinear time, depending on the underlying function. As far as we know, these are the first nontrivial methods for answering exact or \\eps-approximate ENN queries with provable performance guarantees. Moreover, we extend our results to answer exact or \\eps-approximate k-ENN queries. Notably, when only the query points are uncertain, we obtain state-of-the-art results for top-k aggregate (group) nearest-neighbor queries in the L1 metric using the weighted SUM operator.

For the PNN problem, we consider a probabilistic framework in which the location of each input point is specified as a probability distribution function. We present efficient algorithms for (i) computing all points that are nearest neighbors of a query point with nonzero probability; (ii) estimating, within a specified additive error, the probability of a point being the nearest neighbor of a query point; (iii) using it to return the point that maximizes the probability being the nearest neighbor, or all the points with probabilities greater than some threshold to be the nearest neighbor. We also present some experimental results to demonstrate the effectiveness of our approach.

We study the convex-hull problem, which asks for the smallest convex set that contains a given point set, in a probabilistic setting. In our framework, the uncertainty of each input point is described by a probability distribution over a finite number of possible locations including a null location to account for non-existence of the point. Our results include both exact and approximation algorithms for computing the probability of a query point lying inside the convex hull of the input, time-space tradeoffs for the membership queries, a connection between Tukey depth and membership queries, as well as a new notion of \\beta-hull that may be a useful representation of uncertain hulls.

We study contour trees of terrains, which encode the topological changes of the level set of the height value \\ell as we raise \\ell from -\\infty to +\\infty on the terrains, in a probabilistic setting. We consider a terrain that is defined by linearly interpolating each triangle of a triangulation. In our framework, the uncertainty lies in the height of each vertex in the triangulation, and we assume that it is described by a probability distribution. We first show that the probability of a vertex being a critical point, and the expected number of nodes (resp. edges) of the contour tree, can be computed exactly efficiently. Then we present efficient sampling-based methods for estimating, with high probability, (i) the probability that two points lie on an edge of the contour tree, within additive error; (ii) the expected distance of two points p, q and the probability that the distance of p, q is at least \\ell on the contour tree, within additive error and/or relative error, where the distance of p, q on a contour tree is defined to be the difference between the maximum height and the minimum height on the unique path from p to q on the contour tree.