Efficient Algorithms for Querying Large and Uncertain Data
Query processing is an important problem in many research fields including database systems, data mining, and geometric computing. The goal is to preprocess input data into an index to efficiently answer queries over data. In many real applications we need to handle a large amount of data or/and data that are ambiguous because of human errors and data integration. With the data sets becoming increasingly large and complex, queries are also becoming complex, therefore, new challenging problems have emerged in the area of query processing.
Data summarization helps to accelerate expensive data queries by running the query procedure in a small data summary rather than the entire large dataset.
The first part of this thesis studies the problem of finding small high quality data summaries in the Euclidean space.
Data summarization can be applied either in the entire dataset or among points in a query range given by the user.
Efficient algorithms are proposed to get a small summarization of the entire dataset so that Top-$k$ or user preference queries are answered efficiently with high quality guarantees by looking only in the data summary.
Such a summary can also be used in multi-criteria decision problems.
Furthermore, near-linear space indexes are designed so that given a query rectangle, a diverse high-valued summary of $k$ points in the query range is returned efficiently.
The second part of the thesis focuses on data queries over uncertain data.
Uncertainty is typically captured using stochastic data models, and querying data requires either statistics about the probabilistic behavior of the underlying data, or data cleaning to reduce the uncertainty of the query answer.
Small size indexes are built so that, given a query rectangle, statistics of the MAX (or top-$k$) operator among the points in the query range are computed in sublinear time.
In addition, approximation algorithms are proposed for selecting data to clean in order to minimize the variance of the result of a query function over a probabilistic database.
While most of the methods hold for general functions, the main focus is on minimizing the variance in fact checking operations.
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations
Works are deposited here by their authors, and represent their research and opinions, not that of Duke University. Some materials and descriptions may include offensive content. More info