Perturbation Analysis of Database Queries
dc.contributor.advisor | Yang, Jun | |
dc.contributor.author | Walenz, Brett | |
dc.date.accessioned | 2020-01-27T16:51:50Z | |
dc.date.available | 2020-01-27T16:51:50Z | |
dc.date.issued | 2019 | |
dc.department | Computer Science | |
dc.description.abstract | Data-driven decision making plays a dominant role across all domains, from health, business, government, to sports. High impact decisions, decisions that may affect an entire organization, are now being automated by analytical queries. A bank account may be shut down if a query determines there may be fraudulent activity, a sporting event may decrease ticket costs if the rate of ticket sales is not fast enough. These data-driven decisions are often ad-hoc and resource intensive: a bank has to compare and analyze all users, sporting events might use previous events to estimate an acceptable ticket sales rate. In this dissertation, I describe efficient methods for optimizing analytic queries where the query has a complex user-defined function or join. I introduce an analysis methodology for queries called query perturbation analysis. Query perturbation analysis models a query as a collection of parameter settings rather than a collection of database relations. In this model, a query conceptually involves evaluating the same query template multiple times using perturbed parameter settings. I begin with a discussion of the motivation for modeling complex database queries in a perturbation analysis framework. Perturbation analysis often produces an effi- cient and natural (to the end user) way to break down complex queries into isolated parameterized queries. Analytical queries such as analyzing performance in sports, or efficiently finding nearest neighbors to identify intrusion attacks in computer net- works, are prime candidates for perturbation analysis. I start by describing optimizations introduced by modeling a problem using per- turbation analysis. I discuss four common optimizations to query perturbation anal- ysis: parallelization, memoization, grouping, and pruning. Each optimization can be used in different scenarios and under different constraints. I provide examples where each optimization can be used. I show how to tackle this problem from three distinct angles: as a distributed system, within a database system, and using approximation methods. As a distributed system, I introduce Perada, a system for perturbation analysis which exposes these common optimizations and simplifies the development of general, ad hoc perturbation analysis jobs. I first describe how to specify and interact with the underlying optimizations, and then describe the implications of each optimization to job performance. Perada hides this job performance complexity by runtime tuning of optimizations through multiple stages and on/off switches for optimizations. I describe this runtime strategy to balance tradeoffs of various optimizations and the performance gain from a naive always-on strategy. Within a database system, I push perturbation analysis through a SQL optimizer and show how a certain class of queries called iceberg queries can be rewritten for an order of magnitude decrease in execution time. I describe specific constraints and requirements to first identify iceberg queries and ensure that rewritten execution returns the same set as a traditional SQL compiler. Using approximation methods, I describe solutions to approximately count per- turbation analysis queries, and push the approach to a general class of queries with expensive filters. All approaches rely on constructing a fast, probabilistic classifier in place of the expensive filter and classifying individual perturbations. Using the trained classifier, I describe two approaches to approximately count perturbation analysis queries. Quantification learning uses the classifier directly to count the pre- dicted labels. Sampling uses the classifier to create unequal probability sampling schemes to produce highly accurate, low variance estimates. For each distinct angle, I provide empirical results that show the effectiveness of perturbation analysis in data systems. This dissertation shows that perturbation analysis is a practical alternative method to traditional query execution techniques and can be used in diverse settings: distributed, SQL, and approximated. | |
dc.identifier.uri | ||
dc.subject | Computer science | |
dc.title | Perturbation Analysis of Database Queries | |
dc.type | Dissertation |
Files
Original bundle
- Name:
- Walenz_duke_0066D_15260.pdf
- Size:
- 3.15 MB
- Format:
- Adobe Portable Document Format