Browsing by Author "Walenz, Brett"
Results Per Page
Sort Options
Item Open Access Perturbation Analysis of Database Queries(2019) Walenz, BrettData-driven decision making plays a dominant role across all domains, from health,
business, government, to sports. High impact decisions, decisions that may affect
an entire organization, are now being automated by analytical queries. A bank
account may be shut down if a query determines there may be fraudulent activity, a
sporting event may decrease ticket costs if the rate of ticket sales is not fast enough.
These data-driven decisions are often ad-hoc and resource intensive: a bank has to
compare and analyze all users, sporting events might use previous events to estimate
an acceptable ticket sales rate.
In this dissertation, I describe efficient methods for optimizing analytic queries
where the query has a complex user-defined function or join. I introduce an analysis
methodology for queries called query perturbation analysis. Query perturbation
analysis models a query as a collection of parameter settings rather than a collection
of database relations. In this model, a query conceptually involves evaluating the
same query template multiple times using perturbed parameter settings.
I begin with a discussion of the motivation for modeling complex database queries
in a perturbation analysis framework. Perturbation analysis often produces an effi-
cient and natural (to the end user) way to break down complex queries into isolated
parameterized queries. Analytical queries such as analyzing performance in sports,
or efficiently finding nearest neighbors to identify intrusion attacks in computer net-
works, are prime candidates for perturbation analysis.
I start by describing optimizations introduced by modeling a problem using per-
turbation analysis. I discuss four common optimizations to query perturbation anal-
ysis: parallelization, memoization, grouping, and pruning. Each optimization can be
used in different scenarios and under different constraints. I provide examples where
each optimization can be used.
I show how to tackle this problem from three distinct angles: as a distributed
system, within a database system, and using approximation methods.
As a distributed system, I introduce Perada, a system for perturbation analysis
which exposes these common optimizations and simplifies the development of general,
ad hoc perturbation analysis jobs. I first describe how to specify and interact with the
underlying optimizations, and then describe the implications of each optimization to
job performance. Perada hides this job performance complexity by runtime tuning
of optimizations through multiple stages and on/off switches for optimizations. I
describe this runtime strategy to balance tradeoffs of various optimizations and the
performance gain from a naive always-on strategy.
Within a database system, I push perturbation analysis through a SQL optimizer
and show how a certain class of queries called iceberg queries can be rewritten for
an order of magnitude decrease in execution time. I describe specific constraints and
requirements to first identify iceberg queries and ensure that rewritten execution
returns the same set as a traditional SQL compiler.
Using approximation methods, I describe solutions to approximately count per-
turbation analysis queries, and push the approach to a general class of queries with
expensive filters. All approaches rely on constructing a fast, probabilistic classifier
in place of the expensive filter and classifying individual perturbations. Using the
trained classifier, I describe two approaches to approximately count perturbation
analysis queries. Quantification learning uses the classifier directly to count the pre-
dicted labels. Sampling uses the classifier to create unequal probability sampling
schemes to produce highly accurate, low variance estimates.
For each distinct angle, I provide empirical results that show the effectiveness
of perturbation analysis in data systems. This dissertation shows that perturbation
analysis is a practical alternative method to traditional query execution techniques
and can be used in diverse settings: distributed, SQL, and approximated.