Perturbation Analysis of Database Queries

dc.contributor.advisor

Yang, Jun

dc.contributor.author

Walenz, Brett

dc.date.accessioned

2020-01-27T16:51:50Z

dc.date.available

2020-01-27T16:51:50Z

dc.date.issued

2019

dc.department

Computer Science

dc.description.abstract

Data-driven decision making plays a dominant role across all domains, from health,

business, government, to sports. High impact decisions, decisions that may affect

an entire organization, are now being automated by analytical queries. A bank

account may be shut down if a query determines there may be fraudulent activity, a

sporting event may decrease ticket costs if the rate of ticket sales is not fast enough.

These data-driven decisions are often ad-hoc and resource intensive: a bank has to

compare and analyze all users, sporting events might use previous events to estimate

an acceptable ticket sales rate.

In this dissertation, I describe efficient methods for optimizing analytic queries

where the query has a complex user-defined function or join. I introduce an analysis

methodology for queries called query perturbation analysis. Query perturbation

analysis models a query as a collection of parameter settings rather than a collection

of database relations. In this model, a query conceptually involves evaluating the

same query template multiple times using perturbed parameter settings.

I begin with a discussion of the motivation for modeling complex database queries

in a perturbation analysis framework. Perturbation analysis often produces an effi-

cient and natural (to the end user) way to break down complex queries into isolated

parameterized queries. Analytical queries such as analyzing performance in sports,

or efficiently finding nearest neighbors to identify intrusion attacks in computer net-

works, are prime candidates for perturbation analysis.

I start by describing optimizations introduced by modeling a problem using per-

turbation analysis. I discuss four common optimizations to query perturbation anal-

ysis: parallelization, memoization, grouping, and pruning. Each optimization can be

used in different scenarios and under different constraints. I provide examples where

each optimization can be used.

I show how to tackle this problem from three distinct angles: as a distributed

system, within a database system, and using approximation methods.

As a distributed system, I introduce Perada, a system for perturbation analysis

which exposes these common optimizations and simplifies the development of general,

ad hoc perturbation analysis jobs. I first describe how to specify and interact with the

underlying optimizations, and then describe the implications of each optimization to

job performance. Perada hides this job performance complexity by runtime tuning

of optimizations through multiple stages and on/off switches for optimizations. I

describe this runtime strategy to balance tradeoffs of various optimizations and the

performance gain from a naive always-on strategy.

Within a database system, I push perturbation analysis through a SQL optimizer

and show how a certain class of queries called iceberg queries can be rewritten for

an order of magnitude decrease in execution time. I describe specific constraints and

requirements to first identify iceberg queries and ensure that rewritten execution

returns the same set as a traditional SQL compiler.

Using approximation methods, I describe solutions to approximately count per-

turbation analysis queries, and push the approach to a general class of queries with

expensive filters. All approaches rely on constructing a fast, probabilistic classifier

in place of the expensive filter and classifying individual perturbations. Using the

trained classifier, I describe two approaches to approximately count perturbation

analysis queries. Quantification learning uses the classifier directly to count the pre-

dicted labels. Sampling uses the classifier to create unequal probability sampling

schemes to produce highly accurate, low variance estimates.

For each distinct angle, I provide empirical results that show the effectiveness

of perturbation analysis in data systems. This dissertation shows that perturbation

analysis is a practical alternative method to traditional query execution techniques

and can be used in diverse settings: distributed, SQL, and approximated.

dc.identifier.uri

https://hdl.handle.net/10161/19797

dc.subject

Computer science

dc.title

Perturbation Analysis of Database Queries

dc.type

Dissertation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Walenz_duke_0066D_15260.pdf
Size:
3.15 MB
Format:
Adobe Portable Document Format

Collections