Perturbation Analysis of Database Queries

Walenz, Brett

Perturbation Analysis of Database Queries

View / Download3.15 MB

Date

2019

Authors

Walenz, Brett

Advisors

Yang, Jun

Repository Usage Stats

145
views

377
downloads

Abstract

Data-driven decision making plays a dominant role across all domains, from health,

business, government, to sports. High impact decisions, decisions that may affect

an entire organization, are now being automated by analytical queries. A bank

account may be shut down if a query determines there may be fraudulent activity, a

sporting event may decrease ticket costs if the rate of ticket sales is not fast enough.

These data-driven decisions are often ad-hoc and resource intensive: a bank has to

compare and analyze all users, sporting events might use previous events to estimate

an acceptable ticket sales rate.

In this dissertation, I describe efficient methods for optimizing analytic queries

where the query has a complex user-defined function or join. I introduce an analysis

methodology for queries called query perturbation analysis. Query perturbation

analysis models a query as a collection of parameter settings rather than a collection

of database relations. In this model, a query conceptually involves evaluating the

same query template multiple times using perturbed parameter settings.

I begin with a discussion of the motivation for modeling complex database queries

in a perturbation analysis framework. Perturbation analysis often produces an effi-

cient and natural (to the end user) way to break down complex queries into isolated

parameterized queries. Analytical queries such as analyzing performance in sports,

or efficiently finding nearest neighbors to identify intrusion attacks in computer net-

works, are prime candidates for perturbation analysis.

I start by describing optimizations introduced by modeling a problem using per-

turbation analysis. I discuss four common optimizations to query perturbation anal-

ysis: parallelization, memoization, grouping, and pruning. Each optimization can be

used in different scenarios and under different constraints. I provide examples where

each optimization can be used.

I show how to tackle this problem from three distinct angles: as a distributed

system, within a database system, and using approximation methods.

As a distributed system, I introduce Perada, a system for perturbation analysis

which exposes these common optimizations and simplifies the development of general,

ad hoc perturbation analysis jobs. I first describe how to specify and interact with the

underlying optimizations, and then describe the implications of each optimization to

job performance. Perada hides this job performance complexity by runtime tuning

of optimizations through multiple stages and on/off switches for optimizations. I

describe this runtime strategy to balance tradeoffs of various optimizations and the

performance gain from a naive always-on strategy.

Within a database system, I push perturbation analysis through a SQL optimizer

and show how a certain class of queries called iceberg queries can be rewritten for

an order of magnitude decrease in execution time. I describe specific constraints and

requirements to first identify iceberg queries and ensure that rewritten execution

returns the same set as a traditional SQL compiler.

Using approximation methods, I describe solutions to approximately count per-

turbation analysis queries, and push the approach to a general class of queries with

expensive filters. All approaches rely on constructing a fast, probabilistic classifier

in place of the expensive filter and classifying individual perturbations. Using the

trained classifier, I describe two approaches to approximately count perturbation

analysis queries. Quantification learning uses the classifier directly to count the pre-

dicted labels. Sampling uses the classifier to create unequal probability sampling

schemes to produce highly accurate, low variance estimates.

For each distinct angle, I provide empirical results that show the effectiveness

of perturbation analysis in data systems. This dissertation shows that perturbation

analysis is a practical alternative method to traditional query execution techniques

and can be used in diverse settings: distributed, SQL, and approximated.

Type

Dissertation

Department

Computer Science

Subjects

Computer science

Permalink

https://hdl.handle.net/10161/19797

Citation

Walenz, Brett (2019). Perturbation Analysis of Database Queries. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/19797.

Collections

Dissertations

Full item page

Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.

Perturbation Analysis of Database Queries

Date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

Abstract

Type

Department

Description

Provenance

Subjects

Citation

Permalink

Citation

Collections