# Duke Student Scholarship

## Permanent URI for this community

## Browse

# Browsing Duke Student Scholarship by Department "Computer Science"

###### Results Per Page

###### Sort Options

Item Open Access 3D Object Representations for Robot Perception(2019) Burchfiel, Benjamin Clark MalloyReasoning about 3D objects is one of the most critical perception problems robots face; outside of navigation, most interactions between a robot and its environment are object-centric. Object-centric robot perception has long relied on maintaining an explicit database of 3D object models with the assumption that encountered objects will be exact copies of entries in the database; however, as robots move into unstructured environments such as human homes, the variation of encountered objects increases and maintaining an explicit object database becomes infeasible. This thesis introduces a general-purpose 3D object representation that allows the joint estimation of a previously unencountered object's class, pose, and 3D shape---crucial foundational tasks for general robot perception.

We present the first method capable of performing all three of these tasks simultaneously, Bayesian Eigenobjects (BEOs), and show that it outperforms competing approaches which estimate only object shape and class given a known object pose. BEOs use an approximate Bayesian version of Principal Component Analysis to learn an explicit low-dimensional subspace containing the 3D shapes of objects of interest, which allows for efficient shape inference at high object resolutions. We then extend BEOs to produce Hybrid Bayesian Eigenobjects (HBEOs), a fusion of linear subspace methods with modern convolutional network approaches, enabling realtime inference from a single depth image. Because HBEOs use a Convolutional Network to project partially observed objects onto the learned subspace, they allow the object to be larger and more expressive without impacting the inductive power of the model. Experimentally, we show that HBEOs offer significantly improved performance on all tasks compared to their BEO predecessors. Finally, we leverage the explicit 3D shape estimate produced by BEOs to further extend the state-of-the-art in category level pose estimation by fusing probabilistic pose predictions with a silhouette-based reconstruction prior. We also illustrate the advantages of combining both probabilistic pose estimation and shape verification, via an ablation study, and show that both portions of the system contribute to its performance. Taken together, these methods comprise a significant step towards creating a general-purpose 3D perceptual foundation for robotics systems, upon which problem-specific systems may be built.

Item Open Access A Logical Controller Architecture for Network Security(2020) Yao, YuanjunNetworked infrastructure-as-a-service testbeds are evolving with higher capacity and more advanced capabilities. Modern testbeds offer stitched virtual circuit capability, programmable dataplanes with software-defined networking (SDN), and in-network processing on adjacent cloud servers. With these capabilities they are able to host virtual network service providers (NSPs) that peer and exchange traffic with edge subnets and with other NSPs in the testbeds. Testbed tenants may configure and program their NSPs to provide a range of functions and capabilities. Programmable NSPs enable innovation in network services and protocols following the pluralist philosophy of network architecture.

Advancing testbeds offer an opportunity to harness their power to deploy production NSPs with topology and value-added features tailored to the needs of specific user communities. For example, one objective of this research is to define abstractions and tools to support built-to-order virtual science networks for data-intensive science collaborations that share and exchange datasets securely at high speed. A virtual science network may combine dedicated high-speed circuits on advanced research fabrics with integrated in-network processing on virtual cloud servers, and links to exchange traffic with customer campus networks and/or testbed slices. We propose security-managed science networks with additional security features including access control, embedded virtual security appliances, and managed connectivity according to customer policy. A security-managed NSP is in essence a virtual software-defined exchange (SDX) that applies customer-specified policy to mediate connectivity.

This dissertation proposes control abstractions for dynamic NSPs, with a focus on managing security in the control plane based on programmable security policy. It defines an architecture for automated NSP controllers that orchestrate and program an NSP's SDN dataplane and manage its interactions with customers and peer NSPs. A key element of the approach is to use declarative trust logic to program the control plane: all control-plane interactions---including route advertisements, address assignments, policy controls, and governance authority---are represented as signed statements in a logic (trust datalog). NSP controllers use a logical inference engine to authorize all interactions and check for policy compliance.

To evaluate these ideas, we develop the ExoPlex controller framework for secure policy-based networking over programmable network infrastructures. An ExoPlex NSP combines a logical NSP controller with an off-the-shelf SDN controller and an existing trust logic platform (SAFE), both of which were enhanced for this project. Experiments with the software on testbeds---ExoGENI, ESnet, and Chameleon---demonstrate the power and potential of the approach. The dissertation presents the research in four parts.

The first part introduces the foundational capabilities of research testbeds that enables the approach, and presents the design of the ExoPlex controller framework to leverage those capabilities for hosted NSPs. We demonstrate a proof-of-concept deployment of an NSP with network function virtualization, an elastic dataplane, and managed traffic security on the ExoGENI testbed.

The second part introduces logical trust to structure control-plane interactions and program security policy. We show how to use declarative trust logic to address the challenges for managing identity, resource access, peering, connectivity and secure routing. We present off-the-shelf SAFE logic templates and rules to demonstrate a virtual SDX that authorizes network stitching and connectivity with logical trust.

The third part applies the controller architecture to secure policy-based interdomain routing among transit NSPs based on a logical trust plane. Signed logic exchanges propagate advertised routes and policies through the network. We show that trust logic rules capture and represent current and evolving Internet security protocols, affording protection equivalent to BGPsec for secure routing and RPKI for origin authentication. The logic also supports programmable policy for managed connectivity with end-to-end trust, allowing customers to permission the NSPs so that customer traffic does not pass through untrusted NSPs (path control).

The last part introduces SCIF, which extends logical peering and routing to incorporate customizable policies to defend against packet spoofing and route leaks. It uses trust logic to define more expressive route advertisements and compliance checks to filter advertisements that propagate outside of their intended scope. For SCIF, we extended the ExoPlex SDN dataplanes to configure ingress packet filters automatically from accepted routes (unicast Reverse Path Forwarding). We present logic templates that capture the defenses of valley-free routing and the Internet MANRS approach based on a central database of route ingress/egress policies (RADb/RPSL). We show how to extend their expressive power for stronger routing security, and complement it with path control policies that constrain the set of trusted NSPs for built-to-order internetworks.

Item Open Access A Semi-Supervised Predictive Model to Link Regulatory Regions to Their Target Genes(2015) Hafez, Dina MohamedNext generation sequencing technologies have provided us with a wealth of data profiling a diverse range of biological processes. In an effort to better understand the process of gene regulation, two predictive machine learning models specifically tailored for analyzing gene transcription and polyadenylation are presented.

Transcriptional enhancers are specific DNA sequences that act as ``information integration hubs" to confer regulatory requirements on a given cell. These non-coding DNA sequences can regulate genes from long distances, or across chromosomes, and their relationships with their target genes are not limited to one-to-one. With thousands of putative enhancers and less than 14,000 protein-coding genes, detecting enhancer-gene pairs becomes a very complex machine learning and data analysis challenge.

In order to predict these specific-sequences and link them to genes they regulate, we developed McEnhancer. Using DNAseI sensitivity data and annotated in-situ hybridization gene expression clusters, McEnhancer builds interpolated Markov models to learn enriched sequence content of known enhancer-gene pairs and predicts unknown interactions in a semi-supervised learning algorithm. Classification of predicted relationships were 73-98% accurate for gene sets with varying levels of initial known examples. Predicted interactions showed a great overlap when compared to Hi-C identified interactions. Enrichment of known functionally related TF binding motifs, enhancer-associated histone modification marks, along with corresponding developmental time point was highly evident.

On the other hand, pre-mRNA cleavage and polyadenylation is an essential step for 3'-end maturation and subsequent stability and degradation of mRNAs. This process is highly controlled by cis-regulatory elements surrounding the cleavage site (polyA site), which are frequently constrained by sequence content and position. More than 50\% of human transcripts have multiple functional polyA sites, and the specific use of alternative polyA sites (APA) results in isoforms with variable 3'-UTRs, thus potentially affecting gene regulation. Elucidating the regulatory mechanisms underlying differential polyA preferences in multiple cell types has been hindered by the lack of appropriate tests for determining APAs with significant differences across multiple libraries.

We specified a linear effects regression model to identify tissue-specific biases indicating regulated APA; the significance of differences between tissue types was assessed by an appropriately designed permutation test. This combination allowed us to identify highly specific subsets of APA events in the individual tissue types. Predictive kernel-based SVM models successfully classified constitutive polyA sites from a biologically relevant background (auROC = 99.6%), as well as tissue-specific regulated sets from each other. The main cis-regulatory elements described for polyadenylation were found to be a strong, and highly informative, hallmark for constitutive sites only. Tissue-specific regulated sites were found to contain other regulatory motifs, with the canonical PAS signal being nearly absent at brain-specific sites. We applied this model on SRp20 data, an RNA binding protein that might be involved in oncogene activation and obtained interesting insights.

Together, these two models contribute to the understanding of enhancers and the key role they play in regulating tissue-specific expression patterns during development, as well as provide a better understanding of the diversity of post-transcriptional gene regulation in multiple tissue types.

Item Open Access A Theoretical and Experimental Study of DNA Self-assembly(2012) Chandran, HarishThe control of matter and phenomena at the nanoscale is fast becoming one of the most important challenges of the 21st century with wide-ranging applications from energy and health care to computing and material science. Conventional top-down approaches to nanotechnology, having served us well for long, are reaching their inherent limitations. Meanwhile, bottom-up methods such as self-assembly are emerging as viable alternatives for nanoscale fabrication and manipulation.

A particularly successful bottom up technique is DNA self-assembly where a set of carefully designed DNA strands form a nanoscale object as a consequence of specific, local interactions among the different components, without external direction. The final product of the self-assembly process might be a static nanostructure or a dynamic nanodevice that performs a specific function. Over the past two decades, DNA self-assembly has produced stunning nanoscale objects such as 2D and 3D lattices, polyhedra and addressable arbitrary shaped substrates, and a myriad of nanoscale devices such as molecular tweezers, computational circuits, biosensors and molecular assembly lines. In this dissertation we study multiple problems in the theory, simulations and experiments of DNA self-assembly.

We extend the Turing-universal mathematical framework of self-assembly known as the Tile Assembly Model by incorporating randomization during the assembly process. This allows us to reduce the tile complexity of linear assemblies. We develop multiple techniques to build linear assemblies of expected length N using far fewer tile types than previously possible.

We abstract the fundamental properties of DNA and develop a biochemical system, which we call meta-DNA, based entirely on strands of DNA as the only component molecule. We further develop various enzyme-free protocols to manipulate meta-DNA systems and provide strand level details along with abstract notations for these mechanisms.

We simulate DNA circuits by providing detailed designs for local molecular computations that involve spatially contiguous molecules arranged on addressable substrates via enzyme-free DNA hybridization reaction cascades. We use the Visual DSD simulation software in conjunction with localized reaction rates obtained from biophysical modeling to create chemical reaction networks of localized hybridization circuits that are then model checked using the PRISM model checking software.

We develop a DNA detection system employing the triggered self-assembly of a novel DNA dendritic nanostructure. Detection begins when a specific, single-stranded target DNA strand triggers a hybridization chain reaction between two distinct DNA hairpins. Each hairpin opens and hybridizes up to two copies of the other, and hence each layer of the growing dendritic nanostructure can in principle accommodate an exponentially increasing number of cognate molecules, generating a nanostructure with high molecular weight.

We build linear activatable assemblies employing a novel protection/deprotection strategy to strictly enforce the direction of tiling assembly growth to ensure the robustness of the assembly process. Our system consists of two tiles that can form a linear co-polymer. These tiles, which are initially protected such that they do not react with each other, can be activated to form linear co-polymers via the use of a strand displacing enzyme.

Item Open Access Accelerating Data Parallel Applications via Hardware and Software Techniques(2020) Bashizade, RaminThe unprecedented amount of data available today opens the door to many new applications in areas such as finance, scientific simulation, machine learning, etc. Many such applications perform the same computations on different data, which are called data-parallel. However, processing this enormous amount of data is challenging, especially in the post-Moore's law era. Specialized accelerators are a promising solution to meet the performance requirements of data-parallel applications. Among these are graphics processing units (GPUs), as well as more application-specific solutions.

One of the areas with high performance requirements is statistical machine learning, which has widespread applications in various domains. These methods include probabilistic algorithms, such as Markov Chain Monte-Carlo (MCMC), which rely on generating random numbers from probability distributions. These algorithms are computationally expensive on conventional processors, yet their statistical properties, namely, interpretability and uncertainty quantification compared to deep learning, make them an attractive alternative approach. Therefore, hardware specialization can be adopted to address the shortcomings of conventional processors in running these applications.

In addition to hardware techniques, probabilistic algorithms can benefit from algorithmic optimizations that aim to avoid performing unnecessary work. To be more specific, we can skip a random variable (RV) whose probability distribution function (PDF) is concentrated on only one value, i.e., there is only one value to choose, and the values of its neighboring RVs have not changed. In other words, if a RV has a concentrated PDF, its PDF will remain concentrated until at least one of its neighbors changes. Due to their high throughput and centralized scheduling mechanism, GPUs are a suitable target for this optimization.

Other than probabilistic algorithms, GPUs can be utilized to accelerate a variety of applications. GPUs with their Single-Instruction Multiple-Thread (SIMT) execution model offer massive parallelism that is combined with a relative ease of programming. The large amount and diversity of resources on the GPU is intended to ensure applications with different characteristics can achieve high performance, but at the same time it means that some of these resources will remain under-utilized, which is inefficient in a multi-tenant environment.

In this dissertation, we propose and evaluate solutions to the challenges mentioned above, namely i) accelerating probabilistic algorithms with uncertainty quantification, ii) optimizing probabilistic algorithms on GPUs to avoid unnecessary work, and iii) increasing resource utilization of GPUs in multi-tenant environments.

Item Open Access Adapting a Kidney Exchange Algorithm to Incorporate Human Values(2017-05-04) Freedman, RachelArtificial morality is moral behavior exhibited by automated or artificially intelligent agents. A primary goal of the field of artificial morality is to design artificial agents that act in accordance with human values. One domain in which computer systems make such value decisions is that of kidney exchange. In a kidney exchange, incompatible patient-donor pairs exchange kidneys with other incompatible pairs instead of waiting for cadaver kidney transplants. This allows these patients to be removed from the waiting list and to receive live transplants, which typically have better outcomes. When the matching of these pairs is automated, algorithms must decide which patients to prioritize. In this paper, I develop a procedure to align these prioritization decisions with human values. Many previous attempts to impose human values on artificial agents have relied on the “top-down approach” of defining a coherent framework of rules for the agent to follow. Instead, I develop my value function by gathering survey participant responses to relevant moral dilemmas, using machine learning to approximate the value system that these responses are based on, and then encoding these into the algorithm. This “bottom-up approach” is thought to produce more accurate, robust, and generalizable moral systems. My method of gathering, analyzing, and incorporating public opinion can be easily generalized to other domains. Its success here therefore suggests that it holds promise for the future development of artificial morality.Item Open Access Algorithms for Allocation Problems in Online Settings(2018) Kell, Nathaniel BrianA fundamental computational challenge that arises in the operation of online systems, services, and platforms is that of resource allocation. Broadly deﬁned, a resource allocation problem is one where set of users generate demands, which then have to be satisﬁed using a limited set of resources. In many of these scenarios, requests and jobs arrive in an online sequence, meaning they arrive over time and must be allocated without knowledge of future requests.

In this thesis, we examine resource allocation problems in online settings, focusing on problems in data center scheduling and internet advertising. Our results are summarized as follows.

• Vector Scheduling: We resolve the complexity of the vector scheduling problem, a variant of the classic load balancing problem ﬁrst considered by Graham, where jobs have vectors loads (as opposed to a single scalar). We study the problem in the three classical settings—identical, unrelated, and related—giving competitive algorithms for optimizing generic q-norms of the machines loads. In each setting we show these algorithm are optimal by giving asymptotically matching lower bounds. Also as a consequence of these results, we give the ﬁrst constant competitive algorithm for optimizing q-norms on related machines in the scalar setting.

• Budgeted Allocation: We study a variant of the online budgeted allocation (also called AdWords) problem where advertising budgets are expressed over multiple tiers of user-attribute granularity. We show that, unlike in the single-budget AdWords problem, obtaining a constant competitive ratio is impossible and give asymptotically tight upper and lower bounds. However, we then observe that in many real-world scenarios, multi-tier budgets have a laminar structure. In this setting we obtain a competitive ratio of e/(e−1) in the small bids case, which matches the best known AdWords result for single budgets.

Item Open Access Algorithms for Analyzing Spatio-Temporal Data(2018) Nath, AbhinandanIn today's age, huge data sets are becoming ubiquitous. In addition to their size, most of these data sets are often noisy, have outliers, and are incomplete. Hence, analyzing such data is challenging. We look at applying geometric techniques to tackle some of these challenges, with an emphasis on designing provably efficient algorithms. Our work takes two broad approaches -- distributed algorithms, and concise descriptors for big data sets.

With the massive amounts of data available today, it is common to store and process data using multiple machines. Parallel programming frameworks such as MapReduce and its variants are becoming popular for handling such large data. We present the first provably efficient algorithms to compute, store, and query data structures for range queries and approximate nearest neighbor queries in a popular parallel computing abstraction that captures the salient features of MapReduce and other massively parallel communication (MPC) models. Our algorithms are competitive in terms of running time and workload to their classical counterparts.

We propose parallel algorithms in the MPC model for processing large terrain elevation data (represented as a 3D point cloud) that are too big to fit on one machine. In particular, we present a simple randomized algorithm to compute the Delaunay triangulation of the $xy$-projections of the input points. Next we describe an efficient algorithm to compute the contour tree (a topological descriptor that succinctly encodes the contours of a terrain) of the resulting triangulated terrain.

We then look at comparing real-valued functions, by computing a distance function between their merge trees (a small-sized descriptor that succinctly captures the sublevel sets of a function). Merge trees are robust to noise in the data, and can be used effectively as proxies for the data sets themselves in some cases. We also use it give an algorithm to compute the Gromov-Hausdorff distance, a natural way to measure distance between two metric spaces. We give the first proof of hardness and the first non-trivial approximation algorithm for computing the Gromov-Hausdorff distance between metric spaces defined on trees, where the distance between two points is given by the length of the unique path between them in the tree.

Finally we look at the problem of capturing shared portions between large number of input trajectories. We formulate it as a subtrajectory clustering problem - the clustering of subsequences of trajectories. We propose a new model for clustering subtrajectories. Each cluster of subtrajectories is represented as a \emph{pathlet}, a sequence of points that is not necessarily a subsequence of an input trajectory. We present a single objective function for finding the optimal collection of pathlets that best represents the trajectories taking into account noise and other artifacts of the data. We show that the subtrajectory clustering problem is NP-hard and present fast approximation algorithms. We further improve the running time of our algorithm if the input trajectories are ``well-behaved". We also present experimental results on both real and synthetic data sets.

Item Open Access Algorithms for Clustering, Partitioning, and Planning(2023) Taylor, Erin C.We explore three geometric optimization problems with multifaceted goals, including efficiency, fairness, solution quality, and interpretability. The problems we study pose significant challenges because they involve complex geometric objects that interact in high-dimensional spaces. To contend with these challenges, our work utilizes realistic input structure to design algorithms that provide provably good solutions. Alternatively, we can approach the general problem statement with heuristic or approximate schemes.

The first problem domain addressed is trajectory clustering, where the goal is to partition noisy trajectories (polygonal curves) into clusters based on shared structure and find a representative trajectory for each cluster. We present a near-linear time (1+\eps)-approximation algorithm for k-median clustering of polygonal trajectories under the discrete Fréchet distance and finite point sets under the Hausdorff distance. The algorithm leverages the geometric properties of the trajectories, specifically that centers come from a "simpler" metric space, to achieve efficient clustering results.

The second problem we consider is redistricting, aiming to partition a given population into geographic regions in a fair and representative way. Redistricting involves drawing boundaries for electoral districts, which can significantly impact political representation and decision-making processes. We introduce a new model of fairness for the problem of fair redistricting and investigate the existence and computability of fair partitions. We consider this problem in both one and two dimensions. We also propose a new method of generating redistricting plans that leads to more compact (geometrically "nice") districts.

The final problem addressed is optimal motion planning for multiple robots in polygonal environments with obstacles. Each robot is represented a simple polygonal object, such as a unit disc or square. We present algorithms for generating motion plans which approximately minimize the sum of distances traveled from a start to target configuration. For the simplified case of two robots and given the paths the robots must traverse, we also give an algorithm to minimize the makespan (latest arrival time) of our schedule.

Item Open Access Algorithms for continuous queries: A geometric approach(2013) Yu, AlbertThere has been an unprecedented growth in both the amount of data and the number of users interested in different types of data. Users often want to keep track of the data that match their interests over a period of time. A continuous query, once issued by a user, maintains the matching results for the user as new data (as well as updates to the existing data) continue to arrive in a stream. However, supporting potentially millions of continuous queries is a huge challenge. This dissertation addresses the problem of scalably processing a large number of continuous queries over a wide-area network.

Conceptually, the task of supporting distributed continuous queries can be divided into two components--event processing (computing the set of affected users for each data update) and notification dissemination (notifying the set of affected users). The first part of this dissertation focuses on event processing. Since interacting with large-scale data can easily frustrate and overwhelm the users, top-k queries have attracted considerable interest from the database community as they allow users to focus on the top-ranked results only. However, it is nearly impossible to find a set of common top-ranked data that everyone is interested in, therefore, users are allowed to specify their interest in different forms of preferences, such as personalized ranking function and range selection. This dissertation presents geometric frameworks, data structures, and algorithms for answering several types of preference queries efficiently. Experimental evaluations show that our approaches outperform the previous ones by orders of magnitude.

The second part of the dissertation presents comprehensive solutions to the problem of processing and notifying a large number of continuous range top-k queries across a wide-area network. Simple solutions include using a content-driven network to notify all continuous queries whose ranges contain the update (ignoring top-k), or using a server to compute only the affected continuous queries and notifying them individually. The former solution generates too much network traffic, while the latter overwhelms the server. This dissertation presents a geometric framework which allows the set of affected continuous queries to be described succinctly with messages that can be efficiently disseminated using content-driven networks. Fast algorithms are also developed to reformulate each update into a set of messages whose number is provably optimal, with or without knowing all continuous queries.

The final component of this dissertation is the design of a wide-area dissemination network for continuous range queries. In particular, this dissertation addresses the problem of assigning users to servers in a wide-area content-based publish/subscribe system. A good assignment should consider both users' interests and locations, and balance multiple performance criteria including bandwidth, delay, and load balance. This dissertation presents a Monte Carlo approximation algorithm as well as a simple greedy algorithm. The Monte Carlo algorithm jointly considers multiple performance criteria to find a broker-subscriber assignment and provides theoretical performance guarantees. Using this algorithm as a yardstick, the greedy algorithm is also concluded to work well across a wide range of workloads.

Item Open Access Algorithms for Geometric Matching, Clustering, and Covering(2016) Pan, JiangweiWith the popularization of GPS-enabled devices such as mobile phones, location data are becoming available at an unprecedented scale. The locations may be collected from many different sources such as vehicles moving around a city, user check-ins in social networks, and geo-tagged micro-blogging photos or messages. Besides the longitude and latitude, each location record may also have a timestamp and additional information such as the name of the location. Time-ordered sequences of these locations form trajectories, which together contain useful high-level information about people's movement patterns.

The first part of this thesis focuses on a few geometric problems motivated by the matching and clustering of trajectories. We first give a new algorithm for computing a matching between a pair of curves under existing models such as dynamic time warping (DTW). The algorithm is more efficient than standard dynamic programming algorithms both theoretically and practically. We then propose a new matching model for trajectories that avoids the drawbacks of existing models. For trajectory clustering, we present an algorithm that computes clusters of subtrajectories, which correspond to common movement patterns. We also consider trajectories of check-ins, and propose a statistical generative model, which identifies check-in clusters as well as the transition patterns between the clusters.

The second part of the thesis considers the problem of covering shortest paths in a road network, motivated by an EV charging station placement problem. More specifically, a subset of vertices in the road network are selected to place charging stations so that every shortest path contains enough charging stations and can be traveled by an EV without draining the battery. We first introduce a general technique for the geometric set cover problem. This technique leads to near-linear-time approximation algorithms, which are the state-of-the-art algorithms for this problem in either running time or approximation ratio. We then use this technique to develop a near-linear-time algorithm for this

shortest-path cover problem.

Item Open Access Algorithms for Networks With Uncertainty(2019) Haney, Samuel MitchellIn this dissertation, we study algorithmic problems motivated by the optimization of networks under uncertainty.

We summarize our contributions:

\begin{itemize}

\item \textbf{Subset $k$-server:} We propose and give algorithms for the \emph{all-or-one $k$-server}, a generalization of classical $k$-server problem.

$k$-server is an example of a problem in an \emph{online setting}: the algorithm must make irrevocable decisions without knowledge of future inputs.

In the all-or-one $k$-server generalization, clients can make a request for a particular server, or they can submit a general request that may be served by any server.

We give an $O(\log k)$-competitive randomized algorithm for this problem on a uniform metric.

We additionally study other, similar generalizations of $k$-server.

\item \textbf{Retraction:} Motivated by the problem of deploying distributed applications in the cloud, we initiate the algorithmic study of \emph{graph retraction} to a cycle, which seeks a mapping of a graph to a cycle in the graph so as to minimize the maximum stretch of any edge, subject to the constraint that each vertex in the cycle is mapped to itself.

Our main results are an $O(\min\{k, \sqrt{n}\})$-approximation for retracting any graph on $n$ nodes to a cycle with $k$ nodes, and an optimal algorithm when the graph is planar.

\item \textbf{Symmetric Interdiction:} We study the symmetric matching interdiction problem. This problem can be simply stated as follows: find a matching whose removal minimizes the size of the maximum matching in the remaining graph. We show that this problem is APX-hard, and obtain a $3/2$-approximation algorithm. We additionally introduce symmetric interdiction as a general model.

We give a general framework that relates optimization to symmetric interdiction for a broad class of optimization problems.

We are motivated by applications in traffic engineering, where a network operator wishes to route traffic in a datacenter, but cannot distinguish between malicious and legitimate traffic.

\item \textbf{Multicast Games:}

We study the effect of strategic behavior on network design.

In \emph{multicast} and \emph{broadcast} games, agents in a graph attempt to connect to a \emph{root node} at minimum cost to themselves, sharing the cost of each edge along their path to the root equally with the other agents using the edge.

Such games can have many Nash equilibria, and networks formed dynamically by these agents could end up in any one of these equilibria, and may be very expensive.

The main open problem in this field has been to bound the ratio of least expensive Nash equilibrium to the least expensive network, called the price of stability (PoS).

We make progress towards a better price of stability bound for multicast games.

In particular, we show a constant upper bound on the PoS of multicast games for quasi-bipartite graphs. These are

graphs where all edges are between two terminals (as in broadcast games) or between a terminal and a

nonterminal, but there is no edge between nonterminals. This represents a natural class of intermediate

generality between broadcast and multicast games.

\end{itemize}

Item Open Access Algorithms for Public Decision Making(2019) Fain, Brandon ThomasIn public decision making, we are confronted with the problem of aggregating the conflicting preferences of many individuals about outcomes that affect the group. Examples of public decision making include allocating shared public resources and social choice or voting. We study these problems from the perspective of an algorithm designer who takes the preferences of the individuals and the constraints of the decision making problem as input and efficiently computes a solution with provable guarantees with respect to fairness and welfare, as defined on individual preferences.

Concerning fairness, we develop the theory of group fairness as core or proportionality in the allocation of public goods. The core is a stability based notion adapted from cooperative game theory, and we show extensive algorithmic connections between the core solution concept and optimizing the Nash social welfare, the geometric mean of utilities. We explore applications in public budgeting, multi-issue voting, memory sharing, and fair clustering in unsupervised machine learning.

Regarding welfare, we extend recent work in implicit utilitarian social choice to choose approximately optimal public outcomes with respect to underlying cardinal valuations using limited ordinal information. We propose simple randomized algorithms with strong utilitarian social cost guarantees when the space of outcomes is metric. We also study many other desirable properties of our algorithms, including approximating the second moment of utilitarian social cost. We explore applications in voting for public projects, preference elicitation, and deliberation.

Item Open Access Algorithms for Rectangular Robot Motion Planning(2023) Steiger, Alexander JohnWith the proliferation of robot deployment in large-scale operations, such as warehouse management and manufacturing, there is an increasing need for multi-robot motion planning solutions. The basic multi-robot motion planning problem is to decide whether robots can translate from given start positions to given target positions without colliding with each other or any obstacles in the environment on their way, and if so, to compute such collision-free paths. Together, such paths compose a collision-free motion plan.

In practice, there are other qualities to consider besides being collision-free, such as the lengths of the paths, the overall time for all robots to follow their paths, and robustness. We explore this problem space, called optimal multi-robot motion planning. In light of previous hardness results, we focus on instances where the robot(s) and obstacles are assumed to have simple but realistic geometric properties with the goal of describing algorithms that are provably efficient and/or compute high-quality solutions. A frequent aspect of our work is to suppose that the input objects (robots and/or obstacles) are rectangular objects with bounded aspect ratio; that is, the side lengths of the objects are similar in proportion to each other.

Motivated by applications to motion planning and others,we explore the problem of computing and decomposing the common exterior of a set of n 3D axis-aligned cubes. We first present an algorithm to compute the common exterior, whose runtime is nearly optimal and output-sensitive in the sense that the running time depends linearly on the complexity of the common exterior, k, and not the worst-case complexity, Θ(n^2), by exploiting the geometry of the cubes.

Second, we describe an algorithm to further process the common exterior of n axis-aligned cubes into a partition composed of simple regions, specifically axis-aligned boxes. Our partition has size near-linear in the complexity of the union, k, and Ω(k) is a lower bound on the size of any partition.We also describe a simpler algorithm to partition the common exterior of n axis-aligned boxes of arbitrary sizes into O(n^(3/2) + k) axis-aligned boxes, which is optimal in the sense that min{n^(3/2),k} is a worst-case lower bound. In the context of motion planning, our algorithms can be used to compute a small partition of the free space for a box-shaped robot amid obstacles approximated by a set of boxes.

Lastly, we consider a special case of the min-sum motion planning problem, which falls under the umbrella of optimal motion planning. The goal is to compute a motion plan that minimizes the total length of the paths to be followed by the robots. This problem is well-studied for a single robot and polynomial-time algorithms to compute the optimal motion plan are known.However, we are not aware of any polynomial time algorithm for this problem, nor do we know whether the problem is NP-hard. We consider instances of this problem for congruent square robots amid polygonal obstacles in the plane. Our result is a fully polynomial-time approximation scheme (FPTAS) for these instances, which is the first (1+ε)-approximation algorithm for an optimal motion planning problem involving two robots moving in a polygonal environment.

Item Open Access Algorithms for the Reeb Graph and Related Concepts(2014) Parsa, SalmanThis thesis is concerned with a structure called the Reeb graph. There are three main problems considered. The first is devising an efficient algorithm for comnstructing the Reeb graph of a simplicial complex with respect to a generic simplex-wise linear real-valued function. We present an algorithm that builds the Reeb graph in almost optimal worst-case deterministic time. This was the first deterministic algorithm with the time bound which is linear up to a logarithmic factor. Without loss of generality, the complex is assumed to be 2-dimensional. The algorithm works by sweeping the function values and maintaining a spanning forest for the preimage, or the level-set, of the value. Using the observation that the operations that change the level-sets are off-line, we was able to achieve the above bound.

The second topic is the dynamic Reeb graphs. As the function changes its values, the Reeb graph also changes. We are interested in updating the Reeb graph. We reduce this problem to a graph problem that we call retroactive graph connectivity. We obtain algorithms for dynamic Reeb graphs, by using data structures that solve the graph problem.

The third topic is an argument regarding the complexity of computing Betti numbers. This problem is also related to the Reeb graph by means of the vertical Homology classes. The problem considered here is whether the realization of a simplicial complex in the Euclidean 4-space can result in an algorithm for computing its Homology groups faster than the usual matrix reduction methods. Using the observation that the vertical Betti numbers can always be computed more efficiently using the Reeb graph, we show that the answer to this question is in general negative. For instance, given a square matrix over the field with two elements, we construct a simplicial complex in linear time, realized in euclidean 4-space and a function on it, such that its Horizontal homology group, over the same field, is isomorphic with the null-space of the matrix. It follows that the Betti number computation for complexes realized in the 4-space is as hard as computing the rank for a sparse matrix.

Item Open Access Algorithms for Two-sided Online Marketplaces(2020) Alijani, RezaThe recent emergence of new successful two-sided online platforms has transformed the concept of a marketplace. Numerous two-sided mechanism design problems, including those studied in this thesis, are motivated by these platforms. Similar to any other mechanism design problem, we wish to optimize an objective in the presence of selfish agents. However, there are unique features to a two-sided market, such as supply uncertainty and the need for ensuring budget balance, which make these problems particularly challenging. In our problems, we design models to capture the two-sidedness, identify the challenges, and use algorithmic techniques to solve these problems.

We start with a Bayesian single item auction with n independent buyers. For this problem, we show how the existence of a trusted intermediary can result in a better outcome for buyers without hurting the seller's revenue. In this model, the intermediary gets to see the true valuations of buyers and will reveal some information after that in the form of a signal. The intermediary does not have any control over agents after sending this signal, and any agent only maximizes their utility. This essentially means that the seller will run an optimal auction conditioned on receiving any signal. For this problem, we design approximately optimal ways of revealing information.

The previous problem is an interesting model to show how a platform can mediate in the market and improve the outcome for every agent. That problem is a single item static problem. Next, we focus on pricing in two problems with multiple dynamic agents on the seller side. The first problem is an extension of the multiple-choice prophet inequality to the setting in which each item might disappear after an a priori unknown amount of time. This can be viewed as a way to model the supply uncertainty arising because of the fact that different sellers might depart after some time. Considering the importance of prophet inequalities in online posted pricing mechanisms, we hope that incorporating features of two-sided markets in the model and finding new prophet inequalities will be useful and provide new insights.

In our last problem, we design a model to capture a general dynamic two-sided market. In our model, agents (buyers and sellers) with heterogeneous valuations/costs, service quality requirements, and patience levels (or deadlines) arrive over time. Both buyers and sellers arrive to the market following Poisson processes, and each buyer/seller has a private value/cost for service, as well as a private patience level for receiving/providing service. In addition, each agent has a known location in an underlying metric space, where the metric distance between any buyer and seller captures the quality of the corresponding match. The platform knows the distribution over values/costs and patience levels and can post prices and wages at nodes in the metric space, as well as choose agents for matching. It uses these controls to maximize the social surplus subject to weak budget balance while guaranteeing a high match quality and a service time (with a high probability) smaller than their patience.

Item Open Access All-Near-Neighbor Search Among High-Dimensional Data via Hierarchical Sparse Graph Code Filtering(2020) Iliopoulos, Alexandros StavrosThis thesis addresses the problem of all-to-all near-neighbor (all-NN) search among a large dataset of discrete points in a high-dimensional feature space endowed with a distance metric. Near-neighbor search is fundamental to numerous computational tasks arising in modern data analysis, modeling, and knowledge discovery. The accuracy and efficiency of near-neighbor search depends on the underlying structure of the dataset. In emerging data-driven analysis applications, the dataset is not necessarily stationary, the structure is not necessarily built once for all queries, and search is not necessarily limited to a few query points. To facilitate accurate and efficient near-neighbor search in a stationary or dynamic dataset, we make a systematic investigation of all-NN search at multiple resolution levels, with attention to several standing or previously overlooked issues associated with high-dimensional feature spaces.

The key contributions are outlined as follows. (i) We proclaim the inherent sparse distribution of a finite discrete dataset in a high-dimensional space. We demonstrate that exploiting this sparsity is pivotal to dispelling the so-called curse of dimensionality in theoretical analysis and practical algorithm design. (ii) We introduce the adjacency graph hierarchy (AGH) as an analysis apparatus as well as base infrastructure for efficient all-NN search algorithm design. (iii) We present an efficient AGH construction algorithm by directly and deterministically pinpointing adjacent nodes at each resolution level via sparse adjacency code filtering, in lieu of exhaustive search within local node neighborhoods. We provide at the same time a statistical estimate of the significant gap in algorithm complexity between direct location of adjacent nodes and local search. (iv) We introduce an adjacency-sensitive, graph-based but decentralized approach for hierarchical partition of a discrete dataset, with potential advantage for accurate and efficient all-NN search in comparison to existing non-interacting partition structures.

Item Open Access An Exploratory Study of Cell Growth and Division Data(2016) Niu, TongUnderstanding the mechanism behind cell-size control of bacteria such as E. coli has been an active research topic for decades. Until about 2010, most studies were limited to measurements at the cell population level. Recently advanced technologies and techniques, including Mother Machine and modern microscopic imaging, as well as new biological insights and experiment techniques have enabled measurements of cell growth and division at individual cell level with high spatial and temporal resolution. This unprecedented data allowed scientists to take a closer look into the patterns of cell growth and division, to search for new answers, and to gain a better understanding of the natural law governing cell growth. This thesis study is based on six data sets provided by the You lab at Duke University and by the Jun lab at University of California, San Diego. We focus on exploratory and quantitative study of inter-generation relationships among certain cell growth parameters and variables, including the variation in each relationship.

We report and describe three particular findings. (1) The variation in cell size at birth between two successive generations, i.e., mother and daughter generations, is better captured in two separate zones in the mother-daughter plane of initial cell size. This leads to a two-phase linear model with much narrower noise spread. The model can be used to connect or compensate for certain well-known models. (2) The variation in doubling time between two successive generations is confined in a finite region with non-uniform density. Specifically, the confining region is apparently triangular in the successive-generation plot of doubling time. The highest variation of doubling time in daughter-generation corresponds to the median for the mother- generation. This variation structure of successive-generation is further used to predict the variations in succeeding generations via an iterative simulation method. (3) The relationship between cell size and doubling time is studied with Fourier analysis. We found that the two dominant frequencies of the cell-size time series are nearly constant for each lineage. Furthermore, the top dominant frequency location is invariant under random re-ordering of generations. This invariance suggests that the dominant frequency is an intra-generation property, i.e., an inherent property. These findings are descriptive rather than explanatory. Nonetheless, they provide additional angles to examine, re-examine and improve our understanding of cell growth control mechanism. The findings also show, perhaps in a subtle way, the effect of computational concepts and techniques, which we used to probe into data on the finding, artifacts and their separation.

Item Open Access An Operating System Architecture for Networked Server Infrastructure(2007-12-14) Irwin, David EmoryCollections of hardware components are the foundation of computation and consist of interconnections of different types of the same core elements: processors, disks, memory cards, I/O devices, and network links. Designing a system for managing collections of hardware is challenging because modern infrastructures (i) distribute resource control across multiple autonomous sites, (ii) operate diverse sets of hardware, and (iii) support a variety of programming models for developing and executing software services. An operating system is a software layer that manages hardware by coordinating its interaction with software. This thesis defines and evaluates an architecture for a networked operating system that manages collections of hardware in infrastructures spread across networks, such as the Internet. The foundation of a networked operating system determines how software services share a common hardware platform. A fundamental property common to all forms of resource sharing is that software services, by definition, share hardware components and do not use them forever. A lease is a natural construct for restricting the use of a shared resource to a well-defined length of time. Our architecture employs a general neutrality principle, which states that a networked operating system should be policy-neutral, since only users and site administrators, and not operating system developers, know how to manage their software and hardware. Experience building, deploying, and using a prototype has led us to view neutrality as a guiding design principle. Our hypothesis is that an operating system architecture for infrastructure resource management that focuses narrowly on leasing control of hardware provides a foundation for multi-lateral resource negotiation, arbitration, and fault tolerance. In evaluating our hypothesis we make the following contributions:*Introduce a set of design principles for networked operating systems. The principles adapt and extend principles from node operating system design to a networked environment. We evaluate existing systems with respect to these principles, describe how they deviate from them, and explore how these deviations limit the capabilities of higher level software.*Combine the idea of a reconfigurable data center with the Sharp framework for secure resource peering to demonstrate a prototype networked operating system capable of sharing aggregations of resources in infrastructures. *Design, implement, and deploy the architecture using a single programming abstraction---the lease---and show how the lease abstraction embodies the design principles of a networked operating system.*Show that leases are a foundational primitive for addressing arbitration in a networked operating system. Leasing currency defines a configurable tradeoff between proportional-share scheduling and a market economy, and also serves as a basis for implementing other forms of arbitration. *Show how combining the use of leases for long-term resource management with state recovery mechanisms provides robustness to transient faults and failures in a loosely coupled distributed system that coordinates resource allocation.*Evaluate the flexibility and performance of a prototype by managing aggregations of physical and virtual hardware present in modern data centers, and showing that the architecture could scale to manage thousands of machines. *Present case studies of integrating multiple software services including the PlanetLab network testbed, the Plush distributed application manager, and the GridEngine batch scheduler, and leverage the architecture to prototype and evaluate Jaws, a new light-weight batch scheduler that instantiates one or more virtual machines per task.Item Open Access Answering and Explaining SQL Queries Privately(2022) Tao, YuchaoData privacy has been receiving an increasing amount of attention in recent years. While large-scale personal information is collected for scientific research and commercial products, a privacy breach is not acceptable as a trade-off. In the last decade, differential privacy has become a gold standard to protect data privacy and has been applied in many organizations. Past work focused on building a differentially private SQL query answering system as a building block for wider applications. However, answering counting queries with joins under differential privacy appears as a challenge. The join operator allows any user to have an unbounded impact on the query result, which impedes hiding the existence of a single user by differential privacy. On the other hand, the introduction of differential privacy to the query answering also prevents the users from understanding the query results correctly, since she needs to distinguish the effect of differential privacy from the contribution of data.

In this thesis, we study two problems about answering and explaining SQL queries privately. First, we present efficient algorithms to compute local sensitivities of counting queries with joins, which is an important premise for answering these queries under differential privacy. We track the sensitivity contributed by each tuple, based on which we propose a truncation mechanism that answers counting queries with joins privately with high utility. Second, we propose a formal framework DPXPlain, a three-phase framework that allows users to get explanations for group-by COUNT/SUM/AVG query results while preserving DP. We utilize confidence intervals to help users understand the uncertainty in the query results introduced by differential privacy, and further provide top-k explanations under differential privacy to explain the contribution of data to the results.