# Browsing by Author "Parr, Ronald"

###### Results Per Page

###### Sort Options

Item Unknown Efficient selection of disambiguating actions for stereo vision(2010) Schaeffer, MonikaIn many domains that involve the use of sensors, such as robotics or sensor networks, there are opportunities to use some form of active sensing to disambiguate data from noisy or unreliable sensors. These disambiguating actions typically take time and expend energy. One way to choose the next disambiguating action is to select the action with the greatest expected entropy reduction, or information gain. In this work, we consider active sensing in aid of stereo vision for robotics. Stereo vision is a powerful sensing technique for mobile robots, but it can fail in scenes that lack strong texture. In such cases, a structured light source, such as vertical laser line, can be used for disambiguation. By treating the stereo matching problem as a specially structured HMM-like graphical model, we demonstrate that for a scan line with n columns and maximum stereo disparity d, the entropy minimizing aim point for the laser can be selected in O(nd) time - cost no greater than the stereo algorithm itself. A typical HMM formulation would suggest at least O(nd2) time for the entropy calculation alone.Item Unknown Feature Selection for Value Function Approximation(2011) Taylor, GavinThe field of reinforcement learning concerns the question of automated action selection given past experiences. As an agent moves through the state space, it must recognize which state choices are best in terms of allowing it to reach its goal. This is quantified with value functions, which evaluate a state and return the sum of rewards the agent can expect to receive from that state. Given a good value function, the agent can choose the actions which maximize this sum of rewards. Value functions are often chosen from a linear space defined by a set of features; this method offers a concise structure, low computational effort, and resistance to overfitting. However, because the number of features is small, this method depends heavily on these few features being expressive and useful, making the selection of these features a core problem. This document discusses this selection.

Aside from a review of the field, contributions include a new understanding of the role approximate models play in value function approximation, leading to new methods for analyzing feature sets in an intuitive way, both using the linear and the related kernelized approximation architectures. Additionally, we present a new method for automatically choosing features during value function approximation which has a bounded approximation error and produces superior policies, even in extremely noisy domains.

Item Unknown Non-parametric approximate linear programming for MDPs(2012) Pazis, JasonOne of the most difficult tasks in value function based methods for learning in Markov Decision Processes is finding an approximation architecture that is expressive enough to capture the important structure in the value function, while at the same time not overfitting the training samples. This thesis presents a novel Non-Parametric approach to Approximate Linear Programming (NP-ALP), which requires nothing more than a smoothness assumption on the value function. NP-ALP can make use of real-world, noisy sampled transitions rather than requiring samples from the full Bellman equation, while providing the first known max-norm, finite sample performance guarantees for ALP under mild assumptions. Additionally NP-ALP is amenable to problems with large (multidimensional) or even infinite (continuous) action spaces, and does not require a model to select actions using the resulting approximate solution.Item Open Access Object Discovery with a Mobile Robot(2013) Mason, JulianThe world is full of objects: cups, phones, computers, books, and

countless other things. For many tasks, robots need to understand that

this object is a stapler, that object is a textbook, and this other

object is a gallon of milk. The classic approach to this problem is

object recognition, which classifies each observation into one of

several previously-defined classes. While modern object recognition

algorithms perform well, they require extensive supervised training:

in a standard benchmark, the training data average more than four

hundred images of each object class.

The cost of manually labeling the training data prohibits these

techniques from scaling to general environments. Homes and workplaces

can contain hundreds of unique objects, and the objects in one

environment may not appear in another.

We propose a different approach: object discovery. Rather than rely on

manual labeling, we describe unsupervised algorithms that leverage the

unique capabilities of a mobile robot to discover the objects (and

classes of objects) in an environment. Because our algorithms are

unsupervised, they scale gracefully to large, general environments

over long periods of time. To validate our results, we collected 67

robotic runs through a large office environment. This dataset, which

we have made available to the community, is the largest of its kind.

At each step, we treat the problem as one of robotics, not disembodied

computer vision. The scale and quality of our results demonstrate the

merit of this perspective, and prove the practicality of long-term

large-scale object discovery.

Item Open Access PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates(2015) Pazis, JasonAs the reinforcement learning community has shifted its focus from heuristic methods to methods that have performance guarantees, PAC-optimal exploration algorithms have received significant attention. Unfortunately, the majority of current PAC-optimal exploration algorithms are inapplicable in realistic scenarios: 1) They scale poorly to domains of realistic size. 2) They are only applicable to discrete state-action spaces. 3) They assume that experience comes from a single, continuous trajectory. 4) They assume that value function updates are instantaneous. The goal of this work is to bridge the gap between theory and practice, by introducing an efficient and customizable PAC optimal exploration algorithm, that is able to explore in multiple, continuous or discrete state MDPs simultaneously. Our algorithm does not assume that value function updates can be completed instantaneously, and maintains PAC guarantees in realtime environments. Not only do we extend the applicability of PAC optimal exploration algorithms to new, realistic settings, but even when instant value function updates are possible, our bounds present a significant improvement over previous single MDP exploration bounds, and a drastic improvement over previous concurrent PAC bounds. We also present Bellman error MDPs, a new analysis methodology for online and offline reinforcement learning algorithms, and TCE, a new, fine grained metric for the cost of exploration.

Item Open Access Sparse Value Function Approximation for Reinforcement Learning(2013) Painter-Wakefield, Christopher RobertA key component of many reinforcement learning (RL) algorithms is the approximation of the value function. The design and selection of features for approximation in RL is crucial, and an ongoing area of research. One approach to the problem of feature selection is to apply sparsity-inducing techniques in learning the value function approximation; such sparse methods tend to select relevant features and ignore irrelevant features, thus automating the feature selection process. This dissertation describes three contributions in the area of sparse value function approximation for reinforcement learning.

One method for obtaining sparse linear approximations is the inclusion in the objective function of a penalty on the sum of the absolute values of the approximation weights. This L

_{1}regularization approach was first applied to temporal difference learning in the LARS-inspired, batch learning algorithm LARS-TD. In our first contribution, we define an iterative update equation which has as its fixed point the L_{1}regularized linear fixed point of LARS-TD. The iterative update gives rise naturally to an online stochastic approximation algorithm. We prove convergence of the online algorithm and show that the L_{1}regularized linear fixed point is an equilibrium fixed point of the algorithm. We demonstrate the ability of the algorithm to converge to the fixed point, yielding a sparse solution with modestly better performance than unregularized linear temporal difference learning.Our second contribution extends LARS-TD to integrate policy optimization with sparse value learning. We extend the L

_{1}regularized linear fixed point to include a maximum over policies, defining a new, "greedy" fixed point. The greedy fixed point adds a new invariant to the set which LARS-TD maintains as it traverses its homotopy path, giving rise to a new algorithm integrating sparse value learning and optimization. The new algorithm is demonstrated to be similar in performance with policy iteration using LARS-TD.Finally, we consider another approach to sparse learning, that of using a simple algorithm that greedily adds new features. Such algorithms have many of the good properties of the L

_{1}regularization methods, while also being extremely efficient and, in some cases, allowing theoretical guarantees on recovery of the true form of a sparse target function from sampled data. We consider variants of orthogonal matching pursuit (OMP) applied to RL. The resulting algorithms are analyzed and compared experimentally with existing L_{1}regularized approaches. We demonstrate that perhaps the most natural scenario in which one might hope to achieve sparse recovery fails; however, one variant provides promising theoretical guarantees under certain assumptions on the feature dictionary while another variant empirically outperforms prior methods both in approximation accuracy and efficiency on several benchmark problems.Item Open Access Transfer Learning in Value-based Methods with Successor Features(2023) Nemecek, Mark WilliamThis dissertation investigates the concept of transfer learning in a reinforcement learning (RL) context. Transfer learning is based on the idea that it is possible for an agent to use what it has learned in one task to improve the learning process in another task as compared to learning from scratch. This improvement can take multiple forms, such as reducing the number of samples required to reach a given level of performance or even increasing the best performance achieved. In particular, we examine properties and applications of successor features, which are a useful representation that allows efficient calculation of action-value functions for a given policy in different contexts.

Our first contribution is a method for incremental construction of a cache of policies for a family of tasks. When a family of tasks share transition dynamics but differ in reward function, successor features allow us to efficiently compute the action-value functions for known policies in new tasks. As the optimal policy for a new task might be the same as or similar to that for a previous task, it is not always necessary for an agent to learn a new policy for each new task it encounters, especially if it is allowed some amount of suboptimality. We present new bounds for the performance of optimal policies in a new task, as well as an approach to use these bounds to decide, when presented with a new task, whether to use cached policies or learn a new policy.

In our second contribution, we examine the problem of hierarchical reinforcement learning, which involves breaking a task down into smaller subtasks which are easier to solve, through the lens of transfer learning. Within a single task, a subtask may encapsulate a behavior which could be used multiple times for completing the task, but occur in different contexts, such as opening doors while navigating a building. When the reward function changes between tasks, a given subtask may be unaffected, i.e., the optimal behavior within that subtask may remain the same. If so, the behavior may be immediately reused to accelerate training of behaviors for other subtasks. In both of these cases, reusing the learned behavior can be viewed as a transfer learning problem. We introduce a method based on the MAXQ value function decomposition which uses two applications of successor features to facilitate both transfer within a task and transfer between tasks with different reward functions.

The final contribution of this dissertation introduces a method for transfer using a value-based approach in domains with continuous actions. When an environment's action space is continuous, finding the action which maximizes an action-value function approximator efficiently often requires defining a constrained approximator which results in suboptimal behavior. Recently the RBF-DQN approach was proposed to use deep radial-basis value functions to allow efficient maximization of an action-value approximator over the actions while not losing the universal approximator property of neural networks. We present a method which extends this approach to use successor features in order to allow for effective transfer learning between tasks which differ in reward function.

Item Open Access Transition Space Distance Learning(2019) Nemecek, Mark WilliamThe notion of distance plays and important role in many reinforcement learning (RL) techniques. This role may be explicit, as in some non-parametric approaches, or it may be implicit in the architecture of the feature space. The ability to learn distance functions tailored for RL tasks could, thus, benefit many different RL paradigms. While several approaches to learning distance functions from data do exist, they are frequently intended for use in clustering or classification tasks and typically do not take into account the inherent structure present in trajectories sampled from RL environments. For those that do, this structure is generally used to define a similarity between states rather than to represent the mechanics of the domain. Based on the idea that a good distance function in such a domain would reflect the number of transitions necessary to get to from one state to another, we detail an approach to learning distance functions which accounts for the nature of state transitions in a Markov decision process, including their inherent directionality. We then present the results of experiments performed in multiple RL environments in order to demonstrate the benefit of learning such distance functions.