# Browsing by Subject "Reinforcement learning"

###### Results Per Page

###### Sort Options

Item Open Access A Pathway from the Midbrain to the Striatum is Critical to Multiple Forms of Vocal Learning and Modification in the Songbird(2017) Hisey, ErinMany of the skills we value most as humans, such as speech and learning to play musical instruments, are learned in the absence of external reinforcement. However, the model systems most commonly used to study motor learning employ learning paradigms in which animals perform behaviors in response to external rewards or punishments. Here I use the zebra finch, an Australian songbird that can learn its song as a juvenile in the absence of external reinforcement as well as modify its song in response to external cues as an adult, to study the circuit mechanisms underlying both internally and externally reinforced forms of learning. Using a combination of intersectional genetic and microdialysis techniques, I show that a striatonigral pathway and its downstream effectors, namely D1-type dopamine receptors, are necessary for both internally reinforced juvenile learning and externally reinforced adult learning, as wells as for song modification in response to social cues or to deafening. In addition, I employ optogenetic stimulation during singing to demonstrate that this striatonigral projection is sufficient to drive learning. Interestingly, I find that neither the striatonigral pathway nor D1-type dopamine receptors are necessary for recovery of pitch after externally driven pitch learning. In all, I establish that a common mechanism underlies both internally and externally reinforced vocal learning.

Item Open Access An Actor-Critic Circuit in the Songbird Enables Vocal Learning(2020) Kearney, MatthewThe ability to learn and to modify complex vocal sequences requires extensive practice coupled with performance evaluation through auditory feedback. An efficient solution to the challenge of vocal learning, stemming from reinforcement learning theory, proposes that an “actor” learns correct vocal behavior through the instructive guidance of an auditory “critic.” However, the neural circuit mechanisms supporting performance evaluation and even how “actor” and “critic” circuits are instantiated in biological brains are fundamental mysteries. Here, I use a songbird model to dissociate “actor” and “critic” circuits and uncover biological mechanisms for vocal learning.

First, I employ closed-loop optogenetic methods in singing birds to identify two inputs to midbrain dopamine neurons that operate in an opponent fashion to guide vocal learning. Next, I employ electrophysiological methods to establish a microcircuit architecture underlying this opponent mechanism. Notably, I show that disrupting activity in these midbrain dopamine inputs precisely when auditory feedback is processed impairs learning, showing that they function as “critics.” Conversely, I show that disrupting activity in a downstream premotor region prior to vocal production prevents learning, consistent with an “actor” role. Taken together, these experiments dissociate discrete “actor” and “critic” circuits in the songbird’s brain and elucidate neural circuit and microcircuit mechanisms by which “actors” and “critics” working cooperatively enable vocal learning.

Item Open Access Computational Modeling of Multi-Agent, Continuous Decision Making in Competitive Contexts(2021) McDonald, KelseyHumans are able to make adaptive decisions with the goal of obtaining a goal, earning a reward, or avoiding punishment. While much is known about the behavior and corresponding underlying neural mechanism relating to this aspect of decision-making, the field of cognitive neuroscience has focused almost exclusively on how these types of decisions are made in discrete choices where the set of possible actions is comparatively much smaller. We know much less about how human brains are able to make similar types of goal-directed decisions in continuous contexts which are more akin to the types of choices humans make in real-life. Further, how these processes are modified by the presence of other humans whose goals might influence one's own future behavior is currently unknown. Across three empirical studies, I address some of these gaps in the literature by studying human competitive decision-making in a dynamic, control paradigm in which humans interacted with both social and non-social opponents (Chapter 2 and Chapter 4). In Chapter 3, I show that brain regions heavily implicated in social cognition and value-based decision-making also play a role in tracking continuous decision metrics involved in monitoring instantaneous coupling between opponents, advantageous decision timing, and constructing social context. Collectively, the results in this dissertation demonstrate the utility in studying decision-making in less-constrained paradigms with the overall goal of gaining further understanding of how humans make complex, goal-directed decisions closer to real-world conditions.

Item Open Access Efficient Bayesian Nonparametric Methods for Model-Free Reinforcement Learning in Centralized and Decentralized Sequential Environments(2014) Liu, MiaoAs a growing number of agents are deployed in complex environments for scientific research and human well-being, there are increasing demands for designing efficient learning algorithms for these agents to improve their control polices. Such policies must account for uncertainties, including those caused by environmental stochasticity, sensor noise and communication restrictions. These challenges exist in missions such as planetary navigation, forest firefighting, and underwater exploration. Ideally, good control policies should allow the agents to deal with all the situations in an environment and enable them to accomplish their mission within the budgeted time and resources. However, a correct model of the environment is not typically available in advance, requiring the policy to be learned from data. Model-free reinforcement learning (RL) is a promising candidate for agents to learn control policies while engaged in complex tasks, because it allows the control policies to be learned directly from a subset of experiences and with time efficiency. Moreover, to ensure persistent performance improvement for RL, it is important that the control policies be concisely represented based on existing knowledge, and have the flexibility to accommodate new experience. Bayesian nonparametric methods (BNPMs) both allow the complexity of models to be adaptive to data, and provide a principled way for discovering and representing new knowledge.

In this thesis, we investigate approaches for RL in centralized and decentralized sequential decision-making problems using BNPMs. We show how the control policies can be learned efficiently under model-free RL schemes with BNPMs. Specifically, for centralized sequential decision-making, we study Q learning with Gaussian processes to solve Markov decision processes, and we also employ hierarchical Dirichlet processes as the prior for the control policy parameters to solve partially observable Markov decision processes. For decentralized partially observable Markov decision processes, we use stick-breaking processes as the prior for the controller of each agent. We develop efficient inference algorithms for learning the corresponding control policies. We demonstrate that by combining model-free RL and BNPMs with efficient algorithm design, we are able to scale up RL methods for complex problems that cannot be solved due to the lack of model knowledge. We adaptively learn control policies with concise structure and high value, from a relatively small amount of data.

Item Open Access Feature Selection for Value Function Approximation(2011) Taylor, GavinThe field of reinforcement learning concerns the question of automated action selection given past experiences. As an agent moves through the state space, it must recognize which state choices are best in terms of allowing it to reach its goal. This is quantified with value functions, which evaluate a state and return the sum of rewards the agent can expect to receive from that state. Given a good value function, the agent can choose the actions which maximize this sum of rewards. Value functions are often chosen from a linear space defined by a set of features; this method offers a concise structure, low computational effort, and resistance to overfitting. However, because the number of features is small, this method depends heavily on these few features being expressive and useful, making the selection of these features a core problem. This document discusses this selection.

Aside from a review of the field, contributions include a new understanding of the role approximate models play in value function approximation, leading to new methods for analyzing feature sets in an intuitive way, both using the linear and the related kernelized approximation architectures. Additionally, we present a new method for automatically choosing features during value function approximation which has a bounded approximation error and produces superior policies, even in extremely noisy domains.

Item Embargo Innovations in Decompression Sickness Prediction and Adaptive Ascent Algorithms(2023) Di Muro, GianlucaDecompression Sickness (DCS) is a potentially serious medical condition which can occur in humans when there is a decrease in ambient pressure. While it is generally accepted that DCS is initiated by the formation and growth of inert gas bubbles in the body, the mechanisms of its various forms are not completely understood. Complicating matters, divers often face challenges in adhering to predetermined safe ascent paths due to unpredictable environmental conditions. Therefore, the challenge of improving dive safety is twofold: 1) enhancing the accuracy of models in predicting DCS risk for a given dive profile; 2) developing algorithms, recommending safe ascent profiles, and capable of adapting in real time to new unforeseen diving conditions. This dissertation addresses both problems in the context of diving applications.First, we examine how the DCS risk is partitioned in air decompression dives to identify which portion of the dive is the most challenging. Our findings show that most of the risk might be accrued at surface, or during the ascent phase, depending on the specific mission parameters. Subsequently, we conducted a comprehensive investigation into DCS models incorporating inter-tissue perfusion dynamics. We proposed a novel algorithm to optimize these models efficiently. Our results determined that a model neglecting the coupling of faster tissue to slower tissues outperformed all other models on O2 surface decompression dive profiles. We further conducted experiments with various compartment tissue connections, involving diffusion phenomena and introducing delayed dynamics, while also exploring different risk functions. By adopting the Akaike Information Criterion, we found that the best performing model on the training set was BQE22AXT4, a four-compartment model featuring a risk threshold term only in the fourth compartment. Conversely, the classical Linear-Exponential model demonstrated superior performance on the extrapolation set. Finally, we introduce a groundbreaking real-time algorithm that delivers a secure and time optimized ascent path capable of adapting to unanticipated conditions. Our approach harnesses the power of advanced machine learning techniques and backward optimal control. Through our comprehensive analysis, we demonstrate that this innovative methodology attains a safety level on par with precomputed NAVY tables, while offering the added advantage of dynamic adaptation in response to unexpected events.

Item Open Access Locally Adaptive Protocols for Quantum State Discrimination(2021) Brandsen, SarahThis dissertation makes contributions to two rapidly developing fields: quantum information theory and machine learning. It has recently been demonstrated that reinforcement learning is an effective tool for a wide variety of tasks in quantum information theory, ranging from quantum error correction to quantum control to preparation of entangled states. In this work, we demonstrate that reinforcement learning is additionally highly effective for the task of multiple quantum hypothesis testing.

Quantum hypothesis testing consists of finding the quantum measurement which allows one to discriminate with minimal error between $m$ possible states $\{\rho_{k}\}|_{k=1}^{m}$ of a quantum system with corresponding prior probabilities $p_{k} = \text{Pr}[\rho = \rho_{k}]$. In the general case, although semi-definite programming offers a way to numerically approximate the optimal solution~\cite{Eldar_Semidefinite2}, a closed-form analytical solution for the optimal measurement is not known.

Additionally, when the quantum system is large and consists of many subsystems, the optimal measurement may be experimentally difficult to implement. In this work, we provide a comprehensive study of locally adaptive approaches to quantum hypothesis testing where only a single subsystem is measured at a time and the order and types of measurements implemented may depend on previous measurement results. Thus, these locally adaptive protocols present an experimentally feasible approach to quantum state discrimination.

We begin with the case of binary hypothesis testing (where $m=2$), and generalize previous work by Acin et al. (Phys. Rev. A 71, 032338) to show that a simple Bayesian-updating scheme can optimally distinguish between any pair of arbitrary pure, tensor product quantum states. We then demonstrate that this same Bayesian-updating scheme has poor asymptotic behaviour when the candidate states are not pure, and based on this we introduce a modified scheme with strictly better performance. Finally, a dynamic programming (DP) approach is used to find the optimal local protocol for binary state discrimination and numerical simulations are run for both qubit and qutrit subsystems.

Based on these results, we then turn to the more general case of multiple hypothesis testing where there may be several candidate states. Given that the dynamic-programming approach has a high complexity when there are a large number of subsystems, we turn to reinforcement learning methods to learn adaptive protocols for even larger systems. Our numerical results support the claim that reinforcement learning with neural networks (RLNN) is able to successfully find the optimal locally adaptive approach for up to 20 subsystems. We additionally find the optimal collective measurement through semidefinite programming techniques, and demonstrate that the RLNN approach meets or comes close to the optimal collective measurement in every random trial.

Next, we focus on quantum information theory and provide an operational interpretation for the entropy of a channel. This task is motivated by the central role of entropy across several areas of physics and science. We use games of chance as a more systematic and unifying approach to define entropy, as a system's performance in any game of chance depends solely on the uncertainty of the output. We construct families of games which result in a pre-order on channels and provide an operational interpretation for all pre-orders (corresponding to majorization, conditional majorization, and channel majorization respectively), and this defines the unique asymptotically continuous entropy function for classical channels.

Item Open Access Model-based Reinforcement Learning in Modified Levy Jump-Diffusion MarkovDecision Model and Its Financial Applications(2017-11-15) Zhu, ZheqingThis thesis intends to address an important cause of the 2007-2008 financial crisis by incorporating prediction on asset pricing jumps in asset pricing models, the non-normality of asset returns. Several different machine learning techniques, including the Unscented Kalman Filter and Approximate Planning are used, and an improvement in Approximate Planning is developed to improve algorithm time complexity with limited loss in optimality. We obtain significant result in predicting jumps with market sentiment memory extracted from Twitter. With the model, we develop a reinforcement learning module that achieves good performance and which captures over 60% of profitable periods in the market.Item Open Access Nonlinear Energy Harvesting With Tools From Machine Learning(2020) Wang, XuesheEnergy harvesting is a process where self-powered electronic devices scavenge ambient energy and convert it to electrical power. Traditional linear energy harvesters which operate based on linear resonance work well only when excitation frequency is close to its natural frequency. While various control methods applied to an energy harvester realize resonant frequency tuning, they are either energy-consuming or exhibit low efficiency when operating under multi-frequency excitations. In order to overcome these limitations in a linear energy harvester, researchers recently suggested using "nonlinearity" for broad-band frequency response.

Based on existing investigations of nonlinear energy harvesting, this dissertation introduced a novel type of energy harvester designs for space efficiency and intentional nonlinearity: translational-to-rotational conversion. Two dynamical systems were presented: 1) vertically forced rocking elliptical disks, and 2) non-contact magnetic transmission. Both systems realize the translational-to-rotational conversion and exhibit nonlinear behaviors which are beneficial to broad-band energy harvesting.

This dissertation also explores novel methods to overcome the limitation of nonlinear energy harvesting -- the presence of coexisting attractors. A control method was proposed to render a nonlinear harvesting system operating on the desired attractor. This method is based on reinforcement learning and proved to work with various control constraints and optimized energy consumption.

Apart from investigations of energy harvesting, several techniques were presented to improve the efficiency for analyzing generic linear/nonlinear dynamical systems: 1) an analytical method for stroboscopically sampling general periodic functions with arbitrary frequency sweep rates, and 2) a model-free sampling method for estimating basins of attraction using hybrid active learning.

Item Open Access PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates(2015) Pazis, JasonAs the reinforcement learning community has shifted its focus from heuristic methods to methods that have performance guarantees, PAC-optimal exploration algorithms have received significant attention. Unfortunately, the majority of current PAC-optimal exploration algorithms are inapplicable in realistic scenarios: 1) They scale poorly to domains of realistic size. 2) They are only applicable to discrete state-action spaces. 3) They assume that experience comes from a single, continuous trajectory. 4) They assume that value function updates are instantaneous. The goal of this work is to bridge the gap between theory and practice, by introducing an efficient and customizable PAC optimal exploration algorithm, that is able to explore in multiple, continuous or discrete state MDPs simultaneously. Our algorithm does not assume that value function updates can be completed instantaneously, and maintains PAC guarantees in realtime environments. Not only do we extend the applicability of PAC optimal exploration algorithms to new, realistic settings, but even when instant value function updates are possible, our bounds present a significant improvement over previous single MDP exploration bounds, and a drastic improvement over previous concurrent PAC bounds. We also present Bellman error MDPs, a new analysis methodology for online and offline reinforcement learning algorithms, and TCE, a new, fine grained metric for the cost of exploration.

Item Open Access Semantic Understanding for Augmented Reality and Its Applications(2020-04-08) DeChicchis, JosephAlthough augmented reality (AR) devices and developer toolkits are becoming increasingly ubiquitous, current AR devices lack a semantic understanding of the user’s environment. Semantic understanding in an AR context is critical to improving the AR experience because it aids in narrowing the gap between the physical and virtual worlds, making AR more seamless as virtual content interacts naturally with the physical environment. A granular understanding of the user’s environment has the potential to be applied to a wide variety of problems, such as visual output security, improved mesh generation, and semantic map building of the world. This project investigates semantic understanding for AR by building and deploying a system which uses a semantic segmentation model and Magic Leap One to bring semantic understanding to a physical AR device, and explores applications of semantic understanding such as visual output security using reinforcement learning trained policies and the use of semantic context to improve mesh quality.Item Open Access Sparse Value Function Approximation for Reinforcement Learning(2013) PainterWakefield, Christopher RobertA key component of many reinforcement learning (RL) algorithms is the approximation of the value function. The design and selection of features for approximation in RL is crucial, and an ongoing area of research. One approach to the problem of feature selection is to apply sparsity-inducing techniques in learning the value function approximation; such sparse methods tend to select relevant features and ignore irrelevant features, thus automating the feature selection process. This dissertation describes three contributions in the area of sparse value function approximation for reinforcement learning.

One method for obtaining sparse linear approximations is the inclusion in the objective function of a penalty on the sum of the absolute values of the approximation weights. This L

_{1}regularization approach was first applied to temporal difference learning in the LARS-inspired, batch learning algorithm LARS-TD. In our first contribution, we define an iterative update equation which has as its fixed point the L_{1}regularized linear fixed point of LARS-TD. The iterative update gives rise naturally to an online stochastic approximation algorithm. We prove convergence of the online algorithm and show that the L_{1}regularized linear fixed point is an equilibrium fixed point of the algorithm. We demonstrate the ability of the algorithm to converge to the fixed point, yielding a sparse solution with modestly better performance than unregularized linear temporal difference learning.Our second contribution extends LARS-TD to integrate policy optimization with sparse value learning. We extend the L

_{1}regularized linear fixed point to include a maximum over policies, defining a new, "greedy" fixed point. The greedy fixed point adds a new invariant to the set which LARS-TD maintains as it traverses its homotopy path, giving rise to a new algorithm integrating sparse value learning and optimization. The new algorithm is demonstrated to be similar in performance with policy iteration using LARS-TD.Finally, we consider another approach to sparse learning, that of using a simple algorithm that greedily adds new features. Such algorithms have many of the good properties of the L

_{1}regularization methods, while also being extremely efficient and, in some cases, allowing theoretical guarantees on recovery of the true form of a sparse target function from sampled data. We consider variants of orthogonal matching pursuit (OMP) applied to RL. The resulting algorithms are analyzed and compared experimentally with existing L_{1}regularized approaches. We demonstrate that perhaps the most natural scenario in which one might hope to achieve sparse recovery fails; however, one variant provides promising theoretical guarantees under certain assumptions on the feature dictionary while another variant empirically outperforms prior methods both in approximation accuracy and efficiency on several benchmark problems.Item Open Access The Characteristics and Neural Substrates of Feedback-based Decision Process in Recognition Memory(2008-04-10) Han, SanghoonThe judgment of prior stimulus occurrence, generally referred to as item recognition, is perhaps the most heavily studied of all memory skills. A skilled recognition observer not only recovers high fidelity memory evidence, he or she is also able to flexibly modify how much evidence is required for affirmative responding (the decision criterion) depending upon whether the context calls for a cautious or liberal task approach. The ability to adaptively adjust the decision criterion is a relatively understudied recognition skill, and the goal of this thesis is to examine reinforcement learning mechanisms contributing to recognition criterion adaptability. In Chapter 1, I review a measurement model whose theoretical framework has been successfully applied to recognition memory research (i.e., Signal Detection Theory). I also review major findings in the recognition literature examining the adaptive flexibility of criteria. Chapter 2 reports behavioral experiments that examine the sensitivity of decision criteria to trial-by-trial feedback by manipulating feedback validity in a potentially covert manner. Chapter 3 presents another series of behavioral experiments that used even subtler feedback manipulations based on predictions from reinforcement learning and category learning literatures. The findings suggested that feedback induced criterion shifts may rely upon procedural learning mechanisms that are largely implicit. The data also revealed that the magnitudes of induced criterion shifts were significantly correlated with personality measures linked to reward seeking outside the laboratory. In Chapter 4 functional magnetic resonance imaging (fMRI) was used to explore possible neurobiological links between brain regions traditionally linked to reinforcement processing, and recognition decisions. Prominent activations in striatum tracked the intrinsic goals of the subjects with greater activation for correct responding to old items compared to correct responding to new items during standard recognition testing. Furthermore, the pattern was amplified and reversed by the addition of extrinsic rewards. Finally, activation in ventral striatum tracked individual differences in personality reward seeking measures. Together, the findings further support the idea that a reinforcement learning system contributes to recognition decision-making. In the final chapter, I review the main implications arising from the research and suggest future research that could bolster the current results and implications.Item Open Access The Neurocomputational Basis of Serial Decision-Making(2017) Abzug, Zachary MitchellA hallmark of human behavior is serial decision-making, in which decisions are linked across time: the choices we make are informed by our past decisions and, in turn, influence our future decisions. Flexible, accurate goal-directed behavior breaks down when decisions become inconsistent with previous decisions and their outcomes. Such impairments contribute to the difficulty that people with schizophrenia and other psychiatric disorders have functioning in society. While there has been a large amount of research investigating the behavioral and neuronal mechanisms responsible for making individual decisions, there is a dearth of research on serial decision-making. The goal of my work has been to establish the formal study of serial decision-making and provide a psychophysical, computational, and neural foundation for future work. In Study 1, we showed that rhesus monkeys, a prime animal model for decision-making, can perform serial decision-making in a novel rule-selection task. The animals selected behavioral rules rationally and used those rules to flexibly discriminate between complex visual stimuli. In Study 2, we had human and monkey subjects perform variations on the rule-selection task to study how behavioral strategies for serial decision-making are dependent on task characteristics. We developed a set of normative probabilistic behavioral models and used Bayesian model selection to determine which model features best explained the observed behavioral data. Specifically, we found that whether or not humans use sensory information (in addition to reward information) to guide their future decisions is dependent on the lower-level features of the task. In Study 3, we investigated the role of one particular brain region, the supplementary eye field (SEF), in serial decision-making. The SEF is part of frontal cortex and sits at the intersection of oculomotor function and broader cognition, and previous studies have implicated it in linking sequences of decisions. We found that neuronal activity in the SEF encoded the rules used for decisions, predicted the outcomes of future decisions, and reacted to the outcomes of past decisions. The two outcome-related signals match what we expect of control signals necessary for flexibly and adaptively updating stimulus values in accordance with past decisions. Taken together, these three studies demonstrate that serial decision-making strategies are dependent on decision context and that the SEF may contribute to serial decision-making in dynamic environments.

Item Open Access Topics in Online Markov Decision Processes(2015) Guan, PengThis dissertation describes sequential decision making problems in non-stationary environments. Online learning algorithms deal with non-stationary environments, but generally there is no notion of a dynamic state to model future impacts of past actions. State-based models are common in stochastic control settings, but well-known frameworks such as Markov decision processes (MDPs) assume a known stationary environment. In recent years, there has been a growing interest in fusing the above two important learning frameworks and considering an MDP setting in which the cost function is allowed to change arbitrarily over time. A number of online MDP algorithms have been designed to work under various assumptions about the dynamics of state transitions so far and provide performance guarantees, i.e. bounds on the regret defined as the performance gap between the total cost incurred by the learner and the total cost of the best available stationary policy that could have been chosen in hindsight.

However, most of the work in this area has been algorithmic: given a problem, one

would develop an algorithm almost from scratch and prove the performance guarantees on a case-by-case basis. Moreover, the presence of the state and the assumption of an arbitrarily varying environment complicate both the theoretical analysis and the development of computationally efficient methods. Another potential issue is that, by removing distributional assumptions about the mechanism generating the cost sequences, the existing methods have to consider the worst-case scenario, which may render their solutions too conservative in situations where the environment exhibits some degree of predictability.

This dissertation contributes several novel techniques to address the above challenges of the online MDP framework and opens up new research directions for online MDPs.

Our proposed general framework for deriving algorithms in the online MDP setting leads to a unifying view of existing methods and provides a general procedure for constructing new ones. Several new algorithms are developed and analyzed using this framework. We develop convex-analytical algorithms that take advantage of possible regularity of observed sequences, yet maintain the worst case performance guarantees. To further study the convex-analytic methods we applied above, we take a step back to consider the traditional MDP problem and extend the LP approach to MDPs by adding a relative entropy regularization term. A computationally efficient algorithm for this class of MDPs is constructed under mild assumptions on the state transition models. Two-player zero-sum stochastic games are also investigated in this dissertation as an important extension of the online MDP setting. In short, this dissertation provides in-depth analysis of the online MDP problem and answers several important questions in this field.

Item Open Access Towards Uncertainty and Efficiency in Reinforcement Learning(2021) Zhang, RuiyiDeep reinforcement learning (RL) has received great success in playing video games and strategic board games, where a simulator is well-defined, and massive samples are available. However, in many real-world applications, the samples are not easy to collect, and the collection process may be expensive and risky. We consider designing sample efficient RL algorithms for online exploration and learning from offline interactions. In this thesis, I will introduce algorithms that quantify uncertainty via exploiting intrinsic structures within observations to improve sample complexity. These proposed algorithms are theoretically sound and show broad applicability in recommendation, computer vision, operations management, and natural language processing. This thesis consists of two parts: (i) efficient exploration and (ii) data-driven reinforcement learning.

Exploration-exploitation has been widely recognized as a fundamental trade-off. An agent can take exploration actions to learn a better policy or take exploitation actions with the highest reward. A good exploration strategy can improve sample complexity as a policy can converge faster to near optimality via collecting informative data. Better estimation and usage of uncertainty lead to more efficient exploration, as the agent can efficiently explore to better understand environments, \textit{i.e.}, minimizing uncertainty. In the efficient exploration part, we place the reinforcement learning into the probability measure space and formulate it as Wasserstein gradient flows. The proposed method can quantify the uncertainty of value, policy, and constraint functions to provide efficient exploration.

Running a policy in real environments can be expensive and risky. Besides, there are massive logged datasets available. Data-driven RL can effectively exploit these fixed datasets to perform policy improvement or evaluation. In the data-driven RL part, we consider auto-regressive sequence generation as a real-world sequential decision-making problem, where exploiting uncertainty is useful for generating faithful and informative sequences. Specifically, a planning mechanism has been integrated into generation as model-predictive sequence generation. We also realized that most RL-based training schemes are not aligned with human evaluations due to the poor lexical rewards or simulators. To alleviate this issue, we consider semantic rewards, implemented by the generalized Wasserstein distance. It is also nice to see these new schemes can be interpreted as Wasserstein gradient flows.

Item Open Access Transfer Learning in Value-based Methods with Successor Features(2023) Nemecek, Mark WilliamThis dissertation investigates the concept of transfer learning in a reinforcement learning (RL) context. Transfer learning is based on the idea that it is possible for an agent to use what it has learned in one task to improve the learning process in another task as compared to learning from scratch. This improvement can take multiple forms, such as reducing the number of samples required to reach a given level of performance or even increasing the best performance achieved. In particular, we examine properties and applications of successor features, which are a useful representation that allows efficient calculation of action-value functions for a given policy in different contexts.

Our first contribution is a method for incremental construction of a cache of policies for a family of tasks. When a family of tasks share transition dynamics but differ in reward function, successor features allow us to efficiently compute the action-value functions for known policies in new tasks. As the optimal policy for a new task might be the same as or similar to that for a previous task, it is not always necessary for an agent to learn a new policy for each new task it encounters, especially if it is allowed some amount of suboptimality. We present new bounds for the performance of optimal policies in a new task, as well as an approach to use these bounds to decide, when presented with a new task, whether to use cached policies or learn a new policy.

In our second contribution, we examine the problem of hierarchical reinforcement learning, which involves breaking a task down into smaller subtasks which are easier to solve, through the lens of transfer learning. Within a single task, a subtask may encapsulate a behavior which could be used multiple times for completing the task, but occur in different contexts, such as opening doors while navigating a building. When the reward function changes between tasks, a given subtask may be unaffected, i.e., the optimal behavior within that subtask may remain the same. If so, the behavior may be immediately reused to accelerate training of behaviors for other subtasks. In both of these cases, reusing the learned behavior can be viewed as a transfer learning problem. We introduce a method based on the MAXQ value function decomposition which uses two applications of successor features to facilitate both transfer within a task and transfer between tasks with different reward functions.

The final contribution of this dissertation introduces a method for transfer using a value-based approach in domains with continuous actions. When an environment's action space is continuous, finding the action which maximizes an action-value function approximator efficiently often requires defining a constrained approximator which results in suboptimal behavior. Recently the RBF-DQN approach was proposed to use deep radial-basis value functions to allow efficient maximization of an action-value approximator over the actions while not losing the universal approximator property of neural networks. We present a method which extends this approach to use successor features in order to allow for effective transfer learning between tasks which differ in reward function.

Item Open Access Transition Space Distance Learning(2019) Nemecek, Mark WilliamThe notion of distance plays and important role in many reinforcement learning (RL) techniques. This role may be explicit, as in some non-parametric approaches, or it may be implicit in the architecture of the feature space. The ability to learn distance functions tailored for RL tasks could, thus, benefit many different RL paradigms. While several approaches to learning distance functions from data do exist, they are frequently intended for use in clustering or classification tasks and typically do not take into account the inherent structure present in trajectories sampled from RL environments. For those that do, this structure is generally used to define a similarity between states rather than to represent the mechanics of the domain. Based on the idea that a good distance function in such a domain would reflect the number of transitions necessary to get to from one state to another, we detail an approach to learning distance functions which accounts for the nature of state transitions in a Markov decision process, including their inherent directionality. We then present the results of experiments performed in multiple RL environments in order to demonstrate the benefit of learning such distance functions.

Item Embargo Understanding and Modeling Human Planners’ Strategy in Human-automation Interaction in Treatment Planning Using Deep Learning and Reinforcement Learning(2023) Yang, DongrongPurpose: Radiation therapy aims to deliver high energy radiation beam to eradicate cancer cells. Due to radiation toxicity to normal tissue, treatment planning process is needed to customize the radiation beam towards patient specific treatment geometry while minimizing radiation dose to the normal tissue. Treatment planning is often, however, a trial-and-error process to generate ultimate optimal dose distribution. Breast cancer radiation therapy is one of the most commonly seen treatment in modern radiation oncology department. Whole breast radiation therapy (WBRT) using electronic compensation is an iterative manual process which is time consuming. Our institution has been using artificial intelligence (AI) based planning tool for whole breast radiation therapy (WBRT) for 3 years. It is unclear how human planner interacts with AI in real clinical setting and whether the human planner can inject additional insight into well-established AI model. Therefore, the first aim of this study to model planners’ interaction with AI using deep neural network (NN). In addition, we proposed a multi-agent reinforcement learning based framework (MultiRL-FE) to self-interact with the treatment planning system with location awareness to improve plan quality via fluence editing.Methods: A total of 1151 patients have been treated since in-house AI-based planning tool was released for clinical use in 2019. All 526 patients treated with single energy beams were included in this study. The AI tool automatically generates fluence maps and creates “AI plan”. Then planner evaluates the plan and attempts manual fluence modification before physician’s approval (“final plan”). The manual-modification-value (MMV) of each beamlet is the difference between fluence maps in AI and “final plan”. The MMV was recorded for each planner. In the first aim, a deep NN using UNet3+ architecture was developed to predict MMV with AI fluence map, corresponding dose map and organ map in the beam’s eye view (BEV). Then the predicted MMV maps were applied on the initial “AI plan”s to generate AI-modified plans (“AI-m plan”). In the second aim, we developed MultiRL-FE to self-interact with a given plan to improve the plan quality. A simplified treatment planning system was built in the Python environment to train the agent. For each pixel in the fluence map, an individual agent was assigned to interact with the environment by editing fluence value and receive rewards based on projected beam ray’s dose profile. Asynchronous advantage actor critic (A3C) algorithm was used as the backbone for reinforcement learning agents’ training. To effectively train the agent, we developed the MultiRL-FE framework by embedding A3C in a fully convolutional neural network. To test the feasibility of the proposed framework, twelve patients from the same cohort were collected(6 for training and testing respectively). ”Final plans” were perturbed with 10% dose variation to evaluate the potential of the framework to improve the plan. The agent was designed to iteratively modify the fluence maps for 10 iterations. The modified fluence intensity was imported into the Eclipse treatment planning system for dose calculation. For both aims, plan quality was evaluated by dosimetric endpoints including breast PTV V95%(%), V105%(%), V110%(%), lung V20Gy(%) and heart V5Gy(%). Results: In the first aim, the “AI-m plans” generated by HAI network showed statistically significant improvement (p<.05) in hotspot control compared with the initial AI-plan, with an average of -25.2cc volume reduction in breast V105% and -0.805% decrease in Dmax. The planning target volume (PTV) coverage were similar to AI-plan and “final plan”. In the second aim of MultiRL-FE testing, the RL modified plans showed a substantial hotspot reduction from the initial plans. The average PTV V105%(%) of testing set was reduced from 77.78(\pm2.78) to 16.97 (\pm9.42), while clinical plans’ was 3.34(\pm2.73). Meanwhile, the modified plans showed improved dose coverage over the clinical plans, with 70.45(\pm3.94) compared to 65.44(\pm5.39) for V95%(%). Conclusions: In the first part of this study, we proposed a HAI model to enhance the clinical AI tool by reducing hotspot volume from a human perspective. By understanding and modeling the human-automation interaction , this study could advance the widespread clinical application of AI tools in radiation oncology departments with improved robustness and acceptability. In the second part, we developed a self-interactive treatment planning agent with multi-agents reinforcement learning. It offers the advantage of fast location-aware dose editing and can serve as an alternative optimization tool for intensity-modulated radiation therapy and electronic tissue compensation-based treatment planning.