Sparse Value Function Approximation for Reinforcement Learning
A key component of many reinforcement learning (RL) algorithms is the approximation of the value function. The design and selection of features for approximation in RL is crucial, and an ongoing area of research. One approach to the problem of feature selection is to apply sparsity-inducing techniques in learning the value function approximation; such sparse methods tend to select relevant features and ignore irrelevant features, thus automating the feature selection process. This dissertation describes three contributions in the area of sparse value function approximation for reinforcement learning.
One method for obtaining sparse linear approximations is the inclusion in the objective function of a penalty on the sum of the absolute values of the approximation weights. This <italic>L1</italic> regularization approach was first applied to temporal difference learning in the LARS-inspired, batch learning algorithm LARS-TD. In our first contribution, we define an iterative update equation which has as its fixed point the <italic>L1</italic> regularized linear fixed point of LARS-TD. The iterative update gives rise naturally to an online stochastic approximation algorithm. We prove convergence of the online algorithm and show that the <italic>L1</italic> regularized linear fixed point is an equilibrium fixed point of the algorithm. We demonstrate the ability of the algorithm to converge to the fixed point, yielding a sparse solution with modestly better performance than unregularized linear temporal difference learning.
Our second contribution extends LARS-TD to integrate policy optimization with sparse value learning. We extend the <italic>L1</italic> regularized linear fixed point to include a maximum over policies, defining a new, "greedy" fixed point. The greedy fixed point adds a new invariant to the set which LARS-TD maintains as it traverses its homotopy path, giving rise to a new algorithm integrating sparse value learning and optimization. The new algorithm is demonstrated to be similar in performance with policy iteration using LARS-TD.
Finally, we consider another approach to sparse learning, that of using a simple algorithm that greedily adds new features. Such algorithms have many of the good properties of the <italic>L1</italic> regularization methods, while also being extremely efficient and, in some cases, allowing theoretical guarantees on recovery of the true form of a sparse target function from sampled data. We consider variants of orthogonal matching pursuit (OMP) applied to RL. The resulting algorithms are analyzed and compared experimentally with existing <italic>L1</italic> regularized approaches. We demonstrate that perhaps the most natural scenario in which one might hope to achieve sparse recovery fails; however, one variant provides promising theoretical guarantees under certain assumptions on the feature dictionary while another variant empirically outperforms prior methods both in approximation accuracy and efficiency on several benchmark problems.
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations