dc.description.abstract |
<p>As the reinforcement learning community has shifted its focus from heuristic methods
to methods that have performance guarantees, PAC-optimal exploration algorithms have
received significant attention. Unfortunately, the majority of current PAC-optimal
exploration algorithms are inapplicable in realistic scenarios: 1) They scale poorly
to domains of realistic size. 2) They are only applicable to discrete state-action
spaces. 3) They assume that experience comes from a single, continuous trajectory.
4) They assume that value function updates are instantaneous. The goal of this work
is to bridge the gap between theory and practice, by introducing an efficient and
customizable PAC optimal exploration algorithm, that is able to explore in multiple,
continuous or discrete state MDPs simultaneously. Our algorithm does not assume that
value function updates can be completed instantaneously, and maintains PAC guarantees
in realtime environments. Not only do we extend the applicability of PAC optimal exploration
algorithms to new, realistic settings, but even when instant value function updates
are possible, our bounds present a significant improvement over previous single MDP
exploration bounds, and a drastic improvement over previous concurrent PAC bounds.
We also present Bellman error MDPs, a new analysis methodology for online and offline
reinforcement learning algorithms, and TCE, a new, fine grained metric for the cost
of exploration.</p>
|
|