Towards Uncertainty and Efficiency in Reinforcement Learning

Thumbnail Image



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



Deep reinforcement learning (RL) has received great success in playing video games and strategic board games, where a simulator is well-defined, and massive samples are available. However, in many real-world applications, the samples are not easy to collect, and the collection process may be expensive and risky. We consider designing sample efficient RL algorithms for online exploration and learning from offline interactions. In this thesis, I will introduce algorithms that quantify uncertainty via exploiting intrinsic structures within observations to improve sample complexity. These proposed algorithms are theoretically sound and show broad applicability in recommendation, computer vision, operations management, and natural language processing. This thesis consists of two parts: (i) efficient exploration and (ii) data-driven reinforcement learning.

Exploration-exploitation has been widely recognized as a fundamental trade-off. An agent can take exploration actions to learn a better policy or take exploitation actions with the highest reward. A good exploration strategy can improve sample complexity as a policy can converge faster to near optimality via collecting informative data. Better estimation and usage of uncertainty lead to more efficient exploration, as the agent can efficiently explore to better understand environments, \textit{i.e.}, minimizing uncertainty. In the efficient exploration part, we place the reinforcement learning into the probability measure space and formulate it as Wasserstein gradient flows. The proposed method can quantify the uncertainty of value, policy, and constraint functions to provide efficient exploration.

Running a policy in real environments can be expensive and risky. Besides, there are massive logged datasets available. Data-driven RL can effectively exploit these fixed datasets to perform policy improvement or evaluation. In the data-driven RL part, we consider auto-regressive sequence generation as a real-world sequential decision-making problem, where exploiting uncertainty is useful for generating faithful and informative sequences. Specifically, a planning mechanism has been integrated into generation as model-predictive sequence generation. We also realized that most RL-based training schemes are not aligned with human evaluations due to the poor lexical rewards or simulators. To alleviate this issue, we consider semantic rewards, implemented by the generalized Wasserstein distance. It is also nice to see these new schemes can be interpreted as Wasserstein gradient flows.





Zhang, Ruiyi (2021). Towards Uncertainty and Efficiency in Reinforcement Learning. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.