New Directions in Bandit Learning: Singularities and Random Walk Feedback
Date
2021
Authors
Advisors
Journal Title
Journal ISSN
Volume Title
Repository Usage Stats
views
downloads
Abstract
My thesis focuses new directions in bandit learning problems. In Chapter 1, I give an overview of the bandit learning literature, which lays the discussion framework for studies in Chapters 2 and 3. In Chapter 2, I study bandit learning problem in metric measure spaces. I start with multi-armed bandit problem with Lipschitz reward, and propose a practical algorithm that can utilize greedy tree training methods and adapts to the landscape of the reward function. In particular, the study provides a Bayesian perspective to this problem. Also, I study bandit learning for Bounded Mean Oscillation (BMO) functions, where the goal is to ``maximize'' a function that may go to infinity in parts of the space. For an unknown BMO function, I will present algorithms that efficiently finds regions with high function values. To handle possible singularities and unboundedness in BMO functions, I will introduce the new notion of $\delta$-regret -- the difference between the function values along the trajectory and a point that is optimal after removing a $\delta$-sized portion of the space. I will show that my algorithm has $ \mathcal{O} \left( \frac{\kappa \log T}{T} \right) $ average $T$-step $\delta$-regret, where $ \kappa $ depends on $\delta$ and adapts to the landscape of the underlying reward function. In Chapter 3, I will study bandit learning with random walk trajectories as feedback. In domains including online advertisement and social networks, user behaviors can be modeled as a random walk over a network. To this end, we study a novel bandit learning problem, where each arm is the starting node of a random walk in a network and the reward is the length of the walk. We provide a comprehensive understanding of this formulation by studying both the stochastic and the adversarial setting. In the stochastic setting, we observe that, there exists a difficult problem instance on which the following two seemingly conflicting facts simultaneously hold: 1. No algorithm can achieve a regret bound independent of problem intrinsics information theoretically; and 2. There exists an algorithm whose performance is independent of problem intrinsics in terms of tail of mistakes. This reveals an intriguing phenomenon in general semi-bandit feedback learning problems. In the adversarial setting, we establish novel algorithms that achieve regret bound of order $\widetilde{\mathcal{O}} \left( \sqrt{ \kappa T}\right) $, where $\kappa$ is a constant that depends on the structure of the graph, instead of number of arms (nodes). This bounds significantly improves regular bandit algorithms, whose complexity depends on number of arms (nodes).
Type
Department
Description
Provenance
Citation
Permalink
Citation
Wang, Tianyu (2021). New Directions in Bandit Learning: Singularities and Random Walk Feedback. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/23014.
Collections
Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.