New Directions in Bandit Learning: Singularities and Random Walk Feedback

dc.contributor.advisor

Rudin, Cynthia

dc.contributor.author

Wang, Tianyu

dc.date.accessioned

2021-05-19T18:08:03Z

dc.date.available

2021-05-19T18:08:03Z

dc.date.issued

2021

dc.department

Computer Science

dc.description.abstract

My thesis focuses new directions in bandit learning problems. In Chapter 1, I give an overview of the bandit learning literature, which lays the discussion framework for studies in Chapters 2 and 3. In Chapter 2, I study bandit learning problem in metric measure spaces. I start with multi-armed bandit problem with Lipschitz reward, and propose a practical algorithm that can utilize greedy tree training methods and adapts to the landscape of the reward function. In particular, the study provides a Bayesian perspective to this problem. Also, I study bandit learning for Bounded Mean Oscillation (BMO) functions, where the goal is to ``maximize'' a function that may go to infinity in parts of the space. For an unknown BMO function, I will present algorithms that efficiently finds regions with high function values. To handle possible singularities and unboundedness in BMO functions, I will introduce the new notion of $\delta$-regret -- the difference between the function values along the trajectory and a point that is optimal after removing a $\delta$-sized portion of the space. I will show that my algorithm has $ \mathcal{O} \left( \frac{\kappa \log T}{T} \right) $ average $T$-step $\delta$-regret, where $ \kappa $ depends on $\delta$ and adapts to the landscape of the underlying reward function. In Chapter 3, I will study bandit learning with random walk trajectories as feedback. In domains including online advertisement and social networks, user behaviors can be modeled as a random walk over a network. To this end, we study a novel bandit learning problem, where each arm is the starting node of a random walk in a network and the reward is the length of the walk. We provide a comprehensive understanding of this formulation by studying both the stochastic and the adversarial setting. In the stochastic setting, we observe that, there exists a difficult problem instance on which the following two seemingly conflicting facts simultaneously hold: 1. No algorithm can achieve a regret bound independent of problem intrinsics information theoretically; and 2. There exists an algorithm whose performance is independent of problem intrinsics in terms of tail of mistakes. This reveals an intriguing phenomenon in general semi-bandit feedback learning problems. In the adversarial setting, we establish novel algorithms that achieve regret bound of order $\widetilde{\mathcal{O}} \left( \sqrt{ \kappa T}\right) $, where $\kappa$ is a constant that depends on the structure of the graph, instead of number of arms (nodes). This bounds significantly improves regular bandit algorithms, whose complexity depends on number of arms (nodes).

dc.identifier.uri

https://hdl.handle.net/10161/23014

dc.subject

Computer science

dc.subject

decision science

dc.subject

Machine learning

dc.subject

multi-armed bandits

dc.title

New Directions in Bandit Learning: Singularities and Random Walk Feedback

dc.type

Dissertation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wang_duke_0066D_16073.pdf
Size:
4.89 MB
Format:
Adobe Portable Document Format

Collections