# New Directions in Bandit Learning: Singularities and Random Walk Feedback

## Date

2021

## Authors

## Advisors

## Journal Title

## Journal ISSN

## Volume Title

## Repository Usage Stats

views

downloads

## Abstract

My thesis focuses new directions in bandit learning problems. In Chapter 1, I give an overview of the bandit learning literature, which lays the discussion framework for studies in Chapters 2 and 3. In Chapter 2, I study bandit learning problem in metric measure spaces. I start with multi-armed bandit problem with Lipschitz reward, and propose a practical algorithm that can utilize greedy tree training methods and adapts to the landscape of the reward function. In particular, the study provides a Bayesian perspective to this problem. Also, I study bandit learning for Bounded Mean Oscillation (BMO) functions, where the goal is to ``maximize'' a function that may go to infinity in parts of the space. For an unknown BMO function, I will present algorithms that efficiently finds regions with high function values. To handle possible singularities and unboundedness in BMO functions, I will introduce the new notion of $\delta$-regret -- the difference between the function values along the trajectory and a point that is optimal after removing a $\delta$-sized portion of the space. I will show that my algorithm has $ \mathcal{O} \left( \frac{\kappa \log T}{T} \right) $ average $T$-step $\delta$-regret, where $ \kappa $ depends on $\delta$ and adapts to the landscape of the underlying reward function. In Chapter 3, I will study bandit learning with random walk trajectories as feedback. In domains including online advertisement and social networks, user behaviors can be modeled as a random walk over a network. To this end, we study a novel bandit learning problem, where each arm is the starting node of a random walk in a network and the reward is the length of the walk. We provide a comprehensive understanding of this formulation by studying both the stochastic and the adversarial setting. In the stochastic setting, we observe that, there exists a difficult problem instance on which the following two seemingly conflicting facts simultaneously hold: 1. No algorithm can achieve a regret bound independent of problem intrinsics information theoretically; and 2. There exists an algorithm whose performance is independent of problem intrinsics in terms of tail of mistakes. This reveals an intriguing phenomenon in general semi-bandit feedback learning problems. In the adversarial setting, we establish novel algorithms that achieve regret bound of order $\widetilde{\mathcal{O}} \left( \sqrt{ \kappa T}\right) $, where $\kappa$ is a constant that depends on the structure of the graph, instead of number of arms (nodes). This bounds significantly improves regular bandit algorithms, whose complexity depends on number of arms (nodes).

## Type

## Department

## Description

## Provenance

## Citation

## Permalink

## Citation

Wang, Tianyu (2021). *New Directions in Bandit Learning: Singularities and Random Walk Feedback*. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/23014.

## Collections

Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.