Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits

dc.contributor.author

Shen, Y

dc.contributor.author

Xu, P

dc.contributor.author

Zavlanos, MM

dc.date.accessioned

2025-11-03T16:29:39Z

dc.date.available

2025-11-03T16:29:39Z

dc.date.issued

2024-01-01

dc.description.abstract

Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment. Often, the environment in which the data are collected differs from the environment in which the learned policy is applied. To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed that compute worst-case bounds on the policy values assuming that the distribution of the new environment lies within an uncertainty set. Typically, this uncertainty set is defined based on the KL divergence around the empirical distribution computed from the logging dataset. However, the KL uncertainty set fails to encompass distributions with varying support and lacks awareness of the geometry of the distribution support. As a result, KL approaches fall short in addressing practical environment mismatches and lead to over-fitting to worst-case scenarios. To overcome these limitations, we propose a novel DRO approach that employs the Wasserstein distance instead. While Wasserstein DRO is generally computationally more expensive compared to KL DRO, we present a regularized method and a practical (biased) stochastic gradient descent method to optimize the policy efficiently. We also provide a theoretical analysis of the finite sample complexity and iteration complexity for our proposed method. We further validate our approach using a public dataset that was recorded in a randomized stoke trial.

dc.identifier.issn

2835-8856

dc.identifier.uri

https://hdl.handle.net/10161/33467

dc.relation.ispartof

Transactions on Machine Learning Research

dc.rights.uri

https://creativecommons.org/licenses/by-nc/4.0

dc.title

Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits

dc.type

Journal article

duke.contributor.orcid

Xu, P|0000-0002-2559-8622

pubs.organisational-group

Duke

pubs.organisational-group

Pratt School of Engineering

pubs.organisational-group

School of Medicine

pubs.organisational-group

Trinity College of Arts & Sciences

pubs.organisational-group

Basic Science Departments

pubs.organisational-group

Biostatistics & Bioinformatics

pubs.organisational-group

Electrical and Computer Engineering

pubs.organisational-group

Computer Science

pubs.organisational-group

Biostatistics & Bioinformatics, Division of Translational Biomedical

pubs.publication-status

Published

pubs.volume

2024

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
1645_Wasserstein_Distributiona.pdf
Size:
1.25 MB
Format:
Adobe Portable Document Format
Description:
Published version