Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Standard

Logarithmic regret in communicating MDPs : Leveraging known dynamics with bandits. / Saber, Hassan ; Pesquerel, Fabien ; Maillard, Odalric-Ambrym ; Talebi, Mohammad Sadegh.

Proceedings of the 15th Asian Conference on Machine Learning. PMLR, 2024. s. 1167-1182 (Proceedings of Machine Learning Research, Bind 222).

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Harvard

Saber, H, Pesquerel, F, Maillard, O-A & Talebi, MS 2024, Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits. i Proceedings of the 15th Asian Conference on Machine Learning. PMLR, Proceedings of Machine Learning Research, bind 222, s. 1167-1182, 15th Asian Conference on Machine Learning, İstanbul, Tyrkiet, 11/11/2023. <https://proceedings.mlr.press/v222/saber24a.html>

APA

Saber, H., Pesquerel, F., Maillard, O-A., & Talebi, M. S. (2024). Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits. I Proceedings of the 15th Asian Conference on Machine Learning (s. 1167-1182). PMLR. Proceedings of Machine Learning Research Bind 222 https://proceedings.mlr.press/v222/saber24a.html

Vancouver

Saber H, Pesquerel F, Maillard O-A, Talebi MS. Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits. I Proceedings of the 15th Asian Conference on Machine Learning. PMLR. 2024. s. 1167-1182. (Proceedings of Machine Learning Research, Bind 222).

Author

Saber, Hassan ; Pesquerel, Fabien ; Maillard, Odalric-Ambrym ; Talebi, Mohammad Sadegh. / Logarithmic regret in communicating MDPs : Leveraging known dynamics with bandits. Proceedings of the 15th Asian Conference on Machine Learning. PMLR, 2024. s. 1167-1182 (Proceedings of Machine Learning Research, Bind 222).

Bibtex

@inproceedings{bfce6f65134c41b0bdded35bcdc9ae56,
title = "Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits",
abstract = "We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.",
author = "Hassan Saber and Fabien Pesquerel and Odalric-Ambrym Maillard and Talebi, {Mohammad Sadegh}",
year = "2024",
language = "English",
series = "Proceedings of Machine Learning Research",
pages = "1167--1182",
booktitle = "Proceedings of the 15th Asian Conference on Machine Learning",
publisher = "PMLR",
note = "15th Asian Conference on Machine Learning ; Conference date: 11-11-2023 Through 14-11-2023",

}

RIS

TY - GEN

T1 - Logarithmic regret in communicating MDPs

T2 - 15th Asian Conference on Machine Learning

AU - Saber, Hassan

AU - Pesquerel, Fabien

AU - Maillard, Odalric-Ambrym

AU - Talebi, Mohammad Sadegh

PY - 2024

Y1 - 2024

N2 - We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.

AB - We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.

M3 - Article in proceedings

T3 - Proceedings of Machine Learning Research

SP - 1167

EP - 1182

BT - Proceedings of the 15th Asian Conference on Machine Learning

PB - PMLR

Y2 - 11 November 2023 through 14 November 2023

ER -

ID: 386154817