Logarithmic regret in communicating MDPs

Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Documents

Fulltext
Final published version, 883 KB, PDF document

Hassan Saber
Fabien Pesquerel
Odalric-Ambrym Maillard
Talebi, Sadegh

We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.

Original language	English
Title of host publication	Proceedings of the 15th Asian Conference on Machine Learning
Publisher	PMLR
Publication date	2024
Pages	1167-1182
Publication status	Published - 2024
Event	15th Asian Conference on Machine Learning - İstanbul, Turkey Duration: 11 Nov 2023 → 14 Nov 2023

Conference

Conference	15th Asian Conference on Machine Learning
Land	Turkey
By	İstanbul
Periode	11/11/2023 → 14/11/2023

Series	Proceedings of Machine Learning Research
Volume	222
ISSN	2640-3498

Department of Computer Science

Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits

Documents

Conference

Links