Logarithmic regret in communicating MDPs

Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Dokumenter

Fulltext
Forlagets udgivne version, 883 KB, PDF-dokument

Hassan Saber
Fabien Pesquerel
Odalric-Ambrym Maillard
Talebi, Sadegh

We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.

Originalsprog	Engelsk
Titel	Proceedings of the 15th Asian Conference on Machine Learning
Forlag	PMLR
Publikationsdato	2024
Sider	1167-1182
Status	Udgivet - 2024
Begivenhed	15th Asian Conference on Machine Learning - İstanbul, Tyrkiet Varighed: 11 nov. 2023 → 14 nov. 2023

Konference

Konference	15th Asian Conference on Machine Learning
Land	Tyrkiet
By	İstanbul
Periode	11/11/2023 → 14/11/2023

Navn	Proceedings of Machine Learning Research
Vol/bind	222
ISSN	2640-3498

Datalogisk Institut

Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits

Dokumenter

Konference

Links