Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Documents

  • Fulltext

    Final published version, 883 KB, PDF document


We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.
Original languageEnglish
Title of host publicationProceedings of the 15th Asian Conference on Machine Learning
PublisherPMLR
Publication date2024
Pages1167-1182
Publication statusPublished - 2024
Event15th Asian Conference on Machine Learning - İstanbul, Turkey
Duration: 11 Nov 202314 Nov 2023

Conference

Conference15th Asian Conference on Machine Learning
LandTurkey
Byİstanbul
Periode11/11/202314/11/2023
SeriesProceedings of Machine Learning Research
Volume222
ISSN2640-3498

ID: 386154817