Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Dokumenter

  • Fulltext

    Forlagets udgivne version, 883 KB, PDF-dokument


We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.
OriginalsprogEngelsk
TitelProceedings of the 15th Asian Conference on Machine Learning
ForlagPMLR
Publikationsdato2024
Sider1167-1182
StatusUdgivet - 2024
Begivenhed15th Asian Conference on Machine Learning - İstanbul, Tyrkiet
Varighed: 11 nov. 202314 nov. 2023

Konference

Konference15th Asian Conference on Machine Learning
LandTyrkiet
Byİstanbul
Periode11/11/202314/11/2023
NavnProceedings of Machine Learning Research
Vol/bind222
ISSN2640-3498

ID: 386154817