Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Standard
Logarithmic regret in communicating MDPs : Leveraging known dynamics with bandits. / Saber, Hassan ; Pesquerel, Fabien ; Maillard, Odalric-Ambrym ; Talebi, Mohammad Sadegh.
Proceedings of the 15th Asian Conference on Machine Learning. PMLR, 2024. s. 1167-1182 (Proceedings of Machine Learning Research, Bind 222).Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Logarithmic regret in communicating MDPs
T2 - 15th Asian Conference on Machine Learning
AU - Saber, Hassan
AU - Pesquerel, Fabien
AU - Maillard, Odalric-Ambrym
AU - Talebi, Mohammad Sadegh
PY - 2024
Y1 - 2024
N2 - We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.
AB - We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a non-asymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.
M3 - Article in proceedings
T3 - Proceedings of Machine Learning Research
SP - 1167
EP - 1182
BT - Proceedings of the 15th Asian Conference on Machine Learning
PB - PMLR
Y2 - 11 November 2023 through 14 November 2023
ER -
ID: 386154817