Exploration in Reward Machines with Low Regret

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Documents

  • Fulltext

    Final published version, 1.34 MB, PDF document

We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice.

Original languageEnglish
Title of host publicationProceedings of The 26th International Conference on Artificial Intelligence and Statistics
Number of pages33
Volume206
PublisherPMLR
Publication date2023
Pages4114-4146
Publication statusPublished - 2023
Event26th International Conference on Artificial Intelligence and Statistics, AISTATS 2023 - Valencia, Spain
Duration: 25 Apr 202327 Apr 2023

Conference

Conference26th International Conference on Artificial Intelligence and Statistics, AISTATS 2023
LandSpain
ByValencia
Periode25/04/202327/04/2023
SeriesProceedings of Machine Learning Research
Volume206
ISSN2640-3498

Bibliographical note

Publisher Copyright:
Copyright © 2023 by the author(s)

ID: 360396783