Exploration in Reward Machines with Low Regret

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Dokumenter

Fulltext
Forlagets udgivne version, 1,34 MB, PDF-dokument

Bourel, Hippolyte Raymond
Anders Jonsson
Odalric Ambrym Maillard
Talebi, Sadegh

We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice.

Originalsprog	Engelsk
Titel	Proceedings of The 26th International Conference on Artificial Intelligence and Statistics
Antal sider	33
Vol/bind	206
Forlag	PMLR
Publikationsdato	2023
Sider	4114-4146
Status	Udgivet - 2023
Begivenhed	26th International Conference on Artificial Intelligence and Statistics, AISTATS 2023 - Valencia, Spanien Varighed: 25 apr. 2023 → 27 apr. 2023

Konference

Konference	26th International Conference on Artificial Intelligence and Statistics, AISTATS 2023
Land	Spanien
By	Valencia
Periode	25/04/2023 → 27/04/2023

Navn	Proceedings of Machine Learning Research
Vol/bind	206
ISSN	2640-3498

Bibliografisk note

Funding Information:
Talebi are partially supported by the Independent Research Fund Denmark, grant number 1026-00397B. Anders Jon-sson is partially supported by the Spanish grant PID2019-108141GB-I00 and the European project TAILOR (H2020, GA 952215). Odalric-Ambrym Maillard is supported by the French Ministry of Higher Education and Research, Inria, Scool, the Hauts-de-France region, the MEL and the I-Site ULNE regarding project R-PILOTE-19-004-APPRENF.

Publisher Copyright:
Copyright © 2023 by the author(s)

Datalogisk Institut