Exploration in Reward Machines with Low Regret

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Standard

Exploration in Reward Machines with Low Regret. / Bourel, Hippolyte; Jonsson, Anders; Maillard, Odalric Ambrym; Talebi, Mohammad Sadegh.

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Bind 206 PMLR, 2023. s. 4114-4146 (Proceedings of Machine Learning Research, Bind 206).

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Harvard

Bourel, H, Jonsson, A, Maillard, OA & Talebi, MS 2023, Exploration in Reward Machines with Low Regret. i Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. bind 206, PMLR, Proceedings of Machine Learning Research, bind 206, s. 4114-4146, 26th International Conference on Artificial Intelligence and Statistics, AISTATS 2023, Valencia, Spanien, 25/04/2023. <https://proceedings.mlr.press/v206/bourel23a.html>

APA

Bourel, H., Jonsson, A., Maillard, O. A., & Talebi, M. S. (2023). Exploration in Reward Machines with Low Regret. I Proceedings of The 26th International Conference on Artificial Intelligence and Statistics (Bind 206, s. 4114-4146). PMLR. Proceedings of Machine Learning Research Bind 206 https://proceedings.mlr.press/v206/bourel23a.html

Vancouver

Bourel H, Jonsson A, Maillard OA, Talebi MS. Exploration in Reward Machines with Low Regret. I Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Bind 206. PMLR. 2023. s. 4114-4146. (Proceedings of Machine Learning Research, Bind 206).

Author

Bourel, Hippolyte ; Jonsson, Anders ; Maillard, Odalric Ambrym ; Talebi, Mohammad Sadegh. / Exploration in Reward Machines with Low Regret. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Bind 206 PMLR, 2023. s. 4114-4146 (Proceedings of Machine Learning Research, Bind 206).

Bibtex

@inproceedings{37094884d0cc4db28cb5b7be1dc7bc84,
title = "Exploration in Reward Machines with Low Regret",
abstract = "We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice.",
author = "Hippolyte Bourel and Anders Jonsson and Maillard, {Odalric Ambrym} and Talebi, {Mohammad Sadegh}",
note = "Publisher Copyright: Copyright {\textcopyright} 2023 by the author(s); 26th International Conference on Artificial Intelligence and Statistics, AISTATS 2023 ; Conference date: 25-04-2023 Through 27-04-2023",
year = "2023",
language = "English",
volume = "206",
series = "Proceedings of Machine Learning Research",
pages = "4114--4146",
booktitle = "Proceedings of The 26th International Conference on Artificial Intelligence and Statistics",
publisher = "PMLR",

}

RIS

TY - GEN

T1 - Exploration in Reward Machines with Low Regret

AU - Bourel, Hippolyte

AU - Jonsson, Anders

AU - Maillard, Odalric Ambrym

AU - Talebi, Mohammad Sadegh

N1 - Publisher Copyright: Copyright © 2023 by the author(s)

PY - 2023

Y1 - 2023

N2 - We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice.

AB - We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice.

UR - http://www.scopus.com/inward/record.url?scp=85165183137&partnerID=8YFLogxK

M3 - Article in proceedings

AN - SCOPUS:85165183137

VL - 206

T3 - Proceedings of Machine Learning Research

SP - 4114

EP - 4146

BT - Proceedings of The 26th International Conference on Artificial Intelligence and Statistics

PB - PMLR

T2 - 26th International Conference on Artificial Intelligence and Statistics, AISTATS 2023

Y2 - 25 April 2023 through 27 April 2023

ER -

ID: 360396783