Improved Exploration in Factored Average-Reward MDPs

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

Improved Exploration in Factored Average-Reward MDPs. / Talebi, Sadegh; Jonsson, Anders ; Maillard, Odalric-Ambrym .

Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR, 2021. p. 3988-3996 (Proceedings of Machine Learning Research, Vol. 130).

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Talebi, S, Jonsson, A & Maillard, O-A 2021, Improved Exploration in Factored Average-Reward MDPs. in Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR, Proceedings of Machine Learning Research, vol. 130, pp. 3988-3996, 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021), San Diego, United States, 13/04/2021. <https://proceedings.mlr.press/v130/>

APA

Talebi, S., Jonsson, A., & Maillard, O-A. (2021). Improved Exploration in Factored Average-Reward MDPs. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) (pp. 3988-3996). PMLR. Proceedings of Machine Learning Research Vol. 130 https://proceedings.mlr.press/v130/

Vancouver

Talebi S, Jonsson A, Maillard O-A. Improved Exploration in Factored Average-Reward MDPs. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR. 2021. p. 3988-3996. (Proceedings of Machine Learning Research, Vol. 130).

Author

Talebi, Sadegh ; Jonsson, Anders ; Maillard, Odalric-Ambrym . / Improved Exploration in Factored Average-Reward MDPs. Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR, 2021. pp. 3988-3996 (Proceedings of Machine Learning Research, Vol. 130).

Bibtex

@inproceedings{f490979210094d93820d4aaa069176f6,
title = "Improved Exploration in Factored Average-Reward MDPs",
abstract = "We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space XX and the state-space SS admit the respective factored forms of X=⊗ni=1XiX=⊗i=1nXi and S=⊗mi=1SiS=⊗i=1mSi, and the transition and reward functions are factored over XX and SS. Assuming a known a factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of \cSi\cSi{\textquoteright}s and the diameter. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys a substantially improved regret empirically over existing algorithms that have frequentist regret guarantees.",
keywords = "NOISE",
author = "Sadegh Talebi and Anders Jonsson and Odalric-Ambrym Maillard",
year = "2021",
language = "English",
series = "Proceedings of Machine Learning Research",
pages = "3988--3996",
booktitle = "Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS)",
publisher = "PMLR",
note = "24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021) ; Conference date: 13-04-2021 Through 15-04-2021",

}

RIS

TY - GEN

T1 - Improved Exploration in Factored Average-Reward MDPs

AU - Talebi, Sadegh

AU - Jonsson, Anders

AU - Maillard, Odalric-Ambrym

PY - 2021

Y1 - 2021

N2 - We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space XX and the state-space SS admit the respective factored forms of X=⊗ni=1XiX=⊗i=1nXi and S=⊗mi=1SiS=⊗i=1mSi, and the transition and reward functions are factored over XX and SS. Assuming a known a factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of \cSi\cSi’s and the diameter. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys a substantially improved regret empirically over existing algorithms that have frequentist regret guarantees.

AB - We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space XX and the state-space SS admit the respective factored forms of X=⊗ni=1XiX=⊗i=1nXi and S=⊗mi=1SiS=⊗i=1mSi, and the transition and reward functions are factored over XX and SS. Assuming a known a factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of \cSi\cSi’s and the diameter. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys a substantially improved regret empirically over existing algorithms that have frequentist regret guarantees.

KW - NOISE

M3 - Article in proceedings

T3 - Proceedings of Machine Learning Research

SP - 3988

EP - 3996

BT - Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS)

PB - PMLR

T2 - 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021)

Y2 - 13 April 2021 through 15 April 2021

ER -

ID: 301365745