Scaling Up Q-Learning via Exploiting State–Action Equivalence

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Scaling Up Q-Learning via Exploiting State–Action Equivalence. / Lyu, Yunlian; Côme, Aymeric; Zhang, Yijie; Talebi, Mohammad Sadegh.

In: Entropy, Vol. 25, No. 4, 584, 2023.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Lyu, Y, Côme, A, Zhang, Y & Talebi, MS 2023, 'Scaling Up Q-Learning via Exploiting State–Action Equivalence', Entropy, vol. 25, no. 4, 584. https://doi.org/10.3390/e25040584

APA

Lyu, Y., Côme, A., Zhang, Y., & Talebi, M. S. (2023). Scaling Up Q-Learning via Exploiting State–Action Equivalence. Entropy, 25(4), [584]. https://doi.org/10.3390/e25040584

Vancouver

Lyu Y, Côme A, Zhang Y, Talebi MS. Scaling Up Q-Learning via Exploiting State–Action Equivalence. Entropy. 2023;25(4). 584. https://doi.org/10.3390/e25040584

Author

Lyu, Yunlian ; Côme, Aymeric ; Zhang, Yijie ; Talebi, Mohammad Sadegh. / Scaling Up Q-Learning via Exploiting State–Action Equivalence. In: Entropy. 2023 ; Vol. 25, No. 4.

Bibtex

@article{e8ccfddad0bc474e9f1abee2011b20e6,
title = "Scaling Up Q-Learning via Exploiting State–Action Equivalence",
abstract = "Recent success stories in reinforcement learning have demonstrated that leveraging structural properties of the underlying environment is key in devising viable methods capable of solving complex tasks. We study off-policy learning in discounted reinforcement learning, where some equivalence relation in the environment exists. We introduce a new model-free algorithm, called QL-ES (Q-learning with equivalence structure), which is a variant of (asynchronous) Q-learning tailored to exploit the equivalence structure in the MDP. We report a non-asymptotic PAC-type sample complexity bound for QL-ES, thereby establishing its sample efficiency. This bound also allows us to quantify the superiority of QL-ES over Q-learning analytically, which shows that the theoretical gain in some domains can be massive. We report extensive numerical experiments demonstrating that QL-ES converges significantly faster than (structure-oblivious) Q-learning empirically. They imply that the empirical performance gain obtained by exploiting the equivalence structure could be massive, even in simple domains. To the best of our knowledge, QL-ES is the first provably efficient model-free algorithm to exploit the equivalence structure in finite MDPs.",
keywords = "equivalence structure, Markov decision process, Q-learning, reinforcement learning",
author = "Yunlian Lyu and Aymeric C{\^o}me and Yijie Zhang and Talebi, {Mohammad Sadegh}",
note = "Publisher Copyright: {\textcopyright} 2023 by the authors.",
year = "2023",
doi = "10.3390/e25040584",
language = "English",
volume = "25",
journal = "Entropy",
issn = "1099-4300",
publisher = "MDPI AG",
number = "4",

}

RIS

TY - JOUR

T1 - Scaling Up Q-Learning via Exploiting State–Action Equivalence

AU - Lyu, Yunlian

AU - Côme, Aymeric

AU - Zhang, Yijie

AU - Talebi, Mohammad Sadegh

N1 - Publisher Copyright: © 2023 by the authors.

PY - 2023

Y1 - 2023

N2 - Recent success stories in reinforcement learning have demonstrated that leveraging structural properties of the underlying environment is key in devising viable methods capable of solving complex tasks. We study off-policy learning in discounted reinforcement learning, where some equivalence relation in the environment exists. We introduce a new model-free algorithm, called QL-ES (Q-learning with equivalence structure), which is a variant of (asynchronous) Q-learning tailored to exploit the equivalence structure in the MDP. We report a non-asymptotic PAC-type sample complexity bound for QL-ES, thereby establishing its sample efficiency. This bound also allows us to quantify the superiority of QL-ES over Q-learning analytically, which shows that the theoretical gain in some domains can be massive. We report extensive numerical experiments demonstrating that QL-ES converges significantly faster than (structure-oblivious) Q-learning empirically. They imply that the empirical performance gain obtained by exploiting the equivalence structure could be massive, even in simple domains. To the best of our knowledge, QL-ES is the first provably efficient model-free algorithm to exploit the equivalence structure in finite MDPs.

AB - Recent success stories in reinforcement learning have demonstrated that leveraging structural properties of the underlying environment is key in devising viable methods capable of solving complex tasks. We study off-policy learning in discounted reinforcement learning, where some equivalence relation in the environment exists. We introduce a new model-free algorithm, called QL-ES (Q-learning with equivalence structure), which is a variant of (asynchronous) Q-learning tailored to exploit the equivalence structure in the MDP. We report a non-asymptotic PAC-type sample complexity bound for QL-ES, thereby establishing its sample efficiency. This bound also allows us to quantify the superiority of QL-ES over Q-learning analytically, which shows that the theoretical gain in some domains can be massive. We report extensive numerical experiments demonstrating that QL-ES converges significantly faster than (structure-oblivious) Q-learning empirically. They imply that the empirical performance gain obtained by exploiting the equivalence structure could be massive, even in simple domains. To the best of our knowledge, QL-ES is the first provably efficient model-free algorithm to exploit the equivalence structure in finite MDPs.

KW - equivalence structure

KW - Markov decision process

KW - Q-learning

KW - reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=85156237886&partnerID=8YFLogxK

U2 - 10.3390/e25040584

DO - 10.3390/e25040584

M3 - Journal article

C2 - 37190372

AN - SCOPUS:85156237886

VL - 25

JO - Entropy

JF - Entropy

SN - 1099-4300

IS - 4

M1 - 584

ER -

ID: 347308519