Scaling Up Q-Learning via Exploiting State–Action Equivalence
Research output: Contribution to journal › Journal article › Research › peer-review
Standard
Scaling Up Q-Learning via Exploiting State–Action Equivalence. / Lyu, Yunlian; Côme, Aymeric; Zhang, Yijie; Talebi, Mohammad Sadegh.
In: Entropy, Vol. 25, No. 4, 584, 2023.Research output: Contribution to journal › Journal article › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - JOUR
T1 - Scaling Up Q-Learning via Exploiting State–Action Equivalence
AU - Lyu, Yunlian
AU - Côme, Aymeric
AU - Zhang, Yijie
AU - Talebi, Mohammad Sadegh
N1 - Publisher Copyright: © 2023 by the authors.
PY - 2023
Y1 - 2023
N2 - Recent success stories in reinforcement learning have demonstrated that leveraging structural properties of the underlying environment is key in devising viable methods capable of solving complex tasks. We study off-policy learning in discounted reinforcement learning, where some equivalence relation in the environment exists. We introduce a new model-free algorithm, called QL-ES (Q-learning with equivalence structure), which is a variant of (asynchronous) Q-learning tailored to exploit the equivalence structure in the MDP. We report a non-asymptotic PAC-type sample complexity bound for QL-ES, thereby establishing its sample efficiency. This bound also allows us to quantify the superiority of QL-ES over Q-learning analytically, which shows that the theoretical gain in some domains can be massive. We report extensive numerical experiments demonstrating that QL-ES converges significantly faster than (structure-oblivious) Q-learning empirically. They imply that the empirical performance gain obtained by exploiting the equivalence structure could be massive, even in simple domains. To the best of our knowledge, QL-ES is the first provably efficient model-free algorithm to exploit the equivalence structure in finite MDPs.
AB - Recent success stories in reinforcement learning have demonstrated that leveraging structural properties of the underlying environment is key in devising viable methods capable of solving complex tasks. We study off-policy learning in discounted reinforcement learning, where some equivalence relation in the environment exists. We introduce a new model-free algorithm, called QL-ES (Q-learning with equivalence structure), which is a variant of (asynchronous) Q-learning tailored to exploit the equivalence structure in the MDP. We report a non-asymptotic PAC-type sample complexity bound for QL-ES, thereby establishing its sample efficiency. This bound also allows us to quantify the superiority of QL-ES over Q-learning analytically, which shows that the theoretical gain in some domains can be massive. We report extensive numerical experiments demonstrating that QL-ES converges significantly faster than (structure-oblivious) Q-learning empirically. They imply that the empirical performance gain obtained by exploiting the equivalence structure could be massive, even in simple domains. To the best of our knowledge, QL-ES is the first provably efficient model-free algorithm to exploit the equivalence structure in finite MDPs.
KW - equivalence structure
KW - Markov decision process
KW - Q-learning
KW - reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=85156237886&partnerID=8YFLogxK
U2 - 10.3390/e25040584
DO - 10.3390/e25040584
M3 - Journal article
C2 - 37190372
AN - SCOPUS:85156237886
VL - 25
JO - Entropy
JF - Entropy
SN - 1099-4300
IS - 4
M1 - 584
ER -
ID: 347308519