Tsallis-INF for decoupled exploration and exploitation in multi-armed bandits
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
Tsallis-INF for decoupled exploration and exploitation in multi-armed bandits. / Rouyer, Chloé; Seldin, Yevgeny.
Proceedings of Thirty Third Conference on Learning Theory(COLT). PMLR, 2020. p. 3227-3249 (Proceedings of Machine Learning Research, Vol. 125).Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Tsallis-INF for decoupled exploration and exploitation in multi-armed bandits
AU - Rouyer, Chloé
AU - Seldin, Yevgeny
PY - 2020
Y1 - 2020
N2 - We consider a variation of the multi-armed bandit problem, introduced by Avner et al. (2012), in which the forecaster is allowed to choose one arm to explore and one arm to exploit at every round. The loss of the exploited arm is blindly suffered by the forecaster, while the loss of the explored arm is observed without being suffered. The goal of the learner is to minimize the regret. We derive a new algorithm using regularization by Tsallis entropy to achieve best of both worlds guarantees. In the adversarial setting we show that the algorithm achieves the minimax optimal O(KT−−−√) regret bound, slightly improving on the result of Avner et al.. In the stochastic regime the algorithm achieves a time-independent regret bound, significantly improving on the result of Avner et al.. The algorithm also achieves the same time-independent regret bound in the more general stochastically constrained adversarial regime introduced by Wei and Luo (2018).
AB - We consider a variation of the multi-armed bandit problem, introduced by Avner et al. (2012), in which the forecaster is allowed to choose one arm to explore and one arm to exploit at every round. The loss of the exploited arm is blindly suffered by the forecaster, while the loss of the explored arm is observed without being suffered. The goal of the learner is to minimize the regret. We derive a new algorithm using regularization by Tsallis entropy to achieve best of both worlds guarantees. In the adversarial setting we show that the algorithm achieves the minimax optimal O(KT−−−√) regret bound, slightly improving on the result of Avner et al.. In the stochastic regime the algorithm achieves a time-independent regret bound, significantly improving on the result of Avner et al.. The algorithm also achieves the same time-independent regret bound in the more general stochastically constrained adversarial regime introduced by Wei and Luo (2018).
M3 - Article in proceedings
T3 - Proceedings of Machine Learning Research
SP - 3227
EP - 3249
BT - Proceedings of Thirty Third Conference on Learning Theory(COLT)
PB - PMLR
ER -
ID: 272647540