Tsallis-INF for decoupled exploration and exploitation in multi-armed bandits

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Tsallis-INF for decoupled exploration and exploitation in multi-armed bandits. / Rouyer, Chloé; Seldin, Yevgeny.

Proceedings of Thirty Third Conference on Learning Theory(COLT). PMLR, 2020. p. 3227-3249 (Proceedings of Machine Learning Research, Vol. 125).

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Rouyer, C & Seldin, Y 2020, Tsallis-INF for decoupled exploration and exploitation in multi-armed bandits. in Proceedings of Thirty Third Conference on Learning Theory(COLT). PMLR, Proceedings of Machine Learning Research, vol. 125, pp. 3227-3249. <https://proceedings.mlr.press/v125/>

APA

Rouyer, C., & Seldin, Y. (2020). Tsallis-INF for decoupled exploration and exploitation in multi-armed bandits. In Proceedings of Thirty Third Conference on Learning Theory(COLT) (pp. 3227-3249). PMLR. Proceedings of Machine Learning Research Vol. 125 https://proceedings.mlr.press/v125/

Vancouver

Rouyer C, Seldin Y. Tsallis-INF for decoupled exploration and exploitation in multi-armed bandits. In Proceedings of Thirty Third Conference on Learning Theory(COLT). PMLR. 2020. p. 3227-3249. (Proceedings of Machine Learning Research, Vol. 125).

Author

Rouyer, Chloé ; Seldin, Yevgeny. / Tsallis-INF for decoupled exploration and exploitation in multi-armed bandits. Proceedings of Thirty Third Conference on Learning Theory(COLT). PMLR, 2020. pp. 3227-3249 (Proceedings of Machine Learning Research, Vol. 125).

Bibtex

@inproceedings{63237c11353b4444997e1a2f7a64c4f5,

title = "Tsallis-INF for decoupled exploration and exploitation in multi-armed bandits",

abstract = "We consider a variation of the multi-armed bandit problem, introduced by Avner et al. (2012), in which the forecaster is allowed to choose one arm to explore and one arm to exploit at every round. The loss of the exploited arm is blindly suffered by the forecaster, while the loss of the explored arm is observed without being suffered. The goal of the learner is to minimize the regret. We derive a new algorithm using regularization by Tsallis entropy to achieve best of both worlds guarantees. In the adversarial setting we show that the algorithm achieves the minimax optimal O(KT−−−√) regret bound, slightly improving on the result of Avner et al.. In the stochastic regime the algorithm achieves a time-independent regret bound, significantly improving on the result of Avner et al.. The algorithm also achieves the same time-independent regret bound in the more general stochastically constrained adversarial regime introduced by Wei and Luo (2018).",

author = "Chlo{\'e} Rouyer and Yevgeny Seldin",

year = "2020",

language = "English",

series = "Proceedings of Machine Learning Research",

pages = "3227--3249",

booktitle = "Proceedings of Thirty Third Conference on Learning Theory(COLT)",

publisher = "PMLR",

}

RIS

TY - GEN

T1 - Tsallis-INF for decoupled exploration and exploitation in multi-armed bandits

AU - Rouyer, Chloé

AU - Seldin, Yevgeny

PY - 2020

Y1 - 2020

N2 - We consider a variation of the multi-armed bandit problem, introduced by Avner et al. (2012), in which the forecaster is allowed to choose one arm to explore and one arm to exploit at every round. The loss of the exploited arm is blindly suffered by the forecaster, while the loss of the explored arm is observed without being suffered. The goal of the learner is to minimize the regret. We derive a new algorithm using regularization by Tsallis entropy to achieve best of both worlds guarantees. In the adversarial setting we show that the algorithm achieves the minimax optimal O(KT−−−√) regret bound, slightly improving on the result of Avner et al.. In the stochastic regime the algorithm achieves a time-independent regret bound, significantly improving on the result of Avner et al.. The algorithm also achieves the same time-independent regret bound in the more general stochastically constrained adversarial regime introduced by Wei and Luo (2018).

AB - We consider a variation of the multi-armed bandit problem, introduced by Avner et al. (2012), in which the forecaster is allowed to choose one arm to explore and one arm to exploit at every round. The loss of the exploited arm is blindly suffered by the forecaster, while the loss of the explored arm is observed without being suffered. The goal of the learner is to minimize the regret. We derive a new algorithm using regularization by Tsallis entropy to achieve best of both worlds guarantees. In the adversarial setting we show that the algorithm achieves the minimax optimal O(KT−−−√) regret bound, slightly improving on the result of Avner et al.. In the stochastic regime the algorithm achieves a time-independent regret bound, significantly improving on the result of Avner et al.. The algorithm also achieves the same time-independent regret bound in the more general stochastically constrained adversarial regime introduced by Wei and Luo (2018).

M3 - Article in proceedings

T3 - Proceedings of Machine Learning Research

SP - 3227

EP - 3249

BT - Proceedings of Thirty Third Conference on Learning Theory(COLT)

PB - PMLR

ER -

ID: 272647540

Department of Computer Science