DeLTA seminar by Saeed Masoudian: Best-of-Both-Worlds Learning in Bandits with Delayed Feedback
Speaker
Saeed Masoudian, postdoc in the ML section at DIKU
Title
Best-of-Both-Worlds Learning in Bandits with Delayed Feedback
Abstract
This research addresses two pivotal challenges in Multi-armed bandits: achieving best-of-both-worlds guarantees and effectively handling delayed feedback. In practical scenarios like recommender systems and clinical trials, environments may exhibit a blend of stochastic and adversarial characteristics. Concurrently, delays are prevalent in such applications.
We propose a comprehensive analysis of Tsallis-INF algorithm introduced by zimmert and seldin (2019) that enhances understanding of the self-bounding technique used by them and yields improved regret bounds for intermediate regimes between fully adversarial and fully stochastic, such as stochastic with adversarial corruption and stochastically constrained adversarial regimes.
Addressing the challenge of delayed feedback in bandits, we establish a best-of-both-worlds regret guarantee. Existing research within the Follow the Regularized Leader (FTRL) framework addresses the delayed problem only in the adversarial regime. We propose a minor adaptation to the algorithm of zimmert and seldin (2020), that relies on the knowledge of the maximal delay d_{\max} and ensures control over the drift of the distribution over arms played by the algorithm, thereby realizing a best-of-both-worlds guarantee.
Furthermore, we complement our best-of-both-worlds algorithm for delayed bandits with the skipping technique (Zimmert and Seldin, 2020) and implicit exploration (Neu, 2015), eliminating the requirement for prior knowledge of d_{\max}. These techniques facilitate efficient distribution drift control, further enhancing our established best-of-both-worlds guarantees.
Lastly, we explore leveraging intermediate observations to mitigate delay impacts on the learning process. These observations, appearing as finite states S, provide the learner with real-time information. In each round, the corresponding state is revealed immediately upon the learner's action, followed by the actual loss after an adversarially set delay. We find that the complexity of the problem pivots on the state-loss mapping's nature, rather than the action-state relationship. For adversarial state-loss mappings, intermediate observations yield no advantages. However, in scenarios with stochastic state-loss mappings, we improve worst-case regret, replacing \sqrt{(K+d)T} with \sqrt{(K+\min {S,d}) T}, where d is the fixed delay, T is the time horizon, and K is the number of arms. This improvement extends to arbitrary delay settings, ensuring robust high probability guarantees.
_____________________________
You can subscribe to the DeLTA Seminar mailing list by sending an empty email to delta-seminar-join@list.ku.dk.
Online calendar
DeLTA Lab page
DeLTA is a research group affiliated with the Department of Computer Science at the University of Copenhagen studying diverse aspects of Machine Learning Theory and its applications, including, but not limited to Reinforcement Learning, Online Learning and Bandits, PAC-Bayesian analysis