Best-of-Both-Worlds Learning in Bandits with Delayed Feedback

Research output: Book/Report › Ph.D. thesis › Research

Masoudian, Saeed

This thesis addresses two pivotal challenges in Multi-armed bandits: achieving bestof-both-worlds guarantees and effectively handling delayed feedback. In practical scenarios like recommender systems and clinical trials, environments may exhibit a blend of stochastic and adversarial characteristics. Concurrently, delays are prevalent in such applications.
The Tsallis-INF algorithm introduced by Zimmert and Seldin (2019) marked a breakthrough, demonstrating optimal performance in both adversarial and stochastic bandits. Building upon this foundation, our thesis refines the Tsallis-INF regret bound within intermediate scenarios that bridge stochastic and adversarial environments. We propose a comprehensive analysis that enhances understanding of the self-bounding technique used by Zimmert and Seldin (2019) and yields improved regret bounds for stochastic with adversarial corruption and stochastically constrained adversarial regimes.
Addressing the challenge of delayed feedback in bandits, we establish a best-ofboth-worlds regret guarantee. Existing research within the Follow the Regularized Leader (FTRL) framework addresses the delayed problem only in the adversarial regime. We propose a minor adaptation to the algorithm of Zimmert and Seldin (2020), that relies on the knowledge of the maximal delay dmax and ensures control over the drift of the distribution over arms played by the algorithm, thereby realizing a best-of-both-worlds guarantee.
Furthermore, we complement our best-of-both-worlds algorithm for delayed bandits with the skipping technique (Zimmert and Seldin, 2020) and implicit exploration (Neu, 2015), eliminating the requirement for prior knowledge of dmax. These techniques facilitate efficient distribution drift control, further enhancing our established best-of-both-worlds guarantees.
Lastly, we explore leveraging intermediate observations to mitigate delay impacts on the learning process. These observations, appearing as finite states S, provide the learner with real-time information. In each round, the corresponding state is revealed immediately upon the learner’s action, followed by the actual loss after an adversarially set delay. We find that the complexity of the problem pivots on the state-loss mapping’s nature, rather than the action-state relationship. For adversarial state-loss mappings, intermediate observations yield no advantages. However, in scenarios with stochastic state-loss mappings, we improve worst-case regret, replacing√(K + d)T with √(K + min {S, d})T , where d is the fixed delay, T is the time horizon, and K is the number of arms. This improvement extends to arbitrary delay settings, ensuring robust high probability guarantees.

Original language	English

Publisher	Department of Computer Science, Faculty of Science, University of Copenhagen
Number of pages	166
Publication status	Published - 2023

ID: 382751030

Department of Computer Science

Best-of-Both-Worlds Learning in Bandits with Delayed Feedback