DeLTA seminar by Saeed Masoudian: Best-of-Both-Worlds Learning in Bandits with Delayed Feedback

Delta logo

Speaker

Saeed Masoudian, postdoc in the ML section at DIKU

Title

Best-of-Both-Worlds Learning in Bandits with Delayed Feedback

Abstract

This research addresses two pivotal challenges in Multi-armed bandits: achieving best-of-both-worlds guarantees and effectively handling delayed feedback. In practical scenarios like recommender systems and clinical trials, environments may exhibit a blend of stochastic and adversarial characteristics. Concurrently, delays are prevalent in such applications.

We propose a comprehensive analysis of Tsallis-INF algorithm  introduced by zimmert and seldin (2019) that enhances understanding of the self-bounding technique used by them and yields improved regret bounds for intermediate regimes between fully adversarial and fully stochastic, such as stochastic with adversarial corruption and stochastically constrained adversarial regimes.  

Addressing the challenge of delayed feedback in bandits, we establish a best-of-both-worlds regret guarantee.  Existing research within the Follow the Regularized Leader (FTRL) framework addresses the delayed problem only in the adversarial regime. We propose  a minor adaptation to the algorithm of zimmert and seldin (2020), that relies on the knowledge of  the maximal delay d_{\max} and ensures control over the drift of the distribution over arms played by the algorithm, thereby realizing a best-of-both-worlds guarantee.

Furthermore, we complement our best-of-both-worlds algorithm for delayed bandits with the skipping technique (Zimmert and Seldin, 2020) and implicit exploration (Neu, 2015), eliminating the requirement for prior knowledge of d_{\max}. These techniques facilitate efficient distribution drift control, further enhancing our established best-of-both-worlds guarantees.

Lastly, we explore leveraging intermediate observations to mitigate delay impacts on the learning process. These observations, appearing as finite states S, provide the learner with real-time information. In each round, the corresponding state is revealed immediately upon the learner's action, followed by the actual loss after an adversarially set delay. We find that the complexity of the problem pivots on the state-loss mapping's nature, rather than the action-state relationship. For adversarial state-loss mappings, intermediate observations yield no advantages. However, in scenarios with stochastic state-loss mappings, we improve worst-case regret, replacing \sqrt{(K+d)T} with \sqrt{(K+\min {S,d}) T}, where d is the fixed delay, T is the time horizon, and K is the number of arms. This improvement extends to arbitrary delay settings, ensuring robust high probability guarantees.

_____________________________

You can subscribe to the DeLTA Seminar mailing list by sending an empty email to delta-seminar-join@list.ku.dk.
Online calendar
DeLTA Lab page