Pioneer Centre Science Talk by Marco De Nadai

Title

Efficient Training of Vision Transformers

Speaker

Marco De Nadai

Abstract

Vision Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Unlike CNNs, VTs can capture global relations between image elements and potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data-hungry than common CNNs. VTs have indeed to learn from samples the local properties of the visual domain, which are instead embedded in the CNN architectural design. In this talk, I will show you how hybrid architectures work and how we can regularize VTs with auxiliary self-supervised tasks to tame these novel architectures on small and medium data regimes.

Bio

Marco De Nadai is an Applied Scientist at Zalando, Germany, and a fellow of the Bruno Kessler Foundation, Italy. His expertise lies in Computer Vision, particularly generative models for images, and Human behavioural understanding through multi-modal data analysis. Lately, he has focused on outfit generation and Vision Transformers. Before Zalando, Marco was a Research Scientist at the Bruno Kessler Foundation. He holds a PhD in Computer Science from the University of Trento, Italy, and he collaborated with numerous international institutions, including Samsung Electronics, MIT, MIT Media Lab and Nokia Bell Labs.