PhD defence by Concetto Emanuele Bugliarello
On Multimodal Representations Learned at Scale
As humans, we attribute the meaning of words and objects through a rich mental representation of 'how the world works.' By processing perceptual inputs with our senses and interacting with our environments, we acquire grounded representations of concepts in the world, which we map to words in language to communicate with each other. Recent advances in artificial intelligence have been driven by deep neural networks that build a compressed yet complex view of the world. This process is referred to as representation learning, and it is now commonly associated with a learning phase called pretraining, which aims at acquiring general-purpose understanding from exposure to large amounts of data. While successful, the majority of work in representation learning is centred on data from a single modality (e.g., text, images, videos, speech). In response, this dissertation presents critical and in-depth studies of the emerging framework of learning meaning representations from multiple modalities, and vision and language specifically. Multimodal representation learning is a promising direction towards enabling a human-like form of artificial intelligence, enabling machines to interpret and reason about multimodal signals, and to acquire a knowledge of the world that aligns to ours.
Throughout this thesis, we aim at developing an in-depth understanding of multimodal networks trained on large datasets harvested from the Internet. We start by assessing what are the key factors that lead to strong, general-purpose models in controlled setups, and by inspecting whether the network representations are indeed cross-modal through a novel data-centric approach. We then delve into the abilities of multimodal representations to make fine-grained mappings between the visual and textual modalities. Our investigation shows the importance of diverse, object-centric data, and leads us to novel relation-aware approaches for enhanced multimodal alignment. Finally, we closely scrutinise typical benchmarking practices used by the community to measure the performance of pretrained multimodal networks. We construct datasets and evaluation suites that reveal the inability of state-of-the-art multimodal models to grasp geographically diverse data and languages, encouraging the community to develop multimodal technologies that exhibit consistent performance across different demographics.
Professor Serge Belongie, Computer Science
Full Professor Mirella Lapata, University of Edinburgh
Full Professor Mohit Bansal, University of North Carolina at Chapel Hill
Leader of defense: Björg Birkholm Magnúsdóttir
Principal Supervisor Desmond Elliot
Co-Supervisor Anders Østerskov Søgaard
For an electronic copy of the thesis, please visit the PhD Programme page.