Probing Cross-Modal Representations in Multi-Step Relational Reasoning

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Documents

Fulltext
Final published version, 674 KB, PDF document

Iuliia Parfenova
Elliott, Desmond
Raquel Fernández
Sandro Pezzelle

We investigate the representations learned by vision and language models in tasks that require relational reasoning. Focusing on the problem of assessing the relative size of objects in abstract visual contexts, we analyse both one-step and two-step reasoning. For the latter, we construct a new dataset of three-image scenes and define a task that requires reasoning at the level of the individual images and across images in a scene. We probe the learned model representations using diagnostic classifiers. Our experiments show that pretrained multimodal transformer-based architectures can perform higher-level relational reasoning, and are able to learn representations for novel tasks and data that are very different from what was seen in pretraining.

Original language	English
Title of host publication	Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)
Publisher	Association for Computational Linguistics
Publication date	2021
Pages	152-162
DOIs	https://doi.org/10.18653/v1/2021.repl4nlp-1.16
Publication status	Published - 2021
Event	6th Workshop on Representation Learning for NLP (RepL4NLP-2021) - Online, Online Duration: 1 Aug 2021 → 1 Aug 2021

Conference

Conference	6th Workshop on Representation Learning for NLP (RepL4NLP-2021)
Location	Online
By	Online
Periode	01/08/2021 → 01/08/2021

ID: 299038005

Department of Computer Science