Visual Definition Modeling: Challenging Vision & Language Models to Define Words and Objects

Research output: Contribution to journalConference articleResearchpeer-review

Standard

Visual Definition Modeling : Challenging Vision & Language Models to Define Words and Objects. / Scarlini, Bianca; Pasini, Tommaso; Navigli, Roberto.

In: AAAI Conference on Artificial Intelligence, Vol. 36, No. 10, 2022, p. 11267-11275.

Research output: Contribution to journalConference articleResearchpeer-review

Harvard

Scarlini, B, Pasini, T & Navigli, R 2022, 'Visual Definition Modeling: Challenging Vision & Language Models to Define Words and Objects', AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 11267-11275. https://doi.org/10.1609/aaai.v36i10.21377

APA

Scarlini, B., Pasini, T., & Navigli, R. (2022). Visual Definition Modeling: Challenging Vision & Language Models to Define Words and Objects. AAAI Conference on Artificial Intelligence, 36(10), 11267-11275. https://doi.org/10.1609/aaai.v36i10.21377

Vancouver

Scarlini B, Pasini T, Navigli R. Visual Definition Modeling: Challenging Vision & Language Models to Define Words and Objects. AAAI Conference on Artificial Intelligence. 2022;36(10):11267-11275. https://doi.org/10.1609/aaai.v36i10.21377

Author

Scarlini, Bianca ; Pasini, Tommaso ; Navigli, Roberto. / Visual Definition Modeling : Challenging Vision & Language Models to Define Words and Objects. In: AAAI Conference on Artificial Intelligence. 2022 ; Vol. 36, No. 10. pp. 11267-11275.

Bibtex

@inproceedings{a467a22fca074a59b372facb20f24926,
title = "Visual Definition Modeling: Challenging Vision & Language Models to Define Words and Objects",
abstract = "Architectures that model language and vision together have received much attention in recent years. Nonetheless, most tasks in this field focus on end-to-end applications without providing insights on whether it is the underlying semantics of visual objects or words that is captured. In this paper we draw on the established Definition Modeling paradigm and enhance it by grounding, for the first time, textual definitions to visual representations. We name this new task Visual Definition Modeling and put forward DEMETER and DIONYSUS, two benchmarks where, given an image as context, models have to generate a textual definition for a target being either i) a word that describes the image, or ii) an object patch therein. To measure the difficulty of our tasks we finetuned six different baselines and analyzed their performances, which show that a text-only encoder-decoder model is more effective than models pretrained for handling inputs of both modalities concurrently. This demonstrates the complexity of our benchmarks and encourages more research on text generation conditioned on multimodal inputs. The datasets for both benchmarks are available at https://github.com/SapienzaNLP/visual-definition-modeling as well as the code to reproduce our models.",
author = "Bianca Scarlini and Tommaso Pasini and Roberto Navigli",
year = "2022",
doi = "10.1609/aaai.v36i10.21377",
language = "English",
volume = "36",
pages = "11267--11275",
journal = "AAAI Conference on Artificial Intelligence",
issn = "2159-5399",
publisher = "Association for the Advancement of Artificial Intelligence",
number = "10",
note = "36th AAAI Conference on Artificial Intelligence / 34th Conference on Innovative Applications of Artificial Intelligence / 12th Symposium on Educational Advances in Artificial Intelligence ; Conference date: 22-02-2022 Through 01-03-2022",

}

RIS

TY - GEN

T1 - Visual Definition Modeling

T2 - 36th AAAI Conference on Artificial Intelligence / 34th Conference on Innovative Applications of Artificial Intelligence / 12th Symposium on Educational Advances in Artificial Intelligence

AU - Scarlini, Bianca

AU - Pasini, Tommaso

AU - Navigli, Roberto

PY - 2022

Y1 - 2022

N2 - Architectures that model language and vision together have received much attention in recent years. Nonetheless, most tasks in this field focus on end-to-end applications without providing insights on whether it is the underlying semantics of visual objects or words that is captured. In this paper we draw on the established Definition Modeling paradigm and enhance it by grounding, for the first time, textual definitions to visual representations. We name this new task Visual Definition Modeling and put forward DEMETER and DIONYSUS, two benchmarks where, given an image as context, models have to generate a textual definition for a target being either i) a word that describes the image, or ii) an object patch therein. To measure the difficulty of our tasks we finetuned six different baselines and analyzed their performances, which show that a text-only encoder-decoder model is more effective than models pretrained for handling inputs of both modalities concurrently. This demonstrates the complexity of our benchmarks and encourages more research on text generation conditioned on multimodal inputs. The datasets for both benchmarks are available at https://github.com/SapienzaNLP/visual-definition-modeling as well as the code to reproduce our models.

AB - Architectures that model language and vision together have received much attention in recent years. Nonetheless, most tasks in this field focus on end-to-end applications without providing insights on whether it is the underlying semantics of visual objects or words that is captured. In this paper we draw on the established Definition Modeling paradigm and enhance it by grounding, for the first time, textual definitions to visual representations. We name this new task Visual Definition Modeling and put forward DEMETER and DIONYSUS, two benchmarks where, given an image as context, models have to generate a textual definition for a target being either i) a word that describes the image, or ii) an object patch therein. To measure the difficulty of our tasks we finetuned six different baselines and analyzed their performances, which show that a text-only encoder-decoder model is more effective than models pretrained for handling inputs of both modalities concurrently. This demonstrates the complexity of our benchmarks and encourages more research on text generation conditioned on multimodal inputs. The datasets for both benchmarks are available at https://github.com/SapienzaNLP/visual-definition-modeling as well as the code to reproduce our models.

U2 - 10.1609/aaai.v36i10.21377

DO - 10.1609/aaai.v36i10.21377

M3 - Conference article

VL - 36

SP - 11267

EP - 11275

JO - AAAI Conference on Artificial Intelligence

JF - AAAI Conference on Artificial Intelligence

SN - 2159-5399

IS - 10

Y2 - 22 February 2022 through 1 March 2022

ER -

ID: 337601998