What makes the difference?

What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis

Research output: Contribution to journal › Journal article › Research › peer-review

Documents

Fulltext
Submitted manuscript, 1.1 MB, PDF document

Dimitris Gkoumas
Li, Qiuchi
Lioma, Christina
Yijun Yu
Dawei Song

Multimodal video sentiment analysis is a rapidly growing area. It combines verbal (i.e., linguistic) and non-verbal modalities (i.e., visual, acoustic) to predict the sentiment of utterances. A recent trend has been geared towards different modality fusion models utilizing various attention, memory and recurrent components. However, there lacks a systematic investigation on how these different components contribute to solving the problem as well as their limitations. This paper aims to fill the gap, marking the following key innovations. We present the first large-scale and comprehensive empirical comparison of eleven state-of-the-art (SOTA) modality fusion approaches in two video sentiment analysis tasks, with three SOTA benchmark corpora. An in-depth analysis of the results shows that the attention mechanisms are the most effective for modelling crossmodal interactions, yet they are computationally expensive. Second, additional levels of crossmodal interaction decrease performance. Third, positive sentiment utterances are the most challenging cases for all approaches. Finally, integrating context and utilizing the linguistic modality as a pivot for non-verbal modalities improve performance. We expect that the findings would provide helpful insights and guidance to the development of more effective modality fusion models.

Original language	English
Journal	Information Fusion
Volume	66
Pages (from-to)	184-197
ISSN	1566-2535
DOIs	https://doi.org/10.1016/j.inffus.2020.09.005
Publication status	Published - 2021

Bibliographical note

Publisher Copyright:
© 2020 Elsevier B.V.

Research areas

Emotion recognition, Multimodal human language understanding, Reproducibility in multimodal machine learning, Video sentiment analysis

Number of downloads are based on statistics from Google Scholar and www.ku.dk

No data available

ID: 306691667

Datalogisk Institut

What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis

Documents

Bibliographical note

Research areas

Number of downloads are based on statistics from Google Scholar and www.ku.dk