Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

Research output: Working paperPreprintResearch

Standard

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models. / Qian, Rui; Xu, Zheng; Yang, Ming Hsuan; Belongie, Serge; Cui, Yin.

arXiv.org, 2022.

Research output: Working paperPreprintResearch

Harvard

Qian, R, Xu, Z, Yang, MH, Belongie, S & Cui, Y 2022 'Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models' arXiv.org.

APA

Qian, R., Xu, Z., Yang, M. H., Belongie, S., & Cui, Y. (2022). Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models. arXiv.org.

Vancouver

Qian R, Xu Z, Yang MH, Belongie S, Cui Y. Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models. arXiv.org. 2022.

Author

Qian, Rui ; Xu, Zheng ; Yang, Ming Hsuan ; Belongie, Serge ; Cui, Yin. / Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models. arXiv.org, 2022.

Bibtex

@techreport{c8ab9cd620b9404591330459b701ca12,
title = "Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models",
abstract = "Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.",
author = "Rui Qian and Zheng Xu and Yang, {Ming Hsuan} and Serge Belongie and Yin Cui",
year = "2022",
language = "English",
publisher = "arXiv.org",
type = "WorkingPaper",
institution = "arXiv.org",

}

RIS

TY - UNPB

T1 - Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

AU - Qian, Rui

AU - Xu, Zheng

AU - Yang, Ming Hsuan

AU - Belongie, Serge

AU - Cui, Yin

PY - 2022

Y1 - 2022

N2 - Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.

AB - Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.

M3 - Preprint

BT - Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

PB - arXiv.org

ER -

ID: 384580230