Occluded Video Instance Segmentation: A Benchmark

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

Standard

Occluded Video Instance Segmentation : A Benchmark. / Qi, Jiyang; Gao, Yan; Hu, Yao; Wang, Xinggang; Liu, Xiaoyu; Bai, Xiang; Belongie, Serge; Yuille, Alan; Torr, Philip H. S.; Bai, Song.

I: International Journal of Computer Vision, Bind 130, 2022, s. 2022-2039.

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

Harvard

Qi, J, Gao, Y, Hu, Y, Wang, X, Liu, X, Bai, X, Belongie, S, Yuille, A, Torr, PHS & Bai, S 2022, 'Occluded Video Instance Segmentation: A Benchmark', International Journal of Computer Vision, bind 130, s. 2022-2039. https://doi.org/10.1007/s11263-022-01629-1

APA

Qi, J., Gao, Y., Hu, Y., Wang, X., Liu, X., Bai, X., Belongie, S., Yuille, A., Torr, P. H. S., & Bai, S. (2022). Occluded Video Instance Segmentation: A Benchmark. International Journal of Computer Vision, 130, 2022-2039. https://doi.org/10.1007/s11263-022-01629-1

Vancouver

Qi J, Gao Y, Hu Y, Wang X, Liu X, Bai X o.a. Occluded Video Instance Segmentation: A Benchmark. International Journal of Computer Vision. 2022;130:2022-2039. https://doi.org/10.1007/s11263-022-01629-1

Author

Qi, Jiyang ; Gao, Yan ; Hu, Yao ; Wang, Xinggang ; Liu, Xiaoyu ; Bai, Xiang ; Belongie, Serge ; Yuille, Alan ; Torr, Philip H. S. ; Bai, Song. / Occluded Video Instance Segmentation : A Benchmark. I: International Journal of Computer Vision. 2022 ; Bind 130. s. 2022-2039.

Bibtex

@article{18ad6d742993412f83a0a419da0a6716,
title = "Occluded Video Instance Segmentation: A Benchmark",
abstract = "Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality instance masks from 25 semantic categories, where object occlusions usually occur. While our human vision systems can understand those occluded instances by contextual reasoning and association, our experiments suggest that current video understanding systems cannot. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 16.3, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario. We also present a simple plug-and-play module that performs temporal feature calibration to complement missing object cues caused by occlusion. Built upon MaskTrack R-CNN and SipMask, we obtain a remarkable AP improvement on the OVIS dataset. The OVIS dataset and project code are available at http://songbai.site/ovis.",
keywords = "Benchmark, Dataset, Occlusion reasoning, Video instance segmentation, Video understanding",
author = "Jiyang Qi and Yan Gao and Yao Hu and Xinggang Wang and Xiaoyu Liu and Xiang Bai and Serge Belongie and Alan Yuille and Torr, {Philip H. S.} and Song Bai",
note = "Publisher Copyright: {\textcopyright} 2022, The Author(s).",
year = "2022",
doi = "10.1007/s11263-022-01629-1",
language = "English",
volume = "130",
pages = "2022--2039",
journal = "International Journal of Computer Vision",
issn = "0920-5691",
publisher = "Springer",

}

RIS

TY - JOUR

T1 - Occluded Video Instance Segmentation

T2 - A Benchmark

AU - Qi, Jiyang

AU - Gao, Yan

AU - Hu, Yao

AU - Wang, Xinggang

AU - Liu, Xiaoyu

AU - Bai, Xiang

AU - Belongie, Serge

AU - Yuille, Alan

AU - Torr, Philip H. S.

AU - Bai, Song

N1 - Publisher Copyright: © 2022, The Author(s).

PY - 2022

Y1 - 2022

N2 - Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality instance masks from 25 semantic categories, where object occlusions usually occur. While our human vision systems can understand those occluded instances by contextual reasoning and association, our experiments suggest that current video understanding systems cannot. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 16.3, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario. We also present a simple plug-and-play module that performs temporal feature calibration to complement missing object cues caused by occlusion. Built upon MaskTrack R-CNN and SipMask, we obtain a remarkable AP improvement on the OVIS dataset. The OVIS dataset and project code are available at http://songbai.site/ovis.

AB - Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality instance masks from 25 semantic categories, where object occlusions usually occur. While our human vision systems can understand those occluded instances by contextual reasoning and association, our experiments suggest that current video understanding systems cannot. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 16.3, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario. We also present a simple plug-and-play module that performs temporal feature calibration to complement missing object cues caused by occlusion. Built upon MaskTrack R-CNN and SipMask, we obtain a remarkable AP improvement on the OVIS dataset. The OVIS dataset and project code are available at http://songbai.site/ovis.

KW - Benchmark

KW - Dataset

KW - Occlusion reasoning

KW - Video instance segmentation

KW - Video understanding

UR - http://www.scopus.com/inward/record.url?scp=85132288284&partnerID=8YFLogxK

U2 - 10.1007/s11263-022-01629-1

DO - 10.1007/s11263-022-01629-1

M3 - Journal article

AN - SCOPUS:85132288284

VL - 130

SP - 2022

EP - 2039

JO - International Journal of Computer Vision

JF - International Journal of Computer Vision

SN - 0920-5691

ER -

ID: 344656097