M.Sc. Defense by Aslak Niclasen and Jonathan Kurish

Video Action Recognition in Basketball


Recognizing actions in videos is valuable in a variety of tasks. The surveillance, tourism and medical industry are examples of sectors all benefiting from Video Action Recognition. As actions occur over a period of time, accurately recognizing them is a complex task, because structural information may change over time. The use of Deep Learning has shown promising results in the field of Video Action Recognition.
We implement a two stream model combining two separate ResNet-50 models as presented in [1]. One stream uses spatial information and the other uses temporal information. Besides late fusion, the two streams interact using Multiplicative Gating. Furthermore, Temporal Injections are used to enhance temporal support in the spatial model.

We conduct a number of experiments on UCF101, a benchmark dataset with 101 classes, in order to study the ResNet-50 model. We examine the influence of various hyper-parameters, as well as the depth of the model. Our final two stream model achieved a test accuracy of 80.5%, 3% higher than using only late fusion to combine the streams.

In addition, we create a new dataset called Basketball Simple Shooting Statistics Dataset, in short B3SD. It is a fine-grained dataset consisting of various basketball videos. We construct multiple versions of B3SD, in order to distinguish videos containing shots from videos without shots, as well as distinguish between shot types. We explore the effect of various input modalities, such as Optical Flow and gray-scaled difference frames. Our final two-part two stream model achieves a test accuracy of 74.7%, using a single RGB-frame as spatial information and a stack of gray-scaled frames as temporal information.

We recommend that future work includes examining the best input modality for temporal information. Additionally, in order to use models trained with B3SD for more practical applications, the quality of the dataset needs further work. Lastly, as B3SD includes bounding box information and multiple actions occurring in a single video, exploration of continuous action recognition and localization models is possible.

Supervisor and contact person: Kim Steenstrup Pedersen
External examiner: Rasmus Paulsen