Multi-label multi-view action recognition aims to recognize multiple concurrent or sequential actions from untrimmed videos captured by multiple cameras. Existing work has focused on multi-view action recognition in a narrow area with strong labels available, where the onset and offset of each action are labeled at the frame-level. This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level. We propose the method named MultiASL (Multi-view Action Selection Learning), which leverages action selection learning to enhance view fusion by selecting the most useful information from different viewpoints. The proposed method includes a Multi-view Spatial-Temporal Transformer video encoder to extract spatial and temporal features from multi-viewpoint videos. Action Selection Learning is employed at the frame-level, using pseudo ground-truth obtained from weak labels at the video-level, to identify the most relevant frames for action recognition. Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods.
Action recognition has garnered significant interest due to its wide applications in surveillance, robotics, and video content analysis. With advancements in multi-camera systems, there is a growing need to capture and analyze actions from different viewpoints to obtain comprehensive understanding. However, conventional single-view methods are limited in their perspective and may result in incomplete understanding. This study proposes MultiASL, a method designed to improve multi-view action recognition in real-world scenarios, by selecting the most relevant information from various viewpoints.
Figure 1. Configuration of multi-view settings. (a) Multiple cameras arranged to surround a target in a narrow area. (b) Multiple distributed cameras covering a wide-range area.
Figure 2. Overview of the proposed MultiASL method.
Table 1: Comparison of the proposed MultiASL and other methods. The best and second-best results are highlighted in bold and underlined text, respectively
@inproceedings{nguyen2024MultiASL,
title={Action Selection Learning for Multilabel Multiview Action Recognition},
author={Nguyen, Trung Thanh and Kawanishi, Yasutomo and Komamizu, Takahiro and Ide, Ichiro},
booktitle={ACM Multimedia Asia 2024},
pages={1--7},
year={2024},
}