Action Selection Learning for Multi-label Multi-view Action Recognition

Abstract

Multi-label multi-view action recognition aims to recognize multiple concurrent or sequential actions from untrimmed videos captured by multiple cameras. Existing work has focused on multi-view action recognition in a narrow area with strong labels available, where the onset and offset of each action are labeled at the frame-level. This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level. We propose the method named MultiASL (Multi-view Action Selection Learning), which leverages action selection learning to enhance view fusion by selecting the most useful information from different viewpoints. The proposed method includes a Multi-view Spatial-Temporal Transformer video encoder to extract spatial and temporal features from multi-viewpoint videos. Action Selection Learning is employed at the frame-level, using pseudo ground-truth obtained from weak labels at the video-level, to identify the most relevant frames for action recognition. Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods.

Introduction

Action recognition has garnered significant interest due to its wide applications in surveillance, robotics, and video content analysis. With advancements in multi-camera systems, there is a growing need to capture and analyze actions from different viewpoints to obtain comprehensive understanding. However, conventional single-view methods are limited in their perspective and may result in incomplete understanding. This study proposes MultiASL, a method designed to improve multi-view action recognition in real-world scenarios, by selecting the most relevant information from various viewpoints.

Figure 1. Configuration of multi-view settings. (a) Multiple cameras arranged to surround a target in a narrow area. (b) Multiple distributed cameras covering a wide-range area.

Methodology

The proposed MultiASL method includes two main components: a Multi-view Spatial-Temporal Transformer Video Encoder and an Action Selection Learning (ASL) module. The video encoder extracts spatial and temporal features from multiple viewpoints using a multi-view spatial-temporal transformer. Action Selection Learning is applied at the frame level to select the most relevant actions using pseudo ground-truth obtained from video-level labels.

Figure 2. Overview of the proposed MultiASL method.

Experiments

The experiments were conducted on the MM-Office dataset, a collection of videos recorded in a real-world office environment using multiple distributed cameras. The proposed MultiASL method demonstrates superior performance compared to existing multi-view action recognition methods. Various view-level fusion strategies, such as max pooling and mean pooling, were evaluated, with max pooling consistently yielding the best results.

Table 1: Comparison of the proposed MultiASL and other methods. The best and second-best results are highlighted in bold and underlined text, respectively

Acknowledgment

This work was partly supported by Japan Society for the Promotion of Science (JSPS) KAKENHI JP21H03519 and JP24H00733. The computation was carried out using the General Projects on the supercomputer "Flow" with the Information Technology Center, Nagoya University.

BibTeX

@inproceedings{nguyen2024MultiASL,
      title={Action Selection Learning for Multilabel Multiview Action Recognition},
      author={Nguyen, Trung Thanh and Kawanishi, Yasutomo and Komamizu, Takahiro and Ide, Ichiro},
      booktitle={ACM Multimedia Asia 2024},
      pages={1--7},
      year={2024},
}