Trung Thanh Nguyen

I am a PhD Candidate at Nagoya University, specializing in the Department of Intelligent Systems. My research focuses on 3D Computer Vision, Multimodal Recognition, and Vision Language Model, with applications in solving real-world problems.

PhD Candidate

Nagoya University — Graduate School of Informatics, Japan

Student Researcher

RIKEN National Science Institute — Guardian Robot Project, Japan

Rising Star Fellowship

University of Freiburg — Excellence Cluster Future Forests, Germany

Higher Education — Industry Collaboration

Center for AI, Mathematical and Data Science, Nagoya University

Research Internship Incoming

Toshiba Corporation — Japan

nguyent (at) cs.is.i.nagoya-u.ac.jp Google Scholar LinkedIn

News ↗︎

Jul 28, 2026	🌲 ForestMamba has been integrated into the 3Dtrees.earth platform.
Jul 27, 2026	I will serve as an Organizing Committee member of the 3rd Workshop on Computer Vision for Developing Countries (CV4DC), in conjunction with ACCV 2026, Osaka 🇯🇵.
Jul 26, 2026	Our paper, “Context-aware and View-consistent Learning for Multi-view Action Recognition,” has been accepted in ACM TOMM (IF: 5.6).
Jun 29, 2026	The project webpage of 🌲 SelectAnyTree is now live.
Jun 05, 2026	Our paper, “TraRA: Trajectory-level Recognition Aggregation for Video Text Spotting in Urban Surveillance,” has been accepted to IEEE AVSS 2026, Lecce 🇮🇹.
May 27, 2026	Our paper, “PRIMS: Physics-guided Representation for Fluid Identification in Multimodal Sensing,” has been accepted to ECML PKDD 2026, Naples 🇮🇹.
Apr 19, 2026	Our paper on the MultiSensor-Home dataset was accepted in Pattern Recognition (IF: 9.1).
Apr 17, 2026	I was selected to present my PhD research at the Doctoral Consortium of IEEE FG2026, Kyoto 🇯🇵.
Mar 31, 2026	🇩🇪 Universität Freiburg: “International researchers are networking at the Freiburg Rising Stars Academy”
Feb 16, 2026	My interview in a special feature “The Reality of the Doctoral Program” (in Japanese) by Nagoya University is now published on Tamatebako (玉手箱).

Latest Posts ↗︎

Jul 28, 2026	🌲 ForestMamba on 3Dtrees.earth Platform
May 25, 2026	🇯🇵 Presenting at IEEE FG 2026
Mar 20, 2026	🇩🇪 Attending the Freiburg Rising Stars Conference
Mar 10, 2026	🇺🇸 Presenting at IEEE/CVF WACV 2026
Feb 03, 2026	🇯🇵 Attending ACM Asian School on HPC and AI 2026

Selected Publications ↗︎

Pattern Recognit.

MultiSensor-Home: Multi-modal multi-view dataset and benchmarks for action recognition in home environments

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, and Ichiro Ide

Pattern Recognition, 2026

Open Access DOI Bib HTML Code

This paper is an open access article.

@article{Nguyen2026PR,
  publisher = {Elsevier},
  title = {MultiSensor-Home: Multi-modal multi-view dataset and benchmarks for action recognition in home environments},
  journal = {Pattern Recognition},
  pages = {113810},
  year = {2026},
  issn = {0031-3203},
  doi = {https://doi.org/10.1016/j.patcog.2026.113810},
  impact_factor = {9.1},
  pub_group = {ij2026},
  author = {Nguyen, Trung Thanh and Kawanishi, Yasutomo and John, Vijay and Komamizu, Takahiro and Ide, Ichiro}
}

ACM TOMM
Action Selection Learning for Weakly Labeled Multi-modal Multi-view Action Recognition

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, and Ichiro Ide

ACM Transactions on Multimedia Computing, Communications, and Applications, 2026

Open Access Abs DOI Bib HTML Code

This paper is an open access article.

Multi-view action recognition is a critical task in computer vision, with broad applications in surveillance, robotics, and video-content analysis. Traditional single-view action recognition approaches suffer from a limited field of view and occlusion, leading to incomplete understanding of actions and a higher likelihood of misclassification. Moreover, most existing methods rely on constrained environments with strong label annotations, where the onset and offset times of each action are meticulously labeled at the frame-level. However, annotating strong labels for multi-view video sequences in real-world scenarios is time consuming and labor intensive. In many cases, only a weak video-level (sequence-level) label is available, where only the action class label for the entire video sequence is provided. This limits the performance of accurate action recognition. To overcome this limitation, we propose Multi-modal Multi-view Action Selection Learning (MMASL), which integrates audio and video data to perform frame-level action recognition in large-area environments using sequence-level weak labels. The key components of MMASL include modality-specific Shared Audio Encoder and Shared Video Encoder, and an Action Selection Learning (ASL) mechanism. The encoder processes input data from multiple views by extracting and unifying features from audio and video modalities. Meanwhile, ASL dynamically selects relevant frames across views and filters out irrelevant information while focusing on critical action segments to enhance action recognition accuracy. By incorporating audio data with video data, MMASL improves recognition accuracy for visually ambiguous actions that are distinguishable through sound. Experiments in a real-world office environment using the MM-Office dataset demonstrate that MMASL outperforms state-of-the-art methods, achieving up to 8.81% improvement in mAP_C (Class-wise mean Average Precision) and 8.43% in mAP_S (Sample-wise mean Average Precision), highlighting the significance of multi-modal multi-view action recognition with ASL in real-world scenarios.
@article{Nguyen2025ACMTOMM, author = {Nguyen, Trung Thanh and and Kawanishi, Yasutomo and John, Vijay and Komamizu, Takahiro and Ide, Ichiro}, title = {Action Selection Learning for Weakly Labeled Multi-modal Multi-view Action Recognition}, journal = {ACM Transactions on Multimedia Computing, Communications, and Applications}, doi = {10.1145/3744742}, impact_factor = {6.0}, pub_group = {ij2026}, publisher = {ACM}, year = {2026}, }
ACM TOMM
Hierarchical Local-Global Fusion for One-stage Open-vocabulary Temporal Action Detection

Trung Thanh Nguyen, Yasutomo Kawanishi, Takahiro Komamizu, and Ichiro Ide

ACM Transactions on Multimedia Computing, Communications, and Applications, 2026

Open Access Abs DOI Bib HTML Code

This paper is an open access article.

Open-vocabulary Temporal Action Detection (Open-vocab TAD) extends the detection scope of Closed-vocabulary Temporal Action Detection (Closed-vocab TAD) to unseen action classes specified by vocabularies not included in the training data, within untrimmed video. Typical Open-vocab TAD methods adopt a two-stage approach that first proposes candidate action intervals and then identifies those actions. However, errors in the first stage can affect the subsequent stage and the final detection results. Moreover, conventional methods for temporal context analyses tend to focus solely on either global or local context. Focusing solely on the global context can lead to lack of momentary detail, making it difficult to distinguish one action from another. Conversely, focusing only on the local context makes it challenging to determine the start and end timings of action intervals. To address these challenges, we introduce a one-stage approach named Hierarchical Open-vocab TAD (HOTAD), consisting of two branches: Temporal Context Analysis (TCA) and Video-Text Alignment (VTA). The former utilizes Hierarchical Encoder (HE) to fuse global and local temporal features, enabling a comprehensive capture of temporal actions, while the latter branch exploits the synergy between visual and textual modalities for precisely detecting unseen actions in the Open-vocab setting. Experiments and in-depth analysis using the widely recognized datasets THUMOS14 and ActivityNet-1.3 are performed to show the effectiveness of the proposed method. The results highlight remarkable accuracy in detecting a wide range of unseen actions. Furthermore, the proposed method significantly reduces wrong labels and localizes action instances with high precision, showcasing its robustness in complex and dynamic video settings.
@article{nguyen2025_HOTAD, title = {Hierarchical Local-Global Fusion for One-stage Open-vocabulary Temporal Action Detection}, author = {Nguyen, Trung Thanh and Kawanishi, Yasutomo and Komamizu, Takahiro and Ide, Ichiro}, journal = {ACM Transactions on Multimedia Computing, Communications, and Applications}, year = {2026}, publisher = {ACM}, doi = {10.1145/3773986}, impact_factor = {6.0}, pub_group = {ij2026}, }

IEEE/CVF WACV

View-aware Cross-modal Distillation for Multi-view Action Recognition

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, and Ichiro Ide

In Proceedings of the 2026 IEEE/CVF Winter Conference on Applications of Computer Vision, 2026

DOI arXiv Bib

@inproceedings{nguyentWACV2026,
  title = {View-aware Cross-modal Distillation for Multi-view Action Recognition},
  author = {Nguyen, Trung Thanh and Kawanishi, Yasutomo and John, Vijay and Komamizu, Takahiro and Ide, Ichiro},
  booktitle = {Proceedings of the 2026 IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages = {7769--7778},
  year = {2026},
  doi = {10.1109/WACV61042.2026.00750},
  pub_group = {ic2026}
}

IEEE FG
MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, and Ichiro Ide

In Proceedings of the 19th IEEE International Conference on Automatic Face and Gesture Recognition, 2025

🏆 Best Student Paper Award Abs DOI arXiv Bib HTML Code

We presented a new multimodal, multi-view dataset called “MultiSensor-Home”, which provides high-resolution and fine-grained frame-level annotations for action recognition in wide-area distributed environments, along with a Transformer-based sensor fusion method called “MultiTSF”, at the international conference FG 2025, and received the Best Student Paper Award.

Multi-modal multi-view action recognition is a rapidly growing field in computer vision, offering significant potential for applications in surveillance. However, current datasets often fail to address real-world challenges such as wide-area distributed settings, asynchronous data streams, and the lack of frame-level annotations. Furthermore, existing methods face difficulties in effectively modeling inter-view relationships and enhancing spatial feature learning. In this paper, we introduce the MultiSensor-Home dataset, a novel benchmark designed for comprehensive action recognition in home environments, and also propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method. The proposed MultiSensor-Home dataset features untrimmed videos captured by distributed sensors, providing high-resolution RGB and audio data along with detailed multi-view frame-level action labels. The proposed MultiTSF method leverages a Transformer-based fusion mechanism to dynamically model inter-view relationships. Furthermore, the proposed method integrates a human detection module to enhance spatial feature learning, guiding the model to prioritize frames with human activity to enhance action the recognition accuracy. Experiments on the proposed MultiSensor-Home and the existing MM-Office datasets demonstrate the superiority of MultiTSF over the state-of-the-art methods. Quantitative and qualitative results highlight the effectiveness of the proposed method in advancing real-world multi-modal multi-view action recognition.
@inproceedings{nguyen2025multisensor, title = {MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion}, author = {Nguyen, Trung Thanh and Kawanishi, Yasutomo and John, Vijay and Komamizu, Takahiro and Ide, Ichiro}, booktitle = {Proceedings of the 19th IEEE International Conference on Automatic Face and Gesture Recognition}, year = {2025}, doi = {10.1109/FG61629.2025.11099071}, pub_group = {ic2025}, primaryclass = {cs.CV}, }

in the air ✈︎↗︎

News ↗︎

Latest Posts ↗︎

Selected Publications ↗︎