Trung Thanh Nguyen

I am a PhD candidate at Nagoya University, specializing in the Department of Intelligent Systems. My research focuses on vision-language models, multimodal recognition, and video captioning, with applications in solving real-world problems.

Currently, I am a student researcher at RIKEN National Science Institute, working on the Guardian Robot Project. My research involves open-world action detection and multi-view multi-modal action recognition by analyzing multimodal sensory data.

Additionally, I am in charge at the Center for Artificial Intelligence, Mathematical and Data Science, collaborating with Japanese corporations to develop practical AI solutions.

📩 Contact: nguyent (at) cs.is.i.nagoya-u.ac.jp

Google Scholar LinkedIn

news

Dec 03, 2025	I have successfully completed my PhD pre-defense. Onward to the final defense!
Nov 11, 2025	2 papers — “View-aware Cross-modal Distillation for Multi-view Action Recognition” and “PADM: A Physics-aware Diffusion Model for Attenuation Correction” — have been accepted to IEEE/CVF WACV2026, United States.
Oct 21, 2025	Our paper “Hierarchical Global-Local Fusion for One-stage Open-vocabulary Temporal Action Detection” has been accepted to ACM TOMM (IF: 6.0) journal.
Oct 08, 2025	I was selected as a Rising Star for the Freiburg Rising Stars Academy, Universität Freiburg, Germany.
Oct 03, 2025	I was selected to present my PhD research at the Doctoral Symposium of ACM MMAsia, Malaysia.
Oct 01, 2025	Our paper “Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning” has been accepted to ACM MMAsia, Malaysia.
Sep 18, 2025	Our paper “Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation” has been accepted to NeurIPS, United States.
Aug 25, 2025	I was awarded a research grant from Murata Foundation (est. 1970), Japan.
Aug 01, 2025	I was awarded a research grant from THERS (National University Corporation), Japan.
Aug 01, 2025	We presented 2 papers (IS3-038, IS3-148) at MIRU2025, Japan.

selected publications

IEEE/CVF WACV

View-aware Cross-modal Distillation for Multi-view Action Recognition

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, and 2 more authors

In Proceedings of the 2026 IEEE/CVF Winter Conference on Applications of Computer Vision, 2026

arXiv
ACM TOMM
Hierarchical Local-Global Fusion for One-stage Open-vocabulary Temporal Action Detection

Trung Thanh Nguyen, Yasutomo Kawanishi, Takahiro Komamizu, and 1 more author

ACM Transactions on Multimedia Computing, Communications, and Applications, 2025

Open Access Abs DOI Bib HTML Code

This paper is an open access article.

Open-vocabulary Temporal Action Detection (Open-vocab TAD) extends the detection scope of Closed-vocabulary Temporal Action Detection (Closed-vocab TAD) to unseen action classes specified by vocabularies not included in the training data, within untrimmed video. Typical Open-vocab TAD methods adopt a two-stage approach that first proposes candidate action intervals and then identifies those actions. However, errors in the first stage can affect the subsequent stage and the final detection results. Moreover, conventional methods for temporal context analyses tend to focus solely on either global or local context. Focusing solely on the global context can lead to lack of momentary detail, making it difficult to distinguish one action from another. Conversely, focusing only on the local context makes it challenging to determine the start and end timings of action intervals. To address these challenges, we introduce a one-stage approach named Hierarchical Open-vocab TAD (HOTAD), consisting of two branches: Temporal Context Analysis (TCA) and Video-Text Alignment (VTA). The former utilizes Hierarchical Encoder (HE) to fuse global and local temporal features, enabling a comprehensive capture of temporal actions, while the latter branch exploits the synergy between visual and textual modalities for precisely detecting unseen actions in the Open-vocab setting. Experiments and in-depth analysis using the widely recognized datasets THUMOS14 and ActivityNet-1.3 are performed to show the effectiveness of the proposed method. The results highlight remarkable accuracy in detecting a wide range of unseen actions. Furthermore, the proposed method significantly reduces wrong labels and localizes action instances with high precision, showcasing its robustness in complex and dynamic video settings.
@article{nguyen2025_HOTAD, note = {International-Journal}, title = {Hierarchical Local-Global Fusion for One-stage Open-vocabulary Temporal Action Detection}, author = {Nguyen, Trung Thanh and Kawanishi, Yasutomo and Komamizu, Takahiro and Ide, Ichiro}, journal = {ACM Transactions on Multimedia Computing, Communications, and Applications}, year = {2025}, publisher = {ACM}, doi = {10.1145/3773986}, }
ACM TOMM
Action Selection Learning for Weakly Labeled Multi-modal Multi-view Action Recognition

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, and 2 more authors

ACM Transactions on Multimedia Computing, Communications, and Applications, 2025

Open Access Abs DOI Bib HTML Code

This paper is an open access article.

Multi-view action recognition is a critical task in computer vision, with broad applications in surveillance, robotics, and video-content analysis. Traditional single-view action recognition approaches suffer from a limited field of view and occlusion, leading to incomplete understanding of actions and a higher likelihood of misclassification. Moreover, most existing methods rely on constrained environments with strong label annotations, where the onset and offset times of each action are meticulously labeled at the frame-level. However, annotating strong labels for multi-view video sequences in real-world scenarios is time consuming and labor intensive. In many cases, only a weak video-level (sequence-level) label is available, where only the action class label for the entire video sequence is provided. This limits the performance of accurate action recognition. To overcome this limitation, we propose Multi-modal Multi-view Action Selection Learning (MMASL), which integrates audio and video data to perform frame-level action recognition in large-area environments using sequence-level weak labels. The key components of MMASL include modality-specific Shared Audio Encoder and Shared Video Encoder, and an Action Selection Learning (ASL) mechanism. The encoder processes input data from multiple views by extracting and unifying features from audio and video modalities. Meanwhile, ASL dynamically selects relevant frames across views and filters out irrelevant information while focusing on critical action segments to enhance action recognition accuracy. By incorporating audio data with video data, MMASL improves recognition accuracy for visually ambiguous actions that are distinguishable through sound. Experiments in a real-world office environment using the MM-Office dataset demonstrate that MMASL outperforms state-of-the-art methods, achieving up to 8.81% improvement in mAP_C (Class-wise mean Average Precision) and 8.43% in mAP_S (Sample-wise mean Average Precision), highlighting the significance of multi-modal multi-view action recognition with ASL in real-world scenarios.
@article{Nguyen2025ACMTOMM, note = {International-Journal}, author = {Nguyen, Trung Thanh and and Kawanishi, Yasutomo and John, Vijay and Komamizu, Takahiro and Ide, Ichiro}, title = {Action Selection Learning for Weakly Labeled Multi-modal Multi-view Action Recognition}, journal = {ACM Transactions on Multimedia Computing, Communications, and Applications}, doi = {10.1145/3744742}, publisher = {ACM}, year = {2025}, }
IEEE FG
MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, and 2 more authors

In Proceedings of the 19th IEEE International Conference on Automatic Face and Gesture Recognition, 2025

Best Student Paper Award Abs DOI arXiv Bib HTML Code

We presented a new multimodal, multi-view dataset called “MultiSensor-Home”, which provides high-resolution and fine-grained frame-level annotations for action recognition in wide-area distributed environments, along with a Transformer-based sensor fusion method called “MultiTSF”, at the international conference FG 2025, and received the Best Student Paper Award.

Multi-modal multi-view action recognition is a rapidly growing field in computer vision, offering significant potential for applications in surveillance. However, current datasets often fail to address real-world challenges such as wide-area distributed settings, asynchronous data streams, and the lack of frame-level annotations. Furthermore, existing methods face difficulties in effectively modeling inter-view relationships and enhancing spatial feature learning. In this paper, we introduce the MultiSensor-Home dataset, a novel benchmark designed for comprehensive action recognition in home environments, and also propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method. The proposed MultiSensor-Home dataset features untrimmed videos captured by distributed sensors, providing high-resolution RGB and audio data along with detailed multi-view frame-level action labels. The proposed MultiTSF method leverages a Transformer-based fusion mechanism to dynamically model inter-view relationships. Furthermore, the proposed method integrates a human detection module to enhance spatial feature learning, guiding the model to prioritize frames with human activity to enhance action the recognition accuracy. Experiments on the proposed MultiSensor-Home and the existing MM-Office datasets demonstrate the superiority of MultiTSF over the state-of-the-art methods. Quantitative and qualitative results highlight the effectiveness of the proposed method in advancing real-world multi-modal multi-view action recognition.
@inproceedings{nguyen2025multisensor, note = {International-Conference}, title = {MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion}, author = {Nguyen, Trung Thanh and Kawanishi, Yasutomo and John, Vijay and Komamizu, Takahiro and Ide, Ichiro}, booktitle = {Proceedings of the 19th IEEE International Conference on Automatic Face and Gesture Recognition}, year = {2025}, doi = {10.1109/FG61629.2025.11099071}, primaryclass = {cs.CV}, }
IEEE Access
Zero-shot Pill-Prescription Matching with Graph Convolutional Network and Contrastive Learning

Trung Thanh Nguyen, Phi Le Nguyen, Yasutomo Kawanishi, and 2 more authors

IEEE Access, 2024

Open Access Abs DOI Bib HTML Code

This paper is an open access article. https://ieeexplore.ieee.org/document/10504270

Patients’ safety is paramount in the healthcare industry, and reducing medication errors is essential for improvement. A promising solution to this problem involves the development of automated systems capable of assisting patients in verifying their pill intake mistakes. This paper investigates a Pill-Prescription matching task that seeks to associate pills in a multi-pill photo with their corresponding names in the prescription. We specifically aim to overcome the limitations of existing pill detection methods when faced with unseen pills, a situation characteristic of zero-shot learning. We propose a novel method named Zero-PIMA (Zero-shot Pill-Prescription Matching), designed to match pill images with prescription names effectively, even for pills not included in the training dataset. Zero-PIMA is an end-to-end model that includes an object localization module to determine and extract features of pill images and a graph convolutional network to capture the spatial relationship of the pills’ text in the prescription. After that, we leverage the contrastive learning paradigm to increase the distance between mismatched pill images and pill name pairs while minimizing the distance between matched pairs. In addition, to deal with the zero-shot pill detection problem, we leverage pills’ metadata retrieved from the DrugBank database to fine-tune a pre-trained text encoder, thereby incorporating visual information about pills (e.g., shape, color) into their names, making them more informative and ultimately enhancing the pill image-name matching accuracy. Extensive experiments are conducted on our collected real-world VAIPE-PP dataset of multi-pill photos and prescriptions. Through a series of comprehensive experiments, the proposed method outperforms other methods for both seen and unseen pills in terms of mean average precision. These results indicate that the proposed method could reduce medication errors and improve patients’ safety.
@article{Nguyen2024IEEEAccess, note = {International-Journal}, author = {Nguyen, Trung Thanh and Nguyen, Phi Le and Kawanishi, Yasutomo and Komamizu, Takahiro and Ide, Ichiro}, title = {Zero-shot Pill-Prescription Matching with Graph Convolutional Network and Contrastive Learning}, journal = {IEEE Access}, publisher = {IEEE}, year = {2024}, doi = {10.1109/ACCESS.2024.3390153}, }
IEEE TNSM
Fuzzy Q-Learning-Based Opportunistic Communication for MEC-Enhanced Vehicular Crowdsensing

Trung Thanh Nguyen, Truong Thao Nguyen, Thanh-Hung Nguyen, and 1 more author

IEEE Transactions on Network and Service Management, 2022

Abs DOI arXiv Bib

This study focuses on MEC-enhanced, vehicle-based crowdsensing systems that rely on devices installed on automobiles. We investigate an opportunistic communication paradigm in which devices can transmit measured data directly to a crowdsensing server over a 4G communication channel or to nearby devices or so-called Road Side Units positioned along the road via Wi-Fi. We tackle a new problem that is how to reduce the cost of 4G while preserving the latency. We propose an offloading strategy that combines a reinforcement learning technique known as Q-learning with Fuzzy logic to accomplish the purpose. Q-learning assists devices in learning to decide the communication channel. Meanwhile, Fuzzy logic is used to optimize the reward function in Q-learning. The experiment results show that our offloading method significantly cuts down around 30-40% of the 4G communication cost while keeping the latency of 99% packets below the required threshold.
@article{9841517, note = {International-Journal}, author = {Nguyen, Trung Thanh and Thao Nguyen, Truong and Nguyen, Thanh-Hung and Nguyen, Phi Le}, journal = {IEEE Transactions on Network and Service Management}, title = {Fuzzy Q-Learning-Based Opportunistic Communication for MEC-Enhanced Vehicular Crowdsensing}, year = {2022}, volume = {19}, number = {4}, pages = {5021-5033}, doi = {10.1109/TNSM.2022.3192397}, }