ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments
Abstract
In the real world, Human Activity Recognition (HAR) remains challenging due to issues such as occlusion, dynamic backgrounds, and visual noise. Traditional models, such as CNNs, RNNs, and ST-GCNs, have constraints, including a small receptive field and the use of local features, which also reduce generalisation. We present ViT-HAR, a Vision Transformer framework that learns global spatio-temporal interactions and proposes two new modules, namely Contextual Patch Reweighting (CPR) and Attention-Guided Occlusion Masking (AGOM) to solve this issue. These parts enable selective attention of motion-relevant and nonoccluded areas, which increases robustness and interpretability in cluttered scenarios. In contrast to the previous versions of Vision Transformer architecture (ex, TimeSformer and ViViT), which use fixed attention by default, ViT-HAR is based on adaptive attention, redistributing contextual patches and masking unseen areas with dynamically varying weights in attempts to retain semantically salient information. The combined pipeline utilises dynamic frame sampling, contextual reweighting, and occlusion-based masking, resulting in an optimal trade-off between spatial and temporal coherence. NTU RGB+D and Kinetics-700 and UCF101 evaluation results indicate a maximum 6.5% greater Top-1 and show better F1-scores than 3D-CNN and RNN hybrids. Visualization of attention maps attests to the fact that ViT-HAR pays attention to meaningful motion signals, which is why the algorithm proves useful in healthcare monitoring, smart surveillance, and AR/VR. Lightweight and multimodal extensions are investigated in future work.
Keywords
References
- Y. Chen, J. Li, E. Blasch, and Q. Qu, “Future outdoor safety monitoring: Integrating human activity recognition with the internet of physical-virtual things,” Applied Sciences, vol. 15, no. 7, p. 3434, 2025. doi: 10.3390/app15073434
- M. A. Ansari et al., “Decoding human activities: Algorithms, frameworks, and challenges in recognition systems,” in Neural Network Advancements in the Age of AI, pp. 403-432, IGI Global Scientific Publishing, 2025. doi: 10.4018/979-8-3373-0735-0.ch011
- L. Yan, and Y. Du, “Exploring trends and clusters in human posture recognition research: An analysis using CiteSpace,” Sensors, vol. 25, no. 3, p. 632, 2025. doi: 10.3390/s25030632
- Q. Yun, “Vision transformers (ViTs) for feature extraction and classification of AI-generated visual designs,” IEEE Access, 2025. doi: 10.1109/ACCESS.2025.3562130
- V. Hassija et al., “Transformers for vision: A survey on innovative methods for computer vision,” IEEE Access, 2025. doi: 10.1109/ACCESS.2025.3571735
- Q. Wang et al., “CST-ViT: Cascaded spatio-temporal redundancy elimination for efficient vision transformers on edge IoT devices,” IEEE Internet of Things Journal, 2025. doi: 10.1109/JIOT.2025.3587891
- Q. Snyder, Q. Jiang, and E. Tripp, “Integrating self-attention mechanisms in deep learning: A novel dual-head ensemble transformer with its application to bearing fault diagnosis,” Signal Processing, vol. 227, p. 109683, 2025 doi: 10.1016/j.sigpro.2024.109683
- J. Rajanikanth, R. S. Shankar, C. R. Swaroop, K. V. S. S. R. Murthy, and G. Mahesh, “Self-attention-based classification of satellite images: Unlocking the potential of vision transformers for land use analysis,” in Algorithms in Advanced Artificial Intelligence, pp. 725-730, CRC Press, 2025.
Details
Primary Language
English
Subjects
Computer Software
Journal Section
Research Article
Authors
Early Pub Date
March 16, 2026
Publication Date
March 16, 2026
Submission Date
July 18, 2025
Acceptance Date
November 18, 2025
Published in Issue
Year 2026 Volume: 9 Number: 1
