ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments

Arvind Mewada; Shahnawaz Ahmad; Mohd Aquib Ansari

doi:10.35377/saucis...1745614

EN

ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments

Abstract

In the real world, Human Activity Recognition (HAR) remains challenging due to issues such as occlusion, dynamic backgrounds, and visual noise. Traditional models, such as CNNs, RNNs, and ST-GCNs, have constraints, including a small receptive field and the use of local features, which also reduce generalisation. We present ViT-HAR, a Vision Transformer framework that learns global spatio-temporal interactions and proposes two new modules, namely Contextual Patch Reweighting (CPR) and Attention-Guided Occlusion Masking (AGOM) to solve this issue. These parts enable selective attention of motion-relevant and nonoccluded areas, which increases robustness and interpretability in cluttered scenarios. In contrast to the previous versions of Vision Transformer architecture (ex, TimeSformer and ViViT), which use fixed attention by default, ViT-HAR is based on adaptive attention, redistributing contextual patches and masking unseen areas with dynamically varying weights in attempts to retain semantically salient information. The combined pipeline utilises dynamic frame sampling, contextual reweighting, and occlusion-based masking, resulting in an optimal trade-off between spatial and temporal coherence. NTU RGB+D and Kinetics-700 and UCF101 evaluation results indicate a maximum 6.5% greater Top-1 and show better F1-scores than 3D-CNN and RNN hybrids. Visualization of attention maps attests to the fact that ViT-HAR pays attention to meaningful motion signals, which is why the algorithm proves useful in healthcare monitoring, smart surveillance, and AR/VR. Lightweight and multimodal extensions are investigated in future work.

Keywords

References

Y. Chen, J. Li, E. Blasch, and Q. Qu, “Future outdoor safety monitoring: Integrating human activity recognition with the internet of physical-virtual things,” Applied Sciences, vol. 15, no. 7, p. 3434, 2025. doi: 10.3390/app15073434
M. A. Ansari et al., “Decoding human activities: Algorithms, frameworks, and challenges in recognition systems,” in Neural Network Advancements in the Age of AI, pp. 403-432, IGI Global Scientific Publishing, 2025. doi: 10.4018/979-8-3373-0735-0.ch011
L. Yan, and Y. Du, “Exploring trends and clusters in human posture recognition research: An analysis using CiteSpace,” Sensors, vol. 25, no. 3, p. 632, 2025. doi: 10.3390/s25030632
Q. Yun, “Vision transformers (ViTs) for feature extraction and classification of AI-generated visual designs,” IEEE Access, 2025. doi: 10.1109/ACCESS.2025.3562130
V. Hassija et al., “Transformers for vision: A survey on innovative methods for computer vision,” IEEE Access, 2025. doi: 10.1109/ACCESS.2025.3571735
Q. Wang et al., “CST-ViT: Cascaded spatio-temporal redundancy elimination for efficient vision transformers on edge IoT devices,” IEEE Internet of Things Journal, 2025. doi: 10.1109/JIOT.2025.3587891
Q. Snyder, Q. Jiang, and E. Tripp, “Integrating self-attention mechanisms in deep learning: A novel dual-head ensemble transformer with its application to bearing fault diagnosis,” Signal Processing, vol. 227, p. 109683, 2025 doi: 10.1016/j.sigpro.2024.109683
J. Rajanikanth, R. S. Shankar, C. R. Swaroop, K. V. S. S. R. Murthy, and G. Mahesh, “Self-attention-based classification of satellite images: Unlocking the potential of vision transformers for land use analysis,” in Algorithms in Advanced Artificial Intelligence, pp. 725-730, CRC Press, 2025.

S. Aburass, O. Dorgham, J. Al Shaqsi, M. Abu Rumman, and O. Al-Kadi, “Vision transformers in medical imaging: A comprehensive review of advancements and applications across multiple diseases,” Journal of Imaging Informatics in Medicine, pp. 1-44, 2025. doi: 10.1007/s10278-025-01481-y
K. Alomar, and X. Cai, “Human action recognition based on convolutional neural networks and vision transformers,” doctoral dissertation, University of Southampton, 2025.
N. Sedaghati, S. Ardebili, and A. Ghaffari, “Application of human activity/action recognition: A review,” Multimedia Tools and Applications, pp. 1-30, 2025. doi: 10.1007/s11042-024-20576-2
T. F. N. Bukht, H. Rahman, M. Shaheen, A. Algarni, N. A. Almujally, and A. Jalal, “A review of video-based human activity recognition: Theory, methods, and applications,” Multimedia Tools and Applications, vol. 84, no. 17, pp. 18499-18545, 2025. doi: 10.1007/s11042-024-19711-w
E. Ramanujam, P. M. Kumari, and K. Bharathi, “Usage of IoT, high-performance computing, and machine/deep learning in human activity recognition systems: Challenges and opportunities,” Parallel and High-Performance Computing in Artificial Intelligence, pp. 247-268.
E. Abdellatef, R. M. Al-Makhlasawy, and W. A. Shalaby, “Detection of human activities using a multi-layer convolutional neural network,” Scientific Reports, vol. 15, no. 1, p. 7004, 2025. doi: 10.1038/s41598-025-90307-6
T. S. Qureshi, M. H. Shahid, A. A. Farhan, and S. Alamri, “A systematic literature review on human activity recognition using smart devices: Advances, challenges, and future directions,” Artificial Intelligence Review, vol. 58, no. 9, p. 276, 2025. doi: 10.1007/s10462-025-11275-x
D. B. S. Lakshmi, D. Sattibabu, and B. Sindhu, “Unveiling human behaviour: Deep learning approaches for activity recognition,” in Algorithms in Advanced Artificial Intelligence, pp. 408-413, CRC Press, 2025.
S. J. Dutta, T. Boongoen, and R. Zwiggelaar, “Human activity recognition: A review of deep learning-based methods,” IET Computer Vision, vol. 19, no. 1, p. e70003, 2025. doi: 10.1049/cvi2.70003
R. Singh, and A. Sharma, “Occluded skeleton-based multi-stream model using part-aware spatial-temporal graph convolutional network for human activity recognition,” Engineering Applications of Artificial Intelligence, vol. 156, p. 111183, 2025.
A. Iqbal, M. A. Rauf, M. D. Mahamud, M. M. Y. Khalil, and Z. Qin, “E-harnet: An efficient hybrid transformer network for human activity recognition,” The Journal of Supercomputing, vol. 81, no. 10, pp. 1-23, 2025. doi: 10.1007/s11227-025-07618-8
A. Chouchane, S. Bellili, A. Ouamane, E. O. Belabbaci, and Y. Himeur, “Vision transformers in face recognition: A comprehensive review,” Available at SSRN 5318732. doi: 10.2139/ssrn.5318732
S. Takahashi et al., “Comparison of vision transformers and convolutional neural networks in medical image analysis: A systematic review,” Journal of Medical Systems, vol. 48, no. 1, p. 84, 2024.
E. Dilek, and M. Dener, “An overview of transformers for video anomaly detection,” Neural Computing and Applications, pp. 1-33, 2025. doi: 10.1007/s00521-025-11218-1
H. Ahmadi, S. E. Mahdimahalleh, A. Farahat, and B. Saffari, “Unsupervised time-series signal analysis with autoencoders and vision transformers: A review of architectures and applications,” J. Intell. Learn. Syst. Appl., vol. 17, no. 2, May 2025. doi: 10.4236/jilsa.2025.172007
R. Raj, and A. Kos, “An extensive study of convolutional neural networks: Applications in computer vision for improved robotics perceptions,” Sensors, vol. 25, no. 4, p. 1033, 2025. doi: 10.3390/s25041033
M. Elhoseny, E. L. Lydia, S. R. Sree, E. Akhmetshin, and K. Shankar, “Lightweight convolutional neural network-based computer vision model for human behaviour analysis on consumer internet of things devices,” IEEE Transactions on Consumer Electronics, 2025. doi: 10.1109/TCE.2025.3564127
A. Muralidharan, and S. Mahfuz, “Human activity recognition using a hybrid CNN-RNN architecture,” Procedia Computer Science, vol. 257, pp. 336-343, 2025. doi: 10.1016/j.procs.2025.03.045
B. S. Avinash, R. Aryan, A. Yadav, V. K. Gupta, and B. Kumar, “Hierarchical multiscale CNN with frequency-aware attention for enhanced HAR,” IEEE Sensors Journal, 2025. doi: 10.1109/JSEN.2025.3586069
Y. Wang, C. Muntean, P. Pathak, and P. Stynes, “A real-time human action recognition model for assisted living,” in International Conference on Engineering Applications of Neural Networks, pp. 3-16, Springer, Cham, 2025. doi: 10.1007/978-3-031-96196-0_1
N. Raisi, M. Rezaei, and B. Masoumi, “Attention-HAR: Advanced human activity recognition using a deep learning model with an integrated attention mechanism,” Journal of AI and Data Mining, 2025.

Details

Primary Language

English

Subjects

Computer Software

Journal Section

Research Article

Authors

Arvind Mewada
0000-0002-4680-611X
India

Shahnawaz Ahmad
0000-0002-3989-9928
India

Mohd Aquib Ansari ^*
0000-0002-9083-1523
India

Early Pub Date

March 16, 2026

Publication Date

March 16, 2026

Submission Date

July 18, 2025

Acceptance Date

November 18, 2025

Published in Issue

Year 2026 Volume: 9 Number: 1

DOI

https://doi.org/10.35377/saucis...1745614

IZ

https://izlik.org/JA55YJ82FR

Cite

RIS / Bibtex

APA

Mewada, A., Ahmad, S., & Ansari, M. A. (2026). ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments. Sakarya University Journal of Computer and Information Sciences, 9(1), 190-204. https://doi.org/10.35377/saucis...1745614

AMA

1.Mewada A, Ahmad S, Ansari MA. ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments. SAUCIS. 2026;9(1):190-204. doi:10.35377/saucis.1745614

Chicago

Mewada, Arvind, Shahnawaz Ahmad, and Mohd Aquib Ansari. 2026. “ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments”. Sakarya University Journal of Computer and Information Sciences 9 (1): 190-204. https://doi.org/10.35377/saucis. 1745614.

EndNote

Mewada A, Ahmad S, Ansari MA (March 1, 2026) ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments. Sakarya University Journal of Computer and Information Sciences 9 1 190–204.

IEEE

[1]A. Mewada, S. Ahmad, and M. A. Ansari, “ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments”, SAUCIS, vol. 9, no. 1, pp. 190–204, Mar. 2026, doi: 10.35377/saucis...1745614.

ISNAD

Mewada, Arvind - Ahmad, Shahnawaz - Ansari, Mohd Aquib. “ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments”. Sakarya University Journal of Computer and Information Sciences 9/1 (March 1, 2026): 190-204. https://doi.org/10.35377/saucis. 1745614.

JAMA

1.Mewada A, Ahmad S, Ansari MA. ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments. SAUCIS. 2026;9:190–204.

MLA

Mewada, Arvind, et al. “ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments”. Sakarya University Journal of Computer and Information Sciences, vol. 9, no. 1, Mar. 2026, pp. 190-04, doi:10.35377/saucis. 1745614.

Vancouver

1.Arvind Mewada, Shahnawaz Ahmad, Mohd Aquib Ansari. ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments. SAUCIS. 2026 Mar. 1;9(1):190-204. doi:10.35377/saucis. 1745614