Research Article

ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments

Volume: 9 Number: 1 March 16, 2026
EN

ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments

Abstract

In the real world, Human Activity Recognition (HAR) remains challenging due to issues such as occlusion, dynamic backgrounds, and visual noise. Traditional models, such as CNNs, RNNs, and ST-GCNs, have constraints, including a small receptive field and the use of local features, which also reduce generalisation. We present ViT-HAR, a Vision Transformer framework that learns global spatio-temporal interactions and proposes two new modules, namely Contextual Patch Reweighting (CPR) and Attention-Guided Occlusion Masking (AGOM) to solve this issue. These parts enable selective attention of motion-relevant and nonoccluded areas, which increases robustness and interpretability in cluttered scenarios. In contrast to the previous versions of Vision Transformer architecture (ex, TimeSformer and ViViT), which use fixed attention by default, ViT-HAR is based on adaptive attention, redistributing contextual patches and masking unseen areas with dynamically varying weights in attempts to retain semantically salient information. The combined pipeline utilises dynamic frame sampling, contextual reweighting, and occlusion-based masking, resulting in an optimal trade-off between spatial and temporal coherence. NTU RGB+D and Kinetics-700 and UCF101 evaluation results indicate a maximum 6.5% greater Top-1 and show better F1-scores than 3D-CNN and RNN hybrids. Visualization of attention maps attests to the fact that ViT-HAR pays attention to meaningful motion signals, which is why the algorithm proves useful in healthcare monitoring, smart surveillance, and AR/VR. Lightweight and multimodal extensions are investigated in future work.

Keywords

References

  1. Y. Chen, J. Li, E. Blasch, and Q. Qu, “Future outdoor safety monitoring: Integrating human activity recognition with the internet of physical-virtual things,” Applied Sciences, vol. 15, no. 7, p. 3434, 2025. doi: 10.3390/app15073434
  2. M. A. Ansari et al., “Decoding human activities: Algorithms, frameworks, and challenges in recognition systems,” in Neural Network Advancements in the Age of AI, pp. 403-432, IGI Global Scientific Publishing, 2025. doi: 10.4018/979-8-3373-0735-0.ch011
  3. L. Yan, and Y. Du, “Exploring trends and clusters in human posture recognition research: An analysis using CiteSpace,” Sensors, vol. 25, no. 3, p. 632, 2025. doi: 10.3390/s25030632
  4. Q. Yun, “Vision transformers (ViTs) for feature extraction and classification of AI-generated visual designs,” IEEE Access, 2025. doi: 10.1109/ACCESS.2025.3562130
  5. V. Hassija et al., “Transformers for vision: A survey on innovative methods for computer vision,” IEEE Access, 2025. doi: 10.1109/ACCESS.2025.3571735
  6. Q. Wang et al., “CST-ViT: Cascaded spatio-temporal redundancy elimination for efficient vision transformers on edge IoT devices,” IEEE Internet of Things Journal, 2025. doi: 10.1109/JIOT.2025.3587891
  7. Q. Snyder, Q. Jiang, and E. Tripp, “Integrating self-attention mechanisms in deep learning: A novel dual-head ensemble transformer with its application to bearing fault diagnosis,” Signal Processing, vol. 227, p. 109683, 2025 doi: 10.1016/j.sigpro.2024.109683
  8. J. Rajanikanth, R. S. Shankar, C. R. Swaroop, K. V. S. S. R. Murthy, and G. Mahesh, “Self-attention-based classification of satellite images: Unlocking the potential of vision transformers for land use analysis,” in Algorithms in Advanced Artificial Intelligence, pp. 725-730, CRC Press, 2025.

Details

Primary Language

English

Subjects

Computer Software

Journal Section

Research Article

Early Pub Date

March 16, 2026

Publication Date

March 16, 2026

Submission Date

July 18, 2025

Acceptance Date

November 18, 2025

Published in Issue

Year 2026 Volume: 9 Number: 1

APA
Mewada, A., Ahmad, S., & Ansari, M. A. (2026). ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments. Sakarya University Journal of Computer and Information Sciences, 9(1), 190-204. https://doi.org/10.35377/saucis...1745614
AMA
1.Mewada A, Ahmad S, Ansari MA. ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments. SAUCIS. 2026;9(1):190-204. doi:10.35377/saucis.1745614
Chicago
Mewada, Arvind, Shahnawaz Ahmad, and Mohd Aquib Ansari. 2026. “ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments”. Sakarya University Journal of Computer and Information Sciences 9 (1): 190-204. https://doi.org/10.35377/saucis. 1745614.
EndNote
Mewada A, Ahmad S, Ansari MA (March 1, 2026) ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments. Sakarya University Journal of Computer and Information Sciences 9 1 190–204.
IEEE
[1]A. Mewada, S. Ahmad, and M. A. Ansari, “ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments”, SAUCIS, vol. 9, no. 1, pp. 190–204, Mar. 2026, doi: 10.35377/saucis...1745614.
ISNAD
Mewada, Arvind - Ahmad, Shahnawaz - Ansari, Mohd Aquib. “ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments”. Sakarya University Journal of Computer and Information Sciences 9/1 (March 1, 2026): 190-204. https://doi.org/10.35377/saucis. 1745614.
JAMA
1.Mewada A, Ahmad S, Ansari MA. ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments. SAUCIS. 2026;9:190–204.
MLA
Mewada, Arvind, et al. “ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments”. Sakarya University Journal of Computer and Information Sciences, vol. 9, no. 1, Mar. 2026, pp. 190-04, doi:10.35377/saucis. 1745614.
Vancouver
1.Arvind Mewada, Shahnawaz Ahmad, Mohd Aquib Ansari. ViT-HAR: Vision Transformer-Based Human Activity Recognition in Cluttered Environments. SAUCIS. 2026 Mar. 1;9(1):190-204. doi:10.35377/saucis. 1745614

 

INDEXING & ABSTRACTING & ARCHIVING

 

31045 31044   ResimLink - Resim Yükle  31047 

31043 28939 28938 34240
 

 

29070    The papers in this journal are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License