One important area of Human-Computer Interaction (HCI) is image-based gesture recognition. Despite tremendous advancements, it is still very difficult to achieve reliable and accurate gesture recognition in unrestricted, real-world settings. Conventional techniques frequently find it difficult to handle changes in lighting, background noise, occlusions, size variations, and the innate similarity between various gestures. To enhance the discriminative ability of the Vision Transformer (ViT) model for intricate hand gestures, this work presents a carefully planned fine-tuning methodology. Encourage ViT to concentrate on salient gesture regions while remaining resilient to environmental noise; the proposed method combines an adaptive learning rate scheduling system with a novel spatial attention regulator during fine-tuning. Experiments on a challenging and varied gesture dataset demonstrate that the proposed approach significantly performs better than state-of-the-art methods, attaining superior accuracy reaching 100% and demonstrating generalization capabilities. This study opens the door for more user-friendly human-computer interaction systems by providing a highly effective and flexible framework for sophisticated image-based gesture recognition systems.
Keywords
Computer VisionViTGesture ImagesSpatial Attention RegularizationImage AnalysisHand Gesture Recognition d.
References
A. Osman Hashi, S. Zaiton Mohd Hashim, and A. Bte Asamah, “A systematic review of hand gesture recognition: An update from 2018 to 2024,” IEEE Access, vol. 12, pp. 143599-143626, 2024, doi: 10.1109/ACCESS.2024.3421992.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, pp. 84-90, May 2017, doi: 10.1145/3065386.
P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recognition with 3D convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Boston, MA, USA, 2015, pp. 1-7, doi: 10.1109/CVPRW.2015.7301342.
P. Mittal, B. Sharma, and D. P. Yadav, “Comparative analysis between CNN and ViT using brain MRI dataset,” in Proc. 8th Int. Conf. Parallel, Distributed and Grid Comput. (PDGC), Solan, India, 2024, pp. 290-295, doi: 10.1109/PDGC64653.2024.10984339.
I. Pacal, B. Ozdemir, J. Zeynalov, H. Gasimov, and N. Pacal, “A novel CNN-ViT-based deep learning model for early skin cancer diagnosis,” Biomed. Signal Process. Control, vol. 104, p. 107627, 2025, doi: 10.1016/j.bspc.2025.107627.
A. Al-Zebari, N. Omar, and A. Sengur, “Vision transformers-based hand gesture classification,” in Proc. 3rd Int. Informatics and Software Eng. Conf. (IISEC), Ankara, Turkey, 2022, pp. 1-3, doi: 10.1109/IISEC56263.2022.9998295.
T.-H. Nguyen, B.-V. Ngo, and T.-N. Nguyen, “Vision-based hand gesture recognition using a YOLOv8n model for the navigation of a smart wheelchair,” Electronics, vol. 14, no. 4, p. 734, 2025, doi: 10.3390/electronics14040734.
Shivani and S. B. Gupta, “A comprehensive analysis of recognition of hand gestures using machine learning,” Makara J. Technol., vol. 29, no. 1, Art. no. 5, 2025, doi: 10.7454/mst.v29i1.1679.
C. K. Tan, K. M. Lim, R. K. Y. Chang, C. P. Lee, and A. Alqahtani, “HGR-ViT: Hand gesture recognition with vision transformer,” Sensors, vol. 23, no. 12, p. 5555, 2023, doi: 10.3390/s23125555.
Y. Altaf, “Efficient hand sign recognition with fine-tuned faster vision transformers: A comparative study on benchmark image datasets,” J. Electr. Syst., vol. 20, no. 3, pp. 8082-8098, 2024.
A. R. Asif et al., “Performance evaluation of convolutional neural network for hand gesture recognition using EMG,” Sensors, vol. 20, no. 6, p. 1642, 2020, doi: 10.3390/s20061642.
H. Hellara, R. Barioul, S. Sahnoun, A. Fakhfakh, and O. Kanoun, “Comparative study of sEMG feature evaluation methods based on the hand gesture classification performance,” Sensors, vol. 24, no. 11, p. 3638, 2024, doi: 10.3390/s24113638.
V.-D. Do, V.-H. Le, H.-S. Do, V.-N. Phan, and T.-H. Te, “TQU-HG dataset and comparative study for hand gesture recognition of RGB-based images using deep learning,” Indones. J. Electr. Eng. Comput. Sci., vol. 34, no. 3, pp. 1603-1617, 2024.
K. Myagila and H. Kilavo, “A comparative study on performance of SVM and CNN in Tanzania sign language translation using image recognition,” Appl. Artif. Intell., vol. 36, no. 1, p. 2005297, 2021, doi: 10.1080/08839514.2021.2005297.
S. Bhushan, M. Alshehri, I. Keshta, A. K. Chakraverti, J. Rajpurohit, and A. Abugabah, “An experimental analysis of various machine learning algorithms for hand gesture recognition,” Electronics, vol. 11, no. 6, p. 968, 2022, doi: 10.3390/electronics11060968.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 30, 2017, pp. 5998-6008.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020, doi: 10.48550/arXiv.2010.11929.
K. Gupta, A. Singh, S. R. Yeduri, M. B. Srinivas, and L. R. Cenkeramaddi, “Hand gestures recognition using edge computing system based on vision transformer and lightweight CNN,” J. Ambient Intell. Humanized Comput., vol. 14, no. 3, pp. 2601-2615, 2023, doi: 10.1007/s12652-022-04506-4.