Authors :
Harikumar M.; Prasanth D.; Reuben Abraham George
Volume/Issue :
Volume 11 - 2026, Issue 2 - February
Google Scholar :
https://tinyurl.com/4xr5hmbc
Scribd :
https://tinyurl.com/yvumv3um
DOI :
https://doi.org/10.38124/ijisrt/26feb1459
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Recent advances in computing have increased the demand for interaction paradigms that enable intuitive and
efficient communication between users and digital systems. Traditional interfaces such as keyboards and mice often limit
accessibility and natural interaction, particularly in complex or hands-free environments. This paper proposes a multimodal
human–computer interaction system that integrates hand gesture recognition and voice command processing to enable
seamless desktop control and interaction with external applications. The system employs MediaPipe-based hand landmark
extraction and speech-to-text processing, coordinated through an AI agent using the Model Context Protocol (MCP) for task
orchestration and service integration. Experimental evaluation demonstrates that the proposed framework achieves high
recognition accuracy with low latency, supporting real-time interaction across local and cloud-based services. The results
indicate that multimodal fusion combined with agent-based automation enhances usability, responsiveness, and accessibility,
positioning the system as a scalable solution for next-generation human–computer interaction.
Keywords :
Human–Computer Interaction, Hand Gesture Recognition, MediaPipe, Lightweight CNN, Multimodal Systems.
References :
- Jia J, Hu Z, Wang R, et al. "A Multimodal Human-Computer Interaction System and Its Applications." Sensors, 2020;20(11):3215.
- Ridhun M, et al. "Multimodal Human Computer Interaction Using Hand and Speech Recognition." Human-Computer Interaction. ICICT, 2022.
- Wu J, et al. "Fusing multi-modal features for gesture recognition." Proc. 15th ACM Int. Conf. Multimodal Interaction, 2013:453-6.
- Siddiqui N, Chan RHM. "Multimodal hand gesture recognition using single IMU and acoustic measurements at wrist." PLoS ONE, 2020;15(1):e0227039.
- Agrawal A, et al. "Vision-based multimodal human-computer interaction technique using dynamic hand gesture recognition." 2013 IEEE Int. Conf. Image Information Processing.
- Ravanbakhsh S, Pitsikalis V, Katsamanis A, et al. "Multimodal Gesture Recognition via Multiple Hypotheses Rescoring." J. Machine Learning Research, 2015;16:261-294.
- Gao Q, Liu J, Ju Z. "Hand gesture recognition using multimodal data fusion and multiscale parallel CNN." Expert Systems, 2021.
- Cohen PR, Oviatt S, Wu L, et al. "The role of voice input for human-machine communication." PNAS, 1995;92(22):9921-9927.
- Liu J, Li Y, Sun J, et al. "A survey of speech-hand gesture recognition for the development of multimodal interfaces in human-computer interaction." IEEE Trans. Human-Machine Systems, 2010;40(6):465-79.
- El-Azazy AAMEH, et al. "Enhancing Human-Computer Interaction through Speech Recognition and AI." Engineering Research Journal, 2025;54(1):59-102.
- Wu X, et al. "Multimodal gesture recognition." Proc. 19th ACM Int. Conf. Multimodal Interaction, 2017.
- Merge AI. "5 real-world Model Context Protocol integration examples." Merge AI Blog, 2025.
- OpenAI. "Model Context Protocol (MCP) - OpenAI Agents SDK." 2025.
- Anthropic. "Introducing the Model Context Protocol." Anthropic News, 2024.
- Cyclr. "Model Context Protocol (MCP) for AI Integration." Cyclr.com, 2025.
- Oviatt S. "Multimodal Interfaces." CRC Press, 2003.
- Montero CS, et al. "Multimodal interaction: A review." Computer Science Review, 2022; 43:1-15.
- Cardenas EJE, et al. "Multimodal hand gesture recognition combining temporal information." J. Visual Communication and Image Representation, 2020.
- Jaimes A, Sebe N. "Multimodal human–computer interaction: A survey." Computer Vision and Image Understanding, 2007;108(1–2):116-34.
- Dysnix, “MCP Architecture: Advanced Techniques Review,” Blog, 2025. https://dysnix.com/mcp-architecture-review.
- Katsamanis A, et al. "Multimodal Gesture Recognition for HCI." Artificial Intelligence Review, 2016;43(1):1–54.
- Katsamanis A. "Understanding Gesture and Speech Multimodal Communication." ACM Digital Library, 2020.
- Wu Y, et al. "A new human-computer interaction paradigm: Agent interaction model based on large models and its prospects." Virtual Reality & Intelligent Hardware, 2025;7(3):237-266.
- Rusan HA, et al. "Human-Computer Interaction Through Voice Commands Based on Deep Learning." Proc. 2022 Int. Conf. Electrical and Computing Technologies and Applications. IEEE, 2022.
- Chaturvedi S. "Voice Recognition Systems: An example of human–computer interaction." SSRN Electronic Journal, 2024.
- Kettebekov S, et al. "Understanding gestures in multimodal human-computer interaction." International Journal of Human–Computer Studies, 2000;53:153-170.
Recent advances in computing have increased the demand for interaction paradigms that enable intuitive and
efficient communication between users and digital systems. Traditional interfaces such as keyboards and mice often limit
accessibility and natural interaction, particularly in complex or hands-free environments. This paper proposes a multimodal
human–computer interaction system that integrates hand gesture recognition and voice command processing to enable
seamless desktop control and interaction with external applications. The system employs MediaPipe-based hand landmark
extraction and speech-to-text processing, coordinated through an AI agent using the Model Context Protocol (MCP) for task
orchestration and service integration. Experimental evaluation demonstrates that the proposed framework achieves high
recognition accuracy with low latency, supporting real-time interaction across local and cloud-based services. The results
indicate that multimodal fusion combined with agent-based automation enhances usability, responsiveness, and accessibility,
positioning the system as a scalable solution for next-generation human–computer interaction.
Keywords :
Human–Computer Interaction, Hand Gesture Recognition, MediaPipe, Lightweight CNN, Multimodal Systems.