⚠ Official Notice: www.ijisrt.com is the official website of the International Journal of Innovative Science and Research Technology (IJISRT) Journal for research paper submission and publication. Please beware of fake or duplicate websites using the IJISRT name.



Multimodal Command System for HumanComputer Interaction


Authors : Harikumar M.; Prasanth D.; Reuben Abraham George

Volume/Issue : Volume 11 - 2026, Issue 2 - February


Google Scholar : https://tinyurl.com/4xr5hmbc

Scribd : https://tinyurl.com/yvumv3um

DOI : https://doi.org/10.38124/ijisrt/26feb1459

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : Recent advances in computing have increased the demand for interaction paradigms that enable intuitive and efficient communication between users and digital systems. Traditional interfaces such as keyboards and mice often limit accessibility and natural interaction, particularly in complex or hands-free environments. This paper proposes a multimodal human–computer interaction system that integrates hand gesture recognition and voice command processing to enable seamless desktop control and interaction with external applications. The system employs MediaPipe-based hand landmark extraction and speech-to-text processing, coordinated through an AI agent using the Model Context Protocol (MCP) for task orchestration and service integration. Experimental evaluation demonstrates that the proposed framework achieves high recognition accuracy with low latency, supporting real-time interaction across local and cloud-based services. The results indicate that multimodal fusion combined with agent-based automation enhances usability, responsiveness, and accessibility, positioning the system as a scalable solution for next-generation human–computer interaction.

Keywords : Human–Computer Interaction, Hand Gesture Recognition, MediaPipe, Lightweight CNN, Multimodal Systems.

References :

  1. Jia J, Hu Z, Wang R, et al. "A Multimodal Human-Computer Interaction System and Its Applications." Sensors, 2020;20(11):3215.
  2. Ridhun M, et al. "Multimodal Human Computer Interaction Using Hand and Speech Recognition." Human-Computer Interaction. ICICT, 2022.
  3. Wu J, et al. "Fusing multi-modal features for gesture recognition." Proc. 15th ACM Int. Conf. Multimodal Interaction, 2013:453-6.
  4. Siddiqui N, Chan RHM. "Multimodal hand gesture recognition using single IMU and acoustic measurements at wrist." PLoS ONE, 2020;15(1):e0227039.
  5. Agrawal A, et al. "Vision-based multimodal human-computer interaction technique using dynamic hand gesture recognition." 2013 IEEE Int. Conf. Image Information Processing.
  6. Ravanbakhsh S, Pitsikalis V, Katsamanis A, et al. "Multimodal Gesture Recognition via Multiple Hypotheses Rescoring." J. Machine Learning Research, 2015;16:261-294.
  7. Gao Q, Liu J, Ju Z. "Hand gesture recognition using multimodal data fusion and multiscale parallel CNN." Expert Systems, 2021.
  8. Cohen PR, Oviatt S, Wu L, et al. "The role of voice input for human-machine communication." PNAS, 1995;92(22):9921-9927.
  9. Liu J, Li Y, Sun J, et al. "A survey of speech-hand gesture recognition for the development of multimodal interfaces in human-computer interaction." IEEE Trans. Human-Machine Systems, 2010;40(6):465-79.
  10. El-Azazy AAMEH, et al. "Enhancing Human-Computer Interaction through Speech Recognition and AI." Engineering Research Journal, 2025;54(1):59-102.
  11. Wu X, et al. "Multimodal gesture recognition." Proc. 19th ACM Int. Conf. Multimodal Interaction, 2017.
  12. Merge AI. "5 real-world Model Context Protocol integration examples." Merge AI Blog, 2025.
  13. OpenAI. "Model Context Protocol (MCP) - OpenAI Agents SDK." 2025.
  14. Anthropic. "Introducing the Model Context Protocol." Anthropic News, 2024.
  15. Cyclr. "Model Context Protocol (MCP) for AI Integration." Cyclr.com, 2025.
  16. Oviatt S. "Multimodal Interfaces." CRC Press, 2003.
  17. Montero CS, et al. "Multimodal interaction: A review." Computer Science Review, 2022; 43:1-15.
  18. Cardenas EJE, et al. "Multimodal hand gesture recognition combining temporal information." J. Visual Communication and Image Representation, 2020.
  19. Jaimes A, Sebe N. "Multimodal human–computer interaction: A survey." Computer Vision and Image Understanding, 2007;108(1–2):116-34.
  20. Dysnix, “MCP Architecture: Advanced Techniques Review,” Blog, 2025. https://dysnix.com/mcp-architecture-review.
  21. Katsamanis A, et al. "Multimodal Gesture Recognition for HCI." Artificial Intelligence Review, 2016;43(1):1–54.
  22. Katsamanis A. "Understanding Gesture and Speech Multimodal Communication." ACM Digital Library, 2020.
  23. Wu Y, et al. "A new human-computer interaction paradigm: Agent interaction model based on large models and its prospects." Virtual Reality & Intelligent Hardware, 2025;7(3):237-266.
  24. Rusan HA, et al. "Human-Computer Interaction Through Voice Commands Based on Deep Learning." Proc. 2022 Int. Conf. Electrical and Computing Technologies and Applications. IEEE, 2022.
  25. Chaturvedi S. "Voice Recognition Systems: An example of human–computer interaction." SSRN Electronic Journal, 2024.
  26. Kettebekov S, et al. "Understanding gestures in multimodal human-computer interaction." International Journal of Human–Computer Studies, 2000;53:153-170.

Recent advances in computing have increased the demand for interaction paradigms that enable intuitive and efficient communication between users and digital systems. Traditional interfaces such as keyboards and mice often limit accessibility and natural interaction, particularly in complex or hands-free environments. This paper proposes a multimodal human–computer interaction system that integrates hand gesture recognition and voice command processing to enable seamless desktop control and interaction with external applications. The system employs MediaPipe-based hand landmark extraction and speech-to-text processing, coordinated through an AI agent using the Model Context Protocol (MCP) for task orchestration and service integration. Experimental evaluation demonstrates that the proposed framework achieves high recognition accuracy with low latency, supporting real-time interaction across local and cloud-based services. The results indicate that multimodal fusion combined with agent-based automation enhances usability, responsiveness, and accessibility, positioning the system as a scalable solution for next-generation human–computer interaction.

Keywords : Human–Computer Interaction, Hand Gesture Recognition, MediaPipe, Lightweight CNN, Multimodal Systems.

Paper Submission Last Date
31 - March - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS
Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe