SpeakVision: A Comprehensive Survey on End-to-End Sentence Level Lipreading


Authors : Ashwini M Rayannavar; Rakshit Chouhan; Aman Ali Gazi; Maitree Rajesh Patel

Volume/Issue : Volume 9 - 2024, Issue 11 - November


Google Scholar : https://tinyurl.com/ycx8vppc

Scribd : https://tinyurl.com/5n7sfsy3

DOI : https://doi.org/10.5281/zenodo.14330071


Abstract : SpeakVision is a speech reading framework capable of extracting speech from audio-video inputs using an AI-based model. A new integrated approach, using both sight and sound, is needed for situations when a voice signal is obscured, or when seeing the apparatus is much easier than hearing it. SpeakVision leverages AI technologies, such as, 3D convolutional layers for extracting spatial features, Bidirectional LSTMs for temporal information and CTC decoding for generating text. Video preprocessing techniques were applied to optimize model performance, and the results were developed into an easy-to-use Streamlit interface for interactive visualization.

References :

  1. Yannis M. Assael1, Brendan Shillingford1, Shimon Whiteson1 & Nando de Freitas123 Department of Computer Science, University of Oxford, Oxford, UK 1 Google DeepMind, London, UK 2 CIFAR, Canada 3, “LipNet”
  2. Fu, S. Yan, and T. S. Huang. Classification and feature extraction by simplification. IEEE Transactions on Information Forensics and Security, 3(1):91–100, 2008.K. Eves and J. Valasek, “Adaptive control for singularly perturbed systems examples,” Code Ocean, Aug. 2023. [Online]. Available: https://codeocean.com/capsule/4989235/tree
  3. Garg, J. Noyola, and S. Bagadia. Lip reading using CNN and LSTM. Technical report, Stanford University, CS231n project report, 2016.
  4. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5):602–610, 2005.
  5. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264:746–748, 1976.
  6. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Lipreading using convolutional neural network. In INTERSPEECH, pp. 1149–1153, 2014.
  7. F. Woodward and C. G. Barber. Phoneme perception in lipreading. Journal of Speech, Language, and Hearing Research, 3(3):212–222, 1960.

SpeakVision is a speech reading framework capable of extracting speech from audio-video inputs using an AI-based model. A new integrated approach, using both sight and sound, is needed for situations when a voice signal is obscured, or when seeing the apparatus is much easier than hearing it. SpeakVision leverages AI technologies, such as, 3D convolutional layers for extracting spatial features, Bidirectional LSTMs for temporal information and CTC decoding for generating text. Video preprocessing techniques were applied to optimize model performance, and the results were developed into an easy-to-use Streamlit interface for interactive visualization.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe