Authors :
Ashwini M Rayannavar; Rakshit Chouhan; Aman Ali Gazi; Maitree Rajesh Patel
Volume/Issue :
Volume 9 - 2024, Issue 11 - November
Google Scholar :
https://tinyurl.com/ycx8vppc
Scribd :
https://tinyurl.com/5n7sfsy3
DOI :
https://doi.org/10.5281/zenodo.14330071
Abstract :
SpeakVision is a speech reading framework
capable of extracting speech from audio-video inputs
using an AI-based model. A new integrated approach,
using both sight and sound, is needed for situations
when a voice signal is obscured, or when seeing the
apparatus is much easier than hearing it. SpeakVision
leverages AI technologies, such as, 3D convolutional
layers for extracting spatial features, Bidirectional
LSTMs for temporal information and CTC decoding
for generating text. Video preprocessing techniques
were applied to optimize model performance, and the
results were developed into an easy-to-use Streamlit
interface for interactive visualization.
References :
- Yannis M. Assael1, Brendan Shillingford1, Shimon Whiteson1 & Nando de Freitas123 Department of Computer Science, University of Oxford, Oxford, UK 1 Google DeepMind, London, UK 2 CIFAR, Canada 3, “LipNet”
- Fu, S. Yan, and T. S. Huang. Classification and feature extraction by simplification. IEEE Transactions on Information Forensics and Security, 3(1):91–100, 2008.K. Eves and J. Valasek, “Adaptive control for singularly perturbed systems examples,” Code Ocean, Aug. 2023. [Online]. Available: https://codeocean.com/capsule/4989235/tree
- Garg, J. Noyola, and S. Bagadia. Lip reading using CNN and LSTM. Technical report, Stanford University, CS231n project report, 2016.
- Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5):602–610, 2005.
- McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264:746–748, 1976.
- Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Lipreading using convolutional neural network. In INTERSPEECH, pp. 1149–1153, 2014.
- F. Woodward and C. G. Barber. Phoneme perception in lipreading. Journal of Speech, Language, and Hearing Research, 3(3):212–222, 1960.
SpeakVision is a speech reading framework
capable of extracting speech from audio-video inputs
using an AI-based model. A new integrated approach,
using both sight and sound, is needed for situations
when a voice signal is obscured, or when seeing the
apparatus is much easier than hearing it. SpeakVision
leverages AI technologies, such as, 3D convolutional
layers for extracting spatial features, Bidirectional
LSTMs for temporal information and CTC decoding
for generating text. Video preprocessing techniques
were applied to optimize model performance, and the
results were developed into an easy-to-use Streamlit
interface for interactive visualization.