Deep Learning Model for Lip Reading to ImproveAccessibility


Authors : Sonia Singh B; Shubhaprada K P

Volume/Issue : Volume 8 - 2023, Issue 8 - August

Google Scholar : https://bit.ly/3TmGbDi

Scribd : https://tinyurl.com/4kehztc6

DOI : https://doi.org/10.5281/zenodo.8327791

Abstract : The project proposes an end-to-end deep learning architecture for word-level visual speech recognition without the need for explicit word boundary information. The methodology includes spatiotemporal convolutional layers, Residual Networks (Res Nets), and bidirectional Long Short-Term Memory (Bi- LSTM) networks. The system is trained using the CTC loss function and requires data preprocessing with facial landmark extraction, image cropping, resizing, grayscale conversion, and data augmentation to focus on the mouth region. The model is implemented in Tensor Flow and trained with an adaptive learning rate schedule. With this approach, the proposed system achieves end-to-end lip reading from a video frame and implicitly identifies keywords in utterances. Analysis using the CTC loss function confirms the model’s effectiveness. The results suggest potential applications in dictation, hearing aids, and biometric authentication, thus advancing visual speech recognition compared to traditional methods. In summary, the project presents an innovative deep learning architecture for word-level visual speech recognition, surpassing traditional methods and enabling practical applications.

Keywords : Recurrent Neural Network, Long Short-Term Memory, Graphics Processing Unit, Solid State Drive, Text- to- Speech, Application Programming Interface, Audio- Visual, Lip Reading, Bidirectional Long Short-Term Memory, Graphical User Interface, Red Green Blue, Mean Squared Error, Mean Absolute Error, Adaptive Moment Estimation.

The project proposes an end-to-end deep learning architecture for word-level visual speech recognition without the need for explicit word boundary information. The methodology includes spatiotemporal convolutional layers, Residual Networks (Res Nets), and bidirectional Long Short-Term Memory (Bi- LSTM) networks. The system is trained using the CTC loss function and requires data preprocessing with facial landmark extraction, image cropping, resizing, grayscale conversion, and data augmentation to focus on the mouth region. The model is implemented in Tensor Flow and trained with an adaptive learning rate schedule. With this approach, the proposed system achieves end-to-end lip reading from a video frame and implicitly identifies keywords in utterances. Analysis using the CTC loss function confirms the model’s effectiveness. The results suggest potential applications in dictation, hearing aids, and biometric authentication, thus advancing visual speech recognition compared to traditional methods. In summary, the project presents an innovative deep learning architecture for word-level visual speech recognition, surpassing traditional methods and enabling practical applications.

Keywords : Recurrent Neural Network, Long Short-Term Memory, Graphics Processing Unit, Solid State Drive, Text- to- Speech, Application Programming Interface, Audio- Visual, Lip Reading, Bidirectional Long Short-Term Memory, Graphical User Interface, Red Green Blue, Mean Squared Error, Mean Absolute Error, Adaptive Moment Estimation.

CALL FOR PAPERS


Paper Submission Last Date
31 - May - 2024

Paper Review Notification
In 1-2 Days

Paper Publishing
In 2-3 Days

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe