Authors :
K. Sandhya; P. Nithik Roshan; K. Bala Manikanta; Priyanka Pandarinath
Volume/Issue :
Volume 10 - 2025, Issue 5 - May
Google Scholar :
https://tinyurl.com/4bt3hhja
DOI :
https://doi.org/10.38124/ijisrt/25may1394
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Automatic speech recognition (ASR) has advanced from responding to limited sound to fluently understanding natural
language. Used in voice search, virtual assistants, and speech-to-text systems to enhance user experience and productivity.
Began with basic sound recognition and evolved into comprehensive language comprehension. Despite significant
advancements in automatic speech recognition (ASR) technology, existing systems often struggle to accurately transcribe
spoken language in context where semantic nuances and contextual cues play a crucial role. The problem arises from the
inherent limitations of conventional ASR approaches to comprehensively understand and intercept the contextual
information efficiently, resulting in inaccuracies misinterpretations and errors in transcriptions, especially in scenarios
involving ambiguous or context dependent speech. Incorporating Natural Language Processing (NLP) techniques into ASR
systems presents a promising avenue to address this challenge.
Keywords :
Audio Signal Processing, Feature Extraction, Acoustic Modeling, Language Modeling,Decoding, Post-Processing.
References :
- T. Bayes, “An Essay Towards Solving a Problem in the Doctrine of Chances,” Philosophical Transactions of the Royal Society of London, vol. 53, pp. 370–418, 1763.
- F. Jelinek, Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press, 1997.
- H. A. Bourlard and N. Morgan, Connectionist Speech Recognition: a Hybrid Approach. Norwell, MA: Kluwer Academic Publishers, 1993.
- F. Seide, G. Li, and D. Yu, “Conversational Speech Transcription Using Context- Dependent Deep Neural Networks,” in Proc. Interspeech, Florence, Italy, Aug. 2011, pp. 437–440.
- V. Fontaine, C. Ris, and H. Leich, “Nonlinear Discriminant Analysis for Improved Speech Recognition,” in Proc. Eurospeech, Rhodes, Greece, Sep. 1997, pp. 1–4.
- H. Hermansky, D. Ellis, and S. Sharma, “Tandem connectionist Feature Extraction for Conventional HMM Systems,” in Proc. IEEE ICASSP, vol. 3, Istanbul, Turkey, Jun. 2000, pp. 1635–1638.
- M. Nakamura and K. Shikano, “A Study of English Word Category Prediction Based on Neural Networks,” in Proc. IEEE ICASSP, Glasglow, UK, May 1989, pp. 731–734. .
- Y. Bengio, R. Ducharme, and P. Vincent, “A Neural Probabilistic Language Model,” in Proc. NIPS, vol. 13, Denver, CO, Nov. 2000, pp. 932–938.
- H. Schwenk and J.-L. Gauvain, “Connectionist Language Modeling for Large Vocabulary Continuous Speech Recognition,” in Proc. IEEE ICASSP, Orlando, FL, May 2002, pp. 765–768.
- Z. Tuske, P. Golik, R. Schl ¨ uter, and H. Ney, “Acoustic Modeling with ¨ Deep Neural Networks Using Raw Time Signal for LVCSR,” in Proc. Interspeech, Singapore Sep. 2014, pp. 890–894.
- T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan, M. Bacchiani, and A.
- Raw Multichannel Waveforms,” in Proc. IEEE ASRU, Scottsdale, AZ, Dec. 2015, pp. 30–36.
- A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connection- ´ ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, Pittsburgh, PA, Jun. 2006, pp. 369–376.
- A. Graves, “Sequence Transduction with Recurrent Neural Networks,” Nov. 2012,
- arXiv:1211.3711. [Online]. Available: https://arxiv.org/abs/ 1211.3711
- J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention- Based Models for Speech Recognition,” in Proc. NIPS, vol. 28, Laval, Queebec, Canada, Dec. 2015, pp. 577–585. `
- www.huggingface.com
- www.openai.com
- www.cloud.google.com/speech-to-text
- www.aws.amazon.com/transcribe
- "Speech and Language Processing" by Daniel Jurafsky and James H. Martin.
- "Deep Learning for Natural Language Processing" by Palash Goyal, Sumit Pandey, Karan Jain.
- "Neural Networks for Natural Language Processing" by Yoav Goldberg
- "Natural Language Processing with Transformers" by Lewis Tunstall, Leandro von Werra, and Thomas Wolf
- ""Automatic Speech Recognition: A Deep Learning Approach" by Dong Yu.
- "Speech and Language Processing" by Daniel Jurafsky and James H. Martin.
- "Deep Learning for Natural Language Processing" by Palash Goyal, Sumit Pandey,Karan Jain.
- "Neural Networks for Natural Language Processing" by Yoav Goldberg
- "Natural Language Processing with Transformers" by Lewis Tunstall, Leandro von Werra, and Thomas Wolf
- "Automatic Speech Recognition: A Deep Learning Approach" by Dong Yu”.
- "Natural Language Understanding" by James Allen
Automatic speech recognition (ASR) has advanced from responding to limited sound to fluently understanding natural
language. Used in voice search, virtual assistants, and speech-to-text systems to enhance user experience and productivity.
Began with basic sound recognition and evolved into comprehensive language comprehension. Despite significant
advancements in automatic speech recognition (ASR) technology, existing systems often struggle to accurately transcribe
spoken language in context where semantic nuances and contextual cues play a crucial role. The problem arises from the
inherent limitations of conventional ASR approaches to comprehensively understand and intercept the contextual
information efficiently, resulting in inaccuracies misinterpretations and errors in transcriptions, especially in scenarios
involving ambiguous or context dependent speech. Incorporating Natural Language Processing (NLP) techniques into ASR
systems presents a promising avenue to address this challenge.
Keywords :
Audio Signal Processing, Feature Extraction, Acoustic Modeling, Language Modeling,Decoding, Post-Processing.