⚠ Official Notice: www.ijisrt.com is the official website of the International Journal of Innovative Science and Research Technology (IJISRT) Journal for research paper submission and publication. Please beware of fake or duplicate websites using the IJISRT name.



Fusing Self-Supervised Speech Representations with MFCC Stability for Robust Deepfake Audio Detection


Authors : Sadineni Havesa; Golla Susanth Paul; Dr. S. Jagadeesan

Volume/Issue : Volume 11 - 2026, Issue 3 - March


Google Scholar : https://tinyurl.com/27ava5fw

Scribd : https://tinyurl.com/388487w6

DOI : https://doi.org/10.38124/ijisrt/26mar2080

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : Recent advancements in neural speech synthesis have enabled the creation of extremely realistic deepfakes that sound like actual human voices. Even though these technologies have some useful purposes, they also pose serious risks in terms of voice impersonation, misinformation, and financial scams. The detection of fake speech is an emerging research concern in speech forensics and cybersecurity. This paper proposes a dual-branch deep learning framework for effective deepfake audio detection by combining self-supervised speech representations with handcrafted acoustic stability features. The first branch is responsible for the extraction of semantic speech embeddings using the pre-trained model WavLM, which contains contextual and phonetic information from the speech signal. The second branch is responsible for the extraction of Mel-Frequency Cepstral Coefficient (MFCC) stability features and the application of the Temporal Convolutional Network (TCN) model. The features from both branches are combined using a multilayer perceptron classifier to decide if an audio sample is real or fake. Experiments on the Fake-or-Real (FoR) dataset show that the proposed fusion approach enhances detection performance compared to models using a single feature. The results suggest that merging deep contextual embeddings with handcrafted stability features offers better resilience against modern deepfake audio generation methods.

Keywords : Deepfake Audio Detection, Self-Supervised Learning, WavLM, MFCC Stability, Speech Forensics, Temporal Convolutional Networks.

References :

  1. M. Todisco, X. Wang, J. Yamagishi, and H. Delgado, “ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 213–229, 2023.
  2. X. Wang, J. Yamagishi, M. Todisco, and H. Delgado, “A Comparative Study of Deepfake Speech Detection Methods,” IEEE Signal Processing Magazine, vol. 40, no. 3, pp. 79–89, 2023.
  3. H. Tak, J. Patino, M. Todisco, and N. Evans, “End-to-End Anti-Spoofing with Self-Supervised Speech Representations,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 1706–1718, 2023.
  4. Z. Chen, Y. Wu, and C. Wang, “Self-Supervised Speech Representation Learning for Audio Forgery Detection,” IEEE Transactions on Multimedia, vol. 26, pp. 1250–1263, 2024.
  5. J. Wang, Y. Li, and P. Zhang, “Hybrid Feature Fusion for Robust Deepfake Speech Detection,” IEEE Access, vol. 12, pp. 55621–55632, 2024.
  6. L. Li, K. Chen, and H. Zhao, “Detecting Fake Speech Using Self-Supervised Speech Embeddings,” Proceedings of ICASSP, pp. 1–5, 2024.
  7. A. Kumar, R. Singh, and S. Verma, “Generalizable Deepfake Audio Detection Using Hybrid Neural Architectures,” IEEE Signal Processing Letters, vol. 31, pp. 1185–1189, 2024.
  8. Y. Liu, H. Zhang, and J. Sun, “Temporal Convolutional Networks for Speech Spoofing Detection,” IEEE Transactions on Audio, Speech and Language Processing, vol. 32, pp. 2154–2165, 2024.
  9. S. Patel, M. Gupta, and R. Jain, “Deepfake Audio Detection Using Transformer-Based Speech Models,” Proceedings of INTERSPEECH, pp. 4210–4214, 2024.
  10. T. Zhang, X. Li, and P. Liu, “Cross-Dataset Generalization in Deepfake Audio Detection,” IEEE Transactions on Information Forensics and Security, vol. 20, pp. 402–415, 2025.
  11. R. Singh and P. Sharma, “Robust Detection of AI-Generated Speech Using Self-Supervised Audio Representations,” IEEE Access, vol. 13, pp. 10231–10245, 2025.
  12. Y. Chen, J. Huang, and W. Wang, “Improving Deepfake Speech Detection with Hybrid Acoustic Features,” Proceedings of ICASSP, pp. 521–525, 2025.
  13. A. Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” NeurIPS, 2020.
  14. W.-N. Hsu et al., “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM TASLP, 2021.
  15. S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE, 2022.
  16. A. van den Oord et al., “WaveNet: A Generative Model for Raw Audio,” 2016.
  17. J. Shen et al., “Tacotron 2: Natural TTS Synthesis,” 2018.

Recent advancements in neural speech synthesis have enabled the creation of extremely realistic deepfakes that sound like actual human voices. Even though these technologies have some useful purposes, they also pose serious risks in terms of voice impersonation, misinformation, and financial scams. The detection of fake speech is an emerging research concern in speech forensics and cybersecurity. This paper proposes a dual-branch deep learning framework for effective deepfake audio detection by combining self-supervised speech representations with handcrafted acoustic stability features. The first branch is responsible for the extraction of semantic speech embeddings using the pre-trained model WavLM, which contains contextual and phonetic information from the speech signal. The second branch is responsible for the extraction of Mel-Frequency Cepstral Coefficient (MFCC) stability features and the application of the Temporal Convolutional Network (TCN) model. The features from both branches are combined using a multilayer perceptron classifier to decide if an audio sample is real or fake. Experiments on the Fake-or-Real (FoR) dataset show that the proposed fusion approach enhances detection performance compared to models using a single feature. The results suggest that merging deep contextual embeddings with handcrafted stability features offers better resilience against modern deepfake audio generation methods.

Keywords : Deepfake Audio Detection, Self-Supervised Learning, WavLM, MFCC Stability, Speech Forensics, Temporal Convolutional Networks.

Paper Submission Last Date
30 - April - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS
Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe