Transformer-Based Sentiment Analysis on Amazon Reviews Using Kaggle Dataset


Authors : Shwetaba B. Chauhan; Japan M. Mavani

Volume/Issue : Volume 10 - 2025, Issue 12 - December


Google Scholar : https://tinyurl.com/bddsdeef

Scribd : https://tinyurl.com/4wnck3uk

DOI : https://doi.org/10.38124/ijisrt/25dec103

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Note : Google Scholar may take 30 to 40 days to display the article.


Abstract : Sentiment analysis of online product reviews is a crucial natural language processing task with significant business applications. This paper presents a comprehensive study of sentiment classification on a large Amazon Reviews dataset from Kaggle, containing millions of reviews with star ratings, by fine-tuning state-of-the-art transformer models. We experiment with BERT, RoBERTa, DistilBERT, and a generative GPT-2 model for sentiment classification (positive vs. negative). A complete machine learning pipeline is implemented in Python using Google Colab, including data cleaning, tokenization with attention masks, and transformer fine-tuning with classification layers. We conduct an extensive literature review of sentiment analysis techniques from early lexicon-based and machine learning methods to modern deep learning approaches. Our experiments involve hyperparameter tuning and rigorous evaluation using accuracy, precision, recall, F1-score, and ROC-AUC metrics. Results show that RoBERTa achieves the highest accuracy (around 95–96%), slightly outperforming BERT, while the lightweight DistilBERT model offers competitive performance (~94% F1) with smaller size and faster inference. GPT-2, despite its unidirectional nature, also achieves good accuracy (~92–93%) but lags behind the bidirectional models. We analyze performance trade-offs, model size vs. accuracy, and issues like class imbalance and overfitting. An ablation study highlights the impact of including review title text and the importance of fine-tuning the entire model. We also perform error analysis, revealing that models occasionally misclassify reviews with mixed sentiments or sarcasm. The findings demonstrate that transformer-based approaches substantially improve sentiment analysis accuracy on large-scale review data, offering robust solutions for real-world e-commerce analytics. Future work will explore larger transformer models, multi-class sentiment ratings, and domain adaptation techniques.

References :

  1. Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, 79–86[1]. (One of the first works to apply machine learning to sentiment analysis of reviews.)
  2. Turney, P. D. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. Proceedings of the 40th Annual Meeting of the ACL, 417–424[13]. (Introduced an unsupervised method using phrase semantic orientation for review sentiment.)
  3. Hutto, C. J., & Gilbert, E. (2014). VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. Proceedings of ICWSM, 216–225[44]. (Introduced a lexicon and rule-based sentiment tool widely used as a baseline for sentiment analysis.)
  4. Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of EMNLP 2014, 1746–1751[21]. (Demonstrated that a simple CNN with pre-trained word vectors can achieve strong results on sentiment and other text classification tasks.)
  5. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019, 4171–4186[23][24]. (Introduced BERT, a transformative model for NLP that we fine-tune in this paper.)
  6. Liu, Y., Ott, M., Goyal, N., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692[45][46]. (Improved BERT’s pre-training method; our results show RoBERTa’s efficacy on sentiment analysis.)
  7. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108[11]. (Proposed model compression for BERT; we validated that DistilBERT maintains strong performance with fewer parameters.)
  8. Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models are Few-Shot Learners. Advances in NeurIPS 33 (GPT-3 paper)[12]. (Demonstrated the power of very large transformers; mentioned for context about GPT models.)
  9. Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of Tricks for Efficient Text Classification. Proceedings of EACL 2017, 427–431. (Introduced fastText, achieving strong sentiment classification on Amazon review data with simple models; provides a baseline around 94–95% accuracy for comparison.)
  10. Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level Convolutional Networks for Text Classification. Advances in NeurIPS 28, 649–657. (Created the Amazon Review Polarity dataset we used via Kaggle and reported initial benchmark results; our work significantly improves on those with transformers.)
  11. Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2020). Deep Sentiment Classification: A Comprehensive Review. arXiv:2006.00388. (A survey paper on deep learning approaches to sentiment analysis, providing context and background.)
  12. Maheshwary, P., & Sastry, V. N. (2020). Sentiment Analysis of Amazon Product Reviews using Machine Learning Techniques. International Journal of Intelligent Systems and Applications, 12(3), 9. (Analyzed Amazon reviews with machine learning; supports some observations on the effectiveness of advanced models over traditional ones.)
  13. [1] [2] [5] [6] [8] [9] [14] [15] [20] [21] [27] [28] [30] [32] [40] [44] (PDF) Sentiment Analysis of Amazon Product Reviews Using Machine Learning Approaches https://www.researchgate.net/publication/394876232_Sentiment_Analysis_of_Amazon_Product_Reviews_Using_Machine_Learning_Approaches
  14. [3] [4] [29] [31] Unlocking Customer Insights Through Sentiment Analysis https://www.linkedin.com/pulse/unlocking-customer-insights-through-sentiment-analysis-teja-v-vmo7c
  15. [7] [23] [24] [33] [34] [39] [45] [46] [1907.11692] RoBERTa: A Robustly Optimized BERT Pretraining Approach https://ar5iv.labs.arxiv.org/html/1907.11692
  16. [10] [35] [36] [41] Facebook AI's RoBERTa improves Google's BERT pretraining methods | VentureBeat https://venturebeat.com/ai/facebook-ais-roberta-improves-googles-bert-pretraining-methods
  17. [11] [PDF] DistilBERT, a distilled version of BERT: smaller, faster, cheaper and ... https://ysu1989.github.io/courses/au20/cse5539/DistilBERT.pdf
  18. [12] [22] [25] [26] [42] [43] Efficient Sentiment Analysis: A Resource-Aware Evaluation of Feature Extraction Techniques, Ensembling, and Deep Learning Models https://arxiv.org/html/2308.02022v2
  19. [13] [16] [17] [18] [19] cs.cornell.edu https://www.cs.cornell.edu/home/llee/omsa/omsa.pdf
  20. [37] BERT & DistilBERT: Efficient NLP Transformers - Emergent Mind https://www.emergentmind.com/topics/bert-distilbert
  21. [38] Distilbert: A Smaller, Faster, and Distilled BERT - Zilliz Learn https://zilliz.com/learn/distilbert-distilled-version-of-bert

Sentiment analysis of online product reviews is a crucial natural language processing task with significant business applications. This paper presents a comprehensive study of sentiment classification on a large Amazon Reviews dataset from Kaggle, containing millions of reviews with star ratings, by fine-tuning state-of-the-art transformer models. We experiment with BERT, RoBERTa, DistilBERT, and a generative GPT-2 model for sentiment classification (positive vs. negative). A complete machine learning pipeline is implemented in Python using Google Colab, including data cleaning, tokenization with attention masks, and transformer fine-tuning with classification layers. We conduct an extensive literature review of sentiment analysis techniques from early lexicon-based and machine learning methods to modern deep learning approaches. Our experiments involve hyperparameter tuning and rigorous evaluation using accuracy, precision, recall, F1-score, and ROC-AUC metrics. Results show that RoBERTa achieves the highest accuracy (around 95–96%), slightly outperforming BERT, while the lightweight DistilBERT model offers competitive performance (~94% F1) with smaller size and faster inference. GPT-2, despite its unidirectional nature, also achieves good accuracy (~92–93%) but lags behind the bidirectional models. We analyze performance trade-offs, model size vs. accuracy, and issues like class imbalance and overfitting. An ablation study highlights the impact of including review title text and the importance of fine-tuning the entire model. We also perform error analysis, revealing that models occasionally misclassify reviews with mixed sentiments or sarcasm. The findings demonstrate that transformer-based approaches substantially improve sentiment analysis accuracy on large-scale review data, offering robust solutions for real-world e-commerce analytics. Future work will explore larger transformer models, multi-class sentiment ratings, and domain adaptation techniques.

CALL FOR PAPERS


Paper Submission Last Date
31 - December - 2025

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe