Authors :
Shwetaba B. Chauhan; Japan M. Mavani
Volume/Issue :
Volume 10 - 2025, Issue 12 - December
Google Scholar :
https://tinyurl.com/bddsdeef
Scribd :
https://tinyurl.com/4wnck3uk
DOI :
https://doi.org/10.38124/ijisrt/25dec103
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Note : Google Scholar may take 30 to 40 days to display the article.
Abstract :
Sentiment analysis of online product reviews is a crucial natural language processing task with significant
business applications. This paper presents a comprehensive study of sentiment classification on a large Amazon Reviews
dataset from Kaggle, containing millions of reviews with star ratings, by fine-tuning state-of-the-art transformer models.
We experiment with BERT, RoBERTa, DistilBERT, and a generative GPT-2 model for sentiment classification (positive
vs. negative). A complete machine learning pipeline is implemented in Python using Google Colab, including data cleaning,
tokenization with attention masks, and transformer fine-tuning with classification layers. We conduct an extensive
literature review of sentiment analysis techniques from early lexicon-based and machine learning methods to modern deep
learning approaches. Our experiments involve hyperparameter tuning and rigorous evaluation using accuracy, precision,
recall, F1-score, and ROC-AUC metrics. Results show that RoBERTa achieves the highest accuracy (around 95–96%),
slightly outperforming BERT, while the lightweight DistilBERT model offers competitive performance (~94% F1) with
smaller size and faster inference. GPT-2, despite its unidirectional nature, also achieves good accuracy (~92–93%) but lags
behind the bidirectional models. We analyze performance trade-offs, model size vs. accuracy, and issues like class
imbalance and overfitting. An ablation study highlights the impact of including review title text and the importance of
fine-tuning the entire model. We also perform error analysis, revealing that models occasionally misclassify reviews with
mixed sentiments or sarcasm. The findings demonstrate that transformer-based approaches substantially improve
sentiment analysis accuracy on large-scale review data, offering robust solutions for real-world e-commerce analytics.
Future work will explore larger transformer models, multi-class sentiment ratings, and domain adaptation techniques.
References :
- Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, 79–86[1]. (One of the first works to apply machine learning to sentiment analysis of reviews.)
- Turney, P. D. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. Proceedings of the 40th Annual Meeting of the ACL, 417–424[13]. (Introduced an unsupervised method using phrase semantic orientation for review sentiment.)
- Hutto, C. J., & Gilbert, E. (2014). VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. Proceedings of ICWSM, 216–225[44]. (Introduced a lexicon and rule-based sentiment tool widely used as a baseline for sentiment analysis.)
- Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of EMNLP 2014, 1746–1751[21]. (Demonstrated that a simple CNN with pre-trained word vectors can achieve strong results on sentiment and other text classification tasks.)
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019, 4171–4186[23][24]. (Introduced BERT, a transformative model for NLP that we fine-tune in this paper.)
- Liu, Y., Ott, M., Goyal, N., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692[45][46]. (Improved BERT’s pre-training method; our results show RoBERTa’s efficacy on sentiment analysis.)
- Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108[11]. (Proposed model compression for BERT; we validated that DistilBERT maintains strong performance with fewer parameters.)
- Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models are Few-Shot Learners. Advances in NeurIPS 33 (GPT-3 paper)[12]. (Demonstrated the power of very large transformers; mentioned for context about GPT models.)
- Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of Tricks for Efficient Text Classification. Proceedings of EACL 2017, 427–431. (Introduced fastText, achieving strong sentiment classification on Amazon review data with simple models; provides a baseline around 94–95% accuracy for comparison.)
- Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level Convolutional Networks for Text Classification. Advances in NeurIPS 28, 649–657. (Created the Amazon Review Polarity dataset we used via Kaggle and reported initial benchmark results; our work significantly improves on those with transformers.)
- Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2020). Deep Sentiment Classification: A Comprehensive Review. arXiv:2006.00388. (A survey paper on deep learning approaches to sentiment analysis, providing context and background.)
- Maheshwary, P., & Sastry, V. N. (2020). Sentiment Analysis of Amazon Product Reviews using Machine Learning Techniques. International Journal of Intelligent Systems and Applications, 12(3), 9. (Analyzed Amazon reviews with machine learning; supports some observations on the effectiveness of advanced models over traditional ones.)
- [1] [2] [5] [6] [8] [9] [14] [15] [20] [21] [27] [28] [30] [32] [40] [44] (PDF) Sentiment Analysis of Amazon Product Reviews Using Machine Learning Approaches https://www.researchgate.net/publication/394876232_Sentiment_Analysis_of_Amazon_Product_Reviews_Using_Machine_Learning_Approaches
- [3] [4] [29] [31] Unlocking Customer Insights Through Sentiment Analysis https://www.linkedin.com/pulse/unlocking-customer-insights-through-sentiment-analysis-teja-v-vmo7c
- [7] [23] [24] [33] [34] [39] [45] [46] [1907.11692] RoBERTa: A Robustly Optimized BERT Pretraining Approach https://ar5iv.labs.arxiv.org/html/1907.11692
- [10] [35] [36] [41] Facebook AI's RoBERTa improves Google's BERT pretraining methods | VentureBeat https://venturebeat.com/ai/facebook-ais-roberta-improves-googles-bert-pretraining-methods
- [11] [PDF] DistilBERT, a distilled version of BERT: smaller, faster, cheaper and ... https://ysu1989.github.io/courses/au20/cse5539/DistilBERT.pdf
- [12] [22] [25] [26] [42] [43] Efficient Sentiment Analysis: A Resource-Aware Evaluation of Feature Extraction Techniques, Ensembling, and Deep Learning Models https://arxiv.org/html/2308.02022v2
- [13] [16] [17] [18] [19] cs.cornell.edu https://www.cs.cornell.edu/home/llee/omsa/omsa.pdf
- [37] BERT & DistilBERT: Efficient NLP Transformers - Emergent Mind https://www.emergentmind.com/topics/bert-distilbert
- [38] Distilbert: A Smaller, Faster, and Distilled BERT - Zilliz Learn https://zilliz.com/learn/distilbert-distilled-version-of-bert
Sentiment analysis of online product reviews is a crucial natural language processing task with significant
business applications. This paper presents a comprehensive study of sentiment classification on a large Amazon Reviews
dataset from Kaggle, containing millions of reviews with star ratings, by fine-tuning state-of-the-art transformer models.
We experiment with BERT, RoBERTa, DistilBERT, and a generative GPT-2 model for sentiment classification (positive
vs. negative). A complete machine learning pipeline is implemented in Python using Google Colab, including data cleaning,
tokenization with attention masks, and transformer fine-tuning with classification layers. We conduct an extensive
literature review of sentiment analysis techniques from early lexicon-based and machine learning methods to modern deep
learning approaches. Our experiments involve hyperparameter tuning and rigorous evaluation using accuracy, precision,
recall, F1-score, and ROC-AUC metrics. Results show that RoBERTa achieves the highest accuracy (around 95–96%),
slightly outperforming BERT, while the lightweight DistilBERT model offers competitive performance (~94% F1) with
smaller size and faster inference. GPT-2, despite its unidirectional nature, also achieves good accuracy (~92–93%) but lags
behind the bidirectional models. We analyze performance trade-offs, model size vs. accuracy, and issues like class
imbalance and overfitting. An ablation study highlights the impact of including review title text and the importance of
fine-tuning the entire model. We also perform error analysis, revealing that models occasionally misclassify reviews with
mixed sentiments or sarcasm. The findings demonstrate that transformer-based approaches substantially improve
sentiment analysis accuracy on large-scale review data, offering robust solutions for real-world e-commerce analytics.
Future work will explore larger transformer models, multi-class sentiment ratings, and domain adaptation techniques.