Authors :
Himanshu Gupta
Volume/Issue :
Volume 9 - 2024, Issue 8 - August
Google Scholar :
https://tinyurl.com/5dn4m77j
Scribd :
https://tinyurl.com/j877upmt
DOI :
https://doi.org/10.38124/ijisrt/IJISRT24AUG1656
Abstract :
The rapid proliferation of data in the digital age
has made big data analytics a critical tool for deriving
insights and making informed decisions. However,
processing and analyzing large datasets, often reaching
hundreds of terabytes, presents significant challenges. This
paper explores the use of Apache Spark, a powerful
distributed computing framework, for batch processing in
big data analytics using artificial intelligence (AI)
techniques. We evaluate the scalability, efficiency, and
accuracy of AI models when applied to massive datasets
processed in Spark. Our experiments demonstrate that
Apache Spark, coupled with machine learning and deep
learning techniques, offers a robust solution for handling
large-scale data analytics tasks. We also discuss the
challenges associated with such large-scale processing and
propose strategies for optimizing performance and
resource utilization.
References :
- Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., & others. (2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In *Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI 12), 15-28.
- Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., & others. (2015). Spark SQL: Relational Data Processing in Spark. In *Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data* (pp. 1383-1394).
- Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. *Communications of the ACM*, 51(1), 107-113.
- Chen, Y., Alspaugh, S., & Katz, R. H. (2012). Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads. *Proceedings of the VLDB Endowment*, 5(12), 1802-1813.
- Kang, Y., Luo, Y., Tong, Y., & Wang, B. (2020). Efficient Distributed Machine Learning on Big Data. *IEEE Transactions on Big Data*, 6(2), 238-252.
- Meng, X., Bradley, J., Yuvaz, B., Sparks, E., Venkataraman, S., Liu, D., & others. (2016). Mllib: Machine Learning in Apache Spark. *Journal of Machine Learning Research*, 17(1), 1235-1241.
- Apache Spark Documentation. (n.d.). MLlib: Machine Learning Library.
- Zaharia, M., et al. (2010). Spark: Cluster computing with working sets. HotCloud'10.
- Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.
The rapid proliferation of data in the digital age
has made big data analytics a critical tool for deriving
insights and making informed decisions. However,
processing and analyzing large datasets, often reaching
hundreds of terabytes, presents significant challenges. This
paper explores the use of Apache Spark, a powerful
distributed computing framework, for batch processing in
big data analytics using artificial intelligence (AI)
techniques. We evaluate the scalability, efficiency, and
accuracy of AI models when applied to massive datasets
processed in Spark. Our experiments demonstrate that
Apache Spark, coupled with machine learning and deep
learning techniques, offers a robust solution for handling
large-scale data analytics tasks. We also discuss the
challenges associated with such large-scale processing and
propose strategies for optimizing performance and
resource utilization.