Big Data Analytics using Artificial Intelligence: Apache Spark for Scalable Batch Processing


Authors : Himanshu Gupta

Volume/Issue : Volume 9 - 2024, Issue 8 - August

Google Scholar : https://tinyurl.com/5dn4m77j

Scribd : https://tinyurl.com/j877upmt

DOI : https://doi.org/10.38124/ijisrt/IJISRT24AUG1656

Abstract : The rapid proliferation of data in the digital age has made big data analytics a critical tool for deriving insights and making informed decisions. However, processing and analyzing large datasets, often reaching hundreds of terabytes, presents significant challenges. This paper explores the use of Apache Spark, a powerful distributed computing framework, for batch processing in big data analytics using artificial intelligence (AI) techniques. We evaluate the scalability, efficiency, and accuracy of AI models when applied to massive datasets processed in Spark. Our experiments demonstrate that Apache Spark, coupled with machine learning and deep learning techniques, offers a robust solution for handling large-scale data analytics tasks. We also discuss the challenges associated with such large-scale processing and propose strategies for optimizing performance and resource utilization.

References :

  1. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., & others. (2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In *Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI 12), 15-28.
  2. Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., & others. (2015). Spark SQL: Relational Data Processing in Spark. In *Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data* (pp. 1383-1394).
  3. Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. *Communications of the ACM*, 51(1), 107-113.
  4. Chen, Y., Alspaugh, S., & Katz, R. H. (2012). Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads. *Proceedings of the VLDB Endowment*, 5(12), 1802-1813.
  5. Kang, Y., Luo, Y., Tong, Y., & Wang, B. (2020). Efficient Distributed Machine Learning on Big Data. *IEEE Transactions on Big Data*, 6(2), 238-252.
  6. Meng, X., Bradley, J., Yuvaz, B., Sparks, E., Venkataraman, S., Liu, D., & others. (2016). Mllib: Machine Learning in Apache Spark. *Journal of Machine Learning Research*, 17(1), 1235-1241.
  7. Apache Spark Documentation. (n.d.). MLlib: Machine Learning Library.
  8. Zaharia, M., et al. (2010). Spark: Cluster computing with working sets. HotCloud'10.
  9. Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.

The rapid proliferation of data in the digital age has made big data analytics a critical tool for deriving insights and making informed decisions. However, processing and analyzing large datasets, often reaching hundreds of terabytes, presents significant challenges. This paper explores the use of Apache Spark, a powerful distributed computing framework, for batch processing in big data analytics using artificial intelligence (AI) techniques. We evaluate the scalability, efficiency, and accuracy of AI models when applied to massive datasets processed in Spark. Our experiments demonstrate that Apache Spark, coupled with machine learning and deep learning techniques, offers a robust solution for handling large-scale data analytics tasks. We also discuss the challenges associated with such large-scale processing and propose strategies for optimizing performance and resource utilization.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe