Authors :
S. Saraswathi; Vignesh M.; Vijayalakshmi S.; Tholkapian M.
Volume/Issue :
Volume 11 - 2026, Issue 4 - April
Google Scholar :
https://tinyurl.com/2u7swwpd
Scribd :
https://tinyurl.com/4zymy2tx
DOI :
https://doi.org/10.38124/ijisrt/26apr952
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
The increasing adoption of Large Language Models (LLMs) has created a critical need for high-quality, structured
training data derived from real-world, unstructured sources such as documents and web content. However, existing research
primarily focuses on model architectures or isolated data processing stages, lacking a unified solution for end-to-end dataset
preparation. This project proposes a reusable, production-ready Dataset Preparation Pipeline that automates the ingestion,
cleaning, chunking, task-specific data generation, and quality evaluation of raw data for multiple LLM applications
including question answering, summarization, and classification. The system integrates automated quality assessment to
filter low-quality or biased samples and exports curated datasets in standard formats for seamless downstream use. By
adopting a data-centric approach, the proposed pipeline bridges the gap between raw real-world data and reliable LLM
training datasets, enabling scalable, continuous, and efficient dataset generation for modern language model development.
Keywords :
Large Language Models (LLMs), Dataset Preparation Pipeline, Data-Centric AI, Quality Assessment, Training Data Generation.
References :
- Aparna Nayak,Bojan Boziˇ c and and Luca Longo, “Data Quality Assessment and Recommendation of Feature Selection Algorithms: An Ontological Approach” Journal of Web Engineering, Vol. 22 1, 175–196, 2023 River Publishers
- Khalid M. Kahloot and Peter Ekler, “Algorithmic Splitting: A Method for Dataset Preparation“, IEEE Access,date of publication September 6, 2021
- Ndung’u Rachael Njeri, “Data Preparation for Machine Learning Modelling”, International Journal of Computer Applications Technology and Research 2022
- Hongming li 1,Yangliu2, and Chao huang 1,3, (Member, IEEE), “Entropy-Based Data Selection for Language Models”, IEEE Open Access journal ,date of publication 2 September 2025
- Adam Lahouari1, Jutta Rogal1,2, and Mark E. Tuckerman1,3,4,5,6, “Automated Machine Learning Pipeline for Training and Analysis Using Large Language Models”, International journal of Innovative science and research technology, date of publication September 29, 2025
- Mihai nadas1, Laura diosan1 and Andreea Tomescu2, “Synthetic Data Generation Using Large Language Models: Advances in Text and Code”,This article has been accepted for publication in IEEE Access.
- Alex Tacuri,Sergio Firmenich,Alejandro Fernández,Florencia Riva,Matías Urbieta and Gustavo Rossi,” Web Scraping by End User“ IEEE Access,25November 2025.
- Mehedi Hasan,Shayma Islam Shifa,Kashif Niaz,Md Mahedi Hasan Shuvo,” Continuous Data Curation and Valuation for Long-Term Machine Learning Model Health”,European Journal of Science and Modern Technologies, 2(1), 58-78 at 19.12.2025
- Taja Tuzman and Nikola Ljubešić,” LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data“, IEEE Access ,Date of publication 24 February 2025
- Xinyue Feng, “Web Crawling Algorithm Fusing TF-IDF and Word2Vec Feature Extraction” Journal of Web Engineering, Vol. 24 5, 713–738. 2025 River Publishers.
The increasing adoption of Large Language Models (LLMs) has created a critical need for high-quality, structured
training data derived from real-world, unstructured sources such as documents and web content. However, existing research
primarily focuses on model architectures or isolated data processing stages, lacking a unified solution for end-to-end dataset
preparation. This project proposes a reusable, production-ready Dataset Preparation Pipeline that automates the ingestion,
cleaning, chunking, task-specific data generation, and quality evaluation of raw data for multiple LLM applications
including question answering, summarization, and classification. The system integrates automated quality assessment to
filter low-quality or biased samples and exports curated datasets in standard formats for seamless downstream use. By
adopting a data-centric approach, the proposed pipeline bridges the gap between raw real-world data and reliable LLM
training datasets, enabling scalable, continuous, and efficient dataset generation for modern language model development.
Keywords :
Large Language Models (LLMs), Dataset Preparation Pipeline, Data-Centric AI, Quality Assessment, Training Data Generation.