⚠ Official Notice: www.ijisrt.com is the official website of the International Journal of Innovative Science and Research Technology (IJISRT) Journal for research paper submission and publication. Please beware of fake or duplicate websites using the IJISRT name.



Automated Dataset Preparation and Management from Raw Data Using Context-Aware Chunking and LLMs


Authors : S. Saraswathi; Vignesh M.; Vijayalakshmi S.; Tholkapian M.

Volume/Issue : Volume 11 - 2026, Issue 4 - April


Google Scholar : https://tinyurl.com/2u7swwpd

Scribd : https://tinyurl.com/4zymy2tx

DOI : https://doi.org/10.38124/ijisrt/26apr952

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : The increasing adoption of Large Language Models (LLMs) has created a critical need for high-quality, structured training data derived from real-world, unstructured sources such as documents and web content. However, existing research primarily focuses on model architectures or isolated data processing stages, lacking a unified solution for end-to-end dataset preparation. This project proposes a reusable, production-ready Dataset Preparation Pipeline that automates the ingestion, cleaning, chunking, task-specific data generation, and quality evaluation of raw data for multiple LLM applications including question answering, summarization, and classification. The system integrates automated quality assessment to filter low-quality or biased samples and exports curated datasets in standard formats for seamless downstream use. By adopting a data-centric approach, the proposed pipeline bridges the gap between raw real-world data and reliable LLM training datasets, enabling scalable, continuous, and efficient dataset generation for modern language model development.

Keywords : Large Language Models (LLMs), Dataset Preparation Pipeline, Data-Centric AI, Quality Assessment, Training Data Generation.

References :

  1. Aparna Nayak,Bojan Boziˇ c and and Luca Longo, “Data Quality Assessment and Recommendation of Feature Selection Algorithms: An Ontological Approach” Journal of Web Engineering, Vol. 22 1, 175–196, 2023 River Publishers
  2. Khalid M. Kahloot and Peter Ekler, “Algorithmic Splitting: A Method for Dataset Preparation“, IEEE Access,date of publication September 6, 2021
  3. Ndung’u Rachael Njeri, “Data Preparation for Machine Learning Modelling”, International Journal of Computer Applications Technology and Research 2022
  4. Hongming li 1,Yangliu2, and Chao huang 1,3, (Member, IEEE), “Entropy-Based Data Selection for Language Models”, IEEE Open Access journal ,date of publication 2 September 2025
  5. Adam Lahouari1, Jutta Rogal1,2, and Mark E. Tuckerman1,3,4,5,6, “Automated Machine Learning Pipeline for Training and Analysis Using Large Language Models”, International journal of Innovative science and research technology, date of publication September 29, 2025
  6. Mihai nadas1, Laura diosan1 and Andreea Tomescu2, “Synthetic Data Generation Using Large Language Models: Advances in Text and Code”,This article has been accepted for publication in IEEE Access.
  7. Alex Tacuri,Sergio Firmenich,Alejandro Fernández,Florencia Riva,Matías Urbieta and Gustavo Rossi,” Web Scraping by End User“ IEEE Access,25November 2025.
  8. Mehedi Hasan,Shayma Islam Shifa,Kashif Niaz,Md Mahedi Hasan Shuvo,” Continuous Data Curation and Valuation for Long-Term Machine Learning Model Health”,European Journal of Science and Modern Technologies, 2(1), 58-78 at 19.12.2025
  9. Taja Tuzman  and Nikola Ljubešić,” LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data“, IEEE Access ,Date of publication 24 February 2025
  10. Xinyue Feng, “Web Crawling Algorithm Fusing TF-IDF and Word2Vec Feature Extraction” Journal of Web Engineering, Vol. 24 5, 713–738. 2025 River Publishers.

The increasing adoption of Large Language Models (LLMs) has created a critical need for high-quality, structured training data derived from real-world, unstructured sources such as documents and web content. However, existing research primarily focuses on model architectures or isolated data processing stages, lacking a unified solution for end-to-end dataset preparation. This project proposes a reusable, production-ready Dataset Preparation Pipeline that automates the ingestion, cleaning, chunking, task-specific data generation, and quality evaluation of raw data for multiple LLM applications including question answering, summarization, and classification. The system integrates automated quality assessment to filter low-quality or biased samples and exports curated datasets in standard formats for seamless downstream use. By adopting a data-centric approach, the proposed pipeline bridges the gap between raw real-world data and reliable LLM training datasets, enabling scalable, continuous, and efficient dataset generation for modern language model development.

Keywords : Large Language Models (LLMs), Dataset Preparation Pipeline, Data-Centric AI, Quality Assessment, Training Data Generation.

Paper Submission Last Date
30 - April - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS
Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe