Automated dataset preparation and management from raw data using contextaware chunking and llms| International Journal of Innovative Science and Research Technology

Automated Dataset Preparation and Management from Raw Data Using Context-Aware Chunking and LLMs

Authors : S. Saraswathi; Vignesh M.; Vijayalakshmi S.; Tholkapian M.

Volume/Issue : Volume 11 - 2026, Issue 4 - April

Google Scholar : https://tinyurl.com/2u7swwpd

Scribd : https://tinyurl.com/4zymy2tx

DOI : https://doi.org/10.38124/ijisrt/26apr952

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : The increasing adoption of Large Language Models (LLMs) has created a critical need for high-quality, structured training data derived from real-world, unstructured sources such as documents and web content. However, existing research primarily focuses on model architectures or isolated data processing stages, lacking a unified solution for end-to-end dataset preparation. This project proposes a reusable, production-ready Dataset Preparation Pipeline that automates the ingestion, cleaning, chunking, task-specific data generation, and quality evaluation of raw data for multiple LLM applications including question answering, summarization, and classification. The system integrates automated quality assessment to filter low-quality or biased samples and exports curated datasets in standard formats for seamless downstream use. By adopting a data-centric approach, the proposed pipeline bridges the gap between raw real-world data and reliable LLM training datasets, enabling scalable, continuous, and efficient dataset generation for modern language model development.

Keywords : Large Language Models (LLMs), Dataset Preparation Pipeline, Data-Centric AI, Quality Assessment, Training Data Generation.

References :

Aparna Nayak,Bojan Boziˇ c and and Luca Longo, “Data Quality Assessment and Recommendation of Feature Selection Algorithms: An Ontological Approach” Journal of Web Engineering, Vol. 22 1, 175–196, 2023 River Publishers
Khalid M. Kahloot and Peter Ekler, “Algorithmic Splitting: A Method for Dataset Preparation“, IEEE Access,date of publication September 6, 2021
Ndung’u Rachael Njeri, “Data Preparation for Machine Learning Modelling”, International Journal of Computer Applications Technology and Research 2022
Hongming li ¹,Yangliu², and Chao huang ^1,3, (Member, IEEE), “Entropy-Based Data Selection for Language Models”, IEEE Open Access journal ,date of publication 2 September 2025
Adam Lahouari¹, Jutta Rogal^1,2, and Mark E. Tuckerman^1,3,4,5,6, “Automated Machine Learning Pipeline for Training and Analysis Using Large Language Models”, International journal of Innovative science and research technology, date of publication September 29, 2025
Mihai nadas¹, Laura diosan¹and Andreea Tomescu², “Synthetic Data Generation Using Large Language Models: Advances in Text and Code”,This article has been accepted for publication in IEEE Access.
Alex Tacuri,Sergio Firmenich,Alejandro Fernández,Florencia Riva,Matías Urbieta and Gustavo Rossi,” Web Scraping by End User“ IEEE Access,25November 2025.
Mehedi Hasan,Shayma Islam Shifa,Kashif Niaz,Md Mahedi Hasan Shuvo,” Continuous Data Curation and Valuation for Long-Term Machine Learning Model Health”,European Journal of Science and Modern Technologies, 2(1), 58-78 at 19.12.2025
Taja Tuzman and Nikola Ljubešić,” LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data“, IEEE Access ,Date of publication 24 February 2025
Xinyue Feng, “Web Crawling Algorithm Fusing TF-IDF and Word2Vec Feature Extraction” Journal of Web Engineering, Vol. 24 5, 713–738. 2025 River Publishers.

The increasing adoption of Large Language Models (LLMs) has created a critical need for high-quality, structured training data derived from real-world, unstructured sources such as documents and web content. However, existing research primarily focuses on model architectures or isolated data processing stages, lacking a unified solution for end-to-end dataset preparation. This project proposes a reusable, production-ready Dataset Preparation Pipeline that automates the ingestion, cleaning, chunking, task-specific data generation, and quality evaluation of raw data for multiple LLM applications including question answering, summarization, and classification. The system integrates automated quality assessment to filter low-quality or biased samples and exports curated datasets in standard formats for seamless downstream use. By adopting a data-centric approach, the proposed pipeline bridges the gap between raw real-world data and reliable LLM training datasets, enabling scalable, continuous, and efficient dataset generation for modern language model development.

Keywords : Large Language Models (LLMs), Dataset Preparation Pipeline, Data-Centric AI, Quality Assessment, Training Data Generation.

Paper Submission Last Date
30 - June - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.