Using Large Language Models for Machine-Generated Data Creation


Authors : Dr. Payal Gulati

Volume/Issue : Volume 10 - 2025, Issue 12 - December


Google Scholar : https://tinyurl.com/4vmw5hhj

Scribd : https://tinyurl.com/mrxb3ud3

DOI : https://doi.org/10.38124/ijisrt/25dec1606

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : Modern machine learning systems rely on large volumes of high-quality labeled data. However, collecting real- world data for constructing complex datasets often expensive, time-consuming, and restricted due to privacy and ethical concerns. Machine-generated data has emerged as a reliable alternative to address these challenges. With the advancement of Large Language Models (LLMs), it has become possible to generate realistic, domain-specific, and task- oriented data at scale. It also leverages LLMs to create quality data without the need to manually collect, clean, and annotate huge or big datasets. This paper presents a detailed study of machine-generated data creation using LLMs. Building upon existing practical frameworks, the study proposes a structured pipeline that integrates prompt engineering, retrieval-augmented generation, quality filtering, and iterative refinement. The paper also discusses evaluation strategies, real-world applications, and ethical challenges, making the proposed approach suitable for both academic research and industrial deployment.

Keywords : Large Language Models, Data Augmentation, Retrieval-Augmented Generation, Privacy-Preserving Artificial Intelligence, and Machine Learning.

References :

  1. Confident AI, “The Definitive Guide to Synthetic Data Generation Using LLMs,” 2024. [Online]. Available: https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms
  2. T. Brown et al., “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.
  3. P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS, 2020.
  4. S. Shen et al., “Large Language Models for Data Generation: A Survey,” arXiv preprint arXiv:2503.14023, 2025.
  5. A. Kotelnikov et al., “Tabular Data Generation Using Large Language Models,” IEEE Access, vol. 11, pp. 115672–115684, 2023.
  6. C. Beaulieu-Jones et al., “Privacy-Preserving Data Sharing Through Machine-Generated Data,” Nature Communications, vol. 10, no. 1, 2019.
  7. R. Taori et al., “Instruction Tuning for Large Language Models,” Stanford Technical Report, 2023.

Modern machine learning systems rely on large volumes of high-quality labeled data. However, collecting real- world data for constructing complex datasets often expensive, time-consuming, and restricted due to privacy and ethical concerns. Machine-generated data has emerged as a reliable alternative to address these challenges. With the advancement of Large Language Models (LLMs), it has become possible to generate realistic, domain-specific, and task- oriented data at scale. It also leverages LLMs to create quality data without the need to manually collect, clean, and annotate huge or big datasets. This paper presents a detailed study of machine-generated data creation using LLMs. Building upon existing practical frameworks, the study proposes a structured pipeline that integrates prompt engineering, retrieval-augmented generation, quality filtering, and iterative refinement. The paper also discusses evaluation strategies, real-world applications, and ethical challenges, making the proposed approach suitable for both academic research and industrial deployment.

Keywords : Large Language Models, Data Augmentation, Retrieval-Augmented Generation, Privacy-Preserving Artificial Intelligence, and Machine Learning.

CALL FOR PAPERS


Paper Submission Last Date
31 - January - 2026

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe