Authors :
Rucha Dhage; Dr. Manisha Bharati
Volume/Issue :
Volume 11 - 2026, Issue 5 - May
Google Scholar :
https://tinyurl.com/592dsef7
Scribd :
https://tinyurl.com/4kwcv528
DOI :
https://doi.org/10.38124/ijisrt/26may2022
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Traditional Retrieval-Augmented Generation (RAG) systems are very effective at querying text-based documents,
but real-world documents are not just text based, they are complex and contain images, graphs, tables and more. Thus,
traditional text only based RAG systems struggle to process such multimodal documents that contain more than just text
effectively. This project presents the development of a Multimodal RAG system which is designed to bridge this gap. By
Utilizing LangChain, HuggingFace embeddings, ChromaDB, and the LLaVA 1.5 Vision-Language Model (VLM), the system
processes documents that contains not just text, but images and tabular data as well and extracts textual and visual elements
such as images and graphs, and answers user queries based on both text and visual information. By indexing both textual and
visual summaries into a unified vector space, the system retrieves multimodal context and gives accurate, grounded responses
while reducing hallucinations related to chart colors and visual data trends.
Keywords :
Retrieval-Augmented Generation (RAG), LangChain, ChromaDB, HuggingFace Embeddings, LLaVA, Multimodal Retrieval.
References :
- Darji, F. Kheni, D. Chodvadia, P. Goel, D. Garg and B. Patel, “Enhancing Financial Risk Analysis using RAG-based Large Language Models,” in 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), Pudukkottai, India, 2024, pp. 754–760.
- N. Chinaksorn and D. Wanvarie, “LLM-RAG for Financial Question Answering: A Case Study from SET50,” in 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 2025, pp. 952–957.
- K. Kocot, M. Płonka, K. Hołda, K. Daniec and A. Nawrat, “Advanced Document Processing Using LLM and RAG: AnInnovative Approach to Efficiency and Privacy,” in 2025 5th Intelligent Cybersecurity Conference (ICSC), Tampa, FL, USA, 2025, pp. 227–231.
- T. Mitadera et al., “RAG based Question Answering of Accounting Knowledge,” in 2025 IEEE 14th Global Conference on Consumer Electronics (GCCE), Osaka, Japan, 2025, pp. 1456–1460.
- M. Stäbler, S. Turnbull, T. Müller, C. Langdon, J. Marx-Goméz and F. Köster, “The Impact of Chunking Strategies on Domain-Specific Information Retrieval in RAG Systems,” in 2025 IEEE International Conference on Omni-layer Intelligent Systems (COINS), Madison, WI, USA, 2025, pp. 1–6.
- A. Çağlayan, S. N. Gökçe and D. Ayata, “Structured Financial QA with LLMs: Fine Tuning vs. Code-Augmented Retrieval,” in 2025 10th International Conference on Computer Science and Engineering (UBMK), Istanbul, Turkiye, 2025, pp. 539–544.
- S. Mehta, T. Negandhi and S. Ghane, “Finalyze: A RAG-Based Framework for Intelligent Financial Document Analysis,” in 2025 5th International Conference on Emerging Research in Electronics, Computer Science and Technology (ICERECT), MANDYA, India, 2025, pp. 1–5.
- J. Xu, “Enhancing Financial Risk Management with Retrieval-Augmented Large Language Models,” in 2025 4th International Conference on Artificial Intelligence, Internet and Digital Economy (ICAID), Guangzhou, China, 2025, pp. 138–141.
- Z. Huang, K. Du, X. Zhang, R. Mao and E. Cambria, “Combining LLM-Generated Knowledge Graphs with RAG for Financial Sentiment Extraction,” in 2025 IEEE International Conference on Data Mining Workshops (ICDMW), Washington, DC, USA, 2025, pp. 2056–2063.
- S. AboulEla, P. Zabihitari, N. Ibrahim, M. Afshar and R. Kashef, “Exploring RAG Solutions to Reduce Hallucinations in LLMs,” in 2025 IEEE International Systems Conference (SysCon), Montreal, QC, Canada, 2025, pp. 1–8.
- M. Barochiya, P. Makhijani, H. N. Patel, P. Goel and B. Patel, “Evaluating RAG Pipeline in Multimodal LLM-based Question Answering Systems,” in 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), Pudukkottai, India, 2024, pp. 69–75.
- F. Yamout and H. A. Hasan, “GPT-2++: An Optimized GPT-2 for RAG by Integrating BERT, Prompt Engineering, and Fine-Tuning,” in 2025 3rd International Conference on Foundation and Large Language Models (FLLM), Vienna, Austria, 2025, pp. 34–37.
- A. A. Rao, A. R. Revankar, N. Nair, S. Mittal, U. D and U. M. Kumar, “ContractIQ : A Multimodal RAG-Based Agentic System for Intelligent Contract Understanding,” in TENCON 2025- 2025 IEEE Region 10 Conference (TENCON), Kota Kinabalu, Malaysia, 2025, pp. 1698–1702.
- O. Keleş and T. Bayraklı, “Llama-2-econ: Enhancing title generation, abstract classification, and academic Q&A in economic research,” 2024, accessed Apr. 17, 2025.
- M. Thomas, S. Khot, M. Bhole and N. Shaji, “InvestMate: An Integrated AI-Driven Framework for Personalized Financial Planning and Real-Time Market Analysis,” in 2025 International Conference on Computing, Intelligence, and Application (CIACON), Durgapur, India, 2025, pp. 1–7.
Traditional Retrieval-Augmented Generation (RAG) systems are very effective at querying text-based documents,
but real-world documents are not just text based, they are complex and contain images, graphs, tables and more. Thus,
traditional text only based RAG systems struggle to process such multimodal documents that contain more than just text
effectively. This project presents the development of a Multimodal RAG system which is designed to bridge this gap. By
Utilizing LangChain, HuggingFace embeddings, ChromaDB, and the LLaVA 1.5 Vision-Language Model (VLM), the system
processes documents that contains not just text, but images and tabular data as well and extracts textual and visual elements
such as images and graphs, and answers user queries based on both text and visual information. By indexing both textual and
visual summaries into a unified vector space, the system retrieves multimodal context and gives accurate, grounded responses
while reducing hallucinations related to chart colors and visual data trends.
Keywords :
Retrieval-Augmented Generation (RAG), LangChain, ChromaDB, HuggingFace Embeddings, LLaVA, Multimodal Retrieval.