Multimodal rag based system to handle financial documents| International Journal of Innovative Science and Research Technology

Multimodal RAG Based System to Handle Financial Documents

Authors : Rucha Dhage; Dr. Manisha Bharati

Volume/Issue : Volume 11 - 2026, Issue 5 - May

Google Scholar : https://tinyurl.com/592dsef7

Scribd : https://tinyurl.com/4kwcv528

DOI : https://doi.org/10.38124/ijisrt/26may2022

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : Traditional Retrieval-Augmented Generation (RAG) systems are very effective at querying text-based documents, but real-world documents are not just text based, they are complex and contain images, graphs, tables and more. Thus, traditional text only based RAG systems struggle to process such multimodal documents that contain more than just text effectively. This project presents the development of a Multimodal RAG system which is designed to bridge this gap. By Utilizing LangChain, HuggingFace embeddings, ChromaDB, and the LLaVA 1.5 Vision-Language Model (VLM), the system processes documents that contains not just text, but images and tabular data as well and extracts textual and visual elements such as images and graphs, and answers user queries based on both text and visual information. By indexing both textual and visual summaries into a unified vector space, the system retrieves multimodal context and gives accurate, grounded responses while reducing hallucinations related to chart colors and visual data trends.

Keywords : Retrieval-Augmented Generation (RAG), LangChain, ChromaDB, HuggingFace Embeddings, LLaVA, Multimodal Retrieval.

References :

Darji, F. Kheni, D. Chodvadia, P. Goel, D. Garg and B. Patel, “Enhancing Financial Risk Analysis using RAG-based Large Language Models,” in 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), Pudukkottai, India, 2024, pp. 754–760.
N. Chinaksorn and D. Wanvarie, “LLM-RAG for Financial Question Answering: A Case Study from SET50,” in 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 2025, pp. 952–957.
K. Kocot, M. Płonka, K. Hołda, K. Daniec and A. Nawrat, “Advanced Document Processing Using LLM and RAG: AnInnovative Approach to Efficiency and Privacy,” in 2025 5th Intelligent Cybersecurity Conference (ICSC), Tampa, FL, USA, 2025, pp. 227–231.
T. Mitadera et al., “RAG based Question Answering of Accounting Knowledge,” in 2025 IEEE 14th Global Conference on Consumer Electronics (GCCE), Osaka, Japan, 2025, pp. 1456–1460.
M. Stäbler, S. Turnbull, T. Müller, C. Langdon, J. Marx-Goméz and F. Köster, “The Impact of Chunking Strategies on Domain-Specific Information Retrieval in RAG Systems,” in 2025 IEEE International Conference on Omni-layer Intelligent Systems (COINS), Madison, WI, USA, 2025, pp. 1–6.
A. Çağlayan, S. N. Gökçe and D. Ayata, “Structured Financial QA with LLMs: Fine Tuning vs. Code-Augmented Retrieval,” in 2025 10th International Conference on Computer Science and Engineering (UBMK), Istanbul, Turkiye, 2025, pp. 539–544.
S. Mehta, T. Negandhi and S. Ghane, “Finalyze: A RAG-Based Framework for Intelligent Financial Document Analysis,” in 2025 5th International Conference on Emerging Research in Electronics, Computer Science and Technology (ICERECT), MANDYA, India, 2025, pp. 1–5.
J. Xu, “Enhancing Financial Risk Management with Retrieval-Augmented Large Language Models,” in 2025 4th International Conference on Artificial Intelligence, Internet and Digital Economy (ICAID), Guangzhou, China, 2025, pp. 138–141.
Z. Huang, K. Du, X. Zhang, R. Mao and E. Cambria, “Combining LLM-Generated Knowledge Graphs with RAG for Financial Sentiment Extraction,” in 2025 IEEE International Conference on Data Mining Workshops (ICDMW), Washington, DC, USA, 2025, pp. 2056–2063.
S. AboulEla, P. Zabihitari, N. Ibrahim, M. Afshar and R. Kashef, “Exploring RAG Solutions to Reduce Hallucinations in LLMs,” in 2025 IEEE International Systems Conference (SysCon), Montreal, QC, Canada, 2025, pp. 1–8.
M. Barochiya, P. Makhijani, H. N. Patel, P. Goel and B. Patel, “Evaluating RAG Pipeline in Multimodal LLM-based Question Answering Systems,” in 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), Pudukkottai, India, 2024, pp. 69–75.
F. Yamout and H. A. Hasan, “GPT-2++: An Optimized GPT-2 for RAG by Integrating BERT, Prompt Engineering, and Fine-Tuning,” in 2025 3rd International Conference on Foundation and Large Language Models (FLLM), Vienna, Austria, 2025, pp. 34–37.
A. A. Rao, A. R. Revankar, N. Nair, S. Mittal, U. D and U. M. Kumar, “ContractIQ : A Multimodal RAG-Based Agentic System for Intelligent Contract Understanding,” in TENCON 2025- 2025 IEEE Region 10 Conference (TENCON), Kota Kinabalu, Malaysia, 2025, pp. 1698–1702.
O. Keleş and T. Bayraklı, “Llama-2-econ: Enhancing title generation, abstract classification, and academic Q&A in economic research,” 2024, accessed Apr. 17, 2025.
M. Thomas, S. Khot, M. Bhole and N. Shaji, “InvestMate: An Integrated AI-Driven Framework for Personalized Financial Planning and Real-Time Market Analysis,” in 2025 International Conference on Computing, Intelligence, and Application (CIACON), Durgapur, India, 2025, pp. 1–7.

Traditional Retrieval-Augmented Generation (RAG) systems are very effective at querying text-based documents, but real-world documents are not just text based, they are complex and contain images, graphs, tables and more. Thus, traditional text only based RAG systems struggle to process such multimodal documents that contain more than just text effectively. This project presents the development of a Multimodal RAG system which is designed to bridge this gap. By Utilizing LangChain, HuggingFace embeddings, ChromaDB, and the LLaVA 1.5 Vision-Language Model (VLM), the system processes documents that contains not just text, but images and tabular data as well and extracts textual and visual elements such as images and graphs, and answers user queries based on both text and visual information. By indexing both textual and visual summaries into a unified vector space, the system retrieves multimodal context and gives accurate, grounded responses while reducing hallucinations related to chart colors and visual data trends.

Keywords : Retrieval-Augmented Generation (RAG), LangChain, ChromaDB, HuggingFace Embeddings, LLaVA, Multimodal Retrieval.

Paper Submission Last Date
31 - July - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.