Hallucination Detection and Mitigation in Chatbot: A Multi-Agent Approach with Llama2
Authors: Md. Ashfaqur Rahman
Abstract:
Hallucination in Large Language Models (LLMs) poses a significant challenge in chatbot reliability, especially in critical domains such as healthcare, finance, and education. This paper presents a multi-agent approach to hallucination detection and mitigation using Llama2, integrating retrieval-based verification, fact-checking mechanisms, and response correction strategies. The proposed framework consists of specialized agents, including a Web Retrieval Agent that fetches factual information from external sources (e.g., Wikipedia, DuckDuckGo & Google Serper), Fact-Checking Agents that evaluate response accuracy using semantic similarity scoring, a Correction Agent that refines outputs when hallucinations are detected, and a Monitoring Agent that logs hallucination scores and calculates truthfulness metrics. Experimental results demonstrate that incorporating retrieval-augmented generation (RAG) and multi-agent verification significantly reduces hallucination rates. The study highlights the effectiveness of using Llama2 alongside external knowledge sources and multi-agent collaboration to improve chatbot reliability and factual accuracy. Future research will explore reinforcement learning for dynamic agent optimization and enhancing real-time fact verification methods.
Keywords: Hallucination detection, Llama2, multi-agent systems, retrieval-augmented generation, fact-checking, chatbot reliability, truth scoring, large language models, response correction, semantic similarity.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 90References:
[1] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., & Amodei, D. (2020). Language models are few-shot learners. In Proceedings of NeurIPS, 33, 1877-1901.
[2] J. Yin, A. Bose, G. Cong, I. Lyngaas and Q. Anthony, "Comparative Study of Large Language Model Architectures on Frontier," 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), San Francisco, CA, USA, 2024, pp. 556-569, doi: 10.1109/IPDPS57955.2024.00056.
[3] Werkman, Loes. "Assessing the potential of leveraging LLaMA-2 to create an institute-specific online chatbot assistant." (2024).
[4] Chkirbene, Zina, et al. "Large Language Models (LLM) in Industry: A Survey of Applications, Challenges, and Trends." 2024 IEEE 21st International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT (HONET). IEEE, 2024.
[5] Chen, Zhiyu Zoey, et al. "A survey on large language models for critical societal domains: Finance, healthcare, and law." arXiv preprint arXiv: 2405.01769 (2024).
[6] Banerjee, Sourav, Ayushi Agarwal, and Saloni Singla. "Llms will always hallucinate, and we need to live with this." arXiv preprint arXiv: 2409.05746 (2024).
[7] Safar, Deniz, et al. "Hallucinations in GPT-2 Trained Model." Ingénierie des Systèmes d'Information 30.1 (2025).
[8] El-Sayed, Seliem, et al. "A mechanism-based approach to mitigating harms from persuasive generative ai." arXiv preprint arXiv: 2404.15058 (2024).
[9] Rumiantsau, Mikhail, et al. "Beyond Fine-Tuning: Effective Strategies for Mitigating Hallucinations in Large Language Models for Data Analytics." arXiv preprint arXiv: 2410.20024 (2024).
[10] Veturi, Sriram, et al. "Rag based question-answering for contextual response prediction system." arXiv preprint arXiv: 2409.03708 (2024).
[11] Zhang, W.; Zhang, J. Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review. Mathematics 2025, 13, 856. https://doi.org/10.3390/math13050856.
[12] Zhao, Penghao, et al. "Retrieval-augmented generation for ai-generated content: A survey." arXiv preprint arXiv: 2402.19473 (2024).
[13] Zahedi Jahromi. Seyedsajjad. Conversational QA Agents with Session Management. Politecnico di Torino, 2024.
[14] Matar, Khaled, and Yousef Mohammad. "Improving the Reliability of Educational AI Chatbots Using Retrieval-Augmented Generation." (2024).
[15] Singh, Aditi, et al. "Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG." arXiv preprint arXiv: 2501.09136 (2025).
[16] Chandrasekhar, Achuth, et al. "AMGPT: A large language; model for contextual querying in additive manufacturing." Additive Manufacturing Letters 11 (2024): 100232.
[17] Kumar, Aviral, et al. "Training language models to self-correct via reinforcement learning." arXiv preprint arXiv: 2409.12917 (2024).
[18] Bode, Leticia, and Emily K. Vraga. "See something, say something: Correction of global health misinformation on social media." Health communication 33.9 (2018): 1131-1140.
[19] Choi, Seonhwa, and Byunggul Bae. "The real-time monitoring system of social big data for disaster management." Computer science and its applications: Ubiquitous information technologies. Springer Berlin Heidelberg, 2015.
[20] Ouyang, Qian, Shiyu Wang, and Bing Wang. "Enhancing accuracy in large language models through dynamic real-time information injection." (2023).
[21] Weiqiang Jin., et al. (2025). A Comprehensive Survey on Multi-Agent Cooperative Decision Making: Scenarios, Approaches, Challenges and Perspectives.
[22] Kaushik, D. (2022). Robustifying NLP with Humans in the Loop (Doctoral dissertation, New York University)
[23] Mahaut, M., Aina, L., Czarnowska, P., Hardalov, M., Müller, T., & Màrquez, L. (2024). Factual Confidence of LLMs: On Reliability and Robustness of Current Estimators. ArXiv preprint arXiv: 2406.13415.
[24] Chen, Y., Fu, Q., Yuan, Y., Wen, Z., Fan, G., Liu, D., & Xiao, Y. (2023, October). Hallucination detection: Robustly discerning reliable answers in large language models. Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (pp. 245-255).
[25] Chan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., Fu, J., & Liu, Z. (2023). Chateval: Towards better LLM-based evaluators through multi-agent debate.
[26] Chuang, Yung-Sung, et al. "Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps." arXiv preprint arXiv: 2407.07071 (2024).
[27] Ding, Hanxing, et al. "Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models." arXiv preprint arXiv: 2402.10612 (2024).
[28] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T. & Jegou, H. (2023). LLaMA: Open and Efficient Foundation Language Models. ArXiv preprint arXiv: 2302.13971.
[29] Guan, Xinyan, et al. "Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 16. 2024.
[30] Shah, Arjun, et al. "Validation and Extraction of Reliable Information Through Automated Scraping and Natural Language Inference." Engineering Applications of Artificial Intelligence 147 (2025): 110284.
[31] https://openai.com/index/gpt-4-research/
[32] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., & Thakkar, D. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38
[33] Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020 Proceedings
[34] Li, Ruosen, Ziming Luo, and Xinya Du. "Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning." arXiv preprint arXiv: 2410.06304 (2024).
[35] Manakul, P., Gales, M., & Chuang, Y. (2023). Self-CheckGPT: Zero-resource hallucination detection for LLMs. arXiv preprint arXiv: 2303.08896.
[36] Sansford, Hannah, et al. "Grapheval: A knowledge-graph based llm hallucination evaluation framework." arXiv preprint arXiv: 2407.10793 (2024).
[37] Shuster, K., Roller, S., Dinan, E., et al. (2021). Retrieval Augmentation Reduces Hallucination in Conversation. arXiv preprint arXiv: 2104.07567
[38] Hatakeyama-Sato, K., Yamane, N., Igarashi, Y., Nabae, Y., & Hayakawa, T. (2023). Prompt engineering of GPT-4 for chemical research: what can/cannot be done? Science and Technology of Advanced Materials: Methods, 3(1). https://doi.org/10.1080/27660400.2023.2260300
[39] Sun, Z., Zhang, Y., & Li, P. (2024). Towards detecting LLM hallucination via Markov chain-based multi-agent debate framework. Proceedings of AAAI 2024
[40] Wan, H., Zhou, X., & Wang, T. (2024). Contextual knowledge verification for AI-generated content. ACL 2024.
[41] Mialon, G., Scao, T. L., Gerz, D., Riedel, S., & Lewis, P. (2023). Augmented Language Models: a Survey. arXiv preprint arXiv:2302.07842
[42] Ye, Z., Zheng, B., Zhu, J., & Han, X. (2023). Cognitive mirage: Taxonomy and detection of hallucinations in LLMs. arXiv preprint arXiv:2307.09288
[43] Béchard, Patrice, and Orlando Marquez Ayala. "Reducing hallucination in structured outputs via Retrieval-Augmented Generation." arXiv preprint arXiv: 2404.08189 (2024).
[44] Madaan, A., Tandon, N., Gupta, P., Radev, D., & Bosselut, A. (2023). Self-Refine: Iterative Refinement with Self-Feedback.
[45] Lin, Hilton, Evans et al (2022).TruthfulQA: Measuring How Models Mimic Human Falsehoods, arXiv: 2109.07958v2.
[46] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 15, 3, Article 39 (June 2024), 45 pages. https://doi.org/10.1145/3641289.
[47] S. Ding, X. Chen, Y. Fang, W. Liu, Y. Qiu and C. Chai, "DesignGPT: Multi-Agent Collaboration in Design," 2023 16th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 2023, pp. 204-208, doi: 10.1109/ISCID59865.2023.00056.
[48] Tonmoy, S. M., et al. "A comprehensive survey of hallucination mitigation techniques in large language models." arXiv preprint arXiv: 2401.01313 6 (2024).
[49] Luo, Junliang, et al. "Hallucination detection and hallucination mitigation: An investigation." arXiv preprint arXiv: 2401.08358 (2024).
[50] Nivethitha, V., et al. "Fine-tuning Llama-2-7b Model for Conversation Summarization." 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT). IEEE, 2024.
[51] Wu, Anpeng, et al. "Causality for large language models." arXiv preprint arXiv: 2410.15319 (2024).
[52] Wang, Cunxiang, et al. "Survey on factuality in large language models: Knowledge, retrieval and domain-specificity." arXiv preprint arXiv: 2310.07521 (2023).
[53] Wang, Yuxia, et al. "Factuality of large language models: A survey." arXiv preprint arXiv: 2402.02420 (2024).
[54] Augenstein, Isabelle, et al. "Factuality challenges in the era of large language models and opportunities for fact-checking." Nature Machine Intelligence 6.8 (2024): 852-863.
[55] Menczer, Filippo, and Richard K. Belew. "Adaptive retrieval agents: Internalizing local context and scaling up to the Web." Machine Learning 39 (2000): 203-242.
[56] Kumar, Manish, Rajesh Bhatia, and Dhavleesh Rattan. "A survey of Web crawlers for information retrieval." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7.6 (2017): e1218.
[57] Saha, Kundan Kanti, Sangram Ray, and Dipanwita Sadhukhan. "A Lightweight and Precise Information Retrieval System for Organisational Wiki." International Conference on Frontiers of Intelligent Computing: Theory and Applications. Singapore: Springer Nature Singapore, 2022.
[58] Han, Yikun, Chunjiang Liu, and Pengfei Wang. "A comprehensive survey on vector database: Storage and retrieval technique, challenge." arXiv preprint arXiv: 2310.11703 (2023).
[59] Ganapathi, Aditi, Vihang Pancholi, and Shubh Mehta. "VeRA: Vector-based Retrieval Augmentation."
[60] Li, Bohan, et al. "On the sentence embeddings from pre-trained language models." arXiv preprint arXiv: 2011.05864 (2020).
[61] Laskar, Md Tahmid Rahman, Xiangji Huang, and Enamul Hoque. "Contextualized embeddings based transformer encoder for sentence similarity modeling in answer selection task." Proceedings of the twelfth language resources and evaluation conference. 2020.
[62] Malmberg, Jacob. "Evaluating semantic similarity using sentence embeddings." (2021).
[63] https://dagshub.com/blog/llm-evaluation-metrics-Evaluating Large Language Models (LLMs): Metrics, Benchmarks & Best Practices:
[64] Garraghan, Peter, et al. "Timely long tail identification through agent based monitoring and analytics." 2015 IEEE 18th International Symposium on Real-Time Distributed Computing. IEEE, 2015.
[65] Dou, Rukun. "Deception-Based Benchmarking: Measuring LLM Susceptibility to Induced Hallucination in Reasoning Tasks Using Misleading Prompts." (2024).
[66] Pan, Liangming, et al. "Automatically correcting large language models: Surveying the landscape of diverse automated correction strategies." Transactions of the Association for Computational Linguistics 12 (2024): 484-506.
[67] Yang, Chao-Han Huck, et al. "Generative speech recognition error correction with large language models and task-activating prompting." 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023.
[68] Pan, Yuchen, et al. "ANSWERED: Adaptive Tool-Augmented LLMs with Strategic Error Feedback for Compositional Reasoning." International Conference on Intelligent Computing. Singapore: Springer Nature Singapore, 2024.
[69] Lee, Donghwan, et al. "Optimization for reinforcement learning: From a single agent to cooperative agents." IEEE Signal Processing Magazine 37.3 (2020): 123-135.
[70] Roychowdhury, Sujoy, et al. "Evaluation of rag metrics for question answering in the telecom domain." arXiv preprint arXiv: 2407.12873 (2024).
[71] Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, St. Julians, Malta. Association for Computational Linguistics
[72] Oro, Ermelinda, et al. "Evaluating Retrieval-Augmented Generation for Question Answering with Large Language Models." (2024).
[73] Liu, Gabrielle Kaili-May. "Transforming human interactions with AI via reinforcement learning with human feedback (RLHF)." Massachusetts Institute of Technology (2023).
[74] Martin, Andreas, et al. "Semantic Verification in Large Language Model-based Retrieval Augmented Generation." Proceedings of the AAAI Symposium Series. Vol. 3. No. 1. 2024.
[75] Košprdić, Miloš, et al. "Verif.ai: Towards an Open-Source Scientific Generative Question-Answering System with Referenced and Verifiable Answers." arXiv preprint arXiv: 2402.18589 (2024).
[76] Ellison, David, et al. "Benchmarking Large Language Models." Performance Evaluation and Benchmarking: 15th TPC Technology Conference, TPCTC 2023, Vancouver, BC, Canada, August 28–September 1, 2023, Revised Selected Papers. Vol. 14247. Springer Nature, 2024.
[77] Blackwell, Robert E., Jon Barry, and Anthony G. Cohn. "Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores." arXiv preprint arXiv: 2410.03492 (2024).
[78] Nadeau, David, et al. "Benchmarking llama2, mistral, gemma and gpt for factuality, toxicity, bias and propensity for hallucinations." arXiv preprint arXiv: 2404.09785 (2024).
[79] https://www.e2enetworks.com/blog/mistral-7b-vs-llama2-which-performs-better-and-why
[80] Liu, X., Wang, Y., & Zhang, T. (2023). Multi-Agent Fact-Checking in AI Healthcare Systems: A Comparative Study. Journal of Medical AI Research, 15(3), 125-140.
[81] Zhang, P., Li, H., & Chen, Y. (2022). Reducing Hallucination in AI-Driven Medical Assistants Using Multi-Agent Verification. Artificial Intelligence in Medicine, 48(2), 210-225.
[82] Acha, Stefalo, and Sun Yi. "Cooperative Intelligent Control of Multi-Agent Systems (MAS) Through Communication, Trust, and Reliability." Control Systems and Optimization Letters 3.1 (2025): 53-66.
[83] Pezeshkpour, Pouya, et al. "Reasoning capacity in multi-agent systems: Limitations, challenges and human-centered solutions." arXiv preprint arXiv: 2402.01108 (2024).
[84] Zheng, Yue, et al. "A review on edge large language models: Design, execution, and applications." ACM Computing Surveys (2024).
[85] Ray, Partha Pratim. "ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope." Internet of Things and Cyber-Physical Systems 3 (2023): 121-154.
[86] Preiksaitis, Carl, and Christian Rose. "Opportunities, challenges, and future directions of generative artificial intelligence in medical education: scoping review." JMIR medical education 9 (2023): e48785.
[87] Geng, Longling, and Edward Y. Chang. "REALM-Bench: A Real-World Planning Benchmark for LLMs and Multi-Agent Systems." arXiv preprint arXiv: 2502.18836 (2025).
[88] Miller, R A; Guise, N B. Medical knowledge bases. Academic Medicine 66(1): p 15-7, January 1991.
[89] https://www.sec.gov/data-research/sec-markets-data/financial-statement-data-sets.
[90] https://www.eyelevel.ai/post/multimodal-rag-explained.
[91] Chinnaraju, Arunraju. (2025). Explainable AI (XAI) for trustworthy and transparent decision-making: A theoretical framework for AI interpretability. World Journal of Advanced Engineering Technology and Sciences. 14. 170-207. 10.30574/wjaets.2025.14.3.0106.
[92] Bharathi Mohan, G., et al. "An analysis of large language models: their impact and potential applications." Knowledge and Information Systems 66.9 (2024): 5047-5070.
[93] Li, Yinheng, et al. "Large language models in finance: A survey." Proceedings of the fourth ACM international conference on AI in finance. 2023.
[94] Sahoo, Pranab, et al. "A comprehensive survey of hallucination in large language, image, video and audio foundation models." arXiv preprint arXiv: 2405.09589 (2024).
[95] Wang, Ke, et al. "A survey on data synthesis and augmentation for large language models." arXiv preprint arXiv: 2410.12896 (2024).
[96] Ni, Bo, et al. "Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey." arXiv preprint arXiv: 2502.06872 (2025).
[97] Wang, Song, et al. "Developing a reliable, general-purpose hallucination detection and mitigation service: Insights and lessons learned." arXiv preprint arXiv: 2407.15441 (2024).