Fine-Tuning Llama2 for Question Answering Enhancement Using Low-Rank Adaptation
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 33361
Fine-Tuning Llama2 for Question Answering Enhancement Using Low-Rank Adaptation

Authors: Md. Ashfaqur Rahman

Abstract:

Fine-tuning large language models (LLMs) for question-answering (QA) tasks often incurs high computational costs, demanding significant memory and leading to increased inference latency. In this study, we investigate the effectiveness of Low-Rank Adaptation (LoRA) in optimizing the Llama2 7B model by reducing trainable parameters, minimizing memory consumption, and enhancing processing efficiency. Experimental results reveal that LoRA reduces memory usage from 8.49 GB to 4.52 GB and lowers trainable parameters from 6.73 billion to 4.19 million. Additionally, we analyze inference latency across different input configurations, where single-question, two-question, three-question, and four-question inputs are evaluated separately. The fine-tuned model consistently outperforms the base model, achieving latency reductions of 6.23s, 11.66s, 23.97s, and 43.11s, respectively. These results indicate that LoRA enhances efficiency without compromising model performance. To assess overall effectiveness, we employ multiple evaluation metrics, including memory usage, parameter pruning, inference latency, and human evaluation, ensuring a balanced trade-off between computational efficiency and accuracy. The dataset, sourced from the Hugging Face Library, is partitioned into 9.85k training samples and 518 testing samples. Our findings establish LoRA-based fine-tuning as a robust method for improving LLM performance in QA applications while reducing computational overhead, making large-scale QA systems more practical and resource-efficient.

Keywords: Large Language Models, fine-tuning, Low-Rank Adaptation, Llama2, question answering, inference latency, memory optimization, parameter pruning, computational efficiency, hugging face dataset.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 18

References:


[1] E. Hu, et al., LoRA: Low-Rank Adaptation of Large Language Models, arXiv: 2106.09685, 2021.
[2] Yue Gang, Jianhong Shun, Mu Qing. Smarter Fine-Tuning: How LoRA Enhances Large Language Models. 2025. hal-04983079
[3] D. K. Gajulamandyam et al., "Domain Specific Finetuning of LLMs Using PEFT Techniques," 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 2025, pp. 00484-00490, doi: 10.1109/CCWC62904.2025.10903789
[4] Z. Liu, et al., ALoRA: Allocating Low-Rank Adaptation for Fine-Tuning LLMs, arXiv: 2403.16187, 2024.
[5] L. Zhang, et al., Delta-LoRA: Fine-Tuning High-Rank Parameters, arXiv: 2309.02411, 2023.
[6] L. Zhang, et al., LoRA-FA: Memory-Efficient Adaptation for LLMs, arXiv: 2308.03303, 2023.
[7] T. Dettmers, et al., QLoRA: Efficient Finetuning of Quantized LLMs, arXiv: 2305.14314, 2023.
[8] Qin, Haotong, et al. "Accurate lora-finetuning quantization of llms via information retention." arXiv preprint arXiv: 2402.05445 (2024).
[9] Kim, Sanghyeon, et al. "Hydra: Multi-head low-rank adaptation for parameter efficient fine-tuning." Neural Networks 178 (2024): 106414.
[10] Zhao, Weilin, et al. "Ca-lora: Adapting existing lora for compressed llms to enable efficient multi-tasking on personal devices." arXiv preprint arXiv: 2307.07705 (2023).
[11] Chen, Yukang, et al. "Longlora: Efficient fine-tuning of long-context large language models." arXiv preprint arXiv: 2309.12307 (2023).
[12] Feijie Wu, Zitao Li, Yaliang Li, Bolin Ding, and Jing Gao. 2024. FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '24). Association for Computing Machinery, New York, NY, USA, 3345–3355.
[13] C. Li et al., "Federated Transfer Learning for On-Device LLMs Efficient Fine Tuning Optimization," in Big Data Mining and Analytics, vol. 8, no. 2, pp. 430-446, April 2025, doi: 10.26599/BDMA.2024.9020068.
[14] Liang, Yao, et al. "Matrix-transformation based low-rank adaptation (mtlora): A brain-inspired method for parameter-efficient fine-tuning." arXivpreprint arXiv: 2403.07440 (2024).
[15] Hu, Jiang, and Quanzheng Li. "AdaFish: Fast low-rank parameter-efficient fine-tuning by using second-order information." arXiv preprint arXiv: 2403.13128 (2024).
[16] Qiu, Xihe, et al. "Chain-of-lora: Enhancing the instruction fine-tuning performance of low-rank adaptation on diverse instruction set." IEEE Signal Processing Letters (2024).
[17] Wang, Xi, Laurence Aitchison, and Maja Rudolph. "LoRA ensembles for large language model fine-tuning." arXiv preprint arXiv: 2310.00035 (2023).
[18] Xia, Yifei, et al. "Efficient multi-task llm quantization and serving for multiple lora adapters." Advances in Neural Information Processing Systems 37 (2024): 63686-63714.
[19] Fang, Zihan, et al. "Automated federated pipeline for parameter-efficient fine-tuning of large language models." arXiv preprint arXiv: 2404.06448 (2024).
[20] Sun, Youbang, et al. "Improving loRA in privacy-preserving federated learning." arXiv preprint arXiv: 2403.12313 (2024).
[21] Lin, Zheng, et al. "Splitlora: A split parameter-efficient fine-tuning framework for large language models." arXiv preprint arXiv: 2407.00952 (2024).
[22] Le, Thanh-Dung, Ti Nguyen, and Vu Nguyen Ha. "The Impact of LoRA Adapters for LLMs on Clinical NLP Classification Under Data Limitations." arXiv preprint arXiv: 2407.19299 (2024).
[23] Joshi, Shreyas, et al. "Fine Tuning LLMs for Low Resource Languages." 2024 5th International Conference on Image Processing and Capsule Networks (ICIPCN). IEEE, 2024.
[24] Han, Zeyu, et al. "Parameter-efficient fine-tuning for large models: A comprehensive survey." arXiv preprint arXiv: 2403.14608 (2024).
[25] Mao, Yuren, et al. "A survey on lora of large language models." Frontiers of Computer Science 19.7 (2025): 197605.
[26] Chang, Yupeng, et al. "A survey on evaluation of large language models." ACM transactions on intelligent systems and technology 15.3 (2024): 1-45.
[27] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017)
[28] Sheng, Ying, et al. "S-lora: Serving thousands of concurrent lora adapters." arXiv preprint arXiv: 2311.03285 (2023).
[29] Rahman, M. “Hallucination Detection and Mitigation in Chatbot: A Multi-Agent Approach with Llama2.” World Academy of Science, Engineering and Technology, Open Science Index 221, International Journal of Computer and Information Engineering, (2025), 19(5),228-242.