Imputing Missing Data in Electronic Health Records: A Comparison of Linear and Non-Linear Imputation Models
Authors: Alireza Vafaei Sadr, Vida Abedi, Jiang Li, Ramin Zand
Abstract:
Missing data is a common challenge in medical research and can lead to biased or incomplete results. When the data bias leaks into models, it further exacerbates health disparities; biased algorithms can lead to misclassification and reduced resource allocation and monitoring as part of prevention strategies for certain minorities and vulnerable segments of patient populations, which in turn further reduce data footprint from the same population – thus, a vicious cycle. This study compares the performance of six imputation techniques grouped into Linear and Non-Linear models, on two different real-world electronic health records (EHRs) datasets, representing 17864 patient records. The mean absolute percentage error (MAPE) and root mean squared error (RMSE) are used as performance metrics, and the results show that the Linear models outperformed the Non-Linear models in terms of both metrics. These results suggest that sometimes Linear models might be an optimal choice for imputation in laboratory variables in terms of imputation efficiency and uncertainty of predicted values.
Keywords: EHR, Machine Learning, imputation, laboratory variables, algorithmic bias.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 169References:
[1] P. Coorevits, M. Sundgren, G. O. Klein, A. Bahr, B. Claerhout, C. Daniel, M. Dugas, D. Dupont, A. Schmidt, P. Singleton, et al., “Electronic health records: new opportunities for clinical research,” Journal of internal medicine, vol. 274, no. 6, pp. 547–560, 2013.
[2] S. R. Raman, L. H. Curtis, R. Temple, T. Andersson, J. Ezekowitz, I. Ford, S. James, K. Marsolo, P. Mirhaji, M. Rocca, et al., “Leveraging electronic health records for clinical research,” American heart journal, vol. 202, pp. 13–19, 2018.
[3] V. Abedi, J. Li, M. K. Shivakumar, V. Avula, D. P. Chaudhary, M. J. Shellenberger, H. S. Khara, Y. Zhang, M. T. M. Lee, D. M. Wolk, et al., “Increasing the density of laboratory measures for machine learning applications,” Journal of Clinical Medicine, vol. 10, no. 1, p. 103, 2020.
[4] S. Khurshid, C. Reeder, L. X. Harrington, P. Singh, G. Sarma, S. F. Friedman, P. Di Achille, N. Diamant, J. W. Cunningham, A. C. Turner, et al., “Cohort design and natural language processing to reduce bias in electronic health records research,” NPJ Digital Medicine, vol. 5, no. 1, p. 47, 2022.
[5] J. N. Acosta, G. J. Falcone, P. Rajpurkar, and E. J. Topol, “Multimodal biomedical ai,” Nature Medicine, vol. 28, no. 9, pp. 1773–1784, 2022.
[6] R. S. Vanguri and S. P. Shah, “Multimodal data integration improves immunotherapy response prediction,” 2022.
[7] A. S. O’Malley, K. Draper, R. Gourevitch, D. A. Cross, and S. H. Scholle, “Electronic health records and support for primary care teamwork,” Journal of the American Medical Informatics Association, vol. 22, no. 2, pp. 426–434, 2015.
[8] R. Miotto, L. Li, B. A. Kidd, and J. T. Dudley, “Deep patient: an unsupervised representation to predict the future of patients from the electronic health records,” Scientific reports, vol. 6, no. 1, pp. 1–10, 2016.
[9] J. J. Gong, T. Naumann, P. Szolovits, and J. V. Guttag, “Predicting clinical outcomes across changing electronic health record systems,” in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1497–1505, 2017.
[10] E. Kim, S. M. Rubinstein, K. T. Nead, A. P. Wojcieszynski, P. E. Gabriel, and J. L. Warner, “The evolving use of electronic health records (ehr) for research,” in Seminars in radiation oncology, vol. 29, pp. 354–361, Elsevier, 2019.
[11] J. Li, X. S. Yan, D. Chaudhary, V. Avula, S. Mudiganti, H. Husby, S. Shahjouei, A. Afshar, W. F. Stewart, M. Yeasin, et al., “Imputation of missing values for electronic health record laboratory data,” NPJ digital medicine, vol. 4, no. 1, p. 147, 2021.
[12] F. Amrollahi, S. P. Shashikumar, A. L. Holder, and S. Nemati, “Leveraging clinical data across healthcare institutions for continual learning of predictive risk models,” Scientific Reports, vol. 12, no. 1, p. 8380, 2022.
[13] R. Garriga, J. Mas, S. Abraha, J. Nolan, O. Harrison, G. Tadros, and A. Matic, “Machine learning model to predict mental health crises from electronic health records,” Nature medicine, vol. 28, no. 6, pp. 1240–1248, 2022.
[14] T. Botsis, G. Hartvigsen, F. Chen, and C. Weng, “Secondary use of ehr: data quality issues and informatics opportunities,” Summit on translational bioinformatics, vol. 2010, p. 1, 2010.
[15] A. Sharma, R. A. Harrington, M. B. McClellan, M. P. Turakhia, Z. J. Eapen, S. Steinhubl, J. R. Mault, M. D. Majmudar, L. Roessig, K. J. Chandross, et al., “Using digital health technology to better generate evidence and deliver evidence-based care,” Journal of the American College of Cardiology, vol. 71, no. 23, pp. 2680–2690, 2018.
[16] S. Van Buuren, H. C. Boshuizen, and D. L. Knook, “Multiple imputation of missing blood pressure covariates in survival analysis,” Statistics in medicine, vol. 18, no. 6, pp. 681–694, 1999.
[17] C. M. Musil, C. B. Warner, P. K. Yobas, and S. L. Jones, “A comparison of imputation techniques for handling missing data,” Western journal of nursing research, vol. 24, no. 7, pp. 815–829, 2002.
[18] A. Mackinnon, “The use and reporting of multiple imputation in medical research–a review,” Journal of internal medicine, vol. 268, no. 6, pp. 586–593, 2010.
[19] B. Suthar, H. Patel, and A. Goswami, “A survey: classification of imputation methods in data mining,” International Journal of Emerging Technology and Advanced Engineering, vol. 2, no. 1, pp. 309–12, 2012.
[20] Z. Zhang, “Missing data imputation: focusing on single imputation,” Annals of translational medicine, vol. 4, no. 1, 2016.
[21] B. K. Beaulieu-Jones, J. H. Moore, and P. R. O.-A. A. C. T. CONSORTIUM, “Missing data imputation in the electronic health record using deeply learned autoencoders,” in Pacific symposium on biocomputing 2017, pp. 207–218, World Scientific, 2017.
[22] A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Hajaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus, M. Sun, et al., “Scalable and accurate deep learning with electronic health records,” NPJ digital medicine, vol. 1, no. 1, p. 18, 2018.
[23] C. Sun, S. Hong, M. Song, and H. Li, “A review of deep learning methods for irregularly sampled medical time series data,” arXiv preprint arXiv:2010.12493, 2020.
[24] D. Xu, P. J.-H. Hu, T.-S. Huang, X. Fang, and C.-C. Hsu, “A deep learning–based, unsupervised method to impute missing values in electronic health records for improved patient management,” Journal of Biomedical Informatics, vol. 111, p. 103576, 2020.
[25] Y.-H. Zhou and E. Saghapour, “Imputehr: a visualization tool of imputation for the prediction of biomedical data,” Frontiers in Genetics, vol. 12, p. 691274, 2021.
[26] Y. Zou, A. Pesaranghader, Z. Song, A. Verma, D. L. Buckeridge, and Y. Li, “Modeling electronic health record data using an end-to-end knowledge-graph-informed topic model,” Scientific Reports, vol. 12, no. 1, p. 17868, 2022.
[27] K. Psychogyios, L. Ilias, C. Ntanos, and D. Askounis, “Missing value imputation methods for electronic health records,” IEEE Access, vol. 11, pp. 21562–21574, 2023.
[28] A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark, “Mimic-iii, a freely accessible critical care database,” Scientific data, vol. 3, no. 1, pp. 1–9, 2016.
[29] X. Su, X. Yan, and C.-L. Tsai, “Linear regression,” Wiley Interdisciplinary Reviews: Computational Statistics, vol. 4, no. 3, pp. 275–294, 2012.
[30] G. C. McDonald, “Ridge regression,” Wiley Interdisciplinary Reviews: Computational Statistics, vol. 1, no. 1, pp. 93–100, 2009.
[31] J. Ranstam and J. Cook, “Lasso regression,” Journal of British Surgery, vol. 105, no. 10, pp. 1348–1348, 2018.
[32] G. Biau and E. Scornet, “A random forest guided tour,” Test, vol. 25, pp. 197–227, 2016.
[33] T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, H. Cho, K. Chen, R. Mitchell, I. Cano, T. Zhou, et al., “Xgboost: extreme gradient boosting,” R package version 0.4-2, vol. 1, no. 4, pp. 1–4, 2015.
[34] M. Riedmiller and A. Lernen, “Multi layer perceptron,” Machine Learning Lab Special Lecture, University of Freiburg, pp. 7–24, 2014.
[35] S. P. Adam, S.-A. N. Alexandropoulos, P. M. Pardalos, and M. N. Vrahatis, “No free lunch theorem: A review,” Approximation and Optimization: Algorithms, Complexity and Applications, pp. 57–82, 2019.
[36] J. Yoon, J. Jordon, and M. Schaar, “Gain: Missing data imputation using generative adversarial nets,” in International conference on machine learning, pp. 5689–5698, PMLR, 2018.