A Large Dataset Imputation Approach Applied to Country Conflict Prediction Data
Authors: Benjamin D. Leiby, Darryl K. Ahner
Abstract:
This study demonstrates an alternative stochastic imputation approach for large datasets when preferred commercial packages struggle to iterate due to numerical problems. A large country conflict dataset motivates the search to impute missing values well over a common threshold of 20% missingness. The methodology capitalizes on correlation while using model residuals to provide the uncertainty in estimating unknown values. Examination of the methodology provides insight toward choosing linear or nonlinear modeling terms. Static tolerances common in most packages are replaced with tailorable tolerances that exploit residuals to fit each data element. The methodology evaluation includes observing computation time, model fit, and the comparison of known values to replaced values created through imputation. Overall, the country conflict dataset illustrates promise with modeling first-order interactions, while presenting a need for further refinement that mimics predictive mean matching.
Keywords: Correlation, country conflict, imputation, stochastic regression.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 415References:
[1] J. Luengo, S. Garc´ıa, and F. Herrera, ”On the Choice of the Best Imputation Methods for Missing Values Considering Three Groups of Classification Methods,” Knowl. Inf. Syst., vol. 32, no. 1. 2012.
[2] S. van Buuren, Flexible Imputation of Missing Data, 2nd ed. CRC Press, 2018.
[3] D. B. Rubin, ”Multiple Imputation after 18+ Years,” J. Am. Stat. Assoc., vol. 91, no. 434, pp. 473–489, Jun. 1996.
[4] S. van Buuren and K. Groothuis-Oudshoorn, ”Multivariate Imputation by Chained Equations in R,” J. Stat. Softw., vol. 45, no. 3, pp. 1–67, Dec. 2011.
[5] Y. Si et al., ”Multiple Imputation with Massive Data: An Application to the Panel Study of Income Dynamics,” arXiv Prepr. arXiv2007.03016, Jul. 2020.
[6] Y. Deng, C. Chang, M. S. Ido, and Q. Long, ”Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data,” Sci. Rep., vol. 6, no. 1, pp. 1–10, Feb. 2016.
[7] R. J. Little, ”On Algorithmic And Modeling Approaches To Imputation In Large Data Sets,” Stat. Sin., vol. 30, no. 4, pp. 1685–1696, Jan. 2020.
[8] D. Ahner and L. Brantley, ”Finding the Fuel of the Arab Spring Fire: a Historical Data Analysis,” J. Def. Anal. Logist., vol. 2, no. 2, pp. 58–68, Jan. 2018.
[9] Z. J. Kane, ”An Imputation Approach to Developing Alternative Futures of Country Conflict,” Air Force Institute of Technology, 2019.
[10] C. D. Nguyen, J. B. Carlin, and K. J. Lee, ”Practical Strategies for Handling Breakdown of Multiple Imputation Procedures,” Emerg. Themes Epidemiol., vol. 18, no. 1, pp. 1–8, Dec. 2021.
[11] C. O. Plumpton, T. Morris, D. A. Hughes, and I. R. White, ”Multiple Imputation Of Multiple Multi-Item Scales When A Full Imputation Model Is Infeasible,” BMC Res. Notes, vol. 9, no. 1, pp. 1–16, Dec. 2016.
[12] E. N´u˜nez, E. W. Steyerberg, and J. N´u˜nez, ”Regression Modeling Strategies”, Rev. Espa˜nola Cardiol. (English Ed.), vol. 64, no. 6, pp. 501–507, Jun. 2011.
[13] J. A. Nelder, ”The Selection of Terms in Response-Surface Models—How Strong is the Weak-Heredity Principle?,” Am. Stat., vol. 52, no. 4, pp. 315–318, May 1998.
[14] J. R. Oneal and B. Russett, ”Rule Of Three, Let It Be? When More Really Is Better,” Confl. Manag. Peace Sci., vol. 22, no. 4, pp. 293–310, Sep. 2005.
[15] G. S. Patton and P. D. Harkins, War As I Knew It, Houghton Mifflin Company, 1995.
[16] Y. Luo, ”Evaluating The State Of The Art In Missing Data Imputation For Clinical Data,” Brief. Bioinform., vol. 23, no. 1, Jan. 2022.