Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31742
Machine Learning Development Audit Framework: Assessment and Inspection of Risk and Quality of Data, Model and Development Process

Authors: Jan Stodt, Christoph Reich


The usage of machine learning models for prediction is growing rapidly and proof that the intended requirements are met is essential. Audits are a proven method to determine whether requirements or guidelines are met. However, machine learning models have intrinsic characteristics, such as the quality of training data, that make it difficult to demonstrate the required behavior and make audits more challenging. This paper describes an ML audit framework that evaluates and reviews the risks of machine learning applications, the quality of the training data, and the machine learning model. We evaluate and demonstrate the functionality of the proposed framework by auditing an steel plate fault prediction model.

Keywords: Audit, machine learning, assessment, metrics.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 421


[1] “K¨unstliche intelligenz im mittelstand - relevanz, anwendungen, transfer,” 2019, wissenschaftliches Institut f¨ur Infrastruktur und Kommunikationsdienste.
[2] W. Dai and D. Berleant, “Benchmarking contemporary deep learning hardware and frameworks: A survey of qualitative metrics,” in 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI), Dec 2019, pp. 148–155.
[3] Y. Nishi, S. Masuda, H. Ogawa, and K. Uetsuki, “A Test Architecture for Machine Learning Product,” in 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), Apr. 2018, pp. 273–278.
[4] U. Bhatt, A. Xiang, S. Sharma, A. Weller, A. Taly, Y. Jia, J. Ghosh, R. Puri, J. M. F. Moura, and P. Eckersley, “Explainable Machine Learning in Deployment,” p. 10, 2020.
[5] J. M. Zhang, M. Harman, L. Ma, and Y. Liu, “Machine Learning Testing: Survey, Landscapes and Horizons,” IEEE Transactions on Software Engineering, pp. 1–1, 2020, conference Name: IEEE Transactions on Software Engineering.
[6] E. Stewart, K. Chellappan, S. Backhaus, D. Deka, M. Reno, S. Peisert, D. Arnold, C. Chen, A. Florita, and M. Buckner, “Integrated Multi Scale Data Analytics and Machine Learning for the Grid; Benchmarking Algorithms and Data Quality Analysis,” Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), Tech. Rep., 2018.
[7] S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biessmann, and A. Grafberger, “Automating large-scale data quality verification,” Proceedings of the VLDB Endowment, vol. 11, no. 12, pp. 1781–1794, Aug. 2018. (Online). Available:
[8] L. Barrett and M. W. Sherman, “Improving ML Training Data with Gold-Standard Quality Metrics,” p. 4, 2019.
[9] G. S. Handelman, H. K. Kok, R. V. Chandra, A. H. Razavi, S. Huang, M. Brooks, M. J. Lee, and H. Asadi, “Peering Into the Black Box of Artificial I ntelligence: E valuation M etrics of Machine Learning Methods,” American Journal of Roentgenology, vol. 212, no. 1, pp. 38–43, Jan. 2019. (Online). Available:
[10] D. Rolnick, A. Veit, S. Belongie, and N. Shavit, “Deep learning is robust to massive label noise,” 2018. (Online). Available:
[11] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple and accurate method to fool deep neural networks,” CoRR, vol. abs/1511.04599, 2015. (Online). Available:
[12] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 IEEE Symposium on Security and Privacy (SP), May 2017, pp. 39–57.
[13] R. Salay and K. Czarnecki, “Using machine learning safely in automotive software: An assessment and adaption of software process requirements in ISO 26262,” CoRR, vol. abs/1808.01614, 2018.
[Online]. Available:
[14] B. Waltl and R. Vogl, “Increasing transparency in algorithmicdecision- making with explainable ai,” Datenschutz Datensicherheit - DuD, vol. 42, pp. 613 –617, Sep. 2018.
[15] A. B. Arrieta, N. D´ıaz-Rodr´ıguez, J. D. Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garc´ıa, S. Gil-L´opez, D. Molina, R. Benjamins, R. Chatila, and F. Herrera, “Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai,” 2019.
[16] “Global perspectivesand insightsartificial intelligence – considerations for the profession of int ernal auditing,” The Institute of Internal Auditors.
[Online]. Available: Documents/GPI-Artificial-Intelligence.pdf
[17] C. Hutchison, M. Zizyte, P. E. Lanigan, D. Guttendorf, M. Wagner, C. Le Goues, and P. Koopman, “Robustness testing of autonomy software,” in 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), 2018, pp. 276–285.
[18] K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore,” Proceedings of the 26th Symposium on Operating Systems Principles - SOSP ’17, 2017.
[Online]. Available:
[19] M. Paschali, S. Conjeti, F. Navarro, and N. Navab, “Generalizability vs. robustness: Adversarial examples for medical imaging,” 2018.
[20] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, October 12-16, 2015, I. Ray, N. Li, and C. Kruegel, Eds. ACM, 2015, pp. 1322–1333.
[Online]. Available:
[21] I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” IEEE Transactions on Software Engineering, vol. 31, no. 5, pp. 380–391, 2005.
[22] B. R. Aditya, R. Ferdiana, and P. I. Santosa, “Toward modern it audit- current issues and literature review,” in 2018 4th International Conference on Science and Technology (ICST), 2018, pp. 1–6.
[23] F. D¨olitzscher, T. R¨ubsamen, T. Karbe, M. Knahl, C. Reich, and N. Clarke, “Sun behind clouds - on automatic cloud security audits and a cloud audit policy language,” vol. 06.2013, no. 1 & 2, pp. 1 – 16, 2013.
[24] F. Chollet et al., “Keras,”, 2015.
[25] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from
[Online]. Available:
[26] S. Holland, A. Hosny, S. Newman, J. Joseph, and K. Chmielinski, “The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards,” arXiv:1805.03677 (cs), May 2018, arXiv: 1805.03677.
[Online]. Available:
[27] X. Xie, L. Ma, F. Juefei-Xu, M. Xue, H. Chen, Y. Liu, J. Zhao, B. Li, J. Yin, and S. See, “DeepHunter: a coverage-guided fuzz testing framework for deep neural networks,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2019. Beijing, China: Association for Computing Machinery, Jul. 2019, pp. 146–157. (Online). Available:
[28] S. Burton, L. Gauerhof, B. B. Sethy, I. Habli, and R. Hawkins, “Confidence Arguments for Evidence of Performance in Machine Learning for Highly Automated Driving Functions,” in International Conference on Computer Safety, Reliability, and Security. Springer, 2019, pp. 365–377.
[29] “Steel plates faults data set.” (Online). Available: Plates Faults