Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32586
A Bayesian Classification System for Facilitating an Institutional Risk Profile Definition

Authors: Roman Graf, Sergiu Gordea, Heather M. Ryan


This paper presents an approach for easy creation and classification of institutional risk profiles supporting endangerment analysis of file formats. The main contribution of this work is the employment of data mining techniques to support set up of the most important risk factors. Subsequently, risk profiles employ risk factors classifier and associated configurations to support digital preservation experts with a semi-automatic estimation of endangerment group for file format risk profiles. Our goal is to make use of an expert knowledge base, accuired through a digital preservation survey in order to detect preservation risks for a particular institution. Another contribution is support for visualisation of risk factors for a requried dimension for analysis. Using the naive Bayes method, the decision support system recommends to an expert the matching risk profile group for the previously selected institutional risk profile. The proposed methods improve the visibility of risk factor values and the quality of a digital preservation process. The presented approach is designed to facilitate decision making for the preservation of digital content in libraries and archives using domain expert knowledge and values of file format risk profiles. To facilitate decision-making, the aggregated information about the risk factors is presented as a multidimensional vector. The goal is to visualise particular dimensions of this vector for analysis by an expert and to define its profile group. The sample risk profile calculation and the visualisation of some risk factor dimensions is presented in the evaluation section.

Keywords: linked open data, information integration, digital libraries, data mining.

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 629


[1] P. Ayris, R. Davies, R. McLeod, R. Miao, H. Shenton, and P. Wheatley. The life2 final project report. Final project report, LIFE Project, London, UK, 2008.
[2] L. C. David Tarrant, Steve Hitchcock. Where the semantic web and web 2.0 meet format risk management: P2 registry. International Journal of Digital Curation, 6(1):165–182, 2011.
[3] N. Dehak, R. Dehak, J. Glass, D. Reynolds, and P. Kenny. Cosine similarity scoring without score normalization techniques. in Proceedings of Odyssey 2010 - The Speaker and Language Recognition Workshop (Odyssey 2010), pages 71–75, 2010.
[4] S. Gordea, A. Lindley, and R. Graf. Computing recommendations for long term data accessibility basing on open knowledge and linked data. Joint proceedings of the RecSys 2011 Workshops Decisions@RecSys’11 and UCERSTI 2, 811:51–58, November 2011.
[5] R. Graf and S. Gordea. Aggregating a knowledge base of file formats from linked open data. Proceedings of the 9th International Conference on Preservation of Digital Objects, poster:292–293, October 2012.
[6] R. Graf and S. Gordea. A risk analysis of file formats for preservation planning. In Proceedings of the 10th International Conference on Preservation of Digital Objects (iPres2013), pages 177–186, Lissabon, Portugal, Sep 2013. Biblioteca Nacional de Portugal, Lisboa.
[7] R. Graf, S. Gordea, and H. Ryan. A model for format endangerment analysis using fuzzy logic. In Proceedings of the 11th International Conference on Digital Preservation (iPres2014), pages 160–168, Melbourne, Australia, Oct 2014. State Library of Victoria, Melbourne.
[8] D. Heckerman. Bayesian networks for data mining. Data Mining and Knowledge Discovery, 1(1):79–119, 1997.
[9] J. Hunter and S. Choudhury. Panic: an integrated approach to the preservation of composite digital objects using semantic web services. International Journal on Digital Libraries, 6, (2):174–183, September 2006.
[10] A. N. Jackson. Formats over time: Exploring uk web history. Proceedings of the 9th International Conference on Preservation of Digital Objects, pages 155–158, October 2012.
[11] A. Karnik, S. Goswami, and R. Guha. Detecting obfuscated viruses using cosine similarity analysis. In Modelling Simulation, 2007. AMS ’07. First Asia International Conference on, pages 165–170, March 2007.
[12] G. W. Lawrence, W. R. Kehoe, O. Y. Rieger, W. H. Walters, and A. R. Kenney. Risk management of digital information: A file format investigation. june 2000.
[13] D. Pearson and C. Webb. Defining file format obsolescence: A risky journey. The International Journal of Digital Curation, Vol 3, No 1:89–106, July 2008.
[14] H. Ryan. File format study. School of Information and Library Science, University of North Carolina at Chapel Hill, 2, 2013.
[15] D. Tanner. Using statistics to make educational decisions. Library of Congress Cataloging-in-Publication Data, pages 77–104, 2012.
[16] S. Vermaaten, B. Lavoie, and P. Caplan. Identifying threats to successful digital preservation: the spot model rsik assessment. D-Lib Magazine, 18(9/10), September 2012.
[17] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. McLachlan, A. Ng, B. Liu, P. Yu, Z.-H. Zhou, M. Steinbach, D. Hand, and D. Steinberg. Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1):1–37, 2008.
[18] J. Ye. Cosine similarity measures for intuitionistic fuzzy sets and their applications. Mathematical and Computer Modelling, 53(1?2):91 – 97, 2011.
[19] R. Zacharski. A Programmer’s Guide to Data Mining: The Ancient Art of the Numerati. 2012.
[20] H. Zhang. The Optimality of Naive Bayes. In V. Barr and Z. Markov, editors, FLAIRS Conference. AAAI Press, 2004.