Application of a Similarity Measure for Graphs to Web-based Document Structures

Matthias Dehmer; Frank Emmert Streib; Alexander Mehler; Jürgen Kilian; Max Mühlhauser

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

Application of a Similarity Measure for Graphs to Web-based Document Structures

Authors: Matthias Dehmer, Frank Emmert Streib, Alexander Mehler, Jürgen Kilian, Max Mühlhauser

Abstract:

Due to the tremendous amount of information provided by the World Wide Web (WWW) developing methods for mining the structure of web-based documents is of considerable interest. In this paper we present a similarity measure for graphs representing web-based hypertext structures. Our similarity measure is mainly based on a novel representation of a graph as linear integer strings, whose components represent structural properties of the graph. The similarity of two graphs is then defined as the optimal alignment of the underlying property strings. In this paper we apply the well known technique of sequence alignments for solving a novel and challenging problem: Measuring the structural similarity of generalized trees. In other words: We first transform our graphs considered as high dimensional objects in linear structures. Then we derive similarity values from the alignments of the property strings in order to measure the structural similarity of generalized trees. Hence, we transform a graph similarity problem to a string similarity problem for developing a efficient graph similarity measure. We demonstrate that our similarity measure captures important structural information by applying it to two different test sets consisting of graphs representing web-based document structures.

Keywords: Graph similarity, hierarchical and directed graphs, hypertext, generalized trees, web structure mining.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1084774

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1896

References:

[1] R. Bellman, Dynamic Programming. Princeton University Press, 1957
[2] R. A. Botafogo, B. Shneiderman: Structural analysis of hypertexts: Identifying hierarchies and useful metrics, ACM Trans. Inf. Syst. 10 (2), 1992, 142-180
[3] S. Chakrabarti: Mining the Web. Discovering Knowledge from Hypertext Data, Morgen and Kaufmann Publishers, 2003
[4] S. Chakrabarti: Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction, Proc. of the 10th International World Wide Web Conference, Hong Kong, 2001, 211- 220
[5] I. F. Cruz, S. Borisov, M. A. Marks, T. R. Webb: Measuring Structural Similarity Among Web Documents: Preliminary Results , Lecture Notes In Computer Science, Vol. 1375, 1998
[6] M. Dehmer, Strukturelle Analyse web-basierter Dokumente, Ph.D Thesis, Department of Computer Science, Technische Universit┬¿at Darmstadt, 2005, unpublished
[7] R. Gleim: HyGraph - Ein Framework zur Extraktion, Repr┬¿asentation und Analyse webbasierter Hypertextstrukturen, Beitrage zur GLDVTagung 2005, Bonn/Germany, 2005
[8] D. Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997
[9] T. Jiang, L. Wang, K. Zhang: Alignment of trees - An alternative to tree edit, Theoretical Computer Science, Elsevier, Vol. 143, 1995, 137-148
[10] S. Joshi, N. Agrawal, R. Krishnapuram, S. Negi,: Bag of Paths Model for Measuring Structural Similarity in Web Documents, Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2003, 577-582.
[11] A. Mehler, M. Dehmer, R. Gleim: Towards logical hypertext structure. A graph-theoretic perspective, Proc. of I2CS-04, Guadalajara/Mexico, Lecture Notes in Computer Science, Berlin-New York: Springer, 2004
[12] A. Mehler, R. Gleim, M. Dehmer: Towards structure-sensitive hypertext categorization, to appear in: Proceedings of the 29-th Annual Conference of the German Classification Society, 2005
[13] S. M. Selkow: The tree-to-tree editing problem, Information Processing Letters, Vol. 6 (6), 1977, 184-186
[14] T. F. Smith, M. S. Waterman: Identification of common molecular subsequences, Journal of Molecular Biology, Vol. 147 (1), 1981, 195- 197
[15] F. Sobik, Graphmetriken und Klassifikation strukturierter Objekte, ZKIInformationen, Akad. Wiss. DDR, Vol. 2 (82), 1982, 63-122
[16] J. R. Ullman, An algorithm for subgraph isomorphism, J. ACM, Vol. 23 (1), 1976, 31-42
[17] P. H. Winne., L. Gupta, J. C. Nesbit: Exploring individual differences in studying strategies using graph theoretic statistics, The Alberta Journal of Educational Research, Vol. 40, 177-193, 1994
[18] A. Winter: Exchanching Graphs with GXL, http://www.gupro. de/GXL
[19] K. Zhang, D. Shasha: Simple fast algorithms for the editing distance between trees and related problems, SIAM Journal of Computing, Vol. 18 (6), 1989, 1245-1262
[20] B. Zelinka, On a certain distance between isomorphism classes of graphs, ╦çCasopis pro ╦çpest. Mathematiky, Vol. 100, 1975, 371-373