Weighted-Distance Sliding Windows and Cooccurrence Graphs for Supporting Entity-Relationship Discovery in Unstructured Text

Paolo Fantozzi; Luigi Laura; Umberto Nanni

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 32797

Weighted-Distance Sliding Windows and Cooccurrence Graphs for Supporting Entity-Relationship Discovery in Unstructured Text

Authors: Paolo Fantozzi, Luigi Laura, Umberto Nanni

Abstract:

The problem of Entity relation discovery in structured data, a well covered topic in literature, consists in searching within unstructured sources (typically, text) in order to find connections among entities. These can be a whole dictionary, or a specific collection of named items. In many cases machine learning and/or text mining techniques are used for this goal. These approaches might be unfeasible in computationally challenging problems, such as processing massive data streams. A faster approach consists in collecting the cooccurrences of any two words (entities) in order to create a graph of relations - a cooccurrence graph. Indeed each cooccurrence highlights some grade of semantic correlation between the words because it is more common to have related words close each other than having them in the opposite sides of the text. Some authors have used sliding windows for such problem: they count all the occurrences within a sliding windows running over the whole text. In this paper we generalise such technique, coming up to a Weighted-Distance Sliding Window, where each occurrence of two named items within the window is accounted with a weight depending on the distance between items: a closer distance implies a stronger evidence of a relationship. We develop an experiment in order to support this intuition, by applying this technique to a data set consisting in the text of the Bible, split into verses.

Keywords: Cooccurrence graph, entity relation graph, unstructured text, weighted distance.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1474421

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 626

References:

[1] “10 key marketing trends for 2017,” https://www-01.ibm.com/common/ ssi/cgi-bin/ssialias?htmlfid=WRL12345USEN, accessed: 2018-05-15.
[2] E. Agichtein and L. Gravano, “Snowball: Extracting Relations from Large Plain-Text Collections,” Proceedings of the fifth ACM conference on Digital libraries - DL ’00, vol. I, no. 58, pp. 85–94, 2000.
[3] E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, A. Voskoboynik, E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, and A. Voskoboynik, “Snowball: a prototype system for extracting relations from large text collections,” in Proceedings of the 2001 ACM SIGMOD international conference on Management of data - SIGMOD ’01, vol. 30, no. 2. New York, New York, USA: ACM Press, 2001, p. 612.
[4] J. Zhu, Z. Nie, X. Liu, B. Zhang, and J.-R. Wen, “StatSnowball : a Statistical Approach to Extracting Entity,” Proceedings of the 18th international conference on World wide web (WWW2009), pp. 101–110, 2009. (Online). Available: http://www2009.eprints.org/11/1/p101.pdf.
[5] H. Cunningham, “Information Extraction, Automatic Introduction: Extraction and Retrieval,” Oxford: Elsevier. Heath S B Kortmann B Miller J, vol. 5, pp. 665–677, 2006. (Online). Available: http://www.elsevier.com/locate/permissionusematerial.
[6] P. Pantel and M. Pennacchiotti, “Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations,” Acl2006, no. July, pp. 113–120, 2006.
[7] A. O¨ zgu¨r, B. Cetin, and H. Bingl, “Co-Occurrence Network Of Reuters News,” International Journal of Modern Physics C, vol. 19, no. 05, pp. 689–702, may 2008. (Online). Available: http://www.worldscientific.com/doi/abs/10.1142/S0129183108012431.
[8] H. Sayyadi, M. Hurst, and A. Maykov, “Event Detection and Tracking in Social Streams *,” Proceedings of the Third International ICWSM Conference (2009) Event, 2009. (Online). Available: https: //www.aaai.org/ocs/index.php/ICWSM/09/paper/viewFile/170/493.
[9] K. Tanaka-Ishii and H. Iwasaki, “Clustering Co-Occurrence Graph based on Transitivity,” Fifth Workshop on Very Large Corpora, 1997. (Online). Available: http://www.aclweb.org/anthology/W97-0111.
[10] D. Benz, C. K¨orner, A. Hotho, G. Stumme, and M. Strohmaier, “One Tag to Bind Them All: Measuring Term Abstractness in Social Metadata,” in Extended Semantic Web Conference. Springer, Berlin, Heidelberg, 2011, pp. 360–374. (Online). Available: http: //link.springer.com/10.1007/978-3-642-21064-8{ }25.
[11] J. V´eronis, “HyperLex: lexical cartography for information retrieval,” Computer Speech & Language, vol. 18, no. 3, pp. 223–252, jul 2004. (Online). Available: https://www.sciencedirect.com/science/article/pii/ S0885230804000142.
[12] E. Agirre, D. Mart´ınez, O. L´opez De Lacalle, and A. Soroa, “Two graph-based algorithms for state-of-the-art WSD,” Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 585–593, 2006.
[13] R. Mihalcea and P. Tarau, “TextRank: Bringing Order into Texts,” Proceedings of the 2004 conference on empirical methods in natural language processing, 2004. (Online). Available: http://www.aclweb.org/ anthology/W04-3252.
[14] Y. Matsuo and M. Ishizuka, “Keyword Extraction From A Single Document Using Word Co-occurrence Statistical Information,” International Journal on Artificial Intelligence Tools, vol. 13, no. 01, pp. 157–169, mar 2004. (Online). Available: http://www.worldscientific.com/doi/abs/10.1142/S0218213004001466.
[15] Y. Liu, F. Wang, C. Kang, Y. Gao, and Y. Lu, “Analyzing relatedness by toponym co-occurrences on web pages,” Transactions in GIS, 2014.
[16] X. Zhong, J. Liu, Y. Gao, and L. Wu, “Analysis of co-occurrence toponyms in web pages based on complex networks,” Physica A: Statistical Mechanics and its Applications, vol. 466, pp. 462–475, jan 2017. (Online). Available: https://www.sciencedirect.com/science/ article/pii/S0378437116306409.
[17] Z. Zhang and V. Saligrama, “PRISM: Person Re-Identification via Structured Matching,” IEEE Transaction On Pattern Analysis And Machine Intelligence, 2015.
[18] I. Ali and A. Melton, “Semantic-Based Text Document Clustering Using Cognitive Semantic Learning and Graph Theory,” in 2018 IEEE 12th International Conference on Semantic Computing (ICSC). IEEE, jan 2018, pp. 243–247. (Online). Available: https://ieeexplore.ieee.org/ document/8334465/.
[19] B. Lemaire and G. Denhi`ere, “Incremental Construction of an Associative Network from a Corpus,” cogprints.org, 2004. (Online). Available: http://cogprints.org/3779/.
[20] K. Ghoorchian, S. Girdzijauskas, and F. Rahimian, “DeGPar: Large Scale Topic Detection Using Node-Cut Partitioning on Dense Weighted Graphs,” in 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). IEEE, jun 2017, pp. 775–785. (Online). Available: http://ieeexplore.ieee.org/document/7980020/.
[21] L. Jiang, P. Luo, J. Wang, Y. Xiong, B. Lin, M. Wang, and N. An, “GRIAS: An entity-relation graph based framework for discovering entity aliases,” Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 310–319, 2013.
[22] D. V. Kalashnikov and S. Mehrotra, “Domain-independent data cleaning via analysis of entity-relationship graph,” ACM Transactions on Database Systems, vol. 31, no. 2, pp. 716–767, 2006.
[23] M. Laclav´ık, ˇ S. Dlugolinsk´y, and M. Ciglan, “Discovering Relations by Entity Search in Lightweight Semantic Text Graphs,” Computing And Informatics, vol. 33, no. 4, pp. 877–906, 2015. (Online). Available: http://www.cai.sk/ojs/index.php/cai/article/viewArticle/2242.
[24] “Bible: Characters in the bible,” Collins Dictionary. (Online). Available: https://www.collinsdictionary.com/word-lists/ bible-characters-in-the-bible.
[25] T. Bodruk, “Bible: Json + xml,” https://github.com/thiagobodruk/bible.
[26] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An open source software for exploring and manipulating networks,” 2009. (Online). Available: http://www.aaai.org/ocs/index.php/ICWSM/ 09/paper/view/154.