Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 30135
Interactive, Topic-Oriented Search Support by a Centroid-Based Text Categorisation

Authors: Mario Kubek, Herwig Unger

Abstract:

Centroid terms are single words that semantically and topically characterise text documents and so may serve as their very compact representation in automatic text processing. In the present paper, centroids are used to measure the relevance of text documents with respect to a given search query. Thus, a new graphbased paradigm for searching texts in large corpora is proposed and evaluated against keyword-based methods. The first, promising experimental results demonstrate the usefulness of the centroid-based search procedure. It is shown that especially the routing of search queries in interactive and decentralised search systems can be greatly improved by applying this approach. A detailed discussion on further fields of its application completes this contribution.

Keywords: Search algorithm, centroid, query, keyword, cooccurrence, categorisation.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.2643818

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 191

References:


[1] B. Sparrow, J. Liu and D. M. Wegner, Google effects on memory: Cognitive consequences of having information at our fingertips, In Science, Vol. 333, pp. 776–778, 2011.
[2] C. Cleverdon, The Cranfield Tests on Index Language Devices, In Readings in Information Retrieval, pp. 47–59, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.
[3] C. D. Manning, P. Raghavan and H. Sch¨utze, Introduction to Information Retrieval, Cambridge University Press, New York, NY, USA, 2008.
[4] J. B. Miller, Internet Technologies and Information Services, 2nd Edition, Libraries Unlimited, Santa Barbara, California, USA, 2014.
[5] A. van den Bosch, T. Bogers and M. de Kunder, Estimating search engine index size variability: a 9-year longitudinal study, In Scientometrics, Volume 107, Issue 2, pp. 839-856, 2016.
[6] M. Kleppmann, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, O’Reilly Media, 2017.
[7] E. Pariser, The Filter Bubble: What the Internet Is Hiding from You, Penguin Group, 2011.
[8] G. Heyer, U. Quasthoff and T. Wittig, Text Mining: Wissensrohstoff Text – Konzepte, Algorithmen, Ergebnisse, W3L-Verlag, 2008.
[9] M. M. Kubek and H. Unger, Centroid Terms as Text Representatives, In Proceedings of the 2016 ACM Symposium on Document Engineering, DocEng ’16, pp. 99–102, ACM, New York, NY, USA, 2016.
[10] M. M. Kubek and H. Unger, Centroid Terms and their Use in Natural Language Processing, In Autonomous Systems 2016, Fortschritt-Berichte VDI, Reihe 10 Nr. 848, pp. 167–185, VDI-Verlag D¨usseldorf, 2016.
[11] M. Kubek, T. B¨ohme, and H. Unger, Empiric Experiments with Text Representing Centroids, In Lecture Notes on Information Theory, Vol. 5, No. 1, pp. 23–28, 2017.
[12] M. M. Kubek and H. Unger, Towards a Librarian of the Web, In Proceedings of the 2nd International Conference on Communication and Information Processing (ICCIP 2016), pp. 70–78, ACM, New York, NY, USA, 2016.
[13] M. M. Kubek and H. Unger, A Concept Supporting Resilient, Faulttolerant and Decentralised Search, In Autonomous Systems 2017, Fortschritt-Berichte VDI, Reihe 10 Nr. 857, pp. 20–31, VDI-Verlag D¨usseldorf, 2017.
[14] M. M. Kubek and H. Unger, Datasets and Analysis Results, http://www. docanalyser.de/search-corpora.zip, 2017.
[15] L. R. Dice, Measures of the Amount of Ecologic Association Between Species, In Ecology, Vol. 26, No. 3, pp. 297–302, 1945.
[16] Neo4j, Inc., Website of the Neo4j Graph Platform, https://neo4j.com, 2017.
[17] C. Biemann, S. Bordag and U. Quasthoff, Automatic Acquisition of Paradigmatic Relations using Iterated Co-occurrences, In Proceedings of LREC2004, pp. 967–970, Lisboa, Portugal, 2004.
[18] M. M. Kubek, DocAnalyser – Searching with Web Documents, In Autonomous Systems 2014, Fortschritt-Berichte VDI, Reihe 10 Nr. 835, pp. 221–234, VDI-Verlag D¨usseldorf, 2014.
[19] B. H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors, In Commun. ACM, Vol. 13, No. 7, pp. 422–426, ACM, New York, NY, USA, 1970.