Compressed Suffix Arrays to Self-Indexes Based on Partitioned Elias-Fano

Guo Wenyu; Qu Youli

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

Compressed Suffix Arrays to Self-Indexes Based on Partitioned Elias-Fano

Authors: Guo Wenyu, Qu Youli

Abstract:

A practical and simple self-indexing data structure, Partitioned Elias-Fano (PEF) - Compressed Suffix Arrays (CSA), is built in linear time for the CSA based on PEF indexes. Moreover, the PEF-CSA is compared with two classical compressed indexing methods, Ferragina and Manzini implementation (FMI) and Sad-CSA on different type and size files in Pizza & Chili. The PEF-CSA performs better on the existing data in terms of the compression ratio, count, and locates time except for the evenly distributed data such as proteins data. The observations of the experiments are that the distribution of the φ is more important than the alphabet size on the compression ratio. Unevenly distributed data φ makes better compression effect, and the larger the size of the hit counts, the longer the count and locate time.

Keywords: Compressed suffix array, self-indexing, partitioned Elias-Fano, PEF-CSA.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1129125

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1088

References:

[1] Manber U, Myers G. Suffix arrays: a new method for on-line string searches (J). Siam Journal on Computing, 1993, 22(5):935-948.
[2] Ukkonen E. On-line construction of suffix trees (J). Algorithmica, 1995, 14(3):249-260.
[3] Mccreight E M. A Space-Economical Suffix Tree Construction Algorithm (J). Journal of the Acm, 1976, 23(2):262-272
[4] Grossi R, Vitter J S. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching (Extended Abstract) (C)// ACM Symposium on Theory of Computing. ACM, 2000:397-406.
[5] Grossi R, Vitter J S. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching (J). Siam Journal on Computing, 2006, 35(2):397--406.
[6] Sadakane K. Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array (M)// Algorithms and Computation. Springer Berlin Heidelberg, 2000:410--421.
[7] Sadakane K. New text indexing functionalities of the compressed suffix arrays (J). Journal of Algorithms, 2003, 48(2):294-313.
[8] Ferragina P, Manzini G. Opportunistic data structures with applications (M). University of Pisa, 2000.
[9] Ferragina P, Manzini G. An experimental study of an opportunistic index (C)// Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2001:269-278.
[10] Burrows M. A block-sorting lossless data compression algorithm (J). Technical Report Digital Src Research Report, 1995, 57(4):425.
[11] Ferragina, Paolo, Manzini, Giovanni. On compressing and indexing data (J). Università Di Pisa, 2002.
[12] Ottaviano G, Venturini R. Partitioned Elias-Fano indexes (C)// International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 2014:273-282.
[13] Ferragina P, Nitto I, Venturini R. On Optimally Partitioning a Text to Improve Its Compression (J). Algorithmica, 2011, 61(1):51-74.
[14] Vigna S. Quasi-succinct indices (J). 2012:83-92.
[15] Navarro G, Mäkinen V. Compressed full-text indexes (J). Acm Computing Surveys, 2007, 39(1):2.