Compression of Semistructured Documents

Leo Galambos; Jan Lansky; Katsiaryna Chernik

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

Compression of Semistructured Documents

Authors: Leo Galambos, Jan Lansky, Katsiaryna Chernik

Abstract:

EGOTHOR is a search engine that indexes the Web and allows us to search the Web documents. Its hit list contains URL and title of the hits, and also some snippet which tries to shortly show a match. The snippet can be almost always assembled by an algorithm that has a full knowledge of the original document (mostly HTML page). It implies that the search engine is required to store the full text of the documents as a part of the index. Such a requirement leads us to pick up an appropriate compression algorithm which would reduce the space demand. One of the solutions could be to use common compression methods, for instance gzip or bzip2, but it might be preferable if we develop a new method which would take advantage of the document structure, or rather, the textual character of the documents. There already exist a special compression text algorithms and methods for a compression of XML documents. The aim of this paper is an integration of the two approaches to achieve an optimal level of the compression ratio

Keywords: Compression, search engine, HTML, XML.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1328998

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1580

References:

[1] Adiego, J., Feunte, P.: On the Use of Words as Source Alphabet Symbols in PPM. Data Compression Conference, IEEE CS Press, Los Alamitos, CA, USA (2006) 435.
[2] Arnavut, Z.: Move-to-front and inversion coding. Data Compression Conference, IEEE CS Press, Los Alamitos, CA, USA (2000) 193-202.
[3] Burrows, M., Wheeler, D. J.: A Block Sorting Loseless Data Compression Algorithm. Technical report, Digital Equipment Corporation, Palo Alto, CA, U.S.A (2003).
[4] Dvorsky, J., Pokorny, J., Snasel, V.: Word-based Compression Methods for Large Text Documents. Data Compression Conference, IEEE CS Press, Los Alamitos, CA, USA (1999) 523.
[5] Gailly, J. L.: Gzip program and documentation (1993). Source code aviable from ftp://prep.ai.mit.edu/pub/gnu/gzip-*.tar
[6] Galambos, L.: Dynamization in IR Systems. Mieczyslaw A. Klopotek, Slawomir T. Wierzchon, Krzysztof Trojanowski (Eds.): IIPWM, Proc. of the Int. IIS: IIPWM-04, Poland, 2004. ASC Springer 2004, ISBN 3-540- 21331-7.
[7] Galambos, L.: EGOTHOR. http://www.egothor.org/
[8] Cheney, J.: Compressing XML with Multiplexed Hierarchical PPM Models. Data Compression Conference, IEEE CS Press, Los Alamitos, CA, USA (2001) 163.
[9] Chernik, K., Lansky, J., Galambos, L.: Syllable-based compression for XML documents. In: Snasel, V., Richta, K., and Pokorny, J.: Proceedings of the Dateso 2006 Annual International Workshop on DAtabases, TExts, Specifications and Objects. CEUR-WS, Vol. 176, (2006) 21-31
[10] Isal, R.Y.K., Moffat, A.: Word-based Block-sorting Text Compression. Proc. 24th Australasian Computer Science Conference, Gold Coast, Australia, (2001) 92-99
[11] Lansky, J., Zemlicka, M.: Compression of a Dictionary. In: Snasel, V., Richta, K., and Pokorny, J.: Proceedings of the Dateso 2006 Annual International Workshop on DAtabases, TExts, Specifications and Objects. CEUR-WS, Vol. 176, (2006) 11-20
[12] Lansky, J., Zemlicka, M.: Compression of Small Text Files Using Syllables. Data Compression Conference, IEEE CS Press, Los Alamitos, CA, USA (2006) 458.
[13] Lansky, J., Zemlicka, M.: Text Compression: Syllables. In: Richta, K., Snasel, V., Pokorny, J.: Proceedings of the Dateso 2005 Annual International Workshop on DAtabases, TExts, Specifications and Objects. CEUR-WS, Vol. 129, (2005) 32-45
[14] Liefke, H., Suciu, D.: XMill: an Efficient Compressor for XML Data. In Proc. ACM SIGMOD Conference (2000) 153-164
[15] Megginson, D.: SAX: A Simple API for XML. http://www. saxproject.org
[16] Min, J. K., Park, M. J., Chung, C. W.: XPRESS: A Queriable Compression for XML Data. SIGMOD 2003, San Diego, CA, USA (2003) 122-133
[17] Moffat, A., Neal, R. M., Witten, I. H.: Arithmetic Coding Revisited. ACM Transactions on Information Systems, 16, (1998) 256-294
[18] O-Neill, E. T., Lavoie, B. F., Bennett, R.: Trends in the Evolution of the Public Web: 1998-2002. D-Lib Magazine (2003) 1082-9873
[19] Seward, J.: On the Performance of BWT Sorting Algorithms. DCC, IEEE CS Press, Los Alamitos, CA, USA (2000) 173.
[20] Seward, J.: The bzip2 and libbzip2 official home page. http:// sources.redhat.com/bzip2/
[21] Tolani, P., Haritsa, J. R.: XGrind: A Query-friendly XML Compressor. In Proc. IEEE International Conference on Data Engineering (2002).
[22] Turpin, A., Moffat, A.: Housekeeping for prefix coding. IEEE Transaction on Communications, 48(4), (2000) 622-628.
[23] Welch, T. A.: A technique for high performance data compression. IEEE Computer, 17(6) (1984) 8-19.
[24] Witten, I., Moffat, A., Bell, T.: Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold (1994).
[25] World Wide Web Consorcium: Extensive Markup Language (XML). http://www.w3.org/XML/
[26] World Wide Web Consorcium: HyperText Markup Language (HTML). http://www.w3.org/MarkUp/