A File Splitting Technique for Reducing the Entropy of Text Files
Authors: Abdel-Rahman M. Jaradat, , Mansour I. Irshid, Talha T. Nassar
Abstract:
A novel file splitting technique for the reduction of the nth-order entropy of text files is proposed. The technique is based on mapping the original text file into a non-ASCII binary file using a new codeword assignment method and then the resulting binary file is split into several subfiles each contains one or more bits from each codeword of the mapped binary file. The statistical properties of the subfiles are studied and it is found that they reflect the statistical properties of the original text file which is not the case when the ASCII code is used as a mapper. The nth-order entropy of these subfiles are determined and it is found that the sum of their entropies is less than that of the original text file for the same values of extensions. These interesting statistical properties of the resulting subfiles can be used to achieve better compression ratios when conventional compression techniques are applied to these subfiles individually and on a bit-wise basis rather than on character-wise basis.
Keywords: Bit-wise compression, entropy, file splitting, source mapping.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1334752
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1449References:
[1] T. C. Bell, J. G. Cleary, and I. H. Witten, Text Compression, Prentice- Hall, Englewood cliffs NJ, 1990.
[2] G. Held and T. R. Marshall, Data Compression, John Wiley, New York, 1991.
[3] M. F. Lynch, "Compression of bibliographic files using an adaptation of run-length coding," Information Storage and Retrieval, Vol. 9, pp. 207ÔÇö 214, 1973.
[4] A. M. Elabdalla and M. I. Irshid, "An efficient bitwise Huffman coding technique based on source mapping," Computers and Electrical Engineering, Vol. 27, pp. 265ÔÇö272, 2001.
[5] A. M. Jaradat and M. I. Irshid, "A simple binary run-length compression technique for non-binary sources based on source mapping," Active and Passive Electronic Components, Vol. 24, pp. 211ÔÇö221, 2001.
[6] A. A. Sharieh, "Enhancement of Huffman coding for the compression of multimedia files," International Journal of Information Technology, Vol. 1, pp. 211ÔÇö213, 2004.
[7] Calgary-corpus: ftp://ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus