Fast Database Indexing for Large Protein Sequence Collections Using Parallel N-Gram Transformation Algorithm

Jehad A. H. Hammad; Nur'Aini binti Abdul Rashid

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 32797

Fast Database Indexing for Large Protein Sequence Collections Using Parallel N-Gram Transformation Algorithm

Authors: Jehad A. H. Hammad, Nur'Aini binti Abdul Rashid

Abstract:

With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories which are the lengthbased index algorithms, transformation-based algorithms and mixed techniques-based algorithms. In this research, we focused on the transformation based methods. We embedded the N-gram method into the transformation-based method to build an inverted index table. We then applied the parallel methods to speed up the index building time and to reduce the overall retrieval time when querying the genomic database. Our experiments show that the use of N-Gram transformation algorithm is an economical solution; it saves time and space too. The result shows that the size of the index is smaller than the size of the dataset when the size of N-Gram is 5 and 6. The parallel N-Gram transformation algorithm-s results indicate that the uses of parallel programming with large dataset are promising which can be improved further.

Keywords: Biological sequence, Database index, N-gram indexing, Parallel computing, Sequence retrieval.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1328980

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2077

References:

[1] R. Bader, "OpenMP - Parallel programming on shared memory systems ": Leibniz-rechenzentrum. Retrieved September 14,2008 from: http://www.lrz-muenchen.de/services/software/parallel/openmp/, 2008.
[2] B. Barney, "Introduction to Parallel Computing." vol. 2008: Livermore Computing,National Laboratory. Retrieved July 4, 2008 from: https://computing.llnl.gov/tutorials/parallel_comp, 2007.
[3] O. Beng Chin, P. Hwee Hwa, W. Hao, W. Limsoon, and Y. Cui, "Fast filter-and-refine algorithms for subsequence selection," in Database Engineering and Applications Symposium, 2002. Proceedings. International, 2002, pp. 243-254.
[4] A. Califano and I. Rigoutsos, "FLASH: a fast look-up algorithm for string homology," in Computer Vision and Pattern Recognition, 1993. Proceedings CVPR '93., 1993 IEEE Computer Society Conference on, New York, NY, USA, 1993, pp. 353-359.
[5] G. Cooper, M. Raymer, T. Doom, D. Krane, and N. Futamura, "Indexing genomic databases," in Bioinformatics and Bioengineering, 2004. BIBE 2004. Proceedings. Fourth IEEE Symposium on, 2004, pp. 587-591.
[6] Q. Cory, "Introduction to programming shared-memory and distributedmemory parallel computers," Crossroads, vol. 8, pp. 16-22, 2002.
[7] H. Ela, P. A. Malcolm, and W. I. Robert, "A Database Index to Large Biological Sequences," in Proceedings of the 27th International Conference on Very Large Data Bases: Morgan Kaufmann Publishers Inc., 2001.
[8] H. Ela, P. A. Malcolm, and W. I. Robert, "Database indexing for large DNA and protein sequence collections," The VLDB Journal, vol. 11, pp. 256-271, 2002.
[9] Z. Elberrichi and B. Aljohar, "N-grams in Texts Categorization," Scientific Journal of King Faisal University (Basic and Applied Sciences), vol. Vol.8 No.2, pp. 25-38, 2007.
[10] C. Fondrat and P. Dessen, "A rapid access motif database (RAMdb) with a search algorithm for the retrieval patterns in nucleic acids or protein databanks," Comput. Appl. Biosci., vol. 11, pp. 273-279, June 1, 1995 1995.
[11] I. Foster, Designing and Building Parallel Programs. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc, 1995.
[12] A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, Second Edition ed.: Addison Wesley, 2003.
[13] U. Hobohm and C. Sander, "A Sequence Property Approach to Searching Protein Databases," Journal of Molecular Biology, vol. 251, pp. 390-399, 1995.
[14] L. Hsiao Ping, T. Yin Te, S. Ching Hua, S. Tzu Fang, and T. Chuan Yi, "An IDC-based algorithm for efficient homology filtration with guaranteed seriate coverage," in Bioinformatics and Bioengineering, 2004. BIBE 2004. Proceedings. Fourth IEEE Symposium on, 2004, pp. 395-402.
[15] M. Huerta, F. Haseltine, and Y. Liu. vol. 2008: National Institute of Mental Health (NIH) , Biomedical information science and technology initiative (BISTI), Retrieved February 6, 2008, from: http://www.bisti.nih.gov/CompuBioDef.pdf., 2000.
[16] M. N. Hwang and J. Kim, "Protein Sequence Search based on N-gram Indexing," Bioinformatics and Biosystems, vol. Vol. 1, pp. 53-57, 2006.
[17] X. Jiang, P. Zhang, X. Liu, and S. S. T. Yau, "Survey on index based homology search algorithms," Springer Science and Business /media, pp. 40:185-212, 23 March 2007 2007.
[18] T. Kahveci and A. K. Singh, "An Efficient Index Structure for String Databases," in Proceedings of the37th VLDB conference, Roma, Italy, 2001, pp. 351--360.
[19] T. Kahveci and A. K. Singh, "MAP: searching large genome databases," in Pacific Symposium on Biocomputing, Hawaii, 2003, pp. 303-314.
[20] W. J. Kent, "BLAT---The BLAST-Like Alignment Tool," Genome Res., pp. GR-2292R, March 20, 2002.
[21] M. S. Kim, K. Y. Whang, J. G. Lee, and M. J. Lee, "n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure," in VLDB, Trondheim, Norway, 2005, pp. 325-336.
[22] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval: Cambridge University Press, 2008.
[23] S. Microsystems, "Multithreaded programming guide," Sun Microsystems, Inc Business. Retrieved September 14,2008 from: http://docsun.cites.uiuc.edu/sun_docs/C/solaris_9/SUNWdev/MTP/toc.h tml 2002.
[24] Z. B. Miled, N. Li, M. Mahoui, and O. Bukhres, "Information Retrieval in Biomedical Research," Wiley encyclopedia of biomedical engineering, 2006.
[25] Z. Ning, A. J. Cox, and J. C. Mullikin, "SSAHA: A Fast Search Method for Large DNA Databases," Genome Res., vol. 11, pp. 1725-1729, October 1, 2001 2001.
[26] T. H. Ong, K. L. Tan, and H. Wang, "Indexing Genomic Databases for Fast Homology Searching," in Proceedings of the 13th International Conference on Database and Expert Systems Applications, 2002.
[27] B. Parhami, Introduction to Parallel Processing Algorithms and Architectures New York: Kluwer Academic Publishers, 2002.
[28] C. Weimin and K. Aberer, "Efficient querying on genomic databases by using metric space indexing techniques," in Proceedings of the 8th International Workshop on Database and Expert Systems Applications: IEEE Computer Society, 1997.
[29] H. Williams and J. Zobel, "Indexing nucleotide databases for fast query evaluation," in Advances in Database Technology ÔÇö EDBT '96, 1996, pp. 275-288.
[30] H. E. WILLIAMS, "Effective query filtering for fast homology searching," Pacific Symposium on Biocomputing, pp. 214 - 225 1999.
[31] H. E. Williams and J. Zobel, "Indexing and retrieval for genomic databases," Knowledge and Data Engineering, IEEE Transactions on, vol. 14, pp. 63-78, 2002.
[32] C. Xia, L. Shuai Cheng, O. Beng Chin, and K. H. T. Anthony, "Piers: an efficient model for similarity search in DNA sequence databases," SIGMOD Rec., vol. 33, pp. 39-44, 2004.
[33] L. Yip Chi and B. Kao, "A study on n-gram indexing of musical features," in International Conference on Multimedia and Expo, 2000 IEEE New York, NY, USA, 2000, pp. 869-872 vol.2.