A New Edit Distance Method for Finding Similarity in Dna Sequence

Patsaraporn Somboonsak; Mud-Armeen Munlin

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

A New Edit Distance Method for Finding Similarity in Dna Sequence

Authors: Patsaraporn Somboonsak, Mud-Armeen Munlin

Abstract:

The P-Bigram method is a string comparison methods base on an internal two characters-based similarity measure. The edit distance between two strings is the minimal number of elementary editing operations required to transform one string into the other. The elementary editing operations include deletion, insertion, substitution two characters. In this paper, we address the P-Bigram method to sole the similarity problem in DNA sequence. This method provided an efficient algorithm that locates all minimum operation in a string. We have been implemented algorithm and found that our program calculated that smaller distance than one string. We develop PBigram edit distance and show that edit distance or the similarity and implementation using dynamic programming. The performance of the proposed approach is evaluated using number edit and percentage similarity measures.

Keywords: Edit distance, String Matching, String Similarity

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1333320

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3322

References:

[1] Saul B. Needleman, Christlan D. Wunsch, "A General Applicable to the Search for Similarities in the Amino Acid Sequence of two Proteins", J Mol. Biol, Vol. 48, 1970, pp. 443-453.
[2] David Sankoff, "Simultaneous solution of the RNA folding alignment and protosequence problems", Siam J. Appl Math, Vol. 45, 1985, pp. 810-825.
[3] A drien Coyette, et al., "Trainable Sketch Recognizer of Graphical User Interface Design", International Federation for Information Processing, Vol. 1, 2007, pp. 124-135.
[4] Levenshtein V.I. "Binary codes capable of correcting deletions, insertions and reversals", Soviet Physics Doklady, Vol. 8, 1966, pp. 705- 710.
[5] Gusfield. "Algorithms on String Trees and Sequences", Computer science and Computational Biology Cambridge University Press, 1997.
[6] Hall, P.A.V, Dowling, G.R. "Approximate string matching", ACM Computing Surveys, Vol. 4, 1980, pp. 381-402.
[7] Christen, P. "A Comparison of Personal Name Matching Techniques and Practical Issues", Technical Report TR-CS-06-02, Joint Computer Science Technical Report Series, Department of Computer Science, 2006.
[8] Cohen, W. Ravikumar, P. Fienberg, S."A comparison of string distance metrics for name-matching tasks", IJCAI Workshop on Information Integration on the Web, Acapulco, Mexico, 2003, pp. 73-78.
[9] AbdulJaleel, N. Larkey, L.S. "Statistical transliteration for English- Arabic cross language information retrieval", CIKM, 2003, PP. 139-146.
[10] Linden, K, "Multilingual Modeling of Cross-Lingual Spelling Variants spelling variants Information Retrieval", Vol. 3, 2006, pp. 295-310.
[11] Ristad, E.S. Yianilos, P.N. "Learning string-edit distance", IEEE Transactions or Pattern Analysis and Machine Intelligence, 1998.
[12] Carlo Batini, Monica Scannapieco, "Data Quality Concepts, Methodologies and Techniqes", Springer, DCSA, 1998, pp. 117-127.
[13] Heikki Hyyrö, Ayumi Shinohara, "New Bit-Parallel-Distance Algorithm", Nikoletseas, LNCS 3503, 2005, pp. 380-390.
[14] Adrian. Horia Dediu, et al., "A fast Longest Common Subsequence Algorithm for Similar Strings", Language and Automata Theory and Applications, International Conference, LATA, 2010, pp. 82-93.
[15] Heikki Hyyrö, "Restricted Transposition Invariant Approximate String Matching Under Edit Distance", SPIRE, LNCS 3772, 2005, pp. 256- 266.
[16] M.H. Alsuwaiyel, "Algorithms design techniques and analysis", World Scientific Connecting Great Minds, Vol. 7, 1999, pp. 203-208.
[17] Dekang Lin(1998), "An Information Theoretic Definition of similarity", Proceedings of the 15th international conference on statistic", Citeseer.