Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 4

k-mers Related Abstracts

4 Identification of Disease Causing DNA Motifs in Human DNA Using Clustering Approach

Authors: G. Tamilpavai, C. Vishnuppriya

Abstract:

Studying DNA (deoxyribonucleic acid) sequence is useful in biological processes and it is applied in the fields such as diagnostic and forensic research. DNA is the hereditary information in human and almost all other organisms. It is passed to their generations. Earlier stage detection of defective DNA sequence may lead to many developments in the field of Bioinformatics. Nowadays various tedious techniques are used to identify defective DNA. The proposed work is to analyze and identify the cancer-causing DNA motif in a given sequence. Initially the human DNA sequence is separated as k-mers using k-mer separation rule. The separated k-mers are clustered using Self Organizing Map (SOM). Using Levenshtein distance measure, cancer associated DNA motif is identified from the k-mer clusters. Experimental results of this work indicate the presence or absence of cancer causing DNA motif. If the cancer associated DNA motif is found in DNA, it is declared as the cancer disease causing DNA sequence. Otherwise the input human DNA is declared as normal sequence. Finally, elapsed time is calculated for finding the presence of cancer causing DNA motif using clustering formation. It is compared with normal process of finding cancer causing DNA motif. Locating cancer associated motif is easier in cluster formation process than the other one. The proposed work will be an initiative aid for finding genetic disease related research.

Keywords: Bioinformatics, Dna, SOM, cancer motif, k-mers, Levenshtein distance

Procedia PDF Downloads 57
3 Finding the Longest Common Subsequence in Normal DNA and Disease Affected Human DNA Using Self Organizing Map

Authors: G. Tamilpavai, C. Vishnuppriya

Abstract:

Bioinformatics is an active research area which combines biological matter as well as computer science research. The longest common subsequence (LCSS) is one of the major challenges in various bioinformatics applications. The computation of the LCSS plays a vital role in biomedicine and also it is an essential task in DNA sequence analysis in genetics. It includes wide range of disease diagnosing steps. The objective of this proposed system is to find the longest common subsequence which presents in a normal and various disease affected human DNA sequence using Self Organizing Map (SOM) and LCSS. The human DNA sequence is collected from National Center for Biotechnology Information (NCBI) database. Initially, the human DNA sequence is separated as k-mer using k-mer separation rule. Mean and median values are calculated from each separated k-mer. These calculated values are fed as input to the Self Organizing Map for the purpose of clustering. Then obtained clusters are given to the Longest Common Sub Sequence (LCSS) algorithm for finding common subsequence which presents in every clusters. It returns nx(n-1)/2 subsequence for each cluster where n is number of k-mer in a specific cluster. Experimental outcomes of this proposed system produce the possible number of longest common subsequence of normal and disease affected DNA data. Thus the proposed system will be a good initiative aid for finding disease causing sequence. Finally, performance analysis is carried out for different DNA sequences. The obtained values show that the retrieval of LCSS is done in a shorter time than the existing system.

Keywords: Clustering, SOM, k-mers, longest common subsequence

Procedia PDF Downloads 167
2 Phenotype Prediction of DNA Sequence Data: A Machine and Statistical Learning Approach

Authors: Mpho Mokoatle, Darlington Mapiye, James Mashiyane, Stephanie Muller, Gciniwe Dlamini

Abstract:

Great advances in high-throughput sequencing technologies have resulted in availability of huge amounts of sequencing data in public and private repositories, enabling a holistic understanding of complex biological phenomena. Sequence data are used for a wide range of applications such as gene annotations, expression studies, personalized treatment and precision medicine. However, this rapid growth in sequence data poses a great challenge which calls for novel data processing and analytic methods, as well as huge computing resources. In this work, a machine and statistical learning approach for DNA sequence classification based on k-mer representation of sequence data is proposed. The approach is tested using whole genome sequences of Mycobacterium tuberculosis (MTB) isolates to (i) reduce the size of genomic sequence data, (ii) identify an optimum size of k-mers and utilize it to build classification models, (iii) predict the phenotype from whole genome sequence data of a given bacterial isolate, and (iv) demonstrate computing challenges associated with the analysis of whole genome sequence data in producing interpretable and explainable insights. The classification models were trained on 104 whole genome sequences of MTB isoloates. Cluster analysis showed that k-mers maybe used to discriminate phenotypes and the discrimination becomes more concise as the size of k-mers increase. The best performing classification model had a k-mer size of 10 (longest k-mer) an accuracy, recall, precision, specificity, and Matthews Correlation coeffient of 72.0 %, 80.5 %, 80.5 %, 63.6 %, and 0.4 respectively. This study provides a comprehensive approach for resampling whole genome sequencing data, objectively selecting a k-mer size, and performing classification for phenotype prediction. The analysis also highlights the importance of increasing the k-mer size to produce more biological explainable results, which brings to the fore the interplay that exists amongst accuracy, computing resources and explainability of classification results. However, the analysis provides a new way to elucidate genetic information from genomic data, and identify phenotype relationships which are important especially in explaining complex biological mechanisms

Keywords: Next Generation Sequencing, Bootstrapping, k-mers, AWD-LSTM

Procedia PDF Downloads 1
1 Phenotype Prediction of DNA Sequence Data: A Machine and Statistical Learning Approach

Authors: Mpho Mokoatle, Darlington Mapiye, James Mashiyane, Stephanie Muller, Gciniwe Dlamini

Abstract:

Great advances in high-throughput sequencing technologies have resulted in availability of huge amounts of sequencing data in public and private repositories, enabling a holistic understanding of complex biological phenomena. Sequence data are used for a wide range of applications such as gene annotations, expression studies, personalized treatment and precision medicine. However, this rapid growth in sequence data poses a great challenge which calls for novel data processing and analytic methods, as well as huge computing resources. In this work, a machine and statistical learning approach for DNA sequence classification based on $k$-mer representation of sequence data is proposed. The approach is tested using whole genome sequences of Mycobacterium tuberculosis (MTB) isolates to (i) reduce the size of genomic sequence data, (ii) identify an optimum size of k-mers and utilize it to build classification models, (iii) predict the phenotype from whole genome sequence data of a given bacterial isolate, and (iv) demonstrate computing challenges associated with the analysis of whole genome sequence data in producing interpretable and explainable insights. The classification models were trained on 104 whole genome sequences of MTB isoloates. Cluster analysis showed that k-mers maybe used to discriminate phenotypes and the discrimination becomes more concise as the size of k-mers increase. The best performing classification model had a k-mer size of 10 (longest k-mer) an accuracy, recall, precision, specificity, and Matthews Correlation coeffient of 72.0%, 80.5%, 80.5%, 63.6%, and 0.4 respectively. This study provides a comprehensive approach for resampling whole genome sequencing data, objectively selecting a k-mer size, and performing classification for phenotype prediction. The analysis also highlights the importance of increasing the k-mer size to produce more biological explainable results, which brings to the fore the interplay that exists amongst accuracy, computing resources and explainability of classification results. However, the analysis provides a new way to elucidate genetic information from genomic data, and identify phenotype relationships which are important especially in explaining complex biological mechanisms.

Keywords: Next Generation Sequencing, Bootstrapping, k-mers, AWD-LSTM

Procedia PDF Downloads 1