Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 14

Next Generation Sequencing Related Abstracts

14 Postmortem Genetic Testing to Sudden and Unexpected Deaths Using the Next Generation Sequencing

Authors: Eriko Ochiai, Fumiko Satoh, Keiko Miyashita, Yu Kakimoto, Motoki Osawa

Abstract:

Sudden and unexpected deaths from unknown causes occur in infants and youths. Recently, molecular links between a part of these deaths and several genetic diseases are examined in the postmortem. For instance, hereditary long QT syndrome and Burgada syndrome are occasionally fatal through critical ventricular tachyarrhythmia. There are a large number of target genes responsible for such diseases, the conventional analysis using the Sanger’s method has been laborious. In this report, we attempted to analyze sudden deaths comprehensively using the next generation sequencing (NGS) technique. Multiplex PCR to subject’s DNA was performed using Ion AmpliSeq Library Kits 2.0 and Ion AmpliSeq Inherited Disease Panel (Life Technologies). After the library was constructed by emulsion PCR, the amplicons were sequenced 500 flows on Ion Personal Genome Machine System (Life Technologies) according to the manufacture instruction. SNPs and indels were analyzed to the sequence reads that were mapped on hg19 of reference sequences. This project has been approved by the ethical committee of Tokai University School of Medicine. As a representative case, the molecular analysis to a 40 years old male who received a diagnosis of Brugada syndrome demonstrated a total of 584 SNPs or indels. Non-synonymous and frameshift nucleotide substitutions were selected in the coding region of heart disease related genes of ANK2, AKAP9, CACNA1C, DSC2, KCNQ1, MYLK, SCN1B, and STARD3. In particular, c.629T-C transition in exon 3 of the SCN1B gene, resulting in a leu210-to-pro (L210P) substitution is predicted “damaging” by the SIFT program. Because the mutation has not been reported, it was unclear if the substitution was pathogenic. Sudden death that failed in determining the cause of death constitutes one of the most important unsolved subjects in forensic pathology. The Ion AmpliSeq Inherited Disease Panel can amplify the exons of 328 genes at one time. We realized the difficulty in selection of the true source from a number of candidates, but postmortem genetic testing using NGS analysis deserves of a diagnostic to date. We now extend this analysis to SIDS suspected subjects and young sudden death victims.

Keywords: Next Generation Sequencing, sudden death, postmortem genetic testing, SIDS

Procedia PDF Downloads 247
13 Whole Exome Sequencing in Characterizing Mysterious Crippling Disorder in India

Authors: Swarkar Sharma, Ekta Rai, Ankit Mahajan, Parvinder Kumar, Manoj K Dhar, Sushil Razdan, Kumarasamy Thangaraj, Carol Wise, Shiro Ikegawa M.D., K.K. Pandita M.D.

Abstract:

Rare disorders are poorly understood hence, remain uncharacterized or patients are misdiagnosed and get poor medical attention. A rare mysterious skeletal disorder that remained unidentified for decades and rendered many people physically challenged and disabled for life has been reported in an isolated remote village ‘Arai’ of Poonch district of Jammu and Kashmir. This village is located deep in mountains and the population residing in the region is highly consanguineous. In our survey of the region, 70 affected people were reported, showing similar phenotype, in the village with a population of approximately 5000 individuals. We were able to collect samples from two multi generational extended families from the village. Through Whole Exome sequencing (WES), we identified a rare variation NM_003880.3:c.156C>A NP_003871.1:p.Cys52Ter, which results in introduction of premature stop codon in WISP3 gene. We found this variation perfectly segregating with the disease in one of the family. However, this variation was absent in other family. Interestingly, a novel splice site mutation at position c.643+1G>A of WISP3 gene, perfectly segregating with the disease was observed in the second family. Thus, exploiting WES and putting different evidences together (familial histories and genetic data, clinical features, radiological and biochemical tests and findings), the disease has finally been diagnosed as a very rare recessive hereditary skeletal disease “Progressive Pseudorheumatoid Arthropathy of Childhood” (PPAC) also known as “Spondyloepiphyseal Dysplasia Tarda with Progressive Arthropathy” (SEDT-PA). This genetic characterization and identification of the disease causing mutations will aid in genetic counseling, critically required to curb this rare disorder and to prevent its appearance in future generations in the population. Further, understanding of the role of WISP3 gene the biological pathways should help in developing treatment for the disorder.

Keywords: Next Generation Sequencing, whole exome sequencing, rare disorders

Procedia PDF Downloads 251
12 Accurate HLA Typing at High-Digit Resolution from NGS Data

Authors: Yan Zhang, Yazhi Huang, Jing Yang, Dingge Ying, Vorasuk Shotelersuk, Nattiya Hirankarn, Pak Chung Sham, Yu Lung Lau, Wanling Yang

Abstract:

Human leukocyte antigen (HLA) typing from next generation sequencing (NGS) data has the potential for applications in clinical laboratories and population genetic studies. Here we introduce a novel technique for HLA typing from NGS data based on read-mapping using a comprehensive reference panel containing all known HLA alleles and de novo assembly of the gene-specific short reads. An accurate HLA typing at high-digit resolution was achieved when it was tested on publicly available NGS data, outperforming other newly-developed tools such as HLAminer and PHLAT.

Keywords: Next Generation Sequencing, whole exome sequencing, human leukocyte antigens, HLA typing

Procedia PDF Downloads 351
11 Comprehensive Longitudinal Multi-omic Profiling in Weight Gain and Insulin Resistance

Authors: Christine Y. Yeh, Brian D. Piening, Sarah M. Totten, Kimberly Kukurba, Wenyu Zhou, Kevin P. F. Contrepois, Gucci J. Gu, Sharon Pitteri, Michael Snyder

Abstract:

Three million deaths worldwide are attributed to obesity. However, the biomolecular mechanisms that describe the link between adiposity and subsequent disease states are poorly understood. Insulin resistance characterizes approximately half of obese individuals and is a major cause of obesity-mediated diseases such as Type II diabetes, hypertension and other cardiovascular diseases. This study makes use of longitudinal quantitative and high-throughput multi-omics (genomics, epigenomics, transcriptomics, glycoproteomics etc.) methodologies on blood samples to develop multigenic and multi-analyte signatures associated with weight gain and insulin resistance. Participants of this study underwent a 30-day period of weight gain via excessive caloric intake followed by a 60-day period of restricted dieting and return to baseline weight. Blood samples were taken at three different time points per patient: baseline, peak-weight and post weight loss. Patients were characterized as either insulin resistant (IR) or insulin sensitive (IS) before having their samples processed via longitudinal multi-omic technologies. This comparative study revealed a wealth of biomolecular changes associated with weight gain after using methods in machine learning, clustering, network analysis etc. Pathways of interest included those involved in lipid remodeling, acute inflammatory response and glucose metabolism. Some of these biomolecules returned to baseline levels as the patient returned to normal weight whilst some remained elevated. IR patients exhibited key differences in inflammatory response regulation in comparison to IS patients at all time points. These signatures suggest differential metabolism and inflammatory pathways between IR and IS patients. Biomolecular differences associated with weight gain and insulin resistance were identified on various levels: in gene expression, epigenetic change, transcriptional regulation and glycosylation. This study was not only able to contribute to new biology that could be of use in preventing or predicting obesity-mediated diseases, but also matured novel biomedical informatics technologies to produce and process data on many comprehensive omics levels.

Keywords: Next Generation Sequencing, insulin resistance, proteogenomics, multi-omics, type ii diabetes

Procedia PDF Downloads 265
10 Efficient Reuse of Exome Sequencing Data for Copy Number Variation Callings

Authors: Chen Wang, Jared Evans, Yan Asmann

Abstract:

With the quick evolvement of next-generation sequencing techniques, whole-exome or exome-panel data have become a cost-effective way for detection of small exonic mutations, but there has been a growing desire to accurately detect copy number variations (CNVs) as well. In order to address this research and clinical needs, we developed a sequencing coverage pattern-based method not only for copy number detections, data integrity checks, CNV calling, and visualization reports. The developed methodologies include complete automation to increase usability, genome content-coverage bias correction, CNV segmentation, data quality reports, and publication quality images. Automatic identification and removal of poor quality outlier samples were made automatically. Multiple experimental batches were routinely detected and further reduced for a clean subset of samples before analysis. Algorithm improvements were also made to improve somatic CNV detection as well as germline CNV detection in trio family. Additionally, a set of utilities was included to facilitate users for producing CNV plots in focused genes of interest. We demonstrate the somatic CNV enhancements by accurately detecting CNVs in whole exome-wide data from the cancer genome atlas cancer samples and a lymphoma case study with paired tumor and normal samples. We also showed our efficient reuses of existing exome sequencing data, for improved germline CNV calling in a family of the trio from the phase-III study of 1000 Genome to detect CNVs with various modes of inheritance. The performance of the developed method is evaluated by comparing CNV calling results with results from other orthogonal copy number platforms. Through our case studies, reuses of exome sequencing data for calling CNVs have several noticeable functionalities, including a better quality control for exome sequencing data, improved joint analysis with single nucleotide variant calls, and novel genomic discovery of under-utilized existing whole exome and custom exome panel data.

Keywords: Bioinformatics, Next Generation Sequencing, Exome sequencing, Computational Genetics, copy number variations, data reuse

Procedia PDF Downloads 129
9 De Novo Assembly and Characterization of the Transcriptome during Seed Development, and Generation of Genic-SSR Markers in Pomegranate (Punica granatum L.)

Authors: Ozhan Simsek, Dicle Donmez, Burhanettin Imrak, Ahsen Isik Ozguven, Yildiz Aka Kacar

Abstract:

Pomegranate (Punica granatum L.) is known to be one of the oldest edible fruit tree species, with a wide geographical global distribution. Fruits from the two defined varieties (Hicaznar and 33N26) were taken at intervals after pollination and fertilization at different sizes. Seed samples were used for transcriptome sequencing. Primary sequencing was produced by Illumina Hi-Seq™ 2000. Firstly, we had raw reads, and it was subjected to quality control (QC). Raw reads were filtered into clean reads and aligned to the reference sequences. De novo analysis was performed to detect genes expressed in seeds of pomegranate varieties. We performed downstream analysis to determine differentially expressed genes. We generated about 27.09 gb bases in total after Illumina Hi-Seq sequencing. All samples were assembled together, we got 59,264 Unigenes, the total length, average length, N50, and GC content of Unigenes are 84.547.276 bp, 1.426 bp, 2,137 bp, and 46.20 %, respectively. Unigenes were annotated with 7 functional databases, finally, 42.681(NR: 72.02%), 39.660 (NT: 66.92%), 30.790 (Swissprot: 51.95%), 20.212 (COG: 34.11%), 27.689 (KEGG: 46.72%), 12.328 (GO: 20.80%), and 33,833 (Interpro: 57.09%) Unigenes were annotated. With functional annotation results, we detected 42.376 CDS, and 4.999 SSR distribute on 16.143 Unigenes.

Keywords: Next Generation Sequencing, RNA-Seq, SSR, Illumina

Procedia PDF Downloads 55
8 South African Breast Cancer Mutation Spectrum: Pitfalls to Copy Number Variation Detection Using Internationally Designed Multiplex Ligation-Dependent Probe Amplification and Next Generation Sequencing Panels

Authors: Jaco Oosthuizen, Nerina C. Van Der Merwe

Abstract:

The National Health Laboratory Services in Bloemfontien has been the diagnostic testing facility for 1830 patients for familial breast cancer since 1997. From the cohort, 540 were comprehensively screened using High-Resolution Melting Analysis or Next Generation Sequencing for the presence of point mutations and/or indels. Approximately 90% of these patients stil remain undiagnosed as they are BRCA1/2 negative. Multiplex ligation-dependent probe amplification was initially added to screen for copy number variation detection, but with the introduction of next generation sequencing in 2017, was substituted and is currently used as a confirmation assay. The aim was to investigate the viability of utilizing internationally designed copy number variation detection assays based on mostly European/Caucasian genomic data for use within a South African context. The multiplex ligation-dependent probe amplification technique is based on the hybridization and subsequent ligation of multiple probes to a targeted exon. The ligated probes are amplified using conventional polymerase chain reaction, followed by fragment analysis by means of capillary electrophoresis. The experimental design of the assay was performed according to the guidelines of MRC-Holland. For BRCA1 (P002-D1) and BRCA2 (P045-B3), both multiplex assays were validated, and results were confirmed using a secondary probe set for each gene. The next generation sequencing technique is based on target amplification via multiplex polymerase chain reaction, where after the amplicons are sequenced parallel on a semiconductor chip. Amplified read counts are visualized as relative copy numbers to determine the median of the absolute values of all pairwise differences. Various experimental parameters such as DNA quality, quantity, and signal intensity or read depth were verified using positive and negative patients previously tested internationally. DNA quality and quantity proved to be the critical factors during the verification of both assays. The quantity influenced the relative copy number frequency directly whereas the quality of the DNA and its salt concentration influenced denaturation consistency in both assays. Multiplex ligation-dependent probe amplification produced false positives due to ligation failure when ligation was inhibited due to a variant present within the ligation site. Next generation sequencing produced false positives due to read dropout when primer sequences did not meet optimal multiplex binding kinetics due to population variants in the primer binding site. The analytical sensitivity and specificity for the South African population have been proven. Verification resulted in repeatable reactions with regards to the detection of relative copy number differences. Both multiplex ligation-dependent probe amplification and next generation sequencing multiplex panels need to be optimized to accommodate South African polymorphisms present within the genetically diverse ethnic groups to reduce the false copy number variation positive rate and increase performance efficiency.

Keywords: Next Generation Sequencing, South Africa, multiplex ligation-dependent probe amplification, familial breast cancer

Procedia PDF Downloads 76
7 Exhaled Breath Condensate in Lung Cancer: A Non-Invasive Sample for Easier Mutations Detection by Next Generation Sequencing

Authors: Omar Youssef, Aija Knuuttila, Paivi Piirilä, Virinder Sarhadi, Sakari Knuutila

Abstract:

Exhaled breath condensate (EBC) is a unique sample that allows studying different genetic changes in lung carcinoma through a non-invasive way. With the aid of next generation sequencing (NGS) technology, analysis of genetic mutations has been more efficient with increased sensitivity for detection of genetic variants. In order to investigate the possibility of applying this method for cancer diagnostics, mutations in EBC DNA from lung cancer patients and healthy individuals were studied by using NGS. The key aim is to assess the feasibility of using this approach to detect clinically important mutations in EBC. EBC was collected from 20 healthy individuals and 9 lung cancer patients (four lung adenocarcinomas, four 8 squamous cell carcinoma, and one case of mesothelioma). Mutations in hotpot regions of 22 genes were studied by using Ampliseq Colon and Lung cancer panel and sequenced on Ion PGM. Results demonstrated that all nine patients showed a total of 19 cosmic mutations in APC, BRAF, EGFR, ERBB4, FBXW7, FGFR1, KRAS, MAP2K1, NRAS, PIK3CA, PTEN, RET, SMAD4, and TP53. In controls, 15 individuals showed 35 cosmic mutations in BRAF, CTNNB1, DDR2, EGFR, ERBB2, FBXW7, FGFR3, KRAS, MET, NOTCH1, NRAS, PIK3CA, PTEN, SMAD4, and TP53. Additionally, 45 novel mutations not reported previously were also seen in patients’ samples, and 106 novel mutations were seen in controls’ specimens. KRAS exon 2 mutations G12D was identified in one control specimen with mutant allele fraction of 6.8%, while KRAS G13D mutation seen in one patient sample showed mutant allele fraction of 17%. These findings illustrate that hotspot mutations are present in DNA from EBC of both cancer patients and healthy controls. As some of the cosmic mutations were seen in controls too, no firm conclusion can be drawn on the clinical importance of cosmic mutations in patients. Mutations reported in controls could represent early neoplastic changes or normal homeostatic process of apoptosis occurring in lung tissue to get rid of mutant cells. At the same time, mutations detected in patients might represent a non-invasive easily accessible way for early cancer detection. Follow up of individuals with important cancer mutations is necessary to clarify the significance of these mutations in both healthy individuals and cancer patients.

Keywords: Lung cancer, Next Generation Sequencing, mutations, exhaled breath condensate

Procedia PDF Downloads 41
6 Microbial Dark Matter Analysis Using 16S rRNA Gene Metagenomics Sequences

Authors: Hana Barak, Alex Sivan, Ariel Kushmaro

Abstract:

Microorganisms are the most diverse and abundant life forms on Earth and account for a large portion of the Earth’s biomass and biodiversity. To date though, our knowledge regarding microbial life is lacking, as it is based mainly on information from cultivated organisms. Indeed, microbiologists have borrowed from astrophysics and termed the ‘uncultured microbial majority’ as ‘microbial dark matter’. The realization of how diverse and unexplored microorganisms are, actually stems from recent advances in molecular biology, and in particular from novel methods for sequencing microbial small subunit ribosomal RNA genes directly from environmental samples termed next-generation sequencing (NGS). This has led us to use NGS that generates several gigabases of sequencing data in a single experimental run, to identify and classify environmental samples of microorganisms. In metagenomics sequencing analysis (both 16S and shotgun), sequences are compared to reference databases that contain only small part of the existing microorganisms and therefore their taxonomy assignment may reveal groups of unknown microorganisms or origins. These unknowns, or the ‘microbial sequences dark matter’, are usually ignored in spite of their great importance. The goal of this work was to develop an improved bioinformatics method that enables more complete analyses of the microbial communities in numerous environments. Therefore, NGS was used to identify previously unknown microorganisms from three different environments (industrials wastewater, Negev Desert’s rocks and water wells at the Arava valley). 16S rRNA gene metagenome analysis of the microorganisms from those three environments produce about ~4 million reads for 75 samples. Between 0.1-12% of the sequences in each sample were tagged as ‘Unassigned’. Employing relatively simple methodology for resequencing of original gDNA samples through Sanger or MiSeq Illumina with specific primers, this study demonstrates that the mysterious ‘Unassigned’ group apparently contains sequences of candidate phyla. Those unknown sequences can be located on a phylogenetic tree and thus provide a better understanding of the ‘sequences dark matter’ and its role in the research of microbial communities and diversity. Studying this ‘dark matter’ will extend the existing databases and could reveal the hidden potential of the ‘microbial dark matter’.

Keywords: Bioinformatics, Dark Matter, Bacteria, Next Generation Sequencing, unknown

Procedia PDF Downloads 50
5 Changes in Skin Microbiome Diversity According to the Age of Xian Women

Authors: Hanbyul Kim, Hye-Jin Kin, Taehun Park, Woo Jun Sul, Susun An

Abstract:

Skin is the largest organ of the human body and can provide the diverse habitat for various microorganisms. The ecology of the skin surface selects distinctive sets of microorganisms and is influenced by both endogenous intrinsic factors and exogenous environmental factors. The diversity of the bacterial community in the skin also depends on multiple host factors: gender, age, health status, location. Among them, age-related changes in skin structure and function are attributable to combinations of endogenous intrinsic factors and exogenous environmental factors. Skin aging is characterized by a decrease in sweat, sebum and the immune functions thus resulting in significant alterations in skin surface physiology including pH, lipid composition, and sebum secretion. The present study gives a comprehensive clue on the variation of skin microbiota and the correlations between ages by analyzing and comparing the metagenome of skin microbiome using Next Generation Sequencing method. Skin bacterial diversity and composition were characterized and compared between two different age groups: younger (20 – 30y) and older (60 - 70y) Xian, Chinese women. A total of 73 healthy women meet two conditions: (I) living in Xian, China; (II) maintaining healthy skin status during the period of this study. Based on Ribosomal Database Project (RDP) database, skin samples of 73 participants were enclosed with ten most abundant genera: Chryseobacterium, Propionibacterium, Enhydrobacter, Staphylococcus and so on. Although these genera are the most predominant genus overall, each genus showed different proportion in each group. The most dominant genus, Chryseobacterium was more present relatively in Young group than in an old group. Similarly, Propionibacterium and Enhydrobacter occupied a higher proportion of skin bacterial composition of the young group. Staphylococcus, in contrast, inhabited more in the old group. The beta diversity that represents the ratio between regional and local species diversity showed significantly different between two age groups. Likewise, The Principal Coordinate Analysis (PCoA) values representing each phylogenetic distance in the two-dimensional framework using the OTU (Operational taxonomic unit) values of the samples also showed differences between the two groups. Thus, our data suggested that the composition and diversification of skin microbiomes in adult women were largely affected by chronological and physiological skin aging.

Keywords: age, Next Generation Sequencing, Xian, skin microbiome

Procedia PDF Downloads 20
4 Data Analysis for Taxonomy Prediction and Annotation of 16S rRNA Gene Sequences from Metagenome Data

Authors: Suchithra V., Shreedhanya, Kavya Menon, Vidya Niranjan

Abstract:

Skin metagenomics has a wide range of applications with direct relevance to the health of the organism. It gives us insight to the diverse community of microorganisms (the microbiome) harbored on the skin. In the recent years, it has become increasingly apparent that the interaction between skin microbiome and the human body plays a prominent role in immune system development, cancer development, disease pathology, and many other biological implications. Next Generation Sequencing has led to faster and better understanding of environmental organisms and their mutual interactions. This project is studying the human skin microbiome of different individuals having varied skin conditions. Bacterial 16S rRNA data of skin microbiome is downloaded from SRA toolkit provided by NCBI to perform metagenomics analysis. Twelve samples are selected with two controls, and 3 different categories, i.e., sex (male/female), skin type (moist/intermittently moist/sebaceous) and occlusion (occluded/intermittently occluded/exposed). Quality of the data is increased using Cutadapt, and its analysis is done using FastQC. USearch, a tool used to analyze an NGS data, provides a suitable platform to obtain taxonomy classification and abundance of bacteria from the metagenome data. The statistical tool used for analyzing the USearch result is METAGENassist. The results revealed that the top three abundant organisms found were: Prevotella, Corynebacterium, and Anaerococcus. Prevotella is known to be an infectious bacterium found on wound, tooth cavity, etc. Corynebacterium and Anaerococcus are opportunist bacteria responsible for skin odor. This result infers that Prevotella thrives easily in sebaceous skin conditions. Therefore it is better to undergo intermittently occluded treatment such as applying ointments, creams, etc. to treat wound for sebaceous skin type. Exposing the wound should be avoided as it leads to an increase in Prevotella abundance. Moist skin type individuals can opt for occluded or intermittently occluded treatment as they have shown to decrease the abundance of bacteria during treatment.

Keywords: taxonomy, Next Generation Sequencing, skin microbiome, bacterial 16S rRNA, skin metagenomics

Procedia PDF Downloads 27
3 Metagenomic analysis of Irish cattle faecal samples using Oxford Nanopore MinION Next Generation Sequencing

Authors: Niamh Higgins, Dawn Howard

Abstract:

The Irish agri-food sector is of major importance to Ireland’s manufacturing sector and to the Irish economy through employment and the exporting of animal products worldwide. Infectious diseases and parasites have an impact on farm animal health causing profitability and productivity to be affected. For the sustainability of Irish dairy farming, there must be the highest standard of animal health. There can be a lack of information in accounting for > 1% of complete microbial diversity in an environment. There is the tendency of culture-based methods of microbial identification to overestimate the prevalence of species which grow easily on an agar surface. There is a need for new technologies to address these issues to assist with animal health. Metagenomic approaches provide information on both the whole genome and transcriptome present through DNA sequencing of total DNA from environmental samples producing high determination of functional and taxonomic information. Nanopore Next Generation Technologies have the ability to be powerful sequencing technologies. They provide high throughput, low material requirements and produce ultra-long reads, simplifying the experimental process. The aim of this study is to use a metagenomics approach to analyze dairy cattle faecal samples using the Oxford Nanopore MinION Next Generation Sequencer and to establish an in-house pipeline for metagenomic characterization of complex samples. Faecal samples will be obtained from Irish dairy farms, DNA extracted and the MinION will be used for sequencing, followed by bioinformatics analysis. Of particular interest, will be the parasite Buxtonella sulcata, which there has been little research on and which there is no research on its presence on Irish dairy farms. Preliminary results have shown the ability of the MinION to produce hundreds of reads in a relatively short time frame of eight hours. The faecal samples were obtained from 90 dairy cows on a Galway farm. The results from Oxford Nanopore ‘What’s in my pot’ (WIMP) using the Epi2me workflow, show that from a total of 926 classified reads, 87% were from the Kingdom Bacteria, 10% were from the Kingdom Eukaryota, 3% were from the Kingdom Archaea and < 1% were from the Kingdom Viruses. The most prevalent bacteria were those from the Genus Acholeplasma (71 reads), Bacteroides (35 reads), Clostridium (33 reads), Acinetobacter (20 reads). The most prevalent species present were those from the Genus Acholeplasma and included Acholeplasma laidlawii (39 reads) and Acholeplasma brassicae (26 reads). The preliminary results show the ability of the MinION for the identification of microorganisms to species level coming from a complex sample. With ongoing optimization of the pipe-line, the number of classified reads are likely to increase. Metagenomics has the potential in animal health for diagnostics of microorganisms present on farms. This would support wprevention rather than a cure approach as is outlined in the DAFMs National Farmed Animal Health Strategy 2017-2022.

Keywords: animal health, Infectious Disease, Metagenomics, Next Generation Sequencing, buxtonella sulcata, irish dairy cattle, minION

Procedia PDF Downloads 1
2 Phenotype Prediction of DNA Sequence Data: A Machine and Statistical Learning Approach

Authors: Mpho Mokoatle, Darlington Mapiye, James Mashiyane, Stephanie Muller, Gciniwe Dlamini

Abstract:

Great advances in high-throughput sequencing technologies have resulted in availability of huge amounts of sequencing data in public and private repositories, enabling a holistic understanding of complex biological phenomena. Sequence data are used for a wide range of applications such as gene annotations, expression studies, personalized treatment and precision medicine. However, this rapid growth in sequence data poses a great challenge which calls for novel data processing and analytic methods, as well as huge computing resources. In this work, a machine and statistical learning approach for DNA sequence classification based on k-mer representation of sequence data is proposed. The approach is tested using whole genome sequences of Mycobacterium tuberculosis (MTB) isolates to (i) reduce the size of genomic sequence data, (ii) identify an optimum size of k-mers and utilize it to build classification models, (iii) predict the phenotype from whole genome sequence data of a given bacterial isolate, and (iv) demonstrate computing challenges associated with the analysis of whole genome sequence data in producing interpretable and explainable insights. The classification models were trained on 104 whole genome sequences of MTB isoloates. Cluster analysis showed that k-mers maybe used to discriminate phenotypes and the discrimination becomes more concise as the size of k-mers increase. The best performing classification model had a k-mer size of 10 (longest k-mer) an accuracy, recall, precision, specificity, and Matthews Correlation coeffient of 72.0 %, 80.5 %, 80.5 %, 63.6 %, and 0.4 respectively. This study provides a comprehensive approach for resampling whole genome sequencing data, objectively selecting a k-mer size, and performing classification for phenotype prediction. The analysis also highlights the importance of increasing the k-mer size to produce more biological explainable results, which brings to the fore the interplay that exists amongst accuracy, computing resources and explainability of classification results. However, the analysis provides a new way to elucidate genetic information from genomic data, and identify phenotype relationships which are important especially in explaining complex biological mechanisms

Keywords: Next Generation Sequencing, Bootstrapping, k-mers, AWD-LSTM

Procedia PDF Downloads 1
1 Phenotype Prediction of DNA Sequence Data: A Machine and Statistical Learning Approach

Authors: Mpho Mokoatle, Darlington Mapiye, James Mashiyane, Stephanie Muller, Gciniwe Dlamini

Abstract:

Great advances in high-throughput sequencing technologies have resulted in availability of huge amounts of sequencing data in public and private repositories, enabling a holistic understanding of complex biological phenomena. Sequence data are used for a wide range of applications such as gene annotations, expression studies, personalized treatment and precision medicine. However, this rapid growth in sequence data poses a great challenge which calls for novel data processing and analytic methods, as well as huge computing resources. In this work, a machine and statistical learning approach for DNA sequence classification based on $k$-mer representation of sequence data is proposed. The approach is tested using whole genome sequences of Mycobacterium tuberculosis (MTB) isolates to (i) reduce the size of genomic sequence data, (ii) identify an optimum size of k-mers and utilize it to build classification models, (iii) predict the phenotype from whole genome sequence data of a given bacterial isolate, and (iv) demonstrate computing challenges associated with the analysis of whole genome sequence data in producing interpretable and explainable insights. The classification models were trained on 104 whole genome sequences of MTB isoloates. Cluster analysis showed that k-mers maybe used to discriminate phenotypes and the discrimination becomes more concise as the size of k-mers increase. The best performing classification model had a k-mer size of 10 (longest k-mer) an accuracy, recall, precision, specificity, and Matthews Correlation coeffient of 72.0%, 80.5%, 80.5%, 63.6%, and 0.4 respectively. This study provides a comprehensive approach for resampling whole genome sequencing data, objectively selecting a k-mer size, and performing classification for phenotype prediction. The analysis also highlights the importance of increasing the k-mer size to produce more biological explainable results, which brings to the fore the interplay that exists amongst accuracy, computing resources and explainability of classification results. However, the analysis provides a new way to elucidate genetic information from genomic data, and identify phenotype relationships which are important especially in explaining complex biological mechanisms.

Keywords: Next Generation Sequencing, Bootstrapping, k-mers, AWD-LSTM

Procedia PDF Downloads 1