Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 24266

Search results for: genomic data

24266 The Various Legal Dimensions of Genomic Data

Abstract:

When human genomic data is considered, this is often done through only one dimension of the law, or the interplay between the various dimensions is not considered, thus providing an incomplete picture of the legal framework. This research considers and analyzes the various dimensions in South African law applicable to genomic sequence data – including property rights, personality rights, and intellectual property rights. The effective use of personal genomic sequence data requires the acknowledgement and harmonization of the rights applicable to such data.

Keywords: artificial intelligence, data, law, genomics, rights

Procedia PDF Downloads 109

24265 Genomic Prediction Reliability Using Haplotypes Defined by Different Methods

Authors: Sohyoung Won, Heebal Kim, Dajeong Lim

Abstract:

Genomic prediction is an effective way to measure the abilities of livestock for breeding based on genomic estimated breeding values, statistically predicted values from genotype data using best linear unbiased prediction (BLUP). Using haplotypes, clusters of linked single nucleotide polymorphisms (SNPs), as markers instead of individual SNPs can improve the reliability of genomic prediction since the probability of a quantitative trait loci to be in strong linkage disequilibrium (LD) with markers is higher. To efficiently use haplotypes in genomic prediction, finding optimal ways to define haplotypes is needed. In this study, 770K SNP chip data was collected from Hanwoo (Korean cattle) population consisted of 2506 cattle. Haplotypes were first defined in three different ways using 770K SNP chip data: haplotypes were defined based on 1) length of haplotypes (bp), 2) the number of SNPs, and 3) k-medoids clustering by LD. To compare the methods in parallel, haplotypes defined by all methods were set to have comparable sizes; in each method, haplotypes defined to have an average number of 5, 10, 20 or 50 SNPs were tested respectively. A modified GBLUP method using haplotype alleles as predictor variables was implemented for testing the prediction reliability of each haplotype set. Also, conventional genomic BLUP (GBLUP) method, which uses individual SNPs were tested to evaluate the performance of the haplotype sets on genomic prediction. Carcass weight was used as the phenotype for testing. As a result, using haplotypes defined by all three methods showed increased reliability compared to conventional GBLUP. There were not many differences in the reliability between different haplotype defining methods. The reliability of genomic prediction was highest when the average number of SNPs per haplotype was 20 in all three methods, implying that haplotypes including around 20 SNPs can be optimal to use as markers for genomic prediction. When the number of alleles generated by each haplotype defining methods was compared, clustering by LD generated the least number of alleles. Using haplotype alleles for genomic prediction showed better performance, suggesting improved accuracy in genomic selection. The number of predictor variables was decreased when the LD-based method was used while all three haplotype defining methods showed similar performances. This suggests that defining haplotypes based on LD can reduce computational costs and allows efficient prediction. Finding optimal ways to define haplotypes and using the haplotype alleles as markers can provide improved performance and efficiency in genomic prediction.

Keywords: best linear unbiased predictor, genomic prediction, haplotype, linkage disequilibrium

Procedia PDF Downloads 109

24264 Genomic Evidence for Ancient Human Migrations Along South America's East Coast

Authors: Andre Luiz Campelo dos Santos, Amanda Owings, Henry Socrates Lavalle Sullasi, Omer Gokcumen, Michael DeGiorgio, John Lindo

Abstract:

An increasing body of archaeological and genomic evidence have indicated a complex settlement process of the Americas. Here, four newly sequenced ancient genomes from Northeast Brazil and Uruguay are reported to share strong relationships with previously published samples from Panama and Southeast Brazil. Moreover, an unexpected high genomic affinity with present-day Onge is found in ancient individuals unearthed along the northern portion of South America’s Atlantic coast. These results provide genomic evidence for ancient migrations along South America’s Atlantic coast.

Keywords: archaeogenomics, atlantic coast, paleomigrations, South America

Procedia PDF Downloads 195

24263 Sparse Modelling of Cancer Patients’ Survival Based on Genomic Copy Number Alterations

Authors: Khaled M. Alqahtani

Abstract:

Copy number alterations (CNA) are variations in the structure of the genome, where certain regions deviate from the typical two chromosomal copies. These alterations are pivotal in understanding tumor progression and are indicative of patients' survival outcomes. However, effectively modeling patients' survival based on their genomic CNA profiles while identifying relevant genomic regions remains a statistical challenge. Various methods, such as the Cox proportional hazard (PH) model with ridge, lasso, or elastic net penalties, have been proposed but often overlook the inherent dependencies between genomic regions, leading to results that are hard to interpret. In this study, we enhance the elastic net penalty by incorporating an additional penalty that accounts for these dependencies. This approach yields smooth parameter estimates and facilitates variable selection, resulting in a sparse solution. Our findings demonstrate that this method outperforms other models in predicting survival outcomes, as evidenced by our simulation study. Moreover, it allows for a more meaningful interpretation of genomic regions associated with patients' survival. We demonstrate the efficacy of our approach using both real data from a lung cancer cohort and simulated datasets.

Keywords: copy number alterations, cox proportional hazard, lung cancer, regression, sparse solution

Procedia PDF Downloads 0

24262 Genodata: The Human Genome Variation Using BigData

Authors: Surabhi Maiti, Prajakta Tamhankar, Prachi Uttam Mehta

Abstract:

Since the accomplishment of the Human Genome Project, there has been an unparalled escalation in the sequencing of genomic data. This project has been the first major vault in the field of medical research, especially in genomics. This project won accolades by using a concept called Bigdata which was earlier, extensively used to gain value for business. Bigdata makes use of data sets which are generally in the form of files of size terabytes, petabytes, or exabytes and these data sets were traditionally used and managed using excel sheets and RDBMS. The voluminous data made the process tedious and time consuming and hence a stronger framework called Hadoop was introduced in the field of genetic sciences to make data processing faster and efficient. This paper focuses on using SPARK which is gaining momentum with the advancement of BigData technologies. Cloud Storage is an effective medium for storage of large data sets which is generated from the genetic research and the resultant sets produced from SPARK analysis.

Keywords: human genome project, Bigdata, genomic data, SPARK, cloud storage, Hadoop

Procedia PDF Downloads 225

24261 Evaluation of Four Different DNA Targets in Polymerase Chain Reaction for Detection and Genotyping of Helicobacter pylori

Authors: Abu Salim Mustafa

Abstract:

Polymerase chain reaction (PCR) assays targeting genomic DNA segments have been established for the detection of Helicobacter pylori in clinical specimens. However, the data on comparative evaluations of various targets in detection of H. pylori are limited. Furthermore, the frequencies of vacA (s1 and s2) and cagA genotypes, which are suggested to be involved in the pathogenesis of H. pylori in other parts of the world, are not well studied in Kuwait. The aim of this study was to evaluate PCR assays for the detection and genotyping of H. pylori by targeting the amplification of DNA targets from four genomic segments. The genomic DNA were isolated from 72 clinical isolates of H. pylori and tested in PCR with four pairs of oligonucleotides primers, i.e. ECH-U/ECH-L, ET-5U/ET-5L, CagAF/CagAR and Vac1F/Vac1XR, which were expected to amplify targets of various sizes (471 bp, 230 bp, 183 bp and 176/203 bp, respectively) from the genomic DNA of H. pylori. The PCR-amplified DNA were analyzed by agarose gel electrophoresis. PCR products of expected size were obtained with all primer pairs by using genomic DNA isolated from H. pylori. DNA dilution experiments showed that the most sensitive PCR target was 471 bp DNA amplified by the primers ECH-U/ECH-L, followed by the targets of Vac1F/Vac1XR (176 bp/203 DNA), CagAF/CagAR (183 bp DNA) and ET-5U/ET-5L (230 bp DNA). However, when tested with undiluted genomic DNA isolated from single colonies of all isolates, the Vac1F/Vac1XR target provided the maximum positive results (71/72 (99% positives)), followed by ECH-U/ECH-L (69/72 (93% positives)), ET-5U/ET-5L (51/72 (71% positives)) and CagAF/CagAR (26/72 (46% positives)). The results of genotyping experiments showed that vacA s1 (46% positive) and vacA s2 (54% positive) genotypes were almost equally associated with VaCA+/CagA- isolates (P > 0.05), but with VacA+/CagA+ isolates, S1 genotype (92% positive) was more frequently detected than S2 genotype (8% positive) (P< 0.0001). In conclusion, among the primer pairs tested, Vac1F/Vac1XR provided the best results for detection of H. pylori. The genotyping experiments showed that vacA s1 and vacA s2 genotypes were almost equally associated with vaCA⁺/cagA^-isolates, but vacA s1 genotype had a significantly increased association with vacA⁺/cagA⁺isolates.

Keywords: H. pylori, PCR, detection, genotyping

Procedia PDF Downloads 106

24260 TARF: Web Toolkit for Annotating RNA-Related Genomic Features

Authors: Jialin Ma, Jia Meng

Abstract:

Genomic features, the genome-based coordinates, are commonly used for the representation of biological features such as genes, RNA transcripts and transcription factor binding sites. For the analysis of RNA-related genomic features, such as RNA modification sites, a common task is to correlate these features with transcript components (5'UTR, CDS, 3'UTR) to explore their distribution characteristics in terms of transcriptomic coordinates, e.g., to examine whether a specific type of biological feature is enriched near transcription start sites. Existing approaches for performing these tasks involve the manipulation of a gene database, conversion from genome-based coordinate to transcript-based coordinate, and visualization methods that are capable of showing RNA transcript components and distribution of the features. These steps are complicated and time consuming, and this is especially true for researchers who are not familiar with relevant tools. To overcome this obstacle, we develop a dedicated web app TARF, which represents web toolkit for annotating RNA-related genomic features. TARF web tool intends to provide a web-based way to easily annotate and visualize RNA-related genomic features. Once a user has uploaded the features with BED format and specified a built-in transcript database or uploaded a customized gene database with GTF format, the tool could fulfill its three main functions. First, it adds annotation on gene and RNA transcript components. For every features provided by the user, the overlapping with RNA transcript components are identified, and the information is combined in one table which is available for copy and download. Summary statistics about ambiguous belongings are also carried out. Second, the tool provides a convenient visualization method of the features on single gene/transcript level. For the selected gene, the tool shows the features with gene model on genome-based view, and also maps the features to transcript-based coordinate and show the distribution against one single spliced RNA transcript. Third, a global transcriptomic view of the genomic features is generated utilizing the Guitar R/Bioconductor package. The distribution of features on RNA transcripts are normalized with respect to RNA transcript landmarks and the enrichment of the features on different RNA transcript components is demonstrated. We tested the newly developed TARF toolkit with 3 different types of genomics features related to chromatin H3K4me3, RNA N6-methyladenosine (m6A) and RNA 5-methylcytosine (m5C), which are obtained from ChIP-Seq, MeRIP-Seq and RNA BS-Seq data, respectively. TARF successfully revealed their respective distribution characteristics, i.e. H3K4me3, m6A and m5C are enriched near transcription starting sites, stop codons and 5’UTRs, respectively. Overall, TARF is a useful web toolkit for annotation and visualization of RNA-related genomic features, and should help simplify the analysis of various RNA-related genomic features, especially those related RNA modifications.

Keywords: RNA-related genomic features, annotation, visualization, web server

Procedia PDF Downloads 180

24259 Evolutionary Genomic Analysis of Adaptation Genomics

Authors: Agostinho Antunes

Abstract:

The completion of the human genome sequencing in 2003 opened a new perspective into the importance of whole genome sequencing projects, and currently multiple species are having their genomes completed sequenced, from simple organisms, such as bacteria, to more complex taxa, such as mammals. This voluminous sequencing data generated across multiple organisms provides also the framework to better understand the genetic makeup of such species and related ones, allowing to explore the genetic changes underlining the evolution of diverse phenotypic traits. Here, recent results from our group retrieved from comparative evolutionary genomic analyses of varied species will be considered to exemplify how gene novelty and gene enhancement by positive selection might have been determinant in the success of adaptive radiations into diverse habitats and lifestyles.

Keywords: adaptation, animals, evolution, genomics

Procedia PDF Downloads 392

24258 Genomic Adaptation to Local Climate Conditions in Native Cattle Using Whole Genome Sequencing Data

Authors: Rugang Tian

Abstract:

In this study, we generated whole-genome sequence (WGS) data from110 native cattle. Together with whole-genome sequences from world-wide cattle populations, we estimated the genetic diversity and population genetic structure of different cattle populations. Our findings revealed clustering of cattle groups in line with their geographic locations. We identified noticeable genetic diversity between indigenous cattle breeds and commercial populations. Among all studied cattle groups, lower genetic diversity measures were found in commercial populations, however, high genetic diversity were detected in some local cattle, particularly in Rashoki and Mongolian breeds. Our search for potential genomic regions under selection in native cattle revealed several candidate genes related with immune response and cold shock protein on multiple chromosomes such as TRPM8, NMUR1, PRKAA2, SMTNL2 and OXR1 that are involved in energy metabolism and metabolic homeostasis.

Keywords: cattle, whole-genome, population structure, adaptation

Procedia PDF Downloads 17

24257 Genomics of Adaptation in the Sea

Authors: Agostinho Antunes

Abstract:

The completion of the human genome sequencing in 2003 opened a new perspective into the importance of whole genome sequencing projects, and currently multiple species are having their genomes completed sequenced, from simple organisms, such as bacteria, to more complex taxa, such as mammals. This voluminous sequencing data generated across multiple organisms provides also the framework to better understand the genetic makeup of such species and related ones, allowing to explore the genetic changes underlining the evolution of diverse phenotypic traits. Here, recent results from our group retrieved from comparative evolutionary genomic analyses of selected marine animal species will be considered to exemplify how gene novelty and gene enhancement by positive selection might have been determinant in the success of adaptive radiations into diverse habitats and lifestyles.

Keywords: marine genomics, evolutionary bioinformatics, human genome sequencing, genomic analyses

Procedia PDF Downloads 578

24256 Genomic Analysis of Whole Genome Sequencing of Leishmania Major

Authors: Fatimazahrae Elbakri, Azeddine Ibrahimi, Meryem Lemrani, Dris Belghyti

Abstract:

Leishmaniasis represents a major public health problem because of the number of cases recorded each year and the wide distribution of the disease. It is a parasitic disease of flagellated protozoa transmitted by the bite of certain species of sandfly, causing a spectrum of clinical pathology in humans ranging from disfiguring skin lesions to fatal visceral leishmaniasis. Cutaneous leishmaniasis due to Leishmania major is a polymorphic disease; in fact, the infection can be asymptomatic, localized, or disseminated. The objective of this work is to determine the genomic diversity that contributes to clinical variability by trying to identify the variation in chromosome number and to extract SNPs and SNPs and InDels; it is based on four sequences (WGS) of Leishmania major available on NCBI in Fastq form, from three countries: Tunisia, Algeria, and Israel, the analysis is set up from a pipeline to facilitate the discovery of genetic diversity, in particular SNP and chromosomal somy.

Keywords: Leshmania major, cutaneous Leishmania, NGS, genomic, somy, variant calling

Procedia PDF Downloads 40

24255 Genomic Diversity and Relationship among Arabian Peninsula Dromedary Camels Using Full Genome Sequencing Approach

Authors: H. Bahbahani, H. Musa, F. Al Mathen

Abstract:

The dromedary camels (Camelus dromedarius) are single-humped even-toed ungulates populating the African Sahara, Arabian Peninsula, and Southwest Asia. The genome of this desert-adapted species has been minimally investigated using autosomal microsatellite and mitochondrial DNA markers. In this study, the genomes of 33 dromedary camel samples from different parts of the Arabian Peninsula were sequenced using Illumina Next Generation Sequencing (NGS) platform. These data were combined with Genotyping-by-Sequencing (GBS) data from African (Sudanese) dromedaries to investigate the genomic relationship between African and Arabian Peninsula dromedary camels. Principle Component Analysis (PCA) and average genome-wide admixture analysis were be conducted on these data to tackle the objectives of these studies. Both of the two analyses conducted revealed phylogeographic distinction between these two camel populations. However, no breed-wise genetic classification has been revealed among the African (Sudanese) camel breeds. The Arabian Peninsula camel populations also show higher heterozygosity than the Sudanese camels. The results of this study explain the evolutionary history and migration of African dromedary camels from their center of domestication in the southern Arabian Peninsula. These outputs help scientists to further understand the evolutionary history of dromedary camels, which might impact in conserving the favorable genetic of this species.

Keywords: dromedary, genotyping-by-sequencing, Arabian Peninsula, Sudan

Procedia PDF Downloads 162

24254 Genetic Instabilities in Marine Bivalve Following Benzo(α)pyrene Exposure: Utilization of Combined Random Amplified Polymorphic DNA and Comet Assay

Authors: Mengjie Qu, Yi Wang, Jiawei Ding, Siyu Chen, Yanan Di

Abstract:

Marine ecosystem is facing intensified multiple stresses caused by environmental contaminants from human activities. Xenobiotics, such as benzo(α)pyrene (BaP) have been discharged into marine environment and cause hazardous impacts on both marine organisms and human beings. As a filter-feeder, marine mussels, Mytilus spp., has been extensively used to monitor the marine environment. However, their genomic alterations induced by such xenobiotics are still kept unknown. In the present study, gills, as the first defense barrier in mussels, were selected to evaluate the genetic instability alterations induced by the exposure to BaP both in vivo and in vitro. Both random amplified polymorphic DNA (RAPD) assay and comet assay were applied as the rapid tools to assess the environmental stresses due to their low money- and time-consumption. All mussels were identified to be the single species of Mytilus coruscus before used in BaP exposure at the concentration of 56 μg/l for 1 & 3 days (in vivo exposure) or 1 & 3 hours (in vitro). Both RAPD and comet assay results were showed significantly increased genomic instability with time-specific altering pattern. After the recovery period in 'in vivo' exposure, the genomic status was as same as control condition. However, the relative higher genomic instabilities were still observed in gill cells after the recovery from in vitro exposure condition. Different repair mechanisms or signaling pathway might be involved in the isolated gill cells in the comparison with intact tissues. The study provides the robust and rapid techniques to exam the genomic stability in marine organisms in response to marine environmental changes and provide basic information for further mechanism research in stress responses in marine organisms.

Keywords: genotoxic impacts, in vivo/vitro exposure, marine mussels, RAPD and comet assay

Procedia PDF Downloads 249

24253 Analysis of Saudi Breast Cancer Patients’ Primary Tumors using Array Comparative Genomic Hybridization

Authors: L. M. Al-Harbi, A. M. Shokry, J. S. M. Sabir, A. Chaudhary, J. Manikandan, K. S. Saini

Abstract:

Breast cancer is the second most common cause of cancer death worldwide and is the most common malignancy among Saudi females. During breast carcinogenesis, a wide-array of cytogenetic changes involving deletions, or amplification, or translocations, of part or whole of chromosome regions have been observed. Because of the limitations of various earlier technologies, newer tools are developed to scan for changes at the genomic level. Recently, Array Comparative Genomic Hybridization (aCGH) technique has been applied for detecting segmental genomic alterations at molecular level. In this study, aCGH was performed on twenty breast cancer tumors and their matching non-tumor (normal) counterparts using the Agilent 2x400K. Several regions were identified to be either amplified or deleted in a tumor-specific manner. Most frequent alterations were amplification of chromosome 1q, chromosome 8q, 20q, and deletions at 16q were also detected. The amplification of genetic events at 1q and 8q were further validated using FISH analysis using probes targeting 1q25 and 8q (MYC gene). The copy number changes at these loci can potentially cause a significant change in the tumor behavior, as deletions in the E-Cadherin (CDH1)-tumor suppressor gene as well as amplification of the oncogenes-Aurora Kinase A. (AURKA) and MYC could make these tumors highly metastatic. This study validates the use of aCGH in Saudi breast cancer patients and sets the foundations necessary for performing larger cohort studies searching for ethnicity-specific biomarkers and gene copy number variations.

Keywords: breast cancer, molecular biology, ecology, environment

Procedia PDF Downloads 347

24252 SPARK: An Open-Source Knowledge Discovery Platform That Leverages Non-Relational Databases and Massively Parallel Computational Power for Heterogeneous Genomic Datasets

Authors: Thilina Ranaweera, Enes Makalic, John L. Hopper, Adrian Bickerstaffe

Abstract:

Data are the primary asset of biomedical researchers, and the engine for both discovery and research translation. As the volume and complexity of research datasets increase, especially with new technologies such as large single nucleotide polymorphism (SNP) chips, so too does the requirement for software to manage, process and analyze the data. Researchers often need to execute complicated queries and conduct complex analyzes of large-scale datasets. Existing tools to analyze such data, and other types of high-dimensional data, unfortunately suffer from one or more major problems. They typically require a high level of computing expertise, are too simplistic (i.e., do not fit realistic models that allow for complex interactions), are limited by computing power, do not exploit the computing power of large-scale parallel architectures (e.g. supercomputers, GPU clusters etc.), or are limited in the types of analysis available, compounded by the fact that integrating new analysis methods is not straightforward. Solutions to these problems, such as those developed and implemented on parallel architectures, are currently available to only a relatively small portion of medical researchers with access and know-how. The past decade has seen a rapid expansion of data management systems for the medical domain. Much attention has been given to systems that manage phenotype datasets generated by medical studies. The introduction of heterogeneous genomic data for research subjects that reside in these systems has highlighted the need for substantial improvements in software architecture. To address this problem, we have developed SPARK, an enabling and translational system for medical research, leveraging existing high performance computing resources, and analysis techniques currently available or being developed. It builds these into The Ark, an open-source web-based system designed to manage medical data. SPARK provides a next-generation biomedical data management solution that is based upon a novel Micro-Service architecture and Big Data technologies. The system serves to demonstrate the applicability of Micro-Service architectures for the development of high performance computing applications. When applied to high-dimensional medical datasets such as genomic data, relational data management approaches with normalized data structures suffer from unfeasibly high execution times for basic operations such as insert (i.e. importing a GWAS dataset) and the queries that are typical of the genomics research domain. SPARK resolves these problems by incorporating non-relational NoSQL databases that have been driven by the emergence of Big Data. SPARK provides researchers across the world with user-friendly access to state-of-the-art data management and analysis tools while eliminating the need for high-level informatics and programming skills. The system will benefit health and medical research by eliminating the burden of large-scale data management, querying, cleaning, and analysis. SPARK represents a major advancement in genome research technologies, vastly reducing the burden of working with genomic datasets, and enabling cutting edge analysis approaches that have previously been out of reach for many medical researchers.

Keywords: biomedical research, genomics, information systems, software

Procedia PDF Downloads 235

24251 Isolation and Identification of Diacylglycerol Acyltransferase Type-2 (GAT2) Genes from Three Egyptian Olive Cultivars

Authors: Yahia I. Mohamed, Ahmed I. Marzouk, Mohamed A. Yacout

Abstract:

Aim of this work was to study the genetic basis for oil accumulation in olive fruit via tracking DGAT2 (Diacylglycerol acyltransferase type-2) gene in three Egyptian Origen Olive cultivars namely Toffahi, Hamed and Maraki using molecular marker techniques and bioinformatics tools. Results illustrate that, firstly: specific genomic band of Maraki cultivars was identified as DGAT2 (Diacylglycerol acyltransferase type-2) and identical for this gene in Olea europaea with 100 % of similarity. Secondly, differential genomic band of Maraki cultivars which produced from RAPD fingerprinting technique reflected predicted distinguished sequence which identified as DGAT2 (Diacylglycerol acyltransferase type-2) in Fragaria vesca subsp. Vesca with 76% of sequential similarity. Third and finally, specific genomic specific band of Hamed cultivars was indentified as two fragments, 1-Olea europaea cultivar Koroneiki diacylglycerol acyltransferase type 2 mRNA, complete cds with two matches regions with 99% or 2-PREDICTED: Fragaria vesca subsp. vesca diacylglycerol O-acyltransferase 2-like (LOC101313050), mRNA with 86% of similarity.

Keywords: Olea europaea, fingerprinting, diacylglycerol acyltransferase type-2 (DGAT2), Egypt

Procedia PDF Downloads 469

24250 Phenotype Prediction of DNA Sequence Data: A Machine and Statistical Learning Approach

Authors: Mpho Mokoatle, Darlington Mapiye, James Mashiyane, Stephanie Muller, Gciniwe Dlamini

Abstract:

Great advances in high-throughput sequencing technologies have resulted in availability of huge amounts of sequencing data in public and private repositories, enabling a holistic understanding of complex biological phenomena. Sequence data are used for a wide range of applications such as gene annotations, expression studies, personalized treatment and precision medicine. However, this rapid growth in sequence data poses a great challenge which calls for novel data processing and analytic methods, as well as huge computing resources. In this work, a machine and statistical learning approach for DNA sequence classification based on $k$-mer representation of sequence data is proposed. The approach is tested using whole genome sequences of Mycobacterium tuberculosis (MTB) isolates to (i) reduce the size of genomic sequence data, (ii) identify an optimum size of k-mers and utilize it to build classification models, (iii) predict the phenotype from whole genome sequence data of a given bacterial isolate, and (iv) demonstrate computing challenges associated with the analysis of whole genome sequence data in producing interpretable and explainable insights. The classification models were trained on 104 whole genome sequences of MTB isoloates. Cluster analysis showed that k-mers maybe used to discriminate phenotypes and the discrimination becomes more concise as the size of k-mers increase. The best performing classification model had a k-mer size of 10 (longest k-mer) an accuracy, recall, precision, specificity, and Matthews Correlation coeffient of 72.0%, 80.5%, 80.5%, 63.6%, and 0.4 respectively. This study provides a comprehensive approach for resampling whole genome sequencing data, objectively selecting a k-mer size, and performing classification for phenotype prediction. The analysis also highlights the importance of increasing the k-mer size to produce more biological explainable results, which brings to the fore the interplay that exists amongst accuracy, computing resources and explainability of classification results. However, the analysis provides a new way to elucidate genetic information from genomic data, and identify phenotype relationships which are important especially in explaining complex biological mechanisms.

Keywords: AWD-LSTM, bootstrapping, k-mers, next generation sequencing

Procedia PDF Downloads 131

24249 Phenotype Prediction of DNA Sequence Data: A Machine and Statistical Learning Approach

Authors: Darlington Mapiye, Mpho Mokoatle, James Mashiyane, Stephanie Muller, Gciniwe Dlamini

Abstract:

Great advances in high-throughput sequencing technologies have resulted in availability of huge amounts of sequencing data in public and private repositories, enabling a holistic understanding of complex biological phenomena. Sequence data are used for a wide range of applications such as gene annotations, expression studies, personalized treatment and precision medicine. However, this rapid growth in sequence data poses a great challenge which calls for novel data processing and analytic methods, as well as huge computing resources. In this work, a machine and statistical learning approach for DNA sequence classification based on k-mer representation of sequence data is proposed. The approach is tested using whole genome sequences of Mycobacterium tuberculosis (MTB) isolates to (i) reduce the size of genomic sequence data, (ii) identify an optimum size of k-mers and utilize it to build classification models, (iii) predict the phenotype from whole genome sequence data of a given bacterial isolate, and (iv) demonstrate computing challenges associated with the analysis of whole genome sequence data in producing interpretable and explainable insights. The classification models were trained on 104 whole genome sequences of MTB isoloates. Cluster analysis showed that k-mers maybe used to discriminate phenotypes and the discrimination becomes more concise as the size of k-mers increase. The best performing classification model had a k-mer size of 10 (longest k-mer) an accuracy, recall, precision, specificity, and Matthews Correlation coeffient of 72.0 %, 80.5 %, 80.5 %, 63.6 %, and 0.4 respectively. This study provides a comprehensive approach for resampling whole genome sequencing data, objectively selecting a k-mer size, and performing classification for phenotype prediction. The analysis also highlights the importance of increasing the k-mer size to produce more biological explainable results, which brings to the fore the interplay that exists amongst accuracy, computing resources and explainability of classification results. However, the analysis provides a new way to elucidate genetic information from genomic data, and identify phenotype relationships which are important especially in explaining complex biological mechanisms

Keywords: AWD-LSTM, bootstrapping, k-mers, next generation sequencing

Procedia PDF Downloads 120

24248 Mitigating Ruminal Methanogenesis Through Genomic and Transcriptomic Approaches

Authors: Muhammad Adeel Arshad, Faiz-Ul Hassan, Yanfen Cheng

Abstract:

According to FAO, enteric methane (CH4) production is about 44% of all greenhouse gas emissions from the livestock sector. Ruminants produce CH4 as a result of fermentation of feed in the rumen especially from roughages which yield more CH4 per unit of biomass ingested as compared to concentrates. Efficient ruminal fermentation is not possible without abating CO2 and CH4. Methane abatement strategies are required to curb the predicted rise in emissions associated with greater ruminant production in future to meet ever increasing animal protein requirements. Ecology of ruminal methanogenesis and avenues for its mitigation can be identified through various genomic and transcriptomic techniques. Programs such as Hungate1000 and the Global Rumen Census have been launched to enhance our understanding about global ruminal microbial communities. Through Hungate1000 project, a comprehensive reference set of rumen microbial genome sequences has been developed from cultivated rumen bacteria and methanogenic archaea along with representative rumen anaerobic fungi and ciliate protozoa cultures. But still many species of rumen microbes are underrepresented especially uncultivable microbes. Lack of sequence information specific to the rumen's microbial community has inhibited efforts to use genomic data to identify specific set of species and their target genes involved in methanogenesis. Metagenomic and metatranscriptomic study of entire microbial rumen populations offer new perspectives to understand interaction of methanogens with other rumen microbes and their potential association with total gas and methane production. Deep understanding of methanogenic pathway will help to devise potentially effective strategies to abate methane production while increasing feed efficiency in ruminants.

Keywords: Genome sequences, Hungate1000, methanogens, ruminal fermentation

Procedia PDF Downloads 106

24247 Suppression Subtractive Hybridization Technique for Identification of the Differentially Expressed Genes

Authors: Tuhina-khatun, Mohamed Hanafi Musa, Mohd Rafii Yosup, Wong Mui Yun, Aktar-uz-Zaman, Mahbod Sahebi

Abstract:

Suppression subtractive hybridization (SSH) method is valuable tool for identifying differentially regulated genes in disease specific or tissue specific genes important for cellular growth and differentiation. It is a widely used method for separating DNA molecules that distinguish two closely related DNA samples. SSH is one of the most powerful and popular methods for generating subtracted cDNA or genomic DNA libraries. It is based primarily on a suppression polymerase chain reaction (PCR) technique and combines normalization and subtraction in a solitary procedure. The normalization step equalizes the abundance of DNA fragments within the target population, and the subtraction step excludes sequences that are common to the populations being compared. This dramatically increases the probability of obtaining low-abundance differentially expressed cDNAs or genomic DNA fragments and simplifies analysis of the subtracted library. SSH technique is applicable to many comparative and functional genetic studies for the identification of disease, developmental, tissue specific, or other differentially expressed genes, as well as for the recovery of genomic DNA fragments distinguishing the samples under comparison.

Keywords: suppression subtractive hybridization, differentially expressed genes, disease specific genes, tissue specific genes

Procedia PDF Downloads 405

24246 Benefit Sharing of Research Participants in Human Genomic Research: Ethical Concerns and Ramifications

Authors: Tamanda Kamwendo

Abstract:

The concept of benefit sharing has been a prominent global debate in the world, gaining traction in human research ethics. Despite its prevalence, the concept of benefit sharing is not without controversy over its meaning and justification. This is due to the fact that it lacks a broadly accepted definition and many proponents discuss benefit sharing by arguing for its necessity rather than engaging in critical intellectual engagement with technical issues such as what it implies. What is clear in the literature is that the underlying premise of benefit-sharing is that research involving underprivileged and marginalized people is currently unjust and inequitable because these people are denied access to these gains; thus, benefit-sharing arrangements are required for these research projects to be just and equitable. This paper, therefore, investigates the discourses and justifications behind the concept of benefit sharing to human participants, particularly when dealing with human genomics research. Furthermore, considering that benefit sharing is generally viewed as a transaction between research organizations and research participants, it raises ethical concerns concerning the commodification of human material and undermines the sanctity of the human genome. This is predicated on the idea that research sponsors would be compelled to deliver a minimum set of possible benefits to research participants and communities in exchange for their involvement in the study. There is, therefore, need to protect benefit-sharing practices in international health research by developing a governance legal framework. A legal framework of benefit sharing will also dispel the issue of commodification of human material where human genomic research is done.

Keywords: benefit sharing, human participants, human genomic research, ethical concerns

Procedia PDF Downloads 39

24245 Analysis of Expression Data Using Unsupervised Techniques

Authors: M. A. I Perera, C. R. Wijesinghe, A. R. Weerasinghe

Abstract:

his study was conducted to review and identify the unsupervised techniques that can be employed to analyze gene expression data in order to identify better subtypes of tumors. Identifying subtypes of cancer help in improving the efficacy and reducing the toxicity of the treatments by identifying clues to find target therapeutics. Process of gene expression data analysis described under three steps as preprocessing, clustering, and cluster validation. Feature selection is important since the genomic data are high dimensional with a large number of features compared to samples. Hierarchical clustering and K Means are often used in the analysis of gene expression data. There are several cluster validation techniques used in validating the clusters. Heatmaps are an effective external validation method that allows comparing the identified classes with clinical variables and visual analysis of the classes.

Keywords: cancer subtypes, gene expression data analysis, clustering, cluster validation

Procedia PDF Downloads 110

24244 Bioinformatics High Performance Computation and Big Data

Authors: Javed Mohammed

Abstract:

Right now, bio-medical infrastructure lags well behind the curve. Our healthcare system is dispersed and disjointed; medical records are a bit of a mess; and we do not yet have the capacity to store and process the crazy amounts of data coming our way from widespread whole-genome sequencing. And then there are privacy issues. Despite these infrastructure challenges, some researchers are plunging into bio medical Big Data now, in hopes of extracting new and actionable knowledge. They are doing delving into molecular-level data to discover bio markers that help classify patients based on their response to existing treatments; and pushing their results out to physicians in novel and creative ways. Computer scientists and bio medical researchers are able to transform data into models and simulations that will enable scientists for the first time to gain a profound under-standing of the deepest biological functions. Solving biological problems may require High-Performance Computing HPC due either to the massive parallel computation required to solve a particular problem or to algorithmic complexity that may range from difficult to intractable. Many problems involve seemingly well-behaved polynomial time algorithms (such as all-to-all comparisons) but have massive computational requirements due to the large data sets that must be analyzed. High-throughput techniques for DNA sequencing and analysis of gene expression have led to exponential growth in the amount of publicly available genomic data. With the increased availability of genomic data traditional database approaches are no longer sufficient for rapidly performing life science queries involving the fusion of data types. Computing systems are now so powerful it is possible for researchers to consider modeling the folding of a protein or even the simulation of an entire human body. This research paper emphasizes the computational biology's growing need for high-performance computing and Big Data. It illustrates this article’s indispensability in meeting the scientific and engineering challenges of the twenty-first century, and how Protein Folding (the structure and function of proteins) and Phylogeny Reconstruction (evolutionary history of a group of genes) can use HPC that provides sufficient capability for evaluating or solving more limited but meaningful instances. This article also indicates solutions to optimization problems, and benefits Big Data and Computational Biology. The article illustrates the Current State-of-the-Art and Future-Generation Biology of HPC Computing with Big Data.

Keywords: high performance, big data, parallel computation, molecular data, computational biology

Procedia PDF Downloads 332

24243 Antibody Reactivity of Synthetic Peptides Belonging to Proteins Encoded by Genes Located in Mycobacterium tuberculosis-Specific Genomic Regions of Differences

Authors: Abu Salim Mustafa

Abstract:

The comparisons of mycobacterial genomes have identified several Mycobacterium tuberculosis-specific genomic regions that are absent in other mycobacteria and are known as regions of differences. Due to M. tuberculosis-specificity, the peptides encoded by these regions could be useful in the specific diagnosis of tuberculosis. To explore this possibility, overlapping synthetic peptides corresponding to 39 proteins predicted to be encoded by genes present in regions of differences were tested for antibody-reactivity with sera from tuberculosis patients and healthy subjects. The results identified four immunodominant peptides corresponding to four different proteins, with three of the peptides showing significantly stronger antibody reactivity and rate of positivity with sera from tuberculosis patients than healthy subjects. The fourth peptide was recognized equally well by the sera of tuberculosis patients as well as healthy subjects. Predication of antibody epitopes by bioinformatics analyses using ABCpred server predicted multiple linear epitopes in each peptide. Furthermore, peptide sequence analysis for sequence identity using BLAST suggested M. tuberculosis-specificity for the three peptides that had preferential reactivity with sera from tuberculosis patients, but the peptide with equal reactivity with sera of TB patients and healthy subjects showed significant identity with sequences present in nob-tuberculous mycobacteria. The three identified M. tuberculosis-specific immunodominant peptides may be useful in the serological diagnosis of tuberculosis.

Keywords: genomic regions of differences, Mycobacterium tuberculossis, peptides, serodiagnosis

Procedia PDF Downloads 156

24242 Genomic Sequence Representation Learning: An Analysis of K-Mer Vector Embedding Dimensionality

Authors: James Jr. Mashiyane, Risuna Nkolele, Stephanie J. Müller, Gciniwe S. Dlamini, Rebone L. Meraba, Darlington S. Mapiye

Abstract:

When performing language tasks in natural language processing (NLP), the dimensionality of word embeddings is chosen either ad-hoc or is calculated by optimizing the Pairwise Inner Product (PIP) loss. The PIP loss is a metric that measures the dissimilarity between word embeddings, and it is obtained through matrix perturbation theory by utilizing the unitary invariance of word embeddings. Unlike in natural language, in genomics, especially in genome sequence processing, unlike in natural language processing, there is no notion of a “word,” but rather, there are sequence substrings of length k called k-mers. K-mers sizes matter, and they vary depending on the goal of the task at hand. The dimensionality of word embeddings in NLP has been studied using the matrix perturbation theory and the PIP loss. In this paper, the sufficiency and reliability of applying word-embedding algorithms to various genomic sequence datasets are investigated to understand the relationship between the k-mer size and their embedding dimension. This is completed by studying the scaling capability of three embedding algorithms, namely Latent Semantic analysis (LSA), Word2Vec, and Global Vectors (GloVe), with respect to the k-mer size. Utilising the PIP loss as a metric to train embeddings on different datasets, we also show that Word2Vec outperforms LSA and GloVe in accurate computing embeddings as both the k-mer size and vocabulary increase. Finally, the shortcomings of natural language processing embedding algorithms in performing genomic tasks are discussed.

Keywords: word embeddings, k-mer embedding, dimensionality reduction

Procedia PDF Downloads 90

24241 Cytogenetic Characterization of the VERO Cell Line Based on Comparisons with the Subline; Implication for Authorization and Quality Control of Animal Cell Lines

Authors: Fumio Kasai, Noriko Hirayama, Jorge Pereira, Azusa Ohtani, Masashi Iemura, Malcolm A. Ferguson Smith, Arihiro Kohara

Abstract:

The VERO cell line was established in 1962 from normal tissue of an African green monkey, Chlorocebus aethiops (2n=60), and has been commonly used worldwide for screening for toxins or as a cell substrate for the production of viral vaccines. The VERO genome was sequenced in 2014; however, its cytogenetic features have not been fully characterized as it contains several chromosome abnormalities and different karyotypes coexist in the cell line. In this study, the VERO cell line (JCRB0111) was compared with one of the sublines. In contrast to 59 chromosomes as the modal chromosome number in the VERO cell line, the subline had two peaks of 56 and 58 chromosomes. M-FISH analysis using human probes revealed that the VERO cell line was characterized by a translocation t(2;25) found in all metaphases, which was absent in the subline. Different abnormalities detected only in the subline show that the cell line is heterogeneous, indicating that the subline has the potential to change its genomic characteristics during cell culture. The various alterations in the two independent lineages suggest that genomic changes in both VERO cells can be accounted for by progressive rearrangements during their evolution in culture. Both t(5;X) and t(8;14) observed in all metaphases of the two cell lines might have a key role in VERO cells and could be used as genetic markers to identify VERO cells. The flow karyotype shows distinct differences from normal. Further analysis of sorted abnormal chromosomes may uncover other characteristics of VERO cells. Because of the absence of STR data, cytogenetic data are important in characterizing animal cell lines and can be an indicator of their quality control.

Keywords: VERO, cell culture passage, chromosome rearrangement, heterogeneous cells

Procedia PDF Downloads 380

24240 Distribution of HLA-DQA1 and HLA-DQB1 Alleles in Thais: Genetics Database Insight for COVID-19 Severity

Authors: Jinu Phonamontham

Abstract:

Coronavirus, also referred to as COVID-19, is a virus caused by the SARS-Cov-2 virus. The pandemic has caused over 10 million cases and 500,000 deaths worldwide through the end of June 2020. In a previous study, HLA-DQA1*01:02 allele was associated with COVID-19 disease (p-value = 0.0121). Furthermore, there was a statistical significance between HLA- DQB1*06:02 and COVID-19 in the Italian population by Bonferroni’s correction (p-value = 0.0016). Nevertheless, there is no data describing the distribution of HLA alleles as a valid marker for prediction of COVID-19 in the Thai population. We want to investigate the prevalence of HLA-DQA1*01:02 and HLA-DQB1*06:02 alleles that are associated with severe COVID-19 in the Thai population. In this study, we recruited 200 healthy Thai individuals. Genomic DNA samples were isolated from EDTA blood using Genomic DNA Mini Kit. HLA genotyping was conducted using the Lifecodes HLA SSO typing kits (Immucor, West Avenue, Stamford, USA). The frequency of HLA-DQA1 alleles in Thai population, consisting of HLA-DQA1*01:01 (27.75%), HLA-DQA1*01:02 (24.50%), HLA-DQA1*03:03 (13.00%), HLA-DQA1*06:01 (10.25%) and HLA-DQA1*02:01 (6.75%). Furthermore, the distributions of HLA-DQB1 alleles were HLA-DQB1*05:02 (21.50%), HLA-DQB1*03:01 (15.75%), HLA-DQB1*05:01 (14.50%), HLA-DQB1*03:03 (11.00%) and HLA-DQB1*02:02 (8.25%). Particularly, HLA- DQA1*01:02 (29.00%) allele was the highest frequency in the NorthEast group, but there was not significant difference when compared with the other regions in Thais (p-value = 0.4202). HLA-DQB1*06:02 allele was similarly distributed in Thai population and there was no significant difference between Thais and China (3.8%) and South Korea (6.4%) and Japan (8.2%) with p-value > 0.05. Whereas, South Africa (15.7%) has a significance with Thais by p-value of 0.0013. This study supports the specific genotyping of the HLA-DQA1*01:02 and HLA-DQB1*06:02 alleles to screen severe COVID-19 in Thai and many populations.

Keywords: HLA-DQA1*01:02, HLA-DQB1*06:02, Asian, Thai population

Procedia PDF Downloads 61

24239 Allelic Diversity of Productive, Reproductive and Fertility Traits Genes of Buffalo and Cattle

Authors: M. Moaeen-ud-Din, G. Bilal, M. Yaqoob

Abstract:

Identification of genes of importance regarding production traits in buffalo is impaired by a paucity of genomic resources. Choice to fill this gap is to exploit data available for cow. The cross-species application of comparative genomics tools is potential gear to investigate the buffalo genome. However, this is dependent on nucleotide sequences similarity. In this study gene diversity between buffalo and cattle was determined by using 86 gene orthologues. There was about 3% difference in all genes in term of nucleotide diversity; and 0.267±0.134 in amino acids indicating the possibility for successfully using cross-species strategies for genomic studies. There were significantly higher non synonymous substitutions both in cattle and buffalo however, there was similar difference in term of dN – dS (4.414 vs 4.745) in buffalo and cattle respectively. Higher rate of non-synonymous substitutions at similar level in buffalo and cattle indicated a similar positive selection pressure. Results for relative rate test were assessed with the chi-squared test. There was no significance difference on unique mutations between cattle and buffalo lineages at synonymous sites. However, there was a significance difference on unique mutations for non synonymous sites indicating ongoing mutagenic process that generates substitutional mutation at approximately the same rate at silent sites. Moreover, despite of common ancestry, our results indicate a different divergent time among genes of cattle and buffalo. This is the first demonstration that variable rates of molecular evolution may be present within the family Bovidae.

Keywords: buffalo, cattle, gene diversity, molecular evolution

Procedia PDF Downloads 456

24238 Single Cell and Spatial Transcriptomics: A Beginners Viewpoint from the Conceptual Pipeline

Authors: Leo Nnamdi Ozurumba-Dwight

Abstract:

Messenger ribooxynucleic acid (mRNA) molecules are compositional, protein-based. These proteins, encoding mRNA molecules (which collectively connote the transcriptome), when analyzed by RNA sequencing (RNAseq), unveils the nature of gene expression in the RNA. The obtained gene expression provides clues of cellular traits and their dynamics in presentations. These can be studied in relation to function and responses. RNAseq is a practical concept in Genomics as it enables detection and quantitative analysis of mRNA molecules. Single cell and spatial transcriptomics both present varying avenues for expositions in genomic characteristics of single cells and pooled cells in disease conditions such as cancer, auto-immune diseases, hematopoietic based diseases, among others, from investigated biological tissue samples. Single cell transcriptomics helps conduct a direct assessment of each building unit of tissues (the cell) during diagnosis and molecular gene expressional studies. A typical technique to achieve this is through the use of a single-cell RNA sequencer (scRNAseq), which helps in conducting high throughput genomic expressional studies. However, this technique generates expressional gene data for several cells which lack presentations on the cells’ positional coordinates within the tissue. As science is developmental, the use of complimentary pre-established tissue reference maps using molecular and bioinformatics techniques has innovatively sprung-forth and is now used to resolve this set back to produce both levels of data in one shot of scRNAseq analysis. This is an emerging conceptual approach in methodology for integrative and progressively dependable transcriptomics analysis. This can support in-situ fashioned analysis for better understanding of tissue functional organization, unveil new biomarkers for early-stage detection of diseases, biomarkers for therapeutic targets in drug development, and exposit nature of cell-to-cell interactions. Also, these are vital genomic signatures and characterizations of clinical applications. Over the past decades, RNAseq has generated a wide array of information that is igniting bespoke breakthroughs and innovations in Biomedicine. On the other side, spatial transcriptomics is tissue level based and utilized to study biological specimens having heterogeneous features. It exposits the gross identity of investigated mammalian tissues, which can then be used to study cell differentiation, track cell line trajectory patterns and behavior, and regulatory homeostasis in disease states. Also, it requires referenced positional analysis to make up of genomic signatures that will be sassed from the single cells in the tissue sample. Given these two presented approaches to RNA transcriptomics study in varying quantities of cell lines, with avenues for appropriate resolutions, both approaches have made the study of gene expression from mRNA molecules interesting, progressive, developmental, and helping to tackle health challenges head-on.

Keywords: transcriptomics, RNA sequencing, single cell, spatial, gene expression.

Procedia PDF Downloads 95

24237 Copy Number Variants in Children with Non-Syndromic Congenital Heart Diseases from Mexico

Authors: Maria Lopez-Ibarra, Ana Velazquez-Wong, Lucelli Yañez-Gutierrez, Maria Araujo-Solis, Fabio Salamanca-Gomez, Alfonso Mendez-Tenorio, Haydeé Rosas-Vargas

Abstract:

Congenital heart diseases (CHD) are the most common congenital abnormalities. These conditions can occur as both an element of distinct chromosomal malformation syndromes or as non-syndromic forms. Their etiology is not fully understood. Genetic variants such copy number variants have been associated with CHD. The aim of our study was to analyze these genomic variants in peripheral blood from Mexican children diagnosed with non-syndromic CHD. We included 16 children with atrial and ventricular septal defects and 5 healthy subjects without heart malformations as controls. To exclude the most common heart disease-associated syndrome alteration, we performed a fluorescence in situ hybridization test to identify the 22q11.2, responsible for congenital heart abnormalities associated with Di-George Syndrome. Then, a microarray based comparative genomic hybridization was used to identify global copy number variants. The identification of copy number variants resulted from the comparison and analysis between our results and data from main genetic variation databases. We identified copy number variants gain in three chromosomes regions from pediatric patients, 4q13.2 (31.25%), 9q34.3 (25%) and 20q13.33 (50%), where several genes associated with cellular, biosynthetic, and metabolic processes are located, UGT2B15, UGT2B17, SNAPC4, SDCCAG3, PMPCA, INPP6E, C9orf163, NOTCH1, C20orf166, and SLCO4A1. In addition, after a hierarchical cluster analysis based on the fluorescence intensity ratios from the comparative genomic hybridization, two congenital heart disease groups were generated corresponding to children with atrial or ventricular septal defects. Further analysis with a larger sample size is needed to corroborate these copy number variants as possible biomarkers to differentiate between heart abnormalities. Interestingly, the 20q13.33 gain was present in 50% of children with these CHD which could suggest that alterations in both coding and non-coding elements within this chromosomal region may play an important role in distinct heart conditions.

Keywords: aCGH, bioinformatics, congenital heart diseases, copy number variants, fluorescence in situ hybridization

Procedia PDF Downloads 257