64 Analysis of Genomics Big Data in Cloud Computing Using Fuzzy Logic

Authors: Mohammad Vahed, Ana Sadeghitohidi, Majid Vahed, Hiroki Takahashi


In the genomics field, the huge amounts of data have produced by the next-generation sequencers (NGS). Data volumes are very rapidly growing, as it is postulated that more than one billion bases will be produced per year in 2020. The growth rate of produced data is much faster than Moore's law in computer technology. This makes it more difficult to deal with genomics data, such as storing data, searching information, and finding the hidden information. It is required to develop the analysis platform for genomics big data. Cloud computing newly developed enables us to deal with big data more efficiently. Hadoop is one of the frameworks distributed computing and relies upon the core of a Big Data as a Service (BDaaS). Although many services have adopted this technology, e.g. amazon, there are a few applications in the biology field. Here, we propose a new algorithm to more efficiently deal with the genomics big data, e.g. sequencing data. Our algorithm consists of two parts: First is that BDaaS is applied for handling the data more efficiently. Second is that the hybrid method of MapReduce and Fuzzy logic is applied for data processing. This step can be parallelized in implementation. Our algorithm has great potential in computational analysis of genomics big data, e.g. de novo genome assembly and sequence similarity search. We will discuss our algorithm and its feasibility.

Keywords: big data, fuzzy logic, MapReduce, Hadoop, cloud computing

63 Changing the Landscape of Fungal Genomics: New Trends

Authors: Igor V. Grigoriev


Understanding of biological processes encoded in fungi is instrumental in addressing future food, feed, and energy demands of the growing human population. Genomics is a powerful and quickly evolving tool to understand these processes. The Fungal Genomics Program of the US Department of Energy Joint Genome Institute (JGI) partners with researchers around the world to explore fungi in several large scale genomics projects, changing the fungal genomics landscape. The key trends of these changes include: (i) rapidly increasing scale of sequencing and analysis, (ii) developing approaches to go beyond culturable fungi and explore fungal ‘dark matter,’ or unculturables, and (iii) functional genomics and multi-omics data integration. Power of comparative genomics has been recently demonstrated in several JGI projects targeting mycorrhizae, plant pathogens, wood decay fungi, and sugar fermenting yeasts. The largest JGI project ‘1000 Fungal Genomes’ aims at exploring the diversity across the Fungal Tree of Life in order to better understand fungal evolution and to build a catalogue of genes, enzymes, and pathways for biotechnological applications. At this point, at least 65% of over 700 known families have one or more reference genomes sequenced, enabling metagenomics studies of microbial communities and their interactions with plants. For many of the remaining families no representative species are available from culture collections. To sequence genomes of unculturable fungi two approaches have been developed: (a) sequencing DNA from fruiting bodies of ‘macro’ and (b) single cell genomics using fungal spores. The latter has been tested using zoospores from the early diverging fungi and resulted in several near-complete genomes from underexplored branches of the Fungal Tree, including the first genomes of Zoopagomycotina. Genome sequence serves as a reference for transcriptomics studies, the first step towards functional genomics. In the JGI fungal mini-ENCODE project transcriptomes of the model fungus Neurospora crassa grown on a spectrum of carbon sources have been collected to build regulatory gene networks. Epigenomics is another tool to understand gene regulation and recently introduced single molecule sequencing platforms not only provide better genome assemblies but can also detect DNA modifications. For example, 6mC methylome was surveyed across many diverse fungi and the highest among Eukaryota levels of 6mC methylation has been reported. Finally, data production at such scale requires data integration to enable efficient data analysis. Over 700 fungal genomes and other -omes have been integrated in JGI MycoCosm portal and equipped with comparative genomics tools to enable researchers addressing a broad spectrum of biological questions and applications for bioenergy and biotechnology.

Keywords: fungal genomics, single cell genomics, DNA methylation, comparative genomics

62 High-Throughput Mechanized Microfluidic Test Groundwork for Precise Microbial Genomics

Authors: Pouya Karimi, Ramin Gasemi Shayan, Parsa Sheykhzade


Ease shotgun DNA sequencing is changing the microbial sciences. Sequencing instruments are compelling to the point that example planning is currently the key constraining element. Here, we present a microfluidic test readiness stage that incorporates the key strides in cells to grouping library test groundwork for up to 96 examples and decreases DNA input prerequisites 100-overlay while keeping up or improving information quality. The universally useful microarchitecture we show bolsters work processes with subjective quantities of response and tidy up or catch steps. By decreasing the example amount necessities, we empowered low-input (∼10,000 cells) entire genome shotgun (WGS) sequencing of Mycobacterium tuberculosis and soil miniaturized scale settlements with prevalent outcomes. We additionally utilized the upgraded throughput to succession ∼400 clinical Pseudomonas aeruginosa libraries and exhibit magnificent single-nucleotide polymorphism discovery execution that clarified phenotypically watched anti-toxin opposition. Completely coordinated lab-on-chip test arrangement beats specialized boundaries to empower more extensive organization of genomics across numerous fundamental research and translational applications.

Keywords: clinical microbiology, DNA, microbiology, microbial genomics

61 Evolutionary Genomic Analysis of Adaptation Genomics

Authors: Agostinho Antunes


The completion of the human genome sequencing in 2003 opened a new perspective into the importance of whole genome sequencing projects, and currently multiple species are having their genomes completed sequenced, from simple organisms, such as bacteria, to more complex taxa, such as mammals. This voluminous sequencing data generated across multiple organisms provides also the framework to better understand the genetic makeup of such species and related ones, allowing to explore the genetic changes underlining the evolution of diverse phenotypic traits. Here, recent results from our group retrieved from comparative evolutionary genomic analyses of varied species will be considered to exemplify how gene novelty and gene enhancement by positive selection might have been determinant in the success of adaptive radiations into diverse habitats and lifestyles.

Keywords: adaptation, animals, evolution, genomics

60 Genomics of Aquatic Adaptation

Authors: Agostinho Antunes


The completion of the human genome sequencing in 2003 opened a new perspective into the importance of whole genome sequencing projects, and currently multiple species are having their genomes completed sequenced, from simple organisms, such as bacteria, to more complex taxa, such as mammals. This voluminous sequencing data generated across multiple organisms provides also the framework to better understand the genetic makeup of such species and related ones, allowing to explore the genetic changes underlining the evolution of diverse phenotypic traits. Here, recent results from our group retrieved from comparative evolutionary genomic analyses of selected marine animal species will be considered to exemplify how gene novelty and gene enhancement by positive selection might have been determinant in the success of adaptive radiations into diverse habitats and lifestyles.

Keywords: comparative genomics, adaptive evolution, bioinformatics, phylogenetics, genome mining

59 Genomics of Adaptation in the Sea

Authors: Agostinho Antunes


The completion of the human genome sequencing in 2003 opened a new perspective into the importance of whole genome sequencing projects, and currently multiple species are having their genomes completed sequenced, from simple organisms, such as bacteria, to more complex taxa, such as mammals. This voluminous sequencing data generated across multiple organisms provides also the framework to better understand the genetic makeup of such species and related ones, allowing to explore the genetic changes underlining the evolution of diverse phenotypic traits. Here, recent results from our group retrieved from comparative evolutionary genomic analyses of selected marine animal species will be considered to exemplify how gene novelty and gene enhancement by positive selection might have been determinant in the success of adaptive radiations into diverse habitats and lifestyles.

Keywords: marine genomics, evolutionary bioinformatics, human genome sequencing, genomic analyses

58 A Systems Approach to Targeting Cyclooxygenase: Genomics, Bioinformatics and Metabolomics Analysis of COX-1 -/- and COX-2-/- Lung Fibroblasts Providing Indication of Sterile Inflammation

Authors: Abul B. M. M. K. Islam, Mandar Dave, Roderick V. Jensen, Ashok R. Amin


A systems approach was applied to characterize differentially expressed transcripts, bioinformatics pathways, and proteins and prostaglandins (PGs) from lung fibroblasts procured from wild-type (WT), COX-1-/- and COX-2-/- mice to understand system level control mechanism. Bioinformatics analysis of COX-2 and COX-1 ablated cells induced COX-1 and COX-2 specific signature respectively, which significantly overlapped with an 'IL-1β induced inflammatory signature'. This defined novel cross-talk signals that orchestrated coordinated activation of pathways of sterile inflammation sensed by cellular stress. The overlapping signals showed significant over-representation of shared pathways for interferon y and immune responses, T cell functions, NOD, and toll-like receptor signaling. Gene Ontology Biological Process (GOBP) and pathway enrichment analysis specifically showed an increase in mRNA expression associated with: (a) organ development and homeostasis in COX-1-/- cells and (b) oxidative stress and response, spliceosomes and proteasomes activity, mTOR and p53 signaling in COX-2-/- cells. COX-1 and COX-2 showed signs of functional pathways committed to cell cycle and DNA replication at the genomics level. As compared to WT, metabolomics analysis revealed a significant increase in COX-1 mRNA and synthesis of basal levels of eicosanoids (PGE2, PGD2, TXB2, LTB4, PGF1α, and PGF2α) in COX-2 ablated cells and increase in synthesis of PGE2, and PGF1α in COX-1 null cells. There was a compensation of PGE2 and PGF1α in COX-1-/- and COX-2-/- cells. Collectively, these results support a broader, differential and collaborative regulation of both COX-1 and COX-2 pathways at the metabolic, signaling, and genomics levels in cellular homeostasis and sterile inflammation induced by cellular stress.

Keywords: cyclooxygenases, inflammation, lung fibroblasts, systemic

57 Nutritional Genomics Profile Based Personalized Sport Nutrition

Authors: Eszter Repasi, Akos Koller


Our genetic information determines our look, physiology, sports performance and all our features. Maximizing the performances of athletes have adopted a science-based approach to the nutritional support. Nowadays genetics studies have blended with nutritional sciences, and a dynamically evolving, new research field have appeared. Nutritional genomics is needed to be used by nutritional experts. This is a recent field of nutritional science, which can provide a solution to reach the best sport performance using correlations between the athlete’s genome, nutritions, molecules, included human microbiome (links between food, microbiome and epigenetics), nutrigenomics and nutrigenetics. Nutritional genomics has a tremendous potential to change the future of dietary guidelines and personal recommendations. Experts need to use new technology to get information about the athletes, like nutritional genomics profile (included the determination of the oral and gut microbiome and DNA coded reaction for food components), which can modify the preparation term and sports performance. The influence of nutrients on the genes expression is called Nutrigenomics. The heterogeneous response of gene variants to nutrients, dietary components is called Nutrigenetics. The human microbiome plays a critical role in the state of health and well-being, and there are more links between food or nutrition and the human microbiome composition, which can develop diseases and epigenetic changes as well. A nutritional genomics-based profile of athletes can be the best technic for a dietitian to make a unique sports nutrition diet plan. Using functional food and the right food components can be effected on health state, thus sports performance. Scientists need to determine the best response, due to the effect of nutrients on health, through altering genome promote metabolites and result changes in physiology. Nutritional biochemistry explains why polymorphisms in genes for the absorption, circulation, or metabolism of essential nutrients (such as n-3 polyunsaturated fatty acids or epigallocatechin-3-gallate), would affect the efficacy of that nutrient. Controlled nutritional deficiencies and failures, prevented the change of health state or a newly discovered food intolerance are observed by a proper medical team, can support better sports performance. It is important that the dietetics profession informed on gene-diet interactions, that may be leading to optimal health, reduced risk of injury or disease. A special medical application for documentation and monitoring of data of health state and risk factors can uphold and warn the medical team for an early action and help to be able to do a proper health service in time. This model can set up a personalized nutrition advice from the status control, through the recovery, to the monitoring. But more studies are needed to understand the mechanisms and to be able to change the composition of the microbiome, environmental and genetic risk factors in cases of athletes.

Keywords: gene-diet interaction, multidisciplinary team, microbiome, diet plan

56 Isolate-Specific Variations among Clinical Isolates of Brucella Identified by Whole-Genome Sequencing, Bioinformatics and Comparative Genomics

Authors: Abu S. Mustafa, Mohammad W. Khan, Faraz Shaheed Khan, Nazima Habibi


Brucellosis is a zoonotic disease of worldwide prevalence. There are at least four species and several strains of Brucella that cause human disease. Brucella genomes have very limited variation across strains, which hinder strain identification using classical molecular techniques, including PCR and 16 S rDNA sequencing. The aim of this study was to perform whole genome sequencing of clinical isolates of Brucella and perform bioinformatics and comparative genomics analyses to determine the existence of genetic differences across the isolates of a single Brucella species and strain. The draft sequence data were generated from 15 clinical isolates of Brucella melitensis (biovar 2 strain 63/9) using MiSeq next generation sequencing platform. The generated reads were used for further assembly and analysis. All the analysis was performed using Bioinformatics work station (8 core i7 processor, 8GB RAM with Bio-Linux operating system). FastQC was used to determine the quality of reads and low quality reads were trimmed or eliminated using Fastx_trimmer. Assembly was done by using Velvet and ABySS softwares. The ordering of assembled contigs was performed by Mauve. An online server RAST was employed to annotate the contigs assembly. Annotated genomes were compared using Mauve and ACT tools. The QC score for DNA sequence data, generated by MiSeq, was higher than 30 for 80% of reads with more than 100x coverage, which suggested that data could be utilized for further analysis. However when analyzed by FastQC, quality of four reads was not good enough for creating a complete genome draft so remaining 11 samples were used for further analysis. The comparative genome analyses showed that despite sharing same gene sets, single nucleotide polymorphisms and insertions/deletions existed across different genomes, which provided a variable extent of diversity to these bacteria. In conclusion, the next generation sequencing, bioinformatics, and comparative genome analysis can be utilized to find variations (point mutations, insertions and deletions) across different genomes of Brucella within a single strain. This information could be useful in surveillance and epidemiological studies supported by Kuwait University Research Sector grants MI04/15 and SRUL02/13.

Keywords: brucella, bioinformatics, comparative genomics, whole genome sequencing

55 Diversity, Biochemical and Genomic Assessment of Selected Benthic Species of Two Tropical Lagoons, Southwest Nigeria

Authors: G. F. Okunade, M. O. Lawal, R. E. Uwadiae, D. Portnoy


The diversity, physico-chemical, biochemical and genomics assessment of Macrofauna species of Ologe and Badagry Lagoons were carried out between August 2016 and July 2018. The concentrations of Fe, Zn, Mn, Cd, Cr, and Pb in water were determined by Atomic Absorption Spectrophotometer (AAS). Particle size distribution was determined with wet-sieving and sedimentation using hydrometer method. Genomics analyses were carried using 25 P. fusca (quadriseriata) and 25 P.fusca from each lagoon due to abundance in both lagoons all through the two years of collection. DNA was isolated from each sample using the Mag-Bind Blood and Tissue DNA HD 96 kit; a method designed to isolate high quality. The biochemical characteristics were analysed in the dominanat species (P.aurita and T. fuscatus) using ELISA kits. Physico-chemical parameters such as pH, total dissolved solids, dissolved oxygen, conductivity and TDS were analysed using APHA standard protocols. The Physico-chemical parameters of the water quality recorded with mean values of 32.46 ± 0.66mg/L and 41.93 ± 0.65 for COD, 27.28 ± 0.97 and 34.82 ± 0.1 mg/L for BOD, 0.04 ± 4.71 mg/L for DO, 6.65 and 6.58 for pH in Ologe and Badagry lagoons with significant variations (p ≤ 0.05) across seasons. The mean and standard deviation of salinity for Ologe and Badagry Lagoons ranged from 0.43 ± 0.30 to 0.27 ± 0.09. A total of 4210 species belonging to a phylum, two classes, four families and a total of 2008 species in Ologe lagoon while a phylum, two classes, 5 families and a total of 2202 species in Badagry lagoon. The percentage composition of the classes at Ologe lagoon had 99% gastropod and 1% bivalve, while Gastropod contributed 98.91% and bivalve 1.09% in Badagry lagoon. Particle size was distributed in 0.002mm to 2.00mm, particle size distribution in Ologe lagoon recorded 0.83% gravels, 97.83% sand, and 1.33% silt particles while Badagry lagoon recorded 7.43% sand, 24.71% silt, and 67.86% clay particles hence, the excessive dredging activities going on in the lagoon. Maximum percentage of sand (100%) was seen in station 6 in Ologe lagoon while the minimum (96%) was found in station 1. P. aurita (Ologe Lagoon) and T. fuscastus (Badagry Lagoon) were the most abundant benthic species in which both contributed 61.05% and 64.35%, respectively. The enzymatic activities of P. aurita observed with mean values of 21.03 mg/dl for AST, 10.33 mg/dl for ALP, 82.16 mg/dl for ALT and 73.06 mg/dl for CHO in Ologe Lagoon While T. fuscatus observed mean values of Badagry Lagoon) recorded mean values 29.76 mg/dl, ALP with 11.69mg/L, ALT with 140.58 mg/dl and CHO with 45.98 mg/dl. There were significant variations (P < 0.05) in AST and CHO levels of activities in the muscles of the species.

Keywords: benthos, biochemical responses, genomics, metals, particle size

54 Genomic Surveillance of Bacillus Anthracis in South Africa Revealed a Unique Genetic Cluster of B- Clade Strains

Authors: Kgaugelo Lekota, Ayesha Hassim, Henriette Van Heerden


Bacillus anthracis is the causative agent of anthrax that is composed of three genetic groups, namely A, B, and C. Clade-A is distributed world-wide, while sub-clades B has been identified in Kruger National Park (KNP), South Africa. KNP is one of the endemic anthrax regions in South Africa with distinctive genetic diversity. Genomic surveillance of KNP B. anthracis strains was employed on the historical culture collection isolates (n=67) dated from the 1990’s to 2015 using a whole genome sequencing approach. Whole genome single nucleotide polymorphism (SNPs) and pan-genomics analysis were used to define the B. anthracis genetic population structure. This study showed that KNP has heterologous B. anthracis strains grouping in the A-clade with more prominent ABr.005/006 (Ancient A) SNP lineage. The 2012 and 2015 anthrax isolates are dispersed amongst minor sub-clades that prevail in non-stabilized genetic evolution strains. This was augmented with non-parsimony informative SNPs of the B. anthracis strains across minor sub-clades of the Ancient A clade. Pan-genomics of B. anthracis showed a clear distinction between A and B-clade genomes with 11 374 predicted clusters of protein coding genes. Unique accessory genes of B-clade genomes that included biosynthetic cell wall genes and multidrug resistant of Fosfomycin. South Africa consists of diverse B. anthracis strains with unique defined SNPs. The sequenced B. anthracis strains in this study will serve as a means to further trace the dissemination of B. anthracis outbreaks globally and especially in South Africa.

Keywords: bacillus anthracis, whole genome single nucleotide polymorphisms, pangenomics, kruger national park

53 C-eXpress: A Web-Based Analysis Platform for Comparative Functional Genomics and Proteomics in Human Cancer Cell Line, NCI-60 as an Example

Authors: Chi-Ching Lee, Po-Jung Huang, Kuo-Yang Huang, Petrus Tang


Background: Recent advances in high-throughput research technologies such as new-generation sequencing and multi-dimensional liquid chromatography makes it possible to dissect the complete transcriptome and proteome in a single run for the first time. However, it is almost impossible for many laboratories to handle and analysis these “BIG” data without the support from a bioinformatics team. We aimed to provide a web-based analysis platform for users with only limited knowledge on bio-computing to study the functional genomics and proteomics. Method: We use NCI-60 as an example dataset to demonstrate the power of the web-based analysis platform and data delivering system: C-eXpress takes a simple text file that contain the standard NCBI gene or protein ID and expression levels (rpkm or fold) as input file to generate a distribution map of gene/protein expression levels in a heatmap diagram organized by color gradients. The diagram is hyper-linked to a dynamic html table that allows the users to filter the datasets based on various gene features. A dynamic summary chart is generated automatically after each filtering process. Results: We implemented an integrated database that contain pre-defined annotations such as gene/protein properties (ID, name, length, MW, pI); pathways based on KEGG and GO biological process; subcellular localization based on GO cellular component; functional classification based on GO molecular function, kinase, peptidase and transporter. Multiple ways of sorting of column and rows is also provided for comparative analysis and visualization of multiple samples.

Keywords: cancer, visualization, database, functional annotation

52 In silico Subtractive Genomics Approach for Identification of Strain-Specific Putative Drug Targets among Hypothetical Proteins of Drug-Resistant Klebsiella pneumoniae Strain 825795-1

Authors: Umairah Natasya Binti Mohd Omeershffudin, Suresh Kumar


Klebsiella pneumoniae, a Gram-negative enteric bacterium that causes nosocomial and urinary tract infections. Particular concern is the global emergence of multidrug-resistant (MDR) strains of Klebsiella pneumoniae. Characterization of antibiotic resistance determinants at the genomic level plays a critical role in understanding, and potentially controlling, the spread of multidrug-resistant (MDR) pathogens. In this study, drug-resistant Klebsiella pneumoniae strain 825795-1 was investigated with extensive computational approaches aimed at identifying novel drug targets among hypothetical proteins. We have analyzed 1099 hypothetical proteins available in genome. We have used in-silico genome subtraction methodology to design potential and pathogen-specific drug targets against Klebsiella pneumoniae. We employed bioinformatics tools to subtract the strain-specific paralogous and host-specific homologous sequences from the bacterial proteome. The sorted 645 proteins were further refined to identify the essential genes in the pathogenic bacterium using the database of essential genes (DEG). We found 135 unique essential proteins in the target proteome that could be utilized as novel targets to design newer drugs. Further, we identified 49 cytoplasmic protein as potential drug targets through sub-cellular localization prediction. Further, we investigated these proteins in the DrugBank databases, and 11 of the unique essential proteins showed druggability according to the FDA approved drug bank databases with diverse broad-spectrum property. The results of this study will facilitate discovery of new drugs against Klebsiella pneumoniae.

Keywords: pneumonia, drug target, hypothetical protein, subtractive genomics

51 The Development and Provision of a Knowledge Management Ecosystem, Optimized for Genomics

Authors: Matthew I. Bellgard


The field of bioinformatics has made, and continues to make, substantial progress and contributions to life science research and development. However, this paper contends that a systems approach integrates bioinformatics activities for any project in a defined manner. The application of critical control points in this bioinformatics systems approach may be useful to identify and evaluate points in a pathway where specified activity risk can be reduced, monitored and quality enhanced.

Keywords: bioinformatics, food security, personalized medicine, systems approach

50 Systematic Identification of Noncoding Cancer Driver Somatic Mutations

Authors: Zohar Manber, Ran Elkon


Accumulation of somatic mutations (SMs) in the genome is a major driving force of cancer development. Most SMs in the tumor's genome are functionally neutral; however, some cause damage to critical processes and provide the tumor with a selective growth advantage (termed cancer driver mutations). Current research on functional significance of SMs is mainly focused on finding alterations in protein coding sequences. However, the exome comprises only 3% of the human genome, and thus, SMs in the noncoding genome significantly outnumber those that map to protein-coding regions. Although our understanding of noncoding driver SMs is very rudimentary, it is likely that disruption of regulatory elements in the genome is an important, yet largely underexplored mechanism by which somatic mutations contribute to cancer development. The expression of most human genes is controlled by multiple enhancers, and therefore, it is conceivable that regulatory SMs are distributed across different enhancers of the same target gene. Yet, to date, most statistical searches for regulatory SMs have considered each regulatory element individually, which may reduce statistical power. The first challenge in considering the cumulative activity of all the enhancers of a gene as a single unit is to map enhancers to their target promoters. Such mapping defines for each gene its set of regulating enhancers (termed "set of regulatory elements" (SRE)). Considering multiple enhancers of each gene as one unit holds great promise for enhancing the identification of driver regulatory SMs. However, the success of this approach is greatly dependent on the availability of comprehensive and accurate enhancer-promoter (E-P) maps. To date, the discovery of driver regulatory SMs has been hindered by insufficient sample sizes and statistical analyses that often considered each regulatory element separately. In this study, we analyzed more than 2,500 whole-genome sequence (WGS) samples provided by The Cancer Genome Atlas (TCGA) and The International Cancer Genome Consortium (ICGC) in order to identify such driver regulatory SMs. Our analyses took into account the combinatorial aspect of gene regulation by considering all the enhancers that control the same target gene as one unit, based on E-P maps from three genomics resources. The identification of candidate driver noncoding SMs is based on their recurrence. We searched for SREs of genes that are "hotspots" for SMs (that is, they accumulate SMs at a significantly elevated rate). To test the statistical significance of recurrence of SMs within a gene's SRE, we used both global and local background mutation rates. Using this approach, we detected - in seven different cancer types - numerous "hotspots" for SMs. To support the functional significance of these recurrent noncoding SMs, we further examined their association with the expression level of their target gene (using gene expression data provided by the ICGC and TCGA for samples that were also analyzed by WGS).

Keywords: cancer genomics, enhancers, noncoding genome, regulatory elements

49 Predictive Pathogen Biology: Genome-Based Prediction of Pathogenic Potential and Countermeasures Targets

Authors: Debjit Ray


Horizontal gene transfer (HGT) and recombination leads to the emergence of bacterial antibiotic resistance and pathogenic traits. HGT events can be identified by comparing a large number of fully sequenced genomes across a species or genus, define the phylogenetic range of HGT, and find potential sources of new resistance genes. In-depth comparative phylogenomics can also identify subtle genome or plasmid structural changes or mutations associated with phenotypic changes. Comparative phylogenomics requires that accurately sequenced, complete and properly annotated genomes of the organism. Assembling closed genomes requires additional mate-pair reads or “long read” sequencing data to accompany short-read paired-end data. To bring down the cost and time required of producing assembled genomes and annotating genome features that inform drug resistance and pathogenicity, we are analyzing the performance for genome assembly of data from the Illumina NextSeq, which has faster throughput than the Illumina HiSeq (~1-2 days versus ~1 week), and shorter reads (150bp paired-end versus 300bp paired end) but higher capacity (150-400M reads per run versus ~5-15M) compared to the Illumina MiSeq. Bioinformatics improvements are also needed to make rapid, routine production of complete genomes a reality. Modern assemblers such as SPAdes 3.6.0 running on a standard Linux blade are capable in a few hours of converting mixes of reads from different library preps into high-quality assemblies with only a few gaps. Remaining breaks in scaffolds are generally due to repeats (e.g., rRNA genes) are addressed by our software for gap closure techniques, that avoid custom PCR or targeted sequencing. Our goal is to improve the understanding of emergence of pathogenesis using sequencing, comparative genomics, and machine learning analysis of ~1000 pathogen genomes. Machine learning algorithms will be used to digest the diverse features (change in virulence genes, recombination, horizontal gene transfer, patient diagnostics). Temporal data and evolutionary models can thus determine whether the origin of a particular isolate is likely to have been from the environment (could it have evolved from previous isolates). It can be useful for comparing differences in virulence along or across the tree. More intriguing, it can test whether there is a direction to virulence strength. This would open new avenues in the prediction of uncharacterized clinical bugs and multidrug resistance evolution and pathogen emergence.

Keywords: genomics, pathogens, genome assembly, superbugs

48 PTFE Capillary-Based DNA Amplification within an Oscillatory Thermal Cycling Device

Authors: Jyh J. Chen, Fu H. Yang, Ming H. Liao


This study describes a capillary-based device integrated with the heating and cooling modules for polymerase chain reaction (PCR). The device consists of the reaction polytetrafluoroethylene (PTFE) capillary, the aluminum blocks, and is equipped with two cartridge heaters, a thermoelectric (TE) cooler, a fan, and some thermocouples for temperature control. The cartridge heaters are placed into the heating blocks and maintained at two different temperatures to achieve the denaturation and the extension step. Some thermocouples inserted into the capillary are used to obtain the transient temperature profiles of the reaction sample during thermal cycles. A 483-bp DNA template is amplified successfully in the designed system and the traditional thermal cycler. This work should be interesting to persons involved in the high-temperature based reactions and genomics or cell analysis.

Keywords: polymerase chain reaction, thermal cycles, capillary, TE cooler

47 A New Approach for Improving Accuracy of Multi Label Stream Data

Authors: Kunal Shah, Swati Patel


Many real world problems involve data which can be considered as multi-label data streams. Efficient methods exist for multi-label classification in non streaming scenarios. However, learning in evolving streaming scenarios is more challenging, as the learners must be able to adapt to change using limited time and memory. Classification is used to predict class of unseen instance as accurate as possible. Multi label classification is a variant of single label classification where set of labels associated with single instance. Multi label classification is used by modern applications, such as text classification, functional genomics, image classification, music categorization etc. This paper introduces the task of multi-label classification, methods for multi-label classification and evolution measure for multi-label classification. Also, comparative analysis of multi label classification methods on the basis of theoretical study, and then on the basis of simulation was done on various data sets.

Keywords: binary relevance, concept drift, data stream mining, MLSC, multiple window with buffer

46 Genodata: The Human Genome Variation Using BigData

Authors: Surabhi Maiti, Prajakta Tamhankar, Prachi Uttam Mehta


Since the accomplishment of the Human Genome Project, there has been an unparalled escalation in the sequencing of genomic data. This project has been the first major vault in the field of medical research, especially in genomics. This project won accolades by using a concept called Bigdata which was earlier, extensively used to gain value for business. Bigdata makes use of data sets which are generally in the form of files of size terabytes, petabytes, or exabytes and these data sets were traditionally used and managed using excel sheets and RDBMS. The voluminous data made the process tedious and time consuming and hence a stronger framework called Hadoop was introduced in the field of genetic sciences to make data processing faster and efficient. This paper focuses on using SPARK which is gaining momentum with the advancement of BigData technologies. Cloud Storage is an effective medium for storage of large data sets which is generated from the genetic research and the resultant sets produced from SPARK analysis.

Keywords: human genome project, Bigdata, genomic data, SPARK, cloud storage, Hadoop

45 Single Cell Analysis of Circulating Monocytes in Prostate Cancer Patients

Authors: Leander Van Neste, Kirk Wojno


The innate immune system reacts to foreign insult in several unique ways, one of which is phagocytosis of perceived threats such as cancer, bacteria, and viruses. The goal of this study was to look for evidence of phagocytosed RNA from tumor cells in circulating monocytes. While all monocytes possess phagocytic capabilities, the non-classical CD14+/FCGR3A+ monocytes and the intermediate CD14++/FCGR3A+ monocytes most actively remove threatening ‘external’ cellular materials. Purified CD14-positive monocyte samples from fourteen patients recently diagnosed with clinically localized prostate cancer (PCa) were investigated by single-cell RNA sequencing using the 10X Genomics protocol followed by paired-end sequencing on Illumina’s NovaSeq. Similarly, samples were processed and used as controls, i.e., one patient underwent biopsy but was found not to harbor prostate cancer (benign), three young, healthy men, and three men previously diagnosed with prostate cancer that recently underwent (curative) radical prostatectomy (post-RP). Sequencing data were mapped using 10X Genomics’ CellRanger software and viable cells were subsequently identified using CellBender, removing technical artifacts such as doublets and non-cellular RNA. Next, data analysis was performed in R, using the Seurat package. Because the main goal was to identify differences between PCa patients and ‘control’ patients, rather than exploring differences between individual subjects, the individual Seurat objects of all 21 patients were merged into one Seurat object per Seurat’s recommendation. Finally, the single-cell dataset was normalized as a whole prior to further analysis. Cell identity was assessed using the SingleR and cell dex packages. The Monaco Immune Data was selected as the reference dataset, consisting of bulk RNA-seq data of sorted human immune cells. The Monaco classification was supplemented with normalized PCa data obtained from The Cancer Genome Atlas (TCGA), which consists of bulk RNA sequencing data from 499 prostate tumor tissues (including 1 metastatic) and 52 (adjacent) normal prostate tissues. SingleR was subsequently run on the combined immune cell and PCa datasets. As expected, the vast majority of cells were labeled as having a monocytic origin (~90%), with the most noticeable difference being the larger number of intermediate monocytes in the PCa patients (13.6% versus 7.1%; p<.001). In men harboring PCa, 0.60% of all purified monocytes were classified as harboring PCa signals when the TCGA data were included. This was 3-fold, 7.5-fold, and 4-fold higher compared to post-RP, benign, and young men, respectively (all p<.001). In addition, with 7.91%, the number of unclassified cells, i.e., cells with pruned labels due to high uncertainty of the assigned label, was also highest in men with PCa, compared to 3.51%, 2.67%, and 5.51% of cells in post-RP, benign, and young men, respectively (all p<.001). It can be postulated that actively phagocytosing cells are hardest to classify due to their dual immune cell and foreign cell nature. Hence, the higher number of unclassified cells and intermediate monocytes in PCa patients might reflect higher phagocytic activity due to tumor burden. This also illustrates that small numbers (~1%) of circulating peripheral blood monocytes that have interacted with tumor cells might still possess detectable phagocytosed tumor RNA.

Keywords: circulating monocytes, phagocytic cells, prostate cancer, tumor immune response

44 SPARK: An Open-Source Knowledge Discovery Platform That Leverages Non-Relational Databases and Massively Parallel Computational Power for Heterogeneous Genomic Datasets

Authors: Thilina Ranaweera, Enes Makalic, John L. Hopper, Adrian Bickerstaffe


Data are the primary asset of biomedical researchers, and the engine for both discovery and research translation. As the volume and complexity of research datasets increase, especially with new technologies such as large single nucleotide polymorphism (SNP) chips, so too does the requirement for software to manage, process and analyze the data. Researchers often need to execute complicated queries and conduct complex analyzes of large-scale datasets. Existing tools to analyze such data, and other types of high-dimensional data, unfortunately suffer from one or more major problems. They typically require a high level of computing expertise, are too simplistic (i.e., do not fit realistic models that allow for complex interactions), are limited by computing power, do not exploit the computing power of large-scale parallel architectures (e.g. supercomputers, GPU clusters etc.), or are limited in the types of analysis available, compounded by the fact that integrating new analysis methods is not straightforward. Solutions to these problems, such as those developed and implemented on parallel architectures, are currently available to only a relatively small portion of medical researchers with access and know-how. The past decade has seen a rapid expansion of data management systems for the medical domain. Much attention has been given to systems that manage phenotype datasets generated by medical studies. The introduction of heterogeneous genomic data for research subjects that reside in these systems has highlighted the need for substantial improvements in software architecture. To address this problem, we have developed SPARK, an enabling and translational system for medical research, leveraging existing high performance computing resources, and analysis techniques currently available or being developed. It builds these into The Ark, an open-source web-based system designed to manage medical data. SPARK provides a next-generation biomedical data management solution that is based upon a novel Micro-Service architecture and Big Data technologies. The system serves to demonstrate the applicability of Micro-Service architectures for the development of high performance computing applications. When applied to high-dimensional medical datasets such as genomic data, relational data management approaches with normalized data structures suffer from unfeasibly high execution times for basic operations such as insert (i.e. importing a GWAS dataset) and the queries that are typical of the genomics research domain. SPARK resolves these problems by incorporating non-relational NoSQL databases that have been driven by the emergence of Big Data. SPARK provides researchers across the world with user-friendly access to state-of-the-art data management and analysis tools while eliminating the need for high-level informatics and programming skills. The system will benefit health and medical research by eliminating the burden of large-scale data management, querying, cleaning, and analysis. SPARK represents a major advancement in genome research technologies, vastly reducing the burden of working with genomic datasets, and enabling cutting edge analysis approaches that have previously been out of reach for many medical researchers.

Keywords: biomedical research, genomics, information systems, software

43 Complete Genome Sequence Analysis of Pasteurella multocida Subspecies multocida Serotype A Strain PMTB2.1

Authors: Shagufta Jabeen, Faez J. Firdaus Abdullah, Zunita Zakaria, Nurulfiza M. Isa, Yung C. Tan, Wai Y. Yee, Abdul R. Omar


Pasteurella multocida (PM) is an important veterinary opportunistic pathogen particularly associated with septicemic pasteurellosis, pneumonic pasteurellosis and hemorrhagic septicemia in cattle and buffaloes. P. multocida serotype A has been reported to cause fatal pneumonia and septicemia. Pasteurella multocida subspecies multocida of serotype A Malaysian isolate PMTB2.1 was first isolated from buffaloes died of septicemia. In this study, the genome of P. multocida strain PMTB2.1 was sequenced using third-generation sequencing technology, PacBio RS2 system and analyzed bioinformatically via de novo analysis followed by in-depth analysis based on comparative genomics. Bioinformatics analysis based on de novo assembly of PacBio raw reads generated 3 contigs followed by gap filling of aligned contigs with PCR sequencing, generated a single contiguous circular chromosome with a genomic size of 2,315,138 bp and a GC content of approximately 40.32% (Accession number CP007205). The PMTB2.1 genome comprised of 2,176 protein-coding sequences, 6 rRNA operons and 56 tRNA and 4 ncRNAs sequences. The comparative genome sequence analysis of PMTB2.1 with nine complete genomes which include Actinobacillus pleuropneumoniae, Haemophilus parasuis, Escherichia coli and five P. multocida complete genome sequences including, PM70, PM36950, PMHN06, PM3480, PMHB01 and PMTB2.1 was carried out based on OrthoMCL analysis and Venn diagram. The analysis showed that 282 CDs (13%) are unique to PMTB2.1and 1,125 CDs with orthologs in all. This reflects overall close relationship of these bacteria and supports the classification in the Gamma subdivision of the Proteobacteria. In addition, genomic distance analysis among all nine genomes indicated that PMTB2.1 is closely related with other five Pasteurella species with genomic distance less than 0.13. Synteny analysis shows subtle differences in genetic structures among different P.multocida indicating the dynamics of frequent gene transfer events among different P. multocida strains. However, PM3480 and PM70 exhibited exceptionally large structural variation since they were swine and chicken isolates. Furthermore, genomic structure of PMTB2.1 is more resembling that of PM36950 with a genomic size difference of approximately 34,380 kb (smaller than PM36950) and strain-specific Integrative and Conjugative Elements (ICE) which was found only in PM36950 is absent in PMTB2.1. Meanwhile, two intact prophages sequences of approximately 62 kb were found to be present only in PMTB2.1. One of phage is similar to transposable phage SfMu. The phylogenomic tree was constructed and rooted with E. coli, A. pleuropneumoniae and H. parasuis based on OrthoMCL analysis. The genomes of P. multocida strain PMTB2.1 were clustered with bovine isolates of P. multocida strain PM36950 and PMHB01 and were separated from avian isolate PM70 and swine isolates PM3480 and PMHN06 and are distant from Actinobacillus and Haemophilus. Previous studies based on Single Nucleotide Polymorphism (SNPs) and Multilocus Sequence Typing (MLST) unable to show a clear phylogenetic relatedness between Pasteurella multocida and the different host. In conclusion, this study has provided insight on the genomic structure of PMTB2.1 in terms of potential genes that can function as virulence factors for future study in elucidating the mechanisms behind the ability of the bacteria in causing diseases in susceptible animals.

Keywords: comparative genomics, DNA sequencing, phage, phylogenomics

42 Fuzzy Data, Random Drift, and a Theoretical Model for the Sequential Emergence of Religious Capacity in Genus Homo

Authors: Margaret Boone Rappaport, Christopher J. Corbally


The ancient ape ancestral population from which living great ape and human species evolved had demographic features affecting their evolution. The population was large, had great genetic variability, and natural selection was effective at honing adaptations. The emerging populations of chimpanzees and humans were affected more by founder effects and genetic drift because they were smaller. Natural selection did not disappear, but it was not as strong. Consequences of the 'population crash' and the human effective population size are introduced briefly. The history of the ancient apes is written in the genomes of living humans and great apes. The expansion of the brain began before the human line emerged. Coalescence times for some genes are very old – up to several million years, long before Homo sapiens. The mismatch between gene trees and species trees highlights the anthropoid speciation processes, and gives the human genome history a fuzzy, probabilistic quality. However, it suggests traits that might form a foundation for capacities emerging later. A theoretical model is presented in which the genomes of early ape populations provide the substructure for the emergence of religious capacity later on the human line. The model does not search for religion, but its foundations. It suggests a course by which an evolutionary line that began with prosimians eventually produced a human species with biologically based religious capacity. The model of the sequential emergence of religious capacity relies on cognitive science, neuroscience, paleoneurology, primate field studies, cognitive archaeology, genomics, and population genetics. And, it emphasizes five trait types: (1) Documented, positive selection of sensory capabilities on the human line may have favored survival, but also eventually enriched human religious experience. (2) The bonobo model suggests a possible down-regulation of aggression and increase in tolerance while feeding, as well as paedomorphism – but, in a human species that remains cognitively sharp (unlike the bonobo). The two species emerged from the same ancient ape population, so it is logical to search for shared traits. (3) An up-regulation of emotional sensitivity and compassion seems to have occurred on the human line. This finds support in modern genetic studies. (4) The authors’ published model of morality's emergence in Homo erectus encompasses a cognitively based, decision-making capacity that was hypothetically overtaken, in part, by religious capacity. Together, they produced a strong, variable, biocultural capability to support human sociability. (5) The full flowering of human religious capacity came with the parietal expansion and smaller face (klinorhynchy) found only in Homo sapiens. Details from paleoneurology suggest the stage was set for human theologies. Larger parietal lobes allowed humans to imagine inner spaces, processes, and beings, and, with the frontal lobe, led to the first theologies composed of structured and integrated theories of the relationships between humans and the supernatural. The model leads to the evolution of a small population of African hominins that was ready to emerge with religious capacity when the species Homo sapiens evolved two hundred thousand years ago. By 50-60,000 years ago, when human ancestors left Africa, they were fully enabled.

Keywords: genetic drift, genomics, parietal expansion, religious capacity

41 Development of DNA Fingerprints in Selected Medicinal Plants of India

Authors: V. Verma, Hazi Raja


Conventionally, morphological descriptors are routinely used for establishing the identity of varieties. But these morphological descriptors suffer from many drawbacks such as influence of environment on trait expression, epistatic interactions, pleiotrophic effects etc. Furthermore, the paucity of a sufficient number of these descriptors for unequivocal identification of increasing number of reference collection varieties enforces to look for alternatives. Therefore, DNA based finger-print based techniques were selected to define the systematic position of the selected medicinal plants like Plumbago zeylanica, Desmodium gangeticum, Uraria picta. DNA fingerprinting of herbal plants can be useful in authenticating the various claims of medical uses related to the plants, in germplasm characterization and conservation. In plants it has not only helped in identifying species but also in defining a new realm in plant genomics, plant breeding and in conserving the biodiversity. With world paving way for developments in biotechnology, DNA fingerprinting promises a very powerful tool in our future endeavors. Data will be presented on the development of microsatellite markers (SSR) used to fingerprint, characterize, and assess genetic diversity among 12 accessions of both Plumbago zeylanica, 4 accessions of Desmodium gengaticum, 4 accessions of Uraria Picta.

Keywords: Plumbago zeylanica, Desmodium gangeticum, Uraria picta, microsaetllite markers

40 High-Value Health System for All: Technologies for Promoting Health Education and Awareness

Authors: M. P. Sebastian


Health for all is considered as a sign of well-being and inclusive growth. New healthcare technologies are contributing to the quality of human lives by promoting health education and awareness, leading to the prevention, early diagnosis and treatment of the symptoms of diseases. Healthcare technologies have now migrated from the medical and institutionalized settings to the home and everyday life. This paper explores these new technologies and investigates how they contribute to health education and awareness, promoting the objective of high-value health system for all. The methodology used for the research is literature review. The paper also discusses the opportunities and challenges with futuristic healthcare technologies. The combined advances in genomics medicine, wearables and the IoT with enhanced data collection in electronic health record (EHR) systems, environmental sensors, and mobile device applications can contribute in a big way to high-value health system for all. The promise by these technologies includes reduced total cost of healthcare, reduced incidence of medical diagnosis errors, and reduced treatment variability. The major barriers to adoption include concerns with security, privacy, and integrity of healthcare data, regulation and compliance issues, service reliability, interoperability and portability of data, and user friendliness and convenience of these technologies.

Keywords: big data, education, healthcare, information communication technologies (ICT), patients, technologies

39 Modern Proteomics and the Application of Machine Learning Analyses in Proteomic Studies of Chronic Kidney Disease of Unknown Etiology

Authors: Dulanjali Ranasinghe, Isuru Supasan, Kaushalya Premachandra, Ranjan Dissanayake, Ajith Rajapaksha, Eustace Fernando


Proteomics studies of organisms are considered to be significantly information-rich compared to their genomic counterparts because proteomes of organisms represent the expressed state of all proteins of an organism at a given time. In modern top-down and bottom-up proteomics workflows, the primary analysis methods employed are gel–based methods such as two-dimensional (2D) electrophoresis and mass spectrometry based methods. Machine learning (ML) and artificial intelligence (AI) have been used increasingly in modern biological data analyses. In particular, the fields of genomics, DNA sequencing, and bioinformatics have seen an incremental trend in the usage of ML and AI techniques in recent years. The use of aforesaid techniques in the field of proteomics studies is only beginning to be materialised now. Although there is a wealth of information available in the scientific literature pertaining to proteomics workflows, no comprehensive review addresses various aspects of the combined use of proteomics and machine learning. The objective of this review is to provide a comprehensive outlook on the application of machine learning into the known proteomics workflows in order to extract more meaningful information that could be useful in a plethora of applications such as medicine, agriculture, and biotechnology.

Keywords: proteomics, machine learning, gel-based proteomics, mass spectrometry

38 An Integrative Computational Pipeline for Detection of Tumor Epitopes in Cancer Patients

Authors: Tanushree Jaitly, Shailendra Gupta, Leila Taher, Gerold Schuler, Julio Vera


Genomics-based personalized medicine is a promising approach to fight aggressive tumors based on patient's specific tumor mutation and expression profiles. A remarkable case is, dendritic cell-based immunotherapy, in which tumor epitopes targeting patient's specific mutations are used to design a vaccine that helps in stimulating cytotoxic T cell mediated anticancer immunity. Here we present a computational pipeline for epitope-based personalized cancer vaccines using patient-specific haplotype and cancer mutation profiles. In the workflow proposed, we analyze Whole Exome Sequencing and RNA Sequencing patient data to detect patient-specific mutations and their expression level. Epitopes including the tumor mutations are computationally predicted using patient's haplotype and filtered based on their expression level, binding affinity, and immunogenicity. We calculate binding energy for each filtered major histocompatibility complex (MHC)-peptide complex using docking studies, and use this feature to select good epitope candidates further.

Keywords: cancer immunotherapy, epitope prediction, NGS data, personalized medicine

37 Whole Coding Genome Inter-Clade Comparison to Predict Global Cancer-Protecting Variants

Authors: Lamis Naddaf, Yuval Tabach


In this research, we identified the missense genetic variants that have the potential to enhance resistance against cancer. Such field has not been widely explored, as researchers tend to investigate mutations that cause diseases, in response to the suffering of patients, rather than those mutations that protect from them. In conjunction with the genomic revolution, and the advances in genetic engineering and synthetic biology, identifying the protective variants will increase the power of genotype-phenotype predictions and can have significant implications on improved risk estimation, diagnostics, prognosis and even for personalized therapy and drug discovery. To approach our goal, we systematically investigated the sites of the coding genomes and picked up the alleles that showed a correlation with the species’ cancer resistance. We predicted 250 protecting variants (PVs) with a 0.01 false discovery rate and more than 20 thousand PVs with a 0.25 false discovery rate. Cancer resistance in Mammals and reptiles was significantly predicted by the number of PVs a species has. Moreover, Genes enriched with the protecting variants are enriched in pathways relevant to tumor suppression like pathways of Hedgehog signaling and silencing, which its improper activation is associated with the most common form of cancer malignancy. We also showed that the PVs are more abundant in healthy people compared to cancer patients within different human races.

Keywords: comparative genomics, machine learning, cancer resistance, cancer-protecting alleles

36 Elucidation of the Sequential Transcriptional Activity in Escherichia coli Using Time-Series RNA-Seq Data

Authors: Pui Shan Wong, Kosuke Tashiro, Satoru Kuhara, Sachiyo Aburatani


Functional genomics and gene regulation inference has readily expanded our knowledge and understanding of gene interactions with regards to expression regulation. With the advancement of transcriptome sequencing in time-series comes the ability to study the sequential changes of the transcriptome. This method presented here works to augment existing regulation networks accumulated in literature with transcriptome data gathered from time-series experiments to construct a sequential representation of transcription factor activity. This method is applied on a time-series RNA-Seq data set from Escherichia coli as it transitions from growth to stationary phase over five hours. Investigations are conducted on the various metabolic activities in gene regulation processes by taking advantage of the correlation between regulatory gene pairs to examine their activity on a dynamic network. Especially, the changes in metabolic activity during phase transition are analyzed with focus on the pagP gene as well as other associated transcription factors. The visualization of the sequential transcriptional activity is used to describe the change in metabolic pathway activity originating from the pagP transcription factor, phoP. The results show a shift from amino acid and nucleic acid metabolism, to energy metabolism during the transition to stationary phase in E. coli.

Keywords: Escherichia coli, gene regulation, network, time-series

35 Agile Methodology for Modeling and Design of Data Warehouses -AM4DW-

Authors: Nieto Bernal Wilson, Carmona Suarez Edgar


The organizations have structured and unstructured information in different formats, sources, and systems. Part of these come from ERP under OLTP processing that support the information system, however these organizations in OLAP processing level, presented some deficiencies, part of this problematic lies in that does not exist interesting into extract knowledge from their data sources, as also the absence of operational capabilities to tackle with these kind of projects.  Data Warehouse and its applications are considered as non-proprietary tools, which are of great interest to business intelligence, since they are repositories basis for creating models or patterns (behavior of customers, suppliers, products, social networks and genomics) and facilitate corporate decision making and research. The following paper present a structured methodology, simple, inspired from the agile development models as Scrum, XP and AUP. Also the models object relational, spatial data models, and the base line of data modeling under UML and Big data, from this way sought to deliver an agile methodology for the developing of data warehouses, simple and of easy application. The methodology naturally take into account the application of process for the respectively information analysis, visualization and data mining, particularly for patterns generation and derived models from the objects facts structured.

Keywords: data warehouse, model data, big data, object fact, object relational fact, process developed data warehouse

