SIMGraph: Simplifying Contig Graph to Improve de Novo Genome Assembly Using Next-generation Sequencing Data
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32797
SIMGraph: Simplifying Contig Graph to Improve de Novo Genome Assembly Using Next-generation Sequencing Data

Authors: Chien-Ju Li, Chun-Hui Yu, Chi-Chuan Hwang, Tsunglin Liu , Darby Tien-Hao Chang

Abstract:

De novo genome assembly is always fragmented. Assembly fragmentation is more serious using the popular next generation sequencing (NGS) data because NGS sequences are shorter than the traditional Sanger sequences. As the data throughput of NGS is high, the fragmentations in assemblies are usually not the result of missing data. On the contrary, the assembled sequences, called contigs, are often connected to more than one other contigs in a complicated manner, leading to the fragmentations. False connections in such complicated connections between contigs, named a contig graph, are inevitable because of repeats and sequencing/assembly errors. Simplifying a contig graph by removing false connections directly improves genome assembly. In this work, we have developed a tool, SIMGraph, to resolve ambiguous connections between contigs using NGS data. Applying SIMGraph to the assembly of a fungus and a fish genome, we resolved 27.6% and 60.3% ambiguous contig connections, respectively. These results can reduce the experimental efforts in resolving contig connections.

Keywords: Contig graph, NGS, de novo assembly, scaffold.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1058153

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1684

References:


[1] O. M. Margulies, et al., "Genome sequencing in microfabricated high-density picolitre reactors," Nature, vol. 437, pp. 376-80, Sep 15 2005.
[2] D. R. Bentley, "Whole-genome re-sequencing," Curr Opin Genet Dev, vol. 16, pp. 545-52, Dec 2006.
[3] A. Valouev, et al., "A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning," Genome Res, vol. 18, pp. 1051-63, Jul 2008.
[4] M. A. Batzer and P. L. Deininger, "Alu repeats and human genomic diversity," Nat Rev Genet, vol. 3, pp. 370-9, May 2002.
[5] N. Nagarajan, et al., "Finishing genomes with limited resources: lessons from an ensemble of microbial genomes," BMC Genomics, vol. 11, p. 242, 2010.
[6] D. B. Jaffe, et al., "Whole-genome sequence assembly for mammalian genomes: Arachne 2," Genome Res, vol. 13, pp. 91-6, Jan 2003.
[7] F. C. Jones, et al., "The genomic basis of adaptive evolution in threespine sticklebacks," Nature, vol. in press, 2012.
[8] P. Flicek, et al., "Ensembl 2011," Nucleic Acids Res, vol. 39, pp. D800-6, Jan 2011.
[9] E. W. Sayers, et al., "Database resources of the National Center for Biotechnology Information," Nucleic Acids Res, Dec 2 2011.
[10] R. Li, et al., "SOAP2: an improved ultrafast tool for short read alignment," Bioinformatics, vol. 25, pp. 1966-7, Aug 1 2009.
[11] J. R. Miller, et al., "Aggressive assembly of pyrosequencing reads with mates," Bioinformatics, vol. 24, pp. 2818-24, Dec 15 2008.
[12] E. W. Myers, et al., "A whole-genome assembly of Drosophila," Science, vol. 287, pp. 2196-204, Mar 24 2000.
[13] M. Boetzer, et al., "Scaffolding pre-assembled contigs using SSPACE," Bioinformatics, vol. 27, pp. 578-9, Feb 15 2011.
[14] W. J. Kent, "BLAT--the BLAST-like alignment tool," Genome Res, vol. 12, pp. 656-64, Apr 2002.