Genome-Wide SNP Discovery from de novo Assemblies of Pepper (Capsicum annuum ) Transcriptomes Hamid...
-
Upload
julianna-hodge -
Category
Documents
-
view
216 -
download
2
Transcript of Genome-Wide SNP Discovery from de novo Assemblies of Pepper (Capsicum annuum ) Transcriptomes Hamid...
Genome-Wide SNP Discovery from de novo Assemblies of
Pepper (Capsicum annuum ) TranscriptomesHamid Ashrafi1, Jiqiang Yao2, Kevin Stoffel1, Sebastian R. Chin-Wo3, Theresa Hill1, Alexander Kozik3 and Allen
Van Deynze1
1 Department of Plant Sciences, Seed Biotechnology Center, University of California, Davis, CA 956162 Interdisciplinary Center for Biotechnology Research (ICBR), University of Florida, Gainesville, FL 326103 Genome Center, University of California, Davis, CA 95616
Background and Significance
To obtain as many transcribed genes as possible, peppers were sampled from different, cultivars, tissues at multiple stages of growth and development.
To discover putative SNPs among three sampled pepper cultivars by sequencing transcriptomes using Illumina Genome Analyzer.
To annotate the transcriptome sequence in order to have an insight into pepper biological processes.
To use annotated genes for QTL analysis and candidate gene discovery.
Objectives
Materials and Methods
Results
Conclusions
References
Acknowledgments
Plant Materials and cDNA Library Preparation
The seed of three pepper (C. annuum) lines ‘CM334,’ ‘Maor’ and ‘Early Jalapeño’ were planted.
Three cDNA libraries (one from each pepper variety) were prepared using pooled RNA that was extracted from 4 tissues: root, young leaf, flower and fruit using Qiagen RNeasy Mini Kit (Qiagen Valencia CA, USA). Fruit tissues were collected in different developmental stages; 5, 10, and 20 days post pollination developing fruit, breaker and ripe fruit.
The libraries were constructed by shearing cDNAs and 300‐350 bp fragments were selected on gels.
The libraries were normalized using a double-stranded nuclease protocol.
The cDNA libraries were sequenced using Illumina Genome Analyzer IIx (GAIIx) (Illumina Inc., San Diego, CA) for 80-120 cycles at UC Davis Genome Center core facility.
De Novo Assembly of NGS Sequences
The NGS data (GAIIx) went through our standard preprocessing pipeline, developed at UC Davis (Kozik, A, 2010).
Velvet (Zerbino and Birney, 2008) and CLC (CLCBIO, 2010) software packages were used to assemble the sequences.
CAP3 was used to make the final assembly of three assemblies.
Trimmed readsMin 40nt – Max 85nt
Trimmed readsMin 25nt – Max 60nt
One iteration of CLC assembly with all reads
One iteration of CLC assembly with all reads
One iteration of CLC assembly with all reads
Velvet Assembler
Early Jalapeño
313541
313541
Trimmed readsMin 40nt – Max 85nt
Trimmed readsMin 25nt – Max 60nt
All K-mer assemblies,assembled with CAP3
Maor
313541
Trimmed readsMin 40nt – Max 85nt
Trimmed readsMin 25nt – Max 60nt
313541
All K-mer assemblies,assembled with CAP3
CM334
313541
313541
Trimmed readsMin 40nt – Max 85nt
Trimmed readsMin 25nt – Max 60nt
All K-mer assemblies,assembled with CAP3
Velvet K-mers
CLC AssemblerTrimmed reads
Min 40nt – Max 85nt
Trimmed readsMin 25nt – Max 60nt
Trimmed readsMin 40nt – Max 85nt
Trimmed readsMin 25nt – Max 60nt
Velvet Assembler CLC Assembler Velvet Assembler CLC Assembler
CM334 assembly made with CAP3 Early Jalapeño assembly made with CAP3 Maor assembly made with CAP3
+ ++
Pepper final assembly made with CAP3 (Reference Sequence)
Assembly Statistics No. of Contigs Total nt N50
CM334 83,113 84,792,180 1,488Early Jalapeño 82,614 84,973,865 1,488Maor 76,375 79,383,673 1,526Pepper assembly 123,261 135,019,787 1,647(CM334,EJ and Maor)
Annotation A total of 63,202 contigs (51.3%) had at least one hit in the non-redundant database of GenBank with an average length of 1,495
nucleotides.
Contigs with a hit, covered 94.5 M bases (70%) of the total assembly.
A total of 60,055 (48.7%) contigs that did not have any hit in the GenBank were on average 674 nucleotide long and covering40.5 M bases (30%) of the total assembly.
Based on all results of BLASTX, Vitis vinifera, Arabidopsis thaliana and Oryza sativa were the top three species in the blast hits (Fig 3).
Mapping step of Blast2GO resulted in identification of 37,918 (30.7%) contigs with Gene Ontology (GO) terms.
Biological Processes (BP) at different GO levels were generated. Fig 4 shows the BP at level 2. For each BP number of annotated sequences are shown in Fig 5.
Kegg maps for 150 biological pathways were generated and contigs within each pathways were determined. For instance, Fig 6 depicts Kegg map of Pyrimidine Metabolism pathway.
SNP discovery A total of 22,863 putative SNPs within 11,869 contigs were
identified by our SNP discovery pipeline.
The contigs with identified putative SNPs comprised 23,794 kb (17.6%) of pepper transcriptomes assembly.
On average 1 SNP per 1040 bp of exonic regions of pepper genome was identified.
Assembly of transcriptomes of three pepper cultivars, increased the total assembled bases by 50%.
The present pepper transcriptome assembly represents ~4% of pepper genome (3500 Mb).
We demonstrated that for the plants for which the genome sequences are not available yet, the transcriptome assembly is an alternate approach SNP calling.
Annotation of 51% of contigs or 70% of total assembled bases indicates that ~49% of contigs are small contigs that are covering the remaining 30% of unannotated sequences.
Conesa, A., S. Götz, et al. (2005). "Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research." Bioinformatics 21(18): 3674-3676.
Kozik, A (2010).Tool to process and manipulate Illumina sequences). http://code.google.com/p/atgc-illumina/downloads/list”).
Li, H. and R. Durbin (2009). "Fast and accurate short read alignment with Burrows–Wheeler transform." Bioinformatics 25(14): 1754-1760.
Li, H., B. Handsaker, et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079.
Zerbino, D. and E. Birney (2008). "Velvet: algorithms for de novo short read assembly using de Bruijn graphs." Genome Res 18: 821 - 829.
Molecular breeding of pepper (Capsicum spp.) has been hampered by the paucity of molecular markers.
This is primarily due to lack of availability of the pepper genome sequence and limited available sequence resources.
In recent years with the more cost effective sequencing technologies such as Illumina, sequencing of expressed genes (transcriptomes), gene discovery and allele mining is no longer insurmountable.
In order to exploit the speed and scale of data from new sequencing technologies and in an effort to enrich the sequence resources of pepper, we sequenced transcriptome sequences (RNA-seq) of three pepper lines: Maor, Early Jalapeño (EJ) and Criollo de Morelos 334 (CM334).
We selected a wide range of tissues to represent as many expressed genes as possible.
The reference sequence was constructed from >200 million Illumina reads (80-120 nt) using a combination of Velvet, CLC and CAP3 software packages.
BWA (Li and Durbin, 2009), SAMtools (Li et al, 2009b) and in-house Perl scripts were used to identify SNPs among three pepper lines.
The SNPs were filtered to be 100 bp apart from any putative intron-exon junctions as well as adjacent SNPs. After filtering >22,000 high quality putative SNPs were identified and bioinformatically mapped to pepper genetic maps.
The reference sequence was annotated by Blast2Go software (Conesa et al, 2005).
The authors would like to thank Enza Zaden, Nunhems, Rijk Zwaan, Syngenta, Vilmorin and UC Discovery program for the financial support. We also would like to thank sequencing facility of UC Davis Genome Center and Bioinformatics core facility to provide us the servers and computational power. The annotation would not be possible without collaboration with Dr. R Michelmore’s laboratory.
SNP Discovery Pipeline
BWA was used to map all the reads of three genotypes individually to the Pepper final transcriptome assembly.
SAMtools was use to make the pileups of each cultivar and discover the difference within each cultivar with reference sequence.
Indels were screened out of pileup files.
Intron-exon junction positions were inferred in the reference sequence based on Arabidopsis gene models using intron finder of Solanaceae Genome Network website (SGN).
In-house Perl scripts were used to create allele call table of all three genotypes, the SNPs were filtered against adjacent SNPs and identified Intronic regions.
Sequences surrounding the SNPs (100 base on each side) were extracted from the reference sequence to design assays.
Annotation of Reference Sequence
Blast2Go program was used to annotate the reference sequence, obtain the statistics and generate Kegg maps(http://www.genome.jp/kegg/pathway.html).
Fig. 3 Fig. 4Fig. 5
Fig. 2 Distribution of contig length in pepper transcriptome assembly
0
5000
10000
15000
20000
25000
30000
Freq
uenc
y
Contig Length
N50=1647
Mean=1095Max=19,089
Min=265
Fig. 6 Kegg map of Pyrimidine Metabolism
Fig. 1 De Novo assembly of pepper transcriptomes