Genome-Wide SNP Discovery from de novo Assemblies of Pepper (Capsicum annuum ) Transcriptomes Hamid...

1
Genome-Wide SNP Discovery from de novo Assemblies of Pepper (Capsicum annuum ) Transcriptomes Hamid Ashrafi 1 , Jiqiang Yao 2 , Kevin Stoffel 1 , Sebastian R. Chin-Wo 3 , Theresa Hill 1 , Alexander Kozik 3 and Allen Van Deynze 1 1 Department of Plant Sciences, Seed Biotechnology Center, University of California, Davis, CA 95616 2 Interdisciplinary Center for Biotechnology Research (ICBR), University of Florida, Gainesville, FL 32610 3 Genome Center, University of California, Davis, CA 95616 Background and Significance To obtain as many transcribed genes as possible, peppers were sampled from different, cultivars, tissues at multiple stages of growth and development. To discover putative SNPs among three sampled pepper cultivars by sequencing transcriptomes using Illumina Genome Analyzer. To annotate the transcriptome sequence in order to have an insight into pepper biological processes. To use annotated genes for QTL analysis and candidate gene discovery. Objectives Materials and Methods Results Conclusions References Acknowledgments Plant Materials and cDNA Library Preparation The seed of three pepper (C. annuum) lines ‘CM334,’ ‘Maor’ and ‘Early Jalapeño’ were planted. Three cDNA libraries (one from each pepper variety) were prepared using pooled RNA that was extracted from 4 tissues: root, young leaf, flower and fruit using Qiagen RNeasy Mini Kit (Qiagen Valencia CA, USA). Fruit tissues were collected in different developmental stages; 5, 10, and 20 days post pollination developing fruit, breaker and ripe fruit. The libraries were constructed by shearing cDNAs and 300‐350 bp fragments were selected on gels. The libraries were normalized using a double-stranded nuclease protocol. The cDNA libraries were sequenced using Illumina Genome Analyzer IIx (GAIIx) (Illumina Inc., San Diego, CA) for 80- 120 cycles at UC Davis Genome Center core facility. De Novo Assembly of NGS Sequences The NGS data (GAIIx) went through our standard preprocessing pipeline, developed at UC Davis (Kozik, A, 2010). Velvet (Zerbino and Birney, 2008) and CLC (CLCBIO, 2010) software packages were used to assemble the sequences. CAP3 was used to make the final assembly of three assemblies. Trimmed reads Min 40nt – Max 85nt Trimmed reads Min 25nt – Max 60nt One iteration of CLC assembly with all reads One iteration of CLC assembly with all reads One iteration of CLC assembly with all reads Velvet Assembler Early Jalapeño 31 35 41 31 35 41 Trimmed reads Min 40nt – Max 85nt Trimmed reads Min 25nt – Max 60nt All K-mer assemblies, assembled with CAP3 Maor 31 35 41 Trimmed reads Min 40nt – Max 85nt Trimmed reads Min 25nt – Max 60nt 31 35 41 All K-mer assemblies, assembled with CAP3 CM334 31 35 41 31 35 41 Trimmed reads Min 40nt – Max 85nt Trimmed reads Min 25nt – Max 60nt All K-mer assemblies, assembled with CAP3 Velvet K-mers CLC Assembler Trimmed reads Min 40nt – Max 85nt Trimmed reads Min 25nt – Max 60nt Trimmed reads Min 40nt – Max 85nt Trimmed reads Min 25nt – Max 60nt Velvet Assembler CLC Assembler Velvet Assembler CLC Assembler CM334 assembly made with CAP3 Early Jalape ño assembly made with CAP3 Ma or assembly made with CAP3 + + + Pepper final assembly made with CAP3 (Reference Sequence) Assembly Statistics No. of Contigs Total nt N50 CM334 83,113 84,792,180 1,488 Early Jalapeño 82,614 84,973,865 Maor 76,375 79,383,673 1,526 Pepper assembly 123,261 135,019,787 1,647 (CM334,EJ and Maor) Annotation A total of 63,202 contigs (51.3%) had at least one hit in the non-redundant database of GenBank with an average length of 1,495 nucleotides. Contigs with a hit, covered 94.5 M bases (70%) of the total assembly. A total of 60,055 (48.7%) contigs that did not have any hit in the GenBank were on average 674 nucleotide long and covering 40.5 M bases (30%) of the total assembly. Based on all results of BLASTX, Vitis vinifera, Arabidopsis thaliana and Oryza sativa were the top three species in the blast hits (Fig 3). Mapping step of Blast2GO resulted in identification of 37,918 (30.7%) contigs with Gene Ontology (GO) terms. Biological Processes (BP) at different GO levels were generated. Fig 4 shows the BP at level 2. For each BP number of annotated sequences are shown in Fig 5. Kegg maps for 150 biological pathways were generated and contigs within each pathways were determined. For instance, Fig 6 depicts Kegg map of Pyrimidine Metabolism pathway. SNP discovery A total of 22,863 putative SNPs within 11,869 contigs were identified by our SNP discovery pipeline. The contigs with identified putative SNPs comprised 23,794 kb (17.6%) of pepper transcriptomes assembly. On average 1 SNP per 1040 bp of exonic regions of pepper genome was identified. Assembly of transcriptomes of three pepper cultivars, increased the total assembled bases by 50%. The present pepper transcriptome assembly represents ~4% of pepper genome (3500 Mb). We demonstrated that for the plants for which the genome sequences are not available yet, the transcriptome assembly is an alternate approach SNP calling. Annotation of 51% of contigs or 70% of total assembled bases indicates that ~49% of contigs are small contigs that are covering the remaining 30% of unannotated sequences. Conesa, A., S. Götz, et al. (2005). "Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research." Bioinformatics 21(18): 3674-3676. Kozik, A (2010).Tool to process and manipulate Illumina sequences). http://code.google.com/p/atgc-illumina/downloads/li st ”). Li, H. and R. Durbin (2009). "Fast and accurate short read alignment with Burrows–Wheeler transform." Bioinformatics 25(14): 1754-1760. Li, H., B. Handsaker, et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079. Zerbino, D. and E. Birney (2008). "Velvet: algorithms for de novo short read assembly using de Bruijn graphs." Genome Res 18: 821 - 829. Molecular breeding of pepper (Capsicum spp.) has been hampered by the paucity of molecular markers. This is primarily due to lack of availability of the pepper genome sequence and limited available sequence resources. In recent years with the more cost effective sequencing technologies such as Illumina, sequencing of expressed genes (transcriptomes), gene discovery and allele mining is no longer insurmountable. In order to exploit the speed and scale of data from new sequencing technologies and in an effort to enrich the sequence resources of pepper, we sequenced transcriptome sequences (RNA-seq) of three pepper lines: Maor, Early Jalapeño (EJ) and Criollo de Morelos 334 (CM334). We selected a wide range of tissues to represent as many expressed genes as possible. The reference sequence was constructed from >200 million Illumina reads (80-120 nt) using a combination of Velvet, CLC and CAP3 software packages. BWA (Li and Durbin, 2009), SAMtools (Li et al, 2009b) and in- house Perl scripts were used to identify SNPs among three pepper lines. The SNPs were filtered to be 100 bp apart from any putative intron-exon junctions as well as adjacent SNPs. After filtering >22,000 high quality putative SNPs were identified and bioinformatically mapped to pepper genetic maps. The reference sequence was annotated by Blast2Go software (Conesa et al, 2005). The authors would like to thank Enza Zaden, Nunhems, Rijk Zwaan, Syngenta, Vilmorin and UC Discovery program for the financial support. We also would like to thank sequencing facility of UC Davis Genome Center and Bioinformatics core facility to provide us the servers and computational power. The annotation would not be possible without collaboration with Dr. R SNP Discovery Pipeline BWA was used to map all the reads of three genotypes individually to the Pepper final transcriptome assembly. SAMtools was use to make the pileups of each cultivar and discover the difference within each cultivar with reference sequence. Indels were screened out of pileup files. Intron-exon junction positions were inferred in the reference sequence based on Arabidopsis gene models using intron finder of Solanaceae Genome Network website (SGN). In-house Perl scripts were used to create allele call table of all three genotypes, the SNPs were filtered against adjacent SNPs and identified Intronic regions. Sequences surrounding the SNPs (100 base on each side) were extracted from the reference sequence to design assays. Annotation of Reference Sequence Blast2Go program was used to annotate the reference sequence, obtain the statistics and generate Kegg maps(http://www.genome.jp/kegg/pathway.html). Fig. 3 Fig. 4 Fig. 5 Fig. 2 Distribution of contig length in pepper transcript 0 5000 10000 15000 20000 25000 30000 Frequency Contig Length N50=1647 Mean=1095 Max=19,089 Min=265 Fig. 6 Kegg map of Pyrimidine Metabolism Fig. 1 De Novo assembly of pepper transcriptomes

Transcript of Genome-Wide SNP Discovery from de novo Assemblies of Pepper (Capsicum annuum ) Transcriptomes Hamid...

Page 1: Genome-Wide SNP Discovery from de novo Assemblies of Pepper (Capsicum annuum ) Transcriptomes Hamid Ashrafi 1, Jiqiang Yao 2, Kevin Stoffel 1, Sebastian.

Genome-Wide SNP Discovery from de novo Assemblies of

Pepper (Capsicum annuum ) TranscriptomesHamid Ashrafi1, Jiqiang Yao2, Kevin Stoffel1, Sebastian R. Chin-Wo3, Theresa Hill1, Alexander Kozik3 and Allen

Van Deynze1

1 Department of Plant Sciences, Seed Biotechnology Center, University of California, Davis, CA 956162 Interdisciplinary Center for Biotechnology Research (ICBR), University of Florida, Gainesville, FL 326103 Genome Center, University of California, Davis, CA 95616

Background and Significance

To obtain as many transcribed genes as possible, peppers were sampled from different, cultivars, tissues at multiple stages of growth and development.

To discover putative SNPs among three sampled pepper cultivars by sequencing transcriptomes using Illumina Genome Analyzer.

To annotate the transcriptome sequence in order to have an insight into pepper biological processes.

To use annotated genes for QTL analysis and candidate gene discovery.

Objectives

Materials and Methods

Results

Conclusions

References

Acknowledgments

Plant Materials and cDNA Library Preparation

The seed of three pepper (C. annuum) lines ‘CM334,’ ‘Maor’ and ‘Early Jalapeño’ were planted.

Three cDNA libraries (one from each pepper variety) were prepared using pooled RNA that was extracted from 4 tissues: root, young leaf, flower and fruit using Qiagen RNeasy Mini Kit (Qiagen Valencia CA, USA). Fruit tissues were collected in different developmental stages; 5, 10, and 20 days post pollination developing fruit, breaker and ripe fruit.

The libraries were constructed by shearing cDNAs and 300‐350 bp fragments were selected on gels.

The libraries were normalized using a double-stranded nuclease protocol.

The cDNA libraries were sequenced using Illumina Genome Analyzer IIx (GAIIx) (Illumina Inc., San Diego, CA) for 80-120 cycles at UC Davis Genome Center core facility.

De Novo Assembly of NGS Sequences

The NGS data (GAIIx) went through our standard preprocessing pipeline, developed at UC Davis (Kozik, A, 2010).

Velvet (Zerbino and Birney, 2008) and CLC (CLCBIO, 2010) software packages were used to assemble the sequences.

CAP3 was used to make the final assembly of three assemblies.

Trimmed readsMin 40nt – Max 85nt

Trimmed readsMin 25nt – Max 60nt

One iteration of CLC assembly with all reads

One iteration of CLC assembly with all reads

One iteration of CLC assembly with all reads

Velvet Assembler

Early Jalapeño

313541

313541

Trimmed readsMin 40nt – Max 85nt

Trimmed readsMin 25nt – Max 60nt

All K-mer assemblies,assembled with CAP3

Maor

313541

Trimmed readsMin 40nt – Max 85nt

Trimmed readsMin 25nt – Max 60nt

313541

All K-mer assemblies,assembled with CAP3

CM334

313541

313541

Trimmed readsMin 40nt – Max 85nt

Trimmed readsMin 25nt – Max 60nt

All K-mer assemblies,assembled with CAP3

Velvet K-mers

CLC AssemblerTrimmed reads

Min 40nt – Max 85nt

Trimmed readsMin 25nt – Max 60nt

Trimmed readsMin 40nt – Max 85nt

Trimmed readsMin 25nt – Max 60nt

Velvet Assembler CLC Assembler Velvet Assembler CLC Assembler

CM334 assembly made with CAP3 Early Jalapeño assembly made with CAP3 Maor assembly made with CAP3

+ ++

Pepper final assembly made with CAP3 (Reference Sequence)

Assembly Statistics No. of Contigs Total nt N50

CM334 83,113 84,792,180 1,488Early Jalapeño 82,614 84,973,865 1,488Maor 76,375 79,383,673 1,526Pepper assembly 123,261 135,019,787 1,647(CM334,EJ and Maor)

Annotation A total of 63,202 contigs (51.3%) had at least one hit in the non-redundant database of GenBank with an average length of 1,495

nucleotides.

Contigs with a hit, covered 94.5 M bases (70%) of the total assembly.

A total of 60,055 (48.7%) contigs that did not have any hit in the GenBank were on average 674 nucleotide long and covering40.5 M bases (30%) of the total assembly.

Based on all results of BLASTX, Vitis vinifera, Arabidopsis thaliana and Oryza sativa were the top three species in the blast hits (Fig 3).

Mapping step of Blast2GO resulted in identification of 37,918 (30.7%) contigs with Gene Ontology (GO) terms.

Biological Processes (BP) at different GO levels were generated. Fig 4 shows the BP at level 2. For each BP number of annotated sequences are shown in Fig 5.

Kegg maps for 150 biological pathways were generated and contigs within each pathways were determined. For instance, Fig 6 depicts Kegg map of Pyrimidine Metabolism pathway.

SNP discovery A total of 22,863 putative SNPs within 11,869 contigs were

identified by our SNP discovery pipeline.

The contigs with identified putative SNPs comprised 23,794 kb (17.6%) of pepper transcriptomes assembly.

On average 1 SNP per 1040 bp of exonic regions of pepper genome was identified.

Assembly of transcriptomes of three pepper cultivars, increased the total assembled bases by 50%.

The present pepper transcriptome assembly represents ~4% of pepper genome (3500 Mb).

We demonstrated that for the plants for which the genome sequences are not available yet, the transcriptome assembly is an alternate approach SNP calling.

Annotation of 51% of contigs or 70% of total assembled bases indicates that ~49% of contigs are small contigs that are covering the remaining 30% of unannotated sequences.

Conesa, A., S. Götz, et al. (2005). "Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research." Bioinformatics 21(18): 3674-3676.

Kozik, A (2010).Tool to process and manipulate Illumina sequences). http://code.google.com/p/atgc-illumina/downloads/list”).

Li, H. and R. Durbin (2009). "Fast and accurate short read alignment with Burrows–Wheeler transform." Bioinformatics 25(14): 1754-1760.

Li, H., B. Handsaker, et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079.

Zerbino, D. and E. Birney (2008). "Velvet: algorithms for de novo short read assembly using de Bruijn graphs." Genome Res 18: 821 - 829.

Molecular breeding of pepper (Capsicum spp.) has been hampered by the paucity of molecular markers.

This is primarily due to lack of availability of the pepper genome sequence and limited available sequence resources.

In recent years with the more cost effective sequencing technologies such as Illumina, sequencing of expressed genes (transcriptomes), gene discovery and allele mining is no longer insurmountable.

In order to exploit the speed and scale of data from new sequencing technologies and in an effort to enrich the sequence resources of pepper, we sequenced transcriptome sequences (RNA-seq) of three pepper lines: Maor, Early Jalapeño (EJ) and Criollo de Morelos 334 (CM334).

We selected a wide range of tissues to represent as many expressed genes as possible.

The reference sequence was constructed from >200 million Illumina reads (80-120 nt) using a combination of Velvet, CLC and CAP3 software packages.

BWA (Li and Durbin, 2009), SAMtools (Li et al, 2009b) and in-house Perl scripts were used to identify SNPs among three pepper lines.

The SNPs were filtered to be 100 bp apart from any putative intron-exon junctions as well as adjacent SNPs. After filtering >22,000 high quality putative SNPs were identified and bioinformatically mapped to pepper genetic maps.

The reference sequence was annotated by Blast2Go software (Conesa et al, 2005).

The authors would like to thank Enza Zaden, Nunhems, Rijk Zwaan, Syngenta, Vilmorin and UC Discovery program for the financial support. We also would like to thank sequencing facility of UC Davis Genome Center and Bioinformatics core facility to provide us the servers and computational power. The annotation would not be possible without collaboration with Dr. R Michelmore’s laboratory.

SNP Discovery Pipeline

BWA was used to map all the reads of three genotypes individually to the Pepper final transcriptome assembly.

SAMtools was use to make the pileups of each cultivar and discover the difference within each cultivar with reference sequence.

Indels were screened out of pileup files.

Intron-exon junction positions were inferred in the reference sequence based on Arabidopsis gene models using intron finder of Solanaceae Genome Network website (SGN).

In-house Perl scripts were used to create allele call table of all three genotypes, the SNPs were filtered against adjacent SNPs and identified Intronic regions.

Sequences surrounding the SNPs (100 base on each side) were extracted from the reference sequence to design assays.

Annotation of Reference Sequence

Blast2Go program was used to annotate the reference sequence, obtain the statistics and generate Kegg maps(http://www.genome.jp/kegg/pathway.html).

Fig. 3 Fig. 4Fig. 5

Fig. 2 Distribution of contig length in pepper transcriptome assembly

0

5000

10000

15000

20000

25000

30000

Freq

uenc

y

Contig Length

N50=1647

Mean=1095Max=19,089

Min=265

Fig. 6 Kegg map of Pyrimidine Metabolism

Fig. 1 De Novo assembly of pepper transcriptomes