Fruit breedomics workshop wp6 from marker assisted breeding to genomics assisted breeding michaela...
-
Upload
fruitbreedomics -
Category
Science
-
view
1.262 -
download
2
Transcript of Fruit breedomics workshop wp6 from marker assisted breeding to genomics assisted breeding michaela...
from marker-assisted breedingto genomics-assisted breeding?
Genotyping strategyGenotyping: analysis of DNA-sequence variation
Molecular markers: represent variation (polymorphism) in DNA sequence between two homologous chromosomes
What for? Phylogenetic studiesVarietal characterization Alleles identification MAS/MAB Linkage analysis ed QTL mapping ‘Linkage disequilibrium’ mapping Physical and genetic integration
Which strategy? Which marker type? (RFLP, AFLP, SSR, SNP)
How many markers? (100, 500, 2000?.. 1M?)How many samples? (100, 1000, 2000?)
early ’80: RFLP (Restriction Fragment LengthPolymorphism)
early ’90: RAPD (Random Amplified Polymorphic DNA)
early ’90: SSR (Simple Sequence Repeats)
1993: AFLP (Amplified Fragment Length Polymorphism)
middle ’90: SNP (Single Nucleotide Polymorphism)
Evolution in the type of molecular markers
Di-, tri-, tetranucleotide repeatsGAACGTACTCACACACACACACATTTGACTTCGATGATAGATAGATAGATAGATACGT
Microsatellites, also known as Simple Sequence Repeats (SSRs), are repeating sequences of 2-6 base pairs of DNA.
The number of repeats varies Highly polymorphic and transferable between species High info content and widely applicable Distributed evenly throughout the genome Easy to detect by PCR but low throughput
Microsatellite markers
Single Nucleotide Polymorphisms (SNPs)GTGGACGTGCTT[G/C]TCGATTTACCTAG
A SNP is a single base mutation in DNA The most simple and common type of polymorphism Biallelic/less informative Highly abundant; every 1000 bp along human genome, every 50 bp in
apple/grapevine They are specific to the population in which they are developed There are two types of nucleotide base substitutions resulting in SNPs:
Transition: substitution between purines (A, G) or between pyrimidines (C, T). Constitute two thirds of all SNPs
Transversion: substitution between a purine and a pyrimidine
SNP markers
SNP distribution and type of variation– SNP Distribution is not uniform 1/3 in coding, 2/3 in non-coding
– Synonymous mutation: DNA mutations that do not result in a change to the amino acid sequence of a protein
– Nonsynonymous : results in a change in amino acid that may be arbitrarily further classified as conservative (change to an amino acid with similar physiochemical properties), semi-conservative (e.g. negative to positively charged amino acid), or radical (vastly different amino acid). Non-synonymous mutations that change an amino acid to a stop codon are considered nonsense mutation
– ½ SNPs in coding regions are nonsynonymous
No prior information on the polymorphisms:
SSCP/DGGE/HRM
Prior information on the polymorphisms is requested:
SNP discovery: amplicon sequencing/EST databases/whole
genome sequence/resequencing
SNP genotyping: Low throughput/high throughput
techniques
SNP discovery and genotyping: Genotyping by Sequencing
SNP genotyping techniques
dbSNP: the NCBI database of genetic variationhttp://www.ncbi.nlm.nih.gov/SNP/
SNP identification in amplicons
•Sanger sequencing of a specific amplicon •Sequence capture methods (SureSelect, Nimblegen….)
Genomic reduction with restricted enzymes
+ multiplex identifier (barcodes)
+Next-Generation Sequencing
Whole-genome resequencingOR
?
from marker-based mapping approach to genome-based high-throughput strategies
Reference genome available? Missing data ...Sequencing errors..
RNA seqSequence capture (whole exome…)
Genome-wide genetic marker discovery and genotyping using next-generation sequencing
http://genomevolution.org/wiki/index.php/Sequenced_plant_genomes
Sequenced plant genomes (tree up to date as of December 2014)
Reference genome?
Diploid vs Polyploid genomes?
Different applications = different depth of sequencing
Depth of sequencing = genotype calling accuracy
Pooling vs non-pooling
WGG, GBS or SNP array
SNP calling pipeline
Read alignment to the referenceAlignment tools BWA http://bio-bwa.sourceforge.net/ BFAST http://bfast.sourceforge.net/ Tophat (for RNAseq) http://tophat.cbcb.umd.edu/
Read cleaning quality demultiplexing
1. Index the reference (genome) sequence2. Perform the alignment. Mapping parameters
• Number of mismatches per read • Scores for mismatch or gaps
Reference genome
Sample
1. Index the reference (genome) sequence2. Perform the alignment3. Output results in SAM/BAM format
4. Pre-processing of the alignments: no. of mismatches (how diverse is the reference?)
PCR duplicates
Uniqueness (to avoid paralogous loci and repeats)
if paired-end sequences, insert size and read pair orientation
Reference Genome
Reference Genome
?
SNP calling SAMtools
1. Convert SAM to BAM for sorting2. Sort BAM for SNP calling
Variant calling software/algorithms
samtoolsmpileup/ bcftools http://samtools.sourceforge.net/Includes a computation of genotype likelihoods (SAMtools) and SNP and genotype calling (BCFtools)
GATK http://www.broadinstitute.org/gatk/Package for aligned NGS data analysis, which includes a SNP and genotype caller (Unifed Genotyper), SNP filtering (Variant Filtration) and SNP quality recalibration (Variant Recalibrator)
Freebayes https://github.com/ekg/freebayesBayesian genetic variant detector (SNPs, indels, MNPs)
A
CCCC
AA
TTTGGGG
ATTTTTT
readshaplotypes
truepolymorphism
truepolymorphism
sequencingerror
eSNP identification
SNP genotyping techniques• over 100 different approaches
• Ideal SNP genotyping platform: high-throughput capacity, multiplexed: multiple markers tested simultaneously simple assay design robust affordable price automated genotype calling accurate and reliable results quick and dirty DNA extraction protocol
• Major limitations.. Development costs high Very specific -> poorly transferable to distinct germplasm Individual markers have low information Expensive if only few genomic segments or just a few individuals have to be
analysed
SNP genotyping techniques
detection of the allelic discrimination– light emitted by the products– mass– change in the electrical property
Gel-based electrophoresis
1xLow cost-low vs High cost-high throughput
Fluorescent-based PCR (real time PCR) 1xHigh Resolution Melting
P1 P2 Progeny
SSCP non-denaturant gelCAPs/dCAPs agarose gel
FOTO TAGLIOP1 P2 Progeny
Kompetitive Allele Specific PCR (KASP)
The method is based on detecting small differences in PCR melting (dissociation) curves. It is enabled by improved dsDNA-binding dyes used in conjunction with real-time PCR instrumentation that has precise temperature ramp control and advanced data capture capabilities.
High Resolution Melting
+ cost+ no info of SNP+ do not need high quality DNA (low-cost DNA extraction maybe ok…)+ easy to perform, automation- interpretation of the data (optimization is demandatory!)- no multiplex
1x
Fluorescence-based capillary electrophoresis
P1
P2
Ind1
Ind2
Ind3
Ind4
Ind5
Ind6
I II
IIIII
IIII
III II
I II
IIIII
II
I II
SSCP capillary electrophoresisMultiplex Minisequencing
SNPlex Genotyping System
2x
7x48x
Low cost-low vs High cost-high throughput
MassARRAY® SNP Genotyping SEQUENOM
36x
TaqMan® OpenArray® technology
16-256x
Taqman OpenArray…poor quality or too high concentrated DNA….
Illumina GoldenGate® Assay Illumina Infinium® Assay
384-1536x1536-1Mx
Affymetrix Axiom®
1.5k-2Mx
SNP genotyping arrays in agriculture
Grape: Myles et al. (2010) PLoS ONE 5(1), e8219Potato: Felcher et al. (2011) PLoS ONE 7(4), e36347Corn: Ganal et al. (2011) PLoS ONE 6(12), e28334Peach: Verde et al. (2012) PLoS ONE 7(12), e35668Apple: Chagné et al. (2012) PLoS ONE 7(2), e31745Cherry: Peace et al. (2012) PLoS ONE 7(12), e48305Tomato: Sim et al. (2013) PLoS ONE 7(7), e40563Soybean: Song et al. (2013) PLoS ONE 8(1), e54985Pine: Pavy et al. Mol Ecol Resourc 13(2): 324-335…….
9k SNP array in apple and peach
from Chagne et al PloS ONE 2012 from Verde et al PloS ONE 2012
27 apple varieties 54 peach breeding accession
Development of the apple Illumina Infinium
20K array
from Bianco et al. 2014. PLoS ONE
Apple major founders and parents of mapping pop
Double haploids Pooling (read groups)/30x cov Uniformly distributed
(1cm/400Kbps) Haploblocks within 2040 FPs
6-8K SNPs/progeny – 3 working days!!!
Mapped Mono Failed Total
Newly identified SNPs 12,611(86%)
1,091 1,012 14,714
SNPs from 8K array 3,160 (96%)
26 119 3,305
Total 15,771 (88%)
1,117 1,131 18,019
from Bianco et al. 2014. PLoS ONE
High-density linkage mapping: consensus map with 16K SNP
In apple LD decays quite fast (20-55K)
20K Chip Sequencing panel limited to apple major founders
We decided to design a much denser (487K Apple SNPs) chip with Affymetrix covering more diversity in European Apple
From Illumina Infinium 20K to Axiom Apple480K
Cultivar selection
Un-weighted Neighbor Joining tree representing the selected cvrs
14 (from 20K Illumina array in red)45 newly resequenced cvrs
+
8 additional cvs (not shown)including 4 russian cvrs
-----------------------------------------
Total: 67 resequenced cvrs(Illumina HiSeq, 30X avg coverage)
[courtesy of Charles-Eric Durel]
SNP discovery panel
Accession name Tot read pairs Accession name Tot read pairsBudimka 210,405,926 Priscilla 117,058,356Aivaniya 119,632,134 Worchester Pearmain 96,410,779Fyriki 81,179,122 X9273 193,831,694Keswick Codlin 69,130,834 X9748 184,325,907Rosa 70,443,859 Kmenotvorná 44,450,139Filippa 90,247,516 Košíkové 86,098,710Åkerö 91,666,140 Malinové holovouské 81,050,254Godelieve Hegmans 98,149,222 Jantarnoe 90,682,815Court-Pendu Henry 99053865 Chodské 62,687,249Belle et Bonne 113,928,709 Panenské české 76,470,501Reinette Dubois 102,336,676 Patte de Loup 110,531,953Sonderskow 74,020,116 Precoce de Karage 110,480,537Ijunskoe ranee 123,440,174 Ajmi 82,950,154Hetlina 88,568,362 De L’estre 110,615,195Ovčí hubička 114,799,463 Shiemer 128,123,768Renetta Grigia di Torriana 64,416,556 Cabarette 83,997,287Durello di Forli 46,088,993 Douce Rayotte 20,841,799Borowitsky 96,122,310 Amadou 96,043,319Alfred jolibois 92,947,232 Busiard 84,486,599Reinette Clochard 111,511,186 Mela Rozza 79,694,154President Roulin 106,767,106 Gelata 83,186,975Pepino Jaune 144,185,284 Annurca 86,930,085Golden Delicious 80,333,013 Spässerud 84,123,712Antonovka 122,361,758 Young America 96,202,193Breaburn 113,833,008 Abbondanza 90,321,460Cox’z Orange Pippin 92,500,881 Mela Rosa 95,875,984Delicious 130,861,887 Papirovka 90,957,520DrOldenburg 125,373,896 Skry 78,536,012F22682922 103,188,147 Ag Alma 127,334,651Fuji 109,365,440 Aport Kuba 71,297,068Jonathan 128,267,363 Kronprins 62,159,897Lady Williams 108,963,567 Antonovka Pamtorutka 39,392,157Macoun 88,835,758 Heta 44,608,537McIntosh 126,063,436 Maikki 38,025,283Malus micromalus 106,789,660
63 of 67 re-seq apple accessions
13 from previous array (in red)M. Micromalus removed
50 new re-seqKmenotvorná, Shiemer removed because of library contamination
Douce Rayotte removed because of a very low coverage
+
2 double haploid
The Chip Design Pipeline
Alignment and SNP calling
Alignment on 94,079 contigs
Uniqueness at contig level
Paired reads kept only if same contig and opposite in strand (no insert size selection)
SNP calling in pool (read groups)
BCFtools default parameters
SNP filtering
“Minimal” set of quality filters to avoid removing “true” SNPs in probes
Removed potential sequencing errors inGolden (or very rare alleles of Golden)
Tot no. of contigs targeted 73,099
Affymetrix specific filtering
Tot no. of contigs targeted after Affy filters: 64,319
(68% of tot contigs/611Mb)
Focal points and tagging
Focal point strategy:
In GENIC regions (29,314 contigs/70K gene predictions)
[FEM and INRA predictions]
In NON GENIC regions (35,005 contigs)
Additional Focal points to cover genome every 10Kb
Tagging:
In each focal point, group together SNPs withLD r2 >= 0.85
SNP Selection
1. 20,549 contigs/54,735 gene pred
2. 3,197 contigs/4,910 gene pred
3. 15,237 contigs
4. 15,691 contigs
5. 4,690 contigs
Tot . 39,365 unique contigs covering 553Mb
24,954 untargeted contigscovering 58Mb (avg size 2.3Kb)
The Apple480K SNPs
Tot. 40,192 contigs 562Mb
39,365 contigs 553Mb
5,809 contigs 126Mb
SNP distribution across LGs
Elshire et al. (2011) PLoS ONE 6:e19379Myles et al. (2013) Trends in Genetics 29:190
enormous promise and significant challenges…Genotyping by Sequencing
From Davey et al. (2011) Nat Rev Genet 12
GBS
Elshire et al. (2011) PLoS ONE 6:e19379Myles et al. (2013) Trends in Genetics 29:190
Read Depth
Sample 1 Sample 2 Sample 3
AllelesC 19 13 0
T 0 12 3
Genotypes CC CT ?
GBS: Genotype calling
Average bps 19,123,102STD bps 3,182,282MIN bps 7,951,958MAX bps 24,882,057
GBS: Genome coverage 3-5% of the genome!
for only 1-5% of the SNPs identified
UNDERCALL HETEROZYGOSITY
Call dominant markers?
from Myles, 2013. Trends in Genetics 29:190
Sequence is too divergent from reference SNPs at the restriction site
Allelic dropout
273,835 SNPs
3967 SNPs
SNP calling with GATK
GBS in Golden Delicious x Scarlet Two-enzyme combination
HindIII-MspI
1 lane Illumina single-end reads Read alignment on ref genome
(4% of mismatches)GBS in 96 samples
30,393 SNPs
x
Kept SNPs < 20% missing genotype data (uneven coverage across samples)
Kept SNPs if rd > 6 for genotype call in at least 80% of samples (uneven depth of coverage)
11%
1.4%
Golden x Scarlet genetic linkage map
1994 mapped GBS-SNPs (from more than 270K initially discovered, 0.73%) 1272 cM (1 SNP/0.63 cM)
GBS and ‘QTL mapping’ for apple skin color
1 lane Illumina94 progenies
from Myles et al. 2014. G3
Chi-square test for associationCase/control single marker analysis in PLINK
QTL analysis – LG 9
GenomeStudio
• Sample Sheet: comma-delimited text file (.csv file)
• Data Repository: directory that contains intensity files
• Manifest Repository: directory that contains SNP manifest. The SNP manifest containsthe mapping between bead-type identifier and SNP, as well as all SNP annotations.
Sample_ID: Sample identifier (used only for display in the table).SentrixBarcode_A:The barcode of the Universal Array Product that this sample was hybridized to for Manifest A.SentrixPosition_A:The position within the Universal Array Product this sample was hybridized to for Manifest A (and similarly for _B, _C, etc. depending on how many manifests are used with your project).
GenomeStudio: sample sheet
GenomeStudio – Ideal SNP
AB (P1) x AB (P2) ¼ AA + ½ AB + ¼ BB
P1 P2AA
AB BB
Genotypes are called for each sample (dots) by their signal intensity (Norm R, y-axis) and (Norm Theta , x-axis)
Y-Axis: intensity of signalX-axis: colour-ratio, Allele Frequency GenCall Score: is a quality metric that indicates the reliability of each genotype call. Genotypes with lower GenCall scores are located further from the center of a cluster and have a lower reliability.Gene Train for a SNP: measure of the cluster quality for the SNP (distance between clusters-shape of the clusters)
AA
AᴓAB
Bᴓ
AB (P1) x Aᴓ (P2) ¼ AA ¼ Aᴓ ¼ AB ¼ Bᴓ
P1
P2 Aᴓ
AB BB+Bᴓ
AB (P1) x Bᴓ (P2)¼ AB ¼ Aᴓ ¼ BB ¼ Bᴓ
P1
P2
segregation distortion homozygous cluster with two subgroups with different intensity signal genotype calls from re-sequencing may differ genotype calls may be incorrect
genotypes Aᴓ and Bᴓ=additional SNPs
Effect of additional polymorphismsCritical SNPs at “probe site” → null allele
Segregation distortion Homozygous cluster with two distinct
subgroups along x axis Genotype calls from re-sequencing
differ Genotype calls may be incorrect
genotypes BB’ and AB’=additional SNPs (not so critical)
Effect of additional polymorphismsNon-critical SNPs at the probe site → decrease probe-sequence “hybridization” efficiency
A anneals better then B, or the opposite... Product A>B -> fluorescent label A > B -> cluster moves towards A Product B>A-> fluorescent label B > A -> cluster moves towards B Decrease in total fluorescent intensity AB genotypes -> lower intensity
AB’ (P1) x AB (P2) ¼ AA ¼ AB ¼ AB’ ¼BB’
AA
AB’
AB
BB’
A wide, horizontal orientated cloud
Effect of additional polymorphisms in different genetic background...
many clusters....
Paralogous regions: assume 2 copies at different LGs
AA-BB x AB-BB-> ½ AABB + ½ ABBB
AA BB
AB BBP1 P2
P1 P2
Paralogs(example for 2 loci, one locus monomorphic, one polymorphic)
AA BB
AB BB BB
BB
AB-BB (P1) x AB-BB (P2)¼ AABB + ½ ABBB + ¼ BBBB
No segregation distortion Clusters in a narrow range of the plot Genotype calls from re-seq may differ
P1P2
AA BB
AB BB BB
BB
AB-BB (P1) x AB-BB (P2)¾ (AABB + ABBB) + ¼ BBBB
P1 P2
Segregation distortion (3:1) AB may have sub-groups Position of the clusters along x axis not ideal Genotype calls from re-seq may differ
Dedicated sofwares required!!! Gidskehaug et al. (2011) Bioinformatics 27:303
paralogy…
Robust SNP scoring on:• F1 population • F2 population• Whole germplasm
Handles problematic clusters due to paralogy, additional SNPs (null alleles and preferential hybridization)
Advantages:• Ease of use and install• Standard Illumina input• Point and Click graphical interface• Adjustable parameters• Different outputs for common
genetic analysis software (JoinMap, PLINK, Structure, and HapMap)
AA Ab AB Bb
ASSIsT: Automatic SNP ScorIng Tool
Di Guardo, Micheletti et al. in preparation
available at http://compbiotoolbox.fmach.it/assist/
0 0.20 0.40 0.60 0.80 1Norm Theta
SNP_FB_0399644
-0.20
0
0.20
0.40
0.60
0.80
1
1.20
1.40
1.60
1.80
Norm
R
64 95 580.50.37. .
Marker Classification: AB_2-subclusters