Fruit breedomics workshop wp6 from marker assisted breeding to genomics assisted breeding michaela...

from marker-assisted breedingto genomics-assisted breeding?

Genotyping strategyGenotyping: analysis of DNA-sequence variation

Molecular markers: represent variation (polymorphism) in DNA sequence between two homologous chromosomes

What for? Phylogenetic studiesVarietal characterization Alleles identification MAS/MAB Linkage analysis ed QTL mapping ‘Linkage disequilibrium’ mapping Physical and genetic integration

Which strategy? Which marker type? (RFLP, AFLP, SSR, SNP)

How many markers? (100, 500, 2000?.. 1M?)How many samples? (100, 1000, 2000?)

early ’80: RFLP (Restriction Fragment LengthPolymorphism)

early ’90: RAPD (Random Amplified Polymorphic DNA)

early ’90: SSR (Simple Sequence Repeats)

1993: AFLP (Amplified Fragment Length Polymorphism)

middle ’90: SNP (Single Nucleotide Polymorphism)

Evolution in the type of molecular markers

Di-, tri-, tetranucleotide repeatsGAACGTACTCACACACACACACATTTGACTTCGATGATAGATAGATAGATAGATACGT

Microsatellites, also known as Simple Sequence Repeats (SSRs), are repeating sequences of 2-6 base pairs of DNA.

The number of repeats varies Highly polymorphic and transferable between species High info content and widely applicable Distributed evenly throughout the genome Easy to detect by PCR but low throughput

Microsatellite markers

Single Nucleotide Polymorphisms (SNPs)GTGGACGTGCTT[G/C]TCGATTTACCTAG

A SNP is a single base mutation in DNA The most simple and common type of polymorphism Biallelic/less informative Highly abundant; every 1000 bp along human genome, every 50 bp in

apple/grapevine They are specific to the population in which they are developed There are two types of nucleotide base substitutions resulting in SNPs:

Transition: substitution between purines (A, G) or between pyrimidines (C, T). Constitute two thirds of all SNPs

Transversion: substitution between a purine and a pyrimidine

SNP markers

SNP distribution and type of variation– SNP Distribution is not uniform 1/3 in coding, 2/3 in non-coding

– Synonymous mutation: DNA mutations that do not result in a change to the amino acid sequence of a protein

– Nonsynonymous : results in a change in amino acid that may be arbitrarily further classified as conservative (change to an amino acid with similar physiochemical properties), semi-conservative (e.g. negative to positively charged amino acid), or radical (vastly different amino acid). Non-synonymous mutations that change an amino acid to a stop codon are considered nonsense mutation

– ½ SNPs in coding regions are nonsynonymous

No prior information on the polymorphisms:

SSCP/DGGE/HRM

Prior information on the polymorphisms is requested:

SNP discovery: amplicon sequencing/EST databases/whole

genome sequence/resequencing

SNP genotyping: Low throughput/high throughput

techniques

SNP discovery and genotyping: Genotyping by Sequencing

SNP genotyping techniques

dbSNP: the NCBI database of genetic variationhttp://www.ncbi.nlm.nih.gov/SNP/

SNP identification in amplicons

•Sanger sequencing of a specific amplicon •Sequence capture methods (SureSelect, Nimblegen….)

Genomic reduction with restricted enzymes

+ multiplex identifier (barcodes)

+Next-Generation Sequencing

Whole-genome resequencingOR

?

from marker-based mapping approach to genome-based high-throughput strategies

Reference genome available? Missing data ...Sequencing errors..

RNA seqSequence capture (whole exome…)

Genome-wide genetic marker discovery and genotyping using next-generation sequencing

http://genomevolution.org/wiki/index.php/Sequenced_plant_genomes

Sequenced plant genomes (tree up to date as of December 2014)

Reference genome?

Diploid vs Polyploid genomes?

Different applications = different depth of sequencing

Depth of sequencing = genotype calling accuracy

Pooling vs non-pooling

WGG, GBS or SNP array

SNP calling pipeline

Read alignment to the referenceAlignment tools BWA http://bio-bwa.sourceforge.net/ BFAST http://bfast.sourceforge.net/ Tophat (for RNAseq) http://tophat.cbcb.umd.edu/

Read cleaning quality demultiplexing

1. Index the reference (genome) sequence2. Perform the alignment. Mapping parameters

• Number of mismatches per read • Scores for mismatch or gaps

Reference genome

Sample

1. Index the reference (genome) sequence2. Perform the alignment3. Output results in SAM/BAM format

4. Pre-processing of the alignments: no. of mismatches (how diverse is the reference?)

PCR duplicates

Uniqueness (to avoid paralogous loci and repeats)

if paired-end sequences, insert size and read pair orientation

Reference Genome

Reference Genome

?

SNP calling SAMtools

1. Convert SAM to BAM for sorting2. Sort BAM for SNP calling

Variant calling software/algorithms

samtoolsmpileup/ bcftools http://samtools.sourceforge.net/Includes a computation of genotype likelihoods (SAMtools) and SNP and genotype calling (BCFtools)

GATK http://www.broadinstitute.org/gatk/Package for aligned NGS data analysis, which includes a SNP and genotype caller (Unifed Genotyper), SNP filtering (Variant Filtration) and SNP quality recalibration (Variant Recalibrator)

Freebayes https://github.com/ekg/freebayesBayesian genetic variant detector (SNPs, indels, MNPs)

A

CCCC

AA

TTTGGGG

ATTTTTT

readshaplotypes

truepolymorphism

truepolymorphism

sequencingerror

eSNP identification

SNP genotyping techniques• over 100 different approaches

• Ideal SNP genotyping platform: high-throughput capacity, multiplexed: multiple markers tested simultaneously simple assay design robust affordable price automated genotype calling accurate and reliable results quick and dirty DNA extraction protocol

• Major limitations.. Development costs high Very specific -> poorly transferable to distinct germplasm Individual markers have low information Expensive if only few genomic segments or just a few individuals have to be

analysed

SNP genotyping techniques

detection of the allelic discrimination– light emitted by the products– mass– change in the electrical property

Gel-based electrophoresis

1xLow cost-low vs High cost-high throughput

Fluorescent-based PCR (real time PCR) 1xHigh Resolution Melting

P1 P2 Progeny

SSCP non-denaturant gelCAPs/dCAPs agarose gel

FOTO TAGLIOP1 P2 Progeny

Kompetitive Allele Specific PCR (KASP)

The method is based on detecting small differences in PCR melting (dissociation) curves. It is enabled by improved dsDNA-binding dyes used in conjunction with real-time PCR instrumentation that has precise temperature ramp control and advanced data capture capabilities.

High Resolution Melting

+ cost+ no info of SNP+ do not need high quality DNA (low-cost DNA extraction maybe ok…)+ easy to perform, automation- interpretation of the data (optimization is demandatory!)- no multiplex

1x

Fluorescence-based capillary electrophoresis

P1

P2

Ind1

Ind2

Ind3

Ind4

Ind5

Ind6

I II

IIIII

IIII

III II

I II

IIIII

II

I II

SSCP capillary electrophoresisMultiplex Minisequencing

SNPlex Genotyping System

2x

7x48x

Low cost-low vs High cost-high throughput

MassARRAY® SNP Genotyping SEQUENOM

36x

TaqMan® OpenArray® technology

16-256x

Taqman OpenArray…poor quality or too high concentrated DNA….

Illumina GoldenGate® Assay Illumina Infinium® Assay

384-1536x1536-1Mx

Affymetrix Axiom®

1.5k-2Mx

http://www.illumina.com/media.ilmn?Title=GoldenGate%20Assay%20Overview%20and%20GoldenGate%20Assay%20Assay%20Workflow&Img=aboutTechAssayGGWorkflowLg.gif&Cap=&PageName=goldengate%20assay&PageURL=11

http://www.illumina.com/media.ilmn?Title=Infinium%20I%20Assay%20Overview%20and%20Infinium%20Assay%20Workflow&Img=aboutTechAssayInfiniumWorkflowLg.gif&Cap=&PageName=infinium%20assay&PageURL=12

SNP genotyping arrays in agriculture

Grape: Myles et al. (2010) PLoS ONE 5(1), e8219Potato: Felcher et al. (2011) PLoS ONE 7(4), e36347Corn: Ganal et al. (2011) PLoS ONE 6(12), e28334Peach: Verde et al. (2012) PLoS ONE 7(12), e35668Apple: Chagné et al. (2012) PLoS ONE 7(2), e31745Cherry: Peace et al. (2012) PLoS ONE 7(12), e48305Tomato: Sim et al. (2013) PLoS ONE 7(7), e40563Soybean: Song et al. (2013) PLoS ONE 8(1), e54985Pine: Pavy et al. Mol Ecol Resourc 13(2): 324-335…….

9k SNP array in apple and peach

from Chagne et al PloS ONE 2012 from Verde et al PloS ONE 2012

27 apple varieties 54 peach breeding accession

Development of the apple Illumina Infinium

20K array

from Bianco et al. 2014. PLoS ONE

Apple major founders and parents of mapping pop

Double haploids Pooling (read groups)/30x cov Uniformly distributed

(1cm/400Kbps) Haploblocks within 2040 FPs

6-8K SNPs/progeny – 3 working days!!!

Mapped Mono Failed Total

Newly identified SNPs 12,611(86%)

1,091 1,012 14,714

SNPs from 8K array 3,160 (96%)

26 119 3,305

Total 15,771 (88%)

1,117 1,131 18,019

from Bianco et al. 2014. PLoS ONE

High-density linkage mapping: consensus map with 16K SNP

In apple LD decays quite fast (20-55K)

20K Chip Sequencing panel limited to apple major founders

We decided to design a much denser (487K Apple SNPs) chip with Affymetrix covering more diversity in European Apple

From Illumina Infinium 20K to Axiom Apple480K

Cultivar selection

Un-weighted Neighbor Joining tree representing the selected cvrs

14 (from 20K Illumina array in red)45 newly resequenced cvrs

+

8 additional cvs (not shown)including 4 russian cvrs

-----------------------------------------

Total: 67 resequenced cvrs(Illumina HiSeq, 30X avg coverage)

[courtesy of Charles-Eric Durel]

SNP discovery panel

Accession name Tot read pairs Accession name Tot read pairsBudimka 210,405,926 Priscilla 117,058,356Aivaniya 119,632,134 Worchester Pearmain 96,410,779Fyriki 81,179,122 X9273 193,831,694Keswick Codlin 69,130,834 X9748 184,325,907Rosa 70,443,859 Kmenotvorná 44,450,139Filippa 90,247,516 Košíkové 86,098,710Åkerö 91,666,140 Malinové holovouské 81,050,254Godelieve Hegmans 98,149,222 Jantarnoe 90,682,815Court-Pendu Henry 99053865 Chodské 62,687,249Belle et Bonne 113,928,709 Panenské české 76,470,501Reinette Dubois 102,336,676 Patte de Loup 110,531,953Sonderskow 74,020,116 Precoce de Karage 110,480,537Ijunskoe ranee 123,440,174 Ajmi 82,950,154Hetlina 88,568,362 De L’estre 110,615,195Ovčí hubička 114,799,463 Shiemer 128,123,768Renetta Grigia di Torriana 64,416,556 Cabarette 83,997,287Durello di Forli 46,088,993 Douce Rayotte 20,841,799Borowitsky 96,122,310 Amadou 96,043,319Alfred jolibois 92,947,232 Busiard 84,486,599Reinette Clochard 111,511,186 Mela Rozza 79,694,154President Roulin 106,767,106 Gelata 83,186,975Pepino Jaune 144,185,284 Annurca 86,930,085Golden Delicious 80,333,013 Spässerud 84,123,712Antonovka 122,361,758 Young America 96,202,193Breaburn 113,833,008 Abbondanza 90,321,460Cox’z Orange Pippin 92,500,881 Mela Rosa 95,875,984Delicious 130,861,887 Papirovka 90,957,520DrOldenburg 125,373,896 Skry 78,536,012F22682922 103,188,147 Ag Alma 127,334,651Fuji 109,365,440 Aport Kuba 71,297,068Jonathan 128,267,363 Kronprins 62,159,897Lady Williams 108,963,567 Antonovka Pamtorutka 39,392,157Macoun 88,835,758 Heta 44,608,537McIntosh 126,063,436 Maikki 38,025,283Malus micromalus 106,789,660

63 of 67 re-seq apple accessions

13 from previous array (in red)M. Micromalus removed

50 new re-seqKmenotvorná, Shiemer removed because of library contamination

Douce Rayotte removed because of a very low coverage

+

2 double haploid

The Chip Design Pipeline

Alignment and SNP calling

Alignment on 94,079 contigs

Uniqueness at contig level

Paired reads kept only if same contig and opposite in strand (no insert size selection)

SNP calling in pool (read groups)

BCFtools default parameters

SNP filtering

“Minimal” set of quality filters to avoid removing “true” SNPs in probes

Removed potential sequencing errors inGolden (or very rare alleles of Golden)

Tot no. of contigs targeted 73,099

Affymetrix specific filtering

Tot no. of contigs targeted after Affy filters: 64,319

(68% of tot contigs/611Mb)

Focal points and tagging

Focal point strategy:

In GENIC regions (29,314 contigs/70K gene predictions)

[FEM and INRA predictions]

In NON GENIC regions (35,005 contigs)

Additional Focal points to cover genome every 10Kb

Tagging:

In each focal point, group together SNPs withLD r2 >= 0.85

SNP Selection

1. 20,549 contigs/54,735 gene pred

2. 3,197 contigs/4,910 gene pred

3. 15,237 contigs

4. 15,691 contigs

5. 4,690 contigs

Tot . 39,365 unique contigs covering 553Mb

24,954 untargeted contigscovering 58Mb (avg size 2.3Kb)

The Apple480K SNPs

Tot. 40,192 contigs 562Mb

39,365 contigs 553Mb

5,809 contigs 126Mb

SNP distribution across LGs

Elshire et al. (2011) PLoS ONE 6:e19379Myles et al. (2013) Trends in Genetics 29:190

enormous promise and significant challenges…Genotyping by Sequencing

From Davey et al. (2011) Nat Rev Genet 12

GBS

Elshire et al. (2011) PLoS ONE 6:e19379Myles et al. (2013) Trends in Genetics 29:190

Read Depth

Sample 1 Sample 2 Sample 3

AllelesC 19 13 0

T 0 12 3

Genotypes CC CT ?

GBS: Genotype calling

Average bps 19,123,102STD bps 3,182,282MIN bps 7,951,958MAX bps 24,882,057

GBS: Genome coverage 3-5% of the genome!

for only 1-5% of the SNPs identified

UNDERCALL HETEROZYGOSITY

Call dominant markers?

from Myles, 2013. Trends in Genetics 29:190

Sequence is too divergent from reference SNPs at the restriction site

Allelic dropout

273,835 SNPs

3967 SNPs

SNP calling with GATK

GBS in Golden Delicious x Scarlet Two-enzyme combination

HindIII-MspI

1 lane Illumina single-end reads Read alignment on ref genome

(4% of mismatches)GBS in 96 samples

30,393 SNPs

x

Kept SNPs < 20% missing genotype data (uneven coverage across samples)

Kept SNPs if rd > 6 for genotype call in at least 80% of samples (uneven depth of coverage)

11%

1.4%

Golden x Scarlet genetic linkage map

1994 mapped GBS-SNPs (from more than 270K initially discovered, 0.73%) 1272 cM (1 SNP/0.63 cM)

GBS and ‘QTL mapping’ for apple skin color

1 lane Illumina94 progenies

from Myles et al. 2014. G3

Chi-square test for associationCase/control single marker analysis in PLINK

QTL analysis – LG 9

GenomeStudio

• Sample Sheet: comma-delimited text file (.csv file)

• Data Repository: directory that contains intensity files

• Manifest Repository: directory that contains SNP manifest. The SNP manifest containsthe mapping between bead-type identifier and SNP, as well as all SNP annotations.

Sample_ID: Sample identifier (used only for display in the table).SentrixBarcode_A:The barcode of the Universal Array Product that this sample was hybridized to for Manifest A.SentrixPosition_A:The position within the Universal Array Product this sample was hybridized to for Manifest A (and similarly for _B, _C, etc. depending on how many manifests are used with your project).

GenomeStudio: sample sheet

GenomeStudio – Ideal SNP

AB (P1) x AB (P2) ¼ AA + ½ AB + ¼ BB

P1 P2AA

AB BB

Genotypes are called for each sample (dots) by their signal intensity (Norm R, y-axis) and (Norm Theta , x-axis)

Y-Axis: intensity of signalX-axis: colour-ratio, Allele Frequency GenCall Score: is a quality metric that indicates the reliability of each genotype call. Genotypes with lower GenCall scores are located further from the center of a cluster and have a lower reliability.Gene Train for a SNP: measure of the cluster quality for the SNP (distance between clusters-shape of the clusters)

AA

AᴓAB

Bᴓ

AB (P1) x Aᴓ (P2) ¼ AA ¼ Aᴓ ¼ AB ¼ Bᴓ

P1

P2 Aᴓ

AB BB+Bᴓ

AB (P1) x Bᴓ (P2)¼ AB ¼ Aᴓ ¼ BB ¼ Bᴓ

P1

P2

segregation distortion homozygous cluster with two subgroups with different intensity signal genotype calls from re-sequencing may differ genotype calls may be incorrect

genotypes Aᴓ and Bᴓ=additional SNPs

Effect of additional polymorphismsCritical SNPs at “probe site” → null allele

Segregation distortion Homozygous cluster with two distinct

subgroups along x axis Genotype calls from re-sequencing

differ Genotype calls may be incorrect

genotypes BB’ and AB’=additional SNPs (not so critical)

Effect of additional polymorphismsNon-critical SNPs at the probe site → decrease probe-sequence “hybridization” efficiency

A anneals better then B, or the opposite... Product A>B -> fluorescent label A > B -> cluster moves towards A Product B>A-> fluorescent label B > A -> cluster moves towards B Decrease in total fluorescent intensity AB genotypes -> lower intensity

AB’ (P1) x AB (P2) ¼ AA ¼ AB ¼ AB’ ¼BB’

AA

AB’

AB

BB’

A wide, horizontal orientated cloud

Effect of additional polymorphisms in different genetic background...

many clusters....

Paralogous regions: assume 2 copies at different LGs

AA-BB x AB-BB-> ½ AABB + ½ ABBB

AA BB

AB BBP1 P2

P1 P2

Paralogs(example for 2 loci, one locus monomorphic, one polymorphic)

AA BB

AB BB BB

BB

AB-BB (P1) x AB-BB (P2)¼ AABB + ½ ABBB + ¼ BBBB

No segregation distortion Clusters in a narrow range of the plot Genotype calls from re-seq may differ

P1P2

AA BB

AB BB BB

BB

AB-BB (P1) x AB-BB (P2)¾ (AABB + ABBB) + ¼ BBBB

P1 P2

Segregation distortion (3:1) AB may have sub-groups Position of the clusters along x axis not ideal Genotype calls from re-seq may differ

Dedicated sofwares required!!! Gidskehaug et al. (2011) Bioinformatics 27:303

paralogy…

Robust SNP scoring on:• F1 population • F2 population• Whole germplasm

Handles problematic clusters due to paralogy, additional SNPs (null alleles and preferential hybridization)

Advantages:• Ease of use and install• Standard Illumina input• Point and Click graphical interface• Adjustable parameters• Different outputs for common

genetic analysis software (JoinMap, PLINK, Structure, and HapMap)

AA Ab AB Bb

ASSIsT: Automatic SNP ScorIng Tool

Di Guardo, Micheletti et al. in preparation

available at http://compbiotoolbox.fmach.it/assist/

0 0.20 0.40 0.60 0.80 1Norm Theta

SNP_FB_0399644

-0.20

0

0.20

0.40

0.60

0.80

1

1.20

1.40

1.60

1.80

Norm

R

64 95 580.50.37. .

Marker Classification: AB_2-subclusters

Fruit breedomics workshop wp6 from marker assisted breeding to genomics assisted breeding michaela...

Science

Transcript of Fruit breedomics workshop wp6 from marker assisted breeding to genomics assisted breeding michaela...