From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation,...

56
From sequence data to genomic From sequence data to genomic prediction prediction

Transcript of From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation,...

Page 1: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

From sequence data to From sequence data to genomic predictiongenomic prediction

Page 2: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Course overview

• Day 1– Introduction – Generation, quality control, alignment of sequence data– Detection of variants, quality control and filtering

• Day 2– Imputation from SNP array genotypes to sequence data

• Day 3– Genome wide association studies with SNP array and

sequence variant genotypes

• Day 4 & 5– Genomic prediction with SNP array and sequence variant

genotypes (BLUP and Bayesian methods)– Use of genomic selection in breeding programs

Page 3: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation

• Why impute?• Approaches for imputation• Factors affecting accuracy of imputation• Does imputation give you more power?• Imputation to whole genome sequence

variant genotypes

Page 4: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Why impute?

• Fill in missing genotypes from the lab• Merge data sets with genotypes on different

arrays– Eg. Affy and Illumina data

• Impute from low density to high density– 7K-> 50K (save $$$)– 50K->800K

– capture power of higher density?– Better persistence of accuracy

• Sequence expensive, can we impute to full sequence data?

Page 5: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Core concept•Identity by state (IBS)

– A pair of individuals have the same allele at a locus

•Identity by descent (IBD)– A pair of individuals have the same

alleles at a locus and it traces to a common ancestor

•Imputation methods determine whether a chromosome segment is IBD

Page 6: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

• A chunk of ancestral chromosome is conserved in the current population

1 1 1 2

Marker Haplotype

Causes of LD

Page 7: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Core concept 2• Any individuals in a population may share a

proportion of their genome identical by descent (IBD)– IBD segments are the same and have originated in a

common ancestor

• The closer the relationship the longer the IBD segments– Pedigree relationships

Page 8: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Several methods for imputation•Two main categories:

– Family based– Population based– Or combination of the two

– Some of the most effective are Beagle (Browning and Browning, 2009), MACH (Li et al., 2010), Impute2 (Howie et al., 2009), AlphaPhase (Hickey et al 2011)

Page 9: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Several methods for imputation•Two main categories:

– Family based– Population based– Or combination of the two

– Some of the most effective are Beagle (Browning and Browning, 2009), MACH (Li et al., 2010), Impute2 (Howie et al., 2009), AlphaPhase (Hickey et al 2011)

Page 10: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Finding an IBD segment

0 22 0 2 20

00 2 0 2 2 2

Sire

0 ?2 0 2 2?

?2 ? 0 0 2 0

Progeny

Page 11: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

0 22 0 2 20

02 2 0 2 2 2

Sire

0 ?2 0 2 2?

?2 ? 0 0 2 0

Progeny

IBD segment

Page 12: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

0 22 0 2 20

02 2 0 2 2 2

Sire

0 22 0 2 20

?2 ? 0 0 2 0

Progeny

Page 13: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Several methods for imputation•Two main categories:

– Family based– Population based (exploits LD)– Or combination of the two

– Some of the most effective are Beagle (Browning and Browning, 2009), MACH (Li et al., 2010), Impute2 (Howie et al., 2009), AlphaPhase (Hickey et al 2011)

Page 14: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Population based imputation• Hidden Markov Models

– Has “hidden states” – For target individuals these are “map” of

reference haplotypes that have been inherited

– Imputation problem is to derive genotype probabilities given hidden states, sparse genotypes, recombination rates, other population parameters

Page 15: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Population based imputation

Reference population

Target population

Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010 11:499-511.

Page 16: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Population based imputation

• Consider three markers, 4 reference haplotypes

• 0 1 1 • 0 1 0• 1 0 1• 0 0 1

• Imputation?

Page 17: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Li and Stephens

Page 18: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Beagle

Page 19: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation accuracy

• Accuracy = correlation of real and imputed genotypes

• Concordance = percentage (%) of genotypes called correctly

Page 20: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation accuracy

• Depends on – Size of reference set

• bigger the better!

– Density of markers • extent of LD, effective population size

– Frequency of SNP alleles

– Genetic relationship to reference

Page 21: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Table 6. Accuracy of imputation from BovineLD genotypes to BovineSNP50 genotypes for Australian, French, and North American breeds.

Boichard D, Chung H, Dassonneville R, David X, et al. (2012) Design of a Bovine Low-Density SNP Array Optimized for Imputation. PLoS ONE 7(3): e34130. doi:10.1371/journal.pone.0034130http://www.plosone.org/article/info:doi/10.1371/journal.pone.0034130

Page 22: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation accuracy

• Density of markers (extent of LD)– In Holstein Dairy cattle

• 3K -> 50K accuracy 0.93• 7K -> 50K accuracy 0.98

Page 23: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Illumina Bovine HD array

• We genotyped

— 898 Holstein heifers— 47 Holstein Key ancestor bulls

• After (stringent) QC 634,307 SNPs

Page 24: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation 50K -> 800K

• HolsteinsCross validation % Correct

Heifers only 1 96.7%2 96.7%

Average 96.7%

Heifers 1 97.8%using key 2 97.7%ancestors Average 97.7%

Page 25: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation accuracy• Rare alleles?

Page 26: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation accuracy• Relationship to reference?

Page 27: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation accuracy• Effect of map errors?

Page 28: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Why more power with imputation

• High accuracies of imputation demonstrate that we can infer haplotypes of animal genotyped with e.g. 3K accurately

• But potentially large number of haplotypes

• With imputed data can test single snp, only use 1 degree of freedom, rather than number of haplotypes

Page 29: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Why more power with imputation

• Weigel et al. (2010)

Page 30: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation

• Why impute?• Approaches for imputation• Factors affecting accuracy of imputation• Does imputation give you more power?• Imputation to whole genome sequence

variant genotypes

Page 31: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Which individuals to sequence?

• Those which capture greatest genetic diversity?

• Select set of individuals which are likely to capture highest proportion of unique chromosome segments

Page 32: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Which individuals to sequence?

• Let total number of individuals in population be n, number of individuals that can be sequenced be m.

• A = average relationship matrix among n individuals, from pedigree

Page 33: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Animal Sire Dam1 0 02 0 03 0 04 1 25 1 26 1 3

Pedigree

Animal 1 Animal 2 Animal 3 Animal 4 Animal 5 Animal 6Animal 1 1Animal 2 0 1Animal 3 0 0 1Animal 4 0.5 0.5 0 1Animal 5 0.5 0.5 0 0.5 1Animal 6 0.5 0 0.5 0.25 0.25 1

Animals 6 is a half sib of 4 and 5

• An example A matrix……..

Page 34: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Which individuals to sequence?

• Let total number of individuals in population be n, number of individuals that can be sequenced be m.

• A = average relationship matrix among n individuals, from pedigree

• c is a vector of size n, which for each animal has the average relationship to the population (eg. Sum up the elements of A down the column for individual i, take mean)

Page 35: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Which individuals to sequence?

• If we choose a group of m animals for sequencing, how much of the diversity do they capture

• pm = Am-1cm

– Where Am is the sub matrix of A for the m individuals, and cm is the elements of the c vector for the m individuals

• Proportion of diversity = pm’1n

Page 36: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Which individuals to sequence?

• Example

Page 37: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Which individuals to sequence?

• Then choose set of individuals to sequence (m) which maximise pm’1n

• Step wise regression– Find single individual with largest pi, set

ci to zero, next largest pi, set ci to zero…..

• Genetic algorithm

Page 38: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Which individuals to sequence?

• Then choose set of individuals to sequence (m) which maximise pm’1n

• Step wise regression– Find single individual with largest pi, set

ci to zero, next largest pi, set ci to zero…..

• Genetic algorithm

• No A? Use G

Page 39: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Which individuals to sequence?

• Poll Dorset sheep

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0 10 20 30 40 50 60 70 80

Rams sequenced, ranked from most influential

Pro

port

ion

of g

en

etic

div

ersi

ty c

aptu

red

Page 40: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation of full sequence data

• Two groups of individuals– Sequenced individuals: reference

population– Individuals genotyped on SNP array:

target individuals

Page 41: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation of full sequence data

• Steps:– Step 1. Find polymorphisms in sequence

data– Step 2. Genotype all sequenced animals

for polymorphisms (SNP, Indels)– Step 3. Phase genotypes (eg Beagle) in

sequenced individuals, create reference file

– Step 4. Impute all polymorphisms into individuals genotyped with SNP array

Page 42: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation of full sequence data

Create BAM files

1. Filter reads on quality score, trim ends2. Remove PCR duplicates3. Align with BWA

Variant calling

SamTools mPileupVcf file -> filter (number forward /reverse reads of each allele, read depth, quality, filter number of variants in 5bp window)

Beagle Phasing in ReferenceInput genotype probs from Phred scoresQC with 800K

BAM

Reference file for imputation

Beagle Imputation in Target

SNP array data in target population

Analysis

Genome wide association

Genomic selection

Genotype probabilities

Page 43: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation of full sequence data

• How accurate?

Page 44: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

• 1147 animals sequenced• 27 breeds • 20 Partners• Average 11X

Run4.0 1000 bull genomes Run 4.0Breed/Cross NumberHolstein (Black and White) 288Simmental (Dual and Beef) 216Angus (Black and Red) 138

Jersey 61Brown Swiss 59Gelbvieh 34Charolais 33Hereford 31

Limousin 31Guelph Composite 30Beef Booster 29Alberta Composite 28Montbeliarde 28AyrshireFinnish 25

Normande 24Holstein (Red and White) 23Swedish Red 16Danish Red 15Other Crosses 11Belgian Blue 10

Piedmontese 5Eringer 2Galloway 2Unknown 2Scottish Highland 2Pezzata Rossa Italiana 1

Romagnola 1Salers 1Tyrolean Grey 1Total 1147

CRV

Page 45: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

• 36.9 million filtered variants

• 35.2 million SNP• 1.7 million INDEL

X

1000 bull genomes Run 4.0

Page 46: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

– Accuracy?• Chromosome 14

• Remove 50 Holsteins, 20 Jerseys from data set

• Reduce genotypes to 800K for these animals

• Impute full sequence using rest of animals as reference

Imputation of full sequence data

Page 47: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation of full sequence data

Page 48: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation of full sequence data

Page 49: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Imputation of full sequence data

Page 50: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

– Why so difficult to impute rare mutations?

– Examples Complex Veterbral Malformation (CVM) and Bovine Leukocyte Deficiency (BLAD)• All cases of CVM trace back to Ivanhoe Bell

• BLAD traces to Osbornedale Ivanhoe

Imputation of full sequence data

Page 51: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

– Why so difficult to impute rare mutations?

Imputation of full sequence data

BLAD CVM

Location Chr1:145114963 Chr3:43412427

Frequency 0.0014 0.0103

Bulls genotyped 5987 5987

Imputed correctly 5970 5836

Accuracy 0.9972 0.9748

# Carriers 17 123

# Carriers correctly imputed 13 5

Prop. Carriers correctly imputed 0.765 0.041

Page 52: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

– Why so difficult to impute rare mutations?

– The BLAD mutation is in a unique 250kb haplotype, which does not occur in any non-carriers

– The CVM mutation is in a 250kb haplotype which occurs in many non carriers, and also occurs in breeds without mutation

– Hypothesis – BLAD mutation occurred on rare haplotype, while CVM a recent mutation that occurred on a common haplotype background

Imputation of full sequence data

Page 53: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

– Computationally efficient strategies

– Beagle – run imputation in chromosome segments, say 5MB with 0.5MB overlap (to avoid edge effects)

– Fimpute – much faster than Beagle, used to impute 32,500 animals from 800K to 16 million SNP!• Does not give probabilties

– Beagle phasing + Minimac

Imputation of full sequence data

Page 54: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,
Page 55: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,
Page 56: From sequence data to genomic prediction. Course overview Day 1 –Introduction –Generation, quality control, alignment of sequence data –Detection of variants,

Conclusion• Impute

– to fill in missing genotypes– low density to high density to save $$

• Accuracy depends on size of reference, effective population size, relationship to reference, marker density

• Imputation to sequence possible, relatively low accuracies for rare alleles

• Use genotype probabilities from imputation in GWAS and genomic prediction