Uniting the world’s maize diversity for dissection of complex trait...
Transcript of Uniting the world’s maize diversity for dissection of complex trait...
Uniting the world’s maize diversity for dissection of complex trait
and accelerating breeding.
Edward S. Buckler
1,2 and The Maize Diversity Project1,2,3,4,5,6,7
1 USDA-Agricultural Research Service; 2 Cornell University, Ithaca, NY; 3 Cold Spring
Harbor Laboratory, NY; 4 University of California-Irvine, CA; 5 North Carolina State
University, Raleigh, NC; 6 University of Missouri, Columbia, MO; 7 University of Wisconsin,
Madison, WI
Maize is one of the most diverse crops and species in the world. This diversity is a tremendous resource for understanding the genetic basis of complex traits and for plant breeding in general. However, it poses both a serious problem and substantial opportunity in relating these 10s of millions of variable sites for to the complex traits they control. We have recently combined whole genome next generation sequencing with QTL mapping on the maize Nested Association Mapping (NAM) population of 5,000 recombinant inbred lines. Through genome sequencing of 100 maize and teosinte lines, we have identified more than 50 million variable regions in the genome, and more 50 million segregate in maize. We have also combined this deeper sequencing with reduced representation genotyping (or Genotyping-By-Sequencing), that is now bringing genotyping costs down to $10-20/sample. By combining these two approaches to understand genome diversity, we were able to conduct a genome wide association analysis (GWAS) for the first time in diverse maize. We will discuss the results from maize GWAS for traits involved in heterosis and carbon and nitrogen metabolism. We are also now able to estimate the relative contributions of copy number variation versus SNPs for a series of traits. While in many cases the association and genes are very interpretable, there still remain numerous associations that remain complex. Despite this complexity, the ability to make predictions using diverse germplasm is high for many traits, which suggests great opportunities for merging and mining the world’s germplasm resources.
Edward Buckler USDA-ARS
Cornell University
http://www.maizegenetics.net
Uniting the world’s maize
diversity for dissection of
complex traits and
accelerating breeding
The Maize Diversity Project
McMullen & Flint-Garcia, at University
of Missouri
Holland, at North Carolina State Univ.
Ware, at Cold Spring Harbor Lab.
Sun & Kresovich, Cornell University
Doebley, University of Wisconsin
USDA-ARS & NSF Plant Genome
www.panzea.org
Goal: Create the global model
of Genotype To Phenotype to
decrease cycle time
G > P is Key for Genomic Selection
Make Crosses
Inbreed
Small Scale Hybrid
Large Area
Hybrid
Trials
Sell or Release Winner Hybrids
5 years
Make Crosses
Doubled Haploid
Genotype
Predict Value
Small Scale Hybrid
Large Area
Hybrid
Trials
The Model
Data From Other Efforts
4 months
4 y
ears
Sell or Release Winner Hybrids
Fast Genomic
Selection (GS) Standard Breeding
(20th Century)
With perfect knowledge it could run 15X faster,
current reality ~3X
Outline • Uniting World’s Maize Variation
–Skim sequencing (GBS)
–HapMapV2
• Heterosis and recombination
• C and N metabolism QTL
• Challenges
Maize has more molecular diversity
than humans and apes combined
Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001)
1.34% 0.09%
1.42%
Maize likely has functional variation at every gene. In total, there could
be 100,000s of functional SNPs (Single Nucleotide Polymorphisms)
Only 50% of the maize genome is
shared between two varieties
Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005
Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010
50%
Plant 1
Plant 2 Plant 3
99%
Person 1
Person 2 Person 3
Maize Humans
GS versus GWAS
• Same data: genome wide marker and
phenotypes, different statistics
• GWAS – Genome Wide Association
Studies are aimed identifying causative
genes and variants
• GS – Genomic Selection aims to predict
phenotype using the complete genotype
Sequencing the
world’s maize
germplasm diversity
Massive Number of Variants
• Sanger Diversity Studies
– SNP = 1-1.6% (maize to maize+teosinte)
– Small Indel = 0.5%
• Expectation for whole genome with a
sample of 100 maize and teosinte
– 167 Million SNPs
– 59 Million Small Indels
– Vast number of large indels
Genotypes of Diverse
Maize
B73 (Reference Genomes)
Whole Genome
Resequencing
(HapMapV2) Genotyping By
Sequencing
(Ames & SEED)
Genotyping By Sequencing
(GBS) to skim sequence all
maize
RJ Elshire, Glaubitz et al. (2011) PLOSone
Detailed Protocol and Bioinformatics at
http://www.maizegenetics.net/
What is GBS? • Use next generation sequencing to
genotype a reduced representation
portion of a genome
• RAD, RRL, CROPS, GBS
• Molecularly the most effective
approaches use restriction enzymes
– The first maize HapMap was RRL (Gore
et al 2009 Science)
– Recent efforts are drive price down
GBS 96 or 384-plex Protocol http://www.maizegenetics.net
. .
. .
. . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. .
. .
. . . . . .
.
.
. .
. . . . .
.
. . .
. . . . .
. . . . . .
. . . . . . . . .
. .
. .
. . . . . .
. . .
. . . . .
. . . . .
. . . . . .
. . .
. . . . . . . .
.
. .
.
. .
. .
.
. . . . . . . . . . .
Plate DNA &
adapter pair Pool
DNAs
PCR
Primer
s
Digest DNA with RE
Ligate adapters
(may be done simultaneously)
Evaluate
fragment sizes
Clean-up
CTGCAATCTTGGACAATGTATGTAGGGACTAGGGACAGTGATGTAATTAC
CAGCACTAATTCACACAATTTTGTCGGTTGATGTTACTGCAGTGGATCTT
CAGCACTAATTCACACAATTTTGTCGGTTGATGTTACTGCAGTGGATCTT
CAGCACTAATTCATACAATTTTGTTGGTTGATGTTACTGCAGTGGATCTT
CTGCGATCGCCGCGCCGATGAACGGGCCTACCCAGAAGATCCACTGCAGT
CTGCGATCGCCGCGCCGATGAACGGGCCTACCCAGAAGATCCACTGCAGT
CTGCCGTTGCTGGCAGTGCTACAACTCTTCACCTGACTGAAAGCTACTAA
CAGCTAGCGCAAGTGTTTGTGTTGCGCGCGCGCTGTGGAAAAGTGTGCCG
CAGCTAATTTTTTGGTATTTATTTGAAATAAGTTCCCACTACTCGCGGTT
CAGCTAATTTTTTGGTATTTGTTTGAAATAAGTTCCCACTACTCGCGGTT
CAGCCACTTCCCTCATTTGAAACTTTTTGGATCTTTGAAGACCAATAGAT
CAGCTAAGAAGATAGAGCCAAACAAGGTGGGCCTGCCAACGTCTCCTTCC
CAGCTAAGAAGATAGAGCCAAACAAGGTGGGCCTGCCAACGTCTCCTTCC
CTGCGACTCGTGCTTCGCCGCGGCCTGAAGAACCCGGTCTTTCACCGCCG
CTGCTCGGTAGTAAACGGGTACAGAATTTAATCCCGCATCATTTGGAAGC
Sequence (8 x 96 or 384
samples per flowcell)
2 million reads per sample
200 Mbp (today)
In House Molecular Costs per DNA
Sample at various multiplex levels
0
5
10
15
20
25
30
35
48-plex 96-plex 384-plex
Sequencing
Labor
Reagents &Consumables
$33.00
$19.00
$9.00
Co
st
in U
S
Do
llars
GBS has been used over
20 species
Bioinformatics Problems
• Massive amounts of data
• Complex genomes with many
unstable parts of a genome
• No reference genome
• Missing data
• Phasing and imputation
Discovery
Tag Counts by Taxa
Map Tags Genetically
Map Tags by Homology
Genetic Logic
Reference Genetic Map
Alleles and synonyms
Alleles to SNPs
Tags by Taxa
QSeq
Assign Tags to Alleles
Alleles to SNPs and locations
Genotypes (HapMap format)
Production
QSeq
Reference Genome
GBS Bioinformatic Pipelines
Maize GBS Results
• Genotyped 20000 lines of maize
• 14% of tags are too repetitive to do
much with.
• Over 5 million frequent tag map
• 686,000 SNPs frequently called
• Another 600K SNPs at very low
coverage
• Millions of Presence/Absence
Variants
Uniting World’s Germplasm
22K done, another 100K to go
USDA
inbreds
USDA
Genetic
Resources
USDA
Landraces
Mexican
Landraces
CIMMYT
Breeding
Lines
CIMMYT
Breeding
Lines
Chinese
Genetic
Resources
Embrapa
Developing
World
National
Programs
Key Germplasm
• US NAM (5000)
• USDA-ARS Inbreds Lines (3500)
• Chinese NAM (2000)
• CIMMYT Breeding Lines (7500)
• Teosinte Mapping (2000)
We are discovering and can map
alleles with frequency of 0.001.
B73 reference genome highly accurate for B73…
…but far less so for other maize lines
• 9.3% of Mo17 tags genetically map to different chromosome than they align to
• 0.4% of B73 tags genetically map to different chromosome than they align to
Less Ascertainment Bias than 50K array
Ram Sharma – Visiting Scientist, Buckler lab, Cornell (unpublished)
Y1
GBS Association Mapping the
USDA Ames Collection directly
hits known Mendelian genes
Cinta
Romay
Merge GBS of Ames and NAM to
directly map a herbicide sensitivity
locus
Zhiwu Zhang
Zhiwu Zhang
& Nick Lepak
GBS GWAS of α-tocopherol with
only 282 lines
Alex Lipka
• Two top SNPs – one within the
gene and one 70kb away
gTMT
Ghd7 Vgt1
Direct hits on the two most important flowering time
genes, however, even with 680K SNPs we almost missed
Ghd7 (only one significant SNP)
GBS GWAS with Ames 2500 lines
for flowering time
Cinta
Romay
The Maize HapMapV2 Project Ware, at Cold Spring Harbor Lab.
Ross-Ibarra, Univ. California, Davis
X. Xun & S. Chi, Beijing Genome Inst.
Y. Xu, CIMMYT
J. Lai, Chinese Agri. Univ.
Q. Sun, Cornell Univ.
N. Springer, Univ. of Minnesota
McMullen, at University of Missouri
Doebley & Kaeppler, Univ. of Wisconsin
USDA-ARS, NSF, BGI, JGI
Maize HapMap2 • Increase the breadth of samples (teosinte, landraces, improved lines)
– All inbred lines
• Whole Genome Shotgun, Illumina Paired-End, 76-100bp
• 103 lines, 13 Billion reads, 1Tbp of sequence
• Median 5X coverage
Tripsacum dactyloides
Teosinte (Zea mays ssp.
mexicana)
Teosinte (Zea mays ssp.
parviglumis)
Maize Landraces
Maize Improved Lines (including NAM) 60 Inbred lines
23 Inbred lines
17 Inbred lines
2 Inbred lines
1 sample
Chia et al
The Warning & It Applies To
Many Other Studies • CSHL & BGI alignment pipelines only
agree 50% of time with same data
• ~160M SNPs identified – most probably
really exist somewhere
– MOST DO NOT EXIST WHERE ALIGNED
– GENETIC AND EVOLUTIONARY CONTROLS
• >50% errors if accept standard pipelines
• 55M pass various population & genetic
filters
Chia et al
HapMapV2 Results
• 55M SNPs identified (34X
higher than HapMapV1)
• Over 80% of genome now in
high LD
• Copy number and PAV
identified
–80-90% of the genome in flux
Chia et al
Combining with Whole
Genome Sequencing with
Organized Genetics:
Maize NAM
The Hammer: Maize Nested Association Mapping (NAM)
• Crossed and sequenced 25 diverse maize lines to capture a substantial portion of world’s breeding diversity
• Derived 5000 inbred lines from the crosses
• Grew millions of plants
• Largest genetic dissection system ever
Tx303
Mo18W
MS71 Hp301
CML333 CML247
P39
CML228
Ki11
M37W
CML103
NC350
Oh43
Ky21
CML52
Oh7B
M162W
CML69
Tzi8
Ki3
NC358
CML322 CML277
IL14H B97 CML52 B73
F1
RIL2 RIL199 RIL200 RIL1 …
B73
F1
RIL2 RIL199 RIL200 RIL1 …
P39
Mc
Mu
lle
n e
t a
l 2
00
9 S
cie
nc
e
Advantages of
GWAS on NAM
NAM
GWAS
Population structure
eliminated by design
Project HapMap SNPs To
5000 RILs Accurately
Control for 90%
Genetic Variance
Wide range of incremental
models can be evaluated
Can Compare Linkage & GWAS
Summary Of NAM Studies
• NAM maps everything: developmental,
disease, kernel quality, metabolites
• Almost all small effect QTL – 20+ loci
– Younger traits have larger effects
• Additive effects
• Prediction works very well – explains
nearly ALL the heritability
• Hitting specific genes reasonably well
(except in low recombination zones) Buckler, Holland et al. 2009 Science; Tian, Bradbury et al 2011 Nature
Genetics; Kump et al 2011 Nature Genetics; Poland et al 2011 PNAS;
Brown et al 2011 PLOSGenetics; Cook et al 2011 Plant Phys
NAM: HETEROSIS
Heterosis
=
1/Inbreeding Depression
Excess Residual
Heterozygosity
NAM is 5000 parallel
experiments on inbreeding
depression
We expect 3.2% residual
heterozygosity
CML52 B73
F1
S1
S2
S3
S4
S5
100%
50%
25%
12.5%
6.3%
3.2%
Every chromosome had more more
residual heterozygosity in the
recombination suppressed centromeres
0%
1%
2%
3%
4%
5%
6%
1 2 3 4 5 6 7 8 9 10
Resid
ual H
ete
rozyg
osit
y
Chromosome
Centromeric (within 10cM)
Arms
McMullen, Kresovich et al 2009 Science
Low Cross-over rates over nearly half of every
every chromosome
McMullen at el 2009 Science
GBS by Peter Bradbury
Recombination predicts Residual
Heterozygosity -2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.50.0100
0.0200
0.0300
0.0400
0.0500
0.0600
1/R
eco
mb
inat
ion
Re
sid
ual
He
ts
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.50.0100
0.0200
0.0300
0.0400
0.0500
0.0600
1/R
eco
mb
inat
ion
Re
sid
ual
He
ts
1 2 3 4
5 6 7 8 9 10
Recombination w/ RH r2=32%
Gene density and diversity explain
nothing once recombination is
controlled for
Gore, Chia et al 2009 Science
Yield Heterosis Studies Point to
Low Recombination Regions • Chris Schön et al 2010 TAG – reanalyzed
three heterosis studies
• Larièpe et al 2012 Genetics – 6
populations (NCIII design)
• Found substantial overlap in QTL
locations
• The locations agree with the low
recombination and residual
heterozygosity regions from NAM
• A direct NAM study is in progress
Restricted Recombination
Hill-Robertson Interference
(Repulsion QTL)
Pseudo-Overdominance
Heterosis
What is the problem?
• 31% of all genes are in regions with
significant residual heterozygosity
• They still retain 91% of molecular
variation, so there is lots of useful
genetic variation
• The breeding values are unknown for
these genomic regions
Select for paracentromereic
recombinants, and THEN field
evaluate and select
Analyzing C and N metabolism in
the leaf
Nengyi Zhang
• Collaborators: Stitt &
Gibon
• 12,000 field sampled
plants (NAM)
• 12 basic carbon &
nitrogen metabolites
across all of NAM
• Statistically factored
out maturity, weather,
position, and other
confounding factors
Star, Gluc, Prin2
Acid invertase (Vacuoles)
Acid invertase (Cell Wall)
QTL M211
Star, Gluc, Fruc, Prin2
QTL M632
Invertase natural variation were
critical for starch and glucose levels
Zhang, Gibon et al
CA
Direct GWAS hits a C4 key gene:
Carbonic anhydrase (CA) gene
GW
AS
— B
PP
Lin
kag
e —
-lo
g(P
)
CA is the single most important gene
controlling Chlorophyll, Malate, Nitrate,
Glutamine, and overall protein content. Zhang, Gibon et al
0
10
20
30
40
50
60
212 212.5 213 213.5 214 214.5 215 215.5 216
RM
IP
Chr3 AGP (Mbp)
Chla
Mala
CA
CA
CA
CA
Mala transporter
Able to resolve QTL effects 500Kb
apart. C4 metabolism genes were
really key (e.g. Malate transporter)
Zhang, Gibon et al
Carbonic anhydrase (CA) is a critical
enzyme in C fixation in C4 plant
Ludwig M. et.al. Plant Physiol. 1998
CA
Mala
• CO2 HCO3-
• CAs are upstream regulators
of CO2-controlled stomatal
movements in guard cells
• Water use efficiency, heat
stress
CA
Hu H. et al. Nature Cell Biology 2010
CA SNP associations: Chla, Mala, Nitr, Glut, Prot, Prin1
Multi-Trillion Data Point
Opportunity/Problem
• By end of 2011:
– GBS on ~22,000 public sample worldwide
– 200M variants known from whole genome
sequencing
– Combine and impute missing data:
2 alleles x 22,000 lines x 200,000,000 variants =
8.8 trillion data points
Doing the statistics and math will be a
challenge.
Conclusion
• Genotyping can now been done
cheaply with relatively little bias
• GWAS is now possible for diverse
maize, and resolution can be gene level
• Heterosis related to recombination and
pseudo-overdominance
• Genetics of C4 metabolism may be key
to yield
THE CHALLENGES
Rare Alleles
• The purpose of breeding is find good rares
alleles increase their frequency
• We are approaching a point where we can
out genotype our phenotyping
– Knowledge of 100s millions of rare alleles
• Intelligent informatics, biological
modeling, and strategic phenotyping will
be key.
Greater Partnerships to
Understand GxE
• Public sector has expertise in genomics,
modeling, statistics, etc.
• Lacking robust US GxE data sets for yield
• Largest public sector trials are now
CIMMYT and in China
• Alleles are all shared, many PVP available
to make relevant testcrosses.
• 2000 hybrids x 40 locations per year?
• Laboratory for better phenotyping and
modeling.
What should and can we do
in the next decades?
• Double yield with same fertilizer and
water
– Perhaps even more in the developing world.
• Perennialize our crops
• Breed crops to cope with drought and to
use nitrogen efficiently
• Biofortify crops to improve nutrition in
the developing world
Who do I contact to learn more?
• NAM – Jim Holland, Mike McMullen, and Sherry Flint-Garcia
• HapMapV2 – Doreen Ware, Jer-Ming Chia, Jeff Ross-Ibarra, Matt Hufford
• QTL Mapping on NAM – Peter Bradbury, Zhiwu Zhang, Feng Tian
• Ames Inbreds – Cinta Romay, Candy Gardner
• C & N Metabolites – Nengyi Zhang, Yves Gibon, Mark Stitt
• GBS Methods & Bioinformatics – Rob Elshire & Sharon Mitchell, Qi Sun, Jeff Glaubitz, James Harriman
Web: www.panzea.org & www.maizegenetics.net
Supported by USDA-ARS & NSF
Statistics
• Peter Bradbury
• Huihui Li
• Jianming Yu
• Zhiwu Zhang
Informatics
• Dallas Kroon
• Hector Sanchez Villeda
• Qi Sun
Germplasm & Experimental Design
• Sherry Flint-Garcia
• Major Goodman
• Nick Lepak
• Chris Browne
• Susan Romero
• Stella Salvo
• Marco Oropeza Rosas
• Carlos Harjes
Phenotyping Collaborators
• Torbert Rocheford
• Narasimham Upadyayula
Genotyping & Molecular Anal. • Michael Gore
• Elhan Ersoz
• Charlotte Acharya
• Heather Yates
• Kate Hutchins
NAM PIs • Jim Holland
• Steve Kresovich
• Mike McMullen
w/ Maize Diversity Project • Jeff Glaubitz (PM)
• John Doebley
Supported by NSF
and USDA-ARS
• Cinta Romay
• Jason Peiffer
• Feng Tian
• Sara Larsson
• Rob Elshire
• Sharon Mitchell
Next Gen PIs • Doreen Ware
• George Grills