Uniting the world’s maize diversity for dissection of complex trait...

Uniting the world’s maize diversity for dissection of complex trait

and accelerating breeding.

Edward S. Buckler

1,2 and The Maize Diversity Project1,2,3,4,5,6,7

1 USDA-Agricultural Research Service; 2 Cornell University, Ithaca, NY; 3 Cold Spring

Harbor Laboratory, NY; 4 University of California-Irvine, CA; 5 North Carolina State

University, Raleigh, NC; 6 University of Missouri, Columbia, MO; 7 University of Wisconsin,

Madison, WI

Maize is one of the most diverse crops and species in the world. This diversity is a tremendous resource for understanding the genetic basis of complex traits and for plant breeding in general. However, it poses both a serious problem and substantial opportunity in relating these 10s of millions of variable sites for to the complex traits they control. We have recently combined whole genome next generation sequencing with QTL mapping on the maize Nested Association Mapping (NAM) population of 5,000 recombinant inbred lines. Through genome sequencing of 100 maize and teosinte lines, we have identified more than 50 million variable regions in the genome, and more 50 million segregate in maize. We have also combined this deeper sequencing with reduced representation genotyping (or Genotyping-By-Sequencing), that is now bringing genotyping costs down to $10-20/sample. By combining these two approaches to understand genome diversity, we were able to conduct a genome wide association analysis (GWAS) for the first time in diverse maize. We will discuss the results from maize GWAS for traits involved in heterosis and carbon and nitrogen metabolism. We are also now able to estimate the relative contributions of copy number variation versus SNPs for a series of traits. While in many cases the association and genes are very interpretable, there still remain numerous associations that remain complex. Despite this complexity, the ability to make predictions using diverse germplasm is high for many traits, which suggests great opportunities for merging and mining the world’s germplasm resources.

Edward Buckler USDA-ARS

Cornell University

http://www.maizegenetics.net

Uniting the world’s maize

diversity for dissection of

complex traits and

accelerating breeding

The Maize Diversity Project

McMullen & Flint-Garcia, at University

of Missouri

Holland, at North Carolina State Univ.

Ware, at Cold Spring Harbor Lab.

Sun & Kresovich, Cornell University

Doebley, University of Wisconsin

USDA-ARS & NSF Plant Genome

www.panzea.org

Goal: Create the global model

of Genotype To Phenotype to

decrease cycle time

G > P is Key for Genomic Selection

Make Crosses

Inbreed

Small Scale Hybrid

Large Area

Hybrid

Trials

Sell or Release Winner Hybrids

5 years

Make Crosses

Doubled Haploid

Genotype

Predict Value

Small Scale Hybrid

Large Area

Hybrid

Trials

The Model

Data From Other Efforts

4 months

4 y

ears

Sell or Release Winner Hybrids

Fast Genomic

Selection (GS) Standard Breeding

(20th Century)

With perfect knowledge it could run 15X faster,

current reality ~3X

Outline • Uniting World’s Maize Variation

–Skim sequencing (GBS)

–HapMapV2

• Heterosis and recombination

• C and N metabolism QTL

• Challenges

Maize has more molecular diversity

than humans and apes combined

Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001)

1.34% 0.09%

1.42%

Maize likely has functional variation at every gene. In total, there could

be 100,000s of functional SNPs (Single Nucleotide Polymorphisms)

Only 50% of the maize genome is

shared between two varieties

Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005

Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010

50%

Plant 1

Plant 2 Plant 3

99%

Person 1

Person 2 Person 3

Maize Humans

GS versus GWAS

• Same data: genome wide marker and

phenotypes, different statistics

• GWAS – Genome Wide Association

Studies are aimed identifying causative

genes and variants

• GS – Genomic Selection aims to predict

phenotype using the complete genotype

Sequencing the

world’s maize

germplasm diversity

Massive Number of Variants

• Sanger Diversity Studies

– SNP = 1-1.6% (maize to maize+teosinte)

– Small Indel = 0.5%

• Expectation for whole genome with a

sample of 100 maize and teosinte

– 167 Million SNPs

– 59 Million Small Indels

– Vast number of large indels

Genotypes of Diverse

Maize

B73 (Reference Genomes)

Whole Genome

Resequencing

(HapMapV2) Genotyping By

Sequencing

(Ames & SEED)

Genotyping By Sequencing

(GBS) to skim sequence all

maize

RJ Elshire, Glaubitz et al. (2011) PLOSone

Detailed Protocol and Bioinformatics at

http://www.maizegenetics.net/

What is GBS? • Use next generation sequencing to

genotype a reduced representation

portion of a genome

• RAD, RRL, CROPS, GBS

• Molecularly the most effective

approaches use restriction enzymes

– The first maize HapMap was RRL (Gore

et al 2009 Science)

– Recent efforts are drive price down

GBS 96 or 384-plex Protocol http://www.maizegenetics.net

. .

. .

. . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. .

. .

. . . . . .

.

.

. .

. . . . .

.

. . .

. . . . .

. . . . . .

. . . . . . . . .

. .

. .

. . . . . .

. . .

. . . . .

. . . . .

. . . . . .

. . .

. . . . . . . .

.

. .

.

. .

. .

.

. . . . . . . . . . .

Plate DNA &

adapter pair Pool

DNAs

PCR

Primer

s

Digest DNA with RE

Ligate adapters

(may be done simultaneously)

Evaluate

fragment sizes

Clean-up

CTGCAATCTTGGACAATGTATGTAGGGACTAGGGACAGTGATGTAATTAC

CAGCACTAATTCACACAATTTTGTCGGTTGATGTTACTGCAGTGGATCTT

CAGCACTAATTCACACAATTTTGTCGGTTGATGTTACTGCAGTGGATCTT

CAGCACTAATTCATACAATTTTGTTGGTTGATGTTACTGCAGTGGATCTT

CTGCGATCGCCGCGCCGATGAACGGGCCTACCCAGAAGATCCACTGCAGT

CTGCGATCGCCGCGCCGATGAACGGGCCTACCCAGAAGATCCACTGCAGT

CTGCCGTTGCTGGCAGTGCTACAACTCTTCACCTGACTGAAAGCTACTAA

CAGCTAGCGCAAGTGTTTGTGTTGCGCGCGCGCTGTGGAAAAGTGTGCCG

CAGCTAATTTTTTGGTATTTATTTGAAATAAGTTCCCACTACTCGCGGTT

CAGCTAATTTTTTGGTATTTGTTTGAAATAAGTTCCCACTACTCGCGGTT

CAGCCACTTCCCTCATTTGAAACTTTTTGGATCTTTGAAGACCAATAGAT

CAGCTAAGAAGATAGAGCCAAACAAGGTGGGCCTGCCAACGTCTCCTTCC

CAGCTAAGAAGATAGAGCCAAACAAGGTGGGCCTGCCAACGTCTCCTTCC

CTGCGACTCGTGCTTCGCCGCGGCCTGAAGAACCCGGTCTTTCACCGCCG

CTGCTCGGTAGTAAACGGGTACAGAATTTAATCCCGCATCATTTGGAAGC

Sequence (8 x 96 or 384

samples per flowcell)

2 million reads per sample

200 Mbp (today)

In House Molecular Costs per DNA

Sample at various multiplex levels

0

5

10

15

20

25

30

35

48-plex 96-plex 384-plex

Sequencing

Labor

Reagents &Consumables

$33.00

$19.00

$9.00

Co

st

in U

S

Do

llars

GBS has been used over

20 species

Bioinformatics Problems

• Massive amounts of data

• Complex genomes with many

unstable parts of a genome

• No reference genome

• Missing data

• Phasing and imputation

Discovery

Tag Counts by Taxa

Map Tags Genetically

Map Tags by Homology

Genetic Logic

Reference Genetic Map

Alleles and synonyms

Alleles to SNPs

Tags by Taxa

QSeq

Assign Tags to Alleles

Alleles to SNPs and locations

Genotypes (HapMap format)

Production

QSeq

Reference Genome

GBS Bioinformatic Pipelines

Maize GBS Results

• Genotyped 20000 lines of maize

• 14% of tags are too repetitive to do

much with.

• Over 5 million frequent tag map

• 686,000 SNPs frequently called

• Another 600K SNPs at very low

coverage

• Millions of Presence/Absence

Variants

Uniting World’s Germplasm

22K done, another 100K to go

USDA

inbreds

USDA

Genetic

Resources

USDA

Landraces

Mexican

Landraces

CIMMYT

Breeding

Lines

CIMMYT

Breeding

Lines

Chinese

Genetic

Resources

Embrapa

Developing

World

National

Programs

Key Germplasm

• US NAM (5000)

• USDA-ARS Inbreds Lines (3500)

• Chinese NAM (2000)

• CIMMYT Breeding Lines (7500)

• Teosinte Mapping (2000)

We are discovering and can map

alleles with frequency of 0.001.

B73 reference genome highly accurate for B73…

…but far less so for other maize lines

• 9.3% of Mo17 tags genetically map to different chromosome than they align to

• 0.4% of B73 tags genetically map to different chromosome than they align to

Less Ascertainment Bias than 50K array

Ram Sharma – Visiting Scientist, Buckler lab, Cornell (unpublished)

Y1

GBS Association Mapping the

USDA Ames Collection directly

hits known Mendelian genes

Cinta

Romay

Merge GBS of Ames and NAM to

directly map a herbicide sensitivity

locus

Zhiwu Zhang

Zhiwu Zhang

& Nick Lepak

GBS GWAS of α-tocopherol with

only 282 lines

Alex Lipka

• Two top SNPs – one within the

gene and one 70kb away

gTMT

Ghd7 Vgt1

Direct hits on the two most important flowering time

genes, however, even with 680K SNPs we almost missed

Ghd7 (only one significant SNP)

GBS GWAS with Ames 2500 lines

for flowering time

Cinta

Romay

The Maize HapMapV2 Project Ware, at Cold Spring Harbor Lab.

Ross-Ibarra, Univ. California, Davis

X. Xun & S. Chi, Beijing Genome Inst.

Y. Xu, CIMMYT

J. Lai, Chinese Agri. Univ.

Q. Sun, Cornell Univ.

N. Springer, Univ. of Minnesota

McMullen, at University of Missouri

Doebley & Kaeppler, Univ. of Wisconsin

USDA-ARS, NSF, BGI, JGI

Maize HapMap2 • Increase the breadth of samples (teosinte, landraces, improved lines)

– All inbred lines

• Whole Genome Shotgun, Illumina Paired-End, 76-100bp

• 103 lines, 13 Billion reads, 1Tbp of sequence

• Median 5X coverage

Tripsacum dactyloides

Teosinte (Zea mays ssp.

mexicana)

Teosinte (Zea mays ssp.

parviglumis)

Maize Landraces

Maize Improved Lines (including NAM) 60 Inbred lines

23 Inbred lines

17 Inbred lines

2 Inbred lines

1 sample

Chia et al

The Warning & It Applies To

Many Other Studies • CSHL & BGI alignment pipelines only

agree 50% of time with same data

• ~160M SNPs identified – most probably

really exist somewhere

– MOST DO NOT EXIST WHERE ALIGNED

– GENETIC AND EVOLUTIONARY CONTROLS

• >50% errors if accept standard pipelines

• 55M pass various population & genetic

filters

Chia et al

HapMapV2 Results

• 55M SNPs identified (34X

higher than HapMapV1)

• Over 80% of genome now in

high LD

• Copy number and PAV

identified

–80-90% of the genome in flux

Chia et al

Combining with Whole

Genome Sequencing with

Organized Genetics:

Maize NAM

The Hammer: Maize Nested Association Mapping (NAM)

• Crossed and sequenced 25 diverse maize lines to capture a substantial portion of world’s breeding diversity

• Derived 5000 inbred lines from the crosses

• Grew millions of plants

• Largest genetic dissection system ever

Tx303

Mo18W

MS71 Hp301

CML333 CML247

P39

CML228

Ki11

M37W

CML103

NC350

Oh43

Ky21

CML52

Oh7B

M162W

CML69

Tzi8

Ki3

NC358

CML322 CML277

IL14H B97 CML52 B73

F1

RIL2 RIL199 RIL200 RIL1 …

B73

F1

RIL2 RIL199 RIL200 RIL1 …

P39

Mc

Mu

lle

n e

t a

l 2

00

9 S

cie

nc

e

Advantages of

GWAS on NAM

NAM

GWAS

Population structure

eliminated by design

Project HapMap SNPs To

5000 RILs Accurately

Control for 90%

Genetic Variance

Wide range of incremental

models can be evaluated

Can Compare Linkage & GWAS

Summary Of NAM Studies

• NAM maps everything: developmental,

disease, kernel quality, metabolites

• Almost all small effect QTL – 20+ loci

– Younger traits have larger effects

• Additive effects

• Prediction works very well – explains

nearly ALL the heritability

• Hitting specific genes reasonably well

(except in low recombination zones) Buckler, Holland et al. 2009 Science; Tian, Bradbury et al 2011 Nature

Genetics; Kump et al 2011 Nature Genetics; Poland et al 2011 PNAS;

Brown et al 2011 PLOSGenetics; Cook et al 2011 Plant Phys

NAM: HETEROSIS

Heterosis

=

1/Inbreeding Depression

Excess Residual

Heterozygosity

NAM is 5000 parallel

experiments on inbreeding

depression

We expect 3.2% residual

heterozygosity

CML52 B73

F1

S1

S2

S3

S4

S5

100%

50%

25%

12.5%

6.3%

3.2%

Every chromosome had more more

residual heterozygosity in the

recombination suppressed centromeres

0%

1%

2%

3%

4%

5%

6%

1 2 3 4 5 6 7 8 9 10

Resid

ual H

ete

rozyg

osit

y

Chromosome

Centromeric (within 10cM)

Arms

McMullen, Kresovich et al 2009 Science

Low Cross-over rates over nearly half of every

every chromosome

McMullen at el 2009 Science

GBS by Peter Bradbury

Recombination predicts Residual

Heterozygosity -2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.50.0100

0.0200

0.0300

0.0400

0.0500

0.0600

1/R

eco

mb

inat

ion

Re

sid

ual

He

ts

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.50.0100

0.0200

0.0300

0.0400

0.0500

0.0600

1/R

eco

mb

inat

ion

Re

sid

ual

He

ts

1 2 3 4

5 6 7 8 9 10

Recombination w/ RH r2=32%

Gene density and diversity explain

nothing once recombination is

controlled for

Gore, Chia et al 2009 Science

Yield Heterosis Studies Point to

Low Recombination Regions • Chris Schön et al 2010 TAG – reanalyzed

three heterosis studies

• Larièpe et al 2012 Genetics – 6

populations (NCIII design)

• Found substantial overlap in QTL

locations

• The locations agree with the low

recombination and residual

heterozygosity regions from NAM

• A direct NAM study is in progress

Restricted Recombination

Hill-Robertson Interference

(Repulsion QTL)

Pseudo-Overdominance

Heterosis

What is the problem?

• 31% of all genes are in regions with

significant residual heterozygosity

• They still retain 91% of molecular

variation, so there is lots of useful

genetic variation

• The breeding values are unknown for

these genomic regions

Select for paracentromereic

recombinants, and THEN field

evaluate and select

Analyzing C and N metabolism in

the leaf

Nengyi Zhang

• Collaborators: Stitt &

Gibon

• 12,000 field sampled

plants (NAM)

• 12 basic carbon &

nitrogen metabolites

across all of NAM

• Statistically factored

out maturity, weather,

position, and other

confounding factors

Star, Gluc, Prin2

Acid invertase (Vacuoles)

Acid invertase (Cell Wall)

QTL M211

Star, Gluc, Fruc, Prin2

QTL M632

Invertase natural variation were

critical for starch and glucose levels

Zhang, Gibon et al

CA

Direct GWAS hits a C4 key gene:

Carbonic anhydrase (CA) gene

GW

AS

— B

PP

Lin

kag

e —

-lo

g(P

)

CA is the single most important gene

controlling Chlorophyll, Malate, Nitrate,

Glutamine, and overall protein content. Zhang, Gibon et al

0

10

20

30

40

50

60

212 212.5 213 213.5 214 214.5 215 215.5 216

RM

IP

Chr3 AGP (Mbp)

Chla

Mala

CA

CA

CA

CA

Mala transporter

Able to resolve QTL effects 500Kb

apart. C4 metabolism genes were

really key (e.g. Malate transporter)

Zhang, Gibon et al

Carbonic anhydrase (CA) is a critical

enzyme in C fixation in C4 plant

Ludwig M. et.al. Plant Physiol. 1998

CA

Mala

• CO2 HCO3-

• CAs are upstream regulators

of CO2-controlled stomatal

movements in guard cells

• Water use efficiency, heat

stress

CA

Hu H. et al. Nature Cell Biology 2010

CA SNP associations: Chla, Mala, Nitr, Glut, Prot, Prin1

Multi-Trillion Data Point

Opportunity/Problem

• By end of 2011:

– GBS on ~22,000 public sample worldwide

– 200M variants known from whole genome

sequencing

– Combine and impute missing data:

2 alleles x 22,000 lines x 200,000,000 variants =

8.8 trillion data points

Doing the statistics and math will be a

challenge.

Conclusion

• Genotyping can now been done

cheaply with relatively little bias

• GWAS is now possible for diverse

maize, and resolution can be gene level

• Heterosis related to recombination and

pseudo-overdominance

• Genetics of C4 metabolism may be key

to yield

THE CHALLENGES

Rare Alleles

• The purpose of breeding is find good rares

alleles increase their frequency

• We are approaching a point where we can

out genotype our phenotyping

– Knowledge of 100s millions of rare alleles

• Intelligent informatics, biological

modeling, and strategic phenotyping will

be key.

Greater Partnerships to

Understand GxE

• Public sector has expertise in genomics,

modeling, statistics, etc.

• Lacking robust US GxE data sets for yield

• Largest public sector trials are now

CIMMYT and in China

• Alleles are all shared, many PVP available

to make relevant testcrosses.

• 2000 hybrids x 40 locations per year?

• Laboratory for better phenotyping and

modeling.

What should and can we do

in the next decades?

• Double yield with same fertilizer and

water

– Perhaps even more in the developing world.

• Perennialize our crops

• Breed crops to cope with drought and to

use nitrogen efficiently

• Biofortify crops to improve nutrition in

the developing world

Who do I contact to learn more?

• NAM – Jim Holland, Mike McMullen, and Sherry Flint-Garcia

• HapMapV2 – Doreen Ware, Jer-Ming Chia, Jeff Ross-Ibarra, Matt Hufford

• QTL Mapping on NAM – Peter Bradbury, Zhiwu Zhang, Feng Tian

• Ames Inbreds – Cinta Romay, Candy Gardner

• C & N Metabolites – Nengyi Zhang, Yves Gibon, Mark Stitt

• GBS Methods & Bioinformatics – Rob Elshire & Sharon Mitchell, Qi Sun, Jeff Glaubitz, James Harriman

Web: www.panzea.org & www.maizegenetics.net

Supported by USDA-ARS & NSF

Statistics

• Peter Bradbury

• Huihui Li

• Jianming Yu

• Zhiwu Zhang

Informatics

• Dallas Kroon

• Hector Sanchez Villeda

• Qi Sun

Germplasm & Experimental Design

• Sherry Flint-Garcia

• Major Goodman

• Nick Lepak

• Chris Browne

• Susan Romero

• Stella Salvo

• Marco Oropeza Rosas

• Carlos Harjes

Phenotyping Collaborators

• Torbert Rocheford

• Narasimham Upadyayula

Genotyping & Molecular Anal. • Michael Gore

• Elhan Ersoz

• Charlotte Acharya

• Heather Yates

• Kate Hutchins

NAM PIs • Jim Holland

• Steve Kresovich

• Mike McMullen

w/ Maize Diversity Project • Jeff Glaubitz (PM)

• John Doebley

Supported by NSF

and USDA-ARS

• Cinta Romay

• Jason Peiffer

• Feng Tian

• Sara Larsson

• Rob Elshire

• Sharon Mitchell

Next Gen PIs • Doreen Ware

• George Grills

Uniting the world’s maize diversity for dissection of complex trait...

Documents

Transcript of Uniting the world’s maize diversity for dissection of complex trait...