The SUNRISE on the sunflower genome sequence...ANR-11-BTBR-0005 The SUNRISE on the sunflower genome...

29
ANR-11-BTBR-0005 The SUNRISE on the sunflower genome sequence [email protected] Stéphane Muños Baptiste Mayjonade: molecular biology Jérôme Gouzy: bioinformatics

Transcript of The SUNRISE on the sunflower genome sequence...ANR-11-BTBR-0005 The SUNRISE on the sunflower genome...

ANR-11-BTBR-0005

The SUNRISE on the sunflower genome

sequence

[email protected]

Stéphane Muños

Baptiste Mayjonade: molecular biology

Jérôme Gouzy: bioinformatics

Barcelona, November 10th 2015

2

4th EMEA User Group Meeting

Toulouse: a unique place for sunflower genetics and genomics

A lot of facilities available for phenotyping, physiology,

genomics, bioinformatics, microscopy…

Major sunflower seed companies (Syngenta Seeds,

Pionneer, Biogemma, SOLTIS, Maïsadour…) are

located arround Toulouse

INRA National Seeds Ressources Center

Three main topics in our team : water stress tolerance,

broomrape resistance,

downy mildew resitance.

2

Barcelona, November 10th 2015 3 4th EMEA User Group Meeting

XVIth century:

imported by

spanish in

Europe.

Source: National Sunflower Association (www.sunflowernsa.com) Photo © NASA

3000 years

before JC:

domestication

XXth century:

modern breeding

Sunflower oil has been successfull because of

the orthodox christianism. Sunflower oil was

the only one allowed to be consumed during

Lent .

XVIIIth century: begin of

the sunflower breeding in

Russia.

Sunflower (Helianthus annuus) history

Barcelona, November 10th 2015 4 4th EMEA User Group Meeting

Diversity in cultivated sunflower lines

Cadic et al., 2013

Introgression from wild

Elite lines

Very few diversity in the elite lines due to breeding.

Barcelona, November 10th 2015 5 4th EMEA User Group Meeting

Oilseed crop cultivated in dry and marginal land

High impact of climate change

Source: Intergovernmental Panel on Climate Change (IPCC) Fourth Assessment Report

Moriondo et al., Climatic Change, 2010

FAO 2013

Yield losses: Moriondo et al. 2010

20 to 50% in Mediterranean region

0.4q /ha /day of stress

France: 620 000 ha 8 M€ / day

World: 25 000 000 ha >100M€ /day

Definition of a new ideotype Combination of phenotypes, genetically

realistic and adapted to crop managements

Sunflower crops should be highly affected by climate changes

Barcelona, November 10th 2015 6 4th EMEA User Group Meeting

2012-2019

21 million € project

7 millions € from ANR

(french government)

10 partners

6 private partners

8 public labs

Goal: improve the stability of oil yield under water stress

Coordinated by Nicolas Langlade (INRA, LIPM)

[email protected]

The SUNRISE project

Barcelona, November 10th 2015 7 4th EMEA User Group Meeting

Male/

female

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

1 X X X X X X X X X X X X X X X 15

2 X X X X X X X X X X X X X X X 15

3 X X X X X X X X X X X X X X X 15

4 X X X X X X X X X X X X X X X 15

5 X X X X X X X X X X X X X X X 15

6 X X X X X X X X X X X X X X X 15

7 X X X X X X X X X X X X X X X 15

8 X X X X X X X X X X X X X X X 15

9 X X X X X X X X X X X X X X 14

10 X X X X X X X X X X X X X X 14

11 X X X X X X X X X X X X X 13

12 X X X X X X X X X X X X X 13

13 X X X X X X X X X X X X X X 14

14 X X X X X X X X X X X X X 13

15 X X X X X X X X X X X X X 13

16 X X X X X X X X X X X X X 13

17 X X X X X X X X X X X X X 13

18 X X X X X X X X X X X X 12

19 X X X X X X X X X X X X X X 14

20 X X X X X X X X X X X X X X 14

21 X X X X X X X X X X X X X 13

22 X X X X X X X X X X X X X 13

23 X X X X X X X X X X X X X X 14

24 X X X X X X X X X X X X X 13

25 X X X X X X X X X X X X X 13

26 X X X X X X X X X X X X X 13

27 X X X X X X X X X X X X X 13

28 X X X X X X X X X X X X 12

29 X X X X X X X X X X X X X X 14

30 X X X X X X X X X X X X X X 14

31 X X X X X X X X X X X X X 13

32 X X X X X X X X X X X X X 13

33 X X X X X X X X X X X X X X 14

34 X X X X X X X X X X X X X 13

35 X X X X X X X X X X X X X 13

36 X X X X X X X X X X X X X 13

Female/

Male 15 15 15 15 15 15 15 15 14 14 13 13 12 14 14 13 13 12 14 14 13 13 12 14 14 13 13 12 14 14 13 13 12 14 14 13 491

Male

Fem

ale

GWA-hybrid genetic design

36 males x 36 females = 1296 possible hybrids

471 hybrids produced

Barcelona, November 10th 2015 8 4th EMEA User Group Meeting

SNPs have been identified by resequencing experiments and

by mapping data on the available sunflower genome

sequence

The sunflower genome:

Diploid with 2n = 17 chromosome pairs

3.6Gb

Molecular characterization of the 72 parental lines

Goal: identification of SNPs in the 72 parental lines to deduce the genotypes in

the 471 hyrids for association mapping analysis

Barcelona, November 10th 2015 9 4th EMEA User Group Meeting

Assembly of the XRQ sunflower line genome obtained by

INRA

NUM 1007165

MIN 250

MAX 288882

N50 BP 18479

N50 NUM 25861

MEAN 1888

MEDIAN 392

BP 1902331496

=========================

NUM-noN 1007165

MIN-noN 250

MAX-noN 237406

N50 BP-noN 9402

N50 NUM-noN 34006

MEAN-noN 1545

MEDIAN-noN 392

BP-noN 1556693944

127X depth of HiSeq sequences

used

PE and MP sequences (300, 2300,

6700, 18500)

43% of the genome

Barcelona, November 10th 2015 10 4th EMEA User Group Meeting

Combination of bwa; mpileup, varscan:

6 348 868 SNPs identified on 161 955 genomic

scaffolds

But SNPs are not exhaustive

AND no structural variations have been identified

Sunflower genome sequence need to be

improved!

Use of the XRQ assembly to identify SNPs from resequencing experiments

1 lane of HiSeq (2x100nt)/sunflower line Total sequences produced (Q30) : 2 546 Gb (727 X the sunflower genome size) Mean depth: 10,1X %GC: 38

Barcelona, November 10th 2015 11 4th EMEA User Group Meeting

The international consortium for the sunflower genome initiative

(HA412 sunflower line)

Coordinated by Loren Rieseberg

(University of British Columbia,

Canada) NUM 21

MIN 9814

MAX 359367108

N50 BP 226777971

N50 NUM 8

MEAN 190544064

MEDIAN 208730832

BP 4001425362

=========================

NUM-noN 21

MIN-noN 9814

MAX-noN 254638407

N50 BP-noN 119919823

N50 NUM-noN 7

MEAN-noN 106877764

MEDIAN-noN 114409345

BP-noN 2244433056

Obtained by combining 454 and

HiSeq data, togethers with an

hightroughput genetic map and a

physical map (BAC clones finger

printing)

A good assembly.

But difficult to improve

(due to the repeated

sequences) 62% of the genome

Barcelona, November 10th 2015 12 4th EMEA User Group Meeting

Analysis of the composition ot the LTR retrotransposons with LTRharvest (D. Ellinghaus

et al. 2008, default parameters)

30% of the sunflower genome

sequence is composed of LTR

retrotransposons.

8.8% of the human genome.

And the repeats are highly

conserved in sunflower.

Nu

mb

er

of h

its

Length of LTR retrotransposons (nt)

Why is it so difficult to assemble the sunflower genome?

There is a lot of repeated sequences in the sunflower genome. They are large (9-12kb)

and highly conserved.

For de novo assembly: it is important to have very long reads

that fully cross the length of the repeats.

Barcelona, November 10th 2015 13 4th EMEA User Group Meeting

Why PacBio sequencing could help to improve the sunflower

genome assembly?

1.Jerôme gouzy et al. obtained very good results on other organisms (bacteria and

fungi)

2. In map-based cloning projects on sunflower: BAC clones sequencing has been

greatly improved thanks to PacBio sequencing (coll. LIPM-CNRGV). We obtained

systematically only one contig without any N, even by mixing several overlapping

BAC clones.

We decided to improve the sunflower genome assembly

of the XRQ line by sequencing it at 100X depth with

PacBio sequences only.

Barcelona, November 10th 2015 14 4th EMEA User Group Meeting

It is located in the GeT-PlaGe platform (INRA-Toulouse)

http://get.genotoul.fr

[email protected]

First PacBio sequencing machine has been installed in

France by the end of march 2015

Toulouse, 12 et 13 novembre 2014

• In 3 months, 407 SMRT cells produced. 407 SMRT Cells

with P6/C4 chemistry (GeT-PlaGe; IGM; Lauzanne Univ.)

• Subreads statisitcs:

PacBio Data obtained for the XRQ sunflower line

# MAX N50 BP NUM >=

N50

MEAN BP

37,5M 80,9kb 367 Gb

(102x)

15 4th EMEA User Group Meeting

Improvements of the molecular biology steps have increased the length of

the Pacio Sequences (B. Mayjonade)

IGM (San Diegao, USA) 202 SMR cells

NUM MAX N50 BP N50 NUM N90 BP N90 NUM MEAN MEDIAN BP/SMRTcell

moyenne 98666 45457 12211 28413 5666 68251 9176 9032 0,906 M

max 146374 52725 12981 41602 6166 100937 9997 9809 1,36Gb

Lausanne University (Swiss) 59 SMRT cells

moyenne 106800 46800 15172 28371 6505 71812 10773 9821 1,15Gb

max 144358 53253 16132 38325 7024 96979 11436 10568 1,6GB

Get-PlaGe (France) 146 SMRT cells

moyenne 77086,6301 52317,4932 15365,4795 19705,5822 6153,13014 50239,5274 10326,6773 9152,5137 800Mb

max 126777 80974 20507 33133 8422 83662 13635 12295 1,3Gb

IGM (San Diegao, USA) 202 SMR cells

NUM MAX N50 BP N50 NUM N90 BP N90 NUM MEAN MEDIAN BP/SMRTcell

moyenne 98666 45457 12211 28413 5666 68251 9176 9032 0,906 M

max 146374 52725 12981 41602 6166 100937 9997 9809 1,36Gb

Lausanne University (Swiss) 59 SMRT cells

moyenne 106800 46800 15172 28371 6505 71812 10773 9821 1,15Gb

max 144358 53253 16132 38325 7024 96979 11436 10568 1,6GB

Get-PlaGe (France) 146 SMRT cells

moyenne 77086,6301 52317,4932 15365,4795 19705,5822 6153,13014 50239,5274 10326,6773 9152,5137 800Mb

max 126777 80974 20507 33133 8422 83662 13635 12295 1,3Gb

Toulouse, 12 et 13 novembre 2014

Preliminary and quick pre-filtering of the raw data: removal of

repeats.

– Mapping of 1x of data on 2x of long reads (>= 20Kb)

– Analysis of the coverage of the long reads (only hits > 3kb are analyzed)

– Repeats pattern identification (MHAP/MinHash)

~9Kb

Maxcov = 750

Example of pattern

Ma

x d

ep

th o

f e

ach

re

pe

ats

(ma

x=

37

50

)

16 4th EMEA User Group Meeting

Construction of a database containing repeated sequences

Length of the repeat (nt, max 36kb)

Toulouse, 12 et 13 novembre 2014

Removal of repeats from the PacBio raw data (102X)

# MAX N50 BP NUM >=

N50

MEAN BP

32,8M 80,9kb 13,7kb 9,1M 10,3kb 339 Gb

(94x)

Repeat in database

PacBio subreads:

suppressed

Specfific sequences

Kept:

8% of the raw data sequences

are removed for next steps in the

assembly!

17 4th EMEA User Group Meeting

Toulouse, 12 et 13 novembre 2014

Assembling protocol with 100% PacBio data

• 2 pipelines are close

• Reads are first corrected by WGS(CABOG)

• Main differences are the default parameters used in the different versions of the

softwares.

HGAP 3

(PacBio=PB)

PBcR (Koren et al.)

Correction of the reads

Alignement PB/BLASR MHAP (Berlin et al.) or

PB/BLASR

Correction PB/dagcon PBcR

(PB/falconcns|PB/dag

con)

Contiging

Overlap CA/overlap CA/overlap

Layout CA/unitigger CA/unitigger (bogart)

Consensus CA/utgcns CA/utgcns (pbutgcns)

Correction of the assembly

Polishing PB/Quiver PB/Quiver

For any questions, contact

Jérôme Gouzy:

[email protected]

We used PBcR to assemble the sunflower genome

18 4th EMEA User Group Meeting

Toulouse, 12 et 13 novembre 2014

Correction of the data during the first

assembly process

Max_Overhang<2000nt MIN_Overlap>=5000nt

# MAX N50 BP MEAN BP

11,2M 59kb 13,6kb 11,2kb 125 Gb

(34x)

# MAX N50 BP MEAN BP

19,7M 58kb 11,5kb 9kb 180 Gb

(50x)

CR1

CR2

Reads >= 12kb +

Reads >= 3kb, MIN_Overlap >= 3000nt, LEN_Overlap >= 50% of the short sequence

2 strategies for filtering of the hsp

CR2 strategy is less stringent but seems to be accurate (evaluated on previously

characterized genomic regions from sequenced BAC clones)

19 4th EMEA User Group Meeting

Toulouse, 12 et 13 novembre 2014

Sunflower genome assembly evolution according to

PacBio sequecing depth

Coverage of the genome and N50 of the contigs increase with

the depth of the raw data.

20 4th EMEA User Group Meeting

Effect of CR2 correction strategy

with only 18X depth and

2 days of computation

(PBcR 8.3rc1), we

obtained an assembly

with metrics similar to

the previous assembly

obtained with 127X of

HiSeq data

Toulouse, 12 et 13 novembre 2014

Assembly with full data set (102X)

#ctg MAX N50 BP # > N50 MEDIAN Gb

13 124 4.4M 498 kb 1700 118 kb 3.03

• Once the raw data (102X) are corrected, the assembly

metrics seem very good!

Better assembly than ever!

The coverage is twice the one obtained with 127X of

HiSeq Data (4 sisings of PE and MP)

3Gb of sequences with noN , only 13 124 contig with

almost 500kb for the N50.

84% of the sunflower genome is covered

21 4th EMEA User Group Meeting

Barcelona, November 10th 2015 22 4th EMEA User Group Meeting

Does the assembly contain the gene content?

We have obtained a reference transcriptome from the same XRQ sunflower line that can be accessed using this web link: https://www.heliagene.org/HaT13l

Jérôme Gouzy, Sébastien Carrère, Nicolas Langlade (LIPM, INRA-Toulouse)

eFP browser

Toulouse, 12 et 13 novembre 2014

• The gene space is almost complete:

– 98% of the cDNAs from the reference transcriptome can be

fully (gmap)

Does the genome assembly contain the gene content?

23 4th EMEA User Group Meeting

Toulouse, 12 et 13 novembre 2014

• Several hightroughput genetic maps and the physical map (finger printing of the BAC clones) will be used to produce the 17 pseudomolecules of the nuclear genome.

• We are confident that we should be able to anchor 90% of the PacBio contigs.

• This sequence is expected for the end of 2015

• But the task is not easy because of the repeats and we already know that a part of the contigs are chimeric.

What’s next to obtain a sequence of the 17 sunflower

chromosomes sequences of the nuclear genome?

Chris Grassa (INRA):

[email protected]

24 4th EMEA User Group Meeting

Toulouse, 12 et 13 novembre 2014

Accuracy of the assembly is our priority

We want the sunflower genome sequence to be accurate and reliable.

Metrics of an assembly are one thing, accuracy is another!

And the quality of the genome sequence is important for genetics and

breeding!

25 4th EMEA User Group Meeting

Toulouse, 12 et 13 novembre 2014

Conclusions

– Using PacBio sequences only, we have improved the coverage

of the sunflower genome: from 43% (127X HiSeq) to 84% (102X

PacBio) and the size of the contigs have been highly

increased.

– The softwares (smrtanalysis from PacBio or PBcR) are

« easy » to use and efficient for small or simple genomes.

– But for complex genomes, it is difficult. It should be easier

with more longer sequences (majority of sequences >30kb

or 40kb are needed)!

26 4th EMEA User Group Meeting

Toulouse, 12 et 13 novembre 2014

• Sequencing of a new sunflower line: PSC8 (52x)

• From corrected reads by PBcR

Comparison of different softwares

# MAX N50 BP MEAN BP

7,5M 59kb 13,6kb 9kb 70,1 Gb (19,6x)

#ctg MAX N50 BP # >

N50

MEDIAN Gb

26 273 2.5M 223 kb 3 799 66 kb 3.1

PBcR/WGS

FALCON-default parameters:

FALCON-control of the repeats at the end of the reads desactivated:

#ctg MAX N50 BP # >

N50

MEDIAN Gb

35 066 1.0M 101 kb 6 212 38 kb 2.05

#ctg MAX N50 BP # >

N50

MEDIAN Gb

36 197 1.45M 202 kb 4 832 36 kb 3.2

27 4th EMEA User Group Meeting

Toulouse, 12 et 13 novembre 2014

Perspectives

• Assembling of heterozygous genomes (sunflower wild types for example)

• Evaluation of other de novo assembling softwares

The genome of Orobanche cumana (19 chromosomes

pairs, 1.42Gb) will be sequenced (100X PacBio).

Sequencing has begun.

Many thanks

Toulouse, November 12th & 13th 2014 29 SUNRISE

Get-PlaGe:

Cécile Donadieu

Gérald Salin

Céline Vandecasteele

Denis Milan

CNRGV:

Hélène Bergès

William Marande

Sonia Vautrin

LIPM:

Jérôme Gouzy

Baptiste Mayjonade

Nicolas Langlade

Chris Grassa

Sébastien Carrere

Erika Sallet

Ludovic Legrand

Marie-Claude Boniface

Nicolas Pouilly