The SUNRISE on the sunflower genome sequence...ANR-11-BTBR-0005 The SUNRISE on the sunflower genome...
Transcript of The SUNRISE on the sunflower genome sequence...ANR-11-BTBR-0005 The SUNRISE on the sunflower genome...
ANR-11-BTBR-0005
The SUNRISE on the sunflower genome
sequence
Stéphane Muños
Baptiste Mayjonade: molecular biology
Jérôme Gouzy: bioinformatics
Barcelona, November 10th 2015
2
4th EMEA User Group Meeting
Toulouse: a unique place for sunflower genetics and genomics
A lot of facilities available for phenotyping, physiology,
genomics, bioinformatics, microscopy…
Major sunflower seed companies (Syngenta Seeds,
Pionneer, Biogemma, SOLTIS, Maïsadour…) are
located arround Toulouse
INRA National Seeds Ressources Center
Three main topics in our team : water stress tolerance,
broomrape resistance,
downy mildew resitance.
2
Barcelona, November 10th 2015 3 4th EMEA User Group Meeting
XVIth century:
imported by
spanish in
Europe.
Source: National Sunflower Association (www.sunflowernsa.com) Photo © NASA
3000 years
before JC:
domestication
XXth century:
modern breeding
Sunflower oil has been successfull because of
the orthodox christianism. Sunflower oil was
the only one allowed to be consumed during
Lent .
XVIIIth century: begin of
the sunflower breeding in
Russia.
Sunflower (Helianthus annuus) history
Barcelona, November 10th 2015 4 4th EMEA User Group Meeting
Diversity in cultivated sunflower lines
Cadic et al., 2013
Introgression from wild
Elite lines
Very few diversity in the elite lines due to breeding.
Barcelona, November 10th 2015 5 4th EMEA User Group Meeting
Oilseed crop cultivated in dry and marginal land
High impact of climate change
Source: Intergovernmental Panel on Climate Change (IPCC) Fourth Assessment Report
Moriondo et al., Climatic Change, 2010
FAO 2013
Yield losses: Moriondo et al. 2010
20 to 50% in Mediterranean region
0.4q /ha /day of stress
France: 620 000 ha 8 M€ / day
World: 25 000 000 ha >100M€ /day
Definition of a new ideotype Combination of phenotypes, genetically
realistic and adapted to crop managements
Sunflower crops should be highly affected by climate changes
Barcelona, November 10th 2015 6 4th EMEA User Group Meeting
2012-2019
21 million € project
7 millions € from ANR
(french government)
10 partners
6 private partners
8 public labs
Goal: improve the stability of oil yield under water stress
Coordinated by Nicolas Langlade (INRA, LIPM)
The SUNRISE project
Barcelona, November 10th 2015 7 4th EMEA User Group Meeting
Male/
female
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
1 X X X X X X X X X X X X X X X 15
2 X X X X X X X X X X X X X X X 15
3 X X X X X X X X X X X X X X X 15
4 X X X X X X X X X X X X X X X 15
5 X X X X X X X X X X X X X X X 15
6 X X X X X X X X X X X X X X X 15
7 X X X X X X X X X X X X X X X 15
8 X X X X X X X X X X X X X X X 15
9 X X X X X X X X X X X X X X 14
10 X X X X X X X X X X X X X X 14
11 X X X X X X X X X X X X X 13
12 X X X X X X X X X X X X X 13
13 X X X X X X X X X X X X X X 14
14 X X X X X X X X X X X X X 13
15 X X X X X X X X X X X X X 13
16 X X X X X X X X X X X X X 13
17 X X X X X X X X X X X X X 13
18 X X X X X X X X X X X X 12
19 X X X X X X X X X X X X X X 14
20 X X X X X X X X X X X X X X 14
21 X X X X X X X X X X X X X 13
22 X X X X X X X X X X X X X 13
23 X X X X X X X X X X X X X X 14
24 X X X X X X X X X X X X X 13
25 X X X X X X X X X X X X X 13
26 X X X X X X X X X X X X X 13
27 X X X X X X X X X X X X X 13
28 X X X X X X X X X X X X 12
29 X X X X X X X X X X X X X X 14
30 X X X X X X X X X X X X X X 14
31 X X X X X X X X X X X X X 13
32 X X X X X X X X X X X X X 13
33 X X X X X X X X X X X X X X 14
34 X X X X X X X X X X X X X 13
35 X X X X X X X X X X X X X 13
36 X X X X X X X X X X X X X 13
Female/
Male 15 15 15 15 15 15 15 15 14 14 13 13 12 14 14 13 13 12 14 14 13 13 12 14 14 13 13 12 14 14 13 13 12 14 14 13 491
Male
Fem
ale
GWA-hybrid genetic design
36 males x 36 females = 1296 possible hybrids
471 hybrids produced
Barcelona, November 10th 2015 8 4th EMEA User Group Meeting
SNPs have been identified by resequencing experiments and
by mapping data on the available sunflower genome
sequence
The sunflower genome:
Diploid with 2n = 17 chromosome pairs
3.6Gb
Molecular characterization of the 72 parental lines
Goal: identification of SNPs in the 72 parental lines to deduce the genotypes in
the 471 hyrids for association mapping analysis
Barcelona, November 10th 2015 9 4th EMEA User Group Meeting
Assembly of the XRQ sunflower line genome obtained by
INRA
NUM 1007165
MIN 250
MAX 288882
N50 BP 18479
N50 NUM 25861
MEAN 1888
MEDIAN 392
BP 1902331496
=========================
NUM-noN 1007165
MIN-noN 250
MAX-noN 237406
N50 BP-noN 9402
N50 NUM-noN 34006
MEAN-noN 1545
MEDIAN-noN 392
BP-noN 1556693944
127X depth of HiSeq sequences
used
PE and MP sequences (300, 2300,
6700, 18500)
43% of the genome
Barcelona, November 10th 2015 10 4th EMEA User Group Meeting
Combination of bwa; mpileup, varscan:
6 348 868 SNPs identified on 161 955 genomic
scaffolds
But SNPs are not exhaustive
AND no structural variations have been identified
Sunflower genome sequence need to be
improved!
Use of the XRQ assembly to identify SNPs from resequencing experiments
1 lane of HiSeq (2x100nt)/sunflower line Total sequences produced (Q30) : 2 546 Gb (727 X the sunflower genome size) Mean depth: 10,1X %GC: 38
Barcelona, November 10th 2015 11 4th EMEA User Group Meeting
The international consortium for the sunflower genome initiative
(HA412 sunflower line)
Coordinated by Loren Rieseberg
(University of British Columbia,
Canada) NUM 21
MIN 9814
MAX 359367108
N50 BP 226777971
N50 NUM 8
MEAN 190544064
MEDIAN 208730832
BP 4001425362
=========================
NUM-noN 21
MIN-noN 9814
MAX-noN 254638407
N50 BP-noN 119919823
N50 NUM-noN 7
MEAN-noN 106877764
MEDIAN-noN 114409345
BP-noN 2244433056
Obtained by combining 454 and
HiSeq data, togethers with an
hightroughput genetic map and a
physical map (BAC clones finger
printing)
A good assembly.
But difficult to improve
(due to the repeated
sequences) 62% of the genome
Barcelona, November 10th 2015 12 4th EMEA User Group Meeting
Analysis of the composition ot the LTR retrotransposons with LTRharvest (D. Ellinghaus
et al. 2008, default parameters)
30% of the sunflower genome
sequence is composed of LTR
retrotransposons.
8.8% of the human genome.
And the repeats are highly
conserved in sunflower.
Nu
mb
er
of h
its
Length of LTR retrotransposons (nt)
Why is it so difficult to assemble the sunflower genome?
There is a lot of repeated sequences in the sunflower genome. They are large (9-12kb)
and highly conserved.
For de novo assembly: it is important to have very long reads
that fully cross the length of the repeats.
Barcelona, November 10th 2015 13 4th EMEA User Group Meeting
Why PacBio sequencing could help to improve the sunflower
genome assembly?
1.Jerôme gouzy et al. obtained very good results on other organisms (bacteria and
fungi)
2. In map-based cloning projects on sunflower: BAC clones sequencing has been
greatly improved thanks to PacBio sequencing (coll. LIPM-CNRGV). We obtained
systematically only one contig without any N, even by mixing several overlapping
BAC clones.
We decided to improve the sunflower genome assembly
of the XRQ line by sequencing it at 100X depth with
PacBio sequences only.
Barcelona, November 10th 2015 14 4th EMEA User Group Meeting
It is located in the GeT-PlaGe platform (INRA-Toulouse)
http://get.genotoul.fr
First PacBio sequencing machine has been installed in
France by the end of march 2015
Toulouse, 12 et 13 novembre 2014
• In 3 months, 407 SMRT cells produced. 407 SMRT Cells
with P6/C4 chemistry (GeT-PlaGe; IGM; Lauzanne Univ.)
• Subreads statisitcs:
PacBio Data obtained for the XRQ sunflower line
# MAX N50 BP NUM >=
N50
MEAN BP
37,5M 80,9kb 367 Gb
(102x)
15 4th EMEA User Group Meeting
Improvements of the molecular biology steps have increased the length of
the Pacio Sequences (B. Mayjonade)
IGM (San Diegao, USA) 202 SMR cells
NUM MAX N50 BP N50 NUM N90 BP N90 NUM MEAN MEDIAN BP/SMRTcell
moyenne 98666 45457 12211 28413 5666 68251 9176 9032 0,906 M
max 146374 52725 12981 41602 6166 100937 9997 9809 1,36Gb
Lausanne University (Swiss) 59 SMRT cells
moyenne 106800 46800 15172 28371 6505 71812 10773 9821 1,15Gb
max 144358 53253 16132 38325 7024 96979 11436 10568 1,6GB
Get-PlaGe (France) 146 SMRT cells
moyenne 77086,6301 52317,4932 15365,4795 19705,5822 6153,13014 50239,5274 10326,6773 9152,5137 800Mb
max 126777 80974 20507 33133 8422 83662 13635 12295 1,3Gb
IGM (San Diegao, USA) 202 SMR cells
NUM MAX N50 BP N50 NUM N90 BP N90 NUM MEAN MEDIAN BP/SMRTcell
moyenne 98666 45457 12211 28413 5666 68251 9176 9032 0,906 M
max 146374 52725 12981 41602 6166 100937 9997 9809 1,36Gb
Lausanne University (Swiss) 59 SMRT cells
moyenne 106800 46800 15172 28371 6505 71812 10773 9821 1,15Gb
max 144358 53253 16132 38325 7024 96979 11436 10568 1,6GB
Get-PlaGe (France) 146 SMRT cells
moyenne 77086,6301 52317,4932 15365,4795 19705,5822 6153,13014 50239,5274 10326,6773 9152,5137 800Mb
max 126777 80974 20507 33133 8422 83662 13635 12295 1,3Gb
Toulouse, 12 et 13 novembre 2014
Preliminary and quick pre-filtering of the raw data: removal of
repeats.
– Mapping of 1x of data on 2x of long reads (>= 20Kb)
– Analysis of the coverage of the long reads (only hits > 3kb are analyzed)
– Repeats pattern identification (MHAP/MinHash)
~9Kb
Maxcov = 750
Example of pattern
Ma
x d
ep
th o
f e
ach
re
pe
ats
(ma
x=
37
50
)
16 4th EMEA User Group Meeting
Construction of a database containing repeated sequences
Length of the repeat (nt, max 36kb)
Toulouse, 12 et 13 novembre 2014
Removal of repeats from the PacBio raw data (102X)
# MAX N50 BP NUM >=
N50
MEAN BP
32,8M 80,9kb 13,7kb 9,1M 10,3kb 339 Gb
(94x)
Repeat in database
PacBio subreads:
suppressed
Specfific sequences
Kept:
8% of the raw data sequences
are removed for next steps in the
assembly!
17 4th EMEA User Group Meeting
Toulouse, 12 et 13 novembre 2014
Assembling protocol with 100% PacBio data
• 2 pipelines are close
• Reads are first corrected by WGS(CABOG)
• Main differences are the default parameters used in the different versions of the
softwares.
HGAP 3
(PacBio=PB)
PBcR (Koren et al.)
Correction of the reads
Alignement PB/BLASR MHAP (Berlin et al.) or
PB/BLASR
Correction PB/dagcon PBcR
(PB/falconcns|PB/dag
con)
Contiging
Overlap CA/overlap CA/overlap
Layout CA/unitigger CA/unitigger (bogart)
Consensus CA/utgcns CA/utgcns (pbutgcns)
Correction of the assembly
Polishing PB/Quiver PB/Quiver
For any questions, contact
Jérôme Gouzy:
We used PBcR to assemble the sunflower genome
18 4th EMEA User Group Meeting
Toulouse, 12 et 13 novembre 2014
Correction of the data during the first
assembly process
Max_Overhang<2000nt MIN_Overlap>=5000nt
# MAX N50 BP MEAN BP
11,2M 59kb 13,6kb 11,2kb 125 Gb
(34x)
# MAX N50 BP MEAN BP
19,7M 58kb 11,5kb 9kb 180 Gb
(50x)
CR1
CR2
Reads >= 12kb +
Reads >= 3kb, MIN_Overlap >= 3000nt, LEN_Overlap >= 50% of the short sequence
2 strategies for filtering of the hsp
CR2 strategy is less stringent but seems to be accurate (evaluated on previously
characterized genomic regions from sequenced BAC clones)
19 4th EMEA User Group Meeting
Toulouse, 12 et 13 novembre 2014
Sunflower genome assembly evolution according to
PacBio sequecing depth
Coverage of the genome and N50 of the contigs increase with
the depth of the raw data.
20 4th EMEA User Group Meeting
Effect of CR2 correction strategy
with only 18X depth and
2 days of computation
(PBcR 8.3rc1), we
obtained an assembly
with metrics similar to
the previous assembly
obtained with 127X of
HiSeq data
Toulouse, 12 et 13 novembre 2014
Assembly with full data set (102X)
#ctg MAX N50 BP # > N50 MEDIAN Gb
13 124 4.4M 498 kb 1700 118 kb 3.03
• Once the raw data (102X) are corrected, the assembly
metrics seem very good!
Better assembly than ever!
The coverage is twice the one obtained with 127X of
HiSeq Data (4 sisings of PE and MP)
3Gb of sequences with noN , only 13 124 contig with
almost 500kb for the N50.
84% of the sunflower genome is covered
21 4th EMEA User Group Meeting
Barcelona, November 10th 2015 22 4th EMEA User Group Meeting
Does the assembly contain the gene content?
We have obtained a reference transcriptome from the same XRQ sunflower line that can be accessed using this web link: https://www.heliagene.org/HaT13l
Jérôme Gouzy, Sébastien Carrère, Nicolas Langlade (LIPM, INRA-Toulouse)
eFP browser
Toulouse, 12 et 13 novembre 2014
• The gene space is almost complete:
– 98% of the cDNAs from the reference transcriptome can be
fully (gmap)
Does the genome assembly contain the gene content?
23 4th EMEA User Group Meeting
Toulouse, 12 et 13 novembre 2014
• Several hightroughput genetic maps and the physical map (finger printing of the BAC clones) will be used to produce the 17 pseudomolecules of the nuclear genome.
• We are confident that we should be able to anchor 90% of the PacBio contigs.
• This sequence is expected for the end of 2015
• But the task is not easy because of the repeats and we already know that a part of the contigs are chimeric.
What’s next to obtain a sequence of the 17 sunflower
chromosomes sequences of the nuclear genome?
Chris Grassa (INRA):
24 4th EMEA User Group Meeting
Toulouse, 12 et 13 novembre 2014
Accuracy of the assembly is our priority
We want the sunflower genome sequence to be accurate and reliable.
Metrics of an assembly are one thing, accuracy is another!
And the quality of the genome sequence is important for genetics and
breeding!
25 4th EMEA User Group Meeting
Toulouse, 12 et 13 novembre 2014
Conclusions
– Using PacBio sequences only, we have improved the coverage
of the sunflower genome: from 43% (127X HiSeq) to 84% (102X
PacBio) and the size of the contigs have been highly
increased.
– The softwares (smrtanalysis from PacBio or PBcR) are
« easy » to use and efficient for small or simple genomes.
– But for complex genomes, it is difficult. It should be easier
with more longer sequences (majority of sequences >30kb
or 40kb are needed)!
26 4th EMEA User Group Meeting
Toulouse, 12 et 13 novembre 2014
• Sequencing of a new sunflower line: PSC8 (52x)
• From corrected reads by PBcR
Comparison of different softwares
# MAX N50 BP MEAN BP
7,5M 59kb 13,6kb 9kb 70,1 Gb (19,6x)
#ctg MAX N50 BP # >
N50
MEDIAN Gb
26 273 2.5M 223 kb 3 799 66 kb 3.1
PBcR/WGS
FALCON-default parameters:
FALCON-control of the repeats at the end of the reads desactivated:
#ctg MAX N50 BP # >
N50
MEDIAN Gb
35 066 1.0M 101 kb 6 212 38 kb 2.05
#ctg MAX N50 BP # >
N50
MEDIAN Gb
36 197 1.45M 202 kb 4 832 36 kb 3.2
27 4th EMEA User Group Meeting
Toulouse, 12 et 13 novembre 2014
Perspectives
• Assembling of heterozygous genomes (sunflower wild types for example)
• Evaluation of other de novo assembling softwares
The genome of Orobanche cumana (19 chromosomes
pairs, 1.42Gb) will be sequenced (100X PacBio).
Sequencing has begun.
Many thanks
Toulouse, November 12th & 13th 2014 29 SUNRISE
Get-PlaGe:
Cécile Donadieu
Gérald Salin
Céline Vandecasteele
Denis Milan
CNRGV:
Hélène Bergès
William Marande
Sonia Vautrin
LIPM:
Jérôme Gouzy
Baptiste Mayjonade
Nicolas Langlade
Chris Grassa
Sébastien Carrere
Erika Sallet
Ludovic Legrand
Marie-Claude Boniface
Nicolas Pouilly