Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010
-
Upload
crs4-research-center-in-sardinia -
Category
Health & Medicine
-
view
1.790 -
download
1
description
Transcript of Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010
1
Computing for the Analysis of Genomic Data at CRS4
Chris Jones24th March 2010
giovedì 25 marzo 2010
2
Who is Chris Jones?Who is Chris Jones?
giovedì 25 marzo 2010
2
Who is Chris Jones?Who is Chris Jones?
giovedì 25 marzo 2010
2
Who is Chris Jones?
• 10 years of particle physics research at Oxford and CERN in Geneva
Who is Chris Jones?
giovedì 25 marzo 2010
2
Who is Chris Jones?
• 10 years of particle physics research at Oxford and CERN in Geneva
• Strong interest in the use of computers to do things, especially science, BETTER
Who is Chris Jones?
giovedì 25 marzo 2010
2
Who is Chris Jones?
• 10 years of particle physics research at Oxford and CERN in Geneva
• Strong interest in the use of computers to do things, especially science, BETTER
• The ’70s brought digital detectors and an massive waves of new data to particle physics, causing exciting major changes of use of, and attitude towards computers
Who is Chris Jones?
giovedì 25 marzo 2010
2
Who is Chris Jones?
• 10 years of particle physics research at Oxford and CERN in Geneva
• Strong interest in the use of computers to do things, especially science, BETTER
• The ’70s brought digital detectors and an massive waves of new data to particle physics, causing exciting major changes of use of, and attitude towards computers
• 20 years of innovating, building, developing and running services in the CERN Computer Centre Facility
Who is Chris Jones?
giovedì 25 marzo 2010
2
Who is Chris Jones?
• 10 years of particle physics research at Oxford and CERN in Geneva
• Strong interest in the use of computers to do things, especially science, BETTER
• The ’70s brought digital detectors and an massive waves of new data to particle physics, causing exciting major changes of use of, and attitude towards computers
• 20 years of innovating, building, developing and running services in the CERN Computer Centre Facility
Who is Chris Jones?
giovedì 25 marzo 2010
3
Wellcome Trust Genome Campus
giovedì 25 marzo 2010
3
Wellcome Trust Genome Campus
• Escaped on sabbatical to European Bioinformatics Institute – EBI
giovedì 25 marzo 2010
3
Wellcome Trust Genome Campus
• Escaped on sabbatical to European Bioinformatics Institute – EBI
• Strong links to Sanger Institute
giovedì 25 marzo 2010
3
Wellcome Trust Genome Campus
• Escaped on sabbatical to European Bioinformatics Institute – EBI
• Strong links to Sanger Institute
• And to Roche – Roche Genetics IT Plan
giovedì 25 marzo 2010
3
Wellcome Trust Genome Campus
• Escaped on sabbatical to European Bioinformatics Institute – EBI
• Strong links to Sanger Institute
• And to Roche – Roche Genetics IT Plan
• Founded the PRISM Forum
giovedì 25 marzo 2010
3
Wellcome Trust Genome Campus
• Escaped on sabbatical to European Bioinformatics Institute – EBI
• Strong links to Sanger Institute
• And to Roche – Roche Genetics IT Plan
• Founded the PRISM Forum
giovedì 25 marzo 2010
5
Why Sequence Genomes?
• I hope Francesco has explained that very well
• Genomic sequence is the most fundamental information, the starting point, when you look at how living objects work…
• And studies of “genotype” versus “phenotype” can bring us an understanding of the origins of disease which has been completely out of reach until now
• The technology is just becoming available…
giovedì 25 marzo 2010
6
DNA sequence and genes look like…
cacaattacttccacaaatgcagttgaagcttctactcttcttgcataggtaacctgagtcggagcagttttcctcgtggcttcatctttggtgctggatcttcagcataccaatttgaaggtgcagtaaacgaaggcggtagaggaccaagtatttgggataccttcacccataaatatccagaaaaaataagggatggaagcaatgcagacatcacggttgc
giovedì 25 marzo 2010
7
The Human Genome
giovedì 25 marzo 2010
7
The Human Genome
• The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine
giovedì 25 marzo 2010
7
The Human Genome
• The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine
• It took 15 years for the first human genome sequence
giovedì 25 marzo 2010
7
The Human Genome
• The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine
• It took 15 years for the first human genome sequence
• Which was released between 2003 - 2005
giovedì 25 marzo 2010
7
The Human Genome
• The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine
• It took 15 years for the first human genome sequence
• Which was released between 2003 - 2005
• There are 3*109 or 3 Gigabases in the human genome
giovedì 25 marzo 2010
7
The Human Genome
• The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine
• It took 15 years for the first human genome sequence
• Which was released between 2003 - 2005
• There are 3*109 or 3 Gigabases in the human genome
• Pine trees have ~10 times more bases ! Why?
giovedì 25 marzo 2010
7
The Human Genome
• The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine
• It took 15 years for the first human genome sequence
• Which was released between 2003 - 2005
• There are 3*109 or 3 Gigabases in the human genome
• Pine trees have ~10 times more bases ! Why?
• Do not confuse Gb - bits, GB - Bytes, Gbases (Gb)!
giovedì 25 marzo 2010
8
Genome Analyzer IIx
In Edificio 3 Two GAIIx machines Each of which: 40 Gbases / run Paired end reads 4 Gbases / day but which are complex
and forefront technology...
giovedì 25 marzo 2010
8
Genome Analyzer IIx
In Edificio 3 Two GAIIx machines Each of which: 40 Gbases / run Paired end reads 4 Gbases / day but which are complex
and forefront technology...
giovedì 25 marzo 2010
9
Genome Analyzer IIx
Preparation Workflow
Sample Prep
Pipeline Analysis
giovedì 25 marzo 2010
10
Genome Analyzer IIx
FlowCell
8 Lanes 120 Tiles (2 cols 60 tiles) 4 Pictures per tile (A-T-G-C fluos) On each tile ~220k clusters
giovedì 25 marzo 2010
11
How much data per run?
giovedì 25 marzo 2010
11
How much data per run?
• 7.3 MBytes image data per tile * 120 tiles * 8 lanes = 7 000 Mbytes = 7 GigaBytes
giovedì 25 marzo 2010
11
How much data per run?
• 7.3 MBytes image data per tile * 120 tiles * 8 lanes = 7 000 Mbytes = 7 GigaBytes
• * 4 bases per read * read length (say 100) = 2 800 GBytes or 2.8 TeraBytes (TB)
giovedì 25 marzo 2010
11
How much data per run?
• 7.3 MBytes image data per tile * 120 tiles * 8 lanes = 7 000 Mbytes = 7 GigaBytes
• * 4 bases per read * read length (say 100) = 2 800 GBytes or 2.8 TeraBytes (TB)
• * 2 for the paired end = 5.6 TBytes
giovedì 25 marzo 2010
11
How much data per run?
• 7.3 MBytes image data per tile * 120 tiles * 8 lanes = 7 000 Mbytes = 7 GigaBytes
• * 4 bases per read * read length (say 100) = 2 800 GBytes or 2.8 TeraBytes (TB)
• * 2 for the paired end = 5.6 TBytes
• A run of ~1 week on both machines results in 11.2 TeraBytes of image data
giovedì 25 marzo 2010
12
Keeping the raw data?
• If we run for ~40 weeks a year we have nearly 0.5 PetaBytes (1 PB = 1015 Bytes or 1 000 000 000 000 000 Bytes)
• But if we throw the images away there is no chance to recuperate more Sequence Data from the images when a better (promised) algorithm comes along…
• So biology now faces the problem the physicists faced 35 years ago
giovedì 25 marzo 2010
13
Genome Analyzer IIx
Attach single molecules to surface
Amplify to form clusters
Cluster generation
103 molecules / µm
2.2·105 molecules/tile
giovedì 25 marzo 2010
15
Genome Analyzer IIx
• The identity of each base of each cluster is read off from sequential images (cycle by cycle)
Base Calling
giovedì 25 marzo 2010
18
Illumina Pipeline
ACTGCTATCTTTCGATTCGTACTGCTAGGCACCATCGCATTTCAGGACGTCCTGCTAGGCACCATCGCATCTCCATC
giovedì 25 marzo 2010
19
Timing for 115 Cycles Experiment on GA IIx
GA IIx Start Day 1
Illumina Pipeline Day 10
BWA and Yun LI workflow Day 13
Quality-Check Tools Day 15
Experiment Timeline
giovedì 25 marzo 2010
21
How much computing?
A software pipeline has been implemented at CRS4 to perform such operations automatically after a sequencing run ends
40 Gbases per run 370,000,000 sequences
4 samples per flowcell 7,000,000 megabytes of raw data produced per run
5 days for processing sequence-data on the cluster
A huge load for the computer centre
giovedì 25 marzo 2010
22
How much computing?
giovedì 25 marzo 2010
23
Quality Control
giovedì 25 marzo 2010
23
Quality Control
We realised we needed an audit by external experts of how well we were doing (or how badly)
giovedì 25 marzo 2010
23
Quality Control
We realised we needed an audit by external experts of how well we were doing (or how badly)
We asked experts from the Sanger Institute and from Cancer Research, Cambridge, UK
giovedì 25 marzo 2010
23
Quality Control
We realised we needed an audit by external experts of how well we were doing (or how badly)
We asked experts from the Sanger Institute and from Cancer Research, Cambridge, UK
We developed a Quality check process:− Qualitative and quantitative evaluation of illumina
summary file parameters− Evaluation of sequence quality (avg. number of
“blank” base calls)− Evaluation of coverage / holes− Evaluation of known/all SNPs found ratio
giovedì 25 marzo 2010
23
Quality Control
We realised we needed an audit by external experts of how well we were doing (or how badly)
We asked experts from the Sanger Institute and from Cancer Research, Cambridge, UK
We developed a Quality check process:− Qualitative and quantitative evaluation of illumina
summary file parameters− Evaluation of sequence quality (avg. number of
“blank” base calls)− Evaluation of coverage / holes− Evaluation of known/all SNPs found ratio
• This has been very successful
giovedì 25 marzo 2010
24
Quality Check: – Weekly Team Meeting
Qualitative and quantitative evaluation of illumina summary file parameters:
− Based on Sanger QC protocol− Quantitative examination of run results− Qualitative
inspection of plots
giovedì 25 marzo 2010
27
Summary of results
In October 2008 we foresaw 6 Gbases per run per machine We started at the end of February 2009 We started a Quality Control initiative in Sept. 2009 We have continuously improved number of bases per run:
Upgrades of machines Preparation of samples (reagents, PCR) Increasing number of cycles New algorithms for image processing and base-calling –
better alignment software Quality control
giovedì 25 marzo 2010
28
giovedì 25 marzo 2010
30
Activity summary - statistics
67 samples sequenced and aligned 6 samples actually running on the GAs Average coverage of samples 2.98X ~800 Gbases of raw data ~590 Gbases of aligned data
giovedì 25 marzo 2010
31
Imputation
• Program from Gonçalo Abecasis and Serena Sanna• Very powerful tool in the analysis of population genetics • Extrapolate measured data to infer more genomic
variations that you have not measured• Excellent e-Science, use the computer to do better
science• This certainly merits a seminar to itself
giovedì 25 marzo 2010
32
Plans and Visions
• Illumina has announced its latest sequencers, which will measure 200 Gbases in a run of 8 days
• 5 times our current performance in 20% less time• Easy to predict 400 or 600 Gbases, – 10 to 15 times as
much data per run• For the plans to sequence 2000 Sardinians together with
NIH and with University at Ann Arbor, and also for other requests from the Park and from Sardinia, we would like to acquire some of these new machines
giovedì 25 marzo 2010
33
My personal view
giovedì 25 marzo 2010
33
My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
giovedì 25 marzo 2010
33
My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
• It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world,
giovedì 25 marzo 2010
33
My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
• It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world,
• and which ultimately cost Sardinia (and the rest of humanity) a lot of money
giovedì 25 marzo 2010
33
My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
• It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world,
• and which ultimately cost Sardinia (and the rest of humanity) a lot of money
• It is driven by a predominantly Sardinia team doing excellent work
giovedì 25 marzo 2010
33
My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
• It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world,
• and which ultimately cost Sardinia (and the rest of humanity) a lot of money
• It is driven by a predominantly Sardinia team doing excellent work
• It binds together necessarily the strong computer centre of CRS4 and modern digital sequencing technology to build a forefront Sequencing Facility
giovedì 25 marzo 2010
33
My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
• It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world,
• and which ultimately cost Sardinia (and the rest of humanity) a lot of money
• It is driven by a predominantly Sardinia team doing excellent work
• It binds together necessarily the strong computer centre of CRS4 and modern digital sequencing technology to build a forefront Sequencing Facility
• If we don’t do this now we will lose a golden opportunity for ever
giovedì 25 marzo 2010
33
My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
• It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world,
• and which ultimately cost Sardinia (and the rest of humanity) a lot of money
• It is driven by a predominantly Sardinia team doing excellent work
• It binds together necessarily the strong computer centre of CRS4 and modern digital sequencing technology to build a forefront Sequencing Facility
• If we don’t do this now we will lose a golden opportunity for ever
• Where else would you set up such a Facility?
giovedì 25 marzo 2010
34
Thank you for your attention!
giovedì 25 marzo 2010