Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

59
1 Computing for the Analysis of Genomic Data at CRS4 Chris Jones 24 th March 2010 giovedì 25 marzo 2010

description

"Computing for the Analysis of Genomic data al CRS4" (Chris Jones) presentation at CRS4 Research Center. CRS4 Staff Meeting 24-03-2010 (Pula, Sardinia, Italy)

Transcript of Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

Page 1: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

1

Computing for the Analysis of Genomic Data at CRS4

Chris Jones24th March 2010

giovedì 25 marzo 2010

Page 2: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

2

Who is Chris Jones?Who is Chris Jones?

giovedì 25 marzo 2010

Page 3: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

2

Who is Chris Jones?Who is Chris Jones?

giovedì 25 marzo 2010

Page 4: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

2

Who is Chris Jones?

• 10 years of particle physics research at Oxford and CERN in Geneva

Who is Chris Jones?

giovedì 25 marzo 2010

Page 5: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

2

Who is Chris Jones?

• 10 years of particle physics research at Oxford and CERN in Geneva

• Strong interest in the use of computers to do things, especially science, BETTER

Who is Chris Jones?

giovedì 25 marzo 2010

Page 6: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

2

Who is Chris Jones?

• 10 years of particle physics research at Oxford and CERN in Geneva

• Strong interest in the use of computers to do things, especially science, BETTER

• The ’70s brought digital detectors and an massive waves of new data to particle physics, causing exciting major changes of use of, and attitude towards computers

Who is Chris Jones?

giovedì 25 marzo 2010

Page 7: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

2

Who is Chris Jones?

• 10 years of particle physics research at Oxford and CERN in Geneva

• Strong interest in the use of computers to do things, especially science, BETTER

• The ’70s brought digital detectors and an massive waves of new data to particle physics, causing exciting major changes of use of, and attitude towards computers

• 20 years of innovating, building, developing and running services in the CERN Computer Centre Facility

Who is Chris Jones?

giovedì 25 marzo 2010

Page 8: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

2

Who is Chris Jones?

• 10 years of particle physics research at Oxford and CERN in Geneva

• Strong interest in the use of computers to do things, especially science, BETTER

• The ’70s brought digital detectors and an massive waves of new data to particle physics, causing exciting major changes of use of, and attitude towards computers

• 20 years of innovating, building, developing and running services in the CERN Computer Centre Facility

Who is Chris Jones?

giovedì 25 marzo 2010

Page 9: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

3

Wellcome Trust Genome Campus

giovedì 25 marzo 2010

Page 10: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

3

Wellcome Trust Genome Campus

• Escaped on sabbatical to European Bioinformatics Institute – EBI

giovedì 25 marzo 2010

Page 11: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

3

Wellcome Trust Genome Campus

• Escaped on sabbatical to European Bioinformatics Institute – EBI

• Strong links to Sanger Institute

giovedì 25 marzo 2010

Page 12: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

3

Wellcome Trust Genome Campus

• Escaped on sabbatical to European Bioinformatics Institute – EBI

• Strong links to Sanger Institute

• And to Roche – Roche Genetics IT Plan

giovedì 25 marzo 2010

Page 13: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

3

Wellcome Trust Genome Campus

• Escaped on sabbatical to European Bioinformatics Institute – EBI

• Strong links to Sanger Institute

• And to Roche – Roche Genetics IT Plan

• Founded the PRISM Forum

giovedì 25 marzo 2010

Page 14: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

3

Wellcome Trust Genome Campus

• Escaped on sabbatical to European Bioinformatics Institute – EBI

• Strong links to Sanger Institute

• And to Roche – Roche Genetics IT Plan

• Founded the PRISM Forum

giovedì 25 marzo 2010

Page 15: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

5

Why Sequence Genomes?

• I hope Francesco has explained that very well

• Genomic sequence is the most fundamental information, the starting point, when you look at how living objects work…

• And studies of “genotype” versus “phenotype” can bring us an understanding of the origins of disease which has been completely out of reach until now

• The technology is just becoming available…

giovedì 25 marzo 2010

Page 16: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

6

DNA sequence and genes look like…

cacaattacttccacaaatgcagttgaagcttctactcttcttgcataggtaacctgagtcggagcagttttcctcgtggcttcatctttggtgctggatcttcagcataccaatttgaaggtgcagtaaacgaaggcggtagaggaccaagtatttgggataccttcacccataaatatccagaaaaaataagggatggaagcaatgcagacatcacggttgc

giovedì 25 marzo 2010

Page 17: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

7

The Human Genome

giovedì 25 marzo 2010

Page 20: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

7

The Human Genome

• The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine

• It took 15 years for the first human genome sequence

• Which was released between 2003 - 2005

giovedì 25 marzo 2010

Page 21: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

7

The Human Genome

• The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine

• It took 15 years for the first human genome sequence

• Which was released between 2003 - 2005

• There are 3*109 or 3 Gigabases in the human genome

giovedì 25 marzo 2010

Page 22: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

7

The Human Genome

• The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine

• It took 15 years for the first human genome sequence

• Which was released between 2003 - 2005

• There are 3*109 or 3 Gigabases in the human genome

• Pine trees have ~10 times more bases ! Why?

giovedì 25 marzo 2010

Page 23: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

7

The Human Genome

• The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine

• It took 15 years for the first human genome sequence

• Which was released between 2003 - 2005

• There are 3*109 or 3 Gigabases in the human genome

• Pine trees have ~10 times more bases ! Why?

• Do not confuse Gb - bits, GB - Bytes, Gbases (Gb)!

giovedì 25 marzo 2010

Page 24: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

8

Genome Analyzer IIx

In Edificio 3 Two GAIIx machines Each of which: 40 Gbases / run Paired end reads 4 Gbases / day but which are complex

and forefront technology...

giovedì 25 marzo 2010

Page 25: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

8

Genome Analyzer IIx

In Edificio 3 Two GAIIx machines Each of which: 40 Gbases / run Paired end reads 4 Gbases / day but which are complex

and forefront technology...

giovedì 25 marzo 2010

Page 26: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

9

Genome Analyzer IIx

Preparation Workflow

Sample Prep

Pipeline Analysis

giovedì 25 marzo 2010

Page 27: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

10

Genome Analyzer IIx

FlowCell

8 Lanes 120 Tiles (2 cols 60 tiles) 4 Pictures per tile (A-T-G-C fluos) On each tile ~220k clusters

giovedì 25 marzo 2010

Page 28: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

11

How much data per run?

giovedì 25 marzo 2010

Page 29: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

11

How much data per run?

• 7.3 MBytes image data per tile * 120 tiles * 8 lanes = 7 000 Mbytes = 7 GigaBytes

giovedì 25 marzo 2010

Page 30: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

11

How much data per run?

• 7.3 MBytes image data per tile * 120 tiles * 8 lanes = 7 000 Mbytes = 7 GigaBytes

• * 4 bases per read * read length (say 100) = 2 800 GBytes or 2.8 TeraBytes (TB)

giovedì 25 marzo 2010

Page 31: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

11

How much data per run?

• 7.3 MBytes image data per tile * 120 tiles * 8 lanes = 7 000 Mbytes = 7 GigaBytes

• * 4 bases per read * read length (say 100) = 2 800 GBytes or 2.8 TeraBytes (TB)

• * 2 for the paired end = 5.6 TBytes

giovedì 25 marzo 2010

Page 32: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

11

How much data per run?

• 7.3 MBytes image data per tile * 120 tiles * 8 lanes = 7 000 Mbytes = 7 GigaBytes

• * 4 bases per read * read length (say 100) = 2 800 GBytes or 2.8 TeraBytes (TB)

• * 2 for the paired end = 5.6 TBytes

• A run of ~1 week on both machines results in 11.2 TeraBytes of image data

giovedì 25 marzo 2010

Page 33: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

12

Keeping the raw data?

• If we run for ~40 weeks a year we have nearly 0.5 PetaBytes (1 PB = 1015 Bytes or 1 000 000 000 000 000 Bytes)

• But if we throw the images away there is no chance to recuperate more Sequence Data from the images when a better (promised) algorithm comes along…

• So biology now faces the problem the physicists faced 35 years ago

giovedì 25 marzo 2010

Page 34: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

13

Genome Analyzer IIx

Attach single molecules to surface

Amplify to form clusters

Cluster generation

103 molecules / µm

2.2·105 molecules/tile

giovedì 25 marzo 2010

Page 35: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

15

Genome Analyzer IIx

• The identity of each base of each cluster is read off from sequential images (cycle by cycle)

Base Calling

giovedì 25 marzo 2010

Page 36: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

18

Illumina Pipeline

ACTGCTATCTTTCGATTCGTACTGCTAGGCACCATCGCATTTCAGGACGTCCTGCTAGGCACCATCGCATCTCCATC

giovedì 25 marzo 2010

Page 37: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

19

Timing for 115 Cycles Experiment on GA IIx

GA IIx Start Day 1

Illumina Pipeline Day 10

BWA and Yun LI workflow Day 13

Quality-Check Tools Day 15

Experiment Timeline

giovedì 25 marzo 2010

Page 38: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

21

How much computing?

A software pipeline has been implemented at CRS4 to perform such operations automatically after a sequencing run ends

40 Gbases per run 370,000,000 sequences

4 samples per flowcell 7,000,000 megabytes of raw data produced per run

5 days for processing sequence-data on the cluster

A huge load for the computer centre

giovedì 25 marzo 2010

Page 39: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

22

How much computing?

giovedì 25 marzo 2010

Page 40: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

23

Quality Control

giovedì 25 marzo 2010

Page 41: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

23

Quality Control

We realised we needed an audit by external experts of how well we were doing (or how badly)

giovedì 25 marzo 2010

Page 42: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

23

Quality Control

We realised we needed an audit by external experts of how well we were doing (or how badly)

We asked experts from the Sanger Institute and from Cancer Research, Cambridge, UK

giovedì 25 marzo 2010

Page 43: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

23

Quality Control

We realised we needed an audit by external experts of how well we were doing (or how badly)

We asked experts from the Sanger Institute and from Cancer Research, Cambridge, UK

We developed a Quality check process:− Qualitative and quantitative evaluation of illumina

summary file parameters− Evaluation of sequence quality (avg. number of

“blank” base calls)− Evaluation of coverage / holes− Evaluation of known/all SNPs found ratio

giovedì 25 marzo 2010

Page 44: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

23

Quality Control

We realised we needed an audit by external experts of how well we were doing (or how badly)

We asked experts from the Sanger Institute and from Cancer Research, Cambridge, UK

We developed a Quality check process:− Qualitative and quantitative evaluation of illumina

summary file parameters− Evaluation of sequence quality (avg. number of

“blank” base calls)− Evaluation of coverage / holes− Evaluation of known/all SNPs found ratio

• This has been very successful

giovedì 25 marzo 2010

Page 45: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

24

Quality Check: – Weekly Team Meeting

Qualitative and quantitative evaluation of illumina summary file parameters:

− Based on Sanger QC protocol− Quantitative examination of run results− Qualitative

inspection of plots

giovedì 25 marzo 2010

Page 46: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

27

Summary of results

In October 2008 we foresaw 6 Gbases per run per machine We started at the end of February 2009 We started a Quality Control initiative in Sept. 2009 We have continuously improved number of bases per run:

Upgrades of machines Preparation of samples (reagents, PCR) Increasing number of cycles New algorithms for image processing and base-calling –

better alignment software Quality control

giovedì 25 marzo 2010

Page 47: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

28

giovedì 25 marzo 2010

Page 48: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

30

Activity summary - statistics

67 samples sequenced and aligned 6 samples actually running on the GAs Average coverage of samples 2.98X ~800 Gbases of raw data ~590 Gbases of aligned data

giovedì 25 marzo 2010

Page 49: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

31

Imputation

• Program from Gonçalo Abecasis and Serena Sanna• Very powerful tool in the analysis of population genetics • Extrapolate measured data to infer more genomic

variations that you have not measured• Excellent e-Science, use the computer to do better

science• This certainly merits a seminar to itself

giovedì 25 marzo 2010

Page 50: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

32

Plans and Visions

• Illumina has announced its latest sequencers, which will measure 200 Gbases in a run of 8 days

• 5 times our current performance in 20% less time• Easy to predict 400 or 600 Gbases, – 10 to 15 times as

much data per run• For the plans to sequence 2000 Sardinians together with

NIH and with University at Ann Arbor, and also for other requests from the Park and from Sardinia, we would like to acquire some of these new machines

giovedì 25 marzo 2010

Page 51: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

33

My personal view

giovedì 25 marzo 2010

Page 52: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

33

My personal view

• This is an opportunity for Sardinia to play frontier science on a world stage

giovedì 25 marzo 2010

Page 53: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

33

My personal view

• This is an opportunity for Sardinia to play frontier science on a world stage

• It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world,

giovedì 25 marzo 2010

Page 54: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

33

My personal view

• This is an opportunity for Sardinia to play frontier science on a world stage

• It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world,

• and which ultimately cost Sardinia (and the rest of humanity) a lot of money

giovedì 25 marzo 2010

Page 55: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

33

My personal view

• This is an opportunity for Sardinia to play frontier science on a world stage

• It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world,

• and which ultimately cost Sardinia (and the rest of humanity) a lot of money

• It is driven by a predominantly Sardinia team doing excellent work

giovedì 25 marzo 2010

Page 56: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

33

My personal view

• This is an opportunity for Sardinia to play frontier science on a world stage

• It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world,

• and which ultimately cost Sardinia (and the rest of humanity) a lot of money

• It is driven by a predominantly Sardinia team doing excellent work

• It binds together necessarily the strong computer centre of CRS4 and modern digital sequencing technology to build a forefront Sequencing Facility

giovedì 25 marzo 2010

Page 57: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

33

My personal view

• This is an opportunity for Sardinia to play frontier science on a world stage

• It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world,

• and which ultimately cost Sardinia (and the rest of humanity) a lot of money

• It is driven by a predominantly Sardinia team doing excellent work

• It binds together necessarily the strong computer centre of CRS4 and modern digital sequencing technology to build a forefront Sequencing Facility

• If we don’t do this now we will lose a golden opportunity for ever

giovedì 25 marzo 2010

Page 58: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

33

My personal view

• This is an opportunity for Sardinia to play frontier science on a world stage

• It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world,

• and which ultimately cost Sardinia (and the rest of humanity) a lot of money

• It is driven by a predominantly Sardinia team doing excellent work

• It binds together necessarily the strong computer centre of CRS4 and modern digital sequencing technology to build a forefront Sequencing Facility

• If we don’t do this now we will lose a golden opportunity for ever

• Where else would you set up such a Facility?

giovedì 25 marzo 2010

Page 59: Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

34

Thank you for your attention!

giovedì 25 marzo 2010