Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

1

Computing for the Analysis of Genomic Data at CRS4

Chris Jones24th March 2010

giovedì 25 marzo 2010

2

Who is Chris Jones?Who is Chris Jones?


2

Who is Chris Jones?

• 10 years of particle physics research at Oxford and CERN in Geneva

Who is Chris Jones?


2

Who is Chris Jones?


• Strong interest in the use of computers to do things, especially science, BETTER

Who is Chris Jones?


2

Who is Chris Jones?



• The ’70s brought digital detectors and an massive waves of new data to particle physics, causing exciting major changes of use of, and attitude towards computers

Who is Chris Jones?


2

Who is Chris Jones?



• The ’70s brought digital detectors and an massive waves of new data to particle physics, causing exciting major changes of use of, and attitude towards computers

• 20 years of innovating, building, developing and running services in the CERN Computer Centre Facility

Who is Chris Jones?


3

Wellcome Trust Genome Campus


3


• Escaped on sabbatical to European Bioinformatics Institute – EBI


3



• Strong links to Sanger Institute


3




• And to Roche – Roche Genetics IT Plan


3




• And to Roche – Roche Genetics IT Plan

• Founded the PRISM Forum


5

Why Sequence Genomes?

• I hope Francesco has explained that very well

• Genomic sequence is the most fundamental information, the starting point, when you look at how living objects work…

• And studies of “genotype” versus “phenotype” can bring us an understanding of the origins of disease which has been completely out of reach until now

• The technology is just becoming available…


6

DNA sequence and genes look like…

cacaattacttccacaaatgcagttgaagcttctactcttcttgcataggtaacctgagtcggagcagttttcctcgtggcttcatctttggtgctggatcttcagcataccaatttgaaggtgcagtaaacgaaggcggtagaggaccaagtatttgggataccttcacccataaatatccagaaaaaataagggatggaagcaatgcagacatcacggttgc


7

The Human Genome


7

The Human Genome

• The nucleotide bases are: a- adenine, c- cytosine, g- guanine, t- thymine


http://en.wikipedia.org/wiki/Adenine


http://en.wikipedia.org/wiki/Cytosine


http://en.wikipedia.org/wiki/Guanine


http://en.wikipedia.org/wiki/Thymine


7

The Human Genome


• It took 15 years for the first human genome sequence










7

The Human Genome



• Which was released between 2003 - 2005










7

The Human Genome




• There are 3*109 or 3 Gigabases in the human genome










7

The Human Genome





• Pine trees have ~10 times more bases ! Why?










7

The Human Genome





• Pine trees have ~10 times more bases ! Why?

• Do not confuse Gb - bits, GB - Bytes, Gbases (Gb)!










8

Genome Analyzer IIx

In Edificio 3 Two GAIIx machines Each of which: 40 Gbases / run Paired end reads 4 Gbases / day but which are complex

and forefront technology...


9

Genome Analyzer IIx

Preparation Workflow

Sample Prep

Pipeline Analysis


10

Genome Analyzer IIx

FlowCell

8 Lanes 120 Tiles (2 cols 60 tiles) 4 Pictures per tile (A-T-G-C fluos) On each tile ~220k clusters


11

How much data per run?


11


• 7.3 MBytes image data per tile * 120 tiles * 8 lanes = 7 000 Mbytes = 7 GigaBytes


11



• * 4 bases per read * read length (say 100) = 2 800 GBytes or 2.8 TeraBytes (TB)


11




• * 2 for the paired end = 5.6 TBytes


11




• * 2 for the paired end = 5.6 TBytes

• A run of ~1 week on both machines results in 11.2 TeraBytes of image data


12

Keeping the raw data?

• If we run for ~40 weeks a year we have nearly 0.5 PetaBytes (1 PB = 1015 Bytes or 1 000 000 000 000 000 Bytes)

• But if we throw the images away there is no chance to recuperate more Sequence Data from the images when a better (promised) algorithm comes along…

• So biology now faces the problem the physicists faced 35 years ago


13

Genome Analyzer IIx

Attach single molecules to surface

Amplify to form clusters

Cluster generation

103 molecules / µm

2.2·105 molecules/tile


15

Genome Analyzer IIx

• The identity of each base of each cluster is read off from sequential images (cycle by cycle)

Base Calling


18

Illumina Pipeline

ACTGCTATCTTTCGATTCGTACTGCTAGGCACCATCGCATTTCAGGACGTCCTGCTAGGCACCATCGCATCTCCATC


19

Timing for 115 Cycles Experiment on GA IIx

GA IIx Start Day 1

Illumina Pipeline Day 10

BWA and Yun LI workflow Day 13

Quality-Check Tools Day 15

Experiment Timeline


21

How much computing?

A software pipeline has been implemented at CRS4 to perform such operations automatically after a sequencing run ends

40 Gbases per run 370,000,000 sequences

4 samples per flowcell 7,000,000 megabytes of raw data produced per run

5 days for processing sequence-data on the cluster

A huge load for the computer centre


22

How much computing?


23

Quality Control


23

Quality Control

We realised we needed an audit by external experts of how well we were doing (or how badly)


23

Quality Control


We asked experts from the Sanger Institute and from Cancer Research, Cambridge, UK


23

Quality Control



We developed a Quality check process:− Qualitative and quantitative evaluation of illumina

summary file parameters− Evaluation of sequence quality (avg. number of

“blank” base calls)− Evaluation of coverage / holes− Evaluation of known/all SNPs found ratio


23

Quality Control



We developed a Quality check process:− Qualitative and quantitative evaluation of illumina

summary file parameters− Evaluation of sequence quality (avg. number of

“blank” base calls)− Evaluation of coverage / holes− Evaluation of known/all SNPs found ratio

• This has been very successful


24

Quality Check: – Weekly Team Meeting

Qualitative and quantitative evaluation of illumina summary file parameters:

− Based on Sanger QC protocol− Quantitative examination of run results− Qualitative

inspection of plots


27

Summary of results

In October 2008 we foresaw 6 Gbases per run per machine We started at the end of February 2009 We started a Quality Control initiative in Sept. 2009 We have continuously improved number of bases per run:

Upgrades of machines Preparation of samples (reagents, PCR) Increasing number of cycles New algorithms for image processing and base-calling –

better alignment software Quality control


28


30

Activity summary - statistics

67 samples sequenced and aligned 6 samples actually running on the GAs Average coverage of samples 2.98X ~800 Gbases of raw data ~590 Gbases of aligned data


31

Imputation

• Program from Gonçalo Abecasis and Serena Sanna• Very powerful tool in the analysis of population genetics • Extrapolate measured data to infer more genomic

variations that you have not measured• Excellent e-Science, use the computer to do better

science• This certainly merits a seminar to itself


32

Plans and Visions

• Illumina has announced its latest sequencers, which will measure 200 Gbases in a run of 8 days

• 5 times our current performance in 20% less time• Easy to predict 400 or 600 Gbases, – 10 to 15 times as

much data per run• For the plans to sequence 2000 Sardinians together with

NIH and with University at Ann Arbor, and also for other requests from the Park and from Sardinia, we would like to acquire some of these new machines


33

My personal view


33

My personal view

• This is an opportunity for Sardinia to play frontier science on a world stage


33

My personal view


• It exploits the Sardinian genomic heritage and its increased “signal to noise” to find the origins and mechanisms of diseases that affect people around the world,


33

My personal view



• and which ultimately cost Sardinia (and the rest of humanity) a lot of money


33

My personal view




• It is driven by a predominantly Sardinia team doing excellent work


33

My personal view





• It binds together necessarily the strong computer centre of CRS4 and modern digital sequencing technology to build a forefront Sequencing Facility


33

My personal view






• If we don’t do this now we will lose a golden opportunity for ever


33

My personal view






• If we don’t do this now we will lose a golden opportunity for ever

• Where else would you set up such a Facility?


34

Thank you for your attention!


Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

Health & Medicine

Transcript of Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010