Grid Computing The Illinois Bio-Grid

23
Illinois Bio- Grid Grid Computing The Illinois Bio-Grid Alexander B. Schilling, Ph.D. University of Chicago Proteomics Core Lab [email protected]

description

Grid Computing The Illinois Bio-Grid. Alexander B. Schilling, Ph.D. University of Chicago Proteomics Core Lab [email protected]. Outline. Bio-Medical Informatics Show how computability is growing exponentially Illinois Bio-Grid Describe this Grid founded at DePaul IBG Workbench - PowerPoint PPT Presentation

Transcript of Grid Computing The Illinois Bio-Grid

Page 1: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Grid ComputingThe Illinois Bio-Grid

Alexander B. Schilling, Ph.D.

University of Chicago

Proteomics Core Lab

[email protected]

Page 2: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Outline

• Bio-Medical Informatics– Show how computability is growing exponentially

• Illinois Bio-Grid– Describe this Grid founded at DePaul

• IBG Workbench– Describe these grid enabled BioInformatics tools

• Mass Spec Toolkit in Cactus– Describe plans to implement tools for spectral interpretation in Cactus

Page 3: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

BioInformatics and Computability

• Growth of data in GenBank is exponential and doesn't show signs of slowing down yet.– Source GenBank/NCBI

• Compute time to process data growing equivalently– Twice Moore's law

• Biologists don't have access to supercomputers for everyday work

• Grid computing gives Biologists more computing power affordably

Page 4: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Illinois Bio-Grid

• A consortium of – Educational Institutions– National Labs– Private Industry– City & State entities– Museums

Page 5: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Goals

1. Provide an infrastructure of computational (and other) resources to Biological and Medical researchers

2. Provide an infrastructure of computational (and other) resources to Computer Scientists working on BioMedical problems

3. Provide a tool suite of BioMedical software for BioMedical researchers to use on the IBG computational resources– Also for open source distribution worldwide

4. Provide an environment for CS researchers to work with BioMedical researchers

5. Try to solve some computationally intense BioMedical Informatics problems

6. Create a workbench of BioMedical software modules in open source distribution to facilitate more rapid BioMedical Informatics research by researchers worldwide

Page 6: Grid Computing The Illinois Bio-Grid

Illinois Bio-GridIllinois Bio-Grid Infrastructure

IIT

ArgonneMCS

Chicago TechnologyPark

(Supercomputing CenterOf Chicago)

DePaul

U Chicago

Canadian NRCField

Museum

Page 7: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Bio-Grid Workbench

• Consists of many applications important to Biological and Medical Researchers

• All Grid enabled to provide enhanced computational power• Genomics• Proteomics• Phylogenetics• Computational Fluid Dynamics / Medical Imaging• Cell membrane modeling• Data Modeling LSG-RG in

GGF Reference Implementation

Page 8: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Genomics and Proteomics 1

• Homology Searching– Searching for proteins with the same evolutionary "ancestor"– Smith-Waterman / Blast / FastA– Database against database searches (instead of single sequence

against database searches)– Allow groups of input sequences to search for homologous sequences to

all in the set• Mass Spec Data Interpretation

– Ionize peptides and fragment them inside mass spectrometer– Measure charge/mass ratio of peptide ions and fragments– Interpret resulting spectra

441.1

562.1

1163.9

1249.9

1305.9

1420.8

1479.9

1640.0

1780.8

1882.9

All, 0.0-0.5min (#1-#10)

0

250

500

750

1000

1250

Intens.

400 600 800 1000 1200 1400 1600 1800 2000 m/z

Page 9: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

• Mass Spec Based Protein Identification – Conduct “In Silico” Digestion of protein database– Predict charge/mass ratio of all possible peptide ions resulting from

database– Search actual ions in spectra against predicted ions– Return identifications of proteins based on scoring match

Genomics and Proteomics 2

Page 10: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Genomics and Proteomics 3

• Predict 3D Protein folding given sequence of amino acids• Solution to Schrödinger equation is intractable• Search space of possible folds is immense• Current methods of searching

– ab-initio– AI– Lego– Monte Carlo– Lattice

• On Grids can runmultiple searches– In parallel– In series

• On Grids can run athigher resolutions

Page 11: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Phylogenetics

• Sequence various taxa (individuals or species)– Frequently sequence mitochondrial

DNA– Mitochondrial DNA much like

prokaryote DNA

• Compare sequences – Form hypothetical evolutionary tree– Each branch is a mutation– Shows mutations from hypothetical

ancestor

• Search space is immense– Runs for 6 months on a single

processor– Then crashes!

Page 12: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Computational Fluid Dynamics / Medical Imaging

• Monitor and collect real time CAT scan data– Arterial blood flow

• Use Grid to interpret data – Use Computational Fluid Dynamics to model blood flow– Produce real time imaging – Locate aneurisms and other anomalies– Aid in diagnosis and

decision making forsurgical procedures

– Non-invasive

Page 13: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Cell membrane modeling

• Run simulations using both– Configurational Bias Monte Carlo Method (CBMC)– Molecular Dynamics (MD)

• Current simulations being done involve the properties of cholesterol in lipid membranes– Cholesterol is known to be an essential component of mammalian cell

membranes– Its exact role is not well understood

• Previous simulations have been run– Up to 1600 lipid or cholesterol molecules– And 52,000 water molecules

• We're increasing these simulations by– An order of magnitude in the physical dimensions– And 2 to 3 orders of magnitude in time

Page 14: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Data Modeling

• Data Modeling LSG-RG in GGF Reference Implementation– Automatic Data Synchronization– Flagging "dirty" data– Flagging data sources (including

versioning)

Page 15: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

IBG Workbench

Grid Fabric (Resources)

Grid Services (Middleware)

DB Access

Homology Searching

PhylogeneticTrees

MassSpec

CFDProteomics MembraneModeling

Page 16: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

The Purpose of Mass Spectrometry in Proteomics

• Identify and sequence all proteins involved in an organism’s biology.

• Use this knowledge to identify proteins (or peptides) that can be used to study and understand different biological states.

• Correlate protein expression levels to biological function. Use protein or peptide biomarkers to identify disease states in patients.

• Use the structure of the relevant proteins as targets for developing new therapeutic techniques (drugs etc..).

Page 17: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Mass Spectrometers in Proteomics

• Mass spectrometers measure the masses of proteins and peptides by moving their ions through the instrument in a controlled way.

• Proteins can be degraded using enzymes and the peptides produced can be analyzed by the mass spectrometer.

• A MS/MS instrument can cause the peptide ions to fragment into smaller pieces which can be used to deduce the peptide’s sequence.

• Once the sequence of the peptides has been determined, the protein’s complete sequence can be reassembled from the peptide sequences.

• The intensity of peaks can be used to determine the expression level of a protein in a sample.

• Samples from healthy and diseased tissue can be compared to locate biomarkers for disease.

Page 18: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

The MS/MS Experiment Produces Multidimensional Data

• Chromatograms (Time vs Intensity)• Precursor Ion Spectra of Peptides (Mass vs Intensity)• Product Ion Spectra of Peptides(*(Precursor Mass), Mass vs Intensity)

100000

200000

300000

400000

500000 MS TIC

417.1 476.1

562.2

646.41+

725.2 845.1927.7

1+ 1074.6

1163.61+

1305.81+

1479.81+

1578.71+

1640.01+

1710.0

2076.8 2169.4

38.+MS, 4.7min (#36)

599.3

659.9

727.2

842.5

964.4

1153.1

1304.2

1389.3

1516.61+

38.

+MS2(1535.8), 5.6min (#44)

0

200

400

600

800

1000

Intens.

0

25

50

75

100

125

150

400 600 800 1000 1200 1400 1600 1800 2000 m/z

y5

b6y

6

y7

y11

MS/MS of m/z 1535.8

MS

Page 19: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

What the tandem mass spectrum of a peptide looks like.

NH CH

R1

C

O

NH

CH

CH

R2

C

O

NH

R3

C

O

NH

CH

R4

C

O

OH

Y ion

B ion

3

1 B ion

Y ion

2

2

B ion

Y ion1

3

2

Y-ions from C to N terminus

B-ions from N to C terminus

Page 20: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Important Issues In Computation for Proteomics

• DeNovo Sequencing– Many computationally efficient algorithms exist– Many times algorithms produce incorrect results very quickly!– Issue of posttranslational modifications introduces complexity into interpretations– Much data must be discarded to accommodate workstation based computational capacity– A strong desire exists to use intensity data as well as mass data in interpretations

• Database Search (Protein ID)– Most packages are commercial, few open source (BLAST based only)– The more posttranslational modifications you allow for, the longer the searches take. Area

is ripe for parallelism.– Serious problems with false positive identifications

• Many active in research to address this problem• Could be reduced by more front end interpretation before search• Could combine spectra from multiple MS types before search instead of correlating ID

results after searches• Datamining

– What do you do with all the identifications? Systems Biology!• Create models for signal pathways using protein id and expression data

Page 21: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

GridProt: A Cactus Based Proteomics Tool KitThorns:

GridMass – handles basic data extraction, chromatographic peak integration, mass detection

GridTAG - partial sequence mass tag extraction

GridID - grid based database search using mass spec data

GridDeNS - grid based denovo sequencing

Visualization – OpenDX

Data Storage – mzXML and HDF-5

Page 22: Grid Computing The Illinois Bio-Grid

Illinois Bio-Grid

Conclusions

• Illinois Bio-Grid– Excellent resource for Biological and Medical researchers

• IBG Workbench– Excellent software architecture for compute intensive applications– Will be source of BioMedical Informatics software sharing for a plethora

of different research areas– Will be source of workbench tools for researchers in other related

Informatics software creation

• Cactus is an ideal platform for HPC of Mass Spec data– Modular thorns allow generalization for MS, specialization for Proteomics– Ideal base for open source, extendable software ready for HPC as

Proteomics data sets grow.

• http://facweb.cs.depaul.edu/bioinformatics• http://facweb1.cs.depaul.edu/~dangulo

Page 23: Grid Computing The Illinois Bio-Grid

Illinois Bio-GridAcknowledgements

University of Chicago

Howard Hughes Medical Institute

Ben May Cancer Center

Pfizer Inc.

Illinois Biogrid:

Dave Angulo, DePaul University

Gregor von Laszewski, ANL

Kevin Drew, Tim Freeman