“Biomedical computing is entering an age where creative exploration

Post on 10-Jan-2016

29 views 1 download

description

“Biomedical computing is entering an age where creative exploration of huge amounts of data will lay the foundation of hypotheses. Much work must still be done to collect data and create the tools to analyse it. Bioinformatics, which provides the tools to extract and - PowerPoint PPT Presentation

Transcript of “Biomedical computing is entering an age where creative exploration

“Biomedical computing is entering an age where creative explorationof huge amounts of data will lay the foundation of hypotheses.Much work must still be done to collect data and create the tools toanalyse it. Bioinformatics, which provides the tools to extract andcombine knowledge from isolated data, gives us ways to think aboutthe vast amounts of information now available. It is changing theway biologists do science.”

A report to Harold Varmus, June 3 1999.

3 Kilobytes

6 Megabytes

9 Terabytes

12 Petabytes

15 Exabytes

18 Zettabytes

21 Yottabytes

GAATTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGAGCCCAAATGAGCGATAGATAGATAGATCGTGCGGCGATCTCGTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGGTTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTAAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGAGCCCAAATGAGCGATAGATAGATAGATCGTGCGGCGATCTCGTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGGTTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGTGCGGCGATCTCGTACTGGACGGAAATGTCAGGAGATAGGAGAAGAAAA

Nucleotide sequence database.

0

200

400

600

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96Year

Meg

ab

ase

s

The Human Proteome

• ~ 30,000 protein coding genes

• Expansion of the number of different protein molecules due to:– (a) alternative splicing (30 to 50% increase);– (b) post-translational modifications (5 to 10 fold

increase)

• There could well be about 1 million different protein molecules in the human body

Annotated genome

Annotation

Depth

of

know

ledge

Breadth of knowledge

Detailed analysis (typically biological)

of single genes

Large-scale analysis (typically

computational) of entire genome

The two major methods of gene prediction

• sequence comparison

• ab initio

Approaches to gene finding: Generalized hidden Markov models

Limitations of Gene Prediction Programs

• Good at predicting ORF-containing sequence

• Prediction of exact exon-intron boundaries difficult

• Fuse & split genes• Cannot predict UTRs• Cannot predict nested genes

Computational Analysis

Fly Alignments

•Known genes/cDNAs

•ESTs

•Transposons

Cross-species Sequence Similarities

Proteins & ESTs•Fly•Primate•Rodent•Worm•Yeast•Plant•Other Insects•Other Vertebrates•Other Invertebrates

Gene Predictions

•Genie

•Genscan

•tRNAscan-SE

Drosophila Gene Collection 1 Pavel Tomancak

• Embryonic expression of wild-type eve (rust) and a transgene containing the stripe 3 + 7 tertiary element (blue)

• Alignment of eve 5’ regulatory region

• D. melanogaster vs (A) D.erecta (B) D.pseudoobscura

(C) D. willistoni and (D) D.littoralis

stripe 3 + 7

eve

Gene_Ontology

FlyBase - Drosophila - Cambridge & EBI, HarvardBerkeley & Bloomington.

Saccharomyces Genome Data Base - Stanford.Mouse Genome Informatics - Jackson Labs.

The Arabidopsis Information Resource - StanfordWormBase - Caltech & CSHL

DictyBase - Chicago

SwissProt - Hinxton & Geneva The Institute for Genome Research - MD

With support from NIH (NHGRI) &AstraZeneca.

The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.

What is an Ontology?

An ontology is a specification of a conceptualization that is designed for reuse across multiple applications and

implementations. …a specification of a conceptualization is a written, formal description of a set of concepts and

relationships in a domain of interest.

Peter Karp (2000) Bioinformatics 16:269

• The Gene Ontology Consortium subscribes to the

Manifesto of Liberation Bioinformatics:

• Open source

• Open standards

• Open annotation

• Open data• merci tim hubbard - liberationise extraordinaire de ‘inxton

Introduction to GO Introduction to GO

GO: A Gene Ontology

GO Objectives:

Provide a controlled vocabulary for the description of the molecular function and cellular location of gene products, as well as the role of the gene products in basic biological processes

Use these terms as attributes of gene products in the collaborating databases

Allow queries across databases using GO terms, providing the linking of biological information across species

GO = Three OntologiesGO = Three Ontologies

• Biological Process = goal or objective within cell

• Molecular Function = elemental activity or

task

• Cellular Component = location or complex

Parent-Child RelationshipsParent-Child Relationships

HierarchyOne-to-many parental relationship

Directed acyclic graph - dagMany-to-many parental relationship

Each child has only one parent

Each child may have one or more parents

Classes of parent-child relationship:

• ISA (hyponomy) - as in: an elephant is a mammal.

• PARTOF (meronomy) - as in: a trunk is part of an elephant.

cellular_component

%membrane %vacuolar membrane %nuclear membrane%intracellular %cell <cytoplasm <vacuole <vacuolar membrane <vacuolar lumen <nucleus <nuclear membrane

cellular_component

vacuolarmembrane

membrane intracellular

vacuole

vacuolarlumen

cytoplasmnucleus

nuclearmembrane

cell

instance of (%), part of (<).

Structure of the Ontologies

• molecular function 5232 terms• biological process 6416 terms• cellular component 1111 terms

•all 12,759 terms

• definitions 7735 (61%) September 13 2002

Content of GO

Thank yous

• Genome annotation: Colleagues in the European and Berkeley Drosophila Genome Projects.

• FlyBase: Colleagues in Harvard, Berkeley, Bloomington & Cambridge.

• Gene Ontology: Colleagues in Berkeley, Jackson Labs, Stanford and EBI.