CSCI6904 Genomics and Biological Computing

52
CSCI6904 Genomics and Biological Computing Lecture 3 – Conceptual Biology Cells, Gene circuits Conceptual Biology

description

CSCI6904 Genomics and Biological Computing. Lecture 3 – Conceptual Biology Cells, Gene circuits Conceptual Biology. Overview. Computing in Biological systems - PowerPoint PPT Presentation

Transcript of CSCI6904 Genomics and Biological Computing

Page 1: CSCI6904 Genomics and Biological Computing

CSCI6904

Genomics and Biological Computing

Lecture 3 – Conceptual Biology

Cells, Gene circuits

Conceptual Biology

Page 2: CSCI6904 Genomics and Biological Computing

Overview

Computing in Biological systemsCells are computing information and react programatically to various situations. We will have a brief look at what is a cell and how they “compute”.

Evolutionary emergence of NetworksThese Circuits of gene products are arising in a stochastic manner. We will have a quick look on how this random walk results in a combinatorial strategy to evolve solutions.

Investigating NetworksNone of these network is visible, investigating the relationships in the physical world is a resource consuming operation.

Building Knowledge models of cells using text miningPresent a test case called GENEWAY.

Page 3: CSCI6904 Genomics and Biological Computing

Cells

Page 4: CSCI6904 Genomics and Biological Computing

Scope of molecular Biology

Molecular biology tries to organize a stochastically evolved system comprising hundreds of thousands components.

None of these components can be seen, even under the mostpowerful microscopes.

They are usually present in the 10-8 – 10-12 grams scale.

They degrade in a matter of second to hours.

The bottomline is:

Everything we know about this system comes from fragments of information.

Many of these are going to be refuted over time.

Page 5: CSCI6904 Genomics and Biological Computing

Cells as processors

Page 6: CSCI6904 Genomics and Biological Computing

Scope of Biological research

Research is usually structured such that individual contributions Can be pieced together into a “pathway”

Page 7: CSCI6904 Genomics and Biological Computing

Scope of Biological research

Research is usually structured such that individual contributions Can be pieced together into a “pathway”

SugarEssential oils

(plants)

Vitamin K

Bile

Eye Pigments

Sexual Hormones

Amino-Acids

Page 8: CSCI6904 Genomics and Biological Computing

Networks

How do they come into being?Combinatorial assembly during a stochastic process.

What is done to understand the main pathways?Grasping event the smallest facts about 1 edge in the graph is a feat.

Page 9: CSCI6904 Genomics and Biological Computing

Evolutionary Quandary

Intelligent design opposition to evolution of complex systems

A B C D

Page 10: CSCI6904 Genomics and Biological Computing

Evolutionary Quandary

Intelligent design opposition to evolution of complex systems

A B C D

Useless metabolites

Page 11: CSCI6904 Genomics and Biological Computing

Evolutionary Quandary

Intelligent design opposition to evolution of complex systems

A D

Impossible

Page 12: CSCI6904 Genomics and Biological Computing

Evolutionary Quandary

Intelligent design opposition to evolution of complex systems

A B C D

Therefore, the pathway A->D had to be designed by an intelligent entity which had the knowledge of the

intended purpose of the pathway!

Page 13: CSCI6904 Genomics and Biological Computing

Closer look at high-level genes organization

A modular systemProteins can be broken down into domains.

A combinatorial effectDomains can assemble in a combinatorial fashion to try together a vast array of potential biological activities.

Page 14: CSCI6904 Genomics and Biological Computing

Proteins are made of domains

Proteins are organized into domains

Transcription factor eF1eF1/ (PDB: 1IJF)

http://www.ncbi.nlm.nih.gov

Page 15: CSCI6904 Genomics and Biological Computing

Proteins are made of domains

Domains have several interesting properties.

Transcription factor eF1eF1/ (PDB: 1IJF)

http://www.ncbi.nlm.nih.gov

Page 16: CSCI6904 Genomics and Biological Computing

Proteins are made of domains

Domains fold onto themselves such that it is possible to express them separately (in most case).

They are small relative to actual proteins. Which may make it easier to rapidly fold into the right conformation.

Transcription factor eF1eF1/ (PDB: 1IJF)

Page 17: CSCI6904 Genomics and Biological Computing

Proteins are made of domains

They usually provide a biological function through binding or catalysis.

Transcription factor eF1eF1/ (PDB: 1IJF)

Page 18: CSCI6904 Genomics and Biological Computing

A stochastic process

Page 19: CSCI6904 Genomics and Biological Computing

A molecular network

= An interaction

Page 20: CSCI6904 Genomics and Biological Computing

Interfaces are expensive to evolve

Transcription factor eF1/ (PDB: 1IJF)

Interfaces are very sensitive to mutation as they must provide a perfect match.

Page 21: CSCI6904 Genomics and Biological Computing

Network of Metabolites

Metabolites are essentially forming network with a scale-free property, which parallels the stochastic assembly of domains.

At least, this appears to be true with the data there are so far.

Rzhetsky and Gomez, 2001. Bioinformatics, 17:988-996

http://www.genego.com/about/products.shtml

Page 22: CSCI6904 Genomics and Biological Computing

Evolutionary Quandary

Back to our A to D problem.

A B C D

An observed pathway therefore is simply a path connecting an input molecule and a required output. Each edge can be seen as a gene product (protein).

Overall, the pathway offers some kind of advantage to the host organism.

With positive selection, the pathway gets better and look as if it was designed for a specific purpose.

Page 23: CSCI6904 Genomics and Biological Computing

Scope of Biological research

Density of knowledge generating statements per article withrespect to source journals

Page 24: CSCI6904 Genomics and Biological Computing

Where it becomes a bioinformatic’s problem:

Nature of the problemBuilding a global model from plain English text sources.

Size

Complexity

What is done in the GeneWays project The workflow of their integrated system

What I think it really means in the long runThe relationship between research and researchers

(The right information system will be the next big thing)

Page 25: CSCI6904 Genomics and Biological Computing

Motivation

Human limitations andData-heavy and knowledge-heavy Disciplines

SynthesizingHypothesis building

Visualizing Records keeping

Modeling Knowledge StreamliningStructuring(Directing)

(Changing the way research is communicated?)

Page 26: CSCI6904 Genomics and Biological Computing

Motivation

In knowledge-intensive field, the connection between investigators and background information is thinning down.

Data

Hypothesis

Experiment

Information(data,

concepts)

KnowledgeThis arrow does not scale up

as quickly as the others BioinformaticsComputational Biology

Page 27: CSCI6904 Genomics and Biological Computing

Scope of GeneWays

Build from plain-English publications a

model for molecular biology

Allow a more holistic approach to hypothesis formulation.

Page 28: CSCI6904 Genomics and Biological Computing

Scope of GeneWays

~ 3 million statements

150 K full text articles

Page 29: CSCI6904 Genomics and Biological Computing

Scope of GeneWays

What are we looking for, ultimately ?

protein A binds gene Bgene B regulates gene Cgene C express protein D

protein D inactivates protein A

Page 30: CSCI6904 Genomics and Biological Computing

Scope of GeneWays

Doc Sorting

Terms identification

Disambiguation

Information extraction

Ontology

Visualization

Page 31: CSCI6904 Genomics and Biological Computing

Details of GeneWays

Doc Sorting

From Abstracts, using either clustering (unsupervised) or

Naïve Bayes.

This system is using a mixture of methods to

achieve the binary classification:

Relevant / irrelevant

Page 32: CSCI6904 Genomics and Biological Computing

Details of GeneWays

Tagging terms

Especially hard in biology(?)

Morphological rulesGrammatical rules

Rules/dictionary methodsSVMHMM

Naïve BayesDecision Trees

Recall in the 70’s to 80’s

Page 33: CSCI6904 Genomics and Biological Computing

Details of GeneWays

Tagging terms

HTML -> XML-like format

Page 34: CSCI6904 Genomics and Biological Computing

Details of GeneWays

Tagging terms

Vertices:

GeneProtein

GeneorproteinProcess

SmallmoleculesSpeciesComplexDisease

Domain (protein)

Page 35: CSCI6904 Genomics and Biological Computing

Details of GeneWays

Tagging terms

Edges:

N-acylateacetylate

N-glycosylateO-glycosylate

BindDegrade

(De-)methylate(De-)phophorylate[Make|break]bond

ExpressTranscribeReleaseInteract

Substitute… n = 125 (2001)

Page 36: CSCI6904 Genomics and Biological Computing

Details of GeneWays

Learning new verbs:

AVAD system

Χ2 statistics of occurrence of terms before and after tagged

items.

Log-likelihood test based on frequency of occurrence in corpus-specific literature

Co-localize and synergize were discovered using AVAD

Page 37: CSCI6904 Genomics and Biological Computing

Nomenclature

There are obscure ways to agree:

Protein kinase A phosphorylates protein B

Is the same as :

AB ATP B P ADP

Page 38: CSCI6904 Genomics and Biological Computing

Nomenclature

There are obscure ways, period:

Gene named:

“Forever Young” in Arabidopsis Thaliana (mustard familly)

“Mother against decapentaplegic” in Fruit fly

Page 39: CSCI6904 Genomics and Biological Computing

Nevermind the jargon!

Fight fire with fire:

They developed a method that uses BLAST, a popular sequence database search algorithm to mine for biological terms.

(Krauthammer et al., 2000. Gene. 259:245-252)

Page 40: CSCI6904 Genomics and Biological Computing

Nevermind the jargon!

Fight fire with fire:

N-(2-Hydroxyethyl)piperazine-N'-(2-ethanesulfonic acid) (HEPES)2-(N-Morpholino)ethanesulfonic acid (MES)

3-(N-Morpholino)propanesulfonic acid (MOPS)N-tris[Hydroxymethyl]methyl-3-aminopropanesulfonic acid (TAPS)

tris(Hydroxymethyl)aminomethane (TRIS)

Page 41: CSCI6904 Genomics and Biological Computing

Details of GeneWays

Disambiguation

il2 and interleukine-2 can both be used to refer to either

the gene, the protein or the mRNA.

Page 42: CSCI6904 Genomics and Biological Computing

Details of GeneWays

Disambiguation

Use canonical name as much as possible.

Learn Semantic classes

Page 43: CSCI6904 Genomics and Biological Computing

Details of GeneWays

Information extraction

Correlation methodsHMM

Formal grammar (lexicon)

GeneWays uses NLP GENIES

Attempts complete parsing, then default to segmenting

and partial parsing.

Page 44: CSCI6904 Genomics and Biological Computing

Details of the NLP system

GENIES (GENomics Information Extraction System)

Based on MedLEE (medical NLP system)

Term tagging component uses rules and external knowledge

Nested relationships, normalized and agentive forms of verbs inhibit, inhibition and inhibitor .

Page 45: CSCI6904 Genomics and Biological Computing

Details of GeneWays

Information simplification

Convert nested relationships into a collection of binary

statements.

Page 46: CSCI6904 Genomics and Biological Computing

Details of GeneWaysOntology

Knowledge Models

Page 47: CSCI6904 Genomics and Biological Computing

Uses for GeneWaysVisualization

Synthesis and querying facility

The only filter described at the time of the publication is a filter

based on the number of statement supporting an edge.

Page 48: CSCI6904 Genomics and Biological Computing

Uses for GeneWaysVisualization

Synthesis and querying facility

Page 49: CSCI6904 Genomics and Biological Computing

Validation of GeneWays

Expert Review

125 statements / 2500 were erroneous or “phantoms”.

Of these 125:

- 100 due to term identification.- 12 NLP errors.- 5 Simplifier errors.- 8 Actually correct!

System’s precision: 95%Expert’s precision : 93.5%

Such as system should be seen as a mean to enrich

Page 50: CSCI6904 Genomics and Biological Computing

Validation of GeneWays

Redundancy

Redundant statements are not necessarily “more true”.

Redundancy due to indirect relationships.

Page 51: CSCI6904 Genomics and Biological Computing

Validation of GeneWays

A parser’s nightmare:

Statement : “mitogen-activated protein kinase kinase kinase (MAPKKK) phosphorylates protein B”

Interpretations:

1. Protein kinase [protein] is activated by the mitogen [complex]2. MAPK[protein] phosphorylate MAPKK[protein]3. MAPKK[protein] phosphorylate MAPKKK[protein]4. MAPKKK[protein] phosphorylate B [protein]

Potential historical artifacts:

1. B[protein] is activated by the mitogen[complex]2. MAPKK[wrongly thought to be MAPK] phosphorylate B[protein]3. …

Page 52: CSCI6904 Genomics and Biological Computing

Perspective

References

Main: Rzhetski et al., 2004. GeneWays: a system for extracting, analysing,

visualizing, and integrating molecular pathway data. J. Biomed. Informatics, 37:43-53

Learning Verbs: Hatzivassiloglou, V., Weng, W. Learning Anchor Verbs for Biological

Interactions Patterns from published text articleswww.cs.columbia.edu/nlp/papers/2002/ hatzivassiloglou_weng_02.pdf

NLP processor: Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A . 2001.GENIES: a natural-language processing system for the extraction of

molecular pathways from journal articles.Bioinformatics, 17:S74-S82

Acknowledgement: Aditya Aggarwal, the student who dug out this paper to present in CSCI 6904 (2004)