Kyle Jensen MIT Ph.D. Thesis Defense

Motif discovery in sequential data

Kyle Jensen

Thesis OffenseDepartment of Chemical EngineeringMassachusetts Institute of Technology

Thesis committee:

Greg StephanopoulosWilliam GreenRobert BerwickIsidore Rigoutsos

ChE, MITChE, MITEECS, MITIBM

Sequencing throughput, like processor power, is growing exponentially

As a result, Genbank is overflowing

Anatomics Biomics ChromosomicsCytomicsEnviromics Epigenomics Fluxomics GlycomicsGlycoproteomicsImmunogen. Immunomics ImmunoproteomicsIntegromics Interactomics Ionomics LipidomicsMetabolomics Metabonomics Metagenomics MetallomicsMetalloproteomicsMethylomics Mitogenomics NeuromicsNeuropeptido. OncogenomicsPeptidomicsPhenomicsPhospho-prot. PhosphoproteomicsPhysiomics PhysionomicsPostgenomicsPostgenomics Pregenomics RnomicsSecretomics SubproteomicsSurfaceomicsSyndromicsTranscriptomics

And the ome-ome keeps growing

Together, these data form a rich network of information

CTTCATCAATTATCGTACTCTTGTTAATGTGGTAAAATATAAACTGGACCACATGAGAAGAAGAATTGAGACCGATGAGAGAGATTCGACCAACCGGGCTTCCTTCAAATGTCCTGTCTGTAGTAGTACTTTCACAGACTTAGAAGCTAATCAGCTCTTTGATCCTATGACAGGAACTTTCCGCTGTACTTTTTGCCATACAGAGGTAGAAGAGGATGAATCAGCAATGCCCAAAAAAGATGCACGCACACTTTTGGCAAGGTTTAATGAACAAATTGAGCCCATTTATGCATTGCTTCGGGAGACAGAGGATGTGAACTTGGCCTATGAAATACTTGAGCCAGAACCCACAGAAATCCCAGCCCTGAAACAGAGCAAGGACCATGCAGCAACTACTGCTGGAGCTGCTAGCCTAGCAGGTGGGCACCACCGGGAAGCATGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGAGAAGGGGGCATAGATATGGACGCATTTCAGGAGC

This data glut motivates the need for automated methods of discovery and analysis

Here, I focus on motif discovery
in sequential data using a linguistic metaphor

S NP VPNP1 D NP | PNNP2 ADJ NP | NVP V NPD a | thePN peter | paul | maryADJ large | blackN dog | cat | horseV is | likes | hates

A grammar is a mathematical system for describing the structure of a language

GRAMMAR

S NP VPNP D NP | PNNP ADJ NP | NVP V NPD a | thePN peter | paul | maryADJ large | blackN dog | cat | horseV is | likes | hates

S => NP VP => PN VP => mary VP =>mary V NP => mary hates NP =>mary hates D NP1 =>mary hates the NP1 =>mary hates the N => mary hates the dog

S => NP VP => NP V NP =>NP V D NP1 => NP V a NP1 =>NP V a ADJ NP1 =>NP is a ADJ NP1 =>NP is a ADJ ADJ NP1 =>NP is a large ADJ NP1 =>NP is a large ADJ N =>NP is a large black N =>NP is a large black cat=>PN is a large black cat =>peter is a large black cat

Grammars can describe biological phenomena in the same manner as natural languages

Two examplesExample: a declarative sentence in English

Example: eukaryotic gene structure

S

D N

NP V A P NP

D N

the

boy

is

upset

over

the

girl

the

advisor

is

pleased

with

the

research

S NP V A P NP

NP

{

D NN

gene

startcodon

upstream

primarytranscript

TATA box

exon

intron

exon

stopcodon

ATGACTGACTGATCGATCGATCGATCGATGATCGTACGATCGATGCATCGATCGATCGATCGATCGA

Grammars are suitable for describing any complex arrangement of sequential data

The grammar of biological sequences

language

grammar

linguisticexample

biologicalexample

complexity

Simple, regular grammars are compactly written as regular expressions

[LIVF].........[LIV][RK].(9,20)WS.WS....[FYW]

Motif discovery is the inverse problem: given the sentences, find the grammar

CTTCATCAATTATCGTACTCTTGTTAATGTGGTAAAATATAAACTGGACCACATGAGAAGAAGAATTGAGACCGATGAGAGAGATTCGACCAACCGGGCTTCCTTCAAATGTCCTGTCTGTAGTAGTACTTTCACAGACTTAGAAGCTAATCAGCTCTTTGATCCTATGACAGGAACTTTCCGCTGTACTTTTTGCCATACAGAGGTAGAAGAGGATGAATCAGCAATGCCCAAAAAAGATGCACGCACACTTTTGGCAAGGTTTAATGAACAAATTGAGCCCATTTATGCATTGCTTCGGGAGACAGAGGATGTGAACTTGGCCTATGAAATACTTGAGCCAGAACCCACAGAAATCCCAGCCCTGAAACAGAGCAAGGACCATGCAGCAACTACTGCTGGAGCTGCTAGCCTAGCAGGTGGGCACCACCGGGAAGCATGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGAGAAGGGGGCATAGATATGGACGCATTTCAGGAGC

Part 1:Rational design of antimicrobialpeptides using linguistic methods

Antimicrobial peptides are small proteins that attack and kill bacteria

Functional characteristics:Part of innate immune systemall multicellular eukaryotes

Attack bacterial membraneelectrostatic attraction

effective at g/mL concentrations

Applications of AmPs:Novel class of antibioticslow bacterial resistance

activity against MDR pathogens

currently topical: acne, etc.

Other clinical applicationsAIDS, certain cancers, biodefense

AmPs

bacterialmembrane

+

+

-

-

AmP sequences contain many repeated motifs, suggesting a linguistic model

AmP amino acid sequences~1000 natural AmP sequencesfrom many different species

Numerous conserved motifssuggest rules for building AmPs

similar to grammar of languages

cecropins

cecropin motif

The language of AmP sequencesCan we find the underlying grammar of this language?Will this grammar capture the sequence/function relationships?

Knowing the grammar, can we build novel AmPs?

The AmP sequences were modeled using simple regular grammars

Given a language, is there a regular grammar?Example: the cecropin sub-sequences

Automated grammar induction: TeiresiasRegular grammars of the form

R: Vi Vj where (type A, aa) or ={} (type B, wildcard)

Find all G for which a/b > w, and a+b>L

Subject to maximal |R| and maximal occurrences of G

G = (V, , R, S)

where

seq1: QSEAGWLKKLGKseq2: QSEAGWLRKAAKseq3: QTEAGGLKKFGK

What grammar describes these sequences?

V= non-terminal symbols= amino acidsR= set of replacement rulesS= starting amino acid

cecropin motif: Q.EAG.L.K..K

Our goal was to use this linguistic model to design novel AmPs

Protein design space is combinatorially large20N possible N amino acid sequencesN = 18, number of stars in universe

N = 50, number of atoms in Earth

N = 100, number of electrons in universe

Why design novel AmPs?Concern over RamPsCross-resistance

Other approachesFolding & thermodynamics

Combinatorial libraries

sequencespace

grammaticalspace

naturalAmPs

trueAmPs

We used Teiresias to discover ~700 grammars defining the language of AmPs

query:

- grammar 1

grammar 2 -

These grammars were used to design novel AmPsNo more than 5-in-a-row with natural AmPs

12 million grammatical sequences

40 novel AmPs were chosen for experimental validation

Tested against B. subtilis & E. coli

serial dilutions

replicates

9 non-AmPs9 natural AmPsControl42 shuffled42 motif-basedTestNY

Expect Activity?

Our results show significant enrichment for activity in the designed set

Expected Activity?

Y

N

Test

42 motif-based18 / 42

42 shuffled2 / 42

Control

9 natural AmPs6 / 9

9 non-AmPs0 / 9

Optimized leads showed strong activity against anthrax and staph

Part 2:A generic motif discovery algorithmfor diverse biomolecular data

Motif discovery is the automated search for similar regions in streams of data

Un-sequential dataNo ordering

Sequential dataA natural ordering of the dataNucleotide and amino acid sequences

Stock prices, protein structures

MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA

A motif is just a collection ofmutually similar regions in thedata stream

There are two classes of motif discovery tools commonly used for sequence analysis

Exhaustive regular-expression based toolsTeiresias

Pratt

Descriptive position weight matrix-based toolsGibbs sampler

MEME

Consensus

TGCTGTATATACTCACAGCAAACTGTATATACACCCAGGGTACTGTATGAGCATACAGTAACCTGAATGAATATACAGTATACTGTACATCCATACAGTATACTGTATATTCATTCAGGTAACTGTTTTTTTATCCAGTAATCTGTATATATACCCAGCTTACTGTATATAAAAACAGTA

CT[AT].[GT]....A..CAG

Gemoda was designed to be exhaustive and have descriptive power

Gemoda exhaustively returns maximal motifsUses convolution of TeiresiasWay of stiching together smaller patterns combinatorially

Gets descriptiveness from similarity metricGeneric, context dependent definition of similarity

MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA

F(w1, w2) = square error

F(w1, w2) = aa scoring matrix

Gemoda proceeds in three steps: comparison, clustering, and convolution

The comparison stage is used to map the pairwise similarities between all windows in the data streams

Creates an distance matrixDoes an all-by-all comparison of windows in the data

Comparison function is context-specific

F(w1, w2)

The clustering phase is used to find groups of mutually similar windows

Different clustering functions have different usesClique-finding is provably exhaustive

K-means and other methods are faster

Output clusters become elementary motifs which are convolved to make longer, maximal motifs

The convolution phase is used to stitch together the clusters into maximal motifs

The motifs should be as long as possible, without decreasing the support

elementarymotifs(clusters)

windowordering

Here we show a few representative ways in which Gemoda can be used

Motif discovery in...

Protein sequences(ppGpp)ase enzymes & finding known domains

DNA sequencesThe LD-motif challenge problem

Protein structuresConserved structures without conserved sequences

Gemoda can be applied to amino acid sequences as well

Example: (ppGpp)ase family from ENZYME databaseGuanosine-3',5'-bis(diphosphate) 3'-pyrophosphohydrolase enzymesEC 3.1.7.2

Ave. length ~700 amino acids

8 sequences from 8 species

Searched using GemodaMinimum length = 50 amino acids

Minimum Blosum62 bit score = 50 bits

Minimum support = 100% (8/8 sequences)

Clustering method = clique finding

Can Gemoda find this known motif?

How sensitive is Gemoda to noise?

(ppGpp)ase example: the comparison phase shows many regions of local similarity

Dots indicate 50aa windows that are pairwise similar

Streaks indicate regions that will probably be convolved into a maximal motif

(ppGpp)ase example: the clustering phase shows elementary motifs conserved between all 8 enzyme sequences

(ppGpp)ase example: the final motifs match the known rela_spot domain and the HD domain from NCBI's conserved domain database

Maximal motif (one of three, ~100 aa in length)

This particular cluster represents the first set of 8 50aa windows in the above motif.

Results are insensitive to noise

The LD-motif problem models the subtle binding site discovery problem

GACTCGATAGCGACG

Sequence #1: ATGATGAGTCTATTGCGCCGCGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG...

Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCTCTCGATTGCGACTTTCGACTAGCTA...

Sequence #3: ATGTACTACGAGTCTCCATAGCGTTGCTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT...

Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGACGACTCGTGGGCGGCG...

...

Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACTGATTCGTAAGGGACGATAGCTACTATCTTATTCGACTAGTACGACT...

Gemoda can solve both the LD-motif problem and a more generalized version of the same

GGGACTCGATAGCGACGCCG


Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA...



...

Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACTGATTCGTTAGGGACGATAGCTACTATCTTATTCGACTAGTACGACT...

Total motif length?


GACTCGATAGCGACG

X

All sequences?





...



GACTCGATAGCGACG

Number of mutations?





...



GACTCGATAGCGACG




Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTATATCTGGTTCGACTTAGCTATCTATTCGACGACTCGTGGGCGGCG...

...

Sequence #m: ATGCTACTATCTTATTCGACTGAGTACGACTATAGCTACTGATTCGTTAGGGACGATAGCTACTATGACTAGTGACT...

Number of unique motifs?

Gemoda can also be applied to protein structures

Treat protein structure as alpha-carbon traceSeries of x,y,z coordinates

Use a clustering function that compares x,y,z windowsRoot mean square deviation (RMSD)

unit-RMSD

x1y1z1

x2y2z2

x3y3z3

...........................

xMyMzM

Protein structure example: human FIT vs. uridylyltransferase

fin

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level

Slide /48

unrestrictedany???context-sensitiveZb aXDutchRNA psuedo-knotscontext-freeZ aXySwiss-GermanRNA hairpin loopright-linearZ aXEnglish phonologyATP-binding motif

???Page ??? (???)05/09/2006, 08:02:46Page /

Kyle Jensen MIT Ph.D. Thesis Defense

Documents

Transcript of Kyle Jensen MIT Ph.D. Thesis Defense