CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

33
CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar

Transcript of CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

Page 1: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Applying NLP models to the Biological Domain

Eugen BuehlerLyle Ungar

Page 2: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Overview

• “Languages” of Computers and Biology

• Probability Models for NL and Biology

• Maximum Entropy• Basic ME amino acid model• The “Whole Protein Model”• Results in a gene prediction model

Page 3: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Bits and Bytes: The Alphabet of Computers

• Computer electronics are complicated: RAM, processor, etc.

• It all comes down to bits (1s and 0s).• Bits can be organized into bytes (8).• Bytes can represent, among other

things, letters (ASCII), which can form sentences.

Page 4: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Page 5: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

DNA: Biology’s Alphabet• Biology is complicated.• It comes down to nucleotides

(A,C,G,T).• Nucleotides can be grouped into

codons.• Codons represent amino acids,

amino acids make proteins/genes.

LCSAM

CTGTGCAGCGCUATG

Page 6: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Find the words!

0101000110010100100100011010100011100101101101101001011101010101000001110101010100010001001110011101001100111001001110010100110010100100010010010001000100100010001001001100010001001100110010011101010100110011001001100101010001000110100100010000100100100010100100100010001101010100010101011100101011100011110001111000110011101001111101000011010000011110100111110010011000111100101111000111010101011001

Page 7: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Find the genes!AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT GCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGC TGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGT TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT GAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCG CCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGC TGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGT CAGGTGCCCGATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCG CTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC CGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGGGC ATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGG CGGCGCGCGTCTTTGCAGCGATGTCACGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGA ATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTC

Page 8: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

NL and Biological Modeling

“Mary went to the ____ .”

MSGTIPSCPTAL ___

h

w

h

a

hphp sorestore

n

ii happ

1

protein

Page 9: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Markov Models

12 nnnn wwwphwp

12 nnnn aaaphap

Page 10: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

ME, In a Nutshell

• Constrain the model.• Maximize entropy.

Page 11: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Constraining features

• “is the” occurs with frequency 1/10000.• Define a feature:

• Require that:

otherwise0

the"" and "is"in ends if1),(} theis{

whhwf

10000

1,

, theis

hw

hwfhwp

Page 12: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Exponential Solution

• A unique solution exists with maximum entropy:

n

iiiME hwf

hZhwp

1

),(exp1

)(

Page 13: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Triggers

• Triggers – Words that increase the likelihood of other words.

Crop → HarvestCuban → HavanaIran →

HashemiHate → Hate

Page 14: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Unigram and Bigram Caches

• Caches – frequency tables built from the history.

• Is “supercalifragilisticexpialidocious” a common word?

• Allow for model adaptation.

Page 15: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Applying ME Models in Computational Biology

• Significant improvement for NLP.• Same for biological models? • AA sequences: a simple test case.

Page 16: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Feature Sets

• Unigrams and Bigrams• Self-triggers - frequency of a

specific amino acid.• Class based self-triggers -

frequency of a specific amino acid class.

• Unigram Cache - Amino acid frequency for this protein.

Page 17: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Training and testing data

• Burset et al. set of 571 proteins.• Homologous proteins eliminated.• Resulting set of 204 proteins split

into 2 groups of 102 each.

Page 18: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Perplexity of Amino Acid Models

16.616.8

1717.217.417.617.8

1818.218.4

Pe

rple

xit

y

Test Data

Training Data

Page 19: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Results

• “Long distance” features help.• Best model gives a 30% reduction in

perplexity over unigram reduction.• Our model may improve predictions

made by Genscan, a eukaryotic gene finding algorithm.

Page 20: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Limitations of this model

• Artificial model.• Cannot represent all global

features.

Page 21: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

The “Whole Sentence” Model

jjjWS sf

ZsPsP exp

1)()( 0

inii

s

n

s

nn hwf

hZhwpsP ,exp

1)(

110

Page 22: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Secondary Structure

MAGTVTEAWDVAVFAARRRNDEDDTTRDSLFTYTNSNNTRGPFEGPNYHIAPRWVYNITSVWMIFVVIASIFTNGLVLVATAKFKKLRHPLNWILVNLAIADLGETVIASTISVINQISGYFILGHPMCVLEGYTVSTCGISALWSLAVISWERWVVVCKPFGNVKFDAKLAVAGIVFSWVWSAVWTAPPVFGWSRYWPHGLKTSCGPDVFSGSDDPGVLSYMIVLMITCCFIPLAVILLCYLQVWLAIRAVAAQQKESESTQKAEKEVSRMVVVMIIAYCFCWGPYTVFACFAAANPGYAFHPLAAALPAYFAKSATIYNPIIYVFMNRQFRNCIMQLFGKKVDDGSELSSTSRTEVSSVSNSSVSPA

-----HHHHHHHHHHH--------------EEE--------------------EEEE---EEEEEEEEEEEEH--HEEHHHHHHHH------HHHHHHHHH---HHEEEEEEEEEEE---EEE-----EEE----EEEE-EHEHHHHEHHHH-HEEEE---------HHHHHEHEEEEEEEHH-------------H------------E----------EEEEEEEEE------EHHHHHHHHHHHHHHHHHH-H----HHHHHHHHHHHHHEEEEEEEE-------EEEHHH----------HHHHHHHHHH--------EEEEEH-----HHHHHHH---------------EEEEE---------

Page 23: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

“Whole Protein” Results

• 19 features evaluated• Two were selected:

– Mean length of alpha helix region– Maximum length of any structural

region• 59% increase in protein likelihood

Page 24: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Improved Glimmer Models

• Glimmer used IMMs to predict genes in bacteria.

• Will adding amino acid triggers improve these models? How much?

Page 25: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

H. Pylori Genome

• 1562 Coding Sequences• Split into:

– Training (>500bp) – 1154 genes, 1,354,167 bp

– Testing (<500bp) – 408 genes, 129,045 bp

Page 26: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Glimmer Depth

-0.10.10.30.50.70.91.11.31.5

Change in Model Depth/Features

Ch

ang

e in

PP

C

Page 27: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Lateral Gene Transfer

• Many genes in bacteria come not from their ancestors but from other bacterial species.

• Different bacteria “prefer” to use different codons.

• Analogous to detection of plagiarism detection?

Page 28: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Model Adaptation

• Gene models are trained for every organism.

• Lots of unused information• Analogous to cross-domain

application of NLP models.

Page 29: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Thanks

• Lyle Ungar• Roni Rosenfeld• NIH Grant

Page 30: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Page 31: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

N-Gram Features• Unigram (frequency of individual

words)

• Bigram (frequency of pairs of words)

otherwise0

if1),( 1

1

wwwhfw

otherwise0

and in ends if1),( 21

},{ 21

wwwhwhf ww

Page 32: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

Model with 44 featuresFeature: A Parameter: 1.1283516023198754Feature: V Parameter: 1.0452969493978286Feature: L Parameter: 1.6105829266787726Feature: I Parameter: 0.6972079195742815Feature: P Parameter: 0.8475961041799492Feature: W Parameter: 0.25978594248392567Feature: F Parameter: 0.7344119941888887Feature: M Parameter: 0.4493594016115246Feature: G Parameter: 1.0298040447402417Feature: S Parameter: 1.4262351925813321Feature: T Parameter: 0.9964109799704354Feature: Y Parameter: 0.5683263852913849Feature: C Parameter: 0.4212841347252633Feature: N Parameter: 0.714248861570727Feature: Q Parameter: 1.0039856305693131Feature: K Parameter: 1.0584367118507696Feature: R Parameter: 1.0280086647521616Feature: H Parameter: 0.45401904480206107Feature: D Parameter: 0.7901584163893435Feature: Self Trigger (FREQUENCY) : 1 Parameter: 1.4162727357372282Feature: Self Trigger (FREQUENCY) : 2 Parameter: 1.0953472527852288Feature: Self Trigger (FREQUENCY) : 3 Parameter: 0.9310253183080955Feature: Self Trigger (FREQUENCY) : 4 Parameter: 5.441497902570791Feature: Self Trigger (FREQUENCY) : 5 Parameter: 1.9764385350098654Feature: Self Trigger (FREQUENCY) : A Parameter: 2.9856234039913274Feature: Self Trigger (FREQUENCY) : V Parameter: 1.864040230431935Feature: Self Trigger (FREQUENCY) : L Parameter: 2.9274415604129906Feature: Self Trigger (FREQUENCY) : I Parameter: 2.4563140770039715Feature: Self Trigger (FREQUENCY) : P Parameter: 3.7708734693326558Feature: Self Trigger (FREQUENCY) : W Parameter: 2.672577458866343Feature: Self Trigger (FREQUENCY) : F Parameter: 1.479212103590432Feature: Self Trigger (FREQUENCY) : M Parameter: 0.5107656797047934Feature: Self Trigger (FREQUENCY) : G Parameter: 4.495511648228042Feature: Self Trigger (FREQUENCY) : S Parameter: 5.91039344990589Feature: Self Trigger (FREQUENCY) : T Parameter: 2.449321508559543Feature: Self Trigger (FREQUENCY) : Y Parameter: 2.3542114958521925

Feature: Self Trigger (FREQUENCY) : C Parameter: 82.68056436437357Feature: Self Trigger (FREQUENCY) : N Parameter: 2.4258773271617287Feature: Self Trigger (FREQUENCY) : Q Parameter: 14.611485492431102Feature: Self Trigger (FREQUENCY) : K Parameter: 3.1913667655121665Feature: Self Trigger (FREQUENCY) : R Parameter: 17.76347525956296Feature: Self Trigger (FREQUENCY) : H Parameter: 2.6972280092545198Feature: Self Trigger (FREQUENCY) : D Parameter: 1.5621090399310904Feature: Self Trigger (FREQUENCY) : E Parameter: 34.508027837307324Correction Parameter: 0.9499284545145702Iteration 26Perplexity Training Set = 17.48022432377071Perplexity of Test Set = 17.9412251895500226 17.48022432377071 17.94122518955002

Page 33: CLUNCH Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar.

CLUNCH

otherwise0

teller"" and bank"" if1),(tellerbank

whwhf

Trigger feature function