CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor:...
-
Upload
caitlin-copeland -
Category
Documents
-
view
213 -
download
1
Transcript of CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor:...
CSCE555 BioinformaticsCSCE555 Bioinformatics
Lecture 6 Hidden Markov Models
Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555
University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.
HAPPY CHINESE NEW YEAR
RoadmapRoadmap
Probablistic Models of Sequences
Introduction to HMM
Profile HMMs as MSA models
Measuring Similarity between Sequence and
HMM Profile model
Summary
04/19/23 2
Multiple Sequence Multiple Sequence AlignmentAlignmentAlignment containing multiple DNA / protein
sequencesLook for conserved regions → similar functionExample:
#Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT#Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT#Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC#Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC#Oppossum ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG#Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT#Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT
3
Probablistic Model: Position-Probablistic Model: Position-specific scoring matrices specific scoring matrices ((PSSMPSSM))
Difficulty in biological Difficulty in biological sequencessequencesVariation in a family of
sequences◦Gaps of variable lengths◦Conserved segments with different
degrees◦PSSM cannot handle variable-length
gaps◦Need a statistical sequence model
5
Regular Expressions Regular Expressions ModelModelRegular expressions
◦Protein spelling is much more free that English spelling
◦
◦ [AT] [CG] [AC] [ACGT]* A [TG] [GC]
6
RoadmapRoadmap
Probablistic Models of Sequences
Introduction to HMM
Profile HMMs as MSA models
Measuring Similarity between Sequence and
HMM Profile model
Summary
04/19/23 7
Hidden Markov Model (HMM)Hidden Markov Model (HMM)HMM is:
◦Statistical model◦Well suited for many tasks in molecular
biologyUsing HMM in molecular biology
◦Probabilistic profile (profile HMM) From a family of proteins, for searching a
database for other members of the family Resemble the profile and weight matrix
methods
◦Grammatical structure Gene finding Recognize signals Prediction (must follow the rules of a gene)
8
Detect Cheating in Coin Toss Detect Cheating in Coin Toss GameGame
Fair and biased coins could be used
Question: is it possible to determine whether a biased coin has been used based on the output sequence of the Head/Tail sequence?
HTTTHTHTHTTHHHHTHTHTHTHHHHTHT
EXAMPLE : Fair Coin TossEXAMPLE : Fair Coin TossConsider the single coin scenarioWe could model the process producing
the sequence of H’s and T’s as a Markov model with two states, and equal transition probabilities: TH
0.5
0.5
0.50.5
Only one fair coin is used here
Example: Fair and Biased Example: Fair and Biased CoinsCoinsConsider the scenario where there are two
coins: Fair coin and Biased coinVisible state do not correspond to hidden
state - Visible state : Output of H or T - Hidden state : Which coin was tossed
HTTTHTHTHTTHHHHTHTHTHTHHHHTHT
12
Hidden Markov ModelsHidden Markov Models
13
Ingredients of a HMMIngredients of a HMM Collection of states: {S1, S2,…,SN}
State transition probabilities (transition matrix)
Aij = P(qt+1 = Si | qt = Sj)
Initial state distribution
i = P(q1 = Si)
Observations: {O1, O2,…,OM}
Observation probabilities:
Bj(k) = P(vt = Ok | qt = Sj)
14
Ingredients of Our HMMIngredients of Our HMM States:{Ssunny, Srainy, Ssnowy}
State transition probabilities (transition matrix)
A =
Initial state distribution
i = (.7 .25 .05)
Observations: {O1, O2,…,OM}
Observation probabilities (emission matrix): B =
.08 .15 .05
.38 .6 .02
.75 .05 .2
.08 .15 .05
.38 .6 .02
.75 .05 .2
15
Probability of a Sequence of Probability of a Sequence of EventsEvents
P(O) = P(Ogloves, Ogloves, Oumbrella,…, Oumbrella)
= P(O | Q)P(Q) = P(O | q1,…,q7)
= 0.7 x 0.86 x 0.32
x 0.14 x 0.6 + …
all Q
q1,…q7
16
Typical HMM ProblemsTypical HMM ProblemsAnnotation Given a model M and an observed
string S, what is the most probable path through M generating S
Classification Given a model M and an observed string S, what is the total probability of S under M
Consensus Given a model M, what is the string having the highest probability under M
Training Given a set of strings and a model structure, find transition and emission probabilities assigning high probabilities to the strings
RoadmapRoadmap
Probablistic Models of Sequences
Introduction to HMM
Profile HMMs as MSA models
Measuring Similarity between Sequence and
HMM Profile model
Summary
04/19/23 17
HMM Profiles as Sequence HMM Profiles as Sequence ModelsModelsGiven the multiple alignment of
sequences, we can use HMM to model the sequences
Each column of the alignment may be represented by a hidden state that produced that column
Insertions and deletions may be represented by other states
Profile HMMsProfile HMMsHMM with a structure that in a natural
way allows position-dependent gap penalties◦Main states
model the columns of the alignment
◦Insert states model highly variable regions
◦Delete states to jump over one or more columns i.e. to model the situation when just a few of
the sequences have a “-” in the multiple alignment at a position
19
HMM Sequences ContinuedHMM Sequences Continued
Profile HMM ExampleProfile HMM Example Consider the following six sequences shown
below A multiple sequence alignment of these
sequences is the first step towards the processing of inducing the hidden markov model
SEQ1 G C C C A
SEQ2 A G C
SEQ3 A A G C
SEQ4 A G A A
SEQ5 A A A C
SEQ6 A G C
Profile HMM TopologyProfile HMM Topology The topology of HMM is established using consensus
sequence The structure of a Profile HMM is shown below:- The square box represent match states Diamonds represent insert states Circles represent delete states
Profile HMM Example Profile HMM Example ContinuedContinued
The aligned columns correspond to either emissions from the match state or to emissions from the insert state
The consensus columns are used to define the match states M1,M2,M3 for the HMM
After defining the match states, the corresponding insert and delete states are used to define the complete HMM topology
Transition ProbabilitiesTransition ProbabilitiesThe values of the transition probabilities are
computed using the frequency of the transitions as each sequence is considered
The model parameters are computed using the state transition sequences shown in the figure below:-
Transition Probabilities Transition Probabilities ContinuedContinued
The frequency of each of the transitions and the corresponding emission probabilities are shown below
State0 1 2 3
MMMDMI
4 5 6 41 0 0 -1 0 0 2
IMIDII
1 0 0 20 0 0 -0 0 0 2
DMDDDI
- 1 0 0- 0 0 -- 0 0 0
Emission ProbabilitiesEmission ProbabilitiesThe emission probability is
computed using the formula:-
The emission probability specifies the probability of emitting each of the symbols in |∑ | in the state k
Emission Probabilities Emission Probabilities ContinuedContinuedThe emission probability for each
state is computed as shown below:
Searching the Profile Searching the Profile HMMHMMSequences can be searched against the
HMM to detect whether or not they belong to a particular family of sequences described by the profile HMM
Using a global alignment, the probability of the most probable alignment and sequence can be determined using the Viterbi algorithm
Full probability of a sequence aligning to the profile HMM determined using the forward algorithm
How A Sequence Fit a How A Sequence Fit a Model?Model?
◦Probability depends on the length of the sequence
◦Not suitable to use as a score29
Length-independent ScoreLength-independent ScoreLog-odds score
◦The logarithm of the probability of the sequence divided by the probability according to a null model
◦
◦
30
Length-independent ScoreLength-independent ScoreHMM using log-odds
◦
◦
31
SummarySummaryHMMHow to build Profile HMM modelScoring Fit between Sequence
and HMM model
Next LectureNext LectureGene-findingReading:
◦Textbook (CG) chapter 4◦Textbook (EB) chapter 8