Post on 17-Dec-2015
1
Finite State Transducers
Finite State Transducers
Mark Stamp
2
Finite State Automata
FSA states and transitionso Represented as labeled directed
graphso FSA has one label per edge
State are circles: o Double circles for end states:
Beginning stateo Denoted by arrowhead:o Or, sometimes bold circle is used:
Finite State Transducers
3
FSA Example
Nodes are states Transitions are (labeled) arrows For example…
Finite State Transducers
3
a
1
2y
c
z
4
Finite State Transducer
FST input & output labels on edgeo That is, 2 labels per edgeo Can be more labels (e.g., edge
weights)o Recall, FSA has one label per edge
FST represented as directed grapho And same symbols used as for FSAo FSTs may be useful in malware
analysis…Finite State Transducers
5
Finite State Transducer
FST has input and output “tapes”o Transducer, i.e., can map input to
outputo Often viewed as “translating” machineo But somewhat more general
FST is a finite automata with outputo Usual finite automata only has inputo Used in natural language processing
(NLP)o Also used in many other applicationsFinite State Transducers
6
FST Graphically
Edges/transitions are (labeled) arrowso Of the form, i : o, that is, input:ouput
Nodes labeled numerically For example…
Finite State Transducers
3
a:b
1
2 y:q
c:d
z:x
7
FST Modes
FST usually viewed as translating machine
But FST can operate in several modeso Generationo Recognitiono Translation (left-to-right or right-to-left)
Examples of modes considered next…
Finite State Transducers
8
FST Modes
Consider this simple example: Generation mode
o Write equal number of a and b to first and second tape, respectively
Recognition modeo “Accept” when 1st tape has same
number of a as 2nd tape has b Translation mode next slide
Finite State Transducers
1
a:b
9
FST Modes
Consider this simple example: Translation mode
o Left-to-right For every a read from 1st tape, write b to 2nd tape
o Right-to-left For every b read from 2nd tape, write a to 1st tape
Translation is the mode we usually want to consider
Finite State Transducers
1
a:b
10
WFST
WFST == Weighted FSTo Include a “weight” on each edgeo That is, edges of the form i : o / w
Often, probabilities serve as weights…
Finite State Transducers
3
a:b/1
1
2 y:q/1
c:d/0.6
z:x/0.4
11
FST Example
Homework…
Finite State Transducers
12
Operations on FSTs
Many well-defined operations on FSTso Union, intersection, composition, etc.o These also apply to WFSTs
Composition is especially interesting
In malware context, might want to…o Compose detectors for same familyo Compose detectors for different
families Why might this be useful?
Finite State Transducers
13
FST Composition
Compose 2 FSTs (or WFSTs)o Suppose 1st WFST has nodes 1,2,…,n o Suppose 2nd WFST has nodes 1,2,…,m o Possible nodes in composition labeled
(i,j), for i = 1,2,…,n and j = 1,2,…,m o Generally, not all of these will appear
Edge from (i1,j1) to (i2,j2) only when composed labels “match” (next slide…)
Finite State Transducers
14
FST Composition
Suppose we have following labels o In 1st WFST, edge from i1 to i2 is x:y/p
o In 2nd WFST, edge from j1 to j2 is w:z/q
Consider nodes (i1,j1) and (i2,j2) in composed WFST o Edge between nodes provided y == w o I.e., output from 1st matches input for
2nd o And, resulting edge label is x:z/pq
Finite State Transducers
15
WFST Composition
Consider composition of WFSTs
And…
Finite State Transducers
41 2a:b/0.1
3
41 2
3
a:b/0.2
b:b/0.3 a:b/0.5 a:a/0.6
b:b/0.4
b:b/0.1a:b/0.3 b:a/0.5
a:b/0.4
b:a/0.2
16
WFSTCompositi
onExample
Finite State Transducers
41 2a:b/0.1
3
41 2
3
a:b/0.2
b:b/0.3 a:b/0.5 a:a/0.6
b:b/0.4
b:b/0.1a:b/0.3 b:a/0.5
a:b/0.4
b:a/0.2
1,1 2,2a:b/.01
1,2a:a/.04
a:a/.02
4,2b:a/.08
3,2
b:a/.06
a:a/.1
4,3
a:b/.24
a:b/.18
4,4
17
WFST Composition In previous example, composition
is…
But (4,3) node is uselesso Must always end in a final state
Finite State Transducers
1,1 2,2a:b/.01
1,2a:a/.04
a:a/.02
4,2b:a/.08
3,2
b:a/.06
a:a/.1
4,3
a:b/.24
a:b/.18
4,4
18
FST Approximation of HMM
Why would we want to approximate an HMM by FST?o Faster scoring using FSTo Easier to correct misclassification in
FSTo Possible to compose FSTso Most important, it’s really cool and
fun… Down side?
o FST may be less accurate than the HMM
Finite State Transducers
19
FST Approximation of HMM
How to approximate HMM by FST? We consider 2 methods known as
o n-type approximationo s-type approximation
These usually focused on “problem 2”o That is, uncovering the hidden stateso This is the usual concern in NLP, such
as “part of speech” taggingFinite State Transducers
20
n-type Approximation
Let V be distinct observations in HMMo Let λ = (A,B,π) be a trained HMMo Recall, A is N x N, B is N x M, π is 1 x N
Let (input : output / weight) = (Vi : Sj / p) o Where i {1,2,…,M} and j {1,2,…,N} o And Sj are hidden states (rows of B) o And weight is max probability (from λ)
Examples later…Finite State Transducers
21
More n-type Approximations
Range of n-type approximationso n0-type only use the B matrixo n1-type see previous slideo n2-type for 2nd order HMMo n3-type for 3rd order HMM, and so on
What is 2nd order HMM?o Transitions depend on 2 consecutive
stateso In 1st order, only depend on previous
state Finite State Transducers
22
s-type Approximation
“Sentence type” approximation Use sequences and/or natural breaks
o In n-type, max probability over one transition using A and B matrices
o In s-type, all sequences up to some length Ideally, break at boundaries of some sort
o In NLP, sentence is such a boundaryo For malware, not so clear where to breako So in malware, maybe just use a fixed length
Finite State Transducers
23
HMM to FST
Exact representation also possibleo That is, resulting FST is “same” as
HMM Given model λ = (A,B,π) Nodes for each (input : output) = (Vi :
Sj) o Edge from each node to all other
nodes…o …including loop to same nodeo Edges labeled with target node o Weights computed from probabilities
in λ
Finite State Transducers
24
HMM to FST
Note that some probabilities may be 0o Remove edges with 0 probabilities
A lot of probabilities may be smallo So, maybe approximate by removing
edges with “small” probabilities?o Could be an interesting experiment…o A reasonable way to approximate
HMM that does not seem to have been studied Finite State Transducers
25
HMM Example
Suppose we have 2 coinso 1 coin is fair and 1 unfairo Roll a die to decide which coin to flipo We see resulting sequence of H and T
o We do not know which coin was
flipped…o …and we do not see the roll of the die
Observations? Hidden states? Finite State Transducers
26
HMM Example Suppose probabilities are as given
o Then what is λ = (A,B,π) ?
Finite State Transducers
fair unfair0.9 0.2
0.8
0.1
0.5 0.5
H T H T
0.30.7
Observations:
Hidden states:
27
HMM Example
HMM is given by λ = (A,B,π), where
A = B = π =
This π implies we start in F (fair) stateo Also, state 1 is F and state 2 is U (unfair)
Suppose we observe HHTHT o Then probability of, say, FUFFU isπFbF(H)aFUbU(H)aUFbF(T)aFFbF(H)aFUbU(T)
= 1.0(0.5)(0.1)(0.7)(0.8)(0.5)(0.9)(0.5)(0.1)(0.3) = 0.000189
Finite State Transducers
28
HMM Example
We have
A =
B =
π =
And observe HHTHTo Probabilities in
tableFinite State Transducers
FFFFF .020503 .664086
FFFFU .001367 .044272
FFFUF .002835 .091824
FFFUU .000425 .013774
FFUFF .001215 .039353
FFUFU .000081 .002624
FFUUF .000387 .012243
FFUUU .000057 .001836
FUFFF .002835 .091824
FUFFU .000189 .006122
FUFUF .000392 .012697
FUFUU .000059 .001905
FUUFF .000378 .012243
FUUFU .000025 .000816
FUUUF .000118 .003809
FUUUU .000018 .000571
score probabilitystate
29
HMM Example
So, most likely state sequence iso FFFFF o Solves problem 2
Problem 1, scoring?o Next slide
Problem 3?o Not relevant hereFinite State Transducers
FFFFF .020503 .664086
FFFFU .001367 .044272
FFFUF .002835 .091824
FFFUU .000425 .013774
FFUFF .001215 .039353
FFUFU .000081 .002624
FFUUF .000387 .012243
FFUUU .000057 .001836
FUFFF .002835 .091824
FUFFU .000189 .006122
FUFUF .000392 .012697
FUFUU .000059 .001905
FUUFF .000378 .012243
FUUFU .000025 .000816
FUUUF .000118 .003809
FUUUU .000018 .000571
score probabilitystate
30
HMM Example
How to score sequence HHTHT ?
Sum over all stateso Sum the “score”
column in table:P(HHTHT) = .030874o Forward algorithm
is way more efficient
Finite State Transducers
FFFFF .020503 .664086
FFFFU .001367 .044272
FFFUF .002835 .091824
FFFUU .000425 .013774
FFUFF .001215 .039353
FFUFU .000081 .002624
FFUUF .000387 .012243
FFUUU .000057 .001836
FUFFF .002835 .091824
FUFFU .000189 .006122
FUFUF .000392 .012697
FUFUU .000059 .001905
FUUFF .000378 .012243
FUUFU .000025 .000816
FUUUF .000118 .003809
FUUUU .000018 .000571
score probabilitystate
31
n-type Approximation
Consider the 2-coin HMM with
A = B = π =
For each observation, only include the most probable hidden stateo So, only possible FST labels in this
case…H:F/w1, H:U/w2, T:F/w3, T:U/w4
o Where weights wi are probabilitiesFinite State Transducers
32
n-type Approximation
Consider example
A =
B =
π = For each
observation, most probable stateo Weight is probability
Finite State Transducers
2
H:F/0.45
3
T:F/0.45
1
H:F/0.5
T:F/0.5
T:F/0.45H:F/0.45
33
n-type Approximation
Suppose instead…
A =
B =
π = Most probable state
for each observation?o Weight is probabilityFinite State Transducers
2
H:U/0.42
3
T:F/0.30
1
H:U/0.35
T:F/0.25
T:F/0.20
H:F/0.30
4
H:F/0.30
T:F/0.30
34
HMM as FST
Consider 2-coin HMM where
A = B = π =
Then FST nodes correspond to…o Initial stateo Heads from fair coin, (H:F) o Tails from fair coin (T:F) o Heads from unfair coin (H:U) o Tails from unfair coin (T:U)
Finite State Transducers
35
HMM as FST Suppose HMM is specified by
A = B = π =
Then FST is…
Finite State Transducers
2
H:F
3
T:F
4
1
5H:U
T:U
T:U T:U
T:U
H:U
H:UH:U
H:F
H:F
H:F
H:F
T:F
T:FT:F
T:F
36
HMM as FST This FST is boring and not very useful
o Weights make it a little more interesting Computing the weights is homework…
Finite State Transducers
2
H:F
3
T:F
4
1
5H:U
T:U
T:U T:U
T:U
H:U
H:UH:U
H:F
H:F
H:F
H:F
T:F
T:FT:F
T:F
37
Why Consider FSTs?
FST used as “translating machine” Well-defined operations on FSTs
o Composition is an interesting example Can convert HMM to FST
o Either exact or approximationo Approximations may be much
simplified, but might not be as accurate
Advantages of FST over HMM?Finite State Transducers
38
Why Consider FSTs?
Scoring/translating faster with FST Able to compose multiple FSTs
o Where FSTs may be derived from HMMs One idea…
o Multiple HMMs trained on malware (same family and/or different families)
o Convert each HMM to FSTo Compose resulting FSTs
Finite State Transducers
39
Bottom Line
Can we get best of both worlds?o Fast scoring, composition with FSTso Simplify/approximate HMMs via FSTso Tweak FST to improve scoringo Efficient training using HMMs
Other possibilities?o Directly compute an FST without HMMo Or FST as first pass (e.g.,
disassembly?)Finite State Transducers
40
References
A. Kempe, Finite state transducers approximating hidden Markov models
J. R. Novak, Weighted finite state transducers: Important algorithms
K. Striegnitz, Finite state transducers
Finite State Transducers