Markov chains with countable state spaces - manju/ProbModels/Markovchains.pdf · Markov chains with…
markovChains.pdf
-
Upload
bhavesh-chauhan -
Category
Documents
-
view
213 -
download
0
Transcript of markovChains.pdf
-
7/27/2019 markovChains.pdf
1/12
73
Markov Chains
Equipped with the basic tools of probability theory, we can now revisit the stochastic models weconsidered starting on page 47 of these notes. The recurrence (26) for the stochastic version of the
sand-hill crane model is an instance of the following template:
y(n) = a sample from pYn | Yn1,...,Ynm (
y | y(n 1), . . . , y(nm)) . (37)
The stochastic sand-hill crane model is an example of the special case m = 1:
y(n) = a sample from pYn | Yn1 (
y | y(n 1)) . (38)
Recall that, given the value y(n 1) for the population Yn1 (a random variable) of cranes inyear n 1, we formulated probability distributions for the numbers of births (Poisson) and deaths(binomial) between year n 1 and year n. Because these probabilities depend on y(n 1), theyare conditioned upon the fact that Yn1 = y(n 1). If there are b births and d deaths, then
Yn = Yn1 + b d
so knowing the value y(n 1) ofYn1 and the conditional probability distributions pb | Yn1and
pd | Yn1 ofb and d given Yn1 is equivalent to knowing the conditional distribution18
pYn | Yn1 ofYn given Yn1.
A sequence of random variables Y0, Y1, . . . whose values y(0), y(1), . . . are produced by astochastic recurrence of the form (37) is called a discrete Markov process of order m. The ini-
tial value y(0) ofY0 is itself a random variable, rather than a fixed number.
In the sand-hill crane model, the value that the population Yn in any given year n can assume is
unbounded. This presents technical difficulties that is wise to avoid in a first pass through the topic
of stochastic modeling. We therefore make the additional assumption that the random variables Ynare drawn out of a finite alphabetYwith Kvalues. In the sand-hill crane example, we would haveto assume a maximum population ofK1 birds (including zero, this yields Kpossible values). In
other examples, including the one examined in the next Section, the restriction to a finite alphabetis more natural. A discrete Markov process defined on a finite alphabet is called a Markov chain.
Thus:
18To actually determine the form of this conditional distribution would require a bit of work. It turns out that
the probability distribution of the sum (or difference) of two independent random variables is the convolution (or
correlation) of the two distributions. The assumption that births and deaths are independent is an approximation, and
is more or less valid when the death rate is small relative to the population. Should this assumption be unacceptable,
one would have to provide the jointdistribution ofb and d.
-
7/27/2019 markovChains.pdf
2/12
74 6 SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS
To specify a Markov chain of orderm (equation (37)) requires specifying the initial prob-
ability distribution
pY0(y(0))
and the transition probabilities
pYn | Yn1,...,Ynm (
y(n) | y(n 1), . . . , y(nm))
where all variables y(n) range over a finite alphabetY ofK elements. Note that in thisexpression y(0) and y(n), . . . , y(nm) are variables, not fixed numbers. For instance, weneed to know the distribution pY0(y(0)) for all possible values ofy(0).The Markov chain (37) is said to be stationary if the transition probabilities are the same
for all n. In that case, their expression can be simplified as follows:
pY | Y
1,...,Ym(y | y1, . . . , ym) .
A Language Model
To illustrate the concept of a Markov chain we consider the problem of modeling the English
language at the low level of word utterances (that is, we do not attempt to model the structure
of sentences). More specifically, we attempt to write a computer program that generates random
strings of letters that in some sense look like English words.
A first, crude attempt would draw letters and blank spaces at random out of a uniform proba-
bility distribution over the alphabet (augmented with a blank character, for a total of 27 symbols).
This would be a poor statistical model of the English language: all it specifies is what characters
are allowed. Here are a few samples of 65-character sentences (one per line):
earryjnv anr jakroyvnbqkrxtgashqtzifzstqaqwgktlfgidmxxaxmmhzmgbya
mjgxnlyattvc rwpsszwfhimovkvgknlgddou nmytnxpvdescbg k syfdhwqdrj
jmcovoyodzkcofmlycehpcqpuflje xkcykcwbdaifculiluyqerxfwlmpvtlyqkv
This is not quite compelling: words (that is, strings between blank spaces) have implausible
lengths, the individual letters seem to come from some foreign language, and letter combinations
are unpronounceable. The algorithm that generated these sequences is a zeroth-order stationary
Markov chain where the initial probability distribution and the transition probabilities are all equal
to the uniform distribution. Transition here is even a misnomer, because each letter is indepen-
dent of the previous one:
pY0(y) = pY | Y1,...,Ym
(y | y1, . . . , ym) = pY(y) = pU27(y) (39)
(pU27 is the uniform distribution over 27 points).
-
7/27/2019 markovChains.pdf
3/12
75
A moderate improvement comes from replacing the uniform distribution with a sample distri-
bution from real text. To this end, let us take as input all of James Joyces Ulysses, a string of
1,564,672 characters (including a few lines of header).19 Some cleanup first converts all characters
to lowercase, replaces non-alphabetic characters (commas, periods, and so forth) with blanks, and
compacts the possible resulting sequences of consecutive blanks with single blanks.
As an exercise in vector-style text processing, here is the Matlab code for the cleanup function:
function out = cleanup(in, alphabet)
% Convert to lower case
in = lower(in);
% Change unknown symbols to blank spaces
known = false(size(in));
for letter = alphabet
known(in == letter) = true;
end
in(known) = ;
% Find first of each sequence of nonblanks
first = in = & [ in(1:(end-1))] == ;
% Find last of each sequence of nonblanks
last = in = & [in(2:end) ] == ;
% Convert from logical flags to indices
first = find(first);
last = find(last);
% Replace runs of blanks with single blanks
out = ;
for k = 1:length(first)
out = [out, in(first(k):last(k)), ];
end
The input argument in is a string of characters (the text of Ulysses, a very long string indeed),
and alphabet is a string of the 26 lowercase letters of the English alphabet, plus a blank
added as the first character. The output string out contains the cleaned-up result. Loops
were avoided as far as possible. The loop on letter is very small (27 iterations), and the last
loop on k seemed unavoidable.20
After this cleanup operation, it is a simple matter to tally the frequency of each of the letters in
the alphabet (including the blank character) in Ulysses. A plot of this frequency distribution would
look very jagged. For a more pleasing display, figure 17(a) shows the frequency distribution of the
characters sorted by their frequency of occurrence. This function is used as an approximation of
the probability distribution of letters and spaces in the English language.
19This text is available for instance at http://www.gutenberg.org/ebooks/4300.20Please let me know if you find a way to avoid this loop.
-
7/27/2019 markovChains.pdf
4/12
76 6 SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS
e t a o i n s h r l d u m c g f w y p b k v j x q z0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
pY
(y
)
(a)
e t a o i n s h r l d u m c g f w y p b k v j x q z0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
cY
(y
)
(b)Figure 17: (a) Frequency distribution of the 26 letters of the alphabet and the blank character
in Joyces Ulysses. Letters were sorted in order of decreasing frequency. The first value is for
the blank character. (b) Corresponding cumulative distribution. The probability that a number
(diamond) drawn uniformly between 0 and 1 falls on the interval shown on the ordinate is equal to
the difference cY(o) cY(a). By definition ofcY, this difference is equal to the probabilitypY(o), so we select the letter o whenever the diamond falls in the interval shown.
-
7/27/2019 markovChains.pdf
5/12
77
The resulting language model is still the zero-th order Markov chain of equation (39), but with
a more plausible choice of individual letters. Rather than drawing from the uniform distribution
over the alphabet, we now draw from a distribution pY
(y) that we estimated from a piece of text.How do we draw from a given distribution? If the set Yon which the distribution pY(y) is de-
fined is finite, there is a simple method for converting a uniform (pseudo)random sample generator
into a generator that draws from pY(y) instead. First, compute the cumulative distribution ofY:
cY(y) = P[Y y] =
ky
pY(k)
where P[] stands for the phrase the probability that... This is shown in Figure 17(b). If we nowappend a zero to the left of the sequence of values cY(y), the resulting sequence spans monotoni-cally the entire range between 0 and 1, since by construction
cY(y 1) cY(y) and cY(yK) = 1(yK is the last element in the alphabet Y = y1, . . . , yK). In addition, the differences betweenconsecutive values of cY(y) are equal to the probabilities pY(y). We can then use a uniform(pseudo)random number generator to draw a number u between 0 and 1, map that to the ordinate
of the plot in Figure 17(b), and find the first entry yu in the domain Yof the function cY such thatcY(yu) equals or exceeds u. This construction is illustrated in Figure 17(b). The probability ofhitting a particular value yu (the letter o in the example in the Figure) is equal to the probability
that u (the diamond on the ordinate) is between cY(yu 1) and cY(yu) and this is, by definition ofcY, the probability pY(yu). Thus, this procedure draws from pY, as desired.
Here is how to do this in Matlab:
function v = draw(y, p, n)
if nargin < 3 || isempty(n)
n = 1; % Draw a single number
end
c = cumsum(p);
c = c(:);
if c(end) == 0
v = [ ] ;
else
c = c / c(end);
c = [0; c];K = length(c);
r = rand(n, 1);
v = zeros(n, 1);
for i = 2:K
v(find(r c(i - 1))) = y(i - 1);
end
end
-
7/27/2019 markovChains.pdf
6/12
78 6 SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS
The argument y lists the values in the domain Y. In the example, these would be numbersbetween 1 and 27, which represent the alphabet plus the blank character. The argument p
lists the probabilities of each element in the domain, and the argument n specifies how many
numbers to draw (1 if n is left unspecified). The Matlab built-in function cumsum computes thecumulative sum of the elements of the input vector.
A note on efficiency: The function draw spends most of its time looking for the first index v
where c(v) equals or exceeds the random number(s) r. This search is implemented here
with the Matlab built-in function find, which scans the entire array that is passed to it as
argument. In principle, this is an inefficient method, because one could use binary search:
if there are m elements in the vector c, first try c(round(m/2)), an element roughly in the
middle of the vector. If this element is too large, the target must be in the first half of c,
otherwise it must be in the second half. Repeat this procedure in the correct half until the
desired element is found. Since the interval is halved at every step, binary search requires
only log2(K) comparisons rather than
K. However, unless
Kis very large, this theoretical
speedup is more than canceled by the fact that the built-in function find is pre-compiled,
and therefore very fast. A binary search would have to be written in Matlab, an interpreted
language, which is much slower.
Here are some sample sentences obtained by drawing from a realistic frequency distribution
for the letters in English:
ooyusdii eltgotoroo tih ohnnattti gyagditghreay nm roefnnasos r
naa euuecocrrfca ayas el s yba anoropnn laeo piileo hssiod idlif
beeghec ebnnioouhuehinely neiis cnitcwasohs ooglpyocp h trog l
This still does not look anywhere near English. However, both letters and blanks are now drawnwith a frequency that equals that in Ulysses. The letters look more common than with the
uniform distribution because they correspond to actual frequencies in English. Even so, words
have still implausible lengths: a correct frequency of blanks only ensures that the mean number of
blanks in any substantial length of text is correct, not that the lengths of the runs between blanks
(words) are correct. For instance, you will notice several multiple blanks in the text above. Three
four-letter words separated by two blanks (a plausible sequence in English) are equivalent to one
12-letter word followed by two consecutive blanks (a much less plausible sequence in English) in
the sense that the frequency of blanks is the same for both cases (2/14). Similar considerations
hold for letters: Frequencies are correct, but the order in which letters follow each other is entirely
unrelated to English. Adjacent letters are statistically independent in the model, but not in reality.To address this shortcoming, we can collect statistics about the conditional probability of one
letter given the previous one,
pY | Y
1
(y | y1) = P[Y = y | Y1 = y1] .
These transition probabilities are displayed graphically in Figure 18. Do not confuse these with the
joint probabilities. For instance, the conditional probability that the current letter Y is equal to u
given that the previous letter Y1 is equal to q is one, because u always follows q. However,
-
7/27/2019 markovChains.pdf
7/12
79
the joint probability of the pair qu is equal to the probability of q. From Figure 17(a) we see
that this probability is very small (it turns out to be about 9 103). More generally, from thedefinition (34) of conditional probability we see that the joint probability can be found from the
conditional probabilities (Figure 18) and the probabilities of the individual letters (Figure 17(a)) as
follows:
pY1,Y(y1, y) = pY | Y
1
(y | y1)pY1
(y1) = pY | Y1
(y | y1)pY(y) .
The last equality is justified by our assumption that language statistics are stationary.
Values in each row of the transition matrix pY | Y
1
add up to one, because the probability that
a letter is followed by some character is 1 (except for the very last letter in the text):
yY
pY | Y
1
(y | y1) = 1
where Y is the alphabet, plus the blank character. A matrix with this property is said to be stochas-tic.
The matrix of transition probabilities can be estimated from a given piece of text (Ulysses in
our case) by initializing a 27 27 matrix to all zeros, and incrementing the entry corresponding toprevious letter y1 and current letter y every time we encounter such a pair. Once the whole text
has been scanned, we divide each row by the sum of its entries.21
We can now generate a new set of sentences by drawing out of the transition probabilities,
thereby generating samples out of a (stationary) first-order Markov chain as in equation (38). More
specifically, the first letter is drawn out ofpY(y), just as we did in our earlier attempt. After that,we look at the specific value y1 of the previous letter, and draw from the transition probability
pY | Y1(y | y1). Since y1 is now a known value, this amounts to drawing from the distributionrepresented by the row corresponding to y1 in Figure 18. Here are a few sample sentences:
icke inginatenc blof ade and jalorghe y at helmin by hem owery fa
st sin r d n cke s t w anks hinioro e orin en s ar whes ore jot j
whede chrve blan ted sesourethegebe inaberens s ichath fle watt o
On occasion, this is almost pronounceable. Word lengths are now slightly more plausible. Note
however that the distribution of word lengths in the second line is very different from that in the
third: statistics are just statistics, and one must expect variability in an experiment that is this small.
Now however we can see the pattern: incorporating more information on how letters follow
each other leads to more plausible results. Instead of a first-order Markov chain, we can try asecond order one:
y(n) = a sample from pY | Y
1,Y2(y | y1, y2) .
This of course requires compiling the necessary frequency tables from Ulysses (or from your fa-
vorite text). The distribution pY | Y
1,Y2is a function of three variables rather than two, and each
21If we were to divide by the sum of all the entries in the matrix, we would obtain the joint probabilities instead of
the conditional ones.
-
7/27/2019 markovChains.pdf
8/12
80 6 SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS
Current Letter
PreviousLette
r
a b c d e f g h i j k l m n o p q r s t u v w x y z
abcdefghi
jklmnopqrs
tuvwxyz
Figure 18: Conditional frequencies of the 26 letters of the alphabet and the blank character in
Joyces Ulysses, given the previous character. The first row and first column are for the blank
character. The area of each square is proportional to the conditional probability of the current
letter given the previous letter. For instance, the largest square in the picture corresponds to the
probability, equal to one, that the letter q is followed by the letter u.
-
7/27/2019 markovChains.pdf
9/12
81
of them can take one of 27 values. So the new table has 273 = 19683 entries. Other than this, theprocedures for data collection and sequence generation are essentially the same, and the idea can
then be repeated for higher order chains as well.
The cost of exponentially large tables (27n+1 entries for an n-th order chain) can be curbedsubstantially by observing that most conditional probabilities are zero. For instance, a se-quence of three letters drawn entirely (that is uniformly) at random is unlikely to be a plausibleEnglish sequence. If it is not, it will never show up in Ulysses, and the corresponding condi-tional probability in the table is zero. Matlab can deal very well with sparse matriceslike these.If a matrix A has, say, 10,000 entries only 200 of which are nonzero, the instruction
A = sparse(A);
will create a new data structure that only stores the nonzero values of A, together with in-formation necessary to reconstruct where in the original matrix each entry belongs. As a
consequence, the storage space (and processing time) for sparse matrices is proportional tothe number of nonzero entries, rather than to the formal size of the matrix.
Here is the code that collects text statistics up to a specified order:
function [ng, alphabet] = ngrams(text, maxOrder)
nmax = maxOrder + 1;
if nmax > 4
error(Unwise to do more than quadrigrams: too much storage, computation)
end
alphabet = abcdefghijklmnopqrstuvwxyz;na = length(alphabet);
da = double(alphabet);
ng = {};
for n = 1:nmax
ng{n} = sparse(zeros(na(n-1), na));
end
text = cleanup(text, alphabet);
% Wrap around to avoid zero probabilities
text = [text((end - maxOrder):end), text];
for k = 1:length(text)
ps = place(text(k), da);
last = k-1;
ne = min(nmax, k);
for n = 1:ne
j = k - n + 1;
prefix = text(j:last);
pp = place(prefix, da);
ng{n}(pp, ps) = ng{n}(pp, ps) + 1;
-
7/27/2019 markovChains.pdf
10/12
82 6 SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS
end
end
% Normalize conditional distributionso = ones(na, 1);
for n = 1:nmax
s = ng{n} * o;
for p = 1:size(ng{n}, 1)
if s(p) = 0
ng{n}(p, :) = ng{n}(p, :) / s(p);
end
end
end
This function is called ngrams because it computes statistics for digrams (pairs of letters),
trigrams (triples of letters), and so forth. The output ng is a cell array with maxOrder + 1matrices, each describing the statistics of a different order. Logically, the statistics of order nshould be stored in an array with n+1 dimensions. Instead, the code above flattens these ar-rays into two-dimensional matrices to make access and later computation faster. This requiresmapping, say, a four-dimensional vector of indices (i,j,k,l) to a two-dimensional vector (v, l)where v(i,j,k) is some invertible function of (i,j,k). How this is done is not important, but itmust be done consistently when collecting statistics and when using them. To this end, themapping has been encapsulated into a function place that takes a string of letters prefixand (a numerical representation da of) the alphabet and returns the place of that string withina matrix of appropriate size. Here is how place works:
function pos = place(string, da)
na = length(da);len = length(string);
pos = 0;
for k = 1:len
i = find(double(string(k)) == da);
if isempty(i)
% Convert unknown characters to blanks (position 1 in da)
i = 1 ;
end
pos = pos * n a + i - 1 ;
end
% Convert to Matlab style array indexing (so minimum pos is 1, not 0)
pos = pos + 1;
Going back to ngrams, the function cleanup has been already discussed. The instruction
with a comment about Wrap around prevents a little quirk that concerns the end of the text:
if the string, say, ix appears at the end of the text and nowhere else, all the transition
probabilities from ix to a third character are zero. If ix is ever generated in the Markov
chain, then there is no next character to go to. To avoid, this, the tail end of the text is also
copied to the beginning, so that every sequence of letters is followed by some letter. The
-
7/27/2019 markovChains.pdf
11/12
83
rest of the code is straightforward: initialize storage for ng, compute the statistics by scanning
text and incrementing the proper entries of ng, and normalize entries to obtain conditional
probabilities.
The following are examples of gibberish generated by a second-order Markov chain:
he ton th a s my caroodif flows an the er ity thayertione wil ha
m othenre re creara quichow mushing whe so mosing bloack abeenem
used she sighembs inglis day p wer wharon the graiddid wor thad k
Some of these look like actual words: common sequences of letters and blanks. A third-order
model does even better:
es in angull o shoppinjust stees ther a kercourats allech is hote
ternal liked be weavy because in coy mrs hand room him rolio undceran in that he mound a dishine when what to bitcho way forgot p
Almost looks like real text...
All the gibberish22 in this Section has been generated with the following Matlab code.
function s = randomSentence(ng, alphabet, len, order)
if nargin < 4 || isempty(order)
order = length(ng) - 1; % Use maximum order possible
end
if order >= length(ng)
error(Only statistics up to order %d are available, length(ng) - 1);
end
if order < -1
error(order must be at least -1)
end
da = double(alphabet);
na = length(alphabet);
if order == -1
% Dont even consider letter statistics:
% draw uniformly from the alphabet
dr = draw(1:na, ones(1, na) / na, len);
s = alphabet(dr);
else
s = char(double( ) * ones(1, len));
for k = 1:len
22Well, at least the gibberish in fixed-width font...
-
7/27/2019 markovChains.pdf
12/12
84 6 SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS
j = max(1, k - order);
pp = place(s(j:(k-1)), da);
g = min(k, order + 1);
s(k) = alphabet(draw(1:na, ng{g}(pp, :)));end
end
The real meat of this code is the for loop at the bottom, which computes the place of the
previous letters within the appropriate frequency table in ng, draws a new letter index from
the corresponding row, and appends the alphabet letter for that index to the output string s.
The same principle of modeling sequences with Markov chains can of course be extended
to sequences of words instead of characters: collect word statistics for a dictionary rather than
character statistics for an alphabet, and generate pseudo-sentences out of real words.
By now you are probably wondering why in the world anyone would attempt to mimic theEnglish language with statistically correct gibberish, other than as an exercise for a modeling
class. An important application of this principle, at more levels than just letters and words, is
speech recognition. In a nutshell, a speech recognition software system typically takes a stream
of speech signals coming from a microphone, and parses this stream first into phonemes23 then
phonemes into words, and words into sentences. Parsing means cutting up the input into the proper
units, and recognizing each unit (as a specific phoneme, word, or sentence). To understand the
difficulty of this computation, think of listening to an unfamiliar language and trying to determine
the boundaries between words, let alone understand the words. Apparently, the two must be done
together.
Markov statistical models of speech have encountered a great deal of success in the past decade
or two. Rather than generating gibberish, the statistical model is used to measure the likelihood
of different candidate parsing results for the same input, and to choose the most likely interpre-
tation. The Markov models capture the likelihoods of individual links (of varying orders, and at
different levels) between units. Interesting computational techniques can then accrue these values
to compute the likelihoods of long chains of units. The methods for actually doing so are beyond
the scope of this course. See for instance F. Jelinek, Statistical Methods for Speech Recognition,
MIT Press, 1997.
23A phoneme is similar to a syllable, but less ambiguous in its pronunciation.