markovChains.pdf

7/27/2019 markovChains.pdf

1/12

73

Markov Chains

Equipped with the basic tools of probability theory, we can now revisit the stochastic models weconsidered starting on page 47 of these notes. The recurrence (26) for the stochastic version of the

sand-hill crane model is an instance of the following template:

y(n) = a sample from pYn | Yn1,...,Ynm (

y | y(n 1), . . . , y(nm)) . (37)

The stochastic sand-hill crane model is an example of the special case m = 1:

y(n) = a sample from pYn | Yn1 (

y | y(n 1)) . (38)

Recall that, given the value y(n 1) for the population Yn1 (a random variable) of cranes inyear n 1, we formulated probability distributions for the numbers of births (Poisson) and deaths(binomial) between year n 1 and year n. Because these probabilities depend on y(n 1), theyare conditioned upon the fact that Yn1 = y(n 1). If there are b births and d deaths, then

Yn = Yn1 + b d

so knowing the value y(n 1) ofYn1 and the conditional probability distributions pb | Yn1and

pd | Yn1 ofb and d given Yn1 is equivalent to knowing the conditional distribution18

pYn | Yn1 ofYn given Yn1.

A sequence of random variables Y0, Y1, . . . whose values y(0), y(1), . . . are produced by astochastic recurrence of the form (37) is called a discrete Markov process of order m. The ini-

tial value y(0) ofY0 is itself a random variable, rather than a fixed number.

In the sand-hill crane model, the value that the population Yn in any given year n can assume is

unbounded. This presents technical difficulties that is wise to avoid in a first pass through the topic

of stochastic modeling. We therefore make the additional assumption that the random variables Ynare drawn out of a finite alphabetYwith Kvalues. In the sand-hill crane example, we would haveto assume a maximum population ofK1 birds (including zero, this yields Kpossible values). In

other examples, including the one examined in the next Section, the restriction to a finite alphabetis more natural. A discrete Markov process defined on a finite alphabet is called a Markov chain.

Thus:

18To actually determine the form of this conditional distribution would require a bit of work. It turns out that

the probability distribution of the sum (or difference) of two independent random variables is the convolution (or

correlation) of the two distributions. The assumption that births and deaths are independent is an approximation, and

is more or less valid when the death rate is small relative to the population. Should this assumption be unacceptable,

one would have to provide the jointdistribution ofb and d.


2/12

74 6 SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS

To specify a Markov chain of orderm (equation (37)) requires specifying the initial prob-

ability distribution

pY0(y(0))

and the transition probabilities

pYn | Yn1,...,Ynm (

y(n) | y(n 1), . . . , y(nm))

where all variables y(n) range over a finite alphabetY ofK elements. Note that in thisexpression y(0) and y(n), . . . , y(nm) are variables, not fixed numbers. For instance, weneed to know the distribution pY0(y(0)) for all possible values ofy(0).The Markov chain (37) is said to be stationary if the transition probabilities are the same

for all n. In that case, their expression can be simplified as follows:

pY | Y

1,...,Ym(y | y1, . . . , ym) .

A Language Model

To illustrate the concept of a Markov chain we consider the problem of modeling the English

language at the low level of word utterances (that is, we do not attempt to model the structure

of sentences). More specifically, we attempt to write a computer program that generates random

strings of letters that in some sense look like English words.

A first, crude attempt would draw letters and blank spaces at random out of a uniform proba-

bility distribution over the alphabet (augmented with a blank character, for a total of 27 symbols).

This would be a poor statistical model of the English language: all it specifies is what characters

are allowed. Here are a few samples of 65-character sentences (one per line):

earryjnv anr jakroyvnbqkrxtgashqtzifzstqaqwgktlfgidmxxaxmmhzmgbya

mjgxnlyattvc rwpsszwfhimovkvgknlgddou nmytnxpvdescbg k syfdhwqdrj

jmcovoyodzkcofmlycehpcqpuflje xkcykcwbdaifculiluyqerxfwlmpvtlyqkv

This is not quite compelling: words (that is, strings between blank spaces) have implausible

lengths, the individual letters seem to come from some foreign language, and letter combinations

are unpronounceable. The algorithm that generated these sequences is a zeroth-order stationary

Markov chain where the initial probability distribution and the transition probabilities are all equal

to the uniform distribution. Transition here is even a misnomer, because each letter is indepen-

dent of the previous one:

pY0(y) = pY | Y1,...,Ym

(y | y1, . . . , ym) = pY(y) = pU27(y) (39)

(pU27 is the uniform distribution over 27 points).


3/12

75

A moderate improvement comes from replacing the uniform distribution with a sample distri-

bution from real text. To this end, let us take as input all of James Joyces Ulysses, a string of

1,564,672 characters (including a few lines of header).19 Some cleanup first converts all characters

to lowercase, replaces non-alphabetic characters (commas, periods, and so forth) with blanks, and

compacts the possible resulting sequences of consecutive blanks with single blanks.

As an exercise in vector-style text processing, here is the Matlab code for the cleanup function:

function out = cleanup(in, alphabet)

% Convert to lower case

in = lower(in);

% Change unknown symbols to blank spaces

known = false(size(in));

for letter = alphabet

known(in == letter) = true;

end

in(known) = ;

% Find first of each sequence of nonblanks

first = in = & [ in(1:(end-1))] == ;

% Find last of each sequence of nonblanks

last = in = & [in(2:end) ] == ;

% Convert from logical flags to indices

first = find(first);

last = find(last);

% Replace runs of blanks with single blanks

out = ;

for k = 1:length(first)

out = [out, in(first(k):last(k)), ];

end

The input argument in is a string of characters (the text of Ulysses, a very long string indeed),

and alphabet is a string of the 26 lowercase letters of the English alphabet, plus a blank

added as the first character. The output string out contains the cleaned-up result. Loops

were avoided as far as possible. The loop on letter is very small (27 iterations), and the last

loop on k seemed unavoidable.20

After this cleanup operation, it is a simple matter to tally the frequency of each of the letters in

the alphabet (including the blank character) in Ulysses. A plot of this frequency distribution would

look very jagged. For a more pleasing display, figure 17(a) shows the frequency distribution of the

characters sorted by their frequency of occurrence. This function is used as an approximation of

the probability distribution of letters and spaces in the English language.

19This text is available for instance at http://www.gutenberg.org/ebooks/4300.20Please let me know if you find a way to avoid this loop.


4/12


e t a o i n s h r l d u m c g f w y p b k v j x q z0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

pY

(y

)

(a)

e t a o i n s h r l d u m c g f w y p b k v j x q z0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

cY

(y

)

(b)Figure 17: (a) Frequency distribution of the 26 letters of the alphabet and the blank character

in Joyces Ulysses. Letters were sorted in order of decreasing frequency. The first value is for

the blank character. (b) Corresponding cumulative distribution. The probability that a number

(diamond) drawn uniformly between 0 and 1 falls on the interval shown on the ordinate is equal to

the difference cY(o) cY(a). By definition ofcY, this difference is equal to the probabilitypY(o), so we select the letter o whenever the diamond falls in the interval shown.


5/12

77

The resulting language model is still the zero-th order Markov chain of equation (39), but with

a more plausible choice of individual letters. Rather than drawing from the uniform distribution

over the alphabet, we now draw from a distribution pY

(y) that we estimated from a piece of text.How do we draw from a given distribution? If the set Yon which the distribution pY(y) is de-

fined is finite, there is a simple method for converting a uniform (pseudo)random sample generator

into a generator that draws from pY(y) instead. First, compute the cumulative distribution ofY:

cY(y) = P[Y y] =

ky

pY(k)

where P[] stands for the phrase the probability that... This is shown in Figure 17(b). If we nowappend a zero to the left of the sequence of values cY(y), the resulting sequence spans monotoni-cally the entire range between 0 and 1, since by construction

cY(y 1) cY(y) and cY(yK) = 1(yK is the last element in the alphabet Y = y1, . . . , yK). In addition, the differences betweenconsecutive values of cY(y) are equal to the probabilities pY(y). We can then use a uniform(pseudo)random number generator to draw a number u between 0 and 1, map that to the ordinate

of the plot in Figure 17(b), and find the first entry yu in the domain Yof the function cY such thatcY(yu) equals or exceeds u. This construction is illustrated in Figure 17(b). The probability ofhitting a particular value yu (the letter o in the example in the Figure) is equal to the probability

that u (the diamond on the ordinate) is between cY(yu 1) and cY(yu) and this is, by definition ofcY, the probability pY(yu). Thus, this procedure draws from pY, as desired.

Here is how to do this in Matlab:

function v = draw(y, p, n)

if nargin < 3 || isempty(n)

n = 1; % Draw a single number

end

c = cumsum(p);

c = c(:);

if c(end) == 0

v = [ ] ;

else

c = c / c(end);

c = [0; c];K = length(c);

r = rand(n, 1);

v = zeros(n, 1);

for i = 2:K

v(find(r c(i - 1))) = y(i - 1);

end

end


6/12


The argument y lists the values in the domain Y. In the example, these would be numbersbetween 1 and 27, which represent the alphabet plus the blank character. The argument p

lists the probabilities of each element in the domain, and the argument n specifies how many

numbers to draw (1 if n is left unspecified). The Matlab built-in function cumsum computes thecumulative sum of the elements of the input vector.

A note on efficiency: The function draw spends most of its time looking for the first index v

where c(v) equals or exceeds the random number(s) r. This search is implemented here

with the Matlab built-in function find, which scans the entire array that is passed to it as

argument. In principle, this is an inefficient method, because one could use binary search:

if there are m elements in the vector c, first try c(round(m/2)), an element roughly in the

middle of the vector. If this element is too large, the target must be in the first half of c,

otherwise it must be in the second half. Repeat this procedure in the correct half until the

desired element is found. Since the interval is halved at every step, binary search requires

only log2(K) comparisons rather than

K. However, unless

Kis very large, this theoretical

speedup is more than canceled by the fact that the built-in function find is pre-compiled,

and therefore very fast. A binary search would have to be written in Matlab, an interpreted

language, which is much slower.

Here are some sample sentences obtained by drawing from a realistic frequency distribution

for the letters in English:

ooyusdii eltgotoroo tih ohnnattti gyagditghreay nm roefnnasos r

naa euuecocrrfca ayas el s yba anoropnn laeo piileo hssiod idlif

beeghec ebnnioouhuehinely neiis cnitcwasohs ooglpyocp h trog l

This still does not look anywhere near English. However, both letters and blanks are now drawnwith a frequency that equals that in Ulysses. The letters look more common than with the

uniform distribution because they correspond to actual frequencies in English. Even so, words

have still implausible lengths: a correct frequency of blanks only ensures that the mean number of

blanks in any substantial length of text is correct, not that the lengths of the runs between blanks

(words) are correct. For instance, you will notice several multiple blanks in the text above. Three

four-letter words separated by two blanks (a plausible sequence in English) are equivalent to one

12-letter word followed by two consecutive blanks (a much less plausible sequence in English) in

the sense that the frequency of blanks is the same for both cases (2/14). Similar considerations

hold for letters: Frequencies are correct, but the order in which letters follow each other is entirely

unrelated to English. Adjacent letters are statistically independent in the model, but not in reality.To address this shortcoming, we can collect statistics about the conditional probability of one

letter given the previous one,

pY | Y

1

(y | y1) = P[Y = y | Y1 = y1] .

These transition probabilities are displayed graphically in Figure 18. Do not confuse these with the

joint probabilities. For instance, the conditional probability that the current letter Y is equal to u

given that the previous letter Y1 is equal to q is one, because u always follows q. However,


7/12

79

the joint probability of the pair qu is equal to the probability of q. From Figure 17(a) we see

that this probability is very small (it turns out to be about 9 103). More generally, from thedefinition (34) of conditional probability we see that the joint probability can be found from the

conditional probabilities (Figure 18) and the probabilities of the individual letters (Figure 17(a)) as

follows:

pY1,Y(y1, y) = pY | Y

1

(y | y1)pY1

(y1) = pY | Y1

(y | y1)pY(y) .

The last equality is justified by our assumption that language statistics are stationary.

Values in each row of the transition matrix pY | Y

1

add up to one, because the probability that

a letter is followed by some character is 1 (except for the very last letter in the text):

yY

pY | Y

1

(y | y1) = 1

where Y is the alphabet, plus the blank character. A matrix with this property is said to be stochas-tic.

The matrix of transition probabilities can be estimated from a given piece of text (Ulysses in

our case) by initializing a 27 27 matrix to all zeros, and incrementing the entry corresponding toprevious letter y1 and current letter y every time we encounter such a pair. Once the whole text

has been scanned, we divide each row by the sum of its entries.21

We can now generate a new set of sentences by drawing out of the transition probabilities,

thereby generating samples out of a (stationary) first-order Markov chain as in equation (38). More

specifically, the first letter is drawn out ofpY(y), just as we did in our earlier attempt. After that,we look at the specific value y1 of the previous letter, and draw from the transition probability

pY | Y1(y | y1). Since y1 is now a known value, this amounts to drawing from the distributionrepresented by the row corresponding to y1 in Figure 18. Here are a few sample sentences:

icke inginatenc blof ade and jalorghe y at helmin by hem owery fa

st sin r d n cke s t w anks hinioro e orin en s ar whes ore jot j

whede chrve blan ted sesourethegebe inaberens s ichath fle watt o

On occasion, this is almost pronounceable. Word lengths are now slightly more plausible. Note

however that the distribution of word lengths in the second line is very different from that in the

third: statistics are just statistics, and one must expect variability in an experiment that is this small.

Now however we can see the pattern: incorporating more information on how letters follow

each other leads to more plausible results. Instead of a first-order Markov chain, we can try asecond order one:

y(n) = a sample from pY | Y

1,Y2(y | y1, y2) .

This of course requires compiling the necessary frequency tables from Ulysses (or from your fa-

vorite text). The distribution pY | Y

1,Y2is a function of three variables rather than two, and each

21If we were to divide by the sum of all the entries in the matrix, we would obtain the joint probabilities instead of

the conditional ones.


8/12


Current Letter

PreviousLette

r

a b c d e f g h i j k l m n o p q r s t u v w x y z

abcdefghi

jklmnopqrs

tuvwxyz

Figure 18: Conditional frequencies of the 26 letters of the alphabet and the blank character in

Joyces Ulysses, given the previous character. The first row and first column are for the blank

character. The area of each square is proportional to the conditional probability of the current

letter given the previous letter. For instance, the largest square in the picture corresponds to the

probability, equal to one, that the letter q is followed by the letter u.


9/12

81

of them can take one of 27 values. So the new table has 273 = 19683 entries. Other than this, theprocedures for data collection and sequence generation are essentially the same, and the idea can

then be repeated for higher order chains as well.

The cost of exponentially large tables (27n+1 entries for an n-th order chain) can be curbedsubstantially by observing that most conditional probabilities are zero. For instance, a se-quence of three letters drawn entirely (that is uniformly) at random is unlikely to be a plausibleEnglish sequence. If it is not, it will never show up in Ulysses, and the corresponding condi-tional probability in the table is zero. Matlab can deal very well with sparse matriceslike these.If a matrix A has, say, 10,000 entries only 200 of which are nonzero, the instruction

A = sparse(A);

will create a new data structure that only stores the nonzero values of A, together with in-formation necessary to reconstruct where in the original matrix each entry belongs. As a

consequence, the storage space (and processing time) for sparse matrices is proportional tothe number of nonzero entries, rather than to the formal size of the matrix.

Here is the code that collects text statistics up to a specified order:

function [ng, alphabet] = ngrams(text, maxOrder)

nmax = maxOrder + 1;

if nmax > 4

error(Unwise to do more than quadrigrams: too much storage, computation)

end

alphabet = abcdefghijklmnopqrstuvwxyz;na = length(alphabet);

da = double(alphabet);

ng = {};

for n = 1:nmax

ng{n} = sparse(zeros(na(n-1), na));

end

text = cleanup(text, alphabet);

% Wrap around to avoid zero probabilities

text = [text((end - maxOrder):end), text];

for k = 1:length(text)

ps = place(text(k), da);

last = k-1;

ne = min(nmax, k);

for n = 1:ne

j = k - n + 1;

prefix = text(j:last);

pp = place(prefix, da);

ng{n}(pp, ps) = ng{n}(pp, ps) + 1;


10/12


end

end

% Normalize conditional distributionso = ones(na, 1);

for n = 1:nmax

s = ng{n} * o;

for p = 1:size(ng{n}, 1)

if s(p) = 0

ng{n}(p, :) = ng{n}(p, :) / s(p);

end

end

end

This function is called ngrams because it computes statistics for digrams (pairs of letters),

trigrams (triples of letters), and so forth. The output ng is a cell array with maxOrder + 1matrices, each describing the statistics of a different order. Logically, the statistics of order nshould be stored in an array with n+1 dimensions. Instead, the code above flattens these ar-rays into two-dimensional matrices to make access and later computation faster. This requiresmapping, say, a four-dimensional vector of indices (i,j,k,l) to a two-dimensional vector (v, l)where v(i,j,k) is some invertible function of (i,j,k). How this is done is not important, but itmust be done consistently when collecting statistics and when using them. To this end, themapping has been encapsulated into a function place that takes a string of letters prefixand (a numerical representation da of) the alphabet and returns the place of that string withina matrix of appropriate size. Here is how place works:

function pos = place(string, da)

na = length(da);len = length(string);

pos = 0;

for k = 1:len

i = find(double(string(k)) == da);

if isempty(i)

% Convert unknown characters to blanks (position 1 in da)

i = 1 ;

end

pos = pos * n a + i - 1 ;

end

% Convert to Matlab style array indexing (so minimum pos is 1, not 0)

pos = pos + 1;

Going back to ngrams, the function cleanup has been already discussed. The instruction

with a comment about Wrap around prevents a little quirk that concerns the end of the text:

if the string, say, ix appears at the end of the text and nowhere else, all the transition

probabilities from ix to a third character are zero. If ix is ever generated in the Markov

chain, then there is no next character to go to. To avoid, this, the tail end of the text is also

copied to the beginning, so that every sequence of letters is followed by some letter. The


11/12

83

rest of the code is straightforward: initialize storage for ng, compute the statistics by scanning

text and incrementing the proper entries of ng, and normalize entries to obtain conditional

probabilities.

The following are examples of gibberish generated by a second-order Markov chain:

he ton th a s my caroodif flows an the er ity thayertione wil ha

m othenre re creara quichow mushing whe so mosing bloack abeenem

used she sighembs inglis day p wer wharon the graiddid wor thad k

Some of these look like actual words: common sequences of letters and blanks. A third-order

model does even better:

es in angull o shoppinjust stees ther a kercourats allech is hote

ternal liked be weavy because in coy mrs hand room him rolio undceran in that he mound a dishine when what to bitcho way forgot p

Almost looks like real text...

All the gibberish22 in this Section has been generated with the following Matlab code.

function s = randomSentence(ng, alphabet, len, order)

if nargin < 4 || isempty(order)

order = length(ng) - 1; % Use maximum order possible

end

if order >= length(ng)

error(Only statistics up to order %d are available, length(ng) - 1);

end

if order < -1

error(order must be at least -1)

end

da = double(alphabet);

na = length(alphabet);

if order == -1

% Dont even consider letter statistics:

% draw uniformly from the alphabet

dr = draw(1:na, ones(1, na) / na, len);

s = alphabet(dr);

else

s = char(double( ) * ones(1, len));

for k = 1:len

22Well, at least the gibberish in fixed-width font...


12/12


j = max(1, k - order);

pp = place(s(j:(k-1)), da);

g = min(k, order + 1);

s(k) = alphabet(draw(1:na, ng{g}(pp, :)));end

end

The real meat of this code is the for loop at the bottom, which computes the place of the

previous letters within the appropriate frequency table in ng, draws a new letter index from

the corresponding row, and appends the alphabet letter for that index to the output string s.

The same principle of modeling sequences with Markov chains can of course be extended

to sequences of words instead of characters: collect word statistics for a dictionary rather than

character statistics for an alphabet, and generate pseudo-sentences out of real words.

By now you are probably wondering why in the world anyone would attempt to mimic theEnglish language with statistically correct gibberish, other than as an exercise for a modeling

class. An important application of this principle, at more levels than just letters and words, is

speech recognition. In a nutshell, a speech recognition software system typically takes a stream

of speech signals coming from a microphone, and parses this stream first into phonemes23 then

phonemes into words, and words into sentences. Parsing means cutting up the input into the proper

units, and recognizing each unit (as a specific phoneme, word, or sentence). To understand the

difficulty of this computation, think of listening to an unfamiliar language and trying to determine

the boundaries between words, let alone understand the words. Apparently, the two must be done

together.

Markov statistical models of speech have encountered a great deal of success in the past decade

or two. Rather than generating gibberish, the statistical model is used to measure the likelihood

of different candidate parsing results for the same input, and to choose the most likely interpre-

tation. The Markov models capture the likelihoods of individual links (of varying orders, and at

different levels) between units. Interesting computational techniques can then accrue these values

to compute the likelihoods of long chains of units. The methods for actually doing so are beyond

the scope of this course. See for instance F. Jelinek, Statistical Methods for Speech Recognition,

MIT Press, 1997.

23A phoneme is similar to a syllable, but less ambiguous in its pronunciation.

markovChains.pdf

Documents

Transcript of markovChains.pdf