markovChains.pdf

download markovChains.pdf

of 12

Transcript of markovChains.pdf

  • 7/27/2019 markovChains.pdf

    1/12

    73

    Markov Chains

    Equipped with the basic tools of probability theory, we can now revisit the stochastic models weconsidered starting on page 47 of these notes. The recurrence (26) for the stochastic version of the

    sand-hill crane model is an instance of the following template:

    y(n) = a sample from pYn | Yn1,...,Ynm (

    y | y(n 1), . . . , y(nm)) . (37)

    The stochastic sand-hill crane model is an example of the special case m = 1:

    y(n) = a sample from pYn | Yn1 (

    y | y(n 1)) . (38)

    Recall that, given the value y(n 1) for the population Yn1 (a random variable) of cranes inyear n 1, we formulated probability distributions for the numbers of births (Poisson) and deaths(binomial) between year n 1 and year n. Because these probabilities depend on y(n 1), theyare conditioned upon the fact that Yn1 = y(n 1). If there are b births and d deaths, then

    Yn = Yn1 + b d

    so knowing the value y(n 1) ofYn1 and the conditional probability distributions pb | Yn1and

    pd | Yn1 ofb and d given Yn1 is equivalent to knowing the conditional distribution18

    pYn | Yn1 ofYn given Yn1.

    A sequence of random variables Y0, Y1, . . . whose values y(0), y(1), . . . are produced by astochastic recurrence of the form (37) is called a discrete Markov process of order m. The ini-

    tial value y(0) ofY0 is itself a random variable, rather than a fixed number.

    In the sand-hill crane model, the value that the population Yn in any given year n can assume is

    unbounded. This presents technical difficulties that is wise to avoid in a first pass through the topic

    of stochastic modeling. We therefore make the additional assumption that the random variables Ynare drawn out of a finite alphabetYwith Kvalues. In the sand-hill crane example, we would haveto assume a maximum population ofK1 birds (including zero, this yields Kpossible values). In

    other examples, including the one examined in the next Section, the restriction to a finite alphabetis more natural. A discrete Markov process defined on a finite alphabet is called a Markov chain.

    Thus:

    18To actually determine the form of this conditional distribution would require a bit of work. It turns out that

    the probability distribution of the sum (or difference) of two independent random variables is the convolution (or

    correlation) of the two distributions. The assumption that births and deaths are independent is an approximation, and

    is more or less valid when the death rate is small relative to the population. Should this assumption be unacceptable,

    one would have to provide the jointdistribution ofb and d.

  • 7/27/2019 markovChains.pdf

    2/12

    74 6 SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS

    To specify a Markov chain of orderm (equation (37)) requires specifying the initial prob-

    ability distribution

    pY0(y(0))

    and the transition probabilities

    pYn | Yn1,...,Ynm (

    y(n) | y(n 1), . . . , y(nm))

    where all variables y(n) range over a finite alphabetY ofK elements. Note that in thisexpression y(0) and y(n), . . . , y(nm) are variables, not fixed numbers. For instance, weneed to know the distribution pY0(y(0)) for all possible values ofy(0).The Markov chain (37) is said to be stationary if the transition probabilities are the same

    for all n. In that case, their expression can be simplified as follows:

    pY | Y

    1,...,Ym(y | y1, . . . , ym) .

    A Language Model

    To illustrate the concept of a Markov chain we consider the problem of modeling the English

    language at the low level of word utterances (that is, we do not attempt to model the structure

    of sentences). More specifically, we attempt to write a computer program that generates random

    strings of letters that in some sense look like English words.

    A first, crude attempt would draw letters and blank spaces at random out of a uniform proba-

    bility distribution over the alphabet (augmented with a blank character, for a total of 27 symbols).

    This would be a poor statistical model of the English language: all it specifies is what characters

    are allowed. Here are a few samples of 65-character sentences (one per line):

    earryjnv anr jakroyvnbqkrxtgashqtzifzstqaqwgktlfgidmxxaxmmhzmgbya

    mjgxnlyattvc rwpsszwfhimovkvgknlgddou nmytnxpvdescbg k syfdhwqdrj

    jmcovoyodzkcofmlycehpcqpuflje xkcykcwbdaifculiluyqerxfwlmpvtlyqkv

    This is not quite compelling: words (that is, strings between blank spaces) have implausible

    lengths, the individual letters seem to come from some foreign language, and letter combinations

    are unpronounceable. The algorithm that generated these sequences is a zeroth-order stationary

    Markov chain where the initial probability distribution and the transition probabilities are all equal

    to the uniform distribution. Transition here is even a misnomer, because each letter is indepen-

    dent of the previous one:

    pY0(y) = pY | Y1,...,Ym

    (y | y1, . . . , ym) = pY(y) = pU27(y) (39)

    (pU27 is the uniform distribution over 27 points).

  • 7/27/2019 markovChains.pdf

    3/12

    75

    A moderate improvement comes from replacing the uniform distribution with a sample distri-

    bution from real text. To this end, let us take as input all of James Joyces Ulysses, a string of

    1,564,672 characters (including a few lines of header).19 Some cleanup first converts all characters

    to lowercase, replaces non-alphabetic characters (commas, periods, and so forth) with blanks, and

    compacts the possible resulting sequences of consecutive blanks with single blanks.

    As an exercise in vector-style text processing, here is the Matlab code for the cleanup function:

    function out = cleanup(in, alphabet)

    % Convert to lower case

    in = lower(in);

    % Change unknown symbols to blank spaces

    known = false(size(in));

    for letter = alphabet

    known(in == letter) = true;

    end

    in(known) = ;

    % Find first of each sequence of nonblanks

    first = in = & [ in(1:(end-1))] == ;

    % Find last of each sequence of nonblanks

    last = in = & [in(2:end) ] == ;

    % Convert from logical flags to indices

    first = find(first);

    last = find(last);

    % Replace runs of blanks with single blanks

    out = ;

    for k = 1:length(first)

    out = [out, in(first(k):last(k)), ];

    end

    The input argument in is a string of characters (the text of Ulysses, a very long string indeed),

    and alphabet is a string of the 26 lowercase letters of the English alphabet, plus a blank

    added as the first character. The output string out contains the cleaned-up result. Loops

    were avoided as far as possible. The loop on letter is very small (27 iterations), and the last

    loop on k seemed unavoidable.20

    After this cleanup operation, it is a simple matter to tally the frequency of each of the letters in

    the alphabet (including the blank character) in Ulysses. A plot of this frequency distribution would

    look very jagged. For a more pleasing display, figure 17(a) shows the frequency distribution of the

    characters sorted by their frequency of occurrence. This function is used as an approximation of

    the probability distribution of letters and spaces in the English language.

    19This text is available for instance at http://www.gutenberg.org/ebooks/4300.20Please let me know if you find a way to avoid this loop.

  • 7/27/2019 markovChains.pdf

    4/12

    76 6 SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS

    e t a o i n s h r l d u m c g f w y p b k v j x q z0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0.16

    0.18

    0.2

    pY

    (y

    )

    (a)

    e t a o i n s h r l d u m c g f w y p b k v j x q z0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    cY

    (y

    )

    (b)Figure 17: (a) Frequency distribution of the 26 letters of the alphabet and the blank character

    in Joyces Ulysses. Letters were sorted in order of decreasing frequency. The first value is for

    the blank character. (b) Corresponding cumulative distribution. The probability that a number

    (diamond) drawn uniformly between 0 and 1 falls on the interval shown on the ordinate is equal to

    the difference cY(o) cY(a). By definition ofcY, this difference is equal to the probabilitypY(o), so we select the letter o whenever the diamond falls in the interval shown.

  • 7/27/2019 markovChains.pdf

    5/12

    77

    The resulting language model is still the zero-th order Markov chain of equation (39), but with

    a more plausible choice of individual letters. Rather than drawing from the uniform distribution

    over the alphabet, we now draw from a distribution pY

    (y) that we estimated from a piece of text.How do we draw from a given distribution? If the set Yon which the distribution pY(y) is de-

    fined is finite, there is a simple method for converting a uniform (pseudo)random sample generator

    into a generator that draws from pY(y) instead. First, compute the cumulative distribution ofY:

    cY(y) = P[Y y] =

    ky

    pY(k)

    where P[] stands for the phrase the probability that... This is shown in Figure 17(b). If we nowappend a zero to the left of the sequence of values cY(y), the resulting sequence spans monotoni-cally the entire range between 0 and 1, since by construction

    cY(y 1) cY(y) and cY(yK) = 1(yK is the last element in the alphabet Y = y1, . . . , yK). In addition, the differences betweenconsecutive values of cY(y) are equal to the probabilities pY(y). We can then use a uniform(pseudo)random number generator to draw a number u between 0 and 1, map that to the ordinate

    of the plot in Figure 17(b), and find the first entry yu in the domain Yof the function cY such thatcY(yu) equals or exceeds u. This construction is illustrated in Figure 17(b). The probability ofhitting a particular value yu (the letter o in the example in the Figure) is equal to the probability

    that u (the diamond on the ordinate) is between cY(yu 1) and cY(yu) and this is, by definition ofcY, the probability pY(yu). Thus, this procedure draws from pY, as desired.

    Here is how to do this in Matlab:

    function v = draw(y, p, n)

    if nargin < 3 || isempty(n)

    n = 1; % Draw a single number

    end

    c = cumsum(p);

    c = c(:);

    if c(end) == 0

    v = [ ] ;

    else

    c = c / c(end);

    c = [0; c];K = length(c);

    r = rand(n, 1);

    v = zeros(n, 1);

    for i = 2:K

    v(find(r c(i - 1))) = y(i - 1);

    end

    end

  • 7/27/2019 markovChains.pdf

    6/12

    78 6 SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS

    The argument y lists the values in the domain Y. In the example, these would be numbersbetween 1 and 27, which represent the alphabet plus the blank character. The argument p

    lists the probabilities of each element in the domain, and the argument n specifies how many

    numbers to draw (1 if n is left unspecified). The Matlab built-in function cumsum computes thecumulative sum of the elements of the input vector.

    A note on efficiency: The function draw spends most of its time looking for the first index v

    where c(v) equals or exceeds the random number(s) r. This search is implemented here

    with the Matlab built-in function find, which scans the entire array that is passed to it as

    argument. In principle, this is an inefficient method, because one could use binary search:

    if there are m elements in the vector c, first try c(round(m/2)), an element roughly in the

    middle of the vector. If this element is too large, the target must be in the first half of c,

    otherwise it must be in the second half. Repeat this procedure in the correct half until the

    desired element is found. Since the interval is halved at every step, binary search requires

    only log2(K) comparisons rather than

    K. However, unless

    Kis very large, this theoretical

    speedup is more than canceled by the fact that the built-in function find is pre-compiled,

    and therefore very fast. A binary search would have to be written in Matlab, an interpreted

    language, which is much slower.

    Here are some sample sentences obtained by drawing from a realistic frequency distribution

    for the letters in English:

    ooyusdii eltgotoroo tih ohnnattti gyagditghreay nm roefnnasos r

    naa euuecocrrfca ayas el s yba anoropnn laeo piileo hssiod idlif

    beeghec ebnnioouhuehinely neiis cnitcwasohs ooglpyocp h trog l

    This still does not look anywhere near English. However, both letters and blanks are now drawnwith a frequency that equals that in Ulysses. The letters look more common than with the

    uniform distribution because they correspond to actual frequencies in English. Even so, words

    have still implausible lengths: a correct frequency of blanks only ensures that the mean number of

    blanks in any substantial length of text is correct, not that the lengths of the runs between blanks

    (words) are correct. For instance, you will notice several multiple blanks in the text above. Three

    four-letter words separated by two blanks (a plausible sequence in English) are equivalent to one

    12-letter word followed by two consecutive blanks (a much less plausible sequence in English) in

    the sense that the frequency of blanks is the same for both cases (2/14). Similar considerations

    hold for letters: Frequencies are correct, but the order in which letters follow each other is entirely

    unrelated to English. Adjacent letters are statistically independent in the model, but not in reality.To address this shortcoming, we can collect statistics about the conditional probability of one

    letter given the previous one,

    pY | Y

    1

    (y | y1) = P[Y = y | Y1 = y1] .

    These transition probabilities are displayed graphically in Figure 18. Do not confuse these with the

    joint probabilities. For instance, the conditional probability that the current letter Y is equal to u

    given that the previous letter Y1 is equal to q is one, because u always follows q. However,

  • 7/27/2019 markovChains.pdf

    7/12

    79

    the joint probability of the pair qu is equal to the probability of q. From Figure 17(a) we see

    that this probability is very small (it turns out to be about 9 103). More generally, from thedefinition (34) of conditional probability we see that the joint probability can be found from the

    conditional probabilities (Figure 18) and the probabilities of the individual letters (Figure 17(a)) as

    follows:

    pY1,Y(y1, y) = pY | Y

    1

    (y | y1)pY1

    (y1) = pY | Y1

    (y | y1)pY(y) .

    The last equality is justified by our assumption that language statistics are stationary.

    Values in each row of the transition matrix pY | Y

    1

    add up to one, because the probability that

    a letter is followed by some character is 1 (except for the very last letter in the text):

    yY

    pY | Y

    1

    (y | y1) = 1

    where Y is the alphabet, plus the blank character. A matrix with this property is said to be stochas-tic.

    The matrix of transition probabilities can be estimated from a given piece of text (Ulysses in

    our case) by initializing a 27 27 matrix to all zeros, and incrementing the entry corresponding toprevious letter y1 and current letter y every time we encounter such a pair. Once the whole text

    has been scanned, we divide each row by the sum of its entries.21

    We can now generate a new set of sentences by drawing out of the transition probabilities,

    thereby generating samples out of a (stationary) first-order Markov chain as in equation (38). More

    specifically, the first letter is drawn out ofpY(y), just as we did in our earlier attempt. After that,we look at the specific value y1 of the previous letter, and draw from the transition probability

    pY | Y1(y | y1). Since y1 is now a known value, this amounts to drawing from the distributionrepresented by the row corresponding to y1 in Figure 18. Here are a few sample sentences:

    icke inginatenc blof ade and jalorghe y at helmin by hem owery fa

    st sin r d n cke s t w anks hinioro e orin en s ar whes ore jot j

    whede chrve blan ted sesourethegebe inaberens s ichath fle watt o

    On occasion, this is almost pronounceable. Word lengths are now slightly more plausible. Note

    however that the distribution of word lengths in the second line is very different from that in the

    third: statistics are just statistics, and one must expect variability in an experiment that is this small.

    Now however we can see the pattern: incorporating more information on how letters follow

    each other leads to more plausible results. Instead of a first-order Markov chain, we can try asecond order one:

    y(n) = a sample from pY | Y

    1,Y2(y | y1, y2) .

    This of course requires compiling the necessary frequency tables from Ulysses (or from your fa-

    vorite text). The distribution pY | Y

    1,Y2is a function of three variables rather than two, and each

    21If we were to divide by the sum of all the entries in the matrix, we would obtain the joint probabilities instead of

    the conditional ones.

  • 7/27/2019 markovChains.pdf

    8/12

    80 6 SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS

    Current Letter

    PreviousLette

    r

    a b c d e f g h i j k l m n o p q r s t u v w x y z

    abcdefghi

    jklmnopqrs

    tuvwxyz

    Figure 18: Conditional frequencies of the 26 letters of the alphabet and the blank character in

    Joyces Ulysses, given the previous character. The first row and first column are for the blank

    character. The area of each square is proportional to the conditional probability of the current

    letter given the previous letter. For instance, the largest square in the picture corresponds to the

    probability, equal to one, that the letter q is followed by the letter u.

  • 7/27/2019 markovChains.pdf

    9/12

    81

    of them can take one of 27 values. So the new table has 273 = 19683 entries. Other than this, theprocedures for data collection and sequence generation are essentially the same, and the idea can

    then be repeated for higher order chains as well.

    The cost of exponentially large tables (27n+1 entries for an n-th order chain) can be curbedsubstantially by observing that most conditional probabilities are zero. For instance, a se-quence of three letters drawn entirely (that is uniformly) at random is unlikely to be a plausibleEnglish sequence. If it is not, it will never show up in Ulysses, and the corresponding condi-tional probability in the table is zero. Matlab can deal very well with sparse matriceslike these.If a matrix A has, say, 10,000 entries only 200 of which are nonzero, the instruction

    A = sparse(A);

    will create a new data structure that only stores the nonzero values of A, together with in-formation necessary to reconstruct where in the original matrix each entry belongs. As a

    consequence, the storage space (and processing time) for sparse matrices is proportional tothe number of nonzero entries, rather than to the formal size of the matrix.

    Here is the code that collects text statistics up to a specified order:

    function [ng, alphabet] = ngrams(text, maxOrder)

    nmax = maxOrder + 1;

    if nmax > 4

    error(Unwise to do more than quadrigrams: too much storage, computation)

    end

    alphabet = abcdefghijklmnopqrstuvwxyz;na = length(alphabet);

    da = double(alphabet);

    ng = {};

    for n = 1:nmax

    ng{n} = sparse(zeros(na(n-1), na));

    end

    text = cleanup(text, alphabet);

    % Wrap around to avoid zero probabilities

    text = [text((end - maxOrder):end), text];

    for k = 1:length(text)

    ps = place(text(k), da);

    last = k-1;

    ne = min(nmax, k);

    for n = 1:ne

    j = k - n + 1;

    prefix = text(j:last);

    pp = place(prefix, da);

    ng{n}(pp, ps) = ng{n}(pp, ps) + 1;

  • 7/27/2019 markovChains.pdf

    10/12

    82 6 SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS

    end

    end

    % Normalize conditional distributionso = ones(na, 1);

    for n = 1:nmax

    s = ng{n} * o;

    for p = 1:size(ng{n}, 1)

    if s(p) = 0

    ng{n}(p, :) = ng{n}(p, :) / s(p);

    end

    end

    end

    This function is called ngrams because it computes statistics for digrams (pairs of letters),

    trigrams (triples of letters), and so forth. The output ng is a cell array with maxOrder + 1matrices, each describing the statistics of a different order. Logically, the statistics of order nshould be stored in an array with n+1 dimensions. Instead, the code above flattens these ar-rays into two-dimensional matrices to make access and later computation faster. This requiresmapping, say, a four-dimensional vector of indices (i,j,k,l) to a two-dimensional vector (v, l)where v(i,j,k) is some invertible function of (i,j,k). How this is done is not important, but itmust be done consistently when collecting statistics and when using them. To this end, themapping has been encapsulated into a function place that takes a string of letters prefixand (a numerical representation da of) the alphabet and returns the place of that string withina matrix of appropriate size. Here is how place works:

    function pos = place(string, da)

    na = length(da);len = length(string);

    pos = 0;

    for k = 1:len

    i = find(double(string(k)) == da);

    if isempty(i)

    % Convert unknown characters to blanks (position 1 in da)

    i = 1 ;

    end

    pos = pos * n a + i - 1 ;

    end

    % Convert to Matlab style array indexing (so minimum pos is 1, not 0)

    pos = pos + 1;

    Going back to ngrams, the function cleanup has been already discussed. The instruction

    with a comment about Wrap around prevents a little quirk that concerns the end of the text:

    if the string, say, ix appears at the end of the text and nowhere else, all the transition

    probabilities from ix to a third character are zero. If ix is ever generated in the Markov

    chain, then there is no next character to go to. To avoid, this, the tail end of the text is also

    copied to the beginning, so that every sequence of letters is followed by some letter. The

  • 7/27/2019 markovChains.pdf

    11/12

    83

    rest of the code is straightforward: initialize storage for ng, compute the statistics by scanning

    text and incrementing the proper entries of ng, and normalize entries to obtain conditional

    probabilities.

    The following are examples of gibberish generated by a second-order Markov chain:

    he ton th a s my caroodif flows an the er ity thayertione wil ha

    m othenre re creara quichow mushing whe so mosing bloack abeenem

    used she sighembs inglis day p wer wharon the graiddid wor thad k

    Some of these look like actual words: common sequences of letters and blanks. A third-order

    model does even better:

    es in angull o shoppinjust stees ther a kercourats allech is hote

    ternal liked be weavy because in coy mrs hand room him rolio undceran in that he mound a dishine when what to bitcho way forgot p

    Almost looks like real text...

    All the gibberish22 in this Section has been generated with the following Matlab code.

    function s = randomSentence(ng, alphabet, len, order)

    if nargin < 4 || isempty(order)

    order = length(ng) - 1; % Use maximum order possible

    end

    if order >= length(ng)

    error(Only statistics up to order %d are available, length(ng) - 1);

    end

    if order < -1

    error(order must be at least -1)

    end

    da = double(alphabet);

    na = length(alphabet);

    if order == -1

    % Dont even consider letter statistics:

    % draw uniformly from the alphabet

    dr = draw(1:na, ones(1, na) / na, len);

    s = alphabet(dr);

    else

    s = char(double( ) * ones(1, len));

    for k = 1:len

    22Well, at least the gibberish in fixed-width font...

  • 7/27/2019 markovChains.pdf

    12/12

    84 6 SCALAR, STOCHASTIC, DISCRETE DYNAMIC SYSTEMS

    j = max(1, k - order);

    pp = place(s(j:(k-1)), da);

    g = min(k, order + 1);

    s(k) = alphabet(draw(1:na, ng{g}(pp, :)));end

    end

    The real meat of this code is the for loop at the bottom, which computes the place of the

    previous letters within the appropriate frequency table in ng, draws a new letter index from

    the corresponding row, and appends the alphabet letter for that index to the output string s.

    The same principle of modeling sequences with Markov chains can of course be extended

    to sequences of words instead of characters: collect word statistics for a dictionary rather than

    character statistics for an alphabet, and generate pseudo-sentences out of real words.

    By now you are probably wondering why in the world anyone would attempt to mimic theEnglish language with statistically correct gibberish, other than as an exercise for a modeling

    class. An important application of this principle, at more levels than just letters and words, is

    speech recognition. In a nutshell, a speech recognition software system typically takes a stream

    of speech signals coming from a microphone, and parses this stream first into phonemes23 then

    phonemes into words, and words into sentences. Parsing means cutting up the input into the proper

    units, and recognizing each unit (as a specific phoneme, word, or sentence). To understand the

    difficulty of this computation, think of listening to an unfamiliar language and trying to determine

    the boundaries between words, let alone understand the words. Apparently, the two must be done

    together.

    Markov statistical models of speech have encountered a great deal of success in the past decade

    or two. Rather than generating gibberish, the statistical model is used to measure the likelihood

    of different candidate parsing results for the same input, and to choose the most likely interpre-

    tation. The Markov models capture the likelihoods of individual links (of varying orders, and at

    different levels) between units. Interesting computational techniques can then accrue these values

    to compute the likelihoods of long chains of units. The methods for actually doing so are beyond

    the scope of this course. See for instance F. Jelinek, Statistical Methods for Speech Recognition,

    MIT Press, 1997.

    23A phoneme is similar to a syllable, but less ambiguous in its pronunciation.