An Investigation into Stochastic Processes for Modelling Human Generated Data

7/28/2019 An Investigation into Stochastic Processes for Modelling Human Generated Data

1/28

An Investigation into Stochastic Processesfor Modelling Human-Generated Data

A 20-cp 3rd year project

Author:

Tim Jones

Supervisor:

Dr. Gordon Ross

Rendered on Thursday 25th April, 2013


2/28

Acknowledgement of Sources

For all ideas taken from other sources (books, articles, internet), the source of the ideas is mentioned in themain text and fully referenced at the end of the report.

All material which is quoted essentially word-for-word from other sources is given in quotation marksand referenced.

Pictures and diagrams copied from the internet or other sources are labelled with a reference to the webpage or book, article etc.

Signed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Date . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1


3/28

Contents

1 Introduction 3

2 A Zoo of Stochastic Processes 4

2.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.1 A Formal Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 An Intuitive Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 The Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 The Markov-Modulated Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Potential Modelling Techniques 10

3.1 Fitting a Non-Homogeneous Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Fitting a Markov-Modulated Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.1 A derivation of the classical Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . 123.2.2 A First Approximation of the Viterbi Algorithm for the Markov-Modulated Poisson

Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.3 Applying the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.4 An Alternative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 A Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Conclusion 23

A Fun with Integrals 24

2


4/28

Chapter 1

Introduction

Modelling human-generated random data is notoriously difficult. Human event data arises in a variety ofreal-world situations, for example when inspecting network traffic or email communications. Being able tospot anomalies has important applications in botnet detection and spotting suspicious behaviour on socialnetworks. In this project, we attempted to model the behaviour of one user of twitter using various random

processes. Ideally, we would find a good model for twitter in general, but one user makes a worthy startingpoint, and twitter provides an easily accessible representation of such data without trawling through networklogs or other peoples email accounts. The user was observed over the course of around 6 months posting(emitting) a little over 3,300 tweets. The goal is to find some kind of model to fit these data, without usinga hideously large number of parameters.

The data are visualised in Figure 1.1. We see a little seasonality - the user seems to be going to sleep atsome time, waking up at another time, as well as some burstiness, the user will produce a very rapid seriesof tweets in a very short time. Being able to detect these things would let us intuitively decide whether a fitis good or not, but principally well be reliant on statistical tests to judge how good our model is.

Chapter 2 will define a series of stochastic processes which we can use for modelling the data, and thenchapter 3 will apply the more relevant ones. Of particular note is the Markov-Modulated Poisson Processdefined in 2.3, which several authors [1][2][3] have postulated will provide a good model due to its doubly-stochastic nature the intuition being that the tweets are a stochastic process, and the parameters of that

stochastic process themselves follow another stochastic process. We will attempt to fit this and a series ofother models, before concluding that a DTHMM with lognormal emissions provides the best fit

Appendix A also describes a minor result which was spotted during this project, and which may haverather useful applications in other fields.

Figure 1.1: The raw data gathered from the twitter user. Each blue cross marks a tweet at a particular time.The 24 hours of a day run across the x axis, and the days within our 6 months run up the y-axis, so a pointat (17.5, 256) is a tweet at 5:30pm 256 days into the observation.

3


5/28

Chapter 2

A Zoo of Stochastic Processes

2.1 Markov Chains

2.1.1 A Formal Definition

A stochastic process [4, p590] is a collection {Xt : t T} (sometimes X(t) for continuous T) of randomvariables. These collections may be indexed arbitrarily, but tend to be used to describe the evolution of somerandom series of events, using T to be some representation of either discrete or continuous time, for instanceXt may be the number of observed emissions from a radioactive source after t minutes or the licence plateof the tth car to go past a speed camera.

A Discrete Time Markov Chain (DTMC) is a particular type of discrete-time stochastic process - onewhich obeys the Markov Property. We usually set T = {0, 1, 2,...} = N and we say that (Xt)tN obeys theMarkov Property iff

t T, s S P(Xt+1 = s|X0, X1,...Xt) = P(Xt+1 = s|Xt)[5]We refer to S as the state-space, each s S is a state, and Xt represents the state at time t. The Markov

Property essentially states given the present, the future is conditionally independent of the past [5]. Thereis also a Continuous Time Markov Chain (CTMC, sometimes CTMP for Continuous Time Markov Process).We set T = R+, let be some positive value close to 0, and define a similar Markov Property;

s S P(X(t + ) = s|{X() : t}) = P(X(t + ) = s|X(t))Well be dealing exclusively with the discrete-space case in this project, ie a Markov Chain where S is

discrete, though well need to use both continuous and discrete times. It is popular for many authors to setS Z, though our states can be integers, real numbers, popes, or any other completely arbitrary non-emptyset. Where S and T are discrete, we define = (i)iS, the initial probability vector, and = (ij)(i,j)S2 ,the matrix of transition probabilities, such that,

i = P(X0 = i)

t

N ij = P(Xt+1 = j

|Xt = i)

In this case, is constant does not depend on time, a property referred to as homogeneity. Every DTMCcan be uniquley defined by the triple (S,, ). Naturally, the row sums of should be 1, ie

i S,jS

ij = 1

For CTMCs, rather than a transition probability matrix, a transition rate matrix Q = (qij)(i,j)S2 isdefined for small as;

i, j S, P(X(t + ) = j|X(t) = i) = qij + o(),

4


6/28

where o() is some function such that o() 0 as 0. This gives us that

i, j S, > 0 P(X(t + ) = j|X(t) = i) = (eQ)ij ,

where (eQ)ij is the i, jth element of the matrix exponential of Q. With the initial probability vector

defined as before, we can uniquely define any CTMC with the triple ( S,,Q). Qs row sums are always 0,with all off-diagonal elements being positive, and all diagonal elements being negative, ie

i, j S, i = j qij > 0i S, qii =

j=i

qij

2.1.2 An Intuitive Interpretation

Whilst the above defines the Markov Process, it fails to describe it in any intuitive way. Before attemptingto use them, it is important to be able to deal with both the continuous and discrete-time Markov Chainsin an intuitive way. My personal preferred method is to use edge-weighted directed graphs [6]. Each statein S is given a node in the graph, and for each i, j S the edge (i, j) is given weight ij in the discrete caseand qij in the continuous case. Generally, in the continuous case we dont draw transition rates from eachnode to itself, since these are implied by all other nodes.

As an example, lets use the following definitions, with and indexed in the order the elements of Sare written;

S = {John Paul II, Adrian I} = (1, 0)

=

JP2 A1JP2 0.3 0.7

A1 0.9 0.1

This defines a Discrete Time Markov Chain with the graph in Figure 2.1:

JP2 A10.30.7

0.10.9

Figure 2.1: A DTMCs graph - the initial probability vector may be represented by arrows entering from theoutside, but this is not universal

We can then say that a DTMC will hop from node to node at each time step with probabilities definedby the weights of the edges between the current node and its neighbors, and can be simulated by algorithm1. A CTMC is slightly more complex; on arriving in a state i, a random time T Exp(qii) is generated.The CTMC will remain in state i for T units of time, then jump to state j with probability

qijqii

. We can

simulate a CTMC with algorithm 2.

5


7/28

input : (S,, ), a Markov Chain, and T, a maximum number of steps to simulateoutput: x, a vector of states xi, recording the sequence of states visited by the chainbegin

x1 wp /* x1 takes on state sigma with probability delta sigma */for n 2 to T do

xn wp xn1, /* xn takes on state sigma with the relevant probability */end

return xend

Algorithm 1: A Simulation Algorithm for the generic Markov Chain

input : (S,,Q), a CTMC, and T, a maximum time for which to simulate the processoutput: t, a vector of pairs (ti, xi), recording the time and destination of the i

th transition.begin

t0 0x0 wp /* x0 takes on state sigma with probability delta sigma */n 0while tn < T do

Exp(qxnxn) /* tau takes on an exponentially distributed random value */n n + 1tn tn1 + xn wp qxnqxnxn /* sn takes on state sigma with the given probability */

end

return ((t0, x0), ..., (tn1, xn1))end

Algorithm 2: A Simulation Algorithm for the generic CTMC

6


8/28

2.1.3 The Poisson Process

The simplest form of CTMC is the homogeneous Poisson process, where S = N and i S, qii = , qi,i+1 =. We call the rate of this process. If N(t) is a Poisson process, we then have that, for small

i N, t,P(N(t + ) = i + 1|N(t) = i) = + o()

Note that N is the associated counting process in which N(t) counts a number of events up to time t.N is a stochastic process not a distribution or a random variable. The graph for this process is similarlysimple, shown in Figure 2.2.

0 1 2 3

Figure 2.2: The graph of a Poisson process

Implicit from this, we see that the process can only increase - once leaving a state, we never return, sowe can define the emission times tn, as tn = min{t : N(t) = n}. We can also define n = tn tn1, theinter-arrival times for the nth jump for n {1, 2,...}. We refer to these as emissions and arrivals since aPoisson process generally models a counting process whereby we record the times at which we observe events

happening, eg the times at which radioactive particles are detected from a radioactive material. The mostcrucial property of the Poisson process for this work, implicit from algorithm 2, is that

n N n Exp()That is, all inter-arrival times follow an Exponential distribution. A modification of the generic CTMC

simulation algorithm can be made for simulating a Poisson Process. Since its only possible to jump fromstate i to state i + 1, we need not record the destinations of each jump; jump i will always take us to statei. This is detailed in algorithm 3.

input : , a Poisson process rate and T, a maximum time for which to simulateoutput: t, the emission times of a Poisson process of rate terminating before time Tbegin

t0 0n 0while tn < T do

n n + 1n Exp() /* tau takes on an exponentially distributed random value */tn tn1 +

end

return (t0,...,tn1)end

Algorithm 3: A Simulation Algorithm for the Poisson Process

We can also define an inhomogeneous Poisson process. Everything remains from before, except ratherthan having a rate parameter R+, we have a rate function : R+ R+, where, for sufficiently small ,

i N, t R+,P(N(t + ) = i + 1|N(t) = i) = (t) + o()This process is no longer homogeneous, however it remains a powerful and more generic tool. Simulating

one of these is, however, less simple than for the homogeneous case. The algorithm for doing so is based onBernoulli thinning [7] and is equivalent to algorithm 4.

To simulate an inhomogeneous Poisson process, we first simulate a homogeneous Poisson process ofconstant rate max with algorithm 3, where max is some upper bound on the inhomogeneous rate function, and then keep each emission at time ti with a probability proportional to the inhomogeneous rate function(ti) using algorithm 4.

7


9/28

input : : [0, T] [0, max], a desired rate function for the resulting Poisson Process, and t, theemission times of a Poisson Process of rate no greater than max, terminating before time T,indexed from 1 to n

output: t, the emission times for a single realisation of an inhomogeneous Poisson Process with ratefunction

begin

j 0 for i 1 to n dor U(0, 1) /* r takes on a uniformly distributed random value in [0,1] */if r 0 be such that + < T1. We then have that

P( N( + ) N() = 1) = P(N1( + ) N1() = 1)= 1() + o()

= () + o()

So N is identical to a Poisson process of rate within the range [0, T1).Now let [T1, T) and > 0 be such that + < T. We have that

P( N( + ) N() = 1) = P(N2( T1 + ) + N1(T1) N2( T1) N1(T1) = 1)= P(N2( T1 + ) N2( T1) = 1)= 2( T1) + o()= () + o()

So N is identical to a Poisson process of rate within the range [T1, T). We need not deal with the casewhere and + are either side of T1, since we can always choose an small enough such that this is notthe case these probabilities assume that is close to 0. So we have that N is distributed identically to N.

So concatenating two Poisson processes of rates 1 and 2 produces a new Poisson process of rate 1||2,and it follows from simple induction that an MMPP can be thought of as the concatenation of multiplebounded-length homogeneous Poisson processes of rates determined by the state of the underlying CTMC,and time bounds determined by the length of time spent in each state of the underlying CTMC.

9


11/28

Chapter 3

Potential Modelling Techniques

Now that weve defined a selection of random processes, we can start discussing what might be appropriateto fit to our data. Before fitting data to a model, however, it is important to reassure ourselves that, givena realisation of a known process of known parameters, we can recover those parameters from the realisationto a reasonable degree of accuracy. The first step in any consideration will be to ensure that we have some

method of recovery.

3.1 Fitting a Non-Homogeneous Poisson Process

The simplest place to start would be a homogeneous Poisson process, though a cursory glance at our datain figure 1.1 suggests that this would not be appropriate - the user is clearly tweeting at different rates atdifferent times of day, and is not homogeneous. Lets try a non-homogeneous Poisson process. We start bysimulating a Poisson process of known rate function, and seeing how well we can recover it. Let the ratefunction be defined as:

(t) =

5 for 0 t < 30

10 for 30 t < 50

5 for 50 t < 100

The process was simulated for 100 hours, producing a trace such as the one displayed in Figure 3.1a.We can fit a step function by observing multiple traces of these Poisson processes, taking the differences

in emission times, then attempting to cluster them with the k-means algorithm [18]. Other approaches arepossible, but because of how a step function can approximate an arbitrary function, and how easy it is tofind implementations of k-means, these will suffice as an early heuristic. The average number of emissions

(a) The simulated Poisson process (b) Its estimated rate

Figure 3.1: A trace of a Poisson process, and its estimated rate.

10


12/28

per trace is the integral of the rate with respect to time over the time interval for which we observe, ie if Nis the number of observed emissions in a single trace,

E(N) =

1000

(t) dt = 600

So if we run out 6 traces well observe roughly 3, 600 emissions, a little more than our real data set. Its

possible to attempt to fit this function with less data, but more data will give a more accurate result. Fittinga function to these gives us the results from Figure 3.1b. Eyeballing this, we see that its not a terrible fitin this first attempt, but if we try a more complex rate function, a situation similar to Figure 3.2 happens.The estimations are well off the mark. We could instead attempt to fit a polynomial function with maximumlikelihood or least squares or similar, but this requires more parameters, and completely ignores bursts.These bursts occur at random times throughout the day, and last for random lengths of time, but they alltake on the same form. This heavily restricts our function. We cant simply say that at some particular timethere is always a burst, but we also cannot deny their existence by smoothing them out since these bursts,by their very nature, will account for the majority of the observed tweets.

Clearly, a different approach is needed.

Figure 3.2: A slightly more complex rate function, whose estimation barely resembles the original

11


13/28

3.2 Fitting a Markov-Modulated Poisson Process

An MMPP seems ideal then, capturing the simplicity of a step function, and letting us model the idea ofrandomly distributed bursts throughout a day. Indeed, several authors have postulated that such a modelwould be ideal for simulating such data [1][2][3], but there have so far been no actual quantifiable studies ofits relevance. This could be for a number of reasons, partially due to the lack of algorithms, but also possibly

because these studies focus on somewhat smaller data sets. The twitter user in this study is hugely active,yielding a vast quantity of data to fit, and making the true quality of the model all the clearer.Fitting a Hidden Markov Model of any kind relies on two main algorithms, Baum-Welch [14] and Viterbi

[16]. The Baum-Welch algorithm is an expectation-maximisation algorithm for estimating the transitionprobabilities/rates and the emission probabilities, given a set of possible emissions, a number of states to fitand an observed sequence of emissions. The algorithm runs iteratively over the observed data, incrementallyincreasing the likelihood of the estimated model given the observations, and as such it needs an artificiallydefined stopping condition. In this project, well either use some fixed number of iterations, or stop whenthe log-likelihood increases by less than 106 between some pair of iterations, at which point we say thatthe model has converged sufficiently. Viterbi will take an observed sequence of emissions and the parametersof an HMM, usually those estimated by Baum-Welch, and produce the most likely state in which each ofthese emissions happened. The efficacy of these algorithms hinges on knowing the number of states, an issuewhich will be discussed later.

The HiddenMarkov package hosted on CRAN [8] is the only easily-accessible Hidden Markov Modelpackage which supports the Markov Modulated Poisson Process, but does not contain any implementationof the Viterbi algorithm, and no implementation of Viterbi for the MMPP can be easily found. We have twoways of working around this, either write our own or discretise the process.

Since the times between emissions in an MMPP are usually exponentially distributed, we could work indiscrete time by letting yt be the time between the t

th and (t + 1)th emissions, though this sacrifices someinformation. We no longer consider the possibility of making multiple transitions between emissions, andignore the intrinsic link between the times between emissions and the times between state transitions, butthis model may also yield some useful results. We will try both of these approaches.

3.2.1 A derivation of the classical Viterbi algorithm

The Viterbi algorithm for a standard Discrete Time Hidden Markov Model relies on a known, finite obser-

vation space Y, a known or estimated distribution on Y for each state s, ps, known or estimated transitionprobabilities i,j , and known or estimated initial probabilities for each state s, s. The goal is, given asequence of T observations (yn)n[T], yi Y, to find the sequence of states (xn)n[T] satisfying

x = arg maxxST

P(x|y)We call x the Viterbi Path.Let Vt,s be the probability of the most probable state sequence responsible for the first t observations

which ends in state s, that is

Vt,s = maxxSt1

P

(x1,x2, ...,xt1, s)|(y1,...,yt)

We have that V1,s is the probability of both being in state s at time 1 and seeing observation y1 from

state s. This gives us that

V1,s = P(y1|x1 = s)P(x1 = s)Recall that s is the probability of being in state s at time 1, ie P(x1 = s), and that ps(y1) is the

probability of observing y1 from state s, ie P(y1|x1 = s). In practice, these wont be known, but estimatesof them will be given by the Baum-Welch algorithm, so we will use their maximum-likelihood estimates togive us

V1,s = ps(y1)s

12


14/28

Given V,s for < t, we can find Vt,s by noting that the Markov Property implies a form of memoryless-ness. Given the present, the future is conditionally independent of the past. As such, we need only considerVt1,s for each s, as well as our known parameters.

The probability of the most likely path that leads us to state s at time t is given by the probability ofthe most likely path that led us to some state s at time t 1, and then jumped to s at time t, and thenemitted yt from state s. The probability of jumping from s

to s is s,s. The probability of emitting yt from

state s is Ps(yt). The probability of the most likely path that leads us to s

at time t 1 is Vt1,s . Hence,Vt,s = ps(yt)max

sS(s,sVt1,s)

Using this recurrence, we can find Vt,st [T], s S by a standard dynamic programming algorithm.From here, we can then work backwards to find the Viterbi path. xT = argmaxsSVT,s - the most likely

final state is the state in which the path of maximum probability ends.Let

Tt,s = argmaxsS

(s,sVt1,s),

ie, Tt,s is the state from which we are most likely to have come at time t 1 given that we are in state sat time t, we can then see that xt1 = Txt,t. Note the similarities in the definitions of V and T - both can

be calculated simultaneously - V is the maximum, T is the argument that maximises. T1,s is never used, soneed never be defined.

Since we have an expression for xt1 in terms of xt and an expression for xT, we can then recover x, theViterbi Path. Algorithm 5 gives this in full.

Data: (S,, , O , p), a DTHMMinput : y, an observed sequence of T emissions, indexed from 1 to Toutput: x, the most likely sequence of states generating these emissionsbegin

for s S doV1,s ps(y1)s

end

for t

2 to T do

for s S doVt,s ps(yt)maxsS(s,sVt1,s)Tt,s arg maxsS(s,sVt1,s)

end

end

xT arg maxsS(Vs,T)for t T to 2 do

xt1 Txt,tend

return xend

Algorithm 5: The Viterbi Algorithm for DTHMMs

This algorithm is only valid in discrete time. For the continuous time MMPP, modifications are necessary.

3.2.2 A First Approximation of the Viterbi Algorithm for the Markov-Modulated

Poisson Process

Recall the dependencies for the DTHMM Viterbi Algorithm. We require knowledge of a finite Y, Ps for eachs, and ij for each pair of states i, j.

In an MMPP, we observe a Poisson Process of rate randomly varying between various known rates - therates being our states. Let S = {1,...,m}. Our observations can be interpreted as exponential random

13


15/28

variables of these rates. Let 0 = 0, and i be the time of the ith Poisson emission for i [n]. Let yi = ii1

for i [n]. Given that the underlying CTMC was in state s at time i, we have that yi Exp(s). Fromthe properties of the generic CTMC, the probability that the process is in state j at time i given that itwas in state i at time i1 is given by (e

Qyi)i,j . This gives a quasi-discretised model, where all jumps andemissions indeed happen in discrete time, but the transition probabilities acknowledge continuity. To easenotation, well write (eQyi)i,j as (e

Qyi)i,j , omitting the s.

The state space S and transition rates Q are estimated by the Baum Welch algorithm as before, so afterrunning Baum Welch over an observed trace, we can start to find the most likely state at each emission.Note that this Viterbi Path is not the most likely sequence of state transitions, it is instead the most likelystate in which the underlying CTMC resides at the time of each emission.

Since our emissions are continuous, we dont have any notion of most probable - if we model heightcontinuously, the probability that I meet someone exactly 1.8m tall is the same as the probability that I meetsomeone exactly 18m tall, theyre both 0, so instead well base our likelihood calculations off probabilitydensity, capturing the idea that, even though I dont know for certain that Ill meet one of the two, its morelikely for me to meet the 1.8m tall person.

We let ps(t) = sets , the probability density of an exponential random variable of rate s evaluated at

t. Let Vt,s be the probability density of the most likely path that leads us to emitting yt from state s. Wehave that

V1,s = sps(y1)

The probability density of the most likely path that leads us to waiting for time y1 before making anemission is given by the probability of starting in state s, multiplied by the probability density of waiting y1for an emission from state s. The memoryless property of a CTMC allows us to only consider Vt1,s whencalculating Vt,s. We have that

Vt,s = ps(yt)maxsS

(Vt1,s(eQyt)s,s)

The probability density of the most likely path that leads us to waiting for time yt between the (t 1)thand tth emissions in state s is given by the probability density of the most likely path that takes us to states for the (t 1)th emission, followed by jumping (along any arbitrary path) into state s for emission t,multiplied by the probability density of emitting yt in state s.

From here, we can proceed as normal. We define T as before to record our most likely states at eachtransition, and work backwards to find x, producing the following algorithmThe reason that this is an approximation is the fact that the algorithm assumes that either a jump happens

instantaneously, or not at all - we always evaluate ps(yt), rather than ps(yt )1, where represents thetime we wait for all the relevant transitions to occur. The times between state transitions are exponentiallydistributed, and we are dealing with the most likely outcomes. The most likely outcome of any exponentialdistribution is 0 so, if the underlying CTMC jumps from one state to another, the most likely time for that

jump to happen is immediately, so = 0. If multiple jumps happen, their individual times are exponentially

distributed, but their sum does not have a mode of 0.This first approximation is in fact very powerful. Approximating in this way assumes that multiple

transitions between emissions are rare, alternatively that emissions within each state are more frequent thantransitions out of that state. The converse can be true when the tweeter is asleep, his emission rate isnear 0, but he will have a positive transition rate for when he wakes up and in this case the algorithm will

spot large periods of inactivity as estimate them as being the inactive state. If there are multiple states withlow rate emissions but high rate transitions between them, then we would expect to see very few emissionsoccurring on a path through these states, so arguably the information on the exact route through thesestates doesnt exist, and cannot be recovered by any algorithm.

As a final note, we can further refine the fitted model based on the results of the Viterbi algorithm bychanging the estimated rate of each state to the observed rate of the emissions estimated to occur in thatstate. By the nature of these estimation algorithms, these results are likely to differ, and with a large sample

1- tinco - represents the voiceless alveolar stop in the Tengwar alphabet, as devised by JRR Tolkein. When dealing withtime so frequently, we eventually run out of ways to write the letter t

14


16/28

Data: (S,,Q,p), an MMPPinput : y, the absolute times of an observed sequence of T emissions, indexed from 1 to Toutput: x, the most likely sequence of states in which the underlying CTMC resides for each emission

in ybegin

for t 2 to T dot1 yt yt1end

T T 1 for s S doV1,s ps(t1)s

end

for t 2 to T doA eQt for s S do

Vt,s ps(t)maxsS(As,sVt1,s)Tt,s arg maxsS(As,sVt1,s)

end

end

xT arg maxsS(Vs,T)for t T to 2 do

xt1 Txt,tendreturn x

end

Algorithm 6: An Approximate Viterbi Algorithm for MMPPs

size what we observe is usually closer to the truth than what we assert. In practice, if Baum Welch gives agood estimation of the underlying MC, the difference between the two is minor.

3.2.3 Applying the Algorithm

The algorithm was written in R and added to the pre-existing HiddenMarkov [8] package, which was thenrecompiled to be loaded into an R environment. A Python script was written to call into this library froma more convenient language using RPy2 [12] which also allowed mathematical, statistical and visualisationfunctions to be loaded in from matplotlib [9], and SciPy [13].

As before, we first simulate a model, then see if we can recover it, reassuring ourselves that the Viterbiimplementation is correct. The simulated model had the following parameters;

(1, 2, 3) = (0.01, 0.5, 2)

S = {1, 2, 3}

Q =

1 2 3

1120

160

130

2110

215

130

3 110 160 760

=

1

3,

1

3,

1

3

The expected course of an MMPP is somewhat harder to calculate, but 6,000 hours gave a similar number

of emissions to the number of emissions in our data. The resulting trace can be seen in Figure 3.3.

15


17/28

Figure 3.3: A trace of an MMPP with the states shaded in colour 2

Running the Baum-Welch algorithm over the trace with a known state size of 3 for 208 iterations, atwhich the model converged sufficiently, gives us the following predictions, rounded to 2 significant Figures;

(1, 2, 3) = (0.0066, 0.53, 2.00)S = {1, 2, 3}

Q =

1 2 3

1 0.061 0.022 0.0392 0.10 0.14 0.0533 0.094 0.020 0.11

= (1.00, 0.00, 0.00)

On inspection, most these parameters give a reasonable approximation of the originals, but seems tobe well off the mark. The reason for this is fairly simple - the simulated trace had to start somewhere, andin this case it started in state 1. The algorithm was only given one trace, so to minimise the probability oferror, it estimated that the underlying CTMC always starts in the state in which it was estimated to start.The initial distribution of the underlying CTMC isnt hugely relevant, however. Were interested in how theuser tweets in the long term, and regardless of initial distribution this Markov Chain will reach an invariantdistribution.

Running the modified Viterbi algorithm over the process gives the estimate in Figure 3.4

Figure 3.4: The same trace as Figure 3.3, but with the shading to match estimated states, rather than actualstates

2Originally, all the diagrams in this document were vector graphics, but diagrams like this are so intricate that their vectorversions can crash pdf renderers.

16


18/28

Comparing the estimated path to the original in a way that clearly displays the results seems daunting,but we can do this in what I find a fairly beautiful way by simply aligning the two traces along the samescale, rendering them as bitmaps, then taking the difference in the colours between the two images usingGIMP [10, eqn 8.15]. Any black pixels are locations where the original two images match perfectly, all othersare mistakes. Stripping the top row of colours will show mistakes over time. Figure 3.5 is a black and whitethresholded [11] version of the preceding, where all colours appear as white lines. The image is 91.2% black

so my estimated model matches the simulation 91.2% of the time.

Figure 3.5: A comparison between the estimated state sequences of Figure 3.4 (top) and 3.3 (middle). Theirimage difference is shown at the bottom, with adjustments for clarity.

Now that weve verified that our algorithms are capable of recovering known MMPPs to a high degreeof accuracy, its time to pump our data into it.

The data were preprocessed such that rather than recording exact times and dates of tweets, they insteadrecorded times since the beginning of the observation in hours at which each tweet occurred. This sequencewas then fed into an MMPP data structure, over which BaumWelch and Viterbi were run based on theassumption of 3 states. This assumption is arbitrary, but serves as a nice starting point to generate someearly results well address the problem of selecting the number of states later. The resulting model was asfollows:

(1, 2, 3) = (0.0289, 1.28, 15.5)

S = {1, 2, 3}

Q =

1 2 3

1 0.123 0.120 0.0042 0.441

1.09 0.648

3 0.00 5.25 5.25

= (1.00, 0.00, 0.00)

And, with the states shaded in in the colours represented above, the transitions were as represented inFigure 3.6.

Figure 3.6: The twitter data, with predicted states shaded - 1 in red, 2 in yellow, 3 in green. Since 3 isa burst state which, on average, only lasts for 12 minutes and contains several tweets, it can be difficult tosee the green behind those tweets

From this distance, we already see at least some kind of sensible behaviour - the user seems to have regularsleeping patterns, tweeting during the day and not tweeting at night-time. The three states correspond to

17


19/28

intuitive states for a human to occupy in state 1 the user is asleep or otherwise away from the computer,tweeting on average once every 50 hours, but only staying there for 10 hours on average. In state 2, theuser is awake, online, and going about his day as usual - as an active twitter user he tweets about 1.3 timesper hour. On average, every hour, the user will then either go to bed with probability 0.4, or enter intoa conversation with someone and produce a burst with probability 0.6. This already seems a little odd,and shows the limitation of using a homogeneous model, we seem to be suggesting here that the user goes

to bed on average every 1.5 hours, which doesnt fit our notions of how humans actually behave. Bursts aretweets produced at a rate of 15 per hour, or one every 4 minutes, and last around 12 minutes on average.

We can test whether this model gives an accurate representation of our data by performing a Kolmogorov-Smirnov (KS) [15] test on the data for each state. This test takes the maximum difference in the cumulativedistribution functions for the two data sets, and then compared to a critical value from the Kolmogorovdistribution. This test generally requires a large data set to give meaningful results, but we certainly haveone here.

We take our null hypothesis to be that the emissions within each state follow an exponential distributionwhose rate matches the observed rate. Running this test, we find, however, that they very much do not -the p-values for each state were less than 2.2 1016 to say that this is low would be an understatement.

Whilst 3 states makes sense intuitively, theres no reason why this is definitely the case, so lets trydifferent numbers of states. We can find the optimal number based on the Bayesian Information Criterion[17]. Given an estimated model, we can define the Bayesian Information Criterion, BI C, as

BI C = 2 ln L + k ln nWhere ln L is the log-likelihood of the fitted model, ie the natural logarithm of the probability of observing

the data given that the model is correct, k is the number of parameters fitted by model, and n is the numberof data points used to fit the model. Faced with the choice of two different models, we select the one withthe lower BIC.

An MMPP of |S| states has k = |S|2 + |S| 1 free parameters Q contains |S|2 elements, but each ofthe |S| diagonal elements can be determined by the row on which it resides, we fit s states, each of which isfreely choosable, and has |S| elements, one of which can be determined from the other s 1. The numberof data points is one less than the number of tweets weve observed, and the log likelihood is returned bythe Baum-Welch algorithm.

500 iterations of Baum-Welch were run over the data for varying numbers of states until the BIC stopped

decreasing. This optimum occurred at 4 states with a BIC of 1470, described as follows:

(1, 2, 3, 4) = (0.0223, 0.511, 6.14, 28.3)

S = {1, 2, 3, 4}

Q =

1 2 3 4

1 0.111 0.0950 0.0160 0.0002 0.262 1.21 0.950 0.0003 0.287 2.94 3.67 0.4454 0.000 0.000 5.56 5.56

= (1.00, 0.00, 0.00, 0.00)

Fitted by the same methods, we see the results in Figure 3.7. Unfortunately, the Kolmogorov-Smirnovtests still fail - the greatest p-value for any state was 3.85 106. Fitting more states would probably givea better fit, but due to the quadratic scaling it would also start sending the number of parameters up toabsurd levels - the Bayesian Information Criterion would not improve. This is the best MMPP we can fit,and it still isnt very good.

3.2.4 An Alternative Approach

From this point, it is starting to seem like the tweeter does not follow an MMPP, but there is one last thingwe can attempt in this vein. A Discrete Time Hidden Markov Model can still have a continuous observation

18


20/28

Figure 3.7: The twitter data, with predicted states shaded - 1 in red, 2 in yellow, 3 in green and 4 incyan. As with 3 in Figure 3.4, 4s emissions can be difficult to see.

space and all the previous methods will work, just with probability density in place of probability. Ratherthan observing a series of times, we will instead observe a series of time differences, and look for any smallnumber of states that lets us gather these differences into the same exponential distribution. The resultsfrom such a model will be similar, but lose some information about how long our tweeter actually spends ineach state. The crucial difference between the two is that in an MMPP emissions and transitions take place

along the same timeline compare the Viterbi algorithms for both the MMPP and DTHMM, and note thatthe transition probabilities between a DTHMMs states do not depend on the observed emissions, whilst thetransition probabilities in the MMPP version do.

So we go through the same procedure again - the intuitive 3 states didnt result in anything worthwhile,so we apply the BIC-based method again to arrive at an optimum of 5 states and a BIC of 1446, resultingin a DTHMM with the following parameters;

S = {1, 2, 3, 4, 5}(1, 2, 3, 4, 5) = (0.108, 1.18, 4.27, 14.8, 32.4)

= (1, 0, 0, 0, 0)

=

0.295 0.380 0.000 0.312 0.013

0.153 0.573 0.006 0.268 0.0000.051 0.052 0.579 0.156 0.1610.176 0.250 0.286 0.260 0.0270.007 0.000 0.196 0.012 0.785

Y = R+

s S, y Y ps(y) = sesy

Which, when we add the Viterbi path to the plot, gives us Figure 3.8. We perform some new KS tests tofind one p-value of 0.03, and four others below 107. From this, we can definitively conclude that the datacannot be well-clustered into a small number of exponentially distributed subsets. We can then concludethat these data do not follow any kind of simple Poisson process, in spite of appearances.

3.3 A Diagnosis

Whilst this gives a fairly strong negative result that this tweeter is not a Poisson process, it doesnt giveany real explanation as to why. What went wrong? Inspecting the fitted models we see that the first fewemissions dont quite fit in with the rest, but removing these anomalous results does little to the results ofthe Kolmogorov-Smirnov tests; they remain heavily negative in all cases.

Lets start looking at the emissions estimated to occur in each state. Since the DTHMM gave slightlybetter p-values, well use that as a jumping-off point.

19


21/28

Figure 3.8: The twitter data, with predicted states shaded - 1 in red, 2 in yellow, 3 in green, 4 in cyanand 5 in black. Burst states are, as usual, very difficult to see.

And now, observing Figure 3.9 we see an issue. The fits are close, but the heads of the distributions arelight and their tails heavy. What we really need is a distribution which will allow for these tails, such as thelognormal distribution. If X follows a lognormal distribution, then ln(X) follows a normal distribution [19].

Summing lognormal distributions into a Markov Modulated Renewal Process in the same way that wesum exponential distributions into a Poisson process carries all manner of problems, however. The lognormal

distribution is not memoryless nor is it modally 0, so the resulting processes within each state will besomewhat harder to fit, meaning that simple fitting algorithms like Baum-Welch and Viterbi require muchmore sophisticated modifications to work correctly in continuous time.

The algorithms for fitting a discrete model to the data using a DTHMM with lognormal emissions are,however, completely unchanged, so lets do that. We take the natural logarithm of all the differences inemission times, fit multiple models and evaluate their BICs. Here, each state requires 2 parameters, a meanand a standard deviation, so the number of free parameters in an s-state model is now s2 + 2s 1. We findthat the optimal number of states is 3, and go ahead again with the KS tests, making minor correctionsto the parameters for the estimated emissions within each state, resulting in p-values of 0.312, 0.156 and0.420, and with densities shown in Figure 3.10. At last, we have some statistically significant results, withthe following parameters;

S = {1, 2, 3}(1, 2, 3) = (3.33, 1.33, 2.34)(1, 2, 3) = (1.47, 1.86, 0.366)

= (0, 1, 0)

=

0.926 0.069 0.0060.047 0.906 0.0470.000 0.865 0.135

Y = R

s S ps(y) =exp( (ys)222s )

s

2

Given a sequence of emissions from this DTHMM, y, the inter-arrival times are then t with ti = eyi .Perhaps a little strangely, shading the states onto our graph as in Figure 3.11 gives a less obvious seasonalityto the tweeters behavior. We still have observable and highlighted bursts as well as some regularity to theusers sleeping patterns, but theyre less obvious here than with the MMPP. More work would certainly throwup more results, perhaps this requires a continuous time model for such things to be seen, but regardless,this is a positive result for a problem whose solution has so far only been speculated upon, and which showsthat these speculations are very likely to be wrong.

20


22/28

Figure 3.9: The estimated densities of emissions in each state (blue), alongside the actual density of anexponential random variable of rate equal to the rates of the observed emissions in that state

21


23/28

Figure 3.10: The estimated densities of the logarithms of emissions in each state (blue), alongside the actualdensity of a normal random variable of mean and variance equal to those of the logarithms of the observedemissions in that state

Figure 3.11: The state transitions estimated by Viterbi for a DTHMM with lognormal emissions, state 1 inred, 2 in yellow, 3 in green

22


24/28

Chapter 4

Conclusion

We conclude, then, in a swarm of negative results, but with an implementation for an as yet unwrittenalgorithm, and exactly one statistically valid result for an as yet unsolved problem. Further refinements can,of course, be made. Various path finding algorithms can probably be adapted for a CTMC to give the mostlikely sequence of states between two known endpoints, which would allow us to find the modal transition

time between states and create a more accurate MMPP Viterbi. We could even try for a Markov ModulatedRenewal Process, inter-arrival times being defined by some arbitrary distribution of parameters defined bythe states of an underlying CTMC.

I hope that this project has shone a light into what was once a dark area, and that the resulting methodsprove useful to others for detecting botnets in networks, suspicious social network activity, or simply creatingsome rather beautiful diagrams. All code used for this project, as well as the LATEXused to generate thisdocument and svgs of most of the graphics are hosted on GitHub at https://github.com/Ymbirtt/maths_project. If you, for instance, had difficulty seeing some of the diagrams, or want more concrete informationon exactly what happened in this project, Id recommend finding it all there.

23


25/28

Appendix A

Fun with Integrals

Recall Figure 3.1a. It is a trace of an inhomogeneous Poisson process with the following rate

(t) =

5 for 0 t < 30

10 for 30 t < 50

5 for 50 t < 100

Figure A.1 shows a Poisson process of rate , alongside the integral of . With only 600 samples, thetwo functions seem similar. Let N be an inhomogeneous Poisson process of rate : R+ R+, let > 0,and consider the following;

P(N(t + ) N(t) = 1) = (t)+ o()P(N(t + ) N(t) > 1) = o()

E[N(t + ) N(t)] =i=1

P(N(t + ) N(t) = i)

= (t)+ o()

lim0

E[N(t + ) N(t)]

= lim0

(t)+ o()

= (t)

So if N is a Poisson process of rate , then its also a function whose value we expect to increase at arate of meaning that, with enough samples, N would approximate an indefinite integral of the function. Currently, the standard method for evaluating difficult integrals computationally is to use the Monte-Carlo dartboard algorithm [21, 2], though this is only capable of evaluating a single, definite integral.If we had an approximation of the indefinite integral function, we could evaluate arbitrarily many definiteintegrals by just looking up values from the function. Figure A.2 shows an aproximation of the integral ofsin

x5 + 1, taken by simulating 50 Poisson process for a total of roughly 5,000 emissions, each emission onlyincrementing the process by 0.02, rather than 1. Here, we see a very close approximation of the true integral.

The dartboard algorithm usually uses tens of thousands of randomly generated points to evaluate a singledefinite integral.

Of course, the actual practicalities of such an algorithm and its relevance to higher-order integrals areyet to be confirmed, though it could be an interesting area for further study.

24


26/28

Figure A.1: A trace of an inhomogeneous Poisson process used to estimate an integral, alongside its trueintegral, plotted with 620 Poisson emissions

Figure A.2: A trace of an inhomogeneous Poisson process used to estimate an integral, alongside its trueintegral, plotted with 5153 Poisson emissions

25


27/28

Bibliography

[1] Ihler,A.,Hutchins,J.,& Smyth,P Learning to detect events with Markov-Modulated Poisson Processes.

ACM Transactions on Knowledge Discovery from Data, 1(3), 13. 2007

http://dl.acm.org/citation.cfm?doid=1297332.1297337

[2] R.D. Malmgren, D.B. Stouffer, A.E. Motter, and L.A.N. Amaral, A Poissonian explanation for heavytails in e-mail communication,

PNAS, 105(47):1815318158, 2008

http://www.pnas.org/content/105/47/18153

[3] Scott, S. L. and Smyth, P. 2003. The Markov modulated Poisson process and Markov Poisson cascadewith applications to web traffic data.

Bayesian Statistics 7, 671680.

http://dl.acm.org/citation.cfm?doid=1297332.1297337

[4] Joseph L. Doob, The Development of Rigor in Mathematical Probability (1900-1950)

The American Mathematical Monthly, Vol. 103, No. 7 (Aug. - Sep., 1996), pp. 586-595

http://www.jstor.org/stable/2974673

[5] Weisstein, Eric W. Markov Chain. From MathWorldA Wolfram Web Resource.

http://mathworld.wolfram.com/MarkovChain.html

[6] Weisstein, Eric W. Graph. From MathWorldA Wolfram Web Resource.

http://mathworld.wolfram.com/Graph.html

[7] Lewis, P. A. W. and Shedler, G. S. (1979), Simulation of nonhomogeneous Poisson processes by thinning.

Naval Research Logistics, 26: 403413.

http://onlinelibrary.wiley.com/doi/10.1002/nav.3800260304/abstract

[8] David Harte. HiddenMarkov v1.7-0, a Hidden Markov Model library written in R and Fortran,

http://cran.r-project.org/web/packages/HiddenMarkov/index.html

[9] Hunter, J. D. MatPlotLib, a 2D graphics environment

http://matplotlib.org/IEEE Computer Soc, Vol. 9, No. 3 (2007), pp. 90-95

[10] The GIMP Development Team. The GIMP docs, the documentation for the GNU Image ManipulationProgram,

http://docs.gimp.org/en/gimp-concepts-layer-modes.html

[11] The GIMP Development Team. The GIMP docs, the documentation for the GNU Image ManipulationProgram,

http://docs.gimp.org/en/gimp-tool-threshold.html

26


28/28

[12] Moriera W, Warnes GR, Gautier L. RPy v2-2.3, a Python to R interface

http://rpy.sourceforge.net/

[13] Jones E, Oliphant T, Peterson P & others, SciPy, open source scientific tools for Python, 2001-

http://www.scipy.org

[14] Baum L.E, Petrie T, Soules G, & Weiss N, A maximization technique occuring in the statistical analysisof probabilistic functions of Markov chains.

Annals of Mathematical Statistics 41(1), 164-171, 1970

http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoms/

1177697196

[15] Weisstein, Eric W. Kolmogorov-Smirnov Test. From MathWorldA Wolfram Web Resource.

http://mathworld.wolfram.com/Kolmogorov-SmirnovTest.html

[16] Forney G.D, Jr., The viterbi algorithm,

Proceedings of the IEEE , vol.61, no.3, pp.268,278, March 1973

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1450960&isnumber=31166

[17] Schwarz, G. (1978) Estimating the Dimension of a Model

Annals of Statistics, 6, 461-464.

http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/

1176344136

[18] Jones E, Oliphant T, Peterson P & others, SciPys k-means algorithm, http://docs.scipy.org/doc/scipy/reference/cluster.vq.html

[19] Weisstein, Eric W. Log Normal Distribution. From MathWorldA Wolfram Web Resource.

http://mathworld.wolfram.com/LogNormalDistribution.html

[20] Yiying Lu, Lifting a Dreamer

http://www.yiyinglu.com/?portfolio=lifting-a-dreamer-aka-twitter-fail-whale

[21] Caflisch E, Monte Carlo and quasi-Monte Carlo methods,

Acta Numerica vol. 7, Cambridge University Press, 1998, pp. 149.

http://websrv.cs.fsu.edu/~mascagni/Caflisch_1998_Acta_Numerica.pdf

An Investigation into Stochastic Processes for Modelling Human Generated Data

Documents

Transcript of An Investigation into Stochastic Processes for Modelling Human Generated Data