AN EM ALGORITHM FOR HAWKES PROCESS - NYU … EM ALGORITHM FOR HAWKES PROCESS Abstract This...

AN EM ALGORITHM FOR HAWKES PROCESS

Peter F. Halpin

new york university

December 17, 2012

Correspondence should be sent toDr. Peter F. Halpin246 Greene Street, Office 316ENew York, NY10003-6677E-Mail: [email protected]: (212)-998-5197Fax: (212)-995-4832Webpage: https://files.nyu.edu/pfh3/public

Psychometrika Submission December 17, 2012 2

AN EM ALGORITHM FOR HAWKES PROCESS

Abstract

This manuscript addresses the EM algorithm developed in Halpin & De

Boeck (in press). The runtime of the algorithm grows quadratically in the number

of observations, making its application to large data sets impractical. A strategy

for improving efficiency is introduced, and this results in linear growth for many

applications. The performance of the modified algorithm is assessed using data

simulation.

Key words: Hawkes process; EM algorithm; maximum likelihood; runtime


Introduction

Halpin & De Boeck (in press) considered the time series analysis of bivariate event

data in the context of dyadic interaction. They proposed the use of point processes, and in

particular Hawkes process, as way to capture the temporal dependence between the actions

of two individuals. Estimation was based on the so-called branching structure

representation of Hawkes process, which they showed to be amenable to estimation via the

EM algorithm (see also Veen & Schoenberg, 2008). Unfortunately, the runtime of the

algorithm grows quadratically in the number observations, making its application to large

data sets impractical. The present paper provides a modification of the original algorithm

that substantially improves its runtime. The modification reduces the number of

computations in the algorithm by tolerating a specified degree of rounding error, and this

results in linear growth for many applications.

The next section outlines Hawkes process in sufficient detail for this paper to be

self-contained and gives an intuitive description of the problem to be addressed. The

subsequent section presents the modification to the EM algorithm and illustrates some

cases where this yields linear growth. The final section uses data simulation to arrive at a

magnitude of rounding error that has a negligible effect on parameter recovery.

Hawkes Process

Under mild conditions, a point process can be uniquely defined in terms of its

conditional intensity function (CIF). The main reason for specifying a point process in

terms of its CIF is that this leads directly to an expression for its likelihood. A general

form for the CIF is

λ(t) = lim∆↓0

E(M{(t, t+ ∆)} | Ht)

∆(1)


where M{(a, b)} is random counting measure representing the number of events (i.e.,

isolated points) falling in the interval (a, b), E(M{(a, b)}) is the expected value, and Ht is

the σ-algebra generated by the time points tk, k ∈ N, occurring before time t ∈ R+ (see

Daley & Vere-Jones, 2003). In this paper it is assumed that the probability of multiple

events occurring simultaneously is negligible, in which case M is said to be orderly. Then

for fixed t and sufficiently small values of ∆, λ(t)∆ is an approximation to the bernoulli

probability of an event occurring in the interval (t, t+ ∆), conditional on all of the events

happening before time t. In applications, this means we are concerned with how the

probability of an event changes over continuous time as a function previous events.

Point processes extend immediately to the multivariate case. M{(a, b)} is then

vector-valued and each univariate margin gives the number of a different type of event

occurring in the time period (a, b). Although Halpin and De Boeck (in press) considered a

bivariate model, this paper focusses on the univariate case since the problem to be

addressed can be most simply explained in that situation.

The CIF of Hawkes process can be specified as a linear causal filter:

λ(t) = µ+

∫ t

0

φ(t− s) dM(s). (2)

The interpretation of equation (2) is unpacked in the following three points.

1. µ > 0 is a baseline, which can be a function of time but is here treated as a constant.

2. φ(u) is a response function that governs how the the process depends on its past.

Hawkes process requires the following three assumptions:

φ(u) ≥ 0, u ≥ 0; φ(u) = 0, u < 0;

∫ ∞

0

φ(u)du ≤ 1.

Together these assumptions imply that

φ(u) = α× f(u; ξ) (3)


where 0 ≤ α ≤ 1 and f(u; ξ) is a probability density function on R+ with parameter ξ.

Equation (3) presents a convenient method for parametrizing φ, with some common choices

for f(u; ξ) being the exponential (e.g., Ogata, 1988; Truccolo, Eden, Fellows, Donoghue, &

Brown, 2005), the two-parameter gamma (Halpin & Boeck, in press), and the power law

distribution (Barabasi, 2005; Crane & Sornette, 2008). Under this parameterization, α is

referred to as the intensity parameter and f(u; ξ) the response kernel.

3. In the case that M is orderly, dM(u) = M(u+ ∆) is representable as series of

right-shifted Dirac delta functions and the integral reduces to a sum over all events in [0, t],

yielding

∫φ(t− s) dM(s) =

∑tj<t

φ(t− tj). (4)

Thus each new time point is associated with a response function describing how that time

point affects the future of the process. Under the assumptions of Hawkes process, each new

time point increases the probability of further events occurring in the immediate future

(i.e., φ(u) is non-negative). The summation shows that the effect of multiple time points on

the probability of further events is cumulative. For these reasons, Hawkes process is often

referred to as self-exciting; the occurrence of one event increases the probability of further

events, whose occurrence in turn increases the probability of even more events. In terms of

applications this means that Hawkes process is appropriate for modelling clustering, which

occurs when periods of high event frequency are separated by periods of relative inactivity.

As noted, the CIF leads directly to an expression for the log-likelihood (see Daley &

Vere-Jones, 2003):

l(θ | X) =∑k

ln(λ(tk))−∫ T

0

λ(s)ds (5)

where [0, T ] is the observation period, X = t1, t2, . . . denotes the observed event times, and

θ contains the parameters of the model. Substitution of equations (2) through (4) into


equation (5) shows that the log-likelihood of Hawkes process contains the logarithm of a

weighted sum of density functions. A similar situation occurs in finite mixture modelling

(e.g., McLachlan & Peel, 2000) and nonlinear regression (e.g., Seber & Wild, 2003), where

it is known to lead to numerical optimization problems related to ill-conditioning of and

multiple roots in the likelihood function. In the present case the problem is aggravated by

the fact that the number of densities appearing in the likelihood increases with the number

observations, which is shown in equation (4). It is important to note that the number of

model parameters does not grow with the number of time points; the densities are simply

right-shifted. In general, if there are a total of n observed events, then there are a total of

n(n− 1)/2 response functions appearing in the log-likelihood of a univariate Hawkes

process, not including the ”duplicated” response functions appearing the integral. This is

the source of the quadratic growth of the optimization problem, which is the issue to be

dealt with in this paper.

The quadratic growth is especially problematic because the EM algorithm proposed by

Halpin and De Boeck (in press) requires the use of multiple starting values. This means

that even moderately sized data sets cannot be estimated in a reasonable amount of time.

For example, an actual runtime of over 24 hours was recorded for a problem with N ≈ 1500

events and 50 starting values (implemented in the C language on a machine with 2 GHz of

processing speed). Because one of the most exciting potential applications of Hawkes

process is to “big” data collected via computer-mediated communication (e.g., email

databases, twitter), it is important to have an estimation approach that is feasible for large

samples. The following section outlines how that can be accomplished.

Reducing Runtime by Introducing Rounding Error

This section outlines the original EM algorithm suggested by Halpin and De Boeck (in

press) and then considers how to reduce its runtime. The algorithm is based on alternative

representation of Hawkes process, which is referred to as its branching structure. In terms


of the EM algorithm, the branching structure provides the complete data representation of

the model, whereas the causal filter in equation (2) is the incomplete data representation.

Taking this approach, the logarithm of the sum of densities in equation (5) is replaced by

the sum of their logarithms, which results in better conditioning of the numerical

optimization problem and was shown to perform satisfactorily with relatively small data

sets (N ≈ 400). Although the considerations of this section could also be made for

equation (5), the focus is on the EM approach.

The branching structure representation of Hawkes process is in terms of a cluster

Poisson process. It was first proposed by Hawkes and Oakes (1974), who proved it to be

equivalent to the representation given in the foregoing section. Their argument was very

technical and it served to establish the existence and uniqueness of the process. The

branching structure has also found more intuitive applications. For example, in ecology it

is used to describe the growth of wildlife populations in terms of subsequent generations of

offspring due to each immigrant (e.g., Rasmussen, 2011). In the context of disease control,

it is interpreted as the number of people contaminated by each subsequent carrier (e.g.,

Daley & Vere-Jones, 2003). Veen and Schoenberg (2008) were the first to consider the

branching structure as a strategy for obtaining maximum likelihood estimates (MLEs) of

Hawkes process.

For the present purpose, the effect of the branching structure is to decompose Hawkes

process into n independent Poisson processes whose rate functions are given by the response

functions in equation (3). These processes govern the number of “offspring” of each event.

There is also an additional Poisson processes governing the number of “immigrant” events;

this process has a rate function given by the baseline parameter µ. Importantly, each event

tk is assumed to be due to one and only one of these independent Poisson processes: either

one centered at its parent, tj, with tj < tk, or the baseline process. Consequenty, if we

knew which process each event belonged to, estimation of Hawkes process would reduce to

that for a collection of independent Poisson processes. It is therefore natural to introduce a


missing variable that describes the specific process to which each event tk belongs, and

proceed by means of the EM algorithm. As with other applications of the EM algorithm,

the missing data need not correspond to the hypothesized data generating process; it can

be treated merely as a tool for obtaining MLEs.

The following notation is employed to set up the algorithm. Let Z = (Z1, Z2, · · · , Zn)

denote the missing data. If an event tk is an offspring of event tj, tj < tk, this is denoted by

setting Zk = j. If an even tk is an immigrant then Zk = 0. Also let φj(u) denote the

response functions governing each Poisson process, where it is understood that φ0(u) = µ.

For j > 0, these response functions are identical to those introduced in equation (3) above,

except the subscript serves to make explicit the centering event tj.

Letting l(θ | X,Z) denote the complete data log-likelihood, Halpin and De Boeck (in

press) showed that

Q(θ) = EZ|X,θ l(θ | X,Z)

=n∑j=0

(∑k>j

ln(φj(tk − tj))× Prob(Zk = j | X, θ)−∫ T

0

φj(T − tj)

)(6)

where

Prob(Zk = j | X, θ) =φj(tk − tj)∑r<k φr(tk − tr)

. (7)

Equations (6) and (7) provide the necessary components of an EM algorithm for Hawkes

process. Equation (7) is readily computed on the E step. On the M step these probabilities

are treated as fixed and entered into equation (6). Using this approach, Halpin and De

Boeck (in press) provided closed form solutions for the baseline parameter µ and the

intensity parameter α. However, in order to obtain the parameters of the response kernel,

it is necessary to numerically optimize the Q function. This is the computationally

expensive part of the algorithm.

Since the sum over k > j is the source of the quadratic growth of the Q function, let’s


first consider how this can be reduced. Recall that for j > 1, φj(u) = α× f(u; ξ) is just a

weighted density on R+. For usual choices of the response kernel, f(u; ξ)→ 0 as u become

large (i.e., response functions typically have a right tail that asymptotes at zero).

Intuitively, this means that when tk − tj is large, the contribution of φj(tk − tj) to equation

(6) will be negligible.

In order to make this idea more formal, consider the sets

Wj = {k : f(tk − tj; ξ) < w}

and let |W | denote the average of the cardinalities of the Wj. Replacing the sum over k > j

with the sum over k ∈ Wj in equation (6) results in |W | × n densities appearing in the

double summation. This substitution will be referred to as the modified Q function and

denoted Q. |W | is the linear growth factor of Q. The relative efficiency of Q over Q is

R =|W | × nn(n− 1)/2

= 2|W |/(n− 1)

The value of |W | depends on (a) ξ, which is updated throughout the optimization

process, (b) w, which can be determined by the researcher, and (c) the actual observations

tk, which are fixed. This makes is difficult to obtain analytical results on |W |. However,

Table 1 provides evidence that it does not grow with n and it can be much smaller than

(n− 1)/2.

=========================

Insert Table 1 about here

=========================

The table was produced by simulating data using the inverse method (see Daley &

Vere-Jones, 2003). The causal filter in equation (2) was used for simulation, not the

branching structure. Three different sample sizes (N = 500, 1500, and 5000) were simulated


from each of three different models. Model 1 and Model 2 used exponential response

functions, with Model 1 having moderate intensity (α = .4) and Model 2 having high

intensity (α = .8). This means that the data from Model 2 showed a much higher degree of

clustering (i.e., a larger number of events occurring in close proximity to one another).

Model 3 is also high intensity (α = .8) but used a two-parameter gamma kernel with shape

parameter set to .5. The result is heavier-tailed response functions, which have been

reported in various applications to human communication data (e.g., Barabasi, 2005; Crane

& Sornette, 2008; Halpin & Boeck, in press). The choices of intensity parameter are

intended to reflect its possible range rather than realistic values; I have not seen intensity

estimates greater than .5 in real data applications. For each simulated data set, Q was

computed using the true parameter values and w = 1e-10.

The main point to be taken from the Table 1 is that the values of |W | did not increase

with n and therefore the rate of growth of Q was linear. The exact rate of linear growth

depended on the parameters of the data generating model, with more clustered data

showing faster growth. However, even at extraordinarily high intensities and even at the

smallest sample size, the growth rate was much smaller than (n− 1)/2. Based on these

results, it reasonable to conclude that Q is more efficient to compute than Q. It should be

emphasized that this depends on the type of response kernel; the approach outlined here

will not work unless the response kernel has a right tail that asymptotes at zero.

Table 1 does not address how the rounding error w affects the MLEs produced by the

EM algorithm. That is the topic of the next section. Although this section has only

focussed on the computation of the Q function, entirely similar remarks can be made about

the computation of equation (7) on the E step, and about the computation of equation (5).

Effect of Rounding Error on the EM algorithm

This section considers how the rounding error w effects convergence and parameter

recovery. Data were again simulated using the inverse method with the incomplete data


model (equation (2)). The data-generating model used a two-parameter gamma density as

the response kernel. The parameters of the data generating model are stated in the Table 3

and were based on the real data example reported in Halpin and De Boeck (in press).

N = 250 data sets of size n = 500 time points were generated from the model. For each

data set, the EM algorithm described in Halpin and De Boeck (in press) was implemented

using Q in place of Q. The starting values for the estimation algorithm were obtained by

randomly disturbing the data generating values, which avoided the need for multiple

starting values. Convergence was evaluated using the incomplete data log-likelihood

(equation (5)). The convergence criterion was an absolute difference of less than 1e-5 on

subsequent M steps.

The simulation compared the rounding errors w = 0, 1e-10, 1e-5, 1e-3. Because a

rounding error of 0 is not possible in practice, this was implemented using w = 2.22e-16,

which is the smallest double precision number representable most modern computers.

Therefore the value w = 0 represents the amount of error that is intrinsic to the specific

realization of the estimation process (i.e., with the given sample size, convergence criterion,

etc). The remaining values of w represent the introduction of rounding error for

computation efficiency.

Let’s first consider the role of rounding error in the convergence of the algorithm.

Figure 1 shows the relationship between the log-likelihoods evaluated at the MLEs and the

log-likelihoods evaluated at the data generating parameters. The relation is quite similar

for the three smallest values of w, but is appreciably worse for the largest value. It is

important to note that even for w = 0, the relationship is not perfect. The amount of

additional error introduced by the two middle values of w is not perceptible in the figure.

=========================

Insert Figure 1 about here

=========================


Table 2 provides a closer look at the log-likelihoods. It reports the mean and standard

deviation for the differences between the log-likelihoods of the estimated models and the

log-likelihoods computed using the true values. The table entries are reported as

percentages of the difference between the log-likelihoods of w = 0 and of the true values

(i.e., as percentages of the intrinsic estimation error). If w > 0 did not affect the

convergence of the EM algorithm, all values in the table would be 100. Based on the table

we can conclude that all values of w > 0 introduced additional error into the convergence

of the EM algorithm. For w = 1e-10 this was less than .1 percent of the intrinsic estimation

error.

=========================


=========================

Turning now to address parameter recovery, Table 3 reports the bias and error of the

MLEs for each level of rounding error. The entries are reported as percentages of the data

generating parameters. It can be seen that bias and error were very similar for the lowest

two values of w, but for larger values of w there is increased bias and reduced error. Figure

2 shows the distribution of estimates of the gamma response kernels for w = 0 and

w = 1e-10.

=========================


=========================

=========================

Insert Figure 2 about here

=========================


Based on this simulation it may be concluded that there is little to distinguish the

results obtained using a rounding error of w = 1e-10 from the intrinsic error in the

algorithm (i.e., w = 0). On the other hand, w ≤ 1e-5 has a relatively large influence both

on the convergence of the algorithm and on the bias and error of the resulting parameter

estimates.

Conclusions

The number of computations required by the EM algorithm proposed by Halpin and

De Boeck (in press) grows quadratically in the number of observed events, making its

application to large data sets infeasible. This paper has shown that the runtime of the

algorithm can be reduced by introducing rounding error into the computation of the Q

function (i.e. the objective function of the M step of the EM algorithm). In three

applications involving response functions with right tails asymptoting at zero, this was

shown to result in linear growth. The consequences for convergence of the algorithm and

parameter recovery were also considered. A rounding error of 1e− 10 was found to have

negligible effects, but larger values did not. While more research can be done to optimize

the rounding error for specific applications of the algorithm, it can be concluded that the

approach presented here provides an acceptable compromise between runtime

computational accuracy.


References

Barabasi, A. L. (2005). The origin of bursts and heavy tails in human dynamics. Nature,

435 , 207–211.

Crane, R., & Sornette, D. (2008). Robust dynamic classes revealed by measuring the

response function of a social system. Proceedings of the National Academy of

Sciences , 105 , 15649–15653.

Daley, D. J., & Vere-Jones, D. (2003). An introduction to the theory of point processes:

Elementary theory and methods (Second ed., Vol. 1). New York: Springer.

Halpin, P. F., & Boeck, P. D. (in press). Modeling dyadic interaction using hawkes

process. Psychometrika.

Hawkes, A. G., & Oakes, D. (1974). A cluster representation of a self-exciting process.

Journal of Applied Probability , 11 , 493–503.

McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: John Wiley and

Sons.

Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for

point processes. Journal of the American Statistical Association, 83 , 9-27.

Rasmussen, J. G. (2011). Bayesian inference for Hawkes’ processes. Methodology and

Computing in Applied Probability , DOI: 10.1007/s11009-011-9272-5.

Seber, G. A. F., & Wild, C. J. (2003). Non-linear regression (2nd ed.). Hoboken, NJ: John

Wiley & Sons.

Truccolo, W., Eden, U. T., Fellows, M. R., Donoghue, J. P., & Brown, E. N. (2005). A

point process framework for relating neural spiking activity to spiking history, neural

ensemble, and e extrinsic covariate effects. Journal of Neurophysiology , 93 ,

1074–1089.

Veen, A., & Schoenberg, F. P. (2008). Estimation of space-time branching process models

in seismology using an EM-type algorithm. Journal of the American Statistical


Association, 103 , 614–624.


Tables


Table 1.Growth of the Q Function in Number of Time Points (Simulated Data)

n = 500 n = 1500 n = 5000

Model 1 6.665 6.589 6.632Model 2 19.870 23.609 25.083Model 3 28.226 24.0133 23.567

Note: n is number of simulated time points and the table entries are the linear growth

factor, |W |, of the modified Q function, Q, computed using the true parameter values.

|W | × n gives the number of computations required for Q and 2|W |/(n− 1) gives the

efficiency of Q relative to the original Q function proposed by Halpin and De Boeck (in

press). The models are described in the text.


Table 2.Effect of Rounding Error on Log-likelihoods (Simulated Data)

w = 0 w = 1e-10 w = 1e-5 w = 1e-3

Mean 100 99.966 113.163 652.114SD 100 100.014 99.268 284.921

Note: Table entries are means (M) and standard deviations (SD) of differences

between log-likelihoods of the estimated models and the log likelihoods computed using the

true values. The means and standard deviations are reported as percentages of the values

for w = 0 (i.e., percentages of the intrinsic estimation error). The MLEs were obtained

using the EM algorithm described by Halpin and De Boeck (in press) with the modified Q

function and the indicated levels of rounding error, w.


Table 3.Effect of Rounding Error on Parameter Recovery (Simulated Data)

µ α κ β

True values .1 .45 .6 10

w = 0 2.289 -2.662 2.782 -0.4757(12.707) (14.282) (11.812) (49.986)

w = 1e-10 2.266 -2.634 2.775 -0.2537(12.725) (14.315) (11.824) (50.664)

w = 1e-5 4.458 -5.475 5.7890 -19.507(10.857) (11.592) (11.215) (22.937)

w = 1e-3 29.083 -35.617 28.390 -78.925(9.969) (8.786) (17.114) (3.618)

Note: Table entries are bias (error) of maximum likelihood estimates (MLEs) as

percentages of the true values. µ denotes the baseline parameter of Hawkes process, α the

intensity parameter, κ the shape parameters of the two-parameter gamma response kernel,

and β its scale parameter. MLEs were obtained using the EM algorithm described by

Halpin and De Boeck (in press) with the modified Q function and the indicated levels of

rounding error, w.


-1340 -1300 -1260

-1340

-1300

-1260

w = 0

MLEs

True

val

ues

r = .998

-1340 -1300 -1260-1340

-1300

-1260

w = 1-e10

MLEs

True

val

ues

r = .998

-1340 -1300 -1260

-1340

-1300

-1260

w = 1e-5

MLEs

True

val

ues

r = .998

-1340 -1300 -1260

-1340

-1300

-1260

w = 1e-3

MLEs

True

val

ues

r = .985

Figure 1.Relation of log-likelihoods at convergence with log-likelihoods computed using the data generating values(simulated data). The model was estimated using the EM algorithm described by Halpin and De Boeck (inpress) with the modified Q function presenting in this paper and the indicated levels of rounding error, w.


w = 0

MLEs: Shape

Frequency

0.5 0.6 0.7 0.8 0.9

05

1015

2025

30

w = 0

MLEs: Scale

Frequency

5 10 15 20 25 300

510

1520

2530

w = 1e-10

MLEs: Shape

Frequency

0.5 0.6 0.7 0.8 0.9

05

1015

2025

30

w = 1e-10

MLEs: Scale

Frequency

5 10 15 20 25 30

05

1015

2025

30

Figure 2.Histograms of maximum likelihood estimates (MLEs) of the two-parameter gamma density kernel (simulateddata). Bold vertical line indicates the value of the data generating parameters. MLEs were obtained usingthe EM algorithm described by Halpin and De Boeck (in press) with the modified Q function presented inthis paper and the indicated levels of rounding error, w.

AN EM ALGORITHM FOR HAWKES PROCESS - NYU … EM ALGORITHM FOR HAWKES PROCESS Abstract This...

Documents

Transcript of AN EM ALGORITHM FOR HAWKES PROCESS - NYU … EM ALGORITHM FOR HAWKES PROCESS Abstract This...