AN EM ALGORITHM FOR HAWKES PROCESS - NYU … EM ALGORITHM FOR HAWKES PROCESS Abstract This...
Transcript of AN EM ALGORITHM FOR HAWKES PROCESS - NYU … EM ALGORITHM FOR HAWKES PROCESS Abstract This...
AN EM ALGORITHM FOR HAWKES PROCESS
Peter F. Halpin
new york university
December 17, 2012
Correspondence should be sent toDr. Peter F. Halpin246 Greene Street, Office 316ENew York, NY10003-6677E-Mail: [email protected]: (212)-998-5197Fax: (212)-995-4832Webpage: https://files.nyu.edu/pfh3/public
Psychometrika Submission December 17, 2012 2
AN EM ALGORITHM FOR HAWKES PROCESS
Abstract
This manuscript addresses the EM algorithm developed in Halpin & De
Boeck (in press). The runtime of the algorithm grows quadratically in the number
of observations, making its application to large data sets impractical. A strategy
for improving efficiency is introduced, and this results in linear growth for many
applications. The performance of the modified algorithm is assessed using data
simulation.
Key words: Hawkes process; EM algorithm; maximum likelihood; runtime
Psychometrika Submission December 17, 2012 3
Introduction
Halpin & De Boeck (in press) considered the time series analysis of bivariate event
data in the context of dyadic interaction. They proposed the use of point processes, and in
particular Hawkes process, as way to capture the temporal dependence between the actions
of two individuals. Estimation was based on the so-called branching structure
representation of Hawkes process, which they showed to be amenable to estimation via the
EM algorithm (see also Veen & Schoenberg, 2008). Unfortunately, the runtime of the
algorithm grows quadratically in the number observations, making its application to large
data sets impractical. The present paper provides a modification of the original algorithm
that substantially improves its runtime. The modification reduces the number of
computations in the algorithm by tolerating a specified degree of rounding error, and this
results in linear growth for many applications.
The next section outlines Hawkes process in sufficient detail for this paper to be
self-contained and gives an intuitive description of the problem to be addressed. The
subsequent section presents the modification to the EM algorithm and illustrates some
cases where this yields linear growth. The final section uses data simulation to arrive at a
magnitude of rounding error that has a negligible effect on parameter recovery.
Hawkes Process
Under mild conditions, a point process can be uniquely defined in terms of its
conditional intensity function (CIF). The main reason for specifying a point process in
terms of its CIF is that this leads directly to an expression for its likelihood. A general
form for the CIF is
λ(t) = lim∆↓0
E(M{(t, t+ ∆)} | Ht)
∆(1)
Psychometrika Submission December 17, 2012 4
where M{(a, b)} is random counting measure representing the number of events (i.e.,
isolated points) falling in the interval (a, b), E(M{(a, b)}) is the expected value, and Ht is
the σ-algebra generated by the time points tk, k ∈ N, occurring before time t ∈ R+ (see
Daley & Vere-Jones, 2003). In this paper it is assumed that the probability of multiple
events occurring simultaneously is negligible, in which case M is said to be orderly. Then
for fixed t and sufficiently small values of ∆, λ(t)∆ is an approximation to the bernoulli
probability of an event occurring in the interval (t, t+ ∆), conditional on all of the events
happening before time t. In applications, this means we are concerned with how the
probability of an event changes over continuous time as a function previous events.
Point processes extend immediately to the multivariate case. M{(a, b)} is then
vector-valued and each univariate margin gives the number of a different type of event
occurring in the time period (a, b). Although Halpin and De Boeck (in press) considered a
bivariate model, this paper focusses on the univariate case since the problem to be
addressed can be most simply explained in that situation.
The CIF of Hawkes process can be specified as a linear causal filter:
λ(t) = µ+
∫ t
0
φ(t− s) dM(s). (2)
The interpretation of equation (2) is unpacked in the following three points.
1. µ > 0 is a baseline, which can be a function of time but is here treated as a constant.
2. φ(u) is a response function that governs how the the process depends on its past.
Hawkes process requires the following three assumptions:
φ(u) ≥ 0, u ≥ 0; φ(u) = 0, u < 0;
∫ ∞
0
φ(u)du ≤ 1.
Together these assumptions imply that
φ(u) = α× f(u; ξ) (3)
Psychometrika Submission December 17, 2012 5
where 0 ≤ α ≤ 1 and f(u; ξ) is a probability density function on R+ with parameter ξ.
Equation (3) presents a convenient method for parametrizing φ, with some common choices
for f(u; ξ) being the exponential (e.g., Ogata, 1988; Truccolo, Eden, Fellows, Donoghue, &
Brown, 2005), the two-parameter gamma (Halpin & Boeck, in press), and the power law
distribution (Barabasi, 2005; Crane & Sornette, 2008). Under this parameterization, α is
referred to as the intensity parameter and f(u; ξ) the response kernel.
3. In the case that M is orderly, dM(u) = M(u+ ∆) is representable as series of
right-shifted Dirac delta functions and the integral reduces to a sum over all events in [0, t],
yielding
∫φ(t− s) dM(s) =
∑tj<t
φ(t− tj). (4)
Thus each new time point is associated with a response function describing how that time
point affects the future of the process. Under the assumptions of Hawkes process, each new
time point increases the probability of further events occurring in the immediate future
(i.e., φ(u) is non-negative). The summation shows that the effect of multiple time points on
the probability of further events is cumulative. For these reasons, Hawkes process is often
referred to as self-exciting; the occurrence of one event increases the probability of further
events, whose occurrence in turn increases the probability of even more events. In terms of
applications this means that Hawkes process is appropriate for modelling clustering, which
occurs when periods of high event frequency are separated by periods of relative inactivity.
As noted, the CIF leads directly to an expression for the log-likelihood (see Daley &
Vere-Jones, 2003):
l(θ | X) =∑k
ln(λ(tk))−∫ T
0
λ(s)ds (5)
where [0, T ] is the observation period, X = t1, t2, . . . denotes the observed event times, and
θ contains the parameters of the model. Substitution of equations (2) through (4) into
Psychometrika Submission December 17, 2012 6
equation (5) shows that the log-likelihood of Hawkes process contains the logarithm of a
weighted sum of density functions. A similar situation occurs in finite mixture modelling
(e.g., McLachlan & Peel, 2000) and nonlinear regression (e.g., Seber & Wild, 2003), where
it is known to lead to numerical optimization problems related to ill-conditioning of and
multiple roots in the likelihood function. In the present case the problem is aggravated by
the fact that the number of densities appearing in the likelihood increases with the number
observations, which is shown in equation (4). It is important to note that the number of
model parameters does not grow with the number of time points; the densities are simply
right-shifted. In general, if there are a total of n observed events, then there are a total of
n(n− 1)/2 response functions appearing in the log-likelihood of a univariate Hawkes
process, not including the ”duplicated” response functions appearing the integral. This is
the source of the quadratic growth of the optimization problem, which is the issue to be
dealt with in this paper.
The quadratic growth is especially problematic because the EM algorithm proposed by
Halpin and De Boeck (in press) requires the use of multiple starting values. This means
that even moderately sized data sets cannot be estimated in a reasonable amount of time.
For example, an actual runtime of over 24 hours was recorded for a problem with N ≈ 1500
events and 50 starting values (implemented in the C language on a machine with 2 GHz of
processing speed). Because one of the most exciting potential applications of Hawkes
process is to “big” data collected via computer-mediated communication (e.g., email
databases, twitter), it is important to have an estimation approach that is feasible for large
samples. The following section outlines how that can be accomplished.
Reducing Runtime by Introducing Rounding Error
This section outlines the original EM algorithm suggested by Halpin and De Boeck (in
press) and then considers how to reduce its runtime. The algorithm is based on alternative
representation of Hawkes process, which is referred to as its branching structure. In terms
Psychometrika Submission December 17, 2012 7
of the EM algorithm, the branching structure provides the complete data representation of
the model, whereas the causal filter in equation (2) is the incomplete data representation.
Taking this approach, the logarithm of the sum of densities in equation (5) is replaced by
the sum of their logarithms, which results in better conditioning of the numerical
optimization problem and was shown to perform satisfactorily with relatively small data
sets (N ≈ 400). Although the considerations of this section could also be made for
equation (5), the focus is on the EM approach.
The branching structure representation of Hawkes process is in terms of a cluster
Poisson process. It was first proposed by Hawkes and Oakes (1974), who proved it to be
equivalent to the representation given in the foregoing section. Their argument was very
technical and it served to establish the existence and uniqueness of the process. The
branching structure has also found more intuitive applications. For example, in ecology it
is used to describe the growth of wildlife populations in terms of subsequent generations of
offspring due to each immigrant (e.g., Rasmussen, 2011). In the context of disease control,
it is interpreted as the number of people contaminated by each subsequent carrier (e.g.,
Daley & Vere-Jones, 2003). Veen and Schoenberg (2008) were the first to consider the
branching structure as a strategy for obtaining maximum likelihood estimates (MLEs) of
Hawkes process.
For the present purpose, the effect of the branching structure is to decompose Hawkes
process into n independent Poisson processes whose rate functions are given by the response
functions in equation (3). These processes govern the number of “offspring” of each event.
There is also an additional Poisson processes governing the number of “immigrant” events;
this process has a rate function given by the baseline parameter µ. Importantly, each event
tk is assumed to be due to one and only one of these independent Poisson processes: either
one centered at its parent, tj, with tj < tk, or the baseline process. Consequenty, if we
knew which process each event belonged to, estimation of Hawkes process would reduce to
that for a collection of independent Poisson processes. It is therefore natural to introduce a
Psychometrika Submission December 17, 2012 8
missing variable that describes the specific process to which each event tk belongs, and
proceed by means of the EM algorithm. As with other applications of the EM algorithm,
the missing data need not correspond to the hypothesized data generating process; it can
be treated merely as a tool for obtaining MLEs.
The following notation is employed to set up the algorithm. Let Z = (Z1, Z2, · · · , Zn)
denote the missing data. If an event tk is an offspring of event tj, tj < tk, this is denoted by
setting Zk = j. If an even tk is an immigrant then Zk = 0. Also let φj(u) denote the
response functions governing each Poisson process, where it is understood that φ0(u) = µ.
For j > 0, these response functions are identical to those introduced in equation (3) above,
except the subscript serves to make explicit the centering event tj.
Letting l(θ | X,Z) denote the complete data log-likelihood, Halpin and De Boeck (in
press) showed that
Q(θ) = EZ|X,θ l(θ | X,Z)
=n∑j=0
(∑k>j
ln(φj(tk − tj))× Prob(Zk = j | X, θ)−∫ T
0
φj(T − tj)
)(6)
where
Prob(Zk = j | X, θ) =φj(tk − tj)∑r<k φr(tk − tr)
. (7)
Equations (6) and (7) provide the necessary components of an EM algorithm for Hawkes
process. Equation (7) is readily computed on the E step. On the M step these probabilities
are treated as fixed and entered into equation (6). Using this approach, Halpin and De
Boeck (in press) provided closed form solutions for the baseline parameter µ and the
intensity parameter α. However, in order to obtain the parameters of the response kernel,
it is necessary to numerically optimize the Q function. This is the computationally
expensive part of the algorithm.
Since the sum over k > j is the source of the quadratic growth of the Q function, let’s
Psychometrika Submission December 17, 2012 9
first consider how this can be reduced. Recall that for j > 1, φj(u) = α× f(u; ξ) is just a
weighted density on R+. For usual choices of the response kernel, f(u; ξ)→ 0 as u become
large (i.e., response functions typically have a right tail that asymptotes at zero).
Intuitively, this means that when tk − tj is large, the contribution of φj(tk − tj) to equation
(6) will be negligible.
In order to make this idea more formal, consider the sets
Wj = {k : f(tk − tj; ξ) < w}
and let |W | denote the average of the cardinalities of the Wj. Replacing the sum over k > j
with the sum over k ∈ Wj in equation (6) results in |W | × n densities appearing in the
double summation. This substitution will be referred to as the modified Q function and
denoted Q. |W | is the linear growth factor of Q. The relative efficiency of Q over Q is
R =|W | × nn(n− 1)/2
= 2|W |/(n− 1)
The value of |W | depends on (a) ξ, which is updated throughout the optimization
process, (b) w, which can be determined by the researcher, and (c) the actual observations
tk, which are fixed. This makes is difficult to obtain analytical results on |W |. However,
Table 1 provides evidence that it does not grow with n and it can be much smaller than
(n− 1)/2.
=========================
Insert Table 1 about here
=========================
The table was produced by simulating data using the inverse method (see Daley &
Vere-Jones, 2003). The causal filter in equation (2) was used for simulation, not the
branching structure. Three different sample sizes (N = 500, 1500, and 5000) were simulated
Psychometrika Submission December 17, 2012 10
from each of three different models. Model 1 and Model 2 used exponential response
functions, with Model 1 having moderate intensity (α = .4) and Model 2 having high
intensity (α = .8). This means that the data from Model 2 showed a much higher degree of
clustering (i.e., a larger number of events occurring in close proximity to one another).
Model 3 is also high intensity (α = .8) but used a two-parameter gamma kernel with shape
parameter set to .5. The result is heavier-tailed response functions, which have been
reported in various applications to human communication data (e.g., Barabasi, 2005; Crane
& Sornette, 2008; Halpin & Boeck, in press). The choices of intensity parameter are
intended to reflect its possible range rather than realistic values; I have not seen intensity
estimates greater than .5 in real data applications. For each simulated data set, Q was
computed using the true parameter values and w = 1e-10.
The main point to be taken from the Table 1 is that the values of |W | did not increase
with n and therefore the rate of growth of Q was linear. The exact rate of linear growth
depended on the parameters of the data generating model, with more clustered data
showing faster growth. However, even at extraordinarily high intensities and even at the
smallest sample size, the growth rate was much smaller than (n− 1)/2. Based on these
results, it reasonable to conclude that Q is more efficient to compute than Q. It should be
emphasized that this depends on the type of response kernel; the approach outlined here
will not work unless the response kernel has a right tail that asymptotes at zero.
Table 1 does not address how the rounding error w affects the MLEs produced by the
EM algorithm. That is the topic of the next section. Although this section has only
focussed on the computation of the Q function, entirely similar remarks can be made about
the computation of equation (7) on the E step, and about the computation of equation (5).
Effect of Rounding Error on the EM algorithm
This section considers how the rounding error w effects convergence and parameter
recovery. Data were again simulated using the inverse method with the incomplete data
Psychometrika Submission December 17, 2012 11
model (equation (2)). The data-generating model used a two-parameter gamma density as
the response kernel. The parameters of the data generating model are stated in the Table 3
and were based on the real data example reported in Halpin and De Boeck (in press).
N = 250 data sets of size n = 500 time points were generated from the model. For each
data set, the EM algorithm described in Halpin and De Boeck (in press) was implemented
using Q in place of Q. The starting values for the estimation algorithm were obtained by
randomly disturbing the data generating values, which avoided the need for multiple
starting values. Convergence was evaluated using the incomplete data log-likelihood
(equation (5)). The convergence criterion was an absolute difference of less than 1e-5 on
subsequent M steps.
The simulation compared the rounding errors w = 0, 1e-10, 1e-5, 1e-3. Because a
rounding error of 0 is not possible in practice, this was implemented using w = 2.22e-16,
which is the smallest double precision number representable most modern computers.
Therefore the value w = 0 represents the amount of error that is intrinsic to the specific
realization of the estimation process (i.e., with the given sample size, convergence criterion,
etc). The remaining values of w represent the introduction of rounding error for
computation efficiency.
Let’s first consider the role of rounding error in the convergence of the algorithm.
Figure 1 shows the relationship between the log-likelihoods evaluated at the MLEs and the
log-likelihoods evaluated at the data generating parameters. The relation is quite similar
for the three smallest values of w, but is appreciably worse for the largest value. It is
important to note that even for w = 0, the relationship is not perfect. The amount of
additional error introduced by the two middle values of w is not perceptible in the figure.
=========================
Insert Figure 1 about here
=========================
Psychometrika Submission December 17, 2012 12
Table 2 provides a closer look at the log-likelihoods. It reports the mean and standard
deviation for the differences between the log-likelihoods of the estimated models and the
log-likelihoods computed using the true values. The table entries are reported as
percentages of the difference between the log-likelihoods of w = 0 and of the true values
(i.e., as percentages of the intrinsic estimation error). If w > 0 did not affect the
convergence of the EM algorithm, all values in the table would be 100. Based on the table
we can conclude that all values of w > 0 introduced additional error into the convergence
of the EM algorithm. For w = 1e-10 this was less than .1 percent of the intrinsic estimation
error.
=========================
Insert Table 2 about here
=========================
Turning now to address parameter recovery, Table 3 reports the bias and error of the
MLEs for each level of rounding error. The entries are reported as percentages of the data
generating parameters. It can be seen that bias and error were very similar for the lowest
two values of w, but for larger values of w there is increased bias and reduced error. Figure
2 shows the distribution of estimates of the gamma response kernels for w = 0 and
w = 1e-10.
=========================
Insert Table 3 about here
=========================
=========================
Insert Figure 2 about here
=========================
Psychometrika Submission December 17, 2012 13
Based on this simulation it may be concluded that there is little to distinguish the
results obtained using a rounding error of w = 1e-10 from the intrinsic error in the
algorithm (i.e., w = 0). On the other hand, w ≤ 1e-5 has a relatively large influence both
on the convergence of the algorithm and on the bias and error of the resulting parameter
estimates.
Conclusions
The number of computations required by the EM algorithm proposed by Halpin and
De Boeck (in press) grows quadratically in the number of observed events, making its
application to large data sets infeasible. This paper has shown that the runtime of the
algorithm can be reduced by introducing rounding error into the computation of the Q
function (i.e. the objective function of the M step of the EM algorithm). In three
applications involving response functions with right tails asymptoting at zero, this was
shown to result in linear growth. The consequences for convergence of the algorithm and
parameter recovery were also considered. A rounding error of 1e− 10 was found to have
negligible effects, but larger values did not. While more research can be done to optimize
the rounding error for specific applications of the algorithm, it can be concluded that the
approach presented here provides an acceptable compromise between runtime
computational accuracy.
Psychometrika Submission December 17, 2012 14
References
Barabasi, A. L. (2005). The origin of bursts and heavy tails in human dynamics. Nature,
435 , 207–211.
Crane, R., & Sornette, D. (2008). Robust dynamic classes revealed by measuring the
response function of a social system. Proceedings of the National Academy of
Sciences , 105 , 15649–15653.
Daley, D. J., & Vere-Jones, D. (2003). An introduction to the theory of point processes:
Elementary theory and methods (Second ed., Vol. 1). New York: Springer.
Halpin, P. F., & Boeck, P. D. (in press). Modeling dyadic interaction using hawkes
process. Psychometrika.
Hawkes, A. G., & Oakes, D. (1974). A cluster representation of a self-exciting process.
Journal of Applied Probability , 11 , 493–503.
McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: John Wiley and
Sons.
Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for
point processes. Journal of the American Statistical Association, 83 , 9-27.
Rasmussen, J. G. (2011). Bayesian inference for Hawkes’ processes. Methodology and
Computing in Applied Probability , DOI: 10.1007/s11009-011-9272-5.
Seber, G. A. F., & Wild, C. J. (2003). Non-linear regression (2nd ed.). Hoboken, NJ: John
Wiley & Sons.
Truccolo, W., Eden, U. T., Fellows, M. R., Donoghue, J. P., & Brown, E. N. (2005). A
point process framework for relating neural spiking activity to spiking history, neural
ensemble, and e extrinsic covariate effects. Journal of Neurophysiology , 93 ,
1074–1089.
Veen, A., & Schoenberg, F. P. (2008). Estimation of space-time branching process models
in seismology using an EM-type algorithm. Journal of the American Statistical
Psychometrika Submission December 17, 2012 15
Association, 103 , 614–624.
Psychometrika Submission December 17, 2012 16
Tables
Psychometrika Submission December 17, 2012 17
Table 1.Growth of the Q Function in Number of Time Points (Simulated Data)
n = 500 n = 1500 n = 5000
Model 1 6.665 6.589 6.632Model 2 19.870 23.609 25.083Model 3 28.226 24.0133 23.567
Note: n is number of simulated time points and the table entries are the linear growth
factor, |W |, of the modified Q function, Q, computed using the true parameter values.
|W | × n gives the number of computations required for Q and 2|W |/(n− 1) gives the
efficiency of Q relative to the original Q function proposed by Halpin and De Boeck (in
press). The models are described in the text.
Psychometrika Submission December 17, 2012 18
Table 2.Effect of Rounding Error on Log-likelihoods (Simulated Data)
w = 0 w = 1e-10 w = 1e-5 w = 1e-3
Mean 100 99.966 113.163 652.114SD 100 100.014 99.268 284.921
Note: Table entries are means (M) and standard deviations (SD) of differences
between log-likelihoods of the estimated models and the log likelihoods computed using the
true values. The means and standard deviations are reported as percentages of the values
for w = 0 (i.e., percentages of the intrinsic estimation error). The MLEs were obtained
using the EM algorithm described by Halpin and De Boeck (in press) with the modified Q
function and the indicated levels of rounding error, w.
Psychometrika Submission December 17, 2012 19
Table 3.Effect of Rounding Error on Parameter Recovery (Simulated Data)
µ α κ β
True values .1 .45 .6 10
w = 0 2.289 -2.662 2.782 -0.4757(12.707) (14.282) (11.812) (49.986)
w = 1e-10 2.266 -2.634 2.775 -0.2537(12.725) (14.315) (11.824) (50.664)
w = 1e-5 4.458 -5.475 5.7890 -19.507(10.857) (11.592) (11.215) (22.937)
w = 1e-3 29.083 -35.617 28.390 -78.925(9.969) (8.786) (17.114) (3.618)
Note: Table entries are bias (error) of maximum likelihood estimates (MLEs) as
percentages of the true values. µ denotes the baseline parameter of Hawkes process, α the
intensity parameter, κ the shape parameters of the two-parameter gamma response kernel,
and β its scale parameter. MLEs were obtained using the EM algorithm described by
Halpin and De Boeck (in press) with the modified Q function and the indicated levels of
rounding error, w.
Psychometrika Submission December 17, 2012 20
-1340 -1300 -1260
-1340
-1300
-1260
w = 0
MLEs
True
val
ues
r = .998
-1340 -1300 -1260-1340
-1300
-1260
w = 1-e10
MLEs
True
val
ues
r = .998
-1340 -1300 -1260
-1340
-1300
-1260
w = 1e-5
MLEs
True
val
ues
r = .998
-1340 -1300 -1260
-1340
-1300
-1260
w = 1e-3
MLEs
True
val
ues
r = .985
Figure 1.Relation of log-likelihoods at convergence with log-likelihoods computed using the data generating values(simulated data). The model was estimated using the EM algorithm described by Halpin and De Boeck (inpress) with the modified Q function presenting in this paper and the indicated levels of rounding error, w.
Psychometrika Submission December 17, 2012 21
w = 0
MLEs: Shape
Frequency
0.5 0.6 0.7 0.8 0.9
05
1015
2025
30
w = 0
MLEs: Scale
Frequency
5 10 15 20 25 300
510
1520
2530
w = 1e-10
MLEs: Shape
Frequency
0.5 0.6 0.7 0.8 0.9
05
1015
2025
30
w = 1e-10
MLEs: Scale
Frequency
5 10 15 20 25 30
05
1015
2025
30
Figure 2.Histograms of maximum likelihood estimates (MLEs) of the two-parameter gamma density kernel (simulateddata). Bold vertical line indicates the value of the data generating parameters. MLEs were obtained usingthe EM algorithm described by Halpin and De Boeck (in press) with the modified Q function presented inthis paper and the indicated levels of rounding error, w.