Streaming statistical analysis: new challenges and … statistical analysis: new challenges and old...

38
Streaming statistical analysis: new challenges and old problems Niall Adams Department of Mathematics Imperial College London Heilbronn Institute for Mathematical Research University of Bristol June 2016 1/37

Transcript of Streaming statistical analysis: new challenges and … statistical analysis: new challenges and old...

Streaming statistical analysis: new challengesand old problems

Niall Adams

Department of MathematicsImperial College London

Heilbronn Institute for Mathematical ResearchUniversity of Bristol

June 2016

1/37

Contents

1. Background and context

2. Streaming analytics

3. Example: continuous monitoring

4. Conclusion

Collaborators:

I ETH: Dean Bodenham

I Imperial: Christoforos Anagnostopoulos, Jordan Noble, JoshPlasse

Funding:

I EPSRC, GCHQ.

2/37

1: Background and Context

Data Science is often said to consist of 5 V’s:

I Volume

I Velocity

I Variety

I Veracity: data quality and more?

I Value: does data have value?

Putting aside that this may simply be marketing hype, I want tothink about streaming statistical methods in Big Data contexts.

3/37

Further predjudical remarks:

the big data hype seems to push collecting data at allcosts, without consideration for whether the datacontains relevant signal.

We have been here before: data mining was claimed to be“extracting new information from data as a secondary process”.

4/37

Data StreamsA data stream consists of a sequence of data items arriving at highfrequency, generated by a process that is subject to unknownchanges (generically called drift).

Most examples arise from automatic data collection and the needfor timely inference and decision making, often financial, include:

I credit card transaction data (6K/s for Barclaycard Europe)

I stock market tick data

I computer network traffic (17K/s NETFLOW events atImperial)

Aside: These examples consist of a population of entities(customers, stocks, computers). Taken in aggregate, theyrepresent high volume data. But, considering the entities suggestsa huge number of small (or moderate sized) data problems.

It is contexts such as these that streaming methods are directlyrelevant.

5/37

A simple formulation of streaming data is a sequence ofp-dimensional vectors, arriving in discrete time

. . . , xt−2, xt−1, xt

where xi ∈ Rp. It is assumed that the process generating the datais changing over time.

Even massaging the raw data into this shape requires a processingsystem: one of the tasks of stream processing infrastructures.

Examples with real applications:

I Credit card transactions streams. Variables include:transaction value and time, location, card response codes, etc.

I Computer network monitoring (NETFLOW). Variables include:source and destination information, bytes packets, etc.

6/37

Streaming Architectures

Real applications of streaming data involve specialist streaminginfrastructure, hardware and software, to process the data as, orsoon after, it is recorded. Examples:

I IBM InfoSphere Streams [1]

I Apache STORM [2]

Significant investment typically needed in both hardware andexpertise. (At Imperial, we are trying a very lightweight andcheap? approach - around 60K investment!)

I Such systems involvechained analytics anddata processing

I Need: self monitoringstatistical models

7/37

ExampleOur example is NETFLOW data collected at Imperial CollegeLondon. This

I has ∼ 40K computers

I generates ∼ 12Tb of flow data per month, or ∼ 15Gb p/h

I experience suggests no smoking gun in NETFLOW, prefer tofocus on weak signals and combining evidence.

some interests and idiosyncrasies of Imperial:

I Particularly concerned about: illegal transfer of copyrightmaterial, protecting IP

I Few constraints on network usage (academic freedom, Halls ofresidence, . . . )

I Spearphishing is a major compromise route.

We are developing a HADOOP-based system for statisticalquerying with bulk NETFLOW.

8/37

NETFLOWA NETFLOW record is a summary of a connection between twonetwork devices which is collected as the connection traverses arouter.

Example NETFLOW data (anonymised). This consists of two flowrecords from the same source address to the same destinationaddress, on destination port 80. The two events started within 2seconds of each other.

Date flow start Duration Proto Src IP Addr Dst IP Addr Src Pt Dst Pt Packets Bytes

2009-04-22 12:04:44.664 13.632 TCP 126.253.5.69 124.195.12.246 49882 80 1507 2.1 M

2009-04-22 12:04:42.613 16.768 TCP 126.253.5.69 124.195.12.246 49881 80 2179 3.1 M

Numerous challenges with NETFLOW data:

I quality: duplication, direction, timing, etcI scale, human change, machine change, anonymity, etcI data analysis focus: event, node, edge, neighbourhood, . . .

I don’t think of this as big data, but rather a vast collection ofsmall, connected problems.

9/37

2: Streaming AnalyticsWhy streaming analytics?

I The data arrives quickly, and must be processed quickly onarrival:

I For timeliness reasons related to the action and decisionmaking, or because the data will not be stored.

I The data is very large, and processing requires one-passprocedures

I constant memory and compute demand

So, a streaming statistical analytic1, must

I be efficient in data usage, ideally one-pass

I not have varying computational burden

I most be capable of handling temporal variation

Clearly, such procedures cannot in general be as good as batchprocedures! However, batch procedures cannot be deployed →trade-off!

1I don’t like the term analytic either10/37

Really??

Homer Simpson: Kids, there’sthree ways to do things; theright way, the wrong way and theHomer Simpson way!

Bart Simpson: Isn’t that thewrong way?

Homer Simpson: Yeah, butfaster

11/37

AgendaThe research agenda is then (first!) to develop the streamingequivalents of standard statistical tools, particularly

1. Anomaly detection (e.g. is this NETFLOW recordanomalous?)

2. Change-point detection (e.g. has the average amount oftraffic into a computer changed?)

3. Clustering (e.g. has the cluster identity of a group ofcomputers changed?)

4. Prediction: regression and classification (e.g. how manyevents will a computer generate in the next five minutes).

3 and 4 have received most attention in the “streaming” literature.Batch versions, and some aspects of the streaming problem havebeen studied extensively, in statistics and machine learning.

4 - much of the work on streaming classification is not realistic. Itis usually assumed that the label associated with xt arrive as t + 1.This make method development a little easier, but is notconsistent with the character of many real problems.

12/37

3: Change detection on the streamChange-point detection has a long history. Traditional applicationsoften related to a single change in manufacturing processes, forwhich it was possible

I to make strong assertions about the pre-change process(distribution, parameters),

I to have a strong sense of the size of the change being sought

I To stop the process once a change is detected, and reset

An impressive collection of theory has been developed forsequential methods in this context.

However, modern applications have different characteristics

I Unending sequence of data

I no sense of expected change size

I no chance of intervention, no reset

We don’t know what the future holds, so a change-point detectorneed to continuously adapt.

13/37

Basic change-point detection methods

Well studied methods include:

I CUSUM

I Shirayev-Roberts procedures (SRP, e.g. [5])

I EWMA

In the frequentist approach, the performance of such methods ischaracterised by two measures related to capacity to detect asingle change.

CUSUM and SRP have optimality properties in special conditions,including

I the pre- and post-change distributions are fully specified.

I the change size is known

14/37

CUSUM

CUmulativeSUM - first proposed in [8].

CUSUM statistics Sj , to detect an increase in the mean ofGaussian observations, N(µ1, σ1), are

Sj = max(0, Sj−1 + xj − kµ1)

where

I µ1 is the pre-change mean, assumed known

I S0 = µ1

A change is detected when Sj > hσ1. Here

I h, k are two control parameters, selected to

I guarantee a performance rate

I match the magnitude of the changes for detection

15/37

Standard performance measures

In the context of a single change point, two standard measures

1. ARL0: (Average Run Length) the number of observationsuntil a false alarm is detected

2. ARL1: average number of observations between a changepoint occurrence and detection: average detection delay

These are usually determined in MC simulation studies.

However, these are not enough for continuous monitoring, becausemore is going on, notably an sequence of changes at random times,of random sizes.

16/37

For change-detection on the stream (continuous monitoring) need:

I Refined performance metrics

I Reduced burden for setting control parameters, since nochange for intervention

The approaches that interest me use parameters called forgettingfactors.

17/37

Forgetting-factor methods

We are interested in modifying standard statistical procedures toincorporate forgetting factors - parameters that control thecontribution of old data to parameter estimation.

We adapt ideas from adaptive filter theory [3], to tune theforgetting factor automatically.

Simplest to illustrate with an example: consider computing themean vector and covariance matrix of a sequence of n multivariatevectors. Standard recursion

mt = mt−1 + xt , µ̂t = mt/t, m0 = 0

St = St−1 + (xt − µ̂t)(xt − µ̂t)T , Σ̂t = St/t, S0 = 0

18/37

For vectors coming from a non-stationary system, simple averagingof this type is biased.

Knowing precise dynamics of the system gives chance to constructoptimal filter. However, not possible with streaming data (thoughinteresting links between adaptive and optimal filtering).

Incorporating a forgetting factor, λ ∈ (0, 1], in the previousrecursion

nt = λnt−1 + 1, n0 = 0

mt = λmt−1 + xt , µ̂t = mt/nt

St = λSt−1 + (xt − µ̂t)(xt − µ̂t)T , Σ̂t = St/nt

λ down-weights old information more smoothly than a window.Denote these forgetting estimates as µ̂λt , Σ̂λ

t , etc.

nt is the effective sample size or memory. λ = 1 gives offlinesolutions, and nt = t. For fixed λ < 1 memory size tends to1/(1− λ) from below.

19/37

Setting λTwo choices for λ, fixed value, or variable forgetting, λt . Fixedforgetting: set by trial and error, change detection, etc (cf.window).

Variable forgetting: ideas from adaptive filter theory suggesttuning λt according to a local stochastic gradient descent ruleapplied to a suitable cost function, ξ,

λt = λt−1 − α∂ξ2

t

∂λ, ξt : residual error at time t, α small (1)

For the mean and covariance matrix, efficient updating rules canbe implemented via results from numerical linear algebra (O(p2)).

Performance very sensitive to α. Very careful implementationrequired, including bracket on λt and selection of learning rate α.

Framework provides an adaptive means for balancing old and newdata. Note slight hack in terms of interpretation of λt (thoughrigorous development in Bodenham [4]).

20/37

Link to state-space models

While this framework is heuristic, it does have a link to simplestate space models, which allows us to reason about a trueforgetting factor.Consider the simple univariate state-space model

µt = µt−1 + εµ εµ ∼ N(0, σ2µ)

xt = µt + εx εx ∼ N(0, σ2x)

In this case, the Kalman filter provides MSE optimal filteredestimates of µ.The key updating equation involves the Kalman gain, kt

µ̂Kt = µ̂Kt−1 + kt(xt − µ̂Kt−1)

21/37

The adaptive forgetting framework, using squared error, is writtenas a recursion for this model to give

µ̂Ft = µ̂Ft−1 +1

nt(xt − µ̂Ft−1)

wherent = λt−1nt−1 + 1

To make the AF estimates match KF, we have

λt =1(

σ2µ

σ2x

)nt + 1

At least in this simple case, we can reason about an optimal choiceof λ.More general formulation much more difficult, and implies a matrixof forgetting factors (and corresponding difficulties ofinterpretation).

22/37

Change point detectionExtending an adaptive estimator for change detection requires theimposition of a decision rule, which in turn requires the distributionof estimator. Options:

1. probability bounds

2. strong parametric assumptions

3. non-parametric approach (probably best in context, butcompute-demand?)

For the fixed forgetting estimator, based on the (naive) assumptionthat data points are IID from N(µ, σ) it can be shown that

X̄N,λ ∼ N(µ, (uN,λ)σ2)

where

uN,λ =1

(wN,λ)2

N∑k=1

(λ2)N−k

This again admits an efficient sequential formulation.

23/37

For given α, compute the prediction interval

PN,λ,α = (L1, L2)

where

P[X̄N,λ ≤ L1] = α/2 P[X̄N,λ > L2] = 1− α/2

A detection occurs if x̄N,λ /∈ PN,λ,α

This reasoning can be extended, approximately, for thetime-varying forgetting factor.

24/37

Change-point detection on the streamWe seek to determine if adaptive estimation will handle misseddetections and false positives better, in a streaming context.Additionally, with adaptive forgetting, perhaps fewer parameters toset?While standard frequentist approaches are concerned with singlechange-points, we will be concerned with continual monitoring ofthe stream, which means the detector needs to restart.

The data model is:

X1,X2, . . . ,Xτ1 ∼ D1

Xτ1+1,X2, . . . ,Xτ2 ∼ D2

Xτ2+1,X2, . . . ,Xτ3 ∼ D3

etc

Where

I τi are the change points

I Dk 6= Dk+1 are inter-change distributions

25/37

In traditional change detection, a detection would result in anintervention - for example, to reset a manufacturing process.

In continuous monitoring [6], no intervention occurs, and thestream of data continues to flow through the detector.

This calls for:

I Methods that can restart after a detection

I Extra performance measures

26/37

Continuous monitoring - performance

There is a finite amount of time between change-points, and somechanges might not be detected before another change-pointoccurs, and we then classify these as missed changes.

Suppose a stream has C change-points, and an algorithm makes atotal of D detections, T of which are true (correct) detections,while D − T are false detections. We then define:

I True Detection Rate (TDR), the proportion of change-pointscorrectly detected, as TDR = T/C

I Missed Detection Rate (MDR), the proportion ofchange-points not detected, as MDR = 1− (T/C )

I False Detection Rate (FDR), the proportion of detections thatare false detections, as FDR = (D − T )/D

The measures are familiar in epidemiological surveillance.

27/37

Simulation Studies

We now look at the performance of adaptive estimation, comparedto other methods.

Restarting will be handled with a burn-in period, for all methods.

Ingredients of the simulation engine

I All distributions Di ∼ N(µi , σ)

I Random jump sizes and positions: µ+ δ, whereδ ∈ {±δ1,±δ2, . . .} = S

28/37

Schematic representation:

I G - grace period, in which no changes occur

I D - burn-in period, for restart estimation

I ξ - period where changes can occur

29/37

Definitions, given detected change point τ̂n, and subsequent truechange point τm

I if τ̂n+1 ∈ [τ̂n + B, τm] then false detection,

I if τ̂n+1 > τm then correct detection,

I if τm and τm+1 occur without a detected change-point, thena missed detection.

A little extra work is needed for ARL0 and ARL1 - see [6].

30/37

Methods

Test the following methods:

I CUSUM - with values of h and k recommended(!) in theliterature for different size changes (ditto EWMA)

I Adaptive estimation with fixed forgetting factor λ and testlevel α.

I Time-varying adaptive estimation, with learning rate η andtest level α

Note there are more sophisticated sequential methods, but theytend to have more parameters to set.

Will show result for various settings.

Choice of S : different size changes should test any algorithm withfixed settings.

31/37

32/37

33/37

The results in these tables, and many others in [4,6], are consistentin demonstrating that adaptive estimation methods, as framed inSection 4, have some utility for the continuous monitoring problem,as compared to restarting CUSUM. Specifically, such methods haveimproved detection performance at the cost of more false positives.

The benefit of the adaptive forgetting factor approach is a slightrelaxation of the number of control parameters to be determined inadvance, and the capability to determine the control parameterssequentially.

34/37

Realistic Example: Adaptive change detection for relay

Context: construct continuous monitoring change detector forrelay-like behaviour: focusing on B, B → C happens anomalouslysoon after A→ B.

Focus on relevant events in the stream, and extend adaptivechange detector to operate on a minimum value, borrowing ideasfrom EVT [7]. Note: no restarting here.

Left: transformed data and adaptively estimated minima. Right:Raw data and flagged changes (finds known relay).

35/37

4: Conclusion

I Constructing streaming analytics is a relatively new field,facing familiar problems (control parameters, speed, one-pass,etc)

I New challenges arise because the analytics interact. In astreaming architecture:

I Output from one analytic may be input to a downstreamanalytic

I Outputs from different analytics may be combined

I Since these system are intended to operate without humanintervention, and proceed automatically, there is a pressingneed for

I self-monitoring analytics

I How do we create efficient streaming equivalents of statisticaldiagnostics procedures, and how do we use the diagnosticinformation? (this is the new challenge)

36/37

Thank you!

Questions

37/37

References

1. http://www-03.ibm.com/software/products/en/infosphere-streams

2. http://storm.apache.org

3. Haykin, S. ’Adaptive filter theory’, 4th edition, Prentice Hall.

4. Bodenham, D.A. (2014). ’Adaptive estimation with change detection for streaming data’. PhD Thesis,Department of Mathematics, Imperial College London.

5. Alexander G. Tartakovsky (2014) ’Rapid Detection of Attacks in Computer Networks by QuickestChangepoint Detection Methods’. In Data analysis for network cyber-security, Imperial College Press.

6. Bodenham, D.A. and Adams, N.M. (2015) ’Continuous changepoint monitoring of data streams usingadaptive estimation’.Statistics and Computing, under revision.

7. Bodenham, D.A. and Adams N.M. ’Adaptive change detection for relay-like behavior’, IEEE JointInformation and Security Informatics Conference, 2014.

8. Page, E. (1954) ’Continuous inspection schemes’, Biometrika, 41, 100-115.

38/37