MO ET AL.: ADAPTIVE SPARSE REPRESENTATIONS...

MO ET AL.: ADAPTIVE SPARSE REPRESENTATIONS FOR VIDEO ANOMALY DETECTION 1

Adaptive Sparse Representationsfor Video Anomaly Detection

Xuan Mo, Student Member, IEEE, Vishal Monga, Senior Member, IEEE,Raja Bala, Member, IEEE and Zhigang Fan, Senior Member, IEEE

Abstract—Video anomaly detection can be used in the trans-portation domain to identify unusual patterns such as trafficviolations, accidents, unsafe driver behavior, street crime, andother suspicious activities. A common class of approaches reliesupon object tracking and trajectory analysis. Very recently,sparse reconstruction techniques have been employed in videoanomaly detection. The fundamental underlying assumption ofthese methods is that any new feature representation of anormal/anomalous event can be approximately modeled as a(sparse) linear combination pre-labeled feature representations(of previously observed events) in a training dictionary. Sparsitycan be a powerful prior on model coefficients but challengesremain in: a.) the detection of anomalies involving multipleobjects, and b.) the ability of the linear sparsity model toeffectively allow for class separation. The proposed researchaddresses both these issues. First, we develop a new joint sparsitymodel for anomaly detection that enables the detection of jointanomalies involving multiple objects. This extension is highly non-trivial since it leads to a new simultaneous sparsity problem whichwe solve using a greedy pursuit technique. Second, we introducenon-linearity into, i.e. kernelize the linear sparsity model to enablesuperior class separability and hence anomaly detection. Weextensively test on several real world video data sets involvingboth single and multiple object anomalies. Results show markedimprovements in detection of anomalies in both supervised andunsupervised scenarios when using the proposed sparsity models.

Keywords: anomaly detection, joint sparsity model, kernelfunction, outlier rejection.

I. INTRODUCTION

With an increasing demand for security and safety, video-based surveillance systems are being increasingly used inurban traffic locations. Vast amounts of video footage arecollected and analyzed for traffic violations, accidents, crime,terrorism, vandalism, and other suspicious activities. Sincemanual analysis of such large volumes of data is prohibitivelycostly, there is a desire to develop effective algorithms thatcan aid in the automatic or semi-automatic interpretation andanalysis of video data for surveillance and law enforcement.An active area of research within this domain is video anomalydetection, which refers to the problem of finding patterns indata that do not conform to expected behavior, and that maywarrant special attention or action.

Two precursors to anomaly detection are an effective en-coding of events, and a systematic means of modeling normal

X. Mo and V. Monga are with the Dept. of Electrical Engineering, ThePennsylvania State University, University Park, PA USA. [email protected],[email protected]. R. Bala and Z. Fan are with the Xerox ResearchCenter Webster, [email protected], [email protected]. Researchwas supported by a grant from the Xerox Research Center in Webster, NY.

events and normal event classes. The purpose of event encod-ing is to extract features from the video that are most usefulin differentiating among different events. Xiang et al. [1] usea 7-D vector to represent each moving blob. An expectationmaximization (EM) algorithm is then used to cluster these 7-D vectors into a pre-defined number of clusters. Events whichcan not be clustered into any of these pre-defined clustersare regarded as anomalies. A ratio histogram approach isproposed by Chuang et al. [2] to represent object features,and suspicious events such as abandoned luggage are detectedusing a finite state machine. Saligrama et al. [3] propose amotion label representation to encode events with decisionsfacilitated via a two-state Markov chain model. Wang et al.[4] propose using hierarchical Bayesian models, where thevideo data is divided into so-called “documents” and eventsare subsequently encoded as quantized features, or ”words”,within these documents. Simon et al. [5] encode events viaspatio-temporal volumes, and employ decision trees to identifyevents. In [6], the authors model shape activities of objects bya Hidden Markov Model (HMM), and define anomalies asa change in the shape activity model. Instead of extractingmoving features, Malinici et al. [7] extract features of thescene as a whole and build an infinite hidden Markov modelon these features to identify anomalies. An excellent reviewof video anomaly detection techniques can be found in [3].

Recent relevant work and challenges: Very recently, s-parse reconstruction techniques [8], [9] have been employed invideo anomaly detection. The fundamental underlying assump-tion of these methods is that any new feature representation ofa normal/anomalous event can be approximately modeled as a(sparse) linear combination pre-labeled feature representations(of previously observed events) in a training dictionary. Liet al. use object trajectories while Zhao et al. use spatio-temporal volumes. Their work was motivated by the sparsitybased face recognition approach of Wright et al. [10] whichclaimed sparse representations could exhibit robustness tosignificant amounts of noise and face occlusion. We note thatthe assertions of Wright et al. have been challenged in recentwork [11], [12] and [13] which claim similar noise robustnesswithout the use of l1 norm techniques that Wright et al.advocate and [8], [9] employ. Both [8] and [9] show promisefor the use of sparsity in video anomaly detection but exhibittwo important limitations. First, they only address anomaliesinvolving single objects. While such events arguably accountfor a large proportion of anomalies, there are important sce-narios wherein the anomaly arises from an interaction amongmultiple objects. Consider for example, two vehicles follow-


ing nominal individual trajectories but approaching within adangerously close vicinity of each other. The second limitationassociated particularly with [8] is that the anomalous eventsmust be characterized a priori into their own classes, i.e.supervised anomaly detection where representative training foranomalous events is available. In real word applications, itis often not possible to gather a sufficiently large number oftraining samples representing anomalous events.

Contributions: We propose a novel and general trajectorybased joint sparse reconstruction framework for video anomalydetection. Trajectories have long been popular in video anal-ysis and anomaly detection [14], [15], [16], [17]. A commoncharacteristic of trajectory-based approaches [18]–[21] is thederivation of nominal classes of object trajectories in a trainingphase, and the comparison of new test trajectories againstthe nominal classes in an evaluation phase. A statisticallysignificant deviation from all classes indicates an anomaly.

We must emphasize that our choice of trajectories (asopposed to spatio-temporal volumes for example in [9]) as theevent encoder is motivated by two principal reasons: 1.) inter-actions between multiple objects are quite naturally capturedin trajectory representations, e.g. vehicles approaching within adangerously close vicinity of each other can be caught, and 2.)recent advances in object tracking enusure trajectory extractionis both fast and reliable [22]. Neverthless, in theory any eventrepresentation can be used with our model.

Why sparsity: Our goal is to build a linear model wherejoint trajectory representations of multiple objects are writtenas linear combinations of corresponding joint trajectories in atraining dictionary. Sparsity is a powerful prior in this modelbecause of multiple reasons: a.) like in [8] and [9], when a newcollection of multi-object trajectories manifests, it is expectedto invoke only a few columns of the training dictionary thatcombine to create it, and b.) even more crucially object-wisecorrespondence is important in the linear combination for thismodel to physically meaningful leading to a (non-standard)block-diagonal sparse structure on coefficients detailed laterin Section III-B, c.) Finally, we observe that while the s-parse structure conveys information about normal/anomalousevent classes - in the absence of training data for anomalousevents we can develop and use outlier rejection measureson the sparse coefficient matrix that can help with multi-object anomaly detection in unsupervised settings - a verychallenging problem.

Our contributions over [8] and [9] are as follows:• We focus on multi-object anomaly detection and ex-

tend the approaches in [8], [9] towards a joint sparsitymodel where a matrix (instead of a vector) of sparsecoefficients results. This extension is highly non-trivialbecause the structure of this matrix of sparse coefficientsis not naturally row sparse. The model is meaningful inthe multi-object scenario only when there is object-wisecorrespondence in the linear combinations. To incorporatethis very challenging constraint, we therefore develop andsolve a new simultaneous sparsity problem with the helpof a new greedy pursuit algorithm.

• Because linear models are not always adequate, we pro-pose a kernelization of our joint sparsity model. If the data

set does not obey linear models, kernel methods that arepopular in learning, can be applied to project the data intoa high-dimensional nonlinear feature space in which thedata becomes more linearly separable [23]–[25]. Kernelorthogonal and basis pursuit [26], [27] algorithms andtheir applications [28] have been of much recent interest.We develop a new kernelization of our joint sparsitymodel for the purposes of video anomaly detection. Thisinvolves the development of a numerical algorithm thatdoes not significantly increase complexity over the regularlinear joint sparsity model.

• Finally, a suitable outlier rejection measure is developedfor the multiple-object case that obviates the need tobuild anomalous event classes, and enables unsupervisedanomaly detection with high accuracy (note, labeledtraining for normal events is still assumed available).

We evaluate our algorithms by testing on several real trans-portation data sets for both single-object and multiple-objectanomalies. Additionally, both supervised and unsupervisedscenarios are used in testing. In the supervised case, trainingdictionaries contain both example normal and anomaloustrajectories. In the unsupervised case, only normal eventtrajectories are available for training and the aforementionedoutlier rejection measure is used for anomaly detection. Thetest datasets include the well known CAVIAR and AVSS[29], [30] datasets and two transportation data sets provid-ed by Xerox Corporation. To benchmark our findings, wecompare our results against widely cited trajectory basedtechniques. This includes: a.) the one class SVM-based methodby Picarelli et al. [18], and b.) multiple-object tracking andanomaly detection approach in Han et al. [31]. For single-object anomaly detection, we also compare against the recentproposal of Li et al. [8] which was the first approach to suggestsparsity for video anomaly detection. Our experimental resultsinclude confusion matrices and ROC curves - obtained across avariety of real-world scenarios, which reveal that our proposedsparsity models outperforms the alternatives.

The rest of the paper is organized as follows. Section IIbriefly reviews sparsity based video anomaly detection asfirst proposed by Li et al. Section III motivates and detailsour central contribution which is a joint sparsity model. Thisinvolves setting up a new simultaneous sparsity optimizationproblem which effectively captures representation of events asdescribed by multiple trajectories. Section IV then presentskernelization of this joint sparsity model and proposes a newalgorithm for solving the optimization problem that resultsfrom the joint kernel sparsity model (JKSM). Section Vpresents experimental findings. We first show the merits ofsparsity under occlusion leading to missing trajectory infor-mation. Next we provide figures of merit such as confusionmatrices as well as statistical evaluation via ROC Curves onreal transportation video data sets to establish the virtues ofour proposed approach against well-known trajectory basedanomaly detection methods. Section VI concludes the paperwith a discussion of possible future research directions.


Fig. 1. An example illustration of trajectory classification using a sparsereconstruction model. The training dictionary consist of 2 classes (red trajec-tories are class 1, blue trajectories are class 2), each class contains 4 differenttrajectories. The test trajectory (the green trajectory) can be well representedby the linear combination of trajectory no. 1 and trajectory no. 3 from class1.

II. REVIEW OF SPARSITY-BASED ANOMALY DETECTION

Abnormal behavior detection via sparse reconstruction anal-ysis of trajectory [8] is a recent novel and promising ideain the field of video anomaly detection. The fundamentalunderlying assumption in [8] is that any new trajectory canapproximately be modeled as a (sparse) linear combinationof training trajectories (or equivalent features). Let each tra-jectory representation lie in Rn, and T denote the number oftraining samples (i.e. example trajectory representations) fromeach of K different classes, i.e. behavior patterns in a videowhich may be normal or anomalous. The T training samples(trajectory representations) from the i-th class are arranged asthe columns of a matrix Ai ∈Rn×T . The dictionary A∈Rn×KT

of training samples from all classes is then formed as follows:A = [A1 A2 . . .AK ].

Given a sufficient number of training samples from the m-thtrajectory class, a test image y ∈ Rn from the same class isconjectured to approximately lie in the linear span of thosetraining samples. Any trajectory feature vector is synthesizedby a linear combination of the set of all training trajectorysamples as follows:

y≈ Aααα = [A1 A2 . . .AK ]

ααα1ααα2...

αααK

, (1)

where each αααi ∈ RT . Typically for an example trajectory y,only one of the αααi’s will be active (corresponding to theclass/event from which y is generated). Thus the coefficientvector ααα ∈ RKT is modeled as sparse and is recovered bysolving the following optimization problem:

ααα = argminααα‖ααα‖1 subject to ‖y−Aααα‖2 < ε, (2)

where the objective is to minimize the number of non-zeroelements in ααα. The `1 norm can capture sparsity in a convex setup [32] but has shown to be quite expensive [11]. The residualerror between the test trajectory and each class behaviorpattern is computed to find the class to which the test trajectory

belongs:

ri(y) = ‖y−Aiαααi‖2 i = 1,2, . . . ,K (3)

Fig. 1 shows an example of classification using sparsitymodel. The training dictionary consist of 2 classes, each classcontains 4 different trajectories. The test trajectory can bewell represented by the linear combination of trajectory no.1 and trajectory no. 3 from class 1 (see Fig. 1). This is infacttantamount to saying that the coefficient vector ααα is indeedsparse - in this example, two of eight entries being active.

Based on the aforementioned model, Li et al [8] identifyanomalies by apriori defining normal and anomalous eventclasses in video. Their advocacy of sparsity over other trajec-tory classification techniques viz. one-class SVMs [18] is basedon two arguments: 1.) recent work in face recognition [10] hasshown that sparsity based classification can be powerful evenas feature descriptions are missing, e.g. occlusion of objectsleading to missing trajectory information, and 2.) the sparsecoefficients can additionally withstand noise and other imagedistortions. The claim of noise/distortion robustness, partic-ularly of `1 type sparse representations has been challengedrecently [11], [12] and [13].

Neverthless, in models based on a training dictionary, spar-sity is a powerful prior particularly in multi-object scenarioswhere the structure of sparse coefficients in a matrix canbe very instrumental in both normal vs. anomalous eventclassification as well as developing outlier rejection measures.The next Section details our first contribution.

III. A JOINT SPARSITY MODEL FOR TRAJECTORY BASEDVIDEO ANOMALY DETECTION

A. Observations and Motivation for a Joint Sparsity Model

A variety of anomaly detection algorithms have been de-signed for video surveillance. However only few of themhave considered the interaction between multiple objects [4],[6], [31]. While it is true that anomalies are generated byatypical trajectory/behavior of a single object, “collectiveanomalies” that are caused by the joint observation of objectsare also significant. For example, in the area of transportation,some events, e.g. accidents and dangerous driver-pedestrianbehavior, are indeed based on joint and not just individualobject behavior. Previous methods have employed probabilisticmodels to learn the relationship between different individualevents. Han et al. [31] and Vaswani et al. [6] use an HMM-based method to track multiple trajectories followed by defin-ing a set of rules to distinguish between normal and anomalousevents. Wang et al. [4] present an unsupervised frameworkusing hierarchical Bayesian models to model individual eventsand interactions between them.

The sparsity based approach reviewed in Section II whilepowerful, does not capture interactions to detect 2 or moreobject anomalies. We describe next a new “joint sparsitymodel” for video anomaly detection which incorporates mul-tiple object trajectories and their interactions. Hence evenif individually the trajectories may be considered normal,”collective anomalies” could occur and can be successfullydetected in our proposed framework.


B. A Joint Sparsity Model for Anomaly DetectionWe are interested in detection of anomalies involving P ≥ 1objects. Their corresponding P trajectories can be representedas a matrix: Y = [y1 y2 . . . yP] ∈ Rn×P, where yi corre-spond to ith trajectory. The training dictionary can be definedas: A = [A1 A2 . . . AP] ∈ Rn×PKT , where each dictionaryAi = [Ai,1 Ai,2 . . . Ai,K ] ∈ Rn×KT , i = 1,2, . . . ,P, is formedby the concatenation of the sub-dictionaries from all classesbelonging to the ith trajectory. The crucial aspect of thisformulation is that the training trajectories for any class j, i.e.Ai, j, i = 1,2, ..P are observed “jointly” from example videos.This generalizes the set-up of [8], [9].

The test P trajectories can now be represented as a linearcombination of training samples as follows:

Y≈ AS= [A1,1 A1,2 . . . A1,K . . . AP,1 AP,2 . . . AP,K ][ααα1 . . . αααP],(4)

where the coefficient vectors αααi lie in RPKT and S =[ααα1 . . . αααi . . . αααP].

It is important to note that the i-th object trajectory of anyobserved set of test trajectories should only lie in the span oftraining trajectories corresponding to the i-th object. Therefore,the columns of S should have the following structure:

ααα1 =

ααα1,1ααα1,2

...ααα1,K

00

,αααi =

0αααi,1αααi,2

...αααi,K

0

,αααP =

00

αααP,1αααP,2

...αααP,K

. (5)

where each of the sub-vectors αααi, jKj=1, i = 1,2, . . . ,P lies

in RT , while 0 denotes a vector of all zeros in RKT . As aresult, S exhibits a block-diagonal structure.

From [8], we know that for a single object, its trajectorycan be represented by a sparse linear combination of all thetraining samples. For the multiple trajectories scenario, weassume that training samples with non-zero weights (in thesparse linear combination) exhibit one-one correspondenceacross different trajectories. In other words, if the i-th trajec-tory training sample from the j-th class is chosen for the i-thtest trajectory, then it is necessarily that other P−1 trajectorieschoose from j-th class with very high probability, albeit withpossibly different weights.

We take a simple scenario which only has 2 objects and 2training classes (normal and anomalous class) as an exampleto explain the structure of Eq. (4). In this situation, P = 2,K = 2, Eq. (4) becomes:

Y≈ AS = [A1,1 A1,2 A2,1 A2,2]

ααα1,1 000ααα1,2 000

000 ααα2,1000 ααα2,2

, (6)

The test trajectory sample is thought of as a collective event.Therefore, all trajectories of the sample should be classifiedinto one class. If the 1st trajectory is classified into j-th class,the 2nd trajectory should also be classified into j-th class,which means ααα1, j and ααα2, j should be activated simultaneously.

This characteristic that some coefficients should be activatedjointly captures the interaction between objects.

Moreover, we only care about the non-zero element in thematrix S. Define a new matrix S′:

S′ =[

ααα1,1 ααα2,1ααα1,2 ααα2,2

]. (7)

In the structure of S′, “joint coefficients” are moved into thesame row. The joint information can be captured by enforcingcertain entire rows of S′ to be activated simultaneously.

In general, when there are K classes and P objects, thestructure of S′ is:

S′ =

ααα1,1 . . . αααi,1 . . . αααP,1ααα1,2 . . . αααi,2 . . . αααP,2

......

......

...ααα1,K . . . αααi,K . . . αααP,K

∈ RKT×P. (8)

The question that remains to be addressed is the particularway of transforming S. Such a transformation is realized bydefining matrices H ∈ RPKT×P and J ∈ RKT×PKT :

H =

1 0 . . . 00 1 . . . 0...

......

...0 0 . . . 1

, J =[

IKT IKT . . . IKT].

(9)The vectors 1 and 0 are in RKT and contain all ones and zerosrespectively, and IKT is the KT -dimensional identity matrix.

HS =

ααα1,1 . . . 0 . . . 0ααα1,2 . . . αααi,1 . . . 0

... . . . αααi,2 . . . αααP,1

ααα1,K . . .... . . . αααP,2

0 . . . αααi,K . . ....

0 . . . 0 . . . αααP,K

, (10)

then, we have:J(HS) = S′, (11)

where the indicates matrix Hadamard (entry-wise) product.Therefore, we can now solve for the sparse coefficients via

the following optimization problem:

minimize ‖J(HS)‖row,0subject to ‖Y−AS‖F ≤ ε,

(12)

where ‖ · ‖row,0 refers to the number of non-zero rows inthe matrix and the cost function minimization seeks theJ(HS) with the minimum number of non-zero rows, whilethe constraint ensures good approximation (‖ · ‖F denotes theFrobenius norm). It is worth re-emphasizing that the matrix S′has a special structure - elements in the same row should beactivated simultaneously which is captured by using the matrix‖ ·‖row,0 norm. Enforcing sparsity of S′ (equivalently S) usingother traditional matrix norms, e.g. entry-wise lp norms, canbe detrimental to performance [33], [34] particularly under thequadratic reconstruction constraint because they could lead tosolutions that depart significantly from row sparsity.


The well-known row sparsity problem:

minimize ‖S‖row,0subject to ‖Y−AS‖F ≤ ε,

(13)

is non-convex but can be solved using greedy pursuit algo-rithms widely used in the literature. Simultaneous OrthogonalMatching Pursuit(SOMP) [34], [35] - enumerated in Algorithm1 - is amongst the most popular algorithms used. In SOMP, thesupport of the solution is sequentially updated (i.e., the atomsin the dictionary A are sequentially selected). At each iteration,the atom that simultaneously yields the best approximation toall of the residual vectors is selected.

λk = argmaxi‖ RT

k−1ai ‖2 (14)

Our proposed joint sparsity model for representing multipleobject trajectories involves solving Eq. (12), which looksquite similar to Eq. (13) but the Hadamard operator fromS to S′ makes the problem much more involved. We canobserve from Algorithm 1 that the original SOMP algorithmeffectively gives k0 distinct atoms from a dictionary A thatbest approximates the data matrix Y for k0 iterations, we applythe general formulation even when the Hadamard operator ispresent. At every iteration k, SOMP measures the residual foreach atom in A and creates an orthogonal projection with thehighest correlation.

Algorithm 1 SOMPInput: Dictionary A = [a1 a2 . . . aPKT ], data matrix Y =

[y1 y2 . . . yP], the stopping criterion:‖Y−AΛk Sk‖F

‖Y−AΛk−1Sk−1‖F>

1−µ, where µ is a small positive number1: initialization: residual R0 = Y, index set Λ0: empty set,

iteration counter k = 12: while stopping criterion has not been met do

1) Find the index of the atom that best approximatesall residuals: λk = argmaxi ‖ RT

k−1ai ‖22) Update the index set Λk = Λk−1∪λk3) Compute Gk = (AT

ΛkAΛk)

−1ATΛk

Y, ATΛk

consists ofthe k atoms in A indexed in Λk

4) Determine the residual Rk = Y−ATΛk

Gk5) k← k+1

3: end whileOutput: Index set Λ = Λk−1, the sparse representation S

whose nonzero rows indexed by Λ are k rows of the matrix(AT

ΛAΛ)

−1ATΛ

Y

This idea can be extended to our proposed joint sparsitysetting. If the atom of j-th trajectory we selected comes fromi-th training, the other P− 1 atoms of trajectories should bealso chosen from i-th training. Then Eq. (14) in SOMP canbe modified as:

λk = argmaxi

∑j‖ RT

j,k−1a j,i ‖2 (15)

where RTj,k−1 refers to the residual of j-th trajectory in iteration

k− 1, and aaa j,i represents the i-th training of j-th trajectory.After employing this special rule of choice for atom selection,each row of parameter matrix S′ will be activated simulta-

Fig. 2. An example illustration of anomalous event versus normal event.To the left of the figure are sparse coefficients of a anomalous event. Theactivated coefficients are scattered all over the normal classes. To the right ofthe figure are sparse coefficients of a normal event. The activated coefficientsare clustered in normal class 2.

neously or inactivated simultaneously, thus the row sparsityrequirements will inherently hold. The implementation detailsof this algorithm can be found in a technical report [36].

Supervised Anomaly Detection as Event ClassificationHaving obtained the sparse coefficient matrix S, we compute

class-specific residual errors and identify the class of the testevent Y as that which gives the minimum residual:

identity(Y) = argmini‖Y−Aδi(S)‖F, (16)

where δi(S) is the matrix whose only nonzero entries are thesame as those in S associated with class i (in all P trajectories).When sufficent representation (example training trajectories)for anomalous events is available, then anomalous classes aresimply one or more of the K classes in this joint sparsity basedclassification framework.

Unsupervised Anomaly Detection via Outlier RejectionIf training for anomalies is missing/statistically insignificant,

we can not use (16) to identify anomalies. Inspired by theoutlier rejection measure in [10]:

SCI(ααα) =K ·maxi ‖ρi(ααα)‖1/‖ααα‖1−1

K−1SCI(ααα) < τ1→ yyy : outlier, (17)

and ρi(ααα) is the new vector whose only nonzero entries arethe entries in ααα that are associated with class i. we modelanomalies as outliers, given training from expected normalevent classes that form the dictionary A. Eq. (17) can be usedto detect single object anomalies. We extend it to the multipleobject case:

JSCI(S′) =K ·maxi ‖δi(S′)‖row,0/‖S′‖row,0−1

K−1, (18)

where JSCI is the Joint-SCI. Note that 0 ≤ JSCI(S′) ≤ 1, ifJSCI(S′) is close to 0, the event is normal, and if JSCI(S′) isclose to 1, the event is anomalous. We choose the thresholdτ2 ∈ (0,1), if JSCI(S′) < τ2, a multiple object anomaly isidentified. A nominal choice of τ2 = 0.5 can be made butthis can further be optimized experimentally based on theunderlying video data set by observing the range of themeasure JSCI(S′) for normal events.

Fig. 2 shows an example of sparse coefficients under normal


Fig. 3. Kernel function are used to project a non-linearly separable data intoa higher dimension linearly separable data. Figure obtained from [38]

vs. anomalous events. To the left of the figure are sparsecoefficients of a anomalous event. The activated coefficientsare scattered all over the normal classes, therefore the corre-sponding event can not be classified into any class. The JSCIvalue in Eq. (18) will therefore be small. On the other hand,the activated coefficients in the right figure are clustered innormal class 2. Correspondingly, the JSCI value will be large.

IV. KERNEL SPARSITY FOR TRAJECTORY BASED VIDEOANOMALY DETECTION

The effectiveness of the proposed joint sparsity modellargely depends on the structure of the trajectory data. Ifthe data is not linearly separable enough, the trajectory-based sparsity model may not enable sufficiently accuratereconstruction to be reliable from a classification standpoint.Kernel-based algorithms [37] implicitly exploit the higher-order nonlinear structure of the data which is not captured bythe linear models. Kernel methods can be applied to transformthe data into a feature space via a transformation φ(·) suchthat the resulting transformed trajectory vectors becomes moreseparable [23]–[25] and comply with the linear sparsity model.

The kernel function κ : Rn×Rn 7→ R, is usually defined asthe inner product: κ(x,z) = 〈φ(x),φ(z)〉

The benefits of kernel function are illustrated in the Fig. 3.For the left figure, It is impossible to use a linear classifier toclassify these two types of data. However, after using kernelfunction, the data is projected into a higher dimension spacewhere it is more linearly separable.

Kernel Sparsity Model for Object TrajectoryThe transformed training trajectory vectors are written as:

ai 7→ φ(ai), where ai is the i-th column of A. Let φ(y) denotethe representation of the test trajectory in this space. The testtrajectory of original sparsity model in Eq. (1) can now berepresented as follows:

φ(y) =[φ(a1) · · · φ(aKT )

]︸︷︷︸Aφ

[α′1 · · · α′KT

]T︸︷︷︸ααα′

= Aφααα′,

(19)where ααα′ is also assumed to be sparse.

Similar to Eq. (2), the new sparse coefficient vector ααα′ ∈RKT can be recovered by solving:

ααα′ = argminααα′‖ααα′‖1 subject to ‖φ(y)−Aφααα

′‖2 < ε, (20)

The problem in Eq. (20) can be approximately solved bykernel orthogonal/basis matching pursuit algorithms [26]–[28],[39]. Note that in the above problem formulation, we aresolving for the sparse vector ααα′ directly in the feature space

using the implicit feature vectors, but not evaluating the kernelfunctions at the training points.

Algorithm 2 KSOMPInput: Dictionary A = [a1 a2 . . . aPKT ], data matrix Y =

[y1 y2 . . . yP], kernel function κ, the stopping criterion:‖Yφ−AφΛk

Sφk‖F‖Yφ−AφΛk−1

Sφk−1‖F> 1−µ

1: initialization: compute the kernel matrices ΘA ∈RPKT×PKT whose (i, j)-th entry is κ(ai,a j) and ΘA,Y ∈RPKT×P whose (i, j)-th entry is κ(ai,y j), Set index setΛ0 = argmaxi ‖ (ΘΘΘA,Y)i,: ‖2 and iteration counter t = 1.

2: while stopping criterion has not been met do1) Compute the correlation matrix

C=ΘΘΘA,Y−(ΘΘΘA):,Λt−1((ΘΘΘA)Λt−1,Λt−1 +λI)−1(ΘΘΘA,Y)Λt−1,:

2) Select the new index as λt = argmaxi ‖ Ci,: ‖23) Update the index set Λt = Λt−1∪λt4) t← t +1

3: end whileOutput: Index set Λ = Λt−1, the sparse representation

Sφ whose nonzero rows indexed by Sφ = (ΘΘΘΛ,Λ +λI)−1(ΘΘΘA,Y)Λ,:

The well-known row sparsity problem in Eq. (13) can beextended to the (kernelized) feature space as follows:

minimize ‖Sφ‖row,0subject to ‖Yφ−AφSφ‖F ≤ ε,

(21)

We propose the kernel SOMP algorithm, i.e. KSOMP inorder to solve Eq. (21) - see Algoirthm 2. Note that aregularization term λI is added in Step 1 of Algorithm 2 orderto enable a stable inversion. Like SOMP (Algorithm 1), thegoal of KSOMP is to pick out non-zero rows that minimize‖Sφ‖row,0 but in the transformed kernel space.

The kernelized version of our proposed joint sparsity modelin Eq. (12) is given as:

minimize ‖J(HSφ

)‖row,0

subject to ‖Yφ−AφSφ‖F ≤ ε,(22)

which is very similar to Eq. (21) but for the presence ofthe matrices J,H and the Hadamard operator. Recall themodification to SOMP employed in Section III, a similar trickwill allow us to adapt kernelized SOMP to yield a solutionfor our problem in Eq (22). In particular, instead of using theselection rule:

λt = argmaxi‖ Ci,: ‖2, (23)

in Step 2 of KSOMP, we can jointly select atoms of trajectoriesfrom the the same training:

λt = argmaxi

∑j‖ C j

i,: ‖2, (24)

where C ji,: refers to the correlation matrix of j-th trajectory.


Note that in the transformed space, the residual becomes:

‖φ(y)−Aφααα′‖2

= [n

∑i=1

(φ(y)i− (Aφααα′)i)

2]12

= [n

∑i=1

(φ(y)i−KT

∑j=1

α′j(Aφ)i, j)

2]12

= [n

∑i=1

(φ(yi)2−2φ(y)i

KT

∑j=1

α′j(Aφ)i, j +

KT

∑j=1

α′j(Aφ)i, j)]

12

= [n

∑i=1

φ(yi)2−2

KT

∑j=1

α′j

n

∑i=1

φ(y)i(Aφ)i, j +n

∑i=1

(KT

∑j=1

α′j(Aφ)i, j)]

12

=(κ(y,y)−2ααα

′TθθθA,y +ααα

′TΘΘΘAααα

′) 12

Let Sφ be the optimum solution of Eq. (22). The residualfor the kernelized joint sparsity model corresponding to thei-th class is then given by:

‖Yφ−Aφδi(Sφ)‖F = (p

∑j=1

(‖φ(y j)−Aφ(δi(Sφ)):, j‖2)2)

12 , (25)

where φ(y j) is the j-th column of Yφ and (δi(Sφ)):, j is thej-th column of δi(Sφ). More, precisely in terms of the kernelfunction, this is given by:

ri(Yφ) =

(p

∑j=1

(κ(y j,y j)−2(δi(Sφ))

T:, j(ΘΘΘA,Y)Ωi, j

+ (δi(Sφ))T:, j(ΘΘΘA)Ωi,Ωi(δi(Sφ)):, j

)) 12

,

where Ωi is the index set associated with the i-th trainingclass.

Supervised Anomaly Detection as Event ClassificationSimilar to Eq. (16), The class of Yφ is determined by:

identity(Yφ) = argmini‖Yφ−Aφδi(Sφ)‖F, (26)

where δi(Sφ) is the matrix whose only nonzero entries are thesame as those in Sφ associated with class i.

Unsupervised Anomaly Detection via Outlier RejectionAs in Section III, we can define a transformed coefficient

matrix S′φ = J(H Sφ). The outlier rejection measure in Eq.(18) can then be extended here as:

JSCI(S′φ) =K ·maxi ‖δi(S′φ)‖row,0/‖S′φ‖row,0−1

K−1. (27)

Kernel Parameters OptimizationWe focus on picking parameters for the RBF kernel

κ(x,z) = e−γ‖x−z‖2 which will be used in all our experiments.For different choices of the RBF parameter γ, multiple

training dictionaries are generated Aφ(γ), i.e. Aφ(γ) is afunction of γ. Inspired by cross-validation, we split the trainingdata Aφ(γ) into two subsets Bφ(γ) and Cφ(γ), such that bothBφ(γ) and Cφ(γ) have representation from the K classes. Nowif the dictionary in the sparsity model is chosen to be equalto Bφ(γ) and a (transformed) test trajectory is picked from

Cφ(γ), then ideally we expect perfect classification into one ofthe K classes. Therefore, a good kernel is one that will enableclose to ideal classification of test samples from Cφ(γ) - whichmeans that only a small number of Sφ(γ) are activated (non-zero) and for one particular class. Here the outlier rejectionmeasure:

JSCI(S′φ(γ)) =K ·maxi ‖δi(S′φ(γ))‖1/‖S′φ(γ)‖1−1

K−1, (28)

where δi(S′φ(γ)) is the vector whose only nonzero entriesare the same as those in S′φ(γ) associated with class i.JSCI(S′φ(γ)) will be very close to 1 if the classification isaccurate. Therefore the best parameter γ can be chosen bysolving the following kernel parameter optimization problem:

argmaxγ

JSCI(S′φ(γ)), (29)

Computational ComplexitySuppose that the dimension of trajectory is n: y ∈ Rn. For

the SOMP Algorithm 1, the complexity is O(nKPT ) [40] inour set up - where K, P, T were defined in Section III andrefer to the number of classes, objects and training samples perclass respectively. Changing the selection rules in our modifiedversion of SOMP will not increase the complexity much. So,the computational complexity of our joint sparsity model isalso ≈ O(nKPT ). The kernelized joint sparsity model hasto compute the kernel matrix ΘΘΘA and the correlation matrixC. The computational complexity will therefore increase toO(nKP2T ). Note that the number of objects P is usually asmall number. Therefore, kernel trick does not significantlyincrease complexity over sparsity based anomaly detectionwhile providing significant performance improvements as willbe seen shortly in Section V.

V. EXPERIMENTAL VALIDATION

A. Trajectory Extraction

Trajectory extraction is accomplished using well-knowntechniques. First background subtraction is accomplished viathe use of a Gaussian Mixture Model (GMM) [41]. In orderto eliminate the effect of noise, blob analysis is then used hereto identify the location of the moving vehicle. We calculatethe number of connected foreground pixels, and deem theconnected segment to be a vehicle if this exceeds a threshold.As seen in Fig. 4(a), the car is successfully detected by thistechnique. Next, we calculate and track the centroid of theblob over time in order to obtain the object trajectory. Fig.4(b) shows an example of the extracted trajectory, which isrepresented mathematically as a coordinate pair [x(t),y(t)].Li et al. [8] use a LCSCA (Least-squares Cubic SplineCurves Approximation) representation of trajectories. We firstapproximate a raw trajectory using a basic B-spline function[42] with 50 knots (50 x-coordinates and 50 y-coordinates)and these knots are extracted to represent the trajectory.

B. Brief Review of Competing Approaches

Here we elaborate on three widely cited techniques fromthe literature that will be used as benchmarks to compare ourapproach.


(a)

(b)Fig. 4. Trajectory extraction: (a) background subtraction using gaussianmixture models and blob analysis to identify objects, (b) blob centroidcalculation and trajectory derivation by collecting the blob centroid.

Fig. 5. Illustration of one class SVM method by Piciarelli et al. [18] on2-D data. Figure obtained from [18]. The classification hyperplane intersectthe hypersphere thus defining an hyperspherical cap containing the majorityof the data, while outliers lie outside the cap.

Anomaly Detection via SVM Trajectory ClusteringPiciarelli et al. [18] propose a technique that uses one

class SVMs for anomaly detection by utilizing trajectoryinformation as features.

In their method, trajectories are represented using 8 pairs ofx and y coordinates, thus leading to feature vectors composedof 16 elements. Then, these trajectories are clustered with aone-class SVM. During the training phase, a classificationhyperplane is learned. Fig. 5 shows an illustration of theirmethod on 2−D data. Based on the classification hyperplane,an outlier detection technique can be used to detect anomalies.If the angle θX between a test trajectory X and the center Cgreater than a threshold, this test trajectory X is regarded asan anomaly:

θX > θth→ X : outlier, (30)

Online Anomaly Detection using Sparsity ModelZhao et al. [9] propose an online sparsity-based method for

video anomaly detection. They use sparse linear combinationsof spatio-temporal volumes under an l1 sparsity model. Butinstead of using a fixed training dictionary and only solvingfor the sparse coefficients, they employ a principled convexoptimization formulation that allows both a sparse reconstruc-tion code, and an online dictionary to be jointly inferred and

updated:

(ααα∗1, ...,ααα∗m,A

∗) = arg minααα1,...,αααm,A

m

∑i=1

J(yi,αααi,A), (31)

where J(yi,αααi,A) is the objective function that measures hownomal an event is. It includes reconstruction error, sparsityregularization via l1 norm and smoothness terms. Subsequentto the optimization, the dictionary A is augmented using newlyobserved events.

For anomaly detection, they define a threshold ε which thatcontrols the sensitivity of the algorithm to anomalous events. Atest spatio-temporal volume y′ will be detected as anomalousevent if the following criterion is satisfied:

J(y′,ααα′,A∗)> ε (32)

Our implementation of Zhao’s method is based on theirpseudocode in Algorithm 1 on Page 4 of their paper [9].

Multi-Object Trajectory Tracking and Anomaly Detec-tion

Han et al. [31] propose a multiple object tracking algorithmand corresponding rule-based anomaly detection approach. Intheir tracking algorithm, Each object is identified by index iand its state at time t is represented by:

xti = (pt

i,vti,a

ti,s

ti) (33)

where pti is the image location, ct

i represents the 2D velocity,at

i and sti denote the appearance and scale of object i at time t,

respectively. pti and vt

i use continuous image coordinates. Then,they use a Hidden Markov Model as the probabilistic modelto maximize the joint probability between the state sequenceand the observation sequence.

For anomaly detection, they first collect the informationfrom tracking result including the number of objects, theirmotion history and interaction, the timing of their behaviors.Then they interpret events based on the basic informationabout WHO (how many), WHEN, WHERE and WHAT. Basedon this interpretation, anomaly can be defined by some rules.For example, in a traffic intersection scenario, there are always0 - 8 cars around. If at a time 15 cars arrive this intersectionsimultaneously, it can be regarded as an anomalous event.

C. Video Datasets and Intuitive Illustration

As discussed above, if an anomaly is generated by singleobject, we can call it single-object anomaly. Figs. 6 (a) and(b) show consecutive frames from 2 examples of single-objectanomalies. In Fig. 6 (a), a man suddenly falls on the floor(see Fig. 6(a2)) when walking across the lobby. In Fig. 6(b), instead of turning left or right in front of the stop sign,the driver suddenly backs his car - see Figs. 6 (b3)-(b4). Bysetting the number of objects P = 1, our joint sparsity modelreduces to the model by Li et al [8] and can be used to detectanomalous trajectories of individual (single) objects.

On the other hand, if an anomaly happens via the interactionof multiple objects, we call it a multiple-object anomaly.The video clip frames from 3 examples of multiple-objectanomalies are shown in Figs. 7 (a), (b) and (c). In Fig. 7 (a),a pedestrian walk crossing the street loses his hat and retraces


Fig. 6. Example frames of single-object anomalies: (a) a man suddenly falls on floor – from the CAVIAR data set and (b) a driver backs his car in front ofstop sign – from the Xerox Stop Sign data set.

Fig. 7. Example frames of multiple-object anomalies: (a) a vehicle almost hits a pedestrian – from the AVSS data set, (b) a car (marked by red rectangle)violates the stop sign rule – from the Xerox Stop Sign data set and (c) a car fails to yield to oncoming car while turning left – from the Xerox Intersectiondata set.

his footsteps to pick it up from the road. At this time, a vehiclecomes in very close proximity to the pedestrian and comes toa sudden halt - see Fig. 7 (a2). In Fig. 7 (b), the second vehicle(marked by a red rectangle) comes to a complete stop whenwaiting for the vehicle in front of it (Fig. 7 (b1)), but doesnot actually stop at the stop sign - see Figs. 7 (b2)-(b3). InFig. 7 (c), a car fails to yield to oncoming car while turningleft - see Figs. 7 (c2)-(c3). The examples in Fig. 7 (a) and (b)are in fact from a real-world transportation database (whichcannot be made public for proprietary reasons) which we refer

to as the Xerox Stop Sign database. An example video clipis however made available at: http://youtu.be/M6 PJigg5CY.And the example in Fig. 7 (c) comes from another proprietarytransportation database which we address as the Xerox Inter-section database. A representative video clip is made availableat: http://youtu.be/ZGKtkVtWEFU.

Our proposed algorithm for multi-object trajectory basedanomaly detection is called Joint Kernel-based Sparsity Model(abbreviated to JKSM) or equivalently Kernel Sparsity Model(KSM) in the case of single object anomaly detection. We test

http://youtu.be/M6_PJigg5CY

http://youtu.be/ZGKtkVtWEFU


Fig. 8. Occlusion examples: (a) moving occlusion: a car is occluded by another car (b) static occlusion: a car is occluded by the stop sign.

the KSM and JKSM algorithms on several challenging videodatasets. For single-object anomalies, we test on CAVIAR [29]data set in Fig. 6 (a) and the Xerox Stop Sign video database- represented in Fig. 6 (b). For multiple-object anomalies, wetest on AVSS [30] data set -see Fig. 7 (a), the Xerox StopSign data set - multi-object example in Fig. 7 (b), and XeroxIntersection data set - see Fig. 7 (c). We also compare ourexperimental results against three well-recognized techniquesin trajectory based anomalous event detection: 1.) the recentsparsity based technique of Li et al. [8], 2.) the approach ofPiciarelli et al. [18] using one class SVMs, 3.) the sparsitymodel with online dictionary learning of Zhao et al. [9] and 4.)the multiple-object tracking and rule-based anomaly detectiontechnique of Han et al. [31].

Benefits of sparsity under object occlusion or missingtrajectory information:

Because of the limitation of camera’s visual angle, occlu-sions often occur in video data. Figs. 8 (a) and (b) show twoexamples of occlusions. In Fig. 8 (a), a car is occluded byanother car (we call it a moving occlusion)- see Fig. 8(a3). InFig. 8 (b), the car is occluded by the stop sign (static occlusion)- see Fig. 8 (b2).

Tracking algorithms continue to strive to improve andperform well even under occlusion [43]–[45]. Since this isa fundamentally hard problem, occlusion occurs in practiceand leads to missing or corrupted trajectory data. Because ouroptimization problems in (12) and (22) are well-conditioned,perturbation theory arguments apply and lend our methodrobustness under limited trajectory occlusion. In particular,consider:

So = argmin ‖J(HSo)‖row,0subject to ‖Yo−ASo‖F ≤ ε,

(34)

where Yo is a set of occluded test trajectories and So denotesthe corresponding sparse coefficients. If ‖Yo−Y‖2 < η then‖So− S‖2 < ζ, i.e. a small perturbation to the problem shouldcause only a small perturbation to the solution [46]. It meansthat if the occluded trajectory set Yo is not very different fromthe original trajectory set Y, then the optimized sparse coef-ficients under trajectory occlusion will be close enough to S,

i.e. the solution in the absence of occlusion. Because anomalydetection rests on the structure of the sparse coefficient matrix,this intutitevly provides occlusion robustness.

In addition, we perform a simple experiment to illustratethis. We work with the Xerox stop sign data set and 95 tra-jectories (including 76 training trajectories without occlusionand 19 occluded trajectories) obtained from 39 video clips.Our training dictionary consists of 9 normal trajectory classes(containing 8 trajectories each) and 1 anomalous trajectoryclasses (containing 4 trajectories). An independent set of13 normal but occluded trajectories and 6 anomalous butoccluded trajectories are used to test our approach. Fig. 9shows an example of occluded trajectories from Xerox StopSign data. Note that occluded trajectory locations are replacedby a constant value (e.g. zero) for the duration that objecttracking is lost. This is a common characteristic of frame-based tracking approaches [41], [47], [48].

Because training for anomalous events is well-representedin this database, Eq. (3) is used to classify the test trajectories.The RBF function is chosen as the kernel for the KSMalgorithm. The confusion matrices of four methods - KSMand the techniques by Picarelli et al., Li et al. and Zhao et al.are reported in Table I. Note that all Li et al. and Zhao et al.which are also a sparsity based anomaly detection technique,as well the proposed KSM methods do better than Picarelliet al. owing to the robustness of sparse coefficients underocclusion. Further, KSM is mildly better than Li et al. andZhao et al. for this example owing to the non-linearity inthe sparsity model as introduced by the use of the kernel. Amore thorough evaluation of the three methods on real-worlddatabases is reported next.

D. Experimental Results

1) Detection Rates for Single-Object Anomaly Detection:For the CAVIAR data set, we test on 27 video clips from which170 trajectories are extracted. Our training dictionary consistsof 10 normal trajectory classes and 3 anomalous trajectoryclasses, each class contains 10 different training trajectories. 21normal trajectories and 19 anomalous trajectories are used as


TABLE ICONFUSION MATRICES OF PROPOSED AND STATE-OF-THE ART TRAJECTORY BASED METHODS ON XEROX STOP SIGN OCCLUDED DATA – SUPERVISED,

SINGLE-OBJECT ANOMALY DETECTION. KSM DETECTS ANOMALIES USING EQ. (3)

Piciarelli et al. [18] Li et al. [8] Zhao et al. [9] KSMNormal Anomaly Normal Anomaly Normal Anomaly Normal Anomaly

Normal 61.5% 50.0% 84.6% 33.3% 76.9% 16.7% 84.6% 16.7%Anomaly 38.5% 50.0% 15.4% 66.7% 23.1% 83.3% 15.4% 83.3%

Fig. 9. An example of occluded trajectories from the Xerox Stop Sign videodata.

independent test data1. Because training for anomalous eventsis well-represented in this database, Eq. (3) is used to classifythe test trajectories as normal or anomalous. Our proposedKSM method employs the RBF function as kernel. The RBFfunction is defined as: κ(x,z) = e−γ‖x−z‖2 , for γ > 0. Theconfusion matrices of KSM are compared with the approachin [18], the sparsity model by Li et al. [8] and Zhao et al.[9] in Table II. First, we note the benefits of sparsity basedanomaly detection in the form of improved detection rates ofLi et al and Zhao et al. over Piciarelli’s approach. Second, weconjecture that Zhao et al. performs mildly better than Li etal because they can optimize and update dictionaries. Third,KSM serves to further improve detection rates by virtue of theuse of a non-linear kernel that enhances the accuracy of theunderlying sparsity reconstruction - which is assumed linearby Li et al and Zhao et al..

For the Xerox Stop Sign data set, 118 trajectories from 39video clips are extracted. The training dictionary comprises9 normal trajectory classes (containing 8 trajectories each)and 1 anomalous trajectory class (containing 4 trajectories).Again, we have training for anomalies. So we use Eq. (3)to classify a given test trajectory. An independent set of 34normal trajectories and 8 anomalous trajectories are used totest our approach. Table III shows the confusion matrices offour approaches. Again, the benefits of classification usingsparsity model vs. SVM-based classifier are readily apparent.As with the CAVIAR data, a comparison of Li et al, Zhaoet al. and KSM demonstrates a significant advantage broughtabout by employing the nonlinear kernel.ROC Curves: In order to test the robustness of the proposedmethod, we remove all the training for anomalous events fromthe dictionary and use all 40 test trajectories on CAVIAR dataset and all 42 test trajectories on Xerox Stop Sign data set. Inthis case, (17) is used to determine whether the test trajectory

1All training and test trajectories for both normal and anomalous trajectoriesare manually hand labeled by video analysis experts. The training and testsets picked for computing detection accuracy are completely non-overlapping.

Fig. 10. ROC curves for unsupervised single-object anomaly detection –CAVIAR video data set. KSM detects anomalies using Eq. (17), where τ1varies from 0.1 to 0.9.

is normal or anomalous. We can observe from Eq. (17) thatif we increase the threshold τ1, more events will be identifiedas anomalies, thus may lead to an increase on false positiverate. On the other hand, if τ1 decreases, more events will beclassified into normal events, therefore, false negative rate willincrease correspondingly. By varying the threshold τ1 in (17),we can get the relationship between true positive rate and falsepositive rate for both Li et al. and KSM approaches. As tomethod in Piciarelli et al., we can obtain similar relationshipby changing the threshold angle θth in Eq. (30). Also, for Zhaoet al., similar relationship can be obtained by using differentε in Eq. (32). By varying the individual thresholds for eachalgorithm, we obtain receiver operating characteristic (ROC)curves for these three different methods on CAVIAR data(Figs. 10) as well as the Xerox Stop Sign video database (Fig.11). Figs. 10 and 11 reveal that the proposed KSM outperformsthe approaches by Piciarelli et al., Li et al. and Zhao et al. bya considerable margin.

2) Detection Rates for Multiple-Object Anomaly Detection:For multiple (here 2) object anomaly detection, we willcompare against Han et al. [31]2, Picarelli et al. [18] andZhao et al. [9]. Since the one class SVM method in Piciarelliet al. is really proposed for single object anomaly detection,we build two intuitively motivated extensions to evaluate itsperformance in the multi (here 2) object setting:

1) If either of the trajectories corresponding to the two ob-jects is an anomaly, the joint event is called anomalous.We denote this method as Piciarelli et al. 1.

2) If each of the two trajectories corresponding to the twoobjects is individually found to be anomalous, only then

2The predefined anomaly detection rules in Han et al are based onunderlying scenario.


TABLE IICONFUSION MATRICES OF PROPOSED AND STATE-OF-THE ART TRAJECTORY BASED METHODS ON CAVIAR DATA – SUPERVISED, SINGLE-OBJECT

ANOMALY DETECTION. KSM DETECTS ANOMALIES USING EQ. (3)


Normal 85.7% 26.3% 90.5% 15.8% 95.2% 15.8% 95.2% 10.5%Anomaly 14.3% 73.7% 9.5% 84.2% 4.8% 84.2% 4.8% 89.5%

TABLE IIICONFUSION MATRICES OF PROPOSED AND STATE-OF-THE ART TRAJECTORY BASED METHODS ON THE XEROX STOP SIGN DATA – SUPERVISED,

SINGLE-OBJECT ANOMALY DETECTION. KSM DETECTS ANOMALIES USING EQ. (3)


Normal 85.3% 37.5% 91.2% 25.0% 94.1% 25.0% 97.1% 12.5%Anomaly 14.7% 62.5% 8.8% 75.0% 5.9% 75.0% 2.9% 87.5%

TABLE IVCONFUSION MATRICES OF PROPOSED AND STATE-OF-THE ART TRAJECTORY BASED METHODS ON THE XEROX STOP SIGN DATA – UNSUPERVISED

MULTIPLE-OBJECT ANOMALY DETECTION. THE PROPOSED JKSM DETECTS ANOMALIES USING EQ. (27) AND A THRESHOLD VALUE OF 0.5.

Piciarelli et al. 1 Piciarelli et al. 2 Zhao et al. Han et al. [31] JKSMDetection Rates: 3/3 1/3 2/3 2/3 3/3

TABLE VDETECTION RATES OF PROPOSED AND STATE-OF-THE ART METHODS ON AVSS DATA – UNSUPERVISED MULTIPLE-OBJECT ANOMALY DETECTION. THE

PROPOSED JKSM DETECTS ANOMALIES USING EQ. (27) AND A THRESHOLD VALUE OF 0.5.

Piciarelli et al. 1 Piciarelli et al. 2 Zhao et al. Han et al. [31] JKSMDetection Rates: 2/2 0/2 1/2 2/2 2/2

Fig. 11. ROC curves for unsupervised single-object anomaly detection –Xerox Stop Sign video dataset. KSM detects anomalies using Eq. (17), whereτ1 varies from 0.1 to 0.9.

the joint event is anomalous. We denote this method asPiciarelli et al. 2.

To ensure the application of Picarelli et al., we separateevery 2-object event into 2 individual events with knownindividual class labels (normal or anomalous) so that we canobtain results for the aforementioned extensions, i.e. Piciarelliet al. 1 and Piciarelli et al. 2.

For the Xerox Stop Sign data set, we identify 4 different2-object normal event classes. Each class contains 15 trainingtrajectory pairs. This database inherently contains 3 multiple-object anomalies, one of which is illustrated in Fig. 7 (b)with the actual video at: http://youtu.be/tPR-LI3NmiM. Usingthe outlier rejection in Eq. (18), all three multiple-objectanomalies are successfully detected by our proposed JKSM.The detection rates of all methods are shown in Table IV.

For the AVSS data set, 3 different 2-object normalevent classes (containing 24 training trajectory pairseach) are chosen. The database was experimentallyfound to contain 2 different anomalies - correspondingvideos can be seen at: http://youtu.be/mU5R056zInc andhttp://youtu.be/jEzLkWF65Io. Our outlier rejection measurein Eq. (18) was again successfully able to detect both theseanomalies. Table V shows the detection rates of six methods.

In Xerox Intersection data, there are 91 trajectory pairsextracted from 13 video clips. We manually build our train-ing dictionary into 6 different 2-object normal event classes(containing 6 trajectory pairs each) and 6 different anoma-lous classes(containing 4 trajectory pairs each). 17 normaltrajectory pairs and 14 anomalous trajectory pairs are used fortesting. Again, we use the RBF kernel function. The confusionmatrices of our method against the three competing trajectorybased techniques are shown in Table VI.

It is easy to see from Table VI that the proposed JKSMmethod leads to the best detection rates. The improvement overthe techniques of Piciarelli et al. [18] is expected since thistechnique is really for single-object anomaly detection and theextensions Piciarelli et al. 1 and Piciarelli et al. 2 - will eitherstrongly compromise detection or lead to high false alarm.For Zhao et al., spatio-temporal volumes are used as features,which can not distinguish between single-object and multiple-object cases. Therefore, the performance of Zhao et al. isslightly worse than Han et al., which is designed for multiple-object scenario. In [31], anomalies are detected by the use ofcontext based rules on the result of multiple-object tracking.This puts an unreasonable burden on defining these rules andis often restrictive in practice, i.e. not all anomalies can beanticipated. In the proposed joint sparsity model, interactionsbetween distinct object trajectories are better captured anddepartures from expected “joint behavior” (particularly in the

http://youtu.be/tPR-LI3NmiM

http://youtu.be/mU5R056zInc

http://youtu.be/jEzLkWF65Io


TABLE VICONFUSION MATRICES OF PROPOSED AND STATE-OF-THE ART

TRAJECTORY BASED METHODS ON THE XEROX INTERSECTION DATASET –SUPERVISED, MULTIPLE-OBJECT ANOMALY DETECTION. THE PROPOSED

JKSM DETECT ANOMALIES USING EQ. (26)

Piciarelli et al. 1 Piciarelli et al. 2Normal Anomaly Normal Anomaly

Normal 58.8% 7.1% 94.1% 64.3%Anomaly 41.2% 92.9% 5.9% 35.7%

Zhao et al. Han et al. [31]Normal Anomaly Normal Anomaly

Normal 79.4% 35.7% 82.4% 35.7%Anomaly 20.6% 64.3% 17.6% 64.3%

JSM JKSMNormal Anomaly Normal Anomaly

Normal 88.2% 28.6% 88.2% 14.3%Anomaly 11.8% 71.4% 11.8% 85.7%

Fig. 12. ROC curves for unsupervised multiple-object anomaly detection –Xerox Intersection dataset. The proposed JKSM detect anomalies using Eq.(27) where the detection threshold varies from 0.1 to 0.9.

case training for anomalies is absent/limited) is employedwhich enables the improvement in detection rates.ROC Curves: By changing the threshold τ2 correspond to Eq.(18), θth in Eq. (30) and ε in Eq. (32), we can generate theROC curves for JKSM, JSM (joint sparsity model without thekernel), Piciarelli et al. and Zhao et al. (Here, we average theROC curves of Piciarelli et al. 1 and Piciarelli et al. 2 as theROC curve of Piciarelli et al.). For Han et al, we loosen andtighten the anomaly detection rules to obtain the relationshipbetween true positive rate and false positive rate. The ROCcurves of these four different methods on all 31 test trajectorypairs from Xerox Intersection data set are shown in Figs. 12.The statistical merits of the joint sparsity model are clearlyrevealed by the ROC curves. Again, the use of a kernel enablesJKSM to further improve performance.

VI. CONCLUSION

We present a new joint sparsity model for video anomalydetection. An over-complete training dictionary is derivedcomprising joint observations of multi-object events; and atest event is reconstructed as a sparse linear combinationof events in the training dictionary. For the joint sparsity

model to be meaningful, constraints are posed on the structureof sparse coefficients which leads to a new simultaneoussparsity optimization problem. In the supervised case, whereanomalies are pre-characterized into classes, the anomalydetection reduces to a sparsity based classification problem.In the more realistic unsupervised scenario where anomaliescannot be sufficiently pre-characterized, the anomaly detectionis accomplished via a multi-object outlier rejection measure.The merits of a principled joint sparsity model for multi-objectanomaly detection are strongly corroborated via experimentson challenging real-world databases. Additionally, we proposea kernelization of the joint sparsity model so as to furtherimprove anomaly detection where linear sparse reconstructionmodels do not hold directly. While we use fixed, expertdesigned training dictionaries that be updated periodically,online dictionary updates/learning, e.g. in [9] is a usefulextension of our work and can be pursued in future research.

REFERENCES

[1] T. Xiang and S. Gong, “Video behavior profiling for anomaly detection,”IEEE Trans. on Pattern Analysis and Machine Int., vol. 30, no. 5, pp.893–908, May. 2008.

[2] C.-H. Chuang, J.-W. Hsieh, L.-W. Tsai, S.-Y. Chen, and K.-C. Fan,“Carried object detection using ratio histogram and its application tosuspicious event analysis,” IEEE Trans. on Circuits and Systems forVideo Technology, vol. 19, no. 6, pp. 911–916, Jun. 2009.

[3] V. Saligrama, J. Konrad, and P. Jodoin, “Video anomaly identification,”IEEE Signal Processing Magazine, vol. 27, no. 5, pp. 18–33, Sept. 2010.

[4] X. Wang, X. Ma, and W. Grimson, “Unsupervised activity perception incrowded and complicated scenes using hierarchical bayesian models,”IEEE Trans. on Pattern Analysis and Machine Int., vol. 31, no. 3, pp.539–555, Mar. 2009.

[5] C. Simon, J. Meessen, and C. De Vleeschouwer, “Visual event recog-nition using decision trees,” Multimedia Tools Appl., vol. 50, no. 1, pp.95–121, Oct. 2010.

[6] N. Vaswani, A. Roy-Chowdhury, and R. Chellappa, “Shape activity: acontinuous-state hmm for moving/deforming shapes with application toabnormal activity detection,” IEEE Trans. on Image Processing, vol. 14,no. 10, pp. 1603–1616, Oct. 2005.

[7] I. Pruteanu-Malinici and L. Carin, “Infinite hidden markov models forunusual-event detection in video,” IEEE Trans. on Image Processing,vol. 17, no. 5, pp. 811–822, May. 2008.

[8] C. Li, Z. Han, Q. Ye, and J. Jiao, “Abnormal behavior detection viasparse reconstruction analysis of trajectory,” in Proc. IEEE Int. Conf.Image and Graphics, Aug. 2011, pp. 807–810.

[9] B. Zhao, L. Fei-Fei, and E. Xing, “Online detection of unusual eventsin videos via dynamic sparse coding,” in Proc. IEEE Conf. ComputerVision Pattern Recognition, Jun. 2011, pp. 3313–3320.

[10] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust facerecognition via sparse representation,” IEEE Trans. on Pattern Analysisand Machine Int., vol. 31, no. 2, pp. 210–227, Feb. 2009.

[11] D. Zhang, M. Yang, and X. Feng, “Sparse representation or collaborativerepresentation: Which helps face recognition?” in Proc. IEEE Conf. onComputer Vision, 2011, pp. 471–478.

[12] R. Rigamonti, M. Brown, and V. Lepetit, “Are sparse representationsreally relevant for image classification?” in Proc. IEEE Conf. ComputerVision Pattern Recognition, 2011, pp. 1545–1552.

[13] Q. Shi, A. Eriksson, A. van den Hengel, and C. Shen, “Is facerecognition really a compressive sensing problem?” in Proc. IEEE Conf.Computer Vision Pattern Recognition, 2011, pp. 553–560.

[14] N. Johnson and D. Hogg, “Learning the distribution of object trajectoriesfor event recognition,” Image and Vision Computing, vol. 14, pp. 583–592, 1996.

[15] P. Kumar, S. Ranganath, H. Weimin, and K. Sengupta, “Framework forreal-time behavior interpretation from traffic video,” IEEE Trans. onIntelligent Transportation Systems, vol. 6, no. 1, pp. 43–53, Mar. 2005.

[16] D. Makris and T. Ellis, “Learning semantic scene models from observingactivity in visual surveillance,” IEEE Trans. on Systems, Man, andCybernetics, vol. 35, no. 3, pp. 397–408, Jun. 2005.


[17] W. Hu, X. Xiao, Z. Fu, D. Xie, T. Tan, and S. Maybank, “A system forlearning statistical motion patterns,” IEEE Trans. on Pattern Analysisand Machine Int., vol. 28, no. 9, pp. 1450–1464, Sep. 2006.

[18] C. Piciarelli, C. Micheloni, and G. Foresti, “Trajectory-based anoma-lous event detection,” IEEE Trans. on Circuits and Systems for VideoTechnology, vol. 18, no. 11, pp. 1544–1554, Nov. 2008.

[19] W. Hu, T. Tan, L. Wang, and S. Maybank, “A survey on visualsurveillance of object motion and behaviors,” IEEE Trans. on Systems,Man, and Cybernetics, vol. 34, no. 3, pp. 334–352, Aug. 2004.

[20] I. Junejo, O. Javed, and M. Shah, “Multi feature path modeling for videosurveillance,” in Proc. IEEE Conf. on Pattern Recognition, vol. 2, Aug.2004, pp. 716–719.

[21] Z. Fu, W. Hu, and T. Tan, “Similarity based vehicle trajectory clusteringand anomaly detection,” in Proc. IEEE Conf. on Image Processing,vol. 2, Sept. 2005, pp. 602–605.

[22] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACMComputing Surveys, vol. 38, no. 4, Dec. 2006.

[23] C. J. Burges, “A tutorial on support vector machines for pattern recog-nition,” Data Mining and Knowledge Discovery, vol. 2, pp. 121–167,1998.

[24] A. J. Smola and B. Schlkopf, “On a kernel-based method for patternrecognition, regression, approximation, and operator inversion,” Algo-rithmica, vol. 22, pp. 211–231, 1998.

[25] K. Grauman and T. Darrell, “The pyramid match kernel: discriminativeclassification with sets of image features,” in Proc. IEEE Conf. onComputer Vision, vol. 2, Oct. 2005, pp. 1458–1465.

[26] P. Vincent and Y. Bengio, “Kernel matching pursuit,” Machine Learning,vol. 48, pp. 165–187, 2002.

[27] V. Guigue, A. Rakotomamonjy, and S. Canu, “Kernel basis pursuit,” inProc. European Conf. on Machine Learning, 2005, pp. 146–157.

[28] S. Gao, I. W.-H. Tsang, and L.-T. Chia, “Kernel sparse representationfor image classification and face recognition,” in Proc. European Conf.on Computer vision: Part IV, 2010, pp. 1–14.

[29] “Caviar datasets.” [Online]. Available: http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/

[30] “Avss2007 datasets.” [Online]. Available: ftp://motinas.elec.qmul.ac.uk/pub/iLids/

[31] M. Han, W. Xu, H. Tao, and Y. Gong, “An algorithm for multipleobject trajectory tracking,” in Proc. IEEE Conf. Computer Vision PatternRecognition, Jun. 2004, pp. 864–871.

[32] R. Baraniuk, “Compressive sensing [lecture notes],” IEEE Signal Pro-cessing Magazine, vol. 24, no. 4, pp. 118–121, Jul. 2007.

[33] J. Chen and X. Huo, “Theoretical results on sparse representationsof multiple-measurement vectors,” IEEE Trans. on Signal Processing,vol. 54, no. 12, pp. 4634–4643, 2006.

[34] J. A. Tropp, A. C. Gilbert, and M. J. Strauss, “Algorithms for simulta-neous sparse approximation. part i: Greedy pursuit,” Signal Processing,vol. 86, no. 3, pp. 572–588, 2006.

[35] J. Tropp and A. Gilbert, “Signal recovery from random measurementsvia orthogonal matching pursuit,” IEEE Trans. on Info. Theory, vol. 53,no. 12, pp. 4655–4666, Dec. 2007.

[36] C. Jeon, V. Monga, and U. Srinivas, “A greedy pursuit approach toclassification using multi-task multivariate sparse representations,” ThePennsylvania State University, Tech. Rep., 2012. [Online]. Available:http://personal.psu.edu/uxs113/MTMV SOMP.pdf

[37] B. Scholkopf and A. J. Smola, Learning with Kernels: Support VectorMachines, Regularization, Optimization, and Beyond. MIT Press, 2001.

[38] “Kernel machines.” [Online]. Available: http://en.wikipedia.org/wiki/File:Kernel Machine.png

[39] X.-T. Yuan and S. Yan, “Visual classification with multi-task jointsparse representation,” in Proc. IEEE Conf. Computer Vision PatternRecognition, Jun. 2010, pp. 3493–3500.

[40] J. Tropp and S. Wright, “Computational methods for sparse solution oflinear inverse problems,” Proceedings of the IEEE, vol. 98, no. 6, pp.948–958, Jun. 2010.

[41] C. Stauffer and W. Grimson, “Learning patterns of activity using real-time tracking,” IEEE Trans. on Pattern Analysis and Machine Int.,vol. 22, no. 8, pp. 747–757, Aug. 2000.

[42] G. Knott, Interpolating Cubic Splines. Progress in Computer Scienceand Applied Logic, V. 18, 2000.

[43] D. Koller, J. Weber, and J. Malik, “Robust multiple car tracking withocclusion reasoning,” in Proc. European Conf. on Computer Vision, vol.800, 1994, pp. 189–196.

[44] L. Torresani, D. Yang, E. Alexander, and C. Bregler, “Tracking andmodeling non-rigid objects with rank constraints,” in Proc. IEEE Conf.Computer Vision Pattern Recognition, vol. 1, 2001, pp. 493–500.

[45] T. Yang, Q. Pan, J. Li, and S. Li, “Real-time multiple objects trackingwith occlusion handling in dynamic scenes,” in Proc. IEEE Conf.Computer Vision Pattern Recognition, vol. 1, Jun. 2005, pp. 970–975.

[46] S. Boyd and L. Vandenberghe, Convex Optimization. CambridgeUniversity Press, 2004.

[47] M. Piccardi, “Background subtraction techniques: a review,” in Proc.IEEE Conf. on Systems, Man and Cybernetics, vol. 4, Oct. 2004, pp.3099– 3104.

[48] A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based trackingusing the integral histogram,” in Proc. IEEE Conf. Computer VisionPattern Recognition, vol. 1, Jun. 2006, pp. 798–805.

Xuan Mo received the Master’s degree in Automa-tion from Tsinghua University, China, in 2010. He iscurrently a PhD candidate in the Electrical Engineer-ing Department of the Pennsylvania State University.His research interests include computational color &imaging, signal processing and computer vision. Hiscurrently work focuses on video anomaly detectionfor transportation application.

Vishal Monga is currently an Assistant Professorof Electrical Engineering at the main campus ofPennsylvania State University in University Park,PA. He was with Xerox Research from 2005-2009.His undergraduate work was completed at the IndianInstitute of Technology (IIT), Guwahati and his doc-toral degree in Electrical Engineering was obtainedfrom the University of Texas, Austin in Aug 2005.

Dr. Mongas research interests are broadly in statis-tical signal and image processing. He established theInformation Processing and Algorithms Laboratory

(iPAL) at Penn State. Current research themes in his lab include computa-tional color and imaging, multimedia mining and security, and robust imageclassification and recognition.

He currently serves as an Associate Editor for the IEEE Transactionson Image Processing and the SPIE Journal of Electronic Imaging. Whilewith Xerox Research in Rochester, he was selected as the 2007 RochesterEngineering Society (RES) Young Engineer of the Year. He is the recipientof a 2011 Air Force Faculty Fellowship and a Monkowski Early Career awardfrom the college of engineering at Penn State.

Raja Bala received the Ph.D. degree in ElectricalEngineering from Purdue University. He is current-ly a Principal Scientist with the Xerox InnovationGroup. His research interests include color man-agement, novel image rendering techniques, imageprocessing for augmented reality applications, andimage and video analytics. Dr Bala holds over 150patents and publications in the field of color andimaging. He is a Fellow of IS&T and member ofIEEE.

Dr. Zhigang (Zeke) Fan received his MS and PhDdegrees in electrical engineering from the Universityof Rhode Island, Kingston, RI, in 1986 and 1988,respectively. He joined Xerox Corporation in 1988where he is currently a principal scientist in Xe-rox Corporate Research and Technology. Dr. Fansresearch interests include various aspects of imageprocessing and recognition. He has authored and co-authored more than 90 technical papers, as well asover 200 patents and pending applications. Dr. Fanis a Fellow of SPIE and a Fellow of IS&T.

http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/

http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/

ftp://motinas.elec.qmul.ac.uk/pub/iLids/

ftp://motinas.elec.qmul.ac.uk/pub/iLids/

http://personal.psu.edu/uxs113/MTMV_SOMP.pdf

http://en.wikipedia.org/wiki/File:Kernel_Machine.png

http://en.wikipedia.org/wiki/File:Kernel_Machine.png

MO ET AL.: ADAPTIVE SPARSE REPRESENTATIONS...

Documents

Transcript of MO ET AL.: ADAPTIVE SPARSE REPRESENTATIONS...