Mining Sensor Data in Cyber-Physical Systems

11
Tsinghua Science and Technology Tsinghua Science and Technology Volume 19 Issue 3 Article 1 2014 Mining Sensor Data in Cyber-Physical Systems Mining Sensor Data in Cyber-Physical Systems Lu-An Tang NEC Laboratory America, Princeton, NJ 08540, USA. Jiawei Han Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA. Guofei Jiang NEC Laboratory America, Princeton, NJ 08540, USA. Follow this and additional works at: https://tsinghuauniversitypress.researchcommons.org/tsinghua- science-and-technology Part of the Computer Sciences Commons, and the Electrical and Computer Engineering Commons Recommended Citation Recommended Citation Lu-An Tang, Jiawei Han, Guofei Jiang. Mining Sensor Data in Cyber-Physical Systems. Tsinghua Science and Technology 2014, 19(03): 225-234. This Research Article is brought to you for free and open access by Tsinghua University Press: Journals Publishing. It has been accepted for inclusion in Tsinghua Science and Technology by an authorized editor of Tsinghua University Press: Journals Publishing.

Transcript of Mining Sensor Data in Cyber-Physical Systems

Tsinghua Science and Technology Tsinghua Science and Technology

Volume 19 Issue 3 Article 1

2014

Mining Sensor Data in Cyber-Physical Systems Mining Sensor Data in Cyber-Physical Systems

Lu-An Tang NEC Laboratory America, Princeton, NJ 08540, USA.

Jiawei Han Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.

Guofei Jiang NEC Laboratory America, Princeton, NJ 08540, USA.

Follow this and additional works at: https://tsinghuauniversitypress.researchcommons.org/tsinghua-

science-and-technology

Part of the Computer Sciences Commons, and the Electrical and Computer Engineering Commons

Recommended Citation Recommended Citation Lu-An Tang, Jiawei Han, Guofei Jiang. Mining Sensor Data in Cyber-Physical Systems. Tsinghua Science and Technology 2014, 19(03): 225-234.

This Research Article is brought to you for free and open access by Tsinghua University Press: Journals Publishing. It has been accepted for inclusion in Tsinghua Science and Technology by an authorized editor of Tsinghua University Press: Journals Publishing.

TSINGHUA SCIENCE AND TECHNOLOGYISSNll1007-0214ll01/11llpp225-234Volume 19, Number 3, June 2014

Mining Sensor Data in Cyber-Physical Systems

Lu-An Tang�, Jiawei Han, and Guofei Jiang

Abstract: A Cyber-Physical System (CPS) integrates physical devices (i.e., sensors) with cyber (i.e., informational)

components to form a context sensitive system that responds intelligently to dynamic changes in real-world

situations. Such a system has wide applications in the scenarios of traffic control, battlefield surveillance,

environmental monitoring, and so on. A core element of CPS is the collection and assessment of information from

noisy, dynamic, and uncertain physical environments integrated with many types of cyber-space resources. The

potential of this integration is unbounded. To achieve this potential the raw data acquired from the physical world

must be transformed into useable knowledge in real-time. Therefore, CPS brings a new dimension to knowledge

discovery because of the emerging synergism of the physical and the cyber. The various properties of the physical

world must be addressed in information management and knowledge discovery. This paper discusses the problems

of mining sensor data in CPS: With a large number of wireless sensors deployed in a designated area, the task

is real time detection of intruders that enter the area based on noisy sensor data. The framework of IntruMine

is introduced to discover intruders from untrustworthy sensor data. IntruMine first analyzes the trustworthiness

of sensor data, then detects the intruders’ locations, and verifies the detections based on a graph model of the

relationships between sensors and intruders.

Key words: cyber-physical system; sensor network; data trustworthiness

1 Introduction

A Cyber-Physical System (CPS) is an integrationof sensor network with cyber resources. The CPScollects sensor data from physical world and linksthem to various information sources for real-time analysis. Such a system has many promisingapplications in both military and civilian fields,including missile defense[1], battlefield awareness[2, 3],traffic control[4, 5], neighborhood watch[6], environmentmonitoring[7], and wildlife tracking[8]. The research

� Lu-An Tang and Guofei Jiang are with the NEC LaboratoryAmerica, Princeton, NJ 08540, USA. E-mail: [email protected]; [email protected].� Jiawei Han is with the Department of Computer Science,

University of Illinois at Urbana-Champaign, Urbana, IL 61801,USA. E-mail: [email protected].�To whom correspondence should be addressed.

Manuscript received: 2014-05-07; accepted: 2014-05-09

topics of CPS have been placed on the top of thepriority list for federal research investment in the fiscalyear report of U.S. President’s council of advisors onscience and technology.

The key task in such a scenario is to mine the realintruder information from a large set of untrustworthysensor data. Such a problem is considered one of themajor challenges in CPS research field, partly due tothe following problems:� Untrustworthy data: The collected data are highly

unreliable due to hardware and communicationlimits. Many deployment experiences have shownthat untrustworthy data is the most serious problemthat impacts CPS performance. Tolle et al. pointedout that faulty data can occur in various unexpectedways and less than 69% of their data could beused for meaningful interpretation[7]. Szewczyk etal. also found that about 30% of data are faultyin their deployment[9]. It is difficult to filter out

226 Tsinghua Science and Technology, June 2014, 19(3): 225-234

untrustworthy data records solely based on thedata values, since most faulty records have valuessimilar to real ones.� Complex requirements: Many intruder detection

algorithms rely on prior knowledge of thenumber of intruders, movement speed, and soon[10-13]. However, the users often cannot providesuch attributes of intruders in real applications. Incontrast, they would like to obtain fine-grainedsituational awareness of the battlefield andrequire the system to generate this informationautomatically.� Unsupervised learning: Several intrusion detection

methods build up the models or classifiers basedon a training dataset[14, 15]. However, such trainingdatasets are hard to get in realistic deployments. Itis costly and error-prone to manually label alarge sensor dataset. In addition, the data arebased on a specific deployment plan, hence it isusually difficult to apply a training set from onedeployment to another. To increase the systemfeasibility, the intruder mining algorithm shoulduse unsupervised learning that does not require alarge training dataset.� Big data: A typical CPS includes hundreds, even

thousands of sensors. Each sensor generates areading every few minutes, and the readings form ahuge data stream. Furthermore, many applicationsrequire immediate action against intruders. Themining must be efficient to process the huge datastream and find intruders in real time.

In this paper, we introduce the framework ofIntruMine to find real intruders from untrustworthydata. IntruMine iteratively models the relationshipbetween sensors and intruders via monitoring graph,and estimates the attribute values of the intrudersbased on the link information of such a graph. Theconfidence of intruder detection is computed based onthe difference between the real sensor readings andthe estimated ones. This measurement is used to verifythe detected intruders and filter out false positives. Apreliminary version of this paper has been published inRefs. [16, 17].

2 Related Works

The research problems of CPS are relativelynew. However, many related topics, such as detectingfaulty sensor signals or target tracking, have beenstudied extensively in the past decades. The communityof data management and data mining also proposes

some methods to find outliers or anomalies for sensornetwork applications. According to the methodology,the related works can be roughly classified into threecategories: statistical model-based approaches, spatial-and-temporal similarity-based methods, and featureretrieving techniques.

2.1 Statistical model-based approaches

A large category of statistical models have beenproposed to detect faulty sensor data. The faultydata are defined as the ones that do not follow thedistribution of those models. Deshpande et al. usedmodels of time-varying multivariate Gaussians torespond to predetermined queries[18]. The tool respondsto a predetermined set of query types, treating thesensor network like a database. Elnahrawy and Nathutilized a Bayesian Classifier (BC) to clean thedata[19]. They modeled the sensor data as a standardnormal distribution, and generated the prior knowledgeof noise model from training data. Koushanfar etal. developed a cross validation method for OnlineFalse Alarm Detection (OFAD) based on multiple faultmodels[20].

To some extent, those methods can help users filterfalse sensor data. However, most of them need trainingdatasets or prior knowledge to construct the models andtune the parameters. Such information is not available inmany real scenarios. Moreover, with so many statisticalmodels, it is hard for the user to determine which oneis the most appropriate. As mentioned in Ref. [18],the existing models are still not good enough, and thestatistical models cannot be fit for many complex casesin real applications.

2.2 Spatial-and-temporal similarity-basedmethods

The spatial-and-temporal similarity-based methodsare based on the assumption that there are strongcorrelations between the sensor data and theirneighbors (spatial similarity), as well as their histories(temporal similarity). Krishnamachari and Iyengarexploited spatial and temporal relations of faultysensor data[21]. Jeffery et al. attempted to takeadvantage of both spatial and temporal relations tocorrect faulty records[22]. Their methods assume thatall data within each spatial and temporal granuleare homogeneous. The fault recognition programstreat any value exceeding a high value thresholdas faulty. Subramaniam et al. proposed the Non-Parametric Outlier Detection (NPOD) model for sensordata[23]. This framework detects the outliers in a

Lu-An Tang et al.: Mining Sensor Data in Cyber-Physical Systems 227

distributed manner by checking each sensor’s k

nearest neighbors. The data distributions are estimatedby a kernel density function and multi-dimensionaloutliers are discovered by monitoring heterogeneousreadings. Xiao et al. provided a sensor rank basedoutlier detection method[24]. The system generatesclusters of sensor readings and detects the outliersby measuring a sensor reading’s dissimilarity to itsneighbors.

There are several limits of those techniques: (1) Thespatial similarity hypothesis may not be valid in allthe cases, the correlation of sensors are influenced bymultiple factors, including the deployment of sensors,the surrounding environment, and the target movement;(2) the temporal similarity assumption might fail inseveral cases. The sensor’s reliability may reduce overtime, e.g., the sensors might be damaged in the harshenvironment, or run out of power.

2.3 Feature retrieving techniques

Feature retrieving techniques detect faulty data bycomparing distinguishing features. Such methods firstexploit several data features like environmental type,connecting degree, and temporal patterns, and thenconstruct classifiers to distinguish different types offaults. Ni et al. developed some common features,including system features, environment features, anddata features[25]. They combined different featuresto define and detect commonly observed faults. Niand Pottie deployed sensors to detect the presenceof arsenic in groundwater[26]. A Fault RemediationSystem (FRS) is developed for determining faults andsuggesting solutions using rule-based methods andstatic thresholds on the water pressure and other domainspecific features. Tang et al. proposed a Pattern GrowthGraph (PGG) based method to detect variations andfilter noise over evolving medical streams[27]. Thefeature of wave-pattern is proposed to capture themajor information of medical data evolution andrepresent them compactly. The variations are detectedby a wave-pattern matching algorithm and meaningfuldata changes are distinguished from noise. Yu etal. proposed a two-stage approach to find anomaliesin complicated datasets[28]. The algorithm employsan efficient deterministic space partition to eliminateobvious normal instances and generates a small set ofanomaly candidates, and then checks each candidatewith density-based multiple criteria to determine thefinal results.

The feature-based approaches usually have betterperformances than the other methods, but they are moredomain specific. Such methods require users providingdetailed context information and defining the faultyrecords carefully. Scalability and adaptiveness are themajor problems that prevent their application in a widerrange of CPS.

3 Backgroud and Preliminaries

Recent advances in sensor technology have producedmany types of sensors for area-monitoring and intruderdetection purposes. Such sensors can be roughlyclassified into two categories: (1) active sensors(e.g., infrared sensors and radar sensors): Thesesensors radiate signal pulses and detect objects by theecho bouncing off the intruders; (2) passive sensors(e.g., acoustic sensors, seismic sensors, and magneticsensors): These sensors only receive signals from theenvironment. Active sensors achieve higher accuracy,but require significant more power to operate and drainbatteries quickly. Furthermore, when active sensorsradiate signal pulses, they are at high risk of beingdetected by the intruders. As a result, the CPS is usuallydeployed with a large number of low-cost, energy-saving passive sensors.

Though having different mechanisms andmeasurements, most passive sensors report thedetected signals as a numeric value. For example, anacoustic sensor measures the air pressure of sound waveand a magnetic sensor generates the readings aboutmagnetic force. Such measurements are influenced bytwo factors: (1) the intruder’s energy (i.e., the strengthof emitted signals from the intruder); (2) the distancebetween the sensor and the intruder. Usually we canmodel the relationship between intruder o and sensor sas Eq. (1).

f .o; s/ De

˛ � d.o; s/ˇ C (1)

where e is o’s energy and d.o; s/ is the Euclideandistance between them. The parameters ˛, ˇ, and aredetermined by the sensor types and mechanisms.

If there are multiple intruders in the monitoringarea, we assume that their signals aggregate at eachsensor. Let O be the intruder set, sensor s’s reading isestimated as Eq. (2).

Or.s/ DXo2O

f .o; s/ DXo2O

e

˛ � d.o; s/ˇ C (2)

All real world signals are influenced by noise, hence the

228 Tsinghua Science and Technology, June 2014, 19(3): 225-234

observed reading of s isr.s/ D Or.s/C � D

Xo2O

f .o; s/C � D

Xo2O

e

˛ � d.o; s/ˇ C C � (3)

Without loss of generality, we assume that thebackground noise is zero mean Gaussian noise, i.e.,� � N.0; �2/.

Note that the sensor readings are collected andtransmitted by various gateways, such as the aircraftbridges in Table 1. There are many state-of-the-artworks on gateway design, sensor deployment, andmessage transmission[29, 30]. Since the main theme ofthis study is on data mining, we assume that the data canbe collected by CPS (i.e., the sensors have been alreadydeployed and the command center can receive timeslicesnapshots of the data in real time). Now the task boilsdown to find out the intruders’ information from suchdata.

Definition 1 (Problem Definition) Let S be theset of deployed sensors in a CPS, S D fs1.xs1; ys1/,s2.xs2; ys2/, s3.xs3; ys3/, � � � , sn.xsn; ysn/g, and R bethe sensors’ readings in a snapshot, R Dfr.s1/, r.s2/,� � � , r.sn/g. The task of IntruMine is to estimate theintruders O Dfo1.xo1; yo1; e1/, o2.xo2; yo2; e2/, � � � ,ok.xok; yok; ek/g based on S and R.

4 Trustworthiness Analysis of Sensor Data

Although there are a large number of sensors in CPS,not all of them are relevant to the mining task. Typically,only a few of them have detected the intruders. The firststep is to select the set of relevant sensors and maketrustworthiness analysis on their data.

If a sensor works well, it should detect all intrudermovements inside the detecting range. Formally, wedefine the monitored intruder set.

Definition 2 Let O be the intruder set and ds be thedetecting range of a sensor s. The monitored intruder

Table 1 Experiment settings.

DatasetNumber of

objectsNumber of

sensorsNumber ofreadings

Faulty (%)

Real .D1/ 5 213 2:1 � 105 � 10

Syn 1 .D2/ 64 400 4:3 � 105 20

Syn 2 .D3/ 64 2500 2:7 � 106 30

Syn 3 .D4/ 64 10 000 1:1 � 107 40

Notes: ır: 0.3�0.9, default 0.7; ıo: 0.2�0.8, default 0.6: ˛ D10; ˇ D 1; D 1

set of s is defined as Os D fo jo 2 O , dist.s; o/ < dsg.Similarly, we obtain the monitoring sensor set as

below.Definition 3 Let S be a sensor set and ds be the

detecting range of a sensor s. The monitoring sensor setof an intruder o is defined as So = fs j s 2 S , dist(s; o/ <dsg. If s 2 So generates an alarm record ra(s, t ), thenra(s, t ) is said to be related to o.

One may notice that, an alarm may be relatedto multiple intruders and an intruder usually hasseveral related alarms. In this way, we build up arelational graph between the intruders and sensors. Inthis monitoring graph, two kinds of nodes are presented:the intruders and their monitoring sensor’s data records.The sensor-intruder relationship is modeled as an edgein the graph.

The monitoring graph partitions the CPS dataset andlinks the intruders to relevant data records. There aretwo kinds of edges in the graph: the normal edgelinking the intruder with a normal record, and thealarm edge. The weight of an alarm edge representsthe probability that the alarm is caused by such anintruder. Hence the system has to infer the weightsof alarm edges, i.e., compute the conditional alarmtrustworthiness �.ra.si ; t /jo/.�.ra.si ; t /jo/ is determined by the coherence of other

sensors’ readings in monitoring sensor set So. If othersensor’s readings are all coherent with ra.s; t/, itstrustworthiness is high, otherwise it is unlikely to becaused by o.

�.ra.si ; t /jo/ D

Xsj2So;sj¤si

coh.r.sj ; t /; ra.si ; t //

jSoj � 1(4)

To estimate the coherence score between twosensors’ records, we should consider both theirreading difference and positions. When computingcoh.ra.si ; t /; r.sj ; t //, the system should considerwhether sj would report the same severity if it waslocated at si ’s position.

The equation of computing r(s; t) can be learnedin advance. Then it is easy to deduce the inversefunction of intruder o’s signal strength. For example,from Eq. (1) we can estimate j̋ .o/ D .r.sj ; t / �

b/=˛. The expected severity of sensor sj at si ’s locationis computed as

r 0.sj ; t / D f .dist.si ; o/; j̋ .o//:

Coherence coh.ra.si ; t /; r.sj ; t // is judged by the

Lu-An Tang et al.: Mining Sensor Data in Cyber-Physical Systems 229

difference of the expected reading and real reading ofsj , as shown in Eq. (5). Its value range is [0, 1]. Ifsensor sj ’s severity is the same as the expected value,the coherence score reaches the maximum of 1; if thedifference is larger than standard deviation � , i.e., sj ’sseverity is quite different from the expected value, thecoherence score is set to 0.

diff.r 0; r/ D jr 0.sj ; t / � r.sj ; t /j;coh.ra.si ; t /; r.sj ; t // D8<: 1 �

diff.r 0; r/�

; if diff.r 0; r/ < �;

0; otherwise

(5)

A low �.ra.s; t/jo/ indicates two possibilities: (1)ra(s, t ) is a false alarm; or (2) ra(s, t ) is a true alarm,but it is not caused by intruder o. In either case, intrudero’s trustworthiness should be decreased. Therefore, wecompute intruder o’s trustworthiness as the average ofall its conditional alarm trustworthiness (Eq. (6)).

�.o/ D

Xs2So;ra2Ra

�.ra.s; t/jo/

jSoj(6)

The equation to compute alarm trustworthiness isa bit different. If an alarm has different conditionaltrustworthiness with different intruders, the system willtake the maximum one as the result (Eq. (7)). Becauseif there is only one meaningful intruder that causes thealarm, the alarm is still meaningful.

�.ra.s; t// D max.�.ra.s; t/jo//; o 2 Os (7)

5 Mining Intruders

After selecting out trustworthy sensor data, we canproceed to detect the intruders’ appearances in eachsnapshot. The sensors near the intruders usually havehigher readings than sensors that are far away. Suchsensors are denoted as the peak reading sensors. Theycan be easily obtained by a single scan of the sensors’readings.

For each peak reading sensor s, the system initializesan intruder o at s’s position. In some rare cases, severalneighboring sensors may have the same readings. Thenthe system randomly picks one of them to initializethe intruder. It is worth noting that other strategies canalso be used to initialize the intruders, such as samplingmethod[16], random selection[10], and so on.

Since the initialization of the intruders’ appearancesare derived from the peak reading sensors, the locationsmay not be accurate. To overcome this problem, wepropose the method to adjust the attributes of intruders

iteratively.In a snapshot, the reading of each sensor, r.s/, is

already known. With the information of intruders, wecan further calculate the estimated reading of s as

Or.s/ DXo2g

f .o; s/ (8)

Then the probability of observing the sensor’sreading as r.s/ is

p.s/ D1

p2 �2

e�.r.s/�Or.s//2

2�2 (9)

The joint probability of observing the readings withinmonitoring graph g is

p.S/ DYs2g

1p2 �2

e�.r.s/�Or.s//2

2�2 D

�1

p2��2

�jgje�

Ps2g.r.s/�Or.s//

2

2�2 (10)

Hence the log-likelihood of observing the readings ofsensors within monitoring graph g is

log.p.S// /

�jgj log.�2/ �

Xs2g

.r.s/ � Or.s//2

�2(11)

Suppose there are k intruders in the monitoringgraph g, the intruder’s attribute vector ��� D

f.xo1; yo1; e1/; .xo2; yo2; e2/; � � � ; .xok; yok; ek/g. Basedon the maximum likelihood criterion, the estimation of��� is equivalent to

arg max����

Xs2g

.r.s/ � Or.s//2 D

arg min���

Xs2g

.r.s/ � Or.s//2 (12)

The difference between the estimated readingOr.s/ and the real reading r.s/ represents the error� of intruder information. In the best case, thesensor’s readings just fit intruder’s information and thedifference is minimized.

Let �g DXs2g

.r.s/ � Or.s//2. We can use a gradient

descent algorithm to compute the attribute vector ���iteratively. At the first iteration, ��� is initialized withthe information of peak reading sensors. The value ofthe n-th iteration can be calculated from the gradient of.n � 1/-th iteration as follows.

�ni D �n�1i �

1pn

@�n�1g

@�n�1i

(13)

Let Soi be the monitoring sensor set of intruder oi ,the gradients of �g with respect to xoi , yoi , and ei are

230 Tsinghua Science and Technology, June 2014, 19(3): 225-234

computed as shown in Eqs. (14)-(16).@�g

@xoiD

Xsj2Soi

4˛.r.sj / � Or.sj // � ei � .xsj � xoi /

.˛d.oi ; sj /ˇC /

2(14)

@�g

@yoiD

Xsj2Soi

4˛.r.sj / � Or.sj // � ei � .ysj � yoi /

.˛d.oi ; sj /ˇC /

2(15)

@�g

@eiD

Xsj2Soi

�2.r.sj / � Or.sj //

˛d.sj ; oi /ˇC

(16)

The estimation algorithm adjusts the intruder’sattribute values and updates the monitoringgraphs. However, there are still two problemsinfluencing the mining accuracy: (1) Some monitoringgraphs are created due to faulty readings and may notcontain any real intruders. Hence the false positiveintruders will be generated by such graphs. (2) Somesensors in the monitoring graph of real intruders areunreliable, such as s4 in g1. To solve this problem, thesystem needs to verify the estimated results.

From Eq. (5), the derivative of likelihood log.p.S//with respect to �2 is

@ log.p.S//@�2

D �jgj

�2�

Xs2g

.r.s/ � Or.s//2

.�2/2(17)

Setting the derivative to zero, we obtain

�2 D

Xs2g

.r.s/ � Or.s//2

jgj(18)

Based on the estimated �2, we can verifythe reliability of the sensor’s reading. A classicmeasurement in statistics is the 3-standard deviation: Ifthe deviation of estimated reading Or.s/ with respect tothe true reading r.s/ is not within 3 standard deviations,i.e., .r.s/�Or.s//2 > 9�2, then we judge that the readingof such sensor is unreliable.

To filter out false positives, we define the confidenceof intruder detection as follows.

Definition 4 The confidence of detected intruder ois the probability that o really exists, denoted as �.o/.

Intuitively, the readings of monitoring sensors arecaused by intruders. If the actual readings are similarto the estimated ones, this suggests a high confidencethat the intruder is real. For a false positive, thedifference between actual and estimated readings willbe large. Therefore, we can estimate the confidence ofa detected intruder from the reading difference of itsmonitoring sensor set.

In the verification process, the system first calculatesthe reading difference �oi for each intruder and the

average difference of all the intruders.

�oi D

Xs2Soi

.r.s/ � Or.s//2

jSoi j(19)

N� D

Xoi2O

�oi

jOj(20)

Then the intruder detection confidence is estimatedas Eq. (21). For an intruder oi , if the monitoringsensor’s readings are coherent with the information(small reading difference), the intruder’s confidence ishigh. If the reading difference is larger than N�, whichindicates that the real readings are quite different fromthe estimated ones, the intruder detection confidence isset as zero.

�.oi / D

8<: 1 ��oiN�; �oi < N�I

0; �oi > N�(21)

6 Performance Evaluation

6.1 Experiment setup

Datasets: We conduct extensive experiments toevaluate the proposed methods, using both real-worldand synthetic datasets. To test the performance ofIntruMine in large and untrustworthy data collectedfrom CPS, we generate three synthetic datasets based onthe military trajectories in the CBMANET project[31],in which an infantry battalion of 64 vehicles movesfrom Fort Dix to Lakehurst during a mission lasting 3hours. The data generator simulates monitoring fieldsalong their routes with 400 to 10 000 deployed sensors,and each sensor reports a reading every 10 seconds.Baselines: The proposed algorithm (IM) is comparedwith TruAlarm method (TA) in Ref. [16]. Wealso implemented the Maximum Likelihood basedestimation (ML) method based on the principlesproposed in Ref. [10]. We evaluated both efficiency andeffectiveness of the algorithms in the experiments.Environments: The experiments are conducted ona PC with Intel 7500 Dual CPU 2.20 GHz and3.00 GB RAM. The operating system is Windows 7Enterprise. All the algorithms are implemented in Javaon Eclipse 3.3.1 platform with JDK 1.5.0. The datasetsand parameter settings are listed in Table 1.

6.2 Comparisons in mining efficiency

In the first experiment, we evaluate the efficiency ofdifferent algorithms on default settings. The systemprocesses IM, TA, and ML on the four datasets and

Lu-An Tang et al.: Mining Sensor Data in Cyber-Physical Systems 231

records their average time cost on each snapshot. Figure1a shows the results by dataset and Fig. 1b recordsthe algorithm’s running time w.r.t. the total number ofsensors, N . Note that the y-axes are in logarithmicscale. IM achieves the best efficiency on all thedatasets. In the largest dataset D4, IM is an order ofmagnitude faster than ML and two orders of magnitudefaster than TA.

We also study the factors that influence IM’sefficiency. We set the reading threshold ır as 0.3 to 0.9of the typical intruder energy and record the algorithm’stime cost on datasets D1 to D4 in Fig. 2a. Then wecarry out the same experiments for confidence thresholdıo (Fig. 2b). The results show that the influence ofır is larger than ıo, because ıo only influences theverification step, but ır determines the size of themonitoring graphs. With higher ır, the system cangenerate smaller monitoring graphs and increase themining efficiency, especially in the large datasets withdensely deployed sensors (e. g., D4).

6.3 Evaluations of detecting effectiveness

To evaluate the quality of mining results, we retrievethe intruder’s true position and energy in each snapshot

IMTAML

D1 D2 D3 D4

IM TA ML

(a)

Dataset

Tim

e (m

s)T

ime

(ms)

10

10

6

6

10

10

5

5

10

10

4

4

10

10

3

3

10

10

10

10

2

2

0

0

1

1

10

10

(b)

2500 10 000

N400

Fig. 1 Efficiency: (a) different datasets and (b) sensornumber.

Tim

e (m

s)T

ime

(ms)

0

500

1000

1500

2000

2500

0.3 0.5 0.7 0.9

D1 D2

D3 D4

0

500

1000

1500

2000

2500

0.2 0.4 0.6 0.8

(b)

δr

δo

(a)

D1 D2

D3 D4

Fig. 2 Efficiency: IntruMine on different ııır and ıııo.

as the ground truth and compare the mining results withthem. The system first compares the number of detectedintruders with ground truth to calculate the measures ofprecision and recall, then matches the detected intruderto the nearest one in ground truth and computes therelative errors of energy and position.

Since ML requires the exact number of real intrudersas the input (i.e., with 100% precision and recall),we only record the precision and recall of IM andTA in Fig. 3. Both of them can detect all the realintruders, but IM’s precision is about 20% higher. Thenumber of false positives reported by IM is only as halfas TA, because IM’s peak sensor based initializationand intruder verification step could filter out the falsepositives effectively.

All three methods can detect the intruders, we furthercheck their detecting effectiveness by calculating therelative errors of energy and position. The results areshown in Fig. 4. ML has the largest errors: The averageenergy error is more than 50% and the position error isabout 40% in D3 and D4. The reason of ML’s failureis that the algorithm takes count in the reading fromall the sensors, and it is inevitably influenced by thefaulty readings and noises. ML’s accuracy degeneratesrapidly on datasets D3 and D4, since there are more

232 Tsinghua Science and Technology, June 2014, 19(3): 225-234

0

0

20

20

40

40

60

60

80

80

100

100

D1 D2 D3 D4

IM

IM

TA

TA

(a)

Dataset

Dataset

Pre

cisi

on (

%)

Rec

all

(%)

D1 D2 D3 D4

(b)

Fig. 3 Effectiveness: (a) precision and (b) recall on differentdatasets.

faulty readings. This result indicates that ML is notfeasible to process untrustworthy data. The errors ofIM are much lower, with no more than 5% in all thedatasets. The performance of IM even improves on D3and D4, because with more deployed sensors, IM caneffectively filter the untrustworthy ones and utilize theinformation for accurate estimation.

7 Conclusions and Future Work

This paper studies the problem of sensor data miningin cyber-physical systems. A method called IntruMineis proposed to detect and verify the intruders fromuntrustworthy sensor data. The system constructs themonitoring graph and estimates the intruder attributeswith the link information. The information of readingdifference is used to filter out the unreliable sensors andfalse positives. There are many interesting directionsof future work in the line of cyber-physical datamining, such as combining CPS with social networks,developing novel mining functions on feature-richmovement data, and integrating the technology withreal-world interdisciplinary applications.� Combining CPS with social network: Social

network analysis has attracted much attention

Dataset

Dataset

4.2 3.21.2 0.6

0

10

20

30

40

50

D1 D2 D3 D4

1.6 3.6 1.8 1.60

20

40

60

80

D1 D2 D3 D4

IM

IM

TA

TA

ML

ML

(a)

(b)

Ene

rgy

erro

r (%

)P

osit

ion

erro

r (%

)

Fig. 4 Effectiveness: (a) energy error and (b) position erroron different datasets.

in recent years. It is attractive to integrateCPS with social network. There are severalnovel problems while combing the physical datawith social network, including: (1) indexing,searching, and querying the CPS data via socialnetwork structure; (2) considering privacy andsecurity issues when publishing CPS data onsocial network; (3) discovering spatial-temporalpatterns from CPS data with social network, e.g.,mining traveling patterns in a location-based socialnetwork. These studies will definitely enrich theresearch on both CPS and social network.� Mining feature-rich sensor data: The sensor data

are usually collected with rich information featuresof the objects, e.g., a traveler’s trajectory can becollected by the smart phone, with the informationof the users profile, text message, contacts, and soon. These information not only helps understandthe semantic and purpose of the data records,but also contributes to improve many miningfunctions, such as target prediction, travelerclustering, route recommendation, and so on.� Integrating the real-world interdisciplinary

applications: Information management and data

Lu-An Tang et al.: Mining Sensor Data in Cyber-Physical Systems 233

mining on CPS represent an important researchfrontier in database, data mining, sensor network,and information technology. This technologyhas a wide range of applications across differentdomains, such as patient healthcare, battlefieldsurveillance, traffic monitoring, and other casesin science, engineering, education, society, andany field with massive, dynamic, heterogeneous,and interrelated physical and virtual data. It isimportant to integrate the algorithms with realapplications to improve system performance.

Acknowledgements

The work was supported in part by the U.S. ArmyResearch Laboratory under Cooperative AgreementNos. W911NF-09-2-0053 (NS-CTA) and W911NF-11-2-0086 (Cyber-Security), the U.S. Army Research Officeunder Cooperative Agreement No. W911NF-13-1-0193,DTRA, and U.S. National Science Foundation grantsCNS-0931975, IIS-1017362, IIS-1320617, and IIS-1354329. The views and conclusions contained in thisdocument are those of the authors and should not beinterpreted as representing the official policies, eitherexpressed or implied, of the Army Research Laboratory orthe U.S. Government. The U.S. Government is authorizedto reproduce and distribute reprints for Governmentpurposes notwithstanding any copyright notation here on.

References

[1] I. Hwang, H. Balakrishnan, K. Roy, and C. Tomlin,Multiple-target tracking and identity management inclutter, with application to aircraft tracking, in Proceedingsof the American Control Conference, 2004.

[2] M. Hewish, Reformatting fighter tactics, http://www.cs.berkeley.edu/�prabal/nest/resources/Hewish2001.pdf,2001.

[3] L. Tang, X. Yu, Q. Gu, J. Han, A. Leung, and T. La Porta,Mining lines data in cyber-physical system, in KDD, 2013.

[4] C. Lo, W. Peng, C. Chen, T. Lin, and C. Lin,Carweb: A traffic data collection platform, in InternationalConference on Mobile Data Management, 2008.

[5] Y. Zheng and X. Zhou, Computing with SpatialTrajectories. Springer, 2011.

[6] X. Li, R. Lu, X. Liang, X. Shen, J. Chen, and X. Lin,Smart community: An internet of things application, IEEECommunications Magazine, vol. 49, no. 11, pp. 68-75,2011.

[7] G. Tolle, J. Polastre, R. Szewczyk, D. E. Culler, N. Turner,K. Tu, S. Burgess, T. Dawson, P. Buonadonna, D. Gay,and W. Hong, A macroscope in the redwoods, in the ACMConference on Embedded Networked Sensor Systems,2005.

[8] Z. Li, J. Han, M. Ji, L. Tang, Y. Yu, B. Ding,J. Lee, and R. Kays, Movemine: Mining moving objectdata for discovery of animal movement patterns, ACMTransactions on Intelligent Systems and Technology, vol.2, no. 4, p. 27, 2011.

[9] R. Szewczyk, J. Polastre, A. Mainwaring, and D. Culler,Lessons from a sensor network expedition, in EuropeanWorkshop on Wireless Sensor Networks, 2004.

[10] X. Sheng and Y. Hu, Maximum likelihood multiple sourcelocalization using acoustic energy measurements withwireless sensor networks, IEEE Transactions on SignalProcessing, 2005.

[11] M. A. Hammad, W. G. Aref, and A. K. Elmagarmid,Stream window join: Tracking moving objects in sensornetwork databases, in International Conference onScientific and Statistical Database Management, 2003.

[12] J. Aslam, Z. Butler, F. Constantin, V. Crespi, G. Cybenko,and D. Rus, Tracking a moving object with a binary sensornetwork, in the ACM Conference on Embedded NetworkedSensor Systems, 2003.

[13] O. Ozdemir, R. Niu, and P. K. Varshney, Trackingin wireless sensor network using particle filtering:Physical layer considerations, IEEE Transactions onSignal Processing, 2009.

[14] S. J. Pan, J. T. Kwok, Q. Yang, and J. J. Pan, Adaptivelocalization in a dynamic wifi environment through multi-view learning, in AAAI, 2007.

[15] R. Pan, J. Zhao, V. W. Zheng, J. J. Pan, D. Shen, S. J. Pan,and Q. Yang, Domain constrained semisupervised miningof tracking models in sensor networks, in KDD, 2007.

[16] L. Tang, X. Yu, S. Kim, J. Han, C. Hung, and W. Peng,Trualarm: Trustworthiness analysis of sensor networks incyber-physical systems, in ICDM, 2010.

[17] L. Tang, Q. Gu, X. Yu, J. Han, T. La Porta, A. Leung,T. Abdelzaher, and L. Kaplan, Intrumine: Mining intrudersin untrustworthy data of cyber-physical systems, in Proc. ofSIAM International Conference on Data Mining (SDM),2012.

[18] A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein,and W. Hong, Model-driven data acquisition in sensornetworks, in VLDB, 2004.

[19] E. Elnahrawy and B. Nath, Cleaning and querying noisysensors, in WSNA, 2003.

[20] F. Koushanfar, M. Potkonjak, and A. Sangiovanni-Vincentelli, On-line fault detection of sensormeasurements, in IEEE Conference on Sensors, 2003.

[21] B. Krishnamachari and S. Iyengar, Distributed bayesianalgorithms for fault-tolerant event region detection inwireless sensor networks, IEEE Trans. Comput., vol. 53,no. 3, pp. 241-250, 2004.

[22] S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, andJ. Widom, Declarative support for sensor data cleaning, inICPC, 2006.

234 Tsinghua Science and Technology, June 2014, 19(3): 225-234

[23] S. Subramaniam, T. Palpanas, D. Papadopoulos,V. Kalogeraki, and D. Gunopulos, Online outlierdetection in sensor data using non-parametric models, inVLDB, 2006.

[24] X. Xiao, W. Peng, C. Hung, and W. Lee, Using sensorranksfor in-network detection of faulty readings in wirelesssensor networks, in DEWMA, 2007.

[25] K. Ni, N. Ramanathan, M. N. H. Chehade, L. Balzano,S. Nair, S. Zahedi, E. Kohler, G. J. Pottie, M. H. Hansen,and M. B, Srivastava, Sensor network data fault types,ACM Transactions on Sensor Networks, 2009.

[26] K. Ni and G. Pottie, Bayesian selection of non-faultysensors, in IEEE International Symposium on InformationTheory, 2007.

[27] L. Tang, B. Cui, H. Li, G. Miao, D. Yang, and X. Zhou,

Effective variation management for pseudo periodicalstreams, in SIGMOD, 2007.

[28] X. Yu, L. Tang, and J. Han, Filtering and refinement:A two-stage approach for efficient and effective anomalydetection, in ICDM, 2009.

[29] C. Lin, W. Peng, and Y. Tseng, Efficient in-networkmoving object tracking in wireless sensor network, IEEETransaction on Mobile Computing, vol. 5, no. 8, pp. 1044-1056, 2006.

[30] V. Cevher and L. M. Kaplan, Acoustic sensor networkdesign for position estimation, ACM Transactions onSensor Networks, vol. 5, no. 3, 2009.

[31] T. Krout, Cb manet scenario data distribution, in BBNTechnique Report, 2007.

Lu-An Tang is a researcher in NECLaboratories America. He receivedthe PhD degree from the Universityof Illinois at Urbana-Champaign in2013. His principal research interest isin data mining, cyber-physical systems,and computer security. He has over 30publications in books, journals, and major

conferences.

Jiawei Han received his PhD degreefrom University of Wisconsin-Madison in1985. He is the Abel Bliss Professor ofComputer Science, University of Illinoisat Urbana Champaign. He has beenresearching into data mining, informationnetwork analysis, database systems, anddata warehousing, with over 600 journal

and conference publications. He has chaired or served on manyprogram committees of international conferences, includingPC co-chair for KDD, SDM, and ICDM conferences, andAmericas Coordinator for VLDB conferences. He also servedas the founding Editor-In-Chief of ACM Transactions onKnowledge Discovery from Data and is serving as the Directorof Information Network Academic Research Center supportedby U.S. Army Research Lab. He is a fellow of ACM and

IEEE, and received 2004 ACM SIGKDD Innovations Award,2005 IEEE Computer Society Technical Achievement Award,2009 IEEE Computer Society Wallace McDowell Award, and2011 Daniel C. Drucker Eminent Faculty Award at UIUC. Hisbook “Data Mining: Concepts and Techniques” has been usedpopularly as a textbook worldwide.

Guofei Jiang is the Vice President ofsolution technology at NEC LaboratoriesAmerica (NECLA), assisting thetransformation of NEC’s research portfoliotoward solutions and services. He receivedhis PhD degree from Beijing Institute ofTechnology in 1998. Concurrently he alsoleads a large group with several dozens of

research staffs from the global network of NEC R&D units. Hisgroup conducts fundamental and applied research in the areasof big data analytics, distributed systems and cloud platforms,software-defined networking, and computer security. He haspublished over 120 technical papers and also has over 50 patentsgranted or applied. His inventions have been successfullycommercialized as Award Winning NEC products and solutions,and have directly created new business lines for NEC with tensof millions US dollars in revenue.