On Measuring the Similarity of Network Hosts: Pitfalls ...

17
On Measuring the Similarity of Network Hosts: Pitfalls, New Metrics, and Empirical Analyses Scott E. Coull RedJack, LLC. Silver Spring, MD [email protected] Fabian Monrose University of North Carolina Chapel Hill, NC [email protected] Michael Bailey University of Michigan Ann Arbor, MI [email protected] Abstract As the scope and scale of network data grows, security practitioners and network operators are in- creasingly turning to automated data analysis meth- ods to extract meaningful information. Underpin- ning these methods are distance metrics that repre- sent the similarity between two values or objects. In this paper, we argue that many of the obvious dis- tance metrics used to measure behavioral similar- ity among network hosts fail to capture the seman- tic meaning imbued by network protocols. Further- more, they also tend to ignore long-term temporal structure of the objects being measured. To explore the role of these semantic and temporal character- istics, we develop a new behavioral distance met- ric for network hosts and compare its performance to a metric that ignores such information. Specif- ically, we propose semantically meaningful metrics for common data types found within network data, show how these metrics can be combined to treat net- work data as a unified metric space, and describe a temporal sequencing algorithm that captures long- term causal relationships. In doing so, we bring to light several challenges inherent in defining be- havioral metrics for network data, and put forth a new way of approaching network data analysis prob- lems. Our proposed metric is empirically evaluated on a dataset of over 30 million network flows, with results that underscore the utility of a holistic ap- proach to network data analysis. 1 Introduction Network operators face a bewildering array of se- curity and operational challenges that require sig- nificant instrumentation and measurement of their networks, including denial of service attacks, sup- porting quality of service, and capacity planning. Unfortunately, the complexity of modern networks often make manual examination of the data pro- vided by such instrumentation impractical. As a result, operators and security practitioners turn to automated analysis methods to pinpoint interesting or anomalous network activities. Underlying each of these methods and their associated applications is a fundamental question: to what extent are two sets of network activities similar? As straightforward as this may seem, the tech- niques for determining this similarity are as varied as the number of domains they have been applied to. These range from identifying duplicates and mi- nor variations using cryptographic hashes and edit distance of payloads, to the use of distributions of features and their related statistics (e.g., entropy). While each has been shown to provide value in solv- ing specific problems, the diversity of approaches clearly indicates that there is no generally accepted method for reasoning about the similarity of net- work activities. With this in mind, the goal of this paper is to make progress in developing a unified approach to measuring the similarity of network activities. In doing so, we hope to encourage a more rigorous method for describing network behaviors, which will hopefully lead to new applications that would be dif- ficult to achieve otherwise. While a complete frame- work for rigorously defining distance is beyond the scope of any one paper, we address two key aspects of network similarity that we believe must be con- sidered in any such a framework. That is, the spatial and temporal characteristics of the network data. Here, spatial characteristics refer to the unique semantic relationships between the values in two identical or related fields. For example, if we wish to cluster tuples of data that included IP addresses and port numbers, we would have two obvious ways of accomplishing the task: (1) treat the IPs and ports as numeric values (i.e., integers) and use subtrac- tion to calculate their distance, or (2) treat them as discrete values with no relationship to one an-

Transcript of On Measuring the Similarity of Network Hosts: Pitfalls ...

Page 1: On Measuring the Similarity of Network Hosts: Pitfalls ...

On Measuring the Similarity of Network Hosts:Pitfalls, New Metrics, and Empirical Analyses

Scott E. CoullRedJack, LLC.

Silver Spring, [email protected]

Fabian MonroseUniversity of North Carolina

Chapel Hill, [email protected]

Michael BaileyUniversity of Michigan

Ann Arbor, [email protected]

Abstract

As the scope and scale of network data grows,security practitioners and network operators are in-creasingly turning to automated data analysis meth-ods to extract meaningful information. Underpin-ning these methods are distance metrics that repre-sent the similarity between two values or objects. Inthis paper, we argue that many of the obvious dis-tance metrics used to measure behavioral similar-ity among network hosts fail to capture the seman-tic meaning imbued by network protocols. Further-more, they also tend to ignore long-term temporalstructure of the objects being measured. To explorethe role of these semantic and temporal character-istics, we develop a new behavioral distance met-ric for network hosts and compare its performanceto a metric that ignores such information. Specif-ically, we propose semantically meaningful metricsfor common data types found within network data,show how these metrics can be combined to treat net-work data as a unified metric space, and describe atemporal sequencing algorithm that captures long-term causal relationships. In doing so, we bringto light several challenges inherent in defining be-havioral metrics for network data, and put forth anew way of approaching network data analysis prob-lems. Our proposed metric is empirically evaluatedon a dataset of over 30 million network flows, withresults that underscore the utility of a holistic ap-proach to network data analysis.

1 Introduction

Network operators face a bewildering array of se-curity and operational challenges that require sig-nificant instrumentation and measurement of theirnetworks, including denial of service attacks, sup-porting quality of service, and capacity planning.Unfortunately, the complexity of modern networks

often make manual examination of the data pro-vided by such instrumentation impractical. As aresult, operators and security practitioners turn toautomated analysis methods to pinpoint interestingor anomalous network activities. Underlying eachof these methods and their associated applicationsis a fundamental question: to what extent are twosets of network activities similar?

As straightforward as this may seem, the tech-niques for determining this similarity are as variedas the number of domains they have been appliedto. These range from identifying duplicates and mi-nor variations using cryptographic hashes and editdistance of payloads, to the use of distributions offeatures and their related statistics (e.g., entropy).While each has been shown to provide value in solv-ing specific problems, the diversity of approachesclearly indicates that there is no generally acceptedmethod for reasoning about the similarity of net-work activities.

With this in mind, the goal of this paper is tomake progress in developing a unified approach tomeasuring the similarity of network activities. Indoing so, we hope to encourage a more rigorousmethod for describing network behaviors, which willhopefully lead to new applications that would be dif-ficult to achieve otherwise. While a complete frame-work for rigorously defining distance is beyond thescope of any one paper, we address two key aspectsof network similarity that we believe must be con-sidered in any such a framework. That is, the spatialand temporal characteristics of the network data.

Here, spatial characteristics refer to the uniquesemantic relationships between the values in twoidentical or related fields. For example, if we wish tocluster tuples of data that included IP addresses andport numbers, we would have two obvious ways ofaccomplishing the task: (1) treat the IPs and portsas numeric values (i.e., integers) and use subtrac-tion to calculate their distance, or (2) treat themas discrete values with no relationship to one an-

Page 2: On Measuring the Similarity of Network Hosts: Pitfalls ...

other (e.g., in a probability distribution). Clearly,neither of these two options is exactly correct, sincethe network protocols that define these data typesalso define unique semantic relationships.

Temporal characteristics, on the other hand, de-scribe the causal relationships among the networkactivities over time. One way to capture temporalinformation is to examine short n-grams or k-tuplesof network activities. However, this too may ignoreimportant aspects of network data. As an exam-ple, changes in traffic volume may alter the tem-poral locality captured by these short windows ofactivity. Moreover, network data has no restrictionon this temporal locality, which means that activ-ities can have long-term causal relationships witheach other, extending to minutes, hours, or evendays. Successfully building robust notions of sim-ilarity that address these temporal characteristicsmean addressing similarity over large time scales.

Contributions. In this paper, we explore thesespatial and temporal properties in an effort to learnwhat impact they may have on the performance ofautomated network analysis methods. To focus ourstudy, we examine the problem of classifying hostbehaviors, which requires both strong notions ofsemantically-meaningful behavior and causal rela-tionships among these behaviors to provide a mean-ingful classification. Furthermore, we develop anexample unified metric space for network data thatencodes the unique spatial and temporal propertiesof host behavior, and evaluate our proposed metricby comparing it to one that ignores those properties.We note that it is not our intention to state that cur-rent analysis methodologies are “wrong,” nor thatwe present the best metric for computing similar-ity. Instead, we explore whether general metrics fornetwork activities can be created, how such metricsmight lead to improved analysis, and examine thepotential for new directions of research.

More concretely, we begin by defining metricspaces for a subset of data types commonly foundwithin network data. We design these metricspaces to encode the underlying semantic relation-ship among the values for the given data type, in-cluding non-numeric types like IP addresses andport numbers. Then, we show how to combine theseheterogeneous metric spaces into a unified metricspace that treats network data records (e.g., pack-ets, network flows) as points in a multi-dimensionalspace. Finally, we describe host behaviors as a timeseries of these points, and provide a dynamic timewarping algorithm that efficiently measures behav-ioral distance between hosts, even when those timeseries contain millions of points. In doing so, we

develop a geometric interpretation of host behaviorthat is grounded in semantically-meaningful mea-sures of behavior, and the long-term temporal char-acteristics of those behaviors. Wherever possible,we bring to light several challenges that arise in thedevelopment of general metrics for network data.

To evaluate our proposed metric, we use a datasetcontaining over 30 million flows collected at theUniversity of Michigan. Our experiments comparethe performance of our metric to that of the L1

(i.e., Manhattan distance) metric, which ignores se-mantics and temporal information, in a variety ofcluster analysis tasks. The results of these experi-ments indicate that semantic and temporal informa-tion play an important role in capturing a realisticand intuitive notion of network behaviors. Further-more, these results imply that it may be possible totreat network data in a more rigorous way, similarto traditional forms of numeric data. To underscorethis potential, we show how our metric may be usedto measure the privacy provided by anonymized net-work data in the context of well-established privacydefinitions for real-valued, numeric data; namely,Chawla et al. ’s (c, t)-isolation definition [6].

2 Related Work

There is a long, rich body of work related to de-veloping automated methods of network data anal-ysis. These include, but are not limited to, su-pervised [18, 8, 2, 30, 9, 31] and unsupervised[17, 15, 29, 14, 10, 32, 3] traffic classification,anomaly and intrusion detection [11, 28, 27, 24, 13],and data privacy applications [7, 25, 16, 4]. Ratherthan attempt to enumerate the various approachesto automated data analysis, we instead point theinterested reader to recent surveys of traffic classi-fication [22] and anomaly detection techniques [5].Also, we reiterate that the purpose of this paper isto explore the role of semantics and temporal infor-mation in developing a framework for defining sim-ilarity among network activities – a task that, tothe best of our knowledge, has not been thoroughlyexamined in the literature thus far.

Of the network data analysis approaches pro-posed thus far, the work of Eskin et al. [11] ismost closely related to our own. Eskin et al. addressthe problem of unsupervised anomaly detection byframing it as a form of outlier detection in a geo-metric space. Their approach uses kernel functionsto map arbitrary input spaces to a high-dimensionalfeature space that encodes the distances among theinput values. They show how to use these kernelfunctions to encode short windows of k system callsfor host-based anomaly detection, and fields found

Page 3: On Measuring the Similarity of Network Hosts: Pitfalls ...

within network data for network-based detection.Clearly, Eskin et al. share our goal of describing

a unified metric (i.e., feature) space for measuringthe similarity of network data. However, there aretwo very important distinctions. First, while Eskinet al. show how to map non-numeric types (e.g., IPaddresses) to a common feature space, they do soin such a way that all semantic relationships amongthe values is removed. In particular, their kernelfunction treats each value as a discrete value witha binary distance measure: either the value is thesame or it is different. Second, their approach forencoding sequences of activities (i.e., system calls intheir case) is limited to relatively short windows dueto the exponential growth in dimensions of the fea-ture space and the sparsity of that space (i.e., the“curse of dimensionality”). By contrast, we seekto explore the semantic and temporal informationthat is missing from their metrics by developingsemantically-meaningful spatial metrics and long-term temporal metrics based on dynamic time warp-ing of time series data.

3 Preliminaries

For ease of exposition, we first provide defini-tions and notation that describe the network datawe analyze in a format-independent manner. Wealso define the concepts of metric spaces and prod-uct metrics, which we use to create a foundation formeasuring similarity among network hosts.

Network Data Data describing computer net-work activities may come in many different forms,including packet traces, flow logs, and web proxylogs. Rather than describe each of the possible for-mats individually, we instead define network dataas a whole in more abstract terms. Specifically, weconsider all network data to be a database of m rowsand n columns, which we represent as an m×n ma-trix. The rows represent individual records, such aspackets or flows, with n fields, which may includesource and destination IP addresses, port numbers,and time stamps. We denote the ith row as then-dimensional vector ~vi =< vi,1, . . . , vi,n >, andthe database as V =< ~v1, . . . , ~vm >T . For ourpurposes, we assume a total ordering on the rows~vi ≤ ~vi+1 based on when the record was added tothe database by the network sensor. Furthermore,we associate each column in the matrix (i.e., field)with an atomic data type that defines the semanticrelationship among its values. More formally, wedefine a set of types T = {t1, . . . , t`} and an injec-tive function F : [1, n]→ T that maps a column toits associated data type.

Metric Spaces To capture a notion of similarityamong values in each column, we define a metricspace for each data type in the set T . A metricspace is simply a tuple (X, d), where X is a non-empty set of values being measured and d : X ×X → R+ is a non-negative distance metric. Themetric function must satisfy three properties: (1)d(x, y) = 0 iff x = y; (2) d(x, y) = d(y, x); and (3)d(x, y) +d(y, z) ≥ d(x, z). We denote the metric forthe type tj as (Xtj , dtj ).

Given the metric spaces associated with each ofthe columns via the data type mapping describedabove, (XF (1), dF (n)), . . . , (XF (n), dF (n)), we candefine a p-product metric to combine the heteroge-neous metric spaces into a single metric space thatmeasures similarity among records as if they werepoints in an n-dimensional space. Specifically, the p-product metric is defined as (XF (1)×. . .×XF (n), dp),where XF (1) × · · · × XF (n) denotes the Cartesianproduct of the sets and dp is the p-norm:

dp(~x, ~y) = (dF (1)(x1, y1)p + · · ·+ dF (n)(xn, yn)p)1p

We note that metrics we propose are straight-forward generalizations of well-known metrics [19];hence, the proofs are omitted for brevity.

4 Metric Spaces for Network Data

To provide a foundation for measuring similar-ity among network hosts, we define a metric spacethat captures both the spatial and temporal char-acteristics of the host’s behaviors as follows. Webegin by defining metrics spaces that capture se-mantically rich relationships among the values foreach data type found in the network data. For thepurposes of this initial study, we restrict ourselvesto providing example metrics for four prevalent datatypes: IP addresses, port numbers, time fields, andsize fields. Next, we show how to combine theseheterogeneous metric spaces into a single, unifiedmetric space using a p-product metric and a novelnormalization procedure that retains the semanticrelationships of the constituent metric spaces. Thisallows us to treat each network data record as apoint and capture the spatial characteristics of thehost’s behavior in a meaningful way. Finally, wemodel a host’s temporal behavior as a time seriesof points made up of records associated with thehost (e.g., flows sent or received by the host), andshow how dynamic time warping may be used to ef-ficiently measure distance between the behavior oftwo hosts.

Page 4: On Measuring the Similarity of Network Hosts: Pitfalls ...

4.1 Data Types and Metric Spaces

Network data may contain a wide variety of datatypes, each with its own unique semantics and rangeof possible values. That said, without loss of gener-ality, we can classify these types as being numeric(e.g., timestamps, TTL values, sizes) or categorical(e.g., TCP flags, IPs, ports) in nature. The primarydistinction between these two categories of types isthat numeric types have distance metrics that nat-urally follow from the integer or real number valuesused to represent them, while categorical types oftendo not maintain obvious linear relationships amongthe values. Here, we describe example metric spacesfor two data types in each category: time and sizeas numeric types, and IP and port as categorical.

Numeric Types. The time and size data typescontain values syntactically encoded as 16 or 32 bitintegers, and have semantics that mirror those of theintegers themselves. That is, a distance of ten in theintegers is semantically the same as ten bytes or tenseconds1. Both types can be represented by metricspaces of the form (X = {0, . . . ,M}, d(x, y) = |x−y|), where M is simply the largest value possiblein the encoding for that type. In most cases, thatmeans that M = 232−1 for 32-bit integers, or M =216 − 1 for 16-bit.

An obvious caveat, however, is that one must en-sure that values are encoded in such a way thatthey are relevant to the analysis at hand. For exam-ple, when comparing host behavior over consecutivedays, it makes sense to ensure time fields are maderelative to midnight of the current day, thereby al-lowing one to measure differences in activity relativeto that time scale (i.e., a day) rather than to a fixedepoch (e.g., UNIX timestamps).

Categorical Types. While metric spaces for nu-meric types are straightforward, categorical datatypes pose a problem in developing metric spacesdue to the non-linear relationship between the syn-tactic encoding (i.e., integers) and the underlyingsemantic meaning of the values. Moreover, theunique semantics of each data type prohibits usfrom creating a single metric for use by all categor-ical types. Instead, we follow a general procedurewhereby a hierarchy is defined to represent the re-lationships among the values, and the distance be-tween two values is determined by the level of thehierarchy at which the values diverge. While thisis by no means the only approach for defining met-rics for categorical types, we believe it is a reason-

1Note that we will address issues arising from differentunits of measure in the following section.

able first step in representing the semantics of thesecomplex types.

To define the distance hierarchy for the port type,we decompose the space of values (i.e., port num-bers) into groups based on the IANA port list-ings: well-known (0-1023), registered (1024-49151),and dynamic (49151-65535) ports. In this hierar-chy, well-known and registered ports are consid-ered to be closer to one another than to dynamicports since these ports are typically statically as-sociated with a given service. Of course, portswithin the same group are closer than those in dif-ferent groups, and the same port has a distance ofzero. The hierarchy is shown in Figure 1(a). Moreformally, we create the metric space (Xport, dport),where Xport = {0, . . . , 65535} and dport is an exten-sion of the discrete metric defined as:

dport(x, y) =

0 if x = y1 if δport(x) = δport(y)2 if δport(x) ∈ {0, 1} & δport(y) ∈ {0, 1}4 if δport(x) ∈ {0, 1} & δport(y) ∈ {2}

δport(x) =

{0 if x ∈ [0, 1023]1 if x ∈ [1024, 49151]2 if x ∈ [49152, 65535]

The choice of distances in dport is not absolute,but must follow the metric properties from Section3 and also faithfully encode the relative importanceof the differences in groups that the ports belongto. One can also envision extending the hierarchyto include finer grained information about the ap-plications or operating systems typically associatedwith the port numbers. Many such refinements arepossible, but it is important to note that they makecertain assumptions on the data which may not holdin all cases. For example, one could further refinethe distance measure to ensure that ports 80 and443 are near to one another as both are typicallyused for HTTP traffic, however there are clearlycases where other ports carry HTTP traffic or whereports 80 and 443 carry non-HTTP traffic. One ofthe benefits of this approach to similarity measure-ment is that it requires the analyst to carefully con-sider these assumptions when defining metrics.

The metric space for IP addresses2 is slightlymore complex due to the variety of address typesused in practice. There is, for instance, a distinc-tion between routable and non-routable, private andpublic, broadcast and multicast, and many others.In Figure 1(b), we show the hierarchy used to repre-sent the semantic relationships among these group-ings. At the leaves of this hierarchy, we performa simple Hamming distance calculation (denoted asdH) between the IPs to determine distance within

2The provided metric is for IPv4 addresses. IPv6 ad-dresses would simply require a suitable increase in distancesfor each level to retain their relative severity.

Page 5: On Measuring the Similarity of Network Hosts: Pitfalls ...

x=y ? x=y ? x=y ?

Services Dynamic

Well-Known Registeredd (x,y)=4port

d (x,y)=2port

d (x,y)=1port

(a) Distance hierarchy for port type.

d (x,y)=48

d (x,y)=d (x,y)IP d (x,y)H d (x,y)H d (x,y)H d (x,y)H d (x,y)H d (x,y)H d (x,y)H d (x,y)HH

IP

d (x,y)=64IP

d (x,y)=128IPUnicast Other

Public Private

10.0.0.0/8 192.168.0.0/16172.16.0.0/12

Multicast BroadcastLink

LocalDefault Network

(b) Distance hierarchy for IP type.

Figure 1. Distance metrics for categorical types of port and IP. Distances listed to the leftindicate the distance when values diverge at that level of the hierarchy.

the same functional group. Due to the complex-ity of the hierarchy, we do not formally define itsmetric space here. We again reiterate that this isjust one of potentially many ways to create a metricfor functional similarity of IP addresses. Again, wemay imagine a more refined metric that further sub-divides IPs by autonomous system or CIDR block,and again we must ensure that that assumptionsmade by these metrics are supported by the databeing analyzed.

4.2 Network Data Records as Points

Given the aforementioned distance metrics, thenext challenge is to find a way to combine them intoa unified metric space that captures the distancebetween records (i.e., flows or packets) by treatingthem as points within that space. At first glance,doing so would appear to be a relatively simple pro-cedure: assign one of the available types to eachfield in the record, and then combine the respectivemetric spaces using a p-product metric. However,when combining heterogeneous spaces it is impor-tant to normalize the distances to ensure that onedimension does not dominate the overall distancecalculation by virtue of its relatively larger distancevalues. Typically, this is accomplished with a pro-cedure known as affine normalization. Given a dis-tance metric d on values x, y ∈ X, the affine nor-malization is calculated as:

d(x, y) =d(x, y)−min(X)

max(X)−min(X)

However, a naive application of this normaliza-tion method fails to fairly weight all of the dimen-

sions used in the overall distance calculation. Inparticular, affine normalization ignores the distri-bution of values in the underlying space by normal-izing by the largest distance possible. As a result,very large and sparse spaces, such as IPs or sizes,may actually be undervalued as a result.

To see why, consider a field containing flow sizedata. In the evaluation that follows, the vast ma-jority of flows have less than 100 bytes transferred,but the maximum seen is well over ten gigabytes insize. As a result, affine normalization would give ex-tremely small distances to the vast majority of flowpairs even though those distances may actually stillbe semantically meaningful. In essence, affine nor-malization procedures remove or minimize the se-mantic information of distances that are not on thesame scale as the maximum distance. Yet anotherapproach might be to measure distance as the differ-ence in rank between the values (i.e., the indices ofthe values in sorted order). Doing so, however, willremove all semantic information about the relativeseverity of the difference in those values.

To balance these two extremes, we propose a newprocedure that normalizes the data according tocommon units of measure for the data type. Webegin by defining the overall range of distances thatare possible given the values seen in the data be-ing analyzed. Then, we divide this space into non-overlapping ranges based upon the unit of mea-sure that most appropriately suits the values in therange. In other words, the range associated with agiven unit of measure will contain all distances thathave a value of at least one in that unit of measure,but less than one in the next largest unit. In thispaper, we use seconds, minutes, and hours for the

Page 6: On Measuring the Similarity of Network Hosts: Pitfalls ...

Size Normalized IP

Megabytes (1,048,576)

Kilobytes (1,024)

Bytes

Level 0

Level 1

Level 2

Hamming Distance

1.0

0.5

0.0

Figure 2. Example piecewise normalization of size and IP types. Mapping each range indepen-dently ensures they are weighted in accordance with the relative severity of the distance.

time data type, and bytes, kilobytes, and megabytesfor size types. Categorical types, like IPs and ports,are assigned units of measure for each level in theirrespective distance hierarchies.

Once each distance range is associated with aunit of measure, we can then independently mapthem into a common (normalized) interval suchthat the normalized distances represent the relativeseverity of each unit of measure with respect to unitsfrom the same type, as well as those from other datatypes that are being normalized. It is this piecewisemapping to the normalized distance range that al-lows us to maintain the semantic meaning of theunderlying distances without unduly biasing cer-tain dimensions in the overall distance calculation.For simplicity, we map all metric spaces with threeunits of measure to the ranges [0, 0.25), [0.25, 0.5),and [0.5, 1.0]. Types with four units are mappedas [0, 0.125), [0.125, 0.25), [0.25, 0.5),[0.5, 1.0]. Fig-ure 2 shows how this mapping is achieved for sizeand IP types. Then, by denoting the function thatproduces the normalized distance for type tj as

dtj (x, y), we can use the p-product metric to definethe distance between records in the network dataas:

dp(~x, ~y) =(dF (1)(x1, y1)

p+ · · ·+ dF (n)(xn, yn)

p) 1

p

The metric space is now (XF (1) × · · · ×XF (n), dp),and we set p = 2 in order to calculate the Euclideandistance between records.

4.3 Network Objects as Time Series

Thus far, we have introduced distance metricsthat attempt to capture the semantics that un-derlie various data types found in network data,and showed how that semantic information may bepropagated to metrics for network data records bycarefully normalizing and creating a unified metricspace. The final step in our study requires us totake these records and show how to use the metricswe have developed to capture the similarity in be-havior between two hosts. In effect, we defined a

method for reasoning about the spatial similarity ofindividual records associated with a host, but mustalso address the concept of temporal similarity tocapture its behavior in a meaningful way.

Common wisdom suggests that host similarityshould be calculated by independently evaluatingrecords associated with that host, or only examin-ing short subsequences of activities. This is due pri-marily to the so-called “curse of dimensionality” andthe sheer volume of behavioral information associ-ated with each host. Our assertion is that doing somay be insufficient when trying to capture a host’sactivities and behaviors as a whole, since these be-haviors often have strong causal relationships thatextend far beyond limited n-gram windows.

In order to capture the entirety of a host’s be-havior and appropriately measure similarity amonghosts, we instead model a host’s behavior as a timeseries of points made up of the records embeddedinto the n-dimensional metric space described in theprevious section. Given two hosts represented bytheir respective time series, a dynamic programmingmethod known as dynamic time warping (DTW)may be used to find an order-preserving matchingbetween points in the two series which minimizes theoverall distance. The DTW approach has been usedfor decades in the speech recognition communityfor aligning words that may be spoken at differentspeeds. Essentially, the DTW procedure “speedsup” or “slows down” the time series to determinewhich points should be aligned based on the dis-tances between those two points, which in our caseare calculated using the p-product metric.

Unfortunately, DTW runs in O(m1m2) time andspace for two time series of length m1 and m2, re-spectively. Considering that most real-world hostsproduce millions of records per day, such an ap-proach would quickly become impractical. Luck-ily, several heuristics have been proposed that limitthe area of the dynamic programming matrix be-ing evaluated. In particular, Sakoe and Chiba [26]provide a technique which reduces the computa-tional requirements of the DTW process by evalu-ating only those cells of the matrix which lie within

Page 7: On Measuring the Similarity of Network Hosts: Pitfalls ...

B B B B B B B

A

B

A

A

A

A

A

1 2 3 4 5 6 7 8

1

2

3

4

5

6

(a) DTW with heuristic for constant sampling.

B B B B B B B

A

B

A

A

A

A

A

1 2 3 4 5 6 7 8

1

2

3

4

5

6

(b) DTW with heuristic for variable sampling.

Figure 3. Example dynamic programming matrices with the original Sakoe-Chiba heuristic (a)and our adapted heuristic (b) for variable sampling rate time series. Gradient cells indicate the“diagonals” and dark shaded cells indicate the window around the diagonal that are evaluated.In the the variable rate example, points in the same subsequence are grouped together.

a small window around the diagonal of the ma-trix. Despite only examining a small fraction ofcells, their approach has been shown to produceaccurate results in measuring similarity due to thefact that most high-quality matches often occur ina small window around the diagonal, and that itprevents so-called pathological warpings wherein asingle point may map to large sections of the oppo-site time series.

4.3.1 Efficient Dynamic Time Warping forNetwork Data

Ideally, we would apply the Sakoe-Chiba heuristicto make the DTW computation feasible even forvery large datasets. Unfortunately, the Sakoe-Chibaheuristic makes assumptions on the time series thatprevent its direct application to network data. Inparticular, it assumes that the points in the timeseries are sampled at a constant rate, and so theslope of the diagonal that is evaluated in the ma-trix depends only on the length of the time series(i.e., the time taken to speak the word). For net-work data, however, the rate at which records areproduced is based on the traffic volumes sent andreceived by the host, which are inherently bursty innature and dependent on the host’s activities. Con-sequently, a single diagonal with a fixed slope wouldyield a warping path that attempts to align pointsthat may be recorded at significantly different times,leading to a meaningless measurement of behavior.

To address this, we propose an adaptation wherewe break the two time series into subsequencesbased on the time period those sequences cover, andcalculate individual diagonals and slopes for eachsubsequence. The individual subsequences can thenbe pieced together to form a warping path that ac-

commodates the variable sampling rate of networkdata, and from which we can extend a window toappropriately measure similarity without evaluatingthe entire dynamic programming matrix. Figure 3shows an example of the traditional Sakoe-Chibaheuristic and our adaptation for network data.

Our extension of the Sakoe-Chiba heuristic fornetwork data proceeds as follows. Assume that weare given two time series A and B, where |A| ≤ |B|.We begin by splitting the two time series beingaligned into subsequences such that all points in thesame subsequence were recorded during the samesecond (according to their timestamps). The subse-quences in the two time series are paired if they con-tain points for the same second. Any subsequencesthat have not been paired are merged in with thesubsequence in the same time series that is closestto its timestamp. Thus, we have k subsequencesA1, . . . , Ak and B1, . . . , Bk for the sequences A andB, with Ai and Bi being mapped to one another.Next, we iterate through each of the k subsequence

pairs and calculate the slope of the pair as Si = |Bi||Ai| .

The slope and subsequence mapping uniquely de-termine which cells should be evaluated as the “di-agonal” in our variable sampling rate adaptation.Specifically, the jth point in the ith subsequence(i.e., Ai,j) is mapped to Bi,j′ where j′ = dj ∗ Sie.

With this mapping among points in hand, wethen place a window around each cell of length atleast dSie to ensure continuity of the path in thedynamic programming matrix (i.e., a transition ex-ists between cells). For those points that occur atthe beginning or end of a subsequence, the windowlength is set based on the index of the mappings forthe adjacent points to ensure that the subsequencesare connected to one another. In order to provide a

Page 8: On Measuring the Similarity of Network Hosts: Pitfalls ...

Algorithm 1Dynamic Time Warp (A,B,map, slope, c)

Require: |A| ≤ |B|Require: c > 0

Initialize m as a |A| × |B| matrixm[i][j]←∞ for all i, jm[0][0]← 0.0for i← 0 to |A| dostart← max(0,map[i]− slope[i] ∗ c)end← min(|B|,map[i] + slope[i] ∗ c)for j ← start to end do

if i 6= 0 and j 6= 0 thendistance← dp(A[i− 1], B[j − 1])left← matrix[i][j − 1] + distanceup← matrix[i− 1][j] + distancediag ← matrix[i− 1][j − 1] + distancem[i][j]← min(left, up, diag))

end ifend for

end forreturn m[|A|][|B|]/(|A|+ |B|)

tunable trade-off between computational complex-ity and accuracy of the warping, we extend the win-dow by a multiplicative parameter c so that c ∗ Si

cells on either side of the “diagonal” are evaluated.

Algorithm 1 shows the dynamic time warpingprocedure using the restricted warping path fromour variable sampling rate heuristic3. For ease ofpresentation, we provide as input to the algorithmlists containing the two time series A and B, themapping between indices in A and their “diagonal”mapping in B, a list containing the slope values tobe used to ensure continuity of the matrix slope4,and the multiplicative factor c that controls thenumber of cells evaluated around the “diagonals.”As with any dynamic programming technique, theminimum distance for the allowable warping pathis given in the lower-right cell of the matrix. Fur-thermore, in order to ensure the computed warp-ing distances are comparable among time series ofvarying lengths, we normalize the distance by di-viding by the number of cells in the longest possiblewarp path |A| + |B|, which also ensures that thedistance remains symmetric. We use this normal-ized distance to compare the behaviors of hosts anddetermine their similarity. The computational com-plexity of our proposed heuristic is O(|B| ∗ c) in theworst case where both series A and B are of equallength, which represents a significant reduction fromthe quadratic complexity required when evaluatingthe entire dynamic programming matrix.

3Note that this algorithm can be easily altered so thatonly two rows in the matrix are maintained in memory atany given time, thereby reducing the space complexity toO(|B|).

4In practice, there are actually two lists – a forward andbackward list – to accommodate for cases where the point isat the beginning or end of a subsequence.

5 Evaluation

The underlying hypothesis behind the behavioraldistance metric proposed in the previous section isthat the semantics of the network data and long-term sequences of activities provide a more robustnotion of host behavior than measures that ignoresuch information. To evaluate this hypothesis, wecompare our proposed dynamic time warp (DTW)metric to the well-known L1 distance metric, whichdoes not explicitly capture semantic and sequenc-ing information. The experiments in our evaluationcompare these two metrics using a broad range ofanalysis methodologies on a large dataset collectedon a live network segment in an effort to pinpointthe benefits of our approach over naive metrics.

We begin by measuring the trade-off betweenspeed and accuracy of our dynamic time warping-based distance calculation under varying settingsof the multiplicative window parameter c. Usingthe results from this experiment, we choose a win-dow that provides a balance between fidelity andspeed, and show that our approach is feasible evenon datasets containing millions of flows. Next, wecompare our approach to a metric that ignores se-mantics and sequencing. In particular, we examinethe clusters produced by a variety of cluster analy-sis techniques when using L1 distance and our pro-posed DTW metric. Moreover, we look at the con-sistency of these behaviors across consecutive daysof observation to show that our approach capturesthe most robust notion of network behaviors. Fi-nally, we examine a concrete application of our met-ric to the problem of quantifying privacy of net-work hosts, and show how it enables application ofwell-known privacy definitions, such as the notionof (c, t)-isolation defined by Chawla et al. [6].

Data. In our evaluation, we make use of a datasetcontaining uni-directional flow records collectedwithin the Computer Science and Engineering de-partment of the University of Michigan. The CSEdataset, as we refer to it throughout this section,was collected over two consecutive twenty-four hourperiods, and captures all local and outbound trafficfrom a single /24 subnet. The data contains 137 ac-tive hosts that represent a broad mix of network ac-tivities, including high-traffic web and mail servers,general purpose workstations, and hosts runningspecialized research projects. The relatively smallnumber of hosts allows us to manually verify the be-havioral information with system and network ad-ministrators at the University of Michigan. Table 2provides a summary of the traffic within the CSEdataset on each of the observed days.

Page 9: On Measuring the Similarity of Network Hosts: Pitfalls ...

Window Parameter c Avg. Time per Warp Avg. Distance Increasein seconds (σ) in percentage (σ)

25 1.0 (5.5) 5.0% (6.2%)50 1.8 (10.1) 3.3% (4.5%)100 3.2 (19.3) 1.9% (3.1%)200 5.5 (36.2) 0.9% (1.9%)500 11.1 (79.7) 0.0% (0.0%)

Table 1. Time vs. accuracy comparison, including averages and standard deviations for eachwindow parameter tested. Accuracy measured as percentage increase in distances from win-dow parameter c = 500.

Day 1 Day 2Number of Hosts: 135 137

Total Flows: 14,400,974 16,249,269Avg. Flows per Host: 106,674 118,608

Median Flows per Host: 4,670 2,955Max Flows per Host: 6,357,810 6,620,785

Table 2. Traffic properties for both days ofthe University of Michigan CSE dataset.

Throughout all of our experiments, we considersix fields within the flow log data: start time, flowsize, source IP and port, and destination IP andport. These fields are associated with the appro-priate distance metrics and used to measure behav-ioral distance according to the dynamic time warp-ing procedure outlined in the previous section, ex-cept for instances where we explicitly make use ofan alternate distance metric. Furthermore, we re-move all non-TCP traffic from the dataset in aneffort to minimize noise and backscatter caused byUDP and ICMP traffic. We note that since we areexamining flow data, the use of TCP traffic shouldnot unfairly bias our analysis since causal relation-ships among the flows will likely be dictated byapplication-layer protocols or other higher-order be-haviors of the host.

Finally, we performed an initial analysis of thedata using k-means clustering in combination withour DTW metric to pinpoint a variety of scan-ning activities, including pervasive Nessus scans [21]from two hosts within the CSE network and severalIPs from China performing fingerprinting on CSEhosts. These scanning hosts were removed from thedata to ensure that we focus our analysis on legiti-mate forms of network behaviors that may be ver-ified via the University of Michigan CSE depart-ment’s system and network administrators. Thespecifics of this technique are discussed in Section5.2, but the fact that our method was able to pin-point these scanning activities is interesting in andof itself.

5.1 Efficiency

Network datasets collected on real-world net-works may contain millions of records. As such,it is important to understand the efficiency of ourproposed metric and its ability to reasonably op-erate on large datasets. Moreover, we must alsounderstand the impact of choosing the window pa-rameter c on the accuracy of the resultant distanceand the speed with which it is calculated. To do so,we choose five settings of the window parameter (25,50, 100, 200, 500) and run distance measures amonga random subset of 50% of the hosts from each dayof the CSE dataset. The hosts in the sample se-lected for our experiment had an average of 63,984flows each, with the largest host having 6,357,811flows. For each of the window parameter settings,we measure the time it takes to perform the DTWprocedure on a single 3.16GHz processor and the in-crease in distance from that which was calculated bythe largest parameter setting (i.e., c = 500), whichare shown in Table 1.

The results of our efficiency experiments showthat most parameter settings can perform DTW-based behavioral distance measures in a few secondswith relatively small changes to the overall distance,even when calculating distances among hosts withthousands or millions of flows. For example, witha parameter setting of c = 100, the DTW metrictakes an average of only 3.2 seconds with an increasein the calculated distance of just under 2%. Forthe remainder of our evaluation, we fix the windowparameter c = 100 since it appears to provide anappropriate trade-off between distance calculationaccuracy and speed.

5.2 Impact of Semantics and Causality

To evaluate the potential benefits of semanticsand long-term causal information in behavioral met-rics, we compare the performance of our DTW met-ric to the L1 distance metric. In our experiments,the L1 metric operates by creating distributions ofvalues for each of the six fields, calculating the L1

Page 10: On Measuring the Similarity of Network Hosts: Pitfalls ...

Figure 4. Dendrogram illustrating agglomerative clustering using DTW-based metric on hostsin the first day of the CSE dataset.

distance between the two hosts’ respective distribu-tions, and summing the distances. The DTW metricoperates exactly as discussed in Section 4.

Our evaluation is broken into three parts. In thefirst, we use single-linkage agglomerative (i.e., hi-erarchical) clustering to visualize and examine thebehavioral similarity among all hosts in our exper-iments. The second experiment uses k-means clus-tering to explore the ability of the two metrics toproduce clusters with coherent semantics. The finalexperiment compares the consistency of the clus-ters produced by the above techniques, as well asthe similarity of the hosts across consecutive daysto measure the robustness of the behavioral infor-mation captured by the two metrics. In the firsttwo experiments, we examine only the first day oftraffic from the CSE dataset, while both days areused in the final experiment.

In all of these experiments, we make use of infor-mation obtained from the University of MichiganCSE system and network administrators about theknown usage of the hosts in our analyses to providea general notion of the correctness of the clusteringand to highlight specific cases for deeper inspection.This information allows us to label the hosts in ourdata with one of ten labels that describe the statedusage of the host when its IP was registered withthe computer support staff. These labels include:web server (WEB), mail server (SMTP), DNS server

(DNS), a variety of host types involving generalclient activities (LOGIN, CLASS, DHCP), special-ized research hosts for PlanetLab (PLANET), anartificial intelligence research project (AI), and aux-iliary power units (APC). For the purposes of theseclustering experiments, we only examine the subsetof hosts that we have labels for (76 of the 137 hosts).

Agglomerative Clustering. The agglomerativeclustering of hosts using the DTW and L1 metricare shown as dendrograms in Figures 4 and 5, re-spectively. The dendrograms visualize the agglom-erative clustering process, which begins with eachhost in its own cluster and then merges clusters it-eratively with their nearest neighbor. The leavesin the dendrogram are labeled with the stated us-age of the host obtained from system administra-tors and a unique identifier to facilitate comparisonbetween dendrograms. The branches of the dendro-gram illustrate the groupings of hosts, with shorterbranches indicating higher levels of similarity.

At first glance, we see that both DTW and L1

metrics group hosts with the same label in fairlyclose groups. In fact, it appears as though L1

distance actually produces a better clustering ac-cording to these labels, for instance by groupingSMTP servers, PlanetLab hosts, and VM serversin very tight groupings. However, there are two

Page 11: On Measuring the Similarity of Network Hosts: Pitfalls ...

Figure 5. Dendrogram illustrating agglomerative clustering using L1 distance on hosts in thefirst day of the CSE dataset.

subtle shortcomings that require deeper investiga-tion. First, the L1 dendrogram clearly indicatesthat there is relatively little separability among thevarious clusters, as evidenced by the small differ-ences in branch lengths in the dendrogram from dis-tances of 5 to 6.5. The impact of this is that evensmall changes in the underlying distributions willcause significant changes to the groupings. We willinvestigate this particular shortcoming during ourconsistency experiments. The second shortcomingis that while the labels provide a general idea of po-tential usage of the hosts, they say nothing aboutthe actual activities being performed during record-ing of this dataset. As such, it is quite possible thathosts with the same labels can have wildly differentbehaviors.

To more closely examine the clustering providedby DTW and L1 metrics, we manually observe twogroups of hosts (highlighted in Figures 4 and 5): vir-tual machine hosts (VM) and major servers (SMTPand WEB). In the L1 dendrogram, all VM hosts aregrouped together, whereas in the DTW dendrogramthe hosts are grouped into pairs. Moreover, thesepairs are on opposite sides of the dendrogram indi-cating significant differences in behavior. When wemanually examine these hosts’ activities, we see thatone pair (127,128) appear to be performing typicalclient activities, such as outgoing SSH and web con-nections, with only approximately 2,000 flows each.

The other pair (183,184), however, had absolutelyno client activities, and instead most of the approx-imately 6,000 flows consisted of VMWare manage-ment traffic or Nagios system monitoring traffic [20].Clearly, these are very distinct behaviors – client ac-tivity and basic system management activity – andyet the L1 metric was unable to distinguish them.

For the primary servers in the dataset, the L1

distance metric groups the two SMTP servers to-gether, however our DTW metric ends up groupingone of the SMTP servers (18) with the web serverwhile the second SMTP server is much further away.Upon closer examination, the SMTP servers are dif-ferentiated by the fact that one server (18) is theprimary mail server that receives about five timesthe amount of the traffic as the second server (27).In addition, the first server (18) receives the major-ity of its connections from IPs external to the CSEnetwork, while the second server (27) receives manyof its mail connections from hosts within the CSE IPprefix and has a significant number of connectionsfrom hosts performing SSH password dictionary at-tacks. It is this local vs. external preference thatcauses the first server (18) to be grouped with theweb server, since the web server also has a signifi-cant amount of traffic, almost all of which is associ-ated with external hosts. Naively, the L1 clusteringappears to make the most sense given these labels,but again its lack of semantic and causal informa-

Page 12: On Measuring the Similarity of Network Hosts: Pitfalls ...

DTW Metric L1 Distance

Cluster 1

Hosts/Flows: 25 hosts/139.2 flows 12 hosts/165.8 flows

Host Types: APC 72%, DHCP 24%, CLASS 4% APC 50%, DHCP 41%, CLASS 9%

Activities: < SPORT = {22, 6000} > < SPORT = {22, 6000} >< SPORT = 21, DADDR =Chinese IP> < SPORT = 21, DADDR =Chinese IP>

< SPORT = 443, DADDR =UMich DNS>

Cluster 2

Hosts/Flows: 25 hosts/1,166.0 flows 18 hosts/908.2 flows

Host Types: DHCP 72%, CLASS 16%, VM 8%, AI 4%, DHCP 89%, CLASS 11%

Activities: < SPORT = {22, 80} > < SPORT = 22 >

Cluster 3

Hosts/Flows: 26 hosts/539,438 flows 46 hosts/305,211 flows

Host Types: LOGIN 46%, PLANET 11%, CLASS 11%, APC 26%, LOGIN 26%, CLASS 10%,VM 8%, SMTP 8%, WEB 4%, AI 4%, DHCP 8%, VM 8%, PLANET 6%, AI 4%,

DNS 4%, DHCP 4% SMTP 4%, WEB 2%, DNS 2%

Activities: < SPORT = {22, 25, 80, 111, 113, 993} > < SPORT = {21, 22, 80, 111} >< SPORT = 5666, DADDR =UMich DNS> < SPORT = 5666, DADDR =UMich DNS>

< SPORT = 443, DADDR =UMich DNS>

Table 3. k-Means clustering of hosts in the first day of the CSE dataset using DTW and L1

metrics. Activities represent the dominant state profiles obtained from applying our extensionto the Xu et al. dominant state algorithm [32].

tion has caused it to ignore the actual behaviors.

As a side effect of our close examination of somehosts within the dendrogram, we also gained someinsight into the general structure of the dendrogramin the case of the DTW metric. That is, the DTWmetric produced a dendrogram where the hosts onthe right side of the dendrogram perform generalclient activities, while most hosts on the left side actas servers. This separation is illustrated by verticaldotted lines in Figure 4. The DTW dendrogram isfurther refined by the type of hosts (internal vs. ex-ternal) that they communicate with, the volume oftraffic, and other interesting behavioral properties.

k-Means Clustering. Another way to evaluatethe performance of the DTW and L1 metric is toconsider if the clusters produced maintain somelevel of behavioral coherence, and whether thegroupings are similar to those that might be pro-duced by a human network analyst looking at thesame data. To examine these properties, we usek-means++ clustering [1] to produce clusters fromeach of the distance metrics and then examine theproperties of the resultant clusters. In particular,we use the dominant state analysis technique of Xuet al. [32] to characterize the dominant behaviors ineach cluster, examine the distribution of host labelsin each cluster, and manually examine a randomsample of each cluster to determine the overall be-havior being captured.

While a full discussion of the Xu et al. dominantstate algorithm is beyond the scope of this evalu-ation, we provide a high-level notion of its oper-ation and how we extend it to capture behaviors

of groups rather than individual hosts. The algo-rithm begins by calculating the normalized entropyof each field being analyzed, and then ordering thosefields from smallest entropy value to greatest. Thealgorithm then selects all values from the smallestentropy field whose probability are greater than apredetermined threshold t, thereby creating “pro-files” for each value. Then, each of those profiles isextended by examining values in the next smallestentropy field whose joint probability of occurrencewith the profile values is above the threshold t. Theprofiles are extended iteratively until no new valuesmay be added from the current field, at which pointthey are considered to be final profiles.

In the original paper, Xu et al. defined the dis-tribution of values on a per-host basis to quicklydetermine host activities. To apply the same pro-cedure to groups of hosts, we apply the dominantstate algorithm in two levels. At the first level, werun the algorithm on distributions of values for eachhost in the cluster. This produces dominant stateprofiles for each of those hosts. From these profiles,we extract the values for each of the present fieldsto create new distributions. These distribution rep-resent the probability of a value occurring in thedominant state profiles for the hosts in the currentcluster. Finally, we apply the dominant state algo-rithm to these cluster-wide distributions to extractout the profiles that occur most consistently amongall hosts in the clusters.

For this experiment, we manually examined asample of hosts throughout the entire dataset anddetermined that there were, roughly, three high-level classes of activities: server activities, client ac-

Page 13: On Measuring the Similarity of Network Hosts: Pitfalls ...

tivities, and noise/scanning activities. This manualclassification was supported by the results of our ag-glomerative clustering analysis, which showed sev-eral levels of increasingly subtle behavioral differ-ences within each of these three classes. Given thisrough classification of behavior, we set k = 3 and seeif the resultant clustering indeed captured the intu-itive understanding of the activities as determinedby a human analyst. Table 3 shows the break downof the three clusters produced by the k-means++ al-gorithm using DTW and L1 metrics. The results ofthe DTW metric clustering shows that the clustersare indeed broken into the three classes of activ-ity found via manual inspection. Cluster 1, whichcontains primarily the power supply devices (APC)and DHCP hosts, had significant scanning activ-ity from IP addresses in China, and relatively lowtraffic volumes indicating only sporadic use. Bycomparison, hosts in Cluster 2 exhibit traditionalclient behaviors of significant SSH and web brows-ing activities. Finally, Cluster 3 contains hosts withsignificant server-like activities. For instance, thehosts related to the artificial intelligence researchproject (AI) were found to be running web serversthat provide statistics to participants in the project,and most of its activity is made up of web requests.

The L1 metric, on the other hand, producedsomewhat incoherent clusters. While there is someoverlap in the clusters and observed behaviors, itis clear by the distribution of host types that theseclusters mix behaviors. The most prominent exam-ple of this is that many of the power supply hosts(APC) that we verified as have low traffic volumesand scanning activity were grouped in with theserver cluster (Cluster 3). Moreover, the client clus-ter (Cluster 2) contains many of the DHCP hostswith scanning activity, which in turn alters the be-havioral profile for that cluster to remove web ac-tivities. The results of this experiment certainlyindicate that the L1 metric simply does not capturea coherent notion of behavior. Moreover, when thenumber of clusters is increased, the differences be-tween the DTW and L1 metrics become more sig-nificant. That is, the L1 metric continues to mixbehaviors, while the DTW metric creates clustersthat represent increasingly fine-grained behaviors,such as the preference for communicating with in-ternal versus external hosts.

As mentioned earlier in this section, this cluster-ing approach was also used to filter a large portionof the scan traffic found in the dataset. Specifically,when we first ran this experiment, we found thatthe behavioral profiles produced by Cluster 1 wereconsistent with a wide range of scan traffic, and thatthe number of hosts in that cluster were significantly

larger than expected. We were then able to use thebehavioral profiles to remove traffic from scanningIPs and produce a much cleaner clustering withoutmost of the initial scanning activity. As you can seeby the scanning IP from China found in the pro-file of Cluster 1, it appears that we can continueperforming this clustering approach to iterativelyidentify increasingly subtle scanning activity.

Changes in Behavior Over Time. Perhaps oneof the most important properties of any behavioralmetric is its robustness to small changes in underly-ing network activity. That is, we want any changesin the measured distance to be related to some high-level behavioral change and not minute changes inthe specifics of the traffic. As our final experiment,we use both days of traffic in the CSE dataset todetermine the sensitivity of the DTW and L1 met-rics to changes in behavior over time. To do so, wetake the set of all hosts that occur in both days (asdetermined by IP address), and compare their dayone time series to those of all hosts in day two. Thisproduces a list of distances for each host in the dayone data to the day two hosts. With consistent be-havior and a a behavioral metric that is robust tominute changes in activity, we would expect to findthat each day one host is closest to itself in the daytwo data. Of course, there are also instances wherehost behavior did significantly change between thetwo observation periods. Therefore, our analysislooks at both the number of hosts within each labelclass whose behaviors appear to remain the sameand the reasons that some hosts’ behaviors appar-ently change.

To begin, we provide a summary of the aboveconsistency experiment for each host label when us-ing the DTW and L1 metrics in Table 4. The tableshows two values: the number of hosts with per-fect consistency and the average consistency rankof all hosts in the group. By perfect consistency,we mean hosts in day one whose closest host in daytwo is itself. Consistency rank refers to the rank ofthe day one host in the sorted list of day two dis-tances. Ideally, if the behaviors of the hosts in thegroup are exactly the same we would have all hostswith perfect consistency and an average rank of 1.0.The most obvious observation we can make fromthe results of this experiment is that the L1 met-ric appears to be more brittle than DTW with asignificantly greater average rank and fewer perfectconsistency hosts overall. What is more interesting,however, are the cases where DTW indicates changein behavior and L1 does not, and vice versa.

For the case where DTW indicates change and L1

does not, we again examine the two SMTP servers

Page 14: On Measuring the Similarity of Network Hosts: Pitfalls ...

DTW Metric L1 DistanceHost Type Num. Hosts Num. Perfect Avg. Rank Num. Perfect Avg. Rank

Consistency ConsistencyWEB 1 1 1.0 1 1.0DNS 1 1 1.0 0 17.0

PLANET 3 2 1.3 3 1.0VM 4 2 1.5 0 41.3

LOGIN 12 8 2.1 8 14.2SMTP 2 1 3.5 2 1.0APC 18 2 16.2 0 39.9AI 2 0 41.5 0 40.5

DHCP 24 2 48.4 1 31.8CLASS 8 1 54.4 0 31.1

TOTAL 75 20 17.1 15 21.9

Table 4. Host behavioral consistency for hosts occurring in both days of the CSE dataset.

found in our data. Recall from the previous cluster-ing experiments that one SMTP server in the firstday of data acts as the primary server with manyconnections to external hosts, while the other re-ceives many fewer connections and those are pri-marily to internal hosts. Upon closer inspectionof the two servers’ behaviors in day two, we findthat the primary SMTP server’s (18) activities areeffectively unchanged. The secondary server (27),however, has a significant increase in the propor-tion of traffic that is related to mail activities. Inparticular, the secondary SMTP server (27) in thefirst day of data has a roughly even split betweengeneral mail traffic and an SSH password dictionaryattack (i.e., hundreds of consecutive SSH login at-tempts), while in the second day the SSH attacksstop almost completely and the general mail traf-fic to both internal and external hosts increase sig-nificantly. Intuitively, this does indeed indicate achange in behavior due to the presence of the SSHattack and change in mail activity, although the L1

metric indicates that no such change took place.

Next, we look at the group of VM servers for thecase where the L1 metric indicates a change and theDTW metric does not. In this case, manual inspec-tion of all four VM hosts indicates that there wereno significant changes in traffic volume or activitiesfor any host. The only noticeable change was thatone of the client-like VMs (127) was port scannedfor a few seconds late at night on the second day.Given this information, our DTW metric’s rankingmakes perfect sense – the two VMs running man-agement protocols (183,184) were exactly the sameas their previous day’s activity, while the client-like VMs (127,128) got confused for one another inthe previous day which is evidenced by the aver-age ranking of 1.5 for that group (i.e., rank of 1.0for the management VMs, 2.0 for the client VMs).With the L1 metric, not only did it confuse the VMbehaviors for one another in the previous day, but italso confused it with dozens of other client-like hosts(i.e., DHCP, LOGIN, etc.) and management hosts

(i.e., APC), thereby causing the significantly higheraverage rank for that group. In fact, the L1 metriconly seems to achieve a lower average rank in groupswhere changes in behavior are expected, such as theDHCP, APC, or CLASS groups, and where the therankings are likely to improve by chance due simplyto tiny changes in activities.

5.3 Applications

To conclude our evaluation of the proposed DTWbehavioral metric, we discuss its potential applica-tions. Certainly, the fact that our proposed metricprovides a robust, semantically meaningful notionof host behavioral similarity means that it may bea good foundation for a number of network analysistasks, such as anomaly detection or host classifi-cation. In fact, the cluster analysis technique de-scribed above can be thought of as a form of hostclassification and an anomaly identification method.There are, however, some non-obvious forms of net-work data analysis which may benefit from our moreformalized approach. One such application is theuse of the DTW metric as the basis for analyzing theprivacy of hosts within anonymized network data.

Roughly speaking, anonymized network data issimply a transformation of data collected on a com-puter network such that certain sensitive informa-tion is not revealed to those who use the data, butthe data remains generally useful to researchers andnetwork analysts. This sensitive information in-cludes learning the real identities of hosts foundwithin the data despite those identities being re-placed with a pseudonym (e.g., prefix-preservinganonymization [23, 12]). One of the most diffi-cult parts of achieving network data anonymizationis the lack of applicable privacy definitions uponwhich these transformations can be based, due pri-marily to the fact that rigorously defining the be-havior of hosts and the ways these behaviors maychange remains an open problem. Although priorworks (including our own) have attempted to de-fine privacy analysis techniques for anonymized data

Page 15: On Measuring the Similarity of Network Hosts: Pitfalls ...

0

0.2

0.4

0.6

0.8

1

20 40 60 80 100 120 140

% o

f Hos

ts

NeighborhoodSize

Radius0.250.500.75

1.0

(a) Day 1

0

0.2

0.4

0.6

0.8

1

20 40 60 80 100 120 140

% o

f Hos

ts

NeighborhoodSize

Radius0.250.500.75

1.0

(b) Day 2

Figure 6. Privacy neighborhood sizes for different distance radius settings.

[7, 25], their methods have been based upon tech-niques like the L1 distance metric evaluated above,and consequently are likely to provide a rather looseprivacy analysis. We argue that given the DTWmetric’s unique ability to capture a robust notion ofhost behavior and the fact that it embeds this be-havior in a well-defined metric space, it holds goodpromise in bringing formal privacy definitions to thefield of anonymized network data.

We illustrate this point by examining one wayin which a privacy analysis for network data maybe built around the definition of (c, t)-isolation putforth by Chawla et al. in the context of multidi-mensional, real-valued spaces [6]. This privacy no-tion essentially states that any “anonymized” pointshould have t real points from the original datawithin a ball of radius proportional to c centeredat the anonymized point. That is, each anonymizedpoint should have a conserved neighborhood of realpoints that it may “blend in” with, and thereforeprovide the potential adversary with some level ofuncertainty about the point’s real identity. Thedefinition can be used as the basis of an analysismethodology simply by looking at the distributionof neighborhood sizes for each anonymized point atincreasing radius sizes – such information may beused by a data publisher to, for instance, determinethe relative risk of publishing the anonymized datain its current form. The distribution may be visu-alized via a cumulative distribution function thatshows the percentage of points with a number ofneighbors (i.e., neighborhood size) less than thegiven value for a specified radius.

To adapt the definition and analysis methodol-ogy for network hosts, we simply consider the en-tire time series for the hosts to be a “point” anduse the DTW metric to calculate the radius. Figure

6 shows examples of this privacy analysis method-ology applied to hosts within the the two days ofour CSE dataset under the assumption that noneof the six fields in our metric have been alteredby the anonymization process. One way of inter-preting this analysis is that the radius bounds thepotential error in the adversary’s knowledge of thehost’s behaviors, while the number of hosts withinthe neighborhood provides a sense of the privacy ofthat host. Therefore, if a data publisher assumesthe adversary could gain significant knowledge ofa host’s behaviors, perhaps derived from publiclyavailable information, it would be prudent to con-sider the neighborhood sizes when the radius is rel-atively small. As Figure 6(a) shows, for example,even for a radius of 0.5, well over 80% of the hostsin the data have neighborhoods of size 20 or greater,thereby indicating potentially significant privacy forthose hosts. Of course, as discussed earlier in Sec-tion 4.1, this is contingent upon the assumptionsmade about the adversary’s “view” of the data viathe metric definitions.

An obvious downside of this approach is that itis difficult to interpret the semantics of the over-all DTW distances. That is, understanding whatexactly a radius of 0.5 means with respect to theunderlying behaviors may be difficult. One wayto address this issue is to provide distributions ofdistances for each dimension computed during thetime warping process, in addition to the Euclideandistance. Additionally, the semantically-meaningfulmetric spaces for each field may also need to bealtered to accommodate for measuring behavioraldistance between fields that have been altered bythe anonymization process in the anonymized net-work data and those in the original (e.g., compar-ing prefix-preserving IP pseudonyms to the original

Page 16: On Measuring the Similarity of Network Hosts: Pitfalls ...

IPs). However, we believe that the fact that ourapproach allows for such changes to the underlyingsemantics is a contribution in and of itself.

6 Conclusion

Many types of network data analysis rely onwell-known distance metrics to adequately capturea meaningful notion of the behavioral similarityamong network hosts. Despite the importance ofthese metrics in ensuring sound analysis practices,there has been relatively little research on the im-pact of using generic distance metrics and ignor-ing long-term temporal characteristics on analysistasks. Rather, distance metrics used in practicetend to take a simplistic view of network data byassuming they inherit the semantics of its syntac-tic representation (e.g., 16- or 32-bit integers), orthat those values have no relationship at all. More-over, they examine network activities in isolation orwithin short windows (e.g., n-gram distributions),which removes much of the long-term causal infor-mation found in the data. Consequently, these ap-proaches are likely to provide brittle or unrealisticmetrics for host behavior.

In this paper, we explored an alternative ap-proach to defining host similarity that attempts toincorporate semantically meaningful spatial analy-sis of network activities and long-term temporal se-quencing information into a single, unified metricspace that describes host behaviors. To accomplishthis goal, we developed metric spaces for severalprevalent network data types, showed how to com-bine the metric spaces to measure the spatial char-acteristics of individual network data records, andfinally proposed a method of measuring host behav-ior using dynamic time warping (DTW) methods.At each stage in the development of this framework,we brought to light potential pitfalls and attemptedto explain the unique requirements surrounding theanalysis of network data, including the need to care-fully define normalization procedures and considerassumptions about the data made in developing themetrics. Our proposed metric was evaluated againstthe well-known L1 distance metric, which ignoresboth semantic and temporal characteristics of thedata, by applying cluster analysis techniques to adataset containing a variety of realistic network hostactivities. Despite the admitted simplicity of ourexample metrics, the results of these experimentsshowed that our approach provides more consistentand useful characterizations of host behavior thanthe L1 metric.

As a whole, these results indicate that it is use-ful to consider long-term temporal characteristics of

network hosts, as well as the semantics of the un-derlying network data when measuring behavioralsimilarity. Furthermore, our results point towardseveral potentially interesting areas of future work.In the short term, one may consider the develop-ment of more refined distance metrics, includingfine-grained metric spaces for a wider range of datatypes and time warping methods that allow for lo-calized reordering of points. The results also callfor a study of the performance of our DTW metricwhen applied to non-TCP protocols and to networkobjects other than hosts, such as web pages. Moregenerally, a more formal method for characterizingbehaviors, which may be used as the basis for prov-able network data anonymization techniques or ro-bust traffic generation methods, seems warranted.

Data Access To encourage continued researchon general network data similarity metrics, wehave made the complete dataset used in our studyavailable to the public via the PREDICT datarepository [24] as “Departmental-Netflow-Trace-1”(Hosted By: Merit Network, Inc., Keywords: Net-Flow).

Acknowledgments This work was supported bythe U.S. Department of Homeland Security Science& Technology Directorate under Contracts FA8750-08-2-0147 and NBCHC080037, and by the NationalScience Foundation under Grant No. 0937060 thatwas awarded to the Computing Research Associa-tion for the CIFellows Project.

References

[1] D. Arthur and S. Vassilvitskii. k-Means++: TheAdvantages of Careful Seeding. In Proceedings ofthe 18th Annual ACM-SIAM Symposium on Dis-crete Algorithms, pages 1027–1035, January 2007.

[2] T. Auld, A. W. Moore, and S. F. Gull. BayesianNeural Networks for Internet Traffic Classification.IEEE Transaction on Neural Networks, 18:223–239,2007.

[3] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule,and K. Salamatian. Traffic Classification on theFly. SIGCOMM Computer Communications Re-view, 36(2):23–26, 2006.

[4] M. Bezzi. An Entropy-based Method for MeasuringAnonymity. In Proceedings of the 3rd InternationalConference on Security and Privacy in Communi-cations Networks and the Workshops, pages 28–32,2007.

[5] V. Chandola, A. Banerjee, and V. Kumar. Anomalydetection: A survey. ACM Computing Surveys,41(3):15, 2009.

[6] S. Chawla, C. Dwork, F. McSherry, A. Smith, andH. Wee. Toward privacy in Public Databases. In

Page 17: On Measuring the Similarity of Network Hosts: Pitfalls ...

Proceedings of the 2nd Annual Theory of Cryptog-raphy Conference, pages 363–385, 2005.

[7] S. E. Coull, C. V. Wright, A. D. Keromytis, F. Mon-rose, and M. K. Reiter. Taming the Devil: Tech-niques for Evaluating Anonymized Network Data.In Proceedings of the 15th Network and DistributedSystems Security Symposium, pages 125–135, 2008.

[8] M. Crotti, M. Dusi, F. Gringoli, and L. Salgar-elli. Traffic Classification Through Simple Statis-tical Fingerprinting. SIGCOMM Computer Com-munications Review, 37(1):5–16, 2007.

[9] J. Early, C. Brodley, and C. Rosenberg. Behav-ioral authentication of server flows. In Proceedingsof the 19th Annual Computer Security ApplicationsConference, pages 46–55, December 2003.

[10] J. Erman, M. Arlitt, and A. Mahanti. Traffic Clas-sification using Clustering Algorithms. In Proceed-ings of the 2006 SIGCOMM Workshop on MiningNetwork Data, pages 281–286, 2006.

[11] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, andS. Stolfo. A Geometric Framework for UnsupervisedAnomaly Detection. In Applications of Data Miningin Computer Security, pages 77–92, 2002.

[12] J. Fan, J. Xu, M. Ammar, and S. Moon.Prefix-preserving IP Address Anonymization:Measurement-based Security Evaluation and aNew Cryptography-based Scheme. ComputerNetworks, 46(2):263–272, October 2004.

[13] V. Frias-Martinez, S. J. Stolfo, and A. D.Keromytis. Behavior-Profile Clustering for FalseAlert Reduction in Anomaly Detection Sensors. InProceedings of the Annual Computer Security Ap-plications Conference, pages 367–376, 2008.

[14] T. Karagiannis, K. Papagiannaki, and M. Falout-sos. BLINC: Multilevel Traffic Classification in theDark. In Proceedings of ACM SIGCOMM, pages229–240, August 2005.

[15] T. Karagiannis, K. Papagiannaki, N. Taft, andM. Faloutsos. Profiling the End Host. In Proceed-ings of the Passive and Active Network Measure-ment Conference, pages 186–196, 2007.

[16] A. Kounine and M. Bezzi. Assessing DisclosureRisk in Anonymized Datasets. In Proceedings ofFloCon, 2008.

[17] A. McGregor, M. Hall, P. Lorier, and J. Brun-skill. Flow Clustering Using Machine LearningTechniques. In Proceedings of the Passive and Ac-tive Network Measurement Conference, pages 205–214, 2004.

[18] A. W. Moore and D. Zuev. Internet Traffic Clas-sification Using Bayesian Analysis Techniques. InProceedings of ACM SIGMETRICS, pages 50–60,June 2005.

[19] J. Munkres. Topology. Prentice-Hall EnglewoodCliffs, NJ, 2000.

[20] Nagios IT Infrastructure Monitoring. http://www.nagios.org/.

[21] Nessus Vulnerability Scanner. http://www.

nessus.org.[22] T. Nguyen and G. Armitage. A survey of tech-

niques for internet traffic classification using ma-chine learning. IEEE Communications Surveys &Tutorials, 10(4):56–76, 2008.

[23] R. Pang, M. Allman, V. Paxson, and J. Lee. TheDevil and Packet Trace Anonymization. ACMComputer Communication Review, 36(1):29–38,January 2006.

[24] M. Ramadas, S. Ostermann, and B. Tjaden.Detecting Anomalous Network Traffic with Self-Organizing Maps. In Proceedings of Recent Ad-vances in Intrusion Detection Conference, pages36–54, 2003.

[25] B. Ribeiro, W. Chen, G. Miklau, and D. Towsley.Analyzing Privacy in Enterprise Packet TraceAnonymization. In Proceedings of the 15th Net-work and Distributed Systems Security Symposium,to appear, 2008.

[26] H. Sakoe and S. Chiba. Dynamic ProgrammingAlgorithm Optimization for Spoken Word Recogni-tion. IEEE Transactions on Acoustics, Speech, andSignal Processing, 26:43–49, 1978.

[27] R. Sekar, A. Gupta, J. Frullo, T. Shanbhag, A. Ti-wari, H. Yang, and S. Zhou. Specification-basedAnomaly Detection: A New Approach for Detect-ing Network Intrusions. In Proceedings of the 9th

ACM Conference on Computer and Communica-tions Security, pages 265–274, 2002.

[28] Y. Song, A. Keromytis, and S. Stolfo. Spectrogram:A Mixture-of-Markov-Chains Model for AnomalyDetection in Web Traffic. In Proceedings of the 16th

Annual Network and Distributed System SecuritySymposium, pages 121–136, 2009.

[29] S. Wei, J. Mirkovic, and E. Kissel. Profiling andClustering Internet Hosts. In Proceedings of the2006 International Conference on Data Mining,pages 269–275, 2006.

[30] N. Williams, S. Zander, and G. Armitage. A Pre-liminary Performance Comparison of Five MachineLearning Algorithms for Practical IP Traffic FlowClassification. SIGCOMM Computer Communica-tions Review, 36(5):5–16, 2006.

[31] C. Wright, F. Monrose, and G. M. Masson. OnInferring Application Protocol Behaviors in En-crypted Network Traffic. Journal of MachineLearning Research, Special Topic on MachineLearning for Computer Security(7):2745–2769, De-cember 2006.

[32] K. Xu, Z. Zhang, and S. Bhattacharyya. ProfilingInternet Backbone Traffic: Behavior Models andApplications. In Proceedings of ACM SIGCOMM,pages 169–180, August 2005.