A k main routes approach to spatial network activity summarization

15
1464 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014 A K-Main Routes Approach to Spatial Network Activity Summarization Dev Oliver, Student Member, IEEE, Shashi Shekhar, Fellow, IEEE, James M. Kang, Renee Laubscher, Veronica Carlan, and Abdussalam Bannur Abstract—Data summarization is an important concept in data mining for finding a compact representation of a dataset. In spatial network activity summarization (SNAS), we are given a spatial network and a collection of activities (e.g., pedestrian fatality reports, crime reports) and the goal is to find k shortest paths that summarize the activities. SNAS is important for applications where observations occur along linear paths such as roadways, train tracks, etc. SNAS is computationally challenging because of the large number of k subsets of shortest paths in a spatial network. Previous work has focused on either geometry or subgraph-based approaches (e.g., only one path), and cannot summarize activities using multiple paths. This paper proposes a K-Main Routes (KMR) approach that discovers k shortest paths to summarize activities. KMR generalizes K-means for network space but uses shortest paths instead of ellipses to summarize activities. To improve performance, KMR uses network Voronoi, divide and conquer, and pruning strategies. We present a case study comparing KMR’s network-based output (i.e., shortest paths) to geometry-based outputs (e.g., ellipses) on pedestrian fatality data. Experimental results on synthetic and real data show that KMR with our performance-tuning decisions yields substantial computational savings without reducing summary path coverage. Index Terms—Spatial network, partitioning, hot spots, hot routes, activity summarization 1 I NTRODUCTION S PATIAL network activity summarization (SNAS) has important applications in domains where observa- tions occur along linear paths in the network. Table 1 provides examples of such applications where the gen- erative model of observations is inherently linear. For example, transportation planners and engineers may need to identify road segments/stretches that pose risks for pedestrians and require redesign [1]; crime analysts may look for concentrations of crimes along certain streets to guide law enforcement [2]; and hydrologists may try to summarize environmental change on water resources to understand the behavior of river networks and lakes [3]. Informally, the SNAS problem can be defined as fol- lows: Given a spatial network, a collection of activities and their locations (e.g., placed on a node or an edge), and a desired number of paths k, find a set of k short- est paths that maximizes the sum of activities on the paths (counting activities that are on overlapping paths D. Oliver, S. Shekhar, and A. Bannur are with the Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN 55455 USA. E-mail: {oliver, shekhar, bannur}@cs.umn.edu. J. M. Kang, R. Laubscher, and V. Carlan are with the National Geospatial- Intelligence Agency, Springfield, VA 22150-7500 USA. E-mail: {james.m.kang, renee.b.laubscher, veronica.m.carlan}@nga.mil. Manuscript received 2 Nov. 2012; revised 16 July 2013; accepted 27 July 2013. Date of publication 4 Aug. 2013; date of current version 29 May 2014. Recommended for acceptance by C. Shahabi. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier 10.1109/TKDE.2013.135 only once) and a partitioning of activities across the paths. Depending on the domain, an activity may be the location of a pedestrian fatality, a carjacking, a train accident, etc. SNAS assumes that every path is a shortest path because in applications such as transportation planning, the aim is usually to help people to arrive at their destination as fast as possible. Fig. 1(a) and 1(b) illustrate an input and out- put example of SNAS, respectively. The input consists of eight nodes, seven edges (with edge weights of 1 for sim- plicity), eleven activities, and k = 2, indicating that two routes and groups are desired. Table 2 shows the shortest paths for the spatial network in Fig. 1 and their respective activity coverages (i.e., the sum of activities on each short- est path). The output contains two shortest paths and two groups of activities. The shortest paths are representatives for each group and each shortest path maximizes the activ- ity coverage for the group it represents. For example, route A, B, C is the representative for the group comprised of activities 1, 2, 3, 4, and 5, and route D, E, F is the repre- sentative for the group comprised of activities 6, 7, 8, 9, 10, and 11. Finding a set of k shortest paths that maximizes the number of activities on selected paths is computationally challenging. This is due to the fact that if k shortest paths are selected from all shortest paths in a spatial network, there are a large number of possibilities for large k, i.e., ( n k ) , where n is the number of shortest paths. This is because dif- ferent subsets of k shortest paths could be overlapping or have the same shortest paths. For disjoint paths, the prob- lem would be relatively less computationally challenging. However, due to overlapping paths, the general problem of SNAS is NP-complete and its proof is provided in this paper (Section 4). 1041-4347 c 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

description

2014 IEEE / Non IEEE / Real Time Projects & Courses for Final Year Students @ Wingz Technologies It has been brought to our notice that the final year students are looking out for IEEE / Non IEEE / Real Time Projects / Courses and project guidance in advanced technologies. Considering this in regard, we are guiding for real time projects and conducting courses on DOTNET, JAVA, NS2, MATLAB, ANDROID, SQL DBA, ORACLE, JIST & CLOUDSIM, EMBEDDED SYSTEM. So we have attached the pamphlets for the same. We employ highly qualified developers and creative designers with years of experience to accomplish projects with utmost satisfaction. Wingz Technologies help clients’ to design, develop and integrate applications and solutions based on the various platforms like MICROSOFT .NET, JAVA/J2ME/J2EE, NS2, MATLAB,PHP,ORACLE,ANDROID,NS2(NETWORK SIMULATOR 2), EMBEDDED SYSTEM,VLSI,POWER ELECTRONICS etc. We support final year ME / MTECH / BE / BTECH( IT, CSE, EEE, ECE, CIVIL, MECH), MCA, MSC (IT/ CSE /Software Engineering), BCA, BSC (CSE / IT), MS IT students with IEEE Projects/Non IEEE Projects and real time Application projects in various leading domains and enable them to become future engineers. Our IEEE Projects and Application Projects are developed by experienced professionals with accurate designs on hot titles of the current year. We Help You With… Real Time Project Guidance Inplant Training(IPT) Internship Training Corporate Training Custom Software Development SEO(Search Engine Optimization) Research Work (Ph.d and M.Phil) Offer Courses for all platforms. Wingz Technologies Provide Complete Guidance 100% Result for all Projects On time Completion Excellent Support Project Completion & Experience Certificate Real Time Experience Thanking you, Yours truly, Wingz Technologies Plot No.18, Ground Floor,New Colony, 14th Cross Extension, Elumalai Nagar, Chromepet, Chennai-44,Tamil Nadu,India. Mail Me : [email protected], [email protected] Call Me : +91-9840004562,044-65622200. Website Link : www.wingztech.com,www.finalyearproject.co.in

Transcript of A k main routes approach to spatial network activity summarization

Page 1: A k main routes approach to spatial network activity summarization

1464 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

A K-Main Routes Approach to Spatial NetworkActivity Summarization

Dev Oliver, Student Member, IEEE, Shashi Shekhar, Fellow, IEEE, James M. Kang, Renee Laubscher,Veronica Carlan, and Abdussalam Bannur

Abstract—Data summarization is an important concept in data mining for finding a compact representation of a dataset. In spatialnetwork activity summarization (SNAS), we are given a spatial network and a collection of activities (e.g., pedestrian fatality reports,crime reports) and the goal is to find k shortest paths that summarize the activities. SNAS is important for applications whereobservations occur along linear paths such as roadways, train tracks, etc. SNAS is computationally challenging because of the largenumber of k subsets of shortest paths in a spatial network. Previous work has focused on either geometry or subgraph-basedapproaches (e.g., only one path), and cannot summarize activities using multiple paths. This paper proposes a K-Main Routes (KMR)approach that discovers k shortest paths to summarize activities. KMR generalizes K-means for network space but uses shortestpaths instead of ellipses to summarize activities. To improve performance, KMR uses network Voronoi, divide and conquer, andpruning strategies. We present a case study comparing KMR’s network-based output (i.e., shortest paths) to geometry-based outputs(e.g., ellipses) on pedestrian fatality data. Experimental results on synthetic and real data show that KMR with our performance-tuningdecisions yields substantial computational savings without reducing summary path coverage.

Index Terms—Spatial network, partitioning, hot spots, hot routes, activity summarization

1 INTRODUCTION

SPATIAL network activity summarization (SNAS) hasimportant applications in domains where observa-

tions occur along linear paths in the network. Table 1provides examples of such applications where the gen-erative model of observations is inherently linear. Forexample, transportation planners and engineers may needto identify road segments/stretches that pose risks forpedestrians and require redesign [1]; crime analystsmay look for concentrations of crimes along certainstreets to guide law enforcement [2]; and hydrologistsmay try to summarize environmental change on waterresources to understand the behavior of river networks andlakes [3].

Informally, the SNAS problem can be defined as fol-lows: Given a spatial network, a collection of activitiesand their locations (e.g., placed on a node or an edge),and a desired number of paths k, find a set of k short-est paths that maximizes the sum of activities on thepaths (counting activities that are on overlapping paths

• D. Oliver, S. Shekhar, and A. Bannur are with the Departmentof Computer Science and Engineering, University of Minnesota,Minneapolis, MN 55455 USA.E-mail: {oliver, shekhar, bannur}@cs.umn.edu.

• J. M. Kang, R. Laubscher, and V. Carlan are with the National Geospatial-Intelligence Agency, Springfield, VA 22150-7500 USA.E-mail: {james.m.kang, renee.b.laubscher, veronica.m.carlan}@nga.mil.

Manuscript received 2 Nov. 2012; revised 16 July 2013; accepted 27 July2013. Date of publication 4 Aug. 2013; date of current version 29 May 2014.Recommended for acceptance by C. Shahabi.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier 10.1109/TKDE.2013.135

only once) and a partitioning of activities across the paths.Depending on the domain, an activity may be the locationof a pedestrian fatality, a carjacking, a train accident, etc.SNAS assumes that every path is a shortest path becausein applications such as transportation planning, the aim isusually to help people to arrive at their destination as fastas possible. Fig. 1(a) and 1(b) illustrate an input and out-put example of SNAS, respectively. The input consists ofeight nodes, seven edges (with edge weights of 1 for sim-plicity), eleven activities, and k = 2, indicating that tworoutes and groups are desired. Table 2 shows the shortestpaths for the spatial network in Fig. 1 and their respectiveactivity coverages (i.e., the sum of activities on each short-est path). The output contains two shortest paths and twogroups of activities. The shortest paths are representativesfor each group and each shortest path maximizes the activ-ity coverage for the group it represents. For example, route〈A, B, C〉 is the representative for the group comprised ofactivities 1, 2, 3, 4, and 5, and route 〈D, E, F〉 is the repre-sentative for the group comprised of activities 6, 7, 8, 9, 10,

and 11.Finding a set of k shortest paths that maximizes the

number of activities on selected paths is computationallychallenging. This is due to the fact that if k shortest pathsare selected from all shortest paths in a spatial network,there are a large number of possibilities for large k, i.e.,

(nk

),

where n is the number of shortest paths. This is because dif-ferent subsets of k shortest paths could be overlapping orhave the same shortest paths. For disjoint paths, the prob-lem would be relatively less computationally challenging.However, due to overlapping paths, the general problemof SNAS is NP-complete and its proof is provided in thispaper (Section 4).

1041-4347 c© 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: A k main routes approach to spatial network activity summarization

OLIVER ET AL.: A K-MAIN ROUTES APPROACH TO SPATIAL NETWORK ACTIVITY SUMMARIZATION 1465

TABLE 1Examples of Linear Generators where Observations Occur along Paths in the Spatial Network

TABLE 2Shortest Paths from Fig. 1 (Activity Coverage Refers to the Number of Activities on a Path)

1.1 A Framework for Data SummarizationData summarization is an important concept in data miningthat entails techniques for finding a compact description orrepresentation of a dataset. The process typically involvesdefining a set of groups, finding a representative for eachgroup, and reporting a statistic for each group (e.g., sum,mean, standard deviation). These notions differ dependingon the genre of the data being summarized. Table 3 presentsa summarization framework for three genres of data. An

example of the first, relational table summarization, is theGROUP BY clause in SQL that is used to group rows havingcommon values to report SQL aggregation functions suchas mean and standard deviation. The group definition inthis case is a partition of rows and the group representationis distinct values of attributes such as age-group, income-group, etc.

The second genre is spatial Euclidean summarization,which includes heat maps and hotspot analysis. Heat maps

(b)(a)

Fig. 1. Example (a) input and (b) output of spatial network activity summarization (Best in color).

Page 3: A k main routes approach to spatial network activity summarization

1466 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

TABLE 3Summarization Framework for Various Data Genres

provide a graphical representation of data in which individ-ual values contained in a matrix are represented as colors.The group definition for heat maps might include a set ofpixels, and the group representation may be a subset ofthese pixels. Hotspots are a special kind of partitioned pat-tern where objects in hotspot regions have high similarityin comparison to one another and are dissimilar to all theobjects outside the hotspot [6]. These spatial summaries arebased on spatial point locations where the group definitionis a partition of space and the groups could be representedby points, polygons, ellipses, or line-strings.

In this work we explore the third genre, spatial networksummarization, which defines groups based on partitioninga network and may represent groups using nodes, paths,trees, etc.

1.2 An Illustrative Application Domain: PreventingPedestrian Fatalities

To illustrate the applicability of SNAS, we focus on theproblem of pedestrian fatalities. According to a recent pol-icy report, more than 47,700 pedestrians were killed in theUnited States from 2000 and 2009 [1]. More than 688,000pedestrians were injured over the same time period, whichis equivalent to a pedestrian being struck by a vehicle every7 minutes. Pedestrian fatalities have increased in manyplaces, including 15 of the country’s largest metro areas,even as overall traffic deaths have fallen [1].

Domain experts attribute pedestrian fatalities largely tothe design of streets, which have been engineered for speed-ing traffic with little or no provision for people on foot,in wheelchairs or on bicycles [1]. Daily activities haveshifted away from main streets towards higher speed arte-rials based on the emphasis on traffic movement. This

has resulted in more than half of fatal pedestrian crashesoccurring on these wide, high capacity and high-speed thor-oughfares. Typically designed with four or more lanes andhigh travel speeds, arterials are not built with pedestrians inmind (Fig. 2(a)). They lack sidewalks, crosswalks (or havecrosswalks spaced too far apart), pedestrian refuges, streetlighting, and school and public bus shelters [1].

The consequences of this lack of basic infrastructure arefatal. For example, forty percent of fatalities occurred whereno crosswalk was available [1]. Fig. 2(b) shows a map ofpedestrian fatalities that occurred on Orlando roads from2000 - 2009. Transportation planners and engineers needtools to assist them in identifying which frequently usedroad segments/stretches pose risks for pedestrians and con-sequently should be redesigned. Road segments/stretchesthat pose risks to pedestrians may be conceptualized as lin-ear concentrations because the generation model of pedes-trian fatalities is inherently linear, i.e., they occur on roads.This paper presents an approach for identifying linear con-centrations of activities such as pedestrian fatalities in aspatial network.

1.3 Related WorkSummarizing activities by grouping is a significant researcharea in data mining. Previous techniques have generallybeen geometry-based [8]–[12] or network-based [13]–[22].

In geometry-based summarization, partitioning of spa-tial data is based on grouping similar points distributed inplanar space where distance is calculated using Euclideandistance, not network distance. Such techniques focus onthe discovery of the geometry (e.g., circle, ellipse) of highdensity regions [2] and include K-Means [8], K-medoid [9],[10], P-median [11] and Nearest Neighbor Hierarchical

Fig. 2. (a) Pedestrian at risk on a road without proper sidewalks [1]. (b) Pedestrian fatalities occurring on arterials in Orlando, FL [7].

Page 4: A k main routes approach to spatial network activity summarization

OLIVER ET AL.: A K-MAIN ROUTES APPROACH TO SPATIAL NETWORK ACTIVITY SUMMARIZATION 1467

(a) (b) (c)

Fig. 3. Two methods of summarizing activities on a network. (a) Input. (b) NT-VCM output [22]. (c) K-Means output (network distance).

Clustering [12]. These methods do not consider the under-lying spatial network; they group spatial objects that areclose in terms of Euclidean distance but not close in termsof network distance. Thus, they may fail to group activitiesthat occur on the same street.

In network-based summarization, spatial objects aregrouped using network (e.g., road) distance. Existingmethods of network-based summarization such as MeanStreets [13], Maximal Subgraph Finding (MSGF) [14], andClumping [15]–[22] group activities over multiple paths, asingle path/subgraph, or no paths at all. Mean Streets [13]finds anomalous streets or routes with unusually high activ-ity levels. It is not designed to summarize activities over kpaths because the number of high crime streets returned isalways relatively small. MSGF [14] identifies the maximalsubgraph (e.g., a single path, k = 1) under the constraintof a user specified length and cannot summarize activ-ities when k > 1. The Network-Based Variable-DistanceClumping Method (NT-VCM) [22] is an example of theclumping technique [15]–[22]. NT-VCM groups activitiesthat are within a certain shortest path distance of eachother on the network; in order to run NT-VCM, a distancethreshold is needed.

An example output of NT-VCM is shown in Fig. 3(b).The threshold distance for NT-VCM is the unit distancebetween activities 8 and 11. However, NT-VCM is notdesigned to summarize activities using k paths because thegrouping of activities is based on the given distance thresh-old. As such, activities 1, 2, 3, 4, and 5, which occur on thesame street would not belong to the same group, given thisthreshold distance.

Network distance may also be applied to certaingeometry-based techniques such as K-Means [8]. Fig. 3(c)shows an example of the ellipses output by K-Meansusing network distance. The left ellipse groups activities1, 2, 3, 6, 7, 8, and 11 whereas the right ellipse groupsactivities 4, 5, 9, and 10. Even when generalized withnetwork distances, these methods output point-based orellipsoid-based groups, not paths. The objective of network-based K-Means (e.g., minimize the within-cluster sumof squares) is different from that of SNAS (e.g., maxi-mize activity coverage), which explains their difference inoutput.

Previously, we proposed K-Main Routes (KMR) usinginactive node pruning as a heuristic for summarizingactivities in a spatial network [23]. We demonstrated thecorrectness of inactive node pruning, experimentally eval-uated KMR, and presented a case study comparing theoutput of network based with geometry based summa-rization. However, our previous approach was limited interms of performance and use of heuristics without proofof NP-completeness.

1.4 ContributionsIn this paper, we show that SNAS is NP-complete andpropose two additional techniques for improving the per-formance of KMR: 1) Network Voronoi activity Assignment,which allocates activities to the nearest summary path,and 2) Divide and conquer Summary PAth REcomputation,which calculates the new summary path of each groupusing only group nodes. Applying these two performance-tuning techniques results in significant computationalsavings. Specifically, our research contributions are asfollows:

• We introduce a summarization framework demon-strating how different genres of data may be sum-marized.

• We show that SNAS is NP-complete.• We propose two new techniques for improving the

performance of K-Main Routes (KMR): NetworkVoronoi activity Assignment (NOVA_TKDE) andDivide and conquer Summary PAth REcomputation(D-SPARE_TKDE).

• We analytically demonstrate the correctness ofNOVA_TKDE and D-SPARE_TKDE.

• We analyze the computation costs of KMR.• We present a case study comparing KMR with

geometry-based summarization techniques onpedestrian fatality data.

• We test the performance and scalability of KMRusing both synthetic and real-world data sets anddemonstrate the computational efficiency of ourperformance-tuning strategies.

1.5 Scope and Outline of the PaperThis work focuses on summarizing discrete activity events(e.g., pedestrian fatalities, crime reports) associated with apoint on a network. This does not imply that all activi-ties must necessarily be associated with a point in a street.Furthermore, other network properties such as GPS trajec-tories and traffic densities of road networks [24] are notconsidered. The objective function used in SNAS is basedon maximizing the activity coverage of summary paths, noton minimizing the distance of activities to summary paths.The summary paths are shortest paths but other spatialconstraints are not considered (e.g., nearest neighbors) [25].Additionally, it is assumed that the number of activities onthe road network is fixed and does not change over time.A dynamically changing number of activities is presentlybeyond the scope of this research.

The paper is organized as follows: Section 2 presents thebasic concepts and problem statement of SNAS. Section 3explains KMR, NOVA_TKDE, and D-SPARE_TKDE.Section 4 details the analytical evaluation. Section 5

Page 5: A k main routes approach to spatial network activity summarization

1468 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

presents a case study comparing KMR with other sum-marization techniques. The experimental evaluation iscovered in Section 6. Section 7 presents a discussion andSection 8 concludes the paper and previews future work.

2 BASIC CONCEPTS AND PROBLEMSTATEMENT

This section introduces several key concepts in SNAS andpresents a formal problem statement.

2.1 Basic ConceptsWe define our basic concepts as follows:

Definition 1. A spatial network G = (N, E) consists of anode set N and an edge set E, where each element u in Nis associated with a pair of real numbers (x, y) representingthe spatial location of the node in a Euclidean plane [26].Edge set E is a subset of the cross product N × N. Eachelement e = (u, v) in E is an edge that joins node u tonode v.

An example of a spatial network is shown in Fig. 1(a).In the figure, circles represent nodes and lines representedges. A road network is an example of a spatial net-work where nodes represent street intersections and edgesrepresent streets.

Definition 2. An activity set A is a collection of activities. Anactivity a ∈ A is an object of interest associated with onlyone edge e ∈ E or one node n ∈ N.

In Fig. 1(a), activities are represented as squares. In trans-portation planning, an activity may be the location of apedestrian fatality; in crime analysis, an activity may bethe location of a theft; and in disaster response an activitymay be the location of a request for relief supplies.

Definition 3. A summary path set P̂ is a collection of sum-mary paths where each path pi ∈ P̂ is a shortest path. A sum-mary path imposes a partitioning on an activity set A suchthat network distance(a, pi) ≤ network distance(a, pj) ∀pj ∈P̂, ∀a ∈ A.

Fig. 1(b) shows two summary paths 〈A, B, C〉 and〈D, E, F〉. Activities 1, 2, 3, 4, and 5 form a partition around〈A, B, C〉 because they are closer to 〈A, B, C〉 whereasactivities 6, 7, 8, 9, 10 and 11 form a partition around〈D, E, F〉 because they are closer to 〈D, E, F〉. Here thenetwork distance between an activity and a path, i.e.,network distance(a, pi), is the network distance between aand the closest node in pi.

Definition 4. The activity coverage AC(p) of a path p is thesum of activities having network distance = 0 from an edgee ∈ p. The activity coverage AC(P) of a set of paths P isthe sum of activities across individual paths pi in set P, havingnetwork distance = 0 from each edge e ∈ P, counting activitiesthat are covered several times only once. If two paths share anedge, the activities on that edge are only counted once.

For example, in Fig. 1(a) the activity coverage of theshortest path from node A to node B is 3 because thereare 3 activities occurring on that path. Likewise, the activ-ity coverage of the shortest path from node D to node F is5 because there are 5 activities occurring on 〈D, E, F〉. If the

set of paths are 〈A, B, C〉 and 〈D, E, F〉, then the activity cov-erage is 10 because there are 10 activities on all the edgesof the paths in P. If the set of paths are 〈A, B, C〉 and 〈A, B〉,then the activity coverage is 5. Note that the activities onedge AB are only counted once even though AB is an edgein both paths.

2.2 Problem StatementThe problem of spatial network activity summarization(SNAS) can be expressed as follows:

Given:

1) A spatial network G = (N, E) with weight func-tion w(u, v) ≥ 0 for each edge e = (u, v) ∈ E (e.g.,network distance),

2) A set of activities A and their locations (e.g., a nodeor an edge),

3) A desired number of summary paths, k, where k ≥ 1.Find:

1) A summary path set of size k,2) A partitioning of activities across these summary

paths.Objective: Maximize the activity coverage of each sum-

mary path for the group it represents.Constraints:

1) Each summary path is a shortest path between itsend-nodes,

2) Each activity a ∈ A is associated with only one edgee ∈ E.

The spatial network input for SNAS is defined inDefinition 1. The activities input are objects of interest inthe spatial network such as the locations of pedestrianfatalities. The k input represents the desired number of sum-mary paths. The output for SNAS is a summary path setof size k and a partitioning of activities across the paths.The summary paths are representatives for each group andeach summary path maximizes the activity coverage for thegroup it represents. The optimal solution for SNAS may notbe unique and this is shown in lemma 1.

Example. The network in Fig. 1(a) can be viewed as aroad network, composed of streets (edges) and intersec-tions (nodes) with eleven activities (squares). The aim isto find two routes and two groups of activities; the routesare representatives for each of the groups. In a transporta-tion planning scenario, identifying such routes would guidestreet redesign efforts to reduce the risk of pedestrian fatali-ties (e.g., adding sidewalks, crosswalks, pedestrian refuges,street lighting, etc.). In Fig. 1(b), route 〈A, B, C〉 is the rep-resentative for the group comprised of activities 1, 2, 3, 4,and 5; and route 〈D, E, F〉 is the representative for the groupcomprised of activities 6, 7, 8, 9, 10 and 11.

3 SPATIAL NETWORK ACTIVITYSUMMARIZATION

This section describes the computational structure of SNAS.It also describes the K-Main Routes (KMR) algorithmand its performance-tuning decisions Network Voronoi

Page 6: A k main routes approach to spatial network activity summarization

OLIVER ET AL.: A K-MAIN ROUTES APPROACH TO SPATIAL NETWORK ACTIVITY SUMMARIZATION 1469

activity Assignment, Divide and conquer Summary PAthREcomputation, and Inactive Node Pruning.

3.1 Computational Structure of SNASIn SNAS, the optimal solution may not be unique.Additionally, among the optimal solutions there are somewhere every path starts and ends at active nodes. Theseproperties are formally shown via Lemmas 1 and 2.

Definition 5. An active edge is an edge e ∈ E that has 1 ormore activities. An active node is a node u joined by anactive edge or a node that has one or more activities, or both.An inactive node is a node that is not joined by any activeedges.

Edges AB and BC in Fig. 1(a) are active edges becausethey each have at least one activity and nodes A, B, C, D,E, F, and G are all active nodes because they are all joinedby active edges. By contrast, Node H is an inactive nodebecause it is not joined by any active edges.

Lemma 1. The optimal solution for SNAS may not be unique.

Proof. There may be multiple solutions for different val-ues of k. For example, given k = 1 in Fig. 1(a) whereall eleven activities are members of the same group, thesummary path could be 〈A, B, E, D〉 or 〈D, E, B, A〉, sinceboth these paths have a maximum activity coverage of6 based on the one group. Given k = 2 and the groupsshown in Fig. 1(b), the summary paths could be 〈A, B, C〉and 〈D, E, F〉 or 〈C, B, A〉 and 〈F, E, D〉 as both sets ofpaths have a maximum activity coverage of 11 based ontheir respective groups.

Lemma 2. Among the optimal solutions for SNAS, there existoptimal solutions where every path starts and ends at activenodes.

Proof. Let’s begin with an arbitrary optimal solution. Let pbe a shortest path that starts or ends with inactive nodes.If inactive nodes that start or end p are removed suchthat p starts and ends with active nodes, the resultingsubpath p′ is still optimal in terms of activity cover-age, because no active edges were removed. p′ is alsostill a shortest path due to the optimal substructure ofshortest paths wherein subpaths of shortest paths areshortest paths [27]. In other words, eliminating inac-tive nodes from the beginning and end of a shortestpath does not reduce coverage and does not split thepath. KMR takes advantage of this property to achievecomputational savings.

3.2 K-Main Routes AlgorithmAlgorithm 1 presents the pseudocode for the proposed K-Main Routes (KMR) approach. The basic structure of KMRresembles that of K-Means [8] in terms of selecting initialseeds, forming k groups, and updating the representative ofeach group until the assignments no longer change. Line 1of Algorithm 1 selects all shortest paths P that start and endwith active nodes (inactive node pruning). Next, k pathsfrom P are selected as initial summary paths, which arethe “seeds" for KMR (line 2). The algorithm then proceedsin two main phases. First, it forms k groups by assigningeach activity to its closest summary path (line 4). Then, it

Algorithm 1 K-Main Routes (KMR) AlgorithmInput:

1) a spatial network G = (N, E),2) a set of activities A,3) a number of routes k,4) mode1 ∈ {naive, NOVA_TKDE},5) mode2 ∈ {naive, D-SPARE_TKDE}

Output:A summary path set of size k and a partitioning of activ-ities across these summary paths, where the objectiveis to maximize the activity coverage of each summarypath for the group it represents.

Algorithm:1: P← shortest paths between active nodes of G2: P̂← k summary paths ∈ P; stableGroups← false;3: while not stableGroups do4: Phase 1: currentGroups← AssignActivities-

ToSummaryPaths(G, A, k, P̂, mode1)

5: Phase 2: P̂′ ← RecomputeSummaryPaths(G, A, k, currentGroups, mode2)

6: if P̂ = P̂′ then stableGroups← true7: P̂← P̂′8: return currentGroups

updates the summary path of each group by calculatingthe shortest path that maximizes activity coverage (line 5).Assigning and updating repeat until the summary paths nolonger change and the final summary paths and groups arereturned (line 8).

KMR Example: Fig. 4 shows an example execution traceof KMR. The spatial network shown has eight nodes, sevenedges, and eleven activities. For illustration purposes, wechoose 〈A, B〉 and 〈D, E〉 as initial summary paths, k = 2,and set all edge weights to 1.

In step 2 of Fig. 4, two groups are formed by assign-ing each activity to its closest summary path. In thisexample, activities 1, 2, 3, 4, and 5 are assigned to thesummary path 〈A, B〉, and activities 6, 7, 8, 9, 10 and 11are assigned to the summary path 〈D, E〉. The groups arehighlighted with dashed lines. In cases where the distancebetween an activity and several summary paths is equal,the activity is assigned to one of those summary paths atrandom.

Once groups are formed, the summary paths of eachgroup have to be recalculated, as shown in step 3. Inthis case, the new summary paths (shown in step 4) are〈A, B, C〉 and 〈D, E, F〉. These summary paths are chosenbecause they further maximize the activity coverage. Inother words, based on the activities of each group, sum-mary paths 〈A, B, C〉 and 〈D, E, F〉 have maximum activitycoverage.

Step 5 repeats step 2 using the new summary paths〈A, B, C〉 and 〈D, E, F〉. The groups formed in this case areindicated by the dashed lines. Step 6 involves another recal-culation of the summary paths to further maximize theactivity coverage. The summary paths that are recalculateddo not change and as a result, the algorithm terminates. Wenow discuss each phase of KMR in detail.

Page 7: A k main routes approach to spatial network activity summarization

1470 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

Fig. 4. Execution trace of K-Main Routes (KMR). Circles represent nodes, lines represent edges, and squares represent activities (Best in color).

3.2.1 Phase 1: Assign Activities to NearestSummary Paths

In this phase, k groups are formed by assigning each activ-ity to its closest summary path. Algorithm 2 presents thepseudocode for the activity assignment algorithm whichhas two modes: naive and NOVA_TKDE. The naive modeenumerates all the distances between every activity andsummary path, and then assigns each activity to its clos-est summary path. NOVA_TKDE, by contrast, avoids theenumeration that is done in the naive mode while stillproviding correct results.

Performance-Tuning for Phase 1: The Network Voronoiactivity Assignment (NOVA_TKDE) technique is a fasterway of assigning activities to the closest summary path.Consider a virtual node, V, that is connected to every nodeof all summary paths by edges of weight zero. The basicidea is to calculate the distance from V to all active nodesand discover the closest summary path to each activity. Theshortest path from V to each activity a will go through anode in the summary path that is closest to a.

NOVA_TKDE starts by initializing all the relevant datastructures. Line 5 of Algorithm 2 initializes the virtual nodeV connected to each node of all summary paths by edges ofweight 0. Line 6 initializes the Open list and Tnodes to V andthe Closed and Tactivities to the empty set. NOVA_TKDE thenexpands every node in the Open list based on how close itis to the summary paths; closer nodes get expanded first.Once a node n is expanded, it is moved to the Closed list(line 8). Next, each of n′s neighbors xi /∈ Closed is examined,and Tnodes is updated with xi’s distance and sp information,where xi.distance is the network distance of xi from the near-est summary path, and xi.sp is xi’s assigned summary path.xi.distance is calculated by adding n.distance to the distanceof edge (n, xi) (line 9). If xi is not in Open, it is then addedto the Open list (line 10).

Once NOVA_TKDE finds activities on an edge connect-ing node n to a summary path, it records the activitydistance to that summary path. Every activity ai that is onedge (n, xi) is examined, and Tactivities is updated with ai’sdistance and sp information, where ai.distance is the networkdistance of ai from the nearest summary path (based on n),and ai.sp is the assigned summary path of ai. Next, an activ-ity is assigned to a summary path (line 12). If the activitywas previously assigned to another summary path, it is

removed from that path before being assigned to the newsummary path. Once all active nodes have been added tothe Closed list, or the Open list is empty, NOVA_TKDE’smain loop is stopped (line 13). If unassigned activitiesremain due to no connectivity to summary paths, theseactivities are randomly assigned to any summary path ∈ P̂(line 14). NOVA_TKDE then returns the current groups andterminates (line 15).

Algorithm 2 AssignActivitiesToSummaryPathsInput:

1) a spatial network G = (N, E),2) a set of activities A,3) a number of routes k,4) a set of summary paths P̂,5) mode ∈ {naive, NOVA_TKDE}

Output:k groups formed by assigning each activity ∈ A to theclosest summary path ∈ P̂.

Algorithm:1: if mode = “naive" then2: Enumerate all distances between each activity

ai ∈ A and each summary path pi ∈ P̂3: currentGroups← assign each ai to the closest pi4: else if mode = “NOVA_TKDE" then5: V← virtual node connected to all nodes of all

pi ∈ P̂ with zero-weight edges6: Open, Tnodes ← V; Closed, Tactivities ← ∅7: repeat8: n← closest node in Open to any

pi ∈ P̂; Closed← n9: update Tnodes with xi.distance and xi.sp for

each of n’s neighbors xi /∈ Closed10: if xi /∈ Open then Open← xi

11: update Tactivities with ai.distance and ai.sp foreach activity ai ∈ edge(n, xi)

12: currentGroups← assign each ai to the closestpi based on Tactivities

13: until all active nodes ∈ Closed or |Open| = 014: if ∃ unassigned activities, aj then

currentGroups← assign each aj to any pi

15: return currentGroups

Page 8: A k main routes approach to spatial network activity summarization

OLIVER ET AL.: A K-MAIN ROUTES APPROACH TO SPATIAL NETWORK ACTIVITY SUMMARIZATION 1471

Fig. 5. Example of NOVA_TKDE. Activity 10 gets assigned to summarypath 〈D, E〉 because the shortest path 〈V , E, F 〉 from V to activity 10goes through node E of summary path 〈D, E〉 (Best in color).

An example of NOVA_TKDE activity assignment isshown in Fig. 5. Virtual node V is connected by zeroweight edges to nodes A and B of summary path 〈A, B〉and nodes D and E of summary path 〈D, E〉 (Algorithm 2,line 5). Activity 10 gets assigned to summary path 〈D, E〉because the shortest path 〈V, E, F〉 from V to activity 10 goesthrough node E of summary path 〈D, E〉. Similarly, activity5 gets assigned to summary path 〈A, B〉 because the short-est path 〈V, B, C〉 from V to activity 5 goes through node Bof summary path 〈A, B〉.

3.2.2 Phase 2: Recompute Summary PathsIn phase 2, the summary path of each group is recomputedso as to further maximize activity coverage (i.e., the num-ber of activities covered by the set of paths). Algorithm 3presents the pseudocode, which has two modes: naive andD-SPARE_TKDE. The naive mode enumerates the shortestpaths between all active nodes in the spatial network whileD-SPARE_TKDE considers only the set of shortest pathsbetween the active nodes of a group, which gives the correctresults.

Performance-Tuning for Phase 2: The Divide andConquer Summary PAth REcomputation (D-SPARE_TKDE)technique chooses the summary path of each group withmaximum activity coverage but only considers the setof shortest paths between the nodes of a given group(Algorithm 3, lines 8-9). D-SPARE_TKDE assumes rich con-nectivity and otherwise may return a summary path goingoutside a given fragment. If maxPath is null after lookingat the shortest paths in a given group, ci ∈ currentGroups,the shortest path which has the maximum activity coveragebased on the activities in ci is selected as maxPath (lines 10-12). maxPath is then added to P̂ as the new summary pathfor ci (line 13). Once all groups ci ∈ currentGroups havebeen considered, P̂, which contains the summary paths withmaximum activity coverage for each group, is returned.

Fig. 6 shows an example of D-SPARE_TKDE where thegiven group consists of activities 1, 2, 3, 4, and 5, and sum-mary path 〈A, B〉. The goal of D-SPARE_TKDE is to choosea new summary path that maximizes activity coverage forevery group based on the activities of each group. In thiscase, summary paths 〈A, B, C〉 and 〈C, B, A〉 both maximizeactivity coverage based on activities 1, 2, 3, 4, and 5, so D-SPARE_TKDE will choose one of these summary paths asthe new representative for this group.

KMR with Inactive Node Pruning Performance-Tuning:We also applied a pruning strategy to K-Main Routes toimprove its performance. Rather than checking all shortestpaths, inactive node pruning considers only paths between

Algorithm 3 RecomputeSummaryPathsInput:

1) a spatial network G = (N, E),2) a set of activities A,3) a number of routes k,4) a set of groups currentGroups,5) mode ∈ {naive, D-SPARE_TKDE}

Output:k summary paths, P̂′, that maximize activity coverage(AC) for each group ∈ currentGroups

Algorithm:1: if mode = “naive" then2: for each ci ∈ currentGroups do3: P← shortest paths between active nodes of G4: maxPath← path in P with Max AC based on

ci’s activities5: P̂′ ← maxPath6: else if mode = “D-SPARE_TKDE" then7: for each ci ∈ currentGroups do8: P′ ← the set of shortest paths between the active

nodes of ci9: maxPath← path in P′ with Max AC based on

ci’s activities10: if maxPath = ∅ then11: P← shortest paths between active nodes

of G12: maxPath← path in P with Max AC based

on ci’s activities13: P̂′ ← maxPath14: return P̂′

active nodes, thereby reducing the total number of pathsconsidered (Algorithm 1, line 1). For example, Fig. 1(a)shows a spatial network with eight nodes, seven edges, andeleven activities. The active nodes in this network are A, B,C, D, E, F, and G. Without inactive node pruning, the num-ber of shortest paths considered would be 56 because thereare eight nodes, and the shortest path between each nodeand every other node is considered. With inactive nodepruning, the number becomes 42, as only the shortest pathsbetween the seven active nodes are considered.

4 THEORETICAL ANALYSIS

In this section, we present a proof of NP-Completeness forspatial network activity summarization. We also presentproofs of correctness for our proposed performance-tuningtechniques.

4.1 Proof of NP-CompletenessFor simplicity, we begin by defining a generalized, decisionversion of SNAS where the set of paths might be arbitraryand show this problem to be NP-complete. We then presenta proof sketch showing that the decision version of SNASwith shortest paths is also NP-complete.

Definition 6 (Snas Decision). Decision version of SNAS:INSTANCE: A spatial network G = (N, E) with weight func-tion w(u, v) ≥ 0 for each edge e = (u, v) ∈ E, a set of activitiesA and their locations, a set of paths P, a desired number of

Page 9: A k main routes approach to spatial network activity summarization

1472 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

Fig. 6. Example of D-SPARE_TKDE. Only shortest paths between nodes in the group are considered when choosing the new summary path whichmaximizes activity coverage. In this case, summary paths 〈A, B, C〉 and 〈C, B, A〉 both maximize activity coverage based on activities 1, 2, 3, 4, and5 so D-SPARE_TKDE will choose one of these summary paths as the new representative for this group (Best in color).

routes k, and a bound B ∈ Z+, where Z+ denotes the set ofpositive integers.

QUESTION: Does P contain a cardinality k subset P′ ofP, i.e., a subset P′ ⊆ P with

∣∣P′

∣∣ = k and AC(P′) ≥ B, where

AC(P′) denotes the activity coverage of P′?Theorem 1. SNAS is NP-complete.

Proof. The process of devising an NP-completeness prooffor a decision problem � consists of the following foursteps [28]:

1) showing that � is in NP,2) selecting a known NP-complete problem �′,3) constructing a transformation f from � to �′, and4) proving that f is a (polynomial) transformation

In step 1, to show that SNAS ∈ NP, assume that a cer-tificate and a number B are given. The certificate consistsof a spatial network G = (N, E) with weight functionw(u, v) ≥ 0 for each edge ei = (u, v) ∈ E, a set of activ-ities A, a set of paths P, a desired number of routes k,and a cardinality k subset P′ of P, i.e., a subset P′ ⊆P with |P′| = k. We can then verify in polynomial timewhether AC(P′) ≥ B because AC(P′) involves countingthe number of activities in P′.

Step 2 selects the Maximum Coverage problem [29] asa known NP-complete problem �′. Although the opti-mization version of Maximum Coverage [29] is knownto be NP-Hard, its decision version is NP-Complete. Thedecision version is specified as follows:

INSTANCE: A number k and a collection of sets S =S1, S2, . . . , Sm , where Si ⊆ {l1, l2, . . . , ln}, and a boundB ∈ Z+, where Z+ denotes the positive integers.

QUESTION: Does S contain a subset S′ ⊆ S of setssuch that |S′| ≤ k and the number of covered elements∣∣∣∣∣

Si∈S′Si

∣∣∣∣∣≥ B?

Steps 3 and 4 construct a transformation f from �

to �′ and prove that f is a (polynomial) transformation.The reduction entails a polynomial time transformationof the input of Maximum Coverage to the input of SNASfollowed by a polynomial time transformation of the out-put of SNAS to the output of Maximum Coverage. Theinput of Maximum Coverage may be transformed to theinput of SNAS using the following steps:

1) Impose a total order TO on n elements in L ={l1, l2, . . . , ln}.

2) Convert each element in L into a node with oneactivity.

3) Convert each set Si to a path Pi .

• Add edge (lj, lj+1) ∀ j ∈ 1... |Si| .

The transformation computation time is dominatedby the polynomial step of sorting the elements in Siusing TO. Thus it is easy to see that the entire trans-formation is indeed polynomial. Consider an instance ofthe Maximum Coverage problem as shown in Fig. 7(a)where L = {l1, l2, l3, l5, l6}, k = 2, S1 = {l1, l2}, S2 = {l2, l3},S3 = {l1, l2, l3}, and S4 = {l5, l6}. The resulting instanceof SNAS would thus be P = {(l1 → l2), (l2 → l3),(l1 → l2 → l3), (l5 → l6)}, k = 2, A = {a1, a2, a3, a5, a6},and ActivityNode = {a1(l1), a2(l2), a3(l3), a5(l5), a6(l6)}.

Next, we convert the instance of SNAS output to aninstance of Maximum Coverage output. The transforma-tion, which is also polynomial, is as follows:

• For each k path Pi produced by SNAS, convert theactivities on the path into elements and form a set Si.The candidate solutions for the instance of SNAS

shown in Fig. 7(a) are (l1 → l2 → l3) and (l5 → l6).The resulting instance of the Maximum Coverage outputwould be S1 = {l1, l2, l3} and S2 = {l5, l6}.

The reduction of Maximum Coverage to SNAS is apolynomial time reduction since the input of MaximumCoverage can be reduced to SNAS in polynomial time,and the output of SNAS can be reduced to MaximumCoverage in polynomial time.

Since SNAS belongs to the class of NP and a knownNP-complete problem is reduced to it, the decisionversion of SNAS is NP-complete.

We now present a proof sketch that the decision ver-sion of SNAS where the set of paths are shortest pathsis also NP-complete. We follow the same construction asbefore but assign a cost for the edge (li, lj) to be j − i(Fig. 7(b)). This step ensures that all paths generated byconstruction are shortest paths.

4.2 Correctness of Performance-Tuning DecisionsWe prove the correctness of inactive node pruning,NOVA_TKDE, and D-SPARE_TKDE as follows:

Theorem 2. Inactive node pruning is a correct filtering method.

Proof. A filtering method is considered correct if it doesnot exclude all optimal solutions. By Lemma 2, amongthe optimal solutions for SNAS, there exist optimal solu-tions where every path starts and ends at active nodes.Since inactive node pruning does not prune out any paththat starts and ends at active nodes, no such optimalsolutions will be pruned. Thus, inactive node pruning iscorrect.

Theorem 3. Network Voronoi activity Assignment(NOVA_TKDE) is a correct activity assignment method, i.e.,it assigns every activity to the nearest summary path.

Page 10: A k main routes approach to spatial network activity summarization

OLIVER ET AL.: A K-MAIN ROUTES APPROACH TO SPATIAL NETWORK ACTIVITY SUMMARIZATION 1473

Fig. 7. SNAS instance resulting from maximum coverage instance for (a) arbitrary paths and (b) shortest paths.

Proof. An activity assignment method is considered correctif it assigns every activity to its nearest summary path.Let each activity be located at a particular node n ∈ Nand let a virtual source node V be added to G = (N, E)

such that V is connected by an edge of zero weight toeach node in the set of summary paths, P̂. The correct-ness of NOVA_TKDE can be argued in a similar fashionto that of Dijkstra’s algorithm [27], where each activenode containing activity a is connected to V via the near-est node n ∈ P̂. Since a is assigned to the summary pathp ∈ P̂ containing node n, it is assigned to the nearestsummary path. Thus, NOVA_TKDE is correct.

Theorem 4. Divide and Conquer Summary Path Recomputation(D-SPARE_TKDE) is a correct summary path recomputationmethod, i.e., it returns a shortest path with maximum activitycoverage for a given group c ∈ C.

Proof. A summary path recomputation method is consid-ered correct if it returns a shortest path with maximumactivity coverage for a given group c ∈ C. In recomputinga summary path p ∈ c, all shortest paths between nodesin a given group c are considered. Thus, the optimalsolution, i.e., the shortest path with maximum activitycoverage in c, will not be filtered and will be returned.Thus, D-SPARE_TKDE is correct.

Lemma 3. AssignActivitiesToSummaryPaths terminates on allinputs

Proof. AssignActivitiesToSummaryPaths has two modes:naive and NOVA_TKDE. The naive mode terminates onall inputs because there are a finite number of distancesconsidered in assigning activities to the nearest summarypath since there are a finite number of activities and sum-mary paths. If the distances between an activity a andseveral summary paths are equal, then a gets assignedto one of the summary paths at random. Therefore, evenif no path exists between an activity and any summarypath, the distances between that activity and all sum-mary paths will all be equal to infinity and the activitywill be assigned to one of the summary paths at ran-dom. In the NOVA_TKDE mode, each node from theOpen list that is examined gets moved to the Closed list,and the main loop ends when either all active nodesare moved to the Closed list or the Open list is empty.In the case where no path exists between an activityand any summary path, the activity will be assigned toone of the summary paths at random. Since the num-ber of nodes, summary paths, and activities is finite,NOVA_TKDE will also terminate on all inputs. Sinceboth naive and NOVA_TKDE terminate on all inputs,

AssignActivitiesToSummaryPaths also terminates on allinputs.

Lemma 4. RecomputeSummaryPaths terminates on all inputs.

Proof. RecomputeSummaryPaths has two modes: naiveand D-SPARE_TKDE. The naive mode terminates on allinputs because it considers a finite set of shortest pathsbetween active nodes and returns the path with maxi-mum activity coverage. D-SPARE_TKDE also terminateson all inputs because it considers a subset of the finite setof shortest paths considered in the naive mode (i.e. short-est paths between the active nodes of a group). Sinceboth naive and D-SPARE_TKDE terminate on all inputs,RecomputeSummaryPaths also terminates on all inputs.

Lemma 5. KMR terminates on all inputs.

Proof. KMR terminates when either global or local max-ima are reached. Termination is guaranteed becauseactivity coverage must increase for the next itera-tion to happen and the activity coverage of thecurrently selected paths is bounded by the totalnumber of input activities. Each iteration has twophases, each of which is guaranteed to terminate,i.e., AssignActivitiesToSummaryPaths terminates on allinputs (Lemma 3) and RecomputeSummaryPaths ter-minates on all inputs (Lemma 4). Since each iterationis guaranteed to terminate, the activity coverage mustincrease for the next iteration to occur, and there are afinite number of activities, KMR terminates on all inputs.

5 CASE STUDY

We conducted a qualitative evaluation of KMR compar-ing its output with the output of Crimestat K-means [8],[30] on a real pedestrian fatality data set [7], shown inFig. 8(a). The input consists of 43 pedestrian fatalities (rep-resented as dots) in Orlando, Florida occurring between2000 and 2009. As we’ve explained, KMR uses paths andnetwork distance to group activities on a spatial network.By contrast, in geometry-based summarization, the parti-tioning of spatial data is based on grouping similar pointsdistributed in planar space where the distance is calcu-lated using Euclidean distance. Such techniques focus onthe discovery of the geometry (e.g., circle, ellipse) of highdensity regions [2] and include K-means [8], [31]–[35],K-medoid [9], [10], P-median [11] and Nearest NeighborHierarchical Clustering [12] algorithms.

When evaluating the techniques, we consider two out-puts: 1) the groups (represented by colors) and 2) therepresentatives of each group (e.g., paths or ellipses). Asnoted earlier, pedestrian fatalities usually occur on streets,

Page 11: A k main routes approach to spatial network activity summarization

1474 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

Fig. 8. Comparing KMR and Crimestat K-means output for k = 4 on pedestrian fatality data from Orlando, FL [7] (Best in color). (a) Input.(b) Crimestat K-means with Euclidean distance. (c) Crimestat K-means with network distance. (d) KMR.

particularly along arterial roadways [1]. Thus this activitycan be said to have a linear generator. However, the resultsgenerated by Crimestat do not capture this. From Fig. 8(b),it is clear that the ellipse-based output is meant for areas,not streets. Results with Crimestat K-Means using networkdistance are no better. Although the red ellipse in Fig. 8(c)is aligned with a part of the arterial road, not all the activi-ties on this arterial are captured. Instead, the activities thatoccur on the road towards the bottom of the figure are splitamong the red, green, and blue groups. In contrast, thegroups of activities in KMR fully capture the activities onthe arterial roads (Fig. 8(d)). For example, the blue groupand summary path capture the activities on the arterialroad that were split across three groups in network-basedK-Means. Furthermore, the path-based representatives inthe figure make sense in this context due to the inherentlylinear nature of the activities.

6 EXPERIMENTAL EVALUATION

The goal of our experiments was to evaluate KMR withand without performance-tuning. Scalability was evalu-ated by varying and observing the effect of four workloadparameters: nodes, activities, routes, and active node ratio.

Definition 7. The active node ratio, ANR, is the ratio ofactive nodes to all nodes.

The active node ratio in Fig. 1(a) is 7/8, i.e., the totalnumber of active nodes, 7, divided by the total number ofnodes, 8.

For each workload experiment, we ran seven versions ofKMR:

• KMR with NOVA_TKDE only (KMRV)• KMR with D-SPARE_TKDE only (KMRD)• KMR with both NOVA_TKDE and D-SPARE_TKDE

(KMRVD)• KMR with inactive node pruning only (KMRI) [23]• KMR with both inactive node pruning and

NOVA_TKDE (KMRIV)• KMR with both inactive node pruning and D-

SPARE_TKDE (KMRID)• KMR with all three performance-tuning decisions

(KMRIVD)All experiments were performed on a Mac Pro with a

2 x Xeon Quad Core 2.26 GHz processor and 16 GB RAM.In all experiments, the summary paths were initialized tothe top-k edge-disjoint shortest paths.

6.1 Experiment Data SetsOur experiments were performed on both synthetic andreal-world data. Synthetic data were used to evaluate thescalability of KMR and to vary key parameters such as theactive node ratio, that could not be varied in the real-worlddata set. The synthetic data set was a road network gen-erated using the U.S. Census Bureau’s 2010 TIGER/LineShapefiles for Hennepin County, Minnesota [36]; a numberbetween 0 and 100, representing activities, was associatedwith each street.

Page 12: A k main routes approach to spatial network activity summarization

OLIVER ET AL.: A K-MAIN ROUTES APPROACH TO SPATIAL NETWORK ACTIVITY SUMMARIZATION 1475

(a) (b)

Fig. 9. Scalability of KMR with increasing number of nodes. (a) Synthetic data. (b) Real data.

The real-world data set, obtained from the FatalityAnalysis Reporting System (FARS) encyclopedia [7], is ageospatial and temporal data set describing pedestrianfatalities in Orange County, FL (which includes Orlando),from 2001 to 2011. Every activity was associated withits closest street (Euclidean distance) in the road net-work, obtained from the U.S. Census Bureau’s TIGER/LineShapefiles for Orange County [36].

6.2 Experimental Results6.2.1 Effect of the Number of NodesBoth synthetic and real-world data sets were used toobserve the effect of increasing the number of nodes onexecution time. For the synthetic data, the number of routesk was set to 2, the number of activities was set to 1, 200,and the active node ratio was set to 0.2. For the real-worlddata, the number of routes k was set to 30, the number ofactivities was 602 for 60, 000 nodes, and the active noderatio for the 60, 000 nodes was 0.007. Fig. 9 gives the exe-cution times for both data sets. As can be seen, KMRIVD isthe fastest. Computational savings increases as the numberof nodes increases due to NOVA_TKDE, D-SPARE_TKDE,and inactive node pruning.

6.2.2 Effect of the Number of RoutesAgain, both synthetic and real-world data sets were usedto observe the effect of increasing the number of routes

on execution time. For the synthetic data set, the numberof nodes was set to 1, 000, the number of activities was1, 200 and the active node ratio was 0.2. For the real-worlddata, the number of nodes was set to 15, 000, the numberof activities for the 15, 000 nodes was 369, and the activenode ratio for the 15, 000 nodes was 0.005. Fig. 10 gives theexecution times on the two data sets as the number of routesin the network increases. As before, KMRIVD performs thebest. Computational savings increase when performance-tuning is introduced.

6.2.3 Effect of the Number of ActivitiesOnly synthetic data was used to observe the effect of activ-ity number, since this parameter cannot be varied in realdata. The number of nodes was set to 1, 000, the numberof routes k was 2, and the active node ratio was 0.2. Asshown in Fig. 11(a), KMRIVD performs the best followed byKMRIV and KMRID. Computational savings increases whenperformance-tuning is introduced.

6.2.4 Effect of Active Node RatioRecall that the active node ratio (ANR) is the number ofactive nodes divided by the total number of nodes. Hereagain, only synthetic data was tested because the param-eter cannot be varied in the real data set. The number ofnodes was set to 1, 000, the number of routes k was 2, andthe number of activities was 1, 200. Fig. 11(b) shows that

(a) (b)

Fig. 10. Scalability of KMR with increasing number of routes k . (a) Synthetic data. (b) Real data.

Page 13: A k main routes approach to spatial network activity summarization

1476 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

(a) (b)

Fig. 11. Scalability of KMR with increasing (a) number of activities and (b) active node ratio on synthetic data.

(a) (b) (c)

Fig. 12. Example (a) input and output of (b) top-k queries and (c) SNAS (Best in color).

KMRIVD performs the best followed by KMRID and KMRIV .Computational savings increases with higher active noderatios due to NOVA_TKDE, D-SPARE_TKDE, and inactivenode pruning.

In summary, the experiments uniformly show thatKMRIVD, the combination of all three performance-tuningdecisions, gives the best results, much better than KMRI [23]alone. This is because KMRIVD filters shortest pathsthat are not between active nodes, enables faster assign-ment of activities, and enables faster summary pathcalculation.

7 DISCUSSION

Individual runs of KMR are greedy like individual exe-cutions of the K-Means algorithm and risk convergenceto local minima. For example, Table 4 shows four pairsof initial seeds in the sample network dataset shown inFig. 12 and the corresponding final solutions. When initial-ized with seeds of 〈A, B〉 and 〈D, E〉, KMR converges to afinal solution of 〈A, B, C〉 and 〈D, E, F〉 with a total activ-ity coverage of 10 as shown in the first row of Table 4.However, when initialized with 〈A, B, E, D〉 and 〈B, C〉, thealgorithm converges to initial seeds (a local minima) witha total coverage of 8. Just like K-Means, the solution qual-ity of KMR, may be enhanced by running it multiple timeswith different initial seeds.

There is more than one way to detect linear activ-ity concentration. Different formulations result in differentoutputs. For example, top-k queries return the k shortestpaths whereas SNAS returns groups of activities whereeach group is represented by a summary path with max-imum activity coverage. When initially seeded with theresults of the top-k query, KMR’s results will be no worsein terms of activity coverage as shown in bottom tworows of Table 4. However, KMR can find better solu-tions with different seeds as shown in first two rows ofTable 4.

Fig. 12 shows example outputs of top-k queries andSNAS on a given network for k = 2. In this case, the top-kpaths are constrained to be disjoint shortest paths in termsof edges (versus overlapping paths) and the two paths withthe most activities are 〈A, B, E, D〉 (Activity Coverage = 6)and either 〈B, C〉 (as illustrated) or 〈E, F〉 (Activity Coverage= 2), resulting in a total activity coverage of 8. Top-k queriesare not designed to group activities and thus activities thatare not on a top-k path do not belong to any group (e.g.,activities 9, 10, and 11). SNAS, by contrast, returns groupsof activities represented by summary paths 〈A, B, C〉 and〈D, E, F〉 shown by the dashed lines (Activity Coverage =10). 〈A, B, C〉 and 〈D, E, F〉 are representatives that maxi-mize activity coverage for the set of activities in each oftheir respective groups. As can be seen, the output of SNASis different from that of top-k. An activity that is not on asummary path in SNAS may still be a member of the same

TABLE 4Initial Seeds and Final Solutions of KMR

Page 14: A k main routes approach to spatial network activity summarization

OLIVER ET AL.: A K-MAIN ROUTES APPROACH TO SPATIAL NETWORK ACTIVITY SUMMARIZATION 1477

group (e.g., activity 11 in Fig. 12(c)) whereas the notion ofgroups does not exist in top-k queries.

8 CONCLUSION

This work explored the problem of spatial network activ-ity summarization in relation to important applicationdomains such as preventing pedestrian fatalities and crimeanalysis. We proposed a K-Main Routes (KMR) algorithmthat discovers a set of k shortest paths to group activi-ties. KMR uses inactive node pruning, Network Voronoiactivity Assignment (NOVA_TKDE) and Divide and con-quer Summary PAth REcomputation (D-SPARE_TKDE) toenhance its performance and scalability. We presented acase study comparing KMR with other summarization tech-niques on pedestrian fatality data. Experimental evaluationusing both synthetic and real-world data sets indicated thatthe performance-tuning decisions utilized by KMR yieldedsubstantial computational savings without reducing thecoverage of the resulting summary paths.

In future work, we plan to explore other types ofdata that may not be associated with a point in a street(e.g., aggregated pedestrian fatality data at the zip codelevel). We plan to investigate a distance-based rather thancoverage-based objective function. We will also generalizeSNAS for all paths and explore spatial constraints (e.g.,nearest neighbors).

In an alternative formulation of the problem, activitiescould be modeled as simple edge weights. This mightbe useful for applications where summarization does notrequire an exact location for a given activity. We intend toinvestigate this in future work. We also plan to character-ize graphs where D-SPARE_TKDE will find a path within agiven fragment as well as extend our approach to accountfor different types of activities. Finally, incorporating timeinto activity summarization and choosing the right k will beexplored.

ACKNOWLEDGMENTS

This work was supported in part by the US NationalScience Foundation under Grant 1029711, III-CXT IIS-0713214, IGERT DGE-0504195, CRI:IAD CNS-0708604, andin part by USDOD under Grant HM1582-08-1-0017,HM1582-07-1-2035, and W9132V-09-C-0009. The authorswould like to thank K. Koffolt, J. Schiappa, D. Ross, Jr.,M. Masli, V. M. V. Gunturi, D. C. Cugler, and the membersof the University of Minnesota Spatial Database and SpatialData Mining Research Group for their comments.

REFERENCES

[1] M. Ernst, M. Lang, and S. Davis. (2011). Dangerous bydesign: Solving the epidemic of preventable pedestriandeaths. Transportation for America: Surface TransportationPolicy Partnership. Washington, DC, USA. [Online]. Available:http://trid.trb.org/view.aspx?id=1148931

[2] J. Eck, S. Chainey, J. Cameron, M. Leitner, and R. Wilson.(2005, Aug.). Mapping Crime: Understanding Hot Spots, U.S.Department of Justice. Washington, DC, USA [Online]. Available:https://www.ncjrs.gov/pdffiles1/nij/209393.pdf

[3] D. Matthews, S. Effler, C. Driscoll, S. O’Donnell, and C. Matthews,“Electron budgets for the hypolimnion of a recovering urban lake,1989-2004: Response to changes in organic carbon deposition andavailability of electron acceptors,” Limnol. Oceanogr., vol. 53, no. 2,pp. 743–759, 2008.

[4] Huffington Post. (2013, Apr.). Hungary: Snowstorm StrandsThousands in Their Cars [Online]. Available:http://www.huffingtonpost.com/huff-wires/20130315/eu-europe-snow/?utm_hp_ref=travel&ir=travelM1

[5] R. Wronski. (2013, Apr. 9). Metra argues for delay of ‘fail-safe’rail system. Chicago Tribune [Online]. Available:http://www.chicagotribune.com/news/local/ct-met-metra-collision-prevention-20130409,0,3710438.story

[6] S. Shekhar, M. Evans, J. Kang, and P. Mohan, “Identifying patternsin spatial information: A survey of methods,” WIREs Data MiningKnowl. Discov., vol. 1, no. 3, pp. 193–214, Apr. 2011.

[7] Fatality Analysis Reporting System (FARS). Encyclopedia, NationalHighway Traffic Safety Administration (NHTSA) [Online]. Available:http://www.nhtsa.gov/FARS

[8] J. MacQueen et al., “Some methods for classification and analysisof multivariate observations,” in Proc. 5th Berkeley Symp. Math.Statist. Probab., vol. 1. Berkeley, CA, USA, 1967, p. 14.

[9] L. Kaufman and P. Rousseeuw, Finding Groups in Data: AnIntroduction to Cluster Analysis. Wiley Online Library, 1990.

[10] R. Ng and J. Han, “Efficient and effective clustering methods forspatial data mining,” in Proc. 20th Int. Conf. VLDB, San Francisco,CA, USA, 1994, pp. 144–155.

[11] M. Resende and R. Werneck, “A hybrid heuristic for the p-medianproblem,” J. Heuristics, vol. 10, no. 1, pp. 59–88, Jan. 2004.

[12] R. D’Andrade, “U-statistic hierarchical clustering,” Psychometrika,vol. 43, no. 1, pp. 59–67, Mar. 1978.

[13] M. Celik, S. Shekhar, B. George, J. Rogers, and J. Shine,“Discovering and quantifying mean streets: A summary ofresults,” Univ. Minnesota, Minneapolis, MN, USA, Tech. Rep.07-025, 2007.

[14] K. Buchin et al., “Detecting hotspots in geographic networks,” inProc. Adv. GIScience, Berlin, Germany, 2009, pp. 217–231.

[15] S. Roach, The Theory of Random Clumping. London, U.K.: Methuen,1968.

[16] A. Okabe, K. Okunuki, and S. Shiode, “The SANET toolbox: Newmethods for network spatial analysis,” Trans. GIS, vol. 10, no. 4,pp. 535–550, Jul. 2006.

[17] S. Shiode and A. Okabe, “Network variable clumping methodfor analyzing point patterns on a network,”paper presentedat the Annu. Meet. Associations of American Geographers,Philadelphia, PA, USA, 2004.

[18] K. Aerts, C. Lathuy, T. Steenberghen, and I. Thomas, “Spatial clus-tering of traffic accidents using distances along the network,” inProc. 19th Workshop ICTCT, 2006, pp. 1–10.

[19] P. Spooner, I. Lunt, A. Okabe, and S. Shiode, “Spatial analysis ofroadside Acacia populations on a road network using the net-work K-function,” Landscape Ecol., vol. 19, no. 5, pp. 491–499,Jul. 2004.

[20] T. Steenberghen, T. Dufays, I. Thomas, and B. Flahaut, “Intra-urban location and clustering of road accidents using GIS:A Belgian example,” Int. J. Geogr. Inform. Sci., vol. 18, no. 2,pp. 169–181, 2004.

[21] I. Yamada and J. Thill, “Local indicators of network-constrainedclusters in spatial point patterns,” Geogr. Anal., vol. 39, no. 3,pp. 268–292, Jul. 2007.

[22] S. Shiode and N. Shiode, “Detection of multi-scale clusters in net-work space,” Int. J. Geogr. Inform. Sci., vol. 23, no. 1, pp. 75–92,Jan. 2009.

[23] D. Oliver, A. Bannur, J. M. Kang, S. Shekhar, and R. Bousselaire,“A K-main routes approach to spatial network activity summa-rization: A summary of results,” in Proc. IEEE ICDMW, Sydney,NSW, Australia, 2010, pp. 265–272.

[24] X. Li, J. Han, J. Lee, and H. Gonzalez, “Traffic density-based dis-covery of hot routes in road networks,” in Proc. 10th SSTD, Berlin,Germany, 2007, pp. 441–459.

[25] M. Terrovitis, S. Bakiras, D. Papadias, and K. Mouratidis,“Constrained shortest path computation,” in Proc. SSTD, Berlin,Germany, 2005, p. 923.

[26] S. Shekhar and D. Liu, “CCAM: A connectivity-clustered accessmethod for networks and network computations,” IEEE Trans.Knowl. Data Eng., vol. 9, no. 1, pp. 102–119, Jan./Feb. 1997.

[27] T. Cormen, Introduction to Algorithms. Cambridge, MA, USA:MIT Press, 2001.

[28] M. Garey and D. Johnson, Computers and Intractability: A Guide tothe Theory of NP-Completeness. San Francisco, CA, USA: Freeman,1979.

Page 15: A k main routes approach to spatial network activity summarization

1478 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

[29] D. Hochbaum, “Approximating covering and packing problems:Set cover, vertex cover, independent set, and related problems,”in Approximation Algorithms for NP-Hard Problems, Boston, MA,USA: PWS Publishing Co., 1996, pp. 94–143.

[30] N. Levine, CrimeStat: A Spatial Statistics Program for the Analysisof Crime Incident Locations (v 2.0). Houston, TX, USA: Ned Levine& Associates; Washington, DC, USA: The National Institute ofJustice, 2002.

[31] S. Borah and M. Ghose, “Performance analysis of AIM-k-means& k-means in quality cluster generation,” J. Comput., vol. 1, no. 1,pp. 175–178, Dec. 2009.

[32] A. Barakbah and Y. Kiyoki, “A pillar algorithm for k-meansoptimization by distance maximization for initial centroid des-ignation,” in Proc. IEEE Symp. CIDM, Nashville, TN, USA, 2009,pp. 61–68.

[33] S. Khan and A. Ahmad, “Cluster center initialization algorithmfor k-means clustering,” Pattern Recognit. Lett., vol. 25, no. 11,pp. 1293–1302, Aug. 2004.

[34] D. Pelleg and A. Moore, “X-means: Extending k-means with effi-cient estimation of the number of clusters,” in Proc. 17th Int. Conf.Mach. Learn., San Francisco, CA, USA, 2000, pp. 727–734.

[35] P. Bradley and U. Fayyad, “Refining initial points for k-meansclustering,” in Proc. 15th Int. Conf. Mach. Learn., San Francisco,CA, USA, 1998, pp. 91–99.

[36] U.S. Census Bureau, (2010). Census TIGER/Line Shapefiles[Online]. Available: http://www.census.gov/geo/www/tiger/tgrshp2010/tgrshp2010.html

Dev Oliver is a Ph.D. candidate at the Universityof Minnesota, Minneapolis, MN, USA in com-puter science. He received the bachelor’s degreein computer science at Macalester College,St Paul, MN, USA, in 2004 and the master’sdegree in computer engineering at the Universityof Florida, Gainesville, FL, USA, in 2008. Hiscurrent research interests include spatial datamining, spatial databases, and spatial data sum-marization. He is a student member of the IEEE.

Shashi Shekhar is a McKnight DistinguishedUniversity Professor at the University ofMinnesota, Minneapolis, MN, USA. For contribu-tions to spatial databases, spatial data mining,and geographic information systems (GIS), hereceived the IEEE-CS Technical AchievementAward and was elected as fellow of the IEEE andthe AAAS. He co-authored a textbook on SpatialDatabases, and co-edited an Encyclopediaof GIS. He is serving as a member of theComputing Community Consortium Council,

a Co-Editor-in-Chief of Geo-Informatica journal, and a ProgramCo-Chair of G. I. Science Conference in 2012. Earlier, he served onthe National Academies’ Committees (Mapping Sciences, GEOINTResearch Priorities, and GEOINT Workforce), and Editorial Boardof IEEE Transactions on Knowledge and Data Engineering. He alsoCo-Chaired Symposium on Spatial and Temporal Databases and ACMConference on GIS.

James M. Kang received the bachelor’s degreefrom Purdue University, West Lafayette, IN,USA, in 2000, the master’s degree from theRochester Institute of Technology, NY, USA, in2004, and the Ph.D. degree from the University ofMinnesota, Minneapolis, MN, USA, in 2010, all incomputer science. His current research interestsinclude spatial and spatio-temporal data min-ing, spatial and spatio-temporal databases, andcontinuous query processing.

Renee Laubscher started her career with the Federal Government in2005. Her first position was a Staff Officer in the former InformationTechnology Team in InnoVision’s Basic and Applied Research Office.In 2008, she worked with the NGA’s University of Research Initiative(NURI) Program. In 2009, she received the master’s degree in geo-graphic information systems at the University of Redlands, CA, USA.She is presently a Project Scientist with IB’s GEOINT Analytics Divisionwhere she applies her GIS knowledge to correlate sensor, imagery, fea-ture, and text data into formats that can be loaded into a database tofacilitate geospatial analysis.

Veronica Carlan received the bachelor’s degree in mathematics andforeign affairs with a specialization in the Middle East in 2011 from theUniversity of Virginia, Charlottesville, VA, USA. She is currently work-ing on the master’s degree in statistics from Georgetown University,Washington, DC, USA.

Abdussalam Bannur is currently a Lecturer atthe University of Tripoli, Libya, in the ComputerScience Department. He received the first mas-ter’s degree in 1995 from the InternationalInstitute for Aerospace Survey and EarthScience (ITC), the Netherlands, in Applicationof GIS in Urban Planning. He received the sec-ond master’s degree in 2012 from the Universityof Minnesota, Minneapolis, MN, USA, in com-puter science. His current research interestsinclude spatial databases infrastructure, volun-

teered geographic information, and metadata taxonomy.

� For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.