Frequence: Interactive Mining and Visualization of Temporal ...of the mining algorithm, such as...

Frequence: Interactive Mining and Visualization ofTemporal Frequent Event SequencesAdam Perer

IBM T.J. Watson Research CenterYorktown Heights, New York, USA

[email protected]

Fei WangIBM T.J. Watson Research Center

Yorktown Heights, New York, [email protected]

ABSTRACTExtracting insights from temporal event sequences is an im-portant challenge. In particular, mining frequent patternsfrom event sequences is a desired capability for many do-mains. However, most techniques for mining frequent pat-terns are ineffective for real-world data that may be low-resolution, concurrent, or feature many types of events, orthe algorithms may produce results too complex to interpret.To address these challenges, we propose Frequence, an intel-ligent user interface that integrates data mining and visual-ization in an interactive hierarchical information explorationsystem for finding frequent patterns from longitudinal eventsequences. Frequence features a novel frequent sequencemining algorithm to handle multiple levels-of-detail, tempo-ral context, concurrency, and outcome analysis. Frequencealso features a visual interface designed to support insights,and support exploration of patterns of the level-of-detail rele-vant to users. Frequence’s effectiveness is demonstrated withtwo use cases: medical research mining event sequences fromclinical records to understand the progression of a disease,and social network research using frequent sequences fromFoursquare to understand the mobility of people in an urbanenvironment.

Author KeywordsVisual Analytics, Frequent Sequence Mining, TemporalVisualization

ACM Classification KeywordsH.5.m. Information Interfaces and Presentation (e.g. HCI):Miscellaneous

INTRODUCTIONOne of the challenges of the Big Data era is to leverage thevoluminous data that is being captured to drive decision mak-ing and insights. Common to such data are temporal events,data points with both a timestamp and event type, so under-standing patterns of temporal event sequences is an importantproblem to many. For example, medical researchers wish toleverage the data captured by electronic health records to de-termine if certain sequences of medical events correlate withpositive outcomes. Similarly, city government officials wishPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, February 24–27, 2014, Haifa, Israel.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-2184-6/14/02..$15.00.http://dx.doi.org/10.1145/2557500.2557508

to leverage the temporal data collected from their transporta-tion systems, call centers, and law enforcement agencies toimprove their cities’ services.

However, despite the availability of temporal data and thecommon desire to extract knowledge, mining patterns fromtemporal event sequences is still a fundamental challenge indata mining [7]. Frequent Sequence Mining (FSM) tech-niques have emerged in the data mining community to findsets of frequently occurring subsequences. However, thesealgorithms often have constraints that limit its applicability toreal-world data:

• Level of Detail. Temporal events are often recorded ata specific level-of-detail to record maximum informationabout an event’s type. FSM techniques applied to data witha large dictionary of event types will often suffer from com-putational complexity. Perhaps even more of a fundamen-tal problem is that patterns extracted from a specific level-of-detail may impair an interpretable overview of patternsfor users. In order for mining techniques to be technicallyfeasible and usable, we propose multiple levels-of-detailshould be available to users.

• Temporal Context. Many FSM techniques ignore thetemporal context associated with data, and instead focus onthe pure sequentiality of events. However, for certain real-world scenarios, if a certain amount of time elasped be-tween events, the events should not be considered as part ofthe same sequence, even if events are technically sequen-tial in the event log. We propose mining techniques needto take into consideration the temporal context of users.

• Concurrency. Many FSM algorithms suffer from patternexplosion when there are many concurrent events. This isparticularly troubling for real-world data, as many systemsmay record data in low-resolution precision, such as a day,and many events may occur on the same day. For otherdatasets, even when there is extreme temporal precision todata (e.g. millisecond precision), the exact order of eventsmay be irrelevant, so if they occur within a domain-relevanttime window, they should be treated as concurrent. Wepropose mining techniques need to handle concurrency.

• Outcome. Many FSM algorithms are agnostic to the typesof patterns mined. However, in real-world data, analystsmay not just need a list of patterns but instead how each ofthe patterns correlate to an outcome measure. We proposemining techniques need to support outcome analysis.

In this paper, we enhance a fast and popular FSM algorithmto handle these real-world needs to serve actionable insights.

But to support each need properly, the parameters should beflexible to derive from user hypotheses or evolve during aniterative data exploration process. In order to support flexi-ble decision making, we present Frequence, a intelligent userinterface designed to reach actionable insights by integratingvisualization with interactive mining algorithms.

We demonstrate Frequence’s utility in two real-world usecases: a medical informatics researcher using frequent se-quences in electronic medical records to understand the pro-gression of a disease among patients, and a social networkresearcher using frequent sequences from a location-basedsocial network to understand the relationship between on-line popularity and offline mobility throughout urban envi-ronments.

Concretely, Frequence’s contributions include both a novelalgorithm for Frequent Sequence Mining whose design han-dles real-world constraints of level-of-detail, temporal con-text, concurrency, and outcome, and a novel intelligent userinterface that integrates mining and visualization to supportinteractive parameterization and exploration to reach insights.

This paper begins with related work, providing an overviewof sequence mining and visualization techniques. Next, anoverview of the design of Frequence is described, as wellas the iterative workflow it supports. Then, Frequence’s se-quence mining algorithms are described as well as its novelenhancements. Next, Frequence’s visual interface is de-scribed in terms of its visual encoding and interactive UI toparameterize the analytics. Then, use cases are described todemonstrate Frequence’s utility on real-world data. Finally,this paper concludes with a discussion of future work.

RELATED WORKThere has been little work that integrates techniques for min-ing and visualizing frequent event sequences, so we presentprior work on mining and visualization separately.

Mining Frequent Event SequencesFrequent Sequence Mining (FSM) is a popular techniquefor finding sets of frequently occurring subsequences froma larger set of temporal event sequences. It is challengingsince one may need to examine a combinatorially explosivenumber of possible frequent subsequences.

FSM has been researched for a long time and many ap-proaches have been proposed. Initially, most of the ap-proaches (e.g., Generalized Sequential Pattern miner (GSP)[1]) are A priori-like [1], which utilizes the fact that anysuper-pattern of a non-frequent pattern cannot be frequent.These approaches start by constructing a frequent patternset (the appearance frequencies of the patterns included areabove a certain percentage threshold) containing only oneevent, then they grow those patterns pass by pass, with eachpass appending one single event to the detected patterns fromlast pass, until no frequent patterns can be found. This type ofapproach can be computationally expensive due to the hugecandidate sequence set and multiple database scans. Manystrategies have been proposed to make FSM more efficient

and practical. For example, PrefixSpan (Prefix-projected Se-quential PAtterN mining) [8] explores prefix projection inFSM to reduce the efforts of candidate subsequence gener-ation. SPADE (Sequential PAttern Discovery using Equiv-alence classes) [17] utilizes combinatorial properties to de-compose the original problem into smaller sub-problems, thatcan be independently solved in main-memory using efficientlattice search techniques, which greatly reduces the numberof database scans. SPAM (Sequential Pattern Mining) [3] de-signs a smart bitmap based representation for those event se-quences, so that all event sequences become 0-1 strings, andall operations needed for FSM will become bitwise AND/ORoperations. This makes the FSM procedure much more ef-ficient. However, despite these improvements, there are stillsome difficulties for applying such algorithms directly to realworld data, especially when there are concurrent events ortemporal constraints. Frequence’s FSM algorithm enhancesSPAM to support these user needs.

Temporal Event Sequence VisualizationsThere has been a great body of work in the visualization com-munity designing techniques for temporal event sequences.A common approach is to place records on a series of hor-izontal timelines to show multiple records in parallel [2, 9]and support interactive search [6, 13, 16], filtering [13], andclustering [4]. Recently, LifeFlow [15] introduced a way toaggregate multiple event sequences into a tree, and Outflow[14] later designed a way to aggregate events into a graph, aswell as integrating statistics.

The visual design of Frequence’s visualization is similar toSankey [10] and alluvial [11] diagrams. However, these di-agrams focus on the flow of resources, as Sankey diagramsignore sequential ordering whereas alluvial diagrams weredesigned to show how network structures change over time.Outflow also looks visually similar to our approach, how-ever, Outflow aggregates subsequences and outcomes [14].In Frequence, each subsequence is represented as an individ-ual edge to provide an overview of all sequences and theirindividual outcomes and support, as we support user tasksto find meaningful individual sequences. Furthermore, onlyFrequence supports navigation by level-of-detail which is anovel contribution to prior approaches.

SYSTEM DESIGN AND WORKFLOWFrequence’s intelligent user interface is designed to put usersat the center of both analytics and visualization. In linewith many advanced visual interfaces, Frequence supports thecommon browsing and searching strategy of the Visual Infor-mation Seeking Mantra, which suggests to support Overviewfirst, Zoom and Filter, then Details-on-Demand [12]. Whenusers begin their exploration, as illustrated in Figure 6, theyare first shown frequent patterns at a coarse level-of-detail togive them an overview. If they discover an pattern of interest,they can select the pattern and then zoom and filter to see theunderlying patterns at a finer level-of-detail. Whenever userswant more information about a particular pattern, at any level-of-detail, they can select it to get details-on-demand, whichinclude statistics about the pattern’s frequency, outcome, andthe events composed within.

However, parallel to browsing and searching, Frequence alsosupports interaction with the analytics to support a refinementstrategy, where users can adapt the mining algorithm basedupon their data-driven hypotheses to generate more meaning-ful patterns to explore. By interactively refining parametersof the mining algorithm, such as temporal context, outcomemeasures, and other temporal constraints, users can be em-powered to make sure they are reaching meaningful results.

In the next section, we explain the novel frequent pattern min-ing algorithms necessary to support this iterative workflow.

MINING FREQUENT PATTERNSFrequence’s frequent pattern mining algorithm is based on theSPAM algorithm [3], with several enhancements. In order tounderstand the enhancements, we first provide an overviewof the SPAM algorithm.

The goal of Frequent Sequence Mining (FSM) is to mine fre-quent subsequences from a set of event sequences. Formally,event sequences and subsequences are defined as follows:Definition 1. An event sequence θ = 〈θ1, θ2, · · · , θm〉(θi ∈D) is an ordered list of events, whereD is the event dictionaryand θi happened no later than θi+1.Definition 2. A sequence τ = 〈τ1, τ2, · · · , τmτ 〉 is said to bea subsequence of another sequence θ = 〈θ1, θ2, · · · , θmθ 〉,denoted by τ ⊆ θ if ∃ i1, i2, · · · , im such that 1 6 i1 6 i2 6· · · 6 im 6 mθ and τ1 = θi1 , τ2 = θi2 , · · · , τmτ = θim .

Support is the measure that determines whether a subse-quence is frequent or not:Definition 3. Given a set of event sequences S ={θ1,θ2, · · · ,θn}, the support of a sequence τ is the percent-age of the sequences in S which have τ as a subsequence.

With the above definition, we say a subsequence (or temporalpattern) is frequent for sequence set S if its appearance per-centage in S is above a certain pre-specified support value.

As described in our overview of related work, most FSM al-gorithms adopt a pattern growing strategy, which is illustratedin Figure 1. Initially, the algorithms start with an empty fre-quent pattern set F = {}, then each single event in the eventdictionary D is checked, and the events that are frequent areadded toF . These patterns are referred to as frequent patternsof length 1, as the length of a pattern is defined as the num-ber of events it contains. The pattern can be grown with twotypes of extensions: an S-extension and an I-extension [3]:

[ ]

A

A à A A à B A,B

AàAàA AàAàB AàA,B AàBàA AàBàB A,BàA A,BàB

Level

0

1

2

3

S-Extension I-Extension

Figure 1. A graphical illustration of the pattern growing procedure.

1 A B D B C D B C D

A B C

B C D

B

A B

23

1 1 0 1

0 1 1 1

0 1 1 1

0 1 0 0

1 1 1 0

1 1 0 0

0 1 1 1

B C DA

Figure 2. An example of the bitmap representation of the sequences,where there are four different events and three sequences.

Definition 4. An S-extension of a sequence τ =〈τ1, τ2, · · · , τmτ 〉 with event θ, denoted by τ → θ, is to ap-pend θ at the end of τ so that θ happens after τmτ .Definition 5. An I-extension of a sequence τ =〈τ1, τ2, · · · , τmτ 〉 with event θ, denoted by τ , θ, is to appendθ at the end of τ so that θ and τmτ happen concurrently.

After a pattern growing stage, all candidate patterns of length2 are obtained and the frequent ones are added toF . Then thealgorithm grows each pattern of length 2 via an S-extensionand an I-extension to get the frequent patterns of length 3,and the pattern growing procedure repeats until no more fre-quent patterns are found. If one considers the complete pat-tern growing process as a tree (see Figure 1), then the proce-dure can be seen as a breadth-first search on that tree, whichchecks the tree level-by-level. A depth-first search strategyhas also been proposed [3], which checks the tree branch bybranch, and can be a more efficient recursion procedure forlong sequences.

SPAM is able to be computationally efficient by using abitmap representation for event sequences [3], which is il-lustrated in Figure 2. In this representation, a sequence θ isrepresented as a |θ|×|D| bitmap B(θ), where |θ| is the lengthof θ and |D| is the size ofD, such that B(θ)(i, j) = 1 if the j-th event in D happened at the i-th timestamp of θ, otherwiseB(θ)(i, j) = 0. With this representation, both S-extensionsand I-extensions can be performed with a bitwise AND op-eration. Figure 3 shows how to extend event A with eventB for the example in Figure 2. For an S-extension, the firstappearance of A in each sequence is detected, so the corre-sponding bit value is changed to 0 and all the following bitsto 1 (denoted as T (A)), and then an AND-operation is ex-ecuted between the bitmap vectors of T (A) and B. For anI-extension, an AND-operation is directly executed betweenthe bitmap vectors of A and B. After the extension, the valueof 1 will exist in each sequence if the extended pattern existsin those sequences. Therefore, the appearance frequency ofthe extended pattern is simply the number of sequences thathave a value of 1.

Temporal ContextAlthough SPAM uses a smart bitmap representation to makethe FSM procedure efficient, it does not take into considera-tion the temporal context of detected patterns. This is prob-lematic for real-world applications, as patterns lasting years

1

0

0

0

1

1

0

A

0

1

1

0

0

0

1

T(A)

1

1

1

1

1

1

1

B

0

1

1

0

0

0

1

AàB

à & =

1

0

0

0

1

1

0

A

1

1

1

1

1

1

1

B

1

0

0

0

1

1

0

A,B

à =

S-Extension I-Extension

Figure 3. An example of S-extension and I-extension of event A withevent B for the example sequences in Figure 2.

and other patterns lasting days may have completely differentmeanings in different contexts, but SPAM would treat themequivalently.

In order to make SPAM capable of detecting temporal pat-terns within a temporal context (that is, within the thresh-old of a user-defined duration), we propose the following en-hancement: Suppose that there is an explicit timestamp asso-ciated with every event in every sequence, then when a can-didate pattern appears in a specific sequence, the pattern mustnot only appear, but also its duration in the sequence shouldbe below the threshold. Specifically, we define the patternduration in a sequence as:Definition 6. Let τ = 〈τ1, τ2, · · · , τmτ 〉 be a candidate pat-tern of sequence θ = 〈θ1, θ2, · · · , θmθ 〉, where each event θiis associated with timestamp ti, and (i1, i2, · · · , im) be a setof indices of θ satisfying 1 6 i1 6 i2 6 · · · 6 im 6 mθ andτ1 = θi1 , τ2 = θi2 , · · · , τmτ = θim , then the duration of τ inθ is min(i1,i2,··· ,im) (tim − ti1).

Algorithm 1 summarizes the our enhanced SPAM approach,which we call Time-Aware SPAM. Note that in step 3 and step8 of the Pattern Grow subroutine, when checking the ap-pearance frequencies of τS/τ I , there is a requirement thatpatterns should both appear in the counted sequence and theirduration should be shorter than L.

ConcurrencyA bottleneck for Time-Aware SPAM occurs when events hap-pen at the same time, as this greatly increases the search treesize in the pattern growing procedure. This is particularlyproblematic as concurrent events are often common in real-world applications. For example, the finest resolution of tem-poral data in many Electronic Health Records systems is aday, and during a day, multiple medical events may occurto a patient (e.g. a patient often gets multiple lab tests dur-ing a single lab visit). For other datasets, even when there isextreme temporal precision to data (e.g. millisecond preci-sion), the analyst may not be concerned about the exact orderof events, as long as they occurred within a domain-relevanttime window.

To alleviate this problem, we pre-process event sequencesprior to Time-Aware SPAM to contain pattern explosion. Asthere can be many Concurrent Event Sets (CESs) contained in

Algorithm 1 Time-Aware SPAMRequire: Sequence set S = {θ1,θ2, · · · ,θn}, Support

value α, Pattern duration L1. Construct bitmap representations for all sequences in S2. Set the output frequent pattern set F = {}3. Count all the single event frequencies, and add the oneswhose frequencies above α to F4. For every event θ ∈ F5. Do Pattern Grow(θ,F ,F , L)6. Output F

Pattern Grow(τ ,Sn, In, L)1. St = {}, It = {}2. For every θ ∈ Sn3. if τS=S extension(τ , θ) is frequent w.r.t. L4. St = St ∪ θ, add τS to output pattern set5. For every θ ∈ St6. Pattern Grow(τS ,St,St � θ)7. For every θ ∈ In8. if τ I=I extension(τ , θ) is frequent w.r.t. L9. It = It ∪ θ, , add τ I to output pattern set

10. For every θ ∈ It11. Pattern Grow(τ I ,St, It � θ)

event sequences, the first step is to detect the frequent EventPackages (EPs) that are frequent subsets of CESs. If we treateach CES in every event sequence as a transaction, then theproblem of detecting EPs is equivalent to the problem of fre-quent item set mining [1], and each detected EP can be usedas a super event. Then, a greedy approach is applied based onTwo-Way Sorting to break down each CES as a combinationof regular and super events, such that the number of eventscontained in each CES is greatly reduced.

To better explain the process of breaking down CESs, weprovide the following example: Suppose there exists a CESABCDE that needs to be broken down using the detected EPs(shown in the center of Figure 4). The algorithm then sorts thepackages according to the two-way sorting strategy as shownin Figure 4 – that is, the EPs are first sorted according to theircardinalities. Then, for packages with the same cardinality,they are sorted with respect to their appearance frequency. Tobreakdown ABCDE, the algorithm first finds the longest eventpackages that are subsets. In this case, ABC and ACE are thelongest packages which are subsets of ABCDE. Then, becauseABC occurs more frequently than ACE, ABC is selected as asuper event contained in ABCDE. Besides ABC, the rest of theevents are DE. Then the procedure is applied again to breakdown ABCDE as ABC,D, E. Using this technique, there areonly 3 super events in ABCDE after the break-down proce-dure. The full details of the algorithm are described in Algo-rithm 2.

Level of DetailIn real-world data sets, temporal events are often recorded ata very specific level-of-detail to retain maximum informationabout an event. However, applying many FSM techniques(e.g. [17, 3, 8]) or Time-Aware SPAM to data with a large

A B C A C E

A B A C B C A E C E C D

C A B D E

Appearance FrequencyC

ardi

nalit

yHigh Low

Long

Short

Figure 4. A graphical illustration of how the two-way sorting procedureworks. We first sort the mined event packages according to their car-dinalities, and then for the packages with the same cardinality, we sortthem according to their frequencies.

Algorithm 2 Breaking Down a CESRequire: An CES S to be broken down into frequent Event

Packages (EPs).1: Sort the detected EPs into buckets according to their car-

dinalities (number of events contained), such that thepackages within the same bucket have the same cardi-nality.

2: Sort the packages within the same bucket with their ap-pearance frequencies in the patient traces.

3: O = ∅4: for Every bucket B do5: if length(B) <length(S) then6: for Every EP E in B do7: if E is a subset of S then8: Add E to O, Set S = S\E9: if S == ∅ then

10: Return O11: else12: Return to Step 413: end if14: end if15: end for16: end if17: end for

dictionary of event types will often suffer from computationalcomplexity. Technical complexity aside, another fundamentalproblem is patterns extracted from a fine levels-of-detail mayimpair gaining an interpretable overview from the results. Inorder for mining to be technically feasible and usable, Fre-quence is able to mine data on multiple levels-of-detail.

For each level-of-detail available, Frequence runs its miningalgorithm on event sequences using the event dictionary avail-able for each level-of-detail. Typically, levels-of-detail arehierarchical and thus the coarse level will likely have a smallevent dictionary. In this case, special techniques for handlingconcurrency or collapsing event sets may not be required toremain performant. However, for finer levels-of-detail, largedata sets may require such features to be computationally fea-sible. That said, in practice, if an overview of coarse patternsare shown first, a coarse pattern of interest may be selectedto focus on, and then Frequence only needs to mine from thecohort that matched the coarse pattern, which drastically re-duces the number of sequences and event types. Frequence’siterative workflow was designed to support this type of explo-ration.

OutcomeOnce a set of frequent patterns has been identified, correla-tions between the mined patterns and the event sequences’outcome measure can be identified. For example, in a med-ical informatics scenario, analysts may be curious how cer-tain patterns correlate with discrete outcome variables (e.g.,deceased vs. alive) or continuous ones (e.g., blood pressuremeasurements).

To enable outcome analysis, a Bag-of-Pattern (BoP) repre-sentation is constructed for each event sequence. Formally,given a set of n frequent patterns, the BoP representation isan n-dimensional vector, where the i-th element of that vec-tor stores the frequency of the i-th pattern within the corre-sponding event sequence. If there are m event sequences,then we construct an m × n sequence-pattern matrix X =[x1,x2, · · · ,xn] whose (j, i)-th element indicates the num-ber of times the i-th pattern appeared in the j-th event se-quence. Thus, the i-th column xj summarizes the frequencyof the i-th pattern in all m sequences. We can also constructan m-dimensional sequence outcome vector y, such that yjis the outcome of the j-th event sequence. For example, witha binary outcome, yi ∈ {+1,−1}, +1 represents a positiveoutcome whereas -1 represents a negative outcome. Giventhis formulation, statistics are then computed to measure thecorrelation between each xi and y to measure the informa-tiveness of the i-th pattern in terms of predicting a sequence’soutcome. Frequence has implemented a variety of statisticalmeasures including Pearson correlation, P-value, informationgain, and odds ratio.

FREQUENCE’S VISUAL USER INTERFACEFrequence is a web-based interface that features a visualiza-tion panel in the center of the interface to navigate the visu-alization, as well as a panel on the left side of the interface tocontrol the interactive analytics.

Interactive VisualizationFrequence mines sequential knowledge temporal from eventsequences so analysts can gain insights. However, the quan-tity of patterns discovered is often too large for users to makesense of them. The goal of Frequence is not only to mine thepatterns but also to present the data in a user-centric way sothat the patterns mined can be utilized in real-world settings.Information visualization is an effective way of communicat-ing complex data, and thus the key part of the Frequence UIis a flow visualization.

We describe the characteristics of our visualization using Fig-ure 5 as an illustrative example. In this example, the patternsare sequences of places that users have traveled, and each userhas an outcome measure of popularity. This type of data willbe subsequentially described in more detail as a use-case.

Events in the frequent sequences are represented as nodes,and event nodes that belong to the same sequence are con-nected by edges. The nodes and edges are positioned us-ing a modified Sankey diagram layout [10]. However, whileSankey layouts aggregate common subsequences into a sin-gle edge, such an aggregation in Frequence would makeit impossible to reach insights about individual patterns.

Temporal Event Sequence Outcome SupportArts & Entertainment→Food→Travel & Transport Unpopular 0.25Arts & Entertainment→Food→Nightlife Spot Popular 0.50Shop & Service→Food→Nightlife Spot Popular 0.50Food→Outdoors & Recreation Neutral 0.25

Arts & Entertainment

Shop & Service

Food Outdoors & Recreation

Nightlife Spot

Travel & Transport

Food

Figure 5. An illustrative example of Frequence’s visual encoding for aset of frequent patterns. Patterns are represented by a sequence of nodes(events) connected by edges (event subsequences). Patterns are coloredaccording to their correlation with users’ outcomes.

Thus, in Frequence, subsequences are represented as in-dividual edges. For instance, the simple pattern Food →Outdoors&Recreation, is visualized as a Food node con-nected to a Outdoors&Recreation node, as shown at the bot-tom of Figure 5. Patterns that share similar subsequences,such as Arts&Entertainment→ Food→ NightlifeSpot andArts&Entertainment→ Food→ Travel&Transport, involvetwo edges from Arts&Entertainment to Food representingeach subsequence. Thus, prominent subsequences also be-come visually prominent due to the thickness of the combinedmultiple edges.

Of course, not all event subsequences are equal as some corre-late to a positive outcome, whereas others correlate to a neg-ative outcome, as determined by Frequence’s outcome ana-lytics. The visualization uses color to encode each pattern’sassociation with an associated outcome. For this scenario, thepatterns that occur more often with popular users (those whohave a lot of followers on Twitter) are more blue. The patternsthat occur more often with unpopular users (those who havefew followers) are more red. The neutral patterns that appearcommon to both popular and unpopular users are gray.

The frequency of each pattern also varies, and the visualiza-tion uses the thickness of edges to encode how often eachpattern occurs using the pattern’s support, as defined in Defi-nition 3. For instance, in the example shown in Figure 5, thepattern Arts&Entertainment→ Food→ Travel&Transporthas a support of 0.25, which is why its thickness is halfthe size of Arts&Entertainment→ Food→ NightlifeSpot,which has a support of 0.5.

Users can interact with the visualization by moving theirmouseover an edge, which will display statistics, such as sup-port and outcome, about the pattern the edge represents. Ifusers click a node, all patterns that do not match the selectednode have their opacity diminished so users can focus on pat-terns that match the user’s specification. Users can specify afrequent pattern of interest by clicking on multiple nodes un-til their pattern is fully selected. Once a pattern is selected,users can use this selection to zoom and filter. Frequence sup-ports two ways to zoom and filter. The first way is by cohort,which constrains the dataset to the population that has the se-lected pattern, and shows the most frequent patterns among

that population. The second way is by pattern, which usesthe whole population, but only mines events that feature thesubtypes of the selected pattern. This latter option is usefulfor navigating hierarchical levels-of-detail.

Interactive AnalyticsIn addition to exploring patterns using the visualization, userscan also adapt and modify the mining algorithm according totheir needs by using the control panel located on the left-sideof the interface.

By default, the Statistics section is selected to show a statis-tical overview of various aspects of the dataset and patterns.Users can navigate to the Pattern Definition section to refinethe patterns being mined. Here, users can adjust the supportvalue for patterns, to focus the analytics on either strongeror weaker patterns. By selecting the Level of Detail panel,users can change the level-of-detail of the displayed patterns,and choose if filtering is by cohort or by pattern. The Out-come panel allows users to select an outcome variable, aswell as a threshold on that variable to segment the popula-tion. Here, users can also interactively filter the visualizationusing a range slider to focus on positive or negative patterns.The Temporal Context panel allows users to turn on temporalconstraints, such as concurrency. This panel also features aslider to specify the time duration for events to be consideredconcurrent. The Database panel allows users choose fromwhere to load their data.

USE CASE: LOCATION SHARING SOCIAL NETWORKSSome social network researchers are curious about the rela-tionship between people’s online social media personas andtheir offline activities. While mobility data has been difficultto uncover in the past, this is changing due to the proliferationof Location Sharing Services (LSS), where people broadcastthe places they visit. One such LSS is Foursquare, a servicethat allows users to ”check-in” at different venues to save andshare their locations to friends. According to Foursquare1,there are over 40 million users and over 4.5 billion check-inswith Foursquare, as of September 2013. A corpus of a sub-set of Foursquare data was made publicly available[5], whichcontains over 200,000 users and over 22 million check-ins,and which was imported into Frequence.

In this use case, the researcher was interested in understand-ing the behavior of users in New York City (NYC), so theanalysis described here is limited to check-ins within the bor-oughs of Manhattan, Queens, and Brooklyn. This was doneby filtering all check-ins to fall within a bounding box of lat-itude and longitude coordinates, surrounding the city. Afterthe geographical filtering, the size of the NYC-constraineddatabase consisted of 17,739 users with 419,023 check-insin 17,182 unique venues over 11 months. Each uniquevenue was augmented with Foursquare’s hierarchy of 9 top-level categories and 289 sub-level categories, retrieved usingFoursquare’s Venue Search API2 to create three levels-of-dataon which to mine.

1http://www.foursquare.com/about2http://developer.foursquare.com/docs/venues/search

Food

Outdoors & Recreation

Food

Outdoors & Recreation


Food

Nightlife Spot


Food

Travel & Transport


Food


Food

Travel & Transport


Food

Nightlife Spot

The goal of the analysis was to understand if certain patternsof mobility correlate with popularity to procure evidence fortheir hypothesis that people who are popular in social mediago to different places in different sequences than people whowere less popular. In order to gain a proxy for social popular-ity, users’ Foursquare accounts are unified with their Twitteraccounts, and the number of Twitter followers was used asa metric of popularity. All Foursquare users in the databasehad an active Twitter account, and used Twitter to broadcasttheir check-ins. The analyst was only interested in users whowere active users of Twitter, so users that had less than 1000tweets in their lifetime were filtered out3. The median num-ber of Twitter followers for a user the dataset was 265, so theanalyst used the Outcome panel to consider patterns by usersgreater than 265 to be popular, and less than 265 to be un-popular. The analyst also used the Temporal Context panelto constrain event sequences within a 6-hour time window sothat only events that occured within that same window wereconsidered a part of the same event sequence. Each of theseconstraints were determined by rapidly testing multiple hy-potheses, and the constraints evolved from original hypothe-ses to different sensible parameters based upon data explo-ration.

With the parameters tuned from interactive exploration, anoverview of patterns at the coarsest level-of-detail is shownat the top of Figure 6. Interestingly, many of the patterns as-sociated with popular social media users (the blue patterns)contain a Shop&Service event. The analyst was also in-terested in the unpopular patterns, and selected one of thered patterns for further inspection. After selecting the redProfessional&OtherPlaces → Food pattern, the researcherused it as a cohort filter to show patterns only from peoplewho have this pattern. Again, the analyst was interested inunderstanding patterns among less popular users, and to focuson these patterns, the analyst filtered to only the red patterns,shown in the middle of Figure 6. The analyst determines thatOffice→ Hotel correlates with the least popular social me-dia users, and uses this pattern as a cohort filter to mine onlyusers who have this pattern. The bottom of Figure 6 showsthe patterns at the finest level of detail, where the names of theactual venues are now visible to the researcher. No specificoffices are frequent enough to appear in the patterns, due tothe countless companies spread throughout the city, but cer-tain hotels (e.g. the W Hotel in Times Square) do appear andcorrelate to unpopular users, whereas most popular patternsare those tied with Shopping (e.g. Macy’s, H&M). The re-searcher remarked the potential applications of these findingsto marketing professionals, who wish to understand who fre-quents their venues, and their ability to communicate to theirindividual social media audiences.

While the researcher acknowledges these insights only applyto a specific demographic that uses Foursquare and broadcaststheir check-ins on Twitter, this case study demonstrates theutility of Frequence for exploring temporal patterns of humanmobility and reaching insights.

3On average, there were 2144 tweets per user in the NYC dataset,before filtering.

USE CASE: MEDICAL INFORMATICSDue to a large number of medical institutions adopting Elec-tronic Health Records (EHRs), the opportunities to analyzeand derive insights from medical data has never been greater.In particular, clinicians and clinical researchers are very inter-ested in understanding patterns of medical events that oftenlead to positive or negative outcomes, so that they can under-stand or improve their clinical practices.

This use case involves a team of clinicians and clinical re-searchers interested in determining if there are particular pat-terns that lead to patients with lung disease developing sepsis,a potentially deadly medical condition. The institution used aset of 2,336 patients diagnosed with lung disease, each withlongitudinal events of diagnostic codes (which encode symp-toms, causes, and signs of diseases using ICD-9 standards 4).ICD-9 codes are organized according to a meaningful hierar-chy and this hierarchy was utilized by Frequence as multiplelevels of detail.

Of the patients with lung disease, 483 developed sepsis withinsix months of their diagnosis of lung disease, whereas 1,853managed to not contract the condition. Prior to using Fre-quence, the clinical researchers had difficulty drawing anyconclusions between these two cohorts, as both types of pa-tients tend to share many of the same diagnosis codes. How-ever, the researchers were interested in mining for frequentpatterns to see if the order of these diagnoses has any effecton their patients.

At the top of Figure 7, the coarsest patterns for all of the lungdisease patients are shown. The clinician was particularly in-terested in cardiovascular complications, and noticed that thepattern CardiacDisorders → SymptomDisorders was com-mon yet neutral (that is, this pattern was common to pa-tients who did and did not end up contracting sepsis). Af-ter selecting this pattern and filtering by cohort to see thematching patients, the finer level of detail (Level 1) allowedthe clinician to see more detailed cardiac conditions, suchas cardiac dysrhythmia and heart failure. Other complica-tions, such as acute renal failure (which medical literaturesuggests is linked to developing sepsis), also appear. How-ever, the clinician is interested in the patterns that led to pa-tients not developing sepsis, and filtered to the positive pat-terns in the middle of Figure 7. Surprised to see the patternHeartFailure → LungDiseases, the clinician filtered to thecohort that matched this pattern and pivoted to Level 2, asshown in the bottom of Figure 7. The clinician immediatelynoticed that patterns that featured both Atrial Fibrillation andAcute Respiratory Failure are red, which is sensible, as med-ical literature suggests both are risk factors for sepsis. How-ever, the clinician found it interesting that patterns beginningwith Acute Respiratory Failure alone were not predictive ofsepsis, but rather what happened next in the sequence wasmore predictive. From the Acute Respiratory Failure node inthe first column of the visualization, the patterns diverge intored and blue, making it clear that what happens immediatelyafter such Acute Respiratory Failure will likely determine ifthe patient will get sepsis or not.

4http://www.who.int/classifications/icd/

Shop & Service

Professional & Other Places

Food

Office

Hotel

Cardiac Disorders

Symptom Disorders

Heart Failure

Lung Diseases

CONCLUSION AND FUTURE WORKThis paper presents Frequence, a novel visual interface de-signed to mine and visualize frequent event sequences fromtemporal events. Frequence includes a novel FSM algorithmto handle real-world data requirements to serve actionable in-sights. Frequence also supports iterative and flexible explo-ration with a visual interface where users can explore resultsand refine parameters of the mining algorithm.

We demonstrate Frequence’s utility in two real-world usecases: a medical researcher using frequent sequences in Elec-tronic Medical Records to understand the relationship be-tween the progression of a disease and patient outcomes, anda social network researcher using frequent sequences from aLocation-based Sharing Service to understand the relation-ship between online popularity and offline mobility through-out urban environments.

While these use cases show that Frequence can lead to in-sights, many topics remain for future work. While Fre-quence supports interactive parameterization of the miningalgorithm, choosing the right parameters might be a chal-lenge for some users. A next goal of Frequence is to aug-ment the user interface with visual feedback to act as a pre-view of how the parameters will affect mining. An additionalitem for future work reflects the scalability of our frequentsequence mining algorithm. Although our bitmap-based ap-proach is computationally efficient, as data sets get larger andlarger, new approaches will be necessary for Frequence to re-main scalable. We plan to investigate applying our algorithmto a cloud-based distributed architecture, as the hierarchicalaspects of our mining algorithm are well-suited for such anenvironment. Finally, a more comprehensive user evaluationis critical to understanding the relationship between analyt-ics and visualization in Frequence. We plan to deploy Fre-quence to domain partners to conduct longitudinal case stud-ies, where we capture how the capabilities of Frequence affecttheir workflow and help them reach insights.

REFERENCES1. Agrawal, R., and Srikant, R. Fast algorithms for mining

association rules in large databases. In Proceedings ofthe 20th International Conference on Very Large DataBases (1994), 487–499.

2. Aigner, W., Miksch, S., Thurnher, B., and Biffl, S.Planninglines: novel glyphs for representing temporaluncertainties and their evaluation. In InformationVisualisation, 2005. Proceedings. Ninth InternationalConference on (2005), 457–463.

3. Ayres, J., Gehrke, J. E., Yiu, T., and Flannick, J.Sequential pattern mining using bitmaps. 429–435.

4. Burch, M., Beck, F., and Diehl, S. Timeline trees:visualizing sequences of transactions in informationhierarchies. In Proceedings of the working conferenceon Advanced visual interfaces, AVI ’08, ACM (NewYork, NY, USA, 2008), 75–82.

5. Cheng, Z., Caverlee, J., Lee, K., and Sui, D. Z.Exploring millions of footprints in location sharingservices. AAAI ICWSM (2011).

6. Fails, J., Karlson, A., Shahamat, L., and Shneiderman,B. A visual interface for multivariate temporal data:Finding patterns of events across multiple histories. InVisual Analytics Science And Technology, 2006 IEEESymposium On (2006), 167–174.

7. Mitsa, T. Temporal Data Mining, 1st ed. Chapman &Hall/CRC, 2010.

8. Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H.,Chen, Q., Dayal, U., and Hsu, M. Mining sequentialpatterns by pattern-growth: The prefixspan approach.IEEE Transactions on Knowledge and Data Engineering16, 11 (2004), 1424–1440.

9. Plaisant, C., Milash, B., Rose, A., Widoff, S., andShneiderman, B. Lifelines: visualizing personalhistories. In Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems, CHI ’96, ACM(New York, NY, USA, 1996), 221–227.

10. Riehmann, P., Hanfler, M., and Froehlich, B. Interactivesankey diagrams. In IEEE InfoVis (2005), 233–240.

11. Rosvall, M., and Bergstrom, C. T. Mapping change inlarge networks. PLoS ONE 5, 1 (01 2010), e8694.

12. Shneiderman, B. The eyes have it: a task by data typetaxonomy for information visualizations. In VisualLanguages, 1996. Proceedings., IEEE Symposium on(1996), 336–343.

13. Wang, T. D., Plaisant, C., Quinn, A. J., Stanchak, R.,Murphy, S., and Shneiderman, B. Aligning temporaldata by sentinel events: discovering patterns inelectronic health records. In Proceedings of the SIGCHIConference on Human Factors in Computing Systems,CHI ’08, ACM (New York, NY, USA, 2008), 457–466.

14. Wongsuphasawat, K., and Gotz, D. Exploring flow,factors, and outcomes of temporal event sequences withthe outflow visualization. In IEEE InfoVis (2012).

15. Wongsuphasawat, K., Guerra Gomez, J. A., Plaisant, C.,Wang, T. D., Taieb-Maimon, M., and Shneiderman, B.Lifeflow: visualizing an overview of event sequences. InProceedings of the SIGCHI Conference on HumanFactors in Computing Systems, CHI ’11, ACM (NewYork, NY, USA, 2011), 1747–1756.

16. Wongsuphasawat, K., and Shneiderman, B. Findingcomparable temporal categorical records: A similaritymeasure with an interactive visualization. In VisualAnalytics Science and Technology, 2009. VAST 2009.IEEE Symposium on (2009), 27–34.

17. Zaki, M. J. Spade: An efficient algorithm for miningfrequent sequences. Machine Learning 42, 1/2 (2001),31–60.

Figure 6. The exploration process of the Foursquare dataset with Frequence. The top figure shows an overviewof the coarsest patterns. The middle figure shows a filtered view of unpopular patterns at a finer level-of-detailfor the cohort who matched the Professional&OtherPlaces → Food sequence. The bottom figure showsthe patterns at the finest level of detail, where the names of the actual venues become visible, after filtering to thecohort who have the Office→ Hotel pattern.

Professional & Other Places

Food

Office

Hotel

Figure 7. The exploration process of the lung disease patient dataset with Frequence. The top figure shows anoverview of the coarsest patterns. The middle figure shows the positive patterns at a finer level-of-detail for thecohort who matched the CardiacDisorders→ SymptomDisorders sequence. The bottom figure shows thepatterns at the finest level of detail, after selecting HeartFailure→ LungDiseases.

Cardiac Disorders

Symptom Disorders

Heart Failure

Lung Diseases

Frequence: Interactive Mining and Visualization of Temporal ...of the mining algorithm, such as...

Documents

Transcript of Frequence: Interactive Mining and Visualization of Temporal ...of the mining algorithm, such as...