Download - THIC MedIX Summer 2015 Poster

Transcript
Page 1: THIC MedIX Summer 2015 Poster

Thresholded Hierarchical Itemset Clustering for Expert ExplorationsDiana Zajac, Thomas Lux, Dr. Jacob Furst, Dr. Daniela Raicu

College of Computing and Digital Media, DePaul University

Summer 2015

Introduction Clustering Algorithms THIC

Datasets

Traditional Machine Learning (ML) techniques are able to

cluster datasets, yet they produce difficult to interpret clusters.

Noise in the data, as well as high-dimensional and complex

data, can make clustering difficult, and produce undesirable

results. In addition, most clustering algorithms produce clusters

without any explanation as to what patters are found between

data points, and based on what patters those clusters were

formed. In attempt to solve the problem of clustering high-

dimensional, complex and noisy datasets, and producing

interpretable results, we created an interactive user-interface

called THIC. THIC stands for Thresholded Hierarchical Itemset

Clustering, and we have given it this name to describe the

method in which it clusters data. What makes THIC so

innovative, is it’s ability to modify the clustering algorithm with

‘expert’ feedback. An ‘expert’ referring to some outside source

of information that can provide intuitive guidance as to what

features the algorithm should cluster upon.

Figure 1 is a part of the 2012 City Livability dataset obtained with permission of The Economic

Intelligence Unit (EIU) from their collaboration with BuzzData.

Another example, given an ‘expert’ who is well-traveled,

the expert could instruct THIC to group countries “most homey”

under one cluster, countries “most beautiful” under another,

etc. THIC will cluster the cities based on the experts guidance,

but will also predict which clusters the cities the expert hasn’t

yet traveled to may fit into—and then explain which city

features are most important in determining the clusters.

Other datasets we worked with included a large text

corpus, lung cancer data, and Chronic Fatigue Syndrome data.

K-Means:

K-Means clustering is an

algorithm that makes k number of

clusters based on distances of each

data-point from the cluster centers. It

begins by plotting each data point—

in the case of City Livability, each

city is a point—with the features as

dimensions. For an n number of

features, there are n number of

dimensions. So each point has a

given (x, y, z, …, n) coordinate

based on its features. K-Means chooses initial cluster centers, and then

iteratively moves them until the distances of the points to the centers is

minimal, and the clusters are separated as best as possible.

K-Means with Feature Selection (KMFS):

KMFS uses feature selection algorithms in aiding k-means clustering.

Feature selection is usually used in order to strip a dataset of irrelevant,

corrupted, or redundant features, thereby enhancing the analysis capabilities

based on those features. KMFS selects features one-by-one starting with

those that create the ‘best’—most defined and separate—clusters, and

continues to add features until the clusters become ‘bad’—overlapping and

spread-out. Incorporating feature selection into k-means clustering allows for k-

means to cluster data and return to the use the most relevant features used.

KMFS gives the user an idea of what each cluster is based on (what features

‘trend’ in each cluster), but it describes cluster features based on probabilities

rather than 100% accuracy, and also fails to provide user-control.

Why THIC is better:

Expert-guided clustering

Better data interpretability

Many different possibilities (for results)

Provides a controllable tradeoff between optimal results and meaningful

results

Doesn’t lose data dimensionality (no important information lost in feature

selection)

THIC’s philosophy is focused on aiding a user in understanding and

exploring datasets, finding unseen patterns and correlations in datasets, and

creating unconventional clustering of data.

Group 1: High: Green Space, SprawlGroup 2: Low: Sprawl, Culture and Environment, InfrastructureGroup 3: Low: InfrastructureHigh: Green SpaceGroup 4: Low: Sprawl, Culture and EnvironmentGroup 5: Low: Green Space, SprawlGroup 6: Low: Green SpaceHigh: SprawlGroup 7: Low: SprawlHigh: Green Space

The dataset below is one of the datasets we used in

testing THIC. This dataset is particularly interesting because of

the ‘expert feedback’ opportunity. For example, an expert may

want to cluster cities based on “what do European countries

have in common:” the expert would instruct THIC to group

European countries under one cluster, and THIC will produce

results explaining which features all European cities have in

common.

THIC is an interactive interface that allows users to import a numerical

dataset and cluster the data based on their own preferences, such as:

Which features should be included/excluded

Which features should be given higher priority (more weight)

Sizes of groups

Making subgroups

Number of groups

Define groups using features

Control between optimal clustering and clustering meaningful to user

Acknowledgments Dr. Jacob Furst, PhD, 1998, professor, DePaul University, CDM

Dr. Daniela Raicu, PhD, 2002, professor, DePaul University, CDM

College of Digital Media, DePaul University

Science Research Fellows

DePauw University

Future WorkAlthough we completed THIC’s preliminary phase and there is still much to

improve on. The current THIC implementation focuses on single-item-itemsets,

because increasing itemset size increases the computation time and amount of

overlap in groups. Another interest would be developing better ‘stopping criteria’

for the algorithm, which at the moment is based on group overlap and minimal

coverage. With a better stopping criteria, expanding to multi-item-itemsets would

be more feasible, without contradicting the philosophy of THIC.

When completed, THIC will be able to provide meaningful information in

multiple domains, including but not limited to economics, medical sciences, and

statistical analysis.

THIC produces diverse results depending on all of these preferences.

So, the focus of THIC isn’t necessarily the ‘best’ clusters/groupings, but

instead is more about producing results that can aid in understanding a data

set, such as:

Finding certain patterns that may not be evident without THIC (due to size

of dataset or complexity)

Producing results by defining ‘known’ clusters, and matching the rest of

the cases to those

Describing relationships between different features, as well as different

cases—in City Livability, cases are the cities, and features are qualities,

such as pollution and quality of education.