Construction and Mining of Heterogeneous Information...

Construction and Mining of Heterogeneous Information Networks: Will It Be a Key to

Web-Aged Information Management and Mining

Jiawei Han Department of Computer Science

University of Illinois at Urbana-Champaign Collaborated with many students, especially Yizhou Sun, Ming Ji, Chi Wang, Ahmed El-Kishky, Bo Zhao

Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), ARO, Microsoft, IBM, Yahoo!, LinkedIn, Google & Boeing

July 3, 2014

Outline Why Heterogeneous Information Networks?

On the Power of Mining Structured Heterogeneous Info. Networks

Clustering & Classification: A Ranking-Based Approach

Similarity Search & Meta Path in Heterogeneous Info. Net

Prediction & Recommendation: A Meta Path-Based Approach

From Unstructured Data to Structured Networks

From Unstructured Text to Structured Hierarchies

Type Extraction and Role Discovery: Enrich Network Structures

Truth Analysis: Enhance Quality of Structured Network

Looking Forward to the Future

Where There Is Information, There Are Networks!

Social Networking Websites Biological Network: Protein Interaction

Research Collaboration Network Product Recommendation Network via Emails 3

The Real World: Heterogeneous Networks Multiple object types and/or multiple link types

Venue Paper Author DBLP Bibliographic Network The IMDB Movie Network

Actor Movie

Director

Movie Studio

Homogeneous networks are information loss projection of heterogeneous networks!

The Facebook Network

Directly mining information-richer heterogeneous networks 4

Structured Heterogeneous Network Modeling Leads to the New Power of Data Mining! DBLP: A Computer Science bibliographic database

Knowledge hidden in DBLP Network Mining Functions How are CS research areas structured? Clustering

Who are the leading researchers on Web search? Ranking

What are the most essential terms, venues, authors in AI? Classification + Ranking

Who are the peer researchers of Jure Leskovec? Similarity Search

Whom will Christos Faloutsos collaborate with? Relationship Prediction

Which types of relationships are most influential for an author to decide her topics?

Relation Strength Learning

How was the field of Data Mining emerged or evolving? Network Evolution

Which authors are rather different from his/her peers in IR? Outlier/anomaly detection

A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), …

Power of het. network modeling: Treat Author, Venue, Term & Paper all first-class citizens!

Basic functions for networks: Clustering, classification & ranking Why not integrate ranking with clustering & classification?

High-ranked objects should be more important (thus more weighted) in a cluster/class than low-ranked ones

Ranking will make more sense within a particular cluster/class Einstein in physics vs. Turing in Coputer Science

Integrate ranking with clustering/classification in the same process Ranking, as the feature, is conditional (i.e., relative) to a specific

cluster/class Ranking and clustering/classification mutually enhance each other Ranking-based clustering: RankClus [EDBT’09], NetClus [KDD’09],

PathSelClus [KDD’12] Ranking-based classification: GNetMine (for classification)

[ECMLPKDD’10], RankClass [KDD’11]

Ranking-Based Clustering & Classification

Initialization Randomly partition

Repeat Ranking

Ranking objects in each sub-network induced from each cluster

Sub-Network Ranking

Clustering

Generating new measure space Estimate mixture model coefficients

for each target object Adjusting cluster

Until stable

RankClus: Integrating Clustering with Ranking

An E-M framework for iterative enhancement

Difference on ranking rules: 1. Simple ranking rule: Ranking high if more papers 2. Highly ranked if publishing more in highly ranked-

venues

Step-by-Step Running of RankClus

Initially, ranking distributions are mixed together

Two clusters of objects mixed together, but

preserve similarity somehow

Improved a little Two clusters are

almost well separated

Improved significantly

Stable

Well separated

Clustering and ranking two fields: DB/DM & HW/CA

NetClus: Ranking-Based Clustering with Star Network Schema

Beyond bi-typed network: capture more semantics with multiple types Split a network into multi-subnets, each a net-cluster [KDD’09]

Research Paper

AuthorVenue

Publish Write

Contain

……

AVNetClus

Computer Science

Database

Hardware

Theory

DBLP network: using terms, venues & authors to jointly distinguish a sub-field, e.g., database

NetClus: Database System Cluster

database 0.0995511 databases 0.0708818

system 0.0678563 data 0.0214893

query 0.0133316 systems 0.0110413 queries 0.0090603

management 0.00850744 object 0.00837766

relational 0.0081175 processing 0.00745875

based 0.00736599 distributed 0.0068367

xml 0.00664958 oriented 0.00589557 design 0.00527672 web 0.00509167

information 0.0050518 model 0.00499396

efficient 0.00465707

Surajit Chaudhuri 0.00678065 Michael Stonebraker 0.00616469

Michael J. Carey 0.00545769 C. Mohan 0.00528346

David J. DeWitt 0.00491615 Hector Garcia-Molina 0.00453497

H. V. Jagadish 0.00434289 David B. Lomet 0.00397865

Raghu Ramakrishnan 0.0039278 Philip A. Bernstein 0.00376314

Joseph M. Hellerstein 0.00372064 Jeffrey F. Naughton 0.00363698 Yannis E. Ioannidis 0.00359853

Jennifer Widom 0.00351929 Per-Ake Larson 0.00334911 Rakesh Agrawal 0.00328274

Dan Suciu 0.00309047 Michael J. Franklin 0.00304099 Umeshwar Dayal 0.00290143

Abraham Silberschatz 0.00278185

VLDB 0.318495 SIGMOD Conf. 0.313903

ICDE 0.188746 PODS 0.107943 EDBT 0.0436849

One-level deeper: Authors in XML, Xquery cluster

Term Venue Author

Rank-Based Clustering: Works in Multiple Domains

RankCompete: Organize your photo album automatically!

MedRank: Rank treatments for AIDS from Medline

Use multi-typed image features to build up heterog. networks

Explore multiple types in a star schema network

Classification: Knowledge Propagation Across Heterogeneous Typed Networks

Labeled knowledge propagates through multi-typed objects across heterogeneous networks [ECMLPKDD'10, KDD’11]

RankClass: Ranking-Based Classification Class: a group of multi-typed nodes sharing the same topic + within-class ranking

node class

The RankClass Framework

Methodology of RankClass Classification in heterogeneous networks

Knowledge propagation: Class label knowledge propagated across multi-typed objects through a heterogeneous network

GNetMine [Ji et al., PKDD’10]: Objects are treated equally

RankClass [Ji et al., KDD’11]: Ranking-based classification

Highly ranked objects will play more role in classification

An object can only be ranked high in some focused classes

Class membership and ranking are stat. distributions

Let ranking and classification mutually enhance each other!

Output: Classification results + ranking list of objects within each class

Experiments on DBLP Class: Four research areas (communities)

Database, data mining, AI, information retrieval Four types of objects

Paper (14376), Conf. (20), Author (14475), Term (8920) Three types of relations

Paper-conf., paper-author, paper-term Algorithms for comparison

Learning with Local and Global Consistency (LLGC) [Zhou et al. NIPS 2003] – also the homogeneous version of our method

Weighted-vote Relational Neighbor classifier (wvRN) [Macskassy et al. JMLR 2007]

Network-only Link-based Classification (nLB) [Lu et al. ICML 2003, Macskassy et al. JMLR 2007]

Comparing Classification Accuracy on the DBLP Data

Object Ranking in Each Class: Experiment

DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network Rank objects within each class (with extremely limited label information) Obtain high classification accuracy and excellent ranking within each class

Database Data Mining AI IR

Top-5 ranked conferences

VLDB KDD IJCAI SIGIR

SIGMOD SDM AAAI ECIR

ICDE ICDM ICML CIKM

PODS PKDD CVPR WWW

EDBT PAKDD ECML WSDM

Top-5 ranked terms

data mining learning retrieval

database data knowledge information

query clustering reasoning web

system classification logic search

xml frequent cognition text

Similarity Search: Find Similar Objects in Networks

Who are most similar to Christos Faloutsos? Meta-Path: Meta-level description of a path between

two objects

Christos’s students or close collaborators Similar reputation at similar venues

Meta-Path: Author-Paper-Author (APA) Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)

Schema of the DBLP Network

Different meta-paths lead to very different results!

Different meta-paths carry rather different semantics

PathSim: A Het. Network Similarity Measure

Popular object similarity measures in networks Random walk (RW) (or Personalized PageRank): Favors highly

visible objects (i.e., objects with large degrees) Pairwise random walk (PRW) (or SimRank): Favors “pure”

objects (i.e., objects with highly skewed distribution in their in-links or out-links)

PathSim [VLDB’11] Favor “peers”: objects with strong connectivity and similar

visibility under the given meta-path

Note: P-PageRank and SimRank do not distinguish object type nor relationship type

Which Similarity Measure Is Better?

Anhai Doan CS, Wisconsin Database area PhD: 2002

Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)

• Jignesh Patel

• CS, Wisconsin

• Database area

• PhD: 1998

• Amol Deshpande

• CS, Maryland

• Database area

• PhD: 2004

• Jun Yang

• CS, Duke

• Database area

• PhD: 2001

Meta path-guided prediction of links and relationships

Philosophy: Meta path relationships among similar typed links share similar semantics and are comparable and inferable

paper topic

author

publish publish-1

mention-1

mention write write-1

contain/contain-1 cite/cite-1

Bibliographic network: Co-author prediction (A—P—A) Use topological features encoded by meta paths, e.g.,

citation relations between authors (A—P→P—A)

PathPredict: Meta-Path Based Relationship Prediction

Meta-Path Based Co-authorship Prediction

Co-authorship prediction problem Whether two authors are going to collaborate for the first time

Co-authorship encoded in meta-path Author-Paper-Author

Topological features encoded in meta-paths

Meta-paths between authors under length 4

Meta-Path Semantic Meaning

The Power of PathPredict: Experiments

Explain the prediction power of each meta-path Wald Test for logistic

regression Higher prediction accuracy

than using projected homogeneous network 11% higher in

prediction accuracy

Co-author prediction for Jian Pei: Only 42 among 4809 candidates are true first-time co-authors! (Feature collected in [1996, 2002]; Test period in [2003,2009])

Enhancing the Power of Recommender Systems by Heterog. Info. Network Analysis

Avatar Titanic Aliens Revolutionary Road

James Cameron

Kate Winslet

Leonardo Dicaprio

Zoe Saldana

Adventure Romance

Collaborative filtering methods suffer from the data sparsity issue

# of users or items

A small set of users & items have a large number of ratings

Most users and items have a small number of ratings

Heterogeneous relationships complement each other Users and items with limited feedback can be connected to the

network by different types of paths Connect new users or items in the information network

Different users may require different models: Relationship heterogeneity makes personalized recommendation models easier to define

Personalized recommendation with heterog. Networks [WSDM’14]

Performance Study: Personalized Recommendation in Heterogeneous Networks

Winner: HeteRec personalized recommendation (HeteRec-p)

Datasets:

Methods to be compared: Popularity: Recommend the most popular items to users Co-click: Conditional probabilities between items NMF: Non-negative matrix factorization on user feedback Hybrid-SVM: Use Rank-SVM with plain features (utilize both user

feedback and information network)

Automated Construction of Heterogeneous Info. Networks

Rich semantics can be mined from structured heterog. networks

Unfortunately, much of the real world data is unstructured

News, Wikipedia, blogs, tweets, multimedia data, …

Too expensive to be structured by human: Automated & scalable

How to turn unstructured data to structured heterog. Networks?

Methodology: Incrementally & progressively explore links and structures to construct quality heterogeneous networks

Kert [Danilevsky et al, SDM’14]: LDA + Keyphrase extraction and ranking by topic (4 criteria: coverage, purity, phraseness, completeness)

CATHTY [Wang, et al., KDD’13]: Recursive generation of topic hierarchy based on term co-occurrence network

CATHYHIN [Wang, et al., ICDM’13]: Use hetero. network to help topic hierarchy generation

TopMine [El-Kishky et al., 2014]: Exploring frequent contiguous pattern mining

1. Hierarchical topic discovery with entities

2. Phrase mining

3. Rank phrases & entities per topic

Output hierarchy w/ phrases & entities

Input collection

text o

o/1/1 o/1/2

Entity

Hierarchy so generated can be used for network mining, exploration, etc.

Our recent effort:

Automatic Discovery of Clusters and Term Hierarchies from Collection of Documents

KERT: Keyphrase Extraction and Ranking by Topic [SDM’14] 1. Use background LDA (bLDA) to cluster unigrams into T topics + one background topic 2. Extract candidate keyphrases for each foreground topic according to the word topic assignment — should be order-free and non-consecutive

E.g.,‘mining frequent patterns’ vs.‘frequent pattern mining’ E.g., ‘mining top-k frequent closed patterns’

3. Rank the keyphrases in each foreground topic Coverage: information retrieval’ vs. ‘cross-language information

retrieval Purity: only frequent in documents about topic t Phraseness: ‘active learning’ vs.‘learning classification’ Completeness: ‘vector machine’ vs. ‘support vector machine’

Experiment: Machine Learning: Top Keyphrases

– Dataset: DBLP paper titles – KERT variations: ignore each criterion in turn – Baselines: two variations of an approach from a recent work

CATHYHIN: Topic Hierarchy Construction by Integration of Heterogeneous Info. Networks

Using DBLP heterog. Info. network to enhance topical hierarchy generation

Hierarchies generated not only on topical phrases but also on authors & venues

TopMine: Phrase Mining Then Top Modeling

Frequent Contiguous Phrase Mining Frequency: A low frequency phrase is likely unimportant --- a

generalization of the most likely unigrams in LDA Collocation: The co-occurrence of tokens in such frequency

that is significantly higher than expected due to chance. Completeness: Do not double count sub-phrases (e.g. support

vector is not a phrase when inside support vector machine) General frequent pattern mining may place words far apart in the

same phrase The pattern for a phrase: each phrase is a contiguous sequence of

ordered words. Markov Blanket Feature Selection for Support Vector Machines

Good Phrases Should Be Stat. Significant

Under the assumption that the text data was generated by a sequence of Bernoulli trials, we can derive a significance measure for potential mergings.

Good Phrases!

Bag-of-Phrase Construction: An Example

The ToPMine Framework

1. Preprocessing: Perform stemming to reduce vocabulary size and increase collisions of phrases, split on phrase-invariant punctuation

2. Frequent Contiguous Pattern Mining: Collects aggregate counts and candidate phrases

3. Phrase Construction: Agglomerative construction of significant phrases (addresses frequency, collocation, and completeness)

4. Use the phrase construction of “bag of phrases”: This can be considered an enriched input for later algorithms (not exclusive to topic modeling)

5. Bag-of-phrases Topic Modeling

Experiments on AP News

Experiments on Yelp Reviews

Scalability: Runtime Results

Role Discovery: Mining Advisor-Advisee Relationships in DBLP Network

Propagation of simple, commonly accepted constraints in Time-Constrained Probabilistic Factor Graph (TPFG) “Advisor has more publications and longer history than advisee at the time

of advising” “Once an advisee becomes advisor, s/he will not become advisee again”

Smithth

22000000000000000000

Ada Bob

Input: Temporalcollaborationnetwork

Output: Relationshipanalysis

(0.8, [1999,2000])

(0.7,[2000, 2001])

(0.65, [2002, 2004])

(0.2,[2001, 2003])

(0.5, [/, 2000])

(0.9, [/, 1998])

(0.4,[/, 1998])

(0.49,[/, 1999])

Visualizedchorological hierarchies

Role Discovery: Performance & Case Study

Datasets RULE SVM IndMAX TPFG

TEST1 69.9% 73.4% 75.2% 78.9% 80.2% 84.4%

TEST2 69.8% 74.6% 74.6% 79.0% 81.5% 84.3%

TEST3 80.6% 86.7% 83.1% 90.9% 88.8% 91.3%

Empirical parameter

optimized parameter

heuristics Supervised learning

DBLP data: 654, 628 authors, 1076,946 publications, years provided Labeled data: MathGealogy Project; AI Gealogy Project; Homepage

Advisee Top Ranked Advisor Time Note

David M. Blei 1. Michael I. Jordan 01-03 PhD advisor, 2004 grad

2. John D. Lafferty 05-06 Postdoc, 2006

Hong Cheng 1. Qiang Yang 02-03 MS advisor, 2003

2. Jiawei Han 04-08 PhD advisor, 2008

Sergey Brin 1. Rajeev Motawani 97-98 “Unofficial advisor”

Case study

Hierarchical Relationship Discovery From partially ordered objects to hierarchy (tree) Based on NLP or other techniques to extract

partially ordered objects Using constraints to discover relationships

Type Cognitive description Potential definition Homophile Parent and child are similar Polarity Parent is superior to child Support pattern

Patterns frequently occurring with child-parent pairs

Forbidden pattern

Patterns rarely occurring with child-parent pairs

Singleton Potential

Type Cognitive description Potential definition Attribute augment

Use inherited attributes from parents or children

Label propagate

Similar nodes share similar parents (or children)

Reciprocity Patterns altering in child-parent & parent-child pairs

Constraints Restrict certain patterns

Pairwise Potential Function: Cases

Discovery of the Kenny Family Tree

Truth Analysis: Enhancing the Quality of Heterogeneous Info. Networks

Info. networks could be untrustworthy, error-prone, missing, … TruthFinder [KDD’07]: Inference on trustworthiness by

constructing heterogeneous info. networks

Web sites (Book Sellers)

Facts (Book Authors)

Objects (Books)

True facts and trustable websites mutually enhance each other and will become apparent after some iterations

Truth Discovery: Multiple Truth Value and Handling False Negatives

• Voting may not always work well: Some sources tend to miss true values (False Negatives), while some others tend to produce false claims (False Positives)

• Why Latent Truth Model (LTM)? Modeling two-sided quality to support multiple true values per entity for truth finding [VLDB’12]

Negative Claim

Positive Claim

Generating Implicit Negative Claims:

Harry Potter

Netflix

BadSource

Correct Claim

Incorrect Claim

High Precision, High Recall

High Precision, Low Recall

Low Precision, Low Recall

Model source quality in other data integration tasks, e.g. entity resolution. Trustworthiness in multi-genre networks (text-rich networks, social networks, etc.)

Truth Discovery: Effectiveness of Latent Truth Model

Experimental datasets: Large and real Book Authors from abebooks.com (1263 books, 879 sources, 48153 claims, 2420 book-author, 100 labeled) Movie Directors from Bing (15073 movies, 12 sources, 108873 claims, 33526 movie-director, 100 labeled) Effectiveness of Latent Truth Model:

On-Going and Future Work From small, structured data to real, big, unstructured data

From the DBLP network to networks constructed from news, PubMed, tweets, …

Construction of quality, semi-structured heterogeneous networks from massive, real data sets

Integration information network with data/text cube and OLAP technologies Data/text cubes have been shown its power in multi-

dimensional OLAP analysis, e.g., EventCube

Integration of heterogeneous network mining ⇒ OLAP mining on multi-dimensional social and information networks

Network exploration of mission- or query-based hidden networks

From Data Mining to Mining Info. Networks

Han, Kamber and Pei, Data Mining, 3rd ed. 2011

Yu, Han and Faloutsos (eds.), Link Mining, 2010

Sun and Han, Mining Heterogeneous Information Networks, 2012

Conclusions Heterogeneous information networks are ubiquitous

Most datasets can be “organized” or “transformed” into “structured” multi-typed heterogeneous info. networks

Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, … Surprisingly rich knowledge can be mined from structured

heterogeneous info. networks Clustering, ranking, classification, path prediction, ……

Knowledge is power, but knowledge is hidden in massive, but “relatively structured” nodes and links!

Key issue: Construction of trusted, semi-structured heterogeneous networks from unstructured data

From data to knowledge: Much more to be explored but heterogeneous network mining has shown high promise!

References M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information

Networks", KDD'11. Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies,

Morgan & Claypool Publishers, 2012 Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous

Information Network Analysis", EDBT’09 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with

Star Network Schema", KDD’09 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in

Heterogeneous Information Networks”, VLDB'11 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in

Heterogeneous Bibliographic Networks", ASONAM'11 Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, “When Will It Happen? Relationship Prediction in

Heterogeneous Information Networks”, WSDM'12 F. Tao, et al., “EventCube: Multi-Dimensional Search and Mining of Structured and Text Data”,

(system demo) KDD’13 C. Wang, J. Han, et al., “Mining Advisor-Advisee Relationships from Research Publication

Networks", KDD'10 C. Wang, M. Danilevsky, et al., “A Phrase Mining Framework for Recursive Construction of a

Topical Hierarchy”, KDD’13 C. Wang, Marina Danilevsky, et al., "Constructing Topical Hierarchies in Heterogeneous

Information Networks", ICDM'13 54

Construction and Mining of Heterogeneous Information...

Documents

Transcript of Construction and Mining of Heterogeneous Information...

Reading titanic

Titanic Crew - TeachingCave.com€¦ · · 2016-07-06John Pitman !!!!! Titanic Crew! Captain Edward John Smith ! ... Titanic Crew! Titanic Crew! Sixth Officer James Paul Moody !

TITANIC - Wormstedtwormstedt.com/Titanic/TITANIC-FIRE-AND-ICE-Article.pdf · 4 TITANIC: FIRE & ICE (OR WHAT YOU WILL) INTRODUCTION 5 their memories will continue to live on untarnished

Presentación1 titanic

Titanic - Титаник

Titanic ppoint

Eating Disorders. Distorting Reality Kate Winslet GQ United Kingdom.

Rosalie Ham’s Jocelyn Moorhouse Kate Winslet, Judy Davis ...carltonrotary.org.au/images/2015movienightposter.pdf · Starring Kate Winslet, Judy Davis, Liam Hemsworth, and Hugo Weaving

The Titanic - crowlees.co.uk€¦ · Titanic left Southampton on its first journey. Titanic hit an iceberg. Titanic was built in Belfast. 4. Fill in the missing word. Titanic was

Titanic Infographic - Titanic Facts | History of The · PDF fileTitle: Titanic Infographic Author: Subject: Titanic infographic, making it easy for you to visualize facts about the

“ TITANIC ”

Titanic Collection

Measurement & Statistics RMS TITANIC - Mangham Mathmanghammath.com/Chapter Packets/Chapter 9 Titanic data.pdf · Measurement & Statistics RMS TITANIC Royal Mail Steamer Titanic Information

Match The Face with the TITS. Cameron DiazAngelina JolieKate Winslet.

Names and Faces in the Newsgrauman/courses/spring...Kate Winslet Sam Mendes British director Sam Mendes and his part-ner actress Kate Winslet arrive at the London premiere of ’The

The Titanic

Charlotte Real Estate For Sale: 10432 Winslet Drive Charlotte NC 28277

Titanic History

Titanic Powerpoint

Who built the Titanic? Where was the Titanic built?