Evolving web, evolving search

88
Evolving Web, Evolving Search Yuan Tian & Tianqi Chen Yuan Tian & Tianqi Chen Apex Data & Knowledge Management Lab Shanghai Jiao Tong University

description

 

Transcript of Evolving web, evolving search

Page 1: Evolving web, evolving search

Evolving Web, gEvolving Search

Yuan Tian & Tianqi ChenYuan Tian & Tianqi ChenApex Data & Knowledge Management Lab

Shanghai Jiao Tong University

Page 2: Evolving web, evolving search

Agendag

Introduction to SJTU Introduction to Apex Labp Research Demo Demo

Page 3: Evolving web, evolving search

Agendag

Introduction to SJTU Introduction to Apex Labp Research Demo Demo

Page 4: Evolving web, evolving search

Shanghai Jiao Tong Universityg J g y

Location Historyy Student Campus Campus

Page 5: Evolving web, evolving search

Agendag

Introduction to SJTU Introduction to Apex Labp Research Demo Demo

Page 6: Evolving web, evolving search

Apex Labp

Director Professor Director Professor Yong Yu

Associate ProfessorG i X Guirong Xue

Page 7: Evolving web, evolving search

Apex Labp

Research Web Search Social Web Semantic Search Machine Learning Image Search

Page 8: Evolving web, evolving search

Apex Labp

Project Partners

Page 9: Evolving web, evolving search

Apex Labp

Ph.D. Students Haofen Wang Jing Lu Jia Chen Guangcan Liu Xian Wu Yunbo Cao Ruihua Songg

35 Master Students

Page 10: Evolving web, evolving search

Agendag

Introduction to SJTU Introduction to Apex Labp Research Demo Demo

Page 11: Evolving web, evolving search

Research

Traditional Web Social Web Semantic Web Machine Learning Machine Learning

Page 12: Evolving web, evolving search

Research

Traditional Web Social Web Semantic Web Machine Learning Machine Learning

Page 13: Evolving web, evolving search

Search on Traditional Web

Focus on how to improve search relevance? rank pages? integrate mining technologies into search? search finer grained objects instead of documents?

Search Applicationspp General search engine Vertical search engineg Meta search engine

Page 14: Evolving web, evolving search

Expert Search

Page 15: Evolving web, evolving search

Expert Search (introduction)p ( )

Treat web page as bag of words Queries are not fully understoodQ y

Page 16: Evolving web, evolving search

Expert Search (motivation)p ( )

Searching for Experts: Searching for Experts: • A more and more important information needA more and more important information need

• PM search for DevP ti t h f D t• Patient search for Doctor

• Student search for Professor• ……

• Not only in EnterpriseBut also on WWW• But also on WWW

Page 17: Evolving web, evolving search

Query

Ranked List of

ExpertsExperts

An Evidence: an expert and a query co-occur in a document undercertain relation constraintcertain relation constraint

Page 18: Evolving web, evolving search

Research

Traditional Web Social Web Semantic Web Machine Learning Machine Learning

Page 19: Evolving web, evolving search

The Emergence of Web 2.0 Web gets social

g

Web 1.0 -> Web 2.0Publishing -> Participation

Personal Websites -> Blogging

Content Management Systems -> Wikis

Britannica Online -> WikipediaBritannica Online > Wikipedia

Directories (taxonomy) -> Tagging ("folksonomy")

Lower the barrier for contribution. More people are involved. They are less professional. More people are involved. They are less professional.

Page 20: Evolving web, evolving search

Search on Web 2.0

Focus on how to elaborate user involved data? search on new social media

Page 21: Evolving web, evolving search

Deegleg

(WWW 2006, WWW 2007, SIGIR 2008)

Page 22: Evolving web, evolving search

Related facetsRelated facets Related Related tagstags

Search resultsSearch results

Relatd Relatd usersusers

Page 23: Evolving web, evolving search

Emotion Analysis on the Blogy g

Blog can be the resource of the news, but also be the stage for representing the emotion

Enhancing the blog search for different user Enhancing the blog search for different user

Page 24: Evolving web, evolving search

Blog Searchg

I f ti ti lInformative articleNews that is similar to the news on traditional

b itnews websitesTechnical descriptions, e.g. programming

techniquestechniques.Commonsense knowledgeObjective comments on the events in the worldObjective comments on the events in the worldAffective article b l ffDiaries about personal affairsSelf-feelings or self-emotions descriptions

Page 25: Evolving web, evolving search

Two types of blogyp g

Page 26: Evolving web, evolving search

Intent-driven blog search (WWW 2007)

Informative Sense

Snippets

1 1 00 The catalogue of IBM certification: DB21 1.00 The catalogue of IBM certification: DB2Database Administrator DB2 ApplicationDeveloper MQSeries Engineer VisualAgeFor Java …

2 -0.94 Crazy Me! I have hesitated between Acerand smuggled IBM for one week. Iwouldn’t have taken into account theprice, quality or service if I had enoughmoney …

3 1.00 Selling IBM laptop, t22p3-900, , dvd S3/,g p p, p , , ,independent accelerating display card.3550 YUAN. (Post fee notincluded) .Please contact 30316255. Weguarantee the quality. This product is onlysold within Tianjing citysold within Tianjing city ...

4 -0.35 I got a laptop from my friend this week.Although outdated, it is still a classicalone in IBM enthusiast’s mind. There aremany second hand IBM laptops in the

k Al h h I h ld IBMmarket. Although I have sold many IBMlaptops …

5 -0.53 Doctor said that I should make morepreparations mentally. You have stayedwith me for three years, leaving withouty gany words. Do you feel fair for me? Doyou remember the moments we weretogether? You are heartless, I hate you! ...

Page 27: Evolving web, evolving search

Informative SnippetsSense

1 1.00 The catalogue of IBM certification: DB2 DatabaseAdministrator DB2 Application Developer MQSeriesEngineer VisualAge For Java …

2 1.00 Biao Lin is a military talent. Stalin called him “thegiftedgeneral”. Americans called him “the unbeaten general”.general . Americans called him the unbeaten general .Chiang Kai-shek called him “devil of war”. Biao Lin is aspecial person in modern history …

3 0.99 Microsoft’s hotmail can only be registered with suffix“@hotmail.com” by default. You can register @msn.com byvisiting…

4 0 95 Yi Sh i till di th fil t I ill ti it l t 14 0.95 Yi Shang is still sending the file to me. I will practice it later. 1.Start up Instance (db2inst1) db2start; 2. Stop Instance(db2inst1) db2stop …

5 0.84 Name: Lei Zhang. Student number: 5030309959. Classnumber: 007. The analysis and review about the tendency ofJilin Chemical Industry’ stock in 2005. Date, Increasing andDecreasing ranges, Open Price, Close Price, Amount ofdeals …

6 0.01 Recently I like reading the Buddhist Scripture. I can learnphilosophies in it. It makes me comfortable. It is from ...

7 -0.11 It’s out of my mind when I first saw it. The water seemed to beexuding from the building. There was much water on the floorof education building. Water was all around us, anywhere youcan touch had water. …

8 -0.51 I read an article about the last emperor Po-yee today. I havewatched “The Last Emperor” before, which realisticallydescribed his life without losing artistry. His love impresseddescribed his life without losing artistry. His love impressedme. As an emperor, he can’t choose the one he loved …

9 -0.53 She is 164 in height with white skin, black hair and long limpleg. I like the girl who has long hair and likes sport anddancing. I like sweet girls. …

10 -0.94 I have many things to do at the end of this semester. There arefi fi l i ti Di t M th tifive final examinations, Discrete Mathematics,Communication Theory, Architecture of Computer, Algorithmand Law. I know little about them. OMG! Only four weeks areleft. There are also two projects, Compiler and OperationSystem. Complier can be easily completed but OperationSystem …

Page 28: Evolving web, evolving search

Research

Traditional Web Social Web Semantic Web Machine Learning Machine Learning

Page 29: Evolving web, evolving search

Our Vision of Semantic Web Search• It covers most of the important topics in SW• A lot of tools are built in o o oo s e bueach layer

• 10+ top papers (WWW’09, SIGMOD’09, SIGMOD’08, VLDB’07, ICDE’09, ISWC’07, etc)

Page 30: Evolving web, evolving search

Knowledge Engineering Layerg g g y

Ontology Engineering Orient: Integrating Ontology Engineering into Industry

Tooling Environment (ISWC 2004)

O t l L i & P l ti Ontology Learning & Population EachWiki: Facilitating Semantics Reuse for Wikipedia

Authoring (ISWC/ASWC 2007)u o g ( S C/ S C 00 ) PORE: Semi-supervised Positive Only Relation Extraction

from Wikipedia (ISWC/ASWC 2007)HS E l U i d Hi hi l S ti E l HS Explorer: Unsupervised Hierarchical Semantics Explorer for Social Annotations (ISWC/ASWC 2007)

Catriple: Extracting Triples from Wikipedia Categories p g p p g(ASWC 2008)

Page 31: Evolving web, evolving search
Page 32: Evolving web, evolving search

Indexing and Search Layerg y

Ontology Query Engine based on DBMS SOR: A Practical System for OWL Ontology Storage,

Reasoning and Search (VLDB 2007, SIGMOD 2008)

A t ti b d S ti S h E i (DB + IR) Annotation-based Semantic Search Engine (DB + IR) CE2: Towards Large Scale Annotation-based Semantic

Search (CIKM 2008)Sea c (C 008)

An Extension to IR index for Relational Search Semplore: An IR Approach to Scalable Hybrid Query of

Semantic Web Data (ISWC/ASWC 2007, ASWC 2008, WWW 2009, JWS)

Pattern based RDF Store Pattern-based RDF Store

Page 33: Evolving web, evolving search

SOR

Semantic Object Repository

Based on IBM DB2 Supports T-Box Supports T Box

reasoning

Page 34: Evolving web, evolving search

Semplorep

Extension to traditional IR engine

Ranking is considered

Page 35: Evolving web, evolving search

CE^2

Search over semantically annotated corpus

Combination of DB and IR search engines

Page 36: Evolving web, evolving search

Pattern-based RDF store

Learning to materialize join results Efficient retrieval of pattern matchesp Reasonable extra space -> Significant

performance increase (on some dataset)performance increase (on some dataset)

Page 37: Evolving web, evolving search

Query Interface and User Interaction Layer Keyword Interface for Semantic Search Keyword Interface for Semantic Search

Q2Semantic: Lightweight Ontology based Keyword Interpretation for Semantic Search (ESWC 2008, ICDE 2009)2009)

Natural Language Interface for Semantic Search PANTO: A Portable Natural Language Interface to

Ontologies (ESWC 2007) Snippet Generation

Snippet Generation for Semantic Web Search Engines Snippet Generation for Semantic Web Search Engines (ASWC 2008)

Ontology Presentation ZoomRDF: Semantic-driven Fisheye Zooming for RDF Data

(WWW 2010)

Page 38: Evolving web, evolving search

Q2SemanticQ

Structured queries vs. keyword queries

Structural data

Page 39: Evolving web, evolving search

RDF Snippetpp

Representation of search results

How will you know which answers are most relevant?

Page 40: Evolving web, evolving search

ZoomRDF

Page 41: Evolving web, evolving search

Research

Traditional Web Social Web Semantic Web Machine Learning Machine Learning

Page 42: Evolving web, evolving search

Agendag

Introduction to SJTU Introduction to Apex Labp Research Demo Demo

Page 43: Evolving web, evolving search

How to make them as a whole? We focused on Semantic Web

search Closed corpus / one single data source Closed corpus / one single data source

involved Just scale to million triples Uncertainty is not fully considered or usedy y

We need Semantic Web search, however

M th 11 illi d t (W b More than 11 million data sources (Web heterogeneity)

More than 2 billion triples (Scalability) Uncertainty everywhere Uncertainty everywhere

Thus, we carefully consider the following topics Pay as you go for semantic data integration Pay as you go for semantic data integration Semantic search engine towards billion

triples User-friendly query Interface for Semantic

MissingLet’s ForgetWeb Let s Forget

Page 44: Evolving web, evolving search

Hermes (2nd place Billion Triple Challenge, S SSIGMOD 2009, JWS)

1. Integrate and index data sourcesSelect a query Input keywords Refine or navigate

2. Understand user’s need 3. Search and refineq y

“ArticleStanfordTuring Award”

123

p y

ResultsRudi Studer, Semantic Web...Suggestions

g

Distributed Query Processing

Schema‐level Mapping Data‐level Mapping

Graph Data Processing Keyword Translation

SuggestionsAffiliations...

Element Label Extraction

Keyword Mapping

Top‐k Query G h S h Local Query

Query Graph Decomposition 

Result Combination & Ranking

Data Graph Summarization

Query Planning

Graph Element Scoring

Mapping Discovery

Graph Search Local Query Processing

Query Planning& Optimization

Internal IndicesMapping Discovery

IndexingKeyword Index

Schema 

Index

Mapping Index

Graph IndicesStructure

IndexIndex Index

Page 45: Evolving web, evolving search

Heterogeneous Transfer L iLearning

Machine Learning TeamgAPEXLABShanghai Jiao Tong University

Page 46: Evolving web, evolving search

Machine Learning Team in APEXg

Focus on machine learning and its application in Web mining and IR. Transfer learning Advertising Techniques in Web Short text classification&clustering Multiligual search result integeration

Page 47: Evolving web, evolving search

Outline

Introduction to heterogeneous transfer learning Cross media: Text Image g

Clustering Classification

Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising

47

Page 48: Evolving web, evolving search

Outline

Introduction to heterogeneous transfer learning Cross media: Text Image g

Clustering Classification

Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising

48

Page 49: Evolving web, evolving search

Traditional machine learningg

training data and test data in a same distribution.

T i i d t T t dTraining data: newsTest da49

Page 50: Evolving web, evolving search

Transfer learningg

Transfer learning: distributions are not identical.

Training data: newsTest datag50

Page 51: Evolving web, evolving search

Heterogeneous Transfer Learningg g

Learning across different feature spaces.

A fixed-wing aircraft, typically called an airplane, aeroplane or simply plane, is an aircraft capable of flight using forwardcapable of flight using forward motion that generates lift as the wing moves through the air…

An automobile, motor car or car is a wheeled motor vehicle used for transporting passengers, which also carries p g ,its own engine or motor...

T i i d T DT dTraining data: Text DoTest data51

Page 52: Evolving web, evolving search

Related Areas of Heterogeneous Learningg g

Feature Space

Multiple Domain Data

Heterogeneous Homogeneous

Feature Space among Domains

Instance D t Di t ib tiInstance Correspondences among Domains

Data Distribution among Domains

Each instance in onedomain has its

There are few or noInstance

Different Same

Multi-view Learning

Heterogeneous Transfer Learning

Transfer Learning across Different

Distributions

Traditional Machine Learning

correspondencesIn other domains

correspondenceamong domains

Apple is a fr-uit that can be found …

Banana is the common name for…

SourceDomain

TargetDomain

52

Page 53: Evolving web, evolving search

Related Areas of Heterogeneous Learningg g

Feature Space

Multiple Domain Data

Heterogeneous Homogeneous

Feature Space among Domains

Instance D t Di t ib tiInstance Correspondences among Domains

Data Distribution among Domains

Each instance in onedomain has its

There are few or noInstance

Different Same

Multi-view Learning

Heterogeneous Transfer Learning

Transfer Learning across Different

Distributions

Traditional Machine Learning

correspondencesIn other domains

correspondenceamong domains

Apple is a fr-uit that can be found …

Banana is the common name for…

SourceDomain

TargetDomain

53

Page 54: Evolving web, evolving search

Related Areas of Heterogeneous Learningg g

Feature Space

Multiple Domain Data

Heterogeneous Homogeneous

Feature Space among Domains

Instance D t Di t ib tiInstance Correspondences among Domains

Data Distribution among Domains

Each instance in onedomain has its

There are few or noInstance

Different Same

Multi-view Learning

Heterogeneous Transfer Learning

Transfer Learning across Different

Distributions

Traditional Machine Learning

correspondencesIn other domains

correspondenceamong domains

Apple is a fr-uit that can be found …

Banana is the common name for…

SourceDomain

TargetDomain

54

Page 55: Evolving web, evolving search

Related Areas of Heterogeneous Learningg g

Feature Space

Multiple Domain Data

Heterogeneous Homogeneous

Feature Space among Domains

Instance D t Di t ib tiInstance Correspondences among Domains

Data Distribution among Domains

Each instance in onedomain has its

There are few or noInstance

Different Same

Multi-view Learning

Heterogeneous Transfer Learning

Transfer Learning across Different

Distributions

Traditional Machine Learning

correspondencesIn other domains

correspondenceamong domains

Apple is a fr-uit that can be found …

Banana is the common name for…

SourceDomain

TargetDomain

55

Page 56: Evolving web, evolving search

Related Areas of Heterogeneous Learningg g

Feature Space

Multiple Domain Data

Heterogeneous Homogeneous

Feature Space among Domains

Instance D t Di t ib tiInstance Correspondences among Domains

Data Distribution among Domains

Each instance in onedomain has its

There are few or noInstance

Different Same

Multi-view Learning

Heterogeneous Transfer Learning

Transfer Learning across Different

Distributions

Traditional Machine Learning

correspondencesIn other domains

correspondenceamong domains

Apple is a fr-uit that can be found …

Banana is the common name for…

SourceDomain

TargetDomain

56

Page 57: Evolving web, evolving search

Outline

Introduction to heterogeneous transfer learning Cross media: Text Image g

Classification Clusteringg

Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising

57

Page 58: Evolving web, evolving search

Text to Images[Dai et al. NIPS 2008] [Lin et al. APWeb 2010]

Mining and learning the multimedia data is becoming increasing importantbecoming increasing important

Li i d b l b l d i d Limited by scarce labeled image data, can we use abundant text data in the Web?

Our answer is YES

58

Page 59: Evolving web, evolving search

Objective

EleLearningIn Ophma

pu utphanmai translatiput

utpu

antssi translati

t put

LearningIn Ots ve ng tpu utare ho learning  59

Page 60: Evolving web, evolving search

Basic Ideas

Exploiting co occurrence data as a bridge between text and imageExploiting co-occurrence data as a bridge between text and image

Page 61: Evolving web, evolving search

Data Sets

Documents from ODP Images from Caltech-256g

Page 62: Evolving web, evolving search

Experimental Resultp

Page 63: Evolving web, evolving search

Approach 2: Naïve Bayes Waypp y y[Lin et al. APWeb 2010]

P( | )P( | )P(v|w)P(w|c)P(w|c)

P(v|w)

Page 64: Evolving web, evolving search

Text-aided Image Classification g(TAIC)

64

Page 65: Evolving web, evolving search

Experiments: TAICp

Data sets: 9 binary classification data sets and 5 are six-class classification data sets Image data from Caltech-256 and Fifteen scene Auxiliary text data from Open Directory Project

Baseline methods Base classifiers: Naïve Bayes (NBC) and Support

vector machine (SVM)

65

Page 66: Evolving web, evolving search

Evaluation 1: Classification

Heterogeneous TL  No‐Heterogeneous TL 

Average Error Rate 0.318 0.334

66

Average Error Rate 0.318 0.334

4 8% error reduc

Page 67: Evolving web, evolving search

Outline

Introduction to heterogeneous transfer learning Cross media: Text Image g

Classification Clusteringg

Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising

67

Page 68: Evolving web, evolving search

T t id d Im Cl t rinText-aided Image Clustering[Yang et al. ACL 2009]

Image clustering is a effective method for increasing accessibility of image search result

Apple =OR

But traditional clustering methods do not work

Apple OR

But traditional clustering methods do not work well with small amount of data

d d h l We consider use annotated images in the social Web to help image clustering

68

Page 69: Evolving web, evolving search

Annotated PLSA Model for Clustering

Leveraging the auxiliary text data by From Flickrusing the topics as a bridge

Z

W dFrom Flickr.

Words Topicsfrom Image featuresTopics

Aux I

69DataIma

Page 70: Evolving web, evolving search

Making the transfer…g Log-likelihood objective function

T t i f t d ili t t Two parts: image features and auxiliary text features Image feature to image instance correlation: A Image feature to image instance correlation: A Word feature to image feature correlation: B

BA Nortrade

j llj

j lj

lj

iij

j ij

ij wfPB

BvfP

AA

)|(log)1()|(log' '' '

LNormali

tradeoff

Lik lih d fLik lihmali

ti-off

70Likelihood of Likelihozatiopara

Page 71: Evolving web, evolving search

Experiment Setupp p

Data sets: Generated from Caltech-256 and 15-scene corpora

Baseline methods Baseline clustering methods: KMeans, PLSA and STC Strategies:

clustering on target image data only combined: clustering target image data and annotated image combined: clustering target image data and annotated image

data together and evaluate result for target image data

71

Page 72: Evolving web, evolving search

Experimental Resultp

KM_Seperate KM_Combine PLSA_Seperate PLSA_Combine STC aPLSA

1 41.61.8

2

0 60.8

11.21.4

Entr

opy

00.20.40.6

Heterogeneous TL  No‐Heterogeneous TL 

Average Entropy 0.741 0.786

72

Average Entropy 0.741 0.786

5 7% entroy redu

Page 73: Evolving web, evolving search

Clustering Resultsgon Caltech256 [Griffin et al. TR 2007]

f k kbj h i tt hfrogkayakbearjesus-christwatch

73

Page 74: Evolving web, evolving search

Outline

Introduction to heterogeneous transfer learning Cross media: Text Image g

Clustering Classification

Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising

74

Page 75: Evolving web, evolving search

Cross-language Classification g g[Ling et al. WWW 2008]

Classifier

llearn classify

Labeled Chinese Web

Unlabeled Chinese WebChinese Web 

pagesChinese Web 

pages

Text Classification 75

Page 76: Evolving web, evolving search

Cross-language Classificationg g

Much labelled data in English, but few in g ,Chinese.

Labeled Data English Chinese

News Reuters‐21578 ?News Reuters 21578 ?

newsgroups 20 Newsgroups ?

Web pages Open Document Project

Very few ODP dataProject

(> 1M)data (< 20k, ~ 1%)

76

Page 77: Evolving web, evolving search

Cross-language Classificationg g

ClassifierClassifier

learn classifyclassify

Labeled English Web 

Unlabeled Chinese Web 

pages pages

Cross‐language Classification77

Page 78: Evolving web, evolving search

Cross-language Classificationg g Information Bottleneck

l b d d ( b ) X : signals to be encoded (Web pages) : codewords (class labels) X Y : features related to X (terms)

XX

78

Page 79: Evolving web, evolving search

Cross-language Classificationg g

Optimization

minimizeInformation betwminimizeInformation betw

Minimize this distance

79

Page 80: Evolving web, evolving search

Cross-language Classificationg g

Performance

80

Page 81: Evolving web, evolving search

Outline

Introduction to heterogeneous transfer learning Cross media: Text Image g

Clustering Classification

Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising

81

Page 82: Evolving web, evolving search

Application: Visual Contextual Advertising [Chen et al. AAAI 2010]

P i h f d d ti i f t t[ ] Previous research focused on advertising for text

Web pages.With th b i f lti di d t d With the booming of multimedia data, we need to recommend advertisement for these dataDiffi lt i d th t t i diff t f t Difficulty: image and the text in different feature spacesU th d t t b id th t Use the co-occurrence data to bridge these two feature spaces

Page 83: Evolving web, evolving search

Figure illustration of Visual Contextual gAdvertising

Page 84: Evolving web, evolving search

Visual Contextual Advertisingg

(based on the independWe assume that there isindependent

We assume that there isent

iWhere

assumpti

Page 85: Evolving web, evolving search

Experimental Resultsp

Co-occurrence data from Flickr. Test Image from Flickr and Fifteen scene data g

set Advertisement are crawled from MSN search Advertisement are crawled from MSN search

engine with queries chosen from AOL query log.

Page 86: Evolving web, evolving search

Experimental Resultp

Page 87: Evolving web, evolving search

Experimental Resultp

Page 88: Evolving web, evolving search

Thank youy

For more details of APEXLAB http://apex.sjtu.edu.cn/apex_wiki/FrontPage

Our works http://apex.sjtu.edu.cn/apex_wiki/Papersp // p j / p _ / p