A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo...

46
A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvar o Perei ra Ricardo Baeza- Yates Jesus Bisbal UPF – Spain

Transcript of A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo...

A Model for Fast Web Mining Prototyping

Nivio ZivianiUFMG – Brazil

ÁlvaroPereira

RicardoBaeza-Yates

Jesus BisbalUPF – Spain

- 2 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Motivation

• Our focus:

– Web mining as the process of discovering useful information in Web data by means of data mining techniques

• Web mining

– Computation-intensive task

– Iterative process

• Prototyping plays an important role

– Experimenting with different alternatives

– Incorporating the knowledge from previous iterations

• Mining softwares are developed ad-hoc

– Time-consuming tasks

– Not scalable

– Not reusable

- 3 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Main Objective: Design and Development of WIM

WIMWIM – WWeb IInformation MMining model

• WIM goal: facilitate fast Web mining prototyping

• Main research challenges:

– Data model

– Algebra

– Software prototype

• Architecture and implementation issues

- 4 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Web Mining Problems WIM Has Been Applied So Far

• Study of genealogical trees on the Web (WWW'08)

– A study on how the Web textual content evolves

• A usage pagerank for ranking improvement

– A logical graph is created based on usage data

• Linkage Evolution for New Pages

– Hypothesis: duplicates tend to have no evolution of links (inlinks)

• A user intent study

– Identifying queries that cannot be classified as either navigational or informational

• Creation of a reference dataset for learning to rank

- 5 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Outline

• Related work

• WIM data model

• WIM algebra

• Software architecture

• Conclusions and future work

Related Work

- 7 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

First Research Line: Data Mining Tools

• Business-driven solutions

• Not specially designed for Web data

• SQL extensions

• Examples:

– Microsoft SQL Server

– Oracle Data Mining

– IBM DB2 Intelligent Miner

– BI tools:

• Angoss, Infor CRM Epiphany, Portrait Software, SAS

– Weka

- 8 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Second Research Line:Query Languages for Web Data

• Not for mining

• Web data manipulation

– Acquisition, storage, management

• Examples:

– TSIMMIS, W3QL, WebLog, WebSQL, ARANEUS, StruQL, WebOQL, Whoweda, WEBMINER, WUM, Squeal, WebBase, WEBVIEW

Data Model

- 10 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Data Model – Design Goals

• Feasibility

• Simplicity

• Extensibility

• Data representativity

• Uniformity among operators

• Applicability to other scenarios

- 11 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Relation Type

• Node relations represent nodes of a graph, such as:

– Documents of a Web dataset

– Terms of a document

– Queries of a query log

– Sessions of a query log

• Link relations represent edges of a graph, such as:

– Links between Web documents

– Word distance among terms of a document

– Similarity among queries

– Clicks of a query log

– Association between queries and sessions

• Usage data can be represented as both node or link relations

- 12 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Node Relation

txtdocid

123456

toflyor

nottofly

w.aw.bw.cw.dw.ew.f

url

docs

- 13 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Link Relation

• Main difference: link relations must represent start and end nodes of a graph

1

5

24

5

4

1

1

6

3

3

2

txtdocid123456

toflyornottofly

w.aw.bw.cw.dw.ew.f

url

docs

Stgr.id

11121314151617

En

1122345

2445064

we

4115-132

graph

graph

- 14 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Compatibility

1

5

24

5

4

1

1

6

3

3

2

txtdocid

123456

toflyor

nottofly

w.aw.bw.cw.dw.ew.f

url

docs graph

• A link relation is compatible to a node relation if the nodes of the graph (link relation) are foreign keys in the node relation

- 15 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Operation

• The act of applying an operator to a relation

• An operator is a function defined by the WIM algebra

– Unary or binary

txtdocid123456

toflyor

nottofly

w.aw.bw.cw.dw.ew.f

url

docs

Stgr.id11121314151617

En

1122345

2445064

we

4115-132

graph

txtdocid

123456

toflyornottofly

w.aw.bw.cw.dw.ew.f

url

515152535153

cl.

output

- 16 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

WIM Program

• Sequence of operations applied to relations

– Result of users' interaction through the WIM language

• The WIM language:

– Is built upon the WIM algebra

– Is declarative

– Is a dataflow programming language

• Facilitates parallelism

• Allows graphical implementation

- 17 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

WIM Program Example – Genealogical Tree Study

- 18 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

WIM Program Example – Genealogical Tree Study

text urltobeornottobe

id123456

w.tow.bew.orw.now.taw.bf

text urltobeor

nottobe

id123456

w.tow.bew.orw.now.taw.bf

relOldclus111213141112

relClusterOldnum end

110000

start123456

relDupOld

text urltoflyor

nottofly

id212223242526

w.ttw.flw.ofw.now.tnw.fy

clus313233343132

relClusterNew

56

num end201120

start123456

relSearch

21, 25

2324

21, 25

num end201120

start123456

relSearchUrl

21, 25

2324

21, 25

sim0, 0

01

0, 0

compare

search

searchcGr.

1 21

text url

toorto

id

212325

w.ttw.ofw.tn

clus

313331

relEnd

text url

toor

id2123

w.ttw.of

clus3133

relEndInst

21

qtt

relGenEnd

relGenSt

3

1

3

23

21

23

set

num end212

start135

relSeDifUrl

21, 2523

21, 25

sim0, 00

0, 0

set

sel.

agg.

5 set*

set

set

set

disc.

text urltobeornottobe

id123456

w.tow.bew.orw.now.taw.bf

text urltobeor

nottobe

id123456

w.tow.bew.orw.now.taw.bf

relOldclus111213141112

relClusterOldnum end

110000

start123456

relDupOld

text urltoflyor

nottofly

id212223242526

w.ttw.flw.ofw.now.tnw.fy

clus313233343132

relClusterNew

56

num end201120

start123456

relSearch

21, 25

2324

21, 25

num end201120

start123456

relSearchUrl

21, 25

2324

21, 25

sim0, 0

01

0, 0

compare

search

searchcGr.

1 21

text url

toorto

id

212325

w.ttw.ofw.tn

clus

313331

relEnd

text url

toor

id2123

w.ttw.of

clus3133

relEndInst

21

qtt

relGenEnd

relGenSt

3

1

3

23

21

23

set

num end212

start135

relSeDifUrl

21, 2523

21, 25

sim0, 00

0, 0

set

sel.

agg.

5 set*

set

set

set

disc.

text urltobeornottobe

id123456

w.tow.bew.orw.now.taw.bf

text urltobeor

nottobe

id123456

w.tow.bew.orw.now.taw.bf

relOldclus111213141112

relClusterOldnum end

110000

start123456

relDupOld

text urltoflyor

nottofly

id212223242526

w.ttw.flw.ofw.now.tnw.fy

clus313233343132

relClusterNew

56

num end201120

start123456

relSearch

21, 25

2324

21, 25

num end201120

start123456

relSearchUrl

21, 25

2324

21, 25

sim0, 0

01

0, 0

compare

search

searchcGr.

1 21

text url

toorto

id

212325

w.ttw.ofw.tn

clus

313331

relEnd

text url

toor

id2123

w.ttw.of

clus3133

relEndInst

21

qtt

relGenEnd

relGenSt

3

1

3

23

21

23

set

num end212

start135

relSeDifUrl

21, 2523

21, 25

sim0, 00

0, 0

set

sel.

agg.

5 set*

set

set

set

disc.

text urltobeornottobe

id123456

w.tow.bew.orw.now.taw.bf

text urltobeor

nottobe

id123456

w.tow.bew.orw.now.taw.bf

relOldclus111213141112

relClusterOldnum end

110000

start123456

relDupOld

text urltoflyor

nottofly

id212223242526

w.ttw.flw.ofw.now.tnw.fy

clus313233343132

relClusterNew

56

num end201120

start123456

relSearch

21, 25

2324

21, 25

num end201120

start123456

relSearchUrl

21, 25

2324

21, 25

sim0, 0

01

0, 0

compare

search

searchcGr.

1 21

text url

toorto

id

212325

w.ttw.ofw.tn

clus

313331

relEnd

text url

toor

id2123

w.ttw.of

clus3133

relEndInst

21

qtt

relGenEnd

relGenSt

3

1

3

23

21

23

set

num end212

start135

relSeDifUrl

21, 2523

21, 25

sim0, 00

0, 0

set

sel.

agg.

5 set*

set

set

set

disc.

text urltobeornottobe

id123456

w.tow.bew.orw.now.taw.bf

text urltobeor

nottobe

id123456

w.tow.bew.orw.now.taw.bf

relOldclus111213141112

relClusterOldnum end

110000

start123456

relDupOld

text urltoflyor

nottofly

id212223242526

w.ttw.flw.ofw.now.tnw.fy

clus313233343132

relClusterNew

56

num end201120

start123456

relSearch

21, 25

2324

21, 25

num end201120

start123456

relSearchUrl

21, 25

2324

21, 25

sim0, 0

01

0, 0

compare

search

searchcGr.

1 21

text url

toorto

id

212325

w.ttw.ofw.tn

clus

313331

relEnd

text url

toor

id2123

w.ttw.of

clus3133

relEndInst

21

qtt

relGenEnd

relGenSt

3

1

3

23

21

23

set

num end212

start135

relSeDifUrl

21, 2523

21, 25

sim0, 00

0, 0

set

sel.

agg.

5 set*

set

set

set

disc.

text urltobeornottobe

id123456

w.tow.bew.orw.now.taw.bf

text urltobeor

nottobe

id123456

w.tow.bew.orw.now.taw.bf

relOldclus111213141112

relClusterOldnum end

110000

start123456

relDupOld

text urltoflyor

nottofly

id212223242526

w.ttw.flw.ofw.now.tnw.fy

clus313233343132

relClusterNew

56

num end201120

start123456

relSearch

21, 25

2324

21, 25

num end201120

start123456

relSearchUrl

21, 25

2324

21, 25

sim0, 0

01

0, 0

compare

search

searchcGr.

1 21

text url

toorto

id

212325

w.ttw.ofw.tn

clus

313331

relEnd

text url

toor

id2123

w.ttw.of

clus3133

relEndInst

21

qtt

relGenEnd

relGenSt

3

1

3

23

21

23

set

num end212

start135

relSeDifUrl

21, 2523

21, 25

sim0, 00

0, 0

set

sel.

agg.

5 set*

set

set

set

disc.

text urltobeornottobe

id123456

w.tow.bew.orw.now.taw.bf

text urltobeor

nottobe

id123456

w.tow.bew.orw.now.taw.bf

relOldclus111213141112

relClusterOldnum end

110000

start123456

relDupOld

text urltoflyor

nottofly

id212223242526

w.ttw.flw.ofw.now.tnw.fy

clus313233343132

relClusterNew

56

num end201120

start123456

relSearch

21, 25

2324

21, 25

num end201120

start123456

relSearchUrl

21, 25

2324

21, 25

sim0, 0

01

0, 0

compare

cGr.

1 21

text url

toorto

id

212325

w.ttw.ofw.tn

clus

313331

relEnd

text url

toor

id2123

w.ttw.of

clus3133

relEndInst

21

qtt

relGenEnd

relGenSt

3

1

3

23

21

23

set

num end212

start135

relSeDifUrl

21, 2523

21, 25

sim0, 00

0, 05 set*

set

set

disc.

text urltobeornottobe

id123456

w.tow.bew.orw.now.taw.bf

text urltobeor

nottobe

id123456

w.tow.bew.orw.now.taw.bf

relOldclus111213141112

relClusterOldnum end

110000

start123456

relDupOld

text urltoflyor

nottofly

id212223242526

w.ttw.flw.ofw.now.tnw.fy

clus313233343132

relClusterNew

56

num end201120

start123456

relSearch

21, 25

2324

21, 25

num end201120

start123456

relSearchUrl

21, 25

2324

21, 25

sim0, 0

01

0, 0

compare

cGr.

1 21

text url

toorto

id

212325

w.ttw.ofw.tn

clus

313331

relEnd

text url

toor

id2123

w.ttw.of

clus3133

relEndInst

21

qtt

relGenEnd

relGenSt

3

1

3

23

21

23

set

num end212

start135

relSeDifUrl

21, 2523

21, 25

sim0, 00

0, 05 set*

set

set

disc.

text urltobeornottobe

id123456

w.tow.bew.orw.now.taw.bf

text urltobeor

nottobe

id123456

w.tow.bew.orw.now.taw.bf

relOldclus111213141112

relClusterOldnum end

110000

start123456

relDupOld

text urltoflyor

nottofly

id212223242526

w.ttw.flw.ofw.now.tnw.fy

clus313233343132

relClusterNew

56

num end201120

start123456

relSearch

21, 25

2324

21, 25

num end201120

start123456

relSearchUrl

21, 25

2324

21, 25

sim0, 0

01

0, 0

compare

cGr.

1 21

text url

toorto

id

212325

w.ttw.ofw.tn

clus

313331

relEnd

text url

toor

id2123

w.ttw.of

clus3133

relEndInst

21

qtt

relGenEnd

relGenSt

3

1

3

23

21

23

set

num end212

start135

relSeDifUrl

21, 2523

21, 25

sim0, 00

0, 05 set*

set

set

disc.

WIM Algebra

- 29 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Two Classes of Operators

• Seven data manipulation operators

– Select, Calculate, CalcGraph, Aggregate, Set, Join, Materialize

• Eight data mining operators

– Search, Compare, CompGraph, Cluster, Disconnect, Associate, Analyze, Relink

- 30 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Select

Select tuples from the input

select

q.Id

ses.Id123456789

10

11121311131214111313

num.C

1211121111

countClick

num.C

q.Id

ses.Id1345789

10

1113111314111313

11111111

one

- 31 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Calculate

For mathematical and statistical calculations

tfidf1endstart

1112

7, 3, 1, 4, 83, 5, 9, 2, 6

tf0.4, 0.3, 0.3, 0.2, 0.10.6, 0.3, 0.2, 0.1, 0.1

tfidf2endstart

1112

7, 3, 1, 4, 83, 5, 9, 2, 6

1.0, 0.7, 0.7, 0.4, 0.01.0, 0.5, 0.3, 0.0, 0.0

tf2calc.

- 32 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

CalcGraph

For calculations between nodes of the graph

text urltobeornottobe

id123456

w.tow.bew.orw.now.taw.bf

clus111213141112

relClusterOldtext urltoflyor

nottofly

id212223242526

w.ttw.flw.ofw.now.tnw.fy

clus313233343132

relClusterNewrelGenSt

1

3

21

23c.g.

end2123

start13

relGenCalcsum4246

- 33 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Aggregate

1

3

24

1

3

24

3

4

1

1

2

relCocit relAgg

aggregate

url.Id

q.Id

ses.Id1345789

10

1113111314111313

2122212226212622

onemost

url.Id

q.Id

ses.Id1379

11131413

21222626

m.one

3311

aggregate

group tuples with the same value for one or two attributes

- 34 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Set

text url

toflyornottofly

id212223242526

w.ttw.flw.ofw.now.tnw.fy

clus313233343132

relClusterNew

text url

toorto

id212325

w.ttw.ofw.tn

clus313331

relEndnum end

212

start135

relSeDifUrl

21, 2523

21, 25

sim0, 00

0, 0

set set

For union, intersection and difference of tuples in two relations

- 35 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Join

Add an external attribute into a given relation

queryData

id

1234

tobeorno

q. n.cli4341

url.Id

q.Id

ses.Id137

111314

212226

m.one331

mostOnem.one3031

q.Id11121314

join

- 36 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Search

Used for querying (TF-IDF, BM-25, AND, OR)

dataSetprid

123456789

10

0.60.20.50.90.10.20.20.60.60.3

queryListtextid

1112

to flyto buy

tfidfendstart

1112

7, 3, 1, 4, 83, 5, 9, 2, 6

tf0.4, 0.3, 0.3, 0.2, 0.10.6, 0.3, 0.2, 0.1, 0.1

c.Id1234561768

search

search

textto fly...to buy...

to...to fly...to buy...to buy...to fly...to fly...to buy...

be...

- 37 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Compare

Compare elements of a textual attribute

text urltobeornottobe

id123456

w.tow.bew.orw.now.taw.bf

relOldnum end

110000

start123456

relDupOld

56compare

- 38 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Disconnect

Identify clusters in a graph

text urltobeor

nottobe

id123456

w.tow.bew.orw.now.taw.bf

clus111213141112

relClusterOldnum end

110000

start123456

relDupOld

56 disconnect

- 39 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Analyze

For link analysis (Pagerank, Authority, Indegree)

id1234

u.pr0.10.20.40.5

1

3

24

3

4

2

relPrunedrelUsDocs

analyze

Software Architecture

- 41 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Software Architecture

att 1 att 2 att n

index

Compiler

meta

attributes(relations)

output

...

new

program

out 1 out n

Executor

...

attr

data

tmp1 tmp2 tmp n

temporaryattributes

...

tmpindex

Visualizer

Indexer

Pre-processor

Web crawler

Conclusions and Future Work

- 43 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Conclusions

• WIM – a model and software for fast Web mining prototyping

– Data model

– Algebra

– A software prototype

• Efficient

– Several tens of million of tuples

– Running time is higher for the mining operations

• Ad-hoc solutions also need the mining step

• Scalable

– Future implementation could have the attributes stored in different servers and different parts of programs running distributively

- 44 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Conclusions

• Extensible

– New operators, and new options/methods for the current operators, can be added

• We have designed and implemented an extension of operator Analyze

– calculate pagerank taking into account the label of the graph

• Effective for a set of Web mining applications

- 45 -

2nd ACM International Conference on Web Search and Data Mining – WSDM'09

Future Work on WIM

• Finish the implementation and make a version of the prototype available

– Users would contribute with extensions

– Improve the prototype to become a tool

• Design new operators for other mining tasks

• Aggregate a Web crawler and a data visualization interface

• Implement a graphical interface to the WIM language

Thank you!

[email protected]