Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The...
Transcript of Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The...
![Page 1: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/1.jpg)
Web Modelling for Web Warehouse Design
Daniel Coelho GomesDoutoramento em Informática
Especialidade em Engenharia Informática19 de Março de 2007
![Page 2: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/2.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 2
Harnessing the Web
• The web is the largest source of information• Users need applications to extract knowledge from web data• Each application has to manage its own data
Web Applications Users
![Page 3: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/3.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 3
The need for Web Warehouses
• A WWh releases applications from data management– Applications focus on their purposes
• Enables web data reuse
Web Web Warehouse UsersApplications
![Page 4: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/4.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 4
Web Warehousing supports mining applications
![Page 5: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/5.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 5
Web vs. Data Warehousing
OLTP
OLTP
OLTPrelationalrelational
DataWarehouse Data Mining
applications
hypertextualhypertextual
Extract Transform
Load WebWarehouse Web Mining
applications
Web site
Web site
Web site
Web site
Web site
Web site
Extract Transform
Load
• Must know data to design a warehouse
• The Web does not follow a relational model
• Web data models are required
![Page 6: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/6.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 6
What is a web model?
• A Web model describes the characteristics of a web portion– Distribution of sites per Top-Level Domain– Content media types– Incoming links per URL
![Page 7: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/7.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 7
What is a web portion?
• A WWh must be populated with contents relevant to its users
• A web portion is the set of relevant web contents selected to be warehoused
• The Portuguese web– Empirical definition: contents relevant to the
Portuguese community– Formal definition:
• Contents under the .PT domain • Contents outside .PT in Portuguese and linked from .PT
![Page 8: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/8.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 8
Outline
– Motivation• Objectives and methodology• Contributions• Conclusions• Future Work
![Page 9: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/9.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 9
Research questions
1. Which features should be considered in a web model?
2. How can the boundaries of a web portion be defined?
3. What can bias a web model?4. How persistent is information on the
web?5. How do web characteristics influence
Web Warehouse design?
![Page 10: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/10.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 10
Experimental methodology
Build/tuneWebhouse
Model thePortuguese web
Analyzeresults
• Successive versions of Webhouse enabled the identification of the influence of web characteristics in its design
![Page 11: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/11.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 11
Why the Portuguese Web?
• General models of the Web may not be representative of the data to be warehoused– The Portuguese Web can be exhaustively
harvested and accurately modelled– Still provides a general model of web data
because it contains several publication genres– The Portuguese Web is relevant to a
significant amount of users (10M)
![Page 12: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/12.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 12
Webhouse architecture
Viúva Negracrawler
Extract
Webcatconverter
Transform
Versusrepository
Load
Web Applications
Webhouse
![Page 13: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/13.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 13
Outline
– Motivation– Objectives and methodology
• Contributions• Conclusions• Future Work
![Page 14: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/14.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 14
Innovation of this research• Includes web modelling in the web data
integration process– Web Warehousing has been done assuming that the
data sources were well known• Studies the influence of web characteristics in
the several stages of web data integration– From extraction to access
• Combines knowledge from different research domains– Web Characterization: monitors and models the web– Web Crawling: automatic extraction of web data– Web Warehousing: web data integration
![Page 15: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/15.jpg)
Web Characterization
![Page 16: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/16.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 16
A model of the Portuguese web
PT84.2%
COM12.5%
NET2.5%
ORG0.8%
Apache57%
Microsoft IIS
39%
Netscape-enterprise
1%
Oracle9ias1%
Others2%
0%
5%
10%
15%
20%
25%
30%
0 1 2 4 8 16 32 64 128
256
512
1024
2048
size (KB)
cont
ents
1
10
100
1000
10000
100000
1000000
10000000
0 [1,10[ [10,100[ [100,1000[ >=1000
incoming links
cont
ents
![Page 17: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/17.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 17
Models for estimating web data persistence
y = -0.1373Ln(x) + 1.0683R2 = 0.928
0%10%20%30%40%50%60%70%80%90%
100%
0 100 200 300 400 500 600 700 800 900 1000
age (days)
UR
Ls
• URL transience is much more problematic in WWh than in “book marking”
• In 2 months 50% of the URLs in a data set are no longer valid
half-life=61 days
![Page 18: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/18.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 18
Comparison with other studies on URL persistence
Study Results My estimation
Comparison
Koehler (2002) 50% 17%
60%
47%
26%
Cho (2000) 70%
>
>
>Fetterly (2003) 88%
Ntoulas (2004) 20% ~
![Page 19: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/19.jpg)
Web Crawling
![Page 20: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/20.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 20
Crawling algorithms and techniques
loop
CP Local Frontier Global FrontierVolume Classifier Text extractorCollector Parser Site
checkOut()
start()
GET()
parse()extractText()
join()
isContentRelevant()
getREP()
insertMetaData()
store()
courtesyPause()
checkIn()
getURL()
[hasNotUnvisitedURLs]
HEAD()loop
![Page 21: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/21.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 21
Coping with hazardous situations
• Documentation and solutions to address hazardous situations to crawling
• Spider traps– Infinite sites
• Duphosts– Sites with different names that provide the
same content– tucows.com, www.tucows.com, tucows.ip.pt– Waste of WWh resources
![Page 22: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/22.jpg)
Web Warehousing
![Page 23: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/23.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 23
Applications of Webhouse
Web Webhouse UsersApplications
![Page 24: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/24.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 24
Answers
1. Which features should be considered in a web model?
– Vary according to application requirements– Site, content, link structure and data
persistence2. How can the boundaries of a web portion
be defined?– Automatic harvesting policy– Domain restrictions and content
classification
![Page 25: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/25.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 25
Answers3. What can bias a web model?
– Hazardous situations– Sampling methodology must emulate extraction
stage4. How persistent is information on the web?
– The web is getting more transient but there is also persistent data
5. How do web characteristics influence Web Warehouse design?
– Extraction stage – Storage requirements– Schedule maintenance operations
![Page 26: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/26.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 26
Future work
• Is a model of the Portuguese web representative of other web portions?– Differences due to sampling methods and
dates?– Crawl different portions in parallel and
compare models• Web warehousing research is crucial to
deploy large-scale web archiving– How to search among historical web
collections?
![Page 27: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/27.jpg)
Daniel Gomes – http://xldb.fc.ul.pt/daniel/ 27
Main publications• Journals
– Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley InterScience (accepted for publication);
– Daniel Gomes and Mário J. Silva, Characterizing a national community web, Transactions on Internet Technology, ACM, 2005.
• Conferences– Daniel Gomes, Sérgio Freitas, Mário J. Silva, Design and
Selection Criteria for a National Web, ECDL’06 (best paper by young researcher);
– Daniel Gomes, Mário J. Silva, Modelling information persistence on the web, ICWE’06 (best paper candidate);
– Daniel Gomes, André Santos, Mário J. Silva, Managing duplicates in a web archive, SAC’06.
![Page 28: Web Modelling for Web Warehouse Design - Arquivo.pt · – Daniel Gomes and Mário J. Silva, The Viúva Negra crawler: an experience report, Software: Practice and Experience, Wiley](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae61ede28d2a38a6569984/html5/thumbnails/28.jpg)
Thank you for your attention