WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l...
-
date post
21-Dec-2015 -
Category
Documents
-
view
220 -
download
2
Transcript of WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l...
![Page 1: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/1.jpg)
WEDAGEN: A Synthetic Web Database Generator
![Page 2: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/2.jpg)
Presentation Outline Existing WWW search mechanisms WHOWEDA: A Warehouse of Web Data Modular structure of WEDAGEN Configuration parameters Performance evaluation Summary and future work
![Page 3: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/3.jpg)
Existing W3 Search Mechanisms Time delay in manual navigation of the
web Overwhelming results and unwanted
information No tool for organizing and storing
harnessed information for further manipulation
![Page 4: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/4.jpg)
Existing W3 Search Mechanisms Time delay in manual navigation of the
web Overwhelming results and unwanted
information No tool for organizing and storing
harnessed information for further manipulation
Search engines and browsers are not always the best ways to systematically harness information from the web
![Page 5: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/5.jpg)
Existing W3 Search Mechanisms Time delay in manual navigation of the
web Overwhelming results and unwanted
information No tool for organizing and storing
harnessed information for further manipulation
Search engines and browsers are not always the best ways to systematically harness information from the web
The WHOWEDA approach @ NTU
![Page 6: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/6.jpg)
Overview of WHOWEDA A web warehousing system to store and
manipulate web information Store extracted information as ‘web tables’
and provide ‘web operators’ to manipulate web tables
To extract information from W3, user defines a ‘query graph’
Results of extraction is a set of web tuples; each tuple instantiates the query graph
More information: http://www.cais.ntu.edu.sg:8000/
~whoweda
![Page 7: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/7.jpg)
Example: Query graph (web schema)
N1.URL EQUALS “http://sunsite.doc.ic.ac.uk/ bySubject/Computing/ UniSciDepts.html”L2.LABEL EQUALS “faculty”L3.LABEL EQUALS “research projects”L4.LABEL CONTAINS “publications”L5.LABEL CONTAINS “publications”N5.TEXT CONTAINS “Internet computing”
N1 N2 N3
N4 N5
![Page 8: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/8.jpg)
Example: Query results
Id Name Age
A1 John 23
C2 Wendy 35
B4 Jane 25
A2 Wendy 35
C9 Pete 42
B3 Kim 38
F8 Tom 22
G7 Cindy 47
![Page 9: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/9.jpg)
Objectives Need to perform systematic evaluation of
web operators during WHOWEDA development
Limitations of testing using real web data To design a testbed that is controllable,
comprehensive and systematic for evaluating web database systems
To control the quantity and quality of synthetic web tuples by allowing users to specify configuration parameters and web schemas
![Page 10: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/10.jpg)
Objectives Need to perform systematic evaluation of
web operators during WHOWEDA development
Limitations of testing using real web data To design a testbed that is controllable,
comprehensive and systematic for evaluating web database systems
To control the quantity and quality of synthetic web tuples by allowing users to specify configuration parameters and web schemas
WEBAGEN: A Web Database Generator
![Page 11: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/11.jpg)
System Architecture of WEDAGEN
![Page 12: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/12.jpg)
Configuration Input Parameters
WEDAGEN parameters
Default SpecificSelectivity
Instance Related
Control
NumTuplesNumSourceNodeInstancesFanOutNumKeyWordsPerNodeInstanceNumWordsPerNodeInstanceNumWordsPerLinkLabelNumWordsPerHostNameNumWordsPerTitleLocalGlobalLink
NumSourceNodeInstancesFanOutNumKeyWordsPerNodeInstanceNumWordsPerNodeInstanceNumWordsPerLinkLabelNumWordsPerTitleNumWordsPerHostNameLocalGlobalLink
NodeSelectivityTableSelectivity
Web Schema
Fan-In
![Page 13: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/13.jpg)
Parameter Values Suggestion
StartStartGenerate specific
parameter values
Generate specificparameter
values
user changespecific
parameters
user changespecific
parameters
Calculate max. no. of tuples
to be generated
Calculate max. no. of tuples
to be generated
Is calculated
value >NumTuples
Is calculated
value >NumTuples
Calculate NumSourceNodeInstances
to generate specified number of tuples
Calculate NumSourceNodeInstances
to generate specified number of tuples
Store
suggested
values
in file
Store
suggested
values
in file
User changespecific
parameters
User changespecific
parametersEndEnd
Invoke instance
generation module
Invoke instance
generation module
![Page 14: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/14.jpg)
Instance Generation Module (IGM)
1. No. of node
instance generato
r
NumSourceNode
Instances
Fanout
No. of Node
Instancesper node
2. URLgenerato
r
3. Nodeinstanceattributegenerato
r
4. Linkset
generator
5. Webpage
generator
Numwordsper URL
URLs ofall
nodeinstances
Linkset
of eachinstance
Nodeattribute
se.g. title,
text, date
NumSourceNode
Instances
Numwords
per nodeinstance
Images
webpage
Numwords
pertitle
NodePool
Webpages
Webtables
TupleExtractionModule
![Page 15: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/15.jpg)
Directed Graph Output from IGM
![Page 16: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/16.jpg)
Tuple Extraction Module (TEM) IGM generates all node and link instances
interconnected as directed graph(s) TEM extracts and constructs individual web
tuples from the directed graph(s) Node and link instances have IDs assigned Web tuples stored in a web table file A web table has been constructed that is
complete with node, link and tuple information
![Page 17: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/17.jpg)
Extracted Web Tuples
![Page 18: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/18.jpg)
Preliminary Evaluation Elapsed time used to measure overhead of
web table generation A set of sample test configurations
identified consisting of typical combinations of 4 web schemas and input parameters
Performance measured with respect to: Complexity of schema Total number of node instances and
total number of tuples
![Page 19: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/19.jpg)
Four Test Schemas
![Page 20: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/20.jpg)
Parameter Small Medium LargeNumTuples 600 2,000 5,000NumSourceNodeInstances-Schema 1-Schema 2-Schema 3-Schema 4
2411
24
2011
20
8418
FanOut 5 10 25NumWordsPerNodeInstance 200 100 00NumWordsPerLinkLabel 6 10 10NumWordsPerHostname 5 10 10NumWordsPerTitle 6 10 10LocalGlobalLink 0 0 0NodeSelectivity 60 90 90
Three Table Sizes
![Page 21: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/21.jpg)
Elapsed Time Vs No. of Tuples
![Page 22: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/22.jpg)
Experimental Findings Time elapsed in generating web table
increases with size of table Rate of growth is different for different
schemas; i.e., schema complexity affects elapsed time Generating table of tree schema (schema
2) takes longer than that of linear schema (schema 1)
Generating table of schema 2 takes longer than that of schema 4
![Page 23: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/23.jpg)
Summary Identified parameters to create web data
of different sizes and complexities successfully determined
Designed and implemented WEDAGEN and has been successfully integrated into the WHOWEDA system
Able to scale up well with increasing web schema complexity and web table size
Time and effort required to evaluate web database system performance can be reduced with WEBAGEN
![Page 24: WEDAGEN: A Synthetic Web Database Generator. Presentation Outline l Existing WWW search mechanisms l WHOWEDA: A Warehouse of Web Data l Modular structure.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d5e5503460f94a3d56a/html5/thumbnails/24.jpg)
Future Work Inclusion of more parameters:
Minimum and maximum depth of a tuple.
Average ratio of bound and unbound nodes in a tuple.
Apply WEDAGEN to other database systems similar to WHOWEDA
Develop WHOWEDA into a full-fledged benchmark toolkit