Result Page Analysis (Cheng Wang)
Transcript of Result Page Analysis (Cheng Wang)
Cheng Wang
² A list of results decorated with ³ Ø Side bars
³ Ø Branding banners
³ Ø Advertisement
³ Ø Merchant Information
³ Ø Search forms
³ Ø Navigation part
² Data Area Identification
² Record Segmentation
² Data Alignment
² Visual Information ³ Ø ViDE, VIPER
² Ontology ³ Ø ODE
² HTML Page based ³ Ø FiVaTech
² Regular Expression ³ Ø EXALG, DELA
² Weifeng Su, Jiying Wang, Frederick H.Lochvsky. 2009.
² 1: Domain ontology construction ³ Ø query interface ³ Ø query result pages
² 2. Data Extraction using the ontology ³ Ø Identify data area ³ Ø Segments record ³ Ø Data Value alignment
² Multiple Query Result Page ³ Ø PADE
² 1: Match query interface element to data values. Ø title=“%orientalism%”
² 2. Search for voluntary labels in table headers.
² 3. Search for voluntary labels encoded together with data values. ³ Ø ISBN No: 0814756654 ³ Ø ISBN No: 0789204592
² 4. Data values formats ³ Ø 18/09/2008 : 20080918 ³ Ø 03/18/98 : 19980318
² 1. Value level matching ³ Ø Data value similarity
² 2. Label level matching ³ Ø Label co-occurrence
² 3. Label-value matching ³ Ø Check assigned label
³ Ø Assign a suitable label for columns
³ Ø Matching conflict resolution
² 1. Matching is unique ð create attribute
² 2. Matching is 1:1 ð alias ³ Ø Category : Subject
² 3. Matching is 1:n ð n+1 attributes ³ Ø Author: {Last Name, First Name}
² 4. Matching is n:m ð n:1 + 1:m
² One result page ð One data area
² Maximum Entropy Model ³ Maximum Correlation Subtree Identification
² Ø 1 result
² Ø several results (CABABABAD) ³ Ø find continuous repeated patterns
³ Ø Visual gap
² Each data value is assigned a label Ø Maximum Entropy Model Ø Match with Ontology
² ØLabel ð Column
² Wei Liu, Xiaofeng Meng and Weiyi Meng. 2009.
² ViDRE: Data Record Extractor
² ViDIE: Data Item Extractor
² New measure: revision
² 1. Build a Visual Block tree
² 2. Extract data records ³ Ø Noise block filtering
³ Ø Blocks clustering
³ Ø Regroup blocks
² 3. Partition data records into data items and alignment
² Mandatory data items
² Optional data items
² Static data items
² Simple one-pass clustering algorithm ³ Ø Take the first block from the list, use it to form a
cluster.
³ Ø For each remaining blocks, compute similarities to existing clusters.
² ViDE assumes ³ 1. blocks in the same cluster all come from different
data records
³ 2. the cluster which has maximum number n of blocks may contain the mandatory value of data records.
² Step 1: Rearranges blocks in each cluster.
² Step 2: A cluster with n blocks is used as seed. Initialize n groups, each contains one seed block.
² Step 3: For all blocks (in all clusters), determines which group it belongs.
² WDBt: total number of web databases processed
² WDBc: number of web databases whose precision and recall are both 100%
Root
£
Data Area (LCA)
Record
£
Separator Record
£
Separator Record
£
² Real-estate domain
² 60 agents’ websites ³ Ø MRP: 95.0%
³ Ø ERP: 90.0%
Root
Data Area
Record 1
Part A
£
Record 1
Part B
Record 2
Part A
£
Record 2
Part B
Record 3
Part A
£
Record 3
Part B
² DIADEM 0.1 : ³ Ø Construct Real-estate result page ontology
³ Ø Ontological Record Segmentation ° (More features)
³ Ø Data labeling and data alignment
² After: ³ Ø Add visual information