Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.
-
Upload
maria-williamson -
Category
Documents
-
view
221 -
download
2
Transcript of Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.
![Page 1: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/1.jpg)
Building Knowledge Bases from the Web
Rajeev RastogiYahoo! Labs Bangalore
![Page 2: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/2.jpg)
The Web is a vast repository of human knowledge
Basic premise
![Page 3: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/3.jpg)
Diverse information spanning multiple verticals
• Wikipedia, Product, Business, People, …
![Page 4: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/4.jpg)
Grand challenge
Mine the Web to build knowledge bases (KBs) of people, places, things, events,…
Name Address Phone
Chinese Mirch 120 Lexington Ave (between 28th St & 29th St) New York, NY 10016
(212) 532-3663
Camera Aspect Ratio
Mega-pixels
Canon Powershot 600 4:3 0.5
Olympus D-300L 4:3 0.8
Product Name List Price
Sale Price
Apple iPod nano 8 GB Black (5th Generation)
$145.00 $139.99
Name Affiliation # connections
Rajeev Rastogi Yahoo! Labs Bangalore
142
![Page 5: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/5.jpg)
What did search look like in the past?
![Page 6: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/6.jpg)
Search results of the future: Structured abstracts
yelp.com
babycenter
epicurious
answers.com
webmd
New York Times
Gawker
![Page 7: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/7.jpg)
Rank by price
Comparison shopping
![Page 8: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/8.jpg)
Product near me
![Page 9: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/9.jpg)
Topic entity pages
Celebrity Music Videos
Related Topics
Relevant Multi-media content including music, videos, information from Wiki pedia etc.
A topic based page automatically generated in real time
Up to the minute: Latest info using News feeds, blogs, Twitter, Flickr to stay up to date on Madonna
![Page 10: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/10.jpg)
Noise
• Billions of pages with diverse structure, conflicting information, noise
Building KBs from the Web is a hard problem
yelp.com superpages.com
![Page 11: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/11.jpg)
Page content/structure changes constantly
Old
New• ~2% of sites change each day
![Page 12: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/12.jpg)
KB creation pipeline
Acquire content from the Web
Extract structured data for entities from Web pages
Identify and integrate data for each entity
Roma Bistro Paris
Roma Bistro Paris
Information extractionContent acquisition Disambiguation &Integration
![Page 13: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/13.jpg)
Reviews
IE example
Name
AddressCuisine
PhonePrice
Name Address Phone
Chinese Mirch
120 Lexington Ave New York, NY 10016
(212) 532-3663
![Page 14: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/14.jpg)
Template-based Web pages
• From head/torso sites
• Pages have similarstructure
• ~30% of crawled Web pages
• Information rich: 31% of search results
![Page 15: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/15.jpg)
Hand-crafted pages
• Mainly from tail sites
• Pages have diversestructures
![Page 16: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/16.jpg)
Browse pages
Similar-structuredrecords
![Page 17: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/17.jpg)
Unstructured text
![Page 18: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/18.jpg)
Web extraction landscape
Site structure Page structure Structure
Content
Content Redundancy
Content Features
Context Pattern-basedPattern-based
WrapperWrapper Record Identification
Record Identification
Content MatchingContent
Matching
Machine Learning ModelsMachine Learning Models
Unstructuredtext
Template-based pages
Hand-crafted, browse pages
Unstructured
Snowball [AG 00]
HCRF [ZNWZM 06]MLN [YCWZZM 09]
RoadRunner [CMM 01] DEPTA [ZL 05]
[KWD 97][MMK 99]
[GRST 10]
![Page 19: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/19.jpg)
Web extraction landscape
Site structure Page structure Structure
Content
Content Redundancy
Content Features
Context Pattern-basedPattern-based
WrapperWrapper Record Identification
Record Identification
Content MatchingContent
Matching
ML ModelsML Models
Unstructuredtext
Template-based pages
Hand-crafted, browse pages
Unstructured
![Page 20: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/20.jpg)
Wrapper induction
Learn AnnotatePages
Sample pages
Websitepages
LearnRules
Records
XPathRules
Annotations
Extract Websitepages
Cluster
• Technique for extraction from template-based pages
MonitorRules
ApplyRules
Site change
![Page 21: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/21.jpg)
Clustering pages
• Group structurally similar pages using shingle signatures
![Page 22: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/22.jpg)
Page shingle signature
html body @id textarea @id div /div /textarea … br/ /body /html
Windows
Hash
Min
Tags
Page signature: Vector of shingles
Shingle: 5
55 5 20 30
![Page 23: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/23.jpg)
Rule learning
/html/body/div/div/div/div/div/div/span[@class=“tel”] //span[@class=“tel”]XPath Generalization
![Page 24: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/24.jpg)
Learning robust XPaths
//*//*
//h1//h1//span//span
//span[@class=tel]//span[@class=tel]
//*[@class=tel]//*[@class=tel]
SPEC
IALI
ZESP
ECIA
LIZE
Most general XPath that matches all the annotated values and none of the un-annotated values
Most general XPath
Use Apriori to generate candidate XPaths
![Page 25: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/25.jpg)
Detecting site changes
During Learn
For each cluster, store the page signature and extracted record for a
small number of pages
Monitoring
Crawl the pages daily and compare page signatures and extracted records
Day 0
Signature & RecordMatch
Day n
Signature/ Record Mismatch
Day m
![Page 26: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/26.jpg)
Wrapper system deployed in Yahoo!
• 250M extractions from 200 sites (product, business)• Avg num of clusters per site: 24• Avg num of pages annotated per cluster: 1.6
86
88
90
92
94
96
98
100
102
Average Precision / Recall (%)
Precision
Recall
![Page 27: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/27.jpg)
Limitations of wrappers
• Won’t work across Web sites due to different page layouts
• Scaling to thousands of sites can be a challenge– Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites
can be time-consuming & expensive
![Page 28: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/28.jpg)
Holy grail of IE research
• Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site
• OK to annotate pages from a few sites initially to create training data
![Page 29: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/29.jpg)
Web extraction landscape
Site structure Page structure Structure
Content
Content Redundancy
Content Features
Context Pattern-basedPattern-based
WrapperWrapper Record Identification
Record Identification
Content MatchingContent
Matching
ML ModelsML Models
Unstructuredtext
Template-based pages
Hand-crafted, browse pages
Unstructured
![Page 30: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/30.jpg)
Key observation
yelp.com superpages.com
• Web sites contain redundant content (that is, pages for same entity)
![Page 31: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/31.jpg)
Content matching approach
• Step 1: Populate seed database from few initial sites
Name Address
Chinese Mirrch 120 Lexington Ave, New York, NY 10016
Tiffin Wallah 127 E 28th St New York, NY 10079
Seed DB
Wrappers
![Page 32: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/32.jpg)
Content matching approach
• Step 2: Match values in page with seed record values
Name Address
Chinese Mirrch 120 Lexington Ave, New York, NY 10016
Tiffin Wallah 127 E 28th St New York, NY 10079
Seed DB
New site Web page
![Page 33: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/33.jpg)
Content matching approach
Name Address
Chinese Mirrch 120 Lexington Ave, New York, NY 10016
Tiffin Wallah 127 E 28th St New York, NY 10079
21 Club 21 W 52nd St New York, NY 10019
Seed DBNew site Web pages
• Step 3: Use matched values to extract records, expand seed database
Wrappers
New record
![Page 34: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/34.jpg)
Key challenge 1
• Diverse attribute value representations (impacts recall)
Name Address
Chinese Mirrch 120 Lexington Ave, New York, NY 10016
Tiffin Wallah 127 E 28th St New York, NY 10079
Spellingerror
Variant
![Page 35: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/35.jpg)
Key challenge 2
• Noisy attribute value matches (impacts precision)
Name Address
Chinese Mirrch 120 Lexington Ave, New York, NY 10016
Tiffin Wallah 127 E 28th St New York, NY 10079
Noisymatch
![Page 36: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/36.jpg)
Baseline similarity measure
• Use q-grams to handle spelling errors
Weak Similarity = Cosine-similarity between IDF-weighted q-grams
String 3-grams
chinese mirch
{ chi, hin, ine, nes, ese, se# , e#m, #mi, mir, irc, rch}
chinese mirrch
{ chi, hin, ine, nes, ese, se#, e#m, #mi, mir, irr, rrc, rch}
• Weight of a q-gram (attribute-specific) = Sum of the IDFs of the words it appears in
![Page 37: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/37.jpg)
Strong similarity
Address (Seed DB) Address (Web site) WS
120 Lexington AvenueNew York, NY 10016
120 Lexington Ave (between 28th and 29th St) New York, NY 10016
0.53
312 W 34th StreetNew York, NY 10001
312 W 34th St (between 8th and 9th Ave) New York, NY 10001
0.49
Strong similarity is defined between two sets of strings1.Calculate the matching pattern between weakly similar pairs in the two sets2.Pick matching patterns with sufficient “support”3.Use only portions selected by the matching pattern in the final similarity calculation
Templatized content
![Page 38: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/38.jpg)
Computing matching pattern
120 Lexington Avenue New York NY 10016
120 Lexington Ave (Between 28th And 29th St) New York NY 10016
1 1 1 1 1 1
1. Perform max-weight bipartite matching to find matching words• Edge weight = Jaccard similarity over 3-grams
2. Form segments by grouping contiguous matching words3. Assign each segment si a label
• 0 if non-matching• j if matching segment s’j
Matching pattern:103 103
s1 s2 s3
s’1 s’2 s’3
1 0 3
1 0 3
![Page 39: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/39.jpg)
Strong similarity score computation
Strong similarity: similarity between matching segments of values
Support of matching pattern: # distinct matching segmentsSupport(103 103) = 2
Strong similarity only computed for patterns with support
![Page 40: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/40.jpg)
Need for support of a matching pattern
Support(010 010): = 1Hence Strong Similarity = Weak Similarity
![Page 41: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/41.jpg)
Pruning noisy matches
Name Address
Chinese Mirrch 120 Lexington Ave, New York, NY 10016
Tiffin Wallah 127 E 28th St New York, NY 10079
✓
✓
✗
• Match combinations of values in page• Prune combinations that don’t match attribute values in any seed record
![Page 42: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/42.jpg)
X2X2
X1X1
X3
X3
Apriori-style enumeration
Round 1:<Name, X1> (sup=2)<Addr, X2> (sup=2)<Name, X3> (sup=2)
Round 2:<Name, X1> <Addr, X2> (sup=2)<Name, X3> <Addr, X2> (sup=0)
• Prune attribute position combinations with low support– support = # pages in which values at positions match attribute values in a seed record
![Page 43: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/43.jpg)
Experimental results
Datasets
Attributes Restaurant Bibliography
Name (core) Title (core)
Address (core) Author (core)
Phone Source
Payment
Cuisine
![Page 44: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/44.jpg)
Strong vs Weak similarity
• Extraction precision of WS and SS are comparable, precision increases with threshold• Coverage of SS is steady wrt threshold, coverage of WS drops at high thresholds
![Page 45: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/45.jpg)
Strong similarity scores
SS boosts the similarity scores of TPs over a range of WS scores without boosting that of FPs
String 1 String 2 WS SS
980 n michigan ave 14th floorchicago il
980 n michigan avechicago il 60611
0.57 1
1100 e north ave westchicago il 60185
300 w north ave westchicago il 60185
0.74 0.74
![Page 46: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/46.jpg)
Extraction Precision
![Page 47: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/47.jpg)
Coverage
Seed data size (Restaurant)
![Page 48: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.](https://reader033.fdocuments.us/reader033/viewer/2022061305/5514320d550346ec488b6006/html5/thumbnails/48.jpg)
Summary
• Web is a vast repository of human knowledge• Building (structured) knowledge base can improve
search, help users find relevant information• Key challenge: Unsupervised information extraction
from Web pages• Content redundancy on Web can be used for
unsupervised extraction with high precision• Future work
– Handling numeric attributes, browse pages– Detecting and integrating records for the same entity