Weighted Semantic PageRank Using RDF Metadata on Hadoop ICOMP 2014 Jun 20, 2014 Hee-gook Jun.
Web Search Environments Web Crawling Metadata using RDF and Dublin Core
description
Transcript of Web Search Environments Web Crawling Metadata using RDF and Dublin Core
![Page 1: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/1.jpg)
1
Web Search EnvironmentsWeb Crawling Metadata using RDF and Dublin Core
Dave Becketthttp://purl.org/net/dajobe/
Slides:http://ilrt.org/people/cmdjb/talks/tnc2002/
![Page 2: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/2.jpg)
2
Introduction• Overview of SGs and Web Crawling• Why WSE, what’s new? Novel
results• Future work (or stuff we didn’t do)
and conclusions
![Page 3: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/3.jpg)
3
Overview• Digital Library community• In UK, subject-specific gateways (SGs)• Want to improve: scope (more),
timeliness (fresh), cost (less)• Stay professional – the Quality word• Compete with web search engines – the
Google Test
![Page 4: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/4.jpg)
4
Human Cataloguing of the Web• Pros: High quality, domain
knowledge selection, subject-specialised, cataloguing done to well-known and developed standards
• Cons: Expensive, slow, descriptions need to be reviewed regularly to keep them relevant
![Page 5: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/5.jpg)
5
Software running web crawls• Pros: vastly comprehensive (Con:
too much), can be very up-to-date• Cons: cannot distinguish “this page
sucks” from “this page rocks”, indiscriminate, subject to spamming, very general (but…)
![Page 6: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/6.jpg)
6
Combining Web Crawling and High Quality DescriptionA solution• Seed the web crawl from high quality
records• Crawl to other (presumably) good quality
pages• Track the provenance of the crawled
pages• Provenance can be used for querying and
result ranking
![Page 7: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/7.jpg)
7
Web Search Environments (WSE) Project• Research by ILRT and later
Resource Discovery Network (RDN)• RDN funds UK SGs (ILRT also had
DutchESS)
![Page 8: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/8.jpg)
8
WSE Technologies• Simple Dublin Core (DC) records
extracted from SGs• OAI protocol used to collect these
records in one place (not required)• Combine Web Crawler• RDF framework to connect the
resource descriptions together
![Page 9: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/9.jpg)
9
Simple DC RecordsReally simple:• Title• Description• Identifier (URI of resource)• Source (URI of record)
![Page 10: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/10.jpg)
10
Information model 1• DC records describe all the
resources• Web crawler reads these and
returns crawled web pages• These generate a new web crawled
resource
![Page 11: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/11.jpg)
11
Information model 2• Link back to original record(s), plus
web page properties• RDF model lets these be connected
via page, record URIs• Giving one large RDF graph of the
total information
![Page 12: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/12.jpg)
12
WSE graph
![Page 13: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/13.jpg)
13
Novel Outcomes?It is obvious that:• Metadata gathering is not new
(Harvest)• Web crawling is not new (Lycos)• Cataloguing is not new (1000s of
years)So what is new?
![Page 14: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/14.jpg)
14
WSE – Areas Not FocusedI digress…• Gathering data together – not crucial,
Combine is a distributed harvester• Full text indexing – not optimised• Web crawling algorithm – the routes
through the web were not selected in a sophisticated way
![Page 15: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/15.jpg)
15
WSE – General Benefits• Connecting separate systems (one
less place needed to go)• RDF graph allows more data
mixing (not fragile)• Leverages existing systems
(Combine, Zebra), standards (RDF, DC)
![Page 16: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/16.jpg)
16
WSE – Novel Searching• “game theory napster” – zero hits• Cross-subject searching in one
system – “gmo”• Can navigate resulting provenance
![Page 17: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/17.jpg)
17
WSE – Gains• Web crawling gains from high quality
human description• SGs gain from increase in relevant
pages• Fresher content than human-catalogued
resource• More focused than a general search
engine
![Page 18: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/18.jpg)
18
WSE as a new tool• For subject experts• Which includes cataloguers• Gives fast, relevant search
(no formal precision, recall analysis)
![Page 19: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/19.jpg)
19
WSE – new areas• Cross-subject searching possible in
subjects not yet catalogued, or that fall between SGs
• Searching emerging topics is possible ahead of additions to catalogue standards
• Helps indicate where new SGs, thesauri are needed
![Page 20: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/20.jpg)
20
WSE - deploying• ILRT WSE• RDN WSE• RDN – investigating for the main
search system
![Page 21: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/21.jpg)
21
WSE for SGsIndividual SGs – enhancing subject-
specific searches:• Deep / full web crawling of high quality
sites• Granularity of cataloguing and cost
It is better for humans to describe entire sites (or large parts) and let the software do the detailed work of individual pages
![Page 22: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/22.jpg)
22
Future• Improve and target the crawling• Use the SG information with result
ranking• Add other relevant data to the
graph such as RSS news• A Semantic Web application
![Page 23: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/23.jpg)
23
Questions?• Thank You• Slides:
http://ilrt.org/people/cmdjb/talks/tnc/2002/• Project:
http://wse.search.ac.uk/
![Page 24: Web Search Environments Web Crawling Metadata using RDF and Dublin Core](https://reader036.fdocuments.us/reader036/viewer/2022070504/5681680a550346895ddd8eb4/html5/thumbnails/24.jpg)
24
References• Combine Web Crawler: http://www.lub.
lu.se/combine/ • Dublin Core: http://dublincore.org/ • ILRT: http://ilrt.org/ • RDF: http://www.w3.org/RDF/ • Semantic Web: http://www.w3.org/2001/
sw/