The Architecture of a Large-Scale Web Search and Query Engine
description
Transcript of The Architecture of a Large-Scale Web Search and Query Engine
![Page 1: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/1.jpg)
1 Copyright 2005 Digital Enterprise Research Institute. All rights reserved.
www.deri.org
The Architecture of a Large-Scale Web Search and Query Engine
Andreas Harth
Joint work with Aidan Hogan, Juergen Umbrich, Stefan Decker
![Page 2: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/2.jpg)
2
Current Search Climate
• Major search engines (Google, Yahoo, Microsoft) offer keyword searches over hypertext documents
• Search engines are powerful at expressing general searches, but are poor at expressing complex queries:– e.g. podcasts about gardening
– e.g. pictures of your home town
– e.g. people that Rudi Studer knows
– e.g. pictures of friends of Norman Walsh
– e.g. weather-related WSDL services
• Smaller sites, such as online social networks, scientific databases, digital libraries, collaborative data repositories, etc. provide semantically rich data and offer specialized search interfaces – mostly backed up by relational databases
![Page 3: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/3.jpg)
3
Semantic Web Search Engine
• Data integration on Web scale, to leverage structured data available under open licenses
• Allow people to pose queries over the integrated corpus• Allow for programmatic access to the corpus (via
SPARQL)
![Page 4: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/4.jpg)
4
Hypertext Web vs. Semantic Web
Terms & Conditions
What is included?-- Unlimited Mileage –...
Rental Car Group
http://www.rentalcargroup.com/
http://www.rentalcargroup.com/locationallterms.php
foaf:Person
ex:andh
foaf:name
rdf:type
rdf:type
mailto:[email protected]
foaf:mbox
Aidan Hogan
foaf:knows
ex:aidhog
Andreas Harth
foaf:name
rdfs:seeAlso
http://sw.deri.org/~aidanh/foaf/foaf.rdf
http://www.harth.org/~andreas/foaf.rdf
http://sw.deri.org/~aidanh/foaf.rdf
![Page 5: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/5.jpg)
5
Hypertext Web vs. Semantic Web
Terms & Conditions
What is included?-- Unlimited Mileage –...
Rental Car Group
http://www.rentalcargroup.com/
http://www.rentalcargroup.com/locationallterms.php
foaf:Person
ex:andh
foaf:name
rdf:type
rdf:type
mailto:[email protected]
foaf:mbox
Aidan Hogan
foaf:knows
ex:aidhog
Andreas Harth
foaf:name
rdfs:seeAlso
http://sw.deri.org/~aidanh/foaf/foaf.rdf
http://www.harth.org/~andreas/foaf.rdf
http://sw.deri.org/~aidanh/foaf.rdf
![Page 6: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/6.jpg)
6
Topical Subgraphs
• First, Match nodes in a large graph satifsying a query• Then, select the sourrounding nodes and arcs
Topical subgraph contains all
information required to
further process results
RDF Resource
RDF Literal
RDF Property
n=1
n=2
n=3
![Page 7: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/7.jpg)
7
Semantic Web Search Engine Architecture
Index
Crawler
Extraction
Consolidation
Indexing
Query Proc
Ranking
UI
![Page 8: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/8.jpg)
8
Obtaining Information
• Data from the HTML Web– DMOZ sites
• Data from the XML Web– CiteSeer– DBLP– RSS, Podcasts
• Data from the RDF Web– DMOZ categories– SwissProt– Wikipedia– FOAF, SIOC, DC, …
![Page 9: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/9.jpg)
9
Optimized Index on Quadruples
• Data model: subject/predicate/object/context
• 16 different lookup patterns for quads (node substituted by variable) – e.g. (s, ?, ?, ?), (?, p, o, ?), …
• Naive solution: put a separate index on s, p, o, and c, and compute join form combinations
• But: joins are costly
• Solution: 16 indexes to cover all quadruple patterns• But: very costly to maintain 16 indexes• Index with concatenated keys allows to re-use access patterns –
saves 10 indexes• Huffmann coding to save space on disk and in memory
![Page 10: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/10.jpg)
10
Providing Information to the Casual User
• Ranking required in case of large result sets• Link-based ranking algorithms (such as PageRank,
HITS) not applicable to directed labeled graphs
• ReConRank:– link-based ranking on structured data– can exploit labeled links– takes into account provenance of data– operates on topical subgraph – local ranking yields higher quality
![Page 11: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/11.jpg)
11
Example Input Dataset
• Example graph returned by keyword search for “ReConRank”, n = 1• 4 keyword hits (red outline)• 4 rankable resources (yellow outline)
dc:title
ex:06reconrank
dc:relation
rdf:type
http://sw.deri.org/~aidanh/
foaf:Person
ex:andh
foaf:homepagefoaf:interest
foaf:name
rdf:type
rdf:type
mailto:[email protected]
foaf:mbox
Aidan Hogan
YARS, Semantic
Web, RDF, ReConRank ReConRank: A
Scalable Ranking Algorithm for ...
foaf:knowsfoaf:currentProject
ex:aidhog
dc:title
ReConRank Metadata
Page
ex:98pagerank
foaf:Documentrdf:type
http://sw.deri.org/2005/07/n3rank/doap.rdf
Andreas Harth
foaf:name
Semantic Web, Ranking,
PageRank, RDF,
ReConRank
foaf:interestrdfs:seeAlso
foaf:publications
foaf:currentProject
http://sw.deri.org/~aidanh/foaf/foaf.rdf
![Page 12: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/12.jpg)
12
Example Input Dataset (with Context)
• Example graph returned by keyword search for “ReConRank”, n = 1• 4 keyword hits (red outline)• 4 rankable resources (yellow outline)
dc:title
ex:06reconrank
dc:relation
rdf:type
http://sw.deri.org/~aidanh/
foaf:Person
ex:andh
foaf:homepagefoaf:interest
foaf:name
rdf:type
rdf:type
mailto:[email protected]
foaf:mbox
Aidan Hogan
YARS, Semantic
Web, RDF, ReConRank ReConRank: A
Scalable Ranking Algorithm for ...
foaf:knowsfoaf:currentProject
ex:aidhog
dc:title
ReConRank Metadata
Page
ex:98pagerank
foaf:Documentrdf:type
http://sw.deri.org/2005/07/n3rank/doap.rdf
Andreas Harth
foaf:name
Semantic Web, Ranking,
PageRank, RDF,
ReConRank
foaf:interestrdfs:seeAlso
foaf:publications
http://sw.deri.org/~aharth/foaf.rdf
http://sw.deri.org/~aidanh/foaf/foaf.rdf
http://sw.deri.org/2005/07/n3rank/doap.rdf
foaf:currentProject
http://sw.deri.org/~aidanh/foaf/foaf.rdf
![Page 13: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/13.jpg)
13
Solution: Combined Resource Context Graph
• Shown is the result of combining the resource graph with context grpah, including the implied links (depicted with hollow green arrowheads)
• Graph is well connected !
ex:06reconrank
ex:andh
foaf:knows
foaf:currentProjectex:aidhog
http://sw.deri.org/2005/07/n3rank/doap.rdf
http://sw.deri.org/~aidanh/foaf/foaf.rdf
rdfs:seeAlso
foaf:publications
http://sw.deri.org/~aharth/foaf.rdf
foaf:currentProject
![Page 14: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/14.jpg)
14
Performance Evaluation
![Page 15: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/15.jpg)
15
Conclusion
• SWSE is a distributed system for processing large amounts of Web content
• Crawler does syntax integration• Storage component features keyword index and
complete index on quads for fast lookups• Ranking is scalable and fast, applicable to arbitrary RDF,
but needs more quality evaluation
• Design philosophy: keep the system simple, to be able to optimize and distribute easily
• Algorithms designed for distributed setting -- partition the data and task at hand and distribute to many machines
![Page 16: The Architecture of a Large-Scale Web Search and Query Engine](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814e64550346895dbc0292/html5/thumbnails/16.jpg)
16
http://swse.deri.org/
• Prototype online with dataset crawled starting from ISWC 2006 web site
• plus DBLP in RDF• plus Wikipedia in WikiOnt
• Acknowledgements: DERI Lion (SFI/02/CE1/l131)