The Architecture of a Large-Scale Web Search and Query Engine

Post on 19-Jan-2016

35 views 2 download

description

The Architecture of a Large-Scale Web Search and Query Engine. Andreas Harth Joint work with Aidan Hogan, Juergen Umbrich, Stefan Decker. Current Search Climate. Major search engines (Google, Yahoo, Microsoft) offer keyword searches over hypertext documents - PowerPoint PPT Presentation

Transcript of The Architecture of a Large-Scale Web Search and Query Engine

1 Copyright 2005 Digital Enterprise Research Institute. All rights reserved.

www.deri.org

The Architecture of a Large-Scale Web Search and Query Engine

Andreas Harth

Joint work with Aidan Hogan, Juergen Umbrich, Stefan Decker

2

Current Search Climate

• Major search engines (Google, Yahoo, Microsoft) offer keyword searches over hypertext documents

• Search engines are powerful at expressing general searches, but are poor at expressing complex queries:– e.g. podcasts about gardening

– e.g. pictures of your home town

– e.g. people that Rudi Studer knows

– e.g. pictures of friends of Norman Walsh

– e.g. weather-related WSDL services

• Smaller sites, such as online social networks, scientific databases, digital libraries, collaborative data repositories, etc. provide semantically rich data and offer specialized search interfaces – mostly backed up by relational databases

3

Semantic Web Search Engine

• Data integration on Web scale, to leverage structured data available under open licenses

• Allow people to pose queries over the integrated corpus• Allow for programmatic access to the corpus (via

SPARQL)

4

Hypertext Web vs. Semantic Web

Terms & Conditions

What is included?-- Unlimited Mileage –...

Rental Car Group

http://www.rentalcargroup.com/

http://www.rentalcargroup.com/locationallterms.php

foaf:Person

ex:andh

foaf:name

rdf:type

rdf:type

mailto:andreas.harth@deri.org

foaf:mbox

Aidan Hogan

foaf:knows

ex:aidhog

Andreas Harth

foaf:name

rdfs:seeAlso

http://sw.deri.org/~aidanh/foaf/foaf.rdf

http://www.harth.org/~andreas/foaf.rdf

http://sw.deri.org/~aidanh/foaf.rdf

5

Hypertext Web vs. Semantic Web

Terms & Conditions

What is included?-- Unlimited Mileage –...

Rental Car Group

http://www.rentalcargroup.com/

http://www.rentalcargroup.com/locationallterms.php

foaf:Person

ex:andh

foaf:name

rdf:type

rdf:type

mailto:andreas.harth@deri.org

foaf:mbox

Aidan Hogan

foaf:knows

ex:aidhog

Andreas Harth

foaf:name

rdfs:seeAlso

http://sw.deri.org/~aidanh/foaf/foaf.rdf

http://www.harth.org/~andreas/foaf.rdf

http://sw.deri.org/~aidanh/foaf.rdf

6

Topical Subgraphs

• First, Match nodes in a large graph satifsying a query• Then, select the sourrounding nodes and arcs

Topical subgraph contains all

information required to

further process results

RDF Resource

RDF Literal

RDF Property

n=1

n=2

n=3

7

Semantic Web Search Engine Architecture

Index

Crawler

Extraction

Consolidation

Indexing

Query Proc

Ranking

UI

8

Obtaining Information

• Data from the HTML Web– DMOZ sites

• Data from the XML Web– CiteSeer– DBLP– RSS, Podcasts

• Data from the RDF Web– DMOZ categories– SwissProt– Wikipedia– FOAF, SIOC, DC, …

9

Optimized Index on Quadruples

• Data model: subject/predicate/object/context

• 16 different lookup patterns for quads (node substituted by variable) – e.g. (s, ?, ?, ?), (?, p, o, ?), …

• Naive solution: put a separate index on s, p, o, and c, and compute join form combinations

• But: joins are costly

• Solution: 16 indexes to cover all quadruple patterns• But: very costly to maintain 16 indexes• Index with concatenated keys allows to re-use access patterns –

saves 10 indexes• Huffmann coding to save space on disk and in memory

10

Providing Information to the Casual User

• Ranking required in case of large result sets• Link-based ranking algorithms (such as PageRank,

HITS) not applicable to directed labeled graphs

• ReConRank:– link-based ranking on structured data– can exploit labeled links– takes into account provenance of data– operates on topical subgraph – local ranking yields higher quality

11

Example Input Dataset

• Example graph returned by keyword search for “ReConRank”, n = 1• 4 keyword hits (red outline)• 4 rankable resources (yellow outline)

dc:title

ex:06reconrank

dc:relation

rdf:type

http://sw.deri.org/~aidanh/

foaf:Person

ex:andh

foaf:homepagefoaf:interest

foaf:name

rdf:type

rdf:type

mailto:andreas.harth@deri.org

foaf:mbox

Aidan Hogan

YARS, Semantic

Web, RDF, ReConRank ReConRank: A

Scalable Ranking Algorithm for ...

foaf:knowsfoaf:currentProject

ex:aidhog

dc:title

ReConRank Metadata

Page

ex:98pagerank

foaf:Documentrdf:type

http://sw.deri.org/2005/07/n3rank/doap.rdf

Andreas Harth

foaf:name

Semantic Web, Ranking,

PageRank, RDF,

ReConRank

foaf:interestrdfs:seeAlso

foaf:publications

foaf:currentProject

http://sw.deri.org/~aidanh/foaf/foaf.rdf

12

Example Input Dataset (with Context)

• Example graph returned by keyword search for “ReConRank”, n = 1• 4 keyword hits (red outline)• 4 rankable resources (yellow outline)

dc:title

ex:06reconrank

dc:relation

rdf:type

http://sw.deri.org/~aidanh/

foaf:Person

ex:andh

foaf:homepagefoaf:interest

foaf:name

rdf:type

rdf:type

mailto:andreas.harth@deri.org

foaf:mbox

Aidan Hogan

YARS, Semantic

Web, RDF, ReConRank ReConRank: A

Scalable Ranking Algorithm for ...

foaf:knowsfoaf:currentProject

ex:aidhog

dc:title

ReConRank Metadata

Page

ex:98pagerank

foaf:Documentrdf:type

http://sw.deri.org/2005/07/n3rank/doap.rdf

Andreas Harth

foaf:name

Semantic Web, Ranking,

PageRank, RDF,

ReConRank

foaf:interestrdfs:seeAlso

foaf:publications

http://sw.deri.org/~aharth/foaf.rdf

http://sw.deri.org/~aidanh/foaf/foaf.rdf

http://sw.deri.org/2005/07/n3rank/doap.rdf

foaf:currentProject

http://sw.deri.org/~aidanh/foaf/foaf.rdf

13

Solution: Combined Resource Context Graph

• Shown is the result of combining the resource graph with context grpah, including the implied links (depicted with hollow green arrowheads)

• Graph is well connected !

ex:06reconrank

ex:andh

foaf:knows

foaf:currentProjectex:aidhog

http://sw.deri.org/2005/07/n3rank/doap.rdf

http://sw.deri.org/~aidanh/foaf/foaf.rdf

rdfs:seeAlso

foaf:publications

http://sw.deri.org/~aharth/foaf.rdf

foaf:currentProject

14

Performance Evaluation

15

Conclusion

• SWSE is a distributed system for processing large amounts of Web content

• Crawler does syntax integration• Storage component features keyword index and

complete index on quads for fast lookups• Ranking is scalable and fast, applicable to arbitrary RDF,

but needs more quality evaluation

• Design philosophy: keep the system simple, to be able to optimize and distribute easily

• Algorithms designed for distributed setting -- partition the data and task at hand and distribute to many machines

16

http://swse.deri.org/

• Prototype online with dataset crawled starting from ISWC 2006 web site

• plus DBLP in RDF• plus Wikipedia in WikiOnt

• Acknowledgements: DERI Lion (SFI/02/CE1/l131)