SemTech 2011 Semantic Search tutorial

1. Peter Mika| Yahoo Research, Spain
[email protected]
Thanh Tran | Institute AIFB, KIT, Germany
[email protected]
Semantic Search TutorialIntroduction

2. About the speakers
Peter Mika
Semantic Search group at Yahoo! Barcelona
Semantic Search, Web Object Retrieval, Natural Language Processing
Tran Duc Thanh
Semantic Search group at AIFB
Semantic Search, Semantic Data Management, Linked Data Query Processing
3. Agenda
Introduction (5 min)
Semantic Web data (50 min)
The RDF data model
Publishing RDF
Crawling and indexing RDF data
Query processing (35 min)
Ranking (25 min)
Result presentation (15 min)
Semantic Search evaluation (15 min)
Demos (15 min)
Questions (5 min)
4. Why Semantic Search? I.
We are at the beginning of search. (Marissa Mayer)
Solved large classes of queries, e.g. navigational
Heavy investment in computational power
Remaining queries are hard, not solvable by brute force, and require a deep understanding of the world and human cognition
Background knowledge and metadata can help to address poorly solved queries
5. Poorly solved information needs
Ambiguous searches
parishilton
Long tail queries
george bush (and I mean the beer brewer in Arizona)
Multimedia search
parishilton sexy
Imprecise or overly precise searches
jimhendler
pictures of strong adventures people
Precise searches for descriptions
countries in africa
32 year old computer scientist living in barcelona
reliable digital camera under 300 dollars
Many of these queries would not be asked by users, who learned over time what search technology can and can not do.
6. Example: multiple interpretations
7. Why Semantic Search? II.
The Semantic Web is now a reality
Large amounts of data published in RDF
Heterogeneous data of varying quality
Users who are not skilled in writing complex queries (e.g. SPARQL) and may not be experts in the domain
Searching data instead or in addition to searching documents
Direct answers
Novel search tasks
8. Information box with content from and links to Yahoo! Travel
Example: direct answers in search
Points of interest in Vienna, Austria
Shopping results from
Yahoo! Shopping
Since Aug, 2010, regular search results are Powered by Bing
9. Novel search tasks
Aggregation of search results
e.g. price comparison across websites
Analysis and prediction
e.g. world temperature by 2020
Semantic profiling
recommendations based on particular interests
Semantic log analysis
understanding user behavior in terms of objects
Support for complex tasks
e.g. booking a vacation using a combination of services
10. Document retrieval and data retrieval
Information Retrieval (IR) support the retrieval of documents (document retrieval)
Representation based on lightweight syntax-centric models
Work well for topical search
Not so well for more complex information needs
Web scale
Database (DB)and Knowledge-based Systems (KB) deliver more precise answers (data retrieval)
More expressive models
Allow for complex queries
Retrieve concrete answers that precisely match queries
Not just matching and filtering, but also joins
Limitations in scalability
11. Combination of document and data retrieval
Documents with metadata
Metadata may be embeddedinside the document
Im looking for documents that mention countries in Africa.
Data retrieval
Structured data, but searchable text fields
Im looking for directors, who have directed movies where the synopsis mentions dinosaurs.
12. Semantic Search
Target (combination of) document and data retrieval
Semantic search is a retrieval paradigm that
Exploits the structure/semantics of the data or explicit background knowledge to understand user intent and the meaning of content
Incorporates the intent of the query and the meaning of content into the search process (semantic models)
Wide range of semantic search systems
Employ different semantic models, possibly at different steps of the search process and in order to support different tasks
13. Semantic models
Semantics is concerned with the meaning of the resources made available for search
Various representations of meaning
Linguistic models: models of relationships among words
Taxonomies, thesauri, dictionaries of entity names
Inference along linguistic relations, e.g. broader/narrower terms
Natural language search
Conceptual models: models of relationships among objects
Ontologies capture entities in the world and their relationships
Inference along domain-specific relations
Knowledge-based search
We will focus on conceptual models in this tutorial
In particular, the RDF/OWL conceptual model for representing classes in a domain, and describing their instances
14. Semantic Search a process view
Knowledge Representation
Semantic Models
Resources
Documents
DocumentRepresentation
15. Semantic Search systems
For data / document retrieval, semantic search systems might combine a range of techniques, ranging from statistics-based IR methods for ranking, database methods for efficient indexing and query processing, up to complex reasoning techniques for making inferences!
16. Semantic Web data
17. Semantic Web
Sharing data across the Web
Standard data model
RDF
A number of syntaxes (file formats)
RDF/XML, RDFa
Powerful, logic-based languages for schemas
OWL, RIF
Query languages and protocols
HTTP, SPARQL
18. Resource Description Framework (RDF)
Each resource (thing, entity) is identified by a URI
Globally unique identifiers
Locators of information
Data is broken down into individual facts
Triples of (subject, predicate, object)
A set of triples (an RDF graph) is published together in an RDF document
RDF document
foaf:Person
type
example:roi
name
Roi Blanco
19. Linking resources
Friend-of-a-Friend ontology
Rois homepage
type
example:roi
foaf:Person
name
Roi Blanco
sameAs
Yahoo
type
worksWith
example:roi2
example:peter
email
[email protected]
20. Publishing RDF
Linked Data
Data published as RDF documents linked to other RDF documents
Community effort to re-publish large public datasets (e.g. Dbpedia, open government data)
RDFa
Data embedded inside HTML pages
Recommended for site owners by Yahoo, Google, Facebook
SPARQL endpoints
Triple stores (RDF databases) that can be queried through the web
21. Linked Data
A web of RDF documents in parallel to the current Web
often implemented as wrappers around databases or APIs
The four rules of Linked Data:
Use URIs to identify things.
Use HTTPURIs so that these things can be referred to and looked up ("dereference") by people and user agents.
Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF-XML.
Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
22. Linked Data
Advantages:
No change to the publishing of the HTML documents
Data can be published by third party (e.g. Dbpedia)
Disadvantages:
Web servers need to be configured to properly handle URIs that identify concepts instead of documents
Not favored by search engines
Lack of use cases
Crawling needs to be changed
Authority is difficult to determine
Tools
Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby)
RDB-to-RDF mappers (e.g. D2RQ, Triplify)
Validators (Vapour)
Linked Data browsers (many)
23. The state of Linked Data
Rapidly growing community effort to (re)publish open datasets as Linked Data
In particular, scientific and government datasets
see linkeddata.org
Less commercial interest, real usage
24. Metadata in HTML
1995: HTML meta tags
1996: Simple HTML Ontology Extensions (SHOE)
1998: RDF/XML
RDF/XML in HTML
RDF linked from HTML
2003: Web 2.0
Tagging
Microformats
Metadata in Wikipedia
Machine tags in Flickr
2005: eRDF
2008: RDFa 1.0
2011: RDFa 1.1
2012: Microdata?
25. HTML meta tags

href= "http://www.cs.vu.nl/~pmika/foaf.rdf">

26. Microformats (f)
Agreements on the way to encode certain kinds metadata in HTML
Reuse of semantic-bearing HTML elements
Based on existing standards
Minimality
Microformats exist for a limited set of objects
hCard (persons and organizations)
hCalendar (events)
hResume
hProduct
hRecipe
Varying degrees of support and stability
hCard and rel-tag are widely supported
Community centered around microformats.org
Specifications and discussions are hosted there
27. Microformats: limitations
No shared syntax
Each microformat has a separate syntax tailored to the vocabulary
No formal schemas
Limited reuse, extensibility of schemas
Unclear which combinations are allowed
No datatypes
No namespaces, unique identifiers (URIs)
no interlinking
mapping between instances is required
Always appears in the HTML
28. Example: the hCard microformat

Eric Meyer wrote a post (

Tax Relief) about an unintentionally humorous letter he received from
the
Internal Revenue Service
.
29. RDFa
W3C standard for embedding RDF data in HTML documents
A set of new HTML attributes to be used in head or body
A specification of how to extract the data from these attributes
RDFa is just a syntax, you have to choose a vocabulary separately
RDFa 1.0 is a W3C Recommendation since October, 2008
RDFa Primer
RDFa 1.1 is a small update on RDFa to make it easier to use
Currently Working Draft (March 31, 2011)
Updated version of the RDFa Primer (April 19, 2011)
RDFa API for accessing RDFa data in a webpage in the browser from JavaScript
Currently Working Draft (April 19, 2011)
30. RDFa 1.1
Changes
New vocab attribute to define the default namespace for the document or subtree
Profile documents to define multiple namespace prefixes
The prefix attribute as a recommended replacement of xmlns
You can use URIs even where only CURIEs where allowed before
RDFa 1.1 is backward compatible with RDFa 1.0
RDFa 1.1 is recommended if you want to use HTML5
31. When to use RDFa
Choose microformats when you find a microformat that fits your needs and supported by your consumers
Microformats are first option because they are simple
Yahoo supports all major microformats, see the documentation
Its a common misconception that RDFa requires XHTML or that its compatible with HTML5
Its compatible with HTML4, HTML5, XHTML
If you find none that perfectly fits your needs then you need RDFa
Microformats have a fixed schema: you can not add your own attributes
Example: a social networking site with user profiles
VCard is a good candidate, but for example it doesnt have a way to express the users social connections
You either live without this, or go with RDFa
32. Example: Facebooks Like and the Open Graph Protocol
The Like button provides publishers with a way to promote their content on Facebook and build communities
Shows up in profiles and news feed
Site owners can later reach users who have liked an object
Facebook Graph API allows 3rd party developers to access the data
Open Graph Protocol is an RDFa-based format that allows to describe the object that the user Likes
33. Example: Facebooks Open Graph Protocol
RDF vocabulary to be used in conjunction with RDFa
Simplify the work of developers by restricting the freedom in RDFa
Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment
Only HTML accepted

The Rock (1996)

...
34. Example: Yahoo! Enhanced Results (was: SearchMonkey)
Guide for publishers to mark-up their pages for common types of objects
Product, Local, News, Video, Events, Documents, Discussion, Games
Using popular microformats and RDF vocabularies
Copy-paste code
Validator
Yahoo as a consumer
See later
35. Example: Googles Rich Snippets
Google accepts popular microformats and its own RDFa vocabulary
Similar approach to RDFa as Facebook
Validator to check if the markup is correct
Google displays enhanced results based on this metadata
Rich Snippets
36. RDFa on the rise
510% increase between March, 2009 and October, 2010
Percentage of URLs with embedded metadata in various formats
37. Microdata
Currently under standardization at the W3C
Originally part of the HTML5 spec, but now a separate document
Similar to microformats, but with the extensibility of RDFa
Introduce new terms using reverse domain names or full URIs
HTML5 also has a number of semantic elements such as , ,
38. Microdata example

My name is Neil.

My band is called
Four Parts Water.
I was born on
May 10th 2009.

39. The state of metadata in HTML
5-10% of webpages contain some explicit metadata
Depending on how you count
Too many competing approaches
Too many formats: microformats vs RDFa vs Microdata
When using RDFa, publishers may need to use multiple different vocabularies to satisfy everyone
40. Crawling the Semantic Web
Linked Data
Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled
Semantic Sitemap/VOID descriptions
RDFa
Same as HTML crawling, but data is extracted after crawling
Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010.
SPARQL endpoints
Endpoints are not linked, need to be discovered by other means
Semantic Sitemap/VOID descriptions
41. Data fusion
Ontology matching
Widely studied in Semantic Web research, see e.g. list of publications at ontologymatching.org
Unfortunately, not much of it is applicable in a Web context due to the quality of ontologies
Entity resolution
Logic-based approaches in the Semantic Web
Studied as record linkage in the database literature
Machine learning based approaches, focusing on attributes
Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF data
Improvements over only attribute based matching
Blending
Merging objects that represent the same real world entity and reconciling information from multiple sources
42. Data quality assessment and curation
Heterogeneity, quality of data is an even larger issue
Quality ranges from well-curated data sets (e.g. Freebase) to microformats
In the worst of cases, the data becomes a graph of words
Short amounts of text: prone to mistakes in data entry or extraction
Example: mistake in a phone number or state code
Quality assessment and data curation
Quality varies from data created by experts to user-generated content
Automated data validation
Against known-good data or using triangulation
Validation against the ontology or using probabilistic models
Data validation by trained professionals or crowdsourcing
Sampling data for evaluation
Curation based on user feedback
43. Indexing
Search requires matching and ranking
Matching selects a subset of the elements to be scored
The goal of indexing is to speed up matching
Retrieval needs to be performed in milliseconds
Without an index, retrieval would require streaming through the collection
The type of index depends on the query model to support
DB-style indexing
IR-style indexing
44. IR-style indexing
Index data as text
Create virtual documents from data
One virtual document per subgraph, resource or triple
typically: resource
Key differences to Text Retrieval
RDF data is structured
Minimally, queries on property values are required
45. Horizontal index structure
Two fields (indices): one for terms, one for properties
For each term, store the property on the same position in the property index
Positions are required even without phrase queries
Query engine needs to support the alignment operator
Dictionary is number of unique terms + number of properties
Occurrences is number of tokens * 2
46. Vertical index structure
One field (index) per property
Positions are not required
But useful for phrase queries
Query engine needs to support fields
Dictionary is number of unique terms
Occurrences is number of tokens
Number of fields is a problem for merging, query performance
47. Indexing using MapReduce
MapReduce is the perfect model for building inverted indices
Map creates (term, {doc1}) pairs
Reduce collects all docs for the same term: (term, {doc1, doc2}
Sub-indices are merged separately
Term-partitioned indices
Peter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.
48. Query Processing
49. Structure
Taxonomy of retrieval approaches
Query processing for semantic search
Types of semantic data
Formalisms for querying semantic data
Approaches
General task: hybrid graph pattern matching
Matching keyword query against text
Matching structured query against structured data
Matching keyword query against structured data
Matching structured query against text (a hybrid case)
Main tasks, challenges and opportunities
50. Taxonomy of retrieval approaches (1)
Data retrieval problem
A collection of resources represented by the data model G
Information needs expressed as queries in Q
Retrieval is the task of efficiently computing results from G that are relevant to the queries in Q
Document retrieval vs. data retrieval
Differences in query and data representation and matching
Efficiently retrieve structured data that exactly match formal information needs expressed as structured queries
Effectively rank textual results that match ambiguous NL / keyword queries to a certain degree (notions of relevance)
51. Taxonomy of retrieval approaches (2)
Exact
Complete
Sound
Query
Matching

Approximate

52. Not complete 53. Not sound 54. Ranked 55. Best effort 56. Top-kData
Query processing mainly focuses on efficiency of matching whereas ranking deals with degree of matching (relevance)!
57. Query processing for Semantic Search (1)
The underlying collection of resources is represented by semantic data G ranging from
Structured data with well defined schemas
Semi-structured data with incomplete or no schemas
Data that largely comprise text
Hybrid / embedded data
Targeted information needs Q are of varying complexity, captured using different formalisms and querying paradigms
Natural language texts and keywords
Form-based inputs
Formal structured queries
Semantic search, mainly, is the task of efficiently computing results(query processing) from G that are relevantto the queries in Q (ranking)
Keywords
NL Questions
Form- / facet-based Inputs
Structured Queries (SPARQL)
Query
Matching
Data
OWL ontologies with rich, formal semantics
Structured RDF data
Semi-Structured RDF data
RDF data embedded in text (RDFa)
Textual Data
Keyword query on textual data
(IR/document retrieval)
Structured query on textual data
Semantic Search target different group of users, information needs, and types of data. Query processing for semantic search is hybrid combination of techniques!
Unstructured Query
Structured
Query
Keyword query on structured data
Structured query on structured data
(DB/data retrieval)
Structured Data
60. Types of data models (1)
Textual
Bag-of-words
Represent documents, text in structured data,, real-world objects (captured as structured data)
Lacks structure
in text, e.g. linguistic structure, hyperlinks, (positional information)
Structure in structured data representation
term (statistics)
In combination with Cloud Computing technologies, promising solutions for the management of `big data' have emerged. Existing industry solutions are able to support complex queries and analytics tasks with terabytes of data. For example, using a Greenplum.
combination
Cloud
Computing
Technologies
solutions
management
`big data'
industry
solutions
support
complex

Textual
Structured
Resource Description Framework (RDF)
Represent real-world objects, services, applications, . documents
Resource attribute values and relationships between resources
Schema
Picture
creator
Person
Bob
Textual
Structured
Hybrid
RDF data embedded in text (RDFa)
63. Types of data models RDFa (1)

adoptedfrom : http://www.w3.org/TR/xhtml-rdfa-primer/
64. Types of semantic data RDFa (2)
Bob is a good friend of mine. We went to the same university, andalso shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:
content
content
adoptedfrom : http://www.w3.org/TR/xhtml-rdfa-primer/
65. Types of semantic data - conclusion
Semantic data in general can be conceived as a graph with text and structured data items as nodes, and edges represent different types of relationships including explicit semantic relationships and vaguely specified ones such as hyperlinks!
66. Formalisms for querying semantic data (1)
Example information need
Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.
Unstructured queries
Fully-structured queries
Hybrid queries: unstructured + structured
Unstructured
NL
Keywords
apartment
Berlin
Alice
shared
Information about a friend of Alice, who shared an apartment with her in Berlinand knows someone working at KIT.
Unstructured
Fully-structured
SPARQL: BGP, filter, optional, union, select, construct, ask, describe
PREFIX ns:
SELECT ?x
WHERE { ?x ns:knows ? y. ?y ns:name Alice.
?x ns:knows ?z.?z ns: works ?v. ?v ns:name KIT }
Fully-structured
Unstructured
Hybrid: content and structure constraints
shared apartment Berlin Alice
?x ns:knows ? y. ?y ns:name Alice.
?x ns:knows ?z.?z ns: works ?v.
?v ns:name KIT
Fully-structured
Unstructured
Hybrid: content and structure constraints
shared apartment Berlin Alice
?x ns:knows ? y. ?y ns:name Alice.
?x ns:knows ?z.?z ns: works ?v.
?v ns:name KIT
71. Formalisms for querying semantic data - conclusion
Semantic search queries can be conceived as graph patterns with nodes referring to text and structured data items, and edges referring to relationships between these items!
72. Processing hybrid graph patterns (1)
?x ns:knows ?z.?z ns: works ?v. ?v ns:name KIT
apartment shared Berlin Alice
?y ns:name Alice. ?x ns:knows ? y
age
works
34
trouble with bob
FluidOps
Peter
sunset.jpg
Bob is a good friend of mine. We went to the same university, andalso shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:
Beautiful
Sunset
author
title
Semantic Search
Germany
Alice
creator
author
creator
knows
year
Germany
2009
Bob
Thanh
knows
located
works
KIT
73. Processing hybrid graph patterns (2)
Matching hybrid graph patterns against data
74. Matching keyword query against text

Retrieve documents

75. Inverted list (inverted index)keyword {,
, ...}

AND-semantics: top-k join

shared
Berlin
Alice
shared
Berlin
Alice
D1
D1
D1
shared
berlin
alice
=
=
shared
76. Matching structured query against structured data

Retrieve data for triple patterns

77. Index on tables 78. Multiple redundant indexes to cover different access patterns 79. Join (conjunction of triples) 80. Blocking, e.g.linear merge join (required sorted input) 81. Non-blocking, e.g. symmetric hash-join 82. Materialized join indexes?x ns:knows ?y. ?x ns:knows ?z.
?z ns: works ?v. ?v ns:name KIT
Per1 ns:works?v
?vns:name KIT
SP-index
PO-index
=
=
=
Per1 ns:worksIns1
Ins1ns:name KIT
Per1 ns:works Ins1
Ins1 ns:name KIT
83. Matching keyword query against structured data

Retrieve keyword elements

84. Using inverted indexkeyword {, ,}

Exploration / Join

85. Data indexes for triple lookup 86. Materialized index (paths up to graphs) 87. Top-k Steiner tree search, top-k subgraph exploration Alice
Bob
Bob
KIT
Alice
KIT

=
=
Alice ns:knowsBob
Inst1ns:name KIT
Bobns:worksInst1
88. Matching structured query against text

Based on offline IE (offline see Peters slides)

89. Based on online IE, i.e., retrieve is as follows 90. Derive keywords to retrieve relevant documents 91. On-the-fly information extraction, i.e., phrase pattern matchingX title Y 92. Retrieve extracted data for structured part 93. Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp. 94. Index 95. Inverted index for document retrieval and pattern matching 96. Join index inverted index for storing materializedjoins between keywords 97. Neighborhood indexes for phrase patterns?x ns:knows ?y. ?x ns:knows ?z.
?z ns: works ?v. ?v ns:name KIT
KIT
name
knows
name
KIT
Hybrid case
98. Query processing main tasks
Retrieval
Documents , data elements, triples, paths, graphs
Inverted index,, but also other indexes (B+ tree)
Index documents, triples materialized join paths
Join
Different join implementations, efficiency depends on availability of indexes
Non-blocking join good for early result reporting and for unpredictable linked data scenario
Query
Matching
Data
99. Query processing more tasks
Disjunction, aggregation, grouping
Join order optimization
Approximate
Approximate the search space
Approximate the results (matching, join)
Parallelization
Top-k
Use only some entries in the input streams to produce k results
Multiple sources
Federation, routing
On-the-fly mapping, similarity join
Hybrid
Join text and data
Query
Matching
Data
100. Query processing on the Web - research challenges and opportunities
Large amount of semantic data
Data inconsistent, redundant, and low quality
Large amount of data embedded in text
Large amount of sources
Large amount of links between sources

Optimization parallelization,

101. Approximation 102. Hybrid querying and data management 103. Federation, routing 104. Online schema mappings 105. Similarity join

SemTech 2011 Semantic Search tutorial

Education

Transcript of SemTech 2011 Semantic Search tutorial