Publishing data on the Semantic Web

63
Publishing Data on the Semantic Web Peter Mika Researcher, Data Architect Yahoo! Research

description

Tutorial given at the University of Oviedo, covering Linked Data and RDFa.

Transcript of Publishing data on the Semantic Web

Page 1: Publishing data on the Semantic Web

Publishing Data on the Semantic Web

Peter Mika

Researcher, Data Architect

Yahoo! Research

Page 2: Publishing data on the Semantic Web

Intro to the Semantic Web

Page 3: Publishing data on the Semantic Web

- 3 -

Vague, but exciting… Berners-Lee and the dawn of the Web

Page 4: Publishing data on the Semantic Web

- 4 -

Semantic Web

• Publish information in a way that is easier to process for machines

• Web of Data instead of Web of Documents

• Two main architectural challenges

– A common format for sharing data

– Sharing the meaning of data

• Through social means (shared schemas)

• By using powerful schema languages

• Semantic Web standards from W3C

– Languages (RDF, OWL, RIF)

– Serializations (RDF/XML, RDFa)

– Protocols (SPARQL, HTTP)

• Semantic Web research into knowledge representation and reasoning, data integration, data quality and many other topics

• Community efforts to publish data and develop schemas

Page 5: Publishing data on the Semantic Web

- 5 -

RDF (Resource Description Framework)

• The basic data model of the Semantic Web

– A universal model to capture all sorts of data: networks, relational, object-oriented…

• Basic unit of information is a triple

– A tuple of (subject, predicate, object)

– Example: (Joe, loves, Mary)

– Each triple gives the value of a property for a given resource or relates two objects to one another

• Object is either a resource or a literal

• An RDF model is a set of triples

– Ordering of statements in an RDF document is irrelevant (unlike XML)

Page 6: Publishing data on the Semantic Web

- 6 -

Resources vs. literals

• Resources are identified by a URI or otherwise the are called a blank node

– URIs are a generalization of URLs

– Notation: <http://www.example.org/Person> or ex:Person

• Literals have an optional language and datatype (string, integer etc.)

– Literals can not be subjects of statements

– Datatypes are identified by URIs, e.g. XML Schema datatypes

– Two literals are the same if their components are the same

– Notation: “Joe B.” or Joe@en^^http://…#string

Page 7: Publishing data on the Semantic Web

- 7 -

Advanced topic: Resources vs Literals

• Resources are objects, Literals are strings

• Resources are instances of classes, Literals have datatypes

• Whether something is a resource or literal sometimes depends on the detail of modeling

<meta property=“myvocab:knows”>Paris Hilton</meta>

<item rel=“foaf:knows”><meta property=“foaf:name”>Paris Hilton</meta>

</item>

• You cannot make statements about literals (literals are always the object in a triple)

• Resources can carry a globally unique identifier, literals have no identity

• Web resources such as documents and images are resources– <item rel=“rdfs:seeAlso” resource=“http://www.some.related.page.com/”/>

– <item rel=“foaf:img” resource=“http://photosite.example.org/photo.jpg”/>

• When in doubt: it’s a resource

Page 8: Publishing data on the Semantic Web

- 8 -

Graphical and textual notation

• A number of ways to serialize an RDF model into an RDF document

– RDF/XML, Turtle, N3, N-Triples

– Example: http://www.cs.vu.nl/~pmika/foaf.rdf

my:Joe

“Joe A.”

name

foaf:Persontype

Page 9: Publishing data on the Semantic Web

- 9 -

Informational versus non-informational resources

• Informational resource: an HTML document, image, any other file on the Web

– Retrievable in its entirety from the Web

– Retrieving it can return a 200 OK

• Conceptual (non-informational) resource: a person, an event, a place, etc.

– A description of it may be retrievable from the Web

– When identified by a URL, retrieving it should return a 303 Redirect

• Never confuse a webpage with what it describes!

– You are not your Facebook profile: one is a document, the other is a person. A document has properties such as byte-size, media-type etc, a person has name, age, etc.

– Make sure you don’t use the URL of an existing webpage as the URI of a resource

Page 10: Publishing data on the Semantic Web

- 10 -

Vocabularies (ontologies)

• Ontologies are collections of classes and properties used to describe objects in a particular domain

– OWL (the Web Ontology Language) is the standard ontology language

– OWL has an RDF serialization: ontologies are part of the Semantic Web

• Classes can be described by sub- and superclasses, required properties

– Class membership in RDF is expressed using the rdf:type property

– An instance can have multiple classes (types)

– A class can have multiple superclasses

• Properties can be described by their domain, range, cardinalities, etc.

Page 11: Publishing data on the Semantic Web

- 11 -

RDF is designed for distributed systems

• URIs provide web-wide global identification across documents– A resource may be described by multiple documents

– We know it’s the same resource because the same URI is used or through reasoning (advanced topic…)

– URIs are intented to be reused

– Unique, but not single identifiers: two URIs may denote the same thing

• URIs are dereferencable (can be retrieved)– A well-behaved URI returns a description of the resource

– Provides authority: the definition of foaf:Person lives at that URI

• Ontologies can be looked up as well– Typically at the root of the URIs, also known as the namespace

– Example: http://xmlns.com/foaf/0.1/Person redirects to the specification

Page 12: Publishing data on the Semantic Web

- 12 -

URIs implicitly link data together

(#joe, #name, “Joe A.”)(#joe, #email, mailto:[email protected])

(#mary, name, “Mary B.”)(#mary, gender, “female”)

(#joe, #loves, #mary)

Joe’s homepage

A dating site

Mary’s homepage

(#name, #type, #Property)(#name, #domain, #Person)

Schema doc

Page 13: Publishing data on the Semantic Web

- 13 -

Put together, triples form a single ‘global’ graph

“Joe A.”

#joe

#name

[email protected]

#email

#mary

#loves

“Mary B.”

“female”

#name

#gender

Page 14: Publishing data on the Semantic Web

Publishing for the Semantic Web

Page 15: Publishing data on the Semantic Web

- 15 -

Motivation

• Why publish data on the (Semantic) Web?

– In a business context

• Increase the potential for linking, reuse and aggregation

– Drive traffic back from other sites on the Web

– Pre-competitive data integration (e.g. drug discovery)

• Make your data more easily findable

– Drive traffic from search engines

– In a non-profit context

• Increase industry or government transparency, accountability

• Support research and education by making data accessible

Page 16: Publishing data on the Semantic Web

- 16 -

Publishing and consuming data on the Semantic Web

• Publishing data involves– Deciding in which format to publish your data

– Deciding which schema (ontology, vocabulary) to use

• OR you can create a new schema and publish it as well

• Multiple ways of publishing RDF data:1. Linked Data

2. Metadata in HTML

3. SPARQL endpoints

4. Feeds

5. GRDDL

6. Automated tools

Note: you may implement more than one

Page 17: Publishing data on the Semantic Web

- 17 -

Option 1: Linked Data

• A web of RDF documents in parallel to the current Web

– Most often implemented as wrappers around databases or APIs

• The four rules of Linked Data:

– Use URIs to identify things.

– Use HTTP URIs so that these things can be referred to and looked up ("dereference") by people and user agents.

– Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF-XML.

– Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

Page 18: Publishing data on the Semantic Web

- 18 -

Option 1: Linked Data

• Advantages:

– No change to the publishing of the HTML documents

– Data can be published by third party (e.g. Dbpedia)

• Disadvantages:

– Web servers need to be configured to properly handle URIs that identify concepts instead of documents

– Not favored by search engines

• Lack of use cases

• Crawling needs to be changed

• Authority is difficult to determine

• Tools

– Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby)

– RDB-to-RDF mappers (e.g. D2RQ, Triplify)

– Validators (Vapour)

– Linked Data browsers (many)

Page 19: Publishing data on the Semantic Web

- 19 -

Linked Data as a movement

• Rapidly growing community effort to (re)publish open datasets as Linked Data

– In particular, scientific and government datasets

– see linkeddata.org

Page 20: Publishing data on the Semantic Web

- 20 -

Option 2: Metadata in HTML

• Using microformats, RDFa, Microdata (more later)

• Advantages:

– Data and document are always in sync

– Browser plug-in friendly

– Search engine friendly

– Copy-paste friendly

• Tools:

– XML editors (e.g. Oxygen)

– Triplr

– RDFa Distiller

– RDFa bookmarklet

– Ubiquity RDFa plugin

– Optimus microformat parser

• Examples: many, including SlideShare, YouTube, LinkedIn, Digg, Myspace, Facebook…

Peter Mika was born in Budapest.

Peter Mika was born in Budapest.

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

Peter Mika was born in Budapest.

Peter Mika was born in Budapest.

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

Page 21: Publishing data on the Semantic Web

- 21 -

Option 3: SPARQL endpoints

• An API for accessing RDF databases on the Web

– A query language and an HTTP protocol

• Advantages:

– Flexible access: make any query you want

– Also possible to expose a traditional RDBMs via a wrapper

• Disadvantages:

– For the publisher: cost of supporting arbitrary queries

– For the search engine: discovery of SPARQL servers is unsolved

• Tools:

– Triple stores (Oracle, Virtuoso, Sesame, Jena, OWLIM etc.)

– RDB-to-RDF mappers such as D2RQ and Triplify

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

Page 22: Publishing data on the Semantic Web

- 22 -

Option 4: Feeds

• Disadvantages:

– No standard feed format for RDF: data needs to be formatted and often manually submitted for each search engine

• Advantages

– Submit your data without making it public

• Competing and incompatible formats

– DataRSS (Yahoo!)

– Google Data Protocol

– Open Data Protocol (Microsoft)

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

Page 23: Publishing data on the Semantic Web

- 23 -

• Publish the rule to transform the HTML to structured data

• GRDDL is a standard for linking an HTML page to a transformation that produces RDF data

• Advantages

– No change to the page

• Disadvantages

• Transformation needs to be executed to get to the data

• Not much support by search engines

• Tools

• Intel MashMaker

• Dapper

• Glue API from AdaptiveBlue

Option 5: Publishing a transformation of the data

xx yy

1 2

<XSLT><XSLT>

Page 24: Publishing data on the Semantic Web

- 24 -

Option 6: Automatic markup

• Web services that annotate HTML automatically

• Advantages

– No manual effort

• Disadvantages

– Limited to finding relevant entities in text

• Tools

– OpenCalais

– Zemanta APIPeter Mika was born in Budapest.

Peter Mika was born in Budapest.

<person>Peter Mika</person> was born in <location>Budapest</location>.

<person>Peter Mika</person> was born in <location>Budapest</location>.

Page 25: Publishing data on the Semantic Web

- 25 -

Example: Zemanta

• A personal writing assistant for bloggers

– Plugin for popular blogging platforms and web mail clients

• Analyzes text as you type and suggests hyperlinks, tags, categories, images and related articles

• API available with the same functionality

Page 26: Publishing data on the Semantic Web

- 26 -

Choosing a vocabulary

• No vocabularies in many domains

– Books, movies, stuff people care about…

• Too many competing proposals in other domains

– Often versions of the same proposal

– Example: vocabularies for microformats

• Not maintained

– I cannot maintain your vocabulary for you

• Limited tool support

– Too many expert tools until now

• Many vocabularies are not designed for annotation

• Missing meeting point and social process

– An ontology is a shared, formal representation of a domain

Page 27: Publishing data on the Semantic Web

- 27 -

Choosing a vocabulary

• Search the Web or ask for advice on mailing lists

[email protected]

[email protected]

• Wikis

– semanticweb.org

– vocamp.org

• Beware of people who claim to have the vocabulary of everything

– Preferably you want something small and targeted

• Never a 100% fit you will need to introduce vocabulary terms (classes and properties)

– Do not introduce new classes/properties in existing namespaces

– Example: the namespace http://xmlns.com/foaf/0.1/ is used by the FOAF project. Try not to introduce a new term without contacting the owner, i.e. the membership of the FOAF mailing list.

Page 28: Publishing data on the Semantic Web

- 28 -

Advanced topic: creating a vocabulary

1. Get advice on methodology– vocamp.org and semanticweb.org

2. Choose a namespace and a prefix– Give sensible names, e.g. name it after your site, but don’t call it searchmonkey

– Namespace ends either with a slash or a hash

3. Create an RDF or OWL document describing your classes and properties• Use an ontology editor such as Protégé 4.0

• Follow naming conventions

4. Publish your vocabulary– Make sure the URIs of your properties and classes are resolvable

1. E.g. myvocab:digicam should resolve to a document containing the definition of myvocab:digicam

• Convince others to adopt your vocabulary1. If you are in fishing, convince other fishing businesses

Page 29: Publishing data on the Semantic Web

- 29 -

How do we build communities? www.vocamp.org

Page 30: Publishing data on the Semantic Web

Metadata in HTML

Page 31: Publishing data on the Semantic Web

- 31 -

Brief history of the Annotated Web

• 1995: HTML meta tags• 1996: Simple HTML Ontology Extensions (SHOE)• 1998: RDF/XML

– RDF/XML in HTML– RDF linked from HTML

• 2003: Web 2.0– Tagging– Microformats– Metadata in Wikipedia– Machine tags in Flickr

• 2005: eRDF • 2008: RDFa 1.0• 2011: RDFa 1.1• 2012: Microdata?

Page 32: Publishing data on the Semantic Web

- 32 -

HTML meta tags

<HTML><HEAD profile="http://dublincore.org/documents/dcq-html/"><META name="DC.author" content="Peter Mika"><LINK rel="DC.rights copyright"

href="http://www.example.org/rights.html" /> <LINK rel="meta" type="application/rdf+xml" title="FOAF"

href= "http://www.cs.vu.nl/~pmika/foaf.rdf"> </HEAD> …</HTML>

Page 33: Publishing data on the Semantic Web

- 33 -

SHOE example (Hefflin & Hendler, 1996)

<ONTOLOGY "our-ontology" VERSION="1.0"> <ONTOLOGY-EXTENDS "organization-ontology" VERSION="2.1" PREFIX="org"

URL="http://www.ont.org/orgont.html"> <ONTDEF CATEGORY="Person" ISA="org.Thing"> <ONTDEF RELATION="lastName" ARGS="Person STRING"> <ONTDEF RELATION="firstName" ARGS="Person STRING"> <ONTDEF RELATION="marriedTo" ARGS="Person Person"> <ONTDEF RELATION="employee" ARGS="org.Organization Person">

</ONTOLOGY>

<HEAD><META HTTP-EQUIV="Instance-Key" CONTENT="http://www.cs.umd.edu/~george"> <USE-ONTOLOGY "our-ontology" VERSION="1.0" PREFIX="our" URL="http://ont.org/our-ont.html"> </HEAD><BODY>

<CATEGORY "our.Person">

<RELATION "our.marriedTo" TO="http://www.cs.umd.edu/~helena">

<RELATION "our.employee" FROM="http://www.cs.umd.edu">

My name is

<ATTRIBUTE "our.firstName"> George </ATTRIBUTE>

<ATTRIBUTE "our.lastName"> Cook </ATTRIBUTE> and I live at...

Page 34: Publishing data on the Semantic Web

- 34 -

SHOE system

Page 35: Publishing data on the Semantic Web

- 35 -

SHOE Text-based query interface

Page 36: Publishing data on the Semantic Web

- 36 -

SHOE Graphical Query Interface

Page 37: Publishing data on the Semantic Web

- 37 -

Example: Creative Commons

Embedding CC license in HTML (now deprecated):

<HTML><HEAD>… </HEAD><BODY>…

<!–- <rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <Work rdf:about="http://www.yergler.net/averages/"> <dc:title>The Law of Averages</dc:title> <dc:description>...because eventually i&apos;ll be right...</dc:description> <license rdf:resource="http://creativecommons.org/licenses/by-nc/1.0/" /> </Work> <License rdf:about="http://creativecommons.org/licenses/by-nc/1.0/"><requires rdf:resource="http://web.resource.org/cc/Notice" /> <permits rdf:resource="http://web.resource.org/cc/Reproduction" /> <permits rdf:resource="http://web.resource.org/cc/Distribution" /> <prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" /> </License> </rdf:RDF>

-->

Page 38: Publishing data on the Semantic Web

- 38 -

Example: Creative Commons

• Current: rel attribute (HTML4)

This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/us/">Creative Commons Attribution 3.0 United States License</a>.

• Use of the “rel” attribute for semantic annotation is the birth of the microformat…

Page 39: Publishing data on the Semantic Web

- 39 -

Microformats (μf)

• Agreements on the way to encode certain kinds metadata in HTML

– Reuse of semantic-bearing HTML elements

– Based on existing standards

– Minimality

• Microformats exist for a limited set of objects

– hCard (persons and organizations)

– hCalendar (events)

– hResume

– hProduct

– hRecipe

• Varying degrees of support and stability

– hCard and rel-tag are widely supported

• Community centered around microformats.org

– Specifications and discussions are hosted there

Page 40: Publishing data on the Semantic Web

- 40 -

Microformats: limitations

• No shared syntax

– Each microformat has a separate syntax tailored to the vocabulary

• No formal schemas

– Limited reuse, extensibility of schemas

– Unclear which combinations are allowed

• No datatypes

• No namespaces, unique identifiers (URIs)

– no interlinking

– mapping between instances is required

• Always appears in the HTML <body>

Page 41: Publishing data on the Semantic Web

- 41 -

Example: the hCard microformat

<cite class="vcard"><a class="fn url" rel="friend colleague met” href="http://meyerweb.com/">Eric Meyer</a> </cite> wrote a post (<cite><a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/">Tax Relief</a></cite>) about an unintentionally humorous letter he received from the <span class="vcard”> <a class="fn org url" href="http://irs.gov/">Internal Revenue Service</a> </span>.

<div class="vcard"> <a class="email fn" href="mailto:[email protected]">Joe Friday</a> <div class="tel">+1-919-555-7878</div> <div class="title">Area Administrator, Assistant</div> </div>

Page 42: Publishing data on the Semantic Web

- 42 -

RDFa

• W3C standard for embedding RDF data in HTML documents

– A set of new HTML attributes to be used in head or body

– A specification of how to extract the data from these attributes

• RDFa is just a syntax, you have to choose a vocabulary separately

• RDFa 1.0 is a W3C Recommendation since October, 2008

– RDFa Primer

• RDFa 1.1 is a small update on RDFa to make it easier to use

– Currently Working Draft (March 31, 2011)

– Updated version of the RDFa Primer (April 19, 2011)

• RDFa API for accessing RDFa data in a webpage in the browser from JavaScript

– Currently Working Draft (April 19, 2011)

Page 43: Publishing data on the Semantic Web

- 43 -

RDFa 1.1

• Changes

– New vocab attribute to define the default namespace for the document or subtree

– Profile documents to define multiple namespace prefixes

– The prefix attribute as a recommended replacement of xmlns

– You can use URIs even where only CURIEs where allowed before

• RDFa 1.1 is backward compatible with RDFa 1.0

– RDFa 1.1 is recommended if you want to use HTML5

Page 44: Publishing data on the Semantic Web

- 44 -

When to use RDFa

• Choose microformats when you find a microformat that fits your needs and supported by your consumers– Microformats are first option because they are simple

– Yahoo supports all major microformats, see the documentation

– It’s a common misconception that RDFa requires XHTML or that it’s compatible with HTML5

• It’s compatible with HTML4, HTML5, XHTML

• If you find none that perfectly fits your needs then you need RDFa– Microformats have a fixed schema: you can not add your own

attributes

• Example: a social networking site with user profiles– VCard is a good candidate, but for example it doesn’t have a way to

express the user’s social connections

– You either live without this, or go with RDFa

Page 45: Publishing data on the Semantic Web

- 45 -

RDFa intro: metadata in the header

• More info in the<html prefix="og: http://ogp.me/ns#"> <head> <title>The Trouble with Bob</title> <meta property="og:title" content="The Trouble with Bob" /> <meta property="og:type" content="text" /> <meta property="og:image" content="http://example.com/alice/bob-ugly.jpg" /> ... </head>

Page 46: Publishing data on the Semantic Web

- 46 -

RDFa intro: links with a flavor

• More info in theAll content on this site is licensed under <a rel="license" href="http://creativecommons.org/licenses/by/3.0/"> a Creative Commons License </a>.

Page 47: Publishing data on the Semantic Web

- 47 -

RDFa links: talking about subjects other than the page

• More info in theThe trouble with Bob is that he takes much better photos than me: <div about="http://example.com/bob/photos/sunset.jpg"> <img src="http://example.com/bob/photos/sunset.jpg" /> <span property="og:title">Beautiful Sunset</span> by <span property="dc:creator">Bob</span>. </div>

Page 48: Publishing data on the Semantic Web

- 48 -

RDFa links: talking about subjects other than the page

• More info in the

<div typeof=”foaf:Person"> <p property=”foaf:name"> Alice Birpemswick </p> <p> Email: <a rel=”foaf:mbox” href="mailto:[email protected]"> [email protected] </a> </p> <p> Phone: <a rel=”foaf:phone" href="tel:+1-617-555-7332">+1 617.555.7332</a> </p> </div>

Page 49: Publishing data on the Semantic Web

- 49 -

The process of annotating with RDFa

• Find a vocabulary that fits your needs and supported by your consumers

– A vocabulary describes a set of types and attributes within a given domain

– If you don’t find a good candidate, extend an existing one or create a new one

• Annotate your page.

– Before you start, you might want to validate your page for (X)HTML conformance using the W3C’s (X)HTML Validator to reduce the chance of errors. Choose Document Type XHTML + RDFa.

– No specific tool support. If you have an HTML or XML editor that supports DTDs, you will have syntax checking and highlighting.

– Use the RDFa Distiller to validate which data can be extracted from your page.

– If you fancy, use the RDF Validator to graphically visualize the RDF graph that is outputted.

• Put the annotated page online

– The data will be extracted by Google/Bing/Yahoo the next time your page is crawled and indexed

– The data will be available to browser extensions, bookmarklets etc.

• See http://rdfa.info/rdfa-implementations for new tools and APIs

Page 50: Publishing data on the Semantic Web

- 50 -

RDFa can be hard to get right…

• Validation problems can stop us from extracting data– Use the W3C validator

– Use the right DOCTYPE declaration if using XHTML

– Set the encoding of your page properly (using HTTP headers or XML declaration)

• Prefixes need to be defined using the xmlns attribute

• Unless you are making statements about the document, set the subject using the about attribute

• Do not include HTML elements in literal values– Incorrect: <div property=“foaf:name”><b>Peter Mika</b></div>

• Use absolute URIs as the value of the resource attribute– Or make sure you specify HTML base

Page 51: Publishing data on the Semantic Web

- 51 -

RDFa can be hard to get right… II.

• Be careful when using rel and typeof in combination because of the precedence rules

• BAD example:

<div about=“#id”>

<span property=“foaf:name“>Peter Mika</span>

<span rel=“foaf:img“ typeof=“foaf:Image”>

<span property=“dc:format”>jpg</span>

</span

</div>

• To correct, you need to put the typeof inside the <span> node with rel=“foaf:img”

Page 52: Publishing data on the Semantic Web

- 52 -

RDFa can be hard to get right… III.

• Typeof does two things at once: it creates a new subject resource and assigns the type to it

• BAD example:

<div about=“#id”>

<span property=“foaf:name“>Peter Mika</span>

<span rel=“foaf:img“ resource=“http://www.example.org/photo.jpg”>

<span typeof=“foaf:Image”>

<span property=“dc:format”>jpg</span>

</span

</span

</div>

• To correct, you have to repeat the resource attiribute on the span node with the typeof

Page 53: Publishing data on the Semantic Web

- 53 -

RDFa can be hard to get right… IV.

• Marking up <h1>:

– <h1 property=“dc:title”>My homepage</h1>

– NOT: <h1><div property=“dc:title”>My homepage</h1>

•  Marking up an image: <span rel=”foaf:img">        <img alt="Alex" src="http://example.org/alex.jpg"/> </span>

NOT:

<img rel=“foaf:img” src=“photo.jpg/>

• Header

– <meta property=“…” content=“…”>

NOT

– <meta name=“…” content=“…”>

Page 54: Publishing data on the Semantic Web

- 54 -

RDFa can be hard to get right… V.

• You can not break up a description like this:

<span rel=“foaf:knows">   <span property=“foaf:name">Peter Mika</span></span>….

<span rel=“foaf:knows">   <a rel=“foaf:email“ href=“mailto:[email protected] /></span>

• This is not the same as:

<span rel=“foaf:knows">   <span property=“foaf:name">Peter Mika</span>   

<a rel=“foaf:email“ href=“mailto:[email protected] />

</span>

• In the first case there are two related resources, with one attribute each, in the second case there is a single related resource with two attributes.

Page 55: Publishing data on the Semantic Web

- 55 -

Tips

• Hiding information from being displayed

– Links without content will not be rendered

– Use <span property=“foaf:name” content=“Peter Mika”/>

• Use datatypes to provide the expected type of a literal.

– This helps validation because any tool can check whether the literal is indeed of that type.

Page 56: Publishing data on the Semantic Web

- 56 -

Example: Facebook’s Like and the Open Graph Protocol

• The ‘Like’ button provides publishers with a way to promote their content on Facebook and build communities

– Shows up in profiles and news feed

– Site owners can later reach users who have liked an object

– Facebook Graph API allows 3rd party developers to access the data

• Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’

Page 57: Publishing data on the Semantic Web

- 57 -

Example: Facebook’s Open Graph Protocol

• RDF vocabulary to be used in conjunction with RDFa

– Simplify the work of developers by restricting the freedom in RDFa

• Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment

• Only HTML <head> accepted

• http://opengraphprotocol.org/

<html xmlns:og="http://opengraphprotocol.org/schema/"> <head>

<title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> …

</head> ...

Page 58: Publishing data on the Semantic Web

- 58 -

Example: Yahoo! Enhanced Results (was: SearchMonkey)

• Guide for publishers to mark-up their pages for common types of objects

– Product, Local, News, Video, Events, Documents, Discussion, Games

• Using popular microformats and RDF vocabularies

– Copy-paste code

– Validator

• Yahoo as a consumer

– See later

Page 59: Publishing data on the Semantic Web

- 59 -

Example: Google’s Rich Snippets

• Google accepts popular microformats and its own RDFa vocabulary

– Similar approach to RDFa as Facebook

• Validator to check if the markup is correct

• Google displays enhanced results based on this metadata

– Rich Snippets

Page 60: Publishing data on the Semantic Web

- 60 -

Microdata example

<div itemscope itemid=“http://www.yahoo.com/resource/person”> <p>My name is <span itemprop="name">Neil</span>.</p> <p>My band is called <span itemprop="band">Four Parts Water</span>. I was born on <time itemprop="birthday" datetime="2009-05-10">May 10th 2009</time>. <img itemprop="image" src=”me.png" alt=”me”> </p></div

Page 61: Publishing data on the Semantic Web

- 61 -

Microdata

• Currently under standardization at the W3C– Originally part of the HTML5 spec, but now a separate document

• Similar to microformats, but with the extensibility of RDFa

– Introduce new terms using reverse domain names or full URIs

• HTML5 also has a number of “semantic” elements such as <time>, <video>, <article>…

Page 62: Publishing data on the Semantic Web

- 62 -

RDFa on the rise

Percentage of URLs with embedded metadata in various formats

510% increase between March, 2009 and October, 2010

Page 63: Publishing data on the Semantic Web

- 63 -

The state of metadata in HTML

• 5-10% of webpages contain some explicit metadata

– Depending on how you count…

• Too many competing approaches

– Too many formats: microformats vs RDFa vs Microdata

– When using RDFa, publishers may need to use multiple different vocabularies to satisfy everyone