data linkage and sem web - pitt.eduvizclass/classes/infsci2711_s1m/data/data_linkage... · 5...
Transcript of data linkage and sem web - pitt.eduvizclass/classes/infsci2711_s1m/data/data_linkage... · 5...
1
Advanced Topics in Database Management (INFSCI 2711)Book: Semantic Web for the Working Ontologies - 2011
Vladimir Zadorozhny, DINS, University of Pittsburgh
Data Linkage and Semantic Web
Data Linkage and Semantic WebThe Semantic Web applies the idea of data linkage to the Web as a whole. The current Web infrastructure supports a distributed network of web pages that can refer to one another with global links called Uniform Resource Locators (URLs). The main idea of the Semantic Web is to support a distributed Web at the level of the data rather than at the level of the presentation. Instead of having one web page point to another, one data item can point to another, using global references called Uniform Resource Identifiers (URIs). Example: an update to the location of hotels would be reflected in the list of hotels at any particular location. We’d like the two sources to stay synchronized; then inconsistent conclusions will not be drawn from information taken from different pages of the same site (referential integrity?)The data model that the Semantic Web infrastructure uses to represent this distributed web of data is called the Resource Description Framework (RDF).
2
Distributing Data Across the Web
■ For simplicity consider tabular data
Table: ELM (Elizabethan Literature and Music)ID Title Author Medium Year1 As You Like It Shakespear Play 1599
2 Hamlet Shakespear Play 1604
3 Othello Shakespear Play 1603
4 Sonet78 Shakespear Poem 16095 Astrophil and Stella Sir Philip Sidney Poem 1590
6 Edward II Christopher Marlowe Play 1592
7 Hero and Leander Christopher Marlowe Poem 1593
8 Greensleeves Henry VIII Rex Song 1595
Distributing Data Across the Web by rows
1 As You Like It Shakespear Play 1599
4 Sonet78 Shakespear Poem 1609
6 Edward II Christopher Marlowe Play 1592
3 Othello Shakespear Play 1603
7 Hero and Leander Christopher Marlowe Poem 1593
Requires some coordination between the servers. In particular, each server must share information about the columns (global schema).
3
Distributing Data Across the Web by columns The coordination has to do with the identities of the entities to be described. How do I know that row 3 on one server refers to the same entity as row 3 on another server? This solution requires a global identifier for the entities being described.
Medium YearPlay 1599
Play 1604
Play 1603Poem 1609
Poem 1590
Play 1592
Poem 1593Song 1595
AuthorShakespear
Shakespear
Shakespear
ShakespearSir Philip Sidney
Christopher Marlowe
Christopher Marlowe
Henry VIII Rex
TitleAs You Like It
Hamlet
Othello
Sonet78Astrophil and Stella
Edward II
Hero and Leander
Greensleeves
Distributing Data Across the Web by cells
TitleRow 2 Hamlet
MediumRow 7 Poem
YearRow 2 1604
MediumRow 6 Play
AuthorRow 4 Shakespear
• Flexibility supports the AAA slogan: “Anyone can say Anything about Any topic.”
• Any server is able to make a statement about any entity (as is the case “by column”)
• Any server is able to specify any property of an entity (as is the case of “by rows”).
• Cost: both global schema (column name) and global entity identifier (raw id) are required.
• Each is represented with three values: a global reference for the row, a global reference for the column, and the value in the cell itself. à RDF
4
RDF Triples
TitleRow 2 Hamlet
MediumRow 7 Poem
YearRow 2 1604
MediumRow 6 Play
AuthorRow 4 Shakespear
Subject Predicate ObjectRow 7 Medium PoemRow 2 Title HamletRow 2 Year 1604Row 4 Author ShakespearRow 6 Medium Play
Global Entity ID
Global Column Name(Schema) Value
More of RDF TriplesSubject Predicate Object
Shakespear wrote King LearShakespear wrote Macbeth
Ann Hathaway married ShakespearShakespear livedin Stratford
Stratford isIn EnglandMacbeth setIn ScotlandEngland partOf UKScotland partOf UK
MERGING DATA FROM MULTIPLE SOURCESWe started off describing RDF as a way to distribute data over several sources. But when we want touse that data, we will need to merge those sources back together again. One value of the triplesrepresentation is the ease with which this kind of merger can be accomplished. Since information isrepresented simply as triples, merged information from two graphs is as simple as forming the graphof all of the triples from each individual graph, taken together. Let’s see how this is accomplishedin RDF.
Suppose that we had another source of information that was relevant to our example from Table3.3—that is, a list of plays that Shakespeare wrote or a list of parts of the United Kingdom. Thesewould be represented as triples as in Tables 3.4 and 3.5. Each of these can also be shown as a graph,just as in the original table, as shown in Figure 3.5.
What happens when we merge together the information from these three sources? We simply getthe graph of all the triples that show up in Figures 3.4 and 3.5. Merging graphs like those in Figures 3.4and 3.5 to create a combined graph like the one shown in Figure 3.6 is a straightforward process—butonly when it is known which nodes in each of the source graphs match.
FIGURE 3.4
Graph display of triples from Table 3.3. Eight triples appear as eight labeled edges.
Table 3.4 Triples about Shakespeare’s Plays
Subject Predicate Object
Shakespeare Wrote As You Like It
Shakespeare Wrote Henry V
Shakespeare Wrote Love’s Labour’s Lost
Shakespeare Wrote Measure for Measure
Shakespeare Wrote Twelfth Night
Shakespeare Wrote The Winter’s Tale
Shakespeare Wrote Hamlet
Shakespeare Wrote Othello
etc.
32 CHAPTER 3 RDF—The basis of the Semantic Web
Graph display of triples:
5
Basic Ideas of RDF
■ Basic building block: subject-predicate-object (also known as object-attribute-value) triple● It is called a statement
■ RDF has been given a syntax in XML● This syntax inherits the benefits of XML● Other syntactic representations of RDF are possible
■ The fundamental concept of RDF is resource.
Resources
■ We can think of a resource as an object, a �thing� we want to talk about● E.g. authors, books, publishers, places, people, hotels
■ Every resource has a URI, a Universal Resource Identifier ■ A URI can be
● a URL (Web address) or ● some other kind of unique identifier
6
Properties
■ Properties are a special kind of resources■ They describe relations between resources
● E.g. “written by”, “age”, “title”, etc. ■ Properties are also identified by URIs ■ Advantages of using URIs:
● Α global, worldwide, unique naming scheme● Reduces the homonym problem of distributed data representation
Statements
■ Statements assert the properties of resources■ A statement is an object-attribute-value triple
● It consists of a resource, a property, and a value■ Values can be resources or literals
● Literals are atomic values (strings)
7
Three Views of a Statement
■ A triple■ A piece of a graph■ A piece of XML codeThus an RDF document can be viewed as:■ A set of triples■ A graph (semantic net)■ An XML document
Statements as Triples
(http://www.cit.gu.edu.au/~db,http://www.mydomain.org/site-owner,
#David Billington)■ The triple (x,P,y) can be considered as a logical formula P(x,y)
● Binary predicate P relates object x to object y ● RDF offers only binary predicates (properties)
8
Statement as a Graph
■ A directed graph with labeled nodes and arcs● from the resource (the subject of the statement) ● to the value (the object of the statement)
■ Known in AI as a semantic net■ The value of a statement may be a resource
● Ιt may be linked to other resources
A Set of Triples as a Semantic Net
9
Statements in XML Syntax
■ Graphs are a powerful tool for human understanding but■ The Semantic Web vision requires machine-accessible and machine-
processable representations■ There is a 3rd representation based on XML
● But XML is not a part of the RDF data model● E.g. serialisation of XML is irrelevant for RDF
Statements in XML (2)<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:mydomain="http://www.mydomain.org/my-rdf-ns">
<rdf:Descriptionrdf:about="http://www.cit.gu.edu.au/~db">
<mydomain:site-owner rdf:resource=�#David Billington�/>
</rdf:Description></rdf:RDF>
■ The rdf:Description element makes a statement about the resource http://www.cit.gu.edu.au/~db
■ Within the description● the property is used as a tag● the content is the value of the property
10
Example of University Courses
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:xsd="http://www.w3.org/2001/XMLSchema#"xmlns:uni="http://www.mydomain.org/uni-ns"><rdf:Description rdf:about="949318">
<uni:name>David Billington</uni:name><uni:title>Associate Professor</uni:title><uni:age rdf:datatype="&xsd:integer">27<uni:age>
</rdf:Description><rdf:Description rdf:about="CIT1111">
<uni:courseName>Discrete Maths</uni:courseName><uni:isTaughtBy>David Billington</uni:isTaughtBy>
</rdf:Description><rdf:Description rdf:about="CIT2112">
<uni:courseName>Programming III</uni:courseName><uni:isTaughtBy>Michael Maher</uni:isTaughtBy>
</rdf:Description>
</rdf:RDF>
SPARQL: RDF Query Language
■ SPARQL is based on matching graph patterns■ The simplest graph pattern is the triple pattern :- like an RDF triple, but with the possibility of a variable instead of an RDF
term in the subject, predicate, or object positions■ Combining triple patterns gives a basic graph pattern, where an exact
match to a graph is needed to fulfill a pattern
11
Using select-from-where
■ As in SQL, SPARQL queries have a SELECT-FROM-WHERE structure:● SELECT specifies the projection: the number and order of retrieved
data● FROM is used to specify the source being queried (optional)● WHERE imposes constraints on possible solutions in the form of
graph pattern templates and boolean constraints■ Retrieve all phone numbers of staff members:
SELECT ?x ?yWHERE { ?x uni:phone ?y .}
■ Here ?x and ?y are variables, and ?x uni:phone ?y represents a resource-property-value triple pattern
Implicit Join ■ Retrieve all lecturers and their phone numbers:
SELECT ?x ?yWHERE{ ?x rdf:type uni:Lecturer ;
uni:phone ?y . }
■ Implicit join: We restrict the second pattern only to those triples, the resource of which is in the variable ?x● Here we use a syntax shortcut as well: a semicolon indicates that the
following triple shares its subject with the previous one
■ The previous query is equivalent to writing:SELECT ?x ?yWHERE{
?x rdf:type uni:Lecturer .?x uni:phone ?y .
}
12
Explicit Join
■ Retrieve the name of all courses taught by the lecturer with ID 949352SELECT ?nWHERE{
?x rdf:type uni:Course ;uni:isTaughtBy :949352 .
?c uni:name ?n .FILTER (?c = ?x) .
}
Optional Patterns<uni:lecturer rdf:about=�949352�>
<uni:name>Grigoris Antoniou</uni:name></uni:lecturer><uni:professor rdf:about=�94318�>
<uni:name>David Billington</uni:name><uni:email>[email protected]</uni:email>
</uni:professor>
■ For one lecturer it only lists the name■ For the other it also lists the email address
13
Optional Patterns (2)
■ All lecturers and their email addresses:SELECT ?name ?emailWHERE{ ?x rdf:type uni:Lecturer ;
uni:name ?name ;uni:email ?email .
}
■ The result of this query would be:
■ Grigoris Antoniou is listed as a lecturer, but he has no e-mail address
?name ?email
David Billington [email protected]
Optional Patterns (3)
■ As a solution we can adapt the query to use an optional pattern:SELECT ?name ?emailWHERE{ ?x rdf:type uni:Lecturer ;
uni:name ?name .OPTIONAL { x? uni:email ?email }
}
■ The meaning is roughly �give us the names of lecturers, and if known also their e-mail address�
■ The result looks like this:?name ?email
Grigoris Antoniou
David Billington [email protected]
14
Summary■ RDF provides a foundation for representing and processing metadata ■ RDF has a graph-based data model ■ RDF has an XML-based syntax to support syntactic interoperability
● XML and RDF complement each other because RDF supports semantic interoperability
■ RDF has a decentralized philosophy and allows incremental building of knowledge, and its sharing and reuse
■ There exist query languages for RDF, including SPARQL
■ RDF Schema is quite primitive as a modelling language for the Web■ Many desirable modelling primitives are missing■ Therefore we need an ontology layer on top of RDF