Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is...
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
2
Transcript of Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is...
![Page 1: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/1.jpg)
Semi-structured Data
![Page 2: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/2.jpg)
Facts about the Web
• Growing fast
• Popular
• Semi-structured data – Data is presented for ‘human’-processing– Data is often ‘self-describing’ (including name
of attributes within the data fields)
![Page 3: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/3.jpg)
Vision for Web data
• Object-like – it can be represented as a collection of objects of the form described by the conceptual data model
• Schemaless – not conformed to any type structure
• Self-describing – necessary for machine readable data
![Page 4: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/4.jpg)
Facts about database systems
• Integration of databases with different schemas is often needed
• Sharing information between different databases on the World Wide Web becomes more and more important for business
![Page 5: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/5.jpg)
Semi-structured data
• Bridging different data models (relational, object-oriented
![Page 6: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/6.jpg)
Semi-structured data representation
• A database of semi-structured data is a graph with – A set of nodes, each is either a leaf or a interior
node; – Each interior node has a set of arcs coming out
from it, connecting it with another node; each arc has a label; and
– A root that does not have an arc entering it. Every node must be reachable from the root.
![Page 7: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/7.jpg)
cf
name addressaddress
street city street city
mh
name streetcity
mv
title year
CarrieFisher
Maple H’wood Locust Malibu
MarkHamill
Oak H’wood StarWars
1977
Root
star starmovie
starOf
starOf
starsIn
starsIn
Example of semi-structured data representing a movie and stars
![Page 8: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/8.jpg)
Information integration via semi-structured data
• Simple• Semi-structured data
as interface between users of different databases (with different schemas)
Interface
DB1 DB2
User
Application of DB1
Application of DB2
![Page 9: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/9.jpg)
XML – Overview
• Simplifying the data exchange between software agents
• Popular thanks to the involvement of W3C (World Wide Web Consortium – independent organization
www.w3c.org)
![Page 10: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/10.jpg)
XML – Characteristics
• Simple, open, widely accepted
• HTML-like (tags) but extensible by users (no fixed set of tags)
• No predefined semantics for the tags (because XML is developed not for the displaying purpose)
• Semantics is defined by stylesheet (later)
![Page 11: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/11.jpg)
XML Documents
• User-defined tags:<tag> info </tag>
• Properly nested:<tag1>.. <tag2>…</tag1></tag2>is not valid
• Root element: an element contains all other elements• Processing instructions <?command ….?>• Comments <!--- comment --- >• CDATA type• DTD
![Page 12: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/12.jpg)
XML element
• Begin with a opening tag of the form
<XML_element_name>
• End with a closing tag
</XML_element_name>
• The text between the beginning tag and the closing tag is called the content of the element
![Page 13: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/13.jpg)
XML element
<Star-Movie-Data><Star>
<Name> Carrie Fisher </Name><Address> <Street> 123 Maple St. </Street> <City> Hollywood </City> </Address><Address> <Street> 5 Locus Ln. </Street> <City> Malibu</City> </Address>
</Star> <Star>
<Name> Mark Hamill </Name><Address> <Street> 456 Oak Rd. </Street> <City> Brentwood </City></Address>
</Star> <Movie>
<Title> Star Wars </Title> <Year>1997</Year></Movie></ Star-Movie-Data>
Name elelementStar Elelement
![Page 14: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/14.jpg)
XML element
<Star-Movie-Data>
<Star name=“Carrie Fisher”>
….
</Star>
…
</ Star-Movie-Data>
Attribute Value of the attribute
![Page 15: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/15.jpg)
Relationship between XML elements
• Child-parent relationship– Elements nested directly in an element are the
children of this element (Student is a child of PersonList, Name is a child of Student, etc.)
• Ancestor/descendant relationship: important for querying XML documents (extending the child/parent relationship)
![Page 16: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/16.jpg)
XML elements & Database Objects• XML elements can be converted into
objects by– considering the tag’s names of the children as
attributes of the objects – Recursive process
<Student StudentID=“123”>
<Name> “XYZ PQR” </Name>
<CrsTaken>
<CrsName>CS582</CrsName>
<Grade>“A”</Grade> </CrsTaken>
</Student>
(#099,
Name: “XYZ PQR”
CrsTaken:
<CrsName>“CS582”</CrsName>
<Grade>“A”</Grade>
)
Partially converted object
![Page 17: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/17.jpg)
XML elements & Database Objects
• Differences: Additional text within XML elements
<Student StudentID=“123”>
<Name> “XYZ PQR” </Name>
has taken the following course
<CrsTaken>
Database management system II
<CrsName>CS582</CrsName>
with the grade
<Grade>“A”</Grade> </CrsTaken>
</Student>
![Page 18: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/18.jpg)
XML elements & Database Objects
• Differences: XML elements are orderd
<CrsTaken>
<CrsName>“CS582”</CrsName>
<Grade>“A”</Grade>
</CrsTaken>
<CrsTaken>
<Grade>“A”</Grade>
<CrsName>“CS582”</CrsName>
</CrsTaken>
{#901, Grade: “A”, CrsName: “CS582”}
![Page 19: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/19.jpg)
XML Attributes
• Can occur within an element (arbitrary many attributes, order unimportant, same attribute only one)
• Allow a more concise representation • Could be replaced by elements • Less powerful than elements (only string value, no
children)• Can be declared to have unique value, good for
integrity constraint enforcement (next slide)
![Page 20: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/20.jpg)
XML Attributes
• Can be declared to be the type of ID, IDREF, or IDREFS
• ID: unique value throughout the document
• IDREF: refer to a valid ID declared in the same document
• IDREFS: space-separated list of strings of references to valid IDs
![Page 21: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/21.jpg)
Well-formed XML Document
• It has a root element
• Every opening tag is followed by a matching closing tag, elements are properly nested
• Any attribute can occur at most once in a given opening tag, its value must be provided, quoted
![Page 22: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/22.jpg)
Document Type Definition
• Set of rules (by the user) for structuring an XML document
• Can be part of the document itself, or can be specified via a URL where the DTD can be found
• A document that conforms to a DTD is said to be valid
• Viewed as a grammar that specifies a legal XML document, based on the tags used in the document
![Page 23: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/23.jpg)
DTD Components
• A name – must coincide with the tag of the root element of the document conforming to the DTD
• A set of ELEMENTs – one ELEMENT for each allowed tag, including the root tag
• ATTLIST statements – specifies the allow attributes and their type for each tag
• *, +, ? – like in grammar definition – * : zero or finitely many number – + : at least one– ? : zero or one
![Page 24: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/24.jpg)
DTD Components – Element
<!ELEMENT Name definition>
type, element list etc.
Name of the element
definition can be: EMPTY, (#PCDATA), or element list (e1,e2,…,en) where the list (e1,e2,…,en) can be shorted using grammar like notation
![Page 25: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/25.jpg)
DTD Components – Element
<!ELEMENT Name(e1,…,en)>
nth – element
1st – element
Name of the element
<!ELEMENT PersonList (Title,Contents)>
<!ELEMENT Contents(Person *)>
![Page 26: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/26.jpg)
DTD Components – Element
<!ELEMENT Name EMPTY>
no child for the element Name
<!ELEMENT Name (#PCDATA)>
value of Name is a character string
<!ELEMENT Title EMPTY>
<!ELEMENT Id (#PCDATA)>
![Page 27: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/27.jpg)
DTD Components – Attribute List
<!ATTLIST EName Att {Type} Property> where
- Ename – name of an element defined in the DTD
- Att – attribute name allowed to occur in the opening tag of Ename
- {type} – might/might not be there; specify the type of the attribute (CDATA, ID, IDREF, IDREFS)
- Property – either #REQUIRED or #IMPLIED
![Page 28: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/28.jpg)
<!DOCTYPE Stars [<!ELEMENT STARS (STAR*)><!ELEMENT STAR(NAME,ADDRESS+,MOVIES)><!ELEMENT NAME (#PCDATA)><!ELEMENT ADDESS (STREET, CITY)><!ELEMENT STREET (#PCDATA)><!ELEMENT CITY (#PCDATA)><!ELEMENT MOVIES (MOVIE*)><!ELEMENT MOVIE (TITLE, YEAR)><!ELEMENT TITLE (#PCDATA)><!ELEMENT YEAR (#PCDATA)>
]>
A simple DTD for the movie and star database (no integrity constraints)
![Page 29: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/29.jpg)
<!DOCTYPE Stars-Movies [<!ELEMENT STARS-MOVIES (STAR* MOVIES*)><!ELEMENT STAR(NAME,ADDRESS+)>
<!ATTLIST STAR starID ID starredIn IDREF><!ELEMENT NAME (#PCDATA)><!ELEMENT ADDESS (STREET, CITY)><!ELEMENT STREET (#PCDATA)><!ELEMENT CITY (#PCDATA)><!ELEMENT MOVIE (TITLE, YEAR)>
<!ATTLIST MOVIE movieID ID starsOf IDREF><!ELEMENT TITLE (#PCDATA)><!ELEMENT YEAR (#PCDATA)>
]>A DTD for the movie and star database with attributes and
integrity constraints
![Page 30: Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d5d5503460f94a3cd57/html5/thumbnails/30.jpg)
Homework 5 (Due Oct 23)
• 4.2.3 (Pg 146, complete book) (10pt)
• 4.4.1 (part c, Pg 164, complete book) (10pt)
• 4.5.4 (Pg 172, complete book) (10pt)