COMP20008 Elements of Data Processing Week 2: Lecture 1 ...COMP20008 Elements of Data Processing...

Post on 27-Jun-2020

6 views 0 download

Transcript of COMP20008 Elements of Data Processing Week 2: Lecture 1 ...COMP20008 Elements of Data Processing...

COMP20008 Elements of Data Processing Week 2: Lecture 1 Data format and storage

Today

•  Complete section on data format and representation –  HTML/XML/JSON

•  Format •  Namespaces and schema

–  Linked (graph) data •  JSON-LD •  RDF

Announcements

•  Lecture recordings are now available on the LMS •  Workshops start this week

We  are  asking  undergraduate  students  to  par0cipate  in  a  series  of  interviews  about  your  preferences  for  the  delivery  of  assessment  feedback  to  improve  your  learning.   What’s  involved? You  will  be  asked  to  par0cipate  in  four  (4)  short  interviews  (approx.  10-­‐20  mins  long)  throughout  the  semester  to  ask  you  about  what  is  effec0ve  feedback  and  what  form  of  feedback  would  be  most  useful  to  your  studies.    Why  should  you  get  involved? This  will  help  us  to  understand  more  about  how  we  can  deliver  effec0ve  feedback  to  students  at  university.  This  research  is  part  of  a  na0onally  funded  project  looking  at  how  we  can  use  technology  to  improve  the  provision  of  feedback.    PLUS,  you  will  be  given  a  $50  gi'  voucher  to  thank  you  for  your  0me  and  commitment  to  the  study.    How  do  you  get  involved?  If  you  would  like  to  take  part  in  the  study  (or  have  any  further  ques0ons  about  it)  please  contact    Paula  de  Barba  on  paula.de@unimelb.edu.au.    

What  feedback  do  you  want  to  receive  to  help  your  learning  at  university?

RESEARCH PARTICIPATION

HTML – Hypertext Markup language

•  Marked up with elements, delineated by start and end tags. Elements correspond to logical units, such as a heading, paragraph or itemised list.

•  Tags: Keywords contained in pairs of angle brackets. –  Not case sensitive.

•  Browser determines how to display/present the logical units •  Not all elements need both start and end tags. •  Some elements can have attributes. Ordering of attributes is

not significant.

HTML Example

<div class="icon section5"> <hh2><a href="about/index.html">About the Melbourne School of Engineering</a></h2> <ul> <li><a href="about/dean_welcome.html">Dean's Welcome</a></li> <li><a href="about/staff.html">Leadership &amp; Professional Staff</a></li> <li><a href="about/contact.html">Contact Us</a></li> <li><a href="http://www.ecr.unimelb.edu.au">ECR: Computer Resources</a></li> <li><a href="intranet/index.html">For Staff (intranet)</a></li> <li><a href="casual_staff/index.html">For Casual Staff</a></li> <li><a href="intranet/review/prof_staff.html">Professional Staff Review</a></li> <li><a href="/about/safety/index.html">Environment, Health &amp; Safety</a></li> <li><a href="/about/committees/index.html">Committees</a></li> </ul>

XML – Extensible Markup Language

•  Allows new elements to be defined •  Applications may generate and process XML •  Enables data exchange between different platforms •  Facilitates better encoding of semantics <CATALOG> <CD> <TITLE>Empire Burlesque</TITLE> <ARTIST>Bob Dylan</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE CURRENCY="USD"> 10.90</PRICE> <YEAR>1985</YEAR> </CD> <CD> <TITLE>Hide your heart</TITLE> <ARTIST>Bonnie Tyler</ARTIST> <COUNTRY>UK</COUNTRY> <COMPANY>CBS Records</COMPANY> <PRICE CURRENCY="USD">9.90</PRICE> <YEAR>1988</YEAR> </CD> </CATALOG>

JSON: JavaScript Object Notation

{ "Catalog": [ { "CD": { "title": "Empire Burlesque", "artist": "Bon Dylan", "Country": "USA”, "price": { "Currency": "USD", "value": 10.90 }, "year": 1985 } }, { "CD": { "title": "Hide your heart", "artist": "Bonnie Taylor", "Country": "UK", "price": { "currency": "USD", "value": 9.90 }, "year": 1988 } } ] }

JSON is simpler and more compact/lightweight than XML. Easy to parse. Common JSON application – read and display data from a webserver using javascript.

•  http://www.w3schools.com/json/json_http.asp

XML comes with a large family of other standards for querying and transforming (XQuery, XML Schema, XPATH, XSLT, namespaces, …)

Jason format (from json.org)

JASON format (json.org)

Exercise

•  Represent the following information in JSON

<Person> <FirstName>Homer</FirstName> <LastName>Simpson</LastName> <Relatives> <Relative>Grandpa</Relative> <Relative>Marge</Relative> <Relative>Lisa</Relative> <Relative>Bart</Relative> </Relatives> <FavouriteBeer>Duff</FavouriteBeer> </Person>

{"Person": {"firstname": "Homer", "lastname": "Simpson", "relatives": [ "Grandpa", "Marge", "Lisa", "Bart" ], "favouritebeer": "Duff" } } Check is well formed

-http://jsonlint.com

Some more HTML/XML

<catalog> <book price = 55 currency = USD> <title> Foundations of Databases </title> <author> Abiteboul </ author> <date> <year>1995</year> <month>January</month> </date> </book> </catalog> •  book, catalog, title, author, date, year, month are elements •  price is an attribute (provides further information about an

element, in this case the book element). •  currency is an attribute.

Exercise

•  Given the following data: Yellow Balloon, $99.99 –  i) What are three possible XML encodings of the balloon ? –  ii) What are some of the circumstances in which one

encoding might be better than the others ?

XML Namespaces

•  Here is some information about an HTML table <table> <tr> <td>Dogs</td> <td>Cats</td> </tr> </table> Here is some information about furniture <table> <name>Australian Coffee Table</name> <width>90</width> <length>149</length> </table> What happens if we add these together in the one document?

XML Namespaces [example adapted from w3schools.com]

•  Namespace Declarations are used to qualify names with universal resource identifiers (URI’s). A URI uniquely identifies a resource on the Web. The name consists of two parts –  namespace:local-name

•  This is achieved indirectly by using namespace declarations and associated user-specified prefixes

<... xmlns:tabular-info="http://www.tabularinfo.com"> <tabular-info:table> <tr> <td>Dogs</td> <td>Cats</td> </tr> </tabular-info:table>

•  xmlns:tabular-info attribute declares namespace with prefix tabularinfo

•  URI doesn’t have to refer to a real Web resource

Namespace Scope

•  The scope of a namespace declaration is –  the element that contains the namespace declaration –  all its descendants (i.e. nested within the element) –  The declaration may be overriden by further nested

namespace declarations •  Namespaces can be used to to describe both elements and

attributes. Elements/attributes without a namespace prefix are defined a default namespace.

Namespace example

<collection xmlns="http://www.tabularinfo.com" xmlns:furniture="http://www.furniture.com"> <table> <tr><td>Dogs</td> <td>Cats</td> </tr> </table>

<furniture:table> <furniture:name>Australian Coffee Table</furniture:name> <furniture:width>90</furniture:width> <furniture: length>149</furniture:length> </furniture:table> </collection> -collection, first table, td and tr use the default (tabularinfo namespace) -second table, name, width and length use the furniture namespace

Namespace exercise [adapted from http://saxadapter.sourceforge.net/XMLNamespaceTutorial.html]

<a:Envelope xmlns="http://default/" xmlns:a="http://urla" xmlns:b="http://urlb" xmlns:c="http://urlc" a:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">

<a:Header xmlns="" xmlns:b="http://alturlb"> <b:type>HelloWorld</b:type> <c:to xmlns:c=http://alturlc>John Doe</c:to> <from fromType="name">Jane Seymour</from> </a:Header> <a:Body> <text xmlns="http://newdefault">Hello</text> <b:mood>Tired</b:mood> <c:day>Thursday</c:day> <month>March</month> </a:Body> < /a:Envelope> •  For each of the following, give its namespace URI: i) a:Envelope ii) a:Header iii)

a:encodingStyle iv) b:type v) month vi) from vii) fromType viii) a:Body ix) text x) b:mood

Schema

•  We need to ensure the integrity of our data – define its expected structure and content

•  The format of the data can be specified by a schema and a document validated using schema checking software –  Browsers use the HTML 5 Schema (see <!DOCTYPE html>

at the start of an HTML document) –  Schemas also used for other data formats

•  XML Schema (a W3C standard) –  Large and complex, uses regular expression like

rules •  JSON Schema (a draft standard)

–  Python library Jsonschema

JSON Schema Example [http://json-schema.org/examples.html]

{ "type" : "object", "properties" : { "Catalog" : { "type" : "array", "items" :{ "type" : "object", "properties" : { "title": { "type" : "number" }, "artist": { "type" : "string" }, "Country": { "type" : "string" }, "price": {"type": "object", "properties": {"currency": {type: "number"}, "value": {type:"number"} } } } } } } }

JSON Schema

•  Written in JSON itself •  Describes the structure of other data •  Easy to validate a JSON document against its schema using a

schema validator –  E.g. http://jsonschemalint.com/draft4/

Python libraries

•  json •  ElementTree •  html.parser

Other forms of data

•  Sequences •  Graphs

•  Can be represented in multiple ways

Sequence data: biology

Biological sequences (DNA, proteins) >gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP

https://en.wikipedia.org/wiki/S100_protein

Graphs: Social networks

https://www.flickr.com/photos/marc_smith/5592302165

Protein-Protein Interactions

http://www.nature.com/nrg/journal/v5/n2/fig_tab/nrg1272_F2.html

The Internet Graph (https://en.wikipedia.org/wiki/Opte_Project)

Linked Data

•  We need to connect data together --- form links. –  A key part of the Semantic Web –  Also important for the Internet of Things

•  (26 billion things by 2020, each continuously producing data)

1.  Principles of links from Tim Berners-Lee 1.  All kinds of conceptual things, they have names now that start with

HTTP. 2.  If I take one of these HTTP names and I look it up, I will get back

some data in a standard format which is kind of useful data that somebody might like to know about that thing, about that event.

3.  When I get back that information it's not just got somebody's height and weight and when they were born, it's got relationships. And when it has relationships, whenever it expresses a relationship then the other thing that it's related to is given one of those names that starts with HTTP.

Linked Data Examples

•  DBPedia •  Freebase •  FOAF (friend of a friend) •  Google Knowledge Graph

•  https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html

Standards for Linked Data

•  Widely used standards (W3C Recommendations) –  JSON-LD (JSON Linked Data) –  RDF (Resource Description Framework)

JSON-LD (example from json-ld.org)

•  Provide mechanisms for specifying unambiguous meaning in JSON data

•  Provides extra keys with “@” sign –  “@context” (used to define meanings of terms, map to

identifiers) –  “@type” –  “@id”

•  Use cases –  Google Knowledge Graph

JSON-LD Example (from https://en.wikipedia.org/wiki/JSON-LD)

{"@context": { "name": "http://xmlns.com/foaf/0.1/name", "homepage": { "@id": "http://xmlns.com/foaf/0.1/workplaceHomepage", "@type": "@id" }, "Person": "http://xmlns.com/foaf/0.1/Person" }, "@id": "http://me.example.com", "@type": "Person", "name": "John Smith", "homepage": "http://www.example.com/" }

Graphs – RDF (resource description framework) [materials from w3.org]

Serialisation of RDF Example Graph

This graph can be serialised as XML (don’t worry about syntax!)

<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#">

<contact:Person rdf:about="http://www.w3.org/People/EM/contact#me"> <contact:fullName>Eric Miller</contact:fullName> <contact:mailbox rdf:resource="mailto:em@w3.org"/> <contact:personalTitle>Dr.</contact:personalTitle> </contact:Person>

RDF – Triple Store

•  An alternative format for storing RDF type data – triple store <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#fullName> "Eric Miller" . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#mailbox> <mailto:em@w3.org> . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#personalTitle> "Dr." . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person> .

Freebase

•  A large database that connects entities together as a graph –  www.freebase.com –  Freebase is the basis of the Google Knowledge graph that is

used to improve search. •  https://developers.google.com/knowledge-graph/

•  Retrieving data from the Google Knowledge Graph –  Example adapted from http://www.nolan-nichols.com/

knowledge-graph-via-sparql.html

Other formats for Graphs: Matrix Representation

A

C

D

B A B C D

A 0 0 1 0 B 0 0 0 0 C 0 1 0 0 D 0 1 0 0 A ‘1’ in the matrix iff there is an edge from node X to node Y. Or use a relational table

Source Destination

A C C B D B

Next week

•  Workshop for this week –  Useful Unix tools

•  Directory navigation and file manipulation, redirection, pipes, awk, sed, regex, grep ...

–  Look at Section 1a before you attend your workshop •  Lecture on Friday

–  Data quality and data cleaning (lasting ~2 weeks)

What you should know about data formats

•  -Why do we have different data formats and why do we wish to transform between different formats?

•  -Motivation for using relational databases to manage information •  -Different between a (standard) relational database and a nosql database •  -What is a csv, what is a spreadsheet, what is the difference? •  -Be able to write regular expressions in python format (operators .^$*+|[]) •  -Difference between HTML and XML and when to use each •  -Motivation behind using XML and XML namespaces •  -Be able to read and write data in XML (elements, attributes, namespaces) •  -Be able to read and write data in JSON •  -Difference between XML and JSON. Applications where each can be used. •  -The purpose of using schemas for XML and JSON data. •  -The motivation behind Linked Data and the purpose of using JSON-LD or RDF

to represent it.

Further reading

•  Further reading –  Relational databases

•  Pages 403-409 of http://i.stanford.edu/~ullman/focs/ch08.pdf

–  XML •  http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html

–  JSON and JSON-LD •  http://json.org •  http://crypt.codemancers.com/posts/2014-02-11-An-introduction-to-

json-schema/ •  https://cloudant.com/blog/webizing-your-database-with-linked-data-in-

json-ld/#.Vtp_UMfB_Gw –  RDF

•  https://www.w3.org/DesignIssues/LinkedData.html •  http://www-sop.inria.fr/acacia/cours/essi2006/Scientific%20American_

%20Feature%20Article_%20The%20Semantic%20Web_%20May%202001.pdf

•  http://www.dlib.org/dlib/may98/miller/05miller.html