Post on 31-Mar-2015
XML BasicsXML Basics
Wednesday May 12, 1999 SD99
Copyright 1999 Elliotte Rusty Harold
elharo@metalab.unc.edu
http://metalab.unc.edu/xml/slides/
What is XML?What is XML?
• Extensible Markup Language
• A syntax for documents
• A Meta-Markup Language
• A Structural and Semantic language, not a formatting language
• Not just for Web pages
XML is a Meta Markup XML is a Meta Markup LanguageLanguage
• Not like HTML, troff, LaTeX
• Make up the tags you needs as you need them
• The tags you create can be documented in a Document Type Definition (DTD)
• A meta syntax for domain-specific markup languages like MusicML, MathML, and CML
XML describes structure and XML describes structure and semantics, not formattingsemantics, not formatting
• XML documents form a tree
• Element and attribute names reflect the kind of the element
• Formatting can be added with a style sheet
A Song Description in HTMLA Song Description in HTML
<dt>Hot Cop<dd> by Jacques Morali, Henri Belolo, and Victor Willis
<ul><li>Producer: Jacques Morali<li>Publisher: PolyGram Records<li>Length: 6:20<li>Written: 1978<li>Artist: Village People</ul>
A Song Description in XMLA Song Description in XML
<SONG> <TITLE>Hot Cop</TITLE> <COMPOSER>Jacques Morali</COMPOSER> <COMPOSER>Henri Belolo</COMPOSER> <COMPOSER>Victor Willis</COMPOSER> <PRODUCER>Jacques Morali</PRODUCER> <PUBLISHER>PolyGram Records</PUBLISHER> <LENGTH>6:20</LENGTH> <YEAR>1978</YEAR> <ARTIST>Village People</ARTIST></SONG>
Style Sheets provide Style Sheets provide formattingformatting
SONG {display: block}TITLE {display: block; font-family: Helvetica, serif; font-size: 20pt; font-weight: bold}COMPOSER {display: block; font-family: Times, Times New Roman, serif; font-size: 14pt; font-style: italic}ARTIST {display: block; font-family: Times, Times New Roman, serif; font-size: 14pt; font-weight: bold; font-style: italic}PUBLISHER {display: block; font-size: 14pt; font-family: Times, Times New Roman, serif}LENGTH {display: block; font-family: Times, Times New Roman, serif; font-size: 14pt}YEAR {display: block; font-family: Times, Times New Roman, serif; font-size: 14pt}
Attaching style sheets to Attaching style sheets to documentsdocuments
• Processing Instruction<?xml-stylesheet type="text/css" href="song.css"?>
• Converter Program
What is XML used for?What is XML used for?
• Domain-Specific Markup Languages
• Self-Describing Data
• Interchange of Data Among Applications
• Structured and Integrated Data
Domain-Specific Markup Domain-Specific Markup LanguagesLanguages
• Non proprietary format
• Don’t pay for what you don’t use
Self-Describing DataSelf-Describing Data
• Much data is lost due to format problems
• XML is very simple
• XML is self-describing
• XML is well documented
<PERSON ID="p1100" SEX="M"> <NAME> <GIVEN>Judson</GIVEN> <SURNAME>McDaniel</SURNAME> </NAME> <BIRTH> <DATE>21 Feb 1834</DATE> </BIRTH> <DEATH> <DATE>9 Dec 1905</DATE> </DEATH></PERSON>
Interchange of Data Among Interchange of Data Among ApplicationsApplications
• E-commerce
• Syndication
Structured and Integrated Structured and Integrated DataData
• Can specify relationships between elements
• Can assemble data from multiple sources
XML ApplicationsXML Applications
• A specific markup language uses the XML meta-syntax is called an XML application
• Different XML applications have their own more constricted syntaxes and vocabularies within the broader XML syntax
• Further syntax can be layered on top of this; e.g. data typing through DCDs or other schemas
Example XML ApplicationsExample XML Applications
• Web Pages
• Mathematical Equations
• Music Notation
• Vector Graphics
• Metadata
• and more…
Mathematical Markup LanguageMathematical Markup Language
Channel Definition FormatChannel Definition Format
<?xml version="1.0"?><CHANNEL HREF="http://metalab.unc.edu/xml/index.html"> <TITLE>Cafe con Leche</TITLE> <ITEM HREF="http://metalab.unc.edu/xml/books.html"> <TITLE>Books about XML</TITLE> </ITEM> <ITEM HREF="http://metalab.unc.edu/xml/tradeshows.html"> <TITLE>Trade shows and conferences about XML</TITLE> </ITEM> <ITEM HREF="http://metalab.unc.edu/xml/lists.htm"> <TITLE>Mailing Lists dedicated to XML</TITLE> </ITEM></CHANNEL>
Classic LiteratureClassic Literature
• The Complete Plays of Shakespeare
• The Bible
• The Koran
• The Book of Mormon
Vector GraphicsVector Graphics
• Vector Markup Language (VML)– Internet Explorer 5.0
– Microsoft Office 2000
• Scalable Vector Graphics (SVG)
The Resource Description The Resource Description Framework (RDF)Framework (RDF)
• Meta-data
• Dublin Core
• Better Web searching
An Example of RDFAn Example of RDF
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/DC/> <rdf:Description about="http://metalab.unc.edu/xml/>
<dc:CREATOR>Elliotte Rusty Harold</dc:CREATOR>
<dc:TITLE>Cafe con Leche</dc:TITLE> </rdf:Description></rdf:RDF>
XML for XMLXML for XML
• XSL: The Extensible Stylesheet Language
• DCD: The Document Content Description Schema Language
• XLL: The Extensible Linking Language
XSL: The Extensible XSL: The Extensible Stylesheet LanguageStylesheet Language
• XSL Transformations
• XSL Formatting Objects
DCD: The Document Content DCD: The Document Content Description Schema Description Schema LanguageLanguage
• Data Typing in XML is Weak
• <MONTH>9</MONTH>
<DCD> <ElementDef Type="MONTH" Model="Data" Datatype="i1" Min="1" Max="12" /></DCD>
XLL: The Extensible Linking XLL: The Extensible Linking LanguageLanguage• Any element can be a link
• Links can be bi-directional
• Links can be separated from the documents they connect
<footnote xlink:form="simple" href="footnote7.xml">7</footnote>
File Formats, In-house File Formats, In-house applications, and other behind applications, and other behind the scenes usesthe scenes uses• Microsoft Office 2000
• Federal Express Web API
• Netscape What’s Related
Hello XMLHello XML
<?xml version="1.0" standalone="yes"?><FOO>Hello XML!</FOO>
• Plain ASCII or UTF-8 text
• .xml is standard file extension
• Any standard text editor will work
The XML DeclarationThe XML Declaration
• version attribute– required
– always has the value 1.0
• standalone attribute– yes
– no
• encoding attribute– UTF-8
– 8859_1
– etc.
<?xml version="1.0" standalone="yes"?>
The FOO elementThe FOO element
• Start tag <FOO>
• Contents "Hello XML!"
• End tag </FOO>
<FOO>Hello XML!</FOO>
greeting.xmlgreeting.xml
<?xml version="1.0" standalone="yes"?><GREETING>Hello XML!</GREETING>
Style sheetsStyle sheets
• Separate from the XML document• Different Languages
– Cascading Style Sheets Level 1 (CSS1)Internet Explorer 5.0Mozilla 5.0
– Cascading Style Sheets Level 2 (CSS2)Internet Explorer 5 (partial)Mozilla 5.0 (partial)
– Extensible Style Language (XSL)Internet Explorer 5.0 (older draft, buggy)LotusXSL, XT, Other non-browser converters
– Document Style and Semantics Language (DSSSL)Jade
xml-stylesheetxml-stylesheet• Style sheets are attached via an xml-stylesheet processing instruction in the prolog
<?xml version="1.0" standalone="yes"?><?xml-stylesheet type="text/css" href="greeting.css"?>
<GREETING>Hello XML!</GREETING>
– type attribute has the value text/css or text/xsl
– href attribute is a URL to the stylesheet, possibly relative
• Can also use non-browser converters like XT, LotusXSL, and Jade
greeting.cssgreeting.css
GREETING {display: block; font-size: 24pt; font-weight: bold}
A larger example: Baseball A larger example: Baseball statisticsstatistics
• Examine the data
• Design a vocabulary for the data
• Write a style sheet
Sample statisticsSample statisticshttp://cbs.sportsline.com/u/baseball/mlb/stats.htm
Organizing the DataOrganizing the Data
• XML documents are trees.
• XML elements contain other elements as well as text
• Within these limits there's more than one way to organize the data
– Hierarchically
– Relationally
– Objects
What is the Root ElementWhat is the Root Element
• The League?
• The Season?
• A custom Document element?
The Root ElementThe Root Element
<?xml version="1.0"?><SEASON></SEASON>
• Choose SEASON for the root element
• Everything else will be a descendant of SEASON
• This is not the only possible choice
What are the Immediate What are the Immediate Children of The root?Children of The root?
• Leagues?
• Teams?
• Players?
• Games?
Child ElementsChild Elements
<?xml version="1.0"?><SEASON> <YEAR> 1998 </YEAR></SEASON>
White space in XML is not White space in XML is not especially significantespecially significant
<?xml version="1.0"?>
<SEASON><YEAR>1998</YEAR></SEASON>
LeaguesLeagues
• Major league baseball is divided into two leagues
• Each league has– a name
– three divisions
DivisionsDivisions
• Each division has– name
– 4-6 teams
TeamsTeams
• Each team has– Name
– City
– Players
Player DataPlayer Data
• Each player has– First name
– Last name
– Position
– Statistics
Player Batting StatisticsPlayer Batting Statistics
• G Games Played• GS Games Started• AB At Bats• R Runs• H Hits• 2B Doubles• 3B Triples• HR Home Runs• RBI Runs Batted In
• SB Stolen Bases• CS Caught Stealing• SH Sacrifice Hits• SF Sacrifice Flies• Err Errors• PB Pitcher Balked• BB Base on Balls
(Walks)• SO Strike Outs• HBP Hit By Pitch
What does a player look likeWhat does a player look like
• Long names vs. short names
The Complete 1998 Major The Complete 1998 Major LeagueLeague
• Long version
A Style SheetA Style Sheet
• 1998shortstats.xml
• baseballstats.css
• <?xml-stylesheet type="text/css" href="baseballstats.css"?>
• styled1998shortstats.xml
Cascading Style SheetsCascading Style Sheets
• Partially supported by Mozilla and IE 5.0
• Full W3C Recommendation
The Default RuleThe Default Rule
• Not every element needs a rule
• The root element should be at least display: block
SEASON { font-size: 14pt; background-color: white; color: black; display: block}
A style rule for the YEAR A style rule for the YEAR elementelement
• Make it look like a title
YEAR { display: block; font-size: 32pt; font-weight: bold; text-align: center}
Style Rules for Division and Style Rules for Division and League NamesLeague Names
LEAGUE_NAME { display: block; text-align: center; font-size: 28pt; font-weight: bold}
DIVISION_NAME { display: block; text-align: center; font-size: 24pt; font-weight: bold}
Alternate Style Rules for Alternate Style Rules for Division and League NamesDivision and League Names
LEAGUE_NAME, DIVISION_NAME { display: block; text-align: center; font-weight: bold}LEAGUE_NAME {font-size: 28pt }DIVISION_NAME {font-size: 24pt }
Style Rules for TeamsStyle Rules for Teams• Team name and Team city must be one title
• Must be inline elements
• Previous and following must be block elements
TEAM_CITY { font-size: 20pt; font-weight: bold; font-style: italic}
TEAM_NAME { font-size: 20pt; font-weight: bold; font-style: italic}
TEAM, PLAYER {display: block}
Style Rules for PlayersStyle Rules for PlayersTEAM {display: table}TEAM_CITY {display: table-caption}TEAM_NAME {display: table-caption}PLAYER {display: table-row}
SURNAME, GIVEN_NAME, POSITION, GAMES, GAMES_STARTED, AT_BATS, RUNS, HITS, DOUBLES, TRIPLES, HOME_RUNS, RBI, STEALS,CAUGHT_STEALING, SACRIFICE_HITS, SACRIFICE_FLIES, ERRORS, WALKS, STRUCK_OUT, HIT_BY_PITCH {display: table-cell}
Finished Style SheetFinished Style Sheet
SEASON {font-size: 14pt; background-color: white; color: black; display: block}YEAR {display: block; font-size: 32pt; font-weight: bold; text-align: center}LEAGUE_NAME {display: block; text-align: center; font-size: 28pt; font-weight: bold}DIVISION_NAME {display: block; text-align: center; font-size: 24pt; font-weight: bold}TEAM_CITY {font-size: 20pt; font-weight: bold; font-style: italic}TEAM_NAME {font-size: 20pt; font-weight: bold; font-style: italic}TEAM {display: block}PLAYER {display: block}
Possible ExtensionsPossible Extensions• There should be captions like "RBI" or
"At Bats.”
• Derived numbers like batting averages are not included.
• The titles are short. E.g. "1998" instead of "1998 Major League Baseball".
• The document is so long it's hard to read. Something similar to IE5's collapsible outline view would be nice.
• Pitcher stats should be separated from batter stats.
Possible SolutionsPossible Solutions
• CSS Level 2
• XSL
• XSL + JavaScript
Well-formedness RulesWell-formedness Rules• Open and close all tags
• Empty tags end with />
• There is a unique root element
• Elements may not overlap
• Attribute values are quoted
• < and & are only used to start tags and entities
• Only the five predefined entity references are used
Open and close all tagsOpen and close all tags
Empty tags end with Empty tags end with />/>
• <BR/>, <HR/>, and <IMG/> instead of <BR>, <HR>, and <IMG>
• Web browsers deal inconsistently with these
• Can use <BR></BR> <HR></HR> <IMG></IMG> instead
There is a unique root There is a unique root elementelement
• One element completely contains all other elements of the document
• This is HTML in HTML files
• XML Declaration is not an element
<?xml version="1.0" standalone="yes"?><GREETING>Hello XML!</GREETING>
Elements may not overlapElements may not overlap
• If an element contains a start tag for an element, it must also contain the corresponding end tag
• Empty elements may appear anywhere
• Every non root element has a parent element
Attribute values are quotedAttribute values are quoted
• Good: – <A
HREF="http://metalab.unc.edu/xml/">
• Bad: – <A
HREF=http://metalab.unc.edu/xml/>
<< and and && are only used to start are only used to start tags and entitiestags and entities
• Good: <H1>O'Reilly & Associates</H1>
• Bad: <H1> O'Reilly & Associates</H1>
• Good: – <CODE>for (int i = 0; i <= args.length; i++ ) { </CODE>
• Bad: – <CODE>for (int i = 0; i <= args.length; i++ ) { </CODE>
Only the five predefined Only the five predefined entity references are usedentity references are used
• Good: – &
– <
– >
– "
– '
• Bad:– ©
– ®
– &tm;
– α
– é
–
– etc.
DTDs and ValidityDTDs and Validity
• A Document Type Definition describes the elements and attributes that may appear in a document
• Validation compares a particular document against a DTD
• Well-formedness is a prerequisite for validity
What is a DTD?What is a DTD?
• a list of the elements, tags, attributes, and entities contained in a document, and their relationship to each other
• internal vs. external DTDs
The importance of validationThe importance of validation
• Ensures that data is correct before feeding it into a program
• Ensure that a format is followed
• Establish what must be supported
• Not all documents need to be valid; sometimes well-formed is enough
A DTD for greeting.xmlA DTD for greeting.xml
• greeting.xml:<?xml version="1.0"?><GREETING>Hello XML!</GREETING>
• greeting.dtd:
<!ELEMENT GREETING (#PCDATA)>
Document Type DeclarationsDocument Type Declarations<?xml version="1.0"?><!DOCTYPE GREETING SYSTEM "greeting.dtd">
<GREETING>Hello XML!</GREETING>
• specifies the root element
• gives a URL for the DTD
Invalid DocumentsInvalid Documents• Valid:
<GREETING>various random text but no markup</GREETING>
• Invalid: anything else including<GREETING> <sometag>various random text</sometag> <someEmptyTag/></GREETING>– or<GREETING> <GREETING>various random text</GREETING>
</GREETING>
Validating ToolsValidating Tools
• Command line programs like XJParse
• Online validators– http://www.stg.brown.edu/service/
xmlvalid/
– http://www.cogsci.ed.ac.uk/%7Erichard/xml-check.html
• Browsers
Element DeclarationsElement Declarations
• Each tag must be declared in a <!ELEMENT> declaration.
• A <!ELEMENT> declaration gives the name and content model of the element
• The content model uses a simple regular expression-like grammar to precisely specify what is and isn't allowed in an element
Content SpecificationsContent Specifications
• ANY
• #PCDATA
• Sequences
• Choices
• Mixed Content
• Modifiers
• Empty
ANYANY
<!ELEMENT SEASON ANY>
• A SEASON can contain any child element and/or raw text (parsed character data)
#PCDATA#PCDATA
<!ELEMENT YEAR (#PCDATA)>
• Parsed Character Data; i.e. raw text, no markup
#PCDATA#PCDATA
• Valid:<YEAR>1999</YEAR><YEAR>99</YEAR><YEAR>1999 C.E.</YEAR><YEAR> The year of our Lord one thousand, nine hundred, and ninety-nine
</YEAR>
• Invalid:<YEAR><MONTH>January</MONTH><MONTH>February</MONTH><MONTH>March</MONTH><MONTH>April</MONTH><MONTH>May</MONTH><MONTH>June</MONTH><MONTH>July</MONTH><MONTH>August</MONTH><MONTH>September</MONTH><MONTH>October</MONTH><MONTH>November</MONTH><MONTH>December</MONTH></YEAR>
Child ElementsChild Elements
• To declare that a LEAGUE element must have a LEAGUE_NAME child:
<!ELEMENT LEAGUE (LEAGUE_NAME)>
<!ELEMENT LEAGUE_NAME (#PCDATA)>
SequencesSequences
• Separate multiple required child elements with commas; e.g.
<!ELEMENT SEASON (YEAR, LEAGUE, LEAGUE)>
<!ELEMENT LEAGUE (LEAGUE_NAME, DIVISION, DIVISION, DIVISION)>
One or More Children +One or More Children +
<!ELEMENT DIVISION_NAME (#PCDATA)>
<!ELEMENT DIVISION (DIVISION_NAME, TEAM+)>
Zero or More Children *Zero or More Children *
<!ELEMENT TEAM (TEAM_CITY, TEAM_NAME, PLAYER*)>
<!ELEMENT TEAM_CITY (#PCDATA)>
<!ELEMENT TEAM_NAME (#PCDATA)>
Zero or One Children ?Zero or One Children ? <!ELEMENT PLAYER (GIVEN_NAME, SURNAME, POSITION, GAMES, GAMES_STARTED, AT_BATS?, RUNS?, HITS?, DOUBLES?, TRIPLES?, HOME_RUNS?, RBI?, STEALS?, CAUGHT_STEALING?, SACRIFICE_HITS?, SACRIFICE_FLIES?, ERRORS?, WALKS?, STRUCK_OUT?, HIT_BY_PITCH?, WINS?, LOSSES?, SAVES?, COMPLETE_GAMES?, SHUT_OUTS?, ERA?, INNINGS?, EARNED_RUNS?, HIT_BATTER?, WILD_PITCHES?, BALK?,WALKED_BATTER?, STRUCK_OUT_BATTER?)
>
Finished DTDFinished DTD
ChoicesChoices
<!ELEMENT PAYMENT (CASH | CREDIT_CARD)>
<!ELEMENT PAYMENT (CASH | CREDIT_CARD | CHECK)>
Grouping With ParenthesesGrouping With Parentheses
• Parentheses combine several elements into a single element.
• Parenthesized element can be nested inside other parentheses in place of a single element.
• The parenthesized element can be suffixed with a plus sign, a comma, or a question mark. <!ELEMENT dl (dt, dd)*><!ELEMENT ARTICLE (TITLE, (P | PHOTO | GRAPH | SIDEBAR | PULLQUOTE | SUBHEAD)*, BYLINE?)>
Mixed ContentMixed Content
• Both #PCDATA and child elements in a choice
<!ELEMENT TEAM (#PCDATA | TEAM_CITY | TEAM_NAME | PLAYER)*>
• #PCDATA must come first
• #PCDATA cannot be used in a sequence
Empty elementsEmpty elements
<!ELEMENT BR EMPTY>
<!ELEMENT IMG EMPTY>
<!ELEMENT HR EMPTY>
Internal DTDsInternal DTDs
<?xml version="1.0"?><!DOCTYPE GREETING [ <!ELEMENT GREETING (#PCDATA)>]><GREETING>Hello XML!</GREETING>
Internal DTD SubsetsInternal DTD Subsets
<?xml version="1.0"?><!DOCTYPE GREETING SYSTEM "greeting.dtd" [
<!ELEMENT GREETING (#PCDATA)>]><GREETING>Hello XML!</GREETING>
• Internal declarations override external declarations
Programming with XMLProgramming with XML
• Java works best
• C, Perl, Python etc. can also be used
• Unicode support is the biggest issue
SAX, the Simple API for XMLSAX, the Simple API for XML
• Event based
• Programs can plug in different parsers
The Document Object Model The Document Object Model (DOM)(DOM)
To Learn More: BooksTo Learn More: Books
• XML: Extensible Markup Language
– IDG Books 1998
– ISBN 0-76453-199-9
• The XML Bible
– IDG Books 1999
– ISBN 0-76453-236-7
Questions?Questions?