1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry...

1XML and Linguistic Annotation

XML and Linguistic Annotation

Chris Brew, Ohio State University( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology

Group, University of Edinburgh)

http://www.ling.ohio-state.edu/~cbrewOhio State University

Copyright 2000 Chris Brew

XML and Linguistic Annotation 2Summer School, July 2000

XML topics

What is XML? HTML,XML and SGML Wider context of XML

Data Description DTDs, Schemas

Query Languages XML Query, XQL, Quilt, LORE, LT QUERY

Style Languages CSS, XSL


What is XML?

It is a markup language used for annotating text is concerned with logical structure

to identify sections, titles, section headers, chapters, paragraphs,…

is not concerned with appearance you say 'this is a subtitle'

not 'this is in bold, 14pt, centered' you say 'this is an example'

not 'this is in verbatim, indented by 5pts, ragged right’

Derived from SGML.


Why is XML a big deal?

It is a W3C standard It is vendor-independent, platform independent,

application independent,… unlike Word documents, RTF documents, PDF

documents, Postscript documents,…

It is human readable ditto (for most values of 'human')

The Web interchange format


Who is in charge of XML?

XML is a W3C Recommendation The W3C is The World Wide Web Consortium, a

voluntary association of companies and non-profit organizations. Membership costs serious money, confers voting rights. Complex procedures, with the Chairman (Tim Berners-Lee) holding all the high cards, but the big vendors (e.g. Microsoft, Adobe, Netscape) have a lot of power.

The recommendation was written by the W3C’s XML Working Group.


XML as a career move?

Most of the big computer and entertainment companies believe XML is the solution. Exactly what was the problem?

Presenting a parts database over the InternetRunning an on-line job market (flipdog.com)Usually not corpus creation.

Scholars win and loseSGML was a minority interest where we had

serious influence on what facilities were usedXML is mainstream. We’re the minority now.This year’s .coms are busily hiring people who

understand ontologies, NLP and web technology.


Does it live up to the hype?

Of course not, but… The basic idea is simple labeled brackets. Lisp showed the

power of this idea in knowledge representation. Knowledge representation is inherently hard. Lisp made it

easier to state the problem, but it wasn’t itself the solution. XML won’t solve your knowledge representation problems either, but it will let you state them.

Labeled brackets++ Labeled brackets – but designed for information exchange,

with sophisticated input (and political pressures) from many interest groups.


Does it live up to the hype?

Yes. XML and allied standards (XSLT, XML Query,) give us a framework for data interchange.

Weather Reports

XSL

Browser

Day Planner

Weather Model

XML XML

Transformation End UsersData


Transformation

End users will differ in which parts of the weather reports they need, so the middle stage is the crux. One XML format defines the available data Transformations map this format into what is needed by the

different applications, leaving out bits that they don’t need. One common transformation is to HTML, for browsers.

(easy) Another is to printed paper, for efficient random access.

(difficult, because our quality expectations are so high)


Representing knowledge in text

Unformatted text Formatted text Structured Markup


Unformatted text

United Kingdom GeographyLocation: Western Europe, bordering on the North Atlantic Ocean and the North Sea, between Ireland and FranceMap references: Europe, Standard Time Zones of the World Area:total area: 244,820 km2 land area: 241,590 km2 comparative area: slightly smaller than Oregonnote: includes Rockall and Shetland IslandsLand boundaries: total 360 km, Ireland 360 km Coastline: 12,429 km


Formatted text

United Kingdom Geography Location: Western Europe, bordering on the North Atlantic Ocean and the North Sea, between Ireland and France Map references: Europe, Standard Time Zones of the World Area: total area: 244,820 km2 land area: 241,590 km2 comparative area: slightly smaller than Oregon >> note: includes Rockall and Shetland Islands Land boundaries: total 360 km, Ireland 360 km Coastline: 12,429 km


XML marked up text

<chapter><title>United Kingdom</title><section><title>Geography</title><featlist> <feat name=Location>Western Europe, bordering on the NorthAtlantic Ocean and the North Sea, between Ireland and France <feat name='Map references'>Europe, Standard TimeZones of the World <feat name=Area><featlist> <feat name='total area'>244,820 km2</feat> <feat name='land area'>241,590 km2 </feat> <feat name='comparative area'>slightly smaller than Oregon <addendum>note: includes Rockall and Shetland Islands </feat></featlist></feat> <feat name='Land boundaries'>total 360 km, Ireland 360 km</feat></featlist></section>


The syntax...

But aren't all those angle brackets still terribly cumbersome and complicated? Yes. simpler relative only to SGML. But..

There are tools that allow you to add XML annotation without the need to know XML

There are tools that allow you to search XML annotation without the need to know XML

XML is no more complex than other annotation schemes

If you roll your own scheme, you’ll have to write (and maintain) the tools.

If you use XML, part or all of your tool set will be provided by mainstream computer industry.


RTF Format{\rtf1\ansi \pard\plain\s1\fs36\ppscheme-3\lang2057 {\f1\lang1033 Formatted text\par }\pard\plain\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\b\f1\fs32\ppscheme-6\lang1033 United Kingdom}{\f1\fs20\lang1033 }{\f1\fs16\lang1033 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\b\f1\fs24\lang1033 Geography}{\f1\fs12\lang1033 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Location: Western Europe, bordering on the North Atlantic Ocean \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 and the North Sea, between Ireland and France\par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Map references: Europe, Standard Time Zones of the World \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Area: total area: 244,820 km2 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 land area: 241,590 km2 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 comparative area: slightly smaller than Oregon\par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 >> note: includes}{\f1\fs20\lang1033 Rockall}{\f1\fs20\lang1033 and Shetland Islands\par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Land boundaries: total 360 km, Ireland 360 km \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Coastline: 12,429 km\par}}


XHTML is a use of XML

HTML derived from SGML, but an application, not a subsetSGML/XML let you define new types of documentHTML only gives you a language to write

document instances Hard-wired to a particular tag set (often with proprietary

extensions -- e.g. frames) Hard-wired to particular typographic format, with limited

style-sheets XHTML is to XML as HTML is to SGML


SGML/XML for computational linguists

What is XML?

SGML Lite Simpler to write Simpler to parse

HTML Heavy New user-definable tags Not (just) about browsing Data interchange Heavily legislated syntax


What is XML?

XML is just labeled brackets. You get elements with a start tag, some content, and an end tag.

<memo><sender>Marc Moens</sender><recipient>Henry, David</recipient><status>confidential</status><subject>GGP Contract</subject><message>The GGP contract is ready for signature. Please sign the contractas well as the NDA.</message></memo>


XML is SGML made simple

SGML is labeled brackets too. You get elements with an optional start tag, some content,

<memo><sender>Marc Moens<recipient>Henry, David<status>confidential</status><subject>GGP Contract<message>The GGP contract is ready for signature. </memo>


XML Basics

Document Type Definition (DTD) Describes what can (and can’t) be in a particular type of

document E.g. a memo DTD might specify that every memo has:

sender (name),recipients (list of names),date (default: today),subject,message,status (confidential or unrestricted)

Document Instance: Identifies the document type and contains the marked-up text E.g. a memo document instance:

refers to the memo DTDcontains text marked up in conformance with that

DTD


XML and document structure

XML is used to make the structure of documents

• explicit• machine readable

Document content

SGML Tags

Marc Moens

This is the first paragraph. It has some text.

This is the second paragraph with some more text.


XML markup

<article status='draft'> <header> <title>XML tags </title> <author>Marc Moens </author> </header> <body> <para>This is the first paragraph. It has some text. </para> <para> This is the second paragraph with some moretext <emph>and</emph> an embedded element. </para></body></article>

Elements: start tags e.g. <author> content e.g. Marc Moens end tags e.g. </author>

Elements mark up text to indicatestructure and function of text (as opposed to appearance)

tag name = element typeElements can have attributes

Elements and attributes are defined in the Document Type Definition


XML markup: for structure and function

He shouted: 'Come here now, Mr Banks.'

<sentence>He <verb>shouted</verb>: <quote><verb mood=imperative>Come </verb> here <emphasis>now</emphasis>, <person><title>Mr</title> <name>Banks</name></person></quote></sentence>

Encodes structure informationto support renderingas well as data handling

Data handling e.g.• search for all quotes inside sentences but not in footnotes;• search for every mention of someone called Banks without finding the Banks of Scotland[Use an XML-aware query tool]

Rendering e.g.• emphasis should be bold underline;•quotes should be in italics[Use a stylesheet]


XML: Relevance for Linguists

Simplify and standardize appeal to context E.g. build tokenizer which specifically works for headlines of

newspaper articles:We need to be able to tell the tokenizer where the headline starts and ends

Annotate text with interesting linguistic information E.g. use XML tags to record the results of a tokenizer or part

of speech tagger. Or a human annotator

Allow sharing of results between research efforts without having to write a new parser every time you get new

material from somewhere


XML: Relevance for Linguists (example)

cat text | lttok -q '.*/P' -m W | ltpos -q '.*/W' -m C

Use the tokeniser lttok on all paragraphs <P> in the text and mark the resulting words as <W> entitiesThen run the part of speech tagger ltpos over the text and pos tag all the <W> entities, putting the result in attribute C

<W C=VBD>said</W><W C=DET>the</W><W C=NN>director</W><W C=IN>of</W> <W C=NNP>Russian</W><W C=NNP>Bear</W><W C=NNP>Ltd. </W><W C=ë.í>.</W>


Associated Standards

XSLT Transforming documents

XML Query Find bits of documents

XML Schema Use element syntax for DTDs

Namespaces Ensure that <art:draw><cube/><cube/></art:draw> and <soccer:draw><team name=“crew”/><team name=“burn”/></soccer:draw> both get processed correctly.


Infrastructure standards

Xpath Referring to parts of documents

XPointer pointing at documents and parts of documents

DOM Uniform programmer’s interface to document trees

(abstracts away from some details)

SAX Stream-based document interface (essential for big

documents)

Information Set


XML in detail

Well-formedness and validity DTDs XML tools XSLT XML Query


Well-formed and Valid documents

Well-formed XML Each start tag has an end tag XML content is rooted in single “document element” Valid encoding declaration

Valid Well-formed All elements mentioned in DTD All entities defined All parent-child relations as described in DTD All attributes used as described in DTD All element IDs unique


Why well-formedness?

a simpler standard for documents to meet Can be determined without reference to a DTD Simplifies the parser Retains “standalone” property of HTML, which was a big

win.

Non-validating XML systems can thus still be conformant, providing they check well-formedness

If you have a DTD (or a Schema) you can do more refined processing.


DTDs

Document Type Definitions: the grammar of a document family Elements Attributes & values Entities & parameter entities Comments


DTD: Elements

Elements are used to structure a document. Element types are declared in the DTD:

<!DOCTYPE article [ <!ELEMENT article (title, section+) > <!ELEMENT section (title, para+) > <!ELEMENT para (#PCDATA) > <!ELEMENT title (#PCDATA) > ]>


DTD: Attribute declarations

Attributes specify properties of elements. The attributes which may appear on elements of a given type are also declared in the DTD.

<!DOCTYPE article [<!ELEMENT article (title, section+) > <!ATTLIST article artno NUMBER #IMPLIED > <!ELEMENT section (title, para+) > <!ATTLIST section secid ID #REQUIRED > <!ELEMENT para (#PCDATA) >

<!ELEMENT title (#PCDATA) >]>


DTD: Entity declarations

Entities provide short names for commonly used strings, and are also declared in the DTD. <!DOCTYPE article [

<!ELEMENT article (title, section+) ><!ATTLIST article artno NUMBER #IMPLIED ><!ELEMENT section (title, para+) ><!ATTLIST section secid ID #REQUIRED >

<!ENTITY ltg "Language Technology Group> ]>


DTD: IDs

IDs are rigid designators for particular elements in the document. They are declared using type ID<!DOCTYPE article [

<!ELEMENT article (title, section+) ><!ATTLIST article artno NUMBER #IMPLIED ><!ELEMENT section (title, para+) ><!ATTLIST section secid ID #REQUIRED >

<!ENTITY ltg "Language Technology Group>]>

Potentially, IDs allow processors to provide fast random access to parts of documents.

Ids must be unique. Checking might be onerous


XML tools

XML Parser LT XML Toolkit XSLT - xt and Saxon


XML Parser

probably most important single bit of XML software uses DTD to check if document instance is valid


Example: >> cat memo.xml

<?xml version=“1.0” encoding=“ISO-8859-1”?><!DOCTYPE article [

<!ELEMENT article (para+)><!ELEMENT para (#PCDATA)><!ENTITY ltg "Language Technology Group">]>

<article><para>This is the text of a very short article,with very little internal structure.Here is a reference to the &ltg; entity.

</para></article>


Add correct output

Example: >> xmlnorm -V memo.xml

Entity reference has beenreplaced with entity textby parser


Exercise

Practice using xmlnorm to check your documents Add some new entities to the memo. Experience some of xmlnorm‘s error messages Begin to think about DTD design Practice using Web browsers to look at XML files Get a glimpse of what XSL is about

<!DOCTYPE article [<!ELEMENT article (title, section+) ><!ATTLIST article artno NUMBER #IMPLIED ><!ELEMENT section (title, para+) ><!ATTLIST section secid ID #REQUIRED ><!ELEMENT para (#PCDATA) ><!ELEMENT title (#PCDATA) ><!ENTITY ltg 'Language Technology Group'>]>

DTD: Comments


Element type declaration details

<!ELEMENT chapter (title, section+) >

keyword

element typestart with a-zmay contain hyphen, number, stopsnot case sensitivecan be more than one

content modelAn unambiguous regularexpression


Element types: Content model

<!ELEMENT article (title, section+) >

+ at least one, possibly more? optional* zero or more

, all occur, in that order| exclusive or

<!ELEMENT header ( ( (title, subtitle?),(author, affil)+ ), (date | status)? ) >

XML eradicated SGML’s neat & all occur, any order


Element types: Content model options

<!ELEMENT graphic EMPTY > EMPTY

no content no end tag point semantics: attributes may specialise

(#PCDATA) text only

ANY no constraint: sub-elements and/or text

((#PCDATA|emph)*) 'mixed content'


Element grammar

Since content model is a regular expression, markup grammar is context free

Except for one thing ANY keyword

Note that any realistic application interprets the markup tree. The interpretation could be anything. All bets are off…


SGML/XML for computational linguists

nsgmls:exa2a.sgm:7:42:E: element "PI" undefinednsgmls:exa2a.sgm:8:24:E: general entity "T." not defined

and no default entity(ARTICLE(PARA-Here is some text with an inequality: a(PI-2and an abbreviation: AT)PI)PARA)ARTICLE

Example: >> nsgmls exa2a.sgm

<pi/ interpreted as start tag

&T. interpreted as entity reference, not defined so gone from output

No C to confirm validity.


Escaping special characters

There are several ways around the problem of introducing XML's meta-syntax characters into documents Use numeric character references

AT&T Use CDATA marked sections

<![CDATA[<this> is data &not markup]]>

XML provides built-in definitions for amp, lt, gt, quot and apos


76SGML/XML for computational linguists

Example: >> nsgmls exa2b.sgm

(ARTICLE(PARA-Here is some text with an inequality: a<pi/2\nand an abbreviation: AT&T.)PARA)ARTICLEC

DTD: Comments



double hyphens act as comment

<!ELEMENT article (title, section+)>


DTD: Attributes

<!DOCTYPE article [<!ELEMENT article (title, section+) ><!ATTLIST article artno CDATA #IMPLIED ><!ELEMENT section (title, para+) ><!ATTLIST section secid ID #REQUIRED ><!ELEMENT para (#PCDATA) ><!ELEMENT title (#PCDATA) >]>


DTD Attribute declarations: syntax

<!ATTLIST article artno CDATA #IMPLIED >

keyword

element type

attribute nameattribute type

default type#REQUIRED#IMPLIED (= optional)#FIXED


Attribute Value types (contd)

<!ATTLIST article artno CDATA #IMPLIED >

CDATA valid SGML charactersENTITY declared entity nameID unique nameIDREF reference to a unique name


Cross-references

<!DOCTYPE article [<!ELEMENT article (section+)><!ATTLIST section secid ID #IMPLIED><!ELEMENT section (#PCDATA | xref)+><!ELEMENT xref EMPTY><!ATTLIST xref xrefid IDREF #REQUIRED>]><article><section secid='s1'>Here is some text.</section><section>In section <xref xrefid='s1'> we showedyou how to create crossreferences.</section></article>


In a valid SGML/XML document IDs are unique IDREFs are discharged

Applications may interpret IDREF/ID connections

Links from elsewhere may target IDs cf. HTML 'name' attribute as the target for #....

IDs and IDREFs


Attribute value types: list

CDATAvalid SGML characters author='Robin Hood'

ENTITY/IESdeclared entity name(s) figs='pict2 pict7'

IDunique name id='foo37'

IDREF(S)reference(s) to an ID refid='foo2 foo37'

NMTOKEN(S)name(s) w/o i.c. restraint code='96-mm01 98-a'

NOTATIONdata content notation encoding='eps'


Enumerated attribute values

Attribute values can also be constrained to be one of a finite set of allowed values

<!ATTLIST section status (draft|alpha| beta|final) 'draft' >

<section status=alpha><section status=final><section><section status=gamma> Not valid


Elements vs Attributes

<!ELEMENT date (day, month, year)><!ELEMENT day (#PCDATA)>

Content is unconstrained Order will be enforced

vs

<!ELEMENT dateday EMPTY><!ATTLIST date day NUMBER #REQUIRED

month NUMBER #REQUIREDyear NUMBER #REQUIRED>

Content is constrained Order is unconstrained


DTD: Entities

<!DOCTYPE article [<!ELEMENT article - - (#PCDATA)><!ENTITY ltg 'Language Technology Group'>]><article>The &ltg; carries out application-oriented research inlanguage engineering. The &ltg; is based withinthe HCRC.</article>

Each occurrence of &ltg; in the text is replaced byLanguage Technology Groupduring parsing.

can be nested:

<!ENTITY hcl 'HCRC &ltg;'>


DTD: Parameter Entities

Like entities, except within the DTD

<!ENTITY % section '(title?, para+)'>

each time parser finds %section; inthe DTD, it will replace it with (title?, para+)

<!ENTITY % section (title?, para+)><!ELEMENT article - - (title, %section;+)><!ELEMENT subsect - - (%section;+)>


DTD

That’s almost all there is to it For more detail, see the XML standard

Which, as Michael Kay puts it, is like tax legislation

DTD syntax differs from element syntax Harder to learn/use XML Schema

Also, DTDs were designed to be used by document designers, not for distributed data interchange XML can use a DTD, but doesn’t assume one. Composite documents entail composite DTDs, but these

don’t exist. Namespace prefixes add extra complexity


XSL Transformations

Content from one document.

Style from another

Structure


barts_stylish_memo.xml

<?xml version="1.0"?>

<!ELEMENT article (title,(para|credit)+)> <!ELEMENT para (#PCDATA)> <!ENTITY ltg "Language Technology Group"> <!ENTITY author "Bart Simpson"> <!ENTITY techie "Lisa Simpson"> <!ENTITY parents "Marge and Homer"> <!ENTITY school "M&M University">]>

This is the text of a very short article,with very little internal structure.Here is a reference to the &ltg; entity.Please may I stop now?</para>

</credit>

</credit>

</article>&parents; for unfailing support.

<credit>&techie; of &school; for slick XML authoring.

<credit><para><para> by &author;: &school;</para><title>Bart's Ph.D Thesis</title>

<article><!DOCTYPE article [<?xml-stylesheet type="text/xsl" href="memo.xsl"?>


memo.xsl

IE5 attempts to display the style in visual form, without any content.

Germ of a good idea here.


Source of memo.xsl

<?xml version="1.0" encoding="ISO-8859-1" ?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/TR/WD-xsl"><xsl:template match="/"><html> <head><title><xsl:value-of select="//title"/></title></head><body BGCOLOR='#FFFFCC'> <h1><xsl:value-of select="//title"/></h1><xsl:for-each select="//para"><p><xsl:value-of/></p></xsl:for-each><hr/><p><i> Thanks to: </i><br/><xsl:for-each select="//credit"> <xsl:value-of/><br/></xsl:for-each><hr/></p></body></html></xsl:template></xsl:stylesheet>


Fill in the blanks

<?xml version="1.0" encoding="ISO-8859-1" ?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/TR/WD-xsl"><xsl:template match="/"><html> <head><title>•••</title></head><body BGCOLOR='#FFFFCC'> <h1>•••</h1><xsl:for-each select="//para"> <p>•••</p></xsl:for-each><hr/><p><i> Thanks to: </i><br/><xsl:for-each select="//credit"> ••• <br/></xsl:for-each> <hr/></p></body></html></xsl:template></xsl:stylesheet>

XSLT gives you tools for sending part of document to one place, part to another.

Simplest use is pure fill in the blanks. Anybody who uses HTML, PHP and so on will be comfortable with this use of XSLT

If necessary, it is a Turing-complete programming language. It gives you the rope if you need it.


Fill in the blanks

<?xml version="1.0" encoding="ISO-8859-1" ?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/TR/WD-xsl"><xsl:template match="/"><html> <head><title> <xsl:value-of select="//title"/> </title></head><body BGCOLOR='#FFFFCC'> <h1> <xsl:value-of select="//title"/> </h1><xsl:for-each select="//para"><p> <xsl:value-of/> </p></xsl:for-each><hr/><p><i> Thanks to: </i><br/><xsl:for-each select="//credit"> <xsl:value-of/> <br/></xsl:for-each> <hr/></p></body></html></xsl:template></xsl:stylesheet>


XSLT standards

Microsoft’s implementation in IE5 is non-standard (they put it out well before the standard existed). They are moving to conformance.

James Clark’s xt and Michael Kay’s Saxon are much more complete and conformant

W3C eats its own lunch. The HTML versions of the XML standard are generated with XSL

In practice, current best options are Static data:Pre-generate HTML from XML at publication

time Dynamic data: Use Saxon or xt as Java Servlets


Generating HTML

HTML is generated by running Saxon on poem.xml and poem.xsl

saxon poem.xml poem.xsl > poem.html


Using IE5 to view poem.xml

<poem><author>Rupert Brooke</author><date>1912</date><title>Song</title><stanza><line>And suddenly the wind comes soft,</line><line>And Spring is here again;</line><line>And the hawthorn quickens with buds of green</line><line>And my heart with buds of pain.</line></stanza><stanza><line>My heart all Winter lay so numb,</line><line>The earth so dead and frore,</line><line>That I never thought the Spring would come again</line><line>Or my heart wake any more.</line></stanza><stanza><line>But Winter's broken and earth has woken,</line><line>And the small birds cry again;</line><line>And the hawthorn hedge puts forth its buds,</line><line>And my heart puts forth its pain.</line></stanza></poem>


poem.xsl<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:template match="poem"><html><head>

<title><xsl:value-of select="title"/></title></head><body>

<xsl:apply-templates select="title"/><xsl:apply-templates select="author"/><xsl:apply-templates select="stanza"/><xsl:apply-templates select="date"/>

</body></html></xsl:template>

<xsl:template match="title"><div align="center"><h1><xsl:value-of select="."/></h1></div></xsl:template>

<xsl:template match="author"><div align="center"><h2>By <xsl:value-of select="."/></h2></div></xsl:template>

<xsl:template match="stanza"><p><xsl:apply-templates select="line"/></p></xsl:template>

<xsl:template match="line"><xsl:if test="position() mod 2 = 0"> </xsl:if><xsl:value-of select="."/><br/></xsl:template>

<xsl:template match="date"><p><i><xsl:value-of select="."/></i></p></xsl:template></xsl:stylesheet>

Namespace declaration is different (standard conforming) for Saxon.

+XSLT language is different.

+ Saxon and XT are really easy to install.

- IE5 has millions of current users


“Problems” with XML

Uses complex and weird terminology Yes. But so does the ANSI C standard. So do most fields…

Not convenient for specifying graphs (as opposed to trees) This is a point about graphs, not XML. Unification grammar

notations get unwieldy too.

Not as convenient as plain text True for some tasks, but the extra structure of XML lets do

things that you wouldn’t even try with plain text.


XML tools for Unix

Simple equivalents of UN*X tools are available (for free) to do simple SGML processing

We'll introduce them using examples, and give details at the end


sggrep

LT XML program for searching for structure and text in XML files sggrep -q query -s subquery -t regexp in.xml

Options -d DTD: Specify a DTD explicitly. File is an XML file -r : Attribute values in queries are regular expressions. -v : Invert sense of sub-query+regexp. Other options


||

LT XML query language

Two-dimensional regular expressions First dimension is over tree paths

Based on file path analogy:DIV/PARA/W matches Ws inside PARAs inside (toplevel) DIVs

Second dimension is regular expressions over text content of leaf nodes

Select Ss containing Ws whose text is it's or its-q S -s './W' -t "^(it's|its)$"

Full UTZOO (Henry Spencer) regular expression support

Influential, slightly dated now.


sggrep: examples of use

sggrep -q ".*/P/S" -s "./W[TAG=NN]"ï find all S elements occuring inside a P element at any depth

which immediately contain a W element with attribute TAG="NN".

sggrep -q ".*/P/S/W[TAG=NN]"ï find those W elements themselves

sggrep -q ".*/S/W[0]" -t "^[a-z]" ï find all sentence initial words starting with a lower case

letter.


sgmltrans

converts XML into different formats.sgmltrans -r rulefile file.nsg > file.txtï sample rule file:

.*/W matches W "" what to print at start tag "/$TAG\n" what to print at end tag: value of TAG

attribute .*/W/# matches text inside W " " --> "" text replacement: eliminate space if any .*/S matches S "" start tag: nothing "\n" end tag: make each S on separate line .* matches other markup


sgmltrans: example of use

The previous rule file would do this:

<?xml version='1.0'>

<TEST><P><S>

<W TAG='A'>The </W>

<W TAG='B'>cat </W>

<W>sat </W>

<C>.</C></S>

<S>

<W TAG='A'>on </W>

<W TAG='B'>the </W>

<W>mat </W>

<C>.</C>

</S></P></TEST>

The/A

cat/B

sat/

on/A

the/B

mat/


sgrpg: SGML report generator

Program for making more complex queries of normalised SGML and for transforming SGML. Provides nested subqueries and sequencing

Usage: sgrpg query sub-query regexp out-fmt oargs < file.nsg >

file.txt sgrpg -f pat-file < file.nsg > file.txt

This now looks like a design study for XSLT and XML Query.

Has one advantage, designed (from the outset) for big documents


The British National Corpus

2 gigabytes of contemporary English Marked up to word level with part of speech tags Extract data:

zcat medium.xml.gz | sggrep -q ".*/W[TYPE=NN1]" gives all singular nouns in a part of the corpus, e.g.

<W TYPE=NN1>part </W><W TYPE=NN1>meeting </W><W TYPE=NN1>while </W><W TYPE=NN1>funeral</W><W TYPE=NN1>loss</W><W TYPE=NN1>meeting</W><W TYPE=NN1>time </W>


The BNC: an example (2)

zcat medium.xml.gz | \sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" \-t "^[Rr]ight$"

gives sentences containing non-adjectival uses of the word 'right', e.g.

<S N=092> <W TYPE=ITJ>Yes </W> <W TYPE=DT0>that </W> <W TYPE=VBD>was</W> <C TYPE=PUN>, </C> <W TYPE=DT0>that </W> <W TYPE=VBD>was </W> <W TYPE=AV0>right</W> . . . </S>


The BNC: an example (3)

Format the output into a more readable form:

zcat medium.xml.gz | \sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" -t "^[Rr]ight$" |\sgmltrans -r test.rule

Yes/ITJ that/DT0 was/VBD , that/DT0 was/VBD right/AV0 erm/UNC there/EX0 was/VBD a/AT0 limit/NN1 to/PRP how/AVQ much/AV0 you/PNP could/VM0 spend/VVI aswell/AV0 was/VBD n't/XX0 there/EX0 ?

He/PNP goes/VVZ into/PRP a/AT0 restaurant/NN1 and/CJC he/PNP says/VVZ oh/ITJ the/AT0 waiter/NN1 erm/UNC let/VVB me/PNP see/VVI the/AT0 menu/NN1 and/CJC he/PNP looks/VVZ at/PRP the/AT0 menu/NN1 and/CJC said/VVD right/AV0 , he/PNP said/VVD .


An extended example: Noun Compounds

Noun compounds in British National Corpus What is a noun compound?

Too hard. Simple approximation? Sequence of tags matching NN. . .

BNC uses a version of the Brown tags, where NN0, NN1, . . . are all variants of Noun

A pipeline of SGML-aware tools will do the job sgrpg | sggrep [ | . . .]

Use sgrpg to wrap such tag sequences in <G> ... </G>. Use sggrep to filter the output. Use further tools to tabulate, format, etc.


An extended example: The pipe

Step by step through the pipe sgrpg -r -f np-pat.xml | ...

Group the sequences-r use regexp matching-f script file

... sggrep -d groups.xml -q '.*/G'extract the sequences-d DTD -q query (selects groups)

Result:<G><W TYPE='AJ0-NN1'>Local</W>

<W TYPE='NN0'>government</W><W TYPE='NN2'>districts</W></G>. . .


An extended example: filtering

Find all words with unresolved tags, e.g. AJ0-NN1 use regexp matching, which is unanchored by default ...| sggrep -r -q './W[TYPE="-"]' | ...

Find all words in second position ...| sggrep -q './W[1]' | ...

Find all words with unresolved tags in second position ...| sggrep -r -q './W[1 TYPE="-"]' | ...


An extended example: counting

Count all words in second position ...| sggrep -q './W[1]' | sgcount

Count all words with unresolved tags in second position ...| sggrep -r -q './W[1 TYPE="-"]' | sgcount

Results: all 2nd place W 23283 2nd place W with unresolved tag 5066


An extended example: long compounds

Long compounds including 'government' Use subquery to select <G>...</G>s with 'government': sggrep -q G -s './W' -t government Next step, discard short ones: sggrep -q G -s './W[2]' Then sgmltrans for neater format Results:

official/AJ0-NN1 government/NN0 report/NN1-VBLocal/AJ0-NN1 government/NN0 districts/NN2...


British International Corpus?

We are more francophone than we think! Longest 'noun-phrase' in 10% of BNC is:

serai/NN1 mentionné/NN1 dans/NN2 le/NN1 rapport/NN1-VB qui/NN1 te/NN1 sera/NN1 remis/NN1

No disgrace that the part-of-speech tagger gave up here. Tools can't be better than their input allows


XML Conclusions

XML is the wave of the future Both Microsoft and Netscape have endorsed it

Both Mozillla and IE5 have XML support built-in Very good free software is available Microsoft seem to be serious about standard compliance

The W3C have made it clear that all subsequent W3C standards for web distribution of information will be based on XML (c.f. SMIL, SVG and RDF)

Issues XSLT efficiency - space and time.


To read

Robin Cover’s SGML/XML Web Pagehttp://www.sil.org/sgml/sgml.html includes many pointers to SGML tutorials, overviews,

publications

The Whirlwind Guide to SGML & XML Tools and Vendorshttp://www.infotek.no/sgmltool/guide.htm

The XML FAQhttp://www.ucc.ie/xml/ An excellent introduction to XML with pointers to useful

resources for newcomers to the standard


SGML/XML for Linguistics

2.1 Programs for querying/modifying SGML an example what is needed available tools

2.2 SGML marked-up corpora some existing resources

2.3 Related developments SSTML SGML for X-waves


An example

You want to build a system that performs particular LE task You have a corpus of texts for

analysis (detecting textual regularities)system trainingsystem testing

Use XMLWhy?How?


Why use XML?

Use structure of text to fine-tune certain tools e.g. build tokeniser which specifically works for headlines of

newspaper articles

Annotate text with linguistic information e.g. use SGML tags to record the results of a tokeniser or

part of speech tagger, so that other tools can make use of this information

Ensure the others (and you two years from now :-) will have easy access to your results No special-purpose parser required Simple retrieval and tabulation with existing free tools DTD provides some self-documentation


What is needed to use XML?

XML is text Therefore:

you can use any UNIX text manipulation programe.g. grep, sed, awk, perl, etc

XML is annotated text Therefore:

Needed: versions of these tools that are XML-aware



SGML reflects the hierarchical structure of a text You want to be able to tell tools to operate on a particular

part of the SGML-annotated text, for example:all WORD entities with attribute POS set to JJ

(i.e. all adjectives)occurring within the first PARAGRAPH of the main

BODY of an ARTICLE; oroccurring within the HEADLINE of and ARTICLE

Needed: a query language over XML structures



XML-aware versions of text processing tools Query language

In fact sggrep is just a simple wrapper round our query language.

Our query language and interface is designed to work with big files, so it doesn’t read the whole document into memory unless absolutely necessary. Most competitors do this


XML tools: the LT XML library

sggrep is part of an SGML toolset, called LT XML Developed by the Language Technology Group

(Edinburgh) see: http://www.ltg.ed.ac.uk/software

XML Library with Command-line tools Application Programming Interface (API)

Available for WIN32, UN*X (and Mac) LT XML processes XML or nSGML

nSGML now looks like a design study for XML


LT XML: Command-line tools

sggrep - retrieving context sensitive data sgmltrans - transforming information sgrpg - more complex queries/reformatting textonly - strips out SGML markup sgcount - counts SGML tags knit - resolves XML-link links others


LT NSL: APIs

LT NSL Application Program Interfaces:procedure calls to help you write your own programs to process nSGML C language API Python language API


C API for specialised access

Write your own programs to read/write SGML/XML LT XML provides a rich API Both event and tree views of the document stream

The distribution includes two heavily commented example programs.


Python language API for LT XML

Experimental integration of the LT XML API into Python (free portable object-oriented scripting language)

Uses TK portable widget library for graphical UI Reflects document stream as Python objects


Specialised XML editors

Using the Python API we have written a number of specialised processors:

A WYSIWYG XML instance editor (XED) Several specialised annotation tools, E.g. PoS

correctors, span coders Limited set of operations Preserve validity Hide structure from the user


Dataflow in LT NSL programs

mknsg

unknit

nSGML NSL C(++) program

stream API

parser

nSGML NSL C(++) program

stream API

parser

DDB file

file1.sgm Ö file2.sgm ...

file1.sgm ...


The Edinburgh MapTask Corpus

Contents 128 task oriented spontaneous Scottish dialogues small corpus, but very dense and detailed SGML markup.

Availability: Transcripts and digitized speech on 8 CD-ROMS:

http://www.elsnet.org/resources.html or from the LDC

What is its markup like? (early) TEI-compliant Turns, pointers into the speech, identification of non-words. Word-level transcripts with timing markup available soon via

the Internet


HCRC Maptask: an example

mknsg q1ec1.turns.sgm | sggrep -q ".*/W[TAG=at]"

<W START=2.9644 DUR=0.0725 UTT=1 TAG=at>a</W>

<W START=17.1410 DUR=0.1779 UTT=3 TAG=at>an</W>

<W START=18.6693 DUR=0.0791 UTT=3 TAG=at>the</W>


Parsed HCRC Maptask : an example

mknsg q1ec1.g.syn.sgm | sggrep -q ".*/NP" | sgmltrans -r mt.rule

<NP>we </NP>

<NP>a caravan park </NP>

<NP>we </NP>

<NP>we </NP>

<NP><NP><NP>an old mill </NP></NP><PP>on <NP>the right hand side </NP></PP></NP>

<NP><NP>an old mill </NP><PP>on <NP>the right </NP></PP></NP>

<NP>you </NP>

...


The MLCC corpus

Contents Financial Newspaper texts: Dutch, English, French, German,

Italian, Spanish Parallel texts:

The Journal of the European Commission, Written Questions (1993).

Corpus of European Parliamentary debates (1993-1994). (languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish ).

Markup

Available from ELRA: http://www.inpg.fr/ELRA/catalog.html


The MLCC Corpus: an example

zcat exp.joc006.93.en.01.tei.gz |\

mknsg | \

sggrep -q ".*/DIV4[TYPE=Q]/HEAD"

<HEAD>Subject: The staffing in the Commission of the European Communities</HEAD>

<HEAD>Subject: Supplies of military equipment to Iraq</HEAD>

<HEAD>Subject: Commission plans to liberalize the postal sector and to abolish the State monopoly</HEAD>

<HEAD>Subject: New industries in Attika</HEAD>...


The same example for French

zcat exp.joc006.93.fr.01.tei.gz |\

mknsg | \

sggrep ".*/DIV4[TYPE=Q]/HEAD" ""

<HEAD>Objet: Organigramme de la Commission</HEAD>

<HEAD>Objet: Livraisons de matÈriel militaire ‡ l'Irak</HEAD>

<HEAD>Objet: Projets de la Commission visant ‡ libÈraliser et ‡ abolir le monopole d'…tat dans le secteur des postes</HEAD>

<HEAD>Objet: Nouvelles industries en Attique</HEAD>

Corresponds to the English data: Suitable input for multilingual alignment experiments.


The Text Encoding Initiative (TEI)

The TEI is a large and well documented DTD for textual markup. Use it if you can Now has an XML version

Large and comprehensive hardcopy documentation available http://www.uic.edu/orgs/tei/

DTDs available there as well


The Linguistic Data Consortium

LDC - based in Pennsylvania USA Distributes text corpora See: http://www.ldc.upenn.edu/

SGML Corpora include: The European Language Newspaper Text corpus

French (100 million words), German (90 million words) and Portuguese (15 million words). SGML.

TIPSTER Information Retrieval Text Research Collection3 gigabytes. SGML-like. Various English texts.

United Nations Parallel Text Corpus (English, French, Spanish)

Fully-compliant SGML, 2.5 gigabytes


Tutorials

XML: far too many to mention XSL:

XSL specification http://www.w3.org/Style/XSL

Robin Cover's guide http://www.oasis-open.org/cover/xsl.html


Resources

LT-XML http://www.ltg.ed.ac.uk/software/xml/index.html

Full-text search Witten, Moffat and Bell's Managing Gigabyteshttp://www.cs.mu.OZ.AU/mg/


Corpus Tools

Stuttgart Corpus Workbenchhttp://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench

Birmingham Qwick}http://www-clg.bham.ac.uk/QWICK/

The MATE Workbench http://www.cogsci.ed.ac.uk/~dmck/MateCode}.

NB. Prototype


Bibliography

McKelvie, Brew,Thompson: Using SGML as a Basis for Data-Intensive Natural Language Processing, Computers and the Humanities, 31(5): 367-388, 1997

Sinclair, Mason,Ball,Barnbrook Language Independent Statistical Software for Corpus Exploration, Computers and the Humanities, Vol 31(3): 229-255, 1998

References on McKelvie's MATE workbench pagehttp://www.cogsci.ed.ac.uk/~dmck/MateCode

Welty and Ide. Using the right tools: enhancing retrieval from marked-up documents. Computers and the Humanities. 33(10):59-84. 1999

Alignment graphs (and much else) Steven Bird's Linguistic Annotation Pagehttp://www.ldc.upenn.edu/annotation/.


Annotation topics

_ Item annotations Words, Parts-of-speech, lemmas

Simple annotations (one data stream) Boundaries,Spans,Partitions

Complex annotations (multiple data streams) Sequences,Graphs,Overlaps

Data models for annotation access Streams, Trees, Graphs, Databases

_ Human factors in annotation Writing instructions, Measuring and improving reliability


XML topics

Data formats HTML,XML and SGML

Data Description Formalisms DTDs, XML Schema

Style Languages XSLT

Query Languages Annotation Graphs, XML Query, XQL, Quilt, LORE


Exercises

On average, these exercises should take about one hour to complete. Try not to spend longer.

Create an XML document Create a very simple memo

Simple annotation Disambiguate parts-of-speech Compare results with those made by a partner.

Style Create an XML DTD and an XSL style sheet for displaying

POS-tagged text in a browser.


Exercises

More complex annotation syntactic annotation in Penn tree bank style. As before, compare results

Search Exercise XML search tools on the newly annotated texts


Projects

These are open-ended projects hard enough to merit write-up in a research paper. I’d willingly supervise these.

Design a DTD and an XSL stylesheet for tree bank style syntactic annotations. Implement a convenient interface allowing these annotations to be edited over the Web.

Investigate the corpus search tools provided at the LDC web-site. What do they do? Could they and should they use XML/XSL technology for the same purpose? (Easiest if your institution has an LDC membership).


Projects (contd)

Critical review of the Talkbank tools (www.talkbank.org)

Design an XML query language that works well with very big documents

What sort of annotation structure for dialog? (cf. MATE)

Design an optimizing compiler for XSLT (cf. Sun’s very recent XSL compiler)

Does XSLT support language modeling and statistical computation? (If you put XSLT and Splus into a closed box and shake vigorously, what emerges?)


In Summary

Phew! </xmlstuff>

1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry...

Documents

Transcript of 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry...