1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry...
-
Upload
miles-alan-clarke -
Category
Documents
-
view
213 -
download
0
Transcript of 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry...
1XML and Linguistic Annotation
XML and Linguistic Annotation
Chris Brew, Ohio State University( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology
Group, University of Edinburgh)
http://www.ling.ohio-state.edu/~cbrewOhio State University
Copyright 2000 Chris Brew
XML and Linguistic Annotation 2Summer School, July 2000
XML topics
What is XML? HTML,XML and SGML Wider context of XML
Data Description DTDs, Schemas
Query Languages XML Query, XQL, Quilt, LORE, LT QUERY
Style Languages CSS, XSL
XML and Linguistic Annotation 3Summer School, July 2000
What is XML?
It is a markup language used for annotating text is concerned with logical structure
to identify sections, titles, section headers, chapters, paragraphs,…
is not concerned with appearance you say 'this is a subtitle'
not 'this is in bold, 14pt, centered' you say 'this is an example'
not 'this is in verbatim, indented by 5pts, ragged right’
Derived from SGML.
XML and Linguistic Annotation 4Summer School, July 2000
Why is XML a big deal?
It is a W3C standard It is vendor-independent, platform independent,
application independent,… unlike Word documents, RTF documents, PDF
documents, Postscript documents,…
It is human readable ditto (for most values of 'human')
The Web interchange format
XML and Linguistic Annotation 5Summer School, July 2000
Who is in charge of XML?
XML is a W3C Recommendation The W3C is The World Wide Web Consortium, a
voluntary association of companies and non-profit organizations. Membership costs serious money, confers voting rights. Complex procedures, with the Chairman (Tim Berners-Lee) holding all the high cards, but the big vendors (e.g. Microsoft, Adobe, Netscape) have a lot of power.
The recommendation was written by the W3C’s XML Working Group.
XML and Linguistic Annotation 6Summer School, July 2000
XML as a career move?
Most of the big computer and entertainment companies believe XML is the solution. Exactly what was the problem?
Presenting a parts database over the InternetRunning an on-line job market (flipdog.com)Usually not corpus creation.
Scholars win and loseSGML was a minority interest where we had
serious influence on what facilities were usedXML is mainstream. We’re the minority now.This year’s .coms are busily hiring people who
understand ontologies, NLP and web technology.
XML and Linguistic Annotation 7Summer School, July 2000
Does it live up to the hype?
Of course not, but… The basic idea is simple labeled brackets. Lisp showed the
power of this idea in knowledge representation. Knowledge representation is inherently hard. Lisp made it
easier to state the problem, but it wasn’t itself the solution. XML won’t solve your knowledge representation problems either, but it will let you state them.
Labeled brackets++ Labeled brackets – but designed for information exchange,
with sophisticated input (and political pressures) from many interest groups.
XML and Linguistic Annotation 8Summer School, July 2000
Does it live up to the hype?
Yes. XML and allied standards (XSLT, XML Query,) give us a framework for data interchange.
Weather Reports
XSL
Browser
Day Planner
Weather Model
XML XML
Transformation End UsersData
XML and Linguistic Annotation 9Summer School, July 2000
Transformation
End users will differ in which parts of the weather reports they need, so the middle stage is the crux. One XML format defines the available data Transformations map this format into what is needed by the
different applications, leaving out bits that they don’t need. One common transformation is to HTML, for browsers.
(easy) Another is to printed paper, for efficient random access.
(difficult, because our quality expectations are so high)
XML and Linguistic Annotation 10Summer School, July 2000
Representing knowledge in text
Unformatted text Formatted text Structured Markup
XML and Linguistic Annotation 11Summer School, July 2000
Unformatted text
United Kingdom GeographyLocation: Western Europe, bordering on the North Atlantic Ocean and the North Sea, between Ireland and FranceMap references: Europe, Standard Time Zones of the World Area:total area: 244,820 km2 land area: 241,590 km2 comparative area: slightly smaller than Oregonnote: includes Rockall and Shetland IslandsLand boundaries: total 360 km, Ireland 360 km Coastline: 12,429 km
XML and Linguistic Annotation 12Summer School, July 2000
Formatted text
United Kingdom Geography Location: Western Europe, bordering on the North Atlantic Ocean and the North Sea, between Ireland and France Map references: Europe, Standard Time Zones of the World Area: total area: 244,820 km2 land area: 241,590 km2 comparative area: slightly smaller than Oregon >> note: includes Rockall and Shetland Islands Land boundaries: total 360 km, Ireland 360 km Coastline: 12,429 km
XML and Linguistic Annotation 13Summer School, July 2000
XML marked up text
<chapter><title>United Kingdom</title><section><title>Geography</title><featlist> <feat name=Location>Western Europe, bordering on the NorthAtlantic Ocean and the North Sea, between Ireland and France <feat name='Map references'>Europe, Standard TimeZones of the World <feat name=Area><featlist> <feat name='total area'>244,820 km2</feat> <feat name='land area'>241,590 km2 </feat> <feat name='comparative area'>slightly smaller than Oregon <addendum>note: includes Rockall and Shetland Islands </feat></featlist></feat> <feat name='Land boundaries'>total 360 km, Ireland 360 km</feat></featlist></section>
XML and Linguistic Annotation 14Summer School, July 2000
The syntax...
But aren't all those angle brackets still terribly cumbersome and complicated? Yes. simpler relative only to SGML. But..
There are tools that allow you to add XML annotation without the need to know XML
There are tools that allow you to search XML annotation without the need to know XML
XML is no more complex than other annotation schemes
If you roll your own scheme, you’ll have to write (and maintain) the tools.
If you use XML, part or all of your tool set will be provided by mainstream computer industry.
XML and Linguistic Annotation 15Summer School, July 2000
RTF Format{\rtf1\ansi \pard\plain\s1\fs36\ppscheme-3\lang2057 {\f1\lang1033 Formatted text\par }\pard\plain\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\b\f1\fs32\ppscheme-6\lang1033 United Kingdom}{\f1\fs20\lang1033 }{\f1\fs16\lang1033 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\b\f1\fs24\lang1033 Geography}{\f1\fs12\lang1033 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Location: Western Europe, bordering on the North Atlantic Ocean \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 and the North Sea, between Ireland and France\par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Map references: Europe, Standard Time Zones of the World \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Area: total area: 244,820 km2 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 land area: 241,590 km2 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 comparative area: slightly smaller than Oregon\par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 >> note: includes}{\f1\fs20\lang1033 Rockall}{\f1\fs20\lang1033 and Shetland Islands\par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Land boundaries: total 360 km, Ireland 360 km \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Coastline: 12,429 km\par}}
XML and Linguistic Annotation 16Summer School, July 2000
XHTML is a use of XML
HTML derived from SGML, but an application, not a subsetSGML/XML let you define new types of documentHTML only gives you a language to write
document instances Hard-wired to a particular tag set (often with proprietary
extensions -- e.g. frames) Hard-wired to particular typographic format, with limited
style-sheets XHTML is to XML as HTML is to SGML
XML and Linguistic Annotation 17Summer School, July 2000
SGML/XML for computational linguists
What is XML?
SGML Lite Simpler to write Simpler to parse
HTML Heavy New user-definable tags Not (just) about browsing Data interchange Heavily legislated syntax
XML and Linguistic Annotation 18Summer School, July 2000
What is XML?
XML is just labeled brackets. You get elements with a start tag, some content, and an end tag.
<memo><sender>Marc Moens</sender><recipient>Henry, David</recipient><status>confidential</status><subject>GGP Contract</subject><message>The GGP contract is ready for signature. Please sign the contractas well as the NDA.</message></memo>
XML and Linguistic Annotation 19Summer School, July 2000
XML is SGML made simple
SGML is labeled brackets too. You get elements with an optional start tag, some content,
<memo><sender>Marc Moens<recipient>Henry, David<status>confidential</status><subject>GGP Contract<message>The GGP contract is ready for signature. </memo>
XML and Linguistic Annotation 20Summer School, July 2000
XML Basics
Document Type Definition (DTD) Describes what can (and can’t) be in a particular type of
document E.g. a memo DTD might specify that every memo has:
sender (name),recipients (list of names),date (default: today),subject,message,status (confidential or unrestricted)
Document Instance: Identifies the document type and contains the marked-up text E.g. a memo document instance:
refers to the memo DTDcontains text marked up in conformance with that
DTD
XML and Linguistic Annotation 21Summer School, July 2000
XML and document structure
XML is used to make the structure of documents
• explicit• machine readable
Document content
SGML Tags
Marc Moens
This is the first paragraph. It has some text.
This is the second paragraph with some more text.
XML and Linguistic Annotation 22Summer School, July 2000
XML markup
<article status='draft'> <header> <title>XML tags </title> <author>Marc Moens </author> </header> <body> <para>This is the first paragraph. It has some text. </para> <para> This is the second paragraph with some moretext <emph>and</emph> an embedded element. </para></body></article>
Elements: start tags e.g. <author> content e.g. Marc Moens end tags e.g. </author>
Elements mark up text to indicatestructure and function of text (as opposed to appearance)
tag name = element typeElements can have attributes
Elements and attributes are defined in the Document Type Definition
XML and Linguistic Annotation 23Summer School, July 2000
XML markup: for structure and function
He shouted: 'Come here now, Mr Banks.'
<sentence>He <verb>shouted</verb>: <quote><verb mood=imperative>Come </verb> here <emphasis>now</emphasis>, <person><title>Mr</title> <name>Banks</name></person></quote></sentence>
Encodes structure informationto support renderingas well as data handling
Data handling e.g.• search for all quotes inside sentences but not in footnotes;• search for every mention of someone called Banks without finding the Banks of Scotland[Use an XML-aware query tool]
Rendering e.g.• emphasis should be bold underline;•quotes should be in italics[Use a stylesheet]
XML and Linguistic Annotation 24Summer School, July 2000
XML: Relevance for Linguists
Simplify and standardize appeal to context E.g. build tokenizer which specifically works for headlines of
newspaper articles:We need to be able to tell the tokenizer where the headline starts and ends
Annotate text with interesting linguistic information E.g. use XML tags to record the results of a tokenizer or part
of speech tagger. Or a human annotator
Allow sharing of results between research efforts without having to write a new parser every time you get new
material from somewhere
XML and Linguistic Annotation 25Summer School, July 2000
XML: Relevance for Linguists (example)
cat text | lttok -q '.*/P' -m W | ltpos -q '.*/W' -m C
Use the tokeniser lttok on all paragraphs <P> in the text and mark the resulting words as <W> entitiesThen run the part of speech tagger ltpos over the text and pos tag all the <W> entities, putting the result in attribute C
<W C=VBD>said</W><W C=DET>the</W><W C=NN>director</W><W C=IN>of</W> <W C=NNP>Russian</W><W C=NNP>Bear</W><W C=NNP>Ltd. </W><W C=ë.í>.</W>
XML and Linguistic Annotation 26Summer School, July 2000
Associated Standards
XSLT Transforming documents
XML Query Find bits of documents
XML Schema Use element syntax for DTDs
Namespaces Ensure that <art:draw><cube/><cube/></art:draw> and <soccer:draw><team name=“crew”/><team name=“burn”/></soccer:draw> both get processed correctly.
XML and Linguistic Annotation 27Summer School, July 2000
Infrastructure standards
Xpath Referring to parts of documents
XPointer pointing at documents and parts of documents
DOM Uniform programmer’s interface to document trees
(abstracts away from some details)
SAX Stream-based document interface (essential for big
documents)
Information Set
XML and Linguistic Annotation 28Summer School, July 2000
XML in detail
Well-formedness and validity DTDs XML tools XSLT XML Query
XML and Linguistic Annotation 29Summer School, July 2000
Well-formed and Valid documents
Well-formed XML Each start tag has an end tag XML content is rooted in single “document element” Valid encoding declaration
Valid Well-formed All elements mentioned in DTD All entities defined All parent-child relations as described in DTD All attributes used as described in DTD All element IDs unique
XML and Linguistic Annotation 30Summer School, July 2000
Why well-formedness?
a simpler standard for documents to meet Can be determined without reference to a DTD Simplifies the parser Retains “standalone” property of HTML, which was a big
win.
Non-validating XML systems can thus still be conformant, providing they check well-formedness
If you have a DTD (or a Schema) you can do more refined processing.
XML and Linguistic Annotation 31Summer School, July 2000
DTDs
Document Type Definitions: the grammar of a document family Elements Attributes & values Entities & parameter entities Comments
XML and Linguistic Annotation 32Summer School, July 2000
DTD: Elements
Elements are used to structure a document. Element types are declared in the DTD:
<!DOCTYPE article [ <!ELEMENT article (title, section+) > <!ELEMENT section (title, para+) > <!ELEMENT para (#PCDATA) > <!ELEMENT title (#PCDATA) > ]>
XML and Linguistic Annotation 33Summer School, July 2000
DTD: Attribute declarations
Attributes specify properties of elements. The attributes which may appear on elements of a given type are also declared in the DTD.
<!DOCTYPE article [<!ELEMENT article (title, section+) > <!ATTLIST article artno NUMBER #IMPLIED > <!ELEMENT section (title, para+) > <!ATTLIST section secid ID #REQUIRED > <!ELEMENT para (#PCDATA) >
<!ELEMENT title (#PCDATA) >]>
XML and Linguistic Annotation 34Summer School, July 2000
DTD: Entity declarations
Entities provide short names for commonly used strings, and are also declared in the DTD. <!DOCTYPE article [
<!ELEMENT article (title, section+) ><!ATTLIST article artno NUMBER #IMPLIED ><!ELEMENT section (title, para+) ><!ATTLIST section secid ID #REQUIRED >
<!ENTITY ltg "Language Technology Group> ]>
XML and Linguistic Annotation 35Summer School, July 2000
DTD: IDs
IDs are rigid designators for particular elements in the document. They are declared using type ID<!DOCTYPE article [
<!ELEMENT article (title, section+) ><!ATTLIST article artno NUMBER #IMPLIED ><!ELEMENT section (title, para+) ><!ATTLIST section secid ID #REQUIRED >
<!ENTITY ltg "Language Technology Group>]>
Potentially, IDs allow processors to provide fast random access to parts of documents.
Ids must be unique. Checking might be onerous
XML and Linguistic Annotation 36Summer School, July 2000
XML tools
XML Parser LT XML Toolkit XSLT - xt and Saxon
XML and Linguistic Annotation 37Summer School, July 2000
XML Parser
probably most important single bit of XML software uses DTD to check if document instance is valid
XML and Linguistic Annotation 38Summer School, July 2000
Example: >> cat memo.xml
<?xml version=“1.0” encoding=“ISO-8859-1”?><!DOCTYPE article [
<!ELEMENT article (para+)><!ELEMENT para (#PCDATA)><!ENTITY ltg "Language Technology Group">]>
<article><para>This is the text of a very short article,with very little internal structure.Here is a reference to the <g; entity.
</para></article>
XML and Linguistic Annotation 39Summer School, July 2000
Add correct output
Example: >> xmlnorm -V memo.xml
Entity reference has beenreplaced with entity textby parser
XML and Linguistic Annotation 40Summer School, July 2000
Exercise
Practice using xmlnorm to check your documents Add some new entities to the memo. Experience some of xmlnorm‘s error messages Begin to think about DTD design Practice using Web browsers to look at XML files Get a glimpse of what XSL is about
XML and Linguistic Annotation 41Summer School, July 2000
<!DOCTYPE article [<!-- Just a simple example DTD --><!ELEMENT article (title, section+) ><!ATTLIST article artno NUMBER #IMPLIED ><!ELEMENT section (title, para+) ><!ATTLIST section secid ID #REQUIRED ><!ELEMENT para (#PCDATA) ><!ELEMENT title (#PCDATA) ><!ENTITY ltg 'Language Technology Group'>]>
DTD: Comments
XML and Linguistic Annotation 42Summer School, July 2000
Element type declaration details
<!ELEMENT chapter (title, section+) >
keyword
element typestart with a-zmay contain hyphen, number, stopsnot case sensitivecan be more than one
content modelAn unambiguous regularexpression
XML and Linguistic Annotation 43Summer School, July 2000
Element types: Content model
<!ELEMENT article (title, section+) >
+ at least one, possibly more? optional* zero or more
, all occur, in that order| exclusive or
<!ELEMENT header ( ( (title, subtitle?),(author, affil)+ ), (date | status)? ) >
XML eradicated SGML’s neat & all occur, any order
XML and Linguistic Annotation 44Summer School, July 2000
Element types: Content model options
<!ELEMENT graphic EMPTY > EMPTY
no content no end tag point semantics: attributes may specialise
(#PCDATA) text only
ANY no constraint: sub-elements and/or text
((#PCDATA|emph)*) 'mixed content'
XML and Linguistic Annotation 45Summer School, July 2000
Element grammar
Since content model is a regular expression, markup grammar is context free
Except for one thing ANY keyword
Note that any realistic application interprets the markup tree. The interpretation could be anything. All bets are off…
XML and Linguistic Annotation 46Summer School, July 2000
SGML/XML for computational linguists
nsgmls:exa2a.sgm:7:42:E: element "PI" undefinednsgmls:exa2a.sgm:8:24:E: general entity "T." not defined
and no default entity(ARTICLE(PARA-Here is some text with an inequality: a(PI-2and an abbreviation: AT)PI)PARA)ARTICLE
Example: >> nsgmls exa2a.sgm
<pi/ interpreted as start tag
&T. interpreted as entity reference, not defined so gone from output
No C to confirm validity.
XML and Linguistic Annotation 47Summer School, July 2000
Escaping special characters
There are several ways around the problem of introducing XML's meta-syntax characters into documents Use numeric character references
AT&T Use CDATA marked sections
<![CDATA[<this> is data ¬ markup]]>
XML provides built-in definitions for amp, lt, gt, quot and apos
XML and Linguistic Annotation 48Summer School, July 2000
76SGML/XML for computational linguists
Example: >> nsgmls exa2b.sgm
(ARTICLE(PARA-Here is some text with an inequality: a<pi/2\nand an abbreviation: AT&T.)PARA)ARTICLEC
XML and Linguistic Annotation 49Summer School, July 2000
DTD: Comments
<!-- Comments added here -->
double hyphens act as comment
<!ELEMENT article (title, section+)>
XML and Linguistic Annotation 50Summer School, July 2000
DTD: Attributes
<!DOCTYPE article [<!ELEMENT article (title, section+) ><!ATTLIST article artno CDATA #IMPLIED ><!ELEMENT section (title, para+) ><!ATTLIST section secid ID #REQUIRED ><!ELEMENT para (#PCDATA) ><!ELEMENT title (#PCDATA) >]>
XML and Linguistic Annotation 51Summer School, July 2000
DTD Attribute declarations: syntax
<!ATTLIST article artno CDATA #IMPLIED >
keyword
element type
attribute nameattribute type
default type#REQUIRED#IMPLIED (= optional)#FIXED
XML and Linguistic Annotation 52Summer School, July 2000
Attribute Value types (contd)
<!ATTLIST article artno CDATA #IMPLIED >
CDATA valid SGML charactersENTITY declared entity nameID unique nameIDREF reference to a unique name
XML and Linguistic Annotation 53Summer School, July 2000
Cross-references
<!DOCTYPE article [<!ELEMENT article (section+)><!ATTLIST section secid ID #IMPLIED><!ELEMENT section (#PCDATA | xref)+><!ELEMENT xref EMPTY><!ATTLIST xref xrefid IDREF #REQUIRED>]><article><section secid='s1'>Here is some text.</section><section>In section <xref xrefid='s1'> we showedyou how to create crossreferences.</section></article>
XML and Linguistic Annotation 54Summer School, July 2000
In a valid SGML/XML document IDs are unique IDREFs are discharged
Applications may interpret IDREF/ID connections
Links from elsewhere may target IDs cf. HTML 'name' attribute as the target for #....
IDs and IDREFs
XML and Linguistic Annotation 55Summer School, July 2000
Attribute value types: list
CDATAvalid SGML characters author='Robin Hood'
ENTITY/IESdeclared entity name(s) figs='pict2 pict7'
IDunique name id='foo37'
IDREF(S)reference(s) to an ID refid='foo2 foo37'
NMTOKEN(S)name(s) w/o i.c. restraint code='96-mm01 98-a'
NOTATIONdata content notation encoding='eps'
XML and Linguistic Annotation 56Summer School, July 2000
Enumerated attribute values
Attribute values can also be constrained to be one of a finite set of allowed values
<!ATTLIST section status (draft|alpha| beta|final) 'draft' >
<section status=alpha><section status=final><section><section status=gamma> Not valid
XML and Linguistic Annotation 57Summer School, July 2000
Elements vs Attributes
<!ELEMENT date (day, month, year)><!ELEMENT day (#PCDATA)>
Content is unconstrained Order will be enforced
vs
<!ELEMENT dateday EMPTY><!ATTLIST date day NUMBER #REQUIRED
month NUMBER #REQUIREDyear NUMBER #REQUIRED>
Content is constrained Order is unconstrained
XML and Linguistic Annotation 58Summer School, July 2000
DTD: Entities
<!DOCTYPE article [<!ELEMENT article - - (#PCDATA)><!ENTITY ltg 'Language Technology Group'>]><article>The <g; carries out application-oriented research inlanguage engineering. The <g; is based withinthe HCRC.</article>
Each occurrence of <g; in the text is replaced byLanguage Technology Groupduring parsing.
can be nested:
<!ENTITY hcl 'HCRC <g;'>
XML and Linguistic Annotation 59Summer School, July 2000
DTD: Parameter Entities
Like entities, except within the DTD
<!ENTITY % section '(title?, para+)'>
each time parser finds %section; inthe DTD, it will replace it with (title?, para+)
<!ENTITY % section (title?, para+)><!ELEMENT article - - (title, %section;+)><!ELEMENT subsect - - (%section;+)>
XML and Linguistic Annotation 60Summer School, July 2000
DTD
That’s almost all there is to it For more detail, see the XML standard
Which, as Michael Kay puts it, is like tax legislation
DTD syntax differs from element syntax Harder to learn/use XML Schema
Also, DTDs were designed to be used by document designers, not for distributed data interchange XML can use a DTD, but doesn’t assume one. Composite documents entail composite DTDs, but these
don’t exist. Namespace prefixes add extra complexity
XML and Linguistic Annotation 61Summer School, July 2000
XSL Transformations
Content from one document.
Style from another
Structure
XML and Linguistic Annotation 62Summer School, July 2000
barts_stylish_memo.xml
<?xml version="1.0"?>
<!ELEMENT article (title,(para|credit)+)> <!ELEMENT para (#PCDATA)> <!ENTITY ltg "Language Technology Group"> <!ENTITY author "Bart Simpson"> <!ENTITY techie "Lisa Simpson"> <!ENTITY parents "Marge and Homer"> <!ENTITY school "M&M University">]>
This is the text of a very short article,with very little internal structure.Here is a reference to the <g; entity.Please may I stop now?</para>
</credit>
</credit>
</article>&parents; for unfailing support.
<credit>&techie; of &school; for slick XML authoring.
<credit><para><para> by &author;: &school;</para><title>Bart's Ph.D Thesis</title>
<article><!DOCTYPE article [<?xml-stylesheet type="text/xsl" href="memo.xsl"?>
XML and Linguistic Annotation 63Summer School, July 2000
memo.xsl
IE5 attempts to display the style in visual form, without any content.
Germ of a good idea here.
XML and Linguistic Annotation 64Summer School, July 2000
Source of memo.xsl
<?xml version="1.0" encoding="ISO-8859-1" ?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/TR/WD-xsl"><xsl:template match="/"><html> <head><title><xsl:value-of select="//title"/></title></head><body BGCOLOR='#FFFFCC'> <h1><xsl:value-of select="//title"/></h1><xsl:for-each select="//para"><p><xsl:value-of/></p></xsl:for-each><hr/><p><i> Thanks to: </i><br/><xsl:for-each select="//credit">  <xsl:value-of/><br/></xsl:for-each><hr/></p></body></html></xsl:template></xsl:stylesheet>
XML and Linguistic Annotation 65Summer School, July 2000
Fill in the blanks
<?xml version="1.0" encoding="ISO-8859-1" ?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/TR/WD-xsl"><xsl:template match="/"><html> <head><title>•••</title></head><body BGCOLOR='#FFFFCC'> <h1>•••</h1><xsl:for-each select="//para"> <p>•••</p></xsl:for-each><hr/><p><i> Thanks to: </i><br/><xsl:for-each select="//credit">   ••• <br/></xsl:for-each> <hr/></p></body></html></xsl:template></xsl:stylesheet>
XSLT gives you tools for sending part of document to one place, part to another.
Simplest use is pure fill in the blanks. Anybody who uses HTML, PHP and so on will be comfortable with this use of XSLT
If necessary, it is a Turing-complete programming language. It gives you the rope if you need it.
XML and Linguistic Annotation 66Summer School, July 2000
Fill in the blanks
<?xml version="1.0" encoding="ISO-8859-1" ?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/TR/WD-xsl"><xsl:template match="/"><html> <head><title> <xsl:value-of select="//title"/> </title></head><body BGCOLOR='#FFFFCC'> <h1> <xsl:value-of select="//title"/> </h1><xsl:for-each select="//para"><p> <xsl:value-of/> </p></xsl:for-each><hr/><p><i> Thanks to: </i><br/><xsl:for-each select="//credit">  <xsl:value-of/> <br/></xsl:for-each> <hr/></p></body></html></xsl:template></xsl:stylesheet>
XML and Linguistic Annotation 67Summer School, July 2000
XSLT standards
Microsoft’s implementation in IE5 is non-standard (they put it out well before the standard existed). They are moving to conformance.
James Clark’s xt and Michael Kay’s Saxon are much more complete and conformant
W3C eats its own lunch. The HTML versions of the XML standard are generated with XSL
In practice, current best options are Static data:Pre-generate HTML from XML at publication
time Dynamic data: Use Saxon or xt as Java Servlets
XML and Linguistic Annotation 68Summer School, July 2000
Generating HTML
HTML is generated by running Saxon on poem.xml and poem.xsl
saxon poem.xml poem.xsl > poem.html
XML and Linguistic Annotation 69Summer School, July 2000
Using IE5 to view poem.xml
<poem><author>Rupert Brooke</author><date>1912</date><title>Song</title><stanza><line>And suddenly the wind comes soft,</line><line>And Spring is here again;</line><line>And the hawthorn quickens with buds of green</line><line>And my heart with buds of pain.</line></stanza><stanza><line>My heart all Winter lay so numb,</line><line>The earth so dead and frore,</line><line>That I never thought the Spring would come again</line><line>Or my heart wake any more.</line></stanza><stanza><line>But Winter's broken and earth has woken,</line><line>And the small birds cry again;</line><line>And the hawthorn hedge puts forth its buds,</line><line>And my heart puts forth its pain.</line></stanza></poem>
XML and Linguistic Annotation 70Summer School, July 2000
poem.xsl<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="poem"><html><head>
<title><xsl:value-of select="title"/></title></head><body>
<xsl:apply-templates select="title"/><xsl:apply-templates select="author"/><xsl:apply-templates select="stanza"/><xsl:apply-templates select="date"/>
</body></html></xsl:template>
<xsl:template match="title"><div align="center"><h1><xsl:value-of select="."/></h1></div></xsl:template>
<xsl:template match="author"><div align="center"><h2>By <xsl:value-of select="."/></h2></div></xsl:template>
<xsl:template match="stanza"><p><xsl:apply-templates select="line"/></p></xsl:template>
<xsl:template match="line"><xsl:if test="position() mod 2 = 0">  </xsl:if><xsl:value-of select="."/><br/></xsl:template>
<xsl:template match="date"><p><i><xsl:value-of select="."/></i></p></xsl:template></xsl:stylesheet>
Namespace declaration is different (standard conforming) for Saxon.
+XSLT language is different.
+ Saxon and XT are really easy to install.
- IE5 has millions of current users
XML and Linguistic Annotation 71Summer School, July 2000
“Problems” with XML
Uses complex and weird terminology Yes. But so does the ANSI C standard. So do most fields…
Not convenient for specifying graphs (as opposed to trees) This is a point about graphs, not XML. Unification grammar
notations get unwieldy too.
Not as convenient as plain text True for some tasks, but the extra structure of XML lets do
things that you wouldn’t even try with plain text.
XML and Linguistic Annotation 72Summer School, July 2000
XML tools for Unix
Simple equivalents of UN*X tools are available (for free) to do simple SGML processing
We'll introduce them using examples, and give details at the end
XML and Linguistic Annotation 73Summer School, July 2000
sggrep
LT XML program for searching for structure and text in XML files sggrep -q query -s subquery -t regexp in.xml
Options -d DTD: Specify a DTD explicitly. File is an XML file -r : Attribute values in queries are regular expressions. -v : Invert sense of sub-query+regexp. Other options
XML and Linguistic Annotation 74Summer School, July 2000
||
LT XML query language
Two-dimensional regular expressions First dimension is over tree paths
Based on file path analogy:DIV/PARA/W matches Ws inside PARAs inside (toplevel) DIVs
Second dimension is regular expressions over text content of leaf nodes
Select Ss containing Ws whose text is it's or its-q S -s './W' -t "^(it's|its)$"
Full UTZOO (Henry Spencer) regular expression support
Influential, slightly dated now.
XML and Linguistic Annotation 75Summer School, July 2000
sggrep: examples of use
sggrep -q ".*/P/S" -s "./W[TAG=NN]"ï find all S elements occuring inside a P element at any depth
which immediately contain a W element with attribute TAG="NN".
sggrep -q ".*/P/S/W[TAG=NN]"ï find those W elements themselves
sggrep -q ".*/S/W[0]" -t "^[a-z]" ï find all sentence initial words starting with a lower case
letter.
XML and Linguistic Annotation 76Summer School, July 2000
sgmltrans
converts XML into different formats.sgmltrans -r rulefile file.nsg > file.txtï sample rule file:
.*/W matches W "" what to print at start tag "/$TAG\n" what to print at end tag: value of TAG
attribute .*/W/# matches text inside W " " --> "" text replacement: eliminate space if any .*/S matches S "" start tag: nothing "\n" end tag: make each S on separate line .* matches other markup
XML and Linguistic Annotation 77Summer School, July 2000
sgmltrans: example of use
The previous rule file would do this:
<?xml version='1.0'>
<TEST><P><S>
<W TAG='A'>The </W>
<W TAG='B'>cat </W>
<W>sat </W>
<C>.</C></S>
<S>
<W TAG='A'>on </W>
<W TAG='B'>the </W>
<W>mat </W>
<C>.</C>
</S></P></TEST>
The/A
cat/B
sat/
on/A
the/B
mat/
XML and Linguistic Annotation 78Summer School, July 2000
sgrpg: SGML report generator
Program for making more complex queries of normalised SGML and for transforming SGML. Provides nested subqueries and sequencing
Usage: sgrpg query sub-query regexp out-fmt oargs < file.nsg >
file.txt sgrpg -f pat-file < file.nsg > file.txt
This now looks like a design study for XSLT and XML Query.
Has one advantage, designed (from the outset) for big documents
XML and Linguistic Annotation 79Summer School, July 2000
The British National Corpus
2 gigabytes of contemporary English Marked up to word level with part of speech tags Extract data:
zcat medium.xml.gz | sggrep -q ".*/W[TYPE=NN1]" gives all singular nouns in a part of the corpus, e.g.
<W TYPE=NN1>part </W><W TYPE=NN1>meeting </W><W TYPE=NN1>while </W><W TYPE=NN1>funeral</W><W TYPE=NN1>loss</W><W TYPE=NN1>meeting</W><W TYPE=NN1>time </W>
XML and Linguistic Annotation 80Summer School, July 2000
The BNC: an example (2)
zcat medium.xml.gz | \sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" \-t "^[Rr]ight$"
gives sentences containing non-adjectival uses of the word 'right', e.g.
<S N=092> <W TYPE=ITJ>Yes </W> <W TYPE=DT0>that </W> <W TYPE=VBD>was</W> <C TYPE=PUN>, </C> <W TYPE=DT0>that </W> <W TYPE=VBD>was </W> <W TYPE=AV0>right</W> . . . </S>
XML and Linguistic Annotation 81Summer School, July 2000
The BNC: an example (3)
Format the output into a more readable form:
zcat medium.xml.gz | \sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" -t "^[Rr]ight$" |\sgmltrans -r test.rule
Yes/ITJ that/DT0 was/VBD , that/DT0 was/VBD right/AV0 erm/UNC there/EX0 was/VBD a/AT0 limit/NN1 to/PRP how/AVQ much/AV0 you/PNP could/VM0 spend/VVI aswell/AV0 was/VBD n't/XX0 there/EX0 ?
He/PNP goes/VVZ into/PRP a/AT0 restaurant/NN1 and/CJC he/PNP says/VVZ oh/ITJ the/AT0 waiter/NN1 erm/UNC let/VVB me/PNP see/VVI the/AT0 menu/NN1 and/CJC he/PNP looks/VVZ at/PRP the/AT0 menu/NN1 and/CJC said/VVD right/AV0 , he/PNP said/VVD .
XML and Linguistic Annotation 82Summer School, July 2000
An extended example: Noun Compounds
Noun compounds in British National Corpus What is a noun compound?
Too hard. Simple approximation? Sequence of tags matching NN. . .
BNC uses a version of the Brown tags, where NN0, NN1, . . . are all variants of Noun
A pipeline of SGML-aware tools will do the job sgrpg | sggrep [ | . . .]
Use sgrpg to wrap such tag sequences in <G> ... </G>. Use sggrep to filter the output. Use further tools to tabulate, format, etc.
XML and Linguistic Annotation 83Summer School, July 2000
An extended example: The pipe
Step by step through the pipe sgrpg -r -f np-pat.xml | ...
Group the sequences-r use regexp matching-f script file
... sggrep -d groups.xml -q '.*/G'extract the sequences-d DTD -q query (selects groups)
Result:<G><W TYPE='AJ0-NN1'>Local</W>
<W TYPE='NN0'>government</W><W TYPE='NN2'>districts</W></G>. . .
XML and Linguistic Annotation 84Summer School, July 2000
An extended example: filtering
Find all words with unresolved tags, e.g. AJ0-NN1 use regexp matching, which is unanchored by default ...| sggrep -r -q './W[TYPE="-"]' | ...
Find all words in second position ...| sggrep -q './W[1]' | ...
Find all words with unresolved tags in second position ...| sggrep -r -q './W[1 TYPE="-"]' | ...
XML and Linguistic Annotation 85Summer School, July 2000
An extended example: counting
Count all words in second position ...| sggrep -q './W[1]' | sgcount
Count all words with unresolved tags in second position ...| sggrep -r -q './W[1 TYPE="-"]' | sgcount
Results: all 2nd place W 23283 2nd place W with unresolved tag 5066
XML and Linguistic Annotation 86Summer School, July 2000
An extended example: long compounds
Long compounds including 'government' Use subquery to select <G>...</G>s with 'government': sggrep -q G -s './W' -t government Next step, discard short ones: sggrep -q G -s './W[2]' Then sgmltrans for neater format Results:
official/AJ0-NN1 government/NN0 report/NN1-VBLocal/AJ0-NN1 government/NN0 districts/NN2...
XML and Linguistic Annotation 87Summer School, July 2000
An extended example: left context
select for 'government' in 2nd place . . . | sggrep -q G -s './W[1]' -t government |
pull words from first place sggrep -q './W[0]' |
remove markup textonly |
use UN*X for the rest sort | uniq -c | sort -nr | head -4 6 French 5 German 4 interim 4 Chinese
XML and Linguistic Annotation 88Summer School, July 2000
British International Corpus?
We are more francophone than we think! Longest 'noun-phrase' in 10% of BNC is:
serai/NN1 mentionné/NN1 dans/NN2 le/NN1 rapport/NN1-VB qui/NN1 te/NN1 sera/NN1 remis/NN1
No disgrace that the part-of-speech tagger gave up here. Tools can't be better than their input allows
XML and Linguistic Annotation 89Summer School, July 2000
XML Conclusions
XML is the wave of the future Both Microsoft and Netscape have endorsed it
Both Mozillla and IE5 have XML support built-in Very good free software is available Microsoft seem to be serious about standard compliance
The W3C have made it clear that all subsequent W3C standards for web distribution of information will be based on XML (c.f. SMIL, SVG and RDF)
Issues XSLT efficiency - space and time.
XML and Linguistic Annotation 90Summer School, July 2000
To read
Robin Cover’s SGML/XML Web Pagehttp://www.sil.org/sgml/sgml.html includes many pointers to SGML tutorials, overviews,
publications
The Whirlwind Guide to SGML & XML Tools and Vendorshttp://www.infotek.no/sgmltool/guide.htm
The XML FAQhttp://www.ucc.ie/xml/ An excellent introduction to XML with pointers to useful
resources for newcomers to the standard
XML and Linguistic Annotation 91Summer School, July 2000
SGML/XML for Linguistics
2.1 Programs for querying/modifying SGML an example what is needed available tools
2.2 SGML marked-up corpora some existing resources
2.3 Related developments SSTML SGML for X-waves
XML and Linguistic Annotation 92Summer School, July 2000
An example
You want to build a system that performs particular LE task You have a corpus of texts for
analysis (detecting textual regularities)system trainingsystem testing
Use XMLWhy?How?
XML and Linguistic Annotation 93Summer School, July 2000
Why use XML?
Use structure of text to fine-tune certain tools e.g. build tokeniser which specifically works for headlines of
newspaper articles
Annotate text with linguistic information e.g. use SGML tags to record the results of a tokeniser or
part of speech tagger, so that other tools can make use of this information
Ensure the others (and you two years from now :-) will have easy access to your results No special-purpose parser required Simple retrieval and tabulation with existing free tools DTD provides some self-documentation
XML and Linguistic Annotation 94Summer School, July 2000
What is needed to use XML?
XML is text Therefore:
you can use any UNIX text manipulation programe.g. grep, sed, awk, perl, etc
XML is annotated text Therefore:
Needed: versions of these tools that are XML-aware
XML and Linguistic Annotation 95Summer School, July 2000
What is needed to use XML?
SGML reflects the hierarchical structure of a text You want to be able to tell tools to operate on a particular
part of the SGML-annotated text, for example:all WORD entities with attribute POS set to JJ
(i.e. all adjectives)occurring within the first PARAGRAPH of the main
BODY of an ARTICLE; oroccurring within the HEADLINE of and ARTICLE
Needed: a query language over XML structures
XML and Linguistic Annotation 96Summer School, July 2000
What is needed to use XML?
XML-aware versions of text processing tools Query language
In fact sggrep is just a simple wrapper round our query language.
Our query language and interface is designed to work with big files, so it doesn’t read the whole document into memory unless absolutely necessary. Most competitors do this
XML and Linguistic Annotation 97Summer School, July 2000
XML tools: the LT XML library
sggrep is part of an SGML toolset, called LT XML Developed by the Language Technology Group
(Edinburgh) see: http://www.ltg.ed.ac.uk/software
XML Library with Command-line tools Application Programming Interface (API)
Available for WIN32, UN*X (and Mac) LT XML processes XML or nSGML
nSGML now looks like a design study for XML
XML and Linguistic Annotation 98Summer School, July 2000
LT XML: Command-line tools
sggrep - retrieving context sensitive data sgmltrans - transforming information sgrpg - more complex queries/reformatting textonly - strips out SGML markup sgcount - counts SGML tags knit - resolves XML-link links others
XML and Linguistic Annotation 99Summer School, July 2000
LT NSL: APIs
LT NSL Application Program Interfaces:procedure calls to help you write your own programs to process nSGML C language API Python language API
XML and Linguistic Annotation 100Summer School, July 2000
C API for specialised access
Write your own programs to read/write SGML/XML LT XML provides a rich API Both event and tree views of the document stream
The distribution includes two heavily commented example programs.
XML and Linguistic Annotation 101Summer School, July 2000
Python language API for LT XML
Experimental integration of the LT XML API into Python (free portable object-oriented scripting language)
Uses TK portable widget library for graphical UI Reflects document stream as Python objects
XML and Linguistic Annotation 102Summer School, July 2000
Specialised XML editors
Using the Python API we have written a number of specialised processors:
A WYSIWYG XML instance editor (XED) Several specialised annotation tools, E.g. PoS
correctors, span coders Limited set of operations Preserve validity Hide structure from the user
XML and Linguistic Annotation 103Summer School, July 2000
Dataflow in LT NSL programs
mknsg
unknit
nSGML NSL C(++) program
stream API
parser
nSGML NSL C(++) program
stream API
parser
DDB file
file1.sgm Ö file2.sgm ...
file1.sgm ...
XML and Linguistic Annotation 104Summer School, July 2000
The Edinburgh MapTask Corpus
Contents 128 task oriented spontaneous Scottish dialogues small corpus, but very dense and detailed SGML markup.
Availability: Transcripts and digitized speech on 8 CD-ROMS:
http://www.elsnet.org/resources.html or from the LDC
What is its markup like? (early) TEI-compliant Turns, pointers into the speech, identification of non-words. Word-level transcripts with timing markup available soon via
the Internet
XML and Linguistic Annotation 105Summer School, July 2000
HCRC Maptask: an example
mknsg q1ec1.turns.sgm | sggrep -q ".*/W[TAG=at]"
<W START=2.9644 DUR=0.0725 UTT=1 TAG=at>a</W>
<W START=17.1410 DUR=0.1779 UTT=3 TAG=at>an</W>
<W START=18.6693 DUR=0.0791 UTT=3 TAG=at>the</W>
XML and Linguistic Annotation 106Summer School, July 2000
Parsed HCRC Maptask : an example
mknsg q1ec1.g.syn.sgm | sggrep -q ".*/NP" | sgmltrans -r mt.rule
<NP>we </NP>
<NP>a caravan park </NP>
<NP>we </NP>
<NP>we </NP>
<NP><NP><NP>an old mill </NP></NP><PP>on <NP>the right hand side </NP></PP></NP>
<NP><NP>an old mill </NP><PP>on <NP>the right </NP></PP></NP>
<NP>you </NP>
...
XML and Linguistic Annotation 107Summer School, July 2000
The MLCC corpus
Contents Financial Newspaper texts: Dutch, English, French, German,
Italian, Spanish Parallel texts:
The Journal of the European Commission, Written Questions (1993).
Corpus of European Parliamentary debates (1993-1994). (languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish ).
Markup
Available from ELRA: http://www.inpg.fr/ELRA/catalog.html
XML and Linguistic Annotation 108Summer School, July 2000
The MLCC Corpus: an example
zcat exp.joc006.93.en.01.tei.gz |\
mknsg | \
sggrep -q ".*/DIV4[TYPE=Q]/HEAD"
<HEAD>Subject: The staffing in the Commission of the European Communities</HEAD>
<HEAD>Subject: Supplies of military equipment to Iraq</HEAD>
<HEAD>Subject: Commission plans to liberalize the postal sector and to abolish the State monopoly</HEAD>
<HEAD>Subject: New industries in Attika</HEAD>...
XML and Linguistic Annotation 109Summer School, July 2000
The same example for French
zcat exp.joc006.93.fr.01.tei.gz |\
mknsg | \
sggrep ".*/DIV4[TYPE=Q]/HEAD" ""
<HEAD>Objet: Organigramme de la Commission</HEAD>
<HEAD>Objet: Livraisons de matÈriel militaire ‡ l'Irak</HEAD>
<HEAD>Objet: Projets de la Commission visant ‡ libÈraliser et ‡ abolir le monopole d'…tat dans le secteur des postes</HEAD>
<HEAD>Objet: Nouvelles industries en Attique</HEAD>
Corresponds to the English data: Suitable input for multilingual alignment experiments.
XML and Linguistic Annotation 110Summer School, July 2000
The Text Encoding Initiative (TEI)
The TEI is a large and well documented DTD for textual markup. Use it if you can Now has an XML version
Large and comprehensive hardcopy documentation available http://www.uic.edu/orgs/tei/
DTDs available there as well
XML and Linguistic Annotation 111Summer School, July 2000
The Linguistic Data Consortium
LDC - based in Pennsylvania USA Distributes text corpora See: http://www.ldc.upenn.edu/
SGML Corpora include: The European Language Newspaper Text corpus
French (100 million words), German (90 million words) and Portuguese (15 million words). SGML.
TIPSTER Information Retrieval Text Research Collection3 gigabytes. SGML-like. Various English texts.
United Nations Parallel Text Corpus (English, French, Spanish)
Fully-compliant SGML, 2.5 gigabytes
XML and Linguistic Annotation 112Summer School, July 2000
Tutorials
XML: far too many to mention XSL:
XSL specification http://www.w3.org/Style/XSL
Robin Cover's guide http://www.oasis-open.org/cover/xsl.html
XML and Linguistic Annotation 113Summer School, July 2000
Resources
LT-XML http://www.ltg.ed.ac.uk/software/xml/index.html
Full-text search Witten, Moffat and Bell's Managing Gigabyteshttp://www.cs.mu.OZ.AU/mg/
XML and Linguistic Annotation 114Summer School, July 2000
Corpus Tools
Stuttgart Corpus Workbenchhttp://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench
Birmingham Qwick}http://www-clg.bham.ac.uk/QWICK/
The MATE Workbench http://www.cogsci.ed.ac.uk/~dmck/MateCode}.
NB. Prototype
XML and Linguistic Annotation 115Summer School, July 2000
Bibliography
McKelvie, Brew,Thompson: Using SGML as a Basis for Data-Intensive Natural Language Processing, Computers and the Humanities, 31(5): 367-388, 1997
Sinclair, Mason,Ball,Barnbrook Language Independent Statistical Software for Corpus Exploration, Computers and the Humanities, Vol 31(3): 229-255, 1998
References on McKelvie's MATE workbench pagehttp://www.cogsci.ed.ac.uk/~dmck/MateCode
Welty and Ide. Using the right tools: enhancing retrieval from marked-up documents. Computers and the Humanities. 33(10):59-84. 1999
Alignment graphs (and much else) Steven Bird's Linguistic Annotation Pagehttp://www.ldc.upenn.edu/annotation/.
XML and Linguistic Annotation 116Summer School, July 2000
Annotation topics
_ Item annotations Words, Parts-of-speech, lemmas
Simple annotations (one data stream) Boundaries,Spans,Partitions
Complex annotations (multiple data streams) Sequences,Graphs,Overlaps
Data models for annotation access Streams, Trees, Graphs, Databases
_ Human factors in annotation Writing instructions, Measuring and improving reliability
XML and Linguistic Annotation 117Summer School, July 2000
XML topics
Data formats HTML,XML and SGML
Data Description Formalisms DTDs, XML Schema
Style Languages XSLT
Query Languages Annotation Graphs, XML Query, XQL, Quilt, LORE
XML and Linguistic Annotation 118Summer School, July 2000
Exercises
On average, these exercises should take about one hour to complete. Try not to spend longer.
Create an XML document Create a very simple memo
Simple annotation Disambiguate parts-of-speech Compare results with those made by a partner.
Style Create an XML DTD and an XSL style sheet for displaying
POS-tagged text in a browser.
XML and Linguistic Annotation 119Summer School, July 2000
Exercises
More complex annotation syntactic annotation in Penn tree bank style. As before, compare results
Search Exercise XML search tools on the newly annotated texts
XML and Linguistic Annotation 120Summer School, July 2000
Projects
These are open-ended projects hard enough to merit write-up in a research paper. I’d willingly supervise these.
Design a DTD and an XSL stylesheet for tree bank style syntactic annotations. Implement a convenient interface allowing these annotations to be edited over the Web.
Investigate the corpus search tools provided at the LDC web-site. What do they do? Could they and should they use XML/XSL technology for the same purpose? (Easiest if your institution has an LDC membership).
XML and Linguistic Annotation 121Summer School, July 2000
Projects (contd)
Critical review of the Talkbank tools (www.talkbank.org)
Design an XML query language that works well with very big documents
What sort of annotation structure for dialog? (cf. MATE)
Design an optimizing compiler for XSLT (cf. Sun’s very recent XSL compiler)
Does XSLT support language modeling and statistical computation? (If you put XSLT and Splus into a closed box and shake vigorously, what emerges?)
XML and Linguistic Annotation 122Summer School, July 2000
In Summary
Phew! </xmlstuff>