27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of...

15
27/03/01 CROSSMARC kick-off meeting LTG Background XML-based Processing Several years of experience in developing XML-based software LT XML Tools Pipeline architecture Named Entity Recognition LT TTT Tools MUC-7 system http://www.ltg.ed.ac.uk/software/

Transcript of 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of...

Page 1: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

LTG Background

• XML-based Processing– Several years of experience in developing XML-based

software– LT XML Tools– Pipeline architecture

• Named Entity Recognition– LT TTT Tools– MUC-7 system

http://www.ltg.ed.ac.uk/software/

Page 2: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

LT XML

• Suite of tools which communicate using the LT XML API.

• All use the same query language to access and manipulate subparts of XML documents.

• Simple tools can be composed together into complex applications.

• Programs include sggrep, sgcount, sgsort, xmlnorm, rxp, knit.

• Additional programs: xmlperl, xmlquery

Page 3: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

Pipeline Architecture

• An XML document is piped through a series of programs

• Each program targets a particular part of the document via a particular query

• Each program performs some operation, e.g. adding or removing mark-up, making other modifications to the structure of the XML, extracting or counting subparts of the document

Page 4: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

LT TTT: Text Tokenisation Tool

• Suite of XML tools designed to tokenise from the most basic level through to high level mark-up.

• Useful for many linguistic applications including corpus annotation.

• Used by the LTG for their MUC-7 system.

Page 5: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

LT TTT: programs

• ltpos: a part-of-speech tagger and sentence boundary disambiguator

• fsgmatch: a transducer operating over strings of characters or strings of XML elements using hand-written grammar rules

• Other programs– sggrep, xmlperl, sgdelmarkup

Page 6: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

LT TTT: grammar files for fsgmatch

• Titles and paragraphs• Sub-word character sequences• Words • Numbers (300, three hundred)• MUC7 style NUMEX and TIMEX elements• In-text citations• Reference lists• Chunks: noun groups and verb groups (LT CHUNK)

Page 7: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

ltpos

• Statistical (maximum entropy) component• Disambiguates full stops (and optionally

adds sentence mark-up)• Also disambiguates sentence-initial capitals• Uses Penn treebank tagset; trained on the

Brown corpus• Adds POS tag as value of attribute on W

element

Page 8: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

LT TTT: example pipeline

plain2xml.perl \

| fsgmatch -q ".*/TEXT" GRAM/char/paras.gr \

| fsgmatch -q ".*/P" GRAM/char/words.gr \

| ltpos -q ".*/TEXT" -qs ".*/P" -qw ".*/W" -std_form \

–sent SENT resource.xml \

| fsgmatch -q ".*/P" GRAM/xml/numbers.gr \

| fsgmatch -q ".*/P" GRAM/xml/numex.gr \

| fsgmatch -q ".*/P" GRAM/xml/timex.gr

Page 9: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

LT TTT: example input

In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share.

Late last night the company announced a growth of 20%.

Page 10: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

LT TTT: example output

<?xml version='1.0'?>

<!DOCTYPE DOCS SYSTEM "general.dtd" >

<DOCS><TEXT><P><SENT><W L='SL' T='w' S='Y' C='W'>In</W> <TIMEX TYPE='DATE'><W C='W'>July</W> <W C='CD'>1995</W></TIMEX> <W C='W'>CEG</W> <W N='A' C='W'>Corp.</W> <W C='W'>posted</W> <W C='W'>net</W> <W C='W'>of</W> <NUMEX TYPE='MONEY'><W C='W'>$</W><PHR C='CD'><W C='CD'>102</W> <W C='W'>million</W></PHR></NUMEX><W C='CM'>,</W> <W C='W'>or</W> <NUMEX TYPE='MONEY'><W C='CD'>34</W> <W C='W'>cents</W></NUMEX> <W C='W'>a</W> <W C='W'>share</W><W T='.' C='.'>.</W></SENT></P> 

<P><SENT><TIMEX TYPE='TIME'><W S='Y' C='W'>Late</W> <W C='W'>last</W> <W C='W'>night</W></TIMEX> <W C='W'>the</W> <W C='W'>company</W> <W C='W'>announced</W> <W C='W'>a</W> <W C='W'>growth</W> <W C='W'>of</W> <NUMEX TYPE='PERCENT'><W C='CD'>20</W><W C='W'>%</W></NUMEX><W T='.' C='.'>.</W></SENT></P></TEXT></DOCS>

Page 11: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

Named Entity Recognition: MUC7 mark-up

He was one of 118 Nazi rocket engineers secretly brought to the <ENAMEX TYPE="LOCATION">United States</ENAMEX> after the war. The scientists included <ENAMEX TYPE="PERSON">Wernher von Braun</ENAMEX>, the father of the American rocket programs.

<ENAMEX TYPE="ORGANIZATION">MCI</ENAMEX> has long said it would be a bidder and would start the bidding at <NUMEX TYPE="MONEY">$175 million</NUMEX>. <ENAMEX TYPE="ORGANIZATION">MCI</ENAMEX> has teamedup with <ENAMEX TYPE="ORGANIZATION">News Corp.</ENAMEX>.

Page 12: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

LTG’s MUC7 System

• A pipeline made up of calls to LT TTT tools: ltpos and many calls to fsgmatch using different resource grammars.

• Early stages (before tagging) recognise NUMEX and TIMEX elements.

• Complex final stages (after tagging) to recognise ENAMEX elements involving calls to fsgmatch using ENAMEX grammars and lexical resources (e.g. first names, gazetteers of place names) interspersed with calls to statistical (maximum entropy) component.

Page 13: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

Platforms

• LT XML– Unix (Solaris and Linux)– Windows/NT

• LT TTT– Unix (Solaris and Linux)– planned Window/NT version

Page 14: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

Further LTG Expertise

• XML– XSLT for document rendering– Document linking and stand-off annotation– XML query languages– Schemas

• NL Generation

• Automatic summarisation

Page 15: 27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

27/03/01 CROSSMARC kick-off meeting

What we hope to gain from CROSSMARC

• Continued maintenance and development of our existing tools.

• Extending our expertise beyond NER to fact extraction.

• Opportunity to experiment with the symbolic/statistical balance in our system and to experiment with alternative statistical methods.

• Automatic induction of NER rules.