ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.
-
Upload
deshaun-beacham -
Category
Documents
-
view
223 -
download
0
Transcript of ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.
ANNIE and JAPE
GATE Training Course23 November 2006
Diana MaynardAndrey Shafirin
Alala 2
GATE and Information Extraction
● Basic introduction to IE and GATE
● Overview of ANNIE
● JAPE: rule writing
● JAPE debugger
GATE and IE
● IE is one of the core tasks GATE is designed for
● IE is the basis for many other, more complex applications, e.g. semantic annotation
● Cornerstone of IE is Named Entity Recognition
Alala 4
A Typical IE System
1. Pre-processing – format detection – tokenisation – word segmentation – sense disambiguation – sentence splitting – POS tagging
2. Named entity detection – entity detection – coreference
Alala 5
Two Approaches to IE
Knowledge Engineering● rule based ● developed by experienced
language engineers ● make use of human intuition ● obtain marginally better
performance ● development could be very
time consuming ● some changes may be hard
to accommodate
Learning Systems● use statistics or other
machine learning ● developers do not need LE
expertise ● requires large amounts of
annotated training data ● some changes may require
re-annotation of the entire training corpus
Alala 6
Named Entity Recognition● NE involves identification of proper names in texts, and
classification into a set of predefined categories of interest.
● Three universally accepted categories: person, location and organisation
● Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), email addresses etc.
● Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.
Alala 7
ANNIE
Unicode Tokeniser
FS GazetteerLookup
SentenceSplitter
Hepple POSTagger
Input:URL or text
Document format(XML, HTML, SGML, email, …)
GATEDocument
CharacterClass Sequence
Rules
Lists
JAPE SentencePatterns
Brill RulesLexicon
SemanticTagger
OrthoMatcher
JAPE IEGrammarCascade
GATE DocumentXML dump of
IE AnnotationsOutput:
ANNIEIE modules
NOTE: square boxes areprocesses, rounded ones aredata.
PronominalCoreferencer JAPE Grammar
Alala 8
Unicode Tokeniser
•Bases tokenisation on Unicode character classes
•Language-independent tokenisation
•Declarative token specification language, e.g.:
"UPPERCASE_LETTER" LOWERCASE_LETTER"* >
Token; orthography=upperInitial; kind=word
Look at the ANNIE English tokeniser and at tokenisers for other languages (in plugins directory) for more information and examples
Alala 9
Gazetteer● Set of lists compiled into Finite State Machines ● 60k entries in 80 types, inc.: organization; artifact; location; amount_unit; manufacturer; transport_means; company_designator; currency_unit; date; government_designator; ...
● Each list has attributes MajorType and MinorType and Language): city.lst: location: city: englishcurrency_prefix.lst: currency_unit: pre_amountcurrency_unit.lst: currency_unit: post_amount
● Attributes are used as input to JAPE grammars● List entries may be entities or parts of entities, or they
may contain contextual information (e.g. job titles often indicate people)
Alala 10
The Named Entity Grammar● JAPE phases run sequentially and constitute a cascade
of FSTs over annotations ● hand-coded rules applied to annotations to identify NEs ● annotations from format analysis, tokeniser. POS tagger
and gazetteer modules ● use of contextual information ● rule priority based on pattern length, rule status and rule
ordering ● Common entities: persons, locations, organisations,
dates, addresses.
Orthomatcher
● Orthographic coreference between annotations in the same document, e.g. Mr Brown, James Brown
● Matching rules are invoked between annotations of the same type, or between an existing annotation and an “Unknown” annotation
● The latter is the only case where an annotation type can be changed
● Lookup tables of aliases and exceptions (i.e. overriding of matching rules)
● Also pronominal coreference (see User Guide)
Alala 12
JAPE: a Jolly And Pleasant Experience
● Grammars (cascades of phases)– Phases (lists of rules)
● Rules– LHS (patterns)– RHS (actions)
● Priority– Implicit
● longest match● first mention
– Explicit● priority
LHS of JAPE rules
● The LHS of the rule contains patterns to be matched, in the form of annotations (and optionally their attributes).
● Annotation types to be recognised must be declared at the beginning of the phase
● Annotations may be combined using traditional operators [ | * + ?]
● There is no negative operator
● More than one pattern can be matched in a single rule
● Left and right context (not to be annotated) can be matched
Examples of LHS patterns
({Lookup.majorType == location}) :loc
---------------------
({Token.string == "in"} | {Token.string == "by"})
({Year}) :date
--------------------
(
({Lookup.majorType == jobtitle} ):jobtitle
{Surname}
):person
RHS of JAPE rules
({Lookup.majorType == location}) :loc
:loc.Location = {kind = “city", rule = “Location1"}
----------------------
(
({Lookup.majorType == jobtitle} ):jobtitle
{Surname}
):person
:jobtitle.JobTitle = {rule = "PersonJobTitle"},
:person.Person = {kind = “Surname", rule = "PersonJobTitle"}
Complex RHS ● JAPE RHS is quite limited in what you can do ● But you can use any Java you like on the RHS of the
rule ● Useful for e.g. removing temporary annotations and
percolating and manipulating features from previous annotations
● Also means you can use JAPE for many other things apart from just creating annotations, e.g. counting things, manipulating the text, adding annotations to the document, etc.
● And you don’t have to be a JAVA expert to do it.● Although it helps to have friends who are….
Example of using Java in a ruleRule: FirstName({Lookup.majorType == person_first}):person-->{
gate.AnnotationSet person = (gate.AnnotationSet)bindings.get("person");gate.Annotation personAnn = (gate.Annotation)person.iterator().next();gate.FeatureMap features = Factory.newFeatureMap();features.put("gender", personAnn.getFeatures().get("minorType"));features.put("rule", "FirstName");outputAS.add(person.firstNode(), person.lastNode(),
"FirstPerson", features);}
Available Java objects
● bindings: binding variables● doc: GATE Document● annotations: all GATE Document annotations● inputAS, outputAS: phase input and output
annotations● ontology
See documentation for more details…..
Alala 19
JAPE Application modes● Brill (fires all matches)● First (shortest match fires)● Once (Phase exits after first match)● All (as for Brill, but matching continues from offset
following the current one, not from the end of the last match)
● Appelt (priority ordering: longest match fires, then explicit rule priority, then first defined rule fires)
Note that prioritisation only operates within a single phase, not globally
20
{A}+ Application Modes
A A AAppelt
Once
Brill
First
All
Example: “China Sea”
Rule: Location1
Priority: 25
(
({Lookup.majorType == loc_key, Lookup.minorType == pre})?
{Lookup.minorType == country}
{Lookup.majorType == loc_key, Lookup.minorType == post})?
) :locName -->
:locName.Location = {kind = "location", rule = "Location1"}
Rule: Location2
Priority: 20
({Lookup.minorType == location}) :location -->
:location.Name = {kind = "location", rule=GazLocation}
JAPE Hints and Tricks
● JAPE is quite limited in some respects as to what can be done– There is no negative operator– It can be slow if it is badly written, e.g. ({Token})*– Context is consumed, which can make rule-writing
awkward– Priority can be difficult to set correctly
● But fear not, there is generally a sneaky way around it…..
How to avoid a pattern from matchingRule: disablePattern
Priority: 1000
(<pattern>)
{}
● Instead of having a negative operator, we can simply put a high priority rule which does nothing when fired.
● This will be preferred to a lower priority rule which performs the action intended, i.e. only in the case when the former pattern doesn’t apply.
How to play with input annotations
Input: Person Organisation VerbWork Split…Rule: RelationWorkIn
({Person} {VerbWork} {Organisation}){… /* create annotation of type “Relation” */ …}
● Use existing annotations to find relations● We ignore Tokens to enable more flexibility, i.e. there
could be additional words between the annotations specified
● Split ensures we don’t cross sentence boundaries
How to deal with overlapping annotations
● Because matched annotations are consumed, when two annotations overlap (e.g. in gazetteer lists), the second one will never be matched.
● E.g. for the string “hALCAM” with Lookups hAL, ALCAM, and CAM, ALCAM will never be matched
● Solution is to delete the annotations once matched, and then rerun the same grammar phase over the text
● The process may need to be repeated several times (determine by trial and error)
More examples
● In the GATE User Guide under the section “Useful tricks with JAPE”
● Look in the ANNIE grammars and in the foreign language grammars – there are many examples of little tricks
● Check the GATE mailing list archives
Custom Processing Resource for your grammars 1. Java developer extends GATE's default JAPE Transducer
creating Java classpackage com.yourcompany;import gate.creole.Transducer;public class CustomTransducer extends Transducer {}
2. JAPE developer adds definition in the plugin’s creole.xml
<RESOURCE><NAME>My custom JAPE Transducer</NAME><CLASS> com.yourcompany.CustomTransducer </CLASS><PARAMETER NAME="document" RUNTIME="true"</PARAMETER><PARAMETER NAME="inputASName" RUNTIME="true“ OPTIONAL="true">java.lang.String </PARAMETER><PARAMETER NAME="outputASName" RUNTIME="true“ OPTIONAL="true">java.lang.String</PARAMETER><PARAMETER NAME="grammarURL" DEFAULT=“myDir/myMain.jape" SUFFIXES="jape">java.net.URL</PARAMETER><PARAMETER NAME="encoding" DEFAULT="UTF-8">java.lang.String</PARAMETER>
</RESOURCE>
3. GATE user opens custom resource in GATE GUI
Right-Click on “Processing Resources”In the pop-up menu select “New >” --> “My custom JAPE Transducer”
JAPE debugger● Speeds up the development of JAPE grammars
● Integrated in GATE GUI
● Friendly for non-experts
Allows you to:● Inspect the pattern matching
● Find overridden rules
● Detect complex inter-rule influence
● And many other things
Inspection of pattern matching
Overridden rules
Inter-rule influence (finding problem)
Inter-rule influence (what is that?)
Inter-rule influence (problem synopsis)
Text processed:
… of the J. L. Kellog Graduate School of Management and the Indiana University School of Business …
Conflicting rule:Rule: NotPersonFullPriority: 80// Det + Surname// This rule was commented course //J.L. Kellog processed without J. //17.06.03(
{Token.category == DT} | {Token.category == PRP} | {Token.category == RB}
)(
(PREFIX)* (UPPER) (PERSONENDING)?
):foo
Shadowed rule:Rule: PersonFullExtPriority: 100// F.W. Jones Fred Jones// Andrew "Flip" Filipowski// Andrew J. "Flip" Filipowski//({Token.category == DT})?( ((FIRSTNAME | FIRSTNAMEAMBIG))+ (INITIALS)? ((FIRSTNAME | FIRSTNAMEAMBIG) )* (PREFIX)* ((UPPER)):surname (PERSONENDING)?):person-->
Coming soon…..JAPE4What JAPE4 IS:● a new version of internal language in GATE release 4● language is based on original JAPE● incorporate best practices from JAPE, Jape+ and Japec● 3-5 times faster than JAPE
What JAPE4 IS NOT:● an improved version of original Jape, Jape+ or Japec but rather
a new language● a language backward compatible with JAPE
In most cases it seems to be possible to easily modify original Jape, Jape+ or Japec grammars to be compatible with JAPE4 specification.