Natural Language Processing Techniques for Managing Legal

Post on 09-Feb-2022

3 views 0 download

Transcript of Natural Language Processing Techniques for Managing Legal

Natural Language Processing Techniques for Managing Legal Resources

Managing Legal Resources on the Semantic Web

European University InstituteFiesole, Italy

September 11, 2009

Adam Wyner

University College London

adam@wyner.info

Main Point

Legal text expressed in natural language can be automatically annotated with semantic mark ups using natural language processing systems such as the General Architecture for Text Engineering (GATE).

Overview

• Motivation and objectives of NLP in this context.

General Architecture for Text Engineering (GATE).

Processing and marking up text.

Another technology for parsing and semantic interpretation (C&C/Boxer).

Other approaches.

Motivations

• Annotate large legacy corpora.

• Address growth of corpora.

• Reduce number of human annotators and tedious work.

• Make annotation systematic and automatic.

• Annotate fine-grained information:

• Names, locations, addresses, web links, organisations, actions, argument structures, relations between entities....

• Map from well-drafted documents in NL to RDF/OWL.

Motivations

• Top-down vs. Bottom-up approaches:

• Both do initial (and iterative) analysis of the texts in the target corpora.

• Top-down defines the annotation system, which is applied manually to texts. Knowledge intensive in development and application.

• Annotation system is „defined‟ in terms of parsing, lists of basic components, ontologies, and rules to construct complex mark ups from simpler one. Apply the annotation system to text, which outputs annotated text. Knowledge intensive in development.

• Convergent/complementary/integrated approaches.

• Bottom-up reconstructs and implements linguistic knowledge. However, there are limits....

Objectives of NLP

• NLP – automated processing of natural language.

• Generation – convert information in a database into natural language.

• Understanding – convert natural language into a machine readable form.

• Range of subtasks (focusing on text):

• Segment text (words, phrases, sentences, paragraphs, sections,....).

• Morphological analysis (plural/singular, tense,....).

• Tag each word for part of speech in context (noun, verb, adjective, number,....).

Objectives of NLP

• Range of subtasks:

• Syntactic parsing into phrases/chuncks (prepositional, nominal, verbal,....).

• Identify semantic roles (agent, patient,....).

• Entity recognition (organisations, people, places,....).

• Resolve pronominal anaphor and co-reference.

• Address ambiguity.

Objectives of NLP

• NLP useful for:

• Mark up documents in a large corpora.

• Automatic mark up to overcome bottleneck.

• Semantic representation for modelling and inference.

• Semantic representation as a „interlanguage‟ for translation.

• To understand and work with human language capabilities.

Objectives of NLP

Develop annotations, ontologies, and gold-standard corpora.

Semantically annotated texts support activities such as:

Maintenance, presentation, and navigation.

Information extraction (find patterns -- words or statements -- among documents).

Translation

Query (find all individuals who did a particular action).

Inference.

Reminder

• Presentations on acquisition of ontologies using NLP.

• Ontology design patterns with natural language „tie ins‟.

• WordNet and Framenet.

• The analysis cycle:

• Text -> Linguistic Analysis -> Knowledge Extraction -> Structural Content

• Cycle between Linguistic Analysis and Knowledge Extraction to improve the final Structural Content.

• Computational linguistic analysis “layer cake”.

Current State at OPSI, UK

• Office of Public Sector Information, United Kingdom

• Want to develop and leverage public information.

• http://www.opsi.gov.uk/

• The Stationary Office, which have used GATE to develop automated mark up for OPSI, have not (yet) made marked up documents or processes available. Public vs. Private development.

• NLP for legislation is not an academic exercise.

• Applications?

The Crown XML Schema for Legislation

Terrorism Act 2000 (1.0)

Terrorism Act 2000 (1.1)

Terrorism Act 2000 (1.2)

Terrorism Act 2000 (2.0)

Terrorism Act 2000 (2.1)

Content in Notices

Not

glamorous,

but useful.

RuleBurst.

Content in Notices

GATE

• General Architecture for Text Engineering (GATE) open source framework which supports plug in NLP components to process a corpus of text. Is “open” open?

• Where to get it?

• http://gate.ac.uk/

• Components and sequences of processes, each process feeding the next in a “pipeline”.

• Annotated text output.

• Example of a case with screen shots.

GATE

References:

• “Building Search Applications: Lucene, LingPipe, and Gate” by Manu Konchady, 2008.

• “Introduction to Linguistic Annotation and Analytics Technologies” by Graham Wilcock, 2009

GATE

• Language Resources: lexicons, corpora, ontologies.

• Processing Resources: parsers, generators, taggers.

• Visual Resources: visualisation and editing.

• The resources are plug ins, so can be added or taken away.

• Document = text + annotations + features

• <Person, gender = “male”>John Smith</Person>

• <Verb, tense = “past”>ran</Verb>

GATE

• Computational linguistic analysis “layer cake”:

• Sentence segmentation

• Tokenisation (words identified by spaces between them).

• Morphological analysis (singular/plural, tense, nominalisation, ..., range of parts of speech such as noun, verb, adjective, ...).

• Part of speech tagging (noun or verb given other words nearby).

• Shallow syntactic parsing/chunking (noun phrase, verb phrase, subordinate clause, ...).

• Dependency analysis (subordinate clauses, pronominal anaphora,...).

• Pattern matching and rule application.

GATE

• Lists:

• List of verbs: like, run, jump, ....

• List of common nouns: dog, cat, hamburger, ....

• List of proper names: Cyndi, Bill, Lisa, ....

• List of determiners: the, a, two, ....

• Rules:

• (Determiner + Common Noun) | Proper Name => Noun Phrase

• Verb + Noun Phrase => Verb Phrase

• Noun Phrase + Verb Phrase => Sentence

• Output:

• [s [np Cyndi] [vp [v likes] [np [det the] [cn dog]]]].

GATE Offset

Annotations are:

tokens (offsets of

text from start

space to end

space) along with

type/features

which have a

name or value.

GATE Annotations

Partial. Missing namespace and type needed

for full definition.

GATE Annotations

GATE

Construction:

From smaller units, compose larger, derivative units.

Gazetteers:

Lists of words (or abbreviations) that fit an annotation: first names, street locations, organizations....

JAPE (Java Annotation Patterns Engine):

Build other annotations out of previously given/defined annotations. Use this where the mark up is not given by a gazetteer. Rules have a syntax.

GATE Gazetteers

GATE Organisation Gazetteer

GATE JAPE

JAPE idea (here with mark up, but could be some feature).

<FirstName>aaaa</FirstName><LastName>bbbb</LastName>

=> <WholeName><FirstName>aaaa</FirstName>

<LastName>bbbb</LastName></WholeName>

FirstName and LastName we get from the

Gazetteer. WholeName we construct using the rule.

For complex constructions, must have a range of

alternatives.

GATE JAPE

GATE JAPE

GATE JAPE

GATE Example

GATE Example

GATE Example

GATE Example

Organisations and Quotations. Case references.

GATE XML

Other GATE Components

• Develop an ontology, import it into GATE, then mark up elements manually.

• Use the ontology in writing the JAPE rules.

• Plug in other parsers, create gazetteers, work with other languages....

• Machine learning component.

• Have not discussed mark up for metadata, structure, or presentation (see de Maat, Winkels, and van Engers).

• Work to develop gazetteers and JAPE rules.

GATE – Problems and Issues

• Any difference in the characters of the basic text or in annotations is an absolute difference

• theatre and theater are different strings for entities. Variants in Gazetteers.

• Organisation and Organization are different annotations.

• Output in XML is possible, but GATE mark up allows overlapping tags, which are barred in standard XML. Must rework GATE XML with XSLT to make it standard XML.

• Accuracy is not 100% for a variety of reasons, but it can be 80-95%.

C&C/Boxer – Motivations and Objectives

• Fine-grained syntactic parsing – can identify not only parts of speech, but grammatical roles (subject, object) and phrases (e.g. verb plus direct object is verb phrase).

• Contributes to NL to RDF/OWL translation – individual entities, data and object properties?

• Input to semantic interpretation in FOL – test for consistency, support inference, allow rule extraction.

C&C/Boxer

• C & C is a combinatorial categorial grammar.

• Boxer provides a semantic interpretation, given the parse. The semantic interpretation is a form of first order logic –discourse representation theory.

• Needs some manipulation. Parser outputs the „best‟ parse, but that might not be what one wants; the semantic representation might need to be selected.

• Try it out at:

• http://svn.ask.it.usyd.edu.au/trac/candc

• Various representations – C&C, Graphic, XML Parse, Prolog.

C&C/Boxer

C&C/Boxer

Vx [ man’(x) -> happy’(x)]

If Bill is rich and healthy, then he is happy

If Bill is rich and healthy, then he is happy.

A More Complex Example

A person commits an offence if he invites another to provide

money or other property and intends that it should be used,

or has reasonable cause to suspect that it may be used, for

the purposes of terrorism. From UK “Terrorism Act 2000,

Interpretation, Terrorist Property” (Partial parse image).

A More Complex Example

Other Topics

• Controlled Languages

• An expressive subset of grammatical constructions and lexicon.

• Guided in put so only well-formed, unambiguous expressions.

• Translation to FOL.

• Machine Learning

• Annotating a set of documents to make a „gold standard‟.

• Train the system on the gold standard and unannotateddocuments.

• Test accuracy and adjust.

• No information on how the algorithm works.

Conclusion

• Different approaches to mark up.

• Burdens of initial analysis, coding, and labour.

• Top-down is far ahead of bottom-up, but this is a matter of focus of research effort.

• Converging, complementary, integrated approaches.

• Potential to enrich annotations further for finer-grained information.