March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

39
March 2006 March 2006 CLINT-CS CLINT-CS 1 Introduction to Introduction to Computational Computational Linguistics Linguistics Chunk Parsing Chunk Parsing

Transcript of March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

Page 1: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

March 2006March 2006 CLINT-CSCLINT-CS 11

Introduction toIntroduction toComputational Computational

LinguisticsLinguistics

Chunk ParsingChunk Parsing

Page 2: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

22March 2006March 2006 CLINT-CSCLINT-CS

SourcesSources

This presentation is largely based on the This presentation is largely based on the NLTK Chunking Tutorial by Steven Bird, NLTK Chunking Tutorial by Steven Bird, Ewan Klein and Edward Loper version Ewan Klein and Edward Loper version 0.6.3 (2006)0.6.3 (2006)

Page 3: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

33March 2006March 2006 CLINT-CSCLINT-CS

Two Kinds of ParsingTwo Kinds of Parsing

full parsing with formal grammars (HPSG, LFG, TAG, ...) to be compared with robust parsingchunk parsing in the context of tagging (also called partial, shallow, or robust parsing.Chunk parsing is an efficient and robust Chunk parsing is an efficient and robust approach to parsing natural languageapproach to parsing natural languageIt is a popular alternative to full parsing It is a popular alternative to full parsing

Page 4: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

44March 2006March 2006 CLINT-CSCLINT-CS

Example ChunksExample Chunks

[I begin] [with an intuition] :

[when I read] [a sentence] ,

[I read it] [a chunk] [at a time] .

Page 5: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

55March 2006March 2006 CLINT-CSCLINT-CS

Abney (1991): Parsing by ChunksAbney (1991): Parsing by Chunks

These chunks correspond in some way to prosodic patterns. … the strongest stresses in the sentence fall one to a chunk, and pauses are most likely to fall between chunks.The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template.

Page 6: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

66March 2006March 2006 CLINT-CSCLINT-CS

Content and Function WordsContent and Function Words

[I begin] [with an intuition] :

[when I read] [a sentence] ,

[I read it] [a chunk] [at a time] .

Page 7: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

77March 2006March 2006 CLINT-CSCLINT-CS

Inter and Intra Chunk StructureInter and Intra Chunk Structure

Within chunks A simple context-free or finite state grammar is quite

adequate to describe the structure of chunks.

Between chunks within a sentence By contrast, the relationships between chunks are

mediated more by lexical selection than by rigid templates.

The order in which chunks occur is much more flexible than the order of words within chunks.

Page 8: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

88March 2006March 2006 CLINT-CSCLINT-CS

SummarySummary

Chunks are non-overlapping regions of Chunks are non-overlapping regions of texttext

Usually consisting of a head word (such as Usually consisting of a head word (such as a noun) and the adjacent modifiers and a noun) and the adjacent modifiers and function words (such as adjectives and function words (such as adjectives and determiners).determiners).

Page 9: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

99March 2006March 2006 CLINT-CSCLINT-CS

Chunking and TaggingChunking and Tagging

ChunkingChunking

Segmentation – identifying tokensSegmentation – identifying tokens

Labelling – identifying the correct tagLabelling – identifying the correct tag

Chunk ParsingChunk Parsing

Segmentation – identifying strings of Segmentation – identifying strings of tokenstokens

Labelling- identifying the correct chunk Labelling- identifying the correct chunk typetype

Page 10: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

1010March 2006March 2006 CLINT-CSCLINT-CS

Segmentation and LabellingSegmentation and Labelling

Page 11: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

1111March 2006March 2006 CLINT-CSCLINT-CS

Chunking vs Full ParsingChunking vs Full Parsing

Both can be used to build hierarchical Both can be used to build hierarchical structuresstructuresChunking involves hierarchies of limited Chunking involves hierarchies of limited depth (usually 2)depth (usually 2)Complexity of full parsing typically O(nComplexity of full parsing typically O(n33), ), versus O(n) for chunkingversus O(n) for chunkingChunking leaves gaps between chunksChunking leaves gaps between chunksChunking can often give imperfectChunking can often give imperfectresults: results: I turned off the spectrorouteI turned off the spectroroute

Page 12: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

1212March 2006March 2006 CLINT-CSCLINT-CS

Representing ChunksRepresenting ChunksIOB Tags vs TreesIOB Tags vs Trees

Each token is tagged with one of three Each token is tagged with one of three special chunk tags, INSIDE, OUTSIDE, or special chunk tags, INSIDE, OUTSIDE, or BEGIN.BEGIN.A token is tagged as BEGIN if it is at the A token is tagged as BEGIN if it is at the beginning of a chunk, and contained within beginning of a chunk, and contained within that chunk.that chunk.Subsequent tokens within the chunk are Subsequent tokens within the chunk are tagged INSIDE. All other tokens are tagged INSIDE. All other tokens are tagged OUTSIDE.tagged OUTSIDE.

Page 13: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

1313March 2006March 2006 CLINT-CSCLINT-CS

Example IOB TagsExample IOB Tags

Page 14: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

1414March 2006March 2006 CLINT-CSCLINT-CS

Example TreeExample Tree

Page 15: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

1515March 2006March 2006 CLINT-CSCLINT-CS

Chunk ParsingChunk Parsing

A chunk parser finds contiguous, non-A chunk parser finds contiguous, non-overlapping spans of related tokens and overlapping spans of related tokens and groups them together into chunks.groups them together into chunks.It then combines these individual chunks It then combines these individual chunks together, along with the intervening together, along with the intervening tokens, to form a tokens, to form a chunk structurechunk structure..A A chunk structurechunk structure is a two-level tree that is a two-level tree that spans the entire text, and contains both spans the entire text, and contains both chunks and un-chunked tokens. chunks and un-chunked tokens.

Page 16: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

1616March 2006March 2006 CLINT-CSCLINT-CS

Example Chunk StructureExample Chunk Structure

(S: (NP: 'I')(S: (NP: 'I')

'saw''saw'

(NP: 'the' 'big' 'dog')(NP: 'the' 'big' 'dog')

'on''on'

(NP: 'the' 'hill'))(NP: 'the' 'hill'))

Page 17: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

1717March 2006March 2006 CLINT-CSCLINT-CS

Chunking with Regular ExpressionsChunking with Regular Expressions

NLTK-Lite provides a regular expression chunk parser, parse.RegexpChunk to define the kinds of chunk we are interested in, and then to chunk a tagged text.The chunk parser begins with a structure in which no tokens are chunkedEach regular-expression pattern (or chunk rule) is applied in turn, successively updating the chunk structure.Once all of the rules have been applied, the resulting chunk structure is returned.

Page 18: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

1818March 2006March 2006 CLINT-CSCLINT-CS

Tag StringsTag Strings

A tag string is a string consisting of tags delimited with angle-brackets, e.g., <DT><JJ><NN><VBD><DT><NN>

tag strings do not contain any whitespace.

Page 19: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

1919March 2006March 2006 CLINT-CSCLINT-CS

Chunking with Regular ExpressionsChunking with Regular Expressions

The NLTK chunk parser operates using The NLTK chunk parser operates using chunk definitions that are supplied in the chunk definitions that are supplied in the form of form of tag patternstag patterns which are regular which are regular expressions over tags, e.g.expressions over tags, e.g.

<DT><JJ>?<NN><DT><JJ>?<NN>

NB: RE operators come NB: RE operators come afterafter the tag. the tag.

? means an optional element? means an optional element

Page 20: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

2020March 2006March 2006 CLINT-CSCLINT-CS

Tag PatternsTag Patterns

Angle brackets group, so <NN>+ matches one or more repetitions of the tag string <NN>; and <NN|JJ> matches the tag strings <NN> or <JJ>.The operator * within angle brackets is a wildcard, so that <NN.*> matches any single tag starting with NN.<NN>* matches zero or more repetitions of <NN>

Page 21: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

2121March 2006March 2006 CLINT-CSCLINT-CS

Another ExampleAnother Example

<DT>?<JJ.*>*<NN.*><DT>?<JJ.*>*<NN.*>

In this case the DT is optionalIn this case the DT is optional

It is followed by zero or more instances of It is followed by zero or more instances of <JJ.*><JJ.*>

<JJ.*> matches <JJ.*> matches any typeany type of adjective of adjective

The adjectives are followed by <NN.*>, i.e. The adjectives are followed by <NN.*>, i.e. any typeany type of NN. of NN.

Page 22: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

2222March 2006March 2006 CLINT-CSCLINT-CS

Creating a Chunk Parser in NLTKCreating a Chunk Parser in NLTKFirst create one or more rulesrule1 =parse.ChunkRule('<DT|NN>+', 'Chunk sequences of DT and NN')

Then create the parserchunkparser =

parse.RegexpChunk([rule1], chunk_node='NP', top_node='S')

Note that RegexpChunk has optional second and third arguments that specify the node labels for chunks and for the top-level node, respectively.

Page 23: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

2323March 2006March 2006 CLINT-CSCLINT-CS

First Match Takes PrecedenceFirst Match Takes Precedence

If a tag pattern matches at multiple overlapping locations, the rst match takes precedence.

For example, if we apply a rule that matches two consecutive nouns to a text containing three consecutive nouns, then the first two nouns will be chunked

Page 24: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

2424March 2006March 2006 CLINT-CSCLINT-CS

ExampleExample

>>> from nltk_lite import tag

>>> text = "dog/NN cat/NN mouse/NN"

>>> nouns = tag.string2tags(text)

>>> rule = parse.ChunkRule('<NN><NN>', 'Chunk two consecutive nouns')

>>> parser = parse.RegexpChunk([rule], chunk_node='NP', top_node='S')

>>> parser.parse(nouns)

(S: (NP: ('dog', 'NN') ('cat', 'NN')) ('mouse', 'NN'))

Page 25: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

2525March 2006March 2006 CLINT-CSCLINT-CS

Two Rule ExampleTwo Rule Example

>>> sent = tag.string2tags("the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN")

>>> rule1 = parse.ChunkRule('<DT><JJ><NN>', 'Chunk det+adj+noun')

>>> rule2 = parse.ChunkRule('<DT|NN>+', 'Chunk sequences of NN and DT')

>>> chunkparser = parse.RegexpChunk([rule1, rule2], chunk_node='NP', top_node='S')

>>> chunk_tree = chunkparser.parse(sent, trace=1)

Page 26: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

2626March 2006March 2006 CLINT-CSCLINT-CS

TraceTrace

Input:

<DT> <JJ> <NN> <VBD> <IN> <DT> <NN>

Chunk det+adj+noun:

{<DT> <JJ> <NN>} <VBD> <IN> <DT> <NN>

Chunk sequences of NN and DT:

{<DT> <JJ> <NN>} <VBD> <IN> {<DT> <NN>}

Page 27: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

2727March 2006March 2006 CLINT-CSCLINT-CS

Rule InteractionRule Interaction

When a ChunkRule is applied to a chunking hypothesis, it will only create chunks that do not partially overlap with chunks already in the hypothesis.

Thus, if we apply these two rules in reverse order, we will get a different result:

Page 28: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

2828March 2006March 2006 CLINT-CSCLINT-CS

Reverse Order of RulesReverse Order of Rules

>>> chunkparser = parse.RegexpChunk([rule2, rule1], chunk_node='NP', top_node='S')

Input:

<DT> <JJ> <NN> <VBD> <IN> <DT> <NN>

Chunk sequences of NN and DT:

{<DT>} <JJ> {<NN>} <VBD> <IN> {<DT> <NN>}

Chunk det+adj+noun:

{<DT>} <JJ> {<NN>} <VBD> <IN> {<DT> <NN>}

Page 29: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

2929March 2006March 2006 CLINT-CSCLINT-CS

Chinking RulesChinking Rules

Sometimes it is easier to define what we don't want to include in a chunk than we do want to include.Chinking is the process of removing a sequence of tokens from a chunk. If the sequence of tokens spans an entire chunk, then the whole chunk is removed; if the sequence of tokens appears in the middle of the chunk, these tokens are removed, leaving two chunks where there was only one before.If the sequence is at the beginning or end of the chunk, these tokens are removed, and a smaller chunk remains.

Page 30: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

3030March 2006March 2006 CLINT-CSCLINT-CS

Creating Chink RulesCreating Chink Rules

ChinkRules are created with the ChinkRule constructor, >>> chink_rule = parse.ChinkRule('<VBD|IN>+', 'Chink sequences of VBD and IN')To show how it works, we first definechunkall_rule = parse.ChunkRule('<.*>+', 'Chunk everything')and then put the two together

Page 31: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

3131March 2006March 2006 CLINT-CSCLINT-CS

Running Chink RulesRunning Chink Rules

>>> chunkparser = parse.RegexpChunk([chunkall_rule, chink_rule], chunk_node='NP', top_node='S')

Input:

<DT> <JJ> <NN> <VBD> <IN> <DT> <NN>

Chunk everything:{<DT> <JJ> <NN> <VBD> <IN> <DT> <NN>}

Chink sequences of VBD and IN:{<DT> <JJ> <NN>} <VBD> <IN> {<DT> <NN>}

Page 32: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

3232March 2006March 2006 CLINT-CSCLINT-CS

The Unchunk RuleThe Unchunk Rule

Unchunk rules are very similar to Unchunk rules are very similar to ChinkRule except that it will only remove a ChinkRule except that it will only remove a chunk if the pattern matches an entire chunk if the pattern matches an entire chunk.chunk.

>>> unchunk_rule = parse.UnChunkRule('<NN|DT>+', 'Unchunk sequences of NN and DT')

Page 33: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

3333March 2006March 2006 CLINT-CSCLINT-CS

Chunkparser with Unchunk RuleChunkparser with Unchunk Rule

>>> chunk_rule = parse.ChunkRule('<NN|DT|JJ>+','Chunk sequences of NN, JJ, and DT')

>>> chunkparser = parse.RegexpChunk([chunk_rule, unchunk_rule], chunk_node='NP', top_node='S')

Page 34: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

3434March 2006March 2006 CLINT-CSCLINT-CS

Unchunk Parser TraceUnchunk Parser Trace

Input:<DT> <JJ> <NN> <VBD> <IN> <DT> <NN>Chunk sequences of NN, JJ, and DT:{<DT> <JJ> <NN>} <VBD> <IN> {<DT> <NN>}Unchunk sequences of NN and DT:{<DT> <JJ> <NN>} <VBD> <IN> <DT> <NN>

Page 35: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

3535March 2006March 2006 CLINT-CSCLINT-CS

Merge RulesMerge Rules

MergeRules are used to merge two contiguous chunks.Each MergeRule is parameterized by two tag patterns: a left pattern and a right pattern.A MergeRule will merge two contiguous chunks C1 and C2 if the end of C1 matches the left pattern, and the beginning of C2 matches the right pattern.

Page 36: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

3636March 2006March 2006 CLINT-CSCLINT-CS

Split RulesSplit Rules

SplitRules are used to split a single chunk into two smaller chunks.Each SplitRule is parameterized by two tag patterns: a left pattern and a right pattern.A SplitRule will split a chunk at any point p, where the left pattern matches the chunk to the left of p, and the right pattern matches the chunk to the right of p.

Page 37: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

3737March 2006March 2006 CLINT-CSCLINT-CS

Evaluating Chunk ParsersEvaluating Chunk Parsers

Essentially, evaluation is about the Essentially, evaluation is about the comparing the behaviour of a chunk comparing the behaviour of a chunk parser against a standard.parser against a standard.Typically this involves the following phasesTypically this involves the following phases Save already chunked textSave already chunked text Unchunk itUnchunk it Chunk it using chunk parserChunk it using chunk parser Compare the result with the original chunked Compare the result with the original chunked

texttext

Page 38: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

3838March 2006March 2006 CLINT-CSCLINT-CS

Evaluation MetricsEvaluation Metrics

Metrics are typically based on the following sets: guessed: The set of chunks returned by the

chunk parser. correct: The correct set of chunks, as defined

in the test corpus.

From these we can define five useful measures

Page 39: March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.

3939March 2006March 2006 CLINT-CSCLINT-CS

Evaluation MetricsEvaluation Metrics

Precision: what percentage of guessed chunks Precision: what percentage of guessed chunks were correct?were correct?Recall: what percentage of guessed chunks Recall: what percentage of guessed chunks were guessedwere guessedF-Measure: harmonic mean of Precision and F-Measure: harmonic mean of Precision and recallrecallMissed Chunks: what correct chunks were not Missed Chunks: what correct chunks were not guessedguessedIncorrect Chunks: what guessed chunks were Incorrect Chunks: what guessed chunks were not correctnot correct