An Introduction to Edison Vivek Srikumar 17 th April 2012.

35
An Introduction to Edison Vivek Srikumar 17 th April 2012

Transcript of An Introduction to Edison Vivek Srikumar 17 th April 2012.

Page 1: An Introduction to Edison Vivek Srikumar 17 th April 2012.

An Introduction to Edison

Vivek Srikumar17th April 2012

Page 2: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Curator gives us easy access to several layers of annotation over text

What can we do with these?

Page 3: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Outline

• What is Edison?

• Installing Edison

• Using Edison– Creating Edison objects– Accessing the Curator– Adding and using views

Page 4: An Introduction to Edison Vivek Srikumar 17 th April 2012.

What is Edison?

1. A uniform representation of diverse NLP annotations

2. A library of NLP data structures

3. A Java client to the Curator

Page 5: An Introduction to Edison Vivek Srikumar 17 th April 2012.

NLP AnnotationsJohn Smith bought the car.

Part-of-speechNNP John NNP Smith VBD bought DT the NN car . .

Named EntitiesPER John Smith

Shallow parseNP John SmithVP bought NP the car

Semantic rolesPredicate buy

A0 John Smith

A1 the car

Parse tree

S

NP VP

NNP NNP VBD NP

DT NN

John Smith bought the car

And many others….

Page 6: An Introduction to Edison Vivek Srikumar 17 th April 2012.

A uniform representation

• Main ideas– All the annotations over text are graphs– Nodes: Labeled spans of text

• Spans indexed by tokens in the text

– Edges: Relations between the nodes

• Edison terminology– TextAnnotation: A container of tokens and views– View: A graph that denotes a specific annotation– Constituent: A labeled span of text (nodes)– Relation: A labeled directed edge between Constituents

Page 7: An Introduction to Edison Vivek Srikumar 17 th April 2012.

A uniform representation

TextAnnotationRaw text: John Smith bought the car.Tokens: {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.}

Views

Name: SENTENCE Constituents: {…} Relations: {…}

Name: POS Constituents: {…} Relations: {…}

Name: PARSE_CHARNIAK Constituents: {…} Relations: {…}

and other views….

Page 8: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Getting started with Edison• Download the jar from http://cogcomp.cs.illinois.edu/page/software_view/Edison

– Click the download link and follow instructions– Add the edison jar and its dependencies to your class path

• Dependencies– Cogcomp core utilities– Apache commons libraries– Thrift (to communicate with the Curator)– Porter stemmer– LBJ Library – Java WordNet interface

• Javadoc available under “User Guide”

Page 9: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Edison using Maven• Add the following repository definition to your pom.xml file

• Add Edison as a dependency

<repositories> <repository> <id>CogcompSoftware</id> <name>CogcompSoftware</name> <url>http://cogcomp.cs.illinois.edu/m2repo/</url> </repository> </repositories>

<dependency> <groupId>edu.illinois.cs.cogcomp</groupId> <artifactId>edison</artifactId> <version>0.2.9</version> <type>jar</type> <scope>compile</scope> </dependency>

Page 10: An Introduction to Edison Vivek Srikumar 17 th April 2012.

So far…

1. What is Edison?2. Installing Edison3. Creating a TextAnnotation4. Adding views from the Curator5. Using views 6. …??7. Profit!

Page 11: An Introduction to Edison Vivek Srikumar 17 th April 2012.

A uniform representation

TextAnnotationRaw text: John Smith bought the car.Tokens: {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.}

Views

Name: SENTENCE Constituents: {…} Relations: {…}

Name: POS Constituents: {…} Relations: {…}

Name: PARSE_CHARNIAK Constituents: {…} Relations: {…}

and other views….

Page 12: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Three ways to create TextAnnotations

1. When you don’t know the tokenization– Use this for raw text, if you don’t want to use the Curator

2. When you know the tokenization– Use this for pre-tokenized text

3. Using the Curator– Use this for raw text– If your text is pre-tokenized, you can still use the Curator

for adding views

Page 13: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Creating TextAnnotations (1)

• When to use this approach– If you don’t know the tokenization (i.e. words)– Want to use the LBJ tokenizer and sentence

splitter

• Note: Every TextAnnotation has a textId and corpusId, these could be used in the future for book-keeping

Page 14: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Creating TextAnnotations (1)String corpus = "2001_ODYSSEY";String textId = "001";

String text1 = "Good afternoon, gentlemen. I am a HAL-9000 computer.";

TextAnnotation ta1 = new TextAnnotation(corpus, textId, text1);

System.out.println(ta1.getText());

System.out.println(ta1.getTokenizedText());

// Print the sentences. The `Sentence` class has the same// methods as a `TextAnnotation`.List<Sentence> sentences = ta1.sentences();

System.out.println(sentences.size() + " sentences found.");

for (int i = 0; i < sentences.size(); i++) { Sentence sentence = sentences.get(i); System.out.println(sentence);}

Page 15: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Creating TextAnnotations (2)

• When to use this approach– When you know the tokenization• That is, when some external source specifies the tokens

of the text

• After creating it, it can be used as before

Page 16: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Creating TextAnnotations (2)String corpus = "2001_ODYSSEY";String textId = "002";

List<String> tokenizedSentences = Arrays.asList("Good afternoon , gentlemen .",

"I am a HAL-9000 computer .");

TextAnnotation ta2 = new TextAnnotation(corpus, textId, tokenizedSentences);

System.out.println(ta2.getText());

System.out.println(ta2.getTokenizedText());

// Print the sentences. The `Sentence` class of the same// methods as a `TextAnnotation`.List<Sentence> sentences = ta2.sentences();

System.out.println(sentences.size() + " sentences found.");

for (int i = 0; i < sentences.size(); i++) { Sentence sentence = sentences.get(i); System.out.println(sentence);}

Page 17: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Connecting to the Curator (1)If you don’t know anything about your text, the curator can tokenize your text for you.

String text = "Good afternoon, gentlemen. I am a HAL-9000 "+ "computer. I was born in Urbana, Il. in 1992";

String corpus = "2001_ODYSSEY";String textId = "001";

// We need to specify a host and a port where the curator server is// running.String curatorHost = "my-curator-server.cs.uiuc.edu";int curatorPort = 9090;

CuratorClient client = new CuratorClient(curatorHost, curatorPort);

// Should the curator's cache be forcibly updated?boolean forceUpdate = false;

// Get the text annotation object from the curator, which splits the// sentences and tokenizes it.TextAnnotation ta = client.getTextAnnotation(corpus, textId, text,

forceUpdate);

Create a curator client

Create a TextAnnotation

Page 18: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Connecting to the Curator (2)If you know the tokenization and want all the Curator’s annotators to respect this tokenizationString corpus = "2001_ODYSSEY";String textId = "002";

List<String> tokenizedSentences = Arrays.asList("Good afternoon , gentlemen .",

"I am a HAL-9000 computer .");

TextAnnotation ta2 = new TextAnnotation(corpus, textId, tokenizedSentences);

// We need to specify a host and a port where the curator server is// running.String curatorHost = "my-curator-server.cs.uiuc.edu";int curatorPort = 9090;

CuratorClient client = new CuratorClient(curatorHost, curatorPort, true);

Curator shoud Respect tokenization

Note: A Curator Client in this mode cannot create TextAnnotations. Doing so will trigger an exception!

Create your TextAnnotation as before

Page 19: An Introduction to Edison Vivek Srikumar 17 th April 2012.

So far…

1. What is Edison?2. Installing Edison3. Creating a TextAnnotation4. Adding views from the Curator5. Using views 6. …??7. Profit!

Page 20: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Views

• Views are graphs, Constituents are nodes and Relations are edges

• Every TextAnnotation can be seen as a container for views, indexed by their name

• View is a Java class that represents any graph over constituents– Specializations of the View class to deal with specific types

• TokenLabelView, SpanLabelView, TreeView, PredicateArgumentView, CoreferenceView

– You can create your own views or specializations too!

Page 21: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Example: Part-of-speechJohn Smith bought the car.

Part-of-speechNNP John NNP Smith VBD bought DT the NN car . .

Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.}

0-1NNP

1-2NNP

2-3VBD

3-4DT

4-5NN

5-6.

Constituents

No Relations!

Each constituent is associated with a span. The convention is to denote a span using the first token and the (last +1)th one.

This specialization of the View class is called a TokenLabelView, where each constituent assigns a label to a token and there are no relations. Use for part-of-speech, stem/lemma, etc.

Page 22: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Adding part-of-speech from the Curator// Suppose we have a CuratorClient called 'client' and a TextAnnotation// called 'ta'.

// Should the Curator forcibly update the part-of-speech annotation?boolean forceUpdate = false;

// Add the part of speech view from the Curatorclient.addPOSView(ta, forceUpdate);

// Get the part-of-speech view from the TextAnnotation. This view will// be filed under the name 'ViewNames.POS'. Also, we know that// this view will be a TokenLabelView.TokenLabelView posView = (TokenLabelView) ta.getView(ViewNames.POS);

// Iterate through the text and get the POS label for each tokenfor (int tokenId = 0; tokenId < ta.size(); tokenId++) {

String token = ta.getToken(tokenId);String posLabel = posView.getLabel(tokenId);

System.out.println(token + "\t" + posLabel);}

Curator call

This method is available for

TokenLabelVIews

Page 23: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Example: Shallow parseJohn Smith bought the car.

Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.}

0-2NP

2-3VP

3-4NP

Constituents

No Relations!

Each constituent is associated with a span. The convention is to denote a span using the first token and the (last +1)th one.

This specialization of the View class is called a SpanLabelView, where each constituent assigns a label to a span of text and there are no relations. Use for named entities, shallow parse, Wikifier, etc.

Shallow parseNP John SmithVP bought NP the car

Page 24: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Adding shallow parse from the Curator// Suppose we have a CuratorClient called 'client' and a TextAnnotation// called 'ta'.

// Should the Curator forcibly update the shallow parse annotation?boolean forceUpdate = false;

// Add the shallow parse/chunk view from the Curatorclient.addChunkView(ta, forceUpdate);

// Get the shallow parse view from the TextAnnotation. This view will// be filed under the name 'ViewNames.SHALLOW_PARSE'. Also, we know that// this view will be a SpanLabelView.SpanLabelView chunkView = (SpanLabelView) ta.getView(ViewNames.SHALLOW_PARSE);

// Get all constituents whose span is contained in the span (0, 2).List<Constituent> constituents = chunkView.getSpanLabels(0, 2);

// Iterate over them and print their labelsfor(Constituent c: constituents) {

String label = c.getLabel();System.out.println(label);

}

Curator call

Available for SpanLabelView

Page 25: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Other SpanLabel views in the Curator

• Shallow parse– ViewNames.SHALLOW_PARSE– Use ‘client.addChunkView(ta, forceUpdate)’

• Named entities– ViewNames.NER– Use ‘client.addNamedEntityView(ta, forceUpdate)’

• Wikifier– ViewNames.WIKIFIER– Use ‘client.addWikifierView(ta, forceUpdate)

Note: For these function calls to work, the corresponding annotator should exist in your instance of the Curator. Otherwise, an exception will be triggered

Page 26: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Example: Parse viewJohn Smith bought the car.

Tokens {0:John, 1:Smith, 2:bought, 3:the, 4:car, 5:.}

0-5S

0-2NP

3-5VP

Constituents

Relations

This specialization of the View class is called a TreeView, where the graph represents a tree. Use for full parse and dependency trees.

Parse tree

S

NP VP

NNP NNP VBD NP

DT NN

John Smith bought the car

0-1NNP

ParentOf

ParentOf

ParentOf

Rest of the tree not shown.

Page 27: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Adding Charniak parse from the Curator// Suppose we have a CuratorClient called 'client' and a TextAnnotation// called 'ta'.

// Should the Curator forcibly update the parse annotation?boolean forceUpdate = false;

// Add the charniak parse view from the Curatorclient.addCharniakParse(ta, forceUpdate);

// Get the Charniak parse view from the TextAnnotation. This view will// be filed under the name 'ViewNames.PARSE_CHARNIAK'. Also, we know// that this view will be a TreeView.TreeView parseView = (TreeView) ta.getView(ViewNames.PARSE_CHARNIAK);

// get all parse nodesList<Constituent> treeNodes = parseView.getConstituents();

// get the tree structure for the first sentence (i.e. sentence #0)Tree<String> parseTree = parseView.getTree(0);

// Get path between parse tree nodes (common feature)String parsePath = PathFeatureHelper.getFullParsePathString(

treeNodes.get(0), treeNodes.get(1), 400);

Curator call

Do interesting things

Page 28: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Tree views from the curator

• Charniak parser– ViewNames.PARSE_CHARNIAK

– client.addCharniakParse(ta, forceUpdate)• Easy-first dependency parser

– ViewNames.DEPENDENCY

– client.addEasyFirstDependencyView(ta, forceUpdate)• Stanford parser

– ViewNames.PARSE_STANFORD

– client.addStanfordParse(ta, forceUpdate)• Stanford dependency parser

– ViewNames.DEPENDENCY_STANFORD

– client.addStanfordDependencyView(ta, forceUpdate)

Page 29: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Other Curator calls

• Verb semantic roles– View name: ViewNames.SRL– client.addSRLView(ta, forceUpdate)

• Adds a view of type PredicateArgumentView, which is a subclass of the View class

• Nominal semantic roles– View name: ViewNames.NOM– client.addNOMView(ta, forceUpdate)

• Adds a view of type PredicateArgumentView

• Coreference– View name: ViewNames.COREF– client.addCorefView(ta, forceUpdate)

• Adds a view of type CoreferenceView, which is a subclass of the View class

Page 30: An Introduction to Edison Vivek Srikumar 17 th April 2012.

So far…

1. What is Edison?2. Installing Edison3. Creating a TextAnnotation4. Adding views from the Curator5. Using views 6. …??7. Profit!

Page 31: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Using views• All views provide access to– Constituents:

• getConstituents, getConstituentsCoveringToken, getConstituentsCoveringSpan

– Relations: getRelations

• Allows us to manipulate several different views– Eg: Get the parse tree nodes that contain the named entity constituent

that whose label is “PER”:

for (Constituent c : namedEntityView.getConstituents()) {if (c.getLabel().equals("PER")) {

List<Constituent> parseConstituents = parseView

.getConstituentsCovering(c);// do something with these

}}

Page 32: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Using constituents and relations

• Each constituent belongs to a view• Constituents provide the following methods:– getLabel(): gets the label of the constituent– getSpan(): gets the span of the constituent– getIncomingRelations(): gets list of Relations that are

incident to this constituent in this view– getOutgoingRelations(): gets list of Relations whose

source is this constituent in this view• Relations provide the following accessors:– getRelationName(), getSource(), getTarget()

Page 33: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Other useful functionality

• Supports – Top-K views– Custom views, for your application

• Provides helper functions for common tasks– Look at the functions in classes in the package

edu.illinois.cs.cogcomp.edison.features.helpers• Provides interface to WordNet

– WordNetManager• Collin’s head-finding rules• Several feature extraction utilities

– Look the classes at edu.illinois.cs.cogcomp.edison.features

Page 34: An Introduction to Edison Vivek Srikumar 17 th April 2012.

So far…

1. What is Edison?2. Installing Edison3. Creating a TextAnnotation4. Adding views from the Curator5. Using views 6. …??7. Profit!

Page 35: An Introduction to Edison Vivek Srikumar 17 th April 2012.

Links

• Edison downloadhttp://cogcomp.cs.illinois.edu/page/software_view/Edison

• Example codehttp://cogcomp.cs.illinois.edu/software/edison/

• API documentationhttp://cogcomp.cs.illinois.edu/software/edison/apidocs