Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D....

35
Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010

Transcript of Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D....

Page 1: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Natural Language Processing for Action Recognition

JHU Summer School

Evelyne Tzoukermann, Ph.D.

Friday, June 11, 2010

Page 2: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

What is the role of Natural Language in Action Recognition?

1. Provide temporal information– Where in the video is the action happening?

2. Provide semantic information– Parse the phrasal constituents to determine action

type and human interaction through objects, instruments, and other contextual information

– E.g.: cut potatoes semantic representation • <instrument> knife• <human interaction> hands• <location> cutting board

Page 3: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Function of Natural Language in Action Recognition?

1. Facilitate action recognition from the video.2. Ground video processing3. Extract relevant entities and semantics

associated with them4. Allow fusion of knowledge from text with

action primitives Leverage already existing techniques and

knowledge

Page 4: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Completed

• Dataset domains: – Cooking– Crafts

• Classification of Actions

• Categorization of Actions

Page 5: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Cooking domain1. DVD’s:– Cook like a chef– Martha’s Favorite Family Dinners– Joanne Wier’s cooking class

2. CMU Kitchen dataset3. Food Network: 12 consecutive hours of recorded

time4. PBS Kids: Sprout – 5 shows5. URADL: U. of Rochester Activities of Daily Living– 12 activities, 5 individuals, 3 recordings each

Page 6: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Craft domain

• PBS Kids: Sprout – over 25 shows

Page 7: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Tuples of Entities

– Time stamps for temporal information

– Verbs - capture actions

– Objects - what is acted upon

– Instruments - with what tool

– Location – for recognition

– Camera position – for scalability

Page 8: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Information Extraction• Extract structured information from unstructured

documentsEx: "Yesterday, New-York based Foo Inc. announced

their acquisition of Bar Corp.“Entity identification and recognition

• Goal of IE: allow computation to be performed on unstructured data.

• More specific goal: allow logical reasoning to draw inferences based on the logical content of the input data.

Page 9: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Entity Recognition for Video

• Can be considered an IE task with a list of entities

• Find a tuple or an ordered list with a temporal dimension

• Goal of text-based Information Extraction: “Who did what to whom where”

– Find the different entities that fill these slots• Goal of video and text IE– Find the temporal, and other entities

Page 10: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Angelina’s Ballet Slippers

1. Video

2. Web page

Page 11: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Angelina’s Ballet Slippers

Ingredients

• 1 red pepper, cut in half with seeds removed

• 1⁄2 cup quick cook brown rice• 1⁄2 cup vegetable stock• 1 cup canned mixed vegetables,

no added salt• 1⁄4 tsp. black pepper• 1 tsp. chopped fresh parsley• 1 tsp. extra virgin olive oil• 1 lemon• Decorative cabbage• 1⁄4 cup shredded cheddar cheese,

divided

Supplies

• Measuring cups and spoons• Cutting board & knife• Cooking pot• Small cooking pot• Mixing spoons• Slotted spoon• High-sided baking dish• Pastry brush• Large serving plate

Page 12: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Nr Action Objects Human Interaction Begin Time End Time Duration

1 Washing Sink, Soap Washing Hands 00:38.2 00:40.6 00:02.4

2 Drying Hand Towel Drying Hands 00:40.6 00:44.4 00:03.7

3 Filling Sink, Pot Hands fill pot with water 00:45.3 00:47.2 00:01.9

4 Pouring Bowl, Broth, PotChild pours broth from bowl to pot 00:48.2 00:51.4 00:03.2

5 Firing Stove, Pot Hand turns on the burner 00:54.1 00:57.1 00:03.0

6 CuttingRed Pepper, Knife, Cutting Board Adult Male cuts red pepper 00:58.1 01:00.0 00:01.9

7 Deseeding Red Pepper, scoopAdult and child deseed red pepper 01:03.0 01:03.9 00:00.8

8 PlacingPot, Spoon, Red Pepper Adult places red pepper in pot 01:09.7 01:12.2 00:02.5

9 Adding Bowl of Rice, Pot Adult adds rice to pot 01:14.2 01:17.7 00:03.4

10 Opening Can Opener, Can Hands open a can 01:20.2 01:23.3 00:03.0

11 TearingParsley, Measuring cup Child tears off parsley leaves 01:24.2 01:27.4 00:03.2

12 Adding Can, PotHand adds can of veggies to pot 01:32.0 01:35.0 00:03.0

13 Adding Measuring cup, Pot Child adds parsley to pot 01:35.6 01:38.2 00:03.0

Page 13: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Sprout - Alphabet book

Action Verb Freq Direct Object InstrumentHuman Interaction Location

To Thread 1 Thread Hand Both Hands Construction Paper

To Tie 1 Thread Hand Both Hands Construction Paper

To Write 1 Ink Pen Both Hands Paper

To Decorate 2 Ink Pen Both Hands Paper

To Color 2 Ink Pen Both Hands Paper

To Draw 1 Ink Pen Both Hands Paper

Page 14: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Baby Picture Frames

Crafts FreqDirect Object Instrument

Human Interaction Location

To Tape 2 Picture HandBoth Hands Frame

To Glue 2 Glue HandBoth Hands Popsicle sticks

To Decorate 1 Ink PenBoth Hands Popsicle sticks

Page 15: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Action Recognition and Complexity

Input1. transcripts and closed captions2. text transcripts alone3. list of ingredients and utensils

Evaluation can follow these levels

Page 16: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Sprout – Elmo’s Funny Face PizzaCooking Freq Direct Object Instrument Human Interaction Location

To Wash 1 Hands Faucet/ Soap Both Hands In action Sink

To Dry 1 Hands Paper Towels Both Hands In action Work Space

To Place 1 Bagels Hands Both Hands In action Baking Sheet

To Spread 1 Sauce Knife Both Hands In action Bagel

To Top 1 Olives Hands Both Hands In action Bagel

To Cut 1 Peppers Knife Both Hands In action Cutting Board

To Top 1 Peppers Hands Both Hands In action Bagel

To Bake 1 Sheet Pan Hands Both Hands In action Oven

To Clean 1 Food Hands Both Hands In action Work Space

To Sponge 1 Food Sponge Both Hands In action Work Space

To Remove 1 Sheet Pan Oven Mitts Both Hands In action Oven

Page 17: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Sprout – Caillou’s Crunchy Carrot Salad

Cooking Freq Direct Object Instrument human interaction Location

To Peel 1 Carrots PeelerBoth Hands In action Work Space

to Add 1 Apples HandsBoth Hands In action Bowl

To Measure 1 Raisins HandsBoth Hands In action Measuring Cup

To Mix 1 Salad SpoonsBoth Hands In action Salad Bowl

To Cut 1 Lemon KnifeBoth Hands In action Cutting Board

To Squeeze 1 Lemon HandsBoth Hands In action Salad Bowl

To Measure 1 Honey BottleBoth Hands In action

Measuring Spoon

To Refrigerate 1 Bowl HandsBoth Hands In action Refrigerator

To Clean 2 Food HandsBoth Hands In action Table

Page 18: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Martha Stewart Episode 2Cooking Frequency

Direct Object Instrument

Human Interaction Location

To Stir 5 ChiliWooden Spoon One hand Pot

To Pour 1 VinegarMeasuring Cup Both hands

Food Processor

To Pour 1 Orange juice Ramekin Both hands PanTo Add 1 Salt Hand One hand PanTo Cut 1 Butter Knife Both hands Butter BoatTo Beat 1 Egg Fork Both hands BowlTo Mix 6 Meatloaf Hand Both hands BowlTo Remove 2 Roast Hand Both hands Crock Pot

To Slice 7 Roast Knife Both handsCutting Board

To Spoon 1 Dressing Spoon One handPlate of Oranges

To Spread 2 Mix Hand Both hands Baking Dish

Page 19: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Martha Stewart – 191 action verbsto pour 33 to spoon 4to add 20 to measure 4to stir 17 to glaze 3to slice 17 to garnish 2to cut 11 to spread 2to place 11 to cover 2to mix 6 to tie 2to remove 6 to Scrape 2to rub 6 to dry 1to turn 6 to beat 1to deglaze 6 to b roil 1to serve 5 to sear 1to wisk 5 to wrap 1to top 4 to Grate 1to process (in a food Processor) 4 to Bake 1

Page 20: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Semantic Categorization of ActionsTo Apply Heat To CombineTo Bake to Addto Broil To Mixto sear To Process

To BeatTo Separate in to one or more parts To PourTo Cut to deglazeTo Slice to wiskto grateTo Tear To DecorateTo Peel To Topto score To Garnish

To SpreadTo Sanitize To GlazeTo Wash to spoonTo Dry to rub

Page 21: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

CMU Kitchen Set - Verbs– take – put – Open– fill – crack – beat – stir – pour – clean – switchon

– read – spray – close – walk – wist_on – twist_off

Page 22: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

NLP Tools• Part-of-speech tagger or phrase chunker

• Dependency parser for Verb-Object relations– We have tuples of Verb, Object, Instrument, Location– Ex: Stir (v) chili (o) with a wooden spoon (instr) in a

pot (loc)• Collocations for Instrument and Location– Coocurrence from Google– Ex: “place a wooden spoon across the pot to keep it

from boiling”• And more

Page 23: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Ontology

• Need to capture:– Concepts– Relationships– Properties– Timestamps (video_name [beg_time, end_time])– Validation

Page 24: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Ontology for cooking and craft

• Need to capture:– Actions– Food – including the state and transformation

or

– Objects – paper, paper roll, …– Instruments: kitchen utensils, scissors, crayons– Location– Timing– (Recipes)

Page 25: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Ontology

• Use of Protégé http://protege.stanford.edu/– ontology editor and knowledge-base framework.

• Knowtator : Protégé plug-in for annotation– can be used for evaluating or – training a variety of NLP systems.

• Write a plug-in that takes the output of a syntactic parser and connects it to visual frames

Page 26: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Protégé knowledge-base

• class, – Represent the concepts of a domain – organized in a subsumption hierarchy

• instance, correspond to individuals of a class• slot, define properties of a class or instance • facet frames constrain the values that slots

can have.

Page 27: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Dependency ParserInput Sentence: “Next we need to open the can of veggies”

ROOT [next-1] ( SBAR [next-1] ( next-1(Next)/IN S [need-6] ( NP [we-3] ( we-3/PRP ) VP [need-6] (

need-6/VBP S [to-8] (

VP [to-8] ( to-8/TO VP [open-10] ( open-10/VB NP [can-14] ( NP [can-14] ( the-12/DT can-14/NN )

PP [of-17] ( of-17/IN NP [veggy-19] ( veggy-19(veggies)/NNS ) )

Page 28: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Dependency ParserInput Sentence: “Next we need to open the can of veggies”

ROOT [next-1] ( SBAR [next-1] ( next-1(Next)/IN S [need-6] ( NP [we-3] ( we-3/PRP ) VP [need-6] (

need-6/VBP S [to-8] (

VP [to-8] ( to-8/TO VP [open-10] ( open-10/VB NP [can-14] ( NP [can-14] ( the-12/DT can-14/NN )

PP [of-17] ( of-17/IN NP [veggy-19] ( veggy-19(veggies)/NNS ) )

Page 29: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Action concept and relations with other concepts

Action

Verb HumanInteraction Instrument Location Time

Vn,t1,t2Object

Page 30: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Knowtator: Annotation Plug-in

• General purpose annotation tool• Facilitates creation of training and evaluation

corpora for language processing tasks• Ease of use• Straightforward to incorporate domain

knowledge

Page 31: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Knowtator: an example

Page 32: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Processes

SyntacticParser

Ontology Creation

OntologyAnnotation

Corpus enrichment

using collocations

Page 33: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Related Research

1. Ontology and cooking

2. Parsing “restricted” languages

3. Connecting text with images

Page 34: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Related Research

• Dina Demner-Fushman, Sameer Antani, Matthew Simpson, George R. Thoma “Annotation and retrieval of clinically relevant images”, 2009

• Ricardo Ribeiro, Fernando Batista, Joana Paulo Pardal, Nuno J. Mamede, and H. Sofia Pinto “Cooking an Ontology?”, 2008

• Fernando Batista, Joana Paulo, Nuno Mamede, Paula Vaz, Ricardo Ribeiro “Ontology construction: cooking domain”, 2006

• Joana Paulo Pardal, “Dynamic Use of Ontologies in Dialogue Systems”, 2009

Page 35: Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D. Friday, June 11, 2010.

Related Research• Mutsuo Sano, Ichiro Ide, Kenzaburo Miyawaki “Overview of the

ACM Multimedia 2009 Workshop on Multimedia for Cooking and Eating Activities (CEA’09)”

• Keigo Kitamura Toshihiko Yamasaki Kiyoharu Aizawa“FoodLog: Capture, Analysis and Retrieval of PersonalFood Images via Web”, 2009 distinguishes food images from other images

• Dan Tasse and Noah Smith (CMU) SOUR CREAM:Toward Semantic Processing of Recipes, 2008– new techniques for semantic parsing by focusing on the

domain of cooking recipes– first order logic