Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D....
-
Upload
damon-lane -
Category
Documents
-
view
214 -
download
0
Transcript of Natural Language Processing for Action Recognition JHU Summer School Evelyne Tzoukermann, Ph.D....
Natural Language Processing for Action Recognition
JHU Summer School
Evelyne Tzoukermann, Ph.D.
Friday, June 11, 2010
What is the role of Natural Language in Action Recognition?
1. Provide temporal information– Where in the video is the action happening?
2. Provide semantic information– Parse the phrasal constituents to determine action
type and human interaction through objects, instruments, and other contextual information
– E.g.: cut potatoes semantic representation • <instrument> knife• <human interaction> hands• <location> cutting board
Function of Natural Language in Action Recognition?
1. Facilitate action recognition from the video.2. Ground video processing3. Extract relevant entities and semantics
associated with them4. Allow fusion of knowledge from text with
action primitives Leverage already existing techniques and
knowledge
Completed
• Dataset domains: – Cooking– Crafts
• Classification of Actions
• Categorization of Actions
Cooking domain1. DVD’s:– Cook like a chef– Martha’s Favorite Family Dinners– Joanne Wier’s cooking class
2. CMU Kitchen dataset3. Food Network: 12 consecutive hours of recorded
time4. PBS Kids: Sprout – 5 shows5. URADL: U. of Rochester Activities of Daily Living– 12 activities, 5 individuals, 3 recordings each
Craft domain
• PBS Kids: Sprout – over 25 shows
Tuples of Entities
– Time stamps for temporal information
– Verbs - capture actions
– Objects - what is acted upon
– Instruments - with what tool
– Location – for recognition
– Camera position – for scalability
Information Extraction• Extract structured information from unstructured
documentsEx: "Yesterday, New-York based Foo Inc. announced
their acquisition of Bar Corp.“Entity identification and recognition
• Goal of IE: allow computation to be performed on unstructured data.
• More specific goal: allow logical reasoning to draw inferences based on the logical content of the input data.
Entity Recognition for Video
• Can be considered an IE task with a list of entities
• Find a tuple or an ordered list with a temporal dimension
• Goal of text-based Information Extraction: “Who did what to whom where”
– Find the different entities that fill these slots• Goal of video and text IE– Find the temporal, and other entities
Angelina’s Ballet Slippers
1. Video
2. Web page
Angelina’s Ballet Slippers
Ingredients
• 1 red pepper, cut in half with seeds removed
• 1⁄2 cup quick cook brown rice• 1⁄2 cup vegetable stock• 1 cup canned mixed vegetables,
no added salt• 1⁄4 tsp. black pepper• 1 tsp. chopped fresh parsley• 1 tsp. extra virgin olive oil• 1 lemon• Decorative cabbage• 1⁄4 cup shredded cheddar cheese,
divided
Supplies
• Measuring cups and spoons• Cutting board & knife• Cooking pot• Small cooking pot• Mixing spoons• Slotted spoon• High-sided baking dish• Pastry brush• Large serving plate
Nr Action Objects Human Interaction Begin Time End Time Duration
1 Washing Sink, Soap Washing Hands 00:38.2 00:40.6 00:02.4
2 Drying Hand Towel Drying Hands 00:40.6 00:44.4 00:03.7
3 Filling Sink, Pot Hands fill pot with water 00:45.3 00:47.2 00:01.9
4 Pouring Bowl, Broth, PotChild pours broth from bowl to pot 00:48.2 00:51.4 00:03.2
5 Firing Stove, Pot Hand turns on the burner 00:54.1 00:57.1 00:03.0
6 CuttingRed Pepper, Knife, Cutting Board Adult Male cuts red pepper 00:58.1 01:00.0 00:01.9
7 Deseeding Red Pepper, scoopAdult and child deseed red pepper 01:03.0 01:03.9 00:00.8
8 PlacingPot, Spoon, Red Pepper Adult places red pepper in pot 01:09.7 01:12.2 00:02.5
9 Adding Bowl of Rice, Pot Adult adds rice to pot 01:14.2 01:17.7 00:03.4
10 Opening Can Opener, Can Hands open a can 01:20.2 01:23.3 00:03.0
11 TearingParsley, Measuring cup Child tears off parsley leaves 01:24.2 01:27.4 00:03.2
12 Adding Can, PotHand adds can of veggies to pot 01:32.0 01:35.0 00:03.0
13 Adding Measuring cup, Pot Child adds parsley to pot 01:35.6 01:38.2 00:03.0
Sprout - Alphabet book
Action Verb Freq Direct Object InstrumentHuman Interaction Location
To Thread 1 Thread Hand Both Hands Construction Paper
To Tie 1 Thread Hand Both Hands Construction Paper
To Write 1 Ink Pen Both Hands Paper
To Decorate 2 Ink Pen Both Hands Paper
To Color 2 Ink Pen Both Hands Paper
To Draw 1 Ink Pen Both Hands Paper
Baby Picture Frames
Crafts FreqDirect Object Instrument
Human Interaction Location
To Tape 2 Picture HandBoth Hands Frame
To Glue 2 Glue HandBoth Hands Popsicle sticks
To Decorate 1 Ink PenBoth Hands Popsicle sticks
Action Recognition and Complexity
Input1. transcripts and closed captions2. text transcripts alone3. list of ingredients and utensils
Evaluation can follow these levels
Sprout – Elmo’s Funny Face PizzaCooking Freq Direct Object Instrument Human Interaction Location
To Wash 1 Hands Faucet/ Soap Both Hands In action Sink
To Dry 1 Hands Paper Towels Both Hands In action Work Space
To Place 1 Bagels Hands Both Hands In action Baking Sheet
To Spread 1 Sauce Knife Both Hands In action Bagel
To Top 1 Olives Hands Both Hands In action Bagel
To Cut 1 Peppers Knife Both Hands In action Cutting Board
To Top 1 Peppers Hands Both Hands In action Bagel
To Bake 1 Sheet Pan Hands Both Hands In action Oven
To Clean 1 Food Hands Both Hands In action Work Space
To Sponge 1 Food Sponge Both Hands In action Work Space
To Remove 1 Sheet Pan Oven Mitts Both Hands In action Oven
Sprout – Caillou’s Crunchy Carrot Salad
Cooking Freq Direct Object Instrument human interaction Location
To Peel 1 Carrots PeelerBoth Hands In action Work Space
to Add 1 Apples HandsBoth Hands In action Bowl
To Measure 1 Raisins HandsBoth Hands In action Measuring Cup
To Mix 1 Salad SpoonsBoth Hands In action Salad Bowl
To Cut 1 Lemon KnifeBoth Hands In action Cutting Board
To Squeeze 1 Lemon HandsBoth Hands In action Salad Bowl
To Measure 1 Honey BottleBoth Hands In action
Measuring Spoon
To Refrigerate 1 Bowl HandsBoth Hands In action Refrigerator
To Clean 2 Food HandsBoth Hands In action Table
Martha Stewart Episode 2Cooking Frequency
Direct Object Instrument
Human Interaction Location
To Stir 5 ChiliWooden Spoon One hand Pot
To Pour 1 VinegarMeasuring Cup Both hands
Food Processor
To Pour 1 Orange juice Ramekin Both hands PanTo Add 1 Salt Hand One hand PanTo Cut 1 Butter Knife Both hands Butter BoatTo Beat 1 Egg Fork Both hands BowlTo Mix 6 Meatloaf Hand Both hands BowlTo Remove 2 Roast Hand Both hands Crock Pot
To Slice 7 Roast Knife Both handsCutting Board
To Spoon 1 Dressing Spoon One handPlate of Oranges
To Spread 2 Mix Hand Both hands Baking Dish
Martha Stewart – 191 action verbsto pour 33 to spoon 4to add 20 to measure 4to stir 17 to glaze 3to slice 17 to garnish 2to cut 11 to spread 2to place 11 to cover 2to mix 6 to tie 2to remove 6 to Scrape 2to rub 6 to dry 1to turn 6 to beat 1to deglaze 6 to b roil 1to serve 5 to sear 1to wisk 5 to wrap 1to top 4 to Grate 1to process (in a food Processor) 4 to Bake 1
Semantic Categorization of ActionsTo Apply Heat To CombineTo Bake to Addto Broil To Mixto sear To Process
To BeatTo Separate in to one or more parts To PourTo Cut to deglazeTo Slice to wiskto grateTo Tear To DecorateTo Peel To Topto score To Garnish
To SpreadTo Sanitize To GlazeTo Wash to spoonTo Dry to rub
CMU Kitchen Set - Verbs– take – put – Open– fill – crack – beat – stir – pour – clean – switchon
– read – spray – close – walk – wist_on – twist_off
NLP Tools• Part-of-speech tagger or phrase chunker
• Dependency parser for Verb-Object relations– We have tuples of Verb, Object, Instrument, Location– Ex: Stir (v) chili (o) with a wooden spoon (instr) in a
pot (loc)• Collocations for Instrument and Location– Coocurrence from Google– Ex: “place a wooden spoon across the pot to keep it
from boiling”• And more
Ontology
• Need to capture:– Concepts– Relationships– Properties– Timestamps (video_name [beg_time, end_time])– Validation
Ontology for cooking and craft
• Need to capture:– Actions– Food – including the state and transformation
or
– Objects – paper, paper roll, …– Instruments: kitchen utensils, scissors, crayons– Location– Timing– (Recipes)
Ontology
• Use of Protégé http://protege.stanford.edu/– ontology editor and knowledge-base framework.
• Knowtator : Protégé plug-in for annotation– can be used for evaluating or – training a variety of NLP systems.
• Write a plug-in that takes the output of a syntactic parser and connects it to visual frames
Protégé knowledge-base
• class, – Represent the concepts of a domain – organized in a subsumption hierarchy
• instance, correspond to individuals of a class• slot, define properties of a class or instance • facet frames constrain the values that slots
can have.
Dependency ParserInput Sentence: “Next we need to open the can of veggies”
ROOT [next-1] ( SBAR [next-1] ( next-1(Next)/IN S [need-6] ( NP [we-3] ( we-3/PRP ) VP [need-6] (
need-6/VBP S [to-8] (
VP [to-8] ( to-8/TO VP [open-10] ( open-10/VB NP [can-14] ( NP [can-14] ( the-12/DT can-14/NN )
PP [of-17] ( of-17/IN NP [veggy-19] ( veggy-19(veggies)/NNS ) )
Dependency ParserInput Sentence: “Next we need to open the can of veggies”
ROOT [next-1] ( SBAR [next-1] ( next-1(Next)/IN S [need-6] ( NP [we-3] ( we-3/PRP ) VP [need-6] (
need-6/VBP S [to-8] (
VP [to-8] ( to-8/TO VP [open-10] ( open-10/VB NP [can-14] ( NP [can-14] ( the-12/DT can-14/NN )
PP [of-17] ( of-17/IN NP [veggy-19] ( veggy-19(veggies)/NNS ) )
Action concept and relations with other concepts
Action
Verb HumanInteraction Instrument Location Time
Vn,t1,t2Object
Knowtator: Annotation Plug-in
• General purpose annotation tool• Facilitates creation of training and evaluation
corpora for language processing tasks• Ease of use• Straightforward to incorporate domain
knowledge
Knowtator: an example
Processes
SyntacticParser
Ontology Creation
OntologyAnnotation
Corpus enrichment
using collocations
Related Research
1. Ontology and cooking
2. Parsing “restricted” languages
3. Connecting text with images
Related Research
• Dina Demner-Fushman, Sameer Antani, Matthew Simpson, George R. Thoma “Annotation and retrieval of clinically relevant images”, 2009
• Ricardo Ribeiro, Fernando Batista, Joana Paulo Pardal, Nuno J. Mamede, and H. Sofia Pinto “Cooking an Ontology?”, 2008
• Fernando Batista, Joana Paulo, Nuno Mamede, Paula Vaz, Ricardo Ribeiro “Ontology construction: cooking domain”, 2006
• Joana Paulo Pardal, “Dynamic Use of Ontologies in Dialogue Systems”, 2009
Related Research• Mutsuo Sano, Ichiro Ide, Kenzaburo Miyawaki “Overview of the
ACM Multimedia 2009 Workshop on Multimedia for Cooking and Eating Activities (CEA’09)”
• Keigo Kitamura Toshihiko Yamasaki Kiyoharu Aizawa“FoodLog: Capture, Analysis and Retrieval of PersonalFood Images via Web”, 2009 distinguishes food images from other images
• Dan Tasse and Noah Smith (CMU) SOUR CREAM:Toward Semantic Processing of Recipes, 2008– new techniques for semantic parsing by focusing on the
domain of cooking recipes– first order logic