CVPR Talk v2
Transcript of CVPR Talk v2
![Page 1: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/1.jpg)
Situation Recognition Visual Semantic Role Labeling for Image Understanding
1
Mark Yatskar
in collaboration w/ Luke Zettlemoyer, Ali Farhadi
![Page 2: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/2.jpg)
How can we summarize what is happening in an image?
LOADINGAGENT ITEM DESTINATION TOOL PLACE
WOMAN HORSE TRAILER ROPE OUTDOORS
![Page 3: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/3.jpg)
Is the same thing happening in two images?
turkers say…
0255075
100
yes somewhat no
![Page 4: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/4.jpg)
Is the same thing happening in two images?
turkers say…
0255075
100
yes somewhat no
why no?
0255075
100
activty other
Activity
![Page 5: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/5.jpg)
Is the same thing happening in two images?
turkers say…
0255075
100
yes somewhat no
why yes?
0255075
100
throwing playing
Activity
![Page 6: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/6.jpg)
Is the same thing happening in two images?
turkers say…
0255075
100
yes somewhat no
why yes?
0255075
100
throwing playing
why no?
0255075
100
ball sport
Activity
Object
![Page 7: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/7.jpg)
Activity
Object
turkers say…
0255075
100
yes somewhat no
why yes?
0255075
100
beer pouring glass man
Is the same thing happening in two images?
![Page 8: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/8.jpg)
Activity
Object
turkers say…
0255075
100
yes somewhat no
why yes?
0255075
100
beer pouring glass man
Is the same thing happening in two images?
why no?
0255075
100
destination source other
Role
![Page 9: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/9.jpg)
Systematically describe how objects participate
LOADINGAGENT ITEM DESTINATION TOOL PLACE
WOMAN HORSE TRAILER ROPE OUTDOORS
in activities through roles
![Page 10: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/10.jpg)
Situation Recognition
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
![Page 11: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/11.jpg)
Situation Recognition
POURING
AGENT MANSUBSTANCE BEER
SOURCE TAP
DESTINATION GLASS
PLACE BARROOM
POURING
AGENT MANSUBSTANCE BEER
SOURCE GLASS
DESTINATION MOUTH
PLACE BACKYARD
Same
Different
![Page 12: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/12.jpg)
Situation Recognition
POURING
AGENT MANSUBSTANCE BEER
SOURCE TAP
DESTINATION GLASS
PLACE BARROOM
POURING
AGENT MANSUBSTANCE BEER
SOURCE GLASS
DESTINATION MOUTH
PLACE BACKYARD
Same
Different
What is the space of possible situations?
![Page 13: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/13.jpg)
imSituA Large Scale Situation Dataset
120k+ images, 500+ verbs, 100k+ situations
![Page 14: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/14.jpg)
Natural Language Processing: Semantic Role Labeling
A boy is fixing a car tire with a tire iron outdoors.
Activity
Object
Role
![Page 15: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/15.jpg)
Natural Language Processing: Semantic Role Labeling
A boy is fixing a car tire with a tire iron outdoors.
Activity
Object
Role
![Page 16: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/16.jpg)
Natural Language Processing: Semantic Role Labeling
A boy is fixing a car tire with a tire iron outdoors.
Activity
Object
Role
![Page 17: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/17.jpg)
Natural Language Processing: Semantic Role Labeling
A boy is fixing a car tire with a tire iron outdoors.
Activity
Object
Role
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
![Page 18: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/18.jpg)
A jockey falling from a horse onto the ground at a racetrack.
Natural Language Processing: Semantic Role Labeling
FALLINGAGENT SOURCE DESTINATION PLACEJOCKEY HORSE GROUND RACETRACK
Activity
Object
Role
![Page 19: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/19.jpg)
FrameNet for Verb and Role Inventory
FIXINGAGENT OBJECT PART TOOL PLACE
semantic role labeling ontology:
FrameNet (8000 verbs)
creating imSitu
![Page 20: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/20.jpg)
Visualness
FIXINGAGENT OBJECT PART TOOL PLACE
semantic role labeling ontology:
FrameNet (8000 verbs)
~1000 visual verbs~3.5 roles/verb
filter verbs, semantic roles
creating imSituFrameNet
![Page 21: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/21.jpg)
WordNet for Noun Inventory
FIXINGAGENT OBJECT PART TOOL PLACE
values from noun ontology: WordNet (80, 000 nouns)
semantic role labeling ontology:
FrameNet (8000 verbs)
creating imSituFrameNetVisualness
![Page 22: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/22.jpg)
FIXINGAGENT OBJECT PART TOOL PLACE
values from noun ontology: WordNet (80, 000 nouns)
semantic role labeling ontology:
FrameNet (8000 verbs)
Google Images SearchWeb N-grams
Filter Images creating imSitu
WordNet
FrameNetVisualness
![Page 23: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/23.jpg)
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
values from noun ontology: WordNet (80, 000 nouns)
semantic role labeling ontology:
FrameNet (8000 verbs)
Fill Valuescreating imSitu
WordNet
FrameNetVisualness
Filter Images
![Page 24: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/24.jpg)
imSitu: Dataset Statistics
Verbs 504Images 126,102
Situation/Image 3Roles (types) 1,788 (190)Nouns ( >=3) 11,538 (6,794)Annotations 1,481,851Images/Verb 200-400
Uniq. situations (>= 3) 205,095 (21,505)
Despite 80,000 possible values, 2/3 annotators on 76.8% of role-value
WordNet
creating imSituFrameNetVisualness
Filter ImagesFill Values
![Page 25: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/25.jpg)
Skew - not all verbs are equal
0 175 350 525 700
scoopingputting
feeding
fetching
climbing
stumbling
laughingbowing
inflating
shellingsnuggling
twisting
dancing
flossing
VERBS # NOUNS
food: milk
receiver: piglet receiver: dolpin
food: carrot
tool:pot tool: scooper
item:poop item: seed
![Page 26: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/26.jpg)
Skew - not all verbs are equal
0 175 350 525 700
scoopingputting
feeding
fetching
climbing
stumbling
laughingbowing
inflating
shellingsnuggling
twisting
dancing
flossing
VERBS # NOUNS
food: milk
receiver: piglet receiver: dolpin
food: carrot
tool:pot tool: scooper
item:poop item: seed
![Page 27: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/27.jpg)
Skew - not all verbs are equal
0 175 350 525 700
scoopingputting
feeding
fetching
climbing
stumbling
laughingbowing
inflating
shellingsnuggling
twisting
dancing
flossing
VERBS # NOUNS
food: milk
receiver: piglet receiver: dolpin
food: carrot
tool:pot tool: scooper
item:poop item: seed
![Page 28: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/28.jpg)
1 10 100 1000
car
man
elephant
zebra
fireplace
octopus
priestflower
nostril
ice
baconvacuum
bow
cherry
NOUNS # OF SEMANTIC ROLES
splashing.agent swimming.agent
riding.vehicle attacking.victim
colliding.agentdriving.vehicle
pumping.destbuckling.place
Skew - not all nouns are equal
![Page 29: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/29.jpg)
1 10 100 1000
car
man
elephant
zebra
fireplace
octopus
priestflower
nostril
ice
baconvacuum
bow
cherry
NOUNS # OF SEMANTIC ROLES
splashing.agent swimming.agent
riding.vehicle attacking.victim
colliding.agentdriving.vehicle
pumping.destbuckling.place
Skew - not all nouns are equal
![Page 30: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/30.jpg)
Situation RecognitionModels, Evaluation and Basic Results
structure matterssituation recognition improves object and activity recognition
![Page 31: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/31.jpg)
CLEANAGENT SOURCE DIRT TOOL PLACE
man chimney soot brush roof
Conditional Random FieldVGG
Convolutional Layers
Verb-Role-Noun Potential
Verb Potential
Neural Conditional Random Field
Backpropogate CRF loss through VGG
p(S|i; ✓) / v(v, i; ✓)Y
(r,nr)2F
e(v, r, nr, i; ✓)
![Page 32: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/32.jpg)
Qualitative Examples
SPEARING
AGENT PERSON PERSON
VICTIM FISH FISHPLACE OCEAN OCEAN
FALLING
AGENT PERSON PERSONSOURCE HORSE HORSE
DEST. GRND. GRND.PLACE FIELD FIELD
SWIMMING
AGENT SNAKE SNAKEPLACE OCEAN OCEAN
Gold Correct Incorrect
![Page 33: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/33.jpg)
Qualitative Examples
SHAVING
AGENT MAN PERSONCO-AGENT MAN MANBODYPART HEAD HEADSUBSTANCE S. CREAM
TOOL RAZOR RAZORPLACE INSIDE INSIDE
GIVING
AGENT SOLDIERRECIPIENT GIRL
ITEM BAGPLACE OUTSIDE
DETAINING
AGENT SOLDIERVICTIM MANPLACE OUTSIDE
Gold Correct Incorrect
![Page 34: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/34.jpg)
Verb
010203040506070
top-1 top-5
Baseline: 5040-way CNN Predictor (10 most frequent situation/verb)Situation CRF
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
Quantitive : Structured Prediction Crucial
![Page 35: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/35.jpg)
Verb
010203040506070
top-1 top-5
Baseline: 5040-way CNN Predictor (10 most frequent situation/verb)Situation CRF
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
Verb-Role-Noun
top-1 top-5
Quantitive : Structured Prediction Crucial
![Page 36: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/36.jpg)
Verb
010203040506070
top-1 top-5
Baseline: 5040-way CNN Classifer (10 most frequent situation/verb)Situation CRF
Full Structure
top-1 top-5
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
Verb-Role-Noun
top-1 top-5
Quantitive : Structured Prediction Crucial
![Page 37: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/37.jpg)
FEEDING
AGENT MANEATER BABYFOOD MILK
SOURCE BOTTLEPLACE ROOM
FEEDING
AGENT GIRLEATER HORSEFOOD CARROT
SOURCE HANDPLACE PEN
FEEDING
AGENT WOMANEATER HORSEFOOD MILK
SOURCE BOTTLEPLACE BARN
Test
Instances in train : 35 Instances in train : 7 Instances in train : 0
Generalize to Unseen Combinations
Train
![Page 38: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/38.jpg)
Activity
20
30
40
50
60
70
top-1 top-5
activitysituation
Situations Improves Object and Activity Recognition
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
![Page 39: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/39.jpg)
Activity
20
30
40
50
60
70
top-1 top-5
objectactivitysituation
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
Object
60
70
80
90
100
top-1 top-5
Situations Improves Object and Activity Recognition
![Page 40: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/40.jpg)
Errors
PRYING
AGENT PERSONITEM WOOD
SOURCE FLOORTOOL CROWBARPLACE ROOM
PAINTING SPRAYING
PUMPINGAGENT PERSON
ITEM AIRSOURCE AIR
DESTINATION WHEELTOOL PUMPPLACE OUTSIDE
![Page 42: CVPR Talk v2](https://reader033.fdocuments.us/reader033/viewer/2022052013/6285fc68af077c4a7b116b08/html5/thumbnails/42.jpg)
Conclusion
Introduced situation recognitionrole-centric structured representation of whats happening
Collected imSitu120k+ images, 500+ verbs, 100k+ situations
Introduced simple model neural CRF for situationstructure mattersprovides strong context for activity and object recognition
data/browsing/demo/code
imsitu.org