CSE 595 Words and Pictures
description
Transcript of CSE 595 Words and Pictures
SBU
Digital
Media
CSE 595 Words and Pictures
Tamara L. Berg
SUNY Stony Brook
SBU
Digital
Media
Class Info CSE 595: Words & Pictures Instructor: Tamara Berg ([email protected])
Office: 1411 Computer Science Lectures: Tues/Thurs 1:20-2:20pm Rm 2129 CS Office Hours: Tues/Thurs 2:20-3:20pm and by appt.
Course Webpage: http://tamaraberg.com/teaching/Fall_12/wordspics
SBU
Digital
Media
About Me
• Joined Stony Brook in 2008– PhD from UC Berkeley 2007.– 2007-2008 Yahoo! Research
• Research in computer vision and natural language processing - combining information from multiple forms of digital media for applications like image search and recognition.
SBU
Digital
Media
You? MS/PhD? Experience in Comp Vision, Natural
Language Processing, AI, Machine Learning?
Familiar with Matlab?
SBU
Digital
Media
What’s in this picture?
SBU
Digital
Media
What does the picture tell us?
Green, textured region – maybe tree?
Fuzzy black thing with a face-like part -- maybe an animal?
SBU
Digital
Media
What do the words tell us?
Tags: leaves, endangered, green, i love nature, chennai, nilgiri langur, monkey, forest, wildlife, perch, black, wallpaper, ARK OF WILDLIFE, topv111, WeeklySurvivor, top20HallFame, topv333, 100v10f, captive, simian
SBU
Digital
Media
What do words+picture tell us?
Tags: leaves, endangered, green, i love nature, chennai, nilgiri langur, monkey, forest, wildlife, perch, black, wallpaper, ARK OF WILDLIFE, topv111, WeeklySurvivor, top20HallFame, topv333, 100v10f, captive, simian
SBU
Digital
Media
Consumer Photo Collections
Over the hills and far away
Road, Hills, Germany, Hoffenheim, Outstanding Shots, specland, Baden-Wuerttemberg
Heavenly
Peacock, AlbinoPeacock, WhiteBeauty, Birds, Wildlife, FeathredaleWildlifePark, PictureAustralia, ImpressedBeauty
End of the world - Verdens Ende - The lighthouse 1
Verdens ende, end of the world, norway, lighthouse, ABigFave, vippefyr, wood, coal
Flickr – 3+ billion photographs, 3-5 million uploaded per day
SBU
Digital
Media
Museum and Library Collections
Fine Arts Museum of San Francisco (82,000 images)
Woman of Head Howard H G Mrs Gift America North bust States United Sculpture marble
bowl stemmed small Irridescent glass
New York Public Library
Digital Collection
The new board walk, Rockaway, Long Island
Part of New England, New York, east New Iarsey and Long Iland.
SBU
Digital
Media
Web CollectionsBillions of Web Pages
SBU
Digital
Media
Video
OUTSIDE IN THE RAIN THE SENATOR WEARING HIS UH BASEBALL CAP A BOSTON RED SOX CAP AS HE TALKED TO HIS SUPPORTERS HERE IN THE RAIN THE UH SENATOR THEY'RE DOING HIS BEST TO TRY TO MAKE HIS CASE THAT HE WILL BE THE MAN FOR THE MIDDLE CLASS AND UH TRY TO CONVINCE HIS SUPPORTERS TO EXPRESS THEIR SUPPORT THROUGH A VOTE ON TUESDAY IN THERE WE ARE TWENTY FOUR HOURS FROM THE GREAT MOMENT THAT THE WORLD IN AMERICA IS WAITING FOR IT I NEED TO YOU IN THESE HOURS TO GO OUT AND DO THE HARD WORK NOT ON THOSE DOORS MAKE THOSE PHONE CALLS TO TALK TO FRIENDS TAKE PEOPLE TO THE POLLS HELP US CHANGE THE DIRECTION OF THIS GREAT NATION FOR THE BETTER CAN YOU IMAGINE A UH SENATOR BEGINNING HIS DAY IN FLORIDA TODAY
TrecVid 2006 – video frames with speech processing output
SBU
Digital
Media
Consumer Products
Soft and glossy patent calfskin trimmed with natural vachetta cowhide, open top satchel for daytime and weekends, interior double slide pockets and zip pocket, seersucker stripe cotton twill lining, kate spade leather license plate logo, imported. 2.8" drop length 14"h x 14.2"w x 6.9"d Katespade.com
It's the perfect party dress. With distinctly feminine details such as a wide sash bow around an empire waist and a deep scoopneck, this linen dress will keep you comfortable and feeling elegant all evening long. * Measures 38" from center back, hits at the knee. * Scoopneck, full skirt. * Hidden side zip, fully lined. * 100% Linen. Dry clean. bananarepublic.com
Internet retail transactions in 2006, 2007 of $145 billion, $175 billion (Forrester Research).
SBU
Digital
Media
Lots of Data!
SBU
Digital
Media
What do we want to do?
SBU
Digital
Media
What do we want to do?
Organize
Search
Browse
SBU
Digital
Media
What do we want to do?
Organize
Search
Browse
SBU
Digital
Media
What do we want to do?
Organize
Search
BrowseComputing Iconic Summaries for General Visual Concepts.R. Raguram and S. Lazebnik, 2008.
SBU
Digital
Media
What do we want to do?
Image Search circa 2007
Organize
Search
Browse
SBU
Digital
Media
What do we want to do?
Image Search now
Organize
Search
Browse
SBU
Digital
Media
What do we want to do?
Image re-ranking for “monkey”
Tamara L Berg, David A Forsyth,Animals on the Web CVPR 2006
Organize
Search
Browse
SBU
Digital
Media
What do we want to do?
Visual shopping at like.com
Organize
Search
Browse
SBU
Digital
Media
What do we want to do?
Visual attribute discoveryTamara L Berg, Alexander C Berg, Jonathan ShihAutomatic Attribute Discovery and Characterization from Noisy Web DataECCV 2010
Organize
Search
Browse
SBU
Digital
Media
What do we want to do?
Visual attribute discovery
J. Wang, K. Markert, and M. Everingham. "Learning models for object recognition from natural language descriptions” BMVC 2009.
Organize
Search
Browse
SBU
Digital
Media
Types of Words & Pictures
SBU
Digital
Media
General web pages
SBU
Digital
Media
General web pages
Image re-ranking for “monkey”
Tamara L Berg, David A Forsyth,Animals on the Web CVPR 2006
Improving Search
SBU
Digital
Media
General web pages
Harvesting Image Databases from the WebSchroff, F. , Criminisi, A. and Zisserman, A.ICCV 2007.
Mining to build big computer vision data sets.
SBU
Digital
Media
General web pages
Pros?
Cons?
SBU
Digital
Media
Tags or keywords + images
Tags: canon, eos, macro, japan, frog, animal, toad, amphibian, pet, eye, feet, mouth, finger, hand, prince, photo,art, light, photo, flickr, blurry, favorite, nice.
SBU
Digital
Media
Tags or keywords + images
Gang Wang, Derek Hoiem, and David Forsyth, Building text features for object image classification. CVPR, 2009.
Using tags and similar images for novel image classification
SBU
Digital
Media
Tags or keywords + images
Tag Order as implicit cue to expected size
“Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags”Sung Ju Hwang and Kristen Grauman
SBU
Digital
Media
Tags or keywords + images
Tags: canon, eos, macro, japan, frog, animal, toad, amphibian, pet, eye, feet, mouth, finger, hand, prince, photo,art, light, photo, flickr, blurry, favorite, nice.
Pros?
Cons?
SBU
Digital
Media
President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters
Captioned images
SBU
Digital
Media
President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters
Captioned images for face labeling
Captions provide direct information about depiction!
SBU
Digital
Media
Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose AnnotationJie Luo, Barbara Caputo, Vittorio FerrariNIPS 2009
Captioned images for face and pose labeling
SBU
Digital
Media
Videos with transcripts
SBU
Digital
Media
M. Everingham, J. Sivic, and A. Zisserman. Hello! My name is... Buffy' - Automatic naming of characters in TV videoBMVC 2006.
Videos with transcripts for face labeling
SBU
Digital
Media
Learning by Watching
SBU
Digital
Media
P. Buehler, M. Everingham, and A. Zisserman. "Learning sign language by watching TV (using weakly aligned subtitles)". CVPR 2009.
Learning Sign Language
SBU
Digital
Media
Learning to Sportscast: A Test of Grounded Language Acquisition (2008)David L. Chen and Raymond J. Mooney
Learning to Sportscast
SBU
Digital
Media
Learning About Semantics
SBU
Digital
Media
Traditional Recognition
car
shoe
person
SBU
Digital
Media
Beyond traditional recognition
SBU
Digital
Media
Beyond traditional recognition
“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – Scarlett O’Hara, Gone with the Wind.
SBU
Digital
Media
Attributes
Visual attribute learning from textTamara L Berg, Alexander C Berg, Jonathan ShihAutomatic Attribute Discovery and Characterization from Noisy Web DataECCV 2010
SBU
Digital
Media
Object relationships
SBU
Digital
Media
Object relationships
Object relationships – prepositions & adjectives
Beyond Nouns: Exploiting prepositions and comparative adjectives for learning visual classifiersAbhinav Gupta and Larry S. DavisIn ECCV 2008
Car is on the street
SBU
Digital
Media
Cross-Language Learning
Learning Bilingual Lexicons using the Visual Similarity of Labeled Web ImagesShane Bergsma and Benjamin Van Durme 2011
SBU
Digital
Media
Descriptive Text
Visually descriptive language offers: 1) information about the world, especially the visual world. 2) training data for how people construct natural language to describe imagery.
“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – Scarlett O’Hara, Gone with the Wind.
SBU
Digital
Media
Generating descriptions for images
SBU
Digital
Media
Generating Captions for News Images with Articles
How Many Words is a Picture Worth? Automatic Caption Generation for News Images”
Feng & Lapata 2010
SBU
Digital
Media
Generating Simple Descriptions for images
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”
Baby Talk: Understanding and Generating Simple Image Descriptions (2011)Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, Tamara L. Berg
SBU
Digital
Media
Im2Text: Describing Images Using 1 Million Captioned Photographs
Vicente Ordonez, Girish Kulkarni, Tamara L. BergStony Brook University
NIPS 2011
One of the many stone bridges in town that carry the gravel carriage roads.
An old bridge over dirty green water.
A stone bridge over a peaceful river.
Generate Natural Sounding Descriptions
SBU
Digital
Media
Summary Enormous amounts of data. Lots of commercial and academic
applications. We should combine information
from words & pictures intelligently.
SBU
Digital
Media
Overall Class Goal Gain exposure to interesting and
current research on Words&Pictures
No prior experience in Computer Vision or Natural Language Processing is required.
We will be reading a variety of research papers over the course of the semester
Please read the papers!
SBU
Digital
Media
General knowledge lecturesComputer VisionNatural Language ProcessingFeatures & RepresentationsClustering Discriminative Models & ClassificationGenerative & Topic Models
SBU
Digital
Media
Your responsibilities
Homework – 3 relatively simple assignments. Project – final project including proposal,
update, and final presentation & write-up. Participation – read papers and participate in
topic discussions. Topic presentations – one in class topic
presentation in groups of 4-5.
30%
30%
30%
10%
Late assignments/projects will be accepted with a 10% reduction in value per day late.
SBU
Digital
Media
Homework & Projects
Assignments should be completed individually in matlab.
Projects will be in groups of 3 and can be completed in the language of your choice on the topic of your choice (must involve text and images/video).
SBU
Digital
Media
Participation Experiment Goal: interesting, lively discussions
about research topics.
To encourage this goal at the end of each class please submit a paper noting how many (if any) questions you posed, answers you provided, or significant comments you made.
If this does not work, we will revert to having short sporadic pop quizzes on papers.
SBU
Digital
Media
Note about papers You won’t understand everything,
especially at first. Don’t sweat the small stuff. Try to grasp the overall idea, what’s
novel, what’s interesting, pros/cons of the method, how it relates to other things we’ve read.
SBU
Digital
Media
Topic Presentations You will give one topic presentation
during the semester in groups of 4-5.
Suggested papers for each topic presentations are listed on the course website.
You are welcome to swap papers (if relevant to your topic), but please ask me at least 1 week prior to the presentation.
SBU
Digital
Media
Reference Books 1) Forsyth, David A., and Ponce, J.
Computer Vision: A Modern Approach, Prentice Hall, 2003.
2) Hartley, R. and Zisserman, A. Multiple View Geometry in Computer Vision, Academic Press, 2002.
3) Jurafsky and Martin, SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, McGraw Hill, 2008.
4) Christopher D. Manning, and Hinrich Schuetze. Foundations of Statistical Natural Language Processing
SBU
Digital
Media
For next class Get access to matlab
Student Matlab licenses can be purchased from mathworks for $99
Do a matlab tutorial One link on the course website, many others
are available online.
SBU
Digital
Media
Class Info CSE 595: Words & Pictures Instructor: Tamara Berg ([email protected])
Office: 1411 Computer Science Lectures: Tues/Thurs 1:20-2:20pm Rm 2129 CS Office Hours: Tues/Thurs 2:20-3:20pm and by appt.
Course Webpage: http://tamaraberg.com/teaching/Fall_12/wordspics