CSE 595 Words and Pictures

SBU

Digital

Media

CSE 595 Words and Pictures

Tamara L. Berg

SUNY Stony Brook

SBU

Digital

Media

Class Info CSE 595: Words & Pictures Instructor: Tamara Berg ([email protected])

Office: 1411 Computer Science Lectures: Tues/Thurs 1:20-2:20pm Rm 2129 CS Office Hours: Tues/Thurs 2:20-3:20pm and by appt.

Course Webpage: http://tamaraberg.com/teaching/Fall_12/wordspics

http://tamaraberg.com/

http://tamaraberg.com/teaching/Spring_11/wordspics/





SBU

Digital

Media

About Me

• Joined Stony Brook in 2008– PhD from UC Berkeley 2007.– 2007-2008 Yahoo! Research

• Research in computer vision and natural language processing - combining information from multiple forms of digital media for applications like image search and recognition.

SBU

Digital

Media

You? MS/PhD? Experience in Comp Vision, Natural

Language Processing, AI, Machine Learning?

Familiar with Matlab?

SBU

Digital

Media

What’s in this picture?

SBU

Digital

Media

What does the picture tell us?

Green, textured region – maybe tree?

Fuzzy black thing with a face-like part -- maybe an animal?

SBU

Digital

Media

What do the words tell us?

Tags: leaves, endangered, green, i love nature, chennai, nilgiri langur, monkey, forest, wildlife, perch, black, wallpaper, ARK OF WILDLIFE, topv111, WeeklySurvivor, top20HallFame, topv333, 100v10f, captive, simian

SBU

Digital

Media

What do words+picture tell us?

Tags: leaves, endangered, green, i love nature, chennai, nilgiri langur, monkey, forest, wildlife, perch, black, wallpaper, ARK OF WILDLIFE, topv111, WeeklySurvivor, top20HallFame, topv333, 100v10f, captive, simian

SBU

Digital

Media

Consumer Photo Collections

Over the hills and far away

Road, Hills, Germany, Hoffenheim, Outstanding Shots, specland, Baden-Wuerttemberg

Heavenly

Peacock, AlbinoPeacock, WhiteBeauty, Birds, Wildlife, FeathredaleWildlifePark, PictureAustralia, ImpressedBeauty

End of the world - Verdens Ende - The lighthouse 1

Verdens ende, end of the world, norway, lighthouse, ABigFave, vippefyr, wood, coal

Flickr – 3+ billion photographs, 3-5 million uploaded per day

SBU

Digital

Media

Museum and Library Collections

Fine Arts Museum of San Francisco (82,000 images)

Woman of Head Howard H G Mrs Gift America North bust States United Sculpture marble

bowl stemmed small Irridescent glass

New York Public Library

Digital Collection

The new board walk, Rockaway, Long Island

Part of New England, New York, east New Iarsey and Long Iland.

SBU

Digital

Media

Web CollectionsBillions of Web Pages

SBU

Digital

Media

Video

OUTSIDE IN THE RAIN THE SENATOR WEARING HIS UH BASEBALL CAP A BOSTON RED SOX CAP AS HE TALKED TO HIS SUPPORTERS HERE IN THE RAIN THE UH SENATOR THEY'RE DOING HIS BEST TO TRY TO MAKE HIS CASE THAT HE WILL BE THE MAN FOR THE MIDDLE CLASS AND UH TRY TO CONVINCE HIS SUPPORTERS TO EXPRESS THEIR SUPPORT THROUGH A VOTE ON TUESDAY IN THERE WE ARE TWENTY FOUR HOURS FROM THE GREAT MOMENT THAT THE WORLD IN AMERICA IS WAITING FOR IT I NEED TO YOU IN THESE HOURS TO GO OUT AND DO THE HARD WORK NOT ON THOSE DOORS MAKE THOSE PHONE CALLS TO TALK TO FRIENDS TAKE PEOPLE TO THE POLLS HELP US CHANGE THE DIRECTION OF THIS GREAT NATION FOR THE BETTER CAN YOU IMAGINE A UH SENATOR BEGINNING HIS DAY IN FLORIDA TODAY

TrecVid 2006 – video frames with speech processing output

SBU

Digital

Media

Consumer Products

Soft and glossy patent calfskin trimmed with natural vachetta cowhide, open top satchel for daytime and weekends, interior double slide pockets and zip pocket, seersucker stripe cotton twill lining, kate spade leather license plate logo, imported. 2.8" drop length 14"h x 14.2"w x 6.9"d Katespade.com

It's the perfect party dress. With distinctly feminine details such as a wide sash bow around an empire waist and a deep scoopneck, this linen dress will keep you comfortable and feeling elegant all evening long. * Measures 38" from center back, hits at the knee. * Scoopneck, full skirt. * Hidden side zip, fully lined. * 100% Linen. Dry clean. bananarepublic.com

Internet retail transactions in 2006, 2007 of $145 billion, $175 billion (Forrester Research).

SBU

Digital

Media

Lots of Data!

SBU

Digital

Media

What do we want to do?

SBU

Digital

Media


Organize

Search

Browse

SBU

Digital

Media


Organize

Search

BrowseComputing Iconic Summaries for General Visual Concepts.R. Raguram and S. Lazebnik, 2008.

SBU

Digital

Media


Image Search circa 2007

Organize

Search

Browse

SBU

Digital

Media


Image Search now

Organize

Search

Browse

SBU

Digital

Media


Image re-ranking for “monkey”

Tamara L Berg, David A Forsyth,Animals on the Web CVPR 2006

Organize

Search

Browse

SBU

Digital

Media


Visual shopping at like.com

Organize

Search

Browse

SBU

Digital

Media


Visual attribute discoveryTamara L Berg, Alexander C Berg, Jonathan ShihAutomatic Attribute Discovery and Characterization from Noisy Web DataECCV 2010

Organize

Search

Browse

SBU

Digital

Media


Visual attribute discovery

J. Wang, K. Markert, and M. Everingham. "Learning models for object recognition from natural language descriptions” BMVC 2009.

Organize

Search

Browse

SBU

Digital

Media

Types of Words & Pictures

SBU

Digital

Media

General web pages

SBU

Digital

Media

General web pages

Image re-ranking for “monkey”

Tamara L Berg, David A Forsyth,Animals on the Web CVPR 2006

Improving Search

SBU

Digital

Media

General web pages

Harvesting Image Databases from the WebSchroff, F. , Criminisi, A. and Zisserman, A.ICCV 2007.

Mining to build big computer vision data sets.

SBU

Digital

Media

General web pages

Pros?

Cons?

SBU

Digital

Media

Tags or keywords + images

Tags: canon, eos, macro, japan, frog, animal, toad, amphibian, pet, eye, feet, mouth, finger, hand, prince, photo,art, light, photo, flickr, blurry, favorite, nice.

SBU

Digital

Media


Gang Wang, Derek Hoiem, and David Forsyth, Building text features for object image classification. CVPR, 2009.

Using tags and similar images for novel image classification

SBU

Digital

Media


Tag Order as implicit cue to expected size

“Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags”Sung Ju Hwang and Kristen Grauman

SBU

Digital

Media


Tags: canon, eos, macro, japan, frog, animal, toad, amphibian, pet, eye, feet, mouth, finger, hand, prince, photo,art, light, photo, flickr, blurry, favorite, nice.

Pros?

Cons?

SBU

Digital

Media

President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

Captioned images

SBU

Digital

Media

President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

Captioned images for face labeling

Captions provide direct information about depiction!

SBU

Digital

Media

Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose AnnotationJie Luo, Barbara Caputo, Vittorio FerrariNIPS 2009

Captioned images for face and pose labeling

SBU

Digital

Media

Videos with transcripts

SBU

Digital

Media

M. Everingham, J. Sivic, and A. Zisserman. Hello! My name is... Buffy' - Automatic naming of characters in TV videoBMVC 2006.

Videos with transcripts for face labeling

SBU

Digital

Media

Learning by Watching

SBU

Digital

Media

P. Buehler, M. Everingham, and A. Zisserman. "Learning sign language by watching TV (using weakly aligned subtitles)". CVPR 2009.

Learning Sign Language

SBU

Digital

Media

Learning to Sportscast: A Test of Grounded Language Acquisition (2008)David L. Chen and Raymond J. Mooney

Learning to Sportscast

http://www.cs.utexas.edu/~ai-lab/people-view.php?PID=289

http://www.cs.utexas.edu/~ai-lab/people-view.php?PID=125

SBU

Digital

Media

Learning About Semantics

SBU

Digital

Media

Traditional Recognition

car

shoe

person

SBU

Digital

Media

Beyond traditional recognition

SBU

Digital

Media

Beyond traditional recognition

“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – Scarlett O’Hara, Gone with the Wind.

SBU

Digital

Media

Attributes

Visual attribute learning from textTamara L Berg, Alexander C Berg, Jonathan ShihAutomatic Attribute Discovery and Characterization from Noisy Web DataECCV 2010

SBU

Digital

Media

Object relationships

SBU

Digital

Media

Object relationships

Object relationships – prepositions & adjectives

Beyond Nouns: Exploiting prepositions and comparative adjectives for learning visual classifiersAbhinav Gupta and Larry S. DavisIn ECCV 2008

Car is on the street

SBU

Digital

Media

Cross-Language Learning

Learning Bilingual Lexicons using the Visual Similarity of Labeled Web ImagesShane Bergsma and Benjamin Van Durme 2011

SBU

Digital

Media

Descriptive Text

Visually descriptive language offers: 1) information about the world, especially the visual world. 2) training data for how people construct natural language to describe imagery.

“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – Scarlett O’Hara, Gone with the Wind.

SBU

Digital

Media

Generating descriptions for images

SBU

Digital

Media

Generating Captions for News Images with Articles

How Many Words is a Picture Worth? Automatic Caption Generation for News Images”

Feng & Lapata 2010

SBU

Digital

Media

Generating Simple Descriptions for images

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Baby Talk: Understanding and Generating Simple Image Descriptions (2011)Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, Tamara L. Berg

http://www.tamaraberg.com/papers/generation_cvpr11.pdf

http://www.tamaraberg.com/papers/generation_cvpr11.pdf

http://girishkulkarni008.web.officelive.com/aboutus.aspx

http://girishkulkarni008.web.officelive.com/aboutus.aspx

http://www.tamaraberg.com/

http://www.sagnikdhar.com/

http://www.tamaraberg.com/

http://www.cs.stonybrook.edu/~ychoi

http://acberg.com/


SBU

Digital

Media

Im2Text: Describing Images Using 1 Million Captioned Photographs

Vicente Ordonez, Girish Kulkarni, Tamara L. BergStony Brook University

NIPS 2011

One of the many stone bridges in town that carry the gravel carriage roads.

An old bridge over dirty green water.

A stone bridge over a peaceful river.

Generate Natural Sounding Descriptions

SBU

Digital

Media

Summary Enormous amounts of data. Lots of commercial and academic

applications. We should combine information

from words & pictures intelligently.

SBU

Digital

Media

Overall Class Goal Gain exposure to interesting and

current research on Words&Pictures

No prior experience in Computer Vision or Natural Language Processing is required.

We will be reading a variety of research papers over the course of the semester

Please read the papers!

SBU

Digital

Media

General knowledge lecturesComputer VisionNatural Language ProcessingFeatures & RepresentationsClustering Discriminative Models & ClassificationGenerative & Topic Models

SBU

Digital

Media

Your responsibilities

Homework – 3 relatively simple assignments. Project – final project including proposal,

update, and final presentation & write-up. Participation – read papers and participate in

topic discussions. Topic presentations – one in class topic

presentation in groups of 4-5.

30%

30%

30%

10%

Late assignments/projects will be accepted with a 10% reduction in value per day late.

SBU

Digital

Media

Homework & Projects

Assignments should be completed individually in matlab.

Projects will be in groups of 3 and can be completed in the language of your choice on the topic of your choice (must involve text and images/video).

SBU

Digital

Media

Participation Experiment Goal: interesting, lively discussions

about research topics.

To encourage this goal at the end of each class please submit a paper noting how many (if any) questions you posed, answers you provided, or significant comments you made.

If this does not work, we will revert to having short sporadic pop quizzes on papers.

SBU

Digital

Media

Note about papers You won’t understand everything,

especially at first. Don’t sweat the small stuff. Try to grasp the overall idea, what’s

novel, what’s interesting, pros/cons of the method, how it relates to other things we’ve read.

SBU

Digital

Media

Topic Presentations You will give one topic presentation

during the semester in groups of 4-5.

Suggested papers for each topic presentations are listed on the course website.

You are welcome to swap papers (if relevant to your topic), but please ask me at least 1 week prior to the presentation.

SBU

Digital

Media

Reference Books 1) Forsyth, David A., and Ponce, J.

Computer Vision: A Modern Approach, Prentice Hall, 2003.

2) Hartley, R. and Zisserman, A. Multiple View Geometry in Computer Vision, Academic Press, 2002.

3) Jurafsky and Martin, SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, McGraw Hill, 2008.

4) Christopher D. Manning, and Hinrich Schuetze. Foundations of Statistical Natural Language Processing

http://www.amazon.com/Computer-Vision-Approach-David-Forsyth/dp/0130851981

http://www.amazon.com/Multiple-View-Geometry-Computer-Vision/dp/0521540518

http://www.amazon.com/Speech-Language-Processing-Introduction-Computational/dp/0130950696



http://www.amazon.com/Foundations-Statistical-Natural-Language-Processing/dp/0262133601

http://www.amazon.com/Foundations-Statistical-Natural-Language-Processing/dp/0262133601

SBU

Digital

Media

For next class Get access to matlab

Student Matlab licenses can be purchased from mathworks for $99

Do a matlab tutorial One link on the course website, many others

are available online.

SBU

Digital

Media

Class Info CSE 595: Words & Pictures Instructor: Tamara Berg ([email protected])

Office: 1411 Computer Science Lectures: Tues/Thurs 1:20-2:20pm Rm 2129 CS Office Hours: Tues/Thurs 2:20-3:20pm and by appt.

Course Webpage: http://tamaraberg.com/teaching/Fall_12/wordspics







CSE 595 Words and Pictures

Documents

Transcript of CSE 595 Words and Pictures