Video Search: Opportunities and Challenges
Keynote Speech at ACM MIR WorkshopACM Multimedia Conference 2005
Singapore
Dr. Ramesh R. Sarukkai
Yahoo! Search{ [email protected] [email protected] }
About the Presenter
Dr. Ramesh Rangarajan Sarukkai currently heads up Video Search engineering at Yahoo where he manages video search/core media search infrastructure. Over his 6+ years tenure at Yahoo, Dr. Sarukkai has grown and managed different teams including the Small Business (e-commerce stores, hosting, domains) billing platform, and architected wireless applications & voice portals including 1-800-my-yahoo. Prior to Yahoo, he has worked in Kurzweil A.I., IBM Watson Lab, and has collaborated with HP Labs.
Dr. Sarukkai is the author of the book entitled “Foundations of Web Technology” (Kluwer/Springer 2002). In addition, he holds the first patent on large vocabulary voice web browsing, and has a number of patents awarded/pending in the areas of speech, natural language modeling, personalized search, information retrieval, optimal sponsored search, media and wireless technologies. Dr. Sarukkai also served on the World Wide Web Consortium (W3C) Voice Browser working group, recently chaired a panel in World Wide Web 2005 conference (Japan) on networking effects of the Web, and has contributed as a reviewer/Program Committee in a number of leading conferences/journals. His current interests include next-generation media search, social annotation, networking effects on the Web, emergent web technologies and new business models.
Outline• Introduction
• Market Trends
• Video Search
• Opportunities
• Challenges
• Conclusion
Introduction
Key questions discussed in this talk:• Why is video search important?• What broadly constitutes video search?• How is Web Video Search different?• What are some opportunities and challenges?
Outline• Introduction
• Market Trends
• Video Search
• Opportunities
• Challenges
• Conclusion
Market Trends
Whoever controls the media - the images - controls the culture.
- Allen Ginsberg
American Poet
Market Trends• Broadband doubling over next 3-5 years• Video enabled devices are emerging rapidly• Emergence of mass internet audience• Mainstream media moving to the Web• International trends are similar• Money Follows…
Market Trends• How many of you are aware of video on the Web?• How many have viewed a video on the Web in
• The last 3 months?• The last 6 months?• Ever?
• Would you watch video on your devices (ipod/wireless)?
• How many of you have produced video (personal or otherwise) recently?
• How many of you have shared that with your friends/community? Would you have liked to?
Market Trends• How many of you are aware of video on the Web?
• Large portion of online users
• How many have viewed a video on the Web in • The last 3 months?
• 50%• The last 6 months?• Ever?
• Would you watch video on your devices (ipod/wireless)?• 1M downloads in 20 days (iPod)
• How many of you have produced video (personal or otherwise) recently?
• Continuing to skyrocket with digital camera phones/devices
• How many of you have shared that with your friends/community? Would you have liked to?
• Huge interest & adoption in viral communities
* Source: Forrester Report
Market Trends
• Technology more media friendly• Storage costs plummeting (GB TB)• CPU speed continuing to double (Moore’s law)• Increased bandwidth• Device support for media• Adding media to sites drives traffic• Web continues to propel scalable infrastructure
for media products/communities
Market Trends
Democratization of Mass Media in the 21st Century (aka the long tail)
• Of the People (media is ultimately defined by users)
• By the people (production,tagging)
• For the people (consumption, devices, personalization)
- Abraham Lincoln as a Media Mogul in the 21st Century!
“The Long Tail” was coined by Chris Andersen’s article in the Wired magazine.
Market Trends
Mainstream media migrating online & interactive
• Media content online is growing• Video Streaming continuing to grow aggressively online.• More interactive in nature (user voting):
• MTV Interactive• American Idol• Reality shows
Market Trends
Video search is a key part of the future of digital media
Outline• Introduction
• Market Trends
• Video Search
• Opportunities
• Challenges
• Conclusion
Video Search
What is Video Search?
Multimedia? As far as I'm concerned, it's reading with the radio on!
- Rory BremnerActor/Writer
Video Search
Meta-Content
Video Production
UnstructuredData
Video Consumption/Community
Content Features +Meta-Content
Structured +Unstructured Data
Video Search
• Media Information Retrieval• Been around since the late 1970s
• Text Based/DB• Issues: Manual Annotation, Subjectivity of Human Perception
• Content Based• Color, texture, shape, face detection/recognition, speech
transcriptions, motion, segmentation boundaries/shots• High Dimensionality• Limited success to date.
Citations:
“Image Retrieval: Current Techniques, Promising Directions, and Open Issues” [Rui et al 99]
Video Search
Active Research Area
Graph from “A new perspective on Visual Information Retrieval”, Horst Eidenberger, 2004Black: “Image Retrieval”; Grey:”Video Retrieval”; IEEE Digital Library
Video Search
Traditional 3-Step Overview:
• Meta-Data/Visual Feature Extraction• Multi-dimensional indexing• Retrieval System design
Video Search
Popular features/techniques:• Color, Shape, Texture, Shape descriptors• OCR, ASR• A number of prototype or research products with
small data sets• More researched for visual queries
Video Search: Features
1970 200019901980 2010
Texture: Autocorrelation; Wavelet transforms; Gabor Filters
Shape: Edge Detectors; Moment invariants; Animate Vision Marr; Finite Element Methods; Shape from Motion;
Color: Color Moments Color Histograms Color Autocorrelograms
Segmentation: Scene segmentation; Scene Segmentation; Shot detection;
OCR: Modeling; Successful OCR deployments;
Face: Face Detection algorithms; Neural Networks; EigenFaces
ASR: Acoustic analysis; HMMS; N-grams; CSR; LVCSR; Domain Specific;
NIST Video TREC StartsMedia IR systems
Web Media Search
Video Search: Features
Color• Robust to background • Independent of size, orientation• Color Histogram [Swain &
Ballard]• “Sensitive to noise and
sparse”- Cumulative Histograms [Stricker & Orgengo]
• Color Moments• Color Sets: Map RGB Color
space to Hue Saturation Value, & quantize [Smith, Chang]
• Color layout- local color features by dividing image into regions
• Color Autocorrelograms
Texture• One of the earliest Image
features [Harlick et al 70s]• Co-occurrence matrix• Orientation and distance on gray-
scale pixels• Contrast, inverse deference
moment, and entropy [Gotlieb & Kreyszig]
• Human visual texture properties: coarseness, contrast, directionality, likeliness, regularity and roughness [Tamura et al]
• Wavelet Transforms [90s]• [Smith & Chang] extracted mean
and variance from wavelet subbands
• Gabor Filters• And so on
Region Segmentation• Partition image into regions• Strong Segmentation: Object
segmentation is difficult.• Weak segmentation: Region
segmentation based on some homegenity criteria
Scene Segmentation• Shot detection, scene detection• Look for changes in color,
texture, brightness• Context based scene
segmentation applied to certain categories such as broadcast news
Video Search: FeaturesFace
• Face detection is highly reliable- Neural Networks [Rwoley]- Wavelet based histograms of facial features [Schneiderman]
• Face recognition for video is still a challenging problem.- EigenFaces: Extract eigenvectors and use as feature space
OCR• OCR is fairly successful
technology.• Accurate, especially with good
matching vocabularies.• Script recognition still an open
problem.
ASR• Automatic speech recognition
fairly accurate for medium to large vocabulary broadcast type data
• Large number of available speech vendors.
• Still open for free conversational speech in noisy conditions.
Shape
• Outer Boundary based vs. region based
• Fourier descriptors• Moment invariants• Finite Element Method
(Stiffness matrix- how each point is connected to others; Eigen vectors of matrix)
• Turing function based (similar to Fourier descriptor) convex/concave polygons[Arkin et al]
• Wavelet transforms leverages multiresolution [Chuang & Kao]
• Chamfer matching for comparing 2 shapes (linear dimension rather than area)
• 3-D object representations using similar invariant features
• Well-known edge detection algorithms.
Video Search: Video TREC• Overview:
• Shot detection, story segmentation, semantic feature extraction, information retrieval
• Corpora of documentaries, advertising films• Broadcast news added in 2003• Interactive and non-interactive tests• CBIR features• Speech transcribed (LIMSI)• OCR
Video Search: Video TREC• 1st TREC (2001)
• Mean Average Precision @100 items (MAP) 0.033 (in general category)• Transcript only was better than transcript+image aspects+ASR• Also there were different types of test: known items versus general. Known items
had at least one result.• 2nd TREC (2002)
• Interactive runs to rephrase queries• Again text based on only ASR was the best performing system• Mean Average precision of 0.137• Leading system employed multiple systems: TF-IDF variants (Mean Average
Precision MAP 0.093). • Other was boolean with query expansion (0.101 MAP)• OCR was not particularly applicable; Phonetic ASR did not help.
• 3rd TREC (2003)• Data changes radically (Broadcast news added – CNN, ABC, CSPAN)• Baseline ASR + CC MAP 0.155• ASR + CC + VOCR + Image Similarity + Person X retrieval MAP 0.218
* “Successful Approaches in the TREC Video Retrieval Evaluations”, Alexander Hauptmann, Michael Christel, ACM Multimedia 2004.
Video Search: Browsing• Search by text• Navigation with customized categories • Random Browsing • Search by example• Search by Sketch
Video Search:• Pre-2000 or Research
• QBIC (IBM Almaden)• Photobook (MIT Media Lab)• FourEyes (MIT Media Lab)• Netra (UCSB Digital Library)• MARS (UCI)• PicToSeek (UVA.nl)• VisualSEEK (Columbia)• PicHunter (NEC)• ImageRover (BU)• WebSEEK (Columbia)• Virage (now Autonomy)• Visual RetrievalWare (Convera)• AMORE (NEC)• BlobWorld (UC Berkeley)
Video Search
We have discussed the background of video search, techniques used, and applications. It should be clear that “video search” is a fairly broad term.
Video Search
Note the differences in these broader definitions of video search:• No mention of content based versus meta-data based.• No mention of the actual browsing paradigm.• Production and consumption are a key aspect of video
search, and should be taken into core consideration in the models.
• Device, personalization and on-demand needs addressed.
Video Search
Video Search has evolved a lot since its inception and opportunities much broader.
Multimedia technologists (We) are defining and influencing the emergent media culture!
Outline• Introduction
• Market Trends
• Video Search
• Opportunities
• Challenges
• Conclusion
Opportunities
Engineering is the art of compromise, and there is always room for improvement in the real world; but
engineering is also the art of the practical. Engineers realize that they must, at some point curtail design,
and begin to manufacture or build
- H. Petrovski
Invention by design (Harvard Univ. Press)
Opportunities
The Internet Revolution• Exposed the internet backbone to the masses• Standardized publication of web media
(HTML,CSS,XML)• Scaled to millions of users• Pre-dot-com bust- slow emergence of media
search (notably AltaVista Audio-Visual search)• Post-dot-com bust
• Recent trends of community tagging especially for Images: Flickr (Y!), DeliCious, and so on.
Opportunities• Media Search Now
• Yahoo! Image, Video, Audio, Podcast searches• Flickr(Y!)• AOL/SingingFish• BlinxTV• Google• Many smaller companies
OpportunitiesWeb Based Video Search
• Web Search meets Media Search• Leverage Meta-content from:
• Web (inferred meta-data)• Community (Tagging and user submission/mRSS)• Structured Meta-Data from different sources
• Augment with limited media analysis• OCR, ASR, Classification
Video Search
Meta-Content
Video Production
UnstructuredData
Video Consumption/Community
Content Features +Meta-Content
Structured +Unstructured Data
Web Video Search
Traditional M
edia Search
Opportunities Yahoo! Media Search
Billions of Images
Millions of Videos
Hundreds of Millions of users
Reasonable Precision rates
Opportunities
Example 1:• User query “zorro”
• User 1 wants to see Zorro videos• User 2 wants to see Legend of Zorro movie clips• User 3 just wants to see home videos about Zorro
• Can content based analysis help over structured meta-data query inference?
Opportunities
Example 2:• For main-stream head content such as news videos.• Meta-data are fairly descriptive • Usually queried based on non-visual attributes.
• Task: “Pull up recent Hurricane Katrina videos”
Opportunities
Example 3:• Creative Home Video• Community video rendering!
• The now “famous” Star Wars Kid• Example of “social buzz” combined with innovative tail
content video production.
Play Video 1 Play Video 2
Opportunities
A hard example:• Lets take an example:
“Supposing you want to find videos that depict a monkey/chimp doing karate”!
• CBIR Approach:
Train models for Chimps/Monkeys
Motion Analysis for Karate movement models
Many open issues/problems!
Opportunities
What are the key opportunities for Video Search?
Automatic Speech Recognition (ASR) as a comparable case study
OpportunitiesSpeech Recognition: Case study
• From small experimental models to usable large scale products in restricted domains.
• 60’s – Very small experimental tests; Digit recognition; • 70’s – Core Acoustic modeling work; Vocal Tract Modeling;
Formant analysis; Signal Processing (FFT, MFCC)• 80’s – HMMs; Statistical Language Models; Large collections of
acoustic data. Very Large collection of non-acoustic Text Data; Speaker ID;
• 90’s – Large vocabulary speech recognition deployments in many domains (IVR, LVCSR)
• Natural Language Understanding, Core Background Noise/Multiple speakers, Spontaneous Dialog still open research problems
Opportunities
1970 200019901980 2010
Acoustic Modeling Vocal Tract Modeling Formant Analysis DSP: FFT, MFCC
Digit Recognizers Viterbi search; HMMs Sub-phonetic Models
FSMs, Grammars, Statistical N-grams
Digit Recognition; Small Vocabulary Discrete Medium Vocabulary Continuous Large Vocabulary Continuous LVCSR Telephony/Broadcast
Opportunities
1970 200019901980 2010
Acoustic Modeling Vocal Tract Modeling Formant Analysis DSP: FFT, MFCC
Digit Recognizers Viterbi search; HMMs Sub-phonetic Models
FSMs, Grammars, Statistical N-grams
Digit Recognition; Small Vocabulary Discrete Medium Vocabulary Continuous Large Vocabulary Continuous LVCSR Telephony/Broadcast
Millions of non-audio Text Data
Hundreds of Hours of AudioData
Modeling FrameworkMilestones
Opportunities
Speech Recognition Case Study Key Milestones:• Foundations: HMMs, Viterbi search, Statistical Language
Models• Creative use of Data: Millions of non-audio text data from
Wall Street Journal Corpus applied for Language model training
• Data Crunching at the lowest levels: Hundreds of hours of data applied to model thousands of sub-phonetic models
Opportunities
Speech Recognition Case Study: • Exploiting non-media sources of meta/text data are
beneficial• Applying contextual restraints enables improved
applicability• Large scale data crunching does indeed help solve
hard problems
Opportunities
Web Video Search:• Mass Adoption: Large Community of
users/taggers/web developers• Different sources of Media and non-Media data• Highly Scalable systems
Opportunities
The ASR takeaways can be applied to video search: • Exploiting non-media sources of meta/text data are
beneficial (Web Meta-data, Community Tagging)• Applying contextual restraints enables improved
applicability (User models, structured data,Tailored)• Large scale data crunching does indeed help solve
hard problems (Throw Horsepower)
OpportunitiesWeb Browsing Paradigms
• Search by text (Very popular)• Navigation with customized categories (Very
Popular)• Random Browsing (Popular for certain
categories: Discovery)• Search by example (Not popular currently)• Search by Sketch (Not popular)• Combined usage models - users skipping to
scan and then pick selective videos to play.
Opportunities
• Leverage Community• Exploit Structured Meta-Data• Constrain with Context
Opportunities
Tying into social network modeling for media search
• How do you define social networks?• How do you integrate into video search
indexing/ranking?• How do you filter out the noise from the
authoritative ones?
Opportunities
Exploit Structured Meta-Data• Video has a number of different editorial
sources; Exploit such structured data• Tailor results to users’ perceived needs
Music Videos, Movies – aka traditional video markets
Opportunities
Constrain with context• Occam’s razor: Simpler the model, lesser the number of
parameters to train, the more general the model.• Rather than throw in all meta + media feature vectors,
tailor the model;• A decision tree based approach to picking the right
algorithms.• E.g. News broadcast will require certain type of content
features and meta-data.
• Context can also be constrained by user (preferences), device parameters (media type, geo-location)
Outline• Introduction
• Market Trends
• Video Search
• Opportunities
• Challenges
• Conclusion
Challenges A man will occasionally stumble over truth, but
usually manages to pick himself up, walk over or around it and carry on.
- Winston Churchill
Challenges
• Theoretical Frameworks• Leverage CBIR solutions• Learn User Needs/Behavior• Inherit Web Search Problems
ChallengesTheoretical Frameworks
• Establishing the right theoretical frameworks for video search is still a problem for a variety of reasons:
• Lack of proper classified data• Large dimensionality; Different sources of meta-data.• Difficult to define the correct “perceived relevance” metric
(“Semantic Gap*”)• Cost functions such as “average precision” are not
differentiable & need to be approximated with heuristics• Integrate with Application specific cost functions: E.g.
search: Click through rates, clicks, video views, customer time spent
• Sometimes ranking needs to be based on user buzz, quality and other non-IR metrics
* “Content-Based Image Retrieval at the End of the Early Years”, Arnold Smeulders et al, IEEE PAMI, Dec. 2000
ChallengesLeverage CBIR
• Video TREC Summary:• Text based retrieval still is much better than non-text based
searches.• Visual queries have poor MAP, but amenable for
interactive search.• Commercial detectors work okay with broadcast news, but
not okay with documentaries with no commercials. (also face detection)
• Visual features are seldom useful: Not without exceptions, edge detectors usually helped only in aircraft and animal categories when color features were not as discriminative.
• ASR was not as useful in tasks such as shot detection (as opposed to indexing whole clip). OCR is more closer to the actual occurrence of the word.
•“Successful Approaches in the TREC Video Retrieval Evaluations”, Alexander Hauptmann, Michael Christel, ACM Multimedia 2004. “Video Retrieval using Speech and Image Information”, Alex Hauptmann et al.
Challenges
Leverage CBIR• Apply what works in a tailored manner:
• Color, Texture, OCR, Speech Recognition• Classification (in restricted domains)• Copyright infringement prevention• Story-boarding/Shots• Music/Speech Detection
• Apply other techniques in selective domains• Broadcast news: ASR, Segmentation, Speaker Detection• Motion for event detection restricted to time regions
isolated by ASR
Challenges Overcome the temptation to “solve the
computer vision” problem*.• Yet leverage computer vision techniques• Today, CBIR is mostly image based• Leverage motion flow information more• Apply techniques in compressed domain (or
leverage compressed data- e.g. MPEG-4 already computes local/global motion)
• Strong Image Segmentation is hard- Consistent Segmentation across image sequences in video can be exploited.
* “Guest Introduction: The changing Shape of Computer Vision in the Twenty-First Century”, M. Shah, IJCV, 2002
ChallengesUser Needs/Behavior
• Understand user needs• Better modeling of query and user intent for video search• Not much work has been done on user intent analysis for
video search
• Leverage User Behavior• Implicit versus explicit relevance feedback• Challenges in tracking user feedback• Blind/Implicit feedback is useful for limited domains
• How do you assess user relevance?• More challenging for video since there is a temporal
duration associated with the actions.• Explicit ratings are an option, but not incentivizable
Challenges
Inherit Web Search Problems• Meta-data Spam
• Users spam through Web or tagging• Leverage Web Search solutions of spam detection
based on text, linkage analysis, or user disrepute.
• Potential dilution due to “long tail”• How do you balance classic web search
ranking with media based ranking?
Outline• Introduction
• Market Trends
• Video Search
• Opportunities
• Challenges
• Conclusion
Conclusion If you look back too much, you will soon be
headed that way.
- Unknown
ConclusionIn this presentation, we covered:• Market trends that are fuelling large scale
demand globally for video search.• The changing nature of video search
largely driven by the Web.• Discussed different systems and the
techniques that have been applied to media search.
• Identified key areas of opportunities and challenges.
Conclusion
Video Search is here to stay. We (the attendees) have an opportunity to
define and shape next generation media production, consumption, usage and
ultimately the culture.
ConclusionMany thanks to:
• MIR Workshop organizers• Qi Tian• Alex Jaimes• ACM Multimedia 2005 Conference Organizers
• Y! Search• John Thrall• Qi Lu• Bradley Horowitz• Video Search team, esp. Ruofei (Bruce) Zhang for help with
references• Y! Blore R&D: Sesha Shah, Srinivasan, YBL (Marc Davis et al) & YRL
(Malcolm Slaney)
- Prof. S.K. Rangarajan for help with quotes
Conclusion
Thanks to the audience
Questions?
Appendix: Leveraging Research & Industrial Communities
• Scalable systems deployed and processing millions of media files
• End-user testing with huge volume of traffic• General problem areas covered in R&D (e.g. VideoTREC)
are very relevant• Selective application and proper modelling of information
fusion is key.• How do we share or leverage that data to foster research?• How can industries help?• How can universities help?
Top Related