Search. Search issues How do we say what we want? –I want a story about pigs –I want a picture...
-
Upload
patrick-booth -
Category
Documents
-
view
216 -
download
2
Transcript of Search. Search issues How do we say what we want? –I want a story about pigs –I want a picture...
Search
Search issues
• How do we say what we want?– I want a story about pigs– I want a picture of a rooster– How many televisions were sold in Vietnam
during 2000?– Find a movie like this one
• How does the computer find what we said?
Things to search for
• Records
• Text
• Images
• Audio
• Video
Records
• Car– Price: $5,000– Miles: 20,000– Year: 1994– Make: Toyota– Doors: 2
• Queries• Price < 6000 & Miles<100000• Make == Toyota & Year > 1993
Queries
• Make == Toyota & Year >1993
Make Year Miles Price0 Toyota 1994 20000 $6,0001 Honda 1992 100000 $2,0002 Ford 1997 5000 $1,0003 Toyota 1992 150000 $3,0004 Chevy 1996 30000 $2,0005 BMW 1994 120000 $100,000
Queries
• Make == Toyota & Year >1993
Make Year Miles Price0 Toyota 1994 20000 $6,0001 Honda 1992 100000 $2,0002 Ford 1997 5000 $1,0003 Toyota 1992 150000 $3,0004 Chevy 1996 30000 $2,0005 BMW 1994 120000 $100,000
Queries
• Year >1993 or Price < $3,000
Make Year Miles Price0 Toyota 1994 20000 $6,0001 Honda 1992 100000 $2,0002 Ford 1997 5000 $1,0003 Toyota 1992 150000 $3,0004 Chevy 1996 30000 $2,0005 BMW 1994 120000 $100,000
Queries
• Year >1993 or Price < $3,000
Make Year Miles Price0 Toyota 1994 20000 $6,0001 Honda 1992 100000 $2,0002 Ford 1997 5000 $1,0003 Toyota 1992 150000 $3,0004 Chevy 1996 30000 $2,0005 BMW 1994 120000 $100,000
Databases
• Large collections of records
• Accessed by queries
Things to search for
• RecordsText
• Images
• Audio
• Video
Text searching
• How do I say what I want?– Type some phrase
• I want a story about pigs
• How will the computer match this?– What is text?
• An array of characters
– What can can a computer do with text?• Match characters
Text searching
• People think in words not characters
• How do I convert an array of characters into an array of words?– Collect together sequences of letters– How do I know if character C is a letter?
• C>=“a” & C<=“z” | C>=“A” & C<=“Z”
Convert to words
• Because people think in words
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17T h e l a z y b r o w n d o g
0 The1 lazy2 brown3 dog
Every document is an array of words
• I want a story about pigs
• How will I find the right documents?– Find all documents that have the word “pigs”
Searching text
• How will I find pigs fast?– Hint: the “URL Lookup” assignment
– Create an index of all words• With each word store the name or address of each
document that contains that word
– Search the index for “pigs”• Return the list of documents• Use a binary search on the word list (50,000 words)
Problems
• What if a document has the word “Pig” but not “pigs”?
• Normalize– Case - make all words lower case
• Pig -> pig
– Stemming - remove all suffixes and prefixes before putting a word into the index
• pigs -> pig• piggy -> pig
Problems
• I want a story about pigs?– How does the computer know to search for
pigs?• It doesn’t
– How does the computer know what a story is?• It doesn’t
Searching
• I want a story about pigs
• Pick out the important words and search for them– Which words are important?
– D = number of times a word appears in a document– A = average number of times a word appears in all
documents
– Importance = D/A• Why?
How do we create an index of all documents on the Web?
• Try = a list of URLs• Seen = all URLs from Seen
While (Try is not empty){ Page = take a URL from Try
Words = all the “important” words in Pageadd Page to the index using all of WordsLinks = all URLs in Pagefor every Link that is not in Seen add Link to Try and to Seen
}
Other ways to find important words and important documents
• A Document is important if many other documents point to it
• A word is important in document D if that word occurs frequently in documents that link to document D.
Images
• What will I say when searching for an image?– I want a rooster picture– Draw a picture of a rooster?
Search by picture?
?
What’s in a picture?• Computers don’t understand the contents of
images
• To a computer an image is an array of colors
I want a picture of a rooster
• Label all of the pictures
• How does Google do it?– File name of the picture “rooster-crossingSt.jpg”– Words around the picture in the HTML
Audio
• Talking– Use speech recognition to convert audio to text
– With each recognized word keep track of where in the audio it was recognized.
• Build an index using the recognized text– Normalize based on how words sound rather
than are spelled.
Video
• Where in “Casablanca” does Bogart say “Play it again Sam” ?
– he never does, he just says “play it”
• How can the computer find that?– Transcribe the audio– Speech recognition on the audio
Video
• Does Woody ever kiss Bo Peep?
• Exactly what color is a kiss?
Video
• Does Woody ever kiss Bo Peep?
• Annotate every frame with who is in the frame and search for frames with both Woody and Bo Peep.
So what’s with this?
Or this?
Is Woody cheating?
Search• Records
– Queries• < > = And Or
• Text– Normalized words (case, stemming, thesaurus)
• Images– Add words
• Audio– Transcribe or recognize as words
• Video– Transcribe– Annotate