Transcript of Digital Libraries: Steps toward information finding Dr. Lillian N. Cassel Villanova University.
- Slide 1
- Digital Libraries: Steps toward information finding Dr. Lillian
N. Cassel Villanova University
- Slide 2
- But first,
- Slide 3
- and
- Slide 4
- A brief introduction to Information Retrieval Resource:
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schtze,
Introduction to Information Retrieval, Cambridge University Press.
2008. The entire book is available online, free, at
http://nlp.stanford.edu/IR- book/information-retrieval-book.html
http://nlp.stanford.edu/IR- book/information-retrieval-book.html I
will use some of the slides that they provide to go with the
book.
- Slide 5
- Authors definition Information Retrieval (IR) is finding
material (usually documents) of an unstructured nature (usually
text) that satisfies an information need from within large
collections (usually stored on computers). Note the use of the word
usually. We will see examples where the material is not documents,
and not text. We have already seen that the collection may be
distributed over several computers
- Slide 6
- Examples and Scaling IR is about finding a needle in a haystack
finding some particular thing in a very large collection of similar
things. Our examples are necessarily small, so that we can
comprehend them. Do remember, that all that we say must scale to
very large quantities.
- Slide 7
- Just how much information? Libraries are about access to
information. What sense do you have about information quantity? How
fast is it growing? Are there implications for the quantity and
rate of increase?
- Slide 8
- How much information is there? Data summarization, trend
detection anomaly detection are key technologies Yotta Zetta Exa
Peta Tera Giga Mega Kilo A Book All books (words) All Books
MultiMedia Everything Recorded ! A Photo The smaller scale: 24
Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3
milli Slide source Jim Gray Microsoft Research (modified) See Mike
Lesk: How much information is there:
http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian:
How much information
http://www.sims.berkeley.edu/research/projects/how-much-info// Soon
most everything will be recorded and indexed Most bytes will never
be seen by humans. These require algorithms, data and knowledge
representation, and knowledge of the domain
- Slide 9
- Where does the information come from? Many sources Corporations
Individuals Interest groups News organizations Accumulated through
crawling
- Slide 10
- Basic Crawl Architecture WWW DNS Parse Content seen? Doc FPs
Dup URL elim URL set URL Frontier URL filter robots filters Fetch
Ref: Manning Introduction to Information Retrieval 10
- Slide 11
- Crawler Architecture Modules: The URL frontier (the queue of
URLs still to be fetched, or fetched again) A DNS resolution module
(The translation from a URL to a web server to talk to) A fetch
module (use http to retrieve the page) A parsing module to extract
text and links from the page A duplicate elimination module to
recognize links already seen Ref: Manning Introduction to
Information Retrieval 11
- Slide 12
- Crawling threads With so much space to explore, so many pages
to process, a crawler will often consist of many threads, each of
which cycles through the same set of steps we just saw. There may
be multiple threads on one processor or threads may be distributed
over many nodes in a distributed system. 12
- Slide 13
- Politeness Not optional. Explicit Specified by the web site
owner What portions of the site may be crawled and what portions
may not be crawled robots.txt file Implicit If no restrictions are
specified, still restrict how often you hit a single site. You may
have many URLs from the same site. Too much traffic can interfere
with the sites operation. Crawler hits are much faster than
ordinary traffic could overtax the server. (Constitutes a denial of
service attack) Good web crawlers do not fetch multiple pages from
the same server at one time. 13
- Slide 14
- Robots.txt Protocol nearly as old as the web See
www.rototstxt.org/robotstxt.htmlwww.rototstxt.org/robotstxt.html
File: URL/robots.txt Contains the access restrictions Example:
User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine
Disallow: All robots (spiders/crawlers) Robot named searchengine
only Nothing disallowed Source:
www.robotstxt.org/wc/norobots.htmlwww.robotstxt.org/wc/norobots.html
14
- Slide 15
- Another example 15 User-agent: * Disallow: /cgi-bin/ Disallow:
/tmp/ Disallow: /~joe/
- Slide 16
- Processing robots.txt First line: User-agent identifies to whom
the instruction applies. * = everyone; otherwise, specific crawler
name Disallow: or Allow: provides path to exclude or include in
robot access. Once the robots.txt file is fetched from a site, it
does not have to be fetched every time you return to the site. Just
takes time, and uses up hits on the server Cache the robots.txt
file for repeated reference 16
- Slide 17
- Robots tag robots.txt provides information about access to a
directory. A given file may have an html meta tag that directs
robot behavior A responsible crawler will check for that tag and
obey its direction. Ex: OPTIONS: INDEX, NOINDEX, FOLLOW, NOFOLLOW
See http://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2 and
http://www.robotstxt.org/meta.htmlhttp://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2http://www.robotstxt.org/meta.html
17
- Slide 18
- Crawling Pick a URL from the frontier Fetch the document at the
URL Parse the URL Extract links from it to other docs (URLs) Check
if URL has content already seen If not, add to indices For each
extracted URL Ensure it passes certain URL filter tests Check if it
is already in the frontier (duplicate URL elimination) Ref: Manning
Introduction to Information Retrieval Which one? E.g., only
crawl.edu, obey robots.txt, etc. 18
- Slide 19
- Basic Crawl Architecture Ref: Manning Introduction to
Information Retrieval WWW DNS Parse Content seen? Doc FPs Dup URL
elim URL set URL Frontier URL filter robots filters Fetch 19
- Slide 20
- DNS Domain Name Server Internet service to resolve URLs into IP
addresses Distributed servers, some significant latency possible OS
implementations DNS lookup is blocking only one outstanding request
at a time. Solutions DNS caching Batch DNS resolver collects
requests and sends them out together Ref: Manning Introduction to
Information Retrieval 20
- Slide 21
- Parsing Fetched page contains Embedded links to more pages
Actual content for use in the application Extract the links
Relative link? Expand (normalize) Seen before? Discard New? Meet
criteria? Append to URL frontier Does not meet criteria? Discard
Examine content 21
- Slide 22
- Content Seen before? How to tell? Finger Print, Shingles
Documents identical, or similar If already in the index, do not
process it again Ref: Manning Introduction to Information Retrieval
22
- Slide 23
- Distributed crawler For big crawls, Many processes, each doing
part of the job Possibly on different nodes Geographically
distributed How to distribute Give each node a set of hosts to
crawl Use a hashing function to partition the set of hosts How do
these nodes communicate? Need to have a common index Ref: Manning
Introduction to Information Retrieval 23
- Slide 24
- Communication between nodes Ref: Manning Introduction to
Information Retrieval WWW Fetch DNS Parse Content seen? URL filter
Dup URL elim Doc FPs URL set URL Frontier robots filters Host
splitter To othe r hosts From othe r hosts The output of the URL
filter at each node is sent to the Duplicate URL Eliminator at all
nodes 24
- Slide 25
- URL Frontier Two requirements Politeness: do not go too often
to the same site Freshness: keep pages up to date News sites, for
example, change frequently Conflicts The two requirements may be
directly in conflict with each other. Complication Fetching URLs
embedded in a page will yield many URLs located on the same server.
Delay fetching those. Ref: Manning Introduction to Information
Retrieval 25
- Slide 26
- Now that we have a collection How will we ever find the needle
in the haystack? The one bit of information needed? After crawling,
or other resource acquisition step, we need to create a way to
query the information we have Next step: Index Example content:
Shakespeares plays
- Slide 27
- Searching Shakespeare Which plays of Shakespeare contain the
words Brutus AND Caesar but NOT Calpurnia? See
http://www.rhymezone.com/shakespeare/http://www.rhymezone.com/shakespeare/
One could grep all of Shakespeares plays for Brutus and Caesar,
then strip out lines containing Calpurnia? Why is that not the
answer? Slow (for large corpora) NOT Calpurnia is non-trivial Other
operations (e.g., find the word Romans near countrymen) not
feasible Ranked retrieval (best documents to return)
- Slide 28
- Term-document incidence 1 if play contains word, 0 otherwise
Brutus AND Caesar BUT NOT Calpurnia All the plays All the terms
First approach make a matrix with terms on one axis and plays on
the other
- Slide 29
- Incidence Vectors So we have a 0/1 vector for each term. To
answer query: take the vectors for Brutus, Caesar and Calpurnia
(complemented) bitwise AND. 110100 AND 110111 AND 101111 =
100100.
- Slide 30
- Answer to query Antony and Cleopatra Act III, Scene ii Agrippa
[Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found
Julius Caesar dead, He cried almost to roaring; and he wept When at
Philippi he found Brutus slain. Hamlet Act III, Scene ii Lord
Polonius: I did enact Julius Caesar I was killed i' the Capitol;
Brutus killed me.
- Slide 31
- Try another one What is the vector for the query Antony and
mercy What would we do to find Antony OR mercy?
- Slide 32
- Basic assumptions about information retrieval Collection: Fixed
set of documents Goal: Retrieve documents with information that is
relevant to the users information need and helps the user complete
a task
- Slide 33
- The classic search model Corpus TASK Info Need Query Verbal
form Results SEARCH ENGINE Query Refinement Ultimately, some task
to perform. Some information is required in order to perform the
task. The information need must be expressed in words (usually).
The information need must be expressed in the form of a query that
can be processed. It may be necessary to rephrase the query and try
again
- Slide 34
- The classic search model Corpus TASK Info Need Query Verbal
form Results SEARCH ENGINE Query Refinement Get rid of mice in a
politically correct way Info about removing mice without killing
them How do I trap mice alive? mouse trap
Misconception?Mistranslation?Misformulation? Potential pitfalls
between task and query results
- Slide 35
- How good are the results? Precision: How well do the results
match the information need? Recall: What fraction of the available
correct results were retrieved? These are the basic concepts of
information retrieval evaluation.
- Slide 36
- Size considerations Consider N = 1 million documents, each with
about 1000 words. Avg 6 bytes/word including spaces/punctuation 6GB
of data in the documents. Say there are M = 500K distinct terms
among these.
- Slide 37
- The matrix does not work 500K x 1M matrix has half-a- trillion
0s and 1s. But it has no more than one billion 1s. matrix is
extremely sparse. Whats a better representation? We only record the
1 positions. i.e. We dont need to know which documents do not have
a term, only those that do. Why?
- Slide 38
- Inverted index For each term t, we must store a list of all
documents that contain t. Identify each by a docID, a document
serial number Can we used fixed-size arrays for this? Brutus
Calpurnia Caesar 124561657132 124113145173174 2 3154 101 What
happens if we add document 14, which contains Caesar.
- Slide 39
- Inverted index We need variable-size postings lists On disk, a
continuous run of postings is normal and best In memory, can use
linked lists or variable length arrays Some tradeoffs in size/ease
of insertion 39 Postings Sorted by docID (more later on why).
Posting Brutus Calpurnia Caesar 124561657132 124113145173 231 174
54101
- Slide 40
- Tokenizer Token stream. Friends RomansCountrymen Inverted index
construction Linguistic modules Modified tokens. friend
romancountryman Indexer Inverted index. friend roman countryman 24
2 13 16 1 Documents to be indexed. Friends, Romans, countrymen.
Stop words, stemming, capitalization, cases, etc.
- Slide 41
- Indexer steps: Token sequence Sequence of (Modified token,
Document ID) pairs. I did enact Julius Caesar I was killed i' the
Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The
noble Brutus hath told you Caesar was ambitious Doc 2
- Slide 42
- Indexer steps: Sort Sort by terms And then docID Core indexing
step
- Slide 43
- Indexer steps: Dictionary & Postings Multiple term entries
in a single document are merged. Split into Dictionary and Postings
Doc. frequency information is added.
- Slide 44
- Where do we pay in storage? 44 Pointers Terms and counts Lists
of docIDs
- Slide 45
- Storage A small diversion Computer storage Processor caches
Main memory External storage (hard disk, other devices) Very
substantial differences in access speeds Processor caches mostly
used by the operating system for rapid access to data that will be
needed soon Main memory. Limited quantities. High speed access Hard
disk Much larger quantities, speed restricted, access in fixed
units (blocks)
- Slide 46
- Some size examples From the MacBook Pro Memory 8GB (two 4GB
SO-DIMMs) of 1333MHz DDR3 SDRAM Hard drive 500GB 7200-rpm Serial
ATA hard drive The Cloud 2 TB PogoPlug Dropbox, iCloud, etc. These
are big numbers, but the potential size of a significant collection
is larger still. The steps taken to optimize use of storage are
critical to satisfactory response time.
- Slide 47
- Implications of size limits slot 1 slot 2 slot 3 slot 4 slot 5
slot 6 slot 7 slot 8 slot 9 slot 10 slot 11 slot 12 real memory
(RAM) page 1 page 2 page 3 page 4... b page 70 page 71 page 72
virtual memory (on disk) reference to page 71 virtual page 71 is
brought from disk into real memory virtual page 2 generates a page
fault when referencing virtual page 71
- Slide 48
- How do we process a query? Using the index we just built,
examine the terms in some order, looking for the terms in the
query.
- Slide 49
- Query processing: AND Consider processing the query: Brutus AND
Caesar Locate Brutus in the Dictionary; Retrieve its postings.
Locate Caesar in the Dictionary; Retrieve its postings. Merge the
two postings: 49 128 34 248163264123581321 Brutus Caesar
- Slide 50
- The merge Walk through the two postings simultaneously, in time
linear in the total number of postings entries 50 34 12824816 3264
12 3 581321 128 34 248163264123581321 Brutus Caesar 2 8 If the list
lengths are x and y, the merge takes O(x+y) operations. What does
that mean? Crucial: postings sorted by docID.
- Slide 51
- Intersecting two postings lists (a merge algorithm) 51
- Slide 52
- A small exercise 128 34 2481632341358131621 Brutus Antony
Assume the list of documents shown here. Show the processing of the
query: Brutus AND Antony
- Slide 53
- Boolean queries: Exact match The Boolean retrieval model is
being able to ask a query that is a Boolean expression: Boolean
Queries are queries using AND, OR and NOT to join query terms Views
each document as a set of words Is precise: document matches
condition or not. Perhaps the simplest model to build Primary
commercial retrieval tool for 3 decades. Many search systems you
still use are Boolean: Email, library catalog, Mac OS X Spotlight
53
- Slide 54
- Ranked retrieval Thus far, our queries have all been Boolean.
Documents either match or dont. Good for expert users with precise
understanding of their needs and the collection. Also good for
applications: Applications can easily consume 1000s of results.
Writing the precise query that will produce the desired result is
difficult for most users.
- Slide 55
- Problem with Boolean search: feast or famine Boolean queries
often result in either too few (=0) or too many (1000s) results. A
query that is too broad yields hundreds of thousands of hits A
query that is too narrow may yield no hits It takes a lot of skill
to come up with a query that produces a manageable number of hits.
AND gives too few; OR gives too many
- Slide 56
- Ranked retrieval models Rather than a set of documents
satisfying a query expression, in ranked retrieval models, the
system returns an ordering over the (top) documents in the
collection with respect to a query Free text queries: Rather than a
query language of operators and expressions, the users query is
just one or more words in a human language In principle, these are
different options, but in practice, ranked retrieval models have
normally been associated with free text queries and vice versa
56
- Slide 57
- Feast or famine: not a problem in ranked retrieval When a
system produces a ranked result set, large result sets are not an
issue IIndeed, the size of the result set is not an issue WWe just
show the top k ( 10) results WWe dont overwhelm the user PPremise:
the ranking algorithm works
- Slide 58
- Scoring as the basis of ranked retrieval We wish to return, in
order, the documents most likely to be useful to the searcher How
can we rank-order the documents in the collection with respect to a
query? Assign a score say in [0, 1] to each document This score
measures how well document and query match.
- Slide 59
- Query-document matching scores We need a way of assigning a
score to a query/document pair Lets start with a one-term query If
the query term does not occur in the document: score should be 0
The more frequent the query term in the document, the higher the
score (should be) We will look at a number of alternatives for
this.
- Slide 60
- Take 1: Jaccard coefficient A commonly used measure of overlap
of two sets A and B jaccard(A,B) = |A B| / |A B| jaccard(A,A) = 1
jaccard(A,B) = 0 if A B = 0 A and B dont have to be the same size.
Always assigns a number between 0 and 1.
- Slide 61
- Jaccard coefficient: Scoring example What is the query-document
match score that the Jaccard coefficient computes for each of the
two documents below? Query: ides of march Document 1: caesar died
in march Document 2: the long march Which do you think is probably
the better match?
- Slide 62
- Jaccard Example done Query: ides of march Document 1: caesar
died in march Document 2: the long march A = {ides, of, march} B1 =
{caesar, died, in, march} B2 = {the, long, march}
cos(SAS,WH)?">
- 3 documents example contd. Log frequency weighting termSaSPaPWH
affection3.062.762.30 jealous2.001.852.04 gossip1.3001.78
wuthering002.58 After length normalization termSaSPaPWH
affection0.7890.8320.524 jealous0.5150.5550.465 gossip0.33500.405
wuthering000.588 cos(SaS,PaP) 0.789 0.832 + 0.515 0.555 + 0.335 0.0
+ 0.0 0.0 0.94 cos(SaS,WH) 0.79 cos(PaP,WH) 0.69 Why do we have
cos(SaS,PaP) > cos(SAS,WH)?
- Slide 90
- tf-idf weighting has many variants Columns headed n are
acronyms for weight schemes. Why is the base of the log in idf
immaterial?
- Slide 91
- Weighting may differ in queries vs documents Many search
engines allow for different weightings for queries vs. documents
SMART Notation: denotes the combination in use in an engine, with
the notation ddd.qqq, using the of the previous table. A very
standard weighting scheme is: lnc.ltc Document: logarithmic tf (l
as first character), no idf and cosine normalization Query:
logarithmic tf (l in leftmost column), idf (t in second column), no
cosine normalization
- Slide 92
- tf-idf example: lnc.ltc TermQueryDocumentPro d tf- raw
tf-wtdfidfwtnliz e tf-rawtf-wtwtnliz e auto0050002.3001110.520
best11500001.3 0.3400000 car11100002.0 0.52111 0.27
insurance1110003.0 0.7821.3 0.680.53 Document: car insurance auto
insurance Query: best car insurance Exercise: what is N, the number
of docs? Score = 0+0+0.27+0.53 = 0.8 Doc length term =
- Slide 93
- Summary vector space ranking Represent the query as a weighted
tf-idf vector Represent each document as a weighted tf-idf vector
Compute the cosine similarity score for the query vector and each
document vector Rank documents with respect to the query by score
Return the top K (e.g., K = 10) to the user
- Slide 94
- The web and its challenges Unusual and diverse documents
Unusual and diverse users, queries, information needs Beyond terms,
exploit ideas from social networks link analysis, clickstreams How
do search engines work? And how can we make them better? See
http://video.google.com/videoplay?docid=-1243280683715323550&hl=en#
http://video.google.com/videoplay?docid=-1243280683715323550&hl=en#
Marissa Mayer of Google on how a search happens at Google. 94
- Slide 95
- References Christopher D. Manning, Prabhakar Raghavan and
Hinrich Schtze, Introduction to Information Retrieval, Cambridge
University Press. 2008. Book available online at
http://nlp.stanford.edu/IR-book/information-retrieval-
book.htmlhttp://nlp.stanford.edu/IR-book/information-retrieval-
book.html Many of these slides are taken directly from the authors
slides from the first chapter of the book. Slides provided by the
author Paging figure from Vittore Casarosa
http://www.miislita.com/information-retrieval-
tutorial/cosine-similarity-tutorial.html -- Term weighting and
cosine similarity tutorial
http://www.miislita.com/information-retrieval-
tutorial/cosine-similarity-tutorial.html