Digital Libraries: Steps toward information finding Dr. Lillian N. Cassel Villanova University.

Digital Libraries: Steps toward information finding Dr. Lillian N. Cassel Villanova University

But first,

A brief introduction to Information Retrieval Resource: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schtze, Introduction to Information Retrieval, Cambridge University Press. 2008. The entire book is available online, free, at http://nlp.stanford.edu/IR- book/information-retrieval-book.html http://nlp.stanford.edu/IR- book/information-retrieval-book.html I will use some of the slides that they provide to go with the book.

Authors definition Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Note the use of the word usually. We will see examples where the material is not documents, and not text. We have already seen that the collection may be distributed over several computers

Examples and Scaling IR is about finding a needle in a haystack finding some particular thing in a very large collection of similar things. Our examples are necessarily small, so that we can comprehend them. Do remember, that all that we say must scale to very large quantities.

Just how much information? Libraries are about access to information. What sense do you have about information quantity? How fast is it growing? Are there implications for the quantity and rate of increase?

How much information is there? Data summarization, trend detection anomaly detection are key technologies Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book All books (words) All Books MultiMedia Everything Recorded ! A Photo The smaller scale: 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli Slide source Jim Gray Microsoft Research (modified) See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info// Soon most everything will be recorded and indexed Most bytes will never be seen by humans. These require algorithms, data and knowledge representation, and knowledge of the domain

Where does the information come from? Many sources Corporations Individuals Interest groups News organizations Accumulated through crawling

Basic Crawl Architecture WWW DNS Parse Content seen? Doc FPs Dup URL elim URL set URL Frontier URL filter robots filters Fetch Ref: Manning Introduction to Information Retrieval 10

Crawler Architecture Modules: The URL frontier (the queue of URLs still to be fetched, or fetched again) A DNS resolution module (The translation from a URL to a web server to talk to) A fetch module (use http to retrieve the page) A parsing module to extract text and links from the page A duplicate elimination module to recognize links already seen Ref: Manning Introduction to Information Retrieval 11

Crawling threads With so much space to explore, so many pages to process, a crawler will often consist of many threads, each of which cycles through the same set of steps we just saw. There may be multiple threads on one processor or threads may be distributed over many nodes in a distributed system. 12

Politeness Not optional. Explicit Specified by the web site owner What portions of the site may be crawled and what portions may not be crawled robots.txt file Implicit If no restrictions are specified, still restrict how often you hit a single site. You may have many URLs from the same site. Too much traffic can interfere with the sites operation. Crawler hits are much faster than ordinary traffic could overtax the server. (Constitutes a denial of service attack) Good web crawlers do not fetch multiple pages from the same server at one time. 13

Robots.txt Protocol nearly as old as the web See www.rototstxt.org/robotstxt.htmlwww.rototstxt.org/robotstxt.html File: URL/robots.txt Contains the access restrictions Example: User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow: All robots (spiders/crawlers) Robot named searchengine only Nothing disallowed Source: www.robotstxt.org/wc/norobots.htmlwww.robotstxt.org/wc/norobots.html 14

Another example 15 User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joe/

Processing robots.txt First line: User-agent identifies to whom the instruction applies. * = everyone; otherwise, specific crawler name Disallow: or Allow: provides path to exclude or include in robot access. Once the robots.txt file is fetched from a site, it does not have to be fetched every time you return to the site. Just takes time, and uses up hits on the server Cache the robots.txt file for repeated reference 16

Robots tag robots.txt provides information about access to a directory. A given file may have an html meta tag that directs robot behavior A responsible crawler will check for that tag and obey its direction. Ex: OPTIONS: INDEX, NOINDEX, FOLLOW, NOFOLLOW See http://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2 and http://www.robotstxt.org/meta.htmlhttp://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2http://www.robotstxt.org/meta.html 17

Crawling Pick a URL from the frontier Fetch the document at the URL Parse the URL Extract links from it to other docs (URLs) Check if URL has content already seen If not, add to indices For each extracted URL Ensure it passes certain URL filter tests Check if it is already in the frontier (duplicate URL elimination) Ref: Manning Introduction to Information Retrieval Which one? E.g., only crawl.edu, obey robots.txt, etc. 18

Basic Crawl Architecture Ref: Manning Introduction to Information Retrieval WWW DNS Parse Content seen? Doc FPs Dup URL elim URL set URL Frontier URL filter robots filters Fetch 19

DNS Domain Name Server Internet service to resolve URLs into IP addresses Distributed servers, some significant latency possible OS implementations DNS lookup is blocking only one outstanding request at a time. Solutions DNS caching Batch DNS resolver collects requests and sends them out together Ref: Manning Introduction to Information Retrieval 20

Parsing Fetched page contains Embedded links to more pages Actual content for use in the application Extract the links Relative link? Expand (normalize) Seen before? Discard New? Meet criteria? Append to URL frontier Does not meet criteria? Discard Examine content 21

Content Seen before? How to tell? Finger Print, Shingles Documents identical, or similar If already in the index, do not process it again Ref: Manning Introduction to Information Retrieval 22

Distributed crawler For big crawls, Many processes, each doing part of the job Possibly on different nodes Geographically distributed How to distribute Give each node a set of hosts to crawl Use a hashing function to partition the set of hosts How do these nodes communicate? Need to have a common index Ref: Manning Introduction to Information Retrieval 23

Communication between nodes Ref: Manning Introduction to Information Retrieval WWW Fetch DNS Parse Content seen? URL filter Dup URL elim Doc FPs URL set URL Frontier robots filters Host splitter To othe r hosts From othe r hosts The output of the URL filter at each node is sent to the Duplicate URL Eliminator at all nodes 24

URL Frontier Two requirements Politeness: do not go too often to the same site Freshness: keep pages up to date News sites, for example, change frequently Conflicts The two requirements may be directly in conflict with each other. Complication Fetching URLs embedded in a page will yield many URLs located on the same server. Delay fetching those. Ref: Manning Introduction to Information Retrieval 25

Now that we have a collection How will we ever find the needle in the haystack? The one bit of information needed? After crawling, or other resource acquisition step, we need to create a way to query the information we have Next step: Index Example content: Shakespeares plays

Searching Shakespeare Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? See http://www.rhymezone.com/shakespeare/http://www.rhymezone.com/shakespeare/ One could grep all of Shakespeares plays for Brutus and Caesar, then strip out lines containing Calpurnia? Why is that not the answer? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the word Romans near countrymen) not feasible Ranked retrieval (best documents to return)

Term-document incidence 1 if play contains word, 0 otherwise Brutus AND Caesar BUT NOT Calpurnia All the plays All the terms First approach make a matrix with terms on one axis and plays on the other

Incidence Vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100.

Answer to query Antony and Cleopatra Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

Try another one What is the vector for the query Antony and mercy What would we do to find Antony OR mercy?

Basic assumptions about information retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to the users information need and helps the user complete a task

The classic search model Corpus TASK Info Need Query Verbal form Results SEARCH ENGINE Query Refinement Ultimately, some task to perform. Some information is required in order to perform the task. The information need must be expressed in words (usually). The information need must be expressed in the form of a query that can be processed. It may be necessary to rephrase the query and try again

The classic search model Corpus TASK Info Need Query Verbal form Results SEARCH ENGINE Query Refinement Get rid of mice in a politically correct way Info about removing mice without killing them How do I trap mice alive? mouse trap Misconception?Mistranslation?Misformulation? Potential pitfalls between task and query results

How good are the results? Precision: How well do the results match the information need? Recall: What fraction of the available correct results were retrieved? These are the basic concepts of information retrieval evaluation.

Size considerations Consider N = 1 million documents, each with about 1000 words. Avg 6 bytes/word including spaces/punctuation 6GB of data in the documents. Say there are M = 500K distinct terms among these.

The matrix does not work 500K x 1M matrix has half-a- trillion 0s and 1s. But it has no more than one billion 1s. matrix is extremely sparse. Whats a better representation? We only record the 1 positions. i.e. We dont need to know which documents do not have a term, only those that do. Why?

Inverted index For each term t, we must store a list of all documents that contain t. Identify each by a docID, a document serial number Can we used fixed-size arrays for this? Brutus Calpurnia Caesar 124561657132 124113145173174 2 3154 101 What happens if we add document 14, which contains Caesar.

Inverted index We need variable-size postings lists On disk, a continuous run of postings is normal and best In memory, can use linked lists or variable length arrays Some tradeoffs in size/ease of insertion 39 Postings Sorted by docID (more later on why). Posting Brutus Calpurnia Caesar 124561657132 124113145173 231 174 54101

Tokenizer Token stream. Friends RomansCountrymen Inverted index construction Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman 24 2 13 16 1 Documents to be indexed. Friends, Romans, countrymen. Stop words, stemming, capitalization, cases, etc.

Indexer steps: Token sequence Sequence of (Modified token, Document ID) pairs. I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2

Indexer steps: Sort Sort by terms And then docID Core indexing step

Indexer steps: Dictionary & Postings Multiple term entries in a single document are merged. Split into Dictionary and Postings Doc. frequency information is added.

Where do we pay in storage? 44 Pointers Terms and counts Lists of docIDs

Storage A small diversion Computer storage Processor caches Main memory External storage (hard disk, other devices) Very substantial differences in access speeds Processor caches mostly used by the operating system for rapid access to data that will be needed soon Main memory. Limited quantities. High speed access Hard disk Much larger quantities, speed restricted, access in fixed units (blocks)

Some size examples From the MacBook Pro Memory 8GB (two 4GB SO-DIMMs) of 1333MHz DDR3 SDRAM Hard drive 500GB 7200-rpm Serial ATA hard drive The Cloud 2 TB PogoPlug Dropbox, iCloud, etc. These are big numbers, but the potential size of a significant collection is larger still. The steps taken to optimize use of storage are critical to satisfactory response time.

Implications of size limits slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8 slot 9 slot 10 slot 11 slot 12 real memory (RAM) page 1 page 2 page 3 page 4... b page 70 page 71 page 72 virtual memory (on disk) reference to page 71 virtual page 71 is brought from disk into real memory virtual page 2 generates a page fault when referencing virtual page 71

How do we process a query? Using the index we just built, examine the terms in some order, looking for the terms in the query.

Query processing: AND Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. Merge the two postings: 49 128 34 248163264123581321 Brutus Caesar

The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 50 34 12824816 3264 12 3 581321 128 34 248163264123581321 Brutus Caesar 2 8 If the list lengths are x and y, the merge takes O(x+y) operations. What does that mean? Crucial: postings sorted by docID.

Intersecting two postings lists (a merge algorithm) 51

A small exercise 128 34 2481632341358131621 Brutus Antony Assume the list of documents shown here. Show the processing of the query: Brutus AND Antony

Boolean queries: Exact match The Boolean retrieval model is being able to ask a query that is a Boolean expression: Boolean Queries are queries using AND, OR and NOT to join query terms Views each document as a set of words Is precise: document matches condition or not. Perhaps the simplest model to build Primary commercial retrieval tool for 3 decades. Many search systems you still use are Boolean: Email, library catalog, Mac OS X Spotlight 53

Ranked retrieval Thus far, our queries have all been Boolean. Documents either match or dont. Good for expert users with precise understanding of their needs and the collection. Also good for applications: Applications can easily consume 1000s of results. Writing the precise query that will produce the desired result is difficult for most users.

Problem with Boolean search: feast or famine Boolean queries often result in either too few (=0) or too many (1000s) results. A query that is too broad yields hundreds of thousands of hits A query that is too narrow may yield no hits It takes a lot of skill to come up with a query that produces a manageable number of hits. AND gives too few; OR gives too many

Ranked retrieval models Rather than a set of documents satisfying a query expression, in ranked retrieval models, the system returns an ordering over the (top) documents in the collection with respect to a query Free text queries: Rather than a query language of operators and expressions, the users query is just one or more words in a human language In principle, these are different options, but in practice, ranked retrieval models have normally been associated with free text queries and vice versa 56

Feast or famine: not a problem in ranked retrieval When a system produces a ranked result set, large result sets are not an issue IIndeed, the size of the result set is not an issue WWe just show the top k ( 10) results WWe dont overwhelm the user PPremise: the ranking algorithm works

Scoring as the basis of ranked retrieval We wish to return, in order, the documents most likely to be useful to the searcher How can we rank-order the documents in the collection with respect to a query? Assign a score say in [0, 1] to each document This score measures how well document and query match.

Query-document matching scores We need a way of assigning a score to a query/document pair Lets start with a one-term query If the query term does not occur in the document: score should be 0 The more frequent the query term in the document, the higher the score (should be) We will look at a number of alternatives for this.

Take 1: Jaccard coefficient A commonly used measure of overlap of two sets A and B jaccard(A,B) = |A B| / |A B| jaccard(A,A) = 1 jaccard(A,B) = 0 if A B = 0 A and B dont have to be the same size. Always assigns a number between 0 and 1.

Jaccard coefficient: Scoring example What is the query-document match score that the Jaccard coefficient computes for each of the two documents below? Query: ides of march Document 1: caesar died in march Document 2: the long march Which do you think is probably the better match?

Jaccard Example done Query: ides of march Document 1: caesar died in march Document 2: the long march A = {ides, of, march} B1 = {caesar, died, in, march} B2 = {the, long, march} cos(SAS,WH)?">

3 documents example contd. Log frequency weighting termSaSPaPWH affection3.062.762.30 jealous2.001.852.04 gossip1.3001.78 wuthering002.58 After length normalization termSaSPaPWH affection0.7890.8320.524 jealous0.5150.5550.465 gossip0.33500.405 wuthering000.588 cos(SaS,PaP) 0.789 0.832 + 0.515 0.555 + 0.335 0.0 + 0.0 0.0 0.94 cos(SaS,WH) 0.79 cos(PaP,WH) 0.69 Why do we have cos(SaS,PaP) > cos(SAS,WH)?

tf-idf weighting has many variants Columns headed n are acronyms for weight schemes. Why is the base of the log in idf immaterial?

Weighting may differ in queries vs documents Many search engines allow for different weightings for queries vs. documents SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq, using the of the previous table. A very standard weighting scheme is: lnc.ltc Document: logarithmic tf (l as first character), no idf and cosine normalization Query: logarithmic tf (l in leftmost column), idf (t in second column), no cosine normalization

tf-idf example: lnc.ltc TermQueryDocumentPro d tf- raw tf-wtdfidfwtnliz e tf-rawtf-wtwtnliz e auto0050002.3001110.520 best11500001.3 0.3400000 car11100002.0 0.52111 0.27 insurance1110003.0 0.7821.3 0.680.53 Document: car insurance auto insurance Query: best car insurance Exercise: what is N, the number of docs? Score = 0+0+0.27+0.53 = 0.8 Doc length term =

Summary vector space ranking Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity score for the query vector and each document vector Rank documents with respect to the query by score Return the top K (e.g., K = 10) to the user

The web and its challenges Unusual and diverse documents Unusual and diverse users, queries, information needs Beyond terms, exploit ideas from social networks link analysis, clickstreams How do search engines work? And how can we make them better? See http://video.google.com/videoplay?docid=-1243280683715323550&hl=en# http://video.google.com/videoplay?docid=-1243280683715323550&hl=en# Marissa Mayer of Google on how a search happens at Google. 94

References Christopher D. Manning, Prabhakar Raghavan and Hinrich Schtze, Introduction to Information Retrieval, Cambridge University Press. 2008. Book available online at http://nlp.stanford.edu/IR-book/information-retrieval- book.htmlhttp://nlp.stanford.edu/IR-book/information-retrieval- book.html Many of these slides are taken directly from the authors slides from the first chapter of the book. Slides provided by the author Paging figure from Vittore Casarosa http://www.miislita.com/information-retrieval- tutorial/cosine-similarity-tutorial.html -- Term weighting and cosine similarity tutorial http://www.miislita.com/information-retrieval- tutorial/cosine-similarity-tutorial.html

Digital Libraries: Steps toward information finding Dr. Lillian N. Cassel Villanova University.

Documents

Transcript of Digital Libraries: Steps toward information finding Dr. Lillian N. Cassel Villanova University.