מבוא לאחזור מידע Information Retrieval בינה מלאכותית אבי...
-
Upload
jean-taylor -
Category
Documents
-
view
224 -
download
8
Transcript of מבוא לאחזור מידע Information Retrieval בינה מלאכותית אבי...
מבוא לאחזור מידעInformation Retrieval
מלאכותית בינהרוזנפלד אבי
Copyright © Victor LavrenkoHopkins IR Workshop 2005
?מהי אחזור מידע
חיפוש אחרי מידע באינטרנט•?)BINGהרב ד"ר גוגל )–תחום מחקר פורח וחשובה...–
מתוך הרבה מקורות רלוונטי חיפוש אחרי מידע •)אתרים, תמונה, וכו'(
Copyright © Victor LavrenkoHopkins IR Workshop 2005
מה ההבדל בין שאילתות? בבסיסי נתונים ואחזור מידע
נתונים בסיסי מידע אחזור
Data Structured (key) Unstructured (web)
Fields Clear semantics )SSN, age(
No fields )other than text(
QueriesDefined )relational algebra, SQL(
Free text )“natural language”(, Boolean
MatchingExact )results are always “correct”(
Imprecise )need to measure effectiveness(
4
מערכת המידע של אחזור מידע
IRSystem
Query String
Documentcorpus
RankedDocuments
1 .Doc12 .Doc23 .Doc3
.
.
Copyright © Victor LavrenkoHopkins IR Workshop 2005
דירוג תוצאות
• Early IR focused on set-based retrieval– Boolean queries, set of conditions to be satisfied– document either matches the query or not
• like classifying the collection into relevant / non-relevant sets
– still used by professional searchers– “advanced search” in many systems
• Modern IR: ranked retrieval– free-form query expresses user’s information need– rank documents by decreasing likelihood of relevance– many studies prove it is superior
דוגמאות
שלבים למנוע חיפוש
(Web crawlerבניית המאגר מידע )•(Indexבניית האנדקסים )לאנדקס •
STEMMINGניקיון המידע מכפילות, –בניית התשובה•
(STOP WORDSעיבוד השאלתה )הורדת –(PAGERANKדירוג תוצאות )–
ניתוח התוצאות•–FALSE POSITIVE / FALSE NEGATIVE–Recall / Precision
Indexing Process
Indexing Process
• Text acquisition– identifies and stores documents for indexing
• Text transformation– transforms documents into index terms or
features• Index creation
– takes index terms and creates data structures (indexes) to support fast searching
Web Crawler / זחלן רשת– Identifies and acquires documents for search engine– http://en.wikipedia.org/wiki/Web_crawler
זחלן רשת הוא סוג של בוט או תוכנה שסורקת באופן •.WWWאוטומטי ושיטתי את ה
מדיניות של בחירה אשר מגדירה איזה עמוד להוריד.•מדיניות של ביקור חוזר אשר מגדירה מתי לבדוק שינויים •
בדפים.מדיניות נימוס אשר מגדירה איך להימנע מעומס יתר של •
אתרים ולגרום להפלה של השרת.מדיניות של הקבלה אשר מגדירה איך לתאם בין •
הזחלנים השונים.
Text Acquisition• Feeds
– Real-time streams of documents• e.g., web feeds for news, blogs, video, radio, tv
– RSS is common standard• RSS “reader” can provide new XML documents to search engine
• Conversion– Convert variety of documents into a consistent text plus
metadata format• e.g. HTML, XML, Word, PDF, etc. → XML
– Convert text encoding for different languages• Using a Unicode standard like UTF-8
• (http://www.search-engines-book.com/slides/)
Controlling Crawling• Even crawling a site slowly will anger some
web server administrators, who object to any copying of their data
• Robots.txt file can be used to control crawlers
Simple Crawler Thread
Compression
• Text is highly redundant (or predictable)• Compression techniques exploit this redundancy
to make files smaller without losing any of the content
• Compression of indexes covered later• Popular algorithms can compress HTML and XML
text by 80%– e.g., DEFLATE (zip, gzip) and LZW (UNIX compress,
PDF)– may compress large files in blocks to make access
faster
Duplicate Detection• Exact duplicate detection is relatively easy• Checksum techniques
– A checksum is a value that is computed based on the content of the document
• e.g., sum of the bytes in the document file
– Possible for files with different text to have same checksum
• Functions such as a cyclic redundancy check (CRC), have been developed that consider the positions of the bytes
Removing Noise
• Many web pages contain text, links, and pictures that are not directly related to the main content of the page
• This additional material is mostly noise that could negatively affect the ranking of the page
• Techniques have been developed to detect the content blocks in a web page– Non-content material is either ignored or reduced
in importance in the indexing process
Noise Example
Text Transformation• Parser
– Processing the sequence of text tokens in the document to recognize structural elements
• e.g., titles, links, headings, etc.
– Tokenizer recognizes “words” in the text• must consider issues like capitalization, hyphens,
apostrophes, non-alpha characters, separators
– Markup languages such as HTML, XML often used to specify structure and metatags
• Tags used to specify document elements– E.g., <h2> Overview </h2>
• Document parser uses syntax of markup language (or other formatting) to identify structure
Index Creation• Document Statistics
– Gathers counts and positions of words and other features
– Used in ranking algorithm• Weighting
– Computes weights for index terms– Used in ranking algorithm– e.g., tf.idf weight
• Combination of term frequency in document and inverse document frequency in the collection
Zipf’s Law• Distribution of word frequencies is very skewed
– a few words occur very often, many words hardly ever occur
– e.g., two most common words (“the”, “of”) make up about 10% of all word occurrences in text documents
• Zipf’s “law”:– observation that rank (r) of a word times its
frequency (f) is approximately a constant (k)• assuming words are ranked in order of decreasing
frequency
– i.e., r.f » k or r.Pr » c, where Pr is probability of word occurrence and c » 0.1 for English
Zipf’s Law
Top 50 Words from AP89
Copyright © Victor LavrenkoHopkins IR Workshop 2005
Term Frequency (TF)• Observation:
– key words tend to be repeated in a document• Modify our similarity measure:
– give more weight if word occurs multiple times
• Problem:– biased towards long documents– spurious occurrences– normalize by length:
D qtfQDsim )(),(
D
D
qtfQDsim
||
)(),(
Copyright © Victor LavrenkoHopkins IR Workshop 2005
Inverse Document Frequency (IDF)
• Observation:– rare words carry more meaning: cryogenic, apollo– frequent words are linguistic glue: of, the, said, went
• Modify our similarity measure:– give more weight to rare words
… but don’t be too aggressive (why?)
– df(q) … total number of documents that contain q
)(
||log
||
)(),(
qdf
C
D
qtfQDsim
D
25
• tf = term frequency – frequency of a term/keyword in a document
The higher the tf, the higher the importance (weight) for the doc.
• df = document frequency– no. of documents containing the term– distribution of the term
• idf = inverse document frequency– the unevenness of term distribution in the corpus– the specificity of term to a document
The more the term is distributed evenly, the less it is specific to a document
weight(t,D) = tf(t,D) * idf(t)
tf*idf weighting schema
26
• function words do not bear useful information for IRof, in, about, with, I, although, …
• Stoplist: contain stopwords, not to be used as index– Prepositions– Articles– Pronouns– Some adverbs and adjectives– Some frequent words (e.g. document)
• The removal of stopwords usually improves IR effectiveness
• A few “standard” stoplists are commonly used.
Stopwords / Stoplist
27
Stemming
• Reason: – Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them• Stemming:
– Removing some endings of word computercompute computescomputingcomputedcomputation
comput
Stemming• Generally a small but significant effectiveness
improvement– can be crucial for some languages– e.g., 5-10% improvement for English, up to 50% in
Arabic
Words with the Arabic root ktb
Stemming
• Two basic types– Dictionary-based: uses lists of related words– Algorithmic: uses program to determine related
words• Algorithmic stemmers
– suffix-s: remove ‘s’ endings assuming plural• e.g., cats → cat, lakes → lake, wiis → wii• Many false negatives: supplies → supplie• Some false positives: ups → up
30
Porter algorithm(Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137)
• Step 1: plurals and past participles – SSES -> SS caresses -> caress – (*v*) ING -> motoring -> motor
• Step 2: adj->n, n->v, n->adj, …– (m>0) OUSNESS -> OUS callousness -> callous – (m>0) ATIONAL -> ATE relational -> relate
• Step 3: – (m>0) ICATE -> IC triplicate -> triplic
• Step 4:– (m>1) AL -> revival -> reviv– (m>1) ANCE -> allowance -> allow
• Step 5: – (m>1) E -> probate -> probat – (m > 1 and *d and *L) -> single letter controll -> control
N-Grams
• Frequent n-grams are more likely to be meaningful phrases
• N-grams form a Zipf distribution– Better fit than words alone
• Could index all n-grams up to specified length– Much faster than POS tagging– Uses a lot of storage
• e.g., document containing 1,000 words would contain 3,990 instances of word n-grams of length 2 ≤ n ≤ 5
Google N-Grams
• Web search engines index n-grams• Google sample:
• Most frequent trigram in English is “all rights reserved”– In Chinese, “limited liability corporation”
Document Structure and Markup
• Some parts of documents are more important than others
• Document parser recognizes structure using markup, such as HTML tags– Headers, anchor text, bolded text all likely to be
important– Metadata can also be important– Links used for link analysis
Example Web Page
Example Web Page
Link Analysis
• Links are a key component of the Web• Important for navigation, but also for search
– e.g., <a href="http://example.com" >Example website</a>
– “Example website” is the anchor text– “http://example.com” is the destination link– both are used by search engines
37
PageRank in Google
• Assign a numeric value to each page• The more a page is referred to by important pages, the more this page
is important
• d: damping factor (0.85)
• Many other criteria: e.g. proximity of query words– “…information retrieval …” better than “… information … retrieval …”
A B i i
i
IC
IPRddAPR
)(
)()1()(
I1
I2
PageRank• Billions of web pages, some more informative
than others• Links can be viewed as information about the
popularity (authority?) of a web page– can be used by ranking algorithm
• Inlink count could be used as simple measure• Link analysis algorithms like PageRank provide
more reliable ratings– less susceptible to link spam
Indexes
• Indexes are data structures designed to make search faster
• Text search has unique requirements, which leads to unique data structures
• Most common data structure is inverted index– general name for a class of structures– “inverted” because documents are associated
with words, rather than words with documents• similar to a concordance
Indexes and Ranking
• Indexes are designed to support search– faster response time, supports updates
• Text search engines use a particular form of search: ranking– documents are retrieved in sorted order according to
a score computing using the document representation, the query, and a ranking algorithm
• What is a reasonable abstract model for ranking?– enables discussion of indexes without details of
retrieval model
Abstract Model of Ranking
Inverted Index
• Each index term is associated with an inverted list– Contains lists of documents, or lists of word
occurrences in documents, and other information– Each entry is called a posting– The part of the posting that refers to a specific
document or location is called a pointer– Each document in the collection is given a unique
number– Lists are usually document-ordered (sorted by
document number)
44
Inverted List Information to be Published
Word (key) Address1 Address2 Address3 Address4 Address5 Address6
a 111-1111 111-1112 111-1113 111-1114 111-1115 111-1116
aardvark 111-4323
the 111-1111 111-1112 111-1113 111-1114 111-1115 111-1116
zoo 123-4214 123-9714 333-9714
zygote 548-4342
Simple Inverted Index
Inverted Indexwith counts
• supports better ranking algorithms
Inverted Indexwith positions
• supports proximity matches
Proximity Matches
• Matching phrases or words within a window– e.g., "tropical fish", or “find tropical within
5 words of fish”• Word positions in inverted lists make these
types of query features efficient– e.g.,
Query Process
Query Process
• User interaction– supports creation and refinement of query, display
of results• Ranking
– uses query and indexes to generate ranked list of documents
• Evaluation– monitors and measures effectiveness and
efficiency (primarily offline)
Evaluation – False Positive / Negative
Predicted LabelPositive (A) Negative (B)
Known Label Positive (A) True Positive
(TP)False Negative(FN)
Negative (B) False Positive(FP)
True Negative(TN)
Definitions
Measure Formula Intuitive Meaning
Precision TP / (TP + FP)The percentage of positive predictions that are correct.
Recall TP / (TP + FN)The percentage of positive labeled instances that were predicted as positive.
Specificity TN / (TN + FP)The percentage of negative labeled instances that were predicted as negative.
Accuracy (TP + TN) / (TP + TN + FP + FN)
The percentage of predictions that are correct.
Example
Predicted LabelPositive (A) Negative (B)
Known LabelPositive (A) 500 1000Negative (B) 500 10,000
Precision = 50% (500/1000)Recall = 83% (500/600)
Accuracy = 95% (10500/11100)
54
General form of precision/recall
Precision 1.0 Recall 1.0
-Precision change w.r.t. Recall (not a fixed point)
-Systems cannot compare at one Precision/Recall point
-Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0)
Effectiveness MeasuresA is set of relevant
documents ,B is set of retrieved documents
Classification Errors
• False Positive (Type I error)– a non-relevant document is retrieved
• False Negative (Type II error)– a relevant document is not retrieved– 1- Recall
• Precision is used when probability that a positive result is correct is important
Copyright © Victor LavrenkoHopkins IR Workshop 2005
Evaluation Metrics (set-based)• Precision
– Proportion of a retrieved set that is relevant– Precision = |relevant Ç retrieved| ÷ |retrieved|
= P( relevant | retrieved )
• Recall– proportion of all relevant documents in the collection included in
the retrieved set– Recall = |relevant Ç retrieved| ÷ |relevant|
= P( retrieved | relevant )
• Single-number measures:– F1 = 2PR / (P+R) … harmonic mean of precision and recall– Breakeven: (rank-based) point where Recall = Precision
Caching
• Query distributions similar to Zipf– About ½ each day are unique, but some are very
popular• Caching can significantly improve effectiveness
– Cache popular query results– Cache common inverted lists
• Inverted list caching can help with unique queries
• Cache must be refreshed to prevent stale data
Search Engine Optimization (SEO)