Modern Information Retrieval Chapter 7: Text Processing.

Modern Information RetrievalModern Information Retrieval

Chapter 7: Text Chapter 7: Text ProcessingProcessing

OverviewOverview

1. Document pre-processing1. Document pre-processing1. Lexical analysis1. Lexical analysis2. Stopword elimination2. Stopword elimination3. Stemming3. Stemming4. Index-term selection4. Index-term selection5. Thesauri5. Thesauri

2. Text Compression2. Text Compression1. Statistical methods1. Statistical methods2. Huffman coding2. Huffman coding3. Dictionary methods3. Dictionary methods4. Ziv-Lempel compression4. Ziv-Lempel compression

Document Document PreprocessingPreprocessing

• Document pre-processing is the process of Document pre-processing is the process of incorporating a new document into an information incorporating a new document into an information retrieval system.retrieval system.

• The goal is to The goal is to – Represent the document Represent the document efficiently efficiently in terms of both in terms of both

space space (for storing the document) and (for storing the document) and time time (for (for processing retrieval requests) requirements.processing retrieval requests) requirements.

– Maintain good Maintain good retrieval performance retrieval performance (precision and (precision and recall).recall).

• Document pre-processing is a complex process that Document pre-processing is a complex process that leads to the representation of each document by a leads to the representation of each document by a select set of select set of index index terms. terms.

• However, some Web search engines are giving up on However, some Web search engines are giving up on much of this process and index much of this process and index all all (or virtually all) the (or virtually all) the words in a document).words in a document).

Document Preprocessing Document Preprocessing (cont.)(cont.)

• Document pre-processing Document pre-processing includes 5 stages:includes 5 stages:

1. Lexical analysis1. Lexical analysis

2. Stopword elimination2. Stopword elimination

3. Stemming3. Stemming

4. Index-term selection4. Index-term selection

5. Construction of thesauri5. Construction of thesauri

Lexical analysisLexical analysis• ObjectiveObjective: Determine the words of the document.: Determine the words of the document.

• Lexical analysis separates the input alphabet intoLexical analysis separates the input alphabet into– Word characters (e.g., the letters a-z)Word characters (e.g., the letters a-z)– Word separators (e.g., space, newline, tab)Word separators (e.g., space, newline, tab)

• The following decisions may have impact on retrievalThe following decisions may have impact on retrieval– DigitsDigits: Used to be ignored, but the trend now is to : Used to be ignored, but the trend now is to

identify numbers (e.g., telephone numbers) and mixed identify numbers (e.g., telephone numbers) and mixed strings as words.strings as words.

– Punctuation marksPunctuation marks: Usually treated as word separators.: Usually treated as word separators.– HyphensHyphens: Should we interpret “pre-processing: Should we interpret “pre-processing” ” as “pre as “pre

processingprocessing” ” or as “preprocessing”?or as “preprocessing”?– Letter case: Often ignored, but then a search for “First Letter case: Often ignored, but then a search for “First

Bank” (a specific bank) would retrieve a document with Bank” (a specific bank) would retrieve a document with the phrase “Bank of America was the first bank to offer the phrase “Bank of America was the first bank to offer its customers…”its customers…”

Stopword eliminationStopword elimination• ObjectiveObjective: Filter out words that occur in most of the : Filter out words that occur in most of the

documents.documents.

• Such words have no value for retrieval purposesSuch words have no value for retrieval purposes

• These words are referred to as stopwords. They includeThese words are referred to as stopwords. They include– Articles (a, an, the, …)Articles (a, an, the, …)– Prepositions (in, on, of, …)Prepositions (in, on, of, …)– Conjunctions (and, or, but, if, …)Conjunctions (and, or, but, if, …)– Pronouns (I, you, them, it…)Pronouns (I, you, them, it…)– Possibly some verbs, nouns, adverbs, adjectives (make, Possibly some verbs, nouns, adverbs, adjectives (make,

thing, similar, …)thing, similar, …)

• A typical stopword list may include several hundred words.A typical stopword list may include several hundred words.

• As seen earlier, the 100 most frequent words add-up to about As seen earlier, the 100 most frequent words add-up to about 50% of the words in a document.50% of the words in a document.

• Hence, stopword elimination improves the size of the Hence, stopword elimination improves the size of the indexing structures.indexing structures.

StemmingStemming

• ObjectiveObjective: Replace all the : Replace all the variants variants of a word of a word with the single with the single stem stem of the word.of the word.

• Variants include plurals, gerund forms (ing-Variants include plurals, gerund forms (ing-form), third person suffixes, past tense form), third person suffixes, past tense suffixes, etc.suffixes, etc.

• ExampleExample: : connectconnect: connects, connected, : connects, connected, connecting, connection,…connecting, connection,…

• All have similar semantics and relate to a All have similar semantics and relate to a single concept.single concept.

• In parallel, stemming must be performed on In parallel, stemming must be performed on the user query.the user query.

Stemming (cont.)Stemming (cont.)• Stemming improves Stemming improves

– Storage and search efficiencyStorage and search efficiency: less terms are stored.: less terms are stored.– RecallRecall: :

• without stemming a query about “connection”, without stemming a query about “connection”, matches only documents that have “connection”. matches only documents that have “connection”.

• With stemming, the query is about “connect” and With stemming, the query is about “connect” and matches matches in addition in addition documents that originally had documents that originally had “connects”, “connected”, “connecting”, etc.“connects”, “connected”, “connecting”, etc.

• However, stemming may hurt However, stemming may hurt precisionprecision, because users can , because users can no longer target just a particular form.no longer target just a particular form.

• Stemming may be performed using Stemming may be performed using – Algorithms Algorithms that stripe of suffixes according to that stripe of suffixes according to

substitution rules.substitution rules.– Large Large dictionariesdictionaries, that provide the stem of each word., that provide the stem of each word.

Index term selection Index term selection (indexing)(indexing)

• ObjectiveObjective: Increase efficiency by extracting from the : Increase efficiency by extracting from the resulting document a resulting document a selected set of terms selected set of terms to be used for to be used for indexing the document.indexing the document.– If full text representation is adopted then If full text representation is adopted then all all words are words are

used for indexing.used for indexing.

• Indexing is a critical process: User's ability to find Indexing is a critical process: User's ability to find documents on a particular subject is limited by the documents on a particular subject is limited by the indexing process having created index terms for this indexing process having created index terms for this subject.subject.

• Index can be done Index can be done manually manually or or automaticallyautomatically..

• Historically, manual indexing was performed by Historically, manual indexing was performed by professional indexers associated with library organizations.professional indexers associated with library organizations.

• However, automatic indexing is more common now (or, However, automatic indexing is more common now (or, with full text representations, indexing is altogether with full text representations, indexing is altogether avoided).avoided).

Indexing (cont.)Indexing (cont.)• Relative advantages of manual indexing: Relative advantages of manual indexing:

– Ability to perform Ability to perform abstractions abstractions (conclude what the subject is) (conclude what the subject is) and determine additional and determine additional related related terms,terms,

– Ability to judge the Ability to judge the value value of concepts.of concepts.

• Relative advantages of automatic indexing: Relative advantages of automatic indexing: – Reduced cost: Once initial hardware cost is amortized, Reduced cost: Once initial hardware cost is amortized,

operational cost is cheaper than wages for human indexers.operational cost is cheaper than wages for human indexers.– Reduced processing timeReduced processing time– Improved consistency.Improved consistency.

• Controlled vocabularyControlled vocabulary: Index terms must be selected from a : Index terms must be selected from a predefined set of terms (the predefined set of terms (the domain domain of the index).of the index).– Use of a controlled vocabulary helps standardize the choice of Use of a controlled vocabulary helps standardize the choice of

terms.terms.– Searching is improved, because users know the vocabulary Searching is improved, because users know the vocabulary

being used.being used.– Thesauri can compensate for lack of controlled vocabularies.Thesauri can compensate for lack of controlled vocabularies.

Indexing (cont.)Indexing (cont.)• Index exhaustivityIndex exhaustivity: the extent to which concepts are indexed.: the extent to which concepts are indexed.

– Should we index only the most important concepts, or also Should we index only the most important concepts, or also more minor concepts?more minor concepts?

• Index specificityIndex specificity: the preciseness of the index term used.: the preciseness of the index term used.– Should we use general indexing terms or more specific Should we use general indexing terms or more specific

terms?terms?– Should we use the term "computer", "personal computer", Should we use the term "computer", "personal computer",

or “Gateway E-3400”?or “Gateway E-3400”?

• Main effect:Main effect:– High exhaustivity improves recall (decreases precision).High exhaustivity improves recall (decreases precision).– High specificity improves precision (decreases recall).High specificity improves precision (decreases recall).

• Related issues:Related issues:– Index title and abstract only, or the entire document?Index title and abstract only, or the entire document?– Should index terms be weighted?Should index terms be weighted?

Indexing (cont.)Indexing (cont.)Reducing the Reducing the size size of the index:of the index:

• Recall that articles, prepositions, conjunctions, Recall that articles, prepositions, conjunctions, pronouns have already been removed through a pronouns have already been removed through a stopword list.stopword list.– Recall that the 100 most frequent words Recall that the 100 most frequent words

account for 50% of all word occurrences.account for 50% of all word occurrences.

• Words that are Words that are very infrequent very infrequent (occur only a few (occur only a few times in a collection) are often removed, under times in a collection) are often removed, under the assumption that they would probably not be in the assumption that they would probably not be in the user’s vocabulary.the user’s vocabulary.

• Reduction not based on probabilistic arguments: Reduction not based on probabilistic arguments: Nouns Nouns are often preferred over verbs, adjectives, are often preferred over verbs, adjectives, or adverbs.or adverbs.

Indexing (cont.)Indexing (cont.)Indexing may also assign Indexing may also assign weights weights to terms.to terms.

• Non-weighted indexingNon-weighted indexing::– No attempt to determine the value of the different terms No attempt to determine the value of the different terms

assigned to a document. assigned to a document. – Not possible to distinguish between major topics and casual Not possible to distinguish between major topics and casual

references.references.– All retrieved documents are equal in value. All retrieved documents are equal in value. – Typical of commercial systems through the 1980s.Typical of commercial systems through the 1980s.

• Weighted indexingWeighted indexing::– Attempt made to place a value on each term as a description of Attempt made to place a value on each term as a description of

the document.the document.– This value is related to the frequency of occurrence of the term This value is related to the frequency of occurrence of the term

in the document (higher is better), but also to the number of in the document (higher is better), but also to the number of collection documents that use this term (lower is better).collection documents that use this term (lower is better).

– Query weights and document weights are combined to a value Query weights and document weights are combined to a value describing the likelihood that a document matches a querydescribing the likelihood that a document matches a query

ThesauriThesauriObjectiveObjective: Standardize the index terms that were selected.: Standardize the index terms that were selected.

• In its simplest form a thesaurus isIn its simplest form a thesaurus is– A list of “important” words (concepts).A list of “important” words (concepts).– For each word, an associated list of synonyms.For each word, an associated list of synonyms.

• A thesaurus may be generic (cover all of English) or A thesaurus may be generic (cover all of English) or concentrate on a particular domain of knowledge.concentrate on a particular domain of knowledge.

• The role of a thesaurus in information retrieval The role of a thesaurus in information retrieval – Provide a standard vocabulary for indexing.Provide a standard vocabulary for indexing.– Help users locate proper query terms.Help users locate proper query terms.– Provide hierarchies for automatic broadening or narrowing Provide hierarchies for automatic broadening or narrowing

of queries.of queries.

• Here, our interest is in providing a standard vocabulary (a Here, our interest is in providing a standard vocabulary (a controlled vocabulary).controlled vocabulary).

• Essentially, in this final stage, each indexing term is Essentially, in this final stage, each indexing term is replaced replaced by the concept that defines its thesaurus class. by the concept that defines its thesaurus class.

Text CompressionText Compression• Data EncodingData Encoding: Transform encoding units (characters, words, : Transform encoding units (characters, words,

etc.) into code values.etc.) into code values.– Objective Objective is eitheris either

•Reduce size (compression)Reduce size (compression)

•Hide contents (encryption).Hide contents (encryption).

• Lossless encodingLossless encoding: The transformation is : The transformation is reversiblereversible– original – original file can be recovered from encoded (compressed, encrypted) file can be recovered from encoded (compressed, encrypted) file. file.

• Compression ratioCompression ratio::– SS: size of the uncompressed file.: size of the uncompressed file.– CC: size of the compressed file.: size of the compressed file.– Compression-rate Compression-rate = = CC//S.S.– ExampleExample: :

•S= 300,000 bytes, C=100,000 bytes.S= 300,000 bytes, C=100,000 bytes.

•Compression rate: 100,000/300,000 = 0.33.Compression rate: 100,000/300,000 = 0.33.

Text Compression Text Compression (cont.)(cont.)• Advantages Advantages of compression:of compression:

– Reduction in storage size.Reduction in storage size.– Reduction in transmission time.Reduction in transmission time.– Reduction in processing times (e.g., searching).Reduction in processing times (e.g., searching).

• Disadvantages:Disadvantages:– Requires time for compression/decompression.Requires time for compression/decompression.– Processing of compressed text is more complex.Processing of compressed text is more complex.

• Specific for information retrieval:Specific for information retrieval:– Decompression time is often more critical than Decompression time is often more critical than

compression time.compression time.• Unlike transmission-motivated compression (modems), documents Unlike transmission-motivated compression (modems), documents

in an information retrieval system are encoded once and decoded in an information retrieval system are encoded once and decoded many times.many times.

– Prefer compression techniques that allow searching in the Prefer compression techniques that allow searching in the compressed file (without decompressing it).compressed file (without decompressing it).

Text compression (cont.)Text compression (cont.)Basic methods:Basic methods:

• Statistical methods: Statistical methods: – Estimate the probability of occurrence of each Estimate the probability of occurrence of each

encoding unit (character or word) in the alphabet.encoding unit (character or word) in the alphabet.– Assign codes to units: more frequent units are Assign codes to units: more frequent units are

assigned shorter codes.assigned shorter codes.– In information retrieval, word-encoding is preferred In information retrieval, word-encoding is preferred

over character encoding.over character encoding.

• Dictionary methods:Dictionary methods:– Substitute a Substitute a phrase phrase (string of units) by a pointer to (string of units) by a pointer to

a dictionary or a previous occurrence of the phrase.a dictionary or a previous occurrence of the phrase.– Compression is achieved because the pointer is Compression is achieved because the pointer is

shorter than the phrase.shorter than the phrase.

Statistical methodsStatistical methods• Recall from the discussion of information theory: Recall from the discussion of information theory:

– Assume a message from an alphabet of Assume a message from an alphabet of n n symbols.symbols.– Assume that the probability of the Assume that the probability of the i’i’th symbol is th symbol is pipi..– The average information content (The average information content (entropyentropy) is:) is:

– Optimal encoding Optimal encoding is achieved when a symbol with is achieved when a symbol with probability probability pi pi is assigned a code whose length is is assigned a code whose length is log2(1/log2(1/pipi) = –log2() = –log2(pipi) .) .

– Hence, Hence, E E also represents also represents optimal average code optimal average code length length (measured in bits per character).(measured in bits per character).

– Therefore, Therefore, E E is the is the lower bound lower bound on compression.on compression.

n

i pipiE 1 )(2log.

Statistical methods Statistical methods (cont.)(cont.)

• Statistical methods must first estimate the frequencies of the encoding Statistical methods must first estimate the frequencies of the encoding units, and then assign codes based on these frequencies.units, and then assign codes based on these frequencies.

• Approaches: Approaches: – StaticStatic: Use a single distribution for all texts.: Use a single distribution for all texts.

• Fast, but not optimal because different texts exhibit different Fast, but not optimal because different texts exhibit different distributions.distributions.

• The encoding table is stored in the application (not in the text).The encoding table is stored in the application (not in the text).

• Decompression can start at any point in the file.Decompression can start at any point in the file.– DynamicDynamic: Determine the frequencies in a preliminary pass.: Determine the frequencies in a preliminary pass.

• Excellent compression, but a total of two passes is required.Excellent compression, but a total of two passes is required.

• The encoding table is stored at the beginning of the text. The encoding table is stored at the beginning of the text.

• Decompression can start at any point in the file.Decompression can start at any point in the file.– AdaptiveAdaptive: Progressively learn the distribution of the text while : Progressively learn the distribution of the text while

compressing; each character is encoded on the basis of the preceding compressing; each character is encoded on the basis of the preceding characters in a text.characters in a text.

• Fast, and close to optimal compression.Fast, and close to optimal compression.

• Decompression must start from the beginningDecompression must start from the beginning

Huffman codingHuffman coding• GeneralGeneral::

– Huffman coding is one of the best known compression Huffman coding is one of the best known compression techniques (1952).techniques (1952).

– It is used in the Unix programs pack/unpack.It is used in the Unix programs pack/unpack.– It is a statistical method based on variable length codes.It is a statistical method based on variable length codes.– Compression is achieved by assigning shorter codes to more Compression is achieved by assigning shorter codes to more

frequent units.frequent units.– Decompression is unique because no code is the prefix of Decompression is unique because no code is the prefix of

another.another.– Encoding units may be either bytes or words.Encoding units may be either bytes or words.– Does not exploit the Does not exploit the dependencies dependencies between the encoding between the encoding

units.units.– Yields Yields optimum optimum average code length when these units are average code length when these units are

independent.independent.– Can be used with the static, dynamic and adaptive Can be used with the static, dynamic and adaptive

approaches.approaches.

Huffman coding (cont.)Huffman coding (cont.)• MethodMethod::

– 1. Build a table of the encoding units and their 1. Build a table of the encoding units and their frequencies (probabilities).frequencies (probabilities).

– 2. Combine the two 2. Combine the two least least frequent units into a unit frequent units into a unit with the with the sum sum of the probabilities and encode it in a of the probabilities and encode it in a new “unit”.new “unit”.

– 3. Repeat this process until the entire dictionary is 3. Repeat this process until the entire dictionary is represented by a root whose probability is 1.0.represented by a root whose probability is 1.0.

– 4. When there is a 4. When there is a tie tie for the two least frequent for the two least frequent units, any tie-breaking procedure is acceptable. units, any tie-breaking procedure is acceptable.

Unit1: p1 Unit2: p2

New unit: p1+p2

Huffman coding (cont.)Huffman coding (cont.)Example:

Huffman coding (cont.)Huffman coding (cont.)• Example (cont.)Example (cont.)::

– The resulting code:The resulting code:– Average code length:Average code length:

– The entropy (compression The entropy (compression

lower bound) is:lower bound) is:

– Fixed code length would have Fixed code length would have required log2 10 = 3.32 bitsrequired log2 10 = 3.32 bits

(which, in practice, would (which, in practice, would

require 4 bits).require 4 bits).

– Compression ratio: Compression ratio: C/S = 3.05/3.32 = 0.92C/S = 3.05/3.32 = 0.92

10

1 05.3.i bitslipi

10

1 01.3)(2log.i bitspipi

Huffman coding (cont.)Huffman coding (cont.)• Example: Example: When the letters A-Z are thus encoded:When the letters A-Z are thus encoded:

– Code lengths are between 3 and 10 bits.Code lengths are between 3 and 10 bits.– Average code length is 4.12 bits.Average code length is 4.12 bits.– A fixed code would have required log2 26 = 4.70 bits A fixed code would have required log2 26 = 4.70 bits

(i.e., 5 bits).(i.e., 5 bits).

• More compression is obtained by encoding More compression is obtained by encoding wordswords::– With the 800 most frequent English words (small table!) With the 800 most frequent English words (small table!)

are encoded in this method (all other words are in plain are encoded in this method (all other words are in plain ASCII), 40-50% compression has been reported.ASCII), 40-50% compression has been reported.

• Huffman codes are prefix-specific:Huffman codes are prefix-specific:– No code is the beginning of another code.No code is the beginning of another code.– Hence, a left-to-right decoding operation is Hence, a left-to-right decoding operation is uniqueunique..– It is possible to search the compressed text.It is possible to search the compressed text.

Dictionary methodsDictionary methods• Dictionary methods construct a dictionary of phrases, and replace Dictionary methods construct a dictionary of phrases, and replace

their occurrences with dictionary pointers.their occurrences with dictionary pointers.

• The choice of phrases may be static, dynamic or adaptive.The choice of phrases may be static, dynamic or adaptive.

• A simple method (digrams):A simple method (digrams):– Construct a dictionary of Construct a dictionary of pairs pairs of letters that occur together of letters that occur together

frequently (e.g., ou, ea, ch, …).frequently (e.g., ou, ea, ch, …).– If If n n such pairs are used, a pointer (location in the dictionary) such pairs are used, a pointer (location in the dictionary)

requires log2 requires log2 n n bits.bits.– At each step in the encoding, the next pair is examined.At each step in the encoding, the next pair is examined.

• If it corresponds to a dictionary pair, it is replaced by its If it corresponds to a dictionary pair, it is replaced by its encoding, and the encoding position moves by 2 characters.encoding, and the encoding position moves by 2 characters.

• Otherwise, the single character encoding is kept, and the Otherwise, the single character encoding is kept, and the position moves by one character.position moves by one character.

– To assure that decoding is unambiguous, an extra bit is needed To assure that decoding is unambiguous, an extra bit is needed to indicate whether the next unit is a single character code or a to indicate whether the next unit is a single character code or a digram code.digram code.

Ziv-Lempel compressionZiv-Lempel compression• GeneralGeneral::

– The Ziv-Lempel method (1977) uses a single-pass The Ziv-Lempel method (1977) uses a single-pass adaptive scheme.adaptive scheme.

– While compressing, it constructs a dictionary from While compressing, it constructs a dictionary from phrases encountered so far.phrases encountered so far.

– Many popular programs (Unix Many popular programs (Unix compress/uncompress, GNU gzip/gunzip, and compress/uncompress, GNU gzip/gunzip, and Windows WinZip) are based on the Ziv-Lempel Windows WinZip) are based on the Ziv-Lempel algorithm. algorithm.

– Compression is slightly better than Huffman codes Compression is slightly better than Huffman codes (C/S of 45% vs. 55%).(C/S of 45% vs. 55%).

– Disadvantage for information retrieval: Disadvantage for information retrieval: decompressed file cannot be searched and decompressed file cannot be searched and decoding cannot start at a random place in the file.decoding cannot start at a random place in the file.

Ziv-Lempel compression Ziv-Lempel compression (cont.)(cont.)

• Compression:Compression:– 1. Initialize the dictionary to contain all 1. Initialize the dictionary to contain all

“phrases” of length one. “phrases” of length one. – 2. Examine the input stream and search for 2. Examine the input stream and search for

the longest phrase which has appeared in the longest phrase which has appeared in the dictionary. the dictionary.

– 3. Encode this phrase by its index in the 3. Encode this phrase by its index in the dictionary. dictionary.

– 4. Add the phrase followed by the next 4. Add the phrase followed by the next symbol in the input stream to the dictionary. symbol in the input stream to the dictionary.

– 5. Go to Step 2. 5. Go to Step 2.

Ziv-Lempel compression Ziv-Lempel compression (cont.)(cont.)ExampleExample::

• Assume a dictionary of 16 phrases (4 Assume a dictionary of 16 phrases (4 bit index).bit index).

• This case does not result in This case does not result in compression. compression. – SourceSource: 25 characters in a 2-: 25 characters in a 2-

character alphabet require a total character alphabet require a total of 25 bits.of 25 bits.

• OutputOutput: 13 pointers of 4 bits : 13 pointers of 4 bits require a total of 52 bits.require a total of 52 bits.

• This is because the length of the This is because the length of the input data in this example is too input data in this example is too short. short.

• In practice, the Lempel-Ziv algorithm In practice, the Lempel-Ziv algorithm works well only when the input data works well only when the input data is sufficiently large and there is is sufficiently large and there is sufficient redundancy in the data. sufficient redundancy in the data.

Modern Information Retrieval Chapter 7: Text Processing.

Documents

Transcript of Modern Information Retrieval Chapter 7: Text Processing.