Indexing and Searching XML Documents based on Content and Structure Synopses
Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and...
-
Upload
bailey-johnston -
Category
Documents
-
view
221 -
download
1
Transcript of Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and...
Special Topics in Computer ScienceSpecial Topics in Computer Science
The Art of Information RetrievalThe Art of Information Retrieval
Chapter 8: Indexing and Chapter 8: Indexing and Searching Searching
Alexander Gelbukh
www.Gelbukh.com
2
Previous Chapter: Previous Chapter: ConclusionsConclusions
Text transformation: meaning instead of stringso Lexical analysis
o Stopwords
o Stemming POS, WSD, syntax, semantics Ontologies to collate similar stems
Text compressiono Searchable (compress the query, then search)
o Random access
o Word-based statistical methods (Huffman)
Index compression
3
Previous Chapter: Research topicsPrevious Chapter: Research topics
All computational linguisticso Improved POS tagging
o Improved WSD
Uses of thesauruso for user navigation
o for collating similar terms
Better compression methodso Searchable compression
o Random access
4
5
Types of searchingTypes of searching
Sequentialo Small texts
o Volatile, or space limited
Indexedo Semi-static
o Space overhead
First, we discuss indexed searching, then sequential
6
Inverted filesInverted files
Vocabulary: sqrt (n). Heaps’ law. 1GB 5M Occurrences: n * 40% (stopwords)
o positions (word, char), files, sections...
7
Compression: Block addressingCompression: Block addressing
Block addressing: 5% overheado 256, 64K, ..., blocks (1, 2, ..., bytes)
o Equal size (faster search) or logical sections (retrieval units)
8
Searching in inverted filesSearching in inverted files
Vocabulary searcho Separate fileo Many searching techniqueso Lexicographic: log V (voc. size) = ½ log n (Heaps)o Hashing is not good for prefix search
Retrieval of occurrences Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf)
o Boolean operations. Context search Merging One list is shorter (Zipf law)
Only inverted files allow sublinear both space & timeSuffix trees and signature files don’t
9
Building inverted file: 1Building inverted file: 1
Infinite memory? Use trie to store vocabulary
o append positions
O(n)
10
Building inverted file: 2Building inverted file: 2
Finite memory? Fill the memory Write partial index; n/M pieces Merge partial indices (hierarchically): n log (n/M)
Insertion: index, merge. n + n'log(n'/M) Deleting: eliminate every occurrence. n
Very fast creating/maintenance
11
Suffix treesSuffix trees
Text as one long string. No words.o Genetic databases
o Complex queries
o Compacted trie structure
o Problem: space
For text retrieval, inverted files are better
12
13
14
Suffix arraySuffix array
All suffixes (by position) in lexicographic order Allows binary search Much less space: 40% n Supra-index: sampling, for better disk access
15
Searching. ConstructionSearching. Construction
Searching Patterns, prefixes, phrases. Not only words Suffix tree: O(m), but: space (m = query size) Suffix array: O(log n) (n = database size)
Construction of arrays: sortingo Large text: n2 log (M)/M, more than for inverted fileso Skip details
Addition: n n' log (M)/M Deletion: n
16
Signature filesSignature files
Usually worse than inverted files Words are mapped to bit patterns Blocks are mapped to ORs of their word patterns If a block contains a word, all its bits are set Sequential search for blocks False drops!
o Design of the hash function
o Have to traverse the block
Good to search ANDs or proximity querieso bit patterns are ORed
17
18
Boolean operationsBoolean operations
Merging file (occurrences) listso AND: to find repetitions
According to query syntax tree Complexity linear in intermediate results
o Can be slow if they are huge
There are optimization techniqueso E.g.: merge small list with a big one by searching
o This is a usual case (Zipf)
19
Sequential searchSequential search
Necessary part of many algorithms (e.g., block addr) Brute force: O(nm) worst-case, O(n) on average Knuth-Morris-Pratt: linear worst, but the same avrg Boyer-Moore: n log(m) / m. Not all chars are examined!
o If some part of the pattern was compared,no need to compare inside it: you analyze the pattern once
Shift-Or: uses logical operation on all 32 bits in parallel BDM: automation. Complexity same as Boyer-Moore Combination of BDM with bit parallelism
20
Approximate string matchingApproximate string matching
Match with k errors Levenshtein distance Dynamic programming: O(mn), O(kn) Automation: non-deterministic
o Convert to deterministic: O(n), but huge structure
o Bit-parallel: O(n), the fastest known
Filtering: sublinear!o k errors cannot alter k segments
o multipattern exact search; detect suspicious places
o uses approximate algorithm only when needed
21
Regular expressionsRegular expressions
Regular expressionso Automation: O (m 2m) + O (n) – bad for long patterns
o Bit-parallel (simulates non-deterministic)
Using indices to search for words with errorso Inverted files: search in vocabulary, then each word
o Suffix trees and Suffix arrays: the same algorithms!
22
Structural queriesStructural queries
Ad-hoc index for structure Indexing tags as words
o Inverted files are goodsince they store occurrences in order
23
Search over compressionSearch over compression
Improves both space AND time (less disk operations) Compress query and search
o Huffman compression, words as symbols, bytes (frequencies: most frequent shorter)
o Search each word in the vocabulary its code
o More sophisticated algorithms
Compressed inverted files: less disk less time
Text and index compression can be combined
24
...compression...compression
Suffix trees can be compressed almost to size ofsuffix arrays
Suffix arrays can’t be compressed (almost random),but can be constructed over compressed texto instead of Huffman, use a code that respects alphabetic order
o almost the same compression
Signature files are sparse, so can be compressedo ratios up to 70%
25
26
Research topicsResearch topics
Perhaps, new details in integration of compression and search
“Linguistic” indexing: allowing linguistic variationso Search in plural or only singular
o Search with or without synonyms
27
ConclusionsConclusions
Inverted files seem to be the best option Other structures are good for specific cases
o Genetic databases
Sequential searching is an integral part of manyindexing-based search techniqueso Many methods to improve sequential searching
Compression can be integrated with search
28
Thank you!Till compensation
lecture?