wumpus-tr-2005-01

download wumpus-tr-2005-01

of 8

Transcript of wumpus-tr-2005-01

  • 7/31/2019 wumpus-tr-2005-01

    1/8

  • 7/31/2019 wumpus-tr-2005-01

    2/8

    2. RELATED WORKMethods to support dynamically changing text collections

    can be divided into two categories: Support for documentinsertions and support for document deletions.

    Techniques to support document insertions into an exist-ing index have been studied by many researchers over thelast decade. Most of them follow the same basic scheme.They maintain both an on-disk and an in-memory index.Postings for new documents are accumulated in main mem-ory until it is exhausted, and then the data in memory aresomehow combined with the on-disk index.

    Tomasic et al. [12] present an in-place update scheme forinverted files, based on a distinction between short lists andlong lists. They also discuss how different allocation strate-gies for the long lists affects index maintenance and queryprocessing performance. Lester et al. [7] give an evaluationof three different methods to combine the in-memory infor-mation with the on-disk data. Kabra et al. [6] present a hy-brid IR/DB system with delayed update operations throughin-memory buffers. All of these solutions have in commonthat the entire on-disk index has to be read (or written)every time main memory is exhausted, which causes per-formance problems for large collections. We show how the

    number of disk operations can be significantly reduced, atminimal cost for query performance.

    In contrast to the case of document insertions, a thor-ough evaluation of techniques for document deletions is notavailable. Chiueh and Huang [2] present a lazy invalidationapproach that keeps an in-memory list of all deleted docu-ments and performs a postprocessing step for every query,taking the contents of that list into account. The approachto document deletions presented in this paper is similar totheirs, but more general, and is not done as a postprocessingstep, but integrated into the actual query processing.

    None of this related work provides a general discussionof how different index maintenance strategies affect queryprocessing performance and how this implies opportunitiesfor indexing versus query processing performance trade-offs.

    3. THE WUMPUS SEARCH SYSTEMThe retrieval system described in this paper is the Wum-

    pus file system search engine. Wumpus is similar to otherfile system search engines, such as Google Desktop Search 2,Apple Spotlight3, or Beagle4. Unlike most desktop searchsystems (except Spotlight), it is a true multi-user search sys-tem; only a single index is used for all files in the file system,and security restrictions are applied at query time in orderto guarantee that the query results are consistent with allfile permissions.

    In addition to mere file system search cabilities, Wum-pus supports several standard information retrieval meth-ods, such as Okapi BM25 [10] and MultiText QAP [4]. The

    search system is based on the GCL query language and theimplementation framework proposed by Clarke et al. [3].

    In the remainder of this section, we give a short intro-duction to the specific requirements of file system search,explain how our system deals with the individual aspects offile system search, and describe the basic experimental setupas well as our performance evaluation methodology.

    2http://desktop.google.com/3http://www.apple.com/macosx/features/spotlight/4http://www.gnome.org/projects/beagle/

    In-memory index

    Index Manager

    On-disk index 1 On-disk index n.....

    Security Manager

    Query Processor

    User 1

    User 2

    .....

    Input files

    User mUser m-1

    Figure 1: General Layout of the Wumpus searchsystem. Wumpus maintains an in-memory index forrecent updates and several on-disk indices for per-sistent storage. User-specific security restrictionsare applied to all data that are transferred from theindex manager to the query processor.

    3.1 File System SearchFile system search is different from the traditional infor-

    mation retrieval task. The search engine not only has todeal with a large heterogeneous document collection, but afile system is also a truly dynamic environment: files areconstantly created, modified, and deleted. The expectednumber of index update operations is much greater thanthe number of queries to be processed. Using Wumpus, oneof the authors counted more than 4,000 index update op-erations (document insertions and deletions) on his laptopcomputer during a typical work day.

    Furthermore, when an e-mail arrives, or a new file iscreated, the user expects the search system to reflect thischange immediately. Delays greater than a few seconds arenot acceptable. This, together with the great number ofupdate operations that have to be performed, suggest thatindexing performance plays a much greater role than queryprocessing performance in this particular domain. Wumpussupports fast instantaneous updates (i.e., changes to the filesystem are reflected by the search system within fractionsof a second). The indexing techniques used to achieve thisare presented in section 3.2.

    In addition to being a dynamic environment, file systemsearch is a multi-user application. In order to avoid wastingdisk space due to indexing the same file many times, a singleindex has to be used for all users in the system. Specialcare has then to be taken so as to guarantee file systemsecurity. Wumpus built-in security mechanisms guarantee

    that query results only depend on the content of files thatare readable by the user who submitted the query. Thesecurity subsystem, which is also used to support documentdeletions, is briefly discussed in section 3.5.

    3.2 Indexing (Document Insertions)Wumpus uses inverted files [15] as its main index struc-

    ture. All posting lists contain fully positional information(i.e., the exact locations of all occurrences of a term). Posi-tional indexing is, for instance, necessary for phrase queriesand term proximity ranking [5], both of which are supported

  • 7/31/2019 wumpus-tr-2005-01

    3/8

    Hash table

    "trade"

    "dynamic"

    "off"

    " information" "retr ieval"

    5 13 28

    6 49

    8 16 32

    10

    9 25 61

    42

    97

    (a) In-memory index with hash table and linked lists. (b) On-disk index with 2-level search tree.

    "dynamic" "information"

    8,16,32 9,25,61,97

    "trade"

    6,49

    "off"

    10,42

    "retrieval"

    5,13,28

    "dynamic" "retrieval"

    in

    memory

    on

    disk

    Figure 2: Basic structure of in-memory index and on-disk index. Vocabulary terms in memory are arranged ina hash table. Terms on disk are sorted in lexicographical order, allowing for efficient prefix search operations.

    by Wumpus. However, the results presented in this paper,can easily be applied to document-centered search systemsthat do not maintain a fully positional index.

    The actual indexing is a one-pass process. It can be de-scribed as in-memory inversion with ad-hoc text-based par-titioning (cf. Witten et al. [13, pp. 245-253]). When anew document is added to the index, tokens are read fromthe input file, and an inverted index is built in memory un-til the memory consumption reaches a predefined threshold.At that point, an on-disk inverted file is created from thepostings in memory, the in-memory index is deleted, and thesystem continues to read further tokens from the input file.Wumpus is able to maintain multiple on-disk indices at thesame time. Different strategies for merging these sub-indicesare discussed in section 4.

    In the in-memory index, a hash table with move-to-frontheuristics [14] is used to efficiently search for existing termsand add new terms to the vocabulary. All postings arestored in compressed form (byte-aligned gap compression[11]), using linked lists to allow for flexibility and efficientinsertion of new postings. This is shown in Figure 2(a).

    When new data is added to the in-memory index, post-ings are immediately assigned to vocabulary terms using thehash table. After a posting has been added to a posting list,it can be accessed by the query processor without any fur-

    ther delay. Thus, instantaneous updates, which are neededfor real-time file system search, are supported by this datastructure.

    In the on-disk index, as in memory, all postings are storedin compressed form. Terms are sorted in lexicographical or-der, which minimizes the number of disk seeks necessary forprefix searches and query-time stemming (a special kind ofprefix search). Terms and their postings are kept together,which again decreases the number of disk seeks at querytime. For every on-disk index that is part of the indexingsystems main index, some meta-information is kept in mem-ory. Conceptually, this meta-information can be thought ofas a B-tree of height 2 that is used to reduce the searchspace when postings have to be fetched from the on-diskindex (shown in Figure 2(b)).

    3.3 Multiple Sub-IndicesWumpus is able to maintain multiple on-disk indices at

    the same time. An important property of these sub-indicesis that they form an ordered partitioning of the index ad-dress space, i.e. the postings in sub-index Ik come alwaysbefore those in sub-index Ik+1. This property is preservedby all index maintenance operations (sub-index merging andgarbage collection). It has two important implications:

    during query processing, the posting list for a given

    term can be created by concatenating the sub-lists re-trieved from the individual sub-indices;

    during a sub-index merge process, postings do not haveto be decompressed, but may simply be copied in theircompressed form, avoiding expensive within-list merg-ing and thus allowing for very fast sub-index mergeoperations.

    3.4 Query ProcessingWumpus query processor is based the GCL query lan-

    guage and its shortest-substring paradigm [3]. GCL sup-ports a variety of operators that can be used to imposestructural constraints on query results. In the context ofthis paper, we only need two of them:

    A B generates all intervals [x, y] (called index ex-tents), such that A occurs at index address x and Boccurs at index address y.

    E1 E2 generates all index extents that satisfy theGCL expression E1 and are containedin an extent thatsatisfies E2.

    The exact sequence of operations performed by the sys-tem when a query is being processed, of course, depends onthe scoring function used to rank the documents in the col-lection. In general, query processing consists of four steps:

    1. Create a list of all index extents that are to be scored.Traditionally, this is the list of documents in the col-lection, but Wumpus supports arbitrary GCL expres-sions. For the text collections used in our experiments,document extents are generated by the GCL expres-sion .

    2. Fetch all postings for all query terms from the on-diskindices and the in-memory index.

    3. Compute the score for every extent from Step 1, usingthe information found in the posting lists collected inStep 2.

    4. Report the top n extents, along with their scores, tothe user.

    Since all on-disk indices and the in-memory index form anordered partitioning of the index address space, the postinglist for a given term can be created by concatenating thelists found in the individual sub-indices. The number ofdisk seeks necessary to fetch all postings for one term istherefore linear in the number of sub-indices, which is whyat some point it is advisable to merge all sub-indices in orderto increase query processing performance.

  • 7/31/2019 wumpus-tr-2005-01

    4/8

  • 7/31/2019 wumpus-tr-2005-01

    5/8

    Add new sub-index

    from in-memory postings.

    Collisions at generations 0 and 1:

    Merge sub-indices into new sub-

    index of generation 2.

    The resulting index contains

    only 2 instead of 4 sub-indices:

    01011 + 00001 = 01100.

    Generation 3

    Generation 0

    Generation 0

    Generation 3Generation 1Generation 1

    Generation 0

    Generation 3

    Generation 2

    Figure 3: Logarithmic Merge by example. A new

    sub-index is added from in-memory postings. Sub-indices of the same generation n are merged intointo a new sub-index of generation n + 1.

    been created during the indexing process are merged intoone big inverted file that represents the whole collection.

    While this technique is perfectly reasonable for static col-lections, it cannot be used for dynamic collections, becausethe index is continuously updated and the point at which thewhole collection has been indexed is never reached. How-ever, the indices should be merged at some point, sincea large number of sub-indices significantly decreases queryprocessing performance due to disk seek latency. Therefore,we have to decide when it is acceptable to have multiple sub-

    indices and when it is sensible to merge them. We presentthree different merge strategies and evaluate their perfor-mance under different conditions.

    Strategy 1: Immediate Merge

    The first merge strategy has been proposed by Lester etal. [7]. They analyze three different techniques to deal withdynamically growing text collections. Out of these tech-niques, a strategy called Re-Merge exhibits the best per-formance. The indexing system maintains one on-disk andone in-memory index. As soon as main memory is full, Re-Merge sorts the in-memory postings and merges them withthe existing on-disk index, creating a new on-disk index thatcontains all p ostings gathered so far. We call it ImmediateMerge because in-memory postings are immediately mergedwith the on-disk index when they are written to disk.

    The advantage of this strategy is that, in order to fetch aposting list, in most cases only a single disk seek is necessary,since there is only one sub-index from which postings haveto be retrieved. The disadvantage is that for every mergeoperation the entire index has to be scanned. Therefore, thenumber of disk operations necessary to create the index is

    Dindex(C, M)

    C2

    M

    ,

    where C is the size of the text collection and M is theamount of available main memory. This is devastating forvery large text collections and for systems with little avail-able main memory (halving the available main memory dou-

    bles the number of disk operations).

    Strategy 2: No Merge

    The second strategy does not perform any merge opera-tions. When memory is full, postings are sorted and writtento disk, creating a new on-disk sub-index. On-disk indicesare never merged. When the posting list for a given termhas to be retrieved from the index, sub-lists are fetched fromall sub-indices. Since the sub-indices are ordered, the termsposting list can be created by concatenating the sub-lists.

    The advantage of No Merge is its high indexing perfor-

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    0 500 1000 1500 2000 2500 3000 3500 4000 4500Averagetimeperquery(seconds)

    Number of documents indexed (in thousands)

    Query processing performance for a growing text collection

    No MergeLogarithmic MergeImmediate Merge

    Figure 4: Development of query performance for agrowing index. Logarithmic Merge and ImmediateMerge stay very close to each other. No Merge getsworse as the number of sub-indices increases.

    mance each sub-index is accessed exactly once: when it iscreated from the in-memory postings. This implies

    Dindex(C, M) (C).

    The disadvantage is that, in order to fetch a terms postinglist,

    Dfetch(C, M)

    C

    M

    disk operations (and in particular ( CM

    ) disk seeks) are nec-essary, causing a low query processing performance for col-lections that are much larger than the available main mem-ory. This behavior is shown in Figure 4.

    Strategy 3: Logarithmic Merge

    The two strategies described so far (Immediate Merge andNo Merge) represent the two extremes. The third strategy isa compromise: A newly created on-disk sub-index is some-times merged with an existing one, but not always.

    In order to determine when to merge two sub-indices, we

    introduce a new concept: indexgeneration

    . An on-disk in-dex that was created directly from in-memory postings is ofgeneration 0. An index that was created by merging severalother indices is of generation g + 1, where g is the highestgeneration of any of the indices involved in the merging pro-cess. If, after a new on-disk index has been created, thereare two indices of the same generation, they are merged.This is repeated until there are no more such collisions.

    We call this strategy Logarithmic Merge after the loga-rithmic method by Bentley and Saxe [1]. It is similar to theincremental indexing scheme used by Lucene5.

    When Logarithmic Merge is used, the vector

    a = (a0, a1, . . .), ai :=

    0: no sub-index of generation i1: otherwise

    behaves like a binary number that is increased by 1 everytime a new on-disk index is created.

    It follows immediately that at a given point in time thenumber of sub-indices is bounded by

    O

    log

    C

    M

    ,

    where C and M are again the size of the collection and theavailable main memory, respectively. This is much better

    5http://lucene.apache.org/

  • 7/31/2019 wumpus-tr-2005-01

    6/8

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    0 1 2 3 4 5 6 7 8

    Totaltime(inminutes)

    Queries per 1000 document insertions

    (a) TREC-Genomics corpus (128 MB for in-memory index)

    No MergeLogarithmic MergeImmediate Merge

    0

    100

    200

    300

    400

    500

    600

    700

    800

    0 1 2 3 4 5 6 7 8

    Totaltime(inminutes)

    Queries per 1000 document insertions

    (b) TREC-Genomics corpus (256 MB for in-memory index)

    No MergeLogarithmic MergeImmediate Merge

    Figure 5: Comparison of sub-index merging strategies for growing text collections: (a) TREC-Genomics with128 MB main memory, (b) TREC-Genomics with 256 MB main memory. Every data point represents oneexperiment (indexing the whole collection and processing relevance queries in parallel). The time given is

    the total time for indexing the collection and processing all queries. DQU is varied between 0 and 102.

    Table 2: Exact times for the experimental resultsshown in Figure 5(b). Extrapolated point of equality

    for Log. Merge and Immediate Merge: DQU 0.024.

    DQU No Merge Log. Merge Imm. Merge

    0.000 13m40s 21m08s 57m01s0.002 228m45s 205m17s 238m05s0.004 433m29s 389m22s 416m05s0.006 647m41s 571m08s 595m11s0.008 752m54s 774m26s

    0.024 2207m 2208m

    than for the No Merge strategy and implies increased queryprocessing performance. The number of disk operations nec-essary to build the index is

    Dindex(C, M)

    C log

    C

    M

    ,

    which is much smaller than for Immediate Merge.The number of disk operations can be reduced by a con-stant factor by anticipating future merge operations andstarting a multi-way merge process whenever merging twoindices would cause a new collision and thus a new mergeprocess. This process is shown in Figure 3.

    Results

    The experiments run to evaluate the different mergingstrategies under different query/update conditions involve amonotonically growing document collection. For the TREC-Genomics collection, we ran two series of experiments, with128 and 256 MB for the in-memory index, respectively. Ex-periments for TREC4+5-CR are not shown here, becausethe collection is too small to allow any conclusions.

    Starting from an empty index, the system concurrentlyadds documents to the index and processes relevance queriesuntil the whole collection has been indexed. The numberof relevance queries per update operation (document inser-tion) is varied in order to examine the effect that each ofthe merge strategies has on both indexing performance andquery processing performance. The total time needed by thesystem to finish the job, i.e. index the entire collection andprocess all queries, is measured. Results for both series ofexperiments are shown in Figure 5. In addition, the exactnumbers for the 256 MB series are given by Table 2.

    As expected, No Merge is the best strategy under a verylight query load (DQU < 5 10

    4). As the number of queriesper update operation increases, No Merge becomes worseand worse, due to the high number of sub-indices that haveto be accessed every time a query is processed.

    Logarithmic Merge shows excellent indexing performance;

    it is only 35% slower than No Merge for DQU = 0 and 256 MBof memory. As DQU increases, Logarithmic Merge soon over-takes No Merge and b ecomes the best strategy. In all ourexperiments, overall performance for Logarithmic Merge wasbetter than for Immediate Merge. However, by extrapolat-ing from the experimental results, we can predict that Imme-diate Merge becomes the optimal strategy for DQU > 0.024(24 queries per 1,000 update operations). But even then,Logarithmic Merge stays very close. Its asymptotical per-formance (DQU ) is less than 3% lower than that ofImmediate Merge on our test collection.

    It also has to be pointed out that Logarithmic Merge isvery robust against decreasing the available memory. Whilegoing from 256 MB down to 128 MB significantly decreases

    No Merges query performance and doubles indexing timewith Immediate Merge, both indexing and query perfor-mance remain almost constant when Logarithmic Merge isused as the sub-index merging strategy.

    Discussion

    We have shown that, depending on how dynamic theunderlying text collection is, different sub-index mergingstrategies have to be chosen in order to achieve optimaloverall performance for the search system. The No Mergestrategy, which offers the best indexing performance, shouldonly be used if the number of queries to be processed isextremely small. For most applications, Logarithmic Mergeis the best choice, since it scales very well and can thereforebe used to index very large text collections, where Imme-

    diate Merge causes problems because of its ( C2

    ) indexingcomplexity.

    5. GARBAGE COLLECTIONIn the previous section, we have discussed sub-index

    merging strategies for continuously growing document col-lections. In a truly dynamic environment, however, thecollection does not grow monotonically. Update operationsmay be either document insertions or deletions.

    Postings that are added during an insert operation can

  • 7/31/2019 wumpus-tr-2005-01

    7/8

    0

    1

    2

    3

    4

    5

    6

    7

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Averagetimeperquery(seconds)

    Relative amount of garbage postings in the index

    Impact of garbage postings on query performance

    Logarithmic MergeImmediate Merge

    Figure 6: Experimental results for a sub-collectionof TREC-Genomics, containing 2.3 million activeand a varying number of deleted documents. Queryprocessing performance drops as the number ofgarbage postings is increased. Logarithmic Mergestays very close to Immediate Merge.

    simply be appended to existing posting lists (assuming thatnewly added documents reside at the very end of the indexaddress space). Therefore, it is possible to create a new on-disk sub-index from the in-memory postings without having

    to access the other on-disk sub-indices a great reduction inthe number of disk accesses. This is not true for documentdeletions, as they can potentially affect the entire addressspace. Hence, their impact is not limited to a single sub-index (a small part of the index address space), which makesdeletions more difficult to deal with than insertions.

    Wumpus security mechanism helps deal with documentdeletions. Every user can only access the part of a postinglist that lies within files that may be read by that user, sincethe respective file permissions are automatically applied ev-ery time a query is processed (cf. section 3). This mech-anism can be used to support document deletions: Whena document is deleted from the collection, the file associ-ated with that document is marked as cannot be read byanybody. After that, all postings lying within the file areignored by the query processor.

    However, filtering posting lists during query processingcan only be one part of the solution. At some point, post-ings that belong to deleted documents have to be removed,because the time it takes to process a query is essentially lin-ear in the total number of postings in the index (as shownin Figure 6). We need a garbage collection strategy.

    Threshold-based Garbage Collection

    A relatively simple strategy is to keep track of the relativenumber of postings in the index that belong to documentsthat have been deleted:

    r =#deleted postings

    #postings.

    As soon as this number exceeds a predefined threshold (i.e.,r > , for some (0, 1]), the garbage collector is started,and all superfluous postings are removed from the index.

    This is the basic garbage collection strategy used byWumpus. Different values for affect indexing performanceand query processing p erformance in different ways: If thegarbage collector is run after every single document deletion( 0), this guarantees maximum query performance atthe cost of very bad indexing performance. If, on the otherhand, the garbage collector is never run ( = 1), indexing

    performance is maximal, but the query processor spends agreat amount of time fetching and decompressing postingsthat are irrelevant to the query because they belong todeleted documents.

    Since during garbage collection all sub-indexes have tobe read completely anyway, Wumpus automatically mergesall sub-indexes into one big index every time the garbagecollector is run.

    On-the-Fly Garbage CollectionThe threshold-based garbage collection strategy descibed

    above has the disadvantage that it is completely indepen-dent of the sub-index merging strategy employed and there-fore causes additional disk access operations that could havebeen avoided if the garbage collector had been integratedinto the sub-index merging process. We call this integra-tion on-the-fly garbage collection: Every time two or moresub-indices are merged into a bigger sub-index, the garbagecollector can be used to filter the postings that are read fromthe input indices and only write those postings to the outputindex that belong to documents that have not been deleted.On-the-fly garbage collection is employed in addition to thesimple threshold-based approach.

    Since garbage collection requires the decompression ofposting lists, it causes a notable reduction of sub-indexmerging performance (ordinary sub-index merging does notrequire the decompression of postings see section 3.3). Itis therefore not desirable to run it for everymerge operation.Instead, we define a new threshold value ; the garbagecollector is integrated into a merge process if

    Pki=1

    #deleted postings in index IiPki=1

    #postings in index Ii>

    ,

    where I1 . . .Ik are the input indices of the merge operation.Reasonable choices for are in the interval (0, ].

    Experimental Results

    We conducted several experiments to compare differentgarbage collection threshold values and study how the in-tegration of the garbage collector into sub-index mergingcan increase overall system performance. For each text col-lection (TREC4+5-CR and TREC-Genomics), one series ofexperiments was conducted. For all experiments, we lim-ited the size of the in-memory index to 256 MB and usedLogarithmic Merge as the sub-index merging strategy.

    Starting from an index that contains 50% of the docu-ments in the text collection (2.3 million documents forTREC-Genomics and 260,000 documents for TREC4+5-CR), documents are added and removed in a random fash-ion, with equal probabilities for insertions and deletions. Incontrast to the experiments described in the previous sec-tion, we had the system process a fixed number of queries

    (5,000) and varied DUQ, the number of update operations

    between two queries, between 0 and 10,000.The results visualized in Figure 7 show that for most up-

    date loads = 0.50 is a safe choice, leading to an acceptablequery performance. Only on TREC-Genomics, it is outper-formed by = 0.25 for 0 DUQ 5000. This means thatthe additional work caused by eager garbage collection isusually not rewarded by superior query performance, evenfor low update loads. For both collection, the integrationof on-the-fly garbage collection into the merge process im-proves the systems performance only marginally, indicating

  • 7/31/2019 wumpus-tr-2005-01

    8/8

    0

    20

    40

    60

    80

    100

    120

    140

    160

    0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

    Totaltime(inminutes)

    Update operations per query

    (a) TREC4+5-CR corpus (Logarithmic Merge; 256 MB RAM)

    Threshold-based (rho = 0.10)Threshold-based (rho = 0.25)Threshold-based (rho = 0.50)

    On-the-fly (rho = 0.50, rho=0.20)

    100

    150

    200

    250

    300

    350

    400

    450

    500

    550

    0 2000 4000 6000 8000 10000

    Totaltime(inminutes)

    Update operations per query

    (b) TREC Genomics corpus (Logarithmic Merge; 256 MB RAM)

    Threshold-based (rho = 0.10)Threshold-based (rho = 0.25)Threshold-based (rho = 0.50)

    On-the-fly (rho = 0.50, rho = 0.20)

    Figure 7: Garbage collection strategies for fully dynamic text collections. (a) TREC4+5-CR. (b) TREC-Genomics. The number of update operations per query is varied between 0 and 10,000. The total timeneeded to process 5,000 queries and concurrently perform the update operations is given.

    that the relatively simple threshold-based method is in factnot as bad as we had thought and is suited very well formost scenarios.

    6. CONCLUSION AND FUTURE WORKWe have presented a fully dynamic information retrieval

    system, supporting document insertions and deletions. Ourcontribution is two-fold: We have pointed out different waysto realize index update operations and examined their re-spective effect on both indexing and query processing per-formance, leading to a discussion of the associated index-ing vs. query processing performance trade-offs. We haveshown that for a broad range of relative query and updateloads, including those typical for desktop and file systemsearch, a combination of logarithmic sub-index merging andthreshold-based garbage collection with a garbage thresh-old around 50% is an appropriate way to address index up-dates. Only under extreme conditions (very large or verysmall number of updates per query), other configurationsexhibit better performance.

    As one of the main results of this paper, we have demon-strated how security mechanisms necessary for multi-usersupport can easily be extended to support efficient docu-ment deletions with delayed index garbage collection.

    Using DUQ, the number of index updates per query, to de-scribe how dynamic a text collection is, is a rather ad-hocapproach. We chose it because of its immediate perspicuity.Other measures, such as the relative number of documentsor the relative number of postings changed between two con-secutive queries, might give a better characterization of thesystem. An experimental evaluation using a large numberof different corpora will be necessary to find a good measurefor the degree of a retrieval systems dynamics.

    7. REFERENCES

    [1] J. L. Bentley and J. B. Saxe. Decomposable SearchingProblems I: Static-to-Dynamic Transformations.Journal of Algorithms, 1(4):301358, 1980.

    [2] T. Chiueh and L. Huang. Efficient Real-Time IndexUpdates in Text Retrieval Systems. Technical report,Stony Brook, New York, USA, August 1998.

    [3] C. L. A. Clarke, G. V. Cormack, and F. J. Burkowski.An Algebra for Structured Text Search and aFramework for its Implementation. The ComputerJournal, 38(1):4356, 1995.

    [4] C. L. A. Clarke, G. V. Cormack, and T. R. Lynam.Exploiting Redundancy in Question Answering. InProceedings of the 24th Annual ACM SIGIRConference, pages 358276, November 2001.

    [5] C. L. A. Clarke, G. V. Cormack, and E. A. Tudhope.Relevance Ranking for One to Three Term Queries.

    Inf. Proc. and Management, 36(2):291311, 2000.[6] N. Kabra, R. Ramakrishnan, and V. Ercegovac. TheQUIC Engine: A Hybrid IRDB System. Technicalreport, Madison, Wisconsin, USA, 2002.

    [7] N. Lester, J. Zobel, and H. E. Williams. In-Placeversus Re-Build versus Re-Merge: Index MaintenanceStrategies for Text Retrieval Systems. In CRPIT 26:Proceedings of the 27th Conference on AustralasianComputer Science, pages 1523, Darlinghurst,Australia, 2004. Australian Computer Society, Inc.

    [8] M. Persin, J. Zobel, and R. Sacks-Davis. FilteredDocument Retrieval with Frequency-Sorted Indexes.Journal of the American Society for InformationSciences, 47(10):749764, 1996.

    [9] M. F. Porter. An Algorithm for Suffix Stripping.Readings in Inf. Retr., 14(3):130137, 1980.

    [10] S. Robertson, S. Walker, S. Jones,M. Hancock-Beaulieu, and M. Gatford. Okapi atTREC-3. In Proceedings of the Third Text REtrievalConference, November 1994.

    [11] F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel.Compression of Inverted Indexes for Fast QueryEvaluation. In Proceedings of the 25th Annual ACMSIGIR Conference, pages 222229, 2002.

    [12] A. Tomasic, H. Garca-Molina, and K. Shoens.Incremental Updates of Inverted Lists for TextDocument Retrieval. In SIGMOD 94: Proceedings ofthe 1994 ACM SIGMOD Conference, pages 289300,New York, NY, USA, 1994. ACM Press.

    [13] I. H. Witten, A. Moffat, and T. C. Bell. ManagingGigabytes (2nd ed.): Compressing and IndexingDocuments and Images. Morgan Kaufmann PublishersInc., San Francisco, CA, USA, 1999.

    [14] J. Zobel, S. Heinz, and H. E. Williams. In-MemoryHash Tables for Accumulating Text Vocabularies.Information Processing Letters, 80(6), 2001.

    [15] J. Zobel, A. Moffat, and K. Ramamohanarao. InvertedFiles versus Signature Files for Text Indexing. ACMTrans. on Database Systems, 23(4):453490, 1998.