A novel document retrieval method using the discrete wavelet transform

A Novel Document Retrieval Method Usingthe Discrete Wavelet Transform

LAURENCE A. F. PARK, KOTAGIRI RAMAMOHANARAO,and MARIMUTHU PALANISWAMIThe University of Melbourne

Current information retrieval methods either ignore the term positions or deal with exact termpositions; the former can be seen as coarse document resolution, the latter as fine document reso-lution. We propose a new spectral-based information retrieval method that is able to utilize manydifferent levels of document resolution by examining the term patterns that occur in the documents.To do this, we take advantage of the multiresolution analysis properties of the wavelet transform.We show that we are able to achieve higher precision when compared to vector space and proximityretrieval methods, while producing fast query times and using a compact index.

Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval

General Terms: Algorithms, Experimentation, Performance

Additional Key Words and Phrases: Daubechies, document retrieval, Haar, multiresolution analy-sis, proximity search, vector space methods, wavelet transform

1. INTRODUCTION

Many current information retrieval systems are built around a similarity func-tion. This function takes a query and a document as its arguments and gener-ates a single score which represents the relevance of the query to the document.Current popular retrieval methods, namely, Vector Space Methods [Zobel andMoffat 1998; Buckley and Walz 1999] and Probabilistic Methods [Robertsonand Walker 1999], base their similarity function on the hypothesis that a doc-ument is more likely to be relevant to a query if it contains more occurrencesof the query terms. This implies that the similarity functions need the count of

This work was supported by the Australian Research Council.Authors’ addresses: L. A. F. Park, ARC Centre for Perceptive and Intelligent Machines in ComplexEnvironments, Department of Computer Science and Software Engineering, The University ofMelbourne, Victoria, Australia 3010; email: [email protected]; K. Ramamohanarao,Department of Computer Science and Software Engineering, The University of Melbourne, Victoria,Australia 3010; email: [email protected]; M. Palaniswami, Department of Electrical and ElectronicEngineering, The University of Melbourne, Victoria, Australia 3010; email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for profit or direct commercialadvantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]© 2005 ACM 1046-8188/05/0700-0267 $5.00

ACM Transactions on Information Systems, Vol. 23, No. 3, July 2005, Pages 267–298.

268 • L. A. F. Park et al.

occurrences of each of the terms, and ignore any other information in the doc-ument. Vectors are used to represent documents and queries, where the vectorspace contains one dimension for each of the terms found in the document set.The contribution of each dimension in the document vector is found by countingthe appearances of the associated term in the document. The similarity functionsimply applies weights to the query and document vectors and compares themto generate a score based on their likelihood of relevance. By converting the doc-uments into vectors, we only take into account the number of times each termappears in the documents and disregard the positional information. Therefore,the resolution of these methods at a single level (usually the document level,implying coarse document resolution) means that each document is taken as aunit.

Proximity methods [Clarke and Cormack 2000; Hawking and Thistlewaite1996], base their similarity functions on the hypothesis that a document is morelikely to be relevant to a query if the query terms are found within a smallerproximity to each other. Therefore, they use similarity functions that try toutilize the positional information to achieve higher precision. This is done bycomparing the positions of the query terms in the document. By performingmany comparisons, the query time grows. By examining positions, proximitymethods use each term position as their resolution, implying a fine documentresolution.

Rather than observing a single resolution, we can observe multiple resolu-tions of a document by analyzing the term patterns throughout it. We hypoth-esize that a document is more likely to be relevant if the pattern of all queryterm appearances are similar. To examine the pattern of a term throughout adocument, we could take note of each of the term positions, leading to querytimes similar to the proximity method. Or we could map our term positions intoanother domain in which we could easily analyze the positional patterns andachieve faster query times. Wavelet transforms allow us to do such a thing.

The wavelet transform is able to break a given signal into wavelets (littlewaves) of different scale and position. This decomposition allows us to analyzethe signal at different frequency resolutions and to identify the position of anyspikes that may occur in the signal. Two-dimensional wavelet transforms havebeen used for image compression [Uhl 1994] and retrieval [Wang et al. 2001] forthe mentioned reasons. Natural images (e.g., photographs) contain flowing col-ors that can be represented by low-frequency wavelets. Singularities in images(e.g., fast changes in color) can be shown as high-frequency, positioned wavelets.We see a stream of text in a similar way. If a term is appearing frequentlyand scattered about the document, it can be represented by a low-frequencywavelet, while a term that appears once is shown as a high-frequency positionalwavelet.

Wavelet transforms have been used in text visualisation systems [Miller et al.1998]. Such systems attempt to graphically display which regions of a documentcontain the desired topic. The wavelet transform can assist this visualizationby allowing the analysis to be done at multiple document resolution levels.

We propose a new method of Spectral Document Ranking using the discretewavelet transform (DWT). We will show that using the wavelet transform allows

ACM Transactions on Information Systems, Vol. 23, No. 3, July 2005.

Novel Document Retrieval Method • 269

us to achieve high precision, provides fast query times, and uses a compactindex.

Our focus in this article is to introduce a new method of text document rank-ing using the discrete wavelet transform so that we can analyze the documentterm patterns at various resolutions. Also, we will compare this new methodwith existing methods of document ranking. The article is organized as follows:Section 2 introduces the spectral-based retrieval model and describes the useof term signals. Section 3 introduces the wavelet transform and its self simi-larity properties, while Section 4 discusses the desired properties of waveletsin information retrieval. Section 5 discusses time and space complexity is-sues. Section 6 examines the compliance of our document ranking method, andSection 7 gives details of the experiments performed with results displayed indifferent forms. Section 8 concludes the article.

2. SPECTRAL-BASED DOCUMENT RETRIEVAL

Spectral-based document retrieval [Park et al. 2001, 2002a, 2002b, 2004, 2005]finds relevant documents by considering the query term occurrence patterns.Documents that contain query terms which all follow a similar positional pat-tern are considered more relevant than documents whose query terms do notfollow similar patterns.

Vector space retrieval calculates the document score based upon the occur-rence of the query terms in the document. Proximity methods calculate thedocument score based on the proximity of the query terms to each other. If adocument contains query terms that are within a small proximity of each other,the document would score higher than another document whose query termswere not within the same proximity. Since the vector space method only ob-serves the count of the terms, it would give the same score independent of thequery term positions. Proximity searches use more of the document informationto calculate the document score, but this takes time. Each query term must becompared to each other query term in the scoring process; therefore, as the num-ber of query terms grows, the number of comparisons grows combinatorially.

Spectral-based retrieval is able to overcome this problem by comparing thequery terms in their spectral domain rather than their spatial domain. To dothis, we create a term signal for each query term in each document, convertthe term signals into term spectra using a spectral transform, and combinethe term spectra to obtain a document score. The benefits of performing ourcalculation in the spectral domain are the following:

—the components are orthogonal to each other; therefore we do not need tocross compare components;

—the spectral domain magnitude and phase values are related to the spatialterm count and position, respectively.

2.1 Term Signals

A term signal is a sequence of values that show the occurrence of a particu-lar term in a particular section of a document. The term signal for term t in



Fig. 1. An example of how the term signals are obtained. The top two lines, labeled “travel” and“wales” show the positions of the terms travel and wales in a document (the position is signified bythe vertical stroke through the line). The bottom half shows the generation of the eight term signalcomponents from the term positions.

document d is represented by

fd ,t = [ fd ,t,0 fd ,t,1 · · · fd ,t,B−1], (1)

where fd ,t,b is the value of the signal component. If we have B signal compo-nents and D terms in the document, we calculate the value of the bth componentby counting the occurrences of term t between the bD/Bth word in the docu-ment and the {(b + 1)D/B − 1}th word in the document. Therefore, if B = 8,fd ,t,0 would contain the number of times term t occurred in the first eighth ofdocument d . If B = 1, fd ,t,0 would contain the count of term t throughout thewhole document. Figure 1 shows an example of the term signal creation.

2.2 Term Signal Weights

If we examine weighting schemes found in vector space and probabilistic meth-ods, we can see that they are used to reduce the impact of certain documentand term properties from affecting the document score (e.g., document lengthshould not affect the score) [Salton and Buckley 1988; Singhal et al. 1996].These document and term properties exist in term signals as well, so we willuse weighting to try to remove this dependence.

Each component of a term signal represents a portion of the document it wastaken from (a passage); therefore, we are able to use the existing documentweighting schemes to weight each of the term signal components.

The document weights [Zobel and Moffat 1998] used were

—BD-ACI-BCA: wd ,t,b = 1+log ( fd ,t,b)(1−s)+sWd /W d

,—AB-AFD-BAA (Okapi): wd ,t,b = fd ,t,b

fd ,b,t+τd /τd,

—BI-ACI-BCA: wd ,t,b = 1+log ( fd ,t,b)(1−s)+sWd /W d

,

—Lnu.ltu (SMART): wd ,t,b = (1+log ( fd ,t,b))/(1+log ( f d ,t ))(1−s)+sτd /τd

.



where fd ,t,b is the bth component of the tth term in the dth document, fd ,t isthe average term count for document d , Wd is the document vector l2 norm, τdand τd are the number of unique terms in document d and the average uniqueterms, respectively, and s is a slope parameter (set to 0.7 [Zobel and Moffat1998]).

2.3 Term Spectra

If we were to compare our query term signals in order to obtain a documentscore, we could compare component b of each term or we could compare differentcomponents in different terms. The former method would reduce to passageretrieval, while the latter method would be a form of proximity measure.

As stated earlier, we do not want to compare the term signals positions, wewant to compare their patterns. The most convenient way of doing this is toexamine their wavelet spectrum, given by

ζd ,t = [ζd ,t,0 ζd ,t,1 . . . ζd ,t,B−1], (2)

where ζd ,t,b = Hd ,t,b exp (iθd ,t,b) is the bth spectral component with magnitudeHd ,t,b and phase θd ,t,b. Previous analysis has been performed on observing theFourier transform and the cosine transform of the term signals [Park et al.2001, 2002a, 2002b, 2004, 2005]. These transforms decompose the signal intoa set of infinite sinusoidal waves. This implies that they are able to extract thefrequency information from the signal, but they focus on the signal as a whole.The wavelet transform is able to focus on the signal portions at different reso-lutions. This implies that frequency information is extracted from parts of thedocument, providing us with frequency and position information. The resultingterm spectrum contains orthogonal components, implying that there is no needto cross compare the spectral components. Therefore, document scores can beobtained by combining the term spectra components across terms.

2.4 Spectral-Based Retrieval Model

The spectral-based retrieval model (shown in Figure 2) uses the magnitudeand phase information found in the query term spectra of each document tocalculate the document score. If we obtain a set of term spectra consisting ofcomplex valued components, we can treat the magnitude of the components asproportional to the occurrence of the term in the pattern, and we can treat thephase as the position of the pattern.

A relevant document would have a high occurrence of query terms (implyinga high magnitude of components) and a similar position of each pattern ofquery terms (implying similar phase). Therefore, we will split our process intotwo, so we can deal with the magnitude and phase separately. We know thateach of the components are orthogonal; therefore we need only compare thenth component of each term spectra. The comparisons will lead to a score foreach spectral component, which can be combined to obtain the overall documentscore.

We have stated that term occurrence is related to the magnitude of the termspectrum, and query term occurrence is likely to be related to the relevance of



Fig. 2. The spectral-based retrieval model.

a document to the query. Therefore, we can assume that the query term spectramagnitude is likely to be related to the relevance of the document. To takethe magnitude of each spectral component into account, we will simply combinethe magnitude values by adding them.

We stated that the spectral phase is related to the term position. Therefore,we want a similar phase for each component across all of the query terms.Phase is a radial value, so we cannot simply use the variance as a measureof phase similarity. Instead we will use phase precision. The phase precisionmethod assigns each phase to a unit vector. The vectors are added and themagnitude is averaged. The phase precision value is the resulting magnitude.If all of the phases are the same, the unit vectors will add constructively andthe resulting magnitude will be 1. If the phases are scattered, the unit vectorswill add destructively and the resulting magnitude will be close to zero. Thephase precision of component b in document d is

�d ,b =∣∣∣∣∑

t∈Q exp (iθd ,t,b)

#(Q)

∣∣∣∣, (3)

where Q is the set of query terms and #(Q) is the number of query terms. Totake this method a step further, zero phase precision (�d ,b) ignores the phasesof the components which have zero magnitude because these phase values donot mean anything. This gives us

�d ,b =∣∣∣∣∣∑

t∈Q ,Hd ,t,b �=0 exp (iθd ,t,b)

#(Q)

∣∣∣∣∣. (4)

The zero phase precision can be used as a measure of how important the cor-responding component is (which represents a particular pattern of terms). Weuse this value as a weight of the magnitude values of the same component. Wealso apply weights to each query term using the selected weighting scheme. Bydoing this, we achieve a score for each spectral component:

sd ,b = �d ,b

∑t∈Q

wq,t Hd ,t,b. (5)



Just as we selected the document weighting scheme, the query weightingscheme is also a matter of preference. The query weighting schemes we triedwere

—BD-ACI-BCA: wq,t = (1 + log ( fq,t)) log (1 + f m/ ft),—AB-AFD-BAA: wq,t = log (1 + N/ ft),—BI-ACI-BCA: wq,t = (1 + log ( fq,t))(1 − nt

log2 N ),—Lnu.ltu (SMART): wq,t = (1 + log ( fq,t)) log (N/ ft),

where fq,t and wq,t are the occurrence and weight of term t in query q, respec-tively, ft is the number of documents term t appears in, f m is the largest ftfor all t, N is the number of documents, and nt is a noise measure of term t.Each of the weighting schemes were chosen due to their high precision for aparticular query type [Zobel and Moffat 1998].

Experiments have shown that AB-AFD-BAA achieves high precision forshort (1–10 terms) queries, BI-ACI-BCA achieves high precision results forlong (about 80 terms) queries, and BD-ACI-BCA achieves high precision re-sults when using both long and short queries [Zobel and Moffat 1998]. TheLnu.ltu method from SMART [Buckley et al. 1995] was chosen because it isa well-known method which has produced excellent results at TREC [Buckleyet al. 1995].

To obtain the spectral document score, we combine the score components(sd ,b) using the norm function

Sd = ‖sd‖p, (6)

where sd = [sd ,0 sd ,1 · · · sd ,B−1], and ‖sd‖p is the l p norm given by

‖sd‖p =B−1∑b=0

|sd ,b|p. (7)

We can see that, by increasing p, the dominant score components will havemore effect on the score. In our experiments, we will be examining Sd for p = 1and 2.

2.5 Generalization of Vector Space Model

If we examine our retrieval model when B = 1 (implying only one signal com-ponent is needed for each term), we can show that our model behaves in thesame manner as the vector retrieval methods.

We must first note that the transform of a signal with one element obtainedby using any linear transform which generates orthogonal spectral values (suchas the Fourier, cosine and certain wavelet transforms) will be proportional to theoriginal value. Therefore, if we call our transform Tr, we achieve the following:

Tr(wd ,t,0) = αwd ,t,0. (8)

The transformed value is real, so the phase is zero. From this we can deducethat the zero phase precision will be 1 (since all query terms will have zerophase). The transformed value being real also implies that the magnitude of



the value is the value; therefore Hd ,t,0 = wd ,t,0. By substituting these valuesinto our score component Equation (5), we achieve

sd ,0 =∑t∈Q

wq,twd ,t,0. (9)

By choosing the l1 norm to calculate the document score, we have

Sd =∑t∈Q

wq,twd ,t,0, (10)

which is equivalent to the vector space retrieval model.

3. MULTIRESOLUTION ANALYSIS

The Fourier transform (FT) [Proakis and Manolakis 1996] and the sinusoidalfamily of unitary transforms [Jain 1979] enable us to analyze our information asa whole. When given a signal, the Fourier transform of the signal will provideinformation about every frequency component that exists in the signal. Thistype of analysis is sufficient for stationary signals, but does not allow us toexamine properties of transient signals. For example, we might want to findthe point in time where a signal contains certain frequency components, orfor a spatial signal, we might want to find a position in space where certainfrequency components exist.

Our previous work in the field of text retrieval [Park et al. 2001, 2002a,2002b, 2004, 2005] gave us insight as to how we can use the Fourier and cosinetransforms to easily combine the term magnitude and phase information intothe document score and obtain high-precision results. But, as stated above,the Fourier transform provides frequency information for the whole document;hence we were not able to focus on important portions of the document. In thissection, we will explain the limitations of the Fourier transform and show howwe can utilize both the frequency and positional information using the wavelettransform.

3.1 Fourier Decomposition

If we wish to represent a discrete signal f [t] as a Fourier series, we must findthe coefficients F [k] which satisfy the following formula:

f [t] =T−1∑k=0

F [k] exp (2π ikt/T ). (11)

By doing so, we are able to show our signal ( f [t]) as a linear combination ofthe sinusoidal waves exp (2π ikt/T ) (patterns), where k is the frequency of thewave, i = √−1, and T is the length of our signal. The coefficient F [k] is thecalculated amplitude of the wave of frequency k. To calculate the values ofthe frequency coefficients (or frequency components), we can use the discreteFourier transform

F [k] =T−1∑t=0

f [t] exp (−2π ikt/T ). (12)



This was used in our earlier experiments as our spectral transform. By usingthe Fourier transform, we scored documents higher if they had a cyclic posi-tional pattern of query terms. The Fourier transform examines the signal as awhole; therefore, for a coefficient (weight of a signal pattern, which is in thiscase the sinusoidal wave exp (2π ikt/T )) to have a high weight, the pattern itrepresents must occur throughout the whole signal. The Fourier transform pro-vides detailed frequency information, but with that we cannot pin-point wherein the original signal certain frequencies come from.

3.2 Short Time Fourier Transform

To try to focus on certain spans of a signal, the short time Fourier transform(STFT, also known as the windowed Fourier transform) [Mallat 2001] was de-veloped. The STFT is similar to the FT in the way that it converts a signal toits frequency spectrum, but it windows the signal before doing so. By selectingmany different intervals of the signal, the STFT allows the analyst to observefrequency components local to a certain time or position in the signal. The STFTcomes in the form

S f [u, k] =T−1∑t=0

f [t]g [t − u] exp (−2π ikt/T ), (13)

where g [t−u] is the window function centred at u. By having a set window size,the pattern used in our analysis is smaller than the signal. Smaller patternsimply that the pattern analysis becomes local to a specific point in the document.

A cyclic transient signal contains frequency components which exist for alltimes. If our signals were of this form, we would not care about the time res-olution, but we would like high frequency resolution. Therefore, a STFT witha large window size or the Fourier transform would suffice in obtaining thefrequency information. If we wanted to find where certain impulses existedin a transient signal, the Fourier transform would fail. All it could tell wouldbe that the impulses were in the signal (global pattern analysis), but it couldnot specify at which times they occurred. In this case, we would use a STFTwith small window size (local pattern analysis). The small window size implieshigh time resolution but low frequency resolution (due to the small variancein time selected). By using the STFT, we could tell where in time the impulsesoccurred, but we would only be able to notice high-frequency changes. Any slowchanges to the signal would not be noticed in the frequency analysis due to thewindowing.

3.3 Wavelet Transform

We have seen that the STFT can be used to find time and frequency informationat a set resolution and we have also seen that the resolution must be set inorder to extract certain time-frequency information from a signal. If we wantto examine a signal across different time-frequency resolution scales we mustlook into multiresolution analysis (MRA). MRA is intended to allow for hightime resolution with poor frequency resolution at high-frequency levels andpoor time resolution with high frequency resolution at low-frequency levels.



To perform this task, a wavelet [Mallat 2001] is used instead of a windowingfunction. The name wavelet means a small wave or a wave which is not ofinfinite length (as in a sinusoidal wave).

A wavelet is described by a function ψ ∈ L2(R) (where L2(R) is the set offunctions f (t) which satisfy

∫ | f (t)|2dt < ∞) with a zero average and normof 1. A wavelet can be scaled and translated by adjusting the parameters s andu, respectively.

ψu,s(t) = 1√s

ψ

(t − u

s

). (14)

The scaling factor keeps the norm equal to one for all s and u. The wavelettransform of f ∈ L2(R) at time u and scale s is

W (u, s) = 〈 f , ψu,s〉 =∫ +∞

−∞f (t)

1√s

ψ∗(

t − us

)dt, (15)

where ψ∗ is the complex conjugate of ψ .The appeal of the wavelet transform is its ability to focus on regions of the

signal. If we were to compare two or more signals for similarity, we could ex-amine the signals top down. When the signals differ at a certain level, we knowthat we do not have to delve any deeper. This is an exciting property whichcan be used in information retrieval. If we consider a word signal, the wavelettransform of this signal will give us the location of the words at any desiredresolution. For example, the first wavelet component will tell us if the word ex-ists in the document. The second will tell us where the main cluster of the wordis in the document. The third and fourth will identify the general areas wherethe word appears in the first and second half of the document, respectively, andso on until we get to the exact location of the word.

To construct a wavelet function, ψu,s(t), we must first obtain a scaling func-tion, φu,s(t) ∈ Vn. One of the properties that the scaling function must satisfy isthat

· · · ⊂ Vn+1 ⊂ Vn ⊂ Vn−1 · · · ,

where the set of φu,s(t) for all u is a basis of Vn (s = 2n for dyadic scaling), and⋃n∈Z

Vn = L2(R). This implies that we can show each set of scaling functionsin terms of its subset of scaled scaling functions:

Vn−1 = Vn ∪ Wn−1, Vn ⊥ Wn−1, (16)

where ⊥ implies orthogonality. An example of the relationship between Vn andWn can be seen in Figure 3. If we observe the set of functions Wn, we can seethat it satisfies the following properties:⋃

n∈Z

Wn = L2(R),⋂n∈Z

Wn = ∅. (17)

Therefore the set of functions Wn for all n is a basis for L2(R). This set Wn isthe set of shifted wavelet functions at resolution n.



Fig. 3. An example of the enclosing scaling function (φu,s(t) ∈ Vn) spaces as ovals and the waveletfunction spaces (ψu,s(t) ∈ Wn) as annuli.

Fig. 4. The high-pass filter (H) applies the wavelet transform ψu,s(t) to our data for a specific scales; the low-pass filter (G) applies the scaling function φu,s(t) to extract all of the information notaccounted for. Each result is decimated without loss of information.

3.4 Discrete Wavelet Transform

We have seen that the wavelet transform divides L2(R) into the sets · · · ⊂Vn+1 ⊂ Vn ⊂ Vn−1 · · ·, where Wn = Vn ∩ Vn+1 or Vn = Wn ∪ Vn+1, Wn ∩ Vn+1 = ∅.This is a recursive filtering process, where each resolution of scaling functions(φu,s(t) ∈ Vn) is split into the next resolution of wavelet functions (ψu,s(t) ∈ Wn)and the next resolution of scaling functions (φu,2s(t) ∈ Vn+1).

When given in its discrete form, the dyadic wavelet transform can be shownas a sequence of high-pass1 and low-pass2 filters, where the filter coefficients ofthe high-pass filter describe the wavelet transform and the low-pass filter coef-ficients describe the scaling function which is taking place (shown in Figure 4).We observe that the output of the high-pass filter (the wavelet components) arepart of the resulting transform coefficients, and the output of the low-pass filteris fed back into another high- and low-pass filter to be split again (the coeffi-cients of the scaling function decomposition), as shown in Figure 5. Until themid 1980s, there was no such filter that could provide perfect reconstruction ofthe decomposed signal; therefore there was no way this idea could be applied inpractice. The Conjugate Mirror Filter (CMF) [Vetterli 1986; Vetterli and Herley1992] provided a means to build these wavelet filter banks and provided thespark to the wavelet community. By performing this recursive filtering process,

1A high-pass filter allows high-frequency data through (the upper Fourier transform components).2A low-pass filter allows low-frequency data through (the lower Fourier transform components).



Fig. 5. Recursive filtering process. H is a high-pass filter (using the wavelet coefficients) and G is alow-pass filter (using the scaling function coefficients). The low-pass data gets split and decimatedrepeatedly to obtain its wavelet transform. An example of this is shown in process (19).

we are able to complete the transform in linear time, which is faster than theFourier transform.

We will now provide a simple example of how we can use the transform toprovide us with the different levels of resolution of a signal. The Haar waveletis equivalent to 1 cycle of a square wave. To perform the wavelet transform, wetake every possible scaled and shifted version of the wavelet and find how muchof this wavelet is within our signal (by finding the dot product). For example, ifour signal is fd ,t = [2 0 0 1 1 1 0 0] and we use the Haar wavelet transform

W =

√1/8

√1/8

√1/8

√1/8

√1/8

√1/8

√1/8

√1/8√

1/8√

1/8√

1/8√

1/8 −√1/8 −√

1/8 −√1/8 −√

1/8√1/4

√1/4 −√

1/4 −√1/4 0 0 0 0

0 0 0 0√

1/4√

1/4 −√1/4 −√

1/4√1/2 −√

1/2 0 0 0 0 0 00 0

√1/2 −√

1/2 0 0 0 00 0 0 0

√1/2 −√

1/2 0 00 0 0 0 0 0

√1/2 −√

1/2

,

the wavelet components will be

W f ′d ,t = [ 5/

√8 1/

√8 1/

√4 2/

√4 2/

√2 −1/

√2 0 0 ]′, (18)

where x ′ is x transposed for any vector x. This transformed signal shows us thepositions of the terms at many resolutions. The first component (5/

√8) shows

that there are five occurrences of the term. The second component (1/√

8) showsthat there is one more occurrence of the term in the first half of the signal thanin the second half. The third component shows that there is one more occur-rence of the term in the first quarter compared to the second quarter. The fourthcomponent compares the third and fourth quarters. The next four componentscompare the eighths of the signal. Therefore, we can observe the signal at differ-ent levels of resolution by noting certain components of the transformed signal(as shown in Table I).

[5√

8][1√

8][ 1√

4 2√

4 ][ 2√

2 −1√

2 0 0 ].

If each component of the original signal represented a portion of a document,we could use the wavelet transform to analyze the query term positions at



Table I. The Spectral Coefficients Produced from the Haar WaveletTransform in (18) (We can see that the first component contains no

signal position information (low resolution), and the last fourcomponents focus on two side by side components (high resolution).)

Transformed Value Description5√

8 Sum of signal1√

8 First half–second half of the signal1√

4 First quarter–second quarter of the signal2√

4 Third quarter–fourth quarter of the signal2√

2 First eight–second eighth of the signal−1

√2 Third eighth–fourth eighth of the signal

0 Fifth eight–sixth eighth of the signal0 Seventh eighth–eighth eighth of the signal

multiple document resolutions. If we chose only the first wavelet component,we would treat the document without spatial information (as in the vector spacemethods). The matrix multiplication causes this transformation to be of orderO(N 2) for signals of N elements. To speed up this process, we must use thewavelets scaling function as well as the wavelet function. Applying the scalingfunction to our signal allows us to extract all of the information orthogonal tothe wavelet of the current resolution and also adjusts the signal to the nextlevel of resolution. Therefore, we can obtain the wavelet transform by followinga simple recursive process:

—set input signal (x);—Initialize output elements ( y = ∅);—Initialize counter (n = #(x));—while n �= 1:

(1) apply wavelet function to signal, decimate by factor of 2, and store inwavelet signal ( yn/2,n−1 = D2(H(x)));

(2) apply scaling function to signal, decimate by factor of 2, and use as newinput signal (x = D2(G(x)));

(3) half counter (n = n/2);—assign the remaining input element to zeroth element of wavelet signal y0,0 =

x.

In the above, #() provides the number of elements in the signal, D2() is thedecimating function, yn/2,n−1 is elements n/2 to n − 1 of signal y .

We will show an example of this using the same fd ,t as before and the Haarwavelet which has wavelet coefficients (high-pass filter) of [1/

√2 − 1/

√2]

and scaling function (low-pass filter) [1/√

2 1/√

2]. The first iteration of thescaling function application is a convolution between the scaling function andthe signal; this can also be thought of as the dot product of the many shiftedversions of the scaling function at the first resolution with the signal. Therefore,we have

fd ,t · [ 1/√

2 1/√

2 0 0 0 0 0 0 ]′ = 2/√

2,

fd ,t · [ 0 0 1/√

2 1/√

2 0 0 0 0 ]′ = 1/√

2,



fd ,t · [ 0 0 0 0 1/√

2 1/√

2 0 0 ]′ = 2/√

2,

fd ,t · [ 0 0 0 0 0 0 1/√

2 1/√

2 ]′ = 0.

The convolution of the Haar scaling function and the signal fd ,t produces[ 2/

√2 1/

√2 2/

√2 0 ]. By performing the same operation with the wavelet

function, we receive

fd ,t · [ 1/√

2 −1/√

2 0 0 0 0 0 0 ]′ = 2/√

2,

fd ,t · [ 0 0 1/√

2 −1/√

2 0 0 0 0 ]′ = −1/√

2,

fd ,t · [ 0 0 0 0 1/√

2 −1/√

2 0 0 ]′ = 0,

fd ,t · [ 0 0 0 0 0 0 1/√

2 −1/√

2 ]′ = 0,

The convolution of the Haar wavelet function and the signal fd ,t produces[ 2/

√2 −1/

√2 0 0 ].

These results are concatenated to produce the first iteration of thewavelet transform (shown in process (19)). The scaling function result( [ 2/

√2 1/

√2 2/

√2 0 ]) is passed onto the second iteration, and the wavelet

result is kept as part of the answer.The second iteration involves convolving the scaling function result

( [ 2/√

2 1/√

2 2/√

2 0 ]) with the scaling function:

[ 2/√

2 1/√

2 2/√

2 0 ] · [ 1/√

2 1/√

2 0 0 ]′ = 3/√

2,

[ 2/√

2 1/√

2 2/√

2 0 ] · [ 0 0 1/√

2 1/√

2 ]′ = 2/√

2.

It also involves convolving the scaling function result with the wavelet function:

[ 2/√

2 1/√

2 2/√

2 0 ] · [ 1/√

2 −1/√

2 0 0 ]′ = 1/√

2,

[ 2/√

2 1/√

2 2/√

2 0 ] · [ 0 0 1/√

2 −1/√

2 ]′ = 2/√

2.

We keep the result from the wavelet convolution and pass the scaling functionconvolution to the next iteration. We can see that the portion obtained from thewavelet function convolution is kept as a piece of the transform result, but theportion obtained from the scaling function is fed back in to the system to beused again.

The complete wavelet transform process is

20011100

⇒

2/√

2

1/√

2

2/√

20

⇒

2/√

2

−1/√

200

{3/

√4

2/√

4

}⇒(

1/√

4

2/√

4

)

2/√

2

−1/√

200

{5/

√8

} ⇒(1/

√8

)(

1/√

42/

√4

)

2/√

2

−1/√

200

5/√

8

1/√

8

1/√

4

2/√

4

2/√

2

−1/√

200

. (19)



We have shown the result from the scaling function convolution in braces {} andthe result from the wavelet function convolution in parentheses ().

The values produced by the shifted wavelet function (wavelet convolution)are kept during each iteration. The values produced by the scaling function arefed back into the recursive splitting process. We can see that this process is adivide-and-conquer application of the wavelet transform. We can collapse therecursive process into a single pass over the data to produce the transformeddata and hence reduce the wavelet transform to order O(N ). By using thewavelet and scaling functions together, we do not need to scale the waveletfunction (only shifts are performed); the data is inversely scaled by the scalingfunction to allow for this. For a more rigorous proof see Mallat [2001].

4. CHOOSING A WAVELET

Before applying a wavelet transform to a data set, we must first choose a waveletfrom the many varieties that exist. Some well-known wavelets come underthe title of Daubechies [Daubechies 1988], Shannon, Battle-Lemarie, Meyer,Symmlets, Spline, and Biorthogonal [Mallat 2001]. Before we can choose one,it is necessary to understand the properties that define each wavelet.

4.1 Wavelet Properties

The two main factors which will influence our choice of wavelet are the numberof vanishing moments and the size of support. Both of these must be consideredto extract the most information in the least space from the data set which weare using.

4.1.1 Vanishing Moments. When storing or transmitting a signal, it is bestif we can do so by using the least amount of storage or the least amount ofbandwidth. In the document retrieval domain, we want to store our data inthe smallest space possible (which is also simple to access and retrieve). Whencompressing a signal, whether it be for storage or transmission, we want tofit as much information as possible into the smallest possible space. Mappingdata into a transformed space is a simple method of compression because thetransform algorithm can be used at the compression and decompression sideswithout any statistical knowledge of the data. Optimal compression algorithmsencode the frequent data in the smallest number of symbols. Therefore, wewant a set of orthogonal wavelet basis functions which would be best for mostof the signals we want to compress. Signal compression is measured in termsof vanishing moments of the wavelet function.

The kth moment of a function f (t) is defined as

νk =∫ ∞

−∞tk f (t) dt. (20)

Therefore, for the kth moment to vanish, Equation (20) must equal zero. Awavelet is said to have n vanishing moments if Equation (20) is zero for 0 ≤k < n. We will show that wavelets with a higher number of vanishing momentsare able to represent smooth functions in a more compact manner.



Fig. 6. Functions exhibit compact support if they have no infinite interval of non-zero values.

The number of vanishing moments of a wavelet is related to the differentia-bility of the wavelet. If we consider the signal f which is m times differentiablein the region [v − h, v + h], we can find the Taylor polynomial approximation tothis at v:

pv(t) =m−1∑k=0

f (k)(v)k!

(t − v)k , (21)

which has the error εv(t) = f (t) − pv(t) where

|εv(t)| ≤ |t − v|mm!

supu∈[v−h,v+h]

| f (m)(u)| ∀t ∈ [v − h, v + h]. (22)

The wavelet transform of f is

W (u, s) =∫ ∞

−∞f (t)

1√s

ψ

(t − u

s

)dt (23)

= 1√s

∫ ∞

−∞f (t) ψ(t ′) dt (24)

= 1√s

∫ ∞

−∞pv(t) ψ(t ′) dt + 1√

s

∫ ∞

−∞εv(t) ψ(t ′) dt, (25)

where ψ(t) has n vanishing moments. If the polynomial pv has degree of at mostn − 1, we notice that the first term vanishes and we are left with the wavelettransform of the error term.

W (u, s) = W εv(u, s). (26)

Therefore, the more vanishing moments a wavelet has, the smaller the errorterm will be. This implies that if we use a wavelet with more vanishing mo-ments we will produce transformed data components in which only a few willbe significant, requiring little storage space.

4.1.2 Size of Support. Another factor that affects the focusing of thewavelet transform is the support size of the wavelet. The support of a functionf is the domain in which the function is nonzero [Weisstein 1999]. A function fhas compact support if its support is bounded. For example the square function(shown in Figure 6)

fsq(t) ={

1 0 ≤ t ≤ 1,0 otherwise,

(27)



has support t ∈ [0, 1] which is compact; therefore fsq has compact support. Ifwe observe the function fexp(t) = et , we observe that it has support t ∈ R, whichis not compact. Therefore, fexp does not have compact support.

When choosing a wavelet to use for a specific data set, it is essential that weexamine the wavelet’s size of support. If we examine the wavelet

ψ j ,n(t) = 2− j/2ψ(2− j t − n), (28)

we notice that, if there exists a singularity of large magnitude at point t0 infunction f , then 〈 f , ψ j ,n〉 may have a large magnitude. If there are K waveletswhose support includes the point t0 for each level of scale 2 j , then the waveletfunction has support of size K . Therefore, the greater the size of support ofthe wavelet, the more wavelet components will include the singularity t0, andtherefore the more likely many high-magnitude wavelet coefficients will exist.If we reduce the size of the support of the wavelet, we will have fewer high-magnitude components, and therefore fewer significant components to consider(for storage or transmission) when performing later calculations.

Choosing a small size of support is essential in our application of the wavelettransform. If we examine our term signals, we will see that they consist ofa few singularities (most term signals contain one singularity). Therefore, thelarger the support of the wavelet, the more nonzero components will exist in thetransformed data. Our index size will be more compact if we choose a waveletwith a small size of support.

4.2 Selected Wavelets

To analyze the positions of the terms and their relationship with other termsin the document, we must be able to analyze their relationship document-wise(as the low-resolution set of wavelets should do by including every term signalcomponent in their calculations) and we must also be able to analyze the termsposition-wise (as the high-resolution set of wavelets should do, by includingonly single components). Every wavelet has a lowest level of resolution whichincludes every element in the set to be transformed, but not every wavelet canfocus tightly on the elements wanted. We mentioned before that, if we wantto find singularities in a signal and obtain transformed data with the leastnonzero coefficients, then we must choose a wavelet which has a small support.The wavelets with the smallest support size are the Haar wavelet and theDaubechies-4 wavelet [Daubechies 1988]. This can be seen by the number offilter coefficients needed to describe each of them (two for Haar and four forDaubechies-4).

4.2.1 The Haar Wavelet. The Haar wavelet [Haar 1910] (shown with itsscaling function in Figure 7) has compact support of size 1 but is not con-tinuously differentiable. The compact support of the wavelet implies that thetransformed signal will require less storage space than one which does not havesuch a compact support. For example, if we take the Fourier transform of animpulse, we get a significant value (not close to zero) for each of the frequencycomponents (therefore, we have to store B values for B components). If we dothe same with the Haar wavelet, we obtain B/2 significant values. This example



Fig. 7. The Haar scaling function (φ(t)) and wavelet (ψ(t)).

is a very common case for word signals; therefore, the size of the index shouldreduce by more than 50%.

Due to the shape of the Haar wavelet, if we apply the Haar wavelet to asignal, we can see that we are calculating the difference between the left andright components of the signal. If the resulting inner product of a positive realsignal is positive, we can deduce that the signal is biased to the left. If the innerproduct is negative, then the signal is biased to the right. This also goes for thepositions of the signal when examining the focused components

It is useful to recall that, when using the Fourier [Park et al. 2004] andcosine transforms [Park et al. 2002a], we obtained the word signal and mappedit to the corresponding domain using the transform. By obtaining the spectralinformation, we were able to compare signals and find their relative positionsto each other using the phase of the spectrum, and we were able to identify thefrequencies of the terms by examining the magnitude of the signal. The Fouriertransform was initially chosen because of its ability to map shifts in a signal toa phase change. The Cosine transform was chosen for its partial ability to mapshifts to phase changes and because it produces real values (and unlike theFourier transform which produces complex values), which requires less storage(and hence a smaller faster spectral index). If we examine the Haar transform,we will notice that it deals with shifts in its own unique way. For example, ifwe compare the transform of a signal with a value of 1 at the first element andzeros elsewhere, and the transform of a signal with a value of 1 at the second el-ement and zeros elsewhere (see Table II), we observe that the only difference be-tween these two signals is the fifth wavelet component (which changes its sign).

We notice from Table II that the same goes for impulses at components threeand four, five and six, and seven and eight. If we examine the shift across afactor of 2 boundary (e.g., compare positions 2 and 3, or 4 and 5), we see more



Table II. Haar Wavelet Decomposition of Signal with Impulse at Term Component Position

Term Signal Haar Wavelet Transform of Term Signal[1 0 0 0 0 0 0 0

][ 0.35 0.35 0.50 0 0.71 0 0 0 ][

0 1 0 0 0 0 0 0]

[ 0.35 0.35 0.50 0 −0.71 0 0 0 ][0 0 1 0 0 0 0 0

][ 0.35 0.35 −0.50 0 0 0.71 0 0 ][

0 0 0 1 0 0 0 0]

[ 0.35 0.35 −0.50 0 0 −0.71 0 0 ][0 0 0 0 1 0 0 0

][ 0.35 −0.35 0 0.50 0 0 0.71 0 ][

0 0 0 0 0 1 0 0]

[ 0.35 −0.35 0 0.50 0 0 −0.71 0 ][0 0 0 0 0 0 1 0

][ 0.35 −0.35 0 −0.50 0 0 0 0.71][

0 0 0 0 0 0 0 1]

[ 0.35 −0.35 0 −0.50 0 0 0 −0.71]

than one coefficient change. This is due to the decomposition of the discretedyadic wavelet transform and the support of the Haar wavelet.

The last four wavelet coefficients show small changes in the signal position,the third and fourth wavelet coefficients display larger changes in the signalposition, and the second component shows even greater changes in the signalposition (the first coefficient is related to the sum of the signal and is not effectedby the signal position). From this, we can see that, if we take only the first 1,2, or 4 coefficients, we will be observing the function at a different resolution.Wavelets give us a new perspective by extracting the positions at differentresolutions.

4.2.2 Daubechies Wavelets. A short time after the introduction of conjugatemirror filters, the conditions of the filters were found to be identical to thoseof orthogonal wavelets. Ingrid Daubechies [Daubechies 1988] took advantageof this relationship and found that for a wavelet (filter) to have p vanishingmoments, it must have a support of at least 2p. This theorem set a lower limiton the number of filter coefficients needed to perform a wavelet decompositionof a certain smoothness. From this knowledge, she was also able to derive a setof wavelets which have this minimum support for a given number of vanishingmoments; these are called Daubechies wavelets.

It is interesting to note that the wavelet derived with minimum supportwhen given only one vanishing moment is in fact the Haar wavelet.

The wavelet we will be examining is the Daubechies-4 wavelet (so calledbecause it has four filter coefficients in each high-pass and low-pass filter, andtwo vanishing moments) shown in Figure 8.

5. COMPUTATIONAL COMPLEXITY AND STORAGE

To thoroughly examine a retrieval method, we must not only examine theprecision of results, but also the computational and storage requirements.

5.1 Computational Complexity

If we take note of how that discrete wavelet transform can be performed usinga recursive filter decomposition (shown in Section 3.4), we can easily see thatthe time taken to perform the discrete dyadic wavelet transform is linear withrespect to the number of discrete elements chosen to transform [Mallat 2001].



Fig. 8. The Daubechies-4 scaling function (φ(t)) and wavelet (ψ(t)).

Therefore, if we choose B components for our term signal vectors, the wavelettransform will be of order O(B), and the FFT is of order O(B log B) [Proakisand Manolakis 1996]. This is performed for each query term. In the followinglists, we will represent the number of documents in the document set as N , thenumber of query terms as τ , and the number of components as B.

Each vector space method of document retrieval at query time involves thefollowing:

(1) Apply the specific weighting scheme to each query term in each document(O(Nτ )).

(2) Sum each weighted query term in each document to obtain the documentscores for each document (O(Nτ )).

Therefore, the overall computational complexity of a vector space method isO(Nτ ). The spectral methods require a bit more computation:

(1) Apply the specific weighting scheme to each component of each query termin each document (O(Nτ B)).

(2) Perform the selected transform on each weighted word signal in eachdocument (O(Nτ B log B) for the FFT or O(Nτ B) for the DWT).

(3) Calculate the phase precision for each spectral component across each queryterm in each document (O(Nτ B)).

(4) Calculate the magnitude for each spectral component across each queryterm in each document (O(Nτ B)).

(5) Calculate the score components by multiplying the magnitude and phaseprecision of each component in each document (O(N B)).

(6) Find the document score by summing the score components from eachdocument (O(N B)).



Therefore, the overall time complexity of the Fourier transform method isO(Nτ B log B) and the time complexity of the wavelet method is O(Nτ B) ifthe transform is performed during the query time.

We have taken a simplistic view of both the vector space and spectral methodsby including calculations on all documents. In practice, we would precomputespectral values and use accumulation schemes to find the approximate top ndocuments, and hence only the values associated to the top n documents wouldneed to be processed at the query time.

A typical query time for our experiments performed later in the article(Section 7.2) is 0.02 s for the vector space method and 0.1 s for the spectralretrieval method using eight components. Reducing the number of componentsused in the calculations reduces the query time in a linear fashion.

At this time, we should remind the reader that the vector space methodof information retrieval is a special case of the wavelet information retrievalmethod where B = 1. Therefore, by using the wavelet method, we are able tohave more freedom as to how we perform our searches by choosing alternativeB values to trade off between speed and accuracy.

5.2 Storage Size

To examine the impact our wavelet method has on storage, we will look at thecommon case of a document containing one occurrence of a term. In this case,previous methods using the Fourier and cosine transforms would transform aterm signal with one nonzero element to a term signal with no nonzero ele-ments. This leaves us with B times as many elements to store when comparedto the vector space methods. Using compression techniques such as quantiza-tion and cropping [Park et al. 2002b] led us to the conclusion that we wereable to retain the high-precision results of the Fourier and cosine retrievalmethods with an index of only four times the size of that of the vector spacemethods.

By using the wavelet transform, we obtain term signals with B/2 nonzeroelements when transforming a term signal with one element. Therefore, by us-ing the same compression techniques found in the Fourier and cosine methods,we would expect to reduce the index size even further while still retaining thehigh precision of the wavelet retrieval method.

Our experiments showed that, when we stored spatial values (the term sig-nals consisting of positive integers and many zeros) in the index, we were ableto achieve an index size which is 20% larger than the vector space method (74MB compared to 60 MB). When storing the spectral values (the term spectraconsisting of floating-point values and not as many zeros) in the index, us-ing 6-bit floating point quantization, the index was 160% larger than the vectorspace method index (160 MB). In this index, all eight spectral components werestored for each signal. To reduce the index size, we could choose to store onlythe first two or four spectral components for each term spectrum. By removingthese components, the index size would be reduced linearly, but the precisionof the document rankings would be slightly poorer. In our experiments we haveshown this effect on precision.



6. ANALYSIS

Before we present the experimental results of our new method, we first demon-strate that our scoring function satisfies two important basic properties:

—The score must monotonically increase with the increase of occurrences of aquery term.

—The score must monotonically increase as the displacement between twoquery terms decreases.

We will examine these properties in the following sections with both the Haarand Daubechies wavelets.

6.1 Occurrence Analysis

A desired property of a document score calculation is an increase in score whenthere is an increase in the number of query terms found in the document. Thespectral document ranking method takes into account the number of queryterms in a document during the magnitude calculations. If we ignore the phaseprecision for the moment, we see that the document score is based on the l pnorm of the sum of the query term spectra associated with the document.

The wavelet transform is a linear transformation; therefore, it satisfies thefollowing properties:

(1) W (x1 + x2) = W (x1) + W (x2),(2) W (αx) = αW (x),

where x is a vector and α is a scalar. This implies that summing the queryterm signals and performing the wavelet transform on the combined signalwill achieve the same result as summing the query term spectra. Therefore, wecan examine the effect of one term signal and assume that it is the sum of allof the query term signals.

We want to show that the document score for combined query term signalwd ,t + x is greater than the document score for term signal wd ,t , where x is avector containing real values greater than or equal to zero.

THEOREM 1. If we increase our term signal wd ,t to wd ,t + x, where each ele-ment xb ≥ 0, then

‖ζd ,t‖2 ≤ ‖ζd ,t + ξ‖2, (29)

where ζd ,t and ξ are the wavelet transforms of wd ,t and x, respectively.

PROOF. If we take a single component wd ,t,b ∈ wd ,t,b, we can show that

wd ,t,b ≤ wd ,t,b + xb, (30)

w2d ,t,b ≤ (wd ,t,b + xb)2, (31)

‖wd ,t‖2 ≤ ‖wd ,t + x‖2. (32)

By using Plancherel’s theorem [Mallat 2001], we observe that the signal energyis conserved after the wavelet transform has taken place:

‖W (wd ,t)‖2 ≤ ‖W (wd ,t + x)‖2. (33)



The linear property of the wavelet transform allows us to split the transformon the right-hand side to obtain

‖ζd ,t‖2 ≤ ‖ζd ,t + ξ‖2, (34)

where ζd ,t = W (wd ,t) and ξ = W (x).

If we use the l2 norm in our document score calculations, Theorem 1 showsus that the document score will increase if we increase the number of queryterms in the document.

To generalize the magnitude score calculation to p being a natural number,we will treat the shifted and scaled wavelets as an orthonormal basis ψb. There-fore, each of the term spectra coefficients can be shown as the inner product ofthe term signal and one of the wavelet basis (ζd ,t,b = 〈wd ,t , ψb〉),

‖ζd ,t‖p =B−1∑b=0

|ζd ,t,b|p =B−1∑b=0

|ζ pd ,t,b| =

B−1∑b=0

|〈wd ,t , ψb〉p|, (35)

assuming that the wavelets are in the real domain. If we increase wd ,t by αδk(where each element δk,b = 1 if k = b, or zero otherwise) we achieve

B−1∑b=0

|〈wd ,t + αδk , ψb〉p| =B−1∑b=0

|(〈wd ,t , ψb〉 + αψb,k)p|. (36)

Therefore, we want to show that

B−1∑b=0

|〈wd ,t , ψb〉p| ≤B−1∑b=0

|(〈wd ,t , ψb〉 + αψb,k)p|. (37)

For the p = 1 case, the right-hand side becomes

B−1∑b=0

|〈wd ,t , ψb〉 + αψb,k|. (38)

This shows that, in the p = 1 case, there is no guarantee that, Equation (37)will hold, since it depends on the wavelet set and the original signal. For thep = 2 case, the right-hand side of Equation (37) becomes

B−1∑b=0

〈wd ,t , ψb〉2 + 2〈wd ,t , ψb〉αψb,k + α2ψ2b,k . (39)

If we split up the summation:

B−1∑b=0

〈wd ,t , ψb〉2 + 2α

B−1∑b=0

〈wd ,t , ψb〉ψb,k + α2B−1∑b=0

ψ2b,k , (40)

which simplifies to

B−1∑b=0

〈wd ,t , ψb〉2 + 2αwd ,t,k + α2 (41)



due to the orthonormal properties of the wavelet basis vectors. We know thatwd ,t,k and α are positive; therefore, any increase in wd ,t,k results in an increasein the magnitude calculation of the document score for p = 2 (which is alsoshown by Theorem 1).

This shows that we would expect the l2 norm results to be more precise thanthe l1 norm results. This fact is supported by our experimental results.

6.2 Proximity Analysis

The spectral document retrieval methods achieve high precision because theycombine the ideas from the vector space methods, which only analyze the termcounts in the documents, and the proximity methods, which take into accountthe displacement of the query terms from each other. In this section we willexamine how each method reacts to the position of two different terms. Theideal is that each method give a score which is monotonically decreasing whenthe displacement of the terms becomes larger, but we will see that that is notalways the case.

To conduct the experiment, we have chosen two query terms (qm, qn) whichexist in a document once each at a displacement of b components apart. Eachquery term has the same term weight and, since they appear in the samedocument, the same document normalization.

Due to the nature of the Haar wavelet decomposition, each wavelet coefficientof qm will have the same sign as the wavelet coefficients of qn if the word appearsin the same position. If we normalize our weighted term signals for qm and qnso that only ones and zeros occur, we are able to show the effect of each positionon the Haar wavelet transform.

Analyzing the proximity of terms is not as simple as in the Fourier transformcase. We cannot use displacement as a parameter because the wavelets havecompact support.

To analyze the change of the score when the proximity of the terms changes,we experimented with the Haar and Daubechies-4 methods. Each experimentcompared a single term appearing at component position zero and another termwhose position was adjusted. The score was then plotted against the displace-ment of the two terms. We hoped that the score would decreased as the displace-ment between the two terms grew. The results are shown in Figures 9(a), 9(b),10(a), and 10(b). The results show that both of the Haar methods have a nicemonotonic decrease as the displacement increases, but both of the Daubechies-4methods do not. We expected problems with the Daubechies-4 wavelet methodbecause the wavelet shape (shown in Figure 8) is not a simple rise and fall. Thewavelet does decrease, but then increases above zero and then back to zero. Thisis a common behavior among wavelets because of the property that they mustintegrate to zero. For this reason, we will assume that most other wavelets willbehave in the same fashion as the Daubechies-4 wavelet.

7. EXPERIMENTS AND RESULTS

We have performed many extensive experiments using the spectral retrievalto empirically show that the wavelet information retrieval method is superior



Fig. 9. Scores derived from the Haar wavelet transform using (a) sum, (b) sum of squares, of adocument containing two query, terms of equal weight. The first is found in component zero, thesecond is found in component Term bin displacement.

Fig. 10. Scores derived from the Daubechies-4 wavelet transform using (a) sum, (b) sum of squares,of a document containing two query, terms of equal weight. The first is found in component zero,the second is found in component Term bin displacement.

to the vector space and proximity methods. But before we discuss our resultsin depth, we will go though a simple example of how this wavelet documentretrieval method works.

7.1 Sample Document Set

We will use the sample data found in Table III and their wavelet transformsfound in Table IV. The tables contains three selected terms from three differentdocuments. Each term in each document has a corresponding term signal con-taining eight elements. Each element represents the occurrence of that termin that portion of the document. For example, the first element of the termsignal laugh in document 1 is 2. This implies that the term laugh occurs



Table III. A Sample Set of Terms with Their Term Signals Within Three Documents

Terms Document 1 Document 2 Document 3laugh

[2 0 0 1 1 1 0 0

] [0 0 0 1 0 1 0 0

] [0 1 0 0 1 0 0 1

]diary

[0 0 0 1 0 1 0 0

] [0 1 0 0 1 0 0 0

] [0 1 0 1 0 0 0 0

]smile

[0 0 0 0 0 0 1 0

] [0 0 0 0 1 0 0 0

] [0 0 0 0 0 1 0 0

]

Table IV. The Haar Wavelet Transforms of the Term Signals Found in Table III

Terms Document 1 Document 2

laugh[

5√8

1√8

1√4

2√4

2√2

−1√2

0 0] [

2√8

0 −1√4

1√4

0 −1√2

−1√2

0]

diary[

2√8

0 −1√4

1√4

0 −1√2

−1√2

0] [

2√8

0 1√4

1√4

1√2

0 1√2

0]

smile[

1√8

−1√8

0 −1√4

0 0 0 1√2

] [1√8

−1√8

0 1√4

0 0 1√2

0]

Terms Document 3

laugh[

3√8

−1√8

1√4

0 1√2

0 1√2

−1√2

]diary

[2√8

2√8

0 0 −1√2

−1√2

0 0]

smile[

1√8

−1√8

0 1√4

0 0 −1√2

0]

twice in the first eighth of document 1. The fourth element contains the value1, which implies that the term laugh occurs once in the fourth eighth ofdocument 1. Note: we will ignore the initial preweighting stage and processthe data as if the weights were unitary. Preweighting is an important part ofthe retrieval process; it is only left out in this example to focus on the waveletretrieval process.

If we query our database with the terms smile and diary, the system willextract the data for those terms only (therefore, we will ignore the data for theterm laugh from now on). The system will calculate the score for each document,so we will examine the score calculation for the first document in detail. Ourwavelet components are

Terms Document 1diary

[ 2√8

0 −1√4

1√4

0 −1√2

−1√2

0]

smile[ 1√

8−1√

80 −1√

40 0 0 1√

2

]Take the magnitudes and sum them:

Terms Document 1 Magnitudediary

[ 2√8

0√8

1√4

1√4

0√2

1√2

1√2

0√2

]smile

[ 1√8

1√8

0√4

1√4

0√2

0√2

0√2

1√2

]Total

[ 3√8

1√8

1√4

2√4

0√2

1√2

1√2

1√2

]Take the phase and find the zero phase precision:

Terms Document 1 Phasediary [ 1 0 −1 1 0 −1 −1 0 ]smile [ 1 −1 0 −1 0 0 0 1 ]

Zero phase precision[

1 12

12 0 0 1

212

12

]ACM Transactions on Information Systems, Vol. 23, No. 3, July 2005.


Combine the magnitude and phase:

Document 1 magnitude[ 3√

81√8

1√4

2√4

0√2

1√2

1√2

1√2

]Document 1 Zero phase precision

[1 1

212 0 0 1

212

12

]Document 1 score vector

[ 3√8

12√

81

2√

40√4

0√2

12√

21

2√

21

2√

2

]We can choose either the sum or the squared sum of the score vector as ourdocument score. If we choose the squared sum, we obtain

Document 1 score = 5132

= 1.5938.

If we follow the same process for the other two documents we obtain

Document Score1 1.59382 4.34373 1.5625

If we examine the positions of the query terms (in the term signals), we cansee that the document scores reflect the proximity and occurrence of the queryterms in the documents. Document 2 scored the highest and has the queryterms closest together within the document; we notice in document 2 that bothquery terms appear in the same component. Document 1 ranked second andwe observe that the query terms appear in neighboring components. Finally,document 3 ranked third, with its query terms more distant than the other twodocuments.

7.2 Application to TREC Document Set

To judge how accurate our wavelet document retrieval methods are, we willcompare them to the current methods which have shown to be of high preci-sion. The experiments we have performed involved a variety of methods shownin Table V. These methods are labeled in such fashion that the user can iden-tify the specific algorithm used. For example daub-9-4-6.b8 indicates that theDaubechies-4 wavelet was used, 9 as the first digit implies Lnu.ltu weighting,4 as the second digit means zero phase precision, 6 as the third digit shows thatl2 norm was used to sum the score vectors, and b8 implies that all eight spectralcomponents were used in the score calculations. We compared these against thevector space methods AB-AFD-BAA, BD-ACI-BCA, BI-ACI-BCA, and Lnu.ltuweighting from SMART. We also compared them against a successful termproximity measure [Clarke and Cormack 2000] called shortest-substring re-trieval (shown as SSS). Included in the results are the precision values of ourfds-7-4-1 method [Park et al. 2004]. This is a similar method to our waveletmethod which uses a Fourier transform in place of the wavelet transform. Thefds-7-4-1 method uses AB-AFD-BAA preweighting.

For each of the trials, we set the term signal length B = 8 (which is also thespectrum signal length). This value has been shown in previous experimentsusing the Fourier transform [Park et al. 2004] to obtain a high precision withoutusing excessive storage.



Table V. Experimental Methods (Method names are of the formwavelet-x-y-z.bn where the values of x, y, z, n associate to the

description in this table; e.g., haar-5-4-6.b4 implies use of the Haarwavelet, with BD-ACI-BCA preweighting, using sum magnitudeswith zero phase precision, combine components with the l2 norm,

and using only the first four of eight components)

Label Value Description

x 5 BD-ACI-BCA preweighting7 AB-AFD-BAA preweighting8 BI-ACI-BCA preweighting9 Lnu.ltu preweighting

y 1 Sum vectors with no phase precision4 Sum magnitude with zero phase precision

z 1 Combine using l1 norm6 Combine using l2 norm

n {1, 2, 4, 8} Number of score components added

Fig. 11. Examples of queries taken from TREC queries 51–200 titles.

To perform any experiment in information retrieval, we need a substan-tial database of documents and a well-defined set of queries and documentswhich are relevant to these queries. The TREC collection3 is just this. Wechose to experiment on the AP2WSJ2 (Associated Press disk 2 and Wall StreetJournal disk 2) set containing 154,443 documents. We also selected the titles ofqueries 51 to 200 (from TREC 1, 2, and 3) as our query set. Examples of typicalqueries can be seen in Figure 11. We are interested in an algorithm that wouldbe effective for text retrieval on the Web. We have observed that most Websearch engine users will only observe the first 20 documents retrieved. If theydo not obtain the results they want within this selection, the query is usuallyreformulated and the search is tried again. Therefore, we will only observe theresults of precision after the first 5, 10, 15, and 20 documents (this representsthe impatience of the typical Web user). The results sorted by precision after5, 10, 15, and 20 documents are given in Tables VI(a), VI(b), VII(a), and VII(b),respectively. Each table shows the results for the top 20 and the results of anyof the comparative methods below the bar if they did not appear in the top 20.

We notice from the tables that the methods that appear at the top of each listuse all eight wavelet components to calculate the score and that most use the

3http://trec.nist.gov/.



Table VI. Top 18 Methods Using Data Set AP2WSJ2 with Queries51 to 200 Sorted by Precision After (a) 5 and (b) 10 Documents

(The methods under the lower bar are the vector space andproximity methods which did not make the top 20.)

Method Precision 5 Method Precision 10

haar-5-4-6.b8 0.5000 daub4-7-4-6.b8 0.4687haar-7-4-6.b8 0.4960 fds-7-4-1 0.4673daub4-5-4-6.b8 0.4960 daub4-5-4-6.b8 0.4653daub4-7-4-6.b8 0.4947 haar-5-4-6.b8 0.4633fds-7-4-1 0.4947 daub4-9-4-6.b8 0.4620daub4-9-4-6.b8 0.4907 haar-7-4-6.b8 0.4593haar-9-4-6.b8 0.4893 haar-5-4-6.b4 0.4593AB-AFD-BAA 0.4880 haar-9-4-6.b8 0.4593daub4-7-4-1.b1 0.4827 haar-7-4-6.b2 0.4573haar-7-4-6.b1 0.4827 haar-9-4-1.b8 0.4560haar-7-4-1.b1 0.4827 haar-7-4-6.b4 0.4560daub4-7-4-6.b1 0.4827 daub4-5-4-6.b2 0.4553haar-5-4-6.b1 0.4813 haar-7-4-1.b2 0.4547haar-5-4-6.b2 0.4813 daub4-7-4-1.b8 0.4547haar-7-4-6.b2 0.4813 daub4-7-4-6.b4 0.4547daub4-7-4-6.b4 0.4813 daub4-9-4-6.b4 0.4540daub4-5-4-6.b1 0.4813 daub4-7-4-6.b2 0.4540daub4-9-4-1.b8 0.4813 daub4-5-4-6.b4 0.4540

SMART 0.4693 AB-AFD-BAA 0.4493BD-ACI-BCA 0.4440 SMART 0.4493BI-ACI-BCA 0.4347 BD-ACI-BCA 0.4247SSS 0.3718 BI-ACI-BCA 0.4100

SSS 0.3362(a) (b)

l2 method to combine the score. Each of the methods uses either the AB-AFD-BAA, BD-ACI-BCA, or Lnu.ltu weighting with zero phase precision. We canalso see that the shortest-substring (SSS) method performs poorly relative tothe other methods. The shortest-substring method makes use of logic operatorsto find the shortest substring of text containing the query. Since there are nooperators within the queries used in our experiments, we assumed that all ofthe query terms were needed (equivalent to inserting an AND operator betweeneach of the terms). Therefore, if any of the terms did not appear in a document,there would be no shortest substring and hence no document score. By usingthis similarity function, many relevant documents would have received a zeroscore, resulting in an overall poor precision for the shortest-substring method.

It is interesting to see that, even though the Daubechies-4 proximity analysiswas not monotonic, the results were just as good as the Haar method for the l2combination case.

A plot giving the precision after 5, 10, 15, and 20 documents (Figure 12) showsthat the wavelet methods are preferred for this level of retrieval (which is thelevel required for Web searching). This plot shows the Daubechies-4 methodabove the Haar method for most of the recall levels shown. The Daubechies-4 wavelet produces higher-precision results, but we must also take into ac-count the size of the index produced. Since the Daubechies-4 wavelet has larger



Table VII. Top 18 Methods Using Data Set AP2WSJ2 with Queries 51 to 200 Sortedby Precision After (a) 15 and (b) 20 Documents (The methods under the lower bar

are the vector space and proximity methods which did not make the top 20.)

Method Precision 15 Method Precision 20

fds-7-4-1 0.4493 daub4-7-4-6.b8 0.4257haar-7-4-6.b8 0.4449 daub4-9-4-6.b8 0.4223daub4-7-4-6.b8 0.4449 haar-7-4-6.b8 0.4223daub4-9-4-6.b8 0.4431 fds-7-4-1 0.4220haar-9-4-6.b8 0.4413 haar-9-4-6.b8 0.4217haar-5-4-6.b8 0.4404 AB-AFD-BAA 0.4217AB-AFD-BAA 0.4404 haar-5-4-6.b8 0.4213haar-9-4-1.b8 0.4396 haar-9-4-1.b8 0.4190daub4-5-4-6.b8 0.4391 haar-7-4-1.b8 0.4183haar-7-4-6.b4 0.4382 SMART 0.4180haar-7-4-1.b8 0.4382 daub4-5-4-6.b8 0.4177daub4-7-4-6.b4 0.4373 daub4-7-4-6.b4 0.4157haar-7-4-6.b2 0.4373 haar-7-4-6.b4 0.4143SMART 0.4356 haar-7-4-6.b2 0.4137daub4-9-4-1.b8 0.4351 haar-5-4-1.b8 0.4123daub4-7-4-6.b2 0.4347 daub4-5-4-6.b4 0.4117haar-9-4-6.b4 0.4333 haar-9-4-6.b4 0.4107daub4-9-4-6.b4 0.4329 daub4-9-4-1.b8 0.4103

BD-ACI-BCA 0.4142 BD-ACI-BCA 0.3953BI-ACI-BCA 0.3862 BI-ACI-BCA 0.3657SSS 0.3078 SSS 0.2856

(a) (b)

Fig. 12. Precision-recall plot for recall of 5, 10, 15, and 20 documents for the Haar and Daubechies-4wavelet methods, and the AB-AFD-BAA vector space method.



support than the Haar wavelet, we must expect that more nonzero values willbe produced after the transform is applied. Our experiments have shown thatthe Daubechies-4 wavelet index is on average 1.4 times the size of the indexproduced from the Haar wavelet on the AP2WSJ2 document set. Therefore, theDaubechies-4 wavelet is preferable to the Haar wavelet for spectral documentretrieval in terms of precision, but that comes at the cost of additional storage.

8. CONCLUSIONS

The wavelet transform is a tool which has been used in many areas of scienceand engineering to extract information about the self-similarity of signals. Us-ing this information, we are able to encode the signals in a more compact man-ner and easily extract desired content to perform other tasks (e.g., comparesignals for similarity).

We have proposed a new spectral-based information retrieval method usingthe wavelet transform assuming our hypothesis that a document is more likelyto be relevant if the patterns of all query term appearances are similar. Wehave shown through occurrence and proximity analysis that our new retrievalmethod behaves in the desired manner. By adjusting the size of the documentresolution used, we can reduce this scheme to the vector space method.

The techniques developed using the Haar and Daubechies-4 wavelets, whenusing the l2 component combination on all components and zero phase preci-sion, were able to consistently produce higher-precision results when comparedto the vector space and proximity document retrieval methods. The spectral-based retrieval method produced on average a 4% increase in precision whencompared to the corresponding vector space method using the TREC AP2 dataset. This was performed in the same fast query time as found in the vector spacemethod, using a larger index to store the extra information used.

ACKNOWLEDGMENTS

Thanks to Andrew Liu for discussions on wavelets and the ARC SpecialResearch Centre for Ultra-Broadband Information Networks for its supportand funding of this research.

REFERENCES

BUCKLEY, C., SINGHAL, A., MITRA, M., AND SALTON, G. 1995. New retrieval approaches using SMART:TREC 4. See Harman [1995], pp. 25–48.

BUCKLEY, C. AND WALZ, J. 1999. SMART in TREC 8. See Voorhees and Harman [1999], pp. 577–582.

CLARKE, C. L. A. AND CORMACK, G. V. 2000. Shortest-substring retrieval and ranking. ACM Trans.Inform. Syst. 18, 1 (Jan.), 44–78.

DAUBECHIES, I. 1988. Othonormal bases of compactly supported wavelets. Commun. Pure Appl.Math. 41, 909–996.

HAAR, A. 1910. Zur theorie dur othogonalen funktionensysteme. Mathemat. Annalen 69, 331–371.

HARMAN, D., Ed. 1995. The Fourth Text REtrieval Conference (TREC-4). NIST Spec. Pub. 500-236.National Institute of Standards and Technology, Gaithersburg, MD.

HAWKING, D. AND THISTLEWAITE, P. 1996. Relevance weighting using distance between term occur-rences. Tech. rep. TR-CS-96-08. The Australian National University, Canberra, Australia.



JAIN, A. K. 1979. A sinusoidal family of unitary transforms. IEEE Trans. Patt. Analys. Mach.Intell. PAMI-1, 4 (Oct.), 356–365.

MALLAT, S. 2001. A Wavelet Tour of Signal Processing, 2nd ed. Acacdemic Press, San Diego, CA.MILLER, N. E., WONG, P. C., BREWSTER, M., AND FOOTE, H. 1998. TOPIC ISLANDS—a wavelet-

based text visualization system. In VIS ’98: Proceedings of the Conference on Visualization ’98.IEEE Computer Society Press, Los Alamitos, CA. 189–196.

PARK, L. A. F., PALANISWAMI, M., AND KOTAGIRI, R. 2001. Internet document filtering using Fourierdomain scoring. In Principles of Data Mining and Knowledge Discovery, L. de Raedt and A. Siebes,Eds. Lecture Notes in Artificial Intelligence, vol. 2168. Springer-Verlag, Berlin, Germany, 362–373.

PARK, L. A. F., PALANISWAMI, M., AND RAMAMOHANARAO, K. 2002a. A novel Web text mining methodusing the discrete cosine transform. In 6th European Conference on Principles of Data Mining andKnowledge Discovery, T. Elomaa, H. Mannila, and H. Toivonen, Eds. Lecture Notes in ArtificialIntelligence, vol. 2431. Springer-Verlag, Berlin, Germany, 385–396.

PARK, L. A. F., PALANISWAMI, M., AND RAMAMOHANARAO, K. 2005. A novel document ranking methodusing the discrete cosine transform. IEEE Trans. Patt. Analys. Mach. Intell. 27, 1 (Jan.), 130–135.

PARK, L. A. F., RAMAMOHANARAO, K., AND PALANISWAMI, M. 2002b. A new implementation techniquefor fast spectral based document retrieval systems. In IEEE International Conference on DataMining, V. Kumar and S. Tsumoto, Eds. IEEE Computer Society, Los Alamitos, CA, 346–353.

PARK, L. A. F., RAMAMOHANARAO, K., AND PALANISWAMI, M. 2004. Fourier domain scoring : A noveldocument ranking method. IEEE Trans. Knowl. Data Eng. 16, 5 (May), 529–539.

PROAKIS, J. G. AND MANOLAKIS, D. G. 1996. Digital Signal Processing Principles, Algorithms andApplications, 3rd ed. Prentice-Hall, Inc, Englewood Cliffs, NJ.

ROBERTSON, S. E. AND WALKER, S. 1999. Okapi/keenbow at TREC-8. See Voorhees and Harman[1999], pp. 151–162.

SALTON, G. AND BUCKLEY, C. 1988. Term-weighting approaches in automatic text retrieval. Inform.Process. Manage. 24, 5, 513–523.

SINGHAL, A., BUCKLEY, C., AND MITRA, M. 1996. Pivoted document length normalization. In Pro-ceedings of the 19th Annual International ACM SIGIR Conference on Research and Developmentin Information Retrieval, SIGIR’96, August 18–22, 1996, Zurich, Switzerland (Special Issue ofthe SIGIR Forum), H.-P. Frei, D. Harman, P. Schauble, and R. Wilkinson, Eds. ACM, Press, NewYork, NY, 21–29.

UHL, A. 1994. Digital image compression based on non-stationary and inhomogeneous mul-tiresolution analyzes. In Proceedings of the IEEE International Conference on Image Processing(ICIP-94). vol. 3, 378–382.

VETTERLI, M. 1986. Filter banks allowing perfect reconstruction. Signal Process. 10, 3, 219–244.VETTERLI, M. AND HERLEY, C. 1992. Wavelets and filter banks: Theory and design. IEEE Trans.

Signal Process. 40, 9 (Sept.), 2207–2232.VOORHEES, E. M. AND HARMAN, D. K., Eds. 1999. The Eighth Text REtrieval Conference (TREC-8).

National Institute of Standards and Technology Special Publication 500-246. Department ofCommerce, National Institute of Standards and Technology, Gaithersburg, MD.

WANG, J., LI, J., AND WIEDERHOLD, G. 2001. SIMPLIcity: Semantics-sensitive integrated matchingfor picture libraries. IEEE Trans. Patt. Analysis. Mach. Intell. 23, 9 (Sept.), 947–963.

WEISSTEIN, E. W. 1999. Support. In Eric Weisstein’s World of Mathematics. CRC Press LLC, BocaRaton, FL. Also go to Web site http://mathworld.wolfram.com.

ZOBEL, J. AND MOFFAT, A. 1998. Exploring the similarity space. ACM SIGIR For. 32, 1 (Spring),18–34.

Received June 2003; revised July 2004; accepted March 2005


A novel document retrieval method using the discrete wavelet transform

Documents

Transcript of A novel document retrieval method using the discrete wavelet transform